COMPREHENSIVE APPROACHES TO HIGH-DIMENSIONAL HETEROGENEOUS
DATA: SEMIPARAMETRIC METHODS AND FEATURE SELECTION

By

Sang Kyu Lee

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Statistics—Doctor of Philosophy

2024

ABSTRACT

The advent of new technologies has introduced a broad spectrum of data types, necessitating

advanced methodologies beyond traditional linear models.

In scenarios where covariates

include both linear and unknown or non-linear relationships with the response variable,

relying solely on either parametric or nonparametric methods can be inadequate.

This dissertation addresses these complexities using a semiparametric approach called the

partial linear trend filtering model. Instead of classical nonparametric methods, a newcomer,

trend filtering, is integrated into the high-dimensional partial linear model. Simulation stud-

ies indicate that these models handle heterogeneous data more effectively than traditional

approaches, demonstrating rigorous theoretical results, including convergence rates for the

estimates.

Additionally, this dissertation explores high-dimensional partial linear quantile regression

to assess the heterogeneous effects of covariates on different quantiles of the response variable.

By applying trend filtering to partial linear quantile regression, the strengths of both quantile

regression and trend filtering are combined, supported by rigorous theoretical and simulation

results. For practical validation, the partial linear quantile trend filtering model is applied to

the Environment and Genetics in Lung Cancer Etiology (EAGLE) study data, showcasing

their applicability and effectiveness in real-world data analysis.

Furthermore, only a small number of works have addressed feature selection and False

Discovery Rate (FDR) control within the high-dimensional quantile regression framework.

Inspired by the model-X knockoff procedure [9], a new method is introduced for simultane-

ously controlling FDR and detecting important covariates via the regional quantile regression

approach. This three-step procedure identifies signals within the quantile region of interest

rather than at a specific quantile level, effectively controlling the FDR. Simulation studies

demonstrate the utility of this method, which is also applied to National Health and Nutri-

tion Examination Survey (NHANES) data to identify risk factors associated with high body

mass index.

Copyright by
SANG KYU LEE
2024

ACKNOWLEDGEMENTS

I am immensely thankful to my advisors, Drs. Hyokyoung Grace Hong and Haolei Weng, for

their unwavering support, invaluable guidance, and expert mentorship throughout my thesis

journey. Their encouragement, insights, and constructive feedback have been instrumental

in shaping my research and academic growth.

I would also like to extend my heartfelt appreciation to my committee members, Drs.

Antonio Galvao and Yuehua Cui, for their valuable contributions and insights that enriched

my research and thesis.

Furthermore, I am deeply grateful to my parents for their unconditional love, unwavering

support, and sacrifices that have been my foundation throughout my PhD program. Their

belief in me has been a constant source of motivation.

I am also profoundly thankful to my wife, Hyun Jin, for her unwavering support, under-

standing, and patience during this challenging period. Her encouragement and belief in my

dreams have been a source of strength.

Special thanks are also due to my friends Nicholas Yi and Jong In Lim, among others,

for their friendship, support, and understanding during the ups and downs of my academic

journey.

I am equally grateful to my New Hope Baptist Church family, including Pastor Jeong

and Dr. Pyeon, for their prayers, encouragement, and unwavering support throughout my

PhD program. Their faith and community have been a source of inspiration and strength.

Together, my advisors, committee members, parents, wife, friends, and church family

have been my pillars of support, enabling me to navigate the academic challenges with

determination and resilience. I am truly blessed to have such exceptional individuals in my

life.

iv

CHAPTER 1

OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

TABLE OF CONTENTS

CHAPTER 2

HIGH-DIMENSIONAL PARTIAL LINEAR MODEL WITH
TREND FILTERING . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
. . . . . . . . . . . . . . . . . . . .
2.2 Model Description and Assumptions
2.3 Main Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Simulation Studies
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 3

LINEAR QUANTILE

HIGH-DIMENSIONAL PARTIAL
REGRESSION WITH TREND FILTERING . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
3.2 Model Description and Assumptions
. . . . . . . . . . . . . . . . . . . .
3.3 Main Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Simulation Studies
3.5 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 4

FALSE DISCOVERY RATE CONTROL VIA REGIONAL
QUANTILE REGRESSION ON ULTRA-HIGH DIMENSION . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1
4.2 Proposed Screening Method . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 FDR Control via Regional QR-Knockoff . . . . . . . . . . . . . . . . . .
4.4 Simulation Results
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3
3
6
10
11
18
19

40
40
43
46
48
53
61
63

77
77
79
82
87
93
96
97

CHAPTER 5

FUTURE STUDIES . . . . . . . . . . . . . . . . . . . . . . . . . 105

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

v

CHAPTER 1

OVERVIEW

This dissertation presents a comprehensive exploration of advanced statistical methodologies

for high-dimensional data analysis, addressing critical challenges in handling complex and

diverse data structures. The research is structured into three interconnected chapters, each

building upon and extending state-of-the-art techniques to enhance the accuracy, robustness,

and applicability of data analysis in scenarios where traditional models prove inadequate.

The second chapter introduces an innovative hybrid modeling approach, integrating para-

metric and nonparametric regression techniques within high-dimensional partial linear mod-

els. This methodology synthesizes the strengths of least squares regression and nonparamet-

ric methods, offering a sophisticated solution for complex biological data, particularly in gene

expression studies. The chapter leverages penalization techniques such as LASSO and the

local adaptivity of trend filtering, establishing a robust framework for high-dimensional het-

erogeneous data. Rigorous theoretical proofs and convergence rate results for the estimators

underscore the statistical validity of this approach.

The third chapter extends the research paradigm to partial linear quantile regression,

incorporating trend filtering to address high-dimensional sparse heterogeneous data. This

integration combines the local adaptivity and computational efficiency of trend filtering with

the quantile-adaptive, sparsity-inducing properties of quantile LASSO. The chapter provides

a more nuanced understanding of covariate-response relationships by estimating conditional

quantiles at various levels. Theoretical contributions include comprehensive proofs and con-

vergence rates for estimators, distinguishing this work from previous research in the field. Ad-

ditionally, this chapter includes a real data analysis utilizing the Environment And Genetics

in Lung Cancer Etiology (EAGLE) study, a large-scale, population-based case-control study.

This practical application demonstrates the methodology’s efficacy in analyzing complex,

high-dimensional epidemiological data, particularly in the context of lung cancer research,

thereby bridging the gap between theoretical advancements and real-world applicability.

1

The fourth chapter addresses the unique challenges posed by ultra-high dimensional data,

where the number of covariates (p) grows exponentially relative to the number of observations

(n). The research employs regional quantile regression, modeling quantile regression over an

interval of quantile levels to enhance the stability and robustness of variable selection. A

novel knockoff procedure is proposed, specifically designed for regional quantile regression,

guaranteeing exact False Discovery Rate (FDR) control with finite samples. This innova-

tive approach ensures robust variable selection across multiple quantile levels, significantly

advancing the field of ultra-high dimensional data analysis with quantile regression.

Collectively, this dissertation presents a cohesive and progressive framework for tackling

the multifaceted challenges of heterogeneous high-dimensional data analysis, offering both

theoretical advancements and practical methodologies for researchers and practitioners in

fields ranging from bioinformatics to econometrics.

Throughout the dissertation, notations may vary between chapters to best suit the specific

methodologies and concepts presented. To ensure clarity and consistency, each chapter

provides clear definitions of its notations in its introductory sections, facilitating reader

comprehension and maintaining the precision necessary for advanced statistical discourse.

2

CHAPTER 2

HIGH-DIMENSIONAL PARTIAL LINEAR MODEL WITH TREND
FILTERING

2.1

Introduction

With the advent of new technologies, we are now dealing with a wide variety of data types.

This diversity necessitates methodologies that go beyond the traditional linear model. In

this research, we focus on a scenario where the response variable y is influenced by two

predictors, x and z. Specifically, the relationship between y and x is assumed to be linear,

while the relationship between y and z is non-linear or unknown. This modeling approach

is particularly prevalent in biological studies, such as those involving gene expression data,

where nonlinear factors like age or other related variables play a significant role.

To address this type of data, it is essential to use techniques that separately account for

the linear and non-linear components. Neglecting the non-linear part and using standard

least squares regression (parametric) can lead to incorrect inferences. Conversely, applying

non-linear regression (nonparametric) to both components results in an overly rough esti-

mation, and also there can be a curse of dimensionality when we deal with high-dimensional

data.

The partial linear model resolves this issue by incorporating both parametric and non-

parametric regression simultaneously. This semiparametric approach has been extensively

studied in low-dimensional settings [75, 23, 80], where the number of covariates for the lin-

ear part is smaller than the number of observations or fixed as a constant. However, these

models are not directly applicable to high-dimensional data where the number of covariates

significantly exceeds the number of observations.

For high-dimensional data involving only x, substantial work has been done using pe-

nalization techniques such as LASSO (Least Absolute Shrinkage and Selection Operator) by

[67], SCAD (Smoothly Clipped Absolute Deviation) by [16], and Elastic Net by [90]. The

LASSO, in particular, has been widely studied and has shown remarkable results in sparse

3

contexts [8].

Recent advances have extended these approaches to high-dimensional semiparametric

models based on the partial linear model framework [46, 38, 89, 84]. These models combine

penalized techniques with classical nonparametric methods, such as smoothing splines, to

achieve robust and interpretable results.

In recent years, there has been substantial interest and progress in the nonparametric

method known as trend filtering, as proposed by [66, 31, 69].

[69] describes trend filtering

as a discrete analog of the locally adaptive regression splines introduced by [44]. The locally

adaptive regression splines involves solving the following optimization problem for n number

of observations:

min
f ∈Fk

1
2

n
(cid:88)

i=1

(yi − f (zi))2 + λ · T V (f (k)).

(2.1)

Here, Fk denotes the function space defined as

Fk = (cid:8)f : [0, 1] → R : f is k times weakly differentiable and T V (f (k)) < ∞(cid:9) ,

where T V (f (k)) denotes the total variation of kth derivative of f . The discrete analog of

(2.1) using trend filtering is

min
θ∈Rn

1
2

n
(cid:88)

i=1

(yi − θi)2 + λ∥D(z,k+1)θ∥1,

(2.2)

where bold symbols denote vector notation, ∥ · ∥1 denotes ℓ1-norm and D(z,k+1) is the kth

order difference operator based on z. Equation (2.2) is referred to as the univariate kth order

trend filtering. The details of these notations and definitions will be elaborated on in Section

2. Following the proofs of equivalence between (2.1) and (2.2) as shown by [69], trend filtering

estimates combine the benefits of locally adaptive regression splines with the advantages of

the banded difference operator structure in (2.2). This dual benefit provides local adaptivity

and computational efficiency, making trend filtering superior to other nonparametric methods

such as smoothing splines and B-spline regressions.

4

Trend filtering has been successfully applied to various statistical fields, including graphi-

cal models [78], additive models [56], quantile regression [41], and spatiotemporal models [49].

However, its application within the partial linear model framework has not been explored to

date.

Our contribution lies in integrating trend filtering with the partial linear model and uti-

lizing LASSO to effectively address high-dimensional data. This approach leverages the local

adaptivity and computational efficiency of trend filtering along with the sparsity-inducing

properties of LASSO, providing significant advantages when dealing with complex and het-

erogeneous data. Furthermore, our work provides rigorous theoretical details and establishes

convergence rates for the estimators.

The rest of the chapter is organized as follows. Section 2.2 presents the preliminaries,

including assumptions and model descriptions. Section 2.3 provides the main theoretical re-

sults and their interpretations. Section 2.4 details the simulation studies and computational

aspects. Section 2.5 discusses our results, and Section 2.6 includes all the proofs.

2.1.1 Notations

Before delving into the model description and assumptions, we first clarify the no-

tations used throughout this chapter for convenience. We use bold capital letters (e.g.

Σ) to denote matrices, and use bold small letters (e.g. x, y) to denote vectors. For
a = (a1, . . . , ap) ∈ Rp, denote ∥a∥q = ((cid:80)p

q for q ∈ [1, ∞) and ∥a∥∞ = max1≤i≤p |ai|.
n a′b. Given a square matrix
A = (aij) ∈ Rp×p, λmax(A) and λmin(A) represent its largest and smallest eigenvalues

For two vectors a, b ∈ Rn, we write ∥a∥2

i=1 |ai|q)
n = 1

na′a, ⟨a, b⟩n = 1

1

respectively. For a general matrix A = (aij) ∈ Rp×q, ∥A∥2 denotes its spectral norm;

∥A∥max = maxij |aij|, ∥A∥F =

(cid:113)(cid:80)

i,j a2
ij

. For a, b ∈ R, a ∧ b = min(a, b), a ∨ b = max(a, b).

For a set A, 1A(·) is the usual indicator function, and |A| to be its cardinality. Moreover,

an ≲ bn (an ≳ bn) means there exists some constant C > 0 such that an ≤ Cbn (an ≥ Cbn)

for all n; thus an ≲ bn (an ≳ bn) is equivalent to an = O(bn) (an = Ω(bn)); an ≍ bn if and

5

only if an ≲ bn and bn ≳ an; an ≫ bn means bn = o(an). We put subscript p on O and o

for random variables. For i.i.d. samples {w1, . . . , wn} from a distribution Q supported on

norms for functions f : W → R are: ∥f ∥2

some space W, denote by Qn the associated empirical distribution. The L2(Q) and L2(Qn)
L2(Q) = (cid:82)
i=1 f 2(wi).
For simplicity we will abbreviate subscripts and write ∥f ∥, ∥f ∥n for ∥f ∥L2(Q), ∥f ∥L2(Qn)
respectively, whenever Q is the underlying distribution of the covariates.

W f 2(w)dQ(w), ∥f ∥2

L2(Qn) = 1

(cid:80)n

n

2.2 Model Description and Assumptions

In this section, we present a detailed introduction and description of the model under

study. Following an initial overview of the model, we delineate additional conditions

necessary for our proofs, accompanied by brief explanatory comments for each condition.

2.2.1 Model Description and Estimation Method

We consider a partial linear regression model as below:

y = x′β0 + g0(z) + ε,

where ε is independent of (x, z) ∈ Rp+1, β0 ∈ Rp has the support S = {j : β0

j ̸= 0} with
|S| ≤ s, and g0 : [0, 1] → R is a nonparametric function. Generally, z ∈ [0, 1] may not be

true, but this can be achieved by using a simple affine transformation, so we assume this for

technical simplicity. We let {(yi, xi, zi)}n

i=1

be n i.i.d observations of (y, x, z), and without

loss of generality, we assume that these observations are re-ordered based on zi, i = 1, . . . , n.

Moreover, as discussed in the introduction, we focus on the function g0 where the total

variation of kth derivative is bounded by a constant, that is, T V (g(k)

0 ) ≤ Lg, with Lg > 0.

We consider the case where p ≫ n, indicating a high-dimensional linear model with an

unknown function. This model assumption is prevalent in certain biological studies, such

as those involving gene expression data, where nonlinear factors are present such as age or

other related factors.

6

Expanding upon the univariate trend filtering from [31] and [69], for a given integer k ≥ 0,

the kth order partial linear trend filtering (PLTF) estimation for the partial linear model is

defined as

((cid:98)β, (cid:98)θ) ∈ arg min
β∈Rp,θ∈Rn

1
2

∥y − Xβ − θ∥2

n + λ∥β∥1 + γ∥D(z,k+1)θ∥1,

(2.3)

where D(z,k+1) ∈ R(n−k−1)×n is the discrete dervative operator of order k + 1. When k = 0,

D(z,1) =












−1

1

0 · · ·

0 −1 1 · · ·
...

0

0

0

0

0

0

0 · · · −1 1












∈ R(n−1)×n.

(2.4)

For k ≥ 1, the dervative operator is defined recursively, that is,

D(z,k+1) = D(z,1) · diag

(cid:18) k

zk − z1

, · · · ,

(cid:19)

k
zn − zn−k+1

· D(z,k),

where in the above equation, D(z,1) is defined as the form of (2.4) with the dimension of

(n − k − 1) × (n − k). When k = 0, this estimate is equivalent to the one-dimensional fused

lasso [68], also known as one-dimensional total variation denoising in signal processing [55].

Before further discussing estimation methods, we briefly introduce the falling factorial

basis. [69] and [77] demonstrate a connection between univariate trend filtering and falling

factorial basis functions, showing that the trend filtering problem can be interpreted as a

sparse basis regression problem using these functions. This finding is also applicable to our

case since our model is a generalization of the univariate trend filtering model. For given

knot points t1 < · · · < tn ∈ R, the kth order falling factorial basis functions are defined as

below

qi(t) =

i−1
(cid:89)

(t − tl),

i = 1, . . . , k + 1,

l=1
k
(cid:89)

(t − ti+l) · 1{t > ti+k},

qi+k+1(t) =

l=1

i = 1, . . . , n − k − 1,

7

where we write q1(t) = 1 as a convention. With these definitions, the estimation in (2.3) is

equivalent to

((cid:98)β, (cid:98)α) ∈ arg min
β∈Rp,α∈Rn

1
2

∥y − Xβ −

n
(cid:88)

l=1

αlql(z)∥2

n + λ∥β∥1 + γk!

n
(cid:88)

l=k+2

|αl|,

(2.5)

which is a lasso form. Indeed, the equivalent alternating expression of (2.5) is

((cid:98)β, (cid:98)g) ∈ arg min
β∈Rp,g∈Hk
n

1
2

∥y − Xβ − g(z)∥2

n + λ∥β∥1 + γT V (g(k)),

(2.6)

where Hk
n

is the span of the kth order falling factorial basis functions with knots zk+1 ≤

zk+2 . . . ≤ zn−1. Throughout the chapter, we denote Hn to be Hk
n

for notational simplicity.

We omit the proofs of equivalences between equations for (2.3), (2.5) and (2.6), since the

proof can be directly taken from Lemma 1 in [69] for (2.3) and (2.5), and Lemma 2 in [69] and

Lemma 1 in [56] for (2.5) and (2.6). Based on the equivalence, we argue that the estimate

from (2.3) is the same as (2.6). By constructing an appropriate linear combination of falling

factorial functions, the form in (2.5) enables interpolation or extrapolation for partial linear

trend filtering. Consequently, this formulation enhances the practical applicability of partial

linear trend filtering.

Our estimation approach in (2.3) introduces a doubly penalized least squares estimator,

similar to those proposed by [46] and [84]. This approach involves two penalties: the first

(cid:98)g. The
controls the sparsity of (cid:98)β, while the second regulates the variability of the function
primary distinction between our approach and theirs lies in the estimation method of the

function g. While both of their methods employ the smoothing spline technique, our method

utilizes trend filtering for estimation. Given the model structure, where the estimation of

g significantly influences that of β, accurately estimating g is critical. As discussed in the

introduction, prior literature demonstrates that when the smoothness of g0 is heterogeneous,

the trend filtering method outperforms the smoothing splines method, in terms of local

smoothness as well as the estimation. We show the simulation results later in Section 3.4.

We introduce conditions regarding our model in (2.6), needed for the proofs of lemmas and

theorems throughout.

8

Condition 1 (Design Condition). The covariate x = (x1, . . . , xp) has sub-Gaussian coordi-

nates1: max1≤j≤p ∥xj∥ψ2 ≤ Kx.

Condition 1 and 2 are more general conditions than in [46], where they assumed that the

errors are standard Gaussian and the maximum value of the design matrix is bounded by a

positive constant.

Condition 2 (Sub-Gaussian Condition). The noise ε is sub-Gaussian: E(ε) = 0, V ar(ε) =

σ2, ∥ε∥ψ2 ≤ Kεσ.

Condition 3 (Bounded Input Density). z has a continuous distribution supported on [0, 1].

Its density is bounded below by a constant ℓz > 0.

Condition 2 is the same as in [69], and Condition 3 is the same as in [56]. These two

conditions are common in the trend filtering literature.

Condition 4 (Eigenvalue Condition). Define h(z) = (h1(z), . . . , hp(z)) = E(x|z) and

(cid:101)x =

x − h(z). Assume λmin(E

(cid:101)x(cid:101)x′) ≥ Λmin > 0 and λmax(Eh(z)h(z)′) ≤ Λmax < ∞.

Condition 4 is common in semiparametric literature [83, 46, 84]. This condition ensures

that there is enough information in the data to identify the parameters in the linear part.

Condition 5. max1≤j≤p T V (h(k)

j ) ≤ Lh.

Condition 6. (s+log p)s log p

n

= o(1), p → ∞, as n → ∞.

Condition 5 is the simiar to those of the Condition 2.6 in [46] and Condition A.5 in [84],

and Lh = 0 when z and x is independent. This condition is used for proving Theorem 2.

Condition 6 is a technical condition for proofs in Theorem 1 and 2.

1∥ · ∥ψ2 is the sub-Gaussian norm.

9

2.3 Main Theoretical Results

In this section, we present our main results. Our main results consist of two parts, the

result regarding the g, and the result regarding the β. We now move on to the convergence

rate result for g.

Theorem 1. Assume Conditions 1-4 and 6. Choose λ = c1

n + n− 2k+2
2k+3 )
with large enough constants c1, c2 > 0. Then, there exist constants c3, c4, n0 > 0 such that

n , γ = c2( s log p

(cid:113) log p

any solution

(cid:98)g in (2.6) satisfies

∥(cid:98)g(z) − g0(z)∥2 ≤ c3

∥(cid:98)g − g0∥2

n ≤ c3

(cid:16) s log p
n
(cid:16) s log p
n

+ n− 2k+2

2k+3

(cid:17)

+ n− 2k+2

2k+3

(cid:17)

with probability at least 1 − pc4 − nc4, as long as n ≥ n0. The constants c1, c2, c3, c4, n0

may depend on k, Lg, Lh, Kx, Kϵ, ℓz, Λmin, Λmax, σ.

Theorem 1 implies that the rate of

(cid:98)g depends on two terms: s log p/n and n− 2k+2
(cid:98)g indeed depends on the convergence rate of (cid:98)β, which
is discussed in Theorem 2. When the rate for the linear part is faster than the rate of the

means that the convergence rate of

2k+3 . This

second term, that is when s log p/n = o(n− 2k+2

second term, which implies that ∥(cid:98)g(z) − g0(z)∥2 = Op
This convergence rate is the same as the rate achieved by [46] and it is also the same as

and ∥(cid:98)g − g0∥2

n = Op

.

n− 2k+2

2k+3

n− 2k+2

2k+3

2k+2 ), then the convergence rate is governed by the
(cid:17)

(cid:16)

(cid:17)

(cid:16)

the minimax convergence rate achieved by [84], which were shown in the Sobolev function

space. For other situations, when the function g is sufficiently smooth or p is sufficiently

high, the rate is governed by the first term. The proof of Theorem 1 can be found in the

supplementary section. We proceed to the convergence rates results for β.

Theorem 2. Assume Conditions 1-6, with the same choice of λ, γ in Theorem 1, any solution

10

(cid:98)β in (2.6) satisfies

∥(cid:98)β − β0∥2

2 ≤ c3

∥x′((cid:98)β − β0)∥2 ≤ c3

∥X((cid:98)β − β0)∥2

n ≤ c3

s log p
n
s log p
n
s log p
n

with probability at least 1 − pc4 − nc4, as long as n ≥ n0.

Theorem 2 implies two things about the estimator (cid:98)β; (1) estimation result and (2)
prediction result. For estimation rate and prediction rates for both in-sample and out-

of-samples, we see that they achieve the oracle rate which is the same as the standard

lasso case where the function g is known. This rate coincides with the studies in classical

high-dimensional research [81, 54, 74] as well as the research done on smoothing splines [46,

84]. Note that the smoothness for g does not have effect on the results of β, while for g, β

results are affected. This one-way relationship is due to the orthogonal decomposition that

we use during the proofs. The proof for Theorem 2 can be found in the supplementary section.

2.4 Simulation Studies

Through empirical experiments, we evaluate the performance of partial linear trend

filtering (PLTF) in comparison to partial linear smoothing splines (PLSS), as presented by

[46] and [84]. Additionally, we tested the B-spline basis even though there is no paper using

the B-spline for this method; however, its performance was significantly inferior to the other

methods, so we decided not to include those results here.

2.4.1 Computational Details

We use a Block Coordinate Descent (BCD) algorithm also known as the backfitting

approach for our problem in (2.3). Our approach involves two blocks: the first pertains to β

and the second to θ. The algorithm iterates over these blocks, updating the estimate for each

11

component at each step using the lasso for β and the univariate trend filtering algorithm for

θ. The detailed steps of our algorithm are outlined in Algorithm 2.1.

Algorithm 2.1 A BCD algorithm for High-dimensional Partial Linear Trend Filtering
Data: {yi, xi, zi}, i = 1, . . . , n
Fixed (tuning) Parameters: λ, γ, k

1. Set t = 0 and initialize θ(0)

i = 0, i = 1, . . . , n

2. For (t + 1)-th iteration, where t = 0, 1, 2, . . . :

a) Block 1: Let y(t)∗

i = yi − θ(t)

i

, and update β(t) by fitting the lasso:

β(t+1) = arg min

β

∥y(t)∗ − Xβ∥2

n + λ∥β∥1

= yi − x′

iβ(t+1), and update θ(t) by fitting the univariate trend

b) Block 2: Let y(t)∗∗

i

filtering:

θ(t+1) = arg min

θ

∥y(t)∗∗ − θ∥2

n + γ∥D(z,k+1)θ∥1

c) If both β and θ converge, then stop the iteration. If not, continue the iteration

until it reaches the predefined maximum iteration number

3. Return (cid:98)β and (cid:98)θ as parameters at convergence

The convergence of the BCD algorithm has been rigorously established by [71]. It has

been proven that for a convex criterion decomposable into smooth and separable terms,

the solution obtained through the iterates of the BCD algorithm is the optimal solution.

Therefore, we do not prove the convergence of the algorithm here, since our convergence

result can be derived directly from this.

For the tuning parameters, λ and γ, we use 5-fold cross-validation (CV) to find the

optimally tuned parameters that minimize the overall CV error. However, since there are

two grids of tuning parameters, the search space is the product of the lengths of λ and γ,

resulting in an exponential increase in computation time compared to an algorithm with

a single tuning parameter. To address this, we utilize the warm start method [53] for

our tuning parameter grids. Empirical evidence in their work shows that the warm start

12

algorithm significantly reduces the number of iterations needed for convergence, thereby

enhancing the convergence speed of the algorithm.

Our algorithm is implemented using R software [52]. Typically, for the lasso com-

ponent, we utilize the function from the R package glmnet [20], and for trend filtering,

we employ the function from the R package glmgen [1]. However, there is no publicly

available software for the partial linear smoothing spline method, so we also use BCD

algorithm for the smoothing spline model for the computation as well. The R function,

stats::smooth.spline,

is used for this procedure in the second block. The imple-

mented BCD algorithms for the smoothing spline and the trend filtering, named plmR,

are available on https://sangkyustat.github.io/plmR/ for public use and reproducibility.

2.4.2 Simulation Setting and Results

We now proceed to demonstrate our method through simulations. We first generate (cid:101)X =
((cid:101)x1, . . . , (cid:101)xp+1) from Np+1(0p+1, Σ), where Σ = (σjk) with σjk = 0.5|j−k| and j, k = 1, . . . , p+1.
Then we set z = Φ((cid:101)x25), xj = (cid:101)xj for j = 1, . . . , 24 and xj = (cid:101)xj−1 for j = 26, . . . , p + 1. The
true models we consider are as follows:

Model 1 (Smooth function):

yi = xi6β1 + xi12β2 + xi15β3 + xi20β4 + sin(2πzi) + ϵi,

Model 2 (Less smooth function):

yi = xi6β1 + xi12β2 + xi15β3 + xi20β4 + e3zi sin(6πzi)/7 + ϵi,

Model 3 (Doppler-type function): yi = xi6β1 + xi12β2 + xi15β3 + xi20β4 + sin(4/zi) + ϵi

for i = 1, . . . , n, where ϵi ∼ N (0, σ2

in Figure 2.1. The values for βj, j = 1, . . . , 4 are (0.5, 1, 1, 1.5). Various values for σ2
ϵ

ϵ ). The actual forms of these functions are displayed
are

used to vary the signal-to-noise ratio (SNR) for the models. Similar to [46] and [69], we

define the total signal-to-noise ratio (tSNR) as follows:

(cid:115)

tSNR =

∥x′β0 + g0(z)∥2
σ2
ϵ

,

where the expectation in the numerator can be approximated by using the Monte Carlo

expectation. We vary the tSNR from 4 to 16 on a logarithmic scale and then calculate

13

Figure 2.1 Configurations of the function g(z) for Models 1 through 3. The structures of the
functions become increasingly locally heterogeneous from Model 1 (left) to Model 3 (right).

the metrics for each model. Specifically, we consider three different error metrics, including

in-sample prediction mean squared error (MSE) and l2-norm squared error, as follows:

1. ∥X(cid:98)β + (cid:98)g(z) − (Xβ0 + g0(z))∥2

n

: Overall MSE considering both (cid:98)β and

(cid:98)g.

2. ∥(cid:98)β − β0∥2 : l2-norm squared error for (cid:98)β.

3. ∥(cid:98)g(z) − g0(z)∥2

n

: MSE for

(cid:98)g.

The metrics are computed over 150 repetitions of randomly generated datasets for each

tSNR value, and the medians of each metric are selected as the final results. We consider

p to be 100 and 1000 for low and high-dimensional cases, respectively, with n fixed at 500.

For each method, the algorithm is fitted to the training dataset and then the metrics are

calculated from the in-sample testing dataset. We present results using both optimally tuned

parameters and CV-tuned parameters. The parameters are finely tuned for fair comparison

across all models. We fix k = 2 for the trend filtering method and use cubic smoothing splines

for the smoothing spline method, tuning only λ and γ for two reasons: (1) minimal impact on

results when varying these settings and (2) computational efficiency. The simulation results

for each model are displayed in Figures 2.2, 2.3, 2.4.

14

Figure 2.2 Model 1 error comparisons for a sequence of tSNR grid from 4 to 16. n is set to
500 for all cases, and the repetition number for the simulation is 150.

15

Figure 2.3 Model 2 error comparisons for a sequence of tSNR grid from 4 to 16. n is set to
500 for all cases, and the repetition number for the simulation is 150.

16

Figure 2.4 Model 3 error comparisons for a sequence of tSNR grid from 4 to 16. n is set to
500 for all cases, and the repetition number for the simulation is 150.

17

Figure 2.2 demonstrates that when g is a smooth function, the performance of the PLTF

method is comparable to that of the PLSS method, with no significant differences observed.

However, as shown in Figure 2.3, when the function becomes more heterogeneous, the dif-

ference between the two methods becomes more pronounced. Nevertheless, there is still no

definitive superiority of one method over the other in terms of overall mean squared error

(MSE).

In Figure 2.4, the MSE of PLTF is significantly lower than that of PLSS in both low

and high-dimensional cases. This disparity becomes more apparent as the tSNR increases.

This result argues that PLTF performs much better than PLSS when the error is more

heterogeneous. These findings suggest that for smooth functions, our method performs

as well as the well-known smoothing spline method. However, as the function becomes

more heterogeneous, our method significantly outperforms the smoothing spline approach,

particularly in both low and high-dimensional scenarios when the tSNR is high. This

result aligns with the findings of [56], which reported similar outcomes for the Doppler-type

function we considered.

2.5 Conclusion and Discussion

In this research, we have explored the integration of trend filtering with the partial

linear model, incorporating Lasso to address high-dimensional data structures. Our approach

leverages the local adaptivity and computational efficiency of trend filtering along with the

sparsity-inducing properties of LASSO, providing a robust framework for handling complex

and heterogeneous data.

Our work contributes to the field by extending the application of trend filtering to semi-

parametric models, particularly the partial linear model, which has not been previously

explored. This integration allows for more precise modeling of high-dimensional data, main-

taining both local adaptivity and computational efficiency.

The results and discussions provided in this work underscore the potential of our method

18

to enhance the analysis of complex data structures in various statistical fields. Future re-

search could further explore the practical applications of this approach and extend it to other

types of semiparametric models.

For future studies regarding methodology, our work can be expanded to the additive

model like [56], which is the expansion version of the univariate trend filtering. This

expansion allows us to put multivariate variables as z not only one. In terms of sparsity

and variable selection, our work can be modified by using other nonconvex models such as

SCAD instead of LASSO. We expect that this will give us more advantages in terms of the

accuracy with variable selection of the covariates.

2.6 Proofs

Throughout

the proofs, we use C1, C2, . . .

to denote universal constants and

use D1, D2, . . . , ...

to denote constants

that may only depend on the constants

k, Lg, Lh, Kx, Kϵ, ℓz, Λmin, Λmax from Conditions 1-5.

An explicit (though not opti-

mal) dependence of {Dj, j = 1, 2, . . .} on the aforementioned constants can be tracked

down. However, since it does not provide much more insight, we will often not present the

explicit forms of {Dj, j = 1, 2, . . .}, and this will greatly help streamline the proofs. The

constants {Cj, Dj, j = 1, 2, . . .} may vary from lines to lines.

2.6.1 Proof of Theorem 1

2.6.1.1 Roadmap of the proof

For a given function f (x, z) = x′β + g(z) with β ∈ Rp, T V (g(k)) < ∞, introduce the

functional

τδ0,R(f ) =

λ∥β∥1 + γT V (g(k))
20δ0R

+ ∥x′β + g(z)∥,

δ0 ∈ (0, 1), R > 0,

(2.7)

where for notational simplicity we have suppressed the dependence of τδ0,R(f ) on the tuning
parameters λ, γ > 0. This functional will serve as a critical measure for β and g(z). Define

19

the following events:

T1(δ0, R) =

T2(δ0, R) =

(cid:40)

(cid:40)

(cid:40)

sup
τδ0,R(f )≤R

sup
τδ0,R(f )≤R

(cid:12)
(cid:12)∥f ∥2
(cid:12)

n − ∥f ∥2(cid:12)
(cid:12) ≤ δ0R2
(cid:12)

(cid:41)
,

(cid:12)
(cid:12)
(cid:12)⟨ε, f (X, Z)⟩n

(cid:41)
,

(cid:12)
(cid:12) ≤ δ0R2
(cid:12)
(cid:41)
.

22 log n
ℓzn

T3 =

max
1≤i≤n−1

(z(i+1) − z(i)) ≤

The general proof idea is largely inspired by [46] (see also [84]). We first use a localization

technique based on convexity to show that under the event T1(δ0, R) ∩ T2(δ0, R) ∩ T3, the
bound on ∥(cid:98)g −g0∥ in Theorem 1 (bound on ∥(cid:98)g −g0∥n is derived similarly) holds. This is done
in Lemmas 1-2. Then in Lemmas 3 and 4 we show that the event T1(δ0, R) ∩ T2(δ0, R) ∩ T3

happens with high probability. Along the way we need to choose R, λ, γ, δ0 in (2.7) properly

to meet conditions in Lemmas 1-4 and to achieve the desirable error rate in Theorem 1. We

complete this step in Section 2.6.1.3.

2.6.1.2 Key lemmas

Lemma 1. Assuming Condition 4 and δ0 ≤ 1

100, δ0R2 ≥ D1( log2 n

n2 + γ + λ2s), then on

T1(δ0, R) ∩ T2(δ0, R) ∩ T3, there exists ¯g ∈ Hn such that

(i) T V (¯g(k)) ≤ ak ·T V (g(k)

0 ), ∥¯g−g0∥∞ ≤ 22bk log n

ℓzn

·T V (g(k)

0 ), where ak, bk > 0 are constants

only dependent on k.

(ii) τδ0,R(x′((cid:98)β − β0) + (cid:98)g(z) − ¯g(z)) ≤ R.

Proof. According to Lemma 16 in [70] (see also Lemma 13 in [56]), there exists ¯g ∈ Hn such

that result (i) holds on T3. We use such a ¯g in the rest of the proof. Consider the convex

combination

Accordingly, define

(cid:101)β = t(cid:98)β + (1 − t)β0,

(cid:101)g = t(cid:98)g + (1 − t)¯g,

t ∈ [0, 1].

(cid:98)f (x, z) = x′ (cid:98)β + (cid:98)g(z),

¯f (x, z) = x′β0 + ¯g(z), (cid:101)f (x, z) = t (cid:98)f + (1 − t) ¯f = x′ (cid:101)β + (cid:101)g(z).

20

For the choice of t =

R
R+τδ0,R( (cid:98)f − ¯f )

, it is straightforward to verify that

τδ0,R( (cid:101)f − ¯f ) =

R · τδ0,R( (cid:98)f − ¯f )
R + τδ0,R( (cid:98)f − ¯f )

≤ R.

(2.8)

Hence, to prove result (ii) it is sufficient to show τδ0,R( (cid:101)f − ¯f ) ≤ R
inequality due to the convex problem (2.6),

2

. We start with the basic

1
2

∥y − X(cid:101)β − (cid:101)g(Z)∥2

n + λ∥(cid:101)β∥1 + γT V ((cid:101)g(k)) ≤

1
2

∥y − Xβ0 − ¯g(Z)∥2

n + λ∥β0∥1 + γT V (¯g(k)),

which can be simplified to

1
∥X(β0 − (cid:101)β) + g0(Z) − (cid:101)g(Z)∥2
2
1
2

∥g0(Z) − ¯g(Z)∥2

≤

n + λ∥(cid:101)β∥1 + γT V ((cid:101)g(k))

n + ⟨ε, X((cid:101)β − β0) + (cid:101)g(Z) − ¯g(Z)⟩n + λ∥β0∥1 + γT V (¯g(k)).

Using the triangle inequality for ∥ · ∥1 and T V (·), we can continue from the above to obtain

1
∥X(β0 − (cid:101)β) + g0(Z) − (cid:101)g(Z)∥2
2
1
2

∥g0(Z) − ¯g(Z)∥2

≤

n + λ∥(cid:101)β − β0∥1 + γT V ((cid:101)g(k) − ¯g(k))

n + ⟨ε, X((cid:101)β − β0) + (cid:101)g(Z) − ¯g(Z)⟩n + 2λ∥(cid:101)βS − β0

S∥1 + 2γT V (¯g(k)).

Observing from (2.8) that (cid:101)f − ¯f = x′((cid:101)β − β0) + (cid:101)g(z) − ¯g(z) belongs to the set {f : τδ0,R(f ) ≤
R}, on T1(δ0, R) ∩ T2(δ0, R) ∩ T3 we can proceed with

(2.9)

∥ (cid:101)f − ¯f ∥2 + 4λ∥(cid:101)β − β0∥1 + 4γT V ((cid:101)g(k) − ¯g(k))
≤ ∥X(β0 − (cid:101)β) + ¯g(Z) − (cid:101)g(Z)∥2
≤ 2∥X(β0 − (cid:101)β) + g0(Z) − (cid:101)g(Z)∥2

n + δ0R2 + 4λ∥(cid:101)β − β0∥1 + 4γT V ((cid:101)g(k) − ¯g(k))
n + δ0R2 + 4λ∥(cid:101)β − β0∥1

n + 2∥¯g(Z) − g0(Z)∥2

+ 4γT V ((cid:101)g(k) − ¯g(k))

≤ 4∥g0(Z) − ¯g(Z)∥2

n + 4⟨ε, X((cid:101)β − β0) + (cid:101)g(Z) − ¯g(Z)⟩n + 8λ∥β0

S − (cid:101)βS∥1 + 8γT V (¯g(k))

+ δ0R2

≤ 5δ0R2 + 4b2

kL2
g

222 log2 n
√
zn2 + 8akLgγ + 8λ
ℓ2

s∥(cid:101)β − β0∥2,

(2.10)

21

where the third inequality holds due to (2.9), and in the last inequality we have used result

(i) and Cauchy-Schwarz inequality for ∥β0

S − (cid:101)βS∥1. Now Condition 4 implies that

√

8λ

s∥(cid:101)β − β0∥2 ≤ 8λ

√

≤ 32λ2sΛ−1

min +

1
2

sΛ−1/2

min ∥(cid:101)x′((cid:101)β − β0)∥
∥(cid:101)x′((cid:101)β − β0)∥2 ≤ 32λ2sΛ−1

min +

1
2

∥ (cid:101)f − ¯f ∥2.

Here, the second inequality follows ab ≤ 1

2b2, and the third inequality is due to the
2a2 + 1
orthogonal decomposition ∥x′β + g(z)∥2 = ∥(cid:101)x′β∥2 + ∥h(z)′β + g(z)∥2. This together with
(2.10) yields

∥ (cid:101)f − ¯f ∥2 + 8λ∥(cid:101)β − β0∥1 + 8γT V ((cid:101)g(k) − ¯g(k))
222 log2 n
zn2 + 16akLgγ + 64λ2sΛ−1
ℓ2

≤ 10δ0R2 + 8b2

kL2
g

min ≤ 16δ0R2,

where the last inequality holds under the assumed condition on δ0R2 (with proper choice of
constant D1). The above bound further implies ∥ (cid:101)f − ¯f ∥ ≤ 4
λ∥(cid:101)β−β0∥1+γT V ((cid:101)g(k)−¯g(k))
20δ0R

10R, leading to the bound τδ0,R( (cid:101)f − ¯f ) ≤ R

5R since δ0 ≤ 1
.

10 + 2R

δ0R ≤ 2

5 = R

, and

≤ 1

√

100

2

Lemma 2. Under the same conditions of Lemma 1, then on T1(δ0, R) ∩ T2(δ0, R) ∩ T3, it

holds that

∥(cid:98)g(z) − g0(z)∥ ≤

(cid:16)

1 +

(cid:114) Λmax
Λmin

(cid:17)

R + 22bkLgℓ−1
z

log n
n

.

Moreover, define a subevent of T1(δ0, R):

T0(δ0, R, (cid:101)R) =

(cid:40)

sup
τδ0,R(f )≤ (cid:101)R

(cid:12)
(cid:12)∥f ∥2
(cid:12)

n − ∥f ∥2(cid:12)
(cid:12) ≤ δ0R2
(cid:12)

(cid:41)

with (cid:101)R = (2 +
it holds that

(cid:113) Λmax
Λmin

)R + 22bkLgℓ−1
z

log n

n + (ak+1)Lgγ

20δ0R

. Then on T0(δ0, R, (cid:101)R) ∩ T2(δ0, R) ∩ T3,

∥(cid:98)g(z) − g0(z)∥n ≤

(cid:16)

1 +

(cid:112)

δ0 +

(cid:114) Λmax
Λmin

(cid:17)

R + 22bkLgℓ−1
z

log n
n

.

Proof. Part (ii) in Lemma 1 together with Part (i) of Lemma 5 implies that ∥(cid:98)g(z) − ¯g(z)∥ ≤
(1 +

)R. As a result, combining it with Part (i) of Lemma 1 gives

(cid:113) Λmax
Λmin

∥(cid:98)g(z) − g0(z)∥ ≤ ∥(cid:98)g(z) − ¯g(z)∥ + ∥¯g(z) − g0(z)∥ ≤

(cid:16)

1 +

(cid:114) Λmax
Λmin

(cid:17)

R +

22bkLg log n
ℓzn

.

22

To prove the second result, we first use results (i)(ii) of Lemma 1 to bound

γT V ((cid:98)g(k) − g(k)
0 )
20δ0R

γT V ((cid:98)g(k) − ¯g(k))
20δ0R
Combining the above with the first result on ∥(cid:98)g(z) − g0(z)∥, we see that

γT V (¯g(k) − g(k)
0 )
20δ0R

≤ R +

≤

+

(ak + 1)Lgγ
20δ0R

.

τδ0,R((cid:98)g − g0) =

γT V ((cid:98)g(k) − g(k)
0 )
20δ0R

+ ∥(cid:98)g(z) − g0(z)∥ ≤ (cid:101)R.

Hence, on T0(δ0, R, (cid:101)R) ∩ T2(δ0, R) ∩ T3, we can use the bound on ∥(cid:98)g(z) − g0(z)∥ to obtain the
bound on ∥(cid:98)g(z) − g0(z)∥n:

∥(cid:98)g(z) − g0(z)∥n ≤ (cid:112)∥(cid:98)g(z) − g0(z)∥2 + δ0R2 ≤ ∥(cid:98)g(z) − g0(z)∥ +

(cid:112)

δ0R.

Lemma 3. Assume Conditions 1, 3, 4 hold. Define

KF = (1 ∨ Cz,k)

(cid:16)20δ0R
γ

+ 1 +

(cid:114) Λmax
Λmin

(cid:17)

R,

(2.11)

where Cz,k > 0 is the constant in Lemma 5. Suppose that

(cid:114)

log p
n

≤ 1, δ0

log p
n

R2
λ2 ≤ D1, D2δ

−4k−4
2k+3

0

K 2

F n

−2k−2

2k+3 ≤ R2,

D3(1 + K 2

F )R2 log R−1 ≤ nλ2, D4(λ2nR

−2k−1
k+1 K

−1
k+1

F ∧ λ

√

nR−1) ≥ log n.

Then,

P(T1(δ0, R)) ≥ 1 − 2p−10 − 8pe−D4(λ2nR

−2k−1

k+1 K

−1
k+1
F

√

∧λ

−
nR−1) − 8pe

0 R2n
δ2
D5K2
F .

Further assume 1

2δ0R2 ≥ D1γ, R ≥ log n

n

where this constant D1 is the one from Lemma 1. It

holds that

P(T0(δ0, R, (cid:101)R)) ≥ 1 − 2p−10 − 8pe−D4(λ2nR

−2k−1

k+1 K

−1
k+1
F

√

∧λ

nR−1) − 8pe

−

δ2
0 R2n
D5K2
F .

Proof. For a given function f (x, z) = x′β + g(z), it is clear that

∥f ∥2

n − ∥f ∥2 =

1
n

n
(cid:88)

((x′

iβ)2 − ∥x′β∥2) +

1
n

n
(cid:88)

(g2(zi) − ∥g(z)∥2)

i=1

(x′

iβg(zi) − Ex′βg(z)),

i=1
2
n

+

n
(cid:88)

i=1

23

such that

sup
τδ0,R(f )≤R

(cid:12)
(cid:12)∥f ∥2
(cid:12)

n − ∥f ∥2(cid:12)
(cid:12)
(cid:12) ≤ sup

τδ0,R(f )≤R

(cid:12)
(cid:12)
(cid:12)

1
n

n
(cid:88)

((x′

(cid:12)
iβ)2 − ∥x′β∥2)
(cid:12)
(cid:12)

+ sup

τδ0,R(f )≤R

+ sup

τδ0,R(f )≤R

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)

i=1
1
n

2
n

n
(cid:88)

i=1
n
(cid:88)

i=1

(cid:12)
(g2(zi) − ∥g(z)∥2)
(cid:12)
(cid:12)

(x′

(cid:12)
iβg(zi) − Ex′βg(z))
(cid:12)
(cid:12) := I + II + III.

We now bound the above three terms separately. According to Lemma 5 Part (i), the β in

any f (x, z) = x′β + g(z) with τδ0,R(f ) ≤ R satisfies

β ∈ B :=

(cid:110)

β : ∥β∥1 ≤ 20δ0R2λ−1, ∥x′β∥ ≤ (2 +

(cid:114) Λmax
Λmin

(cid:111)
.

)R

(2.12)

Therefore,

I ≤ sup
β∈B

(cid:12)
(cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

((x′

(cid:12)
iβ)2 − ∥x′β∥2)
(cid:12)
(cid:12) ≤ max

1≤j,k≤p

(cid:12)
(cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

(cid:12)
(xijxik − Exjxk)
(cid:12)
(cid:12) · sup

β∈B

∥β∥2
1,

(2.13)

where in the last inequality we have used the fact |a′Aa| ≤ ∥a∥2

1 ·∥A∥max, ∀a ∈ Rp, A ∈ Rp×p.
are independent, zero-mean, sub-exponential

Condition 1 implies that {xijxik − Exjxk}n

i=1

random variables with the sub-exponential norm ∥xijxik − Exjxk∥ψ1 ≤ C1K 2
Bernstein’s inequality in Theorem 3 together with a simple union bound to obtain that when

. We then use

x

log p

n ≤ 1,

max
1≤j,k≤p

(cid:12)
(cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

(xijxik − Exjxk)

(cid:12)
(cid:12) ≤ C2K 2
(cid:12)

x

(cid:114)

log p
n

holds with probability at least 1 − 2p−10. Putting this result together with (2.13) enables us

to conclude P(I ≤ δ0R2/3) ≥ 1 − 2p−10, as long as δ0

(cid:113) log p
n

Now we bound II. Using Lemma 5 we have II ≤ supg∈F

R2
λ2 ≤ D1.
(cid:12)
(cid:12)∥g∥2

n − ∥g∥2(cid:12)

(cid:12), where

F :=

(cid:110)

g : ∥g∥ ≤ RF , T V (g(k)) ≤ 20δ0R2γ−1, ∥g∥∞ ≤ KF

RF =

(cid:16)

1 +

(cid:114) Λmax
Λmin

(cid:17)

R, KF = (1 ∨ Cz,k)

(cid:16)20δ0R
γ

+ 1 +

(cid:111)
,
(cid:114) Λmax
Λmin

(cid:17)

R.

(2.14)

24

We aim to apply Theorem 5 to bound supg∈F
Theorem 5, we first calculate J0(t, F):

(cid:12)
(cid:12)∥g∥2

n − ∥g∥2(cid:12)

(cid:12). Adopting the notation in

J0(t, F) ≤ C3t

(cid:115)

(cid:90) 1

0

C4

(cid:16) ut
2KF

(cid:17)− 1

k+1 du = C3
(cid:124)

(cid:112)

C4K

2

1
2k+2
F

(cid:123)(cid:122)
¯C3

1

2k+2 (2k + 2)
2k + 1

(cid:125)

2k+1
2k+2 ,

·t

(2.15)

where the first inequality is due to the entropy bound of Corollary 1 in [56]. As a result,

H(u) = sup
t≥0
(cid:16)2k + 1
4k + 4

=

(cid:0)ut − (J −1

0

(t, F))2(cid:1) ≤ sup

t≥0

(cid:0)ut − ¯C

− 4k+4
2k+1
3

4k+4

2k+1 (cid:1)

t

(cid:17) 4k+4

2k+3 2k + 3
2k + 1

4k+4
¯C
2k+3
3

u

4k+4

2k+3 = D2K

2
2k+3
F

4k+4
2k+3 .

u

We are ready to invoke Theorem 5. The condition R2

F ≥ H(4KF /

n) is reduced to

√

R2 ≥ D3K 2

F n

−2k−2
2k+3 ,

(2.16)

and under this condition it holds with probability at least 1 − e−t that

(cid:12)
(cid:12)∥g∥2
(cid:12)

n − ∥g∥2(cid:12)
(cid:12)
(cid:12) ≤ D4 ·

(cid:16) K

sup
g∈F

2k+1
2k+2

2k+3
2k+2
F R
√
n

+

√

KF R
√
n

t

+

K 2
F t
n

(cid:17)

.

Given the above, choosing t = δ2

and assuming a slightly stronger condition R2 ≥

0 R2n
D5K2
F

−2k−2

2k+3 compared to (2.16), it is then straightforward to verify that P(II ≤

−4k−4
2k+3

D6δ
0

K 2

F n

δ0R2/3) ≥ 1 − e

−

0 R2n
δ2
F .
D5K2

Next we bound III. We first use Hölder’s inequality and Lemma 5 to obtain

III ≤ sup

2∥β∥1 ·

τδ0,R(f )≤R
40δ0R2
λ

·

≤

sup
g∈F ,1≤j≤p

(cid:12)
(cid:12)
(cid:12)

i=1
n
1
(cid:88)
n

i=1

n
(cid:88)

(cid:13)
(cid:13)
(cid:13)

1
n

(cid:13)
(xig(zi) − Exg(z))
(cid:13)
(cid:13)∞

(cid:12)
(xijg(zi) − Exjg(z))
(cid:12)
(cid:12).

We then aim to use Theorem 6 to bound supg∈F | 1
j ≤ p. The quantities QF , (cid:98)QF in Theorem 6 become QF = supg∈F

i=1(xijg(zi) − Exjg(z))| for each 1 ≤
(cid:113)E(x2

j g2(z)), (cid:98)QF =
ijg2(zi). Let (cid:98)RF = supg∈F ∥g∥n, Xj = (x1j, . . . , xnj), Z = (z1, . . . , zn),
supg∈F
and ϵ1, . . . , ϵn be a Rademacher sequence independent of Xj, Z. Applying Dudley’s entropy

(cid:113) 1
n

i=1 x2

(cid:80)n

n

(cid:80)n

25

integral gives

(cid:16)

E

1
n

|

sup
g∈F

≤C5

(cid:98)QF√
n

(cid:90) 1

0

n
(cid:88)

i=1
(cid:113)

(cid:12)
(cid:17)
(cid:12)
(cid:12)Xj, Z
ϵixijg(zi)|

≤ C5

(cid:98)QF√
n

(cid:90) 1

0

(cid:113)

H(u (cid:98)QF , F, ∥ · ∥w

n )du,

H(u (cid:98)QF ∥Xj∥−1

∞ , F, ∥ · ∥n)du ≤

J0(2 (cid:98)QF ∥Xj∥−1

∞ , F),

C5∥Xj∥∞
√
n
2C3
2k+1
2k+2
(cid:98)R
F

1
2k+2
F

∥Xj∥∞,

≤D7n−1/2K

1
2k+2
F

2k+1
2k+2
(cid:98)Q
F

1
2k+2

∥Xj∥

∞ ≤ D7n−1/2K

(2.17)

where J0(·, F) is from (2.15) and the second to last inequality is due to (2.15); ∥ · ∥w
n

denotes the weighted empirical norm ∥g∥w

n =
∥Xj∥∞∥g∥n and (cid:98)QF ≤ ∥Xj∥∞ (cid:98)RF . Now applying the second result in Theorem 6 yields

ijg2(zi) and it is clear that ∥g∥w

i=1 x2

n ≤

(cid:113) 1
n

(cid:80)n

(cid:16)

P

1
n

|

sup
g∈F

n
(cid:88)

i=1

ϵixijg(zi)| ≥ C6∥Xj∥∞

(cid:16)
D7n−1/2K

1
2k+2
F

2k+1
2k+2

(cid:98)R

F + (cid:98)RF

(cid:112)t/n

(cid:17)(cid:12)
(cid:12)
(cid:12)Xj, Z

(cid:17)

≤ e−t.

(2.18)

Define the events

A1 =

(cid:110)
∥Xj∥∞ ≤ C7Kx((cid:112)log n +

√

t)

(cid:111)

, A2 =

(cid:110)

sup
g∈F

|∥g∥2

n − ∥g∥2| ≤ δ0R2/3

(cid:111)
.

Condition 1 together with standard union bound for sub-Gaussian tails gives P(Ac
δ2
0 R2n
D5K2

(with a proper choice of C7). The previous result on term II shows P(Ac

−

1) ≤ e−t
F , and it

2) ≤ e

holds on A2 that

(cid:98)RF ≤

(cid:114)

sup
g∈F

|∥g∥2

n − ∥g∥2| + sup
g∈F

∥g∥2 ≤

(cid:113)

δ0/3 + (1 + (cid:112)Λmax/Λmin)2R.

(2.19)

As a result, we can further bound (2.18) on A1 ∩ A2 to arrive at

(cid:16)

P

1
n

|

sup
g∈F

n
(cid:88)

i=1

ϵixijg(zi)| ≥ D8((cid:112)log n +

√

(cid:16)

n−1/2K

1
2k+2

F R

2k+1

2k+2 + R(cid:112)t/n

(cid:17)(cid:12)
(cid:17)
(cid:12)
(cid:12)Xj, Z

·

t)

We integrate out the above conditional probability to obtain the unconditional one:

(cid:16)

P

1
n

|

sup
g∈F

n
(cid:88)

i=1

ϵixijg(zi)| ≥ D8((cid:112)log n +

√

t)(cid:0)n−1/2K

1
2k+2

F R

2k+1

2k+2 + R(cid:112)t/n(cid:1)(cid:17)

1A1∩A2 ≤ e−t

≤ e−t + P(Ac

1 ∪ Ac

2) ≤ 2e−t + e

−

δ2
0 R2n
D5K2
F .

26

It is then direct to verify that choosing t = D9(λ2nR

−2k−1
k+1 K

−1
k+1

F ∧ λ

√

nR−1) ≥ log n gives

(cid:16)

P

1
n

|

sup
g∈F

n
(cid:88)

i=1

ϵixijg(zi)| ≥

(cid:17)

λ
480

≤ 2e−D9(λ2nR

−2k−1

k+1 K

−1
k+1
F

√

∧λ

nR−1) + e

−

0 R2n
δ2
D5K2
F .

(2.20)

We will invoke the symmetrization result of Theorem 6 to transfer the above bound to

(cid:80)n

i=1(xijg(zi) − Exjg(z))|. In order to do so, we need to show 8 × 4802Q2

F ≤ nλ2.

supg∈F | 1
n
Note that

Q2

F = sup
g∈F

Ex2

j g2(z) = sup
g∈F

(cid:0)Ex2

j g2(z)1(|xj| ≤ s) + Ex2

≤ s2 sup
g∈F

∥g∥2 + sup
g∈F

∥g∥2

∞ · Ex2

j

1(|xj| > s) ≤ s2(cid:16)

j g2(z)1(|xj| > s)(cid:1)
(cid:114) Λmax
Λmiin

(cid:17)2

1 +

R2 + K 2

F · Ex2

j

1(|xj| > s).

Using the sub-Gaussian tail bound for xj to bound Ex2

that Q2

F ≤ D10(1 + K 2
condition D12(1 + K 2

F )R2 log R−1 with the choice s = D11
F )R2 log R−1 ≤ nλ2, Theorem 6 together with (2.20) leads to

j 1(|xj| > s), it is not hard to obtain
(cid:112)log R−1. Therefore, under the

(cid:16)

P

1
n

|

sup
g∈F

n
(cid:88)

(xijg(zi) − Exjg(z))| ≥

i=1

(cid:17)

λ
120

≤ 8e−D9(λ2nR

We follow via a union bound,

−2k−1

k+1 K

−1
k+1
F

√

∧λ

−
nR−1) + 4e

0 R2n
δ2
D5K2
F .

P(III ≥ δ0R2/3) ≤ P

(cid:16)

sup
g∈F ,1≤j≤p

(cid:12)
(cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

(cid:12)
(xijg(zi) − Exjg(z))
(cid:12)
(cid:12) ≥

(cid:17)

λ
120

≤ p · max
1≤j≤p

(cid:16)

P

1
n

|

sup
g∈F

n
(cid:88)

i=1

(xijg(zi) − Exjg(z))| ≥

(cid:17)

λ
120

≤ 8pe−D9(λ2nR

−2k−1

k+1 K

−1
k+1
F

√

∧λ

nR−1) + 4pe

−

0 R2n
δ2
D5K2
F .

Finally, we collect the results on I, II, III to bound P(T1(δ0, R)) via P(T1(δ0, R)) ≥ 1−P(I ≥

δ0R2/3) − P(II ≥ δ0R2/3) − P(III ≥ δ0R2/3).

Regarding the bound for P(T0(δ0, R, (cid:101)R)), first note that 2R ≤ (cid:101)R ≤ D12R under the
. Moreover, {f : τδ0,R(f ) ≤ (cid:101)R} ⊂ {f : τδ0, (cid:101)R(f ) ≤ (cid:101)R}.

2δ0R2 ≥ D1γ, R ≥ log n

assumption 1

n

Hence,

P(T0(δ0, R, (cid:101)R)) ≥ P

(cid:16)

sup
τδ0, (cid:101)R(f )≤ (cid:101)R

27

(cid:12)
(cid:12)∥f ∥2
(cid:12)

n − ∥f ∥2(cid:12)
(cid:12) ≤ D−2
(cid:12)

12 δ0 (cid:101)R2(cid:17)

.

Essentially, all

the previous arguments

in bounding P(T1(δ0, R)) carry over

into

P(T0(δ0, R, (cid:101)R)), up to constants Dj’s. We only need to update those constants in the condi-
tions and results, in a way so that P(T1(δ0, R)) and P(T0(δ0, R, (cid:101)R)) share the same constants
(e.g., taking a minimum or maximum among constants).

Lemma 4. Assume Conditions 1-4 hold.

(i) When R2 ≥ D1δ
0

−4k−4
2k+3

n

3, then

−2k−2

2k+3 (K 2

F +K

2
2k+3
F

σ

4k+4

2k+3 ) where KF is defined in (2.11) of Lemma

(cid:16)

P

T2(δ0, R)

(cid:17)

≥ 1 − 2pe−C1n min

(cid:0)

λ2
(KxKεσ)2 ,

λ
KxKεσ

(cid:1)

− 2e−D2σ−2δ2

−
0 nR2 − e

0 R2n
δ2
D3K2
F .

(ii) As long as 11 log n

ℓzn < 1, then P(cid:0)T3

(cid:1) ≥ 1 − 2ℓzn−10.

Proof. Recalling the sets B in (2.12) and F in (2.14) from the proof of Lemma 3, we have

sup
τδ0,R(f )≤R

(cid:12)
(cid:12)
(cid:12)⟨ε, f (X, Z)⟩n

(cid:12)
(cid:12)
(cid:12) ≤ sup

τδ0,R(f )≤R

(cid:12)
(cid:12)
(cid:12)

1
n

≤ sup
β∈B

∥β∥1 ·

(cid:13)
(cid:13)
(cid:13)

n
(cid:88)

εix′

iβ

(cid:12)
(cid:12)
(cid:12) + sup

τδ0,R(f )≤R

(cid:12)
(cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

εig(zi)

(cid:12)
(cid:12)
(cid:12)

n
(cid:88)

i=1
1
n

i=1

xiεi

(cid:13)
(cid:13)
(cid:13)∞

+ sup
g∈F

(cid:12)
(cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

(cid:12)
(cid:12)
εig(zi)
(cid:12)

≤

20δ0R2
λ

(cid:13)
(cid:13)
(cid:13)

1
n

n
(cid:88)

xiεi

(cid:13)
(cid:13)
(cid:13)∞

+ sup
g∈F

(cid:12)
(cid:12)
(cid:12)

1
n

n
(cid:88)

(cid:12)
(cid:12)
(cid:12) := I + II.
εig(zi)

i=1
We first bound term I. Using Bernstein’s inequality in Theorem 3 we obtain that

i=1

P(| 1
n

(cid:80)n

i=1 εixij| ≥ λ

40) ≤ 2e−C1n min

(cid:0)

λ2
(KxKεσ)2 ,

λ
KxKεσ

(cid:1)

. This combined with a union bound

gives

P(I ≤ δ0R2/2) ≥ P

(cid:16)(cid:13)
(cid:13)
(cid:13)

1
n

n
(cid:88)

i=1

xiεi

(cid:13)
(cid:13)
(cid:13)∞

≤

(cid:17)

λ
40

≥ 1 − 2pe−C1n min

(cid:0)

λ2
(KxKεσ)2 ,

λ
KxKεσ

(cid:1)

.

To bound term II, let (cid:98)RF = supg∈F ∥g∥n. Conditional on z1, . . . , zn (which are independent
from ε1, . . . , εn), Theorem 7 implies that the following holds with probability at least 1−2e−t:

II ≤ C2Kεσ ·

(cid:32)

(cid:98)RF√
n

(cid:90) 1

0

(cid:113)

H(u (cid:98)RF , F, ∥ · ∥n)du + (cid:98)RF

(cid:33)
,

(cid:114) t
n

≤ C2Kεσ ·

(cid:16)

D1n−1/2K

1
2k+2
F

2k+1
2k+2

(cid:98)R

F + (cid:98)RF

(cid:17)
(cid:112)t/n

,

28

where the last inequality follows as in (2.17). Integrating over z1, . . . , zn, the above holds

marginally with probability at least 1 − 2e−t as well. Moreover, from (2.19) we already know

that,

P( (cid:98)RF ≤ D3R) ≥ 1 − e

−

δ2
0 R2n
D4K2

F , when R2 ≥ D2δ

−4k−4
2k+3

0

K 2

F n

−2k−2
2k+3 .

Combining these two results via a union bound gives us

(cid:16)

P

II ≤ D5σ

(cid:16)

n−1/2K

1
2k+2

F R

2k+1

2k+2 + R(cid:112)t/n

(cid:17)(cid:17)

≥ 1 − 2e−t − e

−

0 R2n
δ2
D4K2
F .

Choosing t = D6σ−2δ2

0nR2 and assuming R2 ≥ D7δ

−4k−4
2k+3

0

n

direct to confirm

−2k−2

2k+3 (K 2

F + K

2
2k+3
F

4k+4

2k+3 ), it is

σ

P(II ≤ δ0R2/2) ≥ 1 − 2e−D6σ−2δ2

−
0 nR2 − e

0 R2n
δ2
D4K2
F .

Combining the bounds for I, II gives the bound for

P(T2(δ0, R)) ≥ P(I ≤ δ0R2/2, II ≤ δ0R2/2) ≥ 1 − P(I > δ0R2/2) − P(II > δ0R2/2).

Finally, the bound on P(cid:0)T3

(cid:1) is directly taken from Lemma 5 in [77].

Lemma 5. Consider any f (x, z) = x′β + g(z) that satisfies τδ0,R(f ) ≤ R.

(i) Under Condition 4 it holds that

∥β∥2 ≤ Λ−1/2

(cid:16)

∥g∥ ≤

1 +

min R, ∥β∥1 ≤
(cid:114)Λmax
Λmin

(cid:17)

R, T V (g(k)) ≤

20δ0R2
γ

.

20δ0R2
λ

, ∥x′β∥ ≤

(cid:16)

2 +

(cid:114) Λmax
Λmin

(cid:17)

R,

(ii) Under additional Condition 3, it holds that

∥g∥∞ ≤ Cz,k

(cid:16)20δ0R
γ

+ 1 +

(cid:114) Λmax
Λmin

(cid:17)

R,

where Cz,k > 0 is a constant only dependent on k and ℓz.

Proof. The bound on ∥β∥1 and T V (g(k)) is clear from the definition of τδ0,R(f ) in (2.7).
Using the orthogonal decomposition:

∥x′β + g(z)∥2 = ∥(cid:101)x′β∥2 + ∥h(z)′β + g(z)∥2 ≤ R2,

29

we have ∥β∥2 ≤ Λ−1/2

min ∥(cid:101)x′β∥ ≤ Λ−1/2

min R, and

∥g(z)∥ ≤ ∥h(z)′β + g(z)∥ + ∥h(z)′β∥ ≤ R + Λ1/2

max∥β∥2 ≤

(cid:16)

1 +

(cid:114)Λmax
Λmin

(cid:17)

R.

Having the bound on ∥g∥, we can further obtain ∥x′β∥ ≤ ∥x′β + g(z)∥ + ∥g(z)∥ ≤ (2 +
(cid:113) Λmax
Λmin
It remains to bound ∥g∥∞. Define g∗(z) =

∗ ) ≤ 1, ∥g∗∥ ≤ 1

, then T V (g(k)

)R.

g(z)

( 20δ0R

γ +1+

(cid:113) Λmax
Λmin

)R

due to the derived bounds on ∥g∥ and T V (g(k)). Hence it is sufficient to show that

sup
∗ )≤1,∥g∗∥≤1

g∗:T V (g(k)

∥g∗∥∞ ≤ Cz,k.

To prove the above, we first decompose g∗ = p∗ + q∗, where p∗ is a polynomial of degree k

and q∗ is orthogonal to all polynomials of degree k (with respect to the L2 inner product
(cid:82) 1
0 p(z)q(z)dz). Note that T V (q(k)
∗ ) = T V (g(k)
∥q∗∥∞ ≤ ck for a constant ck > 0 only depending on k. Now write p∗(z) = (cid:80)k

∗ ) ≤ 1. Then Lemma 5 in [56] implies that

j=0 ajzj. We

have

(cid:112)

a′V0a = ∥p∗∥ ≤ ∥g∗∥ + ∥q∗∥ ≤ 1 + ck,

where a = (a0, . . . , ak) and V0 = E(v(z)v(z)′) with v(z) = (1, z, z2, . . . , zk). Thus, we obtain

∥p∗∥∞ ≤ ∥a∥1 ≤

√

k + 1∥a∥2 ≤

√

k + 1(1 + ck)λ−1/2

min (V0),

which implies ∥g∗∥∞ ≤ ∥p∗∥∞ + ∥q∗∥∞ ≤

min (V0) + ck. Finally, we need to
show λmin(V0) is bounded below by a constant only depending on k and ℓz. Under Condition

k + 1(1 + ck)λ−1/2

√

3 we have

λmin(V0) = min
∥b∥2=1

E(b′v(z))2 ≥ ℓz · min
∥b∥2=1

E(b′v((cid:101)z))2,

(cid:101)z ∼ U nif (0, 1)

= ℓz · E(b′

∗v((cid:101)z))2,

where b∗ is the minimizer; E(b′

∗v((cid:101)z))2 is a constant only depending on k, and it is positive
because the roots of any polynomial (not identically zero) form a set of Lebesgue measure

zero.

30

2.6.1.3 Completion of the proof

We are in the position to complete the proof of Theorem 1.

In fact, we shall prove

Theorems 1 and 2 altogether. We will choose R2, λ, γ, δ0 such that the conditions in Lemmas

1-4 (for Theorem 1) and Lemmas 6-7 (for Theorem 2) are all satisfied. As a result, we will

then conclude that the bounds on

(cid:98)g(z) in Lemma 2 and bounds on (cid:98)β in Lemma 6 hold with
probability at least P(T0(δ0, R, (cid:101)R) ∩ T2(δ0, R) ∩ T3 ∩ S1 ∩ S2 ∩ S3). Towards this end, we first
list the conditions to meet2:

(i) Conditions from Lemmas 1-2: δ0 ≤ 1

n2 + γ + λ2s).

100, δ0R2 ≥ D1( log2 n
(cid:113) log p
n

(ii) Conditions from Lemma 3:

log p
n ≤ 1, δ0
F )R2 log R−1 ≤ nλ2, D5(λ2nR

R2
λ2 ≤ D2, D3δ

−4k−4
2k+3

0

K 2

F n

−2k−2

2k+3 ≤ R2,

−2k−1
k+1 K

−1
k+1

F ∧ λ

√

nR−1) ≥ log n,

1

2δ0R2 ≥

D4(1 + K 2

D1γ, R ≥ log n
n

. Recall that KF is defined in (2.11).

(iii) Conditions from Lemma 4: R2 ≥ D6δ
0

−4k−4
2k+3

n

−2k−2

2k+3 (K 2

F + K

2
2k+3
F

4k+4

2k+3 ),

σ

11 log n

ℓzn < 1.

(iv) Conditions from Lemma 6: δ0 ≤ 1

100, δ0R2 ≥ D1( log2 n

n2 +γ +λ2s), λ ≥ D7(R2λ−1 log2 n

n2 +

γ).

(v) Conditions from Lemma 7:

log p
n ≤ 1, δ0
F )R2 log R−1 ≤ nλ2, D11(λ2nR

(cid:113) log p
n

R2
λ2 ≤ D8, D9δ

−4k−4
2k+3

0

K 2

F n

−2k−2

2k+3 ≤ R2,

−2k−1
k+1 K

−1
k+1

F ∧ λ

√

nR−1) ≥ log n,

log n
n (1 +

D10(1 + K 2

KF + σ + R2λ−1) + σ−1R2 ≤ D12λ.

We choose δ0 = 1

100, λ = Dσ

(cid:113) log p

n , γ = δ0R2

2D1

, where D > 0 is any given constant that may

only depend on k, Lg, Lh, Kx, Kϵ, ℓz, Λmin, Λmax. It remains to specify R2. We will consider

R2 ≤ 1, which together with γ = δ0R2
2D1

implies

KF = (1 ∨ Cz,k)

(cid:16) 20δ0R
γ

+ 1 +

(cid:114)Λmax
Λmin

(cid:17)

R ≤ (1 ∨ Cz,k)

(cid:16)

40D1 + 1 +

(cid:114) Λmax
Λmin

(cid:17)

:= D13.

Define

R2

0 =

2D1
δ0

(cid:16)log2 n

n2 + λ2s

(cid:17)

+ δ

−4k−4
2k+3

0

−2k−2
2k+3

n

(cid:16)

(D3 + D9)D2

13 + D6(D2

13 + D

2
2k+3

13 σ

4k+4

2k+3 )

(cid:17)
.

2With a bit abuse of notation, the subscripts for constants Dj’s we use here may be different from the

ones in the lemmas.

31

For any R2 ∈ [R2

the conditions involving n

0, 1], it is straightforward to verify that Conditions from Lemmas 1-2 and all
2k+3 are satisfied. To meet other conditions, we will pick R2 = cR2
0

−2k−2

for any constant c ≥ 1. We verify the remaining conditions by showing that under the scaling

= o(1), p → ∞ as n → ∞ (cid:0)so that λ ≍

(s+log p)s log p
n

o(λ)(cid:1):

(cid:113) log p

n , R2 ≍ γ ≍ s log p

n + n

−2k−2

2k+3 , R2 =

(ii) Conditions from Lemma 3:

log p
n = o(1),

(cid:113) log p
n

o(1) ≪ log p ≍ nλ2, λ

√

nR−1 ≫

nR ≳ n

√

R2
λ2 = O(R2/λ) = o(1), R2 log R−1 =
. Moreover,
2k+3 ≫ log n
n

4k+6 ≫ log n, R ≳ n

−k−1

1

λ2nR

−2k−1

k+1 ≳ n

holds that

2k+1

2k+3 · log p ≫ log n when s log p

n ≤ n

−2k−2
2k+3 ; when s log p

n > n

−2k−2
2k+3 , it still

λ2nR

−2k−1
k+1 ≳

(cid:16) n

s log p

(cid:17) 2k+1

2k+2 · log p ≫ s

2k+1

2k+2 · log p ≳ n

2k+1

(2k+2)(2k+3) · (log p)

1

2k+2 ≫ log n.

(iii) Conditions from Lemma 4:

n

log n = o(1).

(iv) Conditions from Lemma 6: R2λ−1 log2 n

n2 + γ ≍ R2 ≪ λ.

(v) Conditions from Lemma 7: Most conditions have been verified for conditions in Lemma

3. It remains to see that log n

n (1 + R2λ−1) + R2 ≲ log n

n + R2 ≪ λ.

Finally, given what we have obtained, it is straightforward to further evaluate the high

probability (by choosing the constant D in λ and c in R2 large enough):

P(T0(δ0, R, (cid:101)R) ∩ T2(δ0, R) ∩ T3 ∩ S1 ∩ S2 ∩ S3) ≥ 1 − p−D14 − n−D15,

when n is large enough under the scaling s log2 p

n = o(1), p → ∞.

32

2.6.2 Proof of Theorem 2

2.6.2.1 Roadmap of the proof

Recall the sets B in (2.12) and F in (2.14) from the proof of Lemma 3. Define the

following set and events:

(cid:110)

(cid:101)F =

S1 =

g : ∥g∥∞ ≤ 22bk(Lg + Lh)ℓ−1

z n−1 log n
(cid:13)
(∆(Z) + (cid:101)X)′(g(Z) − h(Z)β + ε)
(cid:13)
(cid:13)∞

(cid:111)
,

λ, ∀β ∈ B, g ∈ F ∪ (cid:101)F, ∆j ∈ (cid:101)F, 1 ≤ j ≤ p

(cid:111)
,

(cid:110)(cid:13)
(cid:13)
(cid:13)

1
n
1
8
(cid:110)(cid:12)
(cid:12)∥ (cid:101)Xβ∥2
(cid:12)

≤

S2 =

n − ∥(cid:101)x′β∥2(cid:12)

(cid:12)
(cid:12) ≤ λ∥β∥1, ∀β ∈ B

(cid:111)

, S3 =

(cid:110)(cid:12)
(cid:12)∥Xβ∥2
(cid:12)

n − ∥x′β∥2(cid:12)
(cid:12)
(cid:12) ≤ λ∥β∥1, ∀β ∈ B

(cid:111)
.

The proof of Theorem 1 already yields an error rate for (cid:98)β (see Lemmas 1 and 5), however,
the rate is not optimal yet. We will base on this “slow rate" result to perform a finer

analysis of (cid:98)β to obtain the “fast rate". This is again inspired by [46]. Specifically, we prove
in Lemma 6 that the “fast rate" results on (cid:98)β in Theorem 2 hold by intersecting with another
event S1 ∩ S2 ∩ S3. We then show in Lemma 7 that this additional event S1 ∩ S2 ∩ S3 also

has large probability.

2.6.2.2 Key lemmas

Lemma 6. Assume Conditions 3-5 and

δ0 ≤

1
100

, δ0R2 ≥ D1(n−2 log2 n + γ + λ2s), λ ≥ D2(R2λ−1n−2 log2 n + γ),

where the constant D1 is the same as in Lemma 1. Then on T1(δ0, R)∩T2(δ0, R)∩T3 ∩S1 ∩S2,

∥(cid:98)β − β0∥2

2 ≤ 16Λ−2

minsλ2,

∥x′((cid:98)β − β0)∥2 ≤ 32(Λmin + Λmax)Λ−2

minsλ2.

Further intersecting with the event S3, we have ∥X(cid:98)β − Xβ0∥2

n ≤ 16(3Λmin + 2Λmax)Λ−2

minsλ2.

Proof. Recall

j ) ≤ Lh where h(z) =
(h1(z), . . . , hp(z)) = E(x|z). According to Lemma 16 in [70], there exist ¯hj ∈ Hn for all

that under Condition 5, max1≤j≤p T V (h(k)

33

1 ≤ j ≤ p such that

T V (¯h(k)

j ) ≤ akLh, ∥¯hj − hj∥∞ ≤ bkLh · max

(z(i+1) − z(i)),

1≤i≤n−1

where ak, bk > 0 are constants only depending on k, and Hn is the span of the kth order

falling factorial basis functions with knots z(k+1) ≤ z(k+2) . . . ≤ z(n−1). Define

¯h(z) = (¯h1(z), . . . , ¯hp(z)),

ˇg(z) = (cid:98)g(z) + ¯h′(z)((cid:98)β − β0).

Note that ˇg ∈ Hn since

(cid:98)g, ¯hj ∈ Hn. Therefore, we can write down a basic inequality,

1
2

∥y − X(cid:98)β − (cid:98)g(Z)∥2

n + λ∥(cid:98)β∥1 + γT V ((cid:98)g(k)) ≤

1
2

∥y − Xβ0 − ˇg(Z)∥2

n + λ∥β0∥1 + γT V (ˇg(k)),

which can be reformulated as

1
2

∥ (cid:101)X((cid:98)β − β0)∥2

n + λ∥(cid:98)β∥1 ≤

1
2

∥(h(Z) − ¯h(Z))((cid:98)β − β0)∥2

n + γ(T V (ˇg(k)) − T V ((cid:98)g(k)))+

(cid:10)g0(Z) − (cid:98)g(Z) − h(Z)((cid:98)β − β0) + ε, (h(Z) − ¯h(Z) + (cid:101)X)((cid:98)β − β0)(cid:11)

n + λ∥β0∥1.

Using triangle inequality for T V (·) and Hölder’s inequality, we can proceed to obtain

1
2

∥ (cid:101)X((cid:98)β − β0)∥2

n + λ∥(cid:98)β∥1 ≤

∥(h(Z) − ¯h(Z))′(h(Z) − ¯h(Z))∥max · ∥(cid:98)β − β0∥2

1

1
2n
+ γakLh∥(cid:98)β − β0∥1

+ (cid:13)
(cid:13)

1
n

(h(Z) − ¯h(Z) + (cid:101)X)′(g0(Z) − (cid:98)g(Z) − h(Z)((cid:98)β − β0) + ε)(cid:13)

(cid:13)∞ · ∥(cid:98)β − β0∥1

+ λ∥β0∥1.

(2.21)

To further simplify the above inequality, we observe the following:

(i) 1

n ∥(h(Z) − ¯h(Z))′(h(Z) − ¯h(Z))∥max ≤ maxj ∥hj − ¯hj∥2

∞ ≤ b2

kL2

h222ℓ−2

z n−2 log2 n on T3

(ii) Conditions in Lemma 6 imply the conditions in Lemma 1 and Lemma 5. Hence,

combining Lemmas 1 and 5 shows (cid:98)β − β0 ∈ B, ¯g − (cid:98)g ∈ F, g0 − ¯g ∈ (cid:101)F.

34

Based on these two results and assuming 4840b2

kL2

hδ0ℓ−2

z R2λ−1n−2 log2 n + γakLh ≤ λ/4, we

continue from (2.21) to achieve that on T1(δ0, R) ∩ T2(δ0, R) ∩ T3 ∩ S1,

1
2

∥ (cid:101)X((cid:98)β − β0)∥2

n + λ∥(cid:98)β∥1 ≤(4840b2

kL2

hδ0ℓ−2

z R2λ−1n−2 log2 n + γakLh + λ/4) · ∥(cid:98)β − β0∥1

+ λ∥β0∥1
λ
2

∥(cid:98)β − β0∥1 + λ∥β0∥1.

≤

A standard argument in the literature of Lasso (e.g., Lemma 6.3 in [8]) simplifies the above

to

n + λ∥(cid:98)β − β0∥1 ≤ 4λ∥(cid:98)βS − β0

S∥1 ≤ 4

sλ∥(cid:98)β − β0∥2

√

∥ (cid:101)X((cid:98)β − β0)∥2
√

≤4

sλΛ−1/2

min ∥(cid:101)x′((cid:98)β − β0)∥ ≤ 8sλ2Λ−1

min +

1
2

∥(cid:101)x′((cid:98)β − β0)∥2

≤8sλ2Λ−1

min +

1
2

∥ (cid:101)X((cid:98)β − β0)∥2

n +

λ
2

∥(cid:98)β − β0∥1,

where the second to last equality is due to ab ≤ a2/2 + b2/2, and the last inequality holds

on S2. Rearranging the terms leads to

∥ (cid:101)X((cid:98)β − β0)∥2

n + λ∥(cid:98)β − β0∥1 ≤ 16Λ−1

minsλ2.

(2.22)

Now (2.22) combined with the condition from S2 implies

∥(cid:98)β − β0∥2

2 ≤ Λ−1

min · ∥(cid:101)x′((cid:98)β − β0)∥2 ≤ Λ−1

min · (∥ (cid:101)X((cid:98)β − β0)∥2

n + λ∥(cid:98)β − β0∥1) ≤ 16Λ−2

minsλ2,

∥x′((cid:98)β − β0)∥2 = (∥(cid:101)x′((cid:98)β − β0)∥ + ∥h(z)′((cid:98)β − β0)∥)2 ≤ 2∥(cid:101)x′((cid:98)β − β0)∥2 + 2∥h(z)′((cid:98)β − β0)∥2

≤ 32Λ−1

minsλ2 + 2Λmax∥(cid:98)β − β0∥2

2 ≤ 32(Λmin + Λmax)Λ−2

minsλ2.

Further intersecting with S3, we obtain

∥X((cid:98)β − β0)∥2

n ≤ ∥x′((cid:98)β − β0)∥2 + λ∥(cid:98)β − β0∥1 ≤ 32(Λmin + Λmax)Λ−2

minsλ2 + 16Λ−1

minsλ2.

Lemma 7. Assume Conditions 1-5 hold.

35

(2.23)

(2.24)

(2.25)

(i) Suppose

D1δ

−4k−4
2k+3

0

K 2

F n

−2k−2

2k+3 ≤ R2, D2(1 + K 2

F )R2 log R−1 ≤ nλ2,

D3(λ2nR

−2k−1
k+1 K

√

−1
k+1

F ∧ λ

nR−1) ≥ log n,

n−1 log n(1 + KF + σ + R2λ−1) + σ−1R2 ≤ D4λ,

where KF is defined in (2.11) of Lemma 3. Then,

P(S1) ≥1 − 6pe−n − 2(p + p2)e−D5n min(σ−2λ2,σ−1λ)
0 R2n
δ2
D7K2
F .

−
nR−1) − 4pe

− 8pe−D6(λ2nR

−1
k+1
F

k+1 K

−2k−1

∧λ

√

(ii) Suppose log p

n ≤ 1, δ0

(cid:113) log p
n

R2
λ2 ≤ D8. Then,

P(S2) ≥ 1 − 2p−10, P(S3) ≥ 1 − 2p−10.

Proof. For ∆(z) = (∆1(z), . . . , ∆p(z)) with ∆j ∈ (cid:101)F, 1 ≤ j ≤ p,

(cid:13)
(cid:13)
(cid:13)

1
n

∆(Z)′(g(Z) − h(Z)β + ε)

(cid:13)
(cid:13)
(cid:13)∞

≤ ∥∆(Z)∥max ·

(cid:16) 1
1
n
n
(cid:16)
≤ 22bk(Lg + Lh)ℓ−1
z n−1 log n ·

∥g(Z)∥1 +

∥h(Z)β∥1 +

(cid:17)

∥ε∥1

1
n

∥g∥∞ + ∥β∥1 · max

j

∥hj(Z)∥1/n + ∥ε∥1/n

(cid:17)

.

Using Hoeffding’s inequality in Theorem 3, we obtain3

P(∥ε∥1/n ≤ C3σKε) ≥ 1 − 2e−n, P(max

j

∥hj(Z)∥1/n ≤ C4Kx) ≥ 1 − 2pe−n.

Thus with probability at least 1 − 2(p + 1)e−n,

sup
β∈B,g∈F∪ (cid:101)F ,∆j ∈ (cid:101)F

(cid:13)
(cid:13)
(cid:13)

1
n

(cid:13)
∆(Z)′(g(Z) − h(Z)β + ε)
(cid:13)
(cid:13)∞

≤D1n−1 log n · (cid:0)KF + n−1 log n + R2λ−1 + σ(cid:1).

(2.26)

3Given that ∥xj∥ψ2 ≤ Kx, it holds that ∥hj(z)∥ψ2 ≤ C1Kx, ∥(cid:101)xj∥ψ2 ≤ C2Kx.

36

Next we have

(cid:13)
(cid:13)
(cid:13)

1
n

(cid:13)
(cid:101)X′(g(Z) − h(Z)β + ε)
(cid:13)
(cid:13)∞

≤

(cid:13)
(cid:13)
(cid:13)

1
n

≤ max
1≤j≤p

(cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

(cid:101)xijg(zi)(cid:12)

(cid:12) + ∥β∥1 · max
1≤j,k≤p

+

(cid:13)
(cid:13)
1
(cid:101)X′g(Z)
(cid:13)
(cid:13)
(cid:13)∞
(cid:13)
n
n
(cid:88)
(cid:101)xijhk(zi)(cid:12)

(cid:12)
(cid:12)

1
n

i=1

(cid:101)X′h(Z)β

(cid:12) + (cid:13)
(cid:13)

1
n

(cid:13)
(cid:13)
(cid:13)∞
n
(cid:88)

+

(cid:13)
(cid:13)
(cid:13)

1
n

(cid:101)X′ε

(cid:13)
(cid:13)
(cid:13)∞

(cid:101)xiεi

(cid:13)
(cid:13)∞.

i=1

Note that we have used Bernstein’s inequality to bound (cid:13)
(cid:13) 1
n

(cid:80)n

i=1 xiεi

(cid:13)
(cid:13)∞

in the proof of

Lemma 4. The same result holds here,

P

(cid:16)(cid:13)
(cid:13)
(cid:13)

1
n

n
(cid:88)

i=1

(cid:101)xiεi

(cid:13)
(cid:13)
(cid:13)∞

≤

(cid:17)

λ
40

≥ 1 − 2pe−C5n min

(cid:0)

λ2
(KxKεσ)2 ,

λ
KxKεσ

(cid:1)

.

are independent, zero-mean, sub-exponential random variables with

i=1

Since {(cid:101)xijhk(zi)}n
∥(cid:101)xijhk(zi)∥ψ1 ≤ ∥(cid:101)xij∥ψ2 · ∥hk(zi)∥ψ2 ≤ C6K 2
1
n

(cid:101)xijhk(zi)(cid:12)

max
1≤j,k≤p

n
(cid:88)

P

(cid:16)

(cid:12)
(cid:12)

x

i=1

, we use again Bernstein’s inequality to obtain

(cid:12) ≤ λ/(20σ)

(cid:17)

≥ 1 − 2p2e

−C7n min

(cid:0) λ2
xσ2 , λ
K2
K4

xσ

(cid:1)

.

Hence with probability at least 1 − 2(p + p2)e

−C8n min

(cid:0)

λ2
xσ2(K2

K2

ε +K2
x)

,

λ
Kxσ(Kε+Kx)

(cid:1)

,

sup
β∈B,g∈F∪ (cid:101)F ,∆j ∈ (cid:101)F

(cid:13)
(cid:13)
(cid:13)

1
n

(cid:13)
(cid:101)X′(g(Z) − h(Z)β + ε)
(cid:13)
(cid:13)∞

≤

sup
g∈g∈F∪ (cid:101)F ,1≤j≤p

(cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

(cid:101)xijg(zi)(cid:12)

(cid:12) + σ−1δ0R2 +

λ
40

.

We continue in the following way,

sup
g∈F∪ (cid:101)F ,1≤j≤p

(cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

(cid:101)xijg(zi)(cid:12)

(cid:12) ≤ sup

g∈F ,1≤j≤p

(cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

(cid:101)xijg(zi)(cid:12)

(cid:12) + sup

g∈ (cid:101)F ,1≤j≤p

(cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

(cid:101)xijg(zi)(cid:12)
(cid:12)

≤ sup

g∈F ,1≤j≤p

(cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

(cid:101)xijg(zi)(cid:12)

(cid:12) +

22bk(Lg + Lh) log n
ℓzn

· max
1≤j≤p

As in bounding maxj ∥hj(Z)∥1/n earlier, we bound

(cid:16)

P

max
1≤j≤p

1
n

n
(cid:88)

i=1

|(cid:101)xij| ≤ C9Kx

(cid:17)

≥ 1 − 2pe−n.

(2.27)

1
n

n
(cid:88)

i=1

|(cid:101)xij|

(2.28)

Moreover, applying the same arguments for bounding supg∈F | 1
the proof of Lemma 3, we have that under the conditions (2.23)-(2.24),

n

(cid:80)n

i=1(xijg(zi) − Exjg(z))| in

(cid:16)

P

sup
g∈F ,1≤j≤p

(cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

(cid:101)xijg(zi)(cid:12)

(cid:12) ≥

(cid:17)

λ
120

≤ 8pe−D2(λ2nR

37

−2k−1

k+1 K

−1
k+1
F

√

∧λ

nR−1) + 4pe

−

δ2
0 R2n
D3K2
F .

(2.29)

Combining (2.25)-(2.29) completes the bound for P(S1).

Regarding the bound for P(S2) and P(S3), the proof follows the same arguments used for

bounding term I in the proof of Lemma 3.

2.6.2.3 Completion of the proof

See Section 2.6.1.3.

2.6.3 Reference materials

Theorem 3. (General Hoeffding’s inequality). Let x1, . . . , xn ∈ R be independent, zero-

mean, sub-gaussian random variables. Then for every t ≥ 0, we have

P

(cid:18)(cid:12)
(cid:12)
(cid:12)

n
(cid:88)

i=1

(cid:19)

(cid:12)
(cid:12)
(cid:12) ≥ t

xi

≤ 2 exp

(cid:16)

−

ct2
i=1 ∥xi∥2
ψ2

(cid:80)n

(cid:17)

,

where c > 0 is an absolute constant, and ∥ · ∥ψ2
inf{t > 0 : Eex2/t2 ≤ 2}.

is the sub-gaussian norm defined as ∥x∥ψ2 =

(Bernstein’s inequality). Let x1, . . . , xn ∈ R be independent, zero-mean, sub-exponential

random variables. Then for every t ≥ 0, we have

P

(cid:18)(cid:12)
(cid:12)
(cid:12)

n
(cid:88)

i=1

(cid:19)

(cid:12)
(cid:12)
(cid:12) ≥ t

xi

(cid:34)

≤ 2 exp

− c min

(cid:18)

t2
i=1 ∥xi∥2
ψ1

(cid:80)n

,

t
maxi ∥xi∥ψ1

(cid:19)(cid:35)

,

where c > 0 is an absolute constant, and ∥ · ∥ψ1
∥x∥ψ1 = inf{t > 0 : Ee|x|/t ≤ 2}.

is the sub-exponential norm defined as

The above two results are Theorem 2.6.2 and Theorem 2.8.1, respectively in [73].

Theorem 4. Let x1, . . . , xp ∈ R be sub-gaussian random variables, which are not necessarily

independent. Then there exists an absolute constant c > 0 such that for all p > 1,

E max
1≤i≤p

|xi| ≤ c(cid:112)log p max

1≤i≤p

∥xi∥ψ2.

The above result can be found in Lemma 2.4 of [7].

38

Theorem 5. (Uniform convergence of empirical norms). Denote ∥g∥2
n = 1
n
i.i.d. samples zi ∈ Z. For a class F of functions on Z, let J0(t, F) = C0t (cid:82) 1
0 sup{z1,...,zn}⊆Z
(cid:112)H(ut/2, F, ∥ · ∥n)du, t > 0, where C0 > 0 is some universal constant and H is the metric
entropy. Let RF := supg∈F ∥g∥, KF := supg∈F ∥g∥∞, and H(t) be the convex conjugate of

i=1 g(zi)2 for

(cid:80)n

G(t) := (J −1

0
(cid:32)

(t, F))2. Then, for all R2
n − ∥g∥2(cid:12)
(cid:12)
(cid:12) ≥ C1

(cid:12)
(cid:12)∥g∥2
(cid:12)

P

sup
g∈F

F ≥ H(4KF /

n) and all t > 0,

(cid:18) 2KF J0(2RF , F) + KF RF

√

t

(cid:19)(cid:33)

+

K 2
F t
n

≤ e−t,

√

√

n

where C1 > 0 is a universal constant.

The above is Theorem 2.2 from [21].

Theorem 6. (Symmetrization and concentration). Let Y = (y1, . . . , yn) be i.i.d. samples

in some sample space Y and let F be a class of real-valued functions on Y. Define QF =

supf ∈F ∥f ∥, (cid:98)QF = supf ∈F ∥f ∥n. Then,

(cid:16)

P

(cid:16)

P

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)

1
n

1
n

sup
f ∈F

sup
f ∈F

n
(cid:88)

i=1
n
(cid:88)

i=1

(cid:12)
(f (yi) − Ef (yi))
(cid:12)
(cid:12) ≥ 4QF

(cid:17)

(cid:114) 2t
n

(cid:16)

≤ 4P

(cid:12)
(cid:12)
(cid:12)

1
n

sup
f ∈F

n
(cid:88)

i=1

(cid:12)
(cid:12)
(cid:12) ≥ QF
ϵif (yi)

(cid:17)

(cid:114) 2t
n

, ∀t ≥ 4,

(cid:12)
(cid:12)
(cid:12) ≥ C2 ·
ϵif (yi)

(cid:16)

(cid:16)

E

(cid:12)
(cid:12)
(cid:12)

1
n

sup
f ∈F

n
(cid:88)

i=1

(cid:12)
(cid:12)
ϵif (yi)
(cid:12)

(cid:12)
(cid:12)
(cid:12)Y

(cid:17)

+ (cid:98)QF

(cid:112)t/n

(cid:17)(cid:12)
(cid:12)
(cid:12)Y

(cid:17)

≤ e−t, ∀t > 0,

where ϵ1, . . . , ϵn is a Rademacher sequence independent of Y and C2 > 0 is a universal

constant.

The first result is Lemma 16.1 in [72] and the second result is implied by Theorem 16.4

in [72].

Theorem 7. (Dudley’s integral tail bound). Let (Xt)t∈T be a separable random process on a

metric space (T, d) with sub-Gaussian increments: ∥Xt − Xs∥ψ2 ≤ Kd(t, s), ∀t, s ∈ T . Then,
for every u > 0, the event

|Xt − Xs| ≤ CK

sup
t,s∈T

(cid:104) (cid:90) diam(T )

0

(cid:105)
(cid:112)H(ε, T, d)dε + u · diam(T )

holds with probability at least 1 − 2e−u2. Here, C > 0 is a universal constant, H is the metric

entropy, and diam(T ) = sups,t∈T d(s, t).

The result above is Theorem 8.1.6 in [73].

39

CHAPTER 3

HIGH-DIMENSIONAL PARTIAL LINEAR QUANTILE REGRESSION
WITH TREND FILTERING

3.1

Introduction

This research addresses the challenges posed by diverse data types in modern analytics,

focusing on scenarios where the response variable y is influenced by both linear (x) and

non-linear (z) predictors. Such complex relationships, common in biological studies like

gene expression analysis, require specialized methodologies that transcend traditional linear

models. The study emphasizes the importance of separately accounting for linear and non-

linear components to avoid potential pitfalls: using standard least squares regression alone

may lead to incorrect inferences, while applying non-linear regression to both components

can result in overly rough estimations and encounter the curse of dimensionality in high-

dimensional datasets. This approach aims to optimize model accuracy and efficiency in

handling complex, heterogeneous data structures.

The partial linear model addresses this methodological challenge by concurrently inte-

grating parametric and nonparametric regression techniques, thereby offering a more sophis-

ticated and robust approach to complex data structures. This semiparametric approach has

been extensively studied in both low-dimensional and high-dimensional settings by combin-

ing classical nonparametric approaches with various types of linear models [75, 23, 80, 46,

38, 89, 84]. For high-dimensional data, where the number of covariates significantly exceeds

the number of observations, several penalization techniques are involved such as LASSO

[67], SCAD (Smoothly Clipped Absolute Deviation) by [16], and MCP (Minimax Concave

Penalty) by [85].

Nonetheless, the impact of covariates on the central tendency of the conditional distribu-

tion may significantly differ from their effects on the distribution’s extremities. Consequently,

an exclusive focus on the conditional mean function can lead to misinterpretation. Quan-

tile regression, introduced by [32], provides a solution by estimating conditional quantiles

40

at various levels, offering a more nuanced understanding of the relationship between covari-

ates and the response variable. Integrating quantile regression into the partial linear model

framework allows for the strengths of both methods to be harnessed, resulting in a more

robust and comprehensive analytical approach.

Partial linear quantile regression has been applied in various types of research. For low-

dimensional cases, starting from [24], it has been used in predicting high-cost patients [42],

dealing with data with dropouts [43], and in deep learning applications [88] among others.

For high-dimensional cases, it has been applied to datasets with missing covariates [61] and to

functional predictors [39] among others. Notably, [63] explored the high-dimensional partial

framework with various penalty functions using LASSO, SCAD, and MCP combined with

the B-spline method, providing solid theoretical properties for their estimators. We compare

our results with theirs as a main benchmark.

In recent years, there has been substantial interest and progress in the nonparametric

method known as trend filtering, as proposed by [66, 31, 69].

[69] describes trend filtering

as a discrete analog of the locally adaptive regression splines introduced by [44]. The locally

adaptive regression splines involves solving the following optimization problem for n number

of observations:

min
f ∈Fk

1
2

n
(cid:88)

i=1

(yi − f (zi))2 + λ · T V (f (k)).

(3.1)

Here, Fk denotes the function space defined as

Fk = (cid:8)f : [0, 1] → R : f is k times weakly differentiable and T V (f (k)) < ∞(cid:9) ,

where T V (f (k)) denotes the total variation of kth derivative of f . The discrete analog of

(3.1) using trend filtering is

min
θ∈Rn

1
2

n
(cid:88)

i=1

(yi − θi)2 + λ∥D(z,k+1)θ∥1,

(3.2)

where bold symbols denote vector notation, ∥ · ∥1 denotes ℓ1-norm and D(z,k+1) is the kth

order difference operator based on z. Equation (3.2) is referred to as the univariate kth order

41

trend filtering. The details of these notations and definitions will be elaborated on in Section

2. Following the proofs of equivalence between (3.1) and (3.2) as shown by [69], trend filtering

estimates combine the benefits of locally adaptive regression splines with the advantages of

the banded difference operator structure in (3.2). This dual benefit provides local adaptivity

and computational efficiency, making trend filtering superior to other nonparametric methods

such as smoothing splines and B-spline regressions.

Trend filtering has been successfully applied to various statistical fields, including graph-

ical models [78], additive models [56], and spatiotemporal models [49] among others. Espe-

cially, [41] studies various univariate quantile regression trend filtering and their theoretical

properties. The quantile regression analog of (3.2) considered in their research is

min
θ∈Rn

1
n

n
(cid:88)

i=1

ρτ (yi − θi) + λ∥D(z,k+1)θ∥1,

where ρτ (x) = x(τ −I(x ≤ 0)) is called a check loss. However, for their theoretical properties,

they provide all risk and error bounds for estimators using a specific type of Huber loss

[27], not squared loss. Therefore, even though their work is on quantile regression, their

theoretical contributions and derivations are totally different from what we introduce in our

research here, since our loss is utilizing the squared loss for the estimators instead. Also, its

adaptation within the partial linear model framework has not been explored to date. This

makes our theoretical contribution more distinguishable.

Our contribution lies in integrating trend filtering with the partial linear quantile regres-

sion model to effectively address high-dimensional sparse heterogeneous data. Our approach

leverages the local adaptivity and computational efficiency of trend filtering along with the

quantile adaptive sparsity-inducing properties of the quantile LASSO, providing significant

advantages when dealing with complex and heterogeneous data. Furthermore, our work

provides rigorous theoretical details and establishes convergence rates for the estimators,

utilizing the other loss for error, which makes our work distinguishable from the theory com-

pared to the previous approach on quantile trend filtering. Also, we address the challenge

induced because of the integration.

42

The rest of this chapter is organized as follows. Section 3.2 presents the preliminaries,

including assumptions and model descriptions. Section 3.3 provides the main theoretical re-

sults and their interpretations. Section 3.4 details the simulation studies and computational

aspects. Section 3.5 includes the real data analysis. Section 3.6 discusses our results, and

Section 3.7 includes all the proofs.

3.1.1 Notations

Notations for this chapter are the same as the previous chapter. Therefore we omit the

explanation.

3.2 Model Description and Assumptions

3.2.1 Model Description and Estimation Method

We consider the following partial linear quantile regression model:

Qy|x,z(τ ) = x′β0 + g0(z),

where β0 ∈ Rp has the support S = {j : β0

j ̸= 0} with |S| ≤ s, and g0 : [0, 1] → R is a
nonparametric function. Qy|x,z(τ ) = inf{t : P(y ≤ t | x, z) ≥ τ } denotes the τ -th conditional

quantile of y given x and z. For the notational convenience, we remove τ from β0, g0, and

corresponding estimators if it is obvious. Generally, z ∈ [0, 1] may not be true, but this can be

achieved by using a simple affine transformation, so we assume this for technical simplicity.

We let {(yi, xi, zi)}n

i=1

be n i.i.d observations of (y, x, z), and without loss of generality,

we assume that these observations are re-ordered based on zi, i = 1, . . . , n. Moreover, as

discussed in the introduction, we focus on the function g0 where the total variation of kth

derivative is bounded by a constant, that is, T V (g(k)

0 ) ≤ Lg, with Lg > 0.

Expanding upon the univariate trend filtering from [31, 69] and quantile trend filtering

from [41], for a given integer k ≥ 0, the kth order partial linear quantile trend filtering

43

(PLQTF) estimation for the partial linear quantile regression is defined as

((cid:98)β, (cid:98)θ) ∈ arg min
β∈Rp,θ∈Rn

1
n

n
(cid:88)

i=1

ρτ (yi − x′

iβ − θi) + λ∥β∥1 + γ∥D(z,k+1)θ∥1.

(3.3)

where D(z,k+1) ∈ R(n−k−1)×n is the discrete difference operator of order k + 1. When k = 0,

D(z,1) =












−1

1

0 · · ·

0 −1 1 · · ·
...

0

0

0

0

0

0

0 · · · −1 1












∈ R(n−1)×n.

(3.4)

For k ≥ 1, the difference operator is defined recursively, that is,

D(z,k+1) = D(z,1) · diag

(cid:18) k

zk − z1

, · · · ,

(cid:19)

k
zn − zn−k+1

· D(z,k),

where in the above equation, D(z,1) is defined as the form of (3.4) with the dimension of

(n − k − 1) × (n − k). When k = 0, this estimate is equivalent to the one-dimensional

quantile fused lasso [29].

Before further discussing estimation methods, we briefly introduce the falling factorial

basis. [69] and [77] demonstrate a connection between univariate trend filtering and falling

factorial basis functions, showing that the trend filtering problem can be interpreted as a

sparse basis regression problem using these functions. This finding is also applicable to our

case since our model is a generalization of the univariate trend filtering model. For given

knot points t1 < · · · < tn ∈ R, the kth order falling factorial basis functions are defined as

below

qi(t) =

i−1
(cid:89)

(t − tl),

i = 1, . . . , k + 1,

l=1
k
(cid:89)

(t − ti+l) · 1{t > ti+k},

qi+k+1(t) =

l=1

i = 1, . . . , n − k − 1,

where we write q1(t) = 1 as a convention. With these definitions, the estimation in (3.3) is

equivalent to

((cid:98)β, (cid:98)α) ∈ arg min
β∈Rp,α∈Rn

(cid:32)

ρτ

yi − x′

iβ −

1
n

n
(cid:88)

i=1

(cid:33)

αlql(zi)

+ λ∥β∥1 + γk!

n
(cid:88)

l=k+2

|αl|,

(3.5)

n
(cid:88)

l=1

44

which is a lasso form. Indeed, the equivalent alternating expression of (3.5) is

((cid:98)β, (cid:98)g) ∈ arg min
β∈Rp,g∈Hn

1
n

n
(cid:88)

i=1

ρτ (yi − x′

iβ − g(zi)) + λ∥β∥1 + γT V (g(k)).

(3.6)

where Hk
n

is the span of the kth order falling factorial basis functions with knots zk+1 ≤

zk+2 . . . ≤ zn−1. Throughout the chapter, we denote Hn to be Hk
n

for notational simplicity.

We omit the proofs of equivalences between equations for (3.3), (3.5) and (3.6), since the

proof can be directly taken from Lemma 1 in [69] for (3.3) and (3.5), and Lemma 2 in [69] and

Lemma 1 in [56] for (3.5) and (3.6). Based on the equivalence, we argue that the estimate

from (3.3) is the same as (3.6). By constructing an appropriate linear combination of falling

factorial functions, the form in (3.5) enables interpolation or extrapolation for partial lin-

ear quantile trend filtering. As a result, this formulation enhances the practical applicability.

3.2.2 Assumptions

We introduce conditions regarding our model in (3.6), needed for the proofs of lemmas

and theorems throughout.

Condition 7 (Design Condition). The covariates x = (x1, . . . , xp) are bounded:

max1≤j≤p |xj| ≤ Kx.

Condition 7 is the same as in Condition 2 in [63]. This condition is common in high-

dimensional frameworks.

Condition 8 (Bounded Conditional Response Density). The derivative of the conditional
∂y fy|x,z(y) is bounded in absolute value by constant ¯f ′ uniformly in (y, x, z). More-
over, the conditional density evaluated at the conditional quantile is bounded away from zero,

density ∂

i.e. fy|x,z(x′β0 + g0(z)) ≥ f > 0.

Condition 8 is the same as in [5, 63, 41] among others. This condition is common in

quantile regression literature. This condition is less restrictive than the conditions for linear

models, which assume specific distributions.

45

Condition 9 (Bounded Input Density). z has a continuous distribution supported on [0, 1].

Its density is bounded below by a constant ℓz > 0.

Condition 9 is the same as in [69, 56]. This condition is common in the trend filtering

literature.

Condition 10 (Eigenvalue Condition). Define h(z) = (h1(z), . . . , hp(z)) = E(x|z) and

(cid:101)x =

x − h(z). Assume λmin(E

(cid:101)x(cid:101)x′) ≥ Λmin > 0 and λmax(Eh(z)h(z)′) ≤ Λmax < ∞.

Condition 10 is common in semiparametric literature [83, 46, 84]. This condition ensures

that there is enough information in the data to identify the parameters in the linear part.

Condition 11. max1≤j≤p T V (h(k)

j ) ≤ Lh.

Condition 12. s2 log p

n = o(1), p → ∞, as n → ∞.

Condition 11 is the simiar to those of the Condition 2.6 in [46] and Condition A.5 in

[84], and Lh = 0 when z and x is independent. This condition is used for proving Theorem

9. This type of condition is common in smoothing spline and trend filtering literature.

Condition 12 is a technical condition for proofs in Theorem 8 and 9.

3.3 Main Theoretical Results

In this section, we present our main results. Our main results consist of two parts, the

result regarding the g, and the result regarding the β. We now move on to the convergence

rate result for g.

Theorem 8. Assume Conditions 7-10 and 12. Choose λ = c1

n + n− 2k+2
2k+3 )
with large enough constants c1, c2 > 0. Then, there exist constants c3, c4, n0 > 0 such that

n , γ = c2( s log p

(cid:113) log p

any solution

(cid:98)g in (3.6) satisfies

∥(cid:98)g(z) − g0(z)∥2 ≤ c3

∥(cid:98)g − g0∥2

n ≤ c3

(cid:16)s log p
n
(cid:16) s log p
n

+ n− 2k+2

2k+3

(cid:17)

,

+ n− 2k+2

2k+3

(cid:17)

46

with probability at least 1 − pc4 − nc4, as long as n ≥ n0. The constants c1, c2, c3, c4, n0 may
depend on k, Lg, Lh, Kx, ℓz, Λmin, Λmax, f , ¯f ′.

Theorem 8 implies that the rate of

(cid:98)g depends on two terms: s log p/n and n− 2k+2
(cid:98)g indeed depends on the convergence rate of (cid:98)β, which
is discussed in Theorem 9. When the rate for the linear part is faster than the rate of the

means that the convergence rate of

2k+3 . This

second term, that is when s log p/n = o(n− 2k+2

second term, which implies that ∥(cid:98)g(z) − g0(z)∥2 = Op
This phenomenon is also presented in [63], which uses B-spline approximation instead. This

and ∥(cid:98)g − g0∥2

n = Op

.

n− 2k+2

2k+3

n− 2k+2

2k+3

2k+3 ), then the convergence rate is governed by the
(cid:17)

(cid:16)

(cid:17)

(cid:16)

convergence rate is similar to the rate achieved by [63]. The nonparametric term is the same

but only the other term differs, where this comes from the restrictive condition and use

of different penalty terms, their nonparametric rate is achieved based on the Hölder class.

However, our rate is achieved within the total variation function class, where the linear

smoother is suboptimal. This implies that our optimal rate is a faster rate within the total

variational class, especially when p is not growing fast regarding n and g is heterogeneous.

The proof of Theorem 8 can be found in the supplementary section. We proceed to the

convergence rates results for β.

Theorem 9. Assume Conditions 7-12, with the same choice of λ, γ in Theorem 8, any

solution (cid:98)β in (3.6) satisfies

∥(cid:98)β − β0∥2

2 ≤ c3

∥x′((cid:98)β − β0)∥2 ≤ c3

∥X((cid:98)β − β0)∥2

n ≤ c3

,

s log p
n
s log p
n
s log p
n

,

,

with probability at least 1 − pc4 − nc4, as long as n ≥ n0.

Theorem 9 implies two things about the estimator (cid:98)β; (1) estimation results and (2)
prediction results. Both the estimation and prediction rates, for in-sample and out-of-

sample data, achieve the oracle rate. This rate is consistent with findings in classical

47

high-dimensional quantile regression research [5]. It is noteworthy that the smoothness of g

does not impact the results for β, whereas the results for g are influenced by β. This one-way

relationship arises due to the orthogonal decomposition used in the proofs, a phenomenon

also observed in [63]. The proof for Theorem 9 is provided in the supplementary section.

3.4 Simulation Studies

Through empirical experiments, we evaluate the performance of partial linear quantile

trend filtering (PLQTF) in comparison to partial linear quantile B-spline (PLQBS), as

presented by [63]. Additionally, we represent the performance of partial linear quantile

smoothing spline (PLQSS) as a side-by-side comparison even though it is not a main object.

3.4.1 Computational Details

We use a combination of the Blockwise Coordinate Descent (BCD) algorithm, also known

as the backfitting approach, and the Majorization-Minimization (MM) algorithm for our

problem in (3.3). We denote our algorithm qBCD. Our approach mainly involves two blocks:

the first pertains to β for the linear part and the second to θ for the trend filtering part.

The algorithm iterates over these blocks, updating the estimate for each component at each

step using the quantile LASSO for β and the univariate trend filtering algorithm for θ.

However, within each block, because of the complexity of the check loss that we consider,

we need to consider what kind of algorithm we should consider for the optimization regarding

the check loss. For the quantile LASSO part, the R package rqPen [62] deals with the

linear programming approach. For the trend filtering part, there is no software available

publicly that deals with the check loss. Therefore, we deploy the re-iteratively weighted

approximation [57, 48, 58], which is proven to be a type of MM algorithm [34]. Instead of

minimizing a check loss from (3.3), we use the weighted least square approximation below:

n
(cid:88)

i=1

w(r)(yi, xi, β, θ(r)

i

, τ ) (yi − x′

iβ − θi)2

(3.7)

48

where, for a sufficiently small η > 0
(cid:16)

w(r)(yi, xi, β, θ(r)

i

, τ ) =

τ I

yi > x′

i

iβ + θ(r)
(cid:114)(cid:16)

(cid:17)

+ (1 − τ )I

(cid:16)

yi ≤ x′

iβ + θ(r)

i

(cid:17)

.

yi − x′

iβ − θ(r)

i

(cid:17)2

+ η2

The big advantage of this approximation is that it becomes the weighted least square problem

now. The convergence and numerical stability of MM algorithm by the descent property

ensures the convergence of this optimization, for further details of the MM algorithm, we

refer to [28].

The convergence of the Block Coordinate Descent (BCD) algorithm has been rigorously

established by [71]. They proved that for a convex criterion decomposable into smooth and

separable terms, the solution obtained through the iterates of the BCD algorithm is indeed

the optimal solution. Consequently, we do not need to prove the convergence of the algorithm

again in this study. Our qBCD algorithm’s convergence is guaranteed based on this existing

result. The detailed steps of our entire algorithm are outlined in Algorithm 3.1.

For the tuning parameters λ and γ, we employ 5-fold cross-validation (CV) to determine

the optimal parameters that minimize the overall CV error. However, due to the presence

of two tuning parameters, the search space becomes the product of the lengths of λ and γ,

leading to an exponential increase in computation time compared to an algorithm with a

single tuning parameter. Furthermore, the additional loop within blocks further slows down

convergence. To mitigate this, we use the warm start method [53] for our tuning parameter

grids. Empirical evidence from their study demonstrates that the warm start algorithm

significantly reduces the number of iterations required for convergence, thereby enhancing

the convergence speed of the entire algorithm.

Our algorithm is implemented using R software [52]. Typically, for the quantile lasso

component, we utilize the function from the R package rqPen [62], and for trend filtering,

we employ the function from the R package glmgen [1] with weight approximation from

(3.7). However, there is no publicly available software for the partial linear smoothing

spline method, so we also use qBCD algorithm for the smoothing spline model for the

49

Algorithm 3.1 A qBCD algorithm for High-dimensional Partial Linear Quantile Trend
Filtering
Data: {yi, xi, zi}, i = 1, . . . , n
Fixed (tuning) Parameters: λ, γ, k

1. Set t = 0 and initialize θ(0)

i = 0, i = 1, . . . , n

2. For (t + 1)-th iteration, where t = 0, 1, 2, . . . :

a) Block 1: Let y(t)∗

i = yi − θ(t)

i

, and update β(t) by fitting the quantile lasso:

β(t+1) = arg min

β

1
n

n
(cid:88)

i=1

(cid:16)

ρτ

y(t)∗
i − x′

iβ

(cid:17)

+ λ∥β∥1

b) Block 2: Let y(t)∗∗

= yi − x′

iβ(t+1), r = 0, and θ(t+1,0)

= θ(t)
i

, i = 1, . . . , n

i
For (r + 1)-th iteration, where r = 0, 1, 2, . . . :

i

i. Using approximation with w(·), we update θ(t+1,r) by fitting the re-iteratively

univariate weighted trend filtering model as below:

(cid:34)

θ(t+1,r+1) = arg min

θ

1
n

n
(cid:88)

i=1

w(r)(cid:16)

y(t)∗∗
i

, xi, β(t+1), θ(r)

i

, τ

(cid:17)

·

(cid:16)

y(t)∗∗
i

− x′

iβ(t+1) − θi

(cid:17)2

+ γ∥D(z,k+1)θ∥1

(cid:35)

ii. If θ converges, then stop the iteration for r. If not, continue the iteration

until it reaches the predefined maximum iteration number

iii. Update θ(t+1) as the function at convergence

c) If both β and θ converge, then stop the iteration. If not, continue the iteration

until it reaches the predefined maximum iteration number

3. Return (cid:98)β and (cid:98)θ as parameters at convergence

computation as well. The R function, stats::smooth.spline, is used for this procedure

in the second block with weight approximation from (3.7). The implemented qBCD

algorithms for the smoothing spline and the trend filtering, named plmR, are available on

https://github.com/SangkyuStat/plmR for public use and reproducibility.

50

Figure 3.1 Configurations of the function g(z) for Models 1 and 2. The structures of the
functions become increasingly locally heterogeneous from left to right.

3.4.2 Simulation Setting and Results

We now proceed to demonstrate our method through simulations. We consider different

scenarios with different settings. We first generate (cid:101)X = ((cid:101)x1, . . . , (cid:101)xp+1) from Np+1(0p+1, Σ),
where Σ = (σjk) with σjk = 0.5|j−k| and j, k = 1, . . . , p + 1. Then we set z = Φ((cid:101)x25), xj = (cid:101)xj
for j = 1, . . . , 24 and xj = (cid:101)xj−1 for j = 26, . . . , p + 1. The true models we consider are as
follows:

Model 1 (Less heterogeneous):

yi = xi6β1 + xi12β2 + xi15β3 + xi20β4 + e3zi sin(6πzi)/7 + ϵi,

Model 2 (More heterogeneous): yi = xi6β1 + xi12β2 + xi15β3 + xi20β4

+ 2 min(zi, 1 − zi)0.2 sin

(cid:18)

2.85π
0.3 + min(zi, 1 − zi)

(cid:19)

+ ϵi

for i = 1, . . . , n. We vary the settings of errors to make some different scenarios:

Scenario 1 (Normal): ϵi ∼ σϵN (0, 1)

Scenario 2 (Heavy Tail): ϵi ∼ σϵTdf =3

Scenario 3 (Heteroskedastic): ϵi = σϵxi1ζi, where ζi ∼ Tdf =3

51

The actual forms of these functions are displayed in Figure 3.1. Various values for σϵ are

adapted to vary the signal-to-noise ratio (SNR) for the models and scenarios. This kind

of simulation setting is adapted in [5] for quantile regression as well. We define the total

signal-to-noise ratio (tSNR) for a quantile level τ as follows:

(cid:115)

tSNR =

∥x′β0 + g0(z)∥2
Var(ϵ)

,

where the expectation in the numerator and the variance in the denominator can be approx-

imated by using the Monte Carlo expectation. We vary the tSNR level from 4 to 16 on a

logarithmic scale and then calculate the metrics for each model. Specifically, we consider

three different error metrics similar to those of [63], including in-sample prediction mean of

quantile error (MQE), l2-norm squared error, and a mean absolute deviation error (MADE),

as follows:

1. n−1 (cid:80)n

i=1 ρτ

(cid:16)

i (cid:98)β + (cid:98)g(zi) − (x′
x′

(cid:17)
iβ0 + g0(zi))

: Overall MQE considering both (cid:98)β and

(cid:98)g.

2. ∥(cid:98)β − β0∥2 : l2-norm squared error for (cid:98)β.

3. n−1 (cid:80)n

i=1 ∥(cid:98)g(zi) − g0(zi)∥1 : MADE for

(cid:98)g.

The metrics are computed over 150 repetitions of randomly generated datasets for each tSNR

value, and the medians of each metric are selected as the final results. We consider p to be 100

and 600 for low and high-dimensional cases, respectively, with n fixed at 300. The values for β

are (0.8, 1.6, 1.6, 2.4) for both low-dimensional and high-dimensional cases. For each method,

the algorithm is fitted to the training dataset and then the metrics are calculated from the

in-sample testing dataset. We present results using both optimally tuned parameters and

CV-tuned parameters. The optimally tuned parameters are the tuning parameters with the

smallest error among the path. The parameters are finely tuned for fair comparison across

all models. We fix k = 2 for the trend filtering method, use cubic smoothing splines for the

smoothing spline method, and use cubic B-spline for the B-spline analysis. Tuning only λ

and γ for smoothing spline and trend filtering, and tuning λ and number of knots for B-spline

52

model for two reasons: (1) minimal impact on results when varying these settings and (2)

computational efficiency. The simulation results for each model and case are displayed in

Figures 3.2, 3.3, 3.4, 3.5.

Figure 3.2 and 3.4 demonstrate that when g is a less heterogeneous function, the per-

formance of the PLQTF method is comparable to that of the PLQSS, with no significant

differences observed. However, PLQBS is not performing well especially when the tSNR is

large compared to others.

When we consider the more heterogeneous function on low-dimensional data, which is

in Figure 3.3, for Scenario 1 and 2, there was a significant difference between PLQTF and

other models, and this gap is getting larger as tSNR grows larger. However for Scenario 3,

the performance between PLQTF and PLQSS was similar across the tSNR levels.

In Figure 3.5, which is the case for the high-dimensional data, we see that the PLQTF

performs better than other methods across the tSNR level except for Scenario 3. For Sce-

nario 3 trend filtering outperforms other methods as tSNR grows larger. For these scenarios,

we observe that PLQBS performs unstable with MQE which is shown as distinguishable

differences between the CV and optimized error.

3.5 Real Data Analysis

3.5.1 Environment and Genetics in Lung Cancer Etiology (EAGLE) Study

The EAGLE study, a comprehensive molecular epidemiological investigation, employed

a case-control design to examine lung cancer in the Lombardy Region of Italy from 2002 to

2005. This large-scale, multicenter research initiative incorporated a significant biospecimen

component. The study enrolled a cohort exceeding 2,000 newly diagnosed lung cancer

patients, encompassing both sexes and spanning ages 35 to 79. These cases, representing

all histological types of verified lung cancer, were matched with an equivalent number

of healthy population-based controls. The matching criteria included age, gender, and

residence. Patient recruitment occurred across 13 hospitals, while control subjects were

53

Figure 3.2 Model 1 low-dimensional error comparisons for a sequence of tSNR grid from 4
to 16. n is set to 300 for all cases, and the repetition number for the simulation is 150.

54

Figure 3.3 Model 2 low-dimensional error comparisons for a sequence of tSNR grid from 4
to 16. n is set to 300 for all cases, and the repetition number for the simulation is 150.

55

Figure 3.4 Model 1 high-dimensional error comparisons for a sequence of tSNR grid from 4
to 16. n is set to 300 for all cases, and the repetition number for the simulation is 150.

56

Figure 3.5 Model 2 high-dimensional error comparisons for a sequence of tSNR grid from 4
to 16. n is set to 300 for all cases, and the repetition number for the simulation is 150.

57

randomly selected from the same residential areas as the cases. This approach ensured a

representative sample and minimized potential geographical biases. The research protocol

involved extensive data collection through two primary methods:

an interview-based

computer-assisted questionnaire and a self-administered questionnaire. These instruments

gathered comprehensive information on various factors, including demographic character-

istics, detailed smoking history, family history of lung cancer and other cancers, previous

lung diseases, medications, diet, alcohol, attempts at quitting smoking, anxiety, depression,

personality scores, occupation, reproductive and residential history. This multifaceted

approach to data collection facilitated a thorough examination of potential risk factors and

confounding variables associated with lung cancer development. Further details of the study

can be found in [33].

3.5.2 Real Data Analysis Results

For our real data analysis, the dataset comprises 132 tumor samples from the EAGLE

study, each characterized by whole genome sequencing data, DNA methylation profiles from

the EPIC array, and detailed clinical information regarding smoking history. This model

aims to identify the relationship between DNA methylations in tumors and the effect of

tobacco smoking on mutational signatures. Specifically, the outcome variable is the mu-

tational signatures, with a focus on the tobacco-related SBS4, which is quantified by the

number of mutations attributed to this signature from the whole genome sequencing data of

the EAGLE study. The number of mutations is log-transformed while performing the data

analysis.

The methylation data, represented by real values ranging from 0 to 1 for each probe,

serve as covariates. To manage computational demands, the analysis is restricted to 1,713

CpG probes that show significant differential methylation in tumors compared to normal

tissues. Smoking history is considered a factor that has a nonlinear relationship with the

mutational signatures, and we consider the total period (in years) during which the subject

58

Figure 3.6 Estimated
and 0.8. Here y is set to be the number of mutational signatures, and z is set to be the
total duration of smoking (in years).

(cid:98)g for the real data analysis with different quantile levels, τ = 0.2, 0.5,

smoked cigarettes regularly, whose duration is at least more than a year. After removing the

missing observations, we have n = 125 observations and p = 1, 713 variables. As discussed

z is set to be the total duration time of smoking. We try to see the heterogeneous effect of

smoking and selected features among different quantile levels. Here, we consider 3 different

quantile levels, τ = 0.2, 0.5, and 0.8. Our aim of this analysis is twofold. One is to see

the selected variables using PLQTF and the other is to compare the prediction error results

among all methods.

We applied 5-fold cross-validation (CV) to the full dataset and present the results. The

fitted plot of the nonparametric term,

(cid:98)g varies across
different quantile levels. While the effects on the 0.2th and 0.5th quantiles appear similar,

(cid:98)g, is shown in Figure 3.6. The effect of

59

the effect on the 0.8th quantile shows a notably higher increment. Although the estimated

function shapes are similar across quantiles, their magnitudes differ. For each quantile level,

the functions clearly deviate from linearity.

For all three quantile levels, the mutational signature increases until the duration reaches

around 20 years, after which it stabilizes. This implies that for people who have smoked for

20 years or less, mutations increase rapidly during this period and then stop. Notably, 0.8th

quantile level experiences a greater increase compared to others.

Quantile Levels List of Selected Variables (From the biggest coefficients to the smallest)

τ = 0.2

cg13906823, cg10792987, cg27275023, cg12492087, cg23497016,

cg18786873

τ = 0.5

cg07207982, cg07197230, cg10420527, cg10792987, cg04878000,

cg09295202, cg13692543, cg27275023, cg02018277, cg18786873,

cg00716257, cg07226481, cg08358166, cg07405182, cg09281805,

cg13906823, cg24676817, cg06173663, cg17698295, cg06623698

τ = 0.8

cg07197230, cg18451588, cg24127719, cg00558689, cg19080053,

cg12091642, cg06947913, cg07981013, cg06796713, cg21207665,

cg07686872

Table 3.1 List of selected CpG probes for different quantile levels of the number of
mutations.

The selected features for three different quantile levels are presented in Table 3.1. Among

these, we highlight specific features of interest. Notably, cg10420527 (LRP5 gene), selected

for the median, has been previously reported as related to lung cancer, particularly in the

context of smoking history [86]. This aligns our results with their findings, which were

based on mean-based regression. Additionally, cg07197230 (CECR2 gene) is selected for

both the 0.5 and 0.8 quantile levels and has been associated with alcohol use disorder, a

60

condition closely linked to smoking [26]. cg07207982, selected for the median, is reported to

be associated with cholangiocarcinoma, which can metastasize to the lung [45]. Cg10792987

is selected for the 0.2th and 0.5th quantiles, and cg13906823 for the 0.2th quantile. However,

these findings have not yet been documented in related literature, suggesting that they

could represent new discoveries. Other features have smaller coefficients compared to those

mentioned and, as we did not apply false discovery control, some of these may be false

discoveries.

For prediction error calculation, we partition the full dataset into training and test sets,

with 115 observations for training and 10 observations for testing. These datasets are ran-

domly selected for each repetition. Tuning parameters are optimized using 5-fold cross-

validation on the training datasets. This procedure is repeated 50 times to calculate the

prediction error.

For each repetition, the quantile adaptive prediction error for all methods is calculated

as follows:

(cid:80)n

PE(τ ) =

i=1 1{Subject i in testing set}ρτ (yi − x′
i=1 1{Subject i in testing set}
(cid:98)g are the estimates from the methods considered. We report the median of the
where (cid:98)β and
PE values. However, we find that the smoothing spline method performs worse than other

i (cid:98)β − (cid:98)g(zi))

(cid:80)n

,

methods, so we only report the results for PLQTF and PLQBS here.

Results are presented in Table 3.2. We see that the PLQTF results outperform on every

quantile level. However, since the standard errors are quite large, we cannot say that the

results are significantly better, except for the 0.2th quantile.

3.6 Conclusion and Discussion

This study addresses the complexities of high-dimensional data influenced by both linear

and non-linear predictors. We introduce a novel approach that integrates trend filtering with

partial linear quantile regression, providing a robust and efficient method for analyzing such

data.

61

Prediction Error (Standard Error)

Quantile Levels

PLQTF

PLQBS

τ = 0.2

0.690 (0.084)

1.180 (0.078)

τ = 0.5

τ = 0.8

0.792 (0.056)

0.807 (0.063)

0.431 (0.033)

0.460 (0.025)

Table 3.2 Median prediction error comparison results for 50 repetitions. Numbers in
parentheses denote standard errors.

Our model leverages the strengths of both parametric and nonparametric regression tech-

niques, along with the comprehensive insights offered by quantile regression. Trend filtering

enhances local adaptivity and computational efficiency, making our approach superior to

traditional nonparametric methods like smoothing splines and B-spline regressions.

Empirical results and theoretical analysis demonstrate that our model outperforms exist-

ing methods in both accuracy and efficiency. We established rigorous theoretical properties,

including convergence rates for the estimators, distinguishing our work from previous ap-

proaches.

In summary, our contributions include integrating parametric and nonparametric regres-

sion within a partial linear framework, using quantile regression to capture the impact of

covariates across the conditional distribution, and enhancing model performance with trend

filtering for local adaptivity and computational efficiency. We have demonstrated that our

model performs better than existing models.

Future work could explore extending this methodology to different data structures and

domains, as well as developing more efficient computational algorithms.

62

3.7 Proofs

Throughout

the proofs, we use C1, C2, . . .

to denote universal constants and

use D1, D2, . . . , ...
k, Lg, Lh, Kx, ¯f ′, f , ℓz, Λmin, Λmax from Conditions 7-11. An explicit (though not opti-

that may only depend on the constants

to denote constants

mal) dependence of {Dj, j = 1, 2, . . .} on the aforementioned constants can be tracked

down. However, since it does not provide much more insight, we will often not present the

explicit forms of {Dj, j = 1, 2, . . .}, and this will greatly help streamline the proofs. The

constants {Cj, Dj, j = 1, 2, . . .} may vary from line to line.

3.7.1 Proof of Theorem 8

The proof has similar ideas as in the partial linear regression problem. Hence, we will

not repeat arguments in some parts of the proof. For a given function f (x, z) = x′β + g(z)

with β ∈ Rp, T V (g(k)) < ∞, we adopt the same functional

τδ0,R(f ) =

λ∥β∥1 + γT V (g(k))
20δ0R

+ ∥x′β + g(z)∥,

δ0 ∈ (0, 1), R > 0.

(3.8)

Define the event

(cid:26)

T1 :=

max
1≤i≤n−1

(z(i+1) − z(i)) ≤

(cid:27)
.

22 log n
ℓzn

According to Lemma 16 in [70], there exists ¯g ∈ Hn such that on T1, the following holds

T V (¯g(k)) ≤ ak · T V (g(k)

0 ),

∥¯g − g0∥∞ ≤

22bk log n
ℓzn

· T V (g(k)

0 ),

(3.9)

where ak, bk > 0 are constants only dependent on k. Denote ∆ := 22bkLg log n

ℓzn

and

(cid:40)

B :=

β : ∥β∥2 ≤ Λ−1/2

min R, ∥β∥1 ≤

20δ0R2
λ

, ∥x′β∥ ≤

(cid:16)

2 +

(cid:114) Λmax
Λmin

(cid:41)
,

(cid:17)

R

(3.10)

(cid:40)

G :=

g : ∥g∥ ≤

(cid:16)

1 +

(cid:114) Λmax
Λmin

(cid:17)

R + ∆, ∥g∥∞ ≤ Cz,k

(cid:16) 20δ0R
γ

+ 1 +

(cid:114)Λmax
Λmin

(cid:17)

R + ∆,

T V (g(k)) ≤

20δ0R2
γ

+ (1 + ak)Lg

(cid:41)
,

(3.11)

63

where Cz,k > 0 is a constant that only depends on k, ℓz and it can be found in Lemma 11.

Further define the following event

(cid:12)
(cid:12)Gn[ρτ (yi − x′

iβ − g(zi)) − ρτ (yi − x′

iβ0 − g0(zi))](cid:12)

(cid:12) ≤ δ0

(cid:26)

T2(δ0, R) :=

sup
β−β0∈B,g−g0∈G

3.7.1.1 Key lemmas

Lemma 8. Assume that

20Kxδ0R2
λ

+ Cz,k

(cid:16) 20δ0R
γ

+ 1 +

(cid:114)Λmax
Λmin

(cid:17)

R + ∆ ≤

3f
2 ¯f ′

,

∆ + 2γakLg +

8
f

λ2sΛ−1

min ≤ 8δ0R2,

∆ ≤

1
20

R, δ0 ≤

f
200 · 242 .

Then on T1 ∩ T2(δ0, R), it holds that

τδ0,R(x′((cid:98)β − β0) + (cid:98)g(z) − ¯g(z)) ≤ R,

where ¯g ∈ Hn is the one introduced in (3.9).

Proof. Consider the convex combination

(cid:101)β = t(cid:98)β + (1 − t)β0,

(cid:101)g = t(cid:98)g + (1 − t)¯g,

t ∈ [0, 1].

Accordingly, define

√

nR2

(cid:27)
.

(3.12)

(3.13)

(3.14)

(3.15)

(3.16)

(cid:98)f (x, z) = x′ (cid:98)β + (cid:98)g(z),

¯f (x, z) = x′β0 + ¯g(z), (cid:101)f (x, z) = t (cid:98)f + (1 − t) ¯f = x′ (cid:101)β + (cid:101)g(z).

We choose t =

R
R+τδ0,R( (cid:98)f − ¯f )

so that

τδ0,R( (cid:101)f − ¯f ) =

R · τδ0,R( (cid:98)f − ¯f )
R + τδ0,R( (cid:98)f − ¯f )

≤ R.

Combining (3.9) with Lemma 11, it is straightforward to verify that on T1,

(cid:101)β − β0 ∈ B,

(cid:101)g − g0 ∈ G.

64

(3.17)

(3.18)

Moreover, (3.17) implies that to prove (3.16) it is sufficient to show τδ0,R( (cid:101)f − ¯f ) ≤ R

2

.

We start with the basic inequality due to the convex problem (3.6),

Enρτ (yi − x′

≤ Enρτ (yi − x′

i (cid:101)β − (cid:101)g(zi)) + λ∥(cid:101)β∥1 + γT V ((cid:101)g(k))
iβ0 − ¯g(zi)) + λ∥β0∥1 + γT V (¯g(k))

≤ Enρτ (yi − x′

iβ0 − g0(zi)) + En|g0(zi) − ¯g(zi)| + λ∥β0∥1 + γT V (¯g(k))

≤ Enρτ (yi − x′

iβ0 − g0(zi)) + ∆ + λ∥β0∥1 + γT V (¯g(k)),

(3.19)

where the second inequality uses the Lipschitz property of check loss and the third inequality

is due to (3.9). Based on (3.18), we can re-organize the inequality (3.19) to obtain

≤

Eρτ (y − x′ (cid:101)β − (cid:101)g(z)) − Eρτ (y − x′β0 − g0(z))
(cid:12)
(cid:12)n−1/2Gn[ρτ (yi − x′

sup
β−β0∈B,g−g0∈G
∆ + λ∥β0∥1 + γT V (¯g(k)) − λ∥(cid:101)β∥1 − γT V ((cid:101)g(k))

iβ − g(zi)) − ρτ (yi − x′

iβ0 − g0(zi))](cid:12)
(cid:12)+

≤ δ0R2 + ∆ + 2λ∥(cid:101)βS − β0

S∥1 − λ∥(cid:101)β − β0∥1 + 2γT V (¯g(k)) − γT V ((cid:101)g(k) − ¯g(k)).

(3.20)

Here in the last step, we have used (3.12) and the triangle inequality for ∥ · ∥1 and T V (·).

We further develop lower bounds for (3.20). Using the identity of Knight, we obtain

Eρτ (y − x′ (cid:101)β − (cid:101)g(z)) − Eρτ (y − x′β0 − g0(z))

= E

= E

f
2
(cid:32)

≥

≥

(cid:90) x′((cid:101)β−β0)+(cid:101)g(z)−g0(z)

(cid:0)Fy|x,z(x′β0 + g0(z) + t) − Fy|x,z(x′β0 + g0(z))(cid:1)dt

0

(cid:90) x′((cid:101)β−β0)+(cid:101)g(z)−g0(z)

(cid:16)

fy|x,z(x′β0 + g0(z))t +

y|x,z(x′β0 + g0(z) + (cid:101)t)t2(cid:17)
f ′

dt f or (cid:101)t ∈ [0, t]

1
2

0

∥ (cid:101)f − f0∥2 −

f
2

−

¯f ′
6

·

¯f ′
6
(cid:16) 20Kxδ0R2
λ

E|x′((cid:101)β − β0) + (cid:101)g(z) − g0(z)|3,
(cid:16)20δ0R
γ

+ Cz,k

+ 1 +

f0 = x′β0 + g0(z),

(cid:114) Λmax
Λmin

(cid:33)

(cid:17)

(cid:17)

R + ∆

· ∥ (cid:101)f − f0∥2 ≥

f
4

∥ (cid:101)f − f0∥2.

Here, the second equality applies Taylor expansion; the first inequality is due to Condition

8; the second inequality uses (3.18), and the last inequality holds by the condition (3.13).

65

Putting the above lower bound together with (3.20), we can proceed to obtain

f
4

∥ (cid:101)f − f0∥2 + λ∥(cid:101)β − β0∥1 + γT V ((cid:101)g(k) − ¯g(k))

√

s∥(cid:101)β − β0∥2

≤ δ0R2 + ∆ + 2γT V (¯g(k)) + 2λ

≤ δ0R2 + ∆ + 2γT V (¯g(k)) + 2λ

√

≤ δ0R2 + ∆ + 2γT V (¯g(k)) +

8
f

sΛ−1/2

min ∥(cid:101)x((cid:101)β − β0)∥
f
8

∥ (cid:101)f − f0∥2,

min +

λ2sΛ−1

(3.21)

where the first inequality is from Cauchy–Schwarz inequality, and the last inequality holds

by the basic inequality ab ≤ 1

2b2 and the orthogonal decomposition ∥ (cid:101)f − f0∥2 =
∥(cid:101)x′((cid:101)β − β0)∥2 + ∥h(z)′((cid:101)β − β0) + (cid:101)g(z) − g0(z)∥2. Now (3.21) combined with the condition
(3.14) yields

2a2 + 1

f
8
This implies that λ∥(cid:101)β−β0∥1+γT V ((cid:101)g(k)−¯g(k))

20δ0R

∥ (cid:101)f − f0∥2 + λ∥(cid:101)β − β0∥1 + γT V ((cid:101)g(k) − ¯g(k)) ≤ 9δ0R2.

≤ 9

20R, and ∥ (cid:101)f − ¯f ∥ ≤

(cid:113) 72δ0

f R + ∆ ≤ 1

20R under

the condition (3.15).

Lemma 9. Assume the same conditions of Lemma 8 are satisfied. Then on T1 ∩ T2(δ0, R),

it holds that

∥(cid:98)g(z) − g0(z)∥ ≤

(cid:16)

1 +

(cid:114) Λmax
Λmin

(cid:17)

R + ∆.

Moreover, define another event:

T3(δ0, R) =

(cid:40)

sup
g∈G

(cid:12)
(cid:12)∥g∥2
(cid:12)

n − ∥g∥2(cid:12)
(cid:12) ≤ δ0R2
(cid:12)

(cid:41)
,

where G was introduced in (3.11). Then on T1 ∩ T2(δ0, R) ∩ T3(δ0, R), it holds that

∥(cid:98)g(z) − g0(z)∥n ≤

(cid:16)

1 +

(cid:112)

δ0 +

(cid:114)Λmax
Λmin

(cid:17)

R + ∆.

Proof. Result (3.16) in Lemma 8 together with Part (i) of Lemma 11 implies that ∥(cid:98)g(z) −
¯g(z)∥ ≤ (1 +

)R. Combining this result with (3.9) gives

(cid:113) Λmax
Λmin

∥(cid:98)g(z) − g0(z)∥ ≤ ∥(cid:98)g(z) − ¯g(z)∥ + ∥¯g(z) − g0(z)∥ ≤

(cid:16)

1 +

(cid:114) Λmax
Λmin

(cid:17)

R + ∆.

66

To prove the second result, we first use (3.9) and (3.16) to obtain

T V ((cid:98)g(k) − g(k)

0 ) ≤ T V ((cid:98)g(k) − ¯g(k)) + T V (¯g(k) − g(k)

0 ) ≤

20δ0R2
γ

+ (ak + 1)Lg.

Also, from Lemma 11 we know

∥(cid:98)g(z) − g0(z)∥∞ ≤ ∥(cid:98)g(z) − ¯g(z)∥∞ + ∥¯g(z) − g0(z)∥∞ ≤ Cz,k

(cid:16) 20δ0R
γ

+ 1 +

(cid:114) Λmax
Λmin

(cid:17)

R + ∆.

These results show that

(cid:98)g − g0 ∈ G. Hence, on T1 ∩ T2(δ0, R) ∩ T3(δ0, R), we can use the

bound on ∥(cid:98)g(z) − g0(z)∥ to obtain the bound on ∥(cid:98)g(z) − g0(z)∥n,

∥(cid:98)g(z) − g0(z)∥n ≤ (cid:112)∥(cid:98)g(z) − g0(z)∥2 + δ0R2 ≤ ∥(cid:98)g(z) − g0(z)∥ +

(cid:112)

δ0R.

Lemma 10. Denote KG := (1 ∨ Cz,k)

(cid:16) 20δ0R

γ + 1 +

(cid:113) Λmax
Λmin

(cid:17)
R + ∆ + (1 + ak)Lg.

(i) As long as 11 log n

ℓzn < 1, then P(cid:0)T1

(cid:1) ≥ 1 − 2ℓzn−10.

(ii) If ((1 +

(cid:113) Λmax
Λmin

)R + ∆)2 ≥ D1K 2

Gn

−2k−2

2k+3 and K 2k+3

G

(R + ∆)2k+1 ≤ D2δ2k+2

0 R4k+4nk+1,

−
P(T3(δ0, R)) ≥ 1 − e

D3nδ2
0 R4
KG (R+∆)2 .

(iii) In addition to the conditions in Part (ii),

If nδ2

0R4 ≥ D4(R + ∆)2, λ ≥

D5

(cid:112)(log p)/n, (R + ∆)(cid:112)(log p)/n ≤ D6δ0R2, n−1/2K
0 R4

−
P(T2(δ0, R)) ≥ 1 − 4e

D8nδ2
KG (R+∆)2 − 4p−10.

1
2k+2
G

(R + ∆)

2k+1

2k+2 ≤ D7δ0R2,

Proof. Proof of Part (i): This is directly taken from Lemma 5 in [77].

Proof of Part (ii). A similar result (with a slightly different G) has been derived in the

partial linear regression problem. Hence, we will not repeat all the arguments here. We aim
n − ∥g∥2(cid:12)

(cid:12). We first calculate J0(t, G) in Theorem

(cid:12)
(cid:12)∥g∥2

to apply Theorem 12 to bound supg∈G
12,

J0(t, G) ≤ C1t

(cid:115)

(cid:16) ut
2KG

(cid:90) 1

0

(cid:17)− 1

k+1 du = C1K

2

1
2k+2
G

(cid:124)

67

(cid:123)(cid:122)
¯C1

1

2k+2 (2k + 2)
2k + 1

2k+1

2k+2 .

·t

(cid:125)

We then calculate the function H(·) in Theorem 12,

H(u) = sup
t≥0
(cid:16)2k + 1
4k + 4

=

(cid:0)ut − (J −1

0

(t, G))2(cid:1) ≤ sup

t≥0

(cid:0)ut − ¯C

− 4k+4
2k+1
1

4k+4

2k+1 (cid:1)

t

(cid:17) 4k+4

2k+3 2k + 3
2k + 1

4k+4
¯C
2k+3
1

u

4k+4

2k+3 = D1K

2
2k+3
G

4k+4
2k+3 .

u

The condition supg∈G ∥g∥2 ≥ H(4KG/

√

n) required by Theorem 12 is satisfied if

(cid:16)(cid:0)1 +

(cid:114)Λmax
Λmin

(cid:1)R + ∆

(cid:17)2

≥ D2K 2

Gn

−2k−2
2k+3 .

Then it holds under this condition that with probability at least 1 − e−t,

(cid:12)
(cid:12)∥g∥2
(cid:12)

n − ∥g∥2(cid:12)
(cid:12)
(cid:12) ≤ D3 ·

sup
g∈G

(cid:16) K

2k+3
2k+2
G

2k+1
2k+2

(R + ∆)
n

√

+

KG(R + ∆)
√

n

√
t

+

K 2
Gt
n

(cid:17)

.

Choosing t = D4nδ2

0 R4
KG (R+∆)2

pletes the proof.

and further assuming K 2k+3

G

(R + ∆)2k+1 ≤ D5δ2k+2

0 R4k+4nk+1 com-

Proof of Part (iii): Recall the definition of T2(δ0, R) in (3.12). We first apply the sym-

metrization from Theorem 13. Adopting the notation there, we have

QF :=

sup
β−β0∈B,g−g0∈G

∥ρτ (yi − x′

iβ − g(zi)) − ρτ (yi − x′

iβ0 − g0(zi))∥

≤

sup
β−β0∈B,g−g0∈G

∥x′

i(β − β0) + g(zi) − g0(zi)∥

≤ sup

β−β0∈B

∥x′

i(β − β0)∥ + sup
g−g0∈G

∥g(zi) − g0(zi)∥

(cid:16)

≤

2 +

(cid:114) Λmax
Λmin
0 R4
where the first inequality is due to the Lipschitz property of ρτ (·). Setting t = nδ2
32Q2
F

(cid:114) Λmax
Λmin

(cid:114)Λmax
Λmin

R + ∆ =

R + ∆,

3 + 2

R +

1 +

(cid:16)

(cid:17)

(cid:17)

(cid:17)

(cid:16)

(3.22)

, the

bound (3.22) together with the condition nδ2

0R4 ≥ 128(cid:2)(3 + 2

t ≥ 4. Hence, we can invoke Theorem 13 to obtain

(cid:113) Λmax
Λmin

)R + ∆(cid:3)2 shows that

1 − P(T2(δ0, R))

(cid:18)

≤ 4P

sup
β−β0∈B,g−g0∈G

(cid:12)
(cid:12)Enϵi[ρτ (yi − x′

iβ − g(zi)) − ρτ (yi − x′

iβ0 − g0(zi))](cid:12)

(cid:12) ≥

(cid:19)

,

δ0R2

1
4

(3.23)

68

where ϵ1, . . . , ϵn is a Rademacher sequence independent from the data. To further bound

the above, we will apply the concentration result in Theorem 13. Towards this end, let

Y := {(yi, xi, zi)}n
(cid:18)

i=1

. We have

E

≤ 2E

sup
β−β0∈B,g−g0∈G
(cid:18)

sup
β−β0∈B,g−g0∈G

(cid:12)
(cid:12)Enϵi[ρτ (yi − x′

iβ − g(zi)) − ρτ (yi − x′

(cid:12)
(cid:12)Enϵi[x′

i(β − β0) + g(zi) − g0(zi)](cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)Y

(cid:19)

(cid:12)
(cid:12)
(cid:12)Y

iβ0 − g0(zi))](cid:12)
(cid:12)
(cid:19)

(cid:18)

≤ 2E

sup
β−β0∈B

(cid:12)
(cid:12)Enϵi[x′

i(β − β0)](cid:12)
(cid:12)

(cid:19)

(cid:12)
(cid:12)
(cid:12)Y

(cid:18)

+ 2E

sup
g−g0∈G

(cid:12)Enϵi[g(zi) − g0(zi)](cid:12)
(cid:12)
(cid:12)

(cid:19)
,

(cid:12)
(cid:12)
(cid:12)Y

(3.24)

where we have used the contraction theorem (e.g., Theorem 16.2 in [72]) in the first step.

To bound the first term above, starting with Hölder’s inequality, we have

(cid:18)

E

sup
β−β0∈B

(cid:12)
(cid:12)Enϵi[x′

i(β − β0)](cid:12)
(cid:12)

(cid:19)

(cid:12)
(cid:12)
(cid:12)Y

≤ sup

β−β0∈B

∥β − β0∥1 · E(cid:0)∥En(ϵixi)∥∞

(cid:12)
(cid:12)Y(cid:1) ≤

20δ0R2
λ

· C1Kx

(cid:114)

log p
n

.

(3.25)

In the last step, we have used the standard bound for maximum of sub-Gaussian variables

(e.g. Theorem 11). To bound the second term, let (cid:98)QG := supg∈G ∥g∥n and H be the metric
entropy. Applying Dudley’s entropy integral gives

(cid:16)

E

(cid:12)
(cid:12)

1
n

sup
g∈G

≤C3

(cid:98)QG√
n

(cid:90) 1

0

i=1
(cid:16) u (cid:98)QG
KG

n
(cid:88)

ϵig(zi)(cid:12)
(cid:12)

(cid:17)

(cid:12)
(cid:12)
(cid:12)Y

≤ C2

(cid:98)QG√
n

(cid:90) 1

0

(cid:113)

H(u (cid:98)QG, G, ∥ · ∥n)du,

(cid:17) −1

2k+2 du =

(2k + 2)C3
√
n
(2k + 1)

1
2k+2
G

2k+1
2k+2
G

,

(cid:98)Q

K

(3.26)

where the second inequality is due to the entropy bound of Corollary 1 in [56].

Now write down (cid:98)QF from Theorem 13,

(cid:98)Q2

F :=

sup
β−β0∈B,g−g0∈G

≤

sup
β−β0∈B,g−g0∈G

1
n

1
n

n
(cid:88)

(cid:2)ρτ (yi − x′

iβ − g(zi)) − ρτ (yi − x′

iβ0 − g0(zi))(cid:3)2

i=1
n
(cid:88)

i=1

(cid:2)x′

i(β − β0) + g(zi) − g0(zi)(cid:3)2

≤ sup

β−β0∈B

2
n

≤ 2K 2
x

400δ2
λ2

n
(cid:88)

i=1
0R4

(cid:2)x′

i(β − β0)(cid:3)2 + sup

g−g0∈G

2
n

n
(cid:88)

i=1

(cid:2)g(zi) − g0(zi)(cid:3)2

+ 2 (cid:98)Q2
G.

69

We can then use the second result in Theorem 13, together with (3.24)-(3.26), to obtain

(cid:32)

P

sup
β−β0∈B,g−g0∈G

(cid:12)
(cid:12)Enϵi[ρτ (yi − x′

iβ − g(zi)) − ρτ (yi − x′

iβ0 − g0(zi))](cid:12)
(cid:12)

≥ D1 ·

√

(cid:18) δ0R2(

log p +
√
n
λ

√

t)

+ (cid:98)QG

(cid:114) t
n

+ n−1/2K

1
2k+2
G

2k+1
2k+2
(cid:98)Q
G

(cid:33)

(cid:19)(cid:12)
(cid:12)
(cid:12)Y

≤ e−t,

t > 0.

(3.27)

Note that on T3(δ0, R)) it holds that

(cid:98)QG ≤

(cid:114)

sup
g∈G

|∥g∥2

n − ∥g∥2| + sup
g∈G

∥g∥ ≤ (

(cid:112)

δ0 + 1 + (cid:112)Λmax/Λmin)R + ∆,

(3.28)

−
and P(T3(δ0, R)) ≥ 1−e

0 R4

D2nδ2
KG (R+∆)2 from Part (ii). We can combine (3.27)-(3.28) and integrate

out the conditional probability to reach

(cid:32)

P

sup
β−β0∈B,g−g0∈G
(cid:18) δ0R2(

≥ D3 ·

(cid:12)
(cid:12)Enϵi[ρτ (yi − x′

iβ − g(zi)) − ρτ (yi − x′

iβ0 − g0(zi))](cid:12)
(cid:12)

√

t)

√

log p +
√
n
λ

+ (R + ∆)

(cid:114) t
n

+ n−1/2K

1
2k+2
G

(R + ∆)

2k+1
2k+2

(cid:19)(cid:33)

−
≤ e−t + e

D2nδ2
0 R4
KG (R+∆)2 .

It is direct to verify that choosing t = 10 log p under the given conditions finishes the proof.

Lemma 11. Consider any f (x, z) = x′β + g(z) that satisfies τδ0,R(f ) ≤ R.

(i) Under Condition 10 it holds that

∥β∥2 ≤ Λ−1/2

(cid:16)

∥g∥ ≤

1 +

min R, ∥β∥1 ≤
(cid:114)Λmax
Λmin

(cid:17)

R, T V (g(k)) ≤

20δ0R2
γ

.

20δ0R2
λ

, ∥x′β∥ ≤

(cid:16)

2 +

(cid:114) Λmax
Λmin

(cid:17)

R,

(ii) Under additional Condition 9, it holds that

∥g∥∞ ≤ Cz,k

(cid:16)20δ0R
γ

+ 1 +

(cid:114) Λmax
Λmin

(cid:17)

R,

where Cz,k > 0 is a constant only dependent on k and ℓz.

70

Proof. The bound on ∥β∥1 and T V (g(k)) is clear from the definition of τδ0,R(f ) in (3.8).
Using the orthogonal decomposition:

∥x′β + g(z)∥2 = ∥(cid:101)x′β∥2 + ∥h(z)′β + g(z)∥2 ≤ R2,

we have ∥β∥2 ≤ Λ−1/2

min ∥(cid:101)x′β∥ ≤ Λ−1/2

min R, and

∥g(z)∥ ≤ ∥h(z)′β + g(z)∥ + ∥h(z)′β∥ ≤ R + Λ1/2

max∥β∥2 ≤

(cid:16)

1 +

(cid:114)Λmax
Λmin

(cid:17)

R.

Having the bound on ∥g∥, we can further obtain ∥x′β∥ ≤ ∥x′β + g(z)∥ + ∥g(z)∥ ≤ (2 +
(cid:113) Λmax
Λmin
It remains to bound ∥g∥∞. Define g∗(z) =

∗ ) ≤ 1, ∥g∗∥ ≤ 1

, then T V (g(k)

)R.

g(z)

( 20δ0R

γ +1+

(cid:113) Λmax
Λmin

)R

due to the derived bounds on ∥g∥ and T V (g(k)). Hence it is sufficient to show that

sup
∗ )≤1,∥g∗∥≤1

g∗:T V (g(k)

∥g∗∥∞ ≤ Cz,k.

To prove the above, we first decompose g∗ = p∗ + q∗, where p∗ is a polynomial of degree k

and q∗ is orthogonal to all polynomials of degree k (with respect to the L2 inner product
(cid:82) 1
∗ ) = T V (g(k)
0 p(z)q(z)dz). Note that T V (q(k)
∥q∗∥∞ ≤ ck for a constant ck > 0 only depending on k. Now write p∗(z) = (cid:80)k

∗ ) ≤ 1. Then Lemma 5 in [56] implies that

j=0 ajzj. We

have

(cid:112)

a′V0a = ∥p∗∥ ≤ ∥g∗∥ + ∥q∗∥ ≤ 1 + ck,

where a = (a0, . . . , ak) and V0 = E(v(z)v(z)′) with v(z) = (1, z, z2, . . . , zk). Thus, we obtain

∥p∗∥∞ ≤ ∥a∥1 ≤

√

k + 1∥a∥2 ≤

√

k + 1(1 + ck)λ−1/2

min (V0),

which implies ∥g∗∥∞ ≤ ∥p∗∥∞ + ∥q∗∥∞ ≤

min (V0) + ck. Finally, we need to
show λmin(V0) is bounded below by a constant only depending on k and ℓz. Under Condition

k + 1(1 + ck)λ−1/2

√

9 we have

λmin(V0) = min
∥b∥2=1

E(b′v(z))2 ≥ ℓz · min
∥b∥2=1

E(b′v((cid:101)z))2,

(cid:101)z ∼ U nif (0, 1)

= ℓz · E(b′

∗v((cid:101)z))2,

71

where b∗ is the minimizer; E(b′

∗v((cid:101)z))2 is a constant only depending on k, and it is positive
because the roots of any polynomial (not identically zero) form a set of Lebesgue measure

zero.

3.7.1.2 Completion of the proof

We choose R2 = κ1 ·

(cid:16) s log p

n + n− 2k+2

2k+3

(cid:17)

, γ = κ2 · δ0R2, λ = κ3 ·

(cid:113) log p
n

. Under Condition

12, it is straightforward to verify that by choosing sufficiently small constant δ0, proper

constant κ2 and sufficiently large constants κ1, κ3, the conditions in Lemmas 8-10 are all

satisfied when n is large enough.

3.7.2 Proof of Theorem 9

Denote the following event:

T4 =

(cid:110)(cid:12)
(cid:12)∥Xβ∥2
(cid:12)

n − ∥x′β∥2(cid:12)
(cid:12)
(cid:12) ≤ λ∥β∥1, ∀β ∈ B

(cid:111)
,

where the set B was introduced in (3.10). The proof of Theorem 8 already shows (cid:98)β −β0 ∈ B.
We aim to perform a finer analysis to obtain the desirable rates specified in Theorem 9. Recall

that under Condition 11, max1≤j≤p T V (h(k)
According to Lemma 16 in [70], there exist ¯hj ∈ Hn for all 1 ≤ j ≤ p such that on T1,

j ) ≤ Lh where h(z) = (h1(z), . . . , hp(z)) = E(x|z).

T V (¯h(k)

j ) ≤ akLh, ∥¯hj − hj∥∞ ≤

22bkLh log n
ℓzn

:= (cid:101)∆.

(3.29)

Define

(cid:101)yi = yi − h′(zi)β0 − g0(zi), ζi = h′(zi)(β0 − (cid:98)β) + g0(zi) − (cid:98)g(zi), and

¯h(z) = (¯h1(z), . . . , ¯hp(z)),

ˇg(z) = (cid:98)g(z) + ¯h′(z)((cid:98)β − β0).

Note that ˇg ∈ Hn since

(cid:98)g, ¯hj ∈ Hn. Hence the optimization (3.6) implies the basic inequality

Enρτ (yi − x′

≤ Enρτ (yi − x′

i (cid:98)β − (cid:98)g(zi)) + λ∥(cid:98)β∥1 + γT V ((cid:98)g(k))
iβ0 − ˇg(zi)) + λ∥β0∥1 + γT V (ˇg(k))

72

Using the triangle inequality for T V (·) and Lipschitz property of ρτ (·), we can continue from

the above to have

i (cid:98)β + ζi) + λ∥(cid:98)β∥1

Enρτ ((cid:101)yi − (cid:101)x′
≤ Enρτ ((cid:101)yi − (cid:101)x′
≤ Enρτ ((cid:101)yi − (cid:101)x′

iβ0 + ζi) + λ∥β0∥1 + En|(¯h(zi) − h(zi))′(β0 − (cid:98)β)| + γT V ((¯h(k)(z))′(β0 − (cid:98)β))

iβ0 + ζi) + λ∥β0∥1 + ( (cid:101)∆ + γakLh)∥(cid:98)β − β0∥1,

(3.30)

where the last step is due to (3.29). Moreover, using the identity of Knight, we also obtain

Enρτ ((cid:101)yi − (cid:101)x′
(cid:90) (cid:101)x′
i((cid:98)β−β0)

= En

0

= En(cid:101)x′

i((cid:98)β − β0)

iβ0 + ζi) − Enρτ ((cid:101)yi − (cid:101)x′
(cid:16)

i (cid:98)β + ζi) + Enρτ ((cid:101)yi − (cid:101)x′
(cid:17)
1((cid:101)yi − (cid:101)xiβ0 ≤ t) − 1((cid:101)yi − (cid:101)xiβ0 + ζi ≤ t)
(cid:90) 1

dt

(cid:16)

0

1((cid:101)yi − (cid:101)xiβ0 ≤ (cid:101)x′
(cid:90) 1

(cid:16)

i((cid:98)β − β0)t) − 1((cid:101)yi − (cid:101)xiβ0 + ζi ≤ (cid:101)x′

(cid:17)
i((cid:98)β − β0)t)

dt

i (cid:98)β) − Enρτ ((cid:101)yi − (cid:101)x′

iβ0)

≤ ∥(cid:98)β − β0∥1 ·

(cid:13)
En(cid:101)xi
(cid:13)
(cid:13)

0

1((cid:101)yi − (cid:101)xiβ0 ≤ (cid:101)x′

i((cid:98)β − β0)t) − 1((cid:101)yi − (cid:101)xiβ0 + ζi ≤ (cid:101)x′

(cid:17)
i((cid:98)β − β0)t)

dt

(cid:13)
(cid:13)
(cid:13)∞

≤

λ
4

∥∥(cid:98)β − β0∥1 with probability at least 1 − p−D1.

Here, the second equality is by a change of variable, the first inequality is by Hölder’s

inequality, and the last inequality is due to the fact that each

(cid:101)xi is a centered sub-Gaussian
vector and we have used the standard tail bound on maximum of sub-Gaussian variables.

Combining the above with (3.30) yields

i (cid:98)β) + λ∥(cid:98)β∥1

Enρτ ((cid:101)yi − (cid:101)x′
≤ Enρτ ((cid:101)yi − (cid:101)x′
≤ Enρτ ((cid:101)yi − (cid:101)x′

iβ0) + λ∥β0∥1 + (λ/4 + (cid:101)∆ + γakLh)∥(cid:98)β − β0∥1

iβ0) + λ∥β0∥1 +

λ
3

∥(cid:98)β − β0∥1,

(3.31)

where the last inequality holds when n is large enough since γ = o(λ), (cid:101)∆ = o(λ). Given that
(cid:101)x′β0 and the conditional density
the τ -level conditional quantile of
only differs from fy|x,z(·) by a shifting, we can base on (3.31) to perform standard high-

(cid:101)y (conditioning on

(cid:101)x) is

dimensional analysis of ℓ1-penalized quantile regression. Specifically, following the essential

arguments in [5] (which we skip for simplicity), we can obtain that under an additional event

73

that happens with probability at least 1 − p−D2, it holds that

∥(cid:101)x′((cid:98)β − β0)∥2 ≤ D3

s log p
n

,

∥(cid:98)β − β0∥2

2 ≤ D3

s log p
n

.

We should clarify that the term log(p ∨ n) in [5] is replaced by log p here since we are not

aimed for uniform results over the quantile levels. Also, the restricted nonlinear impact

condition required in [5] is not needed in our problem, because we can bound the population

quantile regression objective function as in the proof of Lemma 8 thanks to Condition 12.

Now we can proceed to have

∥x′((cid:98)β − β0)∥2 = (∥(cid:101)x′((cid:98)β − β0)∥ + ∥h(z)′((cid:98)β − β0)∥)2 ≤ 2∥(cid:101)x′((cid:98)β − β0)∥2 + 2∥h(z)′((cid:98)β − β0)∥2

≤ 2∥(cid:101)x′((cid:98)β − β0)∥2 + 2Λmax∥(cid:98)β − β0∥2

2 ≤ D4

s log p
n

.

Further intersecting with T4, we obtain

∥X((cid:98)β − β0)∥2

n ≤ ∥x′((cid:98)β − β0)∥2 + λ∥(cid:98)β − β0∥1

≤ ∥x′((cid:98)β − β0)∥2 + 4

√

sλ∥(cid:98)β − β0∥2 ≤ D5

s log p
n

,

where we have used the fact that (cid:98)β − β0 belongs to a cone {ν ∈ Rp : |νSc| ≤ 3|νS|}.

It remains to show T4 holds with high probability. We first have ∀β ∈ B,

(cid:12)
(cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

((x′

iβ)2 − ∥x′β∥2)

(cid:12)
(cid:12)
(cid:12) ≤ max

1≤j,k≤p

(cid:12)
(cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

(xijxik − Exjxk)

(cid:12)
(cid:12) · ∥β∥2
(cid:12)
1,

(3.32)

where in the last inequality we have used the fact |a′Aa| ≤ ∥a∥2

1 ·∥A∥max, ∀a ∈ Rp, A ∈ Rp×p.
are independent, zero-mean, sub-Gaussian

Condition 7 implies that {xijxik − Exjxk}n

i=1

random variables with the sub-Gaussian norm ∥xijxik − Exjxk∥ψ2 ≤ C1K 2
Hoeffding’s inequality in Theorem 10 together with a simple union bound to obtain that

. We then use

x

max
1≤j,k≤p

(cid:12)
(cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

(xijxik − Exjxk)

(cid:12)
(cid:12) ≤ C2K 2
(cid:12)

x

(cid:114)

log p
n

holds with probability at least 1 − 2p−10. Putting this result together with (3.32) shows that

(cid:12)
(cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

((x′

iβ)2 − ∥x′β∥2)

(cid:12)
(cid:12)
(cid:12) ≤ λ∥β∥1,

∀β ∈ B,

74

(cid:113) log p

n · 20δ0R2

λ ≤ λ. This condition is satisfied when n is large enough since

as long as C2K 2
x
(cid:113) log p
n

λ ∝

and R2 = o(λ).

3.7.3 Reference materials

Theorem 10. (General Hoeffding’s inequality). Let x1, . . . , xn ∈ R be independent, zero-

mean, sub-gaussian random variables. Then for every t ≥ 0, we have

P

(cid:18)(cid:12)
(cid:12)
(cid:12)

n
(cid:88)

i=1

(cid:19)

(cid:12)
(cid:12)
(cid:12) ≥ t

xi

≤ 2 exp

(cid:16)

−

ct2
i=1 ∥xi∥2
ψ2

(cid:80)n

(cid:17)

,

where c > 0 is an absolute constant, and ∥ · ∥ψ2
inf{t > 0 : Eex2/t2 ≤ 2}.

The above result is Theorem 2.6.2 in [73].

is the sub-gaussian norm defined as ∥x∥ψ2 =

Theorem 11. Let x1, . . . , xp ∈ R be sub-gaussian random variables, which are not necessarily

independent. Then there exists an absolute constant c > 0 such that for all p > 1,

E max
1≤i≤p

|xi| ≤ c(cid:112)log p max

1≤i≤p

∥xi∥ψ2.

The above result can be found in Lemma 2.4 of [7].

Theorem 12. (Uniform convergence of empirical norms). Denote ∥g∥2
n = 1
n
i.i.d. samples zi ∈ Z. For a class F of functions on Z, let J0(t, F) = C0t (cid:82) 1
0 sup{z1,...,zn}⊆Z
(cid:112)H(ut/2, F, ∥ · ∥n)du, t > 0, where C0 > 0 is some universal constant and H is the metric
entropy. Let RF := supg∈F ∥g∥, KF := supg∈F ∥g∥∞, and H(t) be the convex conjugate of

i=1 g(zi)2 for

(cid:80)n

G(t) := (J −1

0

(t, F))2. Then, for all R2

F ≥ H(4KF /

n) and all t > 0,

√

(cid:32)

P

sup
g∈F

(cid:12)
(cid:12)∥g∥2
(cid:12)

n − ∥g∥2(cid:12)
(cid:12)
(cid:12) ≥ C1

(cid:18) 2KF J0(2RF , F) + KF RF

√

n

√

t

(cid:19)(cid:33)

+

K 2
F t
n

≤ e−t,

where C1 > 0 is a universal constant.

The above is Theorem 2.2 from [21].

75

Theorem 13. (Symmetrization and concentration). Let Y = (y1, . . . , yn) be i.i.d. samples

in some sample space Y and let F be a class of real-valued functions on Y. Define QF =

supf ∈F ∥f ∥, (cid:98)QF = supf ∈F ∥f ∥n. Then,

(cid:16)

P

(cid:16)

P

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)

1
n

1
n

sup
f ∈F

sup
f ∈F

n
(cid:88)

i=1
n
(cid:88)

i=1

(cid:12)
(f (yi) − Ef (yi))
(cid:12)
(cid:12) ≥ 4QF

(cid:17)

(cid:114) 2t
n

(cid:16)

≤ 4P

(cid:12)
(cid:12)
(cid:12)

1
n

sup
f ∈F

n
(cid:88)

i=1

(cid:12)
(cid:12)
(cid:12) ≥ QF
ϵif (yi)

(cid:17)

(cid:114) 2t
n

, ∀t ≥ 4,

(cid:12)
(cid:12)
(cid:12) ≥ C2 ·
ϵif (yi)

(cid:16)

(cid:16)

E

(cid:12)
(cid:12)
(cid:12)

1
n

sup
f ∈F

n
(cid:88)

i=1

(cid:12)
(cid:12)
ϵif (yi)
(cid:12)

(cid:12)
(cid:12)
(cid:12)Y

(cid:17)

+ (cid:98)QF

(cid:112)t/n

(cid:17)(cid:12)
(cid:12)
(cid:12)Y

(cid:17)

≤ e−t, ∀t > 0,

where ϵ1, . . . , ϵn is a Rademacher sequence independent of Y and C2 > 0 is a universal

constant.

The first result is Lemma 16.1 in [72] and the second result is implied by Theorem 16.4

in [72].

76

CHAPTER 4

FALSE DISCOVERY RATE CONTROL VIA REGIONAL QUANTILE
REGRESSION ON ULTRA-HIGH DIMENSION

4.1

Introduction

The advancement of technologies and biotechnologies has enabled researchers to collect

vast amounts of data. Ultra-high dimensional data, characterized by an exponentially grow-

ing number of candidate covariates p relative to the number of observations n, is increasingly

prevalent in diverse scientific domains such as bioinformatics, machine learning, and finance.

Despite the large number of covariates, often only a small subset significantly influences the

response variable. Identifying these influential covariates amidst the numerous noisy ones

is a formidable task. To tackle this issue, researchers employ a technique known as the

screening method, which aims to isolate and retain the useful variables while discarding the

irrelevant ones.

Significant progress has been made in screening ultra-high dimensional data. Starting

with [17], who proposed the sure independence screening (SIS) based on the marginal cor-

relation between a covariate and a response variable, subsequent research expanded this

model. For instance, [18] adapted the SIS for generalized linear models, [4] developed condi-

tional sure independence screening for situations with prior information on covariates, and

[15] introduced nonparametric independence screening for additive models. However, these

methods primarily address the conditional mean of the response, making them less robust

to outliers or heteroscedastic errors.

This limitation can be mitigated by employing quantile regression (QR). Introduced by

[32], quantile regression models the effects of covariates on conditional quantiles, offering

a robust alternative to mean-based approaches. In high-dimensional and ultra-high dimen-

sional settings, various researchers have advanced this field: [76] used penalized methods, [35]

applied the Bayesian information criterion for variable selection, [14] considered the adaptive

robust lasso model using a two-step procedure based on the ℓ1-penalized quantile regression

77

model [5], and [40] employed quantile partial correlation for screening.

However, these works have focused on local quantile regression, considering single or

multiple quantile levels independently. As highlighted by [87], local quantile regression can

lead to delicate issues. For instance, when determining which covariates affect low quantiles,

the choice of the quantile level is crucial. An incorrect choice, such as τ = 0.1, might result

in missing important covariates that significantly influence the conditional quantile.

To address this, the regional quantile regression method [50], also known as global quantile

regression [87] or quantile function regression [82], can be employed.

[51] pioneered this

approach for the varying coefficient model, which inspired [87] to use the adaptive robust

lasso model [14] in the regional setting. In variable selection, this method models quantile

regression over an interval of quantile levels, ∆ ⊂ (0, 1), rather than a single level τ . Using

∆ reduces the risk of overlooking covariates that significantly contribute to other quantiles,

enhancing the stability of variable selection. Thus, the regional quantile regression method

offers a more general approach than the local quantile regression, as it encompasses the latter

when ∆ is a singleton set.

Although some researchers have developed methods for variable selection in ultra-high

dimensional data using regional quantile regression, no work has considered the control of

the false discovery rate (FDR). As noted by [36], recent screening methods tend to sacrifice

FDR by adopting conservative cutoffs, a problem also present in quantile regression variable

selection. Therefore, there is a need for a new method in quantile regression that robustly

selects variables while concurrently considering multiple quantile levels and controlling the

FDR.

To control the FDR in variable selection problems, [3] proposed an innovative method

called knockoff variable selection. The core concept of the knockoff method involves creating

artificial variables, known as knockoff variables, and then performing variable selection based

on both these knockoff variables and the original variables. The primary advantage of the

knockoff method is its ability to guarantee exact FDR control with finite samples. While the

78

method was initially developed for fixed design matrices X, recent extensions have adapted

it for random design matrices within the model-X framework, as demonstrated by [9], [60],

and [79]. However, in our study, we adopt a multi-step procedure similar to the approaches

of [36] and [2] for ultra-high dimensional variable selection with a fixed design matrix. Ad-

ditionally, [65] applied this two-step method directly to the log-contrast model, highlighting

its versatility and effectiveness.

To the best of our knowledge, we are the first to propose the knockoff method for a

regional quantile regression setting. In a subsequent section, we demonstrate that adapting

the knockoff method to this context is not a straightforward modification of the original

approach due to the involvement of multiple quantile levels. Consequently, we present a new

knockoff procedure specifically designed for regional quantile regression, which addresses the

limitation of using a single quantile level and concurrently controls the FDR level.

The rest of this chapter is organized as follows: In Section 4.2, we introduce our new

screening method.

In Section 4.3, we present our novel knockoff procedure for global

quantile regression. Section 4.4 provides numerical results from various experimental setups.

In Section 4.5, we demonstrate the effectiveness of our method using real data. Finally, in

Section 4.6, we summarize our findings.

4.2 Proposed Screening Method

In this section, we introduce our screening method. Before we proceed to the estimation,

we begin with introducing the models. We consider the following regional quantile regression

form:

Qy|X(τ ) = α∗(τ ) + Xβ∗(τ ), τ ∈ ∆,

(4.1)

where ∆ = [τ1, τ2] ⊆ (0, 1) is a closed interval of quantile levels, y = (y1, . . . , yn)′ ∈ Rn

is a response variable, Xj = (x1j, . . . , xnj)′ ∈ Rn is a column vector for covariate matrix

X = (X1, . . . , Xp) ∈ Rn×p, β(τ ) = (β1(τ ), . . . , βp(τ ))′ ∈ Rp is a vector of coefficients,

79

α(τ ) ∈ R is an intercept. We denote x′
i

to be a row vector for X.

4.2.1 Marginal screening via integrated regional quantile estimator

We now move on to our screening procedure. Here, the covariate matrix X is ultrahigh-

dimensional and contains a vast number of covariates. We assume that the dimension p of

each row of X is allowed to grow with sample size n at an exponential rate, that is log p =

O(na) for some constant a. Our goal is to identify a set of active covariates associated with

the response by filtering out as many unrelated variables as possible, and coincidently find

for which we propose a marginal quantity to rank the covariates. To denote the significant

variables, we define the index set

M(∆) = (cid:8)1 ≤ j ≤ p : ∃τ ∈ ∆, β∗

j (τ ) ̸= 0(cid:9) .

We assume that the number of active variables is smaller than the sample size: |M(∆)| ≤ n.

Intuitively, we can expect that if the corresponding variable Xj is important, then the

integration of the absolute value of its coefficients within the region should be large, and

it should not be close to 0. Then naturally, we can utilize this property as a measure of

importance for each covariate within the quantile region.

For the estimation, we consider a nonparametric regression approach using B-spline ap-

proximation to the quantile coefficients and the intercept within the quantile region. That

is, let βj(τ ) = B(τ )T bj, αj(τ ) = B(τ )T aj, where B(τ ) is a B-Spline function for τ which is

explained to be later. Without loss of generality, we assume each covariate xj is standardized

to have E(xj) = 0, V ar(xj) = 1. Then we proceed to a marginal quantile screening

((cid:98)aj,(cid:98)bj) ∈ arg min

a,b

L
(cid:88)

n
(cid:88)

ℓ=1

i=1

(cid:16)
yi − xijB(τℓ)T b − B(τℓ)T a

(cid:17)

,

ρτℓ

j = 1, . . . , p,

where ρτ (x) = x(τ − I(x ≤ 0)) is a check loss function for a quantile level τ , {τl} is

a partition of the interval ∆, and using them as knots we have constructed N = L + k

normalized B-spline basis functions B(τ ) = (B1(τ ), . . . , BN (τ )) of order k + 1 and L is the

number of discrete quantile levels. For the measure of the importance of each predictor, we

80

(cid:82)

use the integrated coefficient estimator

∆ | (cid:98)βj(τ )|2dτ , where |∆| denotes the cardinality.
Here, (cid:98)βj(τ ) = B(τ )T(cid:98)bj, so the key integral to evaluate is (cid:82)
∆ B(τ )B(τ )T dτ , which can be
even computed or approximated before obtaining (cid:98)bj for each predictor. Then our index set,
derived from our model, can be written as follows

1
|∆|

(cid:26)

(cid:99)M(∆) =

1 ≤ j ≤ p :

1
|∆|

(cid:90)

∆

| (cid:98)βj(τ )|2dτ ≥ ν0

(cid:27)

,

where ν0 > 0 is a cutoff threshold. We show the theoretical result, the sure screening

property, for our model. Before we prove the theorem, we first introduce some regularity

conditions and some population quantities:

For each j ∈ M(∆)

(a∗

j , b∗

j ) ∈ arg min
a,b∈RN

1
L

L
(cid:88)

ℓ=1

(cid:16)

Eρτℓ

y − xjB(τℓ)T b − B(τℓ)T a

(cid:17)

(fj(τ ), gj(τ )) ∈ arg min

f,g∈R

Eρτ (y − xjf − g).

Conditions

(4.2)

(4.3)

1. (Bounded covariates) The covariates are bounded: maxj∈M(∆) |xj| ≤ Kx.

2. (Lipschitz condition) The functions {fj(τ ), gj(τ )}j∈M(∆) belong to the class of functions

whose kth derivative satisfies a Lipschitz condition of order c:

|h(k)(s) − h(k)(t)| ≤

c0|s − t|c, for some positive constant c0, where k is a nonnegative integer and c ∈ (0, 1]

satisfies d = k + c > 0.5.

3. (Bounded response density) For each j ∈ M(∆), the conditional density evaluated at

t = xjfj(τ )+gj(τ ) and xjB(τ )T b∗
j +B(τ )T a∗
j
fy|xj (t) ≤ ¯f < ∞. Moreover, its derivative is uniformly bounded: supt |f ′

is bounded uniformly in (y, xj, τ ): 0 < f ≤

(t)| ≤ ¯f ′.

y|xj

4. (Identifiability of the true model) The nonlinear impact coefficient

q := 3f

2 ¯f ′ · inf a,b,j∈M(∆)

(E(axj +b)2)3/2
E|axj +b|3

is bounded away from zero.

5. (Strong marginal regional signal) minj∈M(∆)

(cid:82)

1
|∆|

∆ |fj(τ )|dτ ≥ κn−γ for some constants

κ, γ > 0.

81

6. γ < 1/2, N −dnγ = o(1), N 3/2n−1/2√

log n = o(1), max(N 2 log N, n) = o(L).

Conditions 1, 2, 3 are common in quantile regression literature utilizing B-spline basis

[25, 63, 82]. Condition 4 is similar to those of [87, 5], where it ensures the information of

the identifiability of the true model. Condition 5 is common in screening literature [25].

Condition 6 describes how fast the number of basis functions grows regarding sample size.

Followed by the conditions, we have a sure screening property as follows.

Theorem 14. Suppose τ1, . . . , τL
4 n−γ, the sure screening property holds:

κ

i,i,d
∼ U nif (∆). Assume Conditions 1-6, choosing ν0 =

P(M(∆) ⊆ (cid:99)M(∆)) → 1,

as n → ∞.

We defer the proof of Theorem 14 to Section 4.7.1. Theorem 14 ensures that if we have

a suitable number of observations, then we can select all important variables. However,

since we try to select all important variables, inevitably there must be some false discoveries

within the selected variables, which should be removed. We now move on to control the

FDR for these selected variables.

4.3 FDR Control via Regional QR-Knockoff

4.3.1 A short review for the model-X knockoff

Recently, various works related to the knockoff method have been proposed. For our case,

we typically focus on the model-X approach, which allows us to further extend the knockoff

method to high dimensional models. For the detailed explanations of the model-X approach

and the knockoff methods, please see [9] and references therein. We introduce one definition

from [9].

Definition 1. A variable xj is said to belong to a null set, H0, (we also say xj is null) if

and only if y is independent of xj conditionally on the other variables in X for j = 1, . . . , p.

82

We now introduce the knockoffs. Let y be a response vector, and X be a design matrix,

we say that (cid:101)X is the model-X knockoffs for X if it has the two properties as follows:

1. (X, (cid:101)X) d= (X, (cid:101)X)swap(S)

2. (cid:101)X ⊥⊥ y|X

where S ⊂ H0, and swap(S) implies interchanging the location of knockoff and original

variables corresponding to the index set S. Because of the properties 1 and 2, we have that

(X, (cid:101)X)|y d= (X, (cid:101)X)swap(S)|y,

That is, Xj, j ∈ S are conditionally independent of y. This result ensures us to proceed

with the knockoff method. However, this fact is generally not true in the regional quantile

regression setting. We introduce the reason in the next section.

4.3.2 Problem with the model-X knockoff for the regional quantile regression

For a single observation case, we say that a variable xj is null if βj = 0 for j ∈ {1, . . . , p},

this is satisfied when we assume Proposition 2.2 of [9]. That is

xj is null ⇔ βj = 0 ⇔ xj is conditionally independent of y.

For quantile regression, this is also true for the single-level quantile regression, since

βj(τ ) = 0, ∀τ ∈ (0, 1) ⇔ xj is conditionally independent of y.

Therefore, we can directly adapt the original knockoff method to this case using the original

y. However, the method cannot be directly adapted to the regional case, since the importance

within a region does not necessarily imply conditional independence. This means that, for

example,

βj(τ ) = 0, ∀τ ∈ ∆ ⇔/ xj is conditionally independent of y.

Specifically, the left-hand side does not imply the righthand side of the above equation.

83

4.3.3 Model-X knockoff for the regional quantile regression

Because of the problem introduced in the previous section, we cannot directly apply the

knockoff method to our model. To solve this problem, we propose the proposition as follows:

Proposition 1. Let

(cid:101)y ∈ Rn be constructed as below






(cid:101)yi =

Qyi|xi(τ1)

if yi ≤ Qyi|xi(τ1)

yi

if Qyi|xi(τ1) ≤ yi ≤ Qyi|xi(τ2)

Qyi|xi(τ2)

if yi ≥ Qyi|xi(τ2)

where x′
i

are the variables selected from the screening method for i = 1, . . . , n. Then Xj is
(cid:101)y if and only if βj(τ ) = 0, ∀τ ∈ ∆ = [τ1, τ2].

conditionally independent of

The proof of Proposition 1 is provided in Section 4.7.2.

Remark 1.

In practice, we don’t know the true conditional quantiles that are used

for the cutoffs in Proposition 1. Therefore, we use estimated conditional quantiles instead.

The conditional quantiles can be estimated using various methods, but we utilize the

non-crossing model from [6] and [37], which we call non-crossing quantile lasso, that ensures

the non-crossing property for conditional quantile estimates for high-dimensional data

across multiple quantile levels. This property ensures that

(cid:98)Qyi|xi(τ1) ≤ (cid:98)Qyi|xi(τ2),

i = 1, . . . , n

for τ1 ≤ τ2, where these estimates are estimated using the noncrossing quantile lasso. We

provide more details in the simulation section in Section 4.2.

Proposition 1 ensures the conditional independence on the region ∆. As discussed in

Definition 1, this is the key property that needs to be satisfied. Since this is satisfied, we

proceed with incorporating the model-X knockoff method.

84

There are various ways to construct knockoff variables. If we have previous information,

we can use the exact construction. However, in most cases, we do not have prior knowl-

edge related to the distribution of the design matrix. Therefore, we use the second-order

approximation. In other words, we make (cid:101)X, so that (X, (cid:101)X) and (X, (cid:101)X)swap(S) have same
first and second orders, instead of the distribution. Obviously, we have E(X) = E( (cid:101)X). For
the covariance structure, we write as follows

Cov(X, (cid:101)X) =






Σ

Σ − diag(s)

Σ − diag(s)

Σ






where s is a p-dimension vector making Cov(X, (cid:101)X) positive definite, and Cov(X) = Σ. When
n > 2p, [3] introduced two methods to select a vector s.

1. Equi-correlated knockoffs: Let s be a vector such that

sj = 2λmin(Σ) ∧ 1

for all j = 1, . . . , p, where ∧ denotes the minimum operator.

2. Semidefinite programme (SDP) knockoffs: Let s be a vector such that

(cid:88)

min

(1 − sj)

j

subject to sj ≥ 0 and diag(s) ⪯ Σ

for all j = 1, . . . , p.

For some special cases such as a high dimensional case and a non-gaussian case, this second

order approximation may not work great. To overcome such issues some methods are

discussed in [9] and [60]. However, we thought this is beyond the content of this chapter, so

we don’t discuss details here.

4.3.4 FDR control for the regional quantile regression

To find important variables, we estimate the statistics Wj for j = 1, . . . , p. The statistics

Wj compare the jth variable in X and (cid:101)X, and find out which variable is important or not.

85

We calculate Wj as follows

Wj(∆) =

L
(cid:88)

l=1

(cid:26)(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12) −
(cid:12) (cid:98)βj(τl)

wl

(cid:27)

(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:98)(cid:101)βj(τl)
(cid:12)
(cid:12)

where L denotes the levels of τl’s within the ∆, and wl is the weights for each τ levels.

Also (cid:98)βj(τl) denotes the non-crossing quantile lasso estimator for Xj and level τl, and (cid:98)(cid:101)βj(τl)
denotes the non-crossing quantile lasso estimator for (cid:101)Xj and level τl. We omit (∆) for the
notational convenience afterward if it is obvious. If |Wj| is largely greater than 0, it implies

that jth variable is an active variable. If |Wj| is comparatively small, it implies that jth

variable is an inactive variable. When |Wj| small (Xj is null), we have the following lemma

Lemma 12. The signs of the null Wj’s are IID coin flips conditionally on (|W1|, . . . , |Wp|).

The Lemma 12 is the Lemma 3.3 in [9], therefore we omit the proof. With Lemma 12,

for fixed t > 0, we get

# {j : Wj ≤ −t} ≥ # {null j : Wj ≤ −t} d= # {null j : Wj ≥ t} .

Then we derive the FDP, false discovery proportion, as follows

FDP(t) =

# {null j : Wj ≥ t}
# {j : Wj ≥ t}

Conservatively, we can estimate FDP as follows

(cid:91)FDP(t) =

# {j : Wj ≤ −t}
# {j : Wj ≥ t}

To get the FDR control with q level, we set the threshold based on knockoff+ from [3]. We

get the threshold as follows

(cid:26)

τ+ = min

t :

1 + # {j : Wj ≤ −t}
# {j : Wj ≥ t}

(cid:27)

≤ q

by setting a set (cid:98)A such that (cid:98)A = {j : Wj ≥ τ+}, we control the usual FDR, that is
(cid:12)

(cid:12)
(cid:12){j ∈ (cid:98)A ∩ H0}
| (cid:98)A ∨ 1|

FDR = E

 ≤ q

(cid:12)
(cid:12)
(cid:12)





86

where ∨ denotes the maximum operator.

As a result, if we assume Proposition 1 and Lemma 12, then by Theorem 3.4 in [9], we

attain the FDR control on the quantile region ∆.

4.4 Simulation Results

4.4.1 Screening performance

In this subsection, we use simulated examples to assess the finite sample performance of

the proposed method and compare it with other related methods. The original sure indepen-

dence screening paper utilizes the sample correlation between the response and covariates,

but it is reasonable for us to design the correlation between the response and covariates based

on the quantile correlation. [40] introduced the sample estimate of the quantile correlation

as below:

qcor
(cid:100)

τ {y, x} =

1
(cid:112)(τ − τ 2)(cid:98)σ2

x

1
n

n
(cid:88)

i=1

ψτ (yi − (cid:98)Qτ,y)(xi − ¯x)

where ψτ (x) = τ − I(x < 0), ¯x = 1
n

i=1(xi − ¯x)2. Also (cid:98)Qτ,y =
(cid:80)n
i=1 I(yi ≤ y) is
the empirical distribution function. We denote this quantile correlation screening (QCOR,

inf {y : Fn(y) ≥ τ } be the sample τ -th quantile of y1, . . . , yn and Fn(y) = 1
n

i=1 xi and

x = 1
(cid:98)σ2
n

(cid:80)n

(cid:80)n

Method 1) as one of our screening methods. The other screening methods are well-known

screening methods from the literature: model-free screening using the projection correlation

(PCOR, Method 2) from [36] and quantile adaptive screening (QaSIS, Method 3) from

[25]. These are all model-free and flexible screening methods that are adaptable to various

scenarios. We denote our method to be RQRS – ∆ (Method 4) for a model with a regional

quantile level and denote RQRS – τ (Method 5) for a model with a singleton quantile level.

For the simulation studies, we show multiple examples with various settings. For every

simulation, we repeat every procedure 100 times, and within each replication, we rank the

features in descending order by the above four screening methods and record the minimum

model size (MMS) that contains all active features and True Positives. We report the

87

quantiles of these results for repetitions with cutoff percentages: 5%, 35%, 65%, 95%, the

best to the worst.

Throughout all examples, we generate design matrix X with N (0, Σ), where each i, jth

component of Σ is defined by Σij = σ|i−j| for i, j = 1, . . . , p and σ = 0.5. The level for the

targeting quantile region is set to ∆ = [0.40, 0.60], and we set τ = 0.5 for the QCOR and

the singleton RCRS model. For the tuning parameters of our model, we use the quadratic

B-spline function for the approximations and set L = 5.

Example 1 (Linear Models)

In this example, we consider linear models generated from the following linear model

y = Xβ + ϵ

where β = (15, 0p−5)′. Here 1s denotes the vector of s ones and 0s denotes the vector of s

zeros. We set n = 200 and p = 1000. We vary the setting by changing the error distribution.

We consider the following distributions:

Example 1.1 : ϵi ∼ N (0, 1), i = 1, . . . , n

Example 1.2 : ϵi ∼ Cauchy(0, 1), i = 1, . . . , n

The results for Example 1.1 and Example 1.2 are presented in Table 4.1.

It shows that every method performed well with linear models, even though we include

a heavy tailed model.

Example 2 (Nonlinear Models)

As we mentioned before, we use the same settings for n, p, X, and ∆ as in Example 1. For

these models, we consider ϵi ∼ N (0, 1). For We set n = 200 and p = 1000. Consider the

88

5% 35% 65% 95%

Method 1

Example 1.1

(QCOR)

Example 1.2

Method 2

Example 1.1

(PCOR)

Example 1.2

Method 3

Example 1.1

(QaSIS)

Example 1.2

Method 4

Example 1.1

(RQRS – ∆)

Example 1.2

Method 5

Example 1.1

(RQRS – τ )

Example 1.2

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

6

5

5

5

5

Table 4.1 MMS result table for Example 1 with 100 repetitions

following nonlinear data-generating models:

Example 2.1 : yi = 5xi1 + 2 sin(πxi2/2) + 2xi3I(xi3 > 0) + 2 exp(5xi4) + ϵi, i = 1, . . . , n

Example 2.2 : yi = 3xi1 + 3x3

i2 + 3x−1

i3 + 5I(xi4 > 0) + ϵi, i = 1, . . . , n

Here, I(·) denotes the indicator function. The results for Example 2.1 and Example 2.2 are

presented in Table 4.2.

Among the methods, the performance of nonlinear models was moderately good except

for Method 3 (QaSIS) in Example 2.1. Even though this model is nonparametric, we see

that it does not work on some complex nonlinear models. Method 4 also worked slightly less

than the other methods, but it was not distinguishably worse.

Example 3 (Heteroskedastic Error Model)

We consider a heteroskedastic error here, where the error is multiplied by a covariate. We

89

5%

35% 65%

95%

Method 1

Example 2.1

(QCOR)

Example 2.2

Method 2

Example 2.1

(PCOR)

Example 2.2

4

4

4

4

4

4

4

4

Method 3

Example 2.1

9.95

20

(QaSIS)

Example 2.2

Method 4

Example 2.1

(RQRS – ∆)

Example 2.2

Method 5

Example 2.1

(RQRS – τ )

Example 2.2

4

4

4

4

4

4

5

4

5

4

5

4

4

4

40

4

5.05

5

5

4

74.05

7

5.35

14.05

4

5

4

5

6.05

5.05

Table 4.2 MMS result table for Example 2 with 100 repetitions

generate (cid:101)X ∼ N500(0, Σ) where Σ = σ|i−j|, σ = 0.5, X501 ∼ Ber(0.5). Further let X =
[ (cid:101)X X501] ∈ Rn×501. Then we generate the responses with

yi = xi1 − xi3 + ϵi

where ϵi ∼ (1+xi501)ξi and ξi ∼ N (0, 1). We set n = 200, just like the former examples. Note

that the methods should also select xi501 as an important covariate since it is incorporated

with the error term. However, for this example, Method 3 is not working because of its

limitation on the design matrix. For QCOR and the singleton RQRS, we let τ = 0.6 to see

the difference. The result is presented in Table 4.3.

As we can see in the result, PCOR presents the worst performance here. QCOR method

is quantile adaptive, but it still performs worse than that of our singleton method. For this

example, the regional method works more stably than the other models.

90

5%

35%

65%

95%

Method 1 (QCOR – τ = 0.6)

18.9

123.65

328

446.5

Method 2 (PCOR)

100.5

314

438.05

492.1

Method 4 (RQRS – ∆)

Method 5 (RQRS – τ = 0.6)

3

3

18.60

8

78

116

248.35

424

Table 4.3 MMS result table for Example 3 with 100 repetitions

4.4.2 FDR control performance

In this section, we use simulated examples to numerically assess the FDR control of the

proposed RQRS-Knockoff procedure. In practice, if crossing happens while we estimate the

quantiles in Proposition 1, we cannot censor the response. Therefore during this simulation,

we put the response to be original yi, if crossings happen.

For FDR control performance, we vary simulation settings by changing models, the num-

ber of variables, and observations and see how the statistical powers and FDR change. It

is assumed that the dimension of data is successfully reduced by the screening method. We

consider a multivariate linear regression model:

yi = x′

iβ + ϵi, i = 1, . . . , n

where xi ∼ M V N (µ, Σ), µ = 0p, Σ = Ip and ϵi ∼ N (0, 1). Here, β = (2 ∗ 15, −0.5 ∗

15, 0, . . . , 0)′ ∈ Rp, and 1s denotes a vector of s ones. The target FDR is 0.2, and consider

the quantile region to be (0.3, 0.7). The data is split into two disjoint groups. We define

the number of observations for the first group as n1 and for the second group as n2. Using

n1 observations, conditional quantiles are estimated on the edges of the region. We denote

these two quantiles to be (cid:98)Qstart and (cid:98)Qend. Then n2 responses are censored by the estimated
conditional quantiles. These are estimated by using the quantile lasso and non-crossing

quantile lasso. By using n2 censored response as a new response variable, adopt the model-X

91

knockoff procedure. During the knockoff procedure, the quantile lasso method is also used for

the estimation, and the tuning parameters are derived from cross-validation. We repeat 100

times to generate the results. We record FDR, Power, Mean Squared Errors (MSE) for both

(cid:98)Qstart and (cid:98)Qend, and the average percentage of crossing that happened among repetition.
We refer Oracle to be the one using the true conditional quantiles, QLR to be quantile lasso

regression, and nX-QLR to be non-crossing quantile lasso regression.

Models

n1

Oracle

150

QLR

150

nX-QLR 150

n2

50

50

50

p

FDR Power

(cid:98)Qstart

(cid:98)Qend Crossings (%)

100

0.162

0.828

-

-

-

100

0.186

0.719

0.590

0.560

6.7%

100

0.172

0.790

0.432

0.423

Oracle

100

100

100

0.184

1

-

-

QLR

100

100

100

0.255

0.958

1.003

0.986

12.4%

nX-QLR 100

100

100

0.305

0.976

0.780

0.754

Oracle

300

100

100

0.161

QLR

300

100

100

0.182

1

1

-

-

0.255

0.247

1.1%

0%

-

0%

-

nX-QLR 300

100

100

0.175

0.999

0.195

0.192

0%

Table 4.4 Result for FDR control of various settings with 100 repetitions

The results are presented in Table 4.4. When n1 is large and n2 is small we can see

that every result with the nX-QLR is better than that of QLR for FDR and MSE of both

quantiles. Nonetheless, we can see that the FDR is still preserved. When n1 is small and

n2 gets bigger, we see that the power is increasing for both methods, but crossing rates and

FDR results are getting worse. However, when n1 and n2 both grow larger, we see that all

results get better, while the MSEs are still lower for nX-QLR.

92

4.5 Real Data Analysis

4.5.1 Obesity risk factors via body mass index and other dietary related vari-

ables for Non-Hispanic Black women

The body mass index (BMI) is a widely used indicator of several health risk factors. It

is calculated by

BMI =

Weight (kg)
Height2 (m)

.

Being underweight or overweight can lead to health issues related to weight. Studies have

shown that underweight individuals are at risk of malnutrition and compromised immune

function [12], while overweight and obese individuals are more likely to develop chronic

health conditions compared to those with a normal BMI [47, 19]. However, the relationship

between risk factors and the distribution of BMI is complex. The lower and upper BMI

distribution may be influenced by different covariates, and the effect size of each covariate

may vary across different quantiles of the BMI distribution. Classical mean-based regression

models may not be sufficient to address this relationship effectively. We focus on the

quantile regions, for both underweight and overweight, and see how much the important

variables differ based on our approach. For the data analysis, we use the National Health

and Nutrition Examination Survey dataset [10]. We specifically use the data collected from

2011 to 2014. The NHANES dataset provides comprehensive information on demographics,

socioeconomic status, and dietary habits. Since we are interested in variables relevant to

the obesity of the non-Hispanic Black (NHB) women population, we select dietary and

health-related variables that might have potential importance related to BMI.

4.5.2 Real data analysis results

There are a total of 93 variables and 383 observations after removing all missing observa-

tions. We set the response variable to be the BMI of the population. Our aim of this study

is to find important variables related to obesity and we are interested in a high-dimensional

dataset, so we intentionally put 1000 fake variables into the dataset, and see whether our

93

method works well with the data. We expect that if the FDR control is working properly,

then most of these fake variables won’t be selected after the final step.

To adapt our procedure, we split the dataset into 3 disjoint groups (G1, G2, G3), and

proceed with each step with the corresponding dataset. The detailed steps are as follows:

• Step 1 (Screening): Using n1 = 153 observations (G1), we screen variables based on

our method (RQRS) from Section 2.1.

• Step 2 (Censoring): Using n2 = 130 observations (G2), we get the estimators (cid:98)β(∆1)
and (cid:98)β(∆2) using non-crossing quantile lasso. Based on these coefficients, we can get
(cid:98)Q∆1(y|x′) and (cid:98)Q∆2(y|x′), and construct

(cid:101)y, where ∆ = [∆1, ∆2].

• Step 3 (Selection by Knockoff ): Using n3 = 100 observations (G3), we proceed with

the FDR controlling using regional quantile regression knockoff. The pre-specified level

for FDR is 0.2.

To compare the effects for lower BMI group and higher BMI group, we use two specific ∆

intervals, [0.15, 0.45] and [0.55, 0.85]. The results are in Table 4.5 and Table 4.6.

In Table 4.5, numbers stand for the indices of fake variables. This tells us that if we only

do the screening, there might be a lot of false discoveries can be included. This problem

can be reduced using a conservative cutoff, but we proceed with our method. As we can see

in the result of the FDR-controlled variables, Table 4.6, all of the fake variables, which are

chosen from the screening procedure, are removed after we have done the FDR-controlling

procedure. The selected variables, those related to the measures of bodies, make sense

since they are all closely related to the BMI. The triglyceride, which is selected for both

regions, is a type of fat located in the blood and is one of the obesity-related important

measures. There are multiple researches such as [13, 64] that use both BMI and triglyceride

as a measure of obesity, which implies that this is also an important measure related to

BMI. Our study reveals that the effect of covariates, Fasting Glucose, differs across the

different BMI groups since it is selected only in the lower BMI group. This result aligns

94

Screened variables for ∆ = [0.15, 0.45]

Beta-cryptoxanthin, Octadecatetraenoic, Arm length

Arm circumference, Waist circumference, Abdominal

Creatinine, HDL Cholesterol, Triglyceride,

Serum copper, Glycohemoglobin, Fasting glucose,

42, 96, 157, 162, 227, 292, 372, 503, 515, 516,

530, 582, 664

Screened variables for ∆ = [0.55, 0.85]

Beta-cryptoxanthin, Docosenoic, Arm length

Arm circumference, Waist circumference, Abdominal

Creatinine, HDL Cholesterol, Triglyceride,

Serum copper, Glycohemoglobin, Fasting glucose,

29, 137, 162, 207, 258, 317, 502, 536, 582, 598,

657, 691, 738, 800, 834, 909, 929, 963

Table 4.5 Screened variables by using RQRS for two different quantile regions where the
response variable is the BMI of NHB women

FDR controlled variables for ∆ = [0.15, 0.45]

Arm length, Arm circumference, Waist circumference,

Abdominal, Triglyceride, Fasting glucose

FDR controlled variables for ∆ = [0.55, 0.85]

Arm length, Arm circumference, Waist circumference,

Abdominal, Triglyceride

Table 4.6 FDR controlled variables by using RQRS-Knockoff for two different quantile
regions where the response variable is the BMI of NHB women

95

with the finding in [30], where they report that fasting glucose is associated with the highest

mortality in the lower BMI group.

4.6 Conclusion and Discussion

In this research, we have introduced a novel approach for variable selection in ultra-high

dimensional settings using regional quantile regression combined with the knockoff method

to ensure false discovery rate (FDR) control. Our methodology extends the capabilities of

traditional quantile regression by considering an interval of quantile levels, thereby enhancing

the robustness and comprehensiveness of variable selection. This approach mitigates the

limitations of local quantile regression, which may overlook important covariates due to its

focus on single quantile levels.

The primary contribution of our work is the development and implementation of a new

knockoff procedure specifically designed for regional quantile regression. This method not

only addresses the inherent complexities of handling multiple quantile levels but also main-

tains rigorous FDR control, a critical aspect often compromised in high-dimensional data

analysis. Through extensive numerical experiments and real data applications, we have

demonstrated the effectiveness and reliability of our proposed method.

Our findings indicate that the regional quantile regression knockoff method offers a signif-

icant improvement over existing approaches, providing a more accurate and stable variable

selection process. This advancement opens new avenues for research and application in

various scientific domains where ultra-high dimensional data is prevalent.

Additionally, our work lays the groundwork for integrating post-selection inference tech-

niques, which can provide valid statistical inference after variable selection, further enhancing

the utility of our approach. Moreover, sample splitting [11], an alternative to the knockoff

method, could be adapted to control FDR, offering a simpler yet effective strategy for robust

variable selection.

In conclusion, the proposed regional quantile regression knockoff method represents a

96

substantial step forward in high-dimensional variable selection, offering a robust, reliable,

and comprehensive solution that balances the need for accuracy and FDR control. Future

research could explore the incorporation of post-selection inference techniques and the

adaptation of sample-splitting methods, potentially broadening the applicability and impact

of our approach in high-dimensional data analysis.

4.7 Proofs

4.7.1 Proof of Theorem 14

We summarize the notations used throughout the proof here for convenience. For a =

(a1, . . . , ap) ∈ Rp, denote ∥a∥q = ((cid:80)p

i=1 |ai|q)

1

q for q ∈ [1, ∞) and ∥a∥∞ = max1≤i≤p |ai|.

Given a square matrix A = (aij) ∈ Rp×p, λmax(A) and λmin(A) represent its largest and

smallest eigenvalues respectively. For a general matrix A = (aij) ∈ Rp×q, ∥A∥2 denotes its

spectral norm; ∥A∥max = maxij |aij|, ∥A∥F =

(cid:113)(cid:80)

i,j a2
ij

. For a, b ∈ R, a ∧ b = min(a, b), a ∨

b = max(a, b). For a set A, 1A(·) is the usual indicator function. Moreover, an ≲ bn (an ≳ bn)

means there exists some constant C > 0 such that an ≤ Cbn (an ≥ Cbn) for all n; thus

an ≲ bn (an ≳ bn) is equivalent to an = O(bn) (an = Ω(bn)); an ≍ bn if and only if an ≲ bn

and bn ≳ an; an ≫ bn means bn = o(an).

Theorem 14 is a direct consequence of Lemmas 13 and 14. Hence, in the rest of the

proof, we focus on these two lemmas. Throughout the proof, we use C1, C2, . . . to denote

universal constants and use D1, D2, . . . , ... to denote constants that may only depend on the
constants Kx, ¯f , ¯f , ¯f ′, κ, q from Conditions 1-5. An explicit (though not optimal) dependence

of {Dj, j = 1, 2, . . .} on the aforementioned constants can be tracked down. However, since it

does not provide much more insight, we will often not present the explicit forms of {Dj, j =

1, 2, . . .}, and this will greatly help streamline the proofs. The constants {Cj, Dj, j = 1, 2, . . .}

may vary from lines to lines.

Lemma 13. Assume cn → ∞ and N 3/2n−1/2cn = o(1). With probability at least 1 −

97

n(e−D1c2

n + N 2e− D2L

N 2 ), the following holds:

∥(cid:98)bj − b∗

j ∥2 ≤ D3N n−1/2cn,

∀j ∈ M(∆).

Proof. For notational simplicity, we will drop the subscript j throughout the proof. Let

θ = (a, b) and introduce the following notations:

L
(cid:88)

n
(cid:88)

(cid:0)yi − xiB(τℓ)T b − B(τℓ)T a(cid:1)

ρτℓ

i=1

ℓ=1
L
(cid:88)

Eρτℓ

(cid:0)y − xB(τℓ)T b − B(τℓ)T a(cid:1)

ℓn(θ) :=

1
nL

ℓ(θ) :=

vi(θ) :=

1
L

1
L

ℓ=1
L
(cid:88)

ℓ=1

(cid:104)
Eρτℓ

(cid:0)yi − xiB(τℓ)T b − B(τℓ)T a(cid:1) − Eρτℓ

(cid:0)yi − xiB(τℓ)T b∗ − B(τℓ)T a∗(cid:1)(cid:105)

A standard argument based on convexity shows that

(cid:16)

P

∥θ − θ∗∥2 ≥ δ
(cid:32)

(cid:17)

≤ P

inf
∥θ−θ∗∥2=δ

ℓ(θ) − ℓ(θ∗) ≤ sup

∥θ−θ∗∥2≤δ

(cid:12)
(vi(θ) − Evi(θ))
(cid:12)
(cid:12)

(cid:33)

(cid:12)
(cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

(4.4)

We first lower bound inf ∥θ−θ∗∥2=δ ℓ(θ)−ℓ(θ∗). Note that θ∗ ∈ arg minθ ℓ(θ) according to (4.2).
This optimality together with the identity of Knight yields that for θ satisfying ∥θ −θ∗∥2 = δ,

ℓ(θ) − ℓ(θ∗)

=

1
L

L
(cid:88)

E

(cid:90) xBT (τℓ)(b∗−b)+B(τℓ)(a∗−a)

(cid:16)

ℓ=1

0

Fy|x(xB(τℓ)T b∗ + B(τℓ)T a∗ + t)

− Fy|x(xB(τℓ)T b∗ + B(τℓ)T a∗)

(cid:17)

dt

xBT (τℓ)(b∗ − b) + B(τℓ)(a∗ − a)

(cid:17)2

L
(cid:88)

(cid:16)

E

ℓ=1

−

¯f ′
6L

L
(cid:88)

ℓ=1

(cid:12)xBT (τℓ)(b∗ − b) + B(τℓ)(a∗ − a)(cid:12)
E(cid:12)
(cid:12)

3

(cid:17)
δN 1/2(1 + Kx)

·

1
L

(cid:17)
δN 1/2(1 + Kx)

λmin

L
(cid:88)

(BT (τℓ)(b∗ − b))2 + (B(τℓ)(a∗ − a))2(cid:17)
(cid:16)

ℓ=1

(cid:16) 1
L

L
(cid:88)

ℓ=1

(cid:17)
B(τℓ)BT (τℓ)

δ2

98

(4.5)

≥

≥

≥

f
2L

(cid:16)f
2

(cid:16)f
2

−

−

¯f ′
6

¯f ′
6

We now turn to upper bounding G := sup∥θ−θ∗∥2≤δ
conditioning on {τℓ}, the vi(θ)’s are i.i.d. random functions, and

1
n

(cid:12)
(cid:12)
(cid:12)

(cid:80)n

(cid:12)
i=1(vi(θ) − Evi(θ))
(cid:12)
(cid:12)

. Note that

sup
∥θ−θ∗∥2≤δ

|vi(θ)| ≤ (1 + Kx)δ

(cid:118)
(cid:117)
(cid:117)
(cid:116)N λmax

(cid:16) 1
L

L
(cid:88)

ℓ=1

B(τℓ)BT (τℓ)

(cid:17)

:= (1 + Kx)δCN τ

Hence, we can apply bounded difference concentration inequality to obtain

(cid:16)

P

G ≥ E(G(cid:12)

(cid:12){τℓ}) + t(cid:12)

(cid:12){τℓ}

(cid:17)

≤ exp

(cid:16)−D1nt2
δ2C 2
N τ

(cid:17)

,

∀t > 0.

(4.6)

Moreover, we apply symmetrization and contraction to derive that conditional on {τℓ},

E(G) ≤ 2E

(cid:16)

sup
∥θ−θ∗∥2≤δ

(cid:12)
(cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

(cid:17)

(cid:12)
(cid:12)
ϵivi(θ)
(cid:12)

≤

2
L

L
(cid:88)

(cid:16)

E

ℓ=1

sup
∥θ−θ∗∥2≤δ

(cid:12)
(cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

(cid:2)Eρτℓ

ϵi

(cid:0)yi − xiB(τℓ)T b − B(τℓ)T a(cid:1)

− Eρτℓ

(cid:0)yi − xiB(τℓ)T b∗ − B(τℓ)T a∗(cid:1)(cid:3)(cid:12)
(cid:12)
(cid:12)

(cid:17)

≤

≤

4
L

4
L

ℓ=1

L
(cid:88)

ℓ=1

L
(cid:88)

(cid:16)

E

sup
∥θ−θ∗∥2≤δ

(cid:12)
(cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

(cid:0)xi(B(τℓ)T (b − b∗) + B(τℓ)T (a − a∗))(cid:1)(cid:12)
(cid:12)
(cid:12)

(cid:17)

ϵi

(cid:32)(cid:118)
(cid:117)
(cid:117)
(cid:12)
E
(cid:116)
(cid:12)

1
n

n
(cid:88)

i=1

ϵixi

(cid:12)
(cid:12)

2 + (cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

(cid:33)

(cid:12)
(cid:12)

ϵi

2 · ∥B(τℓ)∥2δ

≤ 4δCN τ · E

(cid:16)(cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

ϵixi

(cid:12) + (cid:12)
(cid:12)
(cid:12)

1
n

n
(cid:88)

i=1

(cid:17)

(cid:12)
(cid:12)

ϵi

≤ D2CN τ δn−1/2,

(4.7)

where in the fourth and fifth inequalities we have used Cauchy–Schwarz inequality and

Jensen’s inequality. From Lemma 16 we obtain

P(CN τ ≤ D3) ≥ 1 − 2N 2e− D4L

N 2

This result combined with (4.6) and (4.7) shows that

(cid:16)

(cid:17)
G ≤ D5δn−1/2 + t

P

≥ 1 − e

−D6nt2

δ2 − 2N 2e− D4L
N 2 .

(4.8)

Finally, putting together (4.4), (4.5) and (4.8) and applying a union bound completes the

proof.

99

Lemma 14. Assume N −d = o(n−γ).

Then, with probability at

least 1 − 4

L ·

(cid:80)

j∈M(∆)

(cid:82)
((cid:82)

∆ |fj (τ )|2dτ
∆ |fj (τ )|dτ )2

, it holds that

1
L

L
(cid:88)

(B(τℓ)T b∗

j )2 ≥

ℓ=1

κ2
4

n−2γ,

∀j ∈ M(∆).

Proof. It is known (e.g., [59]) that there exist a0, b0 ∈ RN and a constant C1 > 0 such that

|fj(τ ) − B(τ )T b0

j | ≤ C1N −d,

sup
τ

sup
τ

|gj(τ ) − B(τ )T a0

j | ≤ C1N −d.

(4.9)

According to the definitions (4.3) and (4.2), we have

1
L

L
(cid:88)

ℓ=1

Eρτℓ(y − xjfj(τℓ) − gj(τℓ)) ≤

≤

1
L

1
L

L
(cid:88)

ℓ=1
L
(cid:88)

ℓ=1

Eρτℓ(y − xjB(τℓ)T b∗

j − B(τℓ)T a∗
j )

Eρτℓ(y − xjB(τℓ)T b0

j − B(τℓ)T a0

j ).

(4.10)

Then using the identity of Knight like in (4.13), we obtain

0 ≤

≤

1
L

1
L

L
(cid:88)

ℓ=1
L
(cid:88)

ℓ=1

(cid:2)Eρτℓ(y − xjB(τℓ)T b0

j − B(τℓ)T a0

j ) − Eρτℓ(y − xjfj(τℓ) − gj(τℓ))(cid:3)

(cid:16) ¯f
2

E(xj(B(τℓ)T b0

j − fj(τℓ)) + B(τℓ)T a0

j − gj(τℓ))2

+

¯f ′
6
≤ D1N −2d when N is large,

E|xj(B(τℓ)T b0

j − fj(τℓ)) + B(τℓ)T a0

j − gj(τℓ)|3(cid:17)

where the last inequality is due to (4.9). This result combined with (4.10) and Lemma 15

shows that when N is large,

f
4

(cid:16) 1
L

L
(cid:88)

ℓ=1

|B(τℓ)T b∗

j − fj(τℓ)|

(cid:17)2

≤

1
L

L
(cid:88)

ℓ=1

(cid:2)Eρτℓ(y − xjB(τℓ)T b∗

j − B(τℓ)T a∗

j ) − Eρτℓ(y − xjfj(τℓ) − gj(τℓ))(cid:3)

≤D1N −2d,

100

which implies 1
L

(cid:80)L

ℓ=1 |B(τℓ)T b∗

j − fj(τℓ)| ≤ D2N −d. We thus have

(cid:118)
(cid:117)
(cid:117)
(cid:116)

1
L

L
(cid:88)

ℓ=1

(B(τℓ)T b∗

j )2 ≥

≥

≥

=

1
L

1
L

1
L

L
(cid:88)

ℓ=1

L
(cid:88)

ℓ=1
L
(cid:88)

ℓ=1
(cid:90)

1
|∆|

∆

|B(τℓ)T b∗
j |

|fj(τℓ)| −

1
L

L
(cid:88)

ℓ=1

|B(τℓ)T b∗

j − fj(τℓ)|

|fj(τℓ)| − D2N −d

|fj(τ )|dτ − D2N −d +

L
(cid:88)

ℓ=1

1
L
(cid:124)

|fj(τℓ)| −

1
|∆|

(cid:90)

∆

|fj(τ )|dτ

(4.11)

(cid:123)(cid:122)
:=ZL
j

(cid:125)

Now, using Chebyshev’s inequality together with a union bound gives us

P

(cid:16)(cid:12)
(cid:12)Z L
j

(cid:12)
(cid:12) ≥

1
2|∆|

(cid:90)

∆

|fj(τ )|dτ, ∃j ∈ M(∆)

(cid:17)

≤

4
L

·

(cid:88)

j∈M(∆)

(cid:82)
((cid:82)

∆ |fj(τ )|2dτ
∆ |fj(τ )|dτ )2

(4.12)

Combining (4.11), (4.12) and Condition 5 completes the proof.

Lemma 15. Under Conditions 3-4, it holds that

Eρτ (y − xjf (τ ) − g(τ )) − Eρτ (y − xjfj(τ ) − gj(τ ))



if E|f (τ ) − fj(τ )| ≤ q
2

4 (E|f (τ ) − fj(τ )|)2

f

≥



f

4 (qE|f (τ ) − fj(τ )| − q2
4 )

if E|f (τ ) − fj(τ )| > q
2

Here, the expectation E(·) is allowed to be taken additionally with respect to τ that is inde-

pendent from (y, xj).

Proof. The proof is largely motivated by the proof of Lemma 4 in [5]. We use E−τ to denote

the expectation taken only with respect to (y, xj). Using the identity of Knight, we first

101

obtain

E−τ ρτ (y − xjf (τ ) − g(τ )) − E−τ ρτ (y − xjfj(τ ) − gj(τ ))

= E−τ

(cid:16)(cid:2)xj(fj(τ ) − f (τ )) + gj(τ ) − g(τ )(cid:3) · (cid:2)τ − 1(y ≤ xjfj(τ ) + gj(τ ))(cid:3)(cid:17)

(cid:90) xj (−fj (τ )+f (τ ))−gj (τ )+g(τ )

+ E−τ

0

≥

f
2

E−τ |xj(f (τ ) − fj(τ )) + g(τ ) − gj(τ )|2 −

(cid:0)Fy|xj (xjfj(τ ) + gj(τ ) + t) − Fy|xj (xjfj(τ ) + gj(τ ))(cid:1)dt

¯f ′
6

E−τ |xj(f (τ ) − fj(τ )) + g(τ ) − gj(τ )|3,

(4.13)

where in the last step we have used Taylor expansion and Condition 3 to bound the second

term, and the first term equals zero due to the definition of fj(τ ), gj(τ ) in (4.3). When

E−τ |xj(f (τ ) − fj(τ )) + g(τ ) − gj(τ )|2 ≤ q2, it implies that

E−τ |xj(f (τ ) − fj(τ )) + g(τ ) − gj(τ )|3 ≤

6f
4 ¯f ′

· E−τ |xj(f (τ ) − fj(τ )) + g(τ ) − gj(τ )|2.

Plugging this result into (4.13) yields

E−τ ρτ (y − xjf (τ ) − g(τ )) − E−τ ρτ (y − xjfj(τ ) − gj(τ ))
f
4

· E−τ |xj(f (τ ) − fj(τ )) + g(τ ) − gj(τ )|2.

≥

(4.14)

When E−τ |xj(f (τ ) − fj(τ )) + g(τ ) − gj(τ )|2 > q2, define

¯f (τ ) = tf (τ ) + (1 − t)fj(τ ),

¯g(τ ) = tg(τ ) + (1 − t)gj(τ ),

with t =

√

q
E−τ |xj (f (τ )−fj (τ ))+g(τ )−gj (τ )|2

∈ [0, 1]. Using the convexity of check loss, we have

E−τ ρτ (y − xjf (τ ) − g(τ )) − E−τ ρτ (y − xjfj(τ ) − gj(τ ))
1
t
f q
4

E−τ |xj(f (τ ) − fj(τ )) + g(τ ) − gj(τ )|2,

· (cid:0)E−τ ρτ (y − xj
(cid:113)

¯f (τ ) − ¯g(τ )) − E−τ ρτ (y − xjfj(τ ) − gj(τ ))(cid:1)

≥

≥

(4.15)

where the second inequality holds by invoking (4.14). Denote h1(t) := f

4 min(t2, tq) and

h2(t) :=




f

4 t2

if t ≤ q
2



f

4 (qt − q2
4 )

if t > q
2

102

It is straightforward to verify that (4.14) and (4.15) together imply

E−τ ρτ (y − xjf (τ ) − g(τ )) − E−τ ρτ (y − xjfj(τ ) − gj(τ ))

≥ h1(|f (τ ) − fj(τ )|) ≥ h2(|f (τ ) − fj(τ )|).

Moreover, h2(t) is convex over t ∈ [0, ∞). Hence, we take expectation with respect to τ for

the above inequality and apply Jensen’s inequality to conclude

Eρτ (y − xjf (τ ) − g(τ )) − E−τ ρτ (y − xjfj(τ ) − gj(τ )) ≥ h2(E|f (τ ) − fj(τ )|).

Lemma 16. The B-spline basis vector B(τ ) = (B1(τ ), . . . , BN (τ )) satisfies the following

properties:

(i) There exist two constants b1, b2 such that

b1N −1 ≤ λmin

(cid:0)EB(τ )B(τ )T (cid:1) ≤ λmax

(cid:0)EB(τ )B(τ )T (cid:1) ≤ b2N −1,

where τ ∼ U nif (∆).

(ii) Suppose τ1, . . . , τL

i,i,d
∼ U nif (∆). Define H := 1
L

(cid:80)L

ℓ=1 B(τℓ)B(τℓ)T − EB(τ )B(τ )T .

There exists a constant b3 > 0 such that

(cid:16)

P

(cid:17)

∥H∥ ≥ t

≤ 2N 2 exp

(cid:16)

−

t2L
2N (b3 + t/3)

(cid:17)

,

∀t > 0.

Proof. These results are directly taken from Appendix A.1 in [25].

4.7.2 Proof of Proposition 1

Proof. We consider one sample and simple case where x′ = (x1, x2). By the assumption, x

is the variables screened from the screening method. We denote Fx to be the cumulative

distribution function of a random variable x.

103

At first, if β1(τ ) = 0, τ ∈ ∆, this is true if and only if Qy|x2(τ ) = Qy|x1,x2(τ ), ∀τ ∈ ∆.
Then, by the definition of the conditional quantile, where Qy|x(τ ) = inf{a : Fy|x(a) ≥ τ }, we

have

Fy|x1,x2

(cid:0)Qy|x1,x2(τ )(cid:1) = Fy|x2

= Fy|x2

(cid:0)Qy|x2(τ )(cid:1)
(cid:0)Qy|x1,x2(τ )(cid:1)

(4.16)

for τ ∈ ∆. Since y = (cid:101)y when Qy|x1,x2(τ1) ≤ y ≤ Qy|x1,x2(τ2), this implies that for a, where
(cid:101)y|x1,x2(a) = Fy|x1,x2(a) ≤ τ2. Therefore, this
Qy|x1,x2(τ1) ≤ a ≤ Qy|x1,x2(τ2), we have τ1 ≤ F
(cid:101)y and x1 given x2.
and (4.16) imply conditional independence for
(cid:101)y is conditionally independent of x1 given x2, we have

Since

F

(cid:101)y|x2(a) = F

(cid:101)y|x1,x2(a)

where Q

(cid:101)y|x1,x2(τ1) ≤ a ≤ Q

cumulative distribution is one-to-one mapping, this implies that F −1
(cid:101)y|x2

(cid:101)y|x1,x2

(cid:101)y|x1,x2(τ2). Since the conditional quantile and the conditional
(a). Then

(a) = F −1

this is true if and only if β(τ ) = 0, ∀τ ∈ ∆, therefore, we finish the proof.

This can be easily expanded to Xj and X−j, so we finish the entire proof.

104

CHAPTER 5

FUTURE STUDIES

In the future, several avenues can be explored to expand and enhance the methodologies

developed in this thesis. For chapters 2 and 3, we only consider univariate trend filtering

models. However, this model can be easily generalized to the additive model where the

number of additive terms is finite. [56] show that this can be further extended to scenarios

where the number of additive terms can diverge to infinity. Therefore, our work can be

expanded to such settings, potentially offering improved performance in more complex data

structures.

Both the partial linear and quantile trend filtering models can benefit from the adaptation

of other penalization methods such as SCAD (Smoothly Clipped Absolute Deviation) and

MCP (Minimax Concave Penalty). These methods can enhance variable selection perfor-

mance, making our approach more robust and efficient in identifying significant predictors.

Currently, there is no publicly available R package for partial linear quantile trend filtering

or quantile trend filtering. [22] propose an ADMM algorithm for quantile regression, which

is fast and applicable to high-dimensional data. This algorithm has already been success-

fully deployed for trend filtering [53]. By employing the ADMM algorithm for our models,

we expect to achieve faster and more efficient computations. Making this implementation

publicly available would allow broader use and validation of our methods.

For the partial linear models and the FDR control and regional quantile screening method,

deploying a post-selection inference procedure after selecting features can further bolster our

method. This would combine inference with our estimator, providing more reliable and

interpretable results.

Instead of using the model-X knockoff for controlling FDR, we can explore the use of

the sample-splitting method [11] with mirror statistics. This approach may offer additional

flexibility and robustness in FDR control, especially in ultra-high dimensional settings.

By addressing these future research directions, we can further enhance the capabilities

105

and applications of the methodologies developed in this thesis, providing even more robust

and efficient tools for high-dimensional heterogeneous data analysis.

106

BIBLIOGRAPHY

[1] Taylor Arnold, Veeranjaneyulu Sadhanala, and Ryan Tibshirani. glmgen: Fast algo-

rithms for generalized lasso problems. R package version 0.0.3. 2014.

[2] Rina Foygel Barber and Emmanuel J. Candès. “A knockoff filter for high-dimensional

selective inference”. In: The Annals of Statistics 47.5 (2019), pp. 2504–2537.

[3] Rina Foygel Barber and Emmanuel J. Candès. “Controlling the false discovery rate

via knockoffs”. In: The Annals of Statistics 43.5 (2015), pp. 2055–2085.

[4] Emre Barut, Jianqing Fan, and Anneleen Verhasselt.

“Conditional Sure Indepen-
dence Screening”. In: Journal of the American Statistical Association 111.515 (2016),
pp. 1266–1277.

[5] Alexandre Belloni and Victor Chernozhukov. “ℓ1-penalized quantile regression in high-
dimensional sparse models”. In: The Annals of Statistics 39.1 (2011), pp. 82–130.

[6] Howard D. Bondell, Brian J. Reich, and Huixia Wang. “Noncrossing quantile regression

curve estimation”. In: Biometrika 97.4 (Aug. 2010), pp. 825–838.

[7] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities:
A Nonasymptotic Theory of Independence. Oxford University Press, Feb. 2013.

[8] Peter Bühlmann and Sara Van De Geer. Statistics for high-dimensional data: methods,

theory and applications. Springer Science & Business Media, 2011.

[9] Emmanuel Candès et al. “Panning for gold: ‘model-X’ knockoffs for high dimensional
controlled variable selection”. In: Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 80.3 (2018), pp. 551–577.

[10] Centers for Disease Control and Prevention (CDC) and National Center for Health
Statistics (NCHS).
“National Health and Nutrition Examination Survey Data. Hy-
attsville, MD: U.S. Department of Health and Human Services, Centers for Disease
Control and Prevention”. In: (2011-2018).

[11] Chenguang Dai et al. “False discovery rate control via data splitting”. In: Journal of

the American Statistical Association 118.544 (2023), pp. 2503–2520.

[12] S. Eli et al. “Weight loss and BMI criteria in GLIM’s definition of malnutrition is
associated with postoperative complications following abdominal resections – Results
from a National Quality Registry”. In: Clinical Nutrition 39.5 (2020), pp. 1593–1599.

[13] Leay-Kiaw Er et al. “Triglyceride glucose-body mass index is a simple and clinically
useful surrogate marker for insulin resistance in nondiabetic individuals”. In: Plos one

107

11.3 (2016), e0149731.

[14] Jianqing Fan, Yingying Fan, and Emre Barut. “Adaptive robust variable selection”.

In: The Annals of Statistics 42.1 (2014), pp. 324–351.

[15] Jianqing Fan, Yang Feng, and Rui Song.

in Sparse Ultra-High-Dimensional Additive Models”.
Statistical Association 106.494 (2011), pp. 544–557.

“Nonparametric Independence Screening
In: Journal of the American

[16] Jianqing Fan and Runze Li. “Variable selection via nonconcave penalized likelihood
and its oracle properties”. In: Journal of the American statistical Association 96.456
(2001), pp. 1348–1360.

[17] Jianqing Fan and Jinchi Lv. “Sure independence screening for ultrahigh dimensional
In: Journal of the Royal Statistical Society: Series B (Statistical

feature space”.
Methodology) 70.5 (2008), pp. 849–911.

[18] Jianqing Fan and Rui Song. “Sure independence screening in generalized linear models

with NP-dimensionality”. In: The Annals of Statistics 38.6 (2010), pp. 3567–3604.

[19] K. M. Flegal et al. “Association of all-cause mortality with overweight and obesity
using standard body mass index categories: a systematic review and meta-analysis”.
In: JAMA 309.1 (2013), pp. 71–82.

[20] Jerome Friedman, Trevor Hastie, and Robert Tibshirani.

“Regularization Paths for
Generalized Linear Models via Coordinate Descent”. In: Journal of Statistical Software
33.1 (2010), pp. 1–22.

[21] Sara van de Geer. “On the uniform convergence of empirical norms and inner products,
In: Electronic Journal of Statistics 8 (2014),

with application to causal inference”.
pp. 543–574.

[22] Yuwen Gu et al. “ADMM for high-dimensional sparse penalized quantile regression”.

In: Technometrics 60.3 (2018), pp. 319–331.

[23] Wolfgang Härdle, Hua Liang, and Jiti Gao. Partially linear models. Springer Science

& Business Media, 2000.

[24] Xuming He and Peide Shi.

“Bivariate Tensor-Product B-Splines in a Partly Linear

Model”. In: Journal of Multivariate Analysis 58.2 (1996), pp. 162–181.

[25] Xuming He, Lan Wang, and Hyokyoung Grace Hong. “Quantile-adaptive model-free
variable screening for high-dimensional heterogeneous data”. In: The Annals of Statis-
tics 41.1 (2013), pp. 342–369.

108

[26] Ruifeng Hu et al. “ANCO-GeneDB: annotations and comprehensive analysis of candi-
date genes for alcohol, nicotine, cocaine and opioid dependence”. In: Database 2018
(Nov. 2018), bay121.

[27] Peter J Huber.

“Robust Estimation of a Location Parameter”.

In: The Annals of

Mathematical Statistics 35.1 (1964), pp. 73–101.

[28] David R Hunter and Kenneth Lange. “A tutorial on MM algorithms”. In: The Amer-

ican Statistician 58.1 (2004), pp. 30–37.

[29] Liewen Jiang, Howard D Bondell, and Huixia Judy Wang.

“Interquantile shrinkage
In: Computational Statistics & Data

and variable selection in quantile regression”.
Analysis 69.C (2014), pp. 208–219.

[30] Mi-Hyang Jung et al. “Complex interaction of fasting glucose, body mass index, age
and sex on all-cause mortality: a cohort study in 15 million Korean adults”.
In:
Diabetologia 63.8 (2020), pp. 1616–1625.

[31] Seung-Jean Kim et al. “ℓ1 trend filtering”. In: SIAM review 51.2 (2009), pp. 339–360.

[32] Roger Koenker and Gilbert Bassett. “Regression quantiles”. In: Econometrica: journal

of the Econometric Society (1978), pp. 33–50.

[33] Maria Teresa Landi et al. “Environment And Genetics in Lung cancer Etiology (EA-
GLE) study: An integrative population-based case-control study of lung cancer”. In:
BMC Public Health 8.1 (2008), pp. 1–10.

[34] Kenneth Lange and Janet S Sinsheimer. “Normal/independent distributions and their
applications in robust regression”. In: Journal of Computational and Graphical Statis-
tics 2.2 (1993), pp. 175–198.

[35] Eun Ryung Lee, Hohsuk Noh, and Byeong U. Park.

“Model Selection via Bayesian
Information Criterion for Quantile Regression Models”. In: Journal of the American
Statistical Association 109.505 (2014), pp. 216–229.

[36] Wanjun Liu et al. “Model-Free Feature Screening and FDR Control With Knockoff
Features”. In: Journal of the American Statistical Association 117.537 (2022), pp. 428–
443.

[37] Yufeng Liu and Yichao Wu. “Simultaneous multiple non-crossing quantile regression
In: Journal of nonparametric statistics 23.2

estimation using kernel constraints”.
(2011), pp. 415–437.

[38] Chi Ma and Jian Huang. “Asymptotic properties of lasso in high-dimensional partially

linear models”. In: Science China Mathematics 59 (2016), pp. 769–788.

109

[39] Haiqiang Ma et al. “Quantile regression for functional partially linear model in ultra-
high dimensions”. In: Computational Statistics & Data Analysis 129.3 (2019), pp. 135–
147.

[40] Shujie Ma, Runze Li, and Chih-Ling Tsai.

“Variable Screening via Quantile Par-
tial Correlation”. In: Journal of the American Statistical Association 112.518 (2017),
pp. 650–663.

[41] Oscar Hernan Madrid Padilla and Sabyasachi Chatterjee. “Risk bounds for quantile

trend filtering”. In: Biometrika 109.3 (2022), pp. 751–768.

[42] Adam Maidman and Lan Wang. “New semiparametric method for predicting high-cost

patients”. In: Biometrics 74.3 (2018), pp. 1104–1111.

[43] Adam Maidman et al. “Quantile partially linear additive model for data with dropouts
In: Statistics in Medicine 42.16

and an application to modeling cognitive decline”.
(2023), pp. 2729–2745.

[44] Enno Mammen and Sara Van De Geer. “Locally adaptive regression splines”. In: The

Annals of Statistics 25.1 (1997), pp. 387–413.

[45] Nitish Kumar Mishra et al. “Identification of prognostic markers in cholangiocarcinoma
using altered DNA methylation and gene expression profiles”. In: Frontiers in Genetics
11 (2020), p. 522125.

[46] Patric Müller and Sara Van de Geer. “The partial linear model in high dimensions”.

In: Scandinavian Journal of Statistics 42.2 (2015), pp. 580–608.

[47] A. Must et al. “The disease burden associated with overweight and obesity”. In: JAMA

282.16 (1999), pp. 1523–1529.

[48] Doug Nychka et al. “A nonparametric regression approach to syringe grading for quality
In: Journal of the American Statistical Association 90.432 (1995),

improvement”.
pp. 1171–1178.

[49] Carlos Misael Madrid Padilla, Oscar Hernan Madrid Padilla, and Daren Wang.
In: arXiv preprint arXiv:2308.16172

“Temporal-spatial model via Trend Filtering”.
(2023).

[50] Seyoung Park and Xuming He. “Hypothesis testing for regional quantiles”. In: Journal

of Statistical Planning and Inference 191 (2017), pp. 13–24.

[51] Limin Peng, Jinfeng Xu, and Nancy Kutner. “Shrinkage estimation of varying covariate
effects based on quantile regression”. In: Statistics and Computing 24 (2014), pp. 853–
869.

110

[52] R Core Team. R: A Language and Environment for Statistical Computing. R Foun-

dation for Statistical Computing. Vienna, Austria, 2022.

[53] Aaditya Ramdas and Ryan J Tibshirani.

“Fast and flexible ADMM algorithms for
trend filtering”. In: Journal of Computational and Graphical Statistics 25.3 (2016),
pp. 839–858.

[54] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. “Minimax rates of estimation for
high-dimensional linear regression over ℓq-balls”. In: IEEE transactions on information
theory 57.10 (2011), pp. 6976–6994.

[55] Leonid I Rudin, Stanley Osher, and Emad Fatemi. “Nonlinear total variation based
noise removal algorithms”. In: Physica D: nonlinear phenomena 60.1-4 (1992), pp. 259–
268.

[56] Veeranjaneyulu Sadhanala and Ryan J Tibshirani. “Additive models with trend filter-

ing”. In: The Annals of Statistics 47.6 (2019), pp. 3032–3068.

[57] EJ Schlossmacher. “An iterative technique for absolute deviations curve fitting”. In:

Journal of the American Statistical Association 68.344 (1973), pp. 857–859.

[58] Sabine K Schnabel and Paul HC Eilers. “Simultaneous estimation of quantile curves
using quantile sheets”. In: AStA Advances in Statistical Analysis 97.1 (2013), pp. 77–
87.

[59] Larry Schumaker. Spline functions: basic theory. Cambridge university press, 2007.

[60] M Sesia, C Sabatti, and E J Candès.

“Gene hunting with hidden Markov model

knockoffs”. In: Biometrika 106.1 (Aug. 2018), pp. 1–18.

[61] Ben Sherwood. “Variable selection for additive partial linear quantile regression with
missing covariates”. In: Journal of Multivariate Analysis 152 (2016), pp. 206–223.

[62] Ben Sherwood, Adam Maidman, and Shaobo Li. rqPen: Penalized Quantile Regres-

sion. R package version 3.0. 2022.

[63] Ben Sherwood and Lan Wang. “Partially linear additive quantile regression in ultra-

high dimension”. In: Annals of Statistics 44.1 (2016), pp. 288–317.

[64] Bei Song et al. “Triglyceride glucose-body mass index and risk of incident type 2
diabetes mellitus in Japanese people with normal glycemic level: A population-based
longitudinal cohort study”. In: Frontiers in Endocrinology 13 (2022), p. 907973.

[65] Arun Srinivasan, Lingzhou Xue, and Xiang Zhan. “Compositional knockoff filter for
high-dimensional regression analysis of microbiome data”. In: Biometrics 77.3 (2021),

111

pp. 984–995.

[66] Gabriele Steidl, Stephan Didas, and Julia Neumann.

“Splines in higher order TV
regularization”. In: International journal of computer vision 70 (2006), pp. 241–255.

[67] Robert Tibshirani. “Regression shrinkage and selection via the lasso”. In: Journal of
the Royal Statistical Society Series B: Statistical Methodology 58.1 (1996), pp. 267–288.

[68] Robert Tibshirani et al. “Sparsity and smoothness via the fused lasso”. In: Journal of
the Royal Statistical Society Series B: Statistical Methodology 67.1 (2005), pp. 91–108.

[69] Ryan J Tibshirani. “Adaptive piecewise polynomial estimation via trend filtering”. In:

The Annals of Statistics 42.1 (2014), pp. 285–323.

[70] Ryan J Tibshirani. “Divided differences, falling factorials, and discrete splines: An-
other look at trend filtering and related problems”. In: Foundations and Trends® in
Machine Learning 15.6 (2022), pp. 694–846.

[71] Paul Tseng.

“Convergence of a block coordinate descent method for nondifferen-
tiable minimization”. In: Journal of optimization theory and applications 109 (2001),
pp. 475–494.

[72] Sara Van de Geer. Estimation and testing under sparsity. Springer, 2016.

[73] Roman Vershynin. High-dimensional probability: An introduction with applications in

data science. Vol. 47. Cambridge university press, 2018.

[74] Nicolas Verzelen. “Minimax risks for sparse regressions: Ultra-high dimensional phe-

nomenons”. In: Electronic Journal of Statistics 6 (2012), pp. 38–90.

[75] Grace Wahba. Spline models for observational data. Society for Industrial and Applied

Mathematics, 1990.

[76] Lan Wang, Yichao Wu, and Runze Li.

“Quantile Regression for Analyzing Hetero-
geneity in Ultra-High Dimension”. In: Journal of the American Statistical Association
107.497 (2012), pp. 214–222.

[77] Yu-Xiang Wang, Alex Smola, and Ryan Tibshirani. “The falling factorial basis and its
statistical applications”. In: International Conference on Machine Learning. PMLR.
2014, pp. 730–738.

[78] Yu-Xiang Wang et al. “Trend Filtering on Graphs”. In: Journal of Machine Learning

Research 17.105 (2016), pp. 1–41.

[79] Fang Xie and Johannes Lederer.

“Aggregating Knockoffs for False Discovery Rate

112

Control with an Application to Gut Microbiome Data”. In: Entropy 23.2 (2021).

[80] Huiliang Xie and Jian Huang. “SCAD-penalized regression in high-dimensional par-

tially linear models”. In: (2009).

[81] Fei Ye and Cun-Hui Zhang. “Rate minimaxity of the Lasso and Dantzig selector for
In: The Journal of Machine Learning Research 11 (2010),

the ℓq loss in ℓr balls”.
pp. 3519–3540.

[82] Takuma Yoshida. “Quantile function regression and variable selection for sparse mod-

els”. In: Canadian Journal of Statistics 49.4 (2021), pp. 1196–1221.

[83] Kyusang Yu, Enno Mammen, and Byeong Uk Park.

“Semi-parametric regression:
Efficiency gains from modeling the nonparametric part”. In: Bernoulli 17.2 (2011),
pp. 736–748.

[84] Zhuqing Yu, Michael Levine, and Guang Cheng.

partially linear additive models under high dimension”.
pp. 1289–1325.

“Minimax optimal estimation in
In: Bernoulli 25.2 (2019),

[85] Cun Hui Zhang. “Nearly unbiased variable selection under minimax concave penalty”.

In: Annals of Statistics 38.2 (2010), pp. 894–942.

[86] Tongwu Zhang et al. “APOBEC shapes tumor evolution and age at onset of lung cancer

in smokers”. In: bioRxiv (2024).

[87] Qi Zheng, Limin Peng, and Xuming He. “Globally adaptive quantile regression with
ultra-high dimensional data”. In: Annals of statistics 43.5 (2015), pp. 2225–2258.

[88] Qixian Zhong and Jane-Ling Wang.

“Neural networks for partially linear quantile
regression”. In: Journal of Business & Economic Statistics 42.2 (2024), pp. 603–614.

[89] Ying Zhu. “Nonasymptotic Analysis of Semiparametric Regression Models with High-
In: The Annals of Statistics 45.5 (2017),

Dimensional Parametric Coefficients”.
pp. 2274–2298.

[90] Hui Zou and Trevor Hastie.

“Regularization and variable selection via the elastic
net”. In: Journal of the Royal Statistical Society Series B: Statistical Methodology 67.2
(2005), pp. 301–320.

113