THEORY OF SPLINE REGRESSION WITH APPLICATIONS TO TIME SERIES,
LONGITUDINAL, AND CATEGORICAL DATA, AND DATA WITH JUMPS
By
Shujie Ma

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Statistics
2011

ABSTRACT
THEORY OF SPLINE REGRESSION WITH APPLICATIONS TO TIME
SERIES, LONGITUDINAL, AND CATEGORICAL DATA, AND DATA
WITH JUMPS
By
Shujie Ma

Modern technological advances have led to the explosion in the collection of complex data
such as functional/longitudinal, nonlinear time series, and mixed data, and data with jumps.
In recent years, there has been a growing interest in developing statistical tools to analyze
these data primarily due to the fact that traditional parametric methods are unrealistic in
applications. Non- and semi- parametric methods as alternatives have been widely recognized
as powerful tools for complex data analysis, which relax the usual assumptions of parametric
methods and enable us to explore the data more ﬂexibly so as to uncover data structure that
might otherwise be missed.
This dissertation develops statistical theories and methods in spline regression for those
complex data mentioned before, with applications to medical science, ﬁnance and economics.
In Chapter 2, procedures to detect jumps in the regression function via constant and
linear splines are proposed based on the maximal diﬀerences of the spline estimators among
neighboring knots. Simulation experiments corroborate with the asymptotic theory, while
the computing is extremely fast. The detecting-procedure is illustrated in analyzing the
thickness of pennies data set.
In Chapter 3, asymptotically simultaneous conﬁdence bands are obtained for the mean
function of the functional regression model, using piecewise constant spline estimation. Simulation experiments corroborate the asymptotic theory. The conﬁdence band procedure is

illustrated by analyzing CD4 cell counts of HIV infected patients.
A spline-backﬁtted kernel smoothing method is proposed in Chapter 4 for partially linear
additive autoregression model. Under assumptions of stationarity and geometric mixing,
the proposed function and parameter estimators are oracally eﬃcient and fast to compute.
Simulation experiments conﬁrm the asymptotic results. Application to the Boston housing
data serves as a practical illustration of the method.
Chapter 5 considers the problem of estimating a relationship nonparametrically using
regression splines when there exist both continuous and categorical predictors. The resulting
estimator possesses substantially better ﬁnite-sample performance than either its frequencybased peer or cross-validated local linear kernel regression or even additive regression splines
(when additivity does not hold). Theoretical underpinnings are provided and Monte Carlo
simulations are undertaken to assess ﬁnite-sample behavior.

I dedicate this thesis to my husband, Fulin Wang, my parents, Sanbao Ma and Xiaoli Xue,
and my grandmother, Peilan Sun.

iv

ACKNOWLEDGMENT

This thesis would not have been possible without the support of many people. First of all,
I would like to express my deepest gratitude to my supervisor Professor Lijian Yang for his
patient guidance, invaluable assistance, encouragement and excellent advice throughout this
study. His continued support led me to the right way.
I would like to thank Professor Yuehua Cui, Professor Lyudmila Sakhanenko, and Professor Jeﬀrey Wooldridge for serving as members of my doctoral committee and for their
invaluable suggestions.
Thanks also to my collaborator Professor Jeﬀ Racine for his help and for many interesting
discussions and insights.
I am grateful to the entire faculty and staﬀ in the Department of Statistics and Probability who have taught me and assisted me during my study at MSU. My special thanks
go to Professor James Stapleton and Professor Dennis Gilliland for their help, support, and
encouragement.
I am thankful to Dr. Steven J. Pierce for accepting me as one of the statistical consultants
at CSTAT, which provided me with plenty of opportunities to work with students and
faculties from diﬀerent disciplines.
I am thankful to the Graduate School and the Department of Statistics and Probability
for providing me with the Dissertation Completion Fellowship (2011), Summer Support
Fellowship (2010) and Stapleton Fellowship for working on this dissertation. This dissertation
is also supported in part by NSF award 0706518.
I also thank my academic sisters and brothers: Dr. Lan Xue, Dr. Jing Wang, Dr. Li
v

Wang, Dr. Rong Liu, Dr. Qiongxia Song, Guanqun Cao, and Shuzhuan Zheng for their
generous help.
Finally, I take this opportunity to express my profound gratitude to my beloved husband,
my parents and grandmother for their love, endless support and encouragement over all these
years.

vi

TABLE OF CONTENTS

List of Tables

ix

List of Figures

x

1 Introduction
1.1 Nonparametric Regression . . . .
1.2 Partially Linear Additive Models
1.3 Polynomial Splines . . . . . . . .
1.4 Tensor Product Splines . . . . . .

.
.
.
.

1
1
5
6
8

.
.
.
.
.
.
.
.
.
.
.
.
.

10
10
12
16
17
17
19
24
24
27
30
30
31
34

.
.
.
.
.
.
.
.
.

38
38
42
45
47
50
53
56
63
64

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

2 A Jump-detecting Procedure based on Polynomial Spline Estimation
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Simulation example . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.2 Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.2 Proof of Theorem 2.1 for p = 1 . . . . . . . . . . . . . . . . . . . .
2.6 Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.2 Variance calculation . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.3 Proof of Theorem 2.1 for p = 2 . . . . . . . . . . . . . . . . . . . .
3

A Simultaneous Conﬁdence Band
3.1 Introduction . . . . . . . . . . . .
3.2 Main Results . . . . . . . . . . .
3.3 Decomposition . . . . . . . . . .
3.4 Implementation . . . . . . . . . .
3.5 Simulation . . . . . . . . . . . . .
3.6 Empirical Example . . . . . . . .
3.7 Discussion . . . . . . . . . . . .
3.8 Appendix . . . . . . . . . . . . .
3.8.1 A.1. Preliminaries . . . .

for Sparse
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
vii

Longitudinal Regression
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .

4 Spline-backﬁtted Kernel Smoothing
4.1 Introduction . . . . . . . . . . . .
4.2 The SBK Estimators . . . . . . . .
4.3 Simulation . . . . . . . . . . . . .
4.4 Application . . . . . . . . . . . . .
4.5 Appendix . . . . . . . . . . . . . .

of Partially Linear
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .

5 Spline Regression in the Presence of Categorical
5.1 Background . . . . . . . . . . . . . . . . . . . . .
5.2 Methods and Main Results . . . . . . . . . . . . .
5.3 Cross-Validated Choice of N and λ . . . . . . . .
5.4 Monte Carlo Simulations . . . . . . . . . . . . . .
5.5 Concluding Remarks . . . . . . . . . . . . . . . .
5.6 Appendix . . . . . . . . . . . . . . . . . . . . . .
Bibliography

Additive
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .

Predictors
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

Model 79
. . . . . 79
. . . . . 82
. . . . . 86
. . . . . 93
. . . . . 102

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

123
123
125
128
131
135
136
151

viii

LIST OF TABLES

2.1

Power calculation for the simulated example in Chapter 2 . . . . . . . . . . .

18

2.2

Computing time for simulated example in Chapter 2 . . . . . . . . . . . . .

19

3.1

Uniform coverage rates in Chapter 3 . . . . . . . . . . . . . . . . . . . . . .

51

3.2

Uniform coverage rates and average maximal widths in Chapter 3 . . . . . .

52

3.3

Conﬁdence limits for CD4 data set . . . . . . . . . . . . . . . . . . . . . . .

56

4.1

Estimation of parameters for the linear part in Chapter 4 . . . . . . . . . . .

92

5.1

Relative median MSE in Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . 132

5.2

Median values for smoothing parameters in Chapter 5 . . . . . . . . . . . . . 134

ix

LIST OF FIGURES

1.1

Spline estimate plot for the regression function of the fossil data . . . . . . .

3

1.2

Conﬁdence band plot for spinal bone mineral density . . . . . . . . . . . . .

4

2.1

Kernel density plots of τ 1 in Chapter 2 . . . . . . . . . . . . . . . . . . . . .
ˆ

20

2.2

Kernel density plots of c1 in Chapter 2 . . . . . . . . . . . . . . . . . . . . .
ˆ

21

2.3

Plots of the spline estimator for c = 0 in Chapter 2 . . . . . . . . . . . . . .

22

2.4

Plots of the spline estimator c = 2 in Chapter 2 . . . . . . . . . . . . . . . .

23

2.5

Spline estimator for the thickness of pennies data . . . . . . . . . . . . . . .

25

3.1

Plots of simulated data for n = 20 in Chapter 3 . . . . . . . . . . . . . . . .

54

3.2

Plots of simulated data for n = 50 in Chapter 3 . . . . . . . . . . . . . . . .

55

3.3

Plots of conﬁdence bands at 1 − α = 0.95, n = 20 in Chapter 3 . . . . . . . .

57

3.4

Plots of conﬁdence bands at 1 − α = 0.95, n = 50 in Chapter 3 . . . . . . . .

58

3.5

Plots of conﬁdence bands at 1 − α = 0.99, n = 20 in Chapter 3 . . . . . . . .

59

3.6

Plots of conﬁdence bands at 1 − α = 0.99, n = 50 in Chapter 3 . . . . . . . .

60

3.7

Plots of conﬁdence bands for CD4 data . . . . . . . . . . . . . . . . . . . . .

61

3.8

Plots of conﬁdence intervals for CD4 data . . . . . . . . . . . . . . . . . . .

62

4.1

Kernel density plots for α = 2, d1 = 4, d2 = 3 in Chapter 4 . . . . . . . . . .

88

4.2

Kernel density plots in Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . .

89

x

4.3

Kernel density plots for α = 2, d1 = 4, d2 = 30 in Chapter 4 . . . . . . . . .

90

4.4

Kernel density plots for α = 3, d1 = 4, d2 = 30 in Chapter 4 . . . . . . . . .

91

4.5

Plots of the estimator for the nonparametric part at α = 2, n = 200 . . . . .

94

4.6

Plots of the estimator for the nonparametric part at α = 2, n = 500 . . . . .

95

4.7

Plots of the estimator for the nonparametric part at α = 3, n = 200 . . . . .

96

4.8

Plots of the estimator for the nonparametric part at α = 3, n = 500 . . . . .

97

4.9

Plots of the estimators for RM for Boston housing data . . . . . . . . . . . .

99

4.10 Plots of the estimators for log(TAX) for Boston housing data . . . . . . . . . 100
4.11 Plots of the estimators for log(LSTAT) for Boston housing data . . . . . . . 101
5.1

Example with n = 500 and a variety of DGPs in Chapter 5 . . . . . . . . . . 129

xi

Chapter 1
Introduction

1.1

Nonparametric Regression

Regression analysis is one of the most widely used tools in data analysis. Regression analysis models and analyzes the relationship between a dependent variable Y and a vector
of independent variables X. Most commonly, regression analysis estimates the conditional
expectation of the dependent variable given the independent variables. Classic parametric
models, although their properties are very well established, need restrictive assumptions,
which have encountered various limitations in applications. Mis-speciﬁcation in parametric
models could lead to large bias. Nonparametric modelling as an alternative reduces modeling
bias by imposing no speciﬁc model structure and enables people to explore the data more
ﬂexibly.
We begin by considering the nonparametric regression model

Y = m (X) + ϵ,
1

(1.1)

where the error term ϵ contributes a roughness to the raw data, and functional form m (·) is
unknown and satisﬁes m (x) = E (Y |X = x).

In many cases, the unknown function m (·) is a smooth function, then we can estimate
it nonparametrically by such methods as kernel and spline smoothing. For example, Figure
(1.1) shows a nonparametric smoothing curve ﬁtted by quadratic polynomial spline as well
as the data points for a fossil data. The data contains ratios of strontium isotopes found in
fossil shells millions of years ago, which can reﬂect global climate.

In other cases, however, the unknown function m (·) is not smooth. Ignoring possible
jumps in the regression function m (·) may result in a serious error in drawing inference
about the process under study. In Chapter 2, we propose procedures to detect jumps in the
regression function m (·) via constant and linear spline estimation methods in a randomdesign nonparametric regression model for i.i.d. case. The detecting procedure is illustrated
by simulation experiments and by analyzing a thickness of pennies data set.
{
}
In Chapter 3, we consider a sparse functional data case which has the form Xij , Yij ,
1 ≤ i ≤ n, 1 ≤ j ≤ Ni , in which Ni observations are taken for the ith subject, with Xij
and Yij the j th predictor and response variables, respectively, for the ith subject, and Ni ’s
are i.i.d. copies of an integer valued positive random variable. Asymptotically simultaneous
conﬁdence bands are obtained for the mean function m (·) of the functional regression model,
using piecewise constant spline estimation. For illustration, Figure (1.2) shows a conﬁdence
band and conﬁdence interval at 95% conﬁdence level, and the estimated function by piecewise
constant spline for a spinal bone mineral density data set.
2

0.70750
0.70740
0.70730
0.70720

strontium ratio

95

100

105

110

115

120

age(million years)
Figure 1.1: Spline estimate plot for the regression function of the fossil data
Note: the quadratic spline estimator (solid line) and the data points (circle) for ratios of
strontium isotopes over time. For interpretation of the references to color in this and all
other ﬁgures, the reader is referred to the electronic version of this dissertation.

3

1.2
1.1
1.0
0.9
0.8
0.6

0.7

spinal bone mineral density

10

15

20

25

age

Figure 1.2: Conﬁdence band plot for spinal bone mineral density
Note: the constant spline estimator (middle curve), the conﬁdence band (solid line) and
the conﬁdence interval (thin line) at conﬁdence level 95% for spinal bone mineral density
over age.

4

1.2

Partially Linear Additive Models

Nonparametric modeling imposes no speciﬁc model structure and enables one to explore
the data more ﬂexibly, but it does not perform well when the dimension of the covariates
is high, and the variances of the resulting estimates tend to be unacceptably large due
to the sparseness of data, which is the so-called ”curse of dimensionality”. To overcome
these diﬃculties, many diﬀerent semi-parametric models have been proposed and developed,
among which partially linear additive model is becoming very popular in data analysis, which
are important multivariate semiparametric models by allowing linearity in some variables.
A partially linear additive model (PLAM) is given as
∑d
∑d
(
)
(
)
2 m (x )
1 c t +
Yi = m Xi , Ti + σ Xi , Ti εi , m(x, t) = c00 +
α=1 α α
l=1 0l l

(1.2)

}n
{
}
{
T , TT n = Y , X , ..., X , T , . . . T
in which the sequence Yi , Xi
i i1
id2 i1
id1 i=1 . The funci i=1
tions m and σ are the mean and standard deviation of the response Yi conditional on the
predictor vector {Xi , Ti }, and εi is a white noise conditional on {Xi , Ti }.
From equation (1.2) we observe that a PLAM consists of a linear part and a nonparametric additive part, which retains the merits of both additive and linear models, see Stone
(1985) and Hastie and Tibshirani (1990) for additive models. It is much more ﬂexible than
parametric models, since it eschews many assumptions, such as ﬁnitely many unknown parameters imposed on the model and presumed distribution structure set on the data, and
meanwhile it is more interpretable than nonparametric regression surfaces. It avoids the
so-called ”curse of dimensionality” for high dimensional models, since the unknown functions are one-dimensional. Moreover, PLAMs are more parsimonious and ﬂexible and easier
5

interpretable than purely additive model by allowing a subset of the independent variables
to be discrete, while additive models only admit continuous predictors, as well as allowing
the intersection terms among the elements of the additive part to enter as the linear part of
the model.
In Chapter 4, we extend the ”spline-backﬁtted kernel smoothing” (SBK) of Wang and
Yang (2007) to partially linear additive autoregression models. It establishes the uniform
oracle eﬃciency of the estimators by ”reducing bias via undersmoothing (step one) and
averaging out the variance (step two)” via the joint asymptotic properties of kernel and
spline functions. The proposed SBK estimators satisfy (i) computationally expedient; (ii)
theoretically reliable; and (iii) intuitively appealing.

1.3

Polynomial Splines

Polynomial splines have been widely accepted as an appealing and eﬀective tool for nonparametric regression estimation because of its great ﬂexibility, fast computation, simple
implementation and explicit expression, and they have been applied to a wide range of statistical problems such as hazard regressions, derivative estimation, semi-parametric models,
longitudinal, functional and high dimensional data analysis, jump detection, etc. In this
section, I will introduce polynomial spline functions.
Without loss of generality, we take the range of X a univariate predictor to be [0, 1]. To
introduce the spline functions, divide the ﬁnite interval [0, 1] into (N + 1) equal subintervals
)
[
[
]
χJ = tJ , tJ+1 , J = 0, ...., N − 1, χN = tN , 1 . A sequence of equally-spaced points
6

{

}
tJ N , called interior knots, are given as
J=1

t0 = 0 < t1 < · · · < tN < 1 = tN +1 , tJ = Jh, J = 0, ..., N + 1,

in which h = 1/ (N + 1) is the distance between neighboring knots. We denote by G(p−2) =
G(p−2) [0, 1] the space of functions that are polynomials of degree p − 1 on each χJ and has
continuous (p − 2)-th derivative. For example, G(−1) denotes the space of functions that
are constant on each χJ , and G(0) denotes the space of functions that are linear on each
χJ and continuous on [0, 1].

{
}
2 norm of a function ϕ on [0, 1], ∥ϕ∥2 = E ϕ2 (X) =
Denote by ∥ϕ∥2 the theoretical L
2
( )
∫1 2
∫1 2
∑
ϕ (x) f (x) dx = 0 ϕ (x) dx, and the empirical L2 norm as ∥ϕ∥2 = n−1 n ϕ2 Xi .
0
2,n
i=1
Corresponding inner products are deﬁned by

⟨ϕ, φ⟩ =

∫ 1

∫ 1
ϕ (x) φ (x) f (x) dx =

0

0

ϕ (x) φ (x) dx = E {ϕ (X) φ (X)} ,

( ) ( )
∑
⟨ϕ, φ⟩n = n−1 n ϕ Xi φ Xi , for any L2 -integrable functions ϕ, φ on [0, 1]. Clearly
i=1
{
}
E ⟨ϕ, φ⟩n = ⟨ϕ, φ⟩. We now introduce the B-spline basis bJ,1 (x) , 1 − p ≤ J ≤ N of
G(p−2) , the space of splines of order p, for theoretical analysis. The B-spline basis of G(−1) ,
the space of piecewise constant splines, are indicator functions of intervals χJ , bJ,1 (x) =
IJ (x) = IχJ (x), 0 ≤ J ≤ N . The B-spline basis of G(0) , the space of piecewise linear
{
}N
splines, are bJ,2 (x)
J=−1
(
bJ,2 (x) = K

x − tJ+1
h

)
, − 1 ≤ J ≤ N, for K (u) = (1 − |u|)+ .
7

In Chapters 2, 3, and 4, polynomial spline techniques are applied in diﬀerent circumstances.

1.4

Tensor Product Splines

In section 1.3, we introduced spline smoothing and B-spline functions for univariate variables.
In this section, we generalize B-spline to functions of multivariate variables by the tensor
product construction.
(
)
Let X = X1 , . . . , Xq T be a q-dimensional vector of continuous predictors. Assume for
[
]
1 ≤ l ≤ q, each Xl is distributed on a compact interval al , bl , and without loss of generality,
(
)
[
]
ml −2
be the space of polynomial splines
we take all intervals al , bl = [0, 1]. Let Gl = G
l
(
)
of order ml and pre-select an integer Nl = Nn,l . Divide [0, 1] into Nl + 1 subintervals
[
)
[
]
{
}N
l
IJ ,l = tJ ,l , tJ +1,l , Jl = 0, . . . , Nl − 1, IN ,l = tN ,l , 1 , where tJ ,l
is a
l
l
l
l
l
l Jl =1
sequence of equally-spaced points, called interior knots, given as

) = ··· = t = 0 < t < ··· < t
t (
0,l
1,l
Nl ,l < 1 = tNl +1,l = · · · = tNl +ml ,l ,
− ml −1 ,l
(
)
in which tJ ,l = Jl hl , Jl = 0, 1 . . . , Nl + 1, hl = 1/ Nl + 1 is the distance between
l
neighboring knots. Then Gl consists of functions ϖ satisfying (i) ϖ is a polynomial of
degree ml − 1 on each of the subintervals IJ ,l , Jl = 0, . . . , Nl ; (ii) for ml ≥ 2, ϖ is ml − 2
l
times continuously diﬀerentiable on [0, 1]. Let Kl = Kn,l = Nl +ml , where Nl is the number
}T
( ) {
( )
be
of interior knots and ml is the spline order, Bl xl = BJ ,l xl : 1 − ml ≤ Jl ≤ Nl
l
a basis system of the space Gl . We deﬁne the space of tensor-product polynomial splines by
8

∏q
q
G=⊗
Gl . It is clear that G is a linear space of dimension Kn =
K . Then
l=1
l=1 l

B (x) =

[{

]
}N ,...,Nq
( )
1
BJ ,...,Jq (x)
= B1 (x1 ) ⊗ · · · ⊗ Bq xq (1.3)
1
J1 =1−m1 ,...,Jq =1−mq K ×1
n

( )q
is a basis system of the space G, where x = xl
.
l=1
In Chapter 5, tensor product splines and categorical kernel functions which will be introduced in this chapter are applied to study nonparametric regression with multivariate
continuous and categorical variables.

9

Chapter 2
A Jump-detecting Procedure based
on Polynomial Spline Estimation

2.1

Introduction

This chapter is based on Ma and Yang (2011a). In application of regression methods, ignoring
possible jump points may result in a serious error in drawing inference about the process
under study. Whenever there is no appropriate parametric method available, one may start
from nonparametric regression. Two popular nonparametric techniques are kernel and spline
smoothing. For properties of kernel estimators in the absence of jump points, see Mack and
Silverman (1982), Fan and Gijbels (1996), Xia (1998) and Claeskens and Van Keilegom
(2003), and for spline estimators, see Zhou, Shen and Wolfe (1998), Huang (2003) and Wang
and Yang (2009a).
One is often interested in detecting jump points and estimating regression function with
{(
)}
{ }
jumps. We assume that observations Xi , Yi n and unobserved errors εi n are
i=1
i=1
10

i.i.d. copies of (X, Y, ε) satisfying the regression model

Y = m (X) + σε,

(2.1)

where the joint distribution of (X, ε) satisﬁes Assumptions (A3) and (A4) in Section 2.2.
The unknown mean function m (x), deﬁned on interval [a, b], may have a ﬁnite number of
jump points.
Jump regression analysis started in the early 1990s and has become an important research
topic in statistics. See for instance, Qiu, Asano and Li (1991), M¨ller (1992), Wu and
u
Chu (1993), Qiu (1994) and Qiu and Yandell (1998) for procedures that detect the jumps
explicitly before estimating the regression curve, Kang, Koo and Park (2000) for comparing
two estimators of the regression curve after the jump points are detected, Qiu (2003) and
Gijbels Lambert and Qiu (2007) for jump-preserving curve estimators, Joo and Qiu (2009) for
jump detection in not only the regression curve but also its derivatives. For a comprehensive
view on jump regression, see Qiu (2005).
Jump detection has been tackled with many techniques, including local polynomial
smoothing [Qiu (2003) and Gijbels et al. (2007)], smoothing spline [Shiau (1987)], wavelet
methods [Hall and Patil (1995), Wang (1995) and Park and Kim (2004, 2006)], and for twodimensional cases, see Qiu (2007). We propose a spline smoothing method to detect jumps
by solving one optimization problem over the range of x instead of each point, which is computationally more expedient than kernel-type method in M¨ller (1992). Spline method was
u
also discussed in Koo (1997), which proposed estimating discontinuous regression function
without providing theoretical justiﬁcations. In contrast, asymptotic distributions in Theorem 2.1 are established by using the strong approximation results in Wang and Yang (2009a),
11

Normal Comparison Lemma in Leadbetter, Lindgren and Rootz´n, H. (1983), and a convee
nient formula from Kılı¸ (2008) for inverting tridiagonal matrix. The automatic procedures
c
proposed for detecting jumps are based on implementing the asymptotics of Theorem 2.1.
This chapter is organized as follows. Section 2.2 states main theoretical results based
on (piecewise) constant and linear splines. Section 2.3 provides steps to implement the
procedure based on the asymptotic result. Section 2.4 reports ﬁndings in both simulation
and real data studies. All technical proofs are contained in Appendices.

2.2

Main Results

{
}
We denote the space of the p-th order smooth functions as C (p) [a, b] = φ φ(p) ∈ C [a, b] ,
for p = 1, 2. Without loss of generality, we take the range of X to be [0, 1]. We use the spline
functions introduced in Section 1.3 of Chapter 1. Deﬁne the spline estimator based on data
{(
)}
Xi , Yi n drawn from model (3.3) as
i=1

mp (x) = argmin
ˆ

∑n
g∈G(p−2) [0,1]

{

( )}
Yi − g Xi 2 , p = 1, 2.
i=1

(2.2)

{ }
The unknown function m (x) in (3.3) may be smooth, or have jump points τ i k , for
i=1
0 = τ 0 < τ 1 < · · · < τ k < τ k+1 = 1. Technical assumptions are listed as follows:
(
)
(A1) There exists a function m0 (x) ∈ C (p) [0, 1] and a vector c = c1 , ..., ck of jump
)
[
magnitudes such that the regression function m (x) = cl + m0 (x) , x ∈ τ l , τ l+1 , for
[
]
l = 1, · · · k − 1, m (x) = m0 (x) , x ∈ [τ 0 , τ 1 ) , m (x) = ck + m0 (x) , x ∈ τ k , τ k+1 .
)
(
(A2) The number of interior knots N = o n1/(2p+1)+ϑ for any ϑ > 0 while N −1 =
(
)
o n−1/(2p+1) / log n .
12

(A3) X is uniformly distributed on interval [0, 1], i.e. the density function of X is f (x) =
I (0 ≤ x ≤ 1).
(A4) The joint distribution F (x, ε) of random variables (X, ε) satisﬁes the following:
(
)
(a) The error is a white noise: E (ε |X = x ) = 0, E ε2 |X = x = 1.
(b) There exists a positive value η > 1 and ﬁnite positive Mη such that E|ε|2+η < Mη
(
)
and supx∈[0,1] E |ε|2+η |X = x < Mη .
Assumption (A1) is similar to M¨ller and Song (1997). Assumption (A2) is similar to
u
the undersmoothing condition in Claeskens and Van Keilegom (2003), thus the subinterval
(
)
(
)
length h = o n−1/(2p+1) / log n while h−1 = o n1/(2p+1)+ϑ for any ϑ > 0. A uniform
distribution of X in Assumption (A3) is for the simplicity of proofs, which can be relaxed to
any distribution with continuous and positive density function on [0, 1]. Assumption (A4)
is identical with (C2) (a) of Mack and Silverman (1982). All are typical assumptions for
nonparametric regression, with Assumption (A4) weaker than the corresponding assumption
in H¨rdle (1989).
a
We use the B-spline basis introduced in Section 1.3 of Chapter 1 for theoretical analysis.
Deﬁne next the theoretical norms of spline functions

cJ,n =

dJ,n =
⟨

(

∫ 1

⟩
bJ,2 , b ′
J ,2

∫ 1
∫ 1
2
2 (x) dx =
I
IJ (x) dx = h, 0 ≤ J ≤ N,
bJ,1 =
2
0 J
0


(
)
∫ 1
 2h/3, 0 ≤ J ≤ N − 1
2
2 x − tJ+1 dx =
K
bJ,2 =
,

h
2
0
 h/3, J = −1, N

=

K
0

x − tJ+1
h

)



(x − t
)
 h/6, J − J ′ = 1
J ′ +1 dx =
(2.3)
K

h
 0, J − J ′ > 1
13

}N
{
We introduce the rescaled B-spline basis BJ,p (x)
, for G(p−2) ,
J=1−p
−1
BJ,p (x) ≡ bJ,p (x) bJ,p
, J = 1 − p, ..., N.
2

(2.4)

{
}N
The inner product matrix V of the B-spline basis BJ,2 (x)
is denoted as
J=−1

(⟨
(
)N
⟩)N
= B ′ , BJ,2
V= v ′
J J J,J ′ =−1
J ,2
J,J ′ =−1




√
2/4
0
 1



 √

 2/4 1

1/4






..


.
1/4
1
( )


=
= lik −1
,

(N +2)×(N +2)


..
..


.
. 1/4




√



2/4 
1/4 1




√
0
2/4 1
(N +2)×(N +2)
(2.5)
which computed via (2.3). Denote the inverse of V by S and for J = 1, ..., N , 3 × 3 diagonal
submatrices of S are expressed as






S= s ′
= V −1 , SJ = 
′ =−1
J J J,J


(

)N

s(J−2),(J−2) s(J−2),(J−1) s(J−2),J 


s(J−1),(J−2) s(J−1),(J−1) s(J−1),J  .


sJ,(J−2)
sJ,(J−1)
sJJ
(2.6)

To detect jumps in m, one tests the hypothesis H0 : m ∈ C (p) [0, 1] vs H1 : m ∈ C [0, 1].
/
(
)1/2
Denote by ∥c∥2 = c2 + · · · + c2
, the Euclidean norm of the vector c of all the k jump
1
k
14

magnitudes, then under Assumption (A1), one can write alternatively H0 : ∥c∥2 = 0 vs
H1 : ∥c∥2 > 0. For mp (x) given in (2.2), p = 1, 2, deﬁne the test statistics
ˆ
(
)
( )
T1n = max0≤J≤N −1 ˆ1J , ˆ1J = m1 tJ+1 − m1 tJ /σ n,1 ,
δ
δ
ˆ
ˆ
{
(
)}
(
)
( )
T2n = max1≤J≤N ˆ2J , ˆ2J = m2 tJ+1 + m2 tJ−1 /2 − m2 tJ /σ n,2,J ,
δ
δ
ˆ
ˆ
ˆ
(2.7)




 1 




where σ 2 = 2σ 2 / (nh) , σ 2
= σ 2 (8nh/3)−1 ζ T SJ ζ, ζ =  −2 
n,1
n,2,J




1

(2.8)

with SJ deﬁned in (2.6). To state our main results, denote
[ {
}
]
−1 log − 1 log (1 − α) + 1 {log log (N ) + log 4π} .
dN (α) = 1 − {2 log N }
2
2

(2.9)

Theorem 2.1. Under Assumptions (A1)-(A4) and H0 ,
[
]
limn→∞ P Tpn > {2 log (N − 2p + 2)}1/2 dN −2p+2 (α) = α, p = 1, 2.

A similar result by kernel smoothing with ﬁxed-design regression model exists in Theorem
3 of Wu and Chu (1993). The proof of that result, however, does not contain suﬃcient details
for us to further comment. It is feasible to derive similar asymptotic result for Tpn under
H1 but that is beyond the scope of this chapter so we leave it to future work.
15

2.3

Implementation

In this section, we describe how to implement in XploRe (H¨rdle, Hl´vka and Klinke (2000))
a
a
the jump points detection procedures by using the results in Theorem 2.1.
)}n
i=1 from model (3.3), denote Xmin = min (X1 , ..., Xn ) and
{ }
Xmax = max (X1 , ..., Xn ). Then we transform Xi n onto interval [0, 1] by subtracting
i=1
Given any sample

{(

Xi , Yi

each Xi from Xmin , then dividing by Xmax − Xmin . The deﬁnition of mp (x) in (2.2)
ˆ
entails
mp (x) ≡
ˆ

∑N

ˆ′
λJ,p bJ,p (x) , p = 1, 2,
J=1−p

(2.10)

{
}T
ˆ′
ˆ′
where coeﬃcients λ1−p,p , ..., λN,p
are solutions of the following least squares problem
}
n
{
}T
∑{
∑N
( ) 2
ˆ′
ˆ′
λ1−p,p , ..., λN,p
={
argmin }
Yi −
λ b
X
.
J=1−p J,p J,p i
λ1−p,p ,...,λN,p ∈RN +p i=1
[

]
1/3 (log n)2 /5
By Assumption (A2), the number of interior knots N is taken to be N = n
[
]
1/5 (log n)2 /5 for linear spline (p = 2), in which
for constant spline (p = 1), and N = n
( )
ˆ
[a] denotes the integer part of a. Denote the estimator for Yi by Yi,p = mp Xi , for
ˆ
i = 1, · · · , n, with mp given in (2.10), and deﬁne the estimator of σ as
ˆ

σp =
ˆ

{∑

}1/2
)2
n (
ˆ
Y − Yi,p / (n − N − p)
.
i=1 i

Basic spline smoothing theory as in Wang and Yang (2009a) ensures that σ 2 →p σ 2 , as n →
ˆp
∞, hence Theorem 2.1 holds if one replaces σ by σ p . The asymptotic p-value pvalue,p are
ˆ
(
)
obtained from solving the equation Tpn = {2 log (N − 2p + 2)}1/2 dN −2p+2 pvalue,p ,
16

p = 1, 2 with Tpn deﬁned in (2.7) and estimated by replacing σ 2 with σ 2 , then
ˆp
}
[
[
{
pvalue,p = 1 − exp −2 exp 2 log (N − 2p + 2) 1 − {2 log (N − 2p + 2)}−1/2 Tpn
]]
−1 {log log (N − 2p + 2) + log 4π} .
−2
(2.11)

When the p-value is below a pre-determined α, one concludes that there exist jump
points in m. The jump locations and magnitudes are estimated as follows. We use the
ˆ
BIC criteria proposed in Xue and Yang (2006) to select the ”optimal” N , denoted N opt ,
[[
]
([
]
)]
1/3 + 4, min 10n1/3 , [n/2] − 1 , which minimizes the BIC value BIC (N ) =
from 4n
( )
log σ 2 + (N + 1) × log (n) /n. By letting p = 1 and replacing T1n with ˆ1J , for 0 ≤
ˆ1
δ
δ
J ≤ N − 1 in (2.11), we obtain the p-value pvalue,1,J for each ˆ1J . The jump locations
(
)
τ i , 1 ≤ i ≤ k are estimated by τ i = tl + tl +1 /2, for pvalue,1,l < α, with ci =
ˆ
ˆ
i
i
i
( )
(
)
ˆ
m1 tl +1 − m1 tl as the estimated jump magnitudes, for 0 ≤ l1 , . . . , lk ≤ N − 1. It is
ˆ
i
i
[
]
apparent that for τ i ∈ tl , tl +1 , τ i → τ i , 1 ≤ i ≤ k, as n → ∞.
ˆ
i i

2.4
2.4.1

Examples
Simulation example

Here, we examine the ﬁnite-sample performance of the procedure described in Section 2.3
where m (x) has at most one jump. The data set is generated from model (3.3) with
X ∼ U [−1/2, 1/2] , ε ∼ N (0, 1), and with m (x) = sin (2πx) + c × I(τ 1 ≤ x ≤ 1/2),
√
for τ 1 = 2/4. The noise level σ = 0.2, 0.5, sample size n = 200, 600, 1000 and signiﬁcant level α = 0.05, 0.01. Let ns be the number of replications. Denote the asympˆ
totic powers based on constant and linear splines by β p (c), p = 1, 2, calculated from
17

c

σ

0.2
0
0.5

0.2
2
0.5

sample size n
200
600
1000
200
600
1000
200
600
1000
200
600
1000

ˆ
ˆ
ˆ
ˆ
β 2 (c)
β 2 (c)
β 1 (c)
β 1 (c)
α = 0.05 α = 0.01 α = 0.05 α = 0.01
0.100
0.062
0.046
0.058
0.054
0.050
1.000
1.000
1.000
0.942
1.000
1.000

0.032
0.014
0.010
0.012
0.006
0.010
0.998
1.000
1.000
0.776
0.980
1.000

0.640
0.390
0.220
0.220
0.180
0.120
1.000
1.000
1.000
0.890
1.000
1.000

0.280
0.140
0.050
0.070
0.040
0.030
1.000
1.000
1.000
0.680
0.970
1.000

Table 2.1: Power calculation for the simulated example in Chapter 2
Note: powers calculated from the test statistic Tpn deﬁned in (2.7) by constant and linear
splines, respectively, over ns = 500 replications.

[
]
∑ns
ˆ
β p (c) = q=1 I Tn,p,q > {2 log (N − 2p + 2)}1/2 dN −2p+2 (α) /ns , where Tn,p,q is the
q-th replication of Tpn , with Tpn given in (2.7), and dN (α) given in (2.9), for p = 1, 2.
ˆ
Table 2.1 shows values of β p (c) for c = 0 and c = 2.
ˆ
ˆ
In Table 2.1, β p (2), p = 1, 2 approach to 1 rapidly. Meanwhile β 2 (0) approaches α as
the sample size increase, which shows very positive conﬁrmation of Theorem 2.1, in contrast
ˆ
to β 1 (0), the convergent rate of which is much slower, indicating that the linear spline
method outperforms the constant spline method. Table 2.1 also shows the noise level has
more inﬂuence on the constant spline method than the linear spline method. Table 2.2 shows
the average computing time (in seconds) of generating data and detecting jump by constant
and linear spline methods, which are comparable.
There are 500 replications for n = 200, 600 satisfying pvalue,2 < α = 0.05, with pvalue,2
18

sample size n

constant linear

200
600
1000

0.04
0.21
0.55

0.06
0.30
0.60

Table 2.2: Computing time for simulated example in Chapter 2
Note: computing time (in seconds) per replication over ns = 500 replications of generating
data and detecting jump by constant and linear spline methods.

given in (2.11), when c = 2, ns = 500. Figures 2.1 and 2.2 show the kernel estimators of
the densities of τ 1 and c1 given in Section 2.3 with sample size n = 200 (thick lines) and
ˆ
ˆ
n = 600 (median lines) at σ = 0.2.
The vertical lines at

√

2/4 and 2 are the standard lines for comparing τ 1 to τ 1 and c1 to
ˆ
ˆ

c1 respectively. One clearly sees that both of the centers of the density plots are going toward
the standard lines with much narrower spread when the sample size n is increasing. The
frequencies over 500 replications for τ 1 falling between tl and tl +1 described in Section
1
1
2.3 are 0.994 and 1 for n = 200 and 600 respectively.
For visual impression of the actual function estimates, at noise level σ = 0.2 with sample
size n = 600, we plot the spline estimator m2 (x) (solid curves) for the true functions m (x)
ˆ
(thick solid curves) in Figures 2.3 and 2.4. The spline estimators seem rather satisfactory.

2.4.2

Real data analysis

We apply the jump detection procedures in Section 2.3 to the thickness of pennies data
set given in Scott (1992), which consists of measurements in mils of the thickness of 90 US
Lincoln pennies. There are two measurements taken as the response variable Y each year,
19

300
250
200
150
0

50

100

Y

0.35

0.4

0.45

X

Figure 2.1: Kernel density plots of τ 1 in Chapter 2
ˆ
Note: kernel density plots of τ 1 over 500 replications of sample size n = 200 (thick solid)
ˆ
and n = 600 (solid) for which H0 is rejected.

20

4
3
0

1

2

Y

1

1.5

2

2.5

X

Figure 2.2: Kernel density plots of c1 in Chapter 2
ˆ
Note: kernel density plots of c1 over 500 replications of sample size n = 200 (thick solid)
ˆ
and n = 600 (solid) for which H0 is rejected.

21

0
-1

-0.5

Y

0.5

1

1.5

n=600, c=0

-0.5

0
X

0.5

Figure 2.3: Plots of the spline estimator for c = 0 in Chapter 2
Note: plots of the true function m (x) (thick solid curve), spline estimator m2 (x) (solid
ˆ
curve) and the data scatter plots at σ = 0.2.

22

-1

0

Y

1

2

3

n=600, c=2

-0.5

0
X

0.5

Figure 2.4: Plots of the spline estimator c = 2 in Chapter 2
Note: plots of the true function m (x) (thick solid curve), spline estimator m2 (x) (solid
ˆ
curve) and the data scatter plots at σ = 0.2.

23

from 1945 through 1989. Penny thickness was reduced in World War II and restored to its
original thickness sometime around 1960 and reduced again in the 70’s. The asymptotic
p-value pvalue,2 < 10−20 . Two jump points are detected with the p-values 0.014468 and
0.00077337, located around the year 1958 with increased magnitude 2.80, and around the
year 1974 with decreased magnitude 3.75, respectively, which conﬁrms the result in Gijbels
et al. (2007). Figure 2.5 depicts the data points and the spline estimator m2 (x) (solid
ˆ
line), which conﬁrms visually these ﬁndings. Findings from both simulated and real data
demonstrate the eﬀectiveness of our approach in detecting the existence of jumps. The plots
of m2 (x) in Figures 2.3 and 2.5 give an outline of the true function, without breaking the
ˆ
curve at the jumps. Obtaining jump-preserving spline estimator of the true non-smooth
function is beyond the scope of this chapter, but makes an interesting topic for further
research.

2.5

Appendix A

2.5.1

Preliminaries

Denote by ∥·∥∞ the supremum norm of a function r on [0, 1], i.e. ∥r∥∞ = supx∈[0,1]
|r (x)|. We denote by the same letters c, C, any positive constants without distinction. The
following extension of Leadbetter, Lindgren and Rootz´n, H. (1983), Theorem 6.2.1 is a key
e
result on the absolute maximum of discrete time Gaussian processes.
(
)
(n)
(n)
(n)
(n) 2
Lemma 2.1. Let ξ 1 , ..., ξ n have jointly normal distribution with Eξ i ≡ 0, E ξ i
≡
1, 1 ≤ i ≤ n and there exists constants C > 0, a > 1, r ∈ (0, 1) such that the corre)
(
(n)
(n) (n)
lations rij = rij = Eξ i ξ j , 1 ≤ i ̸= j ≤ n satisfy rij ≤ min r, Ca−|i−j| for
24

58
56
52

54

Y

1950

1960

1970

1980

X

Figure 2.5: Spline estimator for the thickness of pennies data
Note: the thickness of pennies data (points) and the spline estimator m2 (x).
ˆ

25

{

(n)
(n)
ξ 1 , ..., ξ n

}

1 ≤ i ̸= j ≤ n. Then the absolute maximum Mn,ξ = max
satisﬁes for
)
(
)
(
any τ ∈ R, P Mn,ξ ≤ τ /an + bn → exp −2e−τ , as n → ∞, where an = (2 log n)1/2 ,
bn = an − 1 a−1 (log log n + log 4π).
2 n
Proof. Take ε > 0 such that (2 − ε) (1 + r)−1 = 1 + δ, for some δ > 0. Let τ n =
τ /an + bn , then τ 2 / (2 log n) → 1, as n → ∞, so for large n, τ 2 > (2 − ε) log n. By
n
n
(
)
(
)
2 −1/2
−|i−j| < 1 for i ̸= j, one has r
the condition rij ≤ min r, Ca
≤
ij 1 − rij
(
)
−|i−j| 1 − r2 −1/2 . Let M
Ca
n,η = max {|η 1 | , . . . , |η n |}, where η 1 , . . . , η n are i.i.d.
copies of N (0, 1). By Leadbetter, Lindgren and Rootz´n (1983), Theorem 1.5.3,
e
(
)
(
)
P Mn,η ≤ τ n → exp −2e−τ , as n → ∞. The Normal Comparison Lemma (Leadbetter,
Lindgren and Rootz´n, H. (1983), Lemma 11.1.2) entails that
e
(
P

(

)
(
)
(n)
−τ n < ξ j ≤ τ n for j = 1, . . . , n − P −τ n < η j ≤ τ n for j = 1, . . . , n
(
)
{
(
)}
∑
2 −1/2
rij 1 − rij
≤ (4/2π)
exp −τ 2 / 1 + rij
.
n
1≤j<i≤n

)

(

P Mn,ξ ≤ τ n − P Mn,η ≤ τ n
(

≤ (4/2π) C 1 − r2

)

)−1/2 ∑

4
≤
2π

∑

(

Ca−|i−j| 1 − r2

1≤i<j≤n

)−1/2

(
exp

−τ 2
n
1+r

)

{
}
−(i−j) exp − (2 − ε) (1 + r)−1 log n
a

1≤j<i≤n
)−1/2 ∑n−1
= (4/2π) C 1 − r2
(n − k) a−k n−1−δ ≤ Cn−δ → 0, as n → ∞,
k=1
(

(
)
(
)
for large n, hence P Mn,ξ ≤ τ n → exp −2e−τ , as n → ∞.
We break the estimation error mp (x) − m (x) into bias and noise. mp (x) deﬁned in (2.2)
ˆ
ˆ
}T
{
∑
ˆ
ˆ
ˆ
can be written as mp (x) ≡ N
ˆ
λJ,p BJ,p (x), where coeﬃcients λ1−p,p , ..., λN,p
J=1−p
26

are solutions of the following least squares problem

2
n
N
{
}T
∑
∑
( )
ˆ
ˆ
={
argmin }
Y −
λJ,p BJ,p Xi
.
λ1−p,p , ..., λN,p
 i

N +p i=1
J=1−p
λ1−p,p ,...,λN,p ∈R
(2.12)
Projecting the relationship in model (3.3) leads to the following decomposition in G(p−2)

mp (x) = mp (x) + ˜p (x) ,
ˆ
˜
ε
mp (x) =
˜

∑N

˜
λ B (x) , ˜p (x) =
ε
J=1−p J,p J,p

(2.13)

∑N

a B (x) .
˜
J=1−p J,p J,p

(2.14)

{
{
}T
}T
˜
˜
˜
˜
are solutions to (2.12) with Yi
λ1−p,p , ..., λN,p
and a1−p,p , ..., aN,p
( )
replaced by m Xi and σεi respectively.
The vectors

Next lemma is from de Boor (2001), p. 149 and Theorem 5.1 of Huang (2003).
Lemma 2.2. There are constants Cp > 0, p ≥ 1 such that for any m ∈ C (p) [0, 1], there exists
a function g ∈ G(p−2) [0, 1] such that ∥g − m∥∞ ≤ Cp m(p)

∞

hp and m (x) deﬁned in
˜

(2.14), with probability approaching 1, satisﬁes mp (x) − m (x) ∞ = O (hp ).
˜

2.5.2

Proof of Theorem 2.1 for p = 1

For x ∈ [0, 1], deﬁne its location and relative position indices J (x) , δ (x) as J (x) = Jn (x) =
{
}
min {[x/h] , N } , δ (x) = x − tJ(x) h−1 . It is clear that tJ (x) ≤ x < tJ (x)+1 , 0 ≤
n
n
⟩
⟨
= 0 unless J = J ′ , for BJ,1 (x)
δ (x) < 1, ∀x ∈ [0, 1) , and δ (1) = 1. Since B ′ ,BJ,1
J ,1
n
given in (2.4). ˜1 (x) in (2.14) can be written as
ε

˜1 (x) =
ε

∑N

∑n
( )
ε∗ BJ,1 (x) BJ −2 , ε∗ = n−1
BJ,1 Xi σεi .
2,n J,1
J=0 J,1
i=1
27

∑N
2
∗
2
ε
J=0 εJ,1 BJ,1 (x), it is easy to prove that E {ˆ1 (x)} = σ / (nh) and for 0 ≤
)
{ (
( )}
J ≤ N −1, E ˆ1 tJ+1 − ˆ1 tJ 2 = 2σ 2 / (nh), which is σ 2 deﬁned in (2.8). Deﬁne for
ε
ε
n,1
{ (
)
( )}
{ (
)
( )}
0 ≤ J ≤ N −1, ˜n,1,J = σ −1 ˜1 tJ+1 − ˜1 tJ , ˆn,1,J = σ −1 ˆ1 tJ+1 − ˆ1 tJ .
ξ
ε
ε
ξ
ε
ε
n,1
n,1
Let ˆ1 (x) =
ε

˜
Lemma 2.3. Under Assumptions (A2)-(A4), as n → ∞,
sup
ξ n,1,J − ˆn,1,J =
ξ
0≤J≤N −1
(
)
−1/2 h−1/2 log n = o (1).
Oa.s n
a.s

{
}N −1
(k)
Lemma 2.4. Under Assumptions (A2)-(A4), there exist ˆn,1,J
ξ
, k = 1, 2, 3 such that
J=0
(1)
(3)
(2)
as n → ∞, sup0≤J≤N −1 ˆn,1,J − ˆn,1,J + sup0≤J≤N −1 ˆn,1,J − ˆn,1,J = oa.s (1).
ξ
ξ
ξ
ξ
{
}N −1
{
}N −1
{
}N −1
(1)
(2)
(3)
ˆ
ξ n,1,J
has the same probability distribution as ˆn,1,J
ξ
, and ˆn,1,J
ξ
J=0
J=0
J=0
is a Gaussian process with mean 0, variance 1, and covariance

}  −1/2, for J − J ′ = 1
{

ˆ(3) , ˆ(3)
=
.
cov ξ n,1,J ξ

n,1,J ′
 0, for J − J ′ > 1

Lemmas 2.3 and 2.4 follow from Appendix A of Wang and Yang (2009a).
Proof of Theorem 2.1 for p = 1: It is clear from Lemma 2.4 that the Gaussian process
{
}N −1
ˆ(3)
satisﬁes the conditions of Lemma 2.1, hence as n → ∞,
ξ n,1,J
J=0
)}
{(
(
)
ˆ(3)
→ exp −2e−τ .
P
sup0≤J≤N −1 ξ n,1,J ≤ τ /aN + bN

By letting τ = − log {− (1/2) log (1 − α)}, and using the deﬁnition of aN , bN and dN (α)
28

we obtain
{

(3)
sup0≤J≤N −1 ˆn,1,J ≤ − log {− (1/2) log (1 − α)} (2 log N )−1/2
ξ
}
1/2 − (1/2) (2 log N )−1/2 (log log N + log 4π) = 1 − α.
+ (2 log N )
{
}
1/2 d (α) = 1 − α.
ˆ(3)
limn→∞ P sup0≤J≤N −1 ξ n,1,J ≤ (2 log N )
N

limn→∞ P

By Lemmas 2.3 and 2.4, we have
}

{
lim P
n→∞

˜
ξ n,1,J ≤ (2 log N )1/2 dN (α)
sup
0≤J≤N −1

= 1 − α,

which implies for 0 ≤ J ≤ N − 1
{
lim P
n→∞

dN (α)−1 (2 log N )−1/2 σ −1
n,1

(

)

(

sup
˜1 tJ+1 − ˜1 tJ
ε
ε
0≤J≤N −1

)

}
≤1

= 1 − α.

( )
( )
Lemma 2.2 entails that under H0 sup0≤J≤N −1 m1 tJ − m tJ = Op (h) and
˜
(
)
( )
sup0≤J≤N −1 m tJ+1 − m tJ = Op (h) ,

which imply that
(
)
( )
sup
m tJ+1 − m tJ
0≤J≤N −1
{
}
{
}
= Op (nh)1/2 (log N )−1/2 h = op (log n)−2 .
σ −1 (log N )−1/2
n,1

{ }}
)} {
( )
)
(
{
(
)
( )
(
+
˜
˜
ˆ
By (2.13), m1 tJ+1 − m1 tJ = m1 tJ+1 − m tJ+1 − m1 tJ − m tJ
ˆ

29

{

(
)
( )} { (
)
( )}
m tJ+1 − m tJ + ˜1 tJ+1 − ˜1 tJ , then for dN (α) deﬁned in (2.9),
ε
ε

{
}
(
)
( )
limn→∞ P sup0≤J≤N −1 m1 tJ+1 − m1 tJ ≤ σ n,1 (2 log N )1/2 dN (α) =
ˆ
ˆ
{
}
(
)
( )
1/2 d (α) = 1 − α.
limn→∞ P sup0≤J≤N −1 ˜1 tJ+1 − ˜1 tJ ≤ σ n,1 (2 log N )
ε
ε
N

2.6

Appendix B

2.6.1

Preliminaries

The following lemma from Wang and Yang (2009a) shows that multiplication by V deﬁned
in (2.5) behaves similarly to multiplication by a constant. We use |T | to denote the maximal
absolute value of any matrix T .
(

)N

Lemma 2.5. Given matrix Ω = V + Γ, in which Γ = γ ′
satisﬁes γ ′ ≡ 0 if
jj J,J ′ =−1
jj
p
J − J ′ > 1 and |Γ| → 0. Then there exist constants c, C > 0 independent of n and Γ, such
that with probability approaching 1

c |ξ| ≤ |Ωξ| ≤ C |ξ| , C −1 |ξ| ≤ Ω−1 ξ ≤ c−1 |ξ| , ∀ξ ∈ RN +2 .

(2.15)

To prove Theorem 2.1 for p = 2, we need the result below (Corollary 16 of Kılı¸ (2008)),
c
which gives explicit formula for the inverse of symmetric tridiagonal matrix.








Lemma 2.6. For any symmetric tridiagonal matrix Gn = 





30

x1 y 1




..

.
y 1 x2

, the

...
...
yn−1 


yn−1 xn

[
]
inverse of the matrix Gn , G−1 = wij is given by
n
 ( )
( )
( )−1 ∏

k−1 C b −2 y 2 , i = J
 C b −1 + ∑n
b

t
i
t=i } t
k=i+1 Ck
{
wij =
∏i−1 ( b )−1


(−1)i+J
yt wii , i > J

t=J Ct
(
)−1
b
b
b
2
in which C1 = x1 , Cn = xn − Cn−1
yn−1 , n = 2, 3, ....
Lemma 2.7. There exists a constant Cs > 0, such that

∑N
J=−1 sJ ′ J ≤ Cs , and 17/16 ≤

sjj ≤ 5/4, where s ′ , 0 ≤ J, J ′ ≤ N − 1, is the element of S deﬁned in (2.6).
J J
{
(
)}N
∑
˜
˜
ξ
, then N
Proof. By (2.15), let ˜ ′ = sgn s ′
J=−1 sJ ′ J ≤ Sξ J ′ ≤ Cs ξ J ′ =
J
J J J=−1
Cs , ∀J ′ = −1, 0, ..., N . Applying Lemma 2.6 to V with x−1 = · · · = xN = 1, y−1 =
( )−1
N
√
∏
∑ ( b )−1 k−1 ( b )−2 2
b
Ct
yt .
Ck
yN −1 = 2/4, y0 = · · · = yN −1 = 1/4, sjj = CJ
+
t=J
k=J+1
b
By mathematical induction, one obtains that 9/10 ≤ CJ ≤ 1, for 0 ≤ J ≤ N − 1. So,
( )−1
b
1 ≤ CJ
≤ 10/9, and for 0 ≤ J ≤ N − 1,

sjj ≥ 1 +

2.6.2

∑N

∏k−1
∑N
2
yt ≥ 1 +
(16)−(k−J) ≥ 17/16,
k=J+1 t=J
k=J+1
∑N
sjj ≤ 10/9 + (10/9)
(1/9)k−J ≤ 5/4.
k=J+1

Variance calculation
(

Vector ˜2 = a−1,2 , ..., aN,2
a
˜
˜
(⟨

)T

given in (2.14) solves the normal equations,

)N
⟩ )N
(
∑n
( )
,
BJ,2 , B ′
˜2 = n−1
a
BJ,2 Xi σεi
J ,2 n J,J ′ =−1
i=1
J=−1
31

(

)−1 (
)
−1 ∑n B (X ) σε N
˜
for BJ,2 (x) in (2.4). In other words, ˜2 = V + B
a
n
,
i
i=1 J,2 i
(√
) J=−1
(⟨
⟩ )N
˜
˜
where B = BJ,2 , B ′
− V satisﬁes B = Op
n−1 h−1 log (n) accordJ ,2 n J,J ′ =−1
ing to Subsection B.2 of the supplement to Wang and Yang (2009a).

(
)T
(
)−1
˜
Now deﬁne ˆ2 = a−1,2 , ..., aN,2
a
ˆ
ˆ
by replacing V + B
with V−1 = S in above
(∑
)
( )
∑
N
formula, i.e., ˆ2 =
a
sJ ′ J n−1 n BJ,2 Xi σεi ′
, and deﬁne for
i=1
J=−1
J =−1,..,N
x ∈ [0, 1]

ˆ2 (x) =
ε

∑N

a B (x) =
ˆ
J=−1 J,2 J,2

∑N

s
n
J,J ′ =−1 J ′ J

−1

∑n

( )
BJ,2 Xi σεi BJ ′ ,2 (x) ,
i=1

(
)}
{ (
( )
)
ˆ
ε
ξ 2,J = ˆ2 tJ+1 + ˆ2 tJ−1 /2 − ˆ2 tJ , 2 ≤ J ≤ N − 1,
ε
ε








D =





(2.16)

0 1 −2 1
.
.
.
.
.
.
0 0

..

.

0 0 

. 
.. ..
. 
.
.
. 
.

. 
.. .. ..
. 
.
.
.
. 

1 −2 1 0
(N −2)×(N +2)

Lemma 2.8. With S and D given in (2.6) and (2.17),

}N −1
{
ˆ
has covariance matrix
ξ 2,J
J=2

[
(
)]N −1
cov ˆ2,J , ˆ ′
ξ
ξ
= σ 2 (8nh/3)−1 DSDT .
2,J
J,J ′ =2
32

(2.17)

(2.18)

( ) ∑
( )
( )
∑
Proof. For 0 ≤ J, J ′ ≤ N +1, ˆ2 tJ = N ′
ε
s ′ n−1 n Bk,2 Xi σεi B ′ tJ
i=1
k ,2
k,k =−1 k k
∑N
∑n
( )
( )
=σ
n−1
Bk,2 Xi εi s(J−1),k B(J−1),2 tJ .
k=−1
i=1
[∑
]
[ ( ) ( )]
∑n
( )
( )
N
2E
−1
X εs
B
t
E ˆ2 tJ ˆ2 tJ ′ = σ
ε
ε
n
B
k=−1
i=1 k,2 i i J−1,k J−1,2 J
[∑
( )]
∑n
( )
N
−1
t
B
X εs
B
×
n
i=1 k ′ ,2 i i J ′ −1,k ′ J ′ −1,2 J ′
k ′ =−1
( )
∑N
( )
2 n−1
EBk,2 (X) B ′ (X)
=σ
B
t B ′
t
s
s
k ,2
J −1,2 J ′ J−1,k J ′ −1,k ′
k,k ′ =−1 J−1,2 J
= σ 2 n−1

( )
( )
BJ−1,2 tJ B ′
v
t
s
s
J −1,2 J ′ J−1,k J ′ −1,k ′ k,k ′
k,k ′ =−1

∑N

( ) ∑N
∑N
( )
= σ 2 n−1 BJ−1,2 tJ B ′
t ′
s
v
′ =−1 sJ ′ −1,k ′
J −1,2 J
k
k=−1 J−1,k k,k ′
( ) ∑N
( )
= σ 2 n−1 BJ−1,2 tJ B ′
t
s
δ
J −1,2 J ′
k ′ =−1 J ′ −1,k ′ J−1,k ′
( )
( )
−1/2 −1/2
s
.
= σ 2 n−1 BJ−1,2 tJ B ′
t ′ s ′
= σ 2 n−1 dJ−1,n d ′
J −1,2 J
J −1,J−1
J −1,n J ′ −1,J−1
(
)
′ ≤ N − 1, E ˆ ˆ
ˆ ,d
By deﬁnitions of ξ 2,J J,n in (2.16), (2.3), for 2 ≤ J, J
ξ 2,J ξ 2,J ′ is
(
σ 2 (8nh/3)−1 s ′ + s ′
− 2s ′
+s ′
+s ′
J ,J
J −2,J
J −1,J
J ,J−2
J −2,J−2
)

−2s ′
− 2s ′
− 2s ′
+ 4s ′
J −1,J−2
J ,J−1
J −2,J−1
J −1,J−1


s
s
s
)  J ′ −2,J−2 J ′ −2,J−1 J ′ −2,J  (
)T
(




2 (8nh/3)−1
.
=σ
1 −2 1  sJ ′ −1,J−2 sJ ′ −1,J−1 sJ ′ −1,J  1 −2 1




s ′
s ′
s ′
J ,J−2
J ,J−1
J ,J
[
(
)]N −1
−1
2
T
Therefore, for 2 ≤ J, J ′ ≤ N − 1, cov ˆ2,J , ˆ ′
ξ
ξ
′ =2 = σ (8nh/3) DSD .
2,J
J,J
33

−1 2
Lemma 2.9. For 2 ≤ J ≤ N − 1, σ 2
n,2,J deﬁned in (2.8) satisﬁes that cσ (8nh/3) σ ≤
−1 2
σ2
n,2,J ≤ Cσ (8nh/3) σ , for cσ = (65/8) (17/16), Cσ = 100/9.
ˆ2
Proof. It follows from (2.18) that σ 2
n,2,J = E ξ 2,J . Then by Lemmas 2.6 and 2.8, for
{
}−1
2 ≤ J ≤ N − 1, σ 2 (8nh/3)−1
σ2
n,2,J is

sJ,J − 4sJ,J−1 + 2sJ,J−2 + 4sJ−1,J−1 − 4sJ−1,J−2 + sJ−2,J−2
{(
}
)−1
b
= sJ−2,J−2 + 4
CJ−2
yJ−2 + 1 sJ−1,J−1 +
{ (
}
(
)−1
)−1
b Cb
b
2 CJ−2 J−1
yJ−2 yJ−1 + 4 CJ−1
yJ−1 + 1 sJ,J ,

2
−1
thus, σ 2
n,2,J ≤ {1 + 4 (1/3 + 1) + (2/9 + 4/3 + 1)} (5/4) σ (8nh/3)
= (100/9) (8nh/3)−1 σ 2 ,
2
−1
σ2
n,2,J ≥ {1 + 4 (1/4 + 1) + (2/16 + 1 + 1)} (17/16) σ (8nh/3)
= (65/8) (17/16) (8nh/3)−1 σ 2 .

2.6.3

Proof of Theorem 2.1 for p = 2

Several lemmas will be given below for proving Theorem 2.1 for p = 2. With ˜2 (x), ˆ2,J
ε
ξ
and σ n,2,J deﬁned in (2.14), (2.16) and (2.8), deﬁne for 2 ≤ J ≤ N − 1
˜
ξ n,2,J = σ −1
n,2,J

)}
(
[{ (
( )]
)
ξ
ε
ε
˜2 tJ+1 + ˜2 tJ−1 /2 − ˜2 tJ , ˆn,2,J = σ −1 ˆ2,J , (2.19)
ε
n,2,J ξ

34

ˆ
Lemma 2.10. Under Assumptions (A2)-(A4), as n → ∞,
sup
ξ n,2,J − ˜n,2,J =
ξ
2≤J≤N −1
(√
)
Oa.s
log n/ (nh) = oa.s (1).

{
}N −1
(k)
Lemma 2.11. Under Assumptions (A2)-(A4), there exist ˆn,2,J
, k = 1, 2, 3, such
ξ
J=2
(2)
(3)
(1)
ξ
ξ
that as n → ∞, sup2≤J≤N −1 ˆn,2,J − ˆn,2,J +sup2≤J≤N −1 ˆn,2,J − ˆn,2,J = oa.s (1).
ξ
ξ
{
}
ˆ(1) has the same probability distribution as ˆ(2) . ˆ(3)
ξ n,2,J
ξ n,2,J
ξ n,2,J is a Gaussian process with
(
)
(
)
(3) ˆ(3)
ξ
−1 σ −1 E ˆ ˆ
ˆ
= σ n,2,J
ξ 2,J ξ 2,J ′
mean 0, variance 1, covariance r ′ = cov ξ n,2,J , ξ
J,J
n,2,J ′
n,2,J ′
for which there exist constants 0 < C, 0 < r < 1 such that for large n,
(
)
− J−J ′
ξ
r ′ ≤ min r, C3
, 2 ≤ J, J ′ ≤ N − 1.
J,J

(

− J−J ′
ξ
Proof. We only prove r ′ ≤ min r, C3
J,J

)
. Lemma 2.10 and the rest of Lemma

2.11 follow from Appendix B of the supplement to Wang and Yang (2009a). By (2.18),
)
(
− 2s ′
+s ′
+s ′
σ −2 (8nh/3)−1 E ˆ2,J ˆ ′ = s ′ + s ′
ξ ξ
2,J
J ,J
J −2,J
J −1,J
J ,J−2
J −2,J−2
−2s ′
− 2s ′
− 2s ′
+ 4s ′
.
J −1,J−2
J ,J−1
J −2,J−1
J −1,J−1
∏
′ J−1 ( b )−1
yt sjj , then for
Ct
By Lemma 2.6, for −1 ≤ J ′ < J ≤ N , s ′ = (−1)J+J
J,J
′
t=J
35

2 ≤ J, J ′ ≤ N − 1 and J − J ′ > 2, by Lemma 2.7,
{
]
}−1 [
σ 2 (8nh/3)−1
E ˆ2,J ˆ ′
ξ ξ
2,J
}
{(
)−1
(
)−1
b
J+J ′
b
y ′
+1
y ′
+2 C ′
= (−1)
C ′
J −2
J −1
J −1
J −2
{
(
)−1
b
× sJ−2,J−2 + 2 CJ−2
yJ−2 sJ−1,J−1 +
} J−3 ( )
)−1
∏
b Cb
b −1 y
CJ−2 J−1
yJ−2 yJ−1 sJ,J
Ct
t
′
( t=J
)
(
)
{
} − J−J ′ −2
− J−J ′
≤ (5/4) (1/3 + 2/3 + 1) 1 + 2/3 + (1/3)2 3
≤ 40 × 3
.
(

{
}
2 (8nh/3)−1 −1 σ 2
By Lemma 2.9, σ
n,2,J ≥ (65/8) (17/16), for 2 ≤ J ≤ N − 1. Therefore,
(
)
− J−J ′
ξ
for 2 ≤ J, J ′ ≤ N − 1 and J − J ′ > 2, r ′ ≤ C3
≤ r < 1, with C =
J,J
40 (8/65) (16/17) and r = 40 (8/65) (16/17) /33 < 1. For J − J ′ = 1, 2, the result can be
proved following the same procedure above.

Proof of Theorem 2.1 for p = 2: It is clear from Lemma 2.11 that the Gaussian
{
}N −1
(3)
process ˆn,2,J
ξ
satisﬁes the conditions of Lemma 2.1, hence as n → ∞,
J=2
(
P

(3)
sup2≤J≤N −1 ˆn,2,J ≤ τ /aN + bN
ξ

)

(
)
→ exp −2e−τ .

By Lemmas 2.10, 2.11, with τ = − log {− (1/2) log (1 − α)} and deﬁnitions of aN and bN ,
)
(
ξ
limn→∞ P sup2≤J≤N −1 ˜n,2,J ≤ {2 log (N − 2)}1/2 dN −2 (α) = 1 − α,

for any 0 < α < 1, ˆn,2,J and dN (α) deﬁned in (2.19) and (2.9). By (2.13) and (2.16),
ξ
36

{
(
)}
(
( )
)
m2 tJ+1 + m2 tJ−1 /2 − m2 tJ is
ˆ
ˆ
ˆ
{

{
(
)
(
)}
(
)
(
)}
{
( )
( )}
˜
˜
m2 tJ+1 − m tJ+1 /2 + m2 tJ−1 − m tJ−1 /2 − m2 tJ − m tJ
˜
[{ (
(
)}
)
( )]
ξ
+ m tJ+1 + m tJ−1 /2 − m tJ + ˆ2,J+1
(

)
2 , hence
Now Lemma 2.2 implies that under H0 , ∥m2 − m∥∞ = Op h
˜
(nh)1/2 {log (N − 2)}−1/2 ×
{
(
)
(
)}
{
(
)
(
)}
m2 tJ+1 − m tJ+1 /2 + m2 tJ−1 − m tJ−1 /2
˜
˜

sup
2≤J≤N −1
]
}
[
{
{
( )
( )}
− m2 tJ − m tJ
˜
= Op (nh)1/2 {log (N − 2)}−1/2 h2 = op (log n)−3 .

By Taylor expansion, sup2≤J≤N −1

{ (
(
)}
( )
)
( )
m tJ+1 + m tJ−1 /2 − m tJ = Op h2

under H0 , as n → ∞. Hence
{ (
(
)}
)
( )
sup
m tJ+1 + m tJ−1 /2 − m tJ
2≤J≤N −1
[√
]
{
}
−1/2 h2 = o (log n)−3 .
= Op
nh {log (N − 2)}
p
(nh)1/2 {log (N − 2)}−1/2

By above results, for T2n deﬁned in (2.7),
{
}
limn→∞ P T2n ≤ {2 log (N − 2)}1/2 dN −2 (α) = 1 − α.

37

Chapter 3
A Simultaneous Conﬁdence Band for
Sparse Longitudinal Regression

3.1

Introduction

This chapter is based on Ma, Yang and Carroll (2011). Functional data analysis (FDA)
has in recent years become a focal area in statistics research, and much has been published
in this area. An incomplete list includes Cardot, Ferraty, and Sarda (2003), Cardot and
Sarda (2005), Ferraty and Vieu (2006), Hall and Heckman (2002), Hall, M¨ller, and Wang
u
(2006), Izem and Marron (2007), James, Hastie, and Sugar (2000), James (2002), James and
Silverman (2005), James and Sugar (2003), Li and Hsing (2007), Li and Hsing (2010), Morris
and Carroll (2006), M¨ller and Stadtm¨ller (2005), M¨ller, Stadtm¨ller, and Yao (2006),
u
u
u
u
M¨ller and Yao (2008), Ramsay and Silverman (2005), Wang, Carroll, and Lin (2005), Yao
u
and Lee (2006), Yao, M¨ller, and Wang (2005a), Yao, M¨ller, and Wang (2005b), Yao (2007),
u
u
Zhang and Chen (2007), Zhao, Marron, and Wells (2004), and Zhou, Huang, and Carroll
38

(2008). According to Ferraty and Vieu (2006), a functional data set consists of iid realizations
{
}
ξ i (x) , x ∈ X , 1 ≤ i ≤ n, of a smooth stochastic process (random curve) {ξ (x) , x ∈ X }
over an entire interval X . A more data oriented alternative in Ramsay and Silverman (2005)
emphasizes smooth functional features inherent in discretely observed longitudinal data, so
that the recording of each random curve ξ i (x) is over a ﬁnite number of points in X , and
contaminated with noise. This second view is taken in this chapter.

{
}
A typical functional data set therefore has the form Xij , Yij , 1 ≤ i ≤ n, 1 ≤ j ≤ Ni ,
in which Ni observations are taken for the ith subject, with Xij and Yij the j th predictor
and response variables, respectively, for the ith subject. Generally, the predictor Xij takes
{
}
values in a compact interval X = [a, b]. For the ith subject, its sample path Xij , Yij is
the noisy realization of a continuous time stochastic process ξ i (x) in the sense that
(
)
(
)
Yij = ξ i Xij + σ Xij εij ,

(3.1)

( )
{
}
with errors εij satisfying E εij = 0, E(ε2 ) = 1, and ξ i (x), x ∈ X are iid copies of a
ij
∫
process {ξ(x), x ∈ X } which is L2 , i.e., E X ξ 2 (x)dx < +∞.

For the standard process {ξ(x), x ∈ X }, one deﬁnes the mean function m(x) = E{ξ(x)}
(
)
{
}
{ }
and the covariance function G x, x′ = cov ξ(x), ξ(x′ ) . Let sequences λk ∞ ,
k=1
(
)
{
}
ψ k (x) ∞ be the eigenvalues and eigenfunctions of G x, x′ , respectively, in which
k=1
{ }
∑
λ1 ≥ λ2 ≥ · · · ≥ 0, ∞ λk < ∞, ψ k ∞ form an orthonormal basis of L2 (X )
k=1
k=1
( )
( )
(
)
(
)
′ = ∑∞ λ ψ (x)ψ x′ , which implies that ∫ G x, x′ ψ x′ dx′ =
and G x, x
k
k
k=1 k k
λk ψ k (x).
39

{
}
The process ξ i (x), x ∈ X allows the Karhunen-Lo`ve L2 representation
e

ξ i (x) = m(x) +

∑∞

ξ ϕ (x),
k=1 ik k

(3.2)

where the random coeﬃcients ξ ik are uncorrelated with mean 0 and variances 1, and the
√
functions ϕk = λk ψ k . In what follows, we assume that λk = 0, for k > κ, where κ is a
( )
∑
positive integer, thus G(x, x′ ) = κ ϕk (x)ϕk x′ . Based on (3.2), the data generating
k=1
process is now written as
(

)

Yij = m Xij +

(

∑κ

)

(

)

ξ ϕ Xij + σ Xij εij .
k=1 ik k

(3.3)

{ }
{
}
The sequences λk κ , ϕk (x) κ
k=1
k=1 and the random coeﬃcients ξ ik exist mathematically,
but are unknown and unobservable.

Two distinct types of functional data have been studied. Li and Hsing (2007), and Li
and Hsing (2010) concern dense functional data, which in the context of model (3.1) means
min1≤i≤n Ni → ∞ as n → ∞. On the other hand, Yao, M¨ller, and Wang (2005a),
u
Yao, M¨ller, and Wang (2005b), and Yao (2007) studied sparse longitudinal data for which
u
Ni ’s are i.i.d. copies of an integer-valued positive random variable. Pointwise asymptotic
distributions were obtained in Yao (2007) for local polynomial estimators of m(x) based on
sparse functional data, but without uniform conﬁdence bands. Nonparametric simultaneous
conﬁdence bands are a powerful tool of global inference for functions, see Claeskens and
Van Keilegom (2003), Fan and Zhang (2000), Hall and Titterington (1988), H¨rdle (1989),
a
H¨rdle and Marron (1991), Huang, Wang, Yang, and Kravchenko (2008), Ma and Yang
a
(2010), Song and Yang (2009), Wang and Yang (2009), Wu and Zhao (2007), Zhao and
40

Wu (2008), and Zhou, Shen, and Wolfe (1998) for its theory and applications. The fact
that a simultaneous conﬁdence band has not been established for functional data analysis
is certainly not due to lack of interesting applications, but to the greater technical diﬃculty
in formulating such bands for functional data and establishing their theoretical properties.
Speciﬁcally, the strong approximation results used to establish the asymptotic conﬁdence
level in nearly all published works on conﬁdence bands, commonly known as “Hungarian
embedding”, are unavailable for sparse functional data.

In this chapter, we present simultaneous conﬁdence bands for m(x) in sparse functional
data via a piecewise-constant spline smoothing approach. While there exist a number of
(
)
′ such as kernels (Yao, M¨ller and,
smoothing methods for estimating m(x) and G x, x
u
Wang (2005a); Yao, M¨ller, and Wang (2005b); Yao (2007)), penalized splines (Cardot,
u
Ferraty, and Sarda (2003); Cardot and Sarda (2005); Yao and Lee (2006)), wavelets Morris
and Carroll (2006), and parametric splines James (2002), we choose B splines (Zhou, Huang,
and Carroll (2008)) for simple implementation, fast computation and explicit expression, see
Huang and Yang (2004), Wang and Yang (2007), and Xue and Yang (2006) for discussion
of the relative merits of various smoothing methods.

We organize this chapter as follows. In Section 3.2 we state our main results on conﬁdence bands constructed from piecewise constant splines. In Section 3.3 we provide further
insights into the error structure of spline estimators. Section 3.4 describes the actual steps
to implement the conﬁdence bands. Section 3.5 reports ﬁndings of a simulation study. An
empirical example in Section 3.6 illustrates how to use the proposed conﬁdence band for
inference. Proofs of technical lemmas are in the Appendix.
41

3.2

Main Results

For convenience, we denote the modulus of continuity of a continuous function r on [a, b]
( )
by ω (r, δ) = max ′
r(x) − r x′ . Denote the empirical L2 norm as
x,x ∈[a,b], x−x′ ≤δ
(
)
−1 ∑n ∑Ni g 2 X , where we denote the total sample size by N =
∥g∥2
= NT
ij
T
i=1 j=1
2,NT
∑n
i=1 Ni . Without loss of generality, we take the range of X, X = [a, b], to be [0, 1]. For
any β ∈ (0, 1], we denote the collection of order β H˜lder continuous function on [0, 1] by
o

C 0,β [0, 1] =





ϕ : ∥ϕ∥0,β =

sup
′ ,x,x′ ∈[0,1]
x̸=x

( )
ϕ(x) − ϕ x′
x − x′ β



< +∞



,

in which ∥ϕ∥0,β is the C 0,β -seminorm of ϕ. Let C [0, 1] be the collection of continuous
function on [0, 1]. Clearly, C 0,β [0, 1] ⊂ C [0, 1] and, if ϕ ∈ C 0,β [0, 1], then ω (ϕ, δ) ≤
∥ϕ∥0,β δ β .
We use the piecewise constant spline functions introduced in Section 1.3 of Chapter 1
with the number of interior knots denoted as Ns . Let hs = 1/ (Ns + 1) be the distance
between neighboring knots. The mean function m(x) is estimated by

m(x) = argmin
g∈G(−1)

∑n

{
(
)}2
i Y −g X
.
ij
j=1 ij

∑N

i=1

(3.4)

The technical assumptions we need are as follows
(A1) The regression function m(x) ∈ C 0,1 [0, 1].
(A2) The functions f (x), σ(x), and ϕk (x) ∈ C 0,β [0, 1] for some β ∈ (2/3, 1] with f (x) ∈
]
[
cf , Cf , σ(x) ∈ [cσ , Cσ ] , x ∈ [0, 1], for constants 0 < cf ≤ Cf < ∞, 0 < cσ ≤ Cσ <
∞.
42

(A3) The set of random variables

(

)
( )
Ni n
is a subset of Ni ∞ consisting of indei=1
i=1

pendent variables Ni , the numbers of observations made for the i-th subject, i =
1, 2, ..., with Ni

N , where N > 0 is a positive integer-valued random variable

with E{N 2r } ≤ r!cr , r = 2, 3, ... for some constant cN > 0. The set of ranN
(
)n,N
(
)∞,∞
i
dom variables Xij , Yij , εij
is a subset of Xij , Yij , εij
in which
i=1,j=1
i=1,j=1
(
)∞,∞
are iid. The number κ of nonzero eigenvalues is ﬁnite and the ranXij , εij
i=1,j=1
( )
dom coeﬃcients ξ ik , k = 1, ..., κ, i = 1, ..., ∞ are iid N (0, 1). The variables Ni ∞ ,
i=1
(
)∞,∞
( )∞,∞
( )∞,κ
, εij
ξ ik
, Xij
are independent.
i=1,k=1
i=1,j=1
i=1,j=1
(

)
ϑ for some ϑ ∈ (1/3, 2β − 1)
(A4) As n → ∞, the number of interior knots Ns = o n
{
}
−1 = o n−1/3 (logn) −1/3 . The subinterval length h ∼ N −1 .
while Ns
s
s

(A5) There exists r > 2/ {β − (1 + ϑ) /2} such that E |ε11 |r < ∞.

Assumptions (A1), (A2), (A4) and (A5) are similar to (A1)–(A4) in Wang and Yang
(2009), with (A1) weaker than its counterpart. Assumption (A3) is the same as (A1.1),
(A1.2), and (A5) in Yao, M¨ller, and Wang (2005b), without requiring joint normality of
u
the measurement errors εij .
We use the B-spline basis of G(−1) , the space of piecewise constant splines, denoted as
{
}
bJ (x) Ns for theoretical analysis. Deﬁne
J=0
cJ,n = bJ 2 =
2

∫ 1
0

bJ (x)f (x)dx, J = 0, ..., Ns ,

σ 2 (x) = var(Y |X = x ) = G (x, x) + σ 2 (x), ∀x ∈ [0, 1] ,
Y
43

(3.5)



2

κ
 E {N (N − 1)} ∑ ∫
1 1

σ 2 (x) = c−2
{nE(N1 )}−1
ϕk (u) f (u) du
n
J(x),n

EN1
χJ(x)

k=1

∫

+
σ 2 (u) f (u) du .
(3.6)

χJ(x) Y

In addition, deﬁne QNs +1 (α) = bNs +1 − a−1+1 log{−(1/2)log(1 − α)},
Ns
(
)
2
log 2πaN +1
s
aNs +1 = {2log (Ns + 1)}1/2 , bNs +1 = aNs +1 −
,
2aNs +1

(3.7)

for any α ∈ (0, 1). We now state our main results.
Theorem 3.1. Under Assumptions (A1)-(A5), for any α ∈ (0, 1) ,
{
}
lim P supx∈[0,1] |m(x) − m(x)| /σ n (x) ≤ QNs +1 (α) = 1 − α,
n→∞
}
{
lim P |m(x) − m(x)| /σ n (x) ≤ Z1−α/2 = 1 − α, ∀x ∈ [0, 1],
n→∞
where σ n (x) and QNs +1 (α) are given in (3.6) and (3.7), respectively, while Z1−α/2 is the
100 (1 − α/2)th percentile of the standard normal distribution.

The deﬁnition of σ n (x) in (3.6) does not allow for practical use. The next proposition
provides two data-driven alternatives

Proposition 3.1. Under Assumptions (A2), (A3), and (A5), as n → ∞,
{
sup
x∈[0,1]

σ −1 (x)σ n,IID (x) − 1 + σ −1 (x)σ n,LONG (x) − 1
n
n
44

}

)
β
= O hs ,
(

in which for x ∈ [0, 1], σ n,IID (x) ≡ σ Y (x) {f (x)hs nE(N1 )}−1/2 and
{
σ n,LONG (x) ≡ σ n,IID (x) 1 +

E {N1 (N1 − 1)} G (x, x) f (x)
hs
EN1
σ 2 (x)
Y

}1/2
.

Using σ n,IID (x) instead of σ n (x) means to treat the (Xij , Yij ) as iid data rather than as
sparse longitudinal data, while using σ n,LONG (x) means to correctly account for the longitudinal correlation structure. The diﬀerence of the two approaches, although asymptotically
negligible uniformly for x ∈ [0, 1] according to Proposition 3.1, is signiﬁcant in ﬁnite samples, as shown in the simulation results of Section 3.5. For similar phenomenon with kernel
smoothing, see Wang, Carroll, and Lin (2005).

Corollary 3.1. Under Assumptions (A1)-(A5), for any α ∈ (0, 1), as n → ∞, an asymptotic 100 (1 − α) % simultaneous conﬁdence band for m(x), x ∈ [0, 1] is

m(x) ± σ n (x)QNs +1 (α) ,

while an asymptotic 100 (1 − α) % pointwise conﬁdence interval for m(x), x ∈ [0, 1], is m(x)±
σ n (x)Z1−α/2 .

3.3

Decomposition

In this section, we decompose the estimation error m(x) − m(x) by the representation of Yij
(
)
(
)
(
) ∑
as the sum of m Xij , κ ξ ik ϕk Xij , and σ Xij εij .
k=1
{
}
We introduce the rescaled B-spline basis BJ (x) Ns for G(−1) , which is BJ (x) ≡
J=0
45

bJ (x) bJ −1 , J = 0, ..., Ns . Therefore,
2
{
}−1/2
BJ (x) ≡ bJ (x) cJ,n
, J = 0, ..., Ns .

(3.8)

⟨
⟩
It is easily veriﬁed that BJ 2 = 1, J = 0, 1, . . . , Ns , BJ , B ′ ≡ 0, J ̸= J ′ .
2
J

The deﬁnition of m(x) in (3.4) means that

m(x) ≡

∑Ns
J=0

′
λJ bJ (x),

(3.9)

}T
{
′
′
as solutions of the least squares problem
with coeﬃcients λ0 , ..., λNs
{
}T
(
)}2
∑n ∑N {
∑Ns
′
′
i
={
argmin
λ0 , ..., λNs
Yij −
λ b Xij
.
}
i=1
j=1
J=0 J J
Ns +1
λ0 ,...,λNs ∈R
{
}T
∑Ns
Simple linear algebra shows m(x) ≡ J=0 λJ BJ (x), where the coeﬃcients λ0 , ..., λNs
are solutions of the least squares problem
{
}T
(
)} 2
∑n ∑N {
∑Ns
i
λ0 , ..., λNs
={
argmin
Yij −
λ B Xij
.
}
i=1
j=1
J=0 J J
λ0 ,...,λNs ∈RNs +1
(3.10)
Projecting the relationship in model (3.3) onto the linear subspace of RNT spanned by
(
)}
{
, we obtain the following crucial decomposition in
BJ Xij
1≤j≤Ni ,1≤i≤n,0≤J≤Ns
the space G(−1) of spline functions:

m(x) = m(x) + e(x) = m(x) + ε(x) +
46

∑κ

ξ (x),
k=1 k

(3.11)

m(x) =

∑Ns
J=0

λJ BJ (x), ε(x) =

ξ k (x) =

∑Ns

∑Ns

a B (x),
J=0 J J

τ
B (x).
J=0 k,J J

(3.12)

}T
{
}T
{
The vectors {λ0 , ..., λNs }T , a0 , ..., aNs , and τ k,0 , ..., τ k,Ns
are solutions to (3.10)
(
) (
)
(
)
with Yij replaced by m Xij , σ Xij εij , and ξ ik ϕk Xij , respectively. We cite next
an important result concerning the function m(x). The ﬁrst part is from de Boor (2001), p.
149, and the second is from Theorem 5.1 of Huang (2003).
Theorem 3.2. There is an absolute constant Cg > 0 such that for every ϕ ∈ C [0, 1], there
exists a function g ∈ G(−1) [0, 1] that satisﬁes ∥g − ϕ∥∞ ≤ Cg ω (ϕ, hs ). In particular, if
β
ϕ ∈ C 0,β [0, 1] for some β ∈ (0, 1], then ∥g − ϕ∥∞ ≤ Cg ∥ϕ∥0,β hs . Under Assumptions
(A1) and (A4), with probability approaching 1, the function m(x) deﬁned in (3.12) satisﬁes
∥m(x) − m(x)∥∞ = O (hs ) .
The next proposition concerns the function e(x) given in (3.11).
Proposition 3.2. Under Assumptions (A2)-(A5), for any τ ∈ R, and σ n (x), aNs +1 , and
bNs +1 as given in (3.6) and (3.7),
{
}
(
)
lim P supx∈[0,1] σ n (x)−1 e(x) ≤ τ /aNs +1 + bNs +1 = exp −2e−τ .
n→∞

3.4

Implementation

In this section, we describe procedures to implement the conﬁdence bands and intervals
(
)N ,n
from model (3.3), the spline
given in Corollary 3.1. Given any data set Xij , Yij i
j=1,i=1
estimator m(x) is obtained by (3.9), and the number of interior knots in (3.9) is taken
47

1/3
to be Ns = [cNT (logn)], in which [a] denotes the integer part of a and c is a positive
constant. When constructing the conﬁdence bands, one needs to evaluate the function σ 2 (x)
n
by estimating the unknown functions f (x), σ 2 (x), and G (x, x), and then plugging in these
Y
estimators: the same approach is taken in Wang and Yang (2009).

The number of interior knots for pilot estimation of f (x), σ 2 (x), and G (x, x) is taken
Y
[
]
(
)
∗
∗
to be Ns = n1/3 , and h∗ = 1/ 1 + Ns . The histogram pilot estimator of the density
s
function f (x) is

f (x) = {

∑n

∑N
∑n
i b
(Xij )}/{(
N )h∗ }.
i=1
j=1 J(x)
i=1 i s

{(
)2 } T
T
Yij − m(Xij )
, the
Deﬁning the vector R ={Rij }1≤j≤N ,1≤i≤n =
i
1≤j≤Ni ,1≤i≤n
∗
∑Ns
estimator of σ 2 (x) is σ 2 (x) = J=0 ρJ bJ (x), where the coeﬃcients {ρ0 , ..., ρN ∗ }T are
Y
Y
s
solutions of the least squares problem:
{
}T
(
)}2
∑n ∑N {
∑N ∗
s ρ b X
i
ρ0 , ..., ρN ∗
={
argmin
Rij −
.
ij
}
J=0 J J
i=1
j=1
s
ρ0 ,...,ρN ∗ ∈RNs +1
s
)
(
The pilot estimator of covariance function G x, x′ is
)
(
G x, x′ = arg

where C ′ =
ijj

min
(−1) ⊗G(−1)
g∈G

{
(
)}2
∑N
i
Cijj ′ − g Xij , Xij ′
,
i=1
j,j ′ =1,j̸=j ′

∑n

{
(
)} {
(
)}
Yij − m Xij
Y ′ −m X ′
, 1 ≤ j, j ′ ≤ Ni , 1 ≤ i ≤ n. The
ij
ij
48

{
}−1/2
function σ n (x) is estimated by either σ n,IID (x) ≡ σ Y (x) f (x)hs NT
or
{
σ n,LONG (x) ≡ σ n,IID (x) 1 +

}1/2
) G (x, x)
N 2 /NT − 1
f (x)hs
.
i=1 i
σ 2 (x)
Y

(∑ n

We now state a result. That is easily proved by standard theory of kernel and spline
smoothing, as in Wang and Yang (2009).

Proposition 3.3. Under Assumptions (A1)-(A5), as n → ∞
{
supx∈[0,1]

σ n,IID (x)σ −1 (x) − 1 + σ n,LONG (x)σ −1
n,IID
n,LONG (x) − 1
(
)
β
−1/2 N −1 (logn)1/2 .
= Oa.s. hs + n
s

}

Proposition 3.1, about how σ n,IID (x) and σ n,LONG (x) uniformly approximate σ n (x),
and Proposition 3.3 together imply that both σ n,IID (x) and σ n,LONG (x) approximate
)
(
−1/2+1/3 (logn)1/2−1/3 , according to Assumpσ n (x) uniformly at a rate faster than n
tion (A5). Therefore as n → ∞, the conﬁdence bands

m(x) ± σ n,IID (x)QNs +1 (α) ,

(3.13)

m(x) ± σ n,LONG (x)QNs +1 (α) ,

(3.14)

with QNs +1 (α) given in (3.7), and the pointwise intervals m(x)±σ n,IID (x)Z1−α/2 , m(x)±
σ n,LONG (x)Z1−α/2 have asymptotic conﬁdence level 1 − α.
49

3.5

Simulation

To illustrate the ﬁnite-sample performance of the spline approach, we generated data from
the model
(

)

Yij = m Xij +

∑2

(

)

ξ ϕ Xij + σεij , 1 ≤ j ≤ Ni , 1 ≤ i ≤ n,
k=1 ik k

with X ∼ Uniform[0, 1], ξ k ∼ Normal(0, 1), k = 1, 2, ε ∼ Normal(0, 1), Ni having a discrete
uniform distribution from 25, · · · , 35, for 1 ≤ i ≤ n, and m(x) = sin {2π (x − 1/2)} , ϕ1 (x) =
√
√
−2 cos {π (x − 1/2)} / 5, ϕ2 (x) = sin {π (x − 1/2)} / 5, thus λ1 = 2/5, λ2 = 1/10. The
noise levels were σ = 0.5, 1.0, the number of subjects n was taken to be 20, 50, 100, 200, the
conﬁdence levels were 1 − α = 0.95, 0.99, and the constant c in the deﬁnition of Ns in Section
3.4 was taken to be 1, 2, 3. We found that the conﬁdence band (3.13) did not have good
coverage rates for moderate sample sizes, and hence in Table 3.1 we report the coverage
as the percentage out of the total 200 replications for which the true curve was covered by
(3.14) at the 101 points {k/100, k = 0, . . . , 100}.
At all noise levels, the coverage percentages for the conﬁdence band (3.14) are very close
to the nominal conﬁdence levels 0.95 and 0.99 for c = 1, 2, but decline for c = 3 when
n = 20, 50. The coverage percentages thus depend on the choice of Ns , and the dependency
becomes stronger when sample sizes decrease. For large sample sizes n = 100, 200, the eﬀect
of the choice of Ns on the coverage percentages is insigniﬁcant. Because Ns varies with Ni ,
for 1 ≤ i ≤ n, the data-driven selection of some ”optimal” Ns remains an open problem.
We next examine two alternative methods to compute the conﬁdence band, based on
the observation that the estimated mean function m(x) and the conﬁdence intervals are
50

Table 3.1: Uniform coverage rates in Chapter 3
σ

n
20

0.5

50
100
200
20

1.0

50
100
200

1−α
0.950
0.990
0.950
0.990
0.950
0.990
0.950
0.990
0.950
0.990
0.950
0.990
0.950
0.990
0.950
0.990

c=1
0.920
0.990
0.960
0.995
0.955
1.000
0.950
0.985
0.935
0.990
0.975
0.995
0.950
0.995
0.940
0.985

c=2
0.930
0.990
0.965
0.995
0.955
1.000
0.965
0.985
0.930
0.990
0.960
0.995
0.940
0.990
0.965
0.995

c=3
0.800
0.900
0.910
0.965
0.955
0.985
0.975
0.990
0.735
0.870
0.895
0.980
0.935
0.990
0.960
0.995

Note: uniform coverage rates from 200 replications using the conﬁdence band (3.14). For
each sample size n, the ﬁrst row is the coverage of a nominal 95% conﬁdence band, while
the second row is for a 99% conﬁdence band.

step functions that remain the same on each subinterval χJ , 0 ≤ J ≤ Ns . Follwing an
associate editor’s suggestion, locally weighted smoothing was applied to the upper and lower
conﬁdence limits to generate a smoothed conﬁdence band. Following a referee’s suggestion
to treat the number (Ns + 1) of subintervals as ﬁxed instead of growing to inﬁnity, a naive
parametric conﬁdence band was computed as

m(x) ± σ n,LONG (x)Q1−α.Ns +1

(3.15)

} is the (1 − α) quantile of the maximal
in which Q1−α.Ns +1 = Z{
1+(1−α)1/(Ns +1) /2
absolute values of (Ns + 1) iid N (0, 1) random variables. We compare the performance of
51

the conﬁdence band in (3.14), the smoothed band and naive parametric band in (3.15).
Given n = 20 with Ns = 8, 12, and n = 50 Ns = 44 (by taking c = 1 in the deﬁnition of Ns
in Section 3.4), σ = 0.5, 1.0, and 1 − α = 0.99, Table 3.2 reports the coverage percentages P ,
Pnaive , Psmooth and the average maximal widths W, Wnaive , Wsmooth of Ns + 1 intervals
out of 200 replications calculated from conﬁdence bands (3.14), (3.15), and the smoothed
conﬁdence bands, respectively.
Table 3.2: Uniform coverage rates and average maximal widths in Chapter 3
n

σ

20

0.5
1.0

50

0.5
1.0

Ns
8
12
8
12
44
44

P
0.820
0.930
0.910
0.960
0.990
0.990

Pnaive
0.505
0.765
0.655
0.820
0.960
0.975

Psmooth
0.910
0.955
0.970
0.985
0.990
1.000

W
1.490
1.644
1.725
1.937
1.651
2.054

Wnaive
1.210
1.363
1.401
1.606
1.522
1.893

Wsmooth
1.480
1.628
1.721
1.928
1.609
2.016

Note: uniform coverage rates and average maximal widths of conﬁdence intervals from 200
replications using the conﬁdence bands (3.14), (3.15), and the smoothed bands respectively,
for 1 − α = 0.99.

In all experiments, one has Psmooth > P > Pnaive and W > Wsmooth > Wnaive . The
coverage percentages for both the conﬁdence bands in (3.14) and the smoothed bands are
much closer to the nominal level than those of the naive bands in (3.15), while the smoothed
bands perform slightly better than the constant spline bands in (3.14), with coverage percentages closer to the nominal and smaller widths. Based on these observations, the naive band is
not recommended due to poor coverage. As for the smoothed band, although it has slightly
better coverage than the constant spline band, its asymptotic property has yet to be established, and the second step smoothing adds to its conceptual complexity and computational
burden. Therefore with everything considered, the constant spline band is recommended for
52

its satisfactory theoretical property, fast computing, and conceptual simplicity.
For visualization of the actual function estimates, at σ = 0.5 with n = 20, 50, Figures 3.1
and 3.2 depict the simulated data points and the true curve, and Figures 3.3, 3.4, 3.5 and 3.6
show the true curve, the estimated curve, the uniform conﬁdence band, and the pointwise
conﬁdence intervals.

3.6

Empirical Example

In this section, we apply the conﬁdence band procedure of Section 3.4 to the data collected
from a study by the AIDS Clinical Trials Group, ACTG 315 (Zhou, Huang, and Carroll
(2008)). In this study, 46 HIV 1 infected patients were treated with potent antiviral therapy
consisting of ritonavir, 3TC and AZT. After initiation of the treatment on day 0, patients
were followed for up to 10 visits. Scheduled visit times common for all patients were 7,
14, 21, 28, 35, 42, 56, 70, 84, and 168 days. Since the patients did not follow exactly the
scheduled times and/or missed some visits, the actual visit times Tij were irregularly spaced
and varied from day 0 to day 196. The CD4+ cell counts during HIV/AIDS treatments are
taken as the response variable Y from day 0 to day 196. Figure 3.7 shows that the data
points (dots) are extremely sparse between day 100 and 150, thus we ﬁrst transform the
1/3
data by Xij = Tij . A histogram (not shown) indicates that the Xij -values are distributed
fairly uniformly. The number of interior knots in (3.9) is taken to be Ns = 6, so that the
range for visit time T , which is [0, 196], is divided into seven unequal subintervals, and in
each subinterval, the mean CD4+ cell counts and the conﬁdence bands remain the same.
Table 3.3 gives the mean CD4+ cell counts and the conﬁdence limits on each subinterval
at simultaneous conﬁdence level 0.95. For instance, from day 4 to 14, the mean CD4+ cell
53

3
2
1
0
-2

-1

Y

0

0.5
X

1

Figure 3.1: Plots of simulated data for n = 20 in Chapter 3
Note: plots of simulated data scatter points at σ = 0.5, n = 20, and the true curve.

54

3
2
1
0
-2

-1

Y

0

0.5
X

1

Figure 3.2: Plots of simulated data for n = 50 in Chapter 3
Note: plots of simulated data scatter points at σ = 0.5, n = 50, and the true curve.

55

counts is 241.62 with lower and upper limits 171.81 and 311.43 respectively.
Table 3.3: Conﬁdence limits for CD4 data set
Days
[0, 1)
[1, 4)
[4, 15)
[15, 36)
[36, 71)
[71, 123)
[123, 196]

Mean CD4+ cell counts
178.23
200.32
241.62
271.87
299.51
280.78
299.27

Conﬁdence limits
[106.73, 249.72]
[130.51, 270.13]
[171.81, 311.43]
[194.70, 349.04]
[222.34, 376.68]
[203.50, 358.06]
[221.99, 376.55]

Note: the mean CD4+ cell counts and the conﬁdence limits on each subinterval at
simultaneous conﬁdence level 0.95.

Figure 3.7 depicts the 95% simultaneous (smoothed) conﬁdence band according to (3.14)
in (median) thin lines, and Figure 3.8 depicts the pointwise 95% conﬁdence intervals in thin
lines. The center thick line is the piecewise-constant spline ﬁt m(x). It can be seen that
the pointwise conﬁdence intervals are of course narrower than the uniform conﬁdence band
by the same ratio. Figure 3.7 is essentially a graphical representation of Table 3.3; both
conﬁrm that the mean CD4+ cell counts generally increases over time as Zhou, Huang, and
Carroll (2008) pointed out. The advantage of the current method is that such inference on
the overall trend is made with predetermined type I error probability, in this case 0.05.

3.7

Discussion

In this chapter, we have constructed a simultaneous conﬁdence band for the mean function
m (x) for sparse longitudinal data via piecewise-constant spline ﬁtting. Our approach extends
the asymptotic results in Wang and Yang (2009) for i.i.d. random designs to a much more
complicated data structure by allowing dependence of measurements within each subject.
56

3
2
1
0
-2

-1

Y

0

0.5
X

1

Figure 3.3: Plots of conﬁdence bands at 1 − α = 0.95, n = 20 in Chapter 3
Note: plots of conﬁdence bands (3.14) (upper and lower solid lines), pointwise conﬁdence
intervals (upper and lower dashed lines), the spline estimator (middle thin line), and the
true function (middle thick line) at 1 − α = 0.95, n = 20.

57

3
2
1
0
-2

-1

Y

0

0.5
X

1

Figure 3.4: Plots of conﬁdence bands at 1 − α = 0.95, n = 50 in Chapter 3
Note: plots of conﬁdence bands (3.14) (upper and lower solid lines), pointwise conﬁdence
intervals (upper and lower dashed lines), the spline estimator (middle thin line), and the
true function (middle thick line) at 1 − α = 0.95, n = 50.

58

3
2
1
0
-2

-1

Y

0

0.5
X

1

Figure 3.5: Plots of conﬁdence bands at 1 − α = 0.99, n = 20 in Chapter 3
Note: plots of conﬁdence bands (3.14) (upper and lower solid lines), pointwise conﬁdence
intervals (upper and lower dashed lines), the spline estimator (middle thin line), and the
true function (middle thick line) at 1 − α = 0.99, n = 20.

59

3
2
1
0
-2

-1

Y

0

0.5
X

1

Figure 3.6: Plots of conﬁdence bands at 1 − α = 0.99, n = 50 in Chapter 3
Note: plots of conﬁdence bands (3.14) (upper and lower solid lines), pointwise conﬁdence
intervals (upper and lower dashed lines), the spline estimator (middle thin line), and the
true function (middle thick line) at 1 − α = 0.99, n = 50.

60

4
3
1

2

Y*E2
0

50

100
T

150

200

Figure 3.7: Plots of conﬁdence bands for CD4 data
Note: plots of the piecewise-constant spline estimator (thick line), the data (dots),
conﬁdence band (3.14) (upper and lower solid lines), and the smoothed band (upper and
lower thin lines) at conﬁdence level 0.95.

61

4
3
1

2

Y*E2
0

50

100
T

150

200

Figure 3.8: Plots of conﬁdence intervals for CD4 data
Note: plots of the piecewise-constant spline estimator (thick line), the data (dots),
pointwise conﬁdence intervals (upper and lower thin lines), and the smoothed band (upper
and lower thin lines) at conﬁdence level 0.95.

62

The proposed estimator has good asymptotic behavior, and the conﬁdence band had coverage
very close to the nominal in our simulation study. An empirical study for the mean CD4+
cell counts illustrates the practical use of the conﬁdence band.
Clearly the simultaneous conﬁdence band in (3.14) can be improved in terms of both theoretical and numerical performance if higher order spline or local linear estimators are used.
Constant piecewise spline estimators are less appealing and have sub-optimal convergence
rates in the sense of Hall, M¨ller, and Wang (2006), which uses local linear approaches.
u
Establishing the asymptotic conﬁdence level for such extensions, however, requires highly
sophisticated extreme value theory, for sequences of non-stationary Gaussian processes over
intervals growing to inﬁnity. That is much more diﬃcult than the proofs of this chapter.
We consider the conﬁdence band in (3.14) signiﬁcant because it is the ﬁrst of its kind for
the longitudinal case with complete theoretical justiﬁcation, and with satisfactory numerical
performance for commonly encountered data sizes.
Our methodology can be applied to construct simultaneous conﬁdence bands for other
(
)
functional objects, such as the covariance function G x, x′ and its eigenfunctions, see Yao
(2007). It can also be adapted to the estimation of regression functions in the functional
linear model, as in Li and Hsing (2007). We expect further research along these lines to yield
deep theoretical results with interesting applications.

3.8

Appendix

Throughout this section, an ∼ bn means lim bn /an = c, where c is some nonzero constant,
n→∞
and for functions an (x), bn (x), an (x) = u {bn (x)} means an (x)/bn (x) → 0 as n → ∞
uniformly for x ∈ [0, 1].
63

3.8.1

A.1. Preliminaries

We ﬁrst state some results on strong approximation, extreme value theory and the classic
Bernstein inequality. These are used in the proofs of Lemma 3.7, Theorem 3.1, and Lemma
3.6.
Lemma 3.1. (Theorem 2.6.7 of Cs˝rg˝ and R´v´sz (1981)) Suppose that ξ i , 1 ≤ i ≤ n
o o
e e
are iid with E(ξ 1 ) = 0, E(ξ 2 ) = 1, and H(x) > 0 (x ≥ 0) is an increasing continuous
1
function such that x−2−γ H(x) is increasing for some γ > 0 and x−1 logH(x) is decreasing
with EH (|ξ 1 |) < ∞. Then there exists a Wiener process {W (t) , 0 ≤ t < ∞} that is a
Borel function of ξ i , 1 ≤ i ≤ n, and constants C1 , C2 , a > 0 which depend only on the
distribution of ξ 1 , such that for any {xn }∞ satisfying H −1 (n) < xn < C1 (nlogn)1/2
n=1
∑k
and Sk = i=1 ξ i ,
{
P

}

max Sk − W (k) > xn
1≤k≤n

≤ C2 n {H (axn )}−1 .

(n)
(n)
Lemma 3.2. Let ξ i , 1 ≤ i ≤ n, be jointly normal with ξ i
∼ N (0, 1). Let
(n) (n)
(n)
Eξ i ξ j be such that for γ > 0, Cr > 0, rij < Cr /nγ , i ̸= j. Then for τ ∈ R,
{
{
}
(
)
(n)
−τ , in which M
∞, P Mn,ξ ≤ τ /an + bn → exp −2e
n,ξ = max ξ 1 , . . . ,

(n)
rij =
as n →
}
(n)
ξn

and an , bn are as in (3.7) with Ns + 1 replaced by n.
Proof. Let

{ }n
{ }
{ }
η i i=1 be i.i.d. standard normal r.v.’s, u = ui n , v = vi n
i=1
i=1

be vectors of real numbers, and w = min (|u1 | , . . . , |un | , |v1 | , . . . , |vn |). By the Normal
Comparison Lemma (Leadbetter, Lindgren and Rootz´n (1983), Lemma 11.1.2),
e
{
P

}
{
}
(n)
−vj < ξ j ≤ uj for j = 1, . . . , n − P −vj < η j ≤ uj for j = 1, . . . , n
64

≤

4 ∑
(n)
r
1≤i<j≤n ij
2π



)
2 −1/2
2 
(n)
 −w
1 − rij
exp 
.
(n)
1 + rij

(

If u1 = · · · = un = v1 = · · · = vn = τ /an +bn = τ n , it is clear that τ 2 / (2logn) → 1, as n →
n
(n)2
∞. Then τ 2 > (2 − ε) logn, for any ε > 0 and large n. Since 1 − rij ≥ 1 − (Cr /nγ )2 → 1
n
(n)2
(n)
as n → ∞, i ̸= j, for i ̸= j, ∃Cr2 > 0 such that 1 − rij ≥ Cr2 > 0 and 1 + rij < 1 + ϵ
for any ϵ > 0 and large n. Let Mn,η = max {|η 1 | , . . . , |η n |}. By Leadbetter, Lindgren and
(
)
{
}
Rootz´n (1983), Theorem 1.5.3, P Mn,η ≤ τ n → exp −2e−τ as n → ∞, while the
e
above results entail
(

(

)

)
2 −1/2





(
)
4 ∑ (n)
(n)
 −w2 
P Mn,ξ ≤ τ n − P Mn,η ≤ τ n ≤
rij
1 − rij
exp 

(n)
2π
1 + rij
i<j
{
}
4 ∑
−γ C −1/2 exp − (2 − ε) logn ≤ C ′ n2−γ−(2−ε)(1+ϵ)−1 → 0
≤
C n
r
r2
1≤i<j≤n r
2π
1+ϵ
{
}
(
)
−τ , as n → ∞.
as n → ∞. Hence P Mn,ξ ≤ τ n → exp −2e

Lemma 3.3. (Theorem 1.2 of Bosq (1998)) Suppose that

{ }n
ξ i i=1 are iid with E(ξ 1 ) =

0, σ 2 = Eξ 2 , and there exists c > 0 such that for r = 3, 4, ..., E |ξ 1 |r ≤ cr−2 r!Eξ 2 < +∞.
1
1
(
)
√
√
Then for each n > 1, t > 0, P (|Sn | ≥ nσt) ≤ 2 exp −t2 (4 + 2ct/ nσ)−1 , in which
Sn =

∑n
i=1 ξ i .

Lemma 3.4. Under Assumption (A2), as n → ∞ for cJ,n deﬁned in (3.5),
) ⟨
⟩
( ) (
cJ,n = f tJ hs 1 + rJ,n , bJ , b ′ ≡ 0, J ̸= J ′ , where max0≤J≤Ns rJ,n ≤ Cω (f, hs ).
J
(
)}r
{
1−r/2
1−r/2
There exist constants CB > cB > 0 such that cB hs
≤ CB hs
≤ E BJ Xij
for r = 1, 2, . . . and 1 ≤ J ≤ Ns + 1, 1 ≤ j ≤ Ni , 1 ≤ i ≤ n.
65

Proof. By the deﬁnition of cJ,n in (3.5),
∫
cJ,n =

∫
∫
( )
{
( )}
bJ (x)f (x)dx = [
] f (x)dx = f tJ hs + [
] f (x) − f tJ dx.
tJ ,tJ+1
tJ ,tJ+1

( )
( )
∫
] f (x) − f t
Hence for all J = 0, ..., Ns , cJ,n − f tJ hs ≤ [
J dx ≤ ω (f, hs ) hs ,
tJ ,tJ+1
{ ( ) }
( )
or rJ,n = cJ,n − f tJ hs f tJ hs −1 ≤ Cω (f, hs ) , J = 0, ..., Ns . By (3.8),
{
(
(
)}r (
)−r/2 ∫
)1−r/2
1−r/2
E BJ Xij
= cJ,n
bJ (x)f (x)dx = cJ,n
∼ hs
.

Proof of Proposition 3.1. By Lemma 3.4 and Assumption (A2) on the continuity
of functions ϕ2 (x), σ 2 (x) and f (x) on [0, 1], for any x ∈ [0, 1]
k
∫

∫

χJ(x)

ϕk (x)f (x)du −

∫
J(x)

χJ(x)

(
)
(
)
1+β
ϕk (u) f (u) du ≤ ω ϕk f, hs hs = O hs
,

{
}
(
)
(
)
1+β
σ 2 (x)f (x) − σ 2 (u) f (u) du ≤ ω σ 2 f, hs hs = O hs
.
Y
Y
Y

Hence,

σ 2 (x) = c−2
(nEN1 )−1
n
J(x),n

∫

σ 2 (u) f (u) du×
Y

J(x)


2 {
}−1 
∫
∫
κ

E {N1 (N1 − 1)} ∑ 
2 (u) f (u) du

1+
ϕk (u) f (u) du
σ


EN1
χJ(x)
J(x) Y


k=1




)}
{
(
)}
{
(
1+β
1+β −2
{1+
(nEN1 )−1 σ 2 (x)f (x)hs + U hs
= f (x)hs + U hs
Y


κ
)}2 {
)}−1 
(
(
∑{
E {N1 (N1 − 1)}
1+β
1+β
σ 2 (x)f (x)hs + U hs
ϕk (x)f (x)hs + U hs
Y

EN1
k=1
66

{

E {N1 (N1 − 1)}
= (f (x)hs nEN1 )−1 σ 2 (x) 1 +
Y
EN1

}
∑κ
( )}
ϕ2 (x)f (x)hs {
β
k=1 k
1 + U hs
σ 2 (x)
Y

{
( )}
{
( )}
β
β
= σ2
(x) 1 + U hs
= σ2
(x) 1 + U hs
.
n,LONG
n,IID

A.2. Proof of Theorem 1
−1/2
Note that BJ(x) (x) ≡ c
, x ∈ [0, 1], so the terms ξ k (x) and ε(x) deﬁned in (3.12)
J(x),n
are
NS
∑

n Ni
(
) (
)
−2 ∑ ∑ B X
−1
ξ k (x) =
NT BJ (x) BJ 2,N
ϕk Xij ξ ik
ij
J
T
i=1 j=1
J=0
n Ni
(
) (
)
∑∑
−2
−1/2
−1
BJ(x) Xij ϕk Xij ξ ik ,
=c
BJ(x)
N
J(x),n
2,NT T
i=1 j=1
n Ni
(
) (
)
∑∑
−2
−1/2
−1
BJ(x) Xij σ Xij εij .
ε(x) = c
BJ(x)
NT
J(x),n
2,NT
i=1 j=1
Let
2
−1/2 −1 ∑n
ξ k (x) = BJ(x)
ξ k (x) = c
N
R
ξ ,
J(x),n T
i=1 ik,ξ,J(x) ik
2,NT
2
−1/2 −1 ∑n ∑Ni
ε(x) = BJ(x)
ε(x) = c
N
R
ε ,
J(x),n T
i=1
j=1 ij,ε,J(x) ij
2,NT

(3.16)

where
(
) (
)
)
(
) (
∑N
i B X
Rik,ξ,J =
ij ϕk Xij , Rij,ε,J = BJ Xij σ Xij , 0 ≤ J ≤ Ns .
j=1 J
(3.17)
67

Lemma 3.5. Under Assumption (A3), for e(x) given in (3.11) and ξ k (x), ε(x) given in
(3.16), we have

e(x) −


κ
∑


k=1



ξ k (x) + ε(x)



≤ An (1 − An )−1

κ
∑

ξ k (x) + ε(x) , x ∈ [0, 1],

k=1

where An = sup0≤J≤N
BJ 2
2,NT − 1 . There exists CA > 0, such that for large n,
S
(
)
)
(√
√
P An ≥ CA log (n) / (nhs ) ≤ 2n−3 . An = Oa.s.
log (n) / (nhs ) as n → ∞.

See the Supplement of Wang and Yang (2009) for a detailed proof.

Lemma 3.6. Under Assumptions (A2) and (A3), for R1k,ξ,J , R11,ε,J in (3.17),
[
∫
−1 E (N ) b (u) ϕ2 (u) f (u) du+
2
ER1k,ξ,J = cJ,n
1
J
k
)2 ]
(∫
,
E {N1 (N1 − 1)}
bJ (u) ϕk (u) f (u) du

2
ER11,ε,J = c−1
J,n

∫

bJ (u) σ 2 (u) f (u) du, 0 ≤ J ≤ Ns ,

]
[
2
2
there exist 0 < cR < CR < ∞, such that ER1k,ξ,J , ER11,ε,J ∈ cR , CR for 0 ≤ J ≤
(√
)
∑
2
2
Ns , sup0≤J≤Ns n−1 n Rik,ξ,J − ER1k,ξ,J = Oa.s.
logn/ (nhs ) , 1 ≤ k ≤ κ,
i=1
(√
)
∑n ∑Ni
−1
2
2
NT
logn/ (nhs ) as n → ∞.
sup
Rij,ε,J − ER11,ε,J = Oa.s.
i=1 j=1
0≤J≤Ns
68

Proof. By independence of X1j , 1 ≤ j ≤ N1 and N1 and (3.8),
{∑
{
(
)
(
) (
) (
)
}}
N1
E BJ X1j BJ X ′ ϕk X1j ϕk X ′ |N1
1j
1j
j,j ′ =1
{∑
}
{
(
) (
)
}
N1
2 X
2 X
=E
E BJ
+
1j ϕk
1j |N1
j=1
{∑
{
(
)
(
) (
) (
)
}}
N1
E
E BJ X1j BJ X ′ ϕk X1j ϕk X ′ |N1
1j
1j
j̸=j ′
{
∫
= c−1
E (N1 ) bJ (u) ϕ2 (u) f (u) du+
k
J(x),n

2
ER1k,ξ,J = E

)2 }

(∫
E {N1 (N1 − 1)}

bJ (u) ϕk (u) f (u) du

.

2
It is easily shown that ∃ 0 < cR < CR < ∞ such that cR ≤ ER1k,ξ,J ≤ CR , 0 ≤ J ≤ Ns .
(
)
2
∗ =ζ
Let ζ i,J = ζ i,k,J = Rik,ξ,J , ζ i,J
i,J − E ζ 1,J for r ≥ 1 and large n,
{∑
{∑
(
) (
)}2r
(
)}2r
Ni
Ni
2r E
E ζ i,J
=E
B Xij ϕk Xij
≤ Cϕ
B Xij
j=1 J
j=1 J


ν 1 +···+ν N =2r (

Ni

)∏ {


(
)}ν 
∑ i
2r
j
2r E
= Cϕ
E BJ Xij


 0≤ν ···ν ≤2r ν 1 · · · ν Ni j=1



1 Ni
(

)r



(
)
{
(
)}ν 
j
2r EN 2r C h1−r
2r E N 2r max
≤ Cϕ
E BJ Xij
≤ Cϕ
B s
1

 1


j=1






∏
 Ni

1−r = C r!h1−r ,
2r
≤ Cϕ CB cr r!hs
ζ s
N
{∑
)r
(
)}2r }
(
{
Ni
≥ c2r E
≥ c2r (EN1 ) cB h1−r ,
E ζ i,J
E BJ Xij
s
ϕ
ϕ
j=1

by Lemma 3.4. So

)}r
)}r
{ (
)r
{ (
(
′
∼ 1, E ζ i,J
for r ≥ 2, and ∃Cζ >
E ζ 1,J
≫ E ζ 1,J

69

{ (
) }1/2
′ > 0 such that C ′ h−1 ≥ σ 2 ≥ c′ h−1 , for σ ∗ = E ζ ∗ 2
cζ
. We obtain
ζ
i,J
ζ s
ζ s
ζ∗
)2
(
) 1
{
}n
(
r
with c∗ = Cζ /c′ r−2 h−1 , which implies that ζ ∗
E ζ∗
≤ cr−2 r!E ζ ∗
s
∗
i,J
i,J
i,J i=1
ζ
∑n
satisﬁes Cram´r’s condition. Applying Lemma 3.3 to i=1 ζ ∗ , for r > 2 and any large
e
i,J
{
}
√
∑n
∗
enough δ > 0, P n−1
i=1 ζ i,J ≥ δ logn/ (nhs ) is bounded by






(

)−1







′
−δ 2 Cζ
(logn)
2 exp
(
) 1
( )



 4 + 2 C /c′ r−2 δ c′ −1 h1/2 (logn)1/2 n−1/2 



s
ζ ζ
ζ


 −δ 2 (logn) 
≤ 2n−3 .
≤ 2 exp
′


4C
ζ




n
∞

∑
∑ 2Ns
√
1
2
2
Rik,ξ,J − ER1k,ξ,J ≥ δ logn/ (nhs ) ≤
Hence
P
sup
< ∞.
0≤J≤N n

n3
s i=1
n=1
n=1
)
(√
∑
2
2
Thus, sup0≤J≤Ns n−1 n Rik,ξ,J − ER1k,ξ,J = Oa.s.
logn/ (nhs ) as n → ∞
i=1
∞
∑

by Borel-Cantelli Lemma. The properties of Rij,ε,J are obtained similarly.
),
Order all Xij , 1 ≤ j ≤ Ni , 1 ≤ i ≤ n from large to small as X(t) , X(1) ≥ · · · ≥ X(
NT
and denote the εij corresponding to X(t) as ε(t) . By (3.16),
(
) (
)
∑N
Tb
X(t) σ X(t) ε(t)
t=1 J(x)
(
) (
){
∑N
}
−1
−1
Tb
= c
NT
X(t) σ X(t) St − St−1 ,
J(x),n
t=1 J(x)

ε(x) = c−1
N −1
J(x),n T

where Sq =

∑q
t=1 ε(t) , q ≥ 1 and S0 = 0.

Lemma 3.7. Under Assumptions (A2)-(A5), there is a Wiener process {W (t) , 0 ≤ t < ∞}
{
}
independent of Ni , Xij , 1 ≤ j ≤ Ni , ξ ik , 1 ≤ k ≤ κ, 1 ≤ i ≤ n , such that as n → ∞,
70

( )
sup ε(0) (x) − ε(x) = oa.s. nt for some t < − (1 − ϑ) /2 < 0, where ε(0) (x) is
x∈[0,1]
(
cJ(x),n NT

)
) (
(
)−1 ∑N
Tb
X(t) σ X(t) {W (t) − W (t − 1)} , x ∈ [0, 1].
t=1 J(x)

(3.18)

Proof. Deﬁne MN = max1≤q≤N Sq − W (q) , in which {W (t) , 0 ≤ t < ∞} is the
T
T
{
}
Wiener process in Lemma 3.1 that a Borel function of the set of variables ε(t) 1 ≤ t ≤ NT
{
}
{
}
is independent of Ni , Xij , 1 ≤ j ≤ Ni , ξ ik , 1 ≤ k ≤ κ, 1 ≤ i ≤ n since ε(t) 1 ≤ t ≤ NT
is. Further, supx∈[0,1] ε(0) (x) − ε(x) equals to
(
) (
){
}
(
)
−1
−1 b
(
) σ X(
)
supx∈[0,1] c
NT
X
W N T − SN
J(x)
NT
NT
J(x),n
T
(
) (
)
(
) (
)}
∑N −1 {
T
+
bJ(x) X(t) σ X(t) − bJ(x) X(t+1) σ X(t+1) {W (t) − St }
t=1
{ (
) (
)
−1 N −1 b
(
) σ X(
) +
≤
max
c
J X N
NT
T
0≤J≤Ns +1 Jn T
(
) (
)
(
) (
)}
∑N −1
T
bJ X(t) σ X(t) − bJ X(t+1) σ X(t+1)
MN
t=1
T

≤




)






(
)
(
∑
−1
3Cσ +
σ X(t) − σ X(t+1)
max
c−1 NT MN

T
0≤J≤Ns +1 J,n




1≤t≤NT −1,X(t) ∈bJ

which, by the H¨lder continuity of σ in Assumption (A2), is bounded by
o







∑
β
−1 3C + ∥σ∥
−1 M
max
c
X(t) − X(t+1)
≤
NT
σ
NT
0,β

0≤J≤Ns +1 J,n 




1≤t≤NT −1,X(t) ∈bJ

β 








∑
 

1−β 
−1 3C + ∥σ∥
−1 M
X(t) − X(t+1) 
NT
NT max cJ,n  σ
0,β nJ
 

J




1≤t≤NT −1,X(t) ∈bJ


71

−1
≤ NT MN
T

(

)


β
max
c−1
J,n 3Cσ + ∥σ∥0,β hs
0≤J≤Ns +1

)1−β 


(
max
nJ
0≤J≤Ns +1



)
∑NT (
where nJ =
t=1 I X(t) ∈ χJ , 0 ≤ J ≤ Ns + 1, has a binomial distribution with
(
)
∫
parameters NT , pJ,n , where pJ,n = χ f (x)dx. Simple application of Lemma 3.3 entails
(
)J
−1 . Meanwhile, by letting H(x) = xr , x = nt′ ,
max0≤J≤Ns +1 nJ = Oa.s. NT Ns
n
t′ ∈ (2/r, β − (1 + ϑ) /2), the existence of which is due to the Assumption (A4) that r >
{
}N
2/ {β − (1 + ϑ) /2}. It is clear that ε(t) T satisﬁes the conditions in Lemma 3.1. Since
t=1
(
)
′
n
= a−r n1−rt = O n−γ 1 for some γ 1 > 1, one can use the probability inequality
H(axn )
( )
′
in Lemma 3.1 and the Borel-Cantelli Lemma to obtain MN = Oa.s. (xn ) = Oa.s. nt .
T
Hence Lemma 3.4 and the above imply

sup
x∈[0,1]

}
(
){
(
)
−β
−1 1−β
(0) (x) − ε(x) = O
t′ −1
ε
NT Ns
1 + Ns
a.s. Ns n

(

)
t′ −1 + N nt′ −1 × N −1 n1−β
= Oa.s. Ns n
s
s
(
)
(
)
t′ −1 + N nt′ −β = o
t′ −β+ϑ
= Oa.s. Ns n
s
a.s. n

since t′ < β − (1 + ϑ) /2 by deﬁnition, implying t′ − 1 ≤ t′ − β < − (1 + ϑ) /2. The Lemma
follows by setting t = t′ − β + ϑ.
Now
(
) (
)
∑N
−1
Tb
X(t) σ X(t) Z(t)
ε(0) (x) = c−1
NT
J(x),n
t=1 J(x)
(
) (
)
∑n ∑N
−1
−1
i b
=c
N
Xij σ Xij Zij ,
J(x),n T
i=1
j=1 J(x)

72

(3.19)

where Z(t) = W (t) − W (t − 1) , 1 ≤ t ≤ NT , are i.i.d N (0, 1), ξ ik , Zij , Xij , Ni are independent, for 1 ≤ k ≤ κ, 1 ≤ j ≤ Ni , 1 ≤ i ≤ n, and ξ k (x), ε(0) (x) are conditional
independent of Xij , Ni , 1 ≤ j ≤ Ni , 1 ≤ i ≤ n. If the conditional variances of ξ k (x), ε(0) (x)
(
)
on Xij , Ni
are σ 2 ,n (x), σ 2 (x), we have
ε,n
ξk
1≤j≤Ni ,1≤i≤n
{

c−1
N −2
J(x),n T

∑n

R2
i=1 ik,ξ,J(x)

}1/2

σ ξ ,n (x) =
k
{
}1/2
∑n ∑N
−1
−2
2
i R
σ ε,n (x) = c
N
,
J(x),n T
i=1
j=1 ij,ε,J(x)

(3.20)

where Rik,ξ,J(x) , Rij,ε,J(x) , and cJ(x),n are given in (3.17) and (3.5).

Lemma 3.8. Under Assumptions (A2) and (A3), let

−1/2 
κ
∑


(0) (x) ,
2 (x) + σ 2 (x)
ξ k (x) + ε
η(x) =
σ ξ ,n
ε,n 



k
k=1
k=1

κ
∑

(3.21)

with σ ξ ,n (x), σ ε,n (x), ξ k (x), ε(0) (x), and cJ(x),n given in (3.20), (3.16), (3.18), and (3.5).
k
{ }
Then η(x) is a Gaussian process consisting of (Ns + 1) standard normal variables η J Ns
J=0
such that η(x) = η J(x) for x ∈ [0, 1], and there exists a constant C > 0 such that for large
n, sup
Eη J η ′ ≤ Chs .
0≤J̸=J ′ ≤Ns
J
(
)
}
{
Proof. It is apparent that L η J Xij , Ni , 1 ≤ j ≤ Ni , 1 ≤ i ≤ n = N (0, 1) for
{ }
0 ≤ J ≤ Ns , so L η J = N (0, 1), for 0 ≤ J ≤ Ns . For J ̸= J ′ , by (3.17) and (3.8),
(
)
(
) (
)
2 X
Rij,ε,J R
= BJ Xij B ′ Xij σ
ij = 0, along with (3.19), (3.18), the condiJ
ij,ε,J ′
tional independence of ξ k (x), ε(0) (x) on Xij , Ni , 1 ≤ j ≤ Ni , 1 ≤ i ≤ n, and independence
73

(
)
of ξ ik , Zij , Xij , Ni , 1 ≤ k ≤ κ, 1 ≤ j ≤ Ni , 1 ≤ i ≤ n, E η J η ′ is
J
{{
∑n

{∑

}}−1/2 {∑
{∑
∑N
κ
n
κ
2
i R2
E
Rik,ξ,J +
R2
+
ij,ε,J
i=1
k=1
j=1
i=1
k=1 ik,ξ,J ′
}}−1/2 {∑
} {∑n
}
∑N
κ {∑n
i R2
E
Rik,ξ,J ξ ik
Rik,ξ,J ′ ξ ik +
j=1 ij,ε,J ′
k=1
i=1
i=1
} (
{∑
} {∑
) }}
n ∑Ni
n ∑Ni
Z
Xij , Ni
R
Z
R
i=1
j=1 ij,ε,J ij
i=1
j=1 ij,ε,J ′ ij
= ECn,J,J ′




C
= N −1
n,J,J ′  T

in which


n ∑
κ
∑
i=1



2
Rik,ξ,J +

−1/2


∑N
i R2
j=1 ij,ε,J 

×

k=1
−1/2 

Ni

κ n



∑ ∑
∑
−1
−1
Rik,ξ,J R
.
NT
R2
R2
+
NT
′
ik,ξ,J ′ 
ij,ε,J ′ 




k=1 ik,ξ,J

j=1
i=1
k=1 i=1





n ∑
∑ κ

Note that according to deﬁnitions of Rik,ξ,J , Rij,ε,J , and Lemma 3.5,
−1
NT
−1
≥ c2 NT
σ

{∑
}
∑N
κ
2
i R2
R
+
i=1
k=1 ik,ξ,J
j=1 ij,ε,J

∑n

(
)
∑N
2
2
2
i B2 X
ij = cσ BJ 2,NT ≥ cσ (1 − An ) , for 0 ≤ J ≤ Ns ,
i=1
j=1 J

∑n











n
κ

∑ ∑

−1
2
2
P
inf
N
Rik,ξ,J +
Rij,ε,J  ×

0≤J̸=J ′ ≤Ns  T




i=1 k=1
j=1



2 

√
Ni


n
κ


∑
∑ ∑
log (n)  

−1
−3
4
NT
+
R2
R2
 ≥ 1 − 2n ,
 ≥ cσ 1 − CA

′ 
′
ij,ε,J 
ik,ξ,J

nhs

j=1
i=1 k=1


74

Ni
∑

by Lemma 3.5. Thus for large n, with probability ≥ 1 − 2n−3 , the numerator of C
is
n,J,J ′
uniformly greater than c2 /2. Applying Bernstein’s inequality to
σ
{∑
}
−1
κ ∑n R
NT
R
k=1 i=1 ik,ξ,J ik,ξ,J ′ , there exists C0 > 0 such that, for large n,

P
sup
0≤J̸=J ′ ≤Ns


∑κ ∑n
−1
R
R
NT
≤ C0 hs  ≥ 1 − 2n−3 .
k=1
i=1 ik,ξ,J ik,ξ,J ′

(
)−1
,
Putting the above together, for large n, C1 = C0 c2 /2
σ
(
P sup

)

0≤J̸=J ′ ≤Ns

C
≤ C1 hs ≥ 1 − 4n−3 .
n,J,J ′

Note that as a continuous random variable, sup0≤J̸=J ′ ≤N Cn,J,J ′ ∈ [0, 1] , thus
s
(

∫ 1

)

E sup
C
0≤J̸=J ′ ≤Ns n,J,J ′

=

(
)
P sup
C
> t dt.
0≤J̸=J ′ ≤Ns n,J,J ′
0

(

)

For large n, C1 hs < 1 and then E sup
C
0≤J̸=J ′ ≤Ns n,J,J ′
∫ C h
1 s
0

is







∫ 1



P
sup
C
> t dt +
P
sup
C
′ > t dt
n,J,J ′


C1 hs 0≤J̸=J ′ ≤Ns n,J,J
0≤J̸=J ′ ≤Ns

≤

∫ C h
1 s

∫ 1
1dt +

0

C1 hs

4n−3 dt ≤ C1 hs + 4n−3 ≤ Chs

for some C > 0 and large enough n. The lemma now follows from
(
sup0≤J̸=J ′ ≤N E Cn,J,J ′
s

)

(

≤ E sup0≤J̸=J ′ ≤N Cn,J,J ′
s

75

)

≤ Chs .

By Lemma 3.8, the (Ns + 1) standard normal variables η 0 , ..., η Ns satisfy the conditions
of Lemma 3.2 Hence for any τ ∈ R,
(
)
(
)
lim P supx∈[0,1] |η(x)| ≤ τ /aNs +1 + bNs +1 = exp −2e−τ .
n→∞

(3.22)

For x ∈ [0, 1], Rik,ξ,J , Rij,ε,J given in (3.17), deﬁne the ratio of population and sample
{
}1/2 {
}
¯
¯
quantities as rn (x) = nE (N1 ) /NT
Rn (x)/R(x) 1/2 , with
(∑
)}
∑N
n
κ
2
i R2
R
+
j=1 ij,ε,J(x)
i=1
k=1 ik,ξ,J(x)
∑κ
2
2
¯
R(x) = (EN1 )−1
ER1k,ξ,J(x) + ER11,ε,J(x) .
k=1

¯
Rn (x) = N −1
T

{∑

Lemma 3.9. Under Assumptions (A2), (A3), for η(x), σ n (x) in (3.21), (3.6),
σ n (x)−1

{∑κ
}
(0) (x) − η(x) = |r (x) − 1| |η(x)|
ξ (x) + ε
n
k=1 k

(3.23)

{
}
(√
)
as n → ∞, supx∈[0,1] aNs +1 |rn (x) − 1| = Oa.s.
{log (Ns + 1)} (logn) / (nhs ) .

Proof. Equation (3.23) follows from the deﬁnitions of η(x) and σ n (x). By Lemma 3.6,
(√
)
n Ni
−1 ∑ ∑ R2
= Oa.s.
− ER2
supx∈[0,1] N
logn/ (nhs ) ,
11,ε,J(x)
T i=1 j=1 ij,ε,J(x)
supx∈[0,1] N −1
T

∑κ
k=1

∑n
i=1

2
Rik,ξ,J(x) − (EN1 )−1
76

∑κ
k=1

2
ER1k,ξ,J(x)

≤ supx∈[0,1] (EN1 )−1
+ supx∈[0,1] (EN1 )−1

∑κ
k=1

n−1

∑n

∑κ

2
R2
− ER1k,ξ,J(x)
i=1 ik,ξ,J(x)

n (EN1 ) N −1 − 1 n−1
T
k=1

∑n

R2
i=1 ik,ξ,J(x)

)
(√
(√
(
)
)
−1/2 = O
= Oa.s.
logn/ (nhs ) + Oa.s. n
logn/ (nhs ) ,
a.s.
¯
and there exist constants 0 < cR < CR < ∞ such that for all x ∈ [0, 1], cR < R(x) < CR .
¯
¯
¯
¯
¯
¯
Thus, supx∈[0,1] Rn (x) − R(x) is bounded by
∑κ
∑κ ∑n
2
2
Rik,ξ,J(x) − (EN1 )−1
ER1k,ξ,J(x) +
supx∈[0,1] N −1
T
k=1
i=1
k=1
(√
)
∑n ∑N
2
i R2
− ER11,ε,J(x) = Oa.s.
supx∈[0,1] N −1
logn/ (nhs ) .
T
i=1
j=1 ij,ε,J(x)
{
}
{
}
{
}
¯
¯
¯
¯
¯
sup
Rn (x) 1/2 − R(x) 1/2 ≤ sup Rn (x) − R(x) sup
R(x) −1/2
x∈[0,1]
(√
)
{x∈[0,1]
} x∈[0,1]
= Oa.s.
logn/ (nhs ) . Then supx∈[0,1] aNs +1 |rn (x) − 1| is bounded by
Thus



{
}1/2
{
}1/2 
{
}1/2
¯
¯
nE (N1 ) /NT
sup
aNs +1
Rn (x)/R(x)
− 1 + 1 − nE (N1 ) /NT


x∈[0,1]


{
}1/2
{
}
{
}
{
}
¯
¯
¯
sup
R(x) −1/2 sup
≤ aNs +1
nE (N1 ) /NT
Rn (x) 1/2 − R(x) 1/2

x∈[0,1]
x∈[0,1]
}
}1/2
)
{
(√
{log (Ns + 1)} (logn) / (nhs ) .
+ 1 − nE (N1 ) /NT
= Oa.s.

Proof of Proposition 3.2. The proof follows from Lemmas 3.5, 3.7, 3.9, (3.22), and
Slutsky’s Theorem.
77

Proof of Theorem 3.1. By Theorem 3.2, ∥m(x) − m(x)∥∞ = Op (hs ), so




{
}
√
 sup σ −1 (x) |m(x) − m(x)| = Op (nhs )1/2 log (Ns + 1)hs = op (1) ,
aNs +1
n
x∈[0,1]




κ
∑

ξ k (x) + ε(x)  = op (1) .
aNs +1  sup σ −1 (x) |m(x) − m(x)| − sup σ −1 (x)
n
n
x∈[0,1]
x∈[0,1]
k=1
Meanwhile, (3.11) and Proposition 3.2 entail that, for any τ ∈ R,





 sup σ −1 (x)
lim P a
n
n→∞  Ns +1
x∈[0,1]

κ
∑
k=1






(
)
ξ k (x) + ε(x) − bNs +1  ≤ τ = exp −2e−τ .


Thus Slutsky’s Theorem implies that










(
)
lim P aNs +1  sup σ −1 (x) |m(x) − m(x)| − bNs +1  ≤ τ = exp −2e−τ .
n
n→∞ 

x∈[0,1]
{
}
Let τ = −log − 1 log (1 − α) , deﬁnitions of aNs +1 , bNs +1 , and QNs +1 (α) in (3.7) entail
2
{
}
lim P m(x) ∈ m(x) ± σ n (x)QNs +1 (α) , ∀x ∈ [0, 1]
n→∞




= lim P Q−1+1 (α) sup σ −1 (x) |e(x) + m(x) − m(x)| ≤ 1 = 1 − α.
n
n→∞  Ns

x∈[0,1]
by (3.11). That σ n (x)−1 {m(x) − m(x)} →d N (0, 1) for any x ∈ [0, 1] follows by directly
using η(x) ∼ N (0, 1), without reference to supx∈[0,1] |η(x)|.

78

Chapter 4

Spline-backﬁtted Kernel Smoothing of
Partially Linear Additive Model

4.1

Introduction

This chapter is based on Ma and Yang (2011b). Since the 1980’s, non- and semiparametric
analysis of time series has been vigorously pursued, see, for example, Tjøstheim and Auestad
(1994) and Huang and Yang (2004). There are few satisfactory smoothing tools for multidimensional time series data, however, due to the poor convergence rate of nonparametric
estimation of multivariate functions, known as the “curse of dimensionality”. One solution
is the partially linear additive model (PLAM) studied in Li (2000), Fan and Li (2003) and
Liang, Thurston, Ruppert, Apanasovich and Hauser (2008)
∑d
∑d
(
)
(
)
1 c t +
2 m (x )
Yi = m Xi , Ti + σ Xi , Ti εi , m(x, t) = c00 +
0l l
l=1
α=1 α α
79

(4.1)

}n
{
}n
{
in which the sequence Yi , XT , TT
= Yi , Xi1 , ..., Xid , Ti1 , . . . Tid
. The funci
i i=1
2
1 i=1
tions m and σ are the mean and standard deviation of the response Yi conditional on the
predictor vector {Xi , Ti }, and εi is a white noise conditional on {Xi , Ti }. For identiﬁabil(
)
ity, both additive and linear components must be centered, i.e., Emα Xiα ≡ 0, 1 ≤ α ≤
d2 , ETil = 0, 1 ≤ l ≤ d1 .
(
)
{
}n
If parameters c0l ≡ 0, 1 ≤ l ≤ d1 , Ti1 , . . . Tid are redundant. Yi , Xi1 , ..., Xid
1
2 i=1
follow an additive model. For applications of additive model, see, N´cher, Ojeda, Cadarsoa
Su´rez, Roca-Pardi˜as and Acu˜a (2006), Roca-Pardi˜as, Cadarso-Su´rez, N´cher and Acu˜a
a
n
n
n
a
a
n
(2006), Gonz´lez-Manteiga, Mart´
a
ınez-Miranda and Raya-Miranda (2008). Additive model,
(
however, is only appropriate to model nonparametric eﬀects of continuous predictors Xi1 ,
)
..., Xid supported on compact intervals. The eﬀects of possibly discrete and/or unbounded
2
)
(
predictors can be neatly modeled as some of the variables Ti1 , . . . Tid in the PLAM (4.1),
1
see the simulation example in Section 4.3 where Ti1 , Ti2 are normal conditional on Xi , Ti3
is discrete and Ti4 has positive density over a compact interval, and Section 4.4 which shows
that the simpler PLAM ﬁts the Boston housing data much better than an additive model.
For general references on partially linear model, see Schimek (2000) and Liang (2006). For
applications of partially linear model to panel data, see Su and Ullah (2006), while for data
with measurement errors, see Liang, Wang and Carroll (2007) and Liang et al. (2008).
{ }d
d2
Satisfactory estimators of functions {mα (xα )}α=1 and constants c0l 1 in model
l=0
}n
{
(4.1) based on Yi , XT , TT
i
i i=1 should be (i) computationally expedient; (ii) theoretically
reliable and (iii) intuitively appealing. Kernel procedures for PLAM, such as Fan and Li
(2003) and Liang et al. (2008) satisfy criterion (iii) and partly (ii) but not (i) since they
are computationally intensive when sample size n is large, as illustrated in the Monte-Carlo
80

results of Xue and Yang (2006). It is mentioned in Li (2000) that the computation time of
estimating a PLAM is about n times of estimating a partially linear model with d2 = 1 by
using the kernel marginal integration method. For discussion of computation burden issues
by kernel methods, see Li (2000). Spline approaches of Li (2000), Schimek (2000) to PLAM,
do not satisfy criterion (ii) as they lack limiting distribution, but are fast to compute, thus
satisfying (i). The SBK estimator we propose combines the best features of both kernel and
spline methods, and is essentially as fast and accurate as an univariate kernel smoothing,
satisfying all three criteria (i)-(iii).

We propose to extend the “spline-backﬁtted kernel smoothing” (SBK) of Wang and
{ }d
Yang (2007) to PLAM (4.1). If the regression coeﬃcients c0l 1 and the component
l=0
{
( )}d
{
}
2
functions mβ xβ
were known by “oracle”, one could create Yiα , Xiα n
i=1
β=1,β̸=α
(
)
∑d2
∑d1
m Xiβ
c0l Til −
with Yiα = Yi − c00 −
β=1,β̸=α β
l=1
(
)
(
)
= mα Xiα +σ Xi , Ti εi , from which one could compute an “oracle smoother” to estimate
the only unknown function mα (xα ), bypassing the “curse of dimensionality”. The idea was
(
)
to obtain approximations to the unobservable variables Yiα by substituting mβ Xiβ ,
1 ≤ i ≤ n, 1 ≤ β ≤ d2 , β ̸= α, with spline estimates and argue that the error incurred by
(
)
this “cheating” is of smaller magnitude than the rate O n−2/5 for estimating mα (xα )
from the unobservable data. Lemmas 4.9, 4.14, 4.17 and 4.18 establish the estimators’
uniform oracle eﬃciency by “reducing bias via undersmoothing (step one) and averaging
out the variance (step two)”, via the joint asymptotics of kernel and spline functions. A
major theoretical innovation is to resolve the dependence between T and X, making use
of Assumption (A5), which is not needed in Wang and Yang (2007). Another signiﬁcant
innovation is the

√

n-consistency and asymptotic distribution of estimators for parameters
81

{

c0l

}d1
, which is trivial for the additive model of Wang and Yang (2007).
l=0

This chapter is organized as follows. The SBK estimators are introduced in Section 4.2
with theoretical properties. Section 4.3 contains Monte Carlo results to demonstrate the
asymptotic properties of SBK estimators for moderate dimensions. The SBK estimator is
applied to the Boston housing data in Section 4.4. Proofs of technical lemmas are in the
Appendix.

4.2

The SBK Estimators

(
)
For convenience, we denote vectors as x = x1 , ..., xd T and take ∥ · ∥ as the usual Euclidean
√
d , i.e., ∥x∥ = ∑d
2
norm on R
α=1 xα , and ∥·∥∞ the sup norm, i.e., ∥x∥∞ = sup1≤α≤d |xα |.
We denote by Ir the r × r identity matrix, 0r×s the zero matrix of dimension r × s, and
{
}n
diag(a, b) the 2 × 2 diagonal matrix with diagonal entries a, b. Let Yi , XT , TT
i
i i=1 be a
sequence of strictly stationary observations from a geometrically α-mixing process following
)T
(
(
)
, (Ti1 , . . . Tid )T } are the i-th
model (4.1), where Yi and Xi , Ti = { Xi1 , ..., Xid
2
1
response and predictor vector. Denote Y = (Y1 , ..., Yn )T the response vector. Without loss
of generality, we assume Xα is distributed on [0, 1] , 1 ≤ α ≤ d2 . An integer N = Nn ∼
n1/4 log n is pre-selected. Denote the class of Lipschitz continuous functions for a constant
}
{
′ ) ≤ C x − x′ , ∀x, x′ ∈ [0, 1] .
C > 0 as Lip([0, 1] , C) = m| m(x) − m(x
We use the second order B spline (or linear B spline) basis bJ (x) = bJ,2 (x), 0 ≤ J ≤
N + 1 which is deﬁned in Section 1.3 of Chapter 1. Let the distance between neighboring
interior or boundary knots be H = Hn = (N + 1)−1 . Deﬁne next the space G of partially
82

linear additive spline functions as the linear space spanned by
{
}
1, tl , bJ (xα ) , 1 ≤ l ≤ d1, 1 ≤ α ≤ d2 , 1 ≤ J ≤ N + 1 ,
{ {
}
(
)}n
and let 1, Tl , bJ Xiα i=1 , 1 ≤ l ≤ d1, 1 ≤ α ≤ d2 , 1 ≤ J ≤ N + 1 span the space Gn ⊂
Rn . As n → ∞, with probability approaching 1, the dimension of Gn becomes
{1 + d1 + d2 (N + 1)}. The spline estimator of m (x, t) is the unique element m (x, t) =
ˆ
{ (
)}
mn (x, t) from G so that m Xi , Ti T
ˆ
ˆ
1≤i≤n best approximates the response vector Y. To
be precise, we deﬁne

m (x, t) = c00 +
ˆ
ˆ

∑d

1 c t +
ˆ
l=1 0l l

∑d
∑N +1
2
c b (x ) ,
ˆ
α=1
J=1 J,α J α

(4.2)

(
)
ˆ ˆ ˆ
minimize
where the coeﬃcients c00 , c0l , cJ,α
1≤l≤d1, 1≤J≤N +1,1≤α≤d2
{

∑n
i=1

Yi − c0 −

}
∑N +1
∑d
(
) 2
2
1 cT −
c b Xiα
.
α=1
J=1 J,α J
l=1 l il

∑d

{ }d
{ }d
Pilot estimators of cT = c0l 1 and mα (xα ) are ˆT = c0l 1 and mα (xα ) =
c
ˆ
ˆ
l=0
l=0
(
)
∑N +1
∑
∑N +1
cJ,α bJ (xα ) − n−1 n
ˆ
ˆ
i=1 J=1 cJ,α bJ Xiα , which are used to deﬁne pseudo
J=1
ˆ
responses Yiα , estimates of the unobservable “oracle” responses Yiα :
)
(
∑d
∑d
1 c T −
2
ˆ0l il
mβ Xiβ ,
ˆ
l=1
β=1,β̸=α
(
)
∑d
∑d
1 c T −
2
Yiα = Yi − c00 −
m Xiβ .
l=1 0l il
β=1,β̸=α β
ˆ
Yiα = Yi − c00 −
ˆ

(4.3)

{
}n
ˆ
, the SBK estimator mSBK,α (xα ) of mα (xα ) mimics the would-be
Based on Yiα , Xiα
ˆ
i=1

83

{
}
Nadaraya-Watson estimator mK,α (xα ) of mα (xα ) based on Yiα , Xiα n , if the unob˜
i=1
{
}n
servable responses Yiα i=1 were available

}
{
∑n
(
)
ˆ
ˆ
mSBK,α (xα ) = n−1
ˆ
Kh Xiα − xα Yiα /fα (xα ),
i=1
{
}
∑n
(
)
ˆ
mK,α (xα ) = n−1
˜
Kh Xiα − xα Yiα /fα (xα ),
i=1

(4.4)

(
)
∑
ˆ
ˆ
with Yiα , Yiα in (4.3), fα (xα ) = n−1 n Kh Xiα − xα an estimator of fα (xα ).
i=1
Without loss of generality, let α = 1. Under Assumptions A1-A5 and A7, it is straightforward to verify (as in Bosq 1998) that as n → ∞,
(
)
supx ∈[h,1−h] mK,1 (x1 ) − m1 (x1 ) = op n−2/5 log n ,
˜
1
}
{
}
√ {
D
2
nh mK,1 (x1 ) − m1 (x1 ) − b1 (x1 ) h2 → N 0, v1 (x1 ) ,
˜
∫
{
}
2 K (u) du m′′ (x ) f (x ) /2 + m′ (x ) f ′ (x ) f −1 (x ) ,
where, b1 (x1 ) = u
1
1 1 1 1
1 1 1 1
1
∫
]
[
−1
2
v1 (x1 ) = K 2 (u) duE σ 2 (X, T) |X1 = x1 f1 (x1 ) .
(4.5)

It is shown in Li (2000) and Schimek (2000) that the spline estimator m1 (x1 ) in the ﬁrst step
ˆ
uniformly converges to m1 (x1 ) with certain convergence rate, but lacks asymptotic distribution. Theorem 4.1 below states that the diﬀerence between mSBK,1 (x1 ) and mK,1 (x1 ) is
ˆ
˜
(
)
op n−2/5 uniformly, dominated by the asymptotic uniform size of mK,1 (x1 ) − m1 (x1 ).
˜
So mSBK,1 (x1 ) has identical asymptotic distribution as mK,1 (x1 ).
ˆ
˜

Theorem 4.1. Under Assumptions A1-A7, as n → ∞, the SBK estimator mSBK,1 (x1 )
ˆ
(
)
ˆ
˜
given in (4.4) satisﬁes supx ∈[0,1] mSBK,1 (x1 ) − mK,1 (x1 ) = op n−2/5 . Hence with
1
84

2
b1 (x1 ) and v1 (x1 ) as deﬁned in (4.5), for any x1 ∈ [h, 1 − h],
√

{
}
{
}
D
2
ˆ
nh mSBK,1 (x1 ) − m1 (x1 ) − b1 (x1 ) h2 → N 0, v1 (x1 ) .

Instead of Nadaraya-Watson estimator, one can use local polynomial estimator, see
Fan and Gijbels (1996). Under Assumptions A1-A7, for any α ∈ (0, 1), an asymptotic
100 (1 − α) % conﬁdence intervals for m1 (x1 ) is
mSBK,1 (x1 ) − ˆ1 (x1 ) h2 ± zα/2 v1 (x1 ) (nh)−1/2
ˆ
b
ˆ

(4.6)

2
where ˆ1 (x1 ) and v1 (x1 ) are estimators of b1 (x1 ) and v1 (x1 ) respectively.
b
ˆ2
The following corollary provides the asymptotic distribution of mSBK (x). The proof of
ˆ
this corollary is straightforward and therefore omitted.
Corollary 4.1. Under Assumptions A1-A7 and the additional assumption mα ∈ C (2) [0, 1] , 2 ≤
d2
d2
d2
∑ 2
∑
∑
vα (xα ),
mSBK,α (xα ), b (x) =
ˆ
bα (xα ), v 2 (x) =
α ≤ d2 . Let mSBK (x) =
ˆ
α=1
α=1
α=1
for any x ∈ [0, 1]d2 , with SBK estimators mSBK,α (xα ) , 1 ≤ α ≤ d2 , deﬁned in (4.4), and
ˆ
2
bα (xα ), vα (xα ) similarly deﬁned as in (4.5), as n → ∞,
√

}
{
{
}
∑d
D
2 → N 0, v 2 (x)
2 m (x ) − b (x) h
nh mSBK (x) −
ˆ
α=1 α α

Next theorem describes the asymptotic behavior of estimator ˆ for c.
c
Theorem 4.2. Under Assumptions A1-A6, as n → ∞,






√
0, σ 2
c
additional Assumption A8, n (ˆ − c) →d N 
0


85

(

)
−1/2 . With the
∥ˆ − c∥ = Op n
c



( )
1
0T 
d1
, for Σ = cov T
˜


0d Σ−1 
1

˜
with random vector T deﬁned in (4.10).
To construct conﬁdence sets for c, Σ is consistently estimated by
∑n ˆ
∑d2 ∑N +1
∗
∗
ˆ
ˆ
ˆ
i=1 Ti,l,n Ti,l′ ,n in which Tl,n = Tl − α=1 J=1 aJ,α bJ,α (xα ), where bJ,α (xα ) ≡
(
)
∑
bJ (xα ) − n−1 n bJ Xiα is the empirical centering of bJ (x) for the α-th variable Xα ,
i=1
(
)
deﬁned in the Appendix and aJ,α
ˆ
minimize
1≤J≤N +1,1≤α≤d2
n−1

2
∑d
∑N +1
∗ (X ) .
2
Tl −
a b
α=1
J=1 J,α J,α α n

4.3

Simulation

In this section, we analyze some simulated data examples to illustrate the ﬁnite-sample
behavior of SBK estimators. The number of interior knots N in (4.2) is given by N =
([
]
[
])
min c1 n1/4 log n + c2 , (n/2 − 1 − d1 ) d−1 − 1 , in which [a] denotes the integer part
2
of a. In our implementation, we have used c1 = c2 = 1. The additional constraint that
N ≤ (n/2 − 1 − d1 ) d−1 − 1 ensures that the number of terms in the linear least squares
2
problem (4.2), 1 + d1 + d2 (N + 1), is no greater than n/2, which is necessary when the
sample size n is moderate.
{
}
The i.i.d. data Yi , Xi , Ti n is generated according to the partially linear additive
i=1
model (4.1), which satisﬁes Assumptions A1-A5, and A8

Yi = 2 +

∑d
∑d
1 T +
2 m (X ) + σ ε , m (x) ≡ sin (2πx) , 1 ≤ α ≤ d ,
0 i α
2
il
l=1
α=1 α iα

(
(
)
)
where σ 0 = 2, εi ∼ N (0, 1) is independent of Xi , Ti , Ti = Ti1 , Ti2 , Ti3 , Ti4 such that
(
)
Ti3 , Ti4 , Ti1 , Ti2 are independent, Ti3 = ±1 with probability 1/2, Ti4 ∼ U (−0.5, 0.5),
86

(
)
( (
) (
)))
5−sin(2πx)
Ti1 , Ti2 ′ ∼ N (0, 0)′ , diag a Xi1 , a Xi2
, a (x) =
.
5+sin(2πx)
{
}T
(
)d2
Xi =
Xiα α=1
is generated from the vector autoregression (VAR) equation Xiα =
{(
}
)
2 1/2 Z
Φ
1−a
iα − 1/2, 1 ≤ α ≤ d2 with stationary distribution
(
)
(
)T
(
)
2 −1 Σ
Zi = Zi1 , ..., Zid
∼ N 0d , 1 − a
2
2
(

(

)
)
2 −1 Σ , Z = aZ
Z1 ∼ N 0d , 1 − a
i
i−1 + εi , εi ∼ N (0, Σ) , 2 ≤ i ≤ n,
2
Σ = (1 − r) Id ×d + r1d 1T , 0 < a < 1, 0 < r < 1,
2 2
2 d2
(

{ }
So Xi n is geometrically α-mixing with marginal distribution U [−0.5, 0.5].
i=1
We obtained for comparison the SBK estimator mSBK,α (xα ) and the “oracle” smoother
ˆ
mK,α (xα ) by Nadaraya-Watson regression using quartic kernel and the rule-of-thumb band˜
width. To see that mSBK,α (xα ) is as eﬃcient as mK,α (xα ) for numerical performance, we
ˆ
˜
deﬁne the empirical relative eﬃciency of mSBK,α (xα ) with respect to mK,α (xα ) as
ˆ
˜


(
)}2 1/2
∑n {
˜
i=1 mK,α (xα ) − mα Xiα


eﬀα = 
.
{
(
)}2 
∑n
ˆ
i=1 mSBK,α (xα ) − mα Xiα

(4.7)

Theorem 4.1 indicates eﬀα should be close to 1 for 1 ≤ α ≤ d2 . Figures 4.1, 4.2, 4.3,
and 4.4 provide the kernel density estimates of 100 empirical eﬃciencies α = 2, 3, sample
sizes n = 100 (solid lines), 200 (dashed lines), 500 (thin lines) and 1000 (thick lines) at
σ 0 = 2, d2 = 3, and d2 = 30. The vertical line at eﬃciency = 1 is the standard line for
ˆ
˜
the comparison of mSBK,α (xα ) and mK,α (xα ). One clearly sees that the center of the
density plots is going toward the standard line at 1 with narrower spread when sample size
n is increasing, conﬁrmative to Theorem 4.1.
87

0

0.5

1

Y

1.5

2

2.5

Efficiency of the 2-nd estimator

0

0.5

1

1.5

2

2.5

X

Figure 4.1: Kernel density plots for α = 2, d1 = 4, d2 = 3 in Chapter 4
Note: kernel density plots of the 100 empirical eﬃciencies of mSBK,α (xα ) to mK,α (xα ),
ˆ
˜
computed according to (4.7) for sample sizes n = 100 (solid lines), 200 (dashed lines), 500
(thin lines) and 1000 (thick lines) at α = 2, d1 = 4, d2 = 3.

88

0

0.5

1

Y

1.5

2

2.5

Efficiency of the 3-rd estimator

0

0.5

1

1.5

2

2.5

X

Figure 4.2: Kernel density plots in Chapter 4
Note: kernel density plots of the 100 empirical eﬃciencies of mSBK,α (xα ) to mK,α (xα ),
ˆ
˜
computed according to (4.7) for sample sizes n = 100 (solid lines), 200 (dashed lines), 500
(thin lines) and 1000 (thick lines) at α = 3, d1 = 4, d2 = 3.

89

0

0.5

1

Y

1.5

2

2.5

Efficiency of the 2-nd estimator

0

0.5

1

1.5

2

2.5

X

Figure 4.3: Kernel density plots for α = 2, d1 = 4, d2 = 30 in Chapter 4
Note: kernel density plots of the 100 empirical eﬃciencies of mSBK,α (xα ) to mK,α (xα ),
ˆ
˜
computed according to (4.7) for sample sizes n = 100 (solid lines), 200 (dashed lines), 500
(thin lines) and 1000 (thick lines) at α = 2, d1 = 4, d2 = 30.

90

0

0.5

1

Y

1.5

2

2.5

Efficiency of the 3-rd estimator

0

0.5

1

1.5

2

2.5

X

Figure 4.4: Kernel density plots for α = 3, d1 = 4, d2 = 30 in Chapter 4
Note: kernel density plots of the 100 empirical eﬃciencies of mSBK,α (xα ) to mK,α (xα ),
ˆ
˜
computed according to (4.7) for sample sizes n = 100 (solid lines), 200 (dashed lines), 500
(thin lines) and 1000 (thick lines) at α = 3, d1 = 4, d2 = 30.

91

To see that c0l is as eﬃcient as c0l , we deﬁne the asymptotic eﬃciency of c0l with respect
ˆ
ˆ
˜

1/2
}2
∑100 {
c0l,t −c0l /100 
˜
 t=1
to c0l as eﬀl =  ∑
˜
˜
ˆ
˜ ˆ
, where c0l,t , c0l,t are values of c0l , c0l for the

{
}2
100 c −c
t=1 ˆ0l,t 0l /100
t-th replication in the simulation. For n = 200, 500, d2 = 3, Table 4.1 lists the frequencies
of 95% conﬁdence interval coverage of the SBK estimators for the regression coeﬃcients
{ }d1
c0l
, the sample mean squared error (MSE) and the asymptotic eﬃciency. The coverage
l=0
frequencies are all close to the nominal level of 95%. As expected, increase in sample size
reduces the sample MSE and increases the asymptotic eﬃciency.

Table 4.1: Estimation of parameters for the linear part in Chapter 4
r

c00

c01

c02

c03

c04

a

0
0.3
0
0.3
0
0.3
0
0.3
0
0.3
0
0.3
0
0.3
0
0.3
0
0.3
0
0.3

0
0
0.3
0.3
0
0
0.3
0.3
0
0
0.3
0.3
0
0
0.3
0.3
0
0
0.3
0.3

95% CI coverage
frequency
0.92 (0.92)
0.92 (0.91)
0.89 (0.92)
0.89 (0.92)
0.95 (0.90)
0.99 (0.91)
0.98 (0.95)
0.96 (0.94)
0.96 (0.95)
0.97 (0.97)
0.95 (0.95)
0.97 (0.97)
0.96 (0.97)
0.92 (0.98)
0.93 (0.96)
0.96 (0.97)
0.95 (0.96)
0.94 (0.95)
0.92 (0.95)
0.93 (0.96)

MSE
0.0241 (0.010)
0.0264 (0.0095)
0.0263 (0.0096)
0.0282 (0.0103)
0.0330 (0.0152)
0.0297 (0.0143)
0.0296 (0.0121)
0.0336 (0.0134)
0.0306 (0.0115)
0.0378 (0.0118)
0.0329 (0.0112)
0.0336 (0.0104)
0.0259 (0.0087)
0.0301 (0.0074)
0.0362 (0.0078)
0.0258 (0.0078)
0.4006 (0.1229)
0.3771 (0.1111)
0.3867 (0.1154)
0.3533 (0.1138)

Asymptotic
Eﬃciency
0.8806 (0.8406)
0.8403 (0.8588)
0.8446 (0.8536)
0.8146 (0.8270)
0.8795 (0.8892)
0.9217 (0.9069)
0.9157 (0.9949)
0.8635 (0.9491)
0.8809 (0.8659)
0.7914 (0.8553)
0.8523 (0.8757)
0.8397 (0.9039)
0.8892 (0.8983)
0.7527 (0.9327)
0.8264 (0.9178)
0.8919 (0.9101)
0.7873 (0.9181)
0.8117 (0.9661)
0.8019 (0.9470)
0.8388 (0.9537)

Note: estimation of c = (c00 , c01 , c02 , c03 , c04 )′ with d2 = 3, n = 200 (outside
parentheses), n = 500 (inside parentheses).

92

For visualization of the actual function estimates, at noise level σ 0 = 2 with sample size
n = 200, 500, we plot mK,α (xα ) (thin curves), mSBK,α (xα ) (thick curves) and their 95%
˜
ˆ
pointwise conﬁdence intervals (upper and lower medium curves) for mα (dashed curves)
in Figures 4.5, 4.6, 4.7, and 4.8. The SBK estimators seem rather satisfactory and their
performance improves with increasing n.

4.4

Application

In this section we apply our method to the well-known Boston housing data, which contains
506 diﬀerent houses from a variety of locations in Boston Standard Metropolitan Statistical Area in 1970. The median value and 13 sociodemographic statistics values of the
Boston houses were ﬁrst studied by Harrison and Rubinfeld (1978) to estimate the housing price index model. Breiman and Friedman (1985) did further analysis to deal with the
multi-collinearity for overﬁtting by using a stepwise method. The response and explanatory
variables of interest are:
MEDV: Median value of owner-occupied homes in $1000’s
RM: average number of rooms per dwelling
TAX: full-value property-tax rate per $10, 000
PTRATIO: pupil-teacher ratio by town school district
LSTAT: proportion of population that is of ”lower status” in %.
Wang and Yang (2009b) ﬁtted an additive model using RM, log(TAX), PTRSATIO
and log(LSTAT) as predictors to test the linearity of the components and found that only
PTRATIO is accepted at the signiﬁcance level 0.05 for the linearity hypothesis test. Based
on the conclusion drawn from Wang and Yang (2009b), we ﬁtted a partial linear additive
93

0
-1.5

-1

-0.5

Y

0.5

1

1.5

2

n=200, Estimation of component #2

0
X

0.5

Figure 4.5: Plots of the estimator for the nonparametric part at α = 2, n = 200
Note: plots of the oracle smoother mK,α (xα ) (thin curve), the SBK estimator
˜
mSBK,α (xα ) (thick curve) deﬁned in (4.4), and the 95% pointwise conﬁdence intervals
ˆ
constructed by (4.6) (upper and lower medium curves) of the function components
mα (xα ), α = 2 (dashed curve), d1 = 4, d2 = 3.

94

0
-1.5

-1

-0.5

Y

0.5

1

1.5

2

n=500, Estimation of component #2

0
X

0.5

Figure 4.6: Plots of the estimator for the nonparametric part at α = 2, n = 500
Note: plots of the oracle smoother mK,α (xα ) (thin curve), the SBK estimator
˜
mSBK,α (xα ) (thick curve) deﬁned in (4.4), and the 95% pointwise conﬁdence intervals
ˆ
constructed by (4.6) (upper and lower medium curves) of the function components
mα (xα ), α = 2 (dashed curve), d1 = 4, d2 = 3.

95

0
-1.5

-1

-0.5

Y

0.5

1

1.5

2

n=200, Estimation of component #3

0
X

0.5

Figure 4.7: Plots of the estimator for the nonparametric part at α = 3, n = 200
Note: plots of the oracle smoother mK,α (xα ) (thin curve), the SBK estimator
˜
mSBK,α (xα ) (thick curve) deﬁned in (4.4), and the 95% pointwise conﬁdence intervals
ˆ
constructed by (4.6) (upper and lower medium curves) of the function components
mα (xα ), α = 3 (dashed curve), d1 = 4, d2 = 3.

96

0
-1.5

-1

-0.5

Y

0.5

1

1.5

2

n=500, Estimation of component #3

0
X

0.5

Figure 4.8: Plots of the estimator for the nonparametric part at α = 3, n = 500
Note: plots of the oracle smoother mK,α (xα ) (thin curve), the SBK estimator
˜
mSBK,α (xα ) (thick curve) deﬁned in (4.4), and the 95% pointwise conﬁdence intervals
ˆ
constructed by (4.6) (upper and lower medium curves) of the function components
mα (xα ), α = 3 (dashed curve), d1 = 4, d2 = 3.

97

model as follows:

MEDV = c00 + c01 × PTRATIO + m1 (RM) + m2 (log(TAX)) + m3 (log(LSTAT)) + ε.

As in Wang and Yang (2009b), the number of interior knots is N = 5.

In Figure 4.9, the univariate nonlinear function estimates (dashed lines) and corresponding simultaneous conﬁdence bands (thin lines) are displayed together with the ”pseudo data
points” (dots) with pseudo response as the backﬁtted response after subtracting the sum
function of the remaining covariates as in (4.3). The conﬁdence bands are used to test the
linearity of the nonparametric components. In Figures 4.9, 4.10 and 4.11 the straight solid
lines are the least squares regression lines through the pseudo data points. The ﬁrst ﬁgure
conﬁdence band with 0.999999 conﬁdence level does not totally cover the straight regression
line, i.e, the p-value is less than 0.000001. Similarly the linearity of the component functions for log(TAX) and log(LSTAT) are rejected at the signiﬁcance levels 0.017 and 0.007,
respectively. The estimators c00 and c01 of c00 and c01 are 33.393 and −0.58845 and both
ˆ
ˆ
are signiﬁcant with p-values close to 0. The correlation between the estimated and observed
values of MEDV is 0.89944, much higher than 0.80112 obtained by Wang and Yang (2009b).
This improvement is due to ﬁtting the variable PTRATIO directly as linear with the higher
accuracy of parametric model instead of treating it unnecessarily as a nonparametric variable. In other words, our simpler partially linear additive model (PLAM) ﬁts the housing
data much better than the additive model of Wang and Yang (2009b).
98

-20

0

Yhat1

20

40

60

Confidence Level = 0.999999

4

5

6
RM

7

8

Figure 4.9: Plots of the estimators for RM for Boston housing data
Note: plots of the least squares regression estimator (solid line), conﬁdence bands (upper
and lower thin lines), the spline estimator (dashed line) and the data (dot).

99

-10

0

Yhat2

10

20

Confidence Level = 0.983

5.5

6
log(TAX)

6.5

Figure 4.10: Plots of the estimators for log(TAX) for Boston housing data
Note: plots of the least squares regression estimator (solid line), conﬁdence bands (upper
and lower thin lines), the spline estimator (dashed line) and the data (dot).

100

10
-10

0

Yhat3

20

30

Confidence Level = 0.993

0.5

1

1.5

2
2.5
log(LSTAT)

3

3.5

Figure 4.11: Plots of the estimators for log(LSTAT) for Boston housing data
Note: plots of the least squares regression estimator (solid line), conﬁdence bands (upper
and lower thin lines), the spline estimator (dashed line) and the data (dot).

101

4.5

Appendix

Throughout this section, an ≫ bn means lim bn /an = 0, and an ∼ bn means lim bn /an
n→∞
n→∞
= c, where c is some nonzero constant.
We state the following assumptions.

A1. Given 1 ≤ α ≤ d2 , mα ∈ C (2) [0, 1], while there is a constant 0 < C∞ < ∞, such that
m′ ∈ Lip([0, 1], C∞ ), ∀1 ≤ β ≤ d2 and β ̸= α.
β
{(
)}∞
T , TT , ε
Xt
t t t=−∞ is strictly stationary and geometrically strongly mixing, that is, its α -mixing coeﬃcient α (k) ≤ K0 e−λ0 k , for

A2. Vector process {Zt }∞
t=−∞ =

K0 , λ0 > 0 , where

α (k) = supB∈σ{Z ,t≤0},C∈σ{Z ,t≥k} |P (B ∩ C) − P (B) P (C)| .
t
t

(4.8)

(
(
)
)
(
)
A3. The noise εi satisﬁes E εi Fi = 0, E ε2 Fi = 1, E εi 2+δ Fi < Mδ < +∞
i
{(
)
for some δ > 2/3, Mδ > 0, and σ-ﬁelds Fi = σ X ′ , T ′ ,
i
i
}
i′ ≤ i; ε ′ , i′ ≤ i − 1, 1 ≤ i ≤ n . Function σ (x, t) is continuous with
i

0 < cσ ≤ inf

σ (x, t) ≤ sup
σ (x, t) ≤ Cσ < ∞.
x∈[0,1]d2 ,t∈Rd1
x∈[0,1]d2 ,t∈Rd1

A4. The density function f (x) of X and the marginal densities fα (xα ) of Xα satisfy
f ∈ C [0, 1]d2 , 0 < cf ≤ inf
f (x) ≤ sup
f (x) ≤ Cf < ∞, fα ∈
x∈[0,1]d2
x∈[0,1]d2
C (1) [0, 1].
A5. There exist constants 0 < cδ ≤ Cδ < +∞, 0 < cQ ≤ CQ < +∞ such that
102

)
Tl 2+δ |X = x ≤ Cδ , ∀ x ∈ [0, 1]d2 , 1 ≤ l ≤ d1. and cQ I(d +1)×(d +1) ≤
1
{(
}
)T (
) 1
Q (x) ≤ CQ I(d +1)×(d +1) , where Q (x) = E
1 TT
1 TT |X = x .
1
1

cδ ≤ E

(

A6. The number of interior knots N = Nn ∼ n1/4 log n, i.e., cN n1/4 log n ≤ N ≤
CN n1/4 log n for some positive constants cN , CN .

A7. The kernel function K ∈ Lip([−1, 1], C∞ ) for C∞ > 0 is bounded, nonnegative,
symmetric, and supported on [−1, 1]. The bandwidth h ∼ n−1/5 , i.e., ch n−1/5 ≤
h ≤ Ch n−1/5 for positive constants Ch , ch .

Assumption A1 on the smoothness of the component functions is greatly relaxed and
is close to being the minimal. Assumption A2 is typical in time series literature while
Assumptions A3-A5 are typical in nonparametric smoothing literature, see for instance, Fan
and Gijbels (1996).
For ϕ, φ on [0, 1]d2 × Rd1 , deﬁne the empirical inner product and empirical norm as
(
) (
)
(
)
∑
∑
⟨ϕ, φ⟩n = n−1 n ϕ Xi , Ti φ Xi , Ti , ∥ϕ∥2 = n−1 n ϕ2 Xi , Ti . If ϕ, φ are
n
i=1
i=1
L2 -integrable, we deﬁne the theoretical inner product and theoretical L2 norm as ⟨ϕ, φ⟩ =
{ (
{ (
) (
)}
)}
E ϕ Xi , Ti φ Xi , Ti , ∥ϕ∥2 = E ϕ2 Xi , Ti
and denote En ϕ = ⟨ϕ, 1⟩n . ϕ is empirically(theoretically) centered if En ϕ = 0(Eϕ = 0). For theoretical analysis, deﬁne the cencJ,α
tered spline basis as bJ,α (xα ) = bJ (xα ) − c
b
(xα ) , ∀1 ≤ α ≤ d2 , 1 ≤ J ≤ N + 1,
J−1,α J−1
∫
where cJ,α = EbJ (Xα ) = bJ (xα ) fα (xα ) dxα . The standardized basis is

BJ,α (xα ) = bJ,α (xα ) / bJ,α , ∀1 ≤ α ≤ d2 , 1 ≤ J ≤ N + 1.
103

(4.9)

For the proof of Theorem 4.2, deﬁne the Hilbert space
{
}
∑d
2 p (x ) , Ep (X ) = 0, E 2 p (X ) < ∞
H = p (x) =
α α
α α
α=1 α α
of theoretically centered L2 additive functions on [0, 1]d2 , while denote by Hn its subspace
{
}
spanned by BJ,α (xα ) , 1 ≤ α ≤ d2 , 1 ≤ J ≤ N + 1 . Denote
{
}
˜
ProjH Tl = pl (X) = argmin p∈H E Tl − p (X) 2 , Tl = Tl − ProjH Tl ,
{
}
˜
ProjHn Tl = argmin p∈Hn E Tl − p (X) 2 , Tl,n = Tl − ProjHn Tl ,

for 1 ≤ l ≤ d1 , where ProjH Tl and ProjHn Tl are orthogonal projections of Tl unto
subspaces H and Hn respectively. Denote next in vector form
{
}
{ }
˜
˜
˜
˜
Tn = Tl,n
, T = Tl
.
1≤l≤d1
1≤l≤d1

(4.10)

The next assumption is needed for the second part of Theorem 4.2.
A8. Functions pl ∈ C [0, 1]d2 , 1 ≤ l ≤ d1 while σ (x, t) ≡ σ 0 , (x, t) ∈ [0, 1]d2 × Rd1 .
m (x, t) can be expressed in terms of the standardized basis
ˆ

m (x, t) = c00 +
ˆ
ˆ
(

∑d
∑d
∑N +1
1 c t +
2
ˆ0l l
c B
ˆ
(x ) ,
l=1
α=1
J=1 J,α J,α α

(4.11)

)

minimize
where c00 , c0l , cJ,α
ˆ ˆ ˆ
1≤l≤d1, 1≤J≤N +1,1≤α≤d2
}
∑n {
∑d
∑d
∑N +1
(
) 2
1 cT −
2
X
Yi − c0 −
c B
.
i=1
l=1 l il
α=1
J=1 J,α J,α iα
104

(4.12)

While (4.2) is used for statistical implementation, algebraically equivalent (4.11) is for mathematical analysis. Pilot estimators of mα (xα ) and cT are

mα (xα ) =
ˆ

∑N +1

{
}d
∗
cJ,α BJ,α (xα ) , ˆT = c00 , c0l 1
ˆ
c
ˆ ˆ
l=1
J=1

(4.13)

(
)
∑
∗
where BJ,α (xα ) ≡ BJ,α (xα ) − En BJ,α = BJ,α (xα ) − n−1 n BJ,α Xiα is the emi=1
pirical centering of BJ,α (xα ). The evaluation of m (x, t) at the n observations results in an
ˆ
{ (
)}
n-dimensional vector m Xi , Ti T
ˆ
1≤i≤n , the projection of Y onto Gn with respect to the
{ }
empirical inner product ⟨·, ·⟩n . In general, for any n-dimensional vector Λ = Λi T
1≤i≤n , we
ˆ ∑d1 λ t +
ˆ
deﬁne Pn Λ (x, t) as the projection of Λ onto (Gn , ⟨·, ·⟩n ), i.e., Pn Λ (x, t) = λ0 +
l=1 l l
{
}
∑d2 ∑N +1 ˆ
ˆ ,λ ,λ
ˆ ˆ
α=1 J=1 λJ,α BJ,α (xα ), with Yi replaced by Λi and coeﬃcients λ0 l J,α given
in (4.12). Deﬁne the empirically centered additive components as
(
)
∑
∗
ˆ
Pn,α Λ (xα ) = N +1 λJ,α BJ,α (xα ) , 1 ≤ α ≤ d2 , and the linear component as Pn,c Λ T =
J=1
{
}
ˆ ˆ
λ0 , λl
. Rewrite estimators m (x), mα (xα ), ˆ deﬁned in (4.11) and (4.13) as
ˆ
ˆ
c
1≤l≤d1
m (x, t) = Pn Y (x, t) , mα (xα ) = Pn,α Y (xα ) , ˆ = Pn,c Y. The noiseless and noisy comˆ
ˆ
c
ponents are

m (x, t) = Pn m (x, t) , mα (xα ) = Pn,α m (xα ) , ˜m = Pn,c m,
˜
˜
c
˜ (x, t) = Pn E, ˜α (xα ) = Pn,α E (xα ) , ˜ε = Pn,c E,
ε
ε
c

(4.14)

{ (
)}
{ (
) }
for m = m Xi , Ti n , E = σ Xi , Ti εi n . Linearity of Pn , Pn,c , Pn,α , 1 ≤ α ≤
i=1
i=1
d2 , and the relation Y = m + E imply a crucial decomposition

m (x, t) = m (x, t) + ˜ (x, t) , mα (xα ) = mα (xα ) + ˜α (xα ) , ˆ = ˜m + ˜ε .
ˆ
˜
ε
ˆ
˜
ε
c c
c

105

(4.15)

(

)T

Let ˜ = a0 , al , aJ,α
a
˜ ˜ ˜
be the minimizer of
1≤l≤d1, 1≤J≤N +1,1≤α≤d2
{
}
∑d
∑d
∑N +1
(
)
(
) 2
1 aT −
2
σ Xi , Ti εi − a0 −
a B
X
.
i=1
l=1 l il
α=1
J=1 J,α J,α iα

∑n

(4.16)
(
)T
Similarly, ˜ = c00 , c0l , cJ,α
c
˜ ˜ ˜
minimizes
1≤l≤d1, 1≤J≤N +1,1≤α≤d2

}
∑d
∑d
∑N +1
(
) 2
1 cT −
2
m Xi , Ti − c0 −
c B
X
. (4.17)
i=1
l=1 l il
α=1
J=1 J,α J,α iα

∑n

{

(

)

(
)−1
Then ˜ (x, t) = ˜T B (x, t) , ˜ = BT B
ε
a
a
BT E, with matrices
{
}T
{ (
)}
B (x, t) = 1, tl , BJ,α (xα )
, B = B Xi , Ti T
1≤i≤n ,
1≤l≤d1 ,1≤J≤N +1,1≤α≤d2
(4.18)
and ˜, the solution of (4.16), equals to
a


 1



 (

 (





En T ′
l

)T

En B ′ ′
J ,α

)T

En Tl
En BJ,α
⟨
⟩
⟨
⟩
T ′ , Tl
T ′ , BJ,α
l
n ⟩ ⟨ l
n ⟩
⟨
B ′ ′ , Tl
B ′ ′ , BJ,α
J ,α
J ,α
n
n


(
)
 −1 ∑n
 n
σ Xi , Ti εi


i=1


(
)
∑n
× n−1
Til σ Xi , Ti εi

i=1



(
) (
)
∑n

 n−1
BJ,α Xiα σ Xi , Ti εi
i=1









−1













1≤l,l′ ≤d1 ,1≤α,α′ ≤d2 ,
1≤J,J ′ ≤N +1

(4.19)







1≤l≤d1 ,1≤J≤N +1,1≤α≤d2

Bernstein inequality below under geometric α-mixing is used in many proofs.
106

Lemma 4.1. [Theorem 1.4, page 31 of Bosq (1998)] Let {ξ t , t ∈ Z} be a zero mean real
∑n
valued α-mixing process, Sn =
t=1 ξ t . Suppose there exists c > 0 such that for t =
1, ..., n, k = 3, 4, ..., E |ξ t |k ≤ ck−2 k!Eξ 2 < +∞ (Cram´r’s condition) then for n > 1,
e
t
integer q ∈ [1, n/2], ε > 0 and k ≥ 3
(
P (|Sn | ≥ nε) ≤ a1 exp −

)

qε2
25m2 + 5cε
2

([
+ a2 (k) α

n
q+1

])2k/(2k+1)
,

where α(·) is the α-mixing coeﬃcient deﬁned in (4.8) and
(

ε2

n
a1 = 2 + 2 1 +
q
25m2 + 5cε
2

)




2k/(2k+1)
5m
k
,
, a2 (k) = 11n 1 +
ε

with mr = max1≤i≤n ∥ξ t ∥r , r ≥ 2.
Lemma 4.2. Under Assumptions A2, A4, A6, cf H/2 ≤ cJ,α ≤ Cf H and (i) ∃ con2
stants c0 (f ) , C0 (f ) depending on fα (xα ) , 1 ≤ α ≤ d2 , such that c0 (f ) H ≤ bJ,α
≤
C0 (f )H. (ii) uniformly for 1 ≤ J, J ′ ≤ N + 1, 1 ≤ α, α′ ≤ d2 , E BJ,α (Xα ) ≤ CH 1/2 ,
(
)2
(
)k
(
(
)
)
E BJ,α Xiα B ′ ′ X ′
≥ cf C −2 > 0, E BJ,α Xiα B ′ ′ X ′
≤
J ,α
iα
J ,α
iα
f
C k H 2−k , k ≥ 1. (iii) uniformly for 1 ≤ J, J ′ ≤ N + 1, 1 ≤ α ≤ d2 ,


 1
J′ = J



{
(
)
(
)} 
′
,
X
E BJ,α Xiα B ′
∼
J ,α iα
 −1/3 J − J = 1




 1/6
J′ − J = 2


(
)
(
) k  ≤ C k H 1−k J ′ − J ≤ 2
X
, k ≥ 1.
E BJ,α Xiα B ′
J ,α iα

 0
′−J >2
J

107

Lemma 4.3. Under Assumptions A2, A4, A6 and A7, denote
(
)
(
)
(
)
(
)
ω J,α Xl , x1 = Kh Xl1 − x1 BJ,α Xlα , µJ,α (x1 ) = Eω J,α Xl , x1 ,

(4.20)

(√ )
as n → ∞, supx ∈[0,1] sup1≤α≤d ,1≤J≤N +1 µJ,α (x1 ) = O
H .
2
1
Lemma 4.4. Under Assumptions A2, A4, A6 and A7, as n → ∞,

sup
sup
x1 ∈[0,1] 1≤α≤d2 ,1≤J≤N +1

n−1

∑n
l=1

{
}
{
√ }
(
)
ω J,α Xl , x1 − µJ,α (x1 ) = Op log n/ nh ,

(
)
where ω J,α Xl , x1 and µJ,α (x1 ) are given in (4.20).
Lemma 4.5. Under Assumptions A2, A4-A6, as n → ∞,

An,2 = sup
J,J ′ ,α

(
)
An,1 = supJ,α En BJ,α = Op n−1/2 log n ,
⟨
⟩
⟨
⟩
{
}
−1/2 log n ,
BJ,α , B ′
− BJ,α , B ′
= Op (nH)
J ,α n
J ,α

⟩
⟨
⟩
(
)
BJ,α , B ′ ′
− BJ,α , B ′ ′ = Op n−1/2 log n ,
J ,α n
J ,α
⟨
⟩
⟨
⟩
(
)
An,4 = supl,J,α Tl , BJ,α
− Tl , BJ,α = Op n−1/2 log n .
n

(4.21)
(4.22)

⟨

An,3 = sup
α̸=α′

(4.23)
(4.24)

Lemma 4.6. Under Assumptions A2, A4-A7, denote
(
)
(
)
(
)
ζ l Xi1 , Til , x1 = Kh Xi1 − x1 Til , µl (x1 ) = Eζ l Xi1 , Til , x1 ,

(4.25)

as n → ∞, sup1≤l≤d ,x ∈[0,1] µl (x1 ) = O (1), while
1 1
sup1≤l≤d ,x ∈[0,1] n−1
1 1

{
}
(
)
}
ζ l Xi1 , Til , x1 − µl (x1 ) = Op n−1/2 h−1/2 log n .
i=1

∑n

{

108

(
For proofs of Lemmas 4.2-4.6, see Ma and Yang (2011b). Let

v(

⟨

)

J ′ ,α′ ,(J,α)

)
=

BJ ′ ,α′ ,
) ⟨
) ⟨
⟩ (
) ⟨
⟩ (
⟩ (
⟩
BJ,α , v ′
= T ′ , Tl , v ′
= T ′ BJ,α , v( ′ ′ )
= B ′ ′ , Tl . Del ,l
l
l ,(J,α)
l,
J ,α
J ,α ,l
note by V the theoretical inner product matrix of the standardized basis
{
}
1, tl , BJ,α (xα ) , 1 ≤ l ≤ d1 , 1 ≤ J ≤ N + 1, 1 ≤ α ≤ d2 , i.e.




V=




1
0T
0T
d1
(d2 (N +1))
(
)
0d
vl′ ,l
vl′ ,(J,α)
1
(
) (
)
0d (N +1)
v( ′ ′ )
v( ′ ′ )
J ,α ,l
J ,α ,(J,α)
2






. (4.26)



1≤l,l′ ≤d1 ,1≤α≤α′ ≤d2 ,
1≤J,J ′ ≤N +1

Denote by S the inverse matrix of V





S=




1
0T
0T
d1
(d2 (N +1))
(
)
sl′ ,l
0d
sl′ ,(J,α)
1
(
)
) (
0d (N +1)
s( ′ ′ )
s( ′ ′ )
J ,α ,(J,α)
J ,α ,l
2






. (4.27)



1≤l,l′ ≤d1 ,1≤α≤α′ ≤d2 ,
1≤J,J ′ ≤N +1

ˆ
Next, we denote by V the empirical version of V, i.e.



 0
 (
)
ˆ = E T T
V 
n l′

 (
)T
En BJ ′ ,α′

En Tl
En BJ,α
⟩
⟨
⟩
⟨
T ′ , Tl
T ′ , BJ,α
l
n ⟩
n ⟩ ⟨ l
⟨
BJ ′ ,α′ , BJ,α
BJ ′ ,α′ , Tl
n
n

109




.



1≤l,l′ ≤d1 ,1≤α≤α′ ≤d2 ,
1≤J,J ′ ≤N +1
(4.28)

ˆ
Lemma 4.7. Under Assumptions A2, A4-A7, for matrices V, S and V deﬁned in (4.26),
−1
(4.27), and (4.28) (i) there exist constants CV > cV > 0, CS = c−1 , cS = CV such that
V

cV I1+d +d (N +1) ≤ V ≤ CV I1+d +d (N +1) ,
1 2
1 2
cS I1+d +d (N +1) ≤ S ≤ CS I1+d +d (N +1) .
1 2
1 2

(4.29)

(ii) Deﬁne An = supg ,g ∈G ⟨g1 , g2 ⟩n − ⟨g1 , g2 ⟩ ∥g1 ∥−1 ∥g2 ∥−1 , then
1 2
(
)
An = Op n−1/2 H −1 log n . (iii) With probability approaching 1 as n → ∞,

ˆ
cV I1+d +d (N +1) ≤ V ≤ CV I1+d +d (N +1) ,
1 2
1 2
ˆ
cS I1+d +d (N +1) ≤ V−1 ≤ CS I1+d +d (N +1) .
1 2
1 2

(4.30)

Lemma 4.8. Under Assumptions A2-A7, as n → ∞,
n−1

∑n
∑n
(
)
(
) (
)
σ Xi , Ti εi + max n−1
BJ,α Xiα σ Xi , Ti εi
i=1
i=1
J,α
(
)
∑n
(
)
−1
−1/2 log n .
+ max n
T σ Xi , Ti εi = Op n
i=1 il
l

For proofs of Lemmas 4.7 and 4.8, see Ma and Yang (2011b).

Corollary 4.2. Under Assumptions A2-A7, as n → ∞,
n−1 BT E = Op (n−1/2 N 1/2 log n), n−1 BT E

∞

= Op (n−1/2 log n).

Corollary 4.2 follows from Lemma 4.8 directly.
Next we study the diﬀerence between mSBK (x1 ) and mK,1 (x1 ) , both given in (4.4).
ˆ
˜
{ }d
Denote c = c0l 1 , the decomposition (4.15) implies that mK,1 (x1 ) − mSBK (x1 ) =
˜
ˆ
l=0
110

{

}
ˆ
ΨT b (x1 ) + ΨT v (x1 ) + Ψb (x1 ) + Ψv (x1 ) /f1 (x1 ), where
)
(
)(
Kh Xi1 − x1 1, TT (˜m − c) ,
c
i
i=1
)
∑n
(
)(
−1
T ˜ ,
ΨT v (x1 ) = n
K X − x1 1, Ti cε
i=1 h i1
∑n
(
) ∑d2 {
(
)
(
)}
Ψb (x1 ) = n−1
Kh Xi1 − x1
mα Xiα − mα Xiα ,
˜
i=1
α=2
∑n
(
) ∑d2
(
)
Ψv (x1 ) = n−1
Kh Xi1 − x1
˜α Xiα .
ε
i=1
α=2
ΨT b (x1 ) = n−1

∑n

(4.31)

(4.32)

(
)
First we show Ψb (x1 ) is uniformly of order Op n−1/2 for x1 ∈ [0, 1].

Lemma 4.9. Under Assumptions A1, A2, A4-A7, as n → ∞,
(
)
(
)
supx ∈[0,1] Ψb (x1 ) = Op n−1/2 + H 2 = Op n−1/2 .
1

Lemma 4.10. Under Assumptions A1, A2 and A6, there exist functions gα ∈ G, 1 ≤ α ≤ d2 ,
(
)
∑d2
such that as n → ∞, m − g + α=1 En gα (Xα ) = Op n−1/2 + H 2 , where g(x, t) =
˜
n
∑d1
∑d2
c00 +
c t + α=1 gα (xα ) and m is deﬁned in (4.14).
˜
l=1 0l l

For proofs of Lemma 4.10 and Lemma 4.9, see Ma and Yang (2011b). Next we show Ψv (x1 )
(
)
in (4.32) is uniformly of order op n−2/5 . For aJ,α given in (4.19), deﬁne an auxiliary
˜
entity
˜∗ =
εα

∑N +1

a B
˜
(x ) ,
J=1 J,α J,α α

(4.33)

The ˜α (xα ) in (4.14) is the empirical centering of ˜∗ (x2 ), i.e.
ε
ε2
˜α (xα ) ≡ ˜∗ (xα ) − n−1
ε
εα
111

∑n

(
)
˜∗ Xiα .
εα
i=1

(4.34)

(2)
(1)
According to (4.34), we can write Ψv (x1 ) = Ψv (x1 ) − Ψv (x1 ), where
∑n
∑n
(
) ∑d2
(
)
(1)
Ψv (x1 ) = n−1
Kh Xl1 − x1
n−1
˜∗ Xiα ,
εα
l=1
α=2
i=1
∑n
(
) ∑d2
(
)
(2)
Ψv (x1 ) = n−1
Kh Xl1 − x1
˜∗ Xlα ,
εα
l=1
α=2

(4.35)
(4.36)

(
)
(2)
for ˜∗ Xiα in (4.33). By (4.19) and (4.33), Ψv (x1 ) can be rewritten as
εα
∑d
∑n ∑N +1
(
)
(2)
2
Ψv (x1 ) = n−1
aJ,α ω J Xl , x1 ,
˜
α=2
l=1
J=1

(4.37)

(
)
(1)
(2)
for ω J,α Xl , x1 given in (4.20). Ψv (x1 ) and Ψv (x1 ) are of order
}
(
)
{
Op N n−1 (log n)2 and Op n−1/2 log n uniformly, given in Lemmas 4.12 and 4.13. The
next lemma provides the size of ˜T ˜, where ˜ is the least square solution deﬁned by (4.16).
a a
a
Lemma 4.11. Under Assumptions A2-A6, as n → ∞,

˜T ˜ = a2 +
a a ˜0

{
}
∑d
∑N +1
∑d
2
1 a2 +
˜l
a2 = Op N n−1 (log n)2 .
˜J,α
α=1
l=1
J=1

(4.38)

Proof. By (4.18), (4.19), (4.30), with probability approaching 1 as n → ∞, ∥˜∥ n−1 BT E ≥
a
(
)
˜T n−1 BT E = ˜T V˜ ≥ cV ∥˜∥2 with Corollary 4.2 imply ∥˜∥2 ≤ c−1 ∥˜∥ n−1 BT E =
a
a ˆa
a
a
V a
{
}
{
}
−1 a
cV ∥˜∥ × Op N 1/2 n−1/2 log n . Thus ∥˜∥ = Op N 1/2 n−1/2 log n , n → ∞.
a
(1)
Lemma 4.12. Under Assumptions A2-A7, as n → ∞, Ψv (x1 ) in (4.35) satisﬁes
{
}
(1)
supx ∈[0,1] Ψv (x1 ) = Op N n−1 (log n)2 .
1
For proof, see Ma and Yang (2011b). The vector ˜ in (4.19) is
a
( )−1 (
)
ˆ
˜= V
a
n−1 BT E .
112

(4.39)

(2)
ˆ (2)
Next deﬁne theoretical versions ˆ of ˜ and Ψv (x1 ) of Ψv (x1 ) in (4.37) as
a a
(
)
(
)
ˆ= V−1 n−1 BT E = S n−1 BT E ,
a
ˆ (2)
Ψv (x1 ) = n−1

(4.40)

∑n

∑d
∑N +1
(
)
2
aJ,α ω J,α Xi , x1 .
ˆ
i=1
α=1
J=1

(4.41)

(2)
Lemma 4.13. Under Assumptions A2-A7, as n → ∞, Ψv (x1 ) in (4.36) satisﬁes
(
)
(
)
(2)
supx ∈[0,1] Ψv (x1 ) = Op n−1/2 log n = op n−2/5 .
1
( )
(
)
ˆ a
ˆ
Proof. According to (4.39) and (4.40), one has Vˆ = V ˜, which implies V − V ˜ =
a
a
(
)
ˆ
V (ˆ − ˜). Using (4.21), (4.22), (4.23), (4.24) one obtains that ∥V (ˆ − ˜)∥ = V − V ˜
a a
a a
a
(
)
(
)
a
≤ Op n−1/2 H −1 log n ∥˜∥. According to Lemma 4.11, ∥˜∥ = Op n−1/2 N 1/2 log n ,
a
}
{
−1 N 3/2 (log n)2 . By Lemmas 4.7 and 4.11, as n → ∞,
so as n → ∞, ∥V (ˆ − ˜)∥ ≤ Op n
a a
{
}
∥ˆ − ˜∥ = Op n−1 N 3/2 (log n)2 ,
a a
(
)
∥ˆ∥ ≤ ∥ˆ − ˜∥ + ∥˜∥ = Op n−1/2 N 1/2 log n .
a
a a
a

(4.42)
(4.43)

d2 ∑
)
n
(
)
∑ N +1 (
∑
(2)
ˆ (2)
sup
Ψv (x1 ) − Ψv (x1 ) ≤ sup
aJ,α − aJ,α n−1
˜
ˆ
ω J Xl , x1
x1 ∈[0,1]
x1 ∈[0,1] α=2 J=1
l=1
=

√
(
(
(
)
)
)
d2 (N + 1)Op n−1 H −3/2 log2 n Op H 1/2 = Op n−1 H −3/2 log2 n . (4.44)

ˆ (2)
Note that Ψv (x1 ) ≤ Q1 (x1 ) + Q2 (x1 ), where Q1 (x1 ) =

Q2 (x1 ) =

d2 ∑
∑ N +1
α=2 J=1

aJ,α µJ,α (x1 ) ,
ˆ

}
∑d
∑N +1
∑n {
(
)
2
aJ,α n−1
ˆ
ω J,α Xi , x1 − µJ,α (x1 ) .
α=2
J=1
i=1
113

Using the discretization idea, we divide interval [0, 1] into Mn ∼ n equally spaced intervals
with disjoint endpoints 0 = x1,0 < · · · < x1,Mn = 1, then supx ∈[0,1] Q1 (x1 ) ≤ T1 + T2 ,
1
(
)
∑d2 ∑N +1
where T1 = max1≤k≤M
ˆ
α=2 J=1 aJ,α µJ,α x1,k , T2 equals to
n
(
)}
∑d
∑N +1 {
2
supx ∈[x
aJ,α µJ,α (x1 ) − aJ,α µJ,α x1,k
ˆ
ˆ
.
α=2
J=1
1
1,k−1 ,x1,k ],1≤k≤Mn


d2 N +1
 d1

n (
∑

(
)
∑ ∑
)
1∑
(
) ,
σ Xi , Ti εi
Til s(J,α),l +
B ′ ′ X ′ s
aJ,α =
ˆ
J ,α
iα
(J,α), J ′ ,α′ 

n
l=1

′ =1 J=1
i=1
α
(
)
(
)
∑N +1
∑
(
)
aJ,α µJ,α x1,k = n−1
ˆ
µJ,α x1,k s(J,α),l Til σ Xi , Ti εi
J=1
i,l,J
(
)
(
) (
∑
)
(
)B
+n−1
µJ,α x1,k s
X ′ σ Xi , Ti εi .
′ ,α′ J ′ ,α′
′ ,J,J ′
iα
(J,α), J
i,α
(
)
(
)
−1 ∑n ∑N +1 µ
Deﬁne next Wα,l = max1≤k≤Mn n
x1,k s(J,α),l Til σ Xi , Ti εi ,
i=1 J=1 J,α
(
)
(
) (
)
∑ ∑
(
)B
W
= max
n−1 i
µJ,α x1,k s
X ′ σ Xi , Ti εi .
′
′
′ ,α′ J ′ ,α′
α,α
J,J
iα
(J,α), J
1≤k≤Mn
)
(
∑d2
∑d1
∑d2
W
Wα,l +
. supα,l Wα,l is bounded by
So T1 ≤ α=1
l=1
α′ =1 α,α′
supl n−1

(
)
∑N +1
(
)
Til σ Xi , Ti εi max1≤k≤M supα,l
µJ,α x1,k s(J,α),l
n
i=1
J=1
(
)
(
)
−1/2 log n O (1) = O n−1/2 log n ,
= Op n
p
p

∑n

which follows from Lemma 4.8 and Lemma 4.3. Let Dn = nθ0 , (2 + δ)−1 < θ0 < 3/8,
where δ is the same as in Assumption A3. Deﬁne
(
)
− = ε I ( ε ≤ D ) , ε+ = ε I ( ε > D ) , ε∗ = ε− − E ε− X , T ,
εi,D
n
n
i
i
i
i
i,D
i,D
i,D
i,D i i
}
{
}T (
)T {
)
)
(
BJ ′ ,α′ (Xiα′ ) ′ σ Xi , Ti ε∗ .
s
Ui,k = µα x1,k
′ ,α′
i,D
(J,α), J
J
1≤J,J ′ ≤N +1
(

114

∑
Denote W D ′ =
max
n−1 n Ui,k as truncated centered version of W
. To
i=1
α,α′
α,α
1≤k≤Mn
(
)
−1/2 log n , note W
D
− W D ′ ≤ Λ1 + Λ2 ,
show W
−W
= Up n
α,α′
α,α′
α,α′
α,α

)
)
(
(
) (
) (
1 ∑
(
)B
Λ1 = max
µJ,α x1,k s
X ′ σ Xi , Ti E ε− Xi , Ti
i,D
iα
(J,α), J ′ ,α′ J ′ ,α′
k n
′
i,J,J
Λ2 =

(
)
(
) (
)
1∑∑
(
)B
max
µJ,α x1,k s
X ′ σ Xi , Ti ε+ .
i,D
iα
(J,α), J ′ ,α′ J ′ ,α′
1≤k≤Mn n
i J,J ′
(

)

{
(
)
(
)}T
Let µα x1,k = µ1,α x1,k , · · · , µN +1,α x1,k
, then
[

)T {

}
(
)
Λ1 = max1≤k≤M µα x1,k
s
n
(J,α), J ′ ,α′ J,J ′
{
(
) (
)} ]
∑n
) ( −
−1
n
B
Xiα′ σ Xi , Ti E εi,D Xi , Ti
′
i=1 J ′ ,α′
J
[

(

[
) {

}
(
)
≤ max1≤k≤Mn µα x1,k
µα x1,k
s
×
(J,α), J ′ ,α′ 1≤J,J ′ ≤N +1
[
}
{
(
) (
)} ]T {
) ( −
∑n
−1
(
)
n
s
i=1 BJ ′ ,α′ Xiα′ σ Xi , Ti E εi,D Xi , Ti J ′
(J,α), J ′ ,α′ J,J ′
{
(
) (
)}N +1 ]]1/2
) ( −
−1 ∑n B
× n
≤
i=1 J ′ ,α′ Xiα′ σ Xi , Ti E εi,D Xi , Ti J ′ =1


∑

′
CS maxk



J,J ′

(

)T

(

2 1/2

) 


)1 ∑

(
µ2
J,α x1,k  n

(
) (
) (
BJ ′ ,α′ Xiα′ σ Xi , Ti E ε− Xi , Ti
i,D
 

i

.

(
)
(
)
−(1+δ)
. By Lemma
By Assumption A3, E ε− Xi , Ti = E ε+ Xi , Ti ≤ Mδ Dn
i,D
i,D

115

4.1, sup ′ ′ n−1
J ,α

(
) (
(
)
)
∑n
B ′ ′ X ′ σ Xi , Ti = Op n−1/2 log n . So
i=1 J ,α
iα

n
(
)
(
) (
)
1∑
−(1+δ)
sup Λ1 ≤ CS Mδ Dn
N max sup µJ,α x1,k
B ′ ′ X ′ σ Xi , Ti
sup
J ,α
iα
n
k J,α
α,α′
J ′ ,α′ i=1
{
}
}
{
−1 D−(1+δ) log2 n = O (log n)2 N n−3/2 ,
= Op N n
p
n

where the last step follows from the choice of Dn . Meanwhile
(
)
∞
∞
∞
∑ E E |εn |2+δ |Xn , Tn
∑ M
∑ E |εn |2+δ
δ ≤ ∞,
P (|εn | ≥ Dn ) ≤
=
≤
2+δ
2+δ
2+δ
Dn
n=1
n=1 Dn
n=1
n=1 Dn
∞
∑

since δ > 2/3. By Borel-Cantelli Lemma, for large n, with probability 1,

n−1

(
)
(
) (
∑N +1
)
(
)B
µJ,α x1,k s
Xiα′ σ Xi , Ti ε+ = 0,
′ ,α′ J ′ ,α′
′ ,J=1
i,D
(J,α), J
i=1
J

∑n

(
)
sup
W
− W D ′ ≤ sup
Λ1 + sup
Λ2 = Op (log n)2 N n−3/2 . Next we
α,α′
α,α′
α,α′
α,α′
α,α
(
)
show that W D ′ = Up n−1/2 log n . The variance of Ui,k is
α,α
(
µα x1,k

)T {

}
({
)
}T
(
) ∗
(
)
BJ,α′ (Xiα′ )
s
var
σ Xi , Ti εi,D
(J,α), J ′ ,α′ J,J ′
1≤J≤N +1
{
}T
(
)
(
)
× s
µ x
.
(J,α), J ′ ,α′ 1≤J,J ′ ≤N +1 α 1,k
(

)

is bounded by
By Assumption A3, c2 v( ′ ) ( ′ ′ )
σ
′
J,α , J ,α
1≤J,J ≤N +1
({
)
(
)
}T
(
)
2 v(
)(
)
var
B ′ (X ′ )
σ Xi , Ti
≤ Cσ
J,α
iα 1≤J≤N +1
J,α′ , J ′ ,α′

116

.
′
1≤J,J ≤N +1

}
(
)
)T {
(
)
(
)(
)
var Ui,k ∼ µα x1,k
s
v
(J,α), J ′ ,α′ 1≤J,J ′ ≤N +1
J,α′ , J ′ ,α′ 1≤J,J ′ ≤N +1
(

)

(

{

(

× s
(J,α), J ′ ,α′
(
µα x1,k

)T {

)

}T
1≤J,J ′ ≤N +1

)
(
µα x1,k Vε,D ∼

}
{
}T
(
)
(
)
(
)
s
s
µα x1,k Vε,D ,
(J,α), J ′ ,α′ J,J ′ (J,α), J ′ ,α′ J,J ′

(
)}1/2
{
}
(
) {
(
)T
∗ X , T . Let κ x
,
µα x1,k
where Vε,D = var εi,D i i
1,k = µα x1,k
)}2
(
)
)}2
′2 { (
′2 2 { (
cS c2 κ x1,k
Vε,D ≤ var Ui,k ≤ CS Cσ κ x1,k
Vε,D .
σ

Since E Ui,k

r

≤

{
(
)
}r−2
{
}n
2
c0 κ x1,k Dn H −1/2
r!E Ui,k < +∞, for r ≥ 3, Ui,k
i=1

satisﬁes the Cram´r’s condition with Cram´r’s constant
e
e
(
)
c∗ = c0 κ x1,k Dn H −1/2 , hence by Lemma 4.1


P



n−1

n
∑
l=1




(

qρ2
n
Ui,k ≥ ρn ≤ a1 exp −
2 + 5c∗ ρ

25m2
n

)

([
+ a2 (3) α

n
q+1

])6/7
,

(
)
{
(
)
}1/3
√
2 ∼ κ2 x
3 x
−1/2 D V
where m2
Vε,D , m3 ≤ cκ
H
, ρn = ρ log n/ n,
n ε,D
1,k
1,k 

(
)
6/7
2
5m
ρn
a1 = 2 n + 2 1 +
, a2 (3) = 11n 1 + ρ3 . Similar as in Lemma 4.4,
q
n
25m2 +5c∗ ρn
2

(
)2
cn log n−1 ρn−1/2 log n
2
qρn
(
)
≥
∼ log n, as n → ∞,
25m2 + 5c∗ ρn
25c∗ + 5c0 κ x1,k Dn H −1/2 ρn−1/2 log n
2
117

Taking c0 , ρ large enough, one has



n
{
}
1 ∑
P
Ui,k > ρn−1/2 log n ≤ c log n exp −c2 ρ2 log n + Cn2−6λ0 c0 /7 ≤ n−3 ,
n
i=1
∑
for n large enough. Hence ∞ P
n=1
∑∞

∑Mn
k=1

n=1

(
P

n−1

(

)
D
−1/2 log n is bounded by
W
≥ ρn
α,α′

) ∑∞
∑n
Ui,k ≥ ρn−1/2 N 1/2 log n ≤
M n−3 < ∞.
i=1
n=1 n

(
)
D = U n−1/2 log n , as n → ∞. Note
Thus, Borel-Cantelli Lemma entails W
p
α,α′
(
)
√
Wα,α′ − W D ′ = Up (log n)2 N n−3/2 , then Wα,α′ = Up (log n/ n). Thus T1 ≤
)
( α,α
∑d2
∑d1
∑d2
W
W +
.
α=1
l=1 α,l
α′ =1 α,α′
(
)
(
)
(
)
So as n → ∞, T1 ≤ d1 Op n−1/2 log n + d2 2 Op n−1/2 log n = Op n−1/2 log n .

Employing Cauchy-Schwartz inequality and Lipschitz continuity of kernel K, Assumption
A5, Lemma 4.2 (ii) and (4.43) lead to
}1/2 (
) {∑N +1
)−1
(
)
−1/2 N 1/2 log n
2 (X )
T2 ≤ d2 Op n
h2 Mn
= op n−1/2 .
EBJ,2 12
J=1
(

(
)
Therefore, supx ∈[0,1] Q1 (x1 ) ≤ T1 + T2 = Op n−1/2 log n . Noting that
1
(

sup
x1 ∈[0,1]

1/2
Q2 (x1 ) = Op n−1/2 N 1/2 (log n) d2 N 1/2 n−1/2 h−1/2 log n

)

)
(
= op n−1/2 ,

by Cauchy-Schwartz inequality, (4.43), Lemma 4.4, Assumptions A6 and A7. Thus

118

ˆ (2)
Ψv (x1 ) ≤

sup
sup {Q1 (x1 ) + Q2 (x1 )} = Op
x1 ∈[0,1]
x1 ∈[0,1]
The desired result follows from the above result and (4.44).

(

n−1/2 log n

)

(

)
−2/5 .
= op n

Lemma 4.14. Under Assumptions A2-A4, A6 and A7, as n → ∞,
(

)
(
)
−1/2 log n = o n−2/5 .
supx ∈[0,1] |Ψv (x1 )| = Op n
p
1

Lemma 4.14 follows from Lemmas 4.12 and 4.13. Next we bound ˜T − cT , ˜T deﬁned
cm
cε
(
)
in (4.17), (4.16). Denote by Ir×d the matrix Ir 0
.
r×d

(
)
Lemma 4.15. Under Assumptions A1, A2, A4-A7, as n → ∞, ∥˜m − c∥ = op n−1/2 .
c

Proof. By the result on page 149 of de Boor (2001), ∃ C∞ > 0 such that for mα ∈
′
C 1 [0, 1] with mα ∈ Lip ([0, 1] , C∞ ) , there is a function gα ∈ G such that Egα (Xα ) =
′
0, ∥gα − mα ∥∞ ≤ C∞ mα H 2 , 1 ≤ α ≤ d2 . Then
(

)
T B −1 BT m − c
˜m − c = I(1+d )×{d (N +1)} × B
c
1
2
{
}
(
)−1
∑d
∑d
(
)
TB
T c +
1 cT +
2 g X
= I(1+d )×{d (N +1)} B
B
−c
0
l=1 l il
α=1 α iα 1≤i≤n
1
2
}
{∑
(
)
(
)
(
) ∑d2
d2
T B −1 BT
+I(1+d )×{d (N +1)} B
−
g X
m X
α=1 α iα 1≤i≤n
α=1 α iα
1
2
]
[∑
(
)}
(
)
d2 {
ˆ −1 n−1 BT
= I(1+d )×{d (N +1)} V
mα Xiα − gα Xiα
α=1
1
2
1≤i≤n
ˆ
with V deﬁned in (4.28). So by Lemma 4.7, ∥˜m − c∥2 is bounded with probability apc
119

[
]
2
−1 n−1 BT ∑d2 {m (X ) − g (X )}
ˆ
proaching 1 by V
α iα
α iα
α=1
1≤i≤n
[∑
]
2
(
)
(
)}
d2 {
mα Xiα − gα Xiα
α=1
1≤i≤n
(
)2
(
)2
∑d2
2 ∑d2 ∥g − m ∥
2 ∑d1
−1 ∑n
≤ CS
+ CS
α ∞
i=1 Til
α=1 α
α=1 ∥gα − mα ∥∞ n
l=1
)
(
(
)
2 ∑d2 ∑N +1 ∑d2 ∥g − m ∥ n−1 ∑n
+CS
α ∞
i=1 BJ,α′ Xiα′
α=1 α
α′ =1 J=1
(
)2 {
(
)2
2 ∑d2 ∥g − m ∥
−1 ∑n
≤ CS
1 + d1 sup1≤l≤d n
α ∞
i=1 Til
α=1 α
1
}
(
(
) )2
∑n
+ (N + 1) d2 sup
n−1 i=1 B ′ X ′
.
1≤α′ ≤d2 ,1≤J≤N +1
J,α
iα
2
≤ CS n−1 BT

(
)
(
)
−1 ∑n
1/2 ,
sup
n
i=1 BJ,α′ Xiα′ = Oa.s. H
1≤α′ ≤d2 ,1≤J≤N +1
∑
sup1≤l≤d n−1 n
i=1 Til = Oa.s. (1), so as n → ∞,
1 [
]2
( )
2 ∼ ∑d2 ∥g − m ∥
4
∥˜m − c∥
c
α
α ∞ = Op H ,
α=1
(
)
Lemma 4.16. Under Assumptions A1-A7, as n → ∞, ∥˜ε ∥ = Op n−1/2 .
c
a a a
Proof. Let (Q1 , Q2 ) = I(1+d )×{d (N +1)} (ˆ, ˜ − ˆ). By (4.40), (4.42), (4.27)
1
2
˜T = I(1+d )×{d (N +1)} ˜ = Q1 + Q2 ,
cε
a
1
2

(4.45)

}
{
so ∥Q2 ∥ = Op n−1 N 3/2 (log n)2 , while
(
)T
∑n
∑n
(
)
Q1 = n−1
σ Xi , Ti εi , n−1
ξ il′
i=1
i=1
1≤l′ ≤d1
{

(4.46)

}
(
)
(
)
∑
∑
in which ξ il′ =
l sl′ l Til + α,J sl′ ,(J,α) BJ,α Xiα σ Xi , Ti εi . Clearly Eξ il′ = 0
(
)
}]2
[{
∑
∑
2 0, s , s
2 ≤ C2E
×
≤ Cσ
while Eξ ′
σ
l sl′ l Tl + α,J sl′ ,(J,α) BJ,α (Xα )
l′ l l′ ,(J,α)
il
120

(

)T

(

) 2

2
2
2
V 0, s ′ , s ′
≤ Cσ CV
0, s ′ , s ′
≤ Cσ CV CS . It is easily checked
l l l ,(J,α)
l l l ,(J,α)
(
)
−1 ∑n ξ
E ξ ′ ξ ′ = 0 for i ̸= i′ thus by Markov Inequality, sup
′ ≤d n
i=1 il′ =
il jl
1≤l
1
(
)
∑
Op (n−1/2 ). Likewise n−1 n σ Xi , Ti εi = Op (n−1/2 ). Then, Lemma 4.16 follows
i=1
from the above results.
Lemma 4.17. Under Assumptions A1, A2, A4-A7, as n → ∞, supx ∈[0,1] ΨT b (x1 ) =
1
√
op (1/ n) .
For proof, see Ma and Yang (2011b). Deﬁne a theoretical version of
(
)
−1 ∑n K (X − x ) a + ∑d1 a T
ΨT v (x1 ) = n
˜
as
1 ˜0
i=1 h i1
l=1 l il
(
) ∑d1
(
)
∑
∑
ˆ
ΨT v (x1 ) = a0 n−1 n Kh Xi1 − x1 +
ˆ
al n−1 n ζ l Xi1 , Til , x1 .
ˆ
i=1
i=1
l=1
Lemma 4.18. Under Assumptions A1-A7, as n → ∞,
}
(
)
{
supx ∈[0,1] ΨT v (x1 ) = Op n−1/2 (log n)4 = op n−2/5 .
1
For proof, see Ma and Yang (2011b).
(
)
( )−1
˜
Lemma 4.19. Under Assumptions A1-A6 and A8, s ′
= cov Tn
and as n → ∞,
l ,l
( )−1
( )−1
(
)
˜
˜
˜
˜
cov Tn
→ cov T
, where s ′
deﬁned in (4.27), Tn and T deﬁned in (4.10).
l ,l
(
Proof.

)
( )−1
˜
s′
= cov Tn
is induced by basic linear algebra. By the result on page 149
l ,l

of de Boor (2001), there is a constant C∞ > 0 and functions gl (x) ∈ Hn , 1 ≤ α ≤ d2
(
)
such that sup1≤l≤d gl − pl ∞ = o (1). Since ProjHn Tl = ProjHn ProjH Tl , for ∀1 ≤
1
)2
(
{
}
≤ E gl (X) − ProjH Tl 2 =
l ≤ d1 by Hilbert space theory, E ProjHn Tl − ProjH Tl
( )−1
( )−1
{
}
˜
˜
E gl (X) − pl (X) 2 = o (1), as n → ∞. Thus cov Tn
→ cov T
, as n → ∞.
Proof of Theorem 4.1. The term ΨT b (x1 ) in (4.31) has order op (n−1/2 ) and other
terms have order op (n−2/5 ) by Lemmas 4.17, 4.18, 4.9, 4.14. Standard theory ensures that
121

(
)
∑
ˆ
f1 (x1 ) = n−1 n Kh Xi1 − x1 has a positive lower bound. Theorem 4.1 then follows.
i=1
2
Proof of Theorem 4.2. The ﬁrst part of Theorem 4.2 follows from Lemmas 4.15 and 4.16.
√
√
n˜ε + n (˜m − c) = n (Q1 + Q2 ) +
c
c
√
√
√
n (˜m − c) = nQ1 + op {1}. It is easily veriﬁed E ( nQ1 ) = 0. By (4.46), Lemma 4.19,
c

By (4.15), (4.45) and Lemma 4.15,

√

n (ˆ − c) =
c

√

(
)
√
var ( nQ1 ) = n × σ 2 I(1+d )×{d (N+1)} var ˆˆT IT
aa
0
(1+d1 )×{d2 (N +1)}
1
2
= n × n−1 σ 2 I(1+d )×{d (N+1)} SIT
0
2
(1+d1 )×{d2 (N +1)}

1






 1
 1






0T
0T
d1
d1
2
2
=σ 0
(
)  → σ0 
( )−1  , as n → ∞.

 0
 0


˜




sl′ ,l
d1
d1 cov T
Theorem 4.2 then follows by applying Theorem 1 of Sunklodas (1984). 2

122

Chapter 5
Spline Regression in the Presence of
Categorical Predictors

5.1

Background

This chapter is based on Ma, Racine and Yang (2011). Applied researchers must frequently
model relationships involving both continuous and categorical predictors, and a range of nonparametric kernel regression methods have recently been proposed for such settings. These
developments have extended the reach of kernel smoothing methods beyond the traditional
continuous-only predictor case and have had a marked impact on applied nonparametric
research; see Li and Racine (2007) for examples and an overview. Though kernel methods
hold much appeal for practitioners, many in the applied community continue to resist their
use, often for valid reasons. In particular, nonparametric kernel methods are local in nature, bandwidth selection can be fragile and numerically demanding, interpretation can be
challenging, while imposing constraints on the resulting estimate can be diﬃcult.
123

Regression spline methods, on the other hand, are global in nature and involve straightforward least squares solutions hence unconstrained and constrained estimation is much easier
to handle and faster to compute. Furthermore, their least squares underpinnings render
the methods immediately accessible to those who routinely use least squares approaches. As
such, regression splines provide an immediately accessible and attractive alternative to kernel
methods for the nonparametric estimation of regression models. For excellent overviews of
spline modeling we direct the interested reader to Stone (1985), Stone (1994), Huang (2003),
and Wang and Yang (2009a). For applications of spline approaches, see Huang (1998) for
functional ANOVA models, Huang and Yang (2004), Wang and Yang (2007) and Xue (2009)
for additive models, Wang and Yang (2009a) and Wang (2009) for single-index models, Liu
and Yang (2010) and Xue and Liang (2010) for additive coeﬃcient models, and Ma and
Yang (2011) for jump detection in regression functions. However, just like their traditional
kernel-based continuous-only predictor kin, regression splines are limited by their inability
to handle the presence of categorical predictors without resorting to sample-splitting which
can entail a substantial loss in eﬃciency. In this chapter we consider a regression spline alternative motivated by recent developments in the kernel smoothing of relationships involving
categorical covariates. The proposed spline approach possesses intuitive appeal by producing a piecewise polynomial, computational expedience as discussed before and theoretical
reliability according to the mean square and uniform convergence rates, and the pointwise
asymptotic distribution results established in this chapter.

The remainder of this chapter proceeds as follows. Section 5.2 outlines the framework
and presents theorems detailing rates of convergence and the asymptotic distribution of the
proposed approach. Section 5.3 considers cross-validatory selection of the spline knot vector
124

and kernel bandwidth vector. Section 5.4 examines the ﬁnite-sample performance of the
proposed method versus the traditional ‘frequency’ (i.e. ‘sample-splitting’) estimator, the
additive regression spline estimator, and the cross-validated local linear kernel regression
estimator, while proofs are to be found in the appendix.

5.2

Methods and Main Results

In what follows we presume that the reader is interested in the unknown conditional mean
in the following location-scale model,

Y = g (X, Z) + σ (X, Z) ε,

(5.1)

(
)
where g(·) is an unknown function, X = X1 , . . . , Xq T is a q-dimensional vector of continuous predictors, and Z = (Z1 , . . . , Zr )T is an r-dimensional vector of categorical predictors.
Letting z = (zs )r , we assume that zs takes cs diﬀerent values in Ds ≡ {0, 1, . . . , cs − 1},
s=1
(
)n
s = 1, . . . , r, and let cs be a ﬁnite positive constant. Let Yi , XT , ZT
i
i i=1 be an i.i.d copy
(
)
[
]
of Y, XT , ZT . Assume for 1 ≤ l ≤ q, each Xl is distributed on a compact interval al , bl ,
[
]
and without loss of generality, we take all intervals al , bl = [0, 1].
In order to handle the presence of categorical predictors, we deﬁne



l (Zs , zs , λs ) =

L (Z, z, λ) =

1,when Zs = zs


 λ , otherwise.
s

,

r
∏

r
∏ 1(Z ̸=z )
λs s s ,
s=1

l (Zs , zs , λs ) =

s=1
125

(5.2)

where l(·) is a variant of Aitchison and Aitken (1976) univariate categorical kernel function,
L(·) is a product categorical kernel function, and λ = (λ1 , λ2 , . . . , λr )T is the vector of
bandwidths for each of the categorical predictors. We use the tensor product splines intro[
]
duced in Section 1.4 of Chapter 1. Let B = {B (X1 ) , . . . , B (Xn )}T
, where B (x) is
n×Kn
deﬁned in (1.3). Then g (x, z) can be approximated by B (x)T β (z), where β (z) is a Kn × 1
vector. We estimate β (z) by minimizing the following weighted least squares criterion,

β (z) = arg

n
}2 (
∑{
( )
)
min
Yi − B Xi T β (z) L Zi , z, λ .
β(z)∈RKn i=1

(
)
Let Lz = diag {L (Z1 , z, λ) , . . . , L (Zn , z, λ)} be a diagonal matrix with L Zi , z, λ ,
1 ≤ i ≤ n as the diagonal entries. Then β (z) can be written as
(

) (
)
−1 BT L B −1 n−1 BT L Y ,
β (z) = n
z
z

(5.3)

where Y = (Y1 , . . . , Yn )T . g (x, z) is estimated by g (x, z) = B (x)T β (z).
Given any z ∈D, for any µ ∈ (0, 1], we denote by C 0,µ [0, 1]q the space of order µ H¨lder
o
continuous functions on [0, 1]q , i.e.,
(
)


ϕ(x, z) − ϕ x′ , z
C 0,µ [0, 1]q = ϕ : ∥ϕ∥0,µ,z =
sup
< +∞
µ


x − x′ 2
x̸=x′ ,x,x′ ∈[0,1]q



)1/2
(∑
q
x2
is the Euclidean norm of x , and ∥ϕ∥0,µ,z is the C 0,µ
l=1 l
-norm of ϕ. Let C [0, 1]q be the space of continuous functions on [0, 1]q . Given a q-tuple
)
(
α = α1 , . . . , αq of nonnegative integers, let [α] =α1 + · · · + αq and let Dα denote the
∂ [α]
diﬀerential operator deﬁned by Dα =
αq . For positive numbers an and bn ,
α1
∂x1 ···∂xq
in which ∥x∥2 =

126

n ≥ 1, let an ≍ bn mean that limn→∞ an /bn = c, where c is some nonzero constant. The
assumptions employed for the asymptotic results are listed below:
(A1) For any given z ∈D, the regression function satisﬁes Dα g ∈ C 0,1 [0, 1]q , for all α
(
)
with [α] = p − 1 and 1 ≤ p ≤ min m1 , . . . , mq .
(A2) The marginal density f (x) of X satisﬁes f (x) ∈ C [0, 1]q and f (x) ∈

[

]
c f , Cf

for constants 0 < cf ≤ Cf < ∞. There exists a constant cP > 0, such that
P (Z = z |X) ≥ cP for all z ∈D.
(

)
2 |X, Z = 1. There exists a positive value
(A3) The noise ε satisﬁes E (ε |X, Z) = 0, E ε
(
)
δ > 0 and ﬁnite positive Mδ such that supx∈[0,1]q ,z∈D E |ε|2+δ |X = x, Z = z <
)
(
Mδ and E |ε|2+δ < Mδ . The standard deviation function σ (x, z) is continuous
on [0, 1]q × D for D =D1 × · · · × Dr and 0 < cσ ≤ inf x∈[0,1]q ,z∈D σ (x, z) ≤
supx∈[0,1]q ,z∈D σ (x, z) ≤ Cσ < ∞.
)
(
(A4) The number of interior knots Nl , 1 ≤ l ≤ q satisfy as n → ∞, max1≤l≤q N −1 =
l
{
} ∏
}
{
q
o n−1/(2p+q) ,
N = o n (log n)−1 , and the bandwidths λs , 1 ≤ s ≤ r
l=1 l
}
{(
)
∑r
−1 ∏q N 1/2 .
satisfy as n → ∞, s=1 λs = o
n
l=1 l
Theorem 5.1. Under assumptions (A1)-(A4), as n → ∞,



{ (∏
}1/2
q
r
)−1
∑ p ∑
1
q
.
λs +
sup
|g (x, z) − g (x, z)| = Oa.s. 
h +
h
log n
l
l=1 l
n
q ,z∈D
x∈[0,1]
s=1
l=1
Theorem 5.2. Under assumptions (A1)-(A4), for σ 2 (x, z) in (5.14), as n → ∞,
n
σ −1 (x, z) {g (x, z) − g (x, z)} −→ N (0, 1). For any given (x, z) ∈ [0, 1]q × D, c∗ n−1 ×
n
σ
)−1
(∏
)−1
(∏
q
q
∗
∗
h
, for some constants 0 < c∗ < Cσ < ∞.
h
≤ σ 2 (x, z) ≤ Cσ n−1
n
σ
l=1 l
l=1 l
127

Proofs of these theorems are presented in the appendix. Having outlined the theoretical
underpinnings of the proposed method, we now consider an illustrative simulated example
that demonstrates how smoothing the categorical predictors in the manner prescribed above
impacts the resulting estimator g (x, z).
ˆ

5.3

Cross-Validated Choice of N and λ

Cross-validation Stone (1977) has a rich pedigree in the regression spline arena and has
been used for decades to choose the appropriate number of interior knots and is the basis for
Friedman (1991) multivariate adaptive regression spline (MARS) methodology among others;
see Wahba (199) for an overview in the spline context. It has also been used extensively for
bandwidth selection for kernel estimators such as the local linear kernel estimator proposed
by Li and Racine (2004) that appears in the simulations in Section 5.4 (see also Racine
and Li (2004) for the local constant counterpart). Following in this tradition we choose the
number of interior knots (i.e. the vector (N )) and smoothing parameters (i.e. the bandwidth
vector λ) by minimizing the cross-validation function deﬁned by

CV (N, λ) = n−1

n
∑

ˆ
(Yi − Bm (Xi )T β −i (Zi ))2 ,
i=1

(5.4)

ˆ
where β −i (Zi ) denotes the leave-one-out estimate of β.
To illustrate the behavior of the data-driven cross-validated selection of N and λ, we
consider two simple data generating processes (DGPs) and plot the resulting cross-validated
regression estimate based on the popular cubic spline basis. For each ﬁgure we regress Yi on
Xi and Zi using the proposed method. Figure 5.1 presents four simulated data sets and the
128

cross-validated values of N and λ (maximum N is 15, n = 500, λ ∈ [0, 1]).

Figure 5.1: Example with n = 500 and a variety of DGPs in Chapter 5
Note: for the lower left DGP there is no diﬀerence in the function when Z = 0 versus
Z = 1 and cross-validation ought to select λ = 1 (N the number of interior knots, λ the
bandwidth).

Figure 5.1 illustrates how the cross-validated choices of N and λ diﬀer depending on the
underlying DGP. For instance, the plot on the upper left is one for which Yi = cos(πXi ) + εi
if Zi = 0 and Yi = 1 + cos(πXi ) + εi if Zi = 1. It is evident that N ≈ 0 and λ ≈ 0.0012
is appropriate here. The ﬁgure on the upper right is one for which Yi = cos(πXi ) + εi
129

regardless of the value taken on by Zi . Those on the lower left and right are similar but use
cos(4πXi ) instead. For both the lower and upper ﬁgures on the right, N ≈ 0 and N ≈ 3
appear to be appropriate while λ ≈ 1 and λ ≈ 1 are also appropriate since Zi is independent
of Yi . These simple examples serve to illustrate how cross-validation is delivering values of
N and λ that are appropriate for the DGP at hand.

Before proceeding, a few words on the numerical optimization of (5.4) are in order.
Search takes place over N1 , . . . , Nq and λ1 , . . . , λr where the λ are continuous lying in
[0, 1] and the N are integers. Clearly this is a mixed integer combinatorial optimization
procedure which would render exhaustive search infeasible when facing a non-trivial number
of predictors. However, in settings such as these one could leverage recent advances in mixed
integer search algorithms which is the avenue we pursue in the Monte Carlo simulations
reported below. In particular, we adopt the ‘Nonsmooth Optimization by Mesh Adaptive
Direct Search’ (NOMAD) approach (Abramson, Audet, Couture, Dennis Jr., and Le Digabel
(2011)). Given that the objective function can be trivially computed for large sample sizes
as it involves nothing more than computing the hat matrix for weighted least squares, it
turns out that the computational burden is in fact nowhere near as costly as, say, crossvalidated kernel regression for moderate to large data sets. As such, the proposed approach
constitutes a computationally attractive alternative to multivariate cross-validated kernel
regression. In addition, in the next section we shall also see that the proposed approach
constitutes a statistically attractive alternative as well, at least from the perspective of
ﬁnite-sample square error loss.
130

5.4

Monte Carlo Simulations

In this section we consider a modest Monte Carlo experiment designed to assess the ﬁnitesample performance of the proposed method. We consider two DGPs with q = 2 continuous
predictors and r = 2 categorical predictors given by

DGP-M:

Yi = cos(2πXi1 ) × sin(2πXi2 ) + (Zi1 + Zi2 )/10 + εi ,

DGP-A:

Yi = cos(2πXi1 ) + sin(2πXi2 ) + (Zi1 + Zi2 )/10 + εi ,

(5.5)

where the continuous predictors are drawn from the uniform (Xj ∼ U [0, 1], j = 1, 2), the
categorical predictors (Zj , j = 1, 2) are drawn from the rectangular distribution with equal
probability (zs ∈ {0, 1, . . . , cs − 1} where cs is the number of categorical outcomes, cs ≥ 2,
s = 1, 2), and ε ∼ N (0, σ 2 ) with σ = 0.1. For what follows we set cs = c, s = 1, 2. Observe
that DGP-M is multiplicative in the continuous components while DGP-A is additive.
We draw M = 1, 000 Monte Carlo replications and for each replication we compute the
cross-validated frequency estimator (i.e. that based only on the (Y, X) pairs lying in each
‘cell’ deﬁned by Z), the proposed cross-validated categorical regression spline estimator, the
cross-validated additive categorical regression spline estimator Ma and Racine (2011) and
the cross-validated local linear kernel estimator that is often used to model continuous and
categorical predictors in a manner similar to that undertaken here (note that the local linear
kernel estimator is minimax eﬃcient and has the best boundary bias correction properties
of the class of kernel regression estimators; see Li and Racine (2004) for details). For the
regression spline estimators we set the spline degree vector equal to (3, 3) (a popular choice)
and cross-validate the number of knots vector (N1 , N2 ) and the bandwidth vector (λ1 , λ2 ).
131

We then compute the MSE of each estimator based upon (5.5) for each replication and report
the relative median MSE over all M replications in Table 5.1. Table 5.2 reports a summary
of the smoothing parameters chosen by cross-validation.
Table 5.1: Relative median MSE in Chapter 5
DGP-M: Multiplicative Speciﬁcation
n
c Frequency Additive Kernel
500 2
0.591
0.008
0.487
500 3
0.572
0.014
0.618
500 4
0.519
0.020
0.718
1000 2
0.745
0.004
0.403
1000 3
0.591
0.008
0.494
1000 4
0.609
0.013
0.585
1500 2
0.815
0.003
0.359
1500 3
0.636
0.006
0.446
1500 4
0.626
0.010
0.518
2000 2
0.838
0.002
0.337
2000 3
0.722
0.005
0.408
2000 4
0.628
0.008
0.480

n
500
500
500
1000
1000
1000
1500
1500
1500
2000
2000
2000

DGP-A: Additive Speciﬁcation
c Frequency Additive Kernel
2
0.554
2.028
0.489
3
0.488
1.939
0.593
4
0.434
1.788
0.702
2
0.776
2.072
0.417
3
0.518
2.096
0.494
4
0.523
2.072
0.571
2
0.822
2.053
0.381
3
0.660
2.119
0.446
4
0.520
2.176
0.514
2
0.852
2.087
0.368
3
0.754
2.143
0.418
4
0.572
2.226
0.478

Note: Relative median MSE of the proposed proposed spline regression estimator versus
the frequency regression spline, additive regression spline, and local linear kernel regression
estimator. Numbers less than one indicate superior performance of the proposed spline
estimator. c denotes the number of outcomes for the discrete predictors, n the sample size.

Table 5.1 illustrates how, for a given sample size, the relative performance of the proposed
approach that smooths the categorical predictors versus the frequency approach that breaks
the data into c1 ×c2 = (4, 9, 16) subsets improves as c increases, as expected (each categorical
predictor takes on c1 = c2 = c = (2, 3, 4) values). Furthermore, for a given c, as n increases
the proposed estimator approaches the frequency estimator since λ → 0 as n → ∞. Table
5.1 further illustrates how the proposed cross-validated method dominates the popular local
linear kernel estimator for all sample sizes and values of c considered.
Often additive spline models are recommended in applied settings due to the curse of
132

dimensionality (the property that the multiplicative tensor regression spline has a rate of
convergence that deteriorates with the number of continuous predictors, q). Observe, however, that even in small sample settings such as n = 500, if the additive model is used
when additivity is not present, the square error properties of the additive regression spline
model can be orders of magnitude worse than the multiplicative tensor regression spline
model (the tensor model has roughly 1/100 the MSE of the additive model for DGP-M).
Naturally, if additivity is appropriate the additive model that incorporates this information
will have better ﬁnite-sample properties (the tensor model has roughly 2 times the MSE of
the additive model for DGP-A, the additive DGP). Simulations not reported here for space
considerations indicate that the ﬁnite-sample mean square error improvement over the kernel regression estimator holds a) whether or not there exist categorical predictors, and b) in
higher-dimension settings than those reported here.
Table 5.2 reveals how the cross-validated bandwidths tend to zero as n increases. These
ﬁndings are consistent with the theoretical properties detailed in the appendix.
In the above simulations the tensor-based multivariate regression spline approach dominates the popular local linear kernel regression approach for the range of sample sizes and
number of predictors considered. However, we caution the reader that this is not guaranteed
to always be the case. The dimension of the tensor spline basis grows exponentially with
the number of continuous predictors for a given order/knot combination for each predictor.
Thus, for a ﬁxed sample size n, as the number of continuous predictors q increases degrees
of freedom will quickly fall and the square error properties of the resulting estimator will
naturally deteriorate. So, in small n large q low degrees of freedom settings one could readily
construct instances in which the kernel regression approach will display better ﬁnite-sample
133

Table 5.2: Median values for smoothing parameters in Chapter 5
DGP-M: Multiplicative Speciﬁcation
n
c N1 N2 λ1
λ2
500 2 2
3 0.16
0.16
500 3 2
3 0.11
0.11
500 4 2
1 0.07
0.08
1000 2 2
3 0.10
0.10
1000 3 2
3 0.07
0.07
1000 4 2
3 0.05
0.05
1500 2 2
3 0.07
0.07
1500 3 2
3 0.05
0.05
1500 4 2
3 0.04
0.04
2000 2 2
3 0.06
0.06
2000 3 2
3 0.04
0.04
2000 4 2
3 0.03
0.03

DGP-A: Additive Speciﬁcation
n
c N 1 N 2 λ1
λ2
500 2 2
3 0.16 0.16
500 3 2
3 0.11 0.11
500 4 2
3 0.09 0.09
1000 2 2
3 0.10 0.10
1000 3 2
3 0.07 0.07
1000 4 2
3 0.05 0.05
1500 2 2
3 0.07 0.07
1500 3 2
3 0.05 0.05
1500 4 2
3 0.04 0.04
2000 2 2
3 0.06 0.06
2000 3 2
3 0.04 0.04
2000 4 2
3 0.03 0.03

Note: Median values for the number of interior knots and bandwidths for the proposed
spline regression estimator. c = c1 = c2 denotes the number of outcomes for the discrete
predictors, n the sample size, Nj the number of interior knots for continuous predictor Xj ,
j = 1, 2, and λj the bandwidth for continuous predictor Zj , j = 1, 2.

behavior than the regression spline approach. We therefore oﬀer the following advice for the
sound practical application of the methods proposed herein:
1. The proposed methods are best suited to settings in which q is not overly large and n
not overly small.
Our experience shows that for a range of DGPs the regression spline outperforms kernel
regression when n ≥ 500 and q ≤ 5.
One practical advantage is the reduced computational burden of cross-validation for
regression splines versus their kernel counterpart, and in large sample settings (say,
n ≥ 10, 000) one can push the dimension of q much higher than that considered here.
2. Of course, when the dimension of the multivariate tensor spline becomes a practical
barrier to their sound application, one can always resort to additive spline models; see
134

Ma and Racine (2011) for details. The drawback of the additive spline approach is
that if the DGP is non-additive, the ineﬃciency of the additive spline approach can be
much worse than the multivariate kernel approach as clearly demonstrated above. Of
course, it is a simple matter to compare the value of the cross-validation function for
each of the tensor-based, additive-based, and kernel-based cross-validated approaches
and it is perfectly sensible to use this as a guide in applied settings. But our experience
is that the tensor-based multivariate regression spline will indeed be competitive and
ought to be part of every practitioners toolkit.
In summary, the simulations outlined above indicate that the proposed method is capable
of outperforming the frequency estimator that breaks the sample into subsets, while it provides a compelling alternative to kernel methods when faced with a mix of categorical and
continuous predictors and to additive regression spline models for general nonlinear DGPs
for which additivity is not fully present.

5.5

Concluding Remarks

Applied researchers frequently must model relationships containing categorical predictors,
yet may require nonparametric estimators of, say, regression functions. The traditional kernel and spline estimators break the data into subsets deﬁned by the categorical predictors
and then model the resulting relationship involving continuous predictors only. Though
consistent, these approaches are acknowledged to be ineﬃcient. In this chapter we provide
an approach that combines regression splines with categorical kernel functions that is capable of overcoming the eﬃciency losses present in the traditional sample-splitting approach.
Furthermore, the proposed approach constitutes an attractive alternative to cross-validated
135

kernel estimators that admit categorical predictors. Theoretical underpinnings are provided
and simulations are undertaken to assess the ﬁnite-sample performance of the proposed
method. We hope that these methods are of interest to those modelling regression functions
nonparametrically when faced with both continuous and categorical predictors.

5.6

Appendix

For any vector ζ = (ζ 1 , . . . , ζ s ) ∈ Rs , denote the norm ∥ζ∥r = (|ζ 1 |r + · · · + |ζ s |r )1/r , 1 ≤
r < +∞, ∥ζ∥∞ = max (|ζ 1 | , . . . , |ζ s |). For any functions ϕ, φ, deﬁne the empirical in( ) ( ) (
)
∑
ner product and norm as ⟨ϕ, φ⟩n,Lz = n−1 n ϕ Xi φ Xi L Zi , z, λ , ∥ϕ∥2
i=1
n,Lz =
( ) (
)
∑
n−1 n ϕ2 Xi L Zi , z, λ . If the functions ϕ, φ are L2 -integrable, we deﬁne the theoi=1
retical inner product and the corresponding norm as ⟨ϕ, φ⟩Lz = E {ϕ (X) φ (X) L (Z, z, λ)},
{
}
∥ϕ∥2 = E ϕ2 (X) L (Z, z, λ) . We denote by the same letters c, C, any positive conLz
stants without distinction. For any s × s symmetric matrix A, denote its Lr norm as
∑
∥A∥r = maxζ∈Rs ,ζ̸=0 ∥Aζ∥r ∥ζ∥−1 . Let ∥A∥∞ = max1≤i≤s s
r
j=1 Aij . Let Is be
the s × s identity matrix.
β (z) can be decomposed into β g (z) and β ε (z), such that β (z) = β g (z) + β ε (z), and
(

) (
)
(
) (
)
−1 BT L B −1 n−1 BT L g , β (z) = n−1 BT L B −1 n−1 BT L E ,
β g (z) = n
z
z
z
z
ε
(5.6)
where g = {g (X1 , Z1 ) , . . . , g (Xn , Zn )}T , E = {σ (X1 , Zn ) ε1 , . . . , σ (Xn , Zn ) εn }T . Then
g (x, z) = gg (x, z) + gε (x, z), in which

gg (x, z) = B (x)T β g (z) ,

gε (x, z) = B (x)T β ε (z) .

136

(5.7)

Next we cite a result in the Appendix of Huang (2003).
Lemma 5.1. Under assumptions (A2), for any α ∈RKn , there exist constants 0 < cB <
CB < ∞ that do not depend on n, such that for any z ∈ D

cB

( ∏q

)

h ∥α∥2 ≤ E
2
l=1 l

[{
} ]
( ∏q
)
T B (X) 2 ≤ C
α
hl ∥α∥2 .
B
2
l=1

Since


r
∏

{
}
k(Zs ̸=zs )
k (Z, z, λ) |X = E
E L
λs
|X ≥ P (Z = z |X ) ≥ cp > 0,


s=1
{
}
E Lk (Z, z, λ) |X ≤ 1,

(5.8)

for any integer k ≥ 1. Thus, for CB = CB and cB = cp cB , one has
[{
} ]
( ∏q
)
T B 2 ≤ E αT B (X) 2 ≤ C
α
hl ∥α∥2 ,
B
2
l=1
Lz
[{
} ]
( ∏q
)
T B 2 ≥ c E αT B (X) 2 ≥ c
α
h ∥α∥2 .
p
B
2
l=1 l
Lz

(5.9)

Lemma 5.2. Under assumptions (A2) and (A4), as n → ∞ ,
⟨

⟩

⟨

⟩

Bj ,...,jq , B ′
− Bj ,...,jq , B ′
′
′
j1 ,...,jq
j1 ,...,jq
1
1
n,Lz
Lz
[{
)
}1/2 ]
(∏q
−1
.
h log n
=Oa.s. n
l=1 l

max
max
′
z∈D j ,...,jq ,j ′ ,...jq
1
1

( )
( ) (
)
Proof. Let ζ
′ ,...,j ′ ,i = Bj1 ,...,jq Xi Bj ′ ,...,j ′ Xi L Zi , z, λ −
j1 ,...,jq ,j1
q
q
} 1
{
( ) (
)
( )
′
E Bj ,...,jq Xi B ′
′ Xi L Zi , z, λ . When jl − jl ≤ ml − 1 for all 1 ≤ l ≤ q,
j1 ,...,jq
1
137

by the properties of the B-spline basis, there exist constants 0 < cB,k < CB,k < ∞ and
(∏
)
( )
( )k
q
′ < C ′ < ∞, such that c
h ≤ E Bj ,...,jq Xi B ′
0 < cB
≤
′ Xi
B,k
B
l=1 l
j1 ,...,jq
1
}
{
(∏
)
(
)
( )
( ) k
q
′ ∏q h k ≤ E B
CB,k
h and cB
≤
′
j1 ,...,jq Xi Bj ′ ,...,jq Xi
l=1 l
l=1 l
1
)
(
′ ∏q h k , thus by (5.8),
CB
l=1 l
{

}
) 2
( )
2
Eζ 2
′
′ ≥ cp E Bj1 ,...,jq Xi Bj ′ ,...,j ′ Xi
j1 ,...,jq ,j1 ,...,jq ,i
q
1
[ {
}]
( )
( ) 2
− E Bj ,...,jq Xi B ′
′ Xi
j1 ,...,jq
1
)
( ∏q
(∏q
)
( ∏q
)2
′
h ,
≥ cp cB,2
h − CB
h
≥c 2
l=1 l
l=1 l
l=1 l
ζ
{
}
( ∏q
)
( )
( )
2
Eζ 2
≤ E Bj ,...,j Xi B 2′
Xi
≤ CB,2
hl ,
′
′
′
q
l=1
j1 ,...,jq ,j1 ,...,jq ,i
j1 ,...,jq
1

k
E ζ

′
′
j1 ,...,jq ,j1 ,...,jq ,i

(

[

( )
( ) (
)k
≤ 2k−1 E Bj ,...,jq Xi B ′
′ Xi L Zi , z, λ
j1 ,...,jq
1
{
} ]
( )
( ) (
) k
+ E Bj ,...,jq Xi B ′
′ Xi L Zi , z, λ
j1 ,...,jq
1
(
( ∏q
( ∏q
)
)k )
( ∏q
)
k−1 C
′
≤2
hl + CB
hl
≤c k
hl .
B,k
l=1
l=1
l=1
ζ

k
There exists a constant c = c k c−1 , such that E ζ
≤ ck!Eζ 2
′
′
′
′
j1 ,...,jq ,j1 ,...,jq ,i
ζ ζ2
j1 ,...,jq ,j1 ,...,jq ,i
< ∞, for k ≥ 3. Then by Bernstein’s inequality (p.24, Theorem 1.2, BOSQ(1998)),
(
P

n−1

)
}1/2 )
(∏q
{
′ n−1
h log n
ζ
′
′ ≥ c
l=1 l
i=1 j1 ,...,jq ,j1 ,...,jq ,i

∑n

138






(
)

′ n ∏q h log n

c
l=1 l
≤ 2 exp −
(∏
)
{ (∏
)
}1/2 

q
q
 4C

′n
h + 2c c
h log n
B,2 n
l=1 l
l=1 l
(
)−1
−c′ 4CB,2
= 2n
≤ 2n−4 , for any c′ ≥ 16CB,2 ,
[
∑∞
−1 ∑n ζ
which implies n=1 P maxz∈D max
′
′ ,...,j ′ n
i=1 j1 ,...,jq ,j ′ ,...,jq ,i
j1 ,...,jq ,j1
q
1
)r
{
(∏q
)
}1/2 ]
∑∞ (
′ n−1
≥ c
h log n
max c
K2 n−4 < ∞.
≤2
n
l=1 l
n=1 1≤s≤r s
Thus, the Borel-Cantelli Lemma entails that

max
max
′
z∈D j ,...,jq ,j ′ ,...,jq
1
1

n−1

[{
( ∏q
)
}1/2 ]
−1
ζ
h log n
.
′
′ = Oa.s. n
i=1 j1 ,...,jq ,j1 ,...,jq ,i
l=1 l

∑n

′
When jl − jl > ml − 1 for some 1 ≤ l ≤ q,
⟨

⟩

⟨

⟩

Bj ,...,jq , B ′
− Bj ,...,jq , B ′
= 0.
′
′
j1 ,...,jq
j1 ,...,jq
1
1
n,Lz
Lz

Lemma 5.3. Under assumptions (A2) and (A4), as n → ∞,
[{
}1/2 ]
)−1
( ∏q
⟨γ 1 , γ 2 ⟩n,L − ⟨γ 1 , γ 2 ⟩L
z =O
z
.
log n
h
n−1
An = sup
sup
a.s.
l=1 l
∥γ 1 ∥L ∥γ 2 ∥L
z∈D γ 1 ,γ 2 ∈G
z
z
(5.10)

Proof. Any γ 1 , γ 2 ∈ G can be written as γ 1 (x) = α1 B (x), γ 2 (x) = α2 B (x) for some
139

(
)
(
)
vectors α1 = αj ,...,jq ,1 ∈ RKn , α2 = αj ,...,jq ,2 ∈ RKn . ⟨γ 1 , γ 2 ⟩n,L is
z
1
1



n
∑ ∑
)
 ∑
 (
n−1
αj ,...,jq ,1 Bj ,...,jq  
αj ,...,jq ,2 Bj ,...,jq  L Zi , z, λ

1
1
1
1
i=1 j1 ,...,jq
j1 ,...,jq

=

∑

⟨

⟩

α ′
Bj ,...,jq , B ′
′
′
′
′α
j1 ,...,jq
j1 ,...,jq ,j1 ,...,jq j1 ,...,jq ,1 ,j1 ,...,jq ,2
1
n,Lz

⟨γ 1 , γ 2 ⟩Lz =

∑

⟨

⟩

.
α ′
Bj ,...,jq , B ′
′
′
′
′α
j1 ,...,jq
j1 ,...,jq ,j1 ,...,jq j1 ,...,jq ,1 ,j1 ,...,jq ,2
1
Lz

According to (5.9), one has for any z ∈ D, there exist constants 0 < c1 < C1 < ∞ and 0 <
(∏
(∏
)1/2
)1/2
q
q
c2 < C2 < ∞, such that c1
hl
∥α1 ∥2 ≤ ∥γ 1 ∥Lz ≤ C1
hl
∥α1 ∥2 ,
l=1
l=1
(∏
)1/2
(∏
)1/2
q
q
h
∥α2 ∥2 , thus
h
∥α2 ∥2 ≤ ∥γ 2 ∥L ≤ C2
c2
z
l=1 l
l=1 l
(∏
)
(∏
)
q
q
h ∥α1 ∥2 ∥α2 ∥2 . Hence
h ∥α1 ∥2 ∥α2 ∥2 ≤ ∥γ 1 ∥L ∥γ 1 ∥L ≤ C1 C2
c1 c2
z
z
l=1 l
l=1 l
∥α ∥ ∥α ∥
(∏ 1 2) 2 2
×
q
hl ∥α1 ∥2 ∥α2 ∥2
c1 c2
l=1
⟨
⟩
⟨
⟩
max
max
Bj ,...,jq , B ′
− Bj ,...,j,q , B ′
′
′
j1 ,...,jq
j1 ,...,jq
1
1
′
z∈D j ,...,jq ,j ′ ,...jq
n,Lz
Lz
1
1
[{
}1/2 ]
(∏q
)−1
−1
=Oa.s.
n
.
h
log n
l=1 l

An ≤

Let
Vz,n = n−1 BT Lz B =

{⟨

⟩

}

,
Bj ,...,jq , B ′
′
j1 ,...,jq
1
n,Lz Kn ×Kn
140

(5.11)

{⟨
Vz,n =

⟩

}

Bj ,...,jq , B ′
.
′
j1 ,...,jq
1
Lz Kn ×Kn

(5.12)

A result in Demko (1986) is described as follows, which plays an essential role in the proof
of Lemma 5.5.

Lemma 5.4. For positive deﬁnite Hermitian matrices A such that A has no more than k
√
5/4
1/4
nonzero entries in each row, then A−1
≤ 33 k A−1
∥A∥2 .
∞
2
Lemma 5.5. Under assumptions (A2) and (A4), there exists a constant 0 < Cm < ∞
(∏
)−1
q
−1
depending only on ml , 1 ≤ l ≤ q, such that supz∈D Vz,n
≤ Cm
h
, where
l=1 l
∞
Vz,n is deﬁned in (5.12)
(∏
)
(∏
)
q
q
2 ≤ wT V w ≤C
Kn , c
Proof. For any vector w ∈R
h ∥w∥2
h ∥w∥2
z,n
B
B
2
l=1 l
l=1 l
which follows directly from (5.9), and it is clear that Vz,n is symmetric, thus Vz,n is a
positive deﬁnite Hermitian matrix .

Vz,n 2 = sup
w

{(

}1/2
) (
)
Vz,n w T Vz,n w / ∥w∥2
2

{
(∏q
)(
}1/2
)
)
−1 (
≤ sup CB
hl Vz,n w T Vz,n Vz,n w / ∥w∥2
2
l=1
w
(
)1/2
{
}1/2
(∏q
)
1/2 ∏q
= CB
hl
sup wT Vz,n w/ ∥w∥2
≤ CB
hl
2
l=1
l=1
w

−1
Similarly Vz,n
≤
2
)
∏q (
than
2ml − 1
l=1

)
(
−1 ∏q h −1 . By tensor spline properties, V
cB
z,n has no more
l=1 l

nonzero entries in each row, thus
(
)−1
√
q
q
)
1/4
5/4
∏
∏ (
−1
−1
−1
= Cm
hl
, where
Vz,n
≤ 33
2ml − 1 Vz,n
supz∈D Vz,n
2
2
∞
l=1
l=1
√
q
)
∏ (
−1
2ml − 1 cB CB .
Cm = 33
l=1

141

−1
Lemma 5.6. For Vz,n deﬁned in (5.11), and for n large enough, supz∈D Vz,n
≤
∞
)−1
(∏
q
h
in which Cm > 0 is the constant in Lemma 5.5.
2Cm
l=1 l
Proof. By Lemma 5.2, as n → ∞,

sup Vz,n − Vz,n
=
∞
z∈D
⟩
⟨
⟩
∑ ⟨
≤
− Bj ,...,jq , B ′
sup max
Bj ,...,jq , B ′
′
′
j1 ,...,jq
j1 ,...,jq
1
1
j1 ,...,jq ′
z∈D
n,Lz
Lz
′
j1 ,...jq

⟨
⟩
⟨
⟩
)
2ml − 1 sup
max
Bj ,...,jq , B ′
− Bj ,...,jq , B ′
′
′
j1 ,...,jq
j1 ,...,jq
1
1
′
′
z∈D j1 ,...,jq ,j1 ,...jq
n,Lz
Lz
l=1
[{
( ∏q
)
}1/2 ]
−1
(5.13)
= Oa.s. n
h log n
.
l=1 l
q
∏(

Let ξ = Vz,n η, for any given vector η with dimension Kn × 1, then for any given
(∏
)−1
q
−1
−1
z ∈ D, Vz,n ξ
≤ Vz,n
∥ξ∥∞ ≤ Cm
hl
∥ξ∥∞ by Lemma 5.5, and thus
l=1
∞ (
∞
)
(
)
−1 ∏q h ∥η∥ . By (5.13) and V
Vz,n η ∞ ≥ Cm
− Vz,n η
≤
z,n
∞
l=1 l
∞
(
)
−1 ∏q h ∥η∥ .
Vz,n − Vz,n
∥η∥∞ , for n large enough Vz,n η
≥ (1/2) Cm
∞
l=1 l
∞
∞
(∏
)−1
q
−1
Let ξ1 =Vz,n η, then we have Vz,n ξ 1
≤ 2Cm
h
∥ξ 1 ∥∞ , for any given
l=1 l
∞
z ∈ D and n large enough. Therefore the result follows.
Lemma 5.7. Under assumptions (A2)-(A4), as n → ∞,
= Oa.s.
sup n−1 BT Lz E
∞
z∈D

[{
( ∏q
)
}1/2 ]
−1
n
h log n
.
l=1 l

Proof. Let Dn = nϑ , with ϑ < 1/2, ϑ (2 + δ) > 1, ϑ (1 + δ) > 1/2, which are satisﬁed
142

by δ > 0. We decompose the noise variable εi into a truncated part and a tail part εi =
)
(
(
)
Dn
εDn + εDn + εDn , where εDn = εi I εi > Dn , εDn = εi I εi ≤ Dn − εDn , εi,3 =
i,1
i,2
i,3
i,1
i,2
i,3
(
)
(
)
{
)
(
}
1+δ = o n−1/2 ,
E εi I εi ≤ Dn Xi , Zi . Since εDn ≤ E εi 2+δ Xi , Zi /Dn
i,3
then

sup
z∈D,j1 ,...,jq

n−1

)
(
) (
) (
) Dn
−1/2 .
B
Xi L Zi , z, λ σ Xi , Zi εi,3 = o n
i=1 j1 ,...,jq

∑n

(

The tail part vanishes almost surely, since

∑∞
∑
P (|εn | > Dn ) ≤ Mδ ∞ n−ϑ(2+δ) <
n=1
n=1

∞. The Borel Cantelli Lemma implies that

sup
z∈D,j1 ,...,jq

n−1

n
∑
i=1

(
)
) (
) (
) Dn
−k , for any k > 0.
Bj ,...,jq Xi L Zi , z, λ σ Xi , Zi ε
=O n
i,l
1
(

For the truncated part, using Bernstein’s inequality in Theorem 1.2 of Bosq (1998), one has
∑n
( ) (
) (
)
supz∈D,j ,...,jq n−1
Bj ,...,jq Xi L Zi , z, λ σ Xi , Zi εDn
i,2
i=1 1
1
[{
(∏q
)
}1/2 ]
n−1
=
h log n
l=1 l
as n → ∞. Therefore the result of Lemma 5.7 follows from above.

Lemma 5.8. Under assumptions (A2)-(A4), for gε (x, z) in (5.7), as n → ∞,
[{
}1/2 ]
)−1
(∏
q
.
h
log n
supx∈[0,1]q ,z∈D |gε (x, z)| = Oa.s.
n−1
l=1 l

Proof. By Theorem 5.4.2 in DeVore and Lorentz (1993), Lemma 5.6, Lemma 5.7 and the
143

deﬁnition of gε (x, z), one has
(
)
−1
−1
sup β ε
= sup Vz,n n−1 BT Lz E
≤ sup Vz,n
sup n−1 BT Lz E
∞ z∈D
∞ z∈D
∞ z∈D
∞
z∈D
[{
]
}1/2
(∏q
)−1
−1
= Oa.s.
n
h
log n
.
l=1 l
[{
= Oa.s.
sup
|gε (x, z)| ≤ sup β ε
∞
z∈D
x∈[0,1]q ,z∈D

n−1

( ∏q

h
l=1 l

)−1

}1/2 ]
log n
.

{⟨
}
⟩
Let Σz =
Bj ,...,jq , B ′
, where
′
j1 ,...,jq
1
Wz Kn ×Kn
⟨
⟩
{
}
2 (Z, z, λ) σ 2 (X, Z) , and
Bj ,...,jq , B ′
= E Bj ,...,jq (X) B ′
′
′ (X) L
j1 ,...,jq
j1 ,...,jq
1
1
Wz
{
}
2 (X , Z ) , . . . , σ 2 (X , Z ) L
Wz = Lz diag σ
n n
z
1 1
{
}
= diag L2 (Z1 , z, λ) σ 2 (X1 , Z1 ) , . . . , L2 (Z1 , z, λ) σ 2 (Xn , Zn ) .

For (x, z) ∈ [0, 1]q × D deﬁne
−1
−1
σ 2 (x, z) = n−1 B (x)T Vz,n Σz Vz,n B (x) .
n

(5.14)

Lemma 5.9. Under assumptions (A2)-(A4), for gε (x, z) in (5.7) and σ 2 (x, z) in (5.14), as
n
n → ∞, σ −1 (x, z) {gε (x, z)} −→ N (0, 1) . For any given (x, z) ∈ [0, 1]q × D,
n
)
(
)−1
(∏
q
2 (x, z) ≤ C ∗ n−1 ∏q h −1 , for some constants 0 < c∗ <
∗ n−1
h
≤ σn
cσ
σ
σ
l=1 l
l=1 l
∗
Cσ < ∞.
Proof. For any given z ∈ D, by the deﬁnition of β ε (z) in (5.6), for any c ∈ RKn with
144

∑
∥c∥2 = 1, we can write cT β ε (z) = n ai εi , where
i=1
(
) ( ) −1
−1 ( ) ( )
a2 = n−2 cT Vz,n B Xi B Xi T L2 Zi , z, λ σ 2 Xi Vz,n c.
i

For any given z ∈ D, by Lemma 5.6, as n → ∞ with probability 1, max1≤i≤n a2 ≤
i
(∏
)−2
q
2
Cσ (2Cm )2 n−2
h
. As n → ∞ with probability 1,
l=1 l
)−2
( ∏q
∑n {
)
( )}2 2 (
−2
L Zi , z, λ
hl
n−2
cT B Xi
a2 ≥ c2 CB
σ
i
i=1
l=1
i=1
[{
]
(∏q
)−2
)
( )}2 2 (
−1 E cT B X
2 C −2
≥ cσ B
h
n
L Zi , z, λ (1 − An ) ,
i
l=1 l

∑n

the proof of which follows the same pattern as in Lemmas 5.1 and 5.3, thus as n → ∞ with
(∏
)−1
∑
q
probability 1, n a2 ≥ C ′ n−1
hl
. Hence for any given z ∈ D,
i=1 i
l=1
{
(
) }
2 / ∑n a2 = O
−1 ∏q h −1 = O
max a
a.s. n
a.s. (1) . Given any ξ > 0, one has
i=1 i
l=1 l
1≤i≤n i
{
}
2 E ε2 I ( a ε > ξ ∥a∥ )
a
2
k
k=1 k
)}δ/(2+δ)
(
)2/(2+δ) { (
∑n
−2
−1 ∥a∥
2 E |ε|2+δ
≤ lim ∥a∥2
a
P |ε| > ξa
2
k
n→∞
k=1 k
)}δ/(2+δ)
{ (
2/(2+δ)
= 0,
≤M
lim max
P |ε| > ξa−1 ∥a∥2
k
δ
n→∞ 1≤k≤n

lim ∥a∥−2
2
n→∞

∑n

thus, the Lindeberg condition is satisﬁed. By the Lindeberg-Feller CLT, as n → ∞,
(∑
)
∑n
n a2 −1/2 → N (0, 1). Therefore, [Var {g (x, z) |X, Z}]−1/2 g (x, z) →
ai εi /
ε
ε
i=1
i=1 i
N (0, 1).
−1
−1
Var {gε (x, z) |X, Z} = n−1 B (x)T Vn,z Σn,z Vn,z B (x)T ,

145

where
⟨
⟩
Bj ,...,jq , B ′
′
j1 ,...,jq
1
Wz ,n
∑n
( )
( ) 2(
) 2(
)
= n−1
Bj ,...,jq Xi B ′
′ Xi L Zi , z, λ σ Xi , Zi
j1 ,...,jq
i=1 1
. By Lemma 5.3, one can prove that

wT Σz w (1 − An ) ≤ wT Σn,z w ≤ wT Σz w (1 + An ) ,

and cΣ

(∏

)
(∏
)
q
q
2 ≤ wT Σ w ≤C
h ∥w∥
h ∥w∥2 , thus
z
Σ
l=1 l
l=1 l

(∏q
)
( ∏q
)−1
−1
−1 −1
−1
hl wT Vz,n Vz,n w ≤C Σ c
wT Vz,n Σz Vz,n w≤C Σ
hl
∥w∥2 ,
l=1
l=1
( ∏q
)
(∏q
)−1
T V−1 Σ V−1 w≥c
T V−1 V−1 w ≥c C −2
w
h w
h
∥w∥2 .
z,n z z,n
z,n z,n
Σ
Σ V
l=1 l
l=1 l
(∏
)−1
2
q
For any given (x, z) ∈ [0, 1]q ×D, σ 2 (x, z) ≤ n−1 C Σ c−2
hl
Bj ,...,jq (x) ≤
n
V
l=1
1
2
(∏
)−1
q
∗
Cσ n−1
h
, where σ 2 (x, z) is deﬁned in (5.14), and similarly one has σ 2 (x, z) ≥
n
n
l=1 l
(∏
)−1
q
∗
c∗ n−1
h
, for some constants 0 < c∗ < Cσ < ∞. For any given (x, z) ∈
σ
σ
l=1 l
[0, 1]q × D,
[
]
1/2 {σ (x, z)}−1 = 1, since
limn→∞ [Var {gε (x, z) |X, Z}]
n

−1
−1
Var {gε (x, z) |X, Z} ≤ B (x)T Vz,n Σz Vz,n B (x) (1 + An ) (1 − An )−2 ,
−1
−1
Var {gε (x, z) |X, Z} ≥ B (x)T Vz,n Σz Vz,n B (x) (1 − An ) (1 + An )−2 .

Thus, the result in Lemma 5.9 follows.

146

Lemma 5.10. Under assumptions (A1), (A2) and (A4), for gg (x, z) in (5.7), as n → ∞,
)
(∑
p ∑
q
supx∈[0,1]q ,z∈D gg (x, z) − g (x, z) = Oa.s.
h + r λs .
s=1
l=1 l

Proof. For 1 ≤ i ≤ n, 1 ≤ s ≤ r, let Z−is be the leave-one out vector of Zi , then
(
)
r
∏ 1 Z ̸=zs
(
)
)
is
= 1 Zi = z +
L Zi , z, λ =
λs
s=1
(∑ r
)
∑r
(
)
λ 1 Zis ̸= zs , Z−is = z−is + o
λ .
s=1 s
s=1 s
(

Denote Lz = Lz,1 + Lz,2 + Lz,3 , where Lz,1 = diag {1 (Z1 = z) , . . . , 1 (Zn = z)}, Lz,2 =
}
{
r
r
)
(
)
(
∑
∑
λs 1 Zns ̸= zs , Z−ns = z−ns
diag
λs 1 Z1s ̸= zs , Z−1s = z−1s , . . . ,
and
s=1
s=1
(
)
r
∑
Lz,3 = o
λs In . Thus by the deﬁnition of gg (x, z) in (5.7),
s=1
(
)
−1
gg (x, z) − g (x, z) = B (x)T Vz,n n−1 BT Lz g − g (x, z) = Π1 + Π2 + Π3 ,
(
)
−1
where Π1 = B (x)T Vz,n n−1 BT Lz,1 g − g (x, z) ,
(
)
(
)
T V−1 n−1 BT L g , Π = B (x)T V−1 n−1 BT L g .
Π2 = B (x)
m z,3
z,n
z,n
z,2
3

By Theorem 12.8 and (13.69) on p. 149 of de Boor (2001), for any z ∈D, there exists
)
(∑
p
q
h . Let gz =
β (z) ∈ RKn , such that supx∈[0,1]q B (x)T β (z) − g (x, z) = O
l=1 l
)
(
−1
{g (X1 , z) , . . . , g (Xn , z)}T , Π1 = B (x)T Vz,n n−1 BT Lz,1 gz − g (x, z) = Π11 + Π12 ,
[
]
T V−1 n−1 BT L {g − Bβ (z)} ,
where Π11 = B (x)
z,n
z,1 z
{
}
−1
Π12 = B (x)T Vz,n n−1 BT Lz,1 Bβ (z) − g (x, z) .
147

[
]
−1
sup Vz,n n−1 BT Lz,1 {gz − Bβ (z)}
∞
z∈D
−1
≤ sup Vz,n
sup n−1 BT Lz,1 {gz − Bβ (z)}
≤
∞ z∈D
∞
z∈D


(
){
}
q
(
)
∑ p
−1
sup ∥gz − Bβ (z)∥∞
= Oa.s. 
h ,
sup Vz,n
n−1 BT Lz,1
l
∞
∞
z∈D
z∈D
l=1
)
[
]
(∑ q
p
−1
h ,
sup
|Π11 | ≤ sup Vz,n n−1 BT {gz − Bβ (z)}
= Oa.s.
l=1 l
∞
z∈D
x∈[0,1]q ,z∈D
by Theorem 5.4.2 in Devore and Lorentz (1993), Lemma 5.6 and properties of the B-spline.

supx∈[0,1]q ,z∈D |Π12 | ≤ supx∈[0,1]q ,z∈D B (x)T β (z) − g (x, z) +
)
}
{
(
T V−1 n−1 BT L
supx∈[0,1]q ,z∈D B (x)
z,n
z,2 + Lz,3 Bβ (z)

≤O

(∑q

)
(∑ r
)
p
−1
h + sup Vz,n
sup n−1 BT Bβ (z)
O
λs .
l=1 l
s=1
∞ z∈D
∞
z∈D

By Lemmas 5.1 and 5.2 and , one has supz∈D n−1 BT B

supx∈[0,1]q ,z∈D |Π12 | ≤ O

∞

= Oa.s.

)
p
h +
l=1 l

(∏

)
q
hl , then
l=1

(∑ q

−1
supz∈D Vz,n
supz∈D n−1 BT B
supz∈D ∥β (z)∥∞ O
∞
∞
)
(∑q
p ∑r
λs .
= Oa.s.
h +
s=1
l=1 l

(∑r
s=1

)
λs

)
(∑
p ∑
q
h + r λs .
supx∈[0,1]q ,z∈D |Π1 | ≤ supx∈[0,1]q ,z∈D (|Π11 | + |Π12 |) = Oa.s.
s=1
l=1 l
(
)
−1
−1
supz∈D n−1 BT Lz,2 g
≤ supz∈D Vn,z
supz∈D Vn,z n−1 BT Lz,2 g
∞
∞
∞
148

−1
≤ sup Vn,z
sup n−1 BT
sup ∥g∥∞ O
∞ z∈D
∞ z∈D
z∈D

(∑ r

)
(∑ r
)
λs = Oa.s.
λs ,
s=1
s=1

sup
gg (x, z) − g (x, z) ≤
sup
(|Π1 | + |Π2 | + |Π3 |)
x∈[0,1]q ,z∈D
x∈[0,1]q ,z∈D
(
)
q
r
∑ p
∑
= Oa.s.
h +
λs .
l
s=1
l=1

thus,

Proofs of Theorems 5.1 and 5.2. Theorem 5.1 follows from Lemmas 5.8 and 5.10 directly,
while Theorem 5.2 follows from Lemmas 5.9 and 5.10 and assumption (A4) directly.

149

BIBLIOGRAPHY

150

BIBLIOGRAPHY

[1] Aitchison, J. and Aitken, C. G. G. (1976). Multivariate binary discrimination by the
kernel method. Biometrika, 63, 413–420.
[2] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical
and powerful approach to multiple testing. Journal of the Royal Statistical Society,
Series B , 57, 289–300.
[3] Bickel, P. J. and Rosenblatt, M. (1973). On some global measures of the deviations of
density function estimates. Annals of Statistics, 1, 1071–1095.
[4] Bosq, D. (1998). Nonparametric Statistics for Stochastic Processes. Springer-Verlag,
New York.
[5] Breiman, L. & Friedman, J. H. (1985). Estimating optimal transformations for multiple
regression and correlation. Journal of the American Statistical Association, 80, 580–619.
[6] Cai, Z., Fan, J. and Li, R. (2000). Eﬃcient estimation and inferences for varyingcoeﬃcient models. Journal of American Statistical Association, 95, 888–902.
[7] Caspi, A. and Moﬃtt, T. E. (2006). Gene-environment interactions in psychiatry: joining forces with neuroscience. Nature Reviews Neuroscience, 7, 583–590.
[8] Caspi, A., Sugden, K., Moﬃtt, T. E., Taylor, A., Craig, I. W., Harrington, H., McClay,
J., Mill, J., Martin, J., Braithwaite,A. and Poulton, R. (2003). Inﬂuence of life stress on
depression: moderation by a polymorphism in the 5-HTT gene. Science, 301, 386–389.
[9] Cardot, H., Ferraty, F., and Sarda, P. (2003). Spline estimators for the functional linear
model. Statistica Sinica, 13, 571–591.
151

[10] Cardot, H. and Sarda, P. (2005). Estimation in generalized linear models for functional
data via penalized likelihood. Journal of Multivariate Analysis, 92, 24–41.
[11] Chatterjee, N. and Carroll, R. J. (2005). Semiparametric maximum likelihood estimation
exploiting gene-environment independence in case-control studies. Biometrika, 92, 399–
418.
[12] Chiang, C. T., Rice, J. A. and Wu, C. O. (2001). Smoothing spline estimation for
varying coeﬃcient models with repeatedly measured dependent variables. Journal of
American Statistical Association, 96, 605–619.
[13] Claeskens, G., and Van Keilegom, I. (2003). Bootstrap conﬁdence bands for regression
curves and their derivatives. The Annals of Statistics, 31, 1852–1884.
[14] Cleveland, W. S., Grosse, E. and Shyu, W. M. (1991). Local regression models. In
Statistical Models in S, J. M. Chambers and T. J. Hastie, 309–376. Paciﬁc Grove:
Wadsworth & Brooks.
[15] Cockerham, C. C. (1954). An extension of the concept of partitioning hereditary variance
for analysis of covariances among relatives when epistasis is present. Genetics, 39, 859–
882.
[16] Costa, L. G. and Eaton, D. L. (2006) Gene-Environment Interactions: Fundamentals
of Ecogenetics, Hoboken, NJ: John Wiley & Sons.
[17] Cs˝rg˝, M. and R´v´sz, P. (1981). Strong Approximations in Probability and Statistics.
o o
e e
Academic Press, New York-London.
[18] de Boor, C. (2001). A Practical Guide to Splines. Springer-Verlag, New York.
. Journal of Approximation Theory,
[19] Demko, S. (1986). Spectral bounds for A−1
∞
48, 207–212.
[20] DeVore, R. A. and Lorentz, G. G. (1993). Constructive Approximation, Springer-Verlag,
New York.
[21] Falconer, D. S. (1952). The problem of environment and selection. Natural American,
86, 293–298.
152

[22] Fan, J. (1993). Local linear regression smoothers and their minimax eﬃciency. The
Annals of Statistics, 21, 196–216.
[23] Fan, J. Q., and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications.
Chapman and Hall, London.
[24] Fan, J. and Zhang, W. (1999). Statistical estimation in varying coeﬃcient models. The
Annals of Statistics, 27, 1491–1518.
[25] Fan, J. and Zhang, W. Y. (2000). Simultaneous conﬁdence bands and hypothesis testing
in varying-coeﬃcient models. Scandinavian Journal of Statistics. 27, 715–731.
[26] Fan, J. and Zhang, W. (2008). Statistical methods with varying coeﬃcient models.
Statistics and its Interface, 1, 179–195.
[27] Ferraty, F. and Vieu, P. (2006). Nonparametric Functional Data Analysis: Theory and
Practice. Springer; Berlin.
[28] Friedman, J. H. (1991). Multivariate Adaptive Regression Splines. The Annals of Statistics, 19, 1–67.
[29] Gijbels, I., Lambert, A., and Qiu, P. H. (2007). Jump-preserving regression and smoothing using local linear ﬁtting: a compromise. Annals of the Institute of Statistical Mathematics, 59, 235–272.
[30] Gonz´lez-Manteiga, W., Mart´
a
ınez-Miranda, MD. & Raya-Miranda, R. (2008). SiZer
Map for inference with additive models. Statistics and Computing, 18, 297–312.
[31] Guo, S. W. (2000). Gene-environment interaction and the mapping of complex traits:
some statistical models and their implications. Human Heredity, 50, 286–303.
[32] Hall, P. and Heckman, N. (2002). Estimating and depicting the structure of a distribution of random functions. Biometrika, 89, 145–158.
[33] Hall, P., M¨ller, H. G., and Wang, J. L. (2006). Properties of principal component
u
methods for functional and longitudinal data analysis. Annals of Statistics, 34, 1493–
1517.
[34] Hall, P., and Patil, P. (1995). Formulae for mean integrated squared error of nonlinear
wavelet-based density estimators. Annals of Statistics, 23, 905–928.
153

[35] Hall, P. and Titterington, D. M. (1988). On conﬁdence bands in nonparametric density
estimation and regression. Journal of Multivariate Analysis, 27, 228–254.
[36] H¨rdle, W. (1989). Asymptotic maximal deviation of M-smoothers. Journal of Multia
variate Analysis, 29, 163–179.
[37] H¨rdle, W., Hl´vka, Z., and Klinke, S. (2000). XploRe Application Guide, Springera
a
Verlag, Berlin.
[38] H¨rdle, W. and Marron, J. S. (1991). Bootstrap simultaneous error bars for nonparaa
metric regression. Annals of Statistics, 19, 778-796.
[39] H¨rdle, W. and Mammen, E. (1993) Comparing nonparmetric versus parametric regresa
sion ﬁts. Annals of Statistics, 21, 1926–1947.
[40] Harrison, D. & Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for
cleaning air. Journal of Economics and Management, 5, 81–102.
[41] Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models.. Chapman and
Hall, London.
[42] Hastie, T. J. and Tibshirani, R. J. (1993). Varying-coeﬃcient models. Journal of the
Royal Statistical Society, Series B, 55, 757–796.
[43] Hahn, L. W., Ritchie, M. D., Moore, J. H. (2003). Multifactor dimensionality reduction
software for detecting gene-gene and gene-environment interactions. Bioinformatics, 19,
376–382.
[44] Hoover, D., Rice, J., Wu, C. O. and Yang, L. (1998). Nonparametric smoothing estimates of time-varying coeﬃcient models with longitudinal data. Biometrika, 85, 809–
822.
[45] Huang, J. Z. (1998). Projection estimation in multiple regression with application to
functional ANOVA models. Annals of Statistics, 26, 242–272.
[46] Huang, J. Z. (2003). Local asymptotics for polynomial spline regression. The Annals of
Statistics, 31, 1600–1635.
[47] Huang, J. Z. & Yang, L. (2004). Identiﬁcation of nonlinear additive autoregression
models. Journal of the Royal Statistical Society, Series B, 66, 463–477.
154

[48] Huang, X., Wang, L., Yang, L., and Kravchenko, A. N. (2008). Management practice
eﬀects on relationships of grain yields with topography and precipitation. Agronomy
Journal, 100, 1463–1471.
[49] Huang, J., Wu, C. and Zhou, L. (2004). Polynomial spline estimation and inference for
varying coeﬃcient models with longitudinal data. Statistica Sinica, 14, 763–788.
[50] Izem, R. and Marron, J. S. (2007). Analysis of nonlinear modes of variation for functional
data. Electronic Journal of Statistics, 1, 641–676.
[51] James, G. M., Hastie, T., and Sugar, C. (2000). Principal component models for sparse
functional data. Biometrika, 87, 587–602.
[52] James, G. M. (2002). Generalized linear models with functional predictors. Journal of
the Royal Statistical Society B, 64, 411–432.
[53] James, G. M. and Silverman, B. W. (2005). Functional adaptive model estimation.
Journal of the American Statistical Association, 100, 565–576.
[54] James, G. M. and Sugar, C. A. (2003). Clustering for sparsely sampled functional data.
Journal of the American Statistical Association, 98, 397–408.
[55] Joo, J., and Qiu, P. (2009). Jump detection in a regression curve and its derivative.Technometrics, 51, 289–305.
[56] Kang, K. H., Koo, J. Y., and Park, C. W. (2000). Kernel estimation of discontinuous
regression function.Statistics and Probability Letters, 47, 277–285.
[57] Kılı¸, E. (2008). Explicit formula for the inverse of a tridiagonal matrix by backward
c
continued fractions. Applied Mathematics and Computation, 197, 345–357.
[58] Koo, J. Y. (1997). Spline estimation of discontinuous regression functions. Journal of
Computational and Graphical Statistics, 6, 266–284.
[59] Kraft, P., Yen, Y. C., Stram, D. O., Morrison, J. and Gauderman, W. J. (2007). Exploiting gene-environment interaction to detect genetic associations. Human Heredity,
63, 111–119.
[60] Lamb, K., and Rizzino, A. (1998). Eﬀects of diﬀerentiation on the transcriptional regulation of the FGF-4 gene: Critical roles played by a distal enhancer. Molecular Reproduction and Development, 51, 218–224.
155

[61] Leadbetter, M. R., Lindgren, G., and Rootz´n, H. (1983), Extremes and Related Prope
erties of Random Sequences and Processes, Springer-Verlag, New York.
[62] Li, S.Y., Lu, Q., Fu, W., Romero, R., and Cui, Y. (2009). A regularized regression approach for dissecting genetic conﬂicts that increase disease risk in pregnancy. Statistical
Applications in Genetics and Molecular Biology Vol. 8, Iss. 1, Article 45.
[63] Li, Q. (2000). Eﬃcient estimation of additive partially linear models. International
Economic Review, 41, 1073–1092.
[64] Li, Q. and J. S. Racine (2004). Cross-validated local linear nonparametric regression.
Statistica Sinica, 14, 485–512.
[65] Li, Y. and Hsing, T. (2007). On rates of convergence in functional linear regression.
Journal of Multivariate Analysis, 98, 1782–1804.
[66] Li, Y. and Hsing, T. (2010). Uniform convergence rates for nonparametric regression
and principal component analysis in functional/longitudinal data. Annals of Statistics,
6, 3321–3351.
[67] Liang, H. (2006). Estimation in partially linear models and numerical comparisons.
Computational Statistics & Data Analysis, 50, 675–687.
[68] Liang, H., Wang, S. & Carroll, R. (2007). Partially linear models with missing response
variables and error-prone covariates. Biometrika, 94, 185–198.
[69] Liang, H., Thurston, S., Ruppert, D., Apanasovich, T. & Hauser, R. (2008). Additive
partial linear models with measurement errors. Biometrika, 95, 667–678.
[70] Li, Q. and J. Racine (2007). Nonparametric Econometrics: Theory and Practice, Princeton University Press.
[71] Liu, R. and Yang, L. (2010). Spline-backﬁtted kernel smoothing of additive coeﬃcient
model. Econometric Theory, 26, 29–59.
[72] Ma, S. and Yang, L. (2011a). A Jump-detecting Procedure based on Polynomial Spline
Estimation. Journal of Nonparametric Statistics, 23, 67-81.
[73] Ma, S. and Yang, L. (2011b). Spline-backﬁtted Kernel Smoothing of Partially Linear
Additive Model. Journal of Statistical Planning and Inference, 141, 204-219.
156

[74] Ma, S., Yang, L. and Carroll, R. (2011). A Simultaneous Conﬁdence Band for Sparse
Longitudinal Regression. Statistica Sinica, accepted.
[75] Ma, S. and Racine, J. (2011). Additive Regression Splines With Irrelevant Regressors.
Manuscript.
[76] Ma, S., Racine, J. and Yang, L. (2011). Spline Regression in the Presence of Categorical
Predictors. Manuscript.
[77] Mack, Y. P., and Silverman, B. W. (1982). Weak and strong uniform consistency of
kernel regression estimates. Wahrscheinlichkeitstheorie verm. Gebiete, 61, 405–415.
[78] Maity, A., Carroll, R. J., Mammen, E. and Chatterjee, N. (2009). Testing in semiparametric models with interaction, with applications to gene-environment interactions.
Journal of the Royal Statistical Society B, 71, 75–96.
[79] Morris, J. S. and Carroll, R. J. (2006). Wavelet-based functional mixed models. Journal
of the Royal Statistical Society B, 68 , 179–199.
[80] M¨ller, H. G. (1992). Change-points in nonparametric regression analysis.The Annals
u
of Statistics, 20, 737–761.
[81] M¨ller, H. G. and Stadtm¨ller, U. (2005). Generalized functional linear models. Annals
u
u
of Statistics, 33, 774–805.
[82] M¨ller, H. G., Stadtm¨ller, U., and Yao, F. (2006). Functional variance processes.
u
u
Journal of the American Statistical Association, 101, 1007–1018.
[83] M¨ller, H. G., and Song, K. S. (1997). Two-stage change-point estimators in smooth
u
regression models. Statistics & Probability Letters, 34, 323–335.
[84] M¨ller, H. G. and Yao, F. (2008). Functional additive models. Journal of American
u
Statistical Association, 103, 1534–1544.
[85] N´cher, V., Ojeda, S., Cadarso-Su´rez, C., Roca-Pardi˜as, J. & Acu˜a, C. (2006).
a
a
n
n
Neural correlates of memory retrieval in the prefrontal cortex. European Journal of
Neuroscience, 24, 925–936.
[86] Abramson, M.A., Audet, C., Couture, G., Dennis Jr., J.E. and Le Digabel, S. (2011).
The NOMAD project. ”Software available at http://www.gerad.ca/nomad”.
157

[87] Park, C., and Kim, W. (2004). Estimation of a regression function with a sharp change
point using boundary wavelets. Statistics & Probability Letters, 66, 435–448.
[88] Park, C., and Kim, W. (2006). Wavelet estimation of a regression function with a sharp
change point in a random design. Journal of Statistical Planning and Inference, 136,
2381–2394.
[89] Peacock, M., Turner, C. H., Econs, M. J. and Foroud, T. (2002). Genetics of osteoporosis. Endocrine Reviews, 23, 303–326.
[90] Qiu, P. H., Asano, C., and Li, X. (1991). Estimation of jump regression function. Bulletin
of Informatics and Cybernetics, 24, 197–212.
[91] Qiu, P. H. (1994). Estimation of the number of jumps of the jump regression functions.
Communications in Statistics-Theory and Methods, 23, 2141–2155.
[92] Qiu, P. H., and Yandell, B. (1998). A local polynomial jump detection algorithm in
nonparametric regression. Technometrics, 40, 141–152.
[93] Qiu, P. H. (2003). A jump-preserving curve ﬁtting procedure based on local piecewiselinear kernel estimation. Journal of Nonparametric Statistics, 15, 437–453.
[94] Qiu, P. H. (2005). Image Processing and Jump Regression Analysis, John Wiley & Sons,
New York.
[95] Qiu, P. H. (2007). Jump surface estimation, edge detection, and image restoration.
Journal of the American Statistical Association, 102, 745–756.
[96] Racine, J. S. and Q. Li (2004). Nonparametric estimation of regression functions with
both categorical and continuous Data. Journal of Econometrics, 119, 99–130.
[97] Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis, Second Edition.
Springer; New York.
[98] Roca-Pardi˜as, J., Cadarso-Su´rez, C. N´cher, V. & Acu˜a, C. (2006). Bootstrap-based
n
a
a
n
methods for testing factor-by-curve interactions in generalized additive models: assessing prefrontal cortex neural activity related to decision-making. Statistics in Medicine,
25, 2483–2501.
[99] Schimek, M. (2000). Estimation and inference in partially linear models with smoothing
splines. Journal of Statistical Planning and Inference, 91, 525–540.
158

[100] Scott, D. W. (1992). Multivariate Density Estimation. Theory, Practice, and Visualization, Wiley, New York.
[101] Shiau, J. (1987). A note on MSE coverage intervals in a partial spline model. Communications in Statistics-Theory and Methods, 16, 1851–1866.
[102] Song, Q. and Yang, L. (2009). Spline conﬁdence bands for variance function. Journal
of Nonparametric Statistics, 21, 589–609.
[103] Stone, C. J. (1977). An Asymptotic Equivalence of Choice of Model by Cross-Validation
and Akaike’s Criterion. Journal of the Royal Statistical Society. Series B, 39, 44–47.
[104] Stone, C. J. (1985). Additive regression and other nonparametric models. Annals of
Statistics, 13, 689–705.
[105] Stone, C. J. (1994). The use of polynomial splines and their tensor products in multivariate function estimation. Annals of Statistics, 22, 118–184.
[106] Su, L. & Ullah, A. (2006). Proﬁle likelihood estimation of partially linear panel data
models with ﬁxed eﬀects. Economics Letters, 92, 75–81.
[107] Sunklodas, J. (1984). On the rate of convergence in the central limit theorem for
strongly mixing random variables. Lithuanian Mathematical Journal, 24, 182–190.
[108] Tjøstheim, D. & Auestad, B. (1994). Nonparametric identiﬁcation of nonlinear time
series: projections. Journal of the American Statistical Association, 89, 1398–1409.
[109] Ulrich, C.M., Kampman, E., Bigler, J., Schwartz, S. M., Chen C, Bostick, R., Fosdick,
L., Beresford, S., Yasui, Y. and Potter, J. (1999). Colorectal adenomas and the C677T
MTHFR polymorphism: evidence for gene-environment interaction. Cancer Epidemiology, Biomarkers & Prevention 8, 659–668.
[110] Xia, Y. C. (1998). Bias-corrected conﬁdence bands in nonparametric regression. Journal of the Royal Statistical Society Series B, 60, 797–811.
[111] Xia, Y. and Li, W. K. (1999). On the estimation and testing of functional-coeﬃcient
linear models. Statistica Sinica, 3, 735–757.
[112] Xue, L., and Yang, L. J. (2006). Additive coeﬃcient modeling via polynomial spline.
Statistica Sinica, 16, 1423–1446.
159

[113] Xue, L. (2006). Variable selection in additive models. Statistica Sinica, 19, 1281–1296.
[114] Xue, L. and Liang, H. (2010). Polynomial spline estimation for the generalized additive
coeﬃcient model. Scandinavian Journal of Statistics, 37, 26–46.
[115] Wahba, G. (1990). Spline Models for Observational Data, SIAM, Philadelphia.
[116] Wang, N., Carroll, R. J., and Lin, X. (2005). Eﬃcient semiparametric marginal estimation for longitudinal/clustered data. Journal of the American Statistical Association,
100, 147–157.
[117] Wang, L. and Yang, L. (2007). Spline-backﬁtted kernel smoothing of nonlinear additive
autoregression model. Annals of Statistics, 35, 2474–2503.
[118] Wang, L. (2009). Single-index model-assisted estimation in survey sampling. Journal
of Nonparametric Statistics, 21, 487–504.
[119] Wang, J., and Yang, L. J. (2009a). Polynomial spline conﬁdence bands for regression
curves. Statistica Sinica, 19, 325–342.
[120] Wang, J. & Yang, L. (2009b). Eﬃcient and fast spline-backﬁtted kernel smoothing of
additive models. Annals of the Institute of Statistical Mathematics, 61, 663–690.
[121] Wang, Y. (1995). Jump and sharp cusp detection by wavelets. Biometrika, 82, 385–397.
[122] Wu, J. S., and Chu, C. K. (1993). Kernel-type estimators of jump points and values
of a regression function. The Annals of Statistics, 3, 1545–1566.
[123] Wu, W. and Zhao, Z. (2007). Inference of trends in time series. Journal of the Royal
Statistical Society B, 69, 391–410.
[124] Yao, F. and Lee, T. C. M. (2006). Penalized spline models for functional principal
component analysis. Journal of the Royal Statistical Society B, 68, 3–25.
[125] Yao, F., M¨ller, H. G., and Wang, J. L. (2005a). Functional linear regression analysis
u
for longitudinal data. Annals of Statistics, 33, 2873–2903.
[126] Yao, F., M¨ller, H. G., and Wang, J. L. (2005b). Functional data analysis for sparse
u
longitudinal data. Journal of the American Statistical Association, 100, 577–590.
160

[127] Yao, F. (2007). Asymptotic distributions of nonparametric regression estimators for
longitudinal or functional data. Journal of Multivariate Analysis, 98, 40–56.
[128] Zhang, J. T. and Chen, J. (2007). Statistical inferences for functional data. Annals of
Statistics, 35, 1052–1079.
[129] Zhao, X., Marron, J. S., and Wells, M. T. (2004). The functional data analysis view
of longitudinal data. Statistica Sinica, 14, 789–808.
[130] Zhao, Z. and Wu, W. (2008). Conﬁdence bands in nonparametric time series regression.
Annals of Statistics, 36, 1854–1878.
[131] Zhou, L., Huang, J., and Carroll, R. J. (2008). Joint modelling of paired sparse functional data using principal components. Biometrika, 95, 601–619.
[132] Zhou, S., Shen, X. and Wolfe, D. A. (1998). Local asymptotics of regression splines
and conﬁdence regions. The Annals of Statistics, 26, 1760–1782.
[133] Zhou, X. and You, J. (2004). Wavelet estimation in varying-coeﬃcient partially linear
regression models. Statistics & Probability Letters, 68, 91–104.

161