THEORY OF SPLINE REGRESSION WITH APPLICATIONS TO TIME SERIES, LONGITUDINAL, AND CATEGORICAL DATA, AND DATA WITH JUMPS By Shujie Ma A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Statistics 2011 ABSTRACT THEORY OF SPLINE REGRESSION WITH APPLICATIONS TO TIME SERIES, LONGITUDINAL, AND CATEGORICAL DATA, AND DATA WITH JUMPS By Shujie Ma Modern technological advances have led to the explosion in the collection of complex data such as functional/longitudinal, nonlinear time series, and mixed data, and data with jumps. In recent years, there has been a growing interest in developing statistical tools to analyze these data primarily due to the fact that traditional parametric methods are unrealistic in applications. Non- and semi- parametric methods as alternatives have been widely recognized as powerful tools for complex data analysis, which relax the usual assumptions of parametric methods and enable us to explore the data more flexibly so as to uncover data structure that might otherwise be missed. This dissertation develops statistical theories and methods in spline regression for those complex data mentioned before, with applications to medical science, finance and economics. In Chapter 2, procedures to detect jumps in the regression function via constant and linear splines are proposed based on the maximal differences of the spline estimators among neighboring knots. Simulation experiments corroborate with the asymptotic theory, while the computing is extremely fast. The detecting-procedure is illustrated in analyzing the thickness of pennies data set. In Chapter 3, asymptotically simultaneous confidence bands are obtained for the mean function of the functional regression model, using piecewise constant spline estimation. Simulation experiments corroborate the asymptotic theory. The confidence band procedure is illustrated by analyzing CD4 cell counts of HIV infected patients. A spline-backfitted kernel smoothing method is proposed in Chapter 4 for partially linear additive autoregression model. Under assumptions of stationarity and geometric mixing, the proposed function and parameter estimators are oracally efficient and fast to compute. Simulation experiments confirm the asymptotic results. Application to the Boston housing data serves as a practical illustration of the method. Chapter 5 considers the problem of estimating a relationship nonparametrically using regression splines when there exist both continuous and categorical predictors. The resulting estimator possesses substantially better finite-sample performance than either its frequencybased peer or cross-validated local linear kernel regression or even additive regression splines (when additivity does not hold). Theoretical underpinnings are provided and Monte Carlo simulations are undertaken to assess finite-sample behavior. I dedicate this thesis to my husband, Fulin Wang, my parents, Sanbao Ma and Xiaoli Xue, and my grandmother, Peilan Sun. iv ACKNOWLEDGMENT This thesis would not have been possible without the support of many people. First of all, I would like to express my deepest gratitude to my supervisor Professor Lijian Yang for his patient guidance, invaluable assistance, encouragement and excellent advice throughout this study. His continued support led me to the right way. I would like to thank Professor Yuehua Cui, Professor Lyudmila Sakhanenko, and Professor Jeffrey Wooldridge for serving as members of my doctoral committee and for their invaluable suggestions. Thanks also to my collaborator Professor Jeff Racine for his help and for many interesting discussions and insights. I am grateful to the entire faculty and staff in the Department of Statistics and Probability who have taught me and assisted me during my study at MSU. My special thanks go to Professor James Stapleton and Professor Dennis Gilliland for their help, support, and encouragement. I am thankful to Dr. Steven J. Pierce for accepting me as one of the statistical consultants at CSTAT, which provided me with plenty of opportunities to work with students and faculties from different disciplines. I am thankful to the Graduate School and the Department of Statistics and Probability for providing me with the Dissertation Completion Fellowship (2011), Summer Support Fellowship (2010) and Stapleton Fellowship for working on this dissertation. This dissertation is also supported in part by NSF award 0706518. I also thank my academic sisters and brothers: Dr. Lan Xue, Dr. Jing Wang, Dr. Li v Wang, Dr. Rong Liu, Dr. Qiongxia Song, Guanqun Cao, and Shuzhuan Zheng for their generous help. Finally, I take this opportunity to express my profound gratitude to my beloved husband, my parents and grandmother for their love, endless support and encouragement over all these years. vi TABLE OF CONTENTS List of Tables ix List of Figures x 1 Introduction 1.1 Nonparametric Regression . . . . 1.2 Partially Linear Additive Models 1.3 Polynomial Splines . . . . . . . . 1.4 Tensor Product Splines . . . . . . . . . . 1 1 5 6 8 . . . . . . . . . . . . . 10 10 12 16 17 17 19 24 24 27 30 30 31 34 . . . . . . . . . 38 38 42 45 47 50 53 56 63 64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 A Jump-detecting Procedure based on Polynomial Spline Estimation 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Simulation example . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Proof of Theorem 2.1 for p = 1 . . . . . . . . . . . . . . . . . . . . 2.6 Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Variance calculation . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.3 Proof of Theorem 2.1 for p = 2 . . . . . . . . . . . . . . . . . . . . 3 A Simultaneous Confidence Band 3.1 Introduction . . . . . . . . . . . . 3.2 Main Results . . . . . . . . . . . 3.3 Decomposition . . . . . . . . . . 3.4 Implementation . . . . . . . . . . 3.5 Simulation . . . . . . . . . . . . . 3.6 Empirical Example . . . . . . . . 3.7 Discussion . . . . . . . . . . . . 3.8 Appendix . . . . . . . . . . . . . 3.8.1 A.1. Preliminaries . . . . for Sparse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Longitudinal Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Spline-backfitted Kernel Smoothing 4.1 Introduction . . . . . . . . . . . . 4.2 The SBK Estimators . . . . . . . . 4.3 Simulation . . . . . . . . . . . . . 4.4 Application . . . . . . . . . . . . . 4.5 Appendix . . . . . . . . . . . . . . of Partially Linear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Spline Regression in the Presence of Categorical 5.1 Background . . . . . . . . . . . . . . . . . . . . . 5.2 Methods and Main Results . . . . . . . . . . . . . 5.3 Cross-Validated Choice of N and λ . . . . . . . . 5.4 Monte Carlo Simulations . . . . . . . . . . . . . . 5.5 Concluding Remarks . . . . . . . . . . . . . . . . 5.6 Appendix . . . . . . . . . . . . . . . . . . . . . . Bibliography Additive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model 79 . . . . . 79 . . . . . 82 . . . . . 86 . . . . . 93 . . . . . 102 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 123 125 128 131 135 136 151 viii LIST OF TABLES 2.1 Power calculation for the simulated example in Chapter 2 . . . . . . . . . . . 18 2.2 Computing time for simulated example in Chapter 2 . . . . . . . . . . . . . 19 3.1 Uniform coverage rates in Chapter 3 . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Uniform coverage rates and average maximal widths in Chapter 3 . . . . . . 52 3.3 Confidence limits for CD4 data set . . . . . . . . . . . . . . . . . . . . . . . 56 4.1 Estimation of parameters for the linear part in Chapter 4 . . . . . . . . . . . 92 5.1 Relative median MSE in Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . 132 5.2 Median values for smoothing parameters in Chapter 5 . . . . . . . . . . . . . 134 ix LIST OF FIGURES 1.1 Spline estimate plot for the regression function of the fossil data . . . . . . . 3 1.2 Confidence band plot for spinal bone mineral density . . . . . . . . . . . . . 4 2.1 Kernel density plots of τ 1 in Chapter 2 . . . . . . . . . . . . . . . . . . . . . ˆ 20 2.2 Kernel density plots of c1 in Chapter 2 . . . . . . . . . . . . . . . . . . . . . ˆ 21 2.3 Plots of the spline estimator for c = 0 in Chapter 2 . . . . . . . . . . . . . . 22 2.4 Plots of the spline estimator c = 2 in Chapter 2 . . . . . . . . . . . . . . . . 23 2.5 Spline estimator for the thickness of pennies data . . . . . . . . . . . . . . . 25 3.1 Plots of simulated data for n = 20 in Chapter 3 . . . . . . . . . . . . . . . . 54 3.2 Plots of simulated data for n = 50 in Chapter 3 . . . . . . . . . . . . . . . . 55 3.3 Plots of confidence bands at 1 − α = 0.95, n = 20 in Chapter 3 . . . . . . . . 57 3.4 Plots of confidence bands at 1 − α = 0.95, n = 50 in Chapter 3 . . . . . . . . 58 3.5 Plots of confidence bands at 1 − α = 0.99, n = 20 in Chapter 3 . . . . . . . . 59 3.6 Plots of confidence bands at 1 − α = 0.99, n = 50 in Chapter 3 . . . . . . . . 60 3.7 Plots of confidence bands for CD4 data . . . . . . . . . . . . . . . . . . . . . 61 3.8 Plots of confidence intervals for CD4 data . . . . . . . . . . . . . . . . . . . 62 4.1 Kernel density plots for α = 2, d1 = 4, d2 = 3 in Chapter 4 . . . . . . . . . . 88 4.2 Kernel density plots in Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . 89 x 4.3 Kernel density plots for α = 2, d1 = 4, d2 = 30 in Chapter 4 . . . . . . . . . 90 4.4 Kernel density plots for α = 3, d1 = 4, d2 = 30 in Chapter 4 . . . . . . . . . 91 4.5 Plots of the estimator for the nonparametric part at α = 2, n = 200 . . . . . 94 4.6 Plots of the estimator for the nonparametric part at α = 2, n = 500 . . . . . 95 4.7 Plots of the estimator for the nonparametric part at α = 3, n = 200 . . . . . 96 4.8 Plots of the estimator for the nonparametric part at α = 3, n = 500 . . . . . 97 4.9 Plots of the estimators for RM for Boston housing data . . . . . . . . . . . . 99 4.10 Plots of the estimators for log(TAX) for Boston housing data . . . . . . . . . 100 4.11 Plots of the estimators for log(LSTAT) for Boston housing data . . . . . . . 101 5.1 Example with n = 500 and a variety of DGPs in Chapter 5 . . . . . . . . . . 129 xi Chapter 1 Introduction 1.1 Nonparametric Regression Regression analysis is one of the most widely used tools in data analysis. Regression analysis models and analyzes the relationship between a dependent variable Y and a vector of independent variables X. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables. Classic parametric models, although their properties are very well established, need restrictive assumptions, which have encountered various limitations in applications. Mis-specification in parametric models could lead to large bias. Nonparametric modelling as an alternative reduces modeling bias by imposing no specific model structure and enables people to explore the data more flexibly. We begin by considering the nonparametric regression model Y = m (X) + ϵ, 1 (1.1) where the error term ϵ contributes a roughness to the raw data, and functional form m (·) is unknown and satisfies m (x) = E (Y |X = x). In many cases, the unknown function m (·) is a smooth function, then we can estimate it nonparametrically by such methods as kernel and spline smoothing. For example, Figure (1.1) shows a nonparametric smoothing curve fitted by quadratic polynomial spline as well as the data points for a fossil data. The data contains ratios of strontium isotopes found in fossil shells millions of years ago, which can reflect global climate. In other cases, however, the unknown function m (·) is not smooth. Ignoring possible jumps in the regression function m (·) may result in a serious error in drawing inference about the process under study. In Chapter 2, we propose procedures to detect jumps in the regression function m (·) via constant and linear spline estimation methods in a randomdesign nonparametric regression model for i.i.d. case. The detecting procedure is illustrated by simulation experiments and by analyzing a thickness of pennies data set. { } In Chapter 3, we consider a sparse functional data case which has the form Xij , Yij , 1 ≤ i ≤ n, 1 ≤ j ≤ Ni , in which Ni observations are taken for the ith subject, with Xij and Yij the j th predictor and response variables, respectively, for the ith subject, and Ni ’s are i.i.d. copies of an integer valued positive random variable. Asymptotically simultaneous confidence bands are obtained for the mean function m (·) of the functional regression model, using piecewise constant spline estimation. For illustration, Figure (1.2) shows a confidence band and confidence interval at 95% confidence level, and the estimated function by piecewise constant spline for a spinal bone mineral density data set. 2 0.70750 0.70740 0.70730 0.70720 strontium ratio 95 100 105 110 115 120 age(million years) Figure 1.1: Spline estimate plot for the regression function of the fossil data Note: the quadratic spline estimator (solid line) and the data points (circle) for ratios of strontium isotopes over time. For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this dissertation. 3 1.2 1.1 1.0 0.9 0.8 0.6 0.7 spinal bone mineral density 10 15 20 25 age Figure 1.2: Confidence band plot for spinal bone mineral density Note: the constant spline estimator (middle curve), the confidence band (solid line) and the confidence interval (thin line) at confidence level 95% for spinal bone mineral density over age. 4 1.2 Partially Linear Additive Models Nonparametric modeling imposes no specific model structure and enables one to explore the data more flexibly, but it does not perform well when the dimension of the covariates is high, and the variances of the resulting estimates tend to be unacceptably large due to the sparseness of data, which is the so-called ”curse of dimensionality”. To overcome these difficulties, many different semi-parametric models have been proposed and developed, among which partially linear additive model is becoming very popular in data analysis, which are important multivariate semiparametric models by allowing linearity in some variables. A partially linear additive model (PLAM) is given as ∑d ∑d ( ) ( ) 2 m (x ) 1 c t + Yi = m Xi , Ti + σ Xi , Ti εi , m(x, t) = c00 + α=1 α α l=1 0l l (1.2) }n { } { T , TT n = Y , X , ..., X , T , . . . T in which the sequence Yi , Xi i i1 id2 i1 id1 i=1 . The funci i=1 tions m and σ are the mean and standard deviation of the response Yi conditional on the predictor vector {Xi , Ti }, and εi is a white noise conditional on {Xi , Ti }. From equation (1.2) we observe that a PLAM consists of a linear part and a nonparametric additive part, which retains the merits of both additive and linear models, see Stone (1985) and Hastie and Tibshirani (1990) for additive models. It is much more flexible than parametric models, since it eschews many assumptions, such as finitely many unknown parameters imposed on the model and presumed distribution structure set on the data, and meanwhile it is more interpretable than nonparametric regression surfaces. It avoids the so-called ”curse of dimensionality” for high dimensional models, since the unknown functions are one-dimensional. Moreover, PLAMs are more parsimonious and flexible and easier 5 interpretable than purely additive model by allowing a subset of the independent variables to be discrete, while additive models only admit continuous predictors, as well as allowing the intersection terms among the elements of the additive part to enter as the linear part of the model. In Chapter 4, we extend the ”spline-backfitted kernel smoothing” (SBK) of Wang and Yang (2007) to partially linear additive autoregression models. It establishes the uniform oracle efficiency of the estimators by ”reducing bias via undersmoothing (step one) and averaging out the variance (step two)” via the joint asymptotic properties of kernel and spline functions. The proposed SBK estimators satisfy (i) computationally expedient; (ii) theoretically reliable; and (iii) intuitively appealing. 1.3 Polynomial Splines Polynomial splines have been widely accepted as an appealing and effective tool for nonparametric regression estimation because of its great flexibility, fast computation, simple implementation and explicit expression, and they have been applied to a wide range of statistical problems such as hazard regressions, derivative estimation, semi-parametric models, longitudinal, functional and high dimensional data analysis, jump detection, etc. In this section, I will introduce polynomial spline functions. Without loss of generality, we take the range of X a univariate predictor to be [0, 1]. To introduce the spline functions, divide the finite interval [0, 1] into (N + 1) equal subintervals ) [ [ ] χJ = tJ , tJ+1 , J = 0, ...., N − 1, χN = tN , 1 . A sequence of equally-spaced points 6 { } tJ N , called interior knots, are given as J=1 t0 = 0 < t1 < · · · < tN < 1 = tN +1 , tJ = Jh, J = 0, ..., N + 1, in which h = 1/ (N + 1) is the distance between neighboring knots. We denote by G(p−2) = G(p−2) [0, 1] the space of functions that are polynomials of degree p − 1 on each χJ and has continuous (p − 2)-th derivative. For example, G(−1) denotes the space of functions that are constant on each χJ , and G(0) denotes the space of functions that are linear on each χJ and continuous on [0, 1]. { } 2 norm of a function ϕ on [0, 1], ∥ϕ∥2 = E ϕ2 (X) = Denote by ∥ϕ∥2 the theoretical L 2 ( ) ∫1 2 ∫1 2 ∑ ϕ (x) f (x) dx = 0 ϕ (x) dx, and the empirical L2 norm as ∥ϕ∥2 = n−1 n ϕ2 Xi . 0 2,n i=1 Corresponding inner products are defined by ⟨ϕ, φ⟩ = ∫ 1 ∫ 1 ϕ (x) φ (x) f (x) dx = 0 0 ϕ (x) φ (x) dx = E {ϕ (X) φ (X)} , ( ) ( ) ∑ ⟨ϕ, φ⟩n = n−1 n ϕ Xi φ Xi , for any L2 -integrable functions ϕ, φ on [0, 1]. Clearly i=1 { } E ⟨ϕ, φ⟩n = ⟨ϕ, φ⟩. We now introduce the B-spline basis bJ,1 (x) , 1 − p ≤ J ≤ N of G(p−2) , the space of splines of order p, for theoretical analysis. The B-spline basis of G(−1) , the space of piecewise constant splines, are indicator functions of intervals χJ , bJ,1 (x) = IJ (x) = IχJ (x), 0 ≤ J ≤ N . The B-spline basis of G(0) , the space of piecewise linear { }N splines, are bJ,2 (x) J=−1 ( bJ,2 (x) = K x − tJ+1 h ) , − 1 ≤ J ≤ N, for K (u) = (1 − |u|)+ . 7 In Chapters 2, 3, and 4, polynomial spline techniques are applied in different circumstances. 1.4 Tensor Product Splines In section 1.3, we introduced spline smoothing and B-spline functions for univariate variables. In this section, we generalize B-spline to functions of multivariate variables by the tensor product construction. ( ) Let X = X1 , . . . , Xq T be a q-dimensional vector of continuous predictors. Assume for [ ] 1 ≤ l ≤ q, each Xl is distributed on a compact interval al , bl , and without loss of generality, ( ) [ ] ml −2 be the space of polynomial splines we take all intervals al , bl = [0, 1]. Let Gl = G l ( ) of order ml and pre-select an integer Nl = Nn,l . Divide [0, 1] into Nl + 1 subintervals [ ) [ ] { }N l IJ ,l = tJ ,l , tJ +1,l , Jl = 0, . . . , Nl − 1, IN ,l = tN ,l , 1 , where tJ ,l is a l l l l l l Jl =1 sequence of equally-spaced points, called interior knots, given as ) = ··· = t = 0 < t < ··· < t t ( 0,l 1,l Nl ,l < 1 = tNl +1,l = · · · = tNl +ml ,l , − ml −1 ,l ( ) in which tJ ,l = Jl hl , Jl = 0, 1 . . . , Nl + 1, hl = 1/ Nl + 1 is the distance between l neighboring knots. Then Gl consists of functions ϖ satisfying (i) ϖ is a polynomial of degree ml − 1 on each of the subintervals IJ ,l , Jl = 0, . . . , Nl ; (ii) for ml ≥ 2, ϖ is ml − 2 l times continuously differentiable on [0, 1]. Let Kl = Kn,l = Nl +ml , where Nl is the number }T ( ) { ( ) be of interior knots and ml is the spline order, Bl xl = BJ ,l xl : 1 − ml ≤ Jl ≤ Nl l a basis system of the space Gl . We define the space of tensor-product polynomial splines by 8 ∏q q G=⊗ Gl . It is clear that G is a linear space of dimension Kn = K . Then l=1 l=1 l B (x) = [{ ] }N ,...,Nq ( ) 1 BJ ,...,Jq (x) = B1 (x1 ) ⊗ · · · ⊗ Bq xq (1.3) 1 J1 =1−m1 ,...,Jq =1−mq K ×1 n ( )q is a basis system of the space G, where x = xl . l=1 In Chapter 5, tensor product splines and categorical kernel functions which will be introduced in this chapter are applied to study nonparametric regression with multivariate continuous and categorical variables. 9 Chapter 2 A Jump-detecting Procedure based on Polynomial Spline Estimation 2.1 Introduction This chapter is based on Ma and Yang (2011a). In application of regression methods, ignoring possible jump points may result in a serious error in drawing inference about the process under study. Whenever there is no appropriate parametric method available, one may start from nonparametric regression. Two popular nonparametric techniques are kernel and spline smoothing. For properties of kernel estimators in the absence of jump points, see Mack and Silverman (1982), Fan and Gijbels (1996), Xia (1998) and Claeskens and Van Keilegom (2003), and for spline estimators, see Zhou, Shen and Wolfe (1998), Huang (2003) and Wang and Yang (2009a). One is often interested in detecting jump points and estimating regression function with {( )} { } jumps. We assume that observations Xi , Yi n and unobserved errors εi n are i=1 i=1 10 i.i.d. copies of (X, Y, ε) satisfying the regression model Y = m (X) + σε, (2.1) where the joint distribution of (X, ε) satisfies Assumptions (A3) and (A4) in Section 2.2. The unknown mean function m (x), defined on interval [a, b], may have a finite number of jump points. Jump regression analysis started in the early 1990s and has become an important research topic in statistics. See for instance, Qiu, Asano and Li (1991), M¨ller (1992), Wu and u Chu (1993), Qiu (1994) and Qiu and Yandell (1998) for procedures that detect the jumps explicitly before estimating the regression curve, Kang, Koo and Park (2000) for comparing two estimators of the regression curve after the jump points are detected, Qiu (2003) and Gijbels Lambert and Qiu (2007) for jump-preserving curve estimators, Joo and Qiu (2009) for jump detection in not only the regression curve but also its derivatives. For a comprehensive view on jump regression, see Qiu (2005). Jump detection has been tackled with many techniques, including local polynomial smoothing [Qiu (2003) and Gijbels et al. (2007)], smoothing spline [Shiau (1987)], wavelet methods [Hall and Patil (1995), Wang (1995) and Park and Kim (2004, 2006)], and for twodimensional cases, see Qiu (2007). We propose a spline smoothing method to detect jumps by solving one optimization problem over the range of x instead of each point, which is computationally more expedient than kernel-type method in M¨ller (1992). Spline method was u also discussed in Koo (1997), which proposed estimating discontinuous regression function without providing theoretical justifications. In contrast, asymptotic distributions in Theorem 2.1 are established by using the strong approximation results in Wang and Yang (2009a), 11 Normal Comparison Lemma in Leadbetter, Lindgren and Rootz´n, H. (1983), and a convee nient formula from Kılı¸ (2008) for inverting tridiagonal matrix. The automatic procedures c proposed for detecting jumps are based on implementing the asymptotics of Theorem 2.1. This chapter is organized as follows. Section 2.2 states main theoretical results based on (piecewise) constant and linear splines. Section 2.3 provides steps to implement the procedure based on the asymptotic result. Section 2.4 reports findings in both simulation and real data studies. All technical proofs are contained in Appendices. 2.2 Main Results { } We denote the space of the p-th order smooth functions as C (p) [a, b] = φ φ(p) ∈ C [a, b] , for p = 1, 2. Without loss of generality, we take the range of X to be [0, 1]. We use the spline functions introduced in Section 1.3 of Chapter 1. Define the spline estimator based on data {( )} Xi , Yi n drawn from model (3.3) as i=1 mp (x) = argmin ˆ ∑n g∈G(p−2) [0,1] { ( )} Yi − g Xi 2 , p = 1, 2. i=1 (2.2) { } The unknown function m (x) in (3.3) may be smooth, or have jump points τ i k , for i=1 0 = τ 0 < τ 1 < · · · < τ k < τ k+1 = 1. Technical assumptions are listed as follows: ( ) (A1) There exists a function m0 (x) ∈ C (p) [0, 1] and a vector c = c1 , ..., ck of jump ) [ magnitudes such that the regression function m (x) = cl + m0 (x) , x ∈ τ l , τ l+1 , for [ ] l = 1, · · · k − 1, m (x) = m0 (x) , x ∈ [τ 0 , τ 1 ) , m (x) = ck + m0 (x) , x ∈ τ k , τ k+1 . ) ( (A2) The number of interior knots N = o n1/(2p+1)+ϑ for any ϑ > 0 while N −1 = ( ) o n−1/(2p+1) / log n . 12 (A3) X is uniformly distributed on interval [0, 1], i.e. the density function of X is f (x) = I (0 ≤ x ≤ 1). (A4) The joint distribution F (x, ε) of random variables (X, ε) satisfies the following: ( ) (a) The error is a white noise: E (ε |X = x ) = 0, E ε2 |X = x = 1. (b) There exists a positive value η > 1 and finite positive Mη such that E|ε|2+η < Mη ( ) and supx∈[0,1] E |ε|2+η |X = x < Mη . Assumption (A1) is similar to M¨ller and Song (1997). Assumption (A2) is similar to u the undersmoothing condition in Claeskens and Van Keilegom (2003), thus the subinterval ( ) ( ) length h = o n−1/(2p+1) / log n while h−1 = o n1/(2p+1)+ϑ for any ϑ > 0. A uniform distribution of X in Assumption (A3) is for the simplicity of proofs, which can be relaxed to any distribution with continuous and positive density function on [0, 1]. Assumption (A4) is identical with (C2) (a) of Mack and Silverman (1982). All are typical assumptions for nonparametric regression, with Assumption (A4) weaker than the corresponding assumption in H¨rdle (1989). a We use the B-spline basis introduced in Section 1.3 of Chapter 1 for theoretical analysis. Define next the theoretical norms of spline functions cJ,n = dJ,n = ⟨ ( ∫ 1 ⟩ bJ,2 , b ′ J ,2 ∫ 1 ∫ 1 2 2 (x) dx = I IJ (x) dx = h, 0 ≤ J ≤ N, bJ,1 = 2 0 J 0   ( ) ∫ 1  2h/3, 0 ≤ J ≤ N − 1 2 2 x − tJ+1 dx = K bJ,2 = ,  h 2 0  h/3, J = −1, N = K 0 x − tJ+1 h )   (x − t )  h/6, J − J ′ = 1 J ′ +1 dx = (2.3) K  h  0, J − J ′ > 1 13 }N { We introduce the rescaled B-spline basis BJ,p (x) , for G(p−2) , J=1−p −1 BJ,p (x) ≡ bJ,p (x) bJ,p , J = 1 − p, ..., N. 2 (2.4) { }N The inner product matrix V of the B-spline basis BJ,2 (x) is denoted as J=−1 (⟨ ( )N ⟩)N = B ′ , BJ,2 V= v ′ J J J,J ′ =−1 J ,2 J,J ′ =−1   √ 2/4 0  1     √   2/4 1  1/4       ..   . 1/4 1 ( )   = = lik −1 ,  (N +2)×(N +2)   .. ..   . . 1/4     √    2/4  1/4 1     √ 0 2/4 1 (N +2)×(N +2) (2.5) which computed via (2.3). Denote the inverse of V by S and for J = 1, ..., N , 3 × 3 diagonal submatrices of S are expressed as      S= s ′ = V −1 , SJ =  ′ =−1 J J J,J   ( )N s(J−2),(J−2) s(J−2),(J−1) s(J−2),J    s(J−1),(J−2) s(J−1),(J−1) s(J−1),J  .   sJ,(J−2) sJ,(J−1) sJJ (2.6) To detect jumps in m, one tests the hypothesis H0 : m ∈ C (p) [0, 1] vs H1 : m ∈ C [0, 1]. / ( )1/2 Denote by ∥c∥2 = c2 + · · · + c2 , the Euclidean norm of the vector c of all the k jump 1 k 14 magnitudes, then under Assumption (A1), one can write alternatively H0 : ∥c∥2 = 0 vs H1 : ∥c∥2 > 0. For mp (x) given in (2.2), p = 1, 2, define the test statistics ˆ ( ) ( ) T1n = max0≤J≤N −1 ˆ1J , ˆ1J = m1 tJ+1 − m1 tJ /σ n,1 , δ δ ˆ ˆ { ( )} ( ) ( ) T2n = max1≤J≤N ˆ2J , ˆ2J = m2 tJ+1 + m2 tJ−1 /2 − m2 tJ /σ n,2,J , δ δ ˆ ˆ ˆ (2.7)    1      where σ 2 = 2σ 2 / (nh) , σ 2 = σ 2 (8nh/3)−1 ζ T SJ ζ, ζ =  −2  n,1 n,2,J     1 (2.8) with SJ defined in (2.6). To state our main results, denote [ { } ] −1 log − 1 log (1 − α) + 1 {log log (N ) + log 4π} . dN (α) = 1 − {2 log N } 2 2 (2.9) Theorem 2.1. Under Assumptions (A1)-(A4) and H0 , [ ] limn→∞ P Tpn > {2 log (N − 2p + 2)}1/2 dN −2p+2 (α) = α, p = 1, 2. A similar result by kernel smoothing with fixed-design regression model exists in Theorem 3 of Wu and Chu (1993). The proof of that result, however, does not contain sufficient details for us to further comment. It is feasible to derive similar asymptotic result for Tpn under H1 but that is beyond the scope of this chapter so we leave it to future work. 15 2.3 Implementation In this section, we describe how to implement in XploRe (H¨rdle, Hl´vka and Klinke (2000)) a a the jump points detection procedures by using the results in Theorem 2.1. )}n i=1 from model (3.3), denote Xmin = min (X1 , ..., Xn ) and { } Xmax = max (X1 , ..., Xn ). Then we transform Xi n onto interval [0, 1] by subtracting i=1 Given any sample {( Xi , Yi each Xi from Xmin , then dividing by Xmax − Xmin . The definition of mp (x) in (2.2) ˆ entails mp (x) ≡ ˆ ∑N ˆ′ λJ,p bJ,p (x) , p = 1, 2, J=1−p (2.10) { }T ˆ′ ˆ′ where coefficients λ1−p,p , ..., λN,p are solutions of the following least squares problem } n { }T ∑{ ∑N ( ) 2 ˆ′ ˆ′ λ1−p,p , ..., λN,p ={ argmin } Yi − λ b X . J=1−p J,p J,p i λ1−p,p ,...,λN,p ∈RN +p i=1 [ ] 1/3 (log n)2 /5 By Assumption (A2), the number of interior knots N is taken to be N = n [ ] 1/5 (log n)2 /5 for linear spline (p = 2), in which for constant spline (p = 1), and N = n ( ) ˆ [a] denotes the integer part of a. Denote the estimator for Yi by Yi,p = mp Xi , for ˆ i = 1, · · · , n, with mp given in (2.10), and define the estimator of σ as ˆ σp = ˆ {∑ }1/2 )2 n ( ˆ Y − Yi,p / (n − N − p) . i=1 i Basic spline smoothing theory as in Wang and Yang (2009a) ensures that σ 2 →p σ 2 , as n → ˆp ∞, hence Theorem 2.1 holds if one replaces σ by σ p . The asymptotic p-value pvalue,p are ˆ ( ) obtained from solving the equation Tpn = {2 log (N − 2p + 2)}1/2 dN −2p+2 pvalue,p , 16 p = 1, 2 with Tpn defined in (2.7) and estimated by replacing σ 2 with σ 2 , then ˆp } [ [ { pvalue,p = 1 − exp −2 exp 2 log (N − 2p + 2) 1 − {2 log (N − 2p + 2)}−1/2 Tpn ]] −1 {log log (N − 2p + 2) + log 4π} . −2 (2.11) When the p-value is below a pre-determined α, one concludes that there exist jump points in m. The jump locations and magnitudes are estimated as follows. We use the ˆ BIC criteria proposed in Xue and Yang (2006) to select the ”optimal” N , denoted N opt , [[ ] ([ ] )] 1/3 + 4, min 10n1/3 , [n/2] − 1 , which minimizes the BIC value BIC (N ) = from 4n ( ) log σ 2 + (N + 1) × log (n) /n. By letting p = 1 and replacing T1n with ˆ1J , for 0 ≤ ˆ1 δ δ J ≤ N − 1 in (2.11), we obtain the p-value pvalue,1,J for each ˆ1J . The jump locations ( ) τ i , 1 ≤ i ≤ k are estimated by τ i = tl + tl +1 /2, for pvalue,1,l < α, with ci = ˆ ˆ i i i ( ) ( ) ˆ m1 tl +1 − m1 tl as the estimated jump magnitudes, for 0 ≤ l1 , . . . , lk ≤ N − 1. It is ˆ i i [ ] apparent that for τ i ∈ tl , tl +1 , τ i → τ i , 1 ≤ i ≤ k, as n → ∞. ˆ i i 2.4 2.4.1 Examples Simulation example Here, we examine the finite-sample performance of the procedure described in Section 2.3 where m (x) has at most one jump. The data set is generated from model (3.3) with X ∼ U [−1/2, 1/2] , ε ∼ N (0, 1), and with m (x) = sin (2πx) + c × I(τ 1 ≤ x ≤ 1/2), √ for τ 1 = 2/4. The noise level σ = 0.2, 0.5, sample size n = 200, 600, 1000 and significant level α = 0.05, 0.01. Let ns be the number of replications. Denote the asympˆ totic powers based on constant and linear splines by β p (c), p = 1, 2, calculated from 17 c σ 0.2 0 0.5 0.2 2 0.5 sample size n 200 600 1000 200 600 1000 200 600 1000 200 600 1000 ˆ ˆ ˆ ˆ β 2 (c) β 2 (c) β 1 (c) β 1 (c) α = 0.05 α = 0.01 α = 0.05 α = 0.01 0.100 0.062 0.046 0.058 0.054 0.050 1.000 1.000 1.000 0.942 1.000 1.000 0.032 0.014 0.010 0.012 0.006 0.010 0.998 1.000 1.000 0.776 0.980 1.000 0.640 0.390 0.220 0.220 0.180 0.120 1.000 1.000 1.000 0.890 1.000 1.000 0.280 0.140 0.050 0.070 0.040 0.030 1.000 1.000 1.000 0.680 0.970 1.000 Table 2.1: Power calculation for the simulated example in Chapter 2 Note: powers calculated from the test statistic Tpn defined in (2.7) by constant and linear splines, respectively, over ns = 500 replications. [ ] ∑ns ˆ β p (c) = q=1 I Tn,p,q > {2 log (N − 2p + 2)}1/2 dN −2p+2 (α) /ns , where Tn,p,q is the q-th replication of Tpn , with Tpn given in (2.7), and dN (α) given in (2.9), for p = 1, 2. ˆ Table 2.1 shows values of β p (c) for c = 0 and c = 2. ˆ ˆ In Table 2.1, β p (2), p = 1, 2 approach to 1 rapidly. Meanwhile β 2 (0) approaches α as the sample size increase, which shows very positive confirmation of Theorem 2.1, in contrast ˆ to β 1 (0), the convergent rate of which is much slower, indicating that the linear spline method outperforms the constant spline method. Table 2.1 also shows the noise level has more influence on the constant spline method than the linear spline method. Table 2.2 shows the average computing time (in seconds) of generating data and detecting jump by constant and linear spline methods, which are comparable. There are 500 replications for n = 200, 600 satisfying pvalue,2 < α = 0.05, with pvalue,2 18 sample size n constant linear 200 600 1000 0.04 0.21 0.55 0.06 0.30 0.60 Table 2.2: Computing time for simulated example in Chapter 2 Note: computing time (in seconds) per replication over ns = 500 replications of generating data and detecting jump by constant and linear spline methods. given in (2.11), when c = 2, ns = 500. Figures 2.1 and 2.2 show the kernel estimators of the densities of τ 1 and c1 given in Section 2.3 with sample size n = 200 (thick lines) and ˆ ˆ n = 600 (median lines) at σ = 0.2. The vertical lines at √ 2/4 and 2 are the standard lines for comparing τ 1 to τ 1 and c1 to ˆ ˆ c1 respectively. One clearly sees that both of the centers of the density plots are going toward the standard lines with much narrower spread when the sample size n is increasing. The frequencies over 500 replications for τ 1 falling between tl and tl +1 described in Section 1 1 2.3 are 0.994 and 1 for n = 200 and 600 respectively. For visual impression of the actual function estimates, at noise level σ = 0.2 with sample size n = 600, we plot the spline estimator m2 (x) (solid curves) for the true functions m (x) ˆ (thick solid curves) in Figures 2.3 and 2.4. The spline estimators seem rather satisfactory. 2.4.2 Real data analysis We apply the jump detection procedures in Section 2.3 to the thickness of pennies data set given in Scott (1992), which consists of measurements in mils of the thickness of 90 US Lincoln pennies. There are two measurements taken as the response variable Y each year, 19 300 250 200 150 0 50 100 Y 0.35 0.4 0.45 X Figure 2.1: Kernel density plots of τ 1 in Chapter 2 ˆ Note: kernel density plots of τ 1 over 500 replications of sample size n = 200 (thick solid) ˆ and n = 600 (solid) for which H0 is rejected. 20 4 3 0 1 2 Y 1 1.5 2 2.5 X Figure 2.2: Kernel density plots of c1 in Chapter 2 ˆ Note: kernel density plots of c1 over 500 replications of sample size n = 200 (thick solid) ˆ and n = 600 (solid) for which H0 is rejected. 21 0 -1 -0.5 Y 0.5 1 1.5 n=600, c=0 -0.5 0 X 0.5 Figure 2.3: Plots of the spline estimator for c = 0 in Chapter 2 Note: plots of the true function m (x) (thick solid curve), spline estimator m2 (x) (solid ˆ curve) and the data scatter plots at σ = 0.2. 22 -1 0 Y 1 2 3 n=600, c=2 -0.5 0 X 0.5 Figure 2.4: Plots of the spline estimator c = 2 in Chapter 2 Note: plots of the true function m (x) (thick solid curve), spline estimator m2 (x) (solid ˆ curve) and the data scatter plots at σ = 0.2. 23 from 1945 through 1989. Penny thickness was reduced in World War II and restored to its original thickness sometime around 1960 and reduced again in the 70’s. The asymptotic p-value pvalue,2 < 10−20 . Two jump points are detected with the p-values 0.014468 and 0.00077337, located around the year 1958 with increased magnitude 2.80, and around the year 1974 with decreased magnitude 3.75, respectively, which confirms the result in Gijbels et al. (2007). Figure 2.5 depicts the data points and the spline estimator m2 (x) (solid ˆ line), which confirms visually these findings. Findings from both simulated and real data demonstrate the effectiveness of our approach in detecting the existence of jumps. The plots of m2 (x) in Figures 2.3 and 2.5 give an outline of the true function, without breaking the ˆ curve at the jumps. Obtaining jump-preserving spline estimator of the true non-smooth function is beyond the scope of this chapter, but makes an interesting topic for further research. 2.5 Appendix A 2.5.1 Preliminaries Denote by ∥·∥∞ the supremum norm of a function r on [0, 1], i.e. ∥r∥∞ = supx∈[0,1] |r (x)|. We denote by the same letters c, C, any positive constants without distinction. The following extension of Leadbetter, Lindgren and Rootz´n, H. (1983), Theorem 6.2.1 is a key e result on the absolute maximum of discrete time Gaussian processes. ( ) (n) (n) (n) (n) 2 Lemma 2.1. Let ξ 1 , ..., ξ n have jointly normal distribution with Eξ i ≡ 0, E ξ i ≡ 1, 1 ≤ i ≤ n and there exists constants C > 0, a > 1, r ∈ (0, 1) such that the corre) ( (n) (n) (n) lations rij = rij = Eξ i ξ j , 1 ≤ i ̸= j ≤ n satisfy rij ≤ min r, Ca−|i−j| for 24 58 56 52 54 Y 1950 1960 1970 1980 X Figure 2.5: Spline estimator for the thickness of pennies data Note: the thickness of pennies data (points) and the spline estimator m2 (x). ˆ 25 { (n) (n) ξ 1 , ..., ξ n } 1 ≤ i ̸= j ≤ n. Then the absolute maximum Mn,ξ = max satisfies for ) ( ) ( any τ ∈ R, P Mn,ξ ≤ τ /an + bn → exp −2e−τ , as n → ∞, where an = (2 log n)1/2 , bn = an − 1 a−1 (log log n + log 4π). 2 n Proof. Take ε > 0 such that (2 − ε) (1 + r)−1 = 1 + δ, for some δ > 0. Let τ n = τ /an + bn , then τ 2 / (2 log n) → 1, as n → ∞, so for large n, τ 2 > (2 − ε) log n. By n n ( ) ( ) 2 −1/2 −|i−j| < 1 for i ̸= j, one has r the condition rij ≤ min r, Ca ≤ ij 1 − rij ( ) −|i−j| 1 − r2 −1/2 . Let M Ca n,η = max {|η 1 | , . . . , |η n |}, where η 1 , . . . , η n are i.i.d. copies of N (0, 1). By Leadbetter, Lindgren and Rootz´n (1983), Theorem 1.5.3, e ( ) ( ) P Mn,η ≤ τ n → exp −2e−τ , as n → ∞. The Normal Comparison Lemma (Leadbetter, Lindgren and Rootz´n, H. (1983), Lemma 11.1.2) entails that e ( P ( ) ( ) (n) −τ n < ξ j ≤ τ n for j = 1, . . . , n − P −τ n < η j ≤ τ n for j = 1, . . . , n ( ) { ( )} ∑ 2 −1/2 rij 1 − rij ≤ (4/2π) exp −τ 2 / 1 + rij . n 1≤j 0, p ≥ 1 such that for any m ∈ C (p) [0, 1], there exists a function g ∈ G(p−2) [0, 1] such that ∥g − m∥∞ ≤ Cp m(p) ∞ hp and m (x) defined in ˜ (2.14), with probability approaching 1, satisfies mp (x) − m (x) ∞ = O (hp ). ˜ 2.5.2 Proof of Theorem 2.1 for p = 1 For x ∈ [0, 1], define its location and relative position indices J (x) , δ (x) as J (x) = Jn (x) = { } min {[x/h] , N } , δ (x) = x − tJ(x) h−1 . It is clear that tJ (x) ≤ x < tJ (x)+1 , 0 ≤ n n ⟩ ⟨ = 0 unless J = J ′ , for BJ,1 (x) δ (x) < 1, ∀x ∈ [0, 1) , and δ (1) = 1. Since B ′ ,BJ,1 J ,1 n given in (2.4). ˜1 (x) in (2.14) can be written as ε ˜1 (x) = ε ∑N ∑n ( ) ε∗ BJ,1 (x) BJ −2 , ε∗ = n−1 BJ,1 Xi σεi . 2,n J,1 J=0 J,1 i=1 27 ∑N 2 ∗ 2 ε J=0 εJ,1 BJ,1 (x), it is easy to prove that E {ˆ1 (x)} = σ / (nh) and for 0 ≤ ) { ( ( )} J ≤ N −1, E ˆ1 tJ+1 − ˆ1 tJ 2 = 2σ 2 / (nh), which is σ 2 defined in (2.8). Define for ε ε n,1 { ( ) ( )} { ( ) ( )} 0 ≤ J ≤ N −1, ˜n,1,J = σ −1 ˜1 tJ+1 − ˜1 tJ , ˆn,1,J = σ −1 ˆ1 tJ+1 − ˆ1 tJ . ξ ε ε ξ ε ε n,1 n,1 Let ˆ1 (x) = ε ˜ Lemma 2.3. Under Assumptions (A2)-(A4), as n → ∞, sup ξ n,1,J − ˆn,1,J = ξ 0≤J≤N −1 ( ) −1/2 h−1/2 log n = o (1). Oa.s n a.s { }N −1 (k) Lemma 2.4. Under Assumptions (A2)-(A4), there exist ˆn,1,J ξ , k = 1, 2, 3 such that J=0 (1) (3) (2) as n → ∞, sup0≤J≤N −1 ˆn,1,J − ˆn,1,J + sup0≤J≤N −1 ˆn,1,J − ˆn,1,J = oa.s (1). ξ ξ ξ ξ { }N −1 { }N −1 { }N −1 (1) (2) (3) ˆ ξ n,1,J has the same probability distribution as ˆn,1,J ξ , and ˆn,1,J ξ J=0 J=0 J=0 is a Gaussian process with mean 0, variance 1, and covariance  }  −1/2, for J − J ′ = 1 {  ˆ(3) , ˆ(3) = . cov ξ n,1,J ξ  n,1,J ′  0, for J − J ′ > 1 Lemmas 2.3 and 2.4 follow from Appendix A of Wang and Yang (2009a). Proof of Theorem 2.1 for p = 1: It is clear from Lemma 2.4 that the Gaussian process { }N −1 ˆ(3) satisfies the conditions of Lemma 2.1, hence as n → ∞, ξ n,1,J J=0 )} {( ( ) ˆ(3) → exp −2e−τ . P sup0≤J≤N −1 ξ n,1,J ≤ τ /aN + bN By letting τ = − log {− (1/2) log (1 − α)}, and using the definition of aN , bN and dN (α) 28 we obtain { (3) sup0≤J≤N −1 ˆn,1,J ≤ − log {− (1/2) log (1 − α)} (2 log N )−1/2 ξ } 1/2 − (1/2) (2 log N )−1/2 (log log N + log 4π) = 1 − α. + (2 log N ) { } 1/2 d (α) = 1 − α. ˆ(3) limn→∞ P sup0≤J≤N −1 ξ n,1,J ≤ (2 log N ) N limn→∞ P By Lemmas 2.3 and 2.4, we have } { lim P n→∞ ˜ ξ n,1,J ≤ (2 log N )1/2 dN (α) sup 0≤J≤N −1 = 1 − α, which implies for 0 ≤ J ≤ N − 1 { lim P n→∞ dN (α)−1 (2 log N )−1/2 σ −1 n,1 ( ) ( sup ˜1 tJ+1 − ˜1 tJ ε ε 0≤J≤N −1 ) } ≤1 = 1 − α. ( ) ( ) Lemma 2.2 entails that under H0 sup0≤J≤N −1 m1 tJ − m tJ = Op (h) and ˜ ( ) ( ) sup0≤J≤N −1 m tJ+1 − m tJ = Op (h) , which imply that ( ) ( ) sup m tJ+1 − m tJ 0≤J≤N −1 { } { } = Op (nh)1/2 (log N )−1/2 h = op (log n)−2 . σ −1 (log N )−1/2 n,1 { }} )} { ( ) ) ( { ( ) ( ) ( + ˜ ˜ ˆ By (2.13), m1 tJ+1 − m1 tJ = m1 tJ+1 − m tJ+1 − m1 tJ − m tJ ˆ 29 { ( ) ( )} { ( ) ( )} m tJ+1 − m tJ + ˜1 tJ+1 − ˜1 tJ , then for dN (α) defined in (2.9), ε ε { } ( ) ( ) limn→∞ P sup0≤J≤N −1 m1 tJ+1 − m1 tJ ≤ σ n,1 (2 log N )1/2 dN (α) = ˆ ˆ { } ( ) ( ) 1/2 d (α) = 1 − α. limn→∞ P sup0≤J≤N −1 ˜1 tJ+1 − ˜1 tJ ≤ σ n,1 (2 log N ) ε ε N 2.6 Appendix B 2.6.1 Preliminaries The following lemma from Wang and Yang (2009a) shows that multiplication by V defined in (2.5) behaves similarly to multiplication by a constant. We use |T | to denote the maximal absolute value of any matrix T . ( )N Lemma 2.5. Given matrix Ω = V + Γ, in which Γ = γ ′ satisfies γ ′ ≡ 0 if jj J,J ′ =−1 jj p J − J ′ > 1 and |Γ| → 0. Then there exist constants c, C > 0 independent of n and Γ, such that with probability approaching 1 c |ξ| ≤ |Ωξ| ≤ C |ξ| , C −1 |ξ| ≤ Ω−1 ξ ≤ c−1 |ξ| , ∀ξ ∈ RN +2 . (2.15) To prove Theorem 2.1 for p = 2, we need the result below (Corollary 16 of Kılı¸ (2008)), c which gives explicit formula for the inverse of symmetric tridiagonal matrix.        Lemma 2.6. For any symmetric tridiagonal matrix Gn =      30 x1 y 1    ..  . y 1 x2  , the  ... ... yn−1    yn−1 xn [ ] inverse of the matrix Gn , G−1 = wij is given by n  ( ) ( ) ( )−1 ∏  k−1 C b −2 y 2 , i = J  C b −1 + ∑n b  t i t=i } t k=i+1 Ck { wij = ∏i−1 ( b )−1   (−1)i+J yt wii , i > J  t=J Ct ( )−1 b b b 2 in which C1 = x1 , Cn = xn − Cn−1 yn−1 , n = 2, 3, .... Lemma 2.7. There exists a constant Cs > 0, such that ∑N J=−1 sJ ′ J ≤ Cs , and 17/16 ≤ sjj ≤ 5/4, where s ′ , 0 ≤ J, J ′ ≤ N − 1, is the element of S defined in (2.6). J J { ( )}N ∑ ˜ ˜ ξ , then N Proof. By (2.15), let ˜ ′ = sgn s ′ J=−1 sJ ′ J ≤ Sξ J ′ ≤ Cs ξ J ′ = J J J J=−1 Cs , ∀J ′ = −1, 0, ..., N . Applying Lemma 2.6 to V with x−1 = · · · = xN = 1, y−1 = ( )−1 N √ ∏ ∑ ( b )−1 k−1 ( b )−2 2 b Ct yt . Ck yN −1 = 2/4, y0 = · · · = yN −1 = 1/4, sjj = CJ + t=J k=J+1 b By mathematical induction, one obtains that 9/10 ≤ CJ ≤ 1, for 0 ≤ J ≤ N − 1. So, ( )−1 b 1 ≤ CJ ≤ 10/9, and for 0 ≤ J ≤ N − 1, sjj ≥ 1 + 2.6.2 ∑N ∏k−1 ∑N 2 yt ≥ 1 + (16)−(k−J) ≥ 17/16, k=J+1 t=J k=J+1 ∑N sjj ≤ 10/9 + (10/9) (1/9)k−J ≤ 5/4. k=J+1 Variance calculation ( Vector ˜2 = a−1,2 , ..., aN,2 a ˜ ˜ (⟨ )T given in (2.14) solves the normal equations, )N ⟩ )N ( ∑n ( ) , BJ,2 , B ′ ˜2 = n−1 a BJ,2 Xi σεi J ,2 n J,J ′ =−1 i=1 J=−1 31 ( )−1 ( ) −1 ∑n B (X ) σε N ˜ for BJ,2 (x) in (2.4). In other words, ˜2 = V + B a n , i i=1 J,2 i (√ ) J=−1 (⟨ ⟩ )N ˜ ˜ where B = BJ,2 , B ′ − V satisfies B = Op n−1 h−1 log (n) accordJ ,2 n J,J ′ =−1 ing to Subsection B.2 of the supplement to Wang and Yang (2009a). ( )T ( )−1 ˜ Now define ˆ2 = a−1,2 , ..., aN,2 a ˆ ˆ by replacing V + B with V−1 = S in above (∑ ) ( ) ∑ N formula, i.e., ˆ2 = a sJ ′ J n−1 n BJ,2 Xi σεi ′ , and define for i=1 J=−1 J =−1,..,N x ∈ [0, 1] ˆ2 (x) = ε ∑N a B (x) = ˆ J=−1 J,2 J,2 ∑N s n J,J ′ =−1 J ′ J −1 ∑n ( ) BJ,2 Xi σεi BJ ′ ,2 (x) , i=1 ( )} { ( ( ) ) ˆ ε ξ 2,J = ˆ2 tJ+1 + ˆ2 tJ−1 /2 − ˆ2 tJ , 2 ≤ J ≤ N − 1, ε ε        D =     (2.16) 0 1 −2 1 . . . . . . 0 0 .. . 0 0   .  .. .. .  . . .  .  .  .. .. .. .  . . . .   1 −2 1 0 (N −2)×(N +2) Lemma 2.8. With S and D given in (2.6) and (2.17), }N −1 { ˆ has covariance matrix ξ 2,J J=2 [ ( )]N −1 cov ˆ2,J , ˆ ′ ξ ξ = σ 2 (8nh/3)−1 DSDT . 2,J J,J ′ =2 32 (2.17) (2.18) ( ) ∑ ( ) ( ) ∑ Proof. For 0 ≤ J, J ′ ≤ N +1, ˆ2 tJ = N ′ ε s ′ n−1 n Bk,2 Xi σεi B ′ tJ i=1 k ,2 k,k =−1 k k ∑N ∑n ( ) ( ) =σ n−1 Bk,2 Xi εi s(J−1),k B(J−1),2 tJ . k=−1 i=1 [∑ ] [ ( ) ( )] ∑n ( ) ( ) N 2E −1 X εs B t E ˆ2 tJ ˆ2 tJ ′ = σ ε ε n B k=−1 i=1 k,2 i i J−1,k J−1,2 J [∑ ( )] ∑n ( ) N −1 t B X εs B × n i=1 k ′ ,2 i i J ′ −1,k ′ J ′ −1,2 J ′ k ′ =−1 ( ) ∑N ( ) 2 n−1 EBk,2 (X) B ′ (X) =σ B t B ′ t s s k ,2 J −1,2 J ′ J−1,k J ′ −1,k ′ k,k ′ =−1 J−1,2 J = σ 2 n−1 ( ) ( ) BJ−1,2 tJ B ′ v t s s J −1,2 J ′ J−1,k J ′ −1,k ′ k,k ′ k,k ′ =−1 ∑N ( ) ∑N ∑N ( ) = σ 2 n−1 BJ−1,2 tJ B ′ t ′ s v ′ =−1 sJ ′ −1,k ′ J −1,2 J k k=−1 J−1,k k,k ′ ( ) ∑N ( ) = σ 2 n−1 BJ−1,2 tJ B ′ t s δ J −1,2 J ′ k ′ =−1 J ′ −1,k ′ J−1,k ′ ( ) ( ) −1/2 −1/2 s . = σ 2 n−1 BJ−1,2 tJ B ′ t ′ s ′ = σ 2 n−1 dJ−1,n d ′ J −1,2 J J −1,J−1 J −1,n J ′ −1,J−1 ( ) ′ ≤ N − 1, E ˆ ˆ ˆ ,d By definitions of ξ 2,J J,n in (2.16), (2.3), for 2 ≤ J, J ξ 2,J ξ 2,J ′ is ( σ 2 (8nh/3)−1 s ′ + s ′ − 2s ′ +s ′ +s ′ J ,J J −2,J J −1,J J ,J−2 J −2,J−2 ) −2s ′ − 2s ′ − 2s ′ + 4s ′ J −1,J−2 J ,J−1 J −2,J−1 J −1,J−1   s s s )  J ′ −2,J−2 J ′ −2,J−1 J ′ −2,J  ( )T (     2 (8nh/3)−1 . =σ 1 −2 1  sJ ′ −1,J−2 sJ ′ −1,J−1 sJ ′ −1,J  1 −2 1     s ′ s ′ s ′ J ,J−2 J ,J−1 J ,J [ ( )]N −1 −1 2 T Therefore, for 2 ≤ J, J ′ ≤ N − 1, cov ˆ2,J , ˆ ′ ξ ξ ′ =2 = σ (8nh/3) DSD . 2,J J,J 33 −1 2 Lemma 2.9. For 2 ≤ J ≤ N − 1, σ 2 n,2,J defined in (2.8) satisfies that cσ (8nh/3) σ ≤ −1 2 σ2 n,2,J ≤ Cσ (8nh/3) σ , for cσ = (65/8) (17/16), Cσ = 100/9. ˆ2 Proof. It follows from (2.18) that σ 2 n,2,J = E ξ 2,J . Then by Lemmas 2.6 and 2.8, for { }−1 2 ≤ J ≤ N − 1, σ 2 (8nh/3)−1 σ2 n,2,J is sJ,J − 4sJ,J−1 + 2sJ,J−2 + 4sJ−1,J−1 − 4sJ−1,J−2 + sJ−2,J−2 {( } )−1 b = sJ−2,J−2 + 4 CJ−2 yJ−2 + 1 sJ−1,J−1 + { ( } ( )−1 )−1 b Cb b 2 CJ−2 J−1 yJ−2 yJ−1 + 4 CJ−1 yJ−1 + 1 sJ,J , 2 −1 thus, σ 2 n,2,J ≤ {1 + 4 (1/3 + 1) + (2/9 + 4/3 + 1)} (5/4) σ (8nh/3) = (100/9) (8nh/3)−1 σ 2 , 2 −1 σ2 n,2,J ≥ {1 + 4 (1/4 + 1) + (2/16 + 1 + 1)} (17/16) σ (8nh/3) = (65/8) (17/16) (8nh/3)−1 σ 2 . 2.6.3 Proof of Theorem 2.1 for p = 2 Several lemmas will be given below for proving Theorem 2.1 for p = 2. With ˜2 (x), ˆ2,J ε ξ and σ n,2,J defined in (2.14), (2.16) and (2.8), define for 2 ≤ J ≤ N − 1 ˜ ξ n,2,J = σ −1 n,2,J )} ( [{ ( ( )] ) ξ ε ε ˜2 tJ+1 + ˜2 tJ−1 /2 − ˜2 tJ , ˆn,2,J = σ −1 ˆ2,J , (2.19) ε n,2,J ξ 34 ˆ Lemma 2.10. Under Assumptions (A2)-(A4), as n → ∞, sup ξ n,2,J − ˜n,2,J = ξ 2≤J≤N −1 (√ ) Oa.s log n/ (nh) = oa.s (1). { }N −1 (k) Lemma 2.11. Under Assumptions (A2)-(A4), there exist ˆn,2,J , k = 1, 2, 3, such ξ J=2 (2) (3) (1) ξ ξ that as n → ∞, sup2≤J≤N −1 ˆn,2,J − ˆn,2,J +sup2≤J≤N −1 ˆn,2,J − ˆn,2,J = oa.s (1). ξ ξ { } ˆ(1) has the same probability distribution as ˆ(2) . ˆ(3) ξ n,2,J ξ n,2,J ξ n,2,J is a Gaussian process with ( ) ( ) (3) ˆ(3) ξ −1 σ −1 E ˆ ˆ ˆ = σ n,2,J ξ 2,J ξ 2,J ′ mean 0, variance 1, covariance r ′ = cov ξ n,2,J , ξ J,J n,2,J ′ n,2,J ′ for which there exist constants 0 < C, 0 < r < 1 such that for large n, ( ) − J−J ′ ξ r ′ ≤ min r, C3 , 2 ≤ J, J ′ ≤ N − 1. J,J ( − J−J ′ ξ Proof. We only prove r ′ ≤ min r, C3 J,J ) . Lemma 2.10 and the rest of Lemma 2.11 follow from Appendix B of the supplement to Wang and Yang (2009a). By (2.18), ) ( − 2s ′ +s ′ +s ′ σ −2 (8nh/3)−1 E ˆ2,J ˆ ′ = s ′ + s ′ ξ ξ 2,J J ,J J −2,J J −1,J J ,J−2 J −2,J−2 −2s ′ − 2s ′ − 2s ′ + 4s ′ . J −1,J−2 J ,J−1 J −2,J−1 J −1,J−1 ∏ ′ J−1 ( b )−1 yt sjj , then for Ct By Lemma 2.6, for −1 ≤ J ′ < J ≤ N , s ′ = (−1)J+J J,J ′ t=J 35 2 ≤ J, J ′ ≤ N − 1 and J − J ′ > 2, by Lemma 2.7, { ] }−1 [ σ 2 (8nh/3)−1 E ˆ2,J ˆ ′ ξ ξ 2,J } {( )−1 ( )−1 b J+J ′ b y ′ +1 y ′ +2 C ′ = (−1) C ′ J −2 J −1 J −1 J −2 { ( )−1 b × sJ−2,J−2 + 2 CJ−2 yJ−2 sJ−1,J−1 + } J−3 ( ) )−1 ∏ b Cb b −1 y CJ−2 J−1 yJ−2 yJ−1 sJ,J Ct t ′ ( t=J ) ( ) { } − J−J ′ −2 − J−J ′ ≤ (5/4) (1/3 + 2/3 + 1) 1 + 2/3 + (1/3)2 3 ≤ 40 × 3 . ( { } 2 (8nh/3)−1 −1 σ 2 By Lemma 2.9, σ n,2,J ≥ (65/8) (17/16), for 2 ≤ J ≤ N − 1. Therefore, ( ) − J−J ′ ξ for 2 ≤ J, J ′ ≤ N − 1 and J − J ′ > 2, r ′ ≤ C3 ≤ r < 1, with C = J,J 40 (8/65) (16/17) and r = 40 (8/65) (16/17) /33 < 1. For J − J ′ = 1, 2, the result can be proved following the same procedure above. Proof of Theorem 2.1 for p = 2: It is clear from Lemma 2.11 that the Gaussian { }N −1 (3) process ˆn,2,J ξ satisfies the conditions of Lemma 2.1, hence as n → ∞, J=2 ( P (3) sup2≤J≤N −1 ˆn,2,J ≤ τ /aN + bN ξ ) ( ) → exp −2e−τ . By Lemmas 2.10, 2.11, with τ = − log {− (1/2) log (1 − α)} and definitions of aN and bN , ) ( ξ limn→∞ P sup2≤J≤N −1 ˜n,2,J ≤ {2 log (N − 2)}1/2 dN −2 (α) = 1 − α, for any 0 < α < 1, ˆn,2,J and dN (α) defined in (2.19) and (2.9). By (2.13) and (2.16), ξ 36 { ( )} ( ( ) ) m2 tJ+1 + m2 tJ−1 /2 − m2 tJ is ˆ ˆ ˆ { { ( ) ( )} ( ) ( )} { ( ) ( )} ˜ ˜ m2 tJ+1 − m tJ+1 /2 + m2 tJ−1 − m tJ−1 /2 − m2 tJ − m tJ ˜ [{ ( ( )} ) ( )] ξ + m tJ+1 + m tJ−1 /2 − m tJ + ˆ2,J+1 ( ) 2 , hence Now Lemma 2.2 implies that under H0 , ∥m2 − m∥∞ = Op h ˜ (nh)1/2 {log (N − 2)}−1/2 × { ( ) ( )} { ( ) ( )} m2 tJ+1 − m tJ+1 /2 + m2 tJ−1 − m tJ−1 /2 ˜ ˜ sup 2≤J≤N −1 ] } [ { { ( ) ( )} − m2 tJ − m tJ ˜ = Op (nh)1/2 {log (N − 2)}−1/2 h2 = op (log n)−3 . By Taylor expansion, sup2≤J≤N −1 { ( ( )} ( ) ) ( ) m tJ+1 + m tJ−1 /2 − m tJ = Op h2 under H0 , as n → ∞. Hence { ( ( )} ) ( ) sup m tJ+1 + m tJ−1 /2 − m tJ 2≤J≤N −1 [√ ] { } −1/2 h2 = o (log n)−3 . = Op nh {log (N − 2)} p (nh)1/2 {log (N − 2)}−1/2 By above results, for T2n defined in (2.7), { } limn→∞ P T2n ≤ {2 log (N − 2)}1/2 dN −2 (α) = 1 − α. 37 Chapter 3 A Simultaneous Confidence Band for Sparse Longitudinal Regression 3.1 Introduction This chapter is based on Ma, Yang and Carroll (2011). Functional data analysis (FDA) has in recent years become a focal area in statistics research, and much has been published in this area. An incomplete list includes Cardot, Ferraty, and Sarda (2003), Cardot and Sarda (2005), Ferraty and Vieu (2006), Hall and Heckman (2002), Hall, M¨ller, and Wang u (2006), Izem and Marron (2007), James, Hastie, and Sugar (2000), James (2002), James and Silverman (2005), James and Sugar (2003), Li and Hsing (2007), Li and Hsing (2010), Morris and Carroll (2006), M¨ller and Stadtm¨ller (2005), M¨ller, Stadtm¨ller, and Yao (2006), u u u u M¨ller and Yao (2008), Ramsay and Silverman (2005), Wang, Carroll, and Lin (2005), Yao u and Lee (2006), Yao, M¨ller, and Wang (2005a), Yao, M¨ller, and Wang (2005b), Yao (2007), u u Zhang and Chen (2007), Zhao, Marron, and Wells (2004), and Zhou, Huang, and Carroll 38 (2008). According to Ferraty and Vieu (2006), a functional data set consists of iid realizations { } ξ i (x) , x ∈ X , 1 ≤ i ≤ n, of a smooth stochastic process (random curve) {ξ (x) , x ∈ X } over an entire interval X . A more data oriented alternative in Ramsay and Silverman (2005) emphasizes smooth functional features inherent in discretely observed longitudinal data, so that the recording of each random curve ξ i (x) is over a finite number of points in X , and contaminated with noise. This second view is taken in this chapter. { } A typical functional data set therefore has the form Xij , Yij , 1 ≤ i ≤ n, 1 ≤ j ≤ Ni , in which Ni observations are taken for the ith subject, with Xij and Yij the j th predictor and response variables, respectively, for the ith subject. Generally, the predictor Xij takes { } values in a compact interval X = [a, b]. For the ith subject, its sample path Xij , Yij is the noisy realization of a continuous time stochastic process ξ i (x) in the sense that ( ) ( ) Yij = ξ i Xij + σ Xij εij , (3.1) ( ) { } with errors εij satisfying E εij = 0, E(ε2 ) = 1, and ξ i (x), x ∈ X are iid copies of a ij ∫ process {ξ(x), x ∈ X } which is L2 , i.e., E X ξ 2 (x)dx < +∞. For the standard process {ξ(x), x ∈ X }, one defines the mean function m(x) = E{ξ(x)} ( ) { } { } and the covariance function G x, x′ = cov ξ(x), ξ(x′ ) . Let sequences λk ∞ , k=1 ( ) { } ψ k (x) ∞ be the eigenvalues and eigenfunctions of G x, x′ , respectively, in which k=1 { } ∑ λ1 ≥ λ2 ≥ · · · ≥ 0, ∞ λk < ∞, ψ k ∞ form an orthonormal basis of L2 (X ) k=1 k=1 ( ) ( ) ( ) ( ) ′ = ∑∞ λ ψ (x)ψ x′ , which implies that ∫ G x, x′ ψ x′ dx′ = and G x, x k k k=1 k k λk ψ k (x). 39 { } The process ξ i (x), x ∈ X allows the Karhunen-Lo`ve L2 representation e ξ i (x) = m(x) + ∑∞ ξ ϕ (x), k=1 ik k (3.2) where the random coefficients ξ ik are uncorrelated with mean 0 and variances 1, and the √ functions ϕk = λk ψ k . In what follows, we assume that λk = 0, for k > κ, where κ is a ( ) ∑ positive integer, thus G(x, x′ ) = κ ϕk (x)ϕk x′ . Based on (3.2), the data generating k=1 process is now written as ( ) Yij = m Xij + ( ∑κ ) ( ) ξ ϕ Xij + σ Xij εij . k=1 ik k (3.3) { } { } The sequences λk κ , ϕk (x) κ k=1 k=1 and the random coefficients ξ ik exist mathematically, but are unknown and unobservable. Two distinct types of functional data have been studied. Li and Hsing (2007), and Li and Hsing (2010) concern dense functional data, which in the context of model (3.1) means min1≤i≤n Ni → ∞ as n → ∞. On the other hand, Yao, M¨ller, and Wang (2005a), u Yao, M¨ller, and Wang (2005b), and Yao (2007) studied sparse longitudinal data for which u Ni ’s are i.i.d. copies of an integer-valued positive random variable. Pointwise asymptotic distributions were obtained in Yao (2007) for local polynomial estimators of m(x) based on sparse functional data, but without uniform confidence bands. Nonparametric simultaneous confidence bands are a powerful tool of global inference for functions, see Claeskens and Van Keilegom (2003), Fan and Zhang (2000), Hall and Titterington (1988), H¨rdle (1989), a H¨rdle and Marron (1991), Huang, Wang, Yang, and Kravchenko (2008), Ma and Yang a (2010), Song and Yang (2009), Wang and Yang (2009), Wu and Zhao (2007), Zhao and 40 Wu (2008), and Zhou, Shen, and Wolfe (1998) for its theory and applications. The fact that a simultaneous confidence band has not been established for functional data analysis is certainly not due to lack of interesting applications, but to the greater technical difficulty in formulating such bands for functional data and establishing their theoretical properties. Specifically, the strong approximation results used to establish the asymptotic confidence level in nearly all published works on confidence bands, commonly known as “Hungarian embedding”, are unavailable for sparse functional data. In this chapter, we present simultaneous confidence bands for m(x) in sparse functional data via a piecewise-constant spline smoothing approach. While there exist a number of ( ) ′ such as kernels (Yao, M¨ller and, smoothing methods for estimating m(x) and G x, x u Wang (2005a); Yao, M¨ller, and Wang (2005b); Yao (2007)), penalized splines (Cardot, u Ferraty, and Sarda (2003); Cardot and Sarda (2005); Yao and Lee (2006)), wavelets Morris and Carroll (2006), and parametric splines James (2002), we choose B splines (Zhou, Huang, and Carroll (2008)) for simple implementation, fast computation and explicit expression, see Huang and Yang (2004), Wang and Yang (2007), and Xue and Yang (2006) for discussion of the relative merits of various smoothing methods. We organize this chapter as follows. In Section 3.2 we state our main results on confidence bands constructed from piecewise constant splines. In Section 3.3 we provide further insights into the error structure of spline estimators. Section 3.4 describes the actual steps to implement the confidence bands. Section 3.5 reports findings of a simulation study. An empirical example in Section 3.6 illustrates how to use the proposed confidence band for inference. Proofs of technical lemmas are in the Appendix. 41 3.2 Main Results For convenience, we denote the modulus of continuity of a continuous function r on [a, b] ( ) by ω (r, δ) = max ′ r(x) − r x′ . Denote the empirical L2 norm as x,x ∈[a,b], x−x′ ≤δ ( ) −1 ∑n ∑Ni g 2 X , where we denote the total sample size by N = ∥g∥2 = NT ij T i=1 j=1 2,NT ∑n i=1 Ni . Without loss of generality, we take the range of X, X = [a, b], to be [0, 1]. For any β ∈ (0, 1], we denote the collection of order β H˜lder continuous function on [0, 1] by o C 0,β [0, 1] =    ϕ : ∥ϕ∥0,β = sup ′ ,x,x′ ∈[0,1] x̸=x ( ) ϕ(x) − ϕ x′ x − x′ β   < +∞  , in which ∥ϕ∥0,β is the C 0,β -seminorm of ϕ. Let C [0, 1] be the collection of continuous function on [0, 1]. Clearly, C 0,β [0, 1] ⊂ C [0, 1] and, if ϕ ∈ C 0,β [0, 1], then ω (ϕ, δ) ≤ ∥ϕ∥0,β δ β . We use the piecewise constant spline functions introduced in Section 1.3 of Chapter 1 with the number of interior knots denoted as Ns . Let hs = 1/ (Ns + 1) be the distance between neighboring knots. The mean function m(x) is estimated by m(x) = argmin g∈G(−1) ∑n { ( )}2 i Y −g X . ij j=1 ij ∑N i=1 (3.4) The technical assumptions we need are as follows (A1) The regression function m(x) ∈ C 0,1 [0, 1]. (A2) The functions f (x), σ(x), and ϕk (x) ∈ C 0,β [0, 1] for some β ∈ (2/3, 1] with f (x) ∈ ] [ cf , Cf , σ(x) ∈ [cσ , Cσ ] , x ∈ [0, 1], for constants 0 < cf ≤ Cf < ∞, 0 < cσ ≤ Cσ < ∞. 42 (A3) The set of random variables ( ) ( ) Ni n is a subset of Ni ∞ consisting of indei=1 i=1 pendent variables Ni , the numbers of observations made for the i-th subject, i = 1, 2, ..., with Ni N , where N > 0 is a positive integer-valued random variable with E{N 2r } ≤ r!cr , r = 2, 3, ... for some constant cN > 0. The set of ranN ( )n,N ( )∞,∞ i dom variables Xij , Yij , εij is a subset of Xij , Yij , εij in which i=1,j=1 i=1,j=1 ( )∞,∞ are iid. The number κ of nonzero eigenvalues is finite and the ranXij , εij i=1,j=1 ( ) dom coefficients ξ ik , k = 1, ..., κ, i = 1, ..., ∞ are iid N (0, 1). The variables Ni ∞ , i=1 ( )∞,∞ ( )∞,∞ ( )∞,κ , εij ξ ik , Xij are independent. i=1,k=1 i=1,j=1 i=1,j=1 ( ) ϑ for some ϑ ∈ (1/3, 2β − 1) (A4) As n → ∞, the number of interior knots Ns = o n { } −1 = o n−1/3 (logn) −1/3 . The subinterval length h ∼ N −1 . while Ns s s (A5) There exists r > 2/ {β − (1 + ϑ) /2} such that E |ε11 |r < ∞. Assumptions (A1), (A2), (A4) and (A5) are similar to (A1)–(A4) in Wang and Yang (2009), with (A1) weaker than its counterpart. Assumption (A3) is the same as (A1.1), (A1.2), and (A5) in Yao, M¨ller, and Wang (2005b), without requiring joint normality of u the measurement errors εij . We use the B-spline basis of G(−1) , the space of piecewise constant splines, denoted as { } bJ (x) Ns for theoretical analysis. Define J=0 cJ,n = bJ 2 = 2 ∫ 1 0 bJ (x)f (x)dx, J = 0, ..., Ns , σ 2 (x) = var(Y |X = x ) = G (x, x) + σ 2 (x), ∀x ∈ [0, 1] , Y 43 (3.5)   2  κ  E {N (N − 1)} ∑ ∫ 1 1  σ 2 (x) = c−2 {nE(N1 )}−1 ϕk (u) f (u) du n J(x),n  EN1 χJ(x)  k=1  ∫  + σ 2 (u) f (u) du . (3.6)  χJ(x) Y In addition, define QNs +1 (α) = bNs +1 − a−1+1 log{−(1/2)log(1 − α)}, Ns ( ) 2 log 2πaN +1 s aNs +1 = {2log (Ns + 1)}1/2 , bNs +1 = aNs +1 − , 2aNs +1 (3.7) for any α ∈ (0, 1). We now state our main results. Theorem 3.1. Under Assumptions (A1)-(A5), for any α ∈ (0, 1) , { } lim P supx∈[0,1] |m(x) − m(x)| /σ n (x) ≤ QNs +1 (α) = 1 − α, n→∞ } { lim P |m(x) − m(x)| /σ n (x) ≤ Z1−α/2 = 1 − α, ∀x ∈ [0, 1], n→∞ where σ n (x) and QNs +1 (α) are given in (3.6) and (3.7), respectively, while Z1−α/2 is the 100 (1 − α/2)th percentile of the standard normal distribution. The definition of σ n (x) in (3.6) does not allow for practical use. The next proposition provides two data-driven alternatives Proposition 3.1. Under Assumptions (A2), (A3), and (A5), as n → ∞, { sup x∈[0,1] σ −1 (x)σ n,IID (x) − 1 + σ −1 (x)σ n,LONG (x) − 1 n n 44 } ) β = O hs , ( in which for x ∈ [0, 1], σ n,IID (x) ≡ σ Y (x) {f (x)hs nE(N1 )}−1/2 and { σ n,LONG (x) ≡ σ n,IID (x) 1 + E {N1 (N1 − 1)} G (x, x) f (x) hs EN1 σ 2 (x) Y }1/2 . Using σ n,IID (x) instead of σ n (x) means to treat the (Xij , Yij ) as iid data rather than as sparse longitudinal data, while using σ n,LONG (x) means to correctly account for the longitudinal correlation structure. The difference of the two approaches, although asymptotically negligible uniformly for x ∈ [0, 1] according to Proposition 3.1, is significant in finite samples, as shown in the simulation results of Section 3.5. For similar phenomenon with kernel smoothing, see Wang, Carroll, and Lin (2005). Corollary 3.1. Under Assumptions (A1)-(A5), for any α ∈ (0, 1), as n → ∞, an asymptotic 100 (1 − α) % simultaneous confidence band for m(x), x ∈ [0, 1] is m(x) ± σ n (x)QNs +1 (α) , while an asymptotic 100 (1 − α) % pointwise confidence interval for m(x), x ∈ [0, 1], is m(x)± σ n (x)Z1−α/2 . 3.3 Decomposition In this section, we decompose the estimation error m(x) − m(x) by the representation of Yij ( ) ( ) ( ) ∑ as the sum of m Xij , κ ξ ik ϕk Xij , and σ Xij εij . k=1 { } We introduce the rescaled B-spline basis BJ (x) Ns for G(−1) , which is BJ (x) ≡ J=0 45 bJ (x) bJ −1 , J = 0, ..., Ns . Therefore, 2 { }−1/2 BJ (x) ≡ bJ (x) cJ,n , J = 0, ..., Ns . (3.8) ⟨ ⟩ It is easily verified that BJ 2 = 1, J = 0, 1, . . . , Ns , BJ , B ′ ≡ 0, J ̸= J ′ . 2 J The definition of m(x) in (3.4) means that m(x) ≡ ∑Ns J=0 ′ λJ bJ (x), (3.9) }T { ′ ′ as solutions of the least squares problem with coefficients λ0 , ..., λNs { }T ( )}2 ∑n ∑N { ∑Ns ′ ′ i ={ argmin λ0 , ..., λNs Yij − λ b Xij . } i=1 j=1 J=0 J J Ns +1 λ0 ,...,λNs ∈R { }T ∑Ns Simple linear algebra shows m(x) ≡ J=0 λJ BJ (x), where the coefficients λ0 , ..., λNs are solutions of the least squares problem { }T ( )} 2 ∑n ∑N { ∑Ns i λ0 , ..., λNs ={ argmin Yij − λ B Xij . } i=1 j=1 J=0 J J λ0 ,...,λNs ∈RNs +1 (3.10) Projecting the relationship in model (3.3) onto the linear subspace of RNT spanned by ( )} { , we obtain the following crucial decomposition in BJ Xij 1≤j≤Ni ,1≤i≤n,0≤J≤Ns the space G(−1) of spline functions: m(x) = m(x) + e(x) = m(x) + ε(x) + 46 ∑κ ξ (x), k=1 k (3.11) m(x) = ∑Ns J=0 λJ BJ (x), ε(x) = ξ k (x) = ∑Ns ∑Ns a B (x), J=0 J J τ B (x). J=0 k,J J (3.12) }T { }T { The vectors {λ0 , ..., λNs }T , a0 , ..., aNs , and τ k,0 , ..., τ k,Ns are solutions to (3.10) ( ) ( ) ( ) with Yij replaced by m Xij , σ Xij εij , and ξ ik ϕk Xij , respectively. We cite next an important result concerning the function m(x). The first part is from de Boor (2001), p. 149, and the second is from Theorem 5.1 of Huang (2003). Theorem 3.2. There is an absolute constant Cg > 0 such that for every ϕ ∈ C [0, 1], there exists a function g ∈ G(−1) [0, 1] that satisfies ∥g − ϕ∥∞ ≤ Cg ω (ϕ, hs ). In particular, if β ϕ ∈ C 0,β [0, 1] for some β ∈ (0, 1], then ∥g − ϕ∥∞ ≤ Cg ∥ϕ∥0,β hs . Under Assumptions (A1) and (A4), with probability approaching 1, the function m(x) defined in (3.12) satisfies ∥m(x) − m(x)∥∞ = O (hs ) . The next proposition concerns the function e(x) given in (3.11). Proposition 3.2. Under Assumptions (A2)-(A5), for any τ ∈ R, and σ n (x), aNs +1 , and bNs +1 as given in (3.6) and (3.7), { } ( ) lim P supx∈[0,1] σ n (x)−1 e(x) ≤ τ /aNs +1 + bNs +1 = exp −2e−τ . n→∞ 3.4 Implementation In this section, we describe procedures to implement the confidence bands and intervals ( )N ,n from model (3.3), the spline given in Corollary 3.1. Given any data set Xij , Yij i j=1,i=1 estimator m(x) is obtained by (3.9), and the number of interior knots in (3.9) is taken 47 1/3 to be Ns = [cNT (logn)], in which [a] denotes the integer part of a and c is a positive constant. When constructing the confidence bands, one needs to evaluate the function σ 2 (x) n by estimating the unknown functions f (x), σ 2 (x), and G (x, x), and then plugging in these Y estimators: the same approach is taken in Wang and Yang (2009). The number of interior knots for pilot estimation of f (x), σ 2 (x), and G (x, x) is taken Y [ ] ( ) ∗ ∗ to be Ns = n1/3 , and h∗ = 1/ 1 + Ns . The histogram pilot estimator of the density s function f (x) is f (x) = { ∑n ∑N ∑n i b (Xij )}/{( N )h∗ }. i=1 j=1 J(x) i=1 i s {( )2 } T T Yij − m(Xij ) , the Defining the vector R ={Rij }1≤j≤N ,1≤i≤n = i 1≤j≤Ni ,1≤i≤n ∗ ∑Ns estimator of σ 2 (x) is σ 2 (x) = J=0 ρJ bJ (x), where the coefficients {ρ0 , ..., ρN ∗ }T are Y Y s solutions of the least squares problem: { }T ( )}2 ∑n ∑N { ∑N ∗ s ρ b X i ρ0 , ..., ρN ∗ ={ argmin Rij − . ij } J=0 J J i=1 j=1 s ρ0 ,...,ρN ∗ ∈RNs +1 s ) ( The pilot estimator of covariance function G x, x′ is ) ( G x, x′ = arg where C ′ = ijj min (−1) ⊗G(−1) g∈G { ( )}2 ∑N i Cijj ′ − g Xij , Xij ′ , i=1 j,j ′ =1,j̸=j ′ ∑n { ( )} { ( )} Yij − m Xij Y ′ −m X ′ , 1 ≤ j, j ′ ≤ Ni , 1 ≤ i ≤ n. The ij ij 48 { }−1/2 function σ n (x) is estimated by either σ n,IID (x) ≡ σ Y (x) f (x)hs NT or { σ n,LONG (x) ≡ σ n,IID (x) 1 + }1/2 ) G (x, x) N 2 /NT − 1 f (x)hs . i=1 i σ 2 (x) Y (∑ n We now state a result. That is easily proved by standard theory of kernel and spline smoothing, as in Wang and Yang (2009). Proposition 3.3. Under Assumptions (A1)-(A5), as n → ∞ { supx∈[0,1] σ n,IID (x)σ −1 (x) − 1 + σ n,LONG (x)σ −1 n,IID n,LONG (x) − 1 ( ) β −1/2 N −1 (logn)1/2 . = Oa.s. hs + n s } Proposition 3.1, about how σ n,IID (x) and σ n,LONG (x) uniformly approximate σ n (x), and Proposition 3.3 together imply that both σ n,IID (x) and σ n,LONG (x) approximate ) ( −1/2+1/3 (logn)1/2−1/3 , according to Assumpσ n (x) uniformly at a rate faster than n tion (A5). Therefore as n → ∞, the confidence bands m(x) ± σ n,IID (x)QNs +1 (α) , (3.13) m(x) ± σ n,LONG (x)QNs +1 (α) , (3.14) with QNs +1 (α) given in (3.7), and the pointwise intervals m(x)±σ n,IID (x)Z1−α/2 , m(x)± σ n,LONG (x)Z1−α/2 have asymptotic confidence level 1 − α. 49 3.5 Simulation To illustrate the finite-sample performance of the spline approach, we generated data from the model ( ) Yij = m Xij + ∑2 ( ) ξ ϕ Xij + σεij , 1 ≤ j ≤ Ni , 1 ≤ i ≤ n, k=1 ik k with X ∼ Uniform[0, 1], ξ k ∼ Normal(0, 1), k = 1, 2, ε ∼ Normal(0, 1), Ni having a discrete uniform distribution from 25, · · · , 35, for 1 ≤ i ≤ n, and m(x) = sin {2π (x − 1/2)} , ϕ1 (x) = √ √ −2 cos {π (x − 1/2)} / 5, ϕ2 (x) = sin {π (x − 1/2)} / 5, thus λ1 = 2/5, λ2 = 1/10. The noise levels were σ = 0.5, 1.0, the number of subjects n was taken to be 20, 50, 100, 200, the confidence levels were 1 − α = 0.95, 0.99, and the constant c in the definition of Ns in Section 3.4 was taken to be 1, 2, 3. We found that the confidence band (3.13) did not have good coverage rates for moderate sample sizes, and hence in Table 3.1 we report the coverage as the percentage out of the total 200 replications for which the true curve was covered by (3.14) at the 101 points {k/100, k = 0, . . . , 100}. At all noise levels, the coverage percentages for the confidence band (3.14) are very close to the nominal confidence levels 0.95 and 0.99 for c = 1, 2, but decline for c = 3 when n = 20, 50. The coverage percentages thus depend on the choice of Ns , and the dependency becomes stronger when sample sizes decrease. For large sample sizes n = 100, 200, the effect of the choice of Ns on the coverage percentages is insignificant. Because Ns varies with Ni , for 1 ≤ i ≤ n, the data-driven selection of some ”optimal” Ns remains an open problem. We next examine two alternative methods to compute the confidence band, based on the observation that the estimated mean function m(x) and the confidence intervals are 50 Table 3.1: Uniform coverage rates in Chapter 3 σ n 20 0.5 50 100 200 20 1.0 50 100 200 1−α 0.950 0.990 0.950 0.990 0.950 0.990 0.950 0.990 0.950 0.990 0.950 0.990 0.950 0.990 0.950 0.990 c=1 0.920 0.990 0.960 0.995 0.955 1.000 0.950 0.985 0.935 0.990 0.975 0.995 0.950 0.995 0.940 0.985 c=2 0.930 0.990 0.965 0.995 0.955 1.000 0.965 0.985 0.930 0.990 0.960 0.995 0.940 0.990 0.965 0.995 c=3 0.800 0.900 0.910 0.965 0.955 0.985 0.975 0.990 0.735 0.870 0.895 0.980 0.935 0.990 0.960 0.995 Note: uniform coverage rates from 200 replications using the confidence band (3.14). For each sample size n, the first row is the coverage of a nominal 95% confidence band, while the second row is for a 99% confidence band. step functions that remain the same on each subinterval χJ , 0 ≤ J ≤ Ns . Follwing an associate editor’s suggestion, locally weighted smoothing was applied to the upper and lower confidence limits to generate a smoothed confidence band. Following a referee’s suggestion to treat the number (Ns + 1) of subintervals as fixed instead of growing to infinity, a naive parametric confidence band was computed as m(x) ± σ n,LONG (x)Q1−α.Ns +1 (3.15) } is the (1 − α) quantile of the maximal in which Q1−α.Ns +1 = Z{ 1+(1−α)1/(Ns +1) /2 absolute values of (Ns + 1) iid N (0, 1) random variables. We compare the performance of 51 the confidence band in (3.14), the smoothed band and naive parametric band in (3.15). Given n = 20 with Ns = 8, 12, and n = 50 Ns = 44 (by taking c = 1 in the definition of Ns in Section 3.4), σ = 0.5, 1.0, and 1 − α = 0.99, Table 3.2 reports the coverage percentages P , Pnaive , Psmooth and the average maximal widths W, Wnaive , Wsmooth of Ns + 1 intervals out of 200 replications calculated from confidence bands (3.14), (3.15), and the smoothed confidence bands, respectively. Table 3.2: Uniform coverage rates and average maximal widths in Chapter 3 n σ 20 0.5 1.0 50 0.5 1.0 Ns 8 12 8 12 44 44 P 0.820 0.930 0.910 0.960 0.990 0.990 Pnaive 0.505 0.765 0.655 0.820 0.960 0.975 Psmooth 0.910 0.955 0.970 0.985 0.990 1.000 W 1.490 1.644 1.725 1.937 1.651 2.054 Wnaive 1.210 1.363 1.401 1.606 1.522 1.893 Wsmooth 1.480 1.628 1.721 1.928 1.609 2.016 Note: uniform coverage rates and average maximal widths of confidence intervals from 200 replications using the confidence bands (3.14), (3.15), and the smoothed bands respectively, for 1 − α = 0.99. In all experiments, one has Psmooth > P > Pnaive and W > Wsmooth > Wnaive . The coverage percentages for both the confidence bands in (3.14) and the smoothed bands are much closer to the nominal level than those of the naive bands in (3.15), while the smoothed bands perform slightly better than the constant spline bands in (3.14), with coverage percentages closer to the nominal and smaller widths. Based on these observations, the naive band is not recommended due to poor coverage. As for the smoothed band, although it has slightly better coverage than the constant spline band, its asymptotic property has yet to be established, and the second step smoothing adds to its conceptual complexity and computational burden. Therefore with everything considered, the constant spline band is recommended for 52 its satisfactory theoretical property, fast computing, and conceptual simplicity. For visualization of the actual function estimates, at σ = 0.5 with n = 20, 50, Figures 3.1 and 3.2 depict the simulated data points and the true curve, and Figures 3.3, 3.4, 3.5 and 3.6 show the true curve, the estimated curve, the uniform confidence band, and the pointwise confidence intervals. 3.6 Empirical Example In this section, we apply the confidence band procedure of Section 3.4 to the data collected from a study by the AIDS Clinical Trials Group, ACTG 315 (Zhou, Huang, and Carroll (2008)). In this study, 46 HIV 1 infected patients were treated with potent antiviral therapy consisting of ritonavir, 3TC and AZT. After initiation of the treatment on day 0, patients were followed for up to 10 visits. Scheduled visit times common for all patients were 7, 14, 21, 28, 35, 42, 56, 70, 84, and 168 days. Since the patients did not follow exactly the scheduled times and/or missed some visits, the actual visit times Tij were irregularly spaced and varied from day 0 to day 196. The CD4+ cell counts during HIV/AIDS treatments are taken as the response variable Y from day 0 to day 196. Figure 3.7 shows that the data points (dots) are extremely sparse between day 100 and 150, thus we first transform the 1/3 data by Xij = Tij . A histogram (not shown) indicates that the Xij -values are distributed fairly uniformly. The number of interior knots in (3.9) is taken to be Ns = 6, so that the range for visit time T , which is [0, 196], is divided into seven unequal subintervals, and in each subinterval, the mean CD4+ cell counts and the confidence bands remain the same. Table 3.3 gives the mean CD4+ cell counts and the confidence limits on each subinterval at simultaneous confidence level 0.95. For instance, from day 4 to 14, the mean CD4+ cell 53 3 2 1 0 -2 -1 Y 0 0.5 X 1 Figure 3.1: Plots of simulated data for n = 20 in Chapter 3 Note: plots of simulated data scatter points at σ = 0.5, n = 20, and the true curve. 54 3 2 1 0 -2 -1 Y 0 0.5 X 1 Figure 3.2: Plots of simulated data for n = 50 in Chapter 3 Note: plots of simulated data scatter points at σ = 0.5, n = 50, and the true curve. 55 counts is 241.62 with lower and upper limits 171.81 and 311.43 respectively. Table 3.3: Confidence limits for CD4 data set Days [0, 1) [1, 4) [4, 15) [15, 36) [36, 71) [71, 123) [123, 196] Mean CD4+ cell counts 178.23 200.32 241.62 271.87 299.51 280.78 299.27 Confidence limits [106.73, 249.72] [130.51, 270.13] [171.81, 311.43] [194.70, 349.04] [222.34, 376.68] [203.50, 358.06] [221.99, 376.55] Note: the mean CD4+ cell counts and the confidence limits on each subinterval at simultaneous confidence level 0.95. Figure 3.7 depicts the 95% simultaneous (smoothed) confidence band according to (3.14) in (median) thin lines, and Figure 3.8 depicts the pointwise 95% confidence intervals in thin lines. The center thick line is the piecewise-constant spline fit m(x). It can be seen that the pointwise confidence intervals are of course narrower than the uniform confidence band by the same ratio. Figure 3.7 is essentially a graphical representation of Table 3.3; both confirm that the mean CD4+ cell counts generally increases over time as Zhou, Huang, and Carroll (2008) pointed out. The advantage of the current method is that such inference on the overall trend is made with predetermined type I error probability, in this case 0.05. 3.7 Discussion In this chapter, we have constructed a simultaneous confidence band for the mean function m (x) for sparse longitudinal data via piecewise-constant spline fitting. Our approach extends the asymptotic results in Wang and Yang (2009) for i.i.d. random designs to a much more complicated data structure by allowing dependence of measurements within each subject. 56 3 2 1 0 -2 -1 Y 0 0.5 X 1 Figure 3.3: Plots of confidence bands at 1 − α = 0.95, n = 20 in Chapter 3 Note: plots of confidence bands (3.14) (upper and lower solid lines), pointwise confidence intervals (upper and lower dashed lines), the spline estimator (middle thin line), and the true function (middle thick line) at 1 − α = 0.95, n = 20. 57 3 2 1 0 -2 -1 Y 0 0.5 X 1 Figure 3.4: Plots of confidence bands at 1 − α = 0.95, n = 50 in Chapter 3 Note: plots of confidence bands (3.14) (upper and lower solid lines), pointwise confidence intervals (upper and lower dashed lines), the spline estimator (middle thin line), and the true function (middle thick line) at 1 − α = 0.95, n = 50. 58 3 2 1 0 -2 -1 Y 0 0.5 X 1 Figure 3.5: Plots of confidence bands at 1 − α = 0.99, n = 20 in Chapter 3 Note: plots of confidence bands (3.14) (upper and lower solid lines), pointwise confidence intervals (upper and lower dashed lines), the spline estimator (middle thin line), and the true function (middle thick line) at 1 − α = 0.99, n = 20. 59 3 2 1 0 -2 -1 Y 0 0.5 X 1 Figure 3.6: Plots of confidence bands at 1 − α = 0.99, n = 50 in Chapter 3 Note: plots of confidence bands (3.14) (upper and lower solid lines), pointwise confidence intervals (upper and lower dashed lines), the spline estimator (middle thin line), and the true function (middle thick line) at 1 − α = 0.99, n = 50. 60 4 3 1 2 Y*E2 0 50 100 T 150 200 Figure 3.7: Plots of confidence bands for CD4 data Note: plots of the piecewise-constant spline estimator (thick line), the data (dots), confidence band (3.14) (upper and lower solid lines), and the smoothed band (upper and lower thin lines) at confidence level 0.95. 61 4 3 1 2 Y*E2 0 50 100 T 150 200 Figure 3.8: Plots of confidence intervals for CD4 data Note: plots of the piecewise-constant spline estimator (thick line), the data (dots), pointwise confidence intervals (upper and lower thin lines), and the smoothed band (upper and lower thin lines) at confidence level 0.95. 62 The proposed estimator has good asymptotic behavior, and the confidence band had coverage very close to the nominal in our simulation study. An empirical study for the mean CD4+ cell counts illustrates the practical use of the confidence band. Clearly the simultaneous confidence band in (3.14) can be improved in terms of both theoretical and numerical performance if higher order spline or local linear estimators are used. Constant piecewise spline estimators are less appealing and have sub-optimal convergence rates in the sense of Hall, M¨ller, and Wang (2006), which uses local linear approaches. u Establishing the asymptotic confidence level for such extensions, however, requires highly sophisticated extreme value theory, for sequences of non-stationary Gaussian processes over intervals growing to infinity. That is much more difficult than the proofs of this chapter. We consider the confidence band in (3.14) significant because it is the first of its kind for the longitudinal case with complete theoretical justification, and with satisfactory numerical performance for commonly encountered data sizes. Our methodology can be applied to construct simultaneous confidence bands for other ( ) functional objects, such as the covariance function G x, x′ and its eigenfunctions, see Yao (2007). It can also be adapted to the estimation of regression functions in the functional linear model, as in Li and Hsing (2007). We expect further research along these lines to yield deep theoretical results with interesting applications. 3.8 Appendix Throughout this section, an ∼ bn means lim bn /an = c, where c is some nonzero constant, n→∞ and for functions an (x), bn (x), an (x) = u {bn (x)} means an (x)/bn (x) → 0 as n → ∞ uniformly for x ∈ [0, 1]. 63 3.8.1 A.1. Preliminaries We first state some results on strong approximation, extreme value theory and the classic Bernstein inequality. These are used in the proofs of Lemma 3.7, Theorem 3.1, and Lemma 3.6. Lemma 3.1. (Theorem 2.6.7 of Cs˝rg˝ and R´v´sz (1981)) Suppose that ξ i , 1 ≤ i ≤ n o o e e are iid with E(ξ 1 ) = 0, E(ξ 2 ) = 1, and H(x) > 0 (x ≥ 0) is an increasing continuous 1 function such that x−2−γ H(x) is increasing for some γ > 0 and x−1 logH(x) is decreasing with EH (|ξ 1 |) < ∞. Then there exists a Wiener process {W (t) , 0 ≤ t < ∞} that is a Borel function of ξ i , 1 ≤ i ≤ n, and constants C1 , C2 , a > 0 which depend only on the distribution of ξ 1 , such that for any {xn }∞ satisfying H −1 (n) < xn < C1 (nlogn)1/2 n=1 ∑k and Sk = i=1 ξ i , { P } max Sk − W (k) > xn 1≤k≤n ≤ C2 n {H (axn )}−1 . (n) (n) Lemma 3.2. Let ξ i , 1 ≤ i ≤ n, be jointly normal with ξ i ∼ N (0, 1). Let (n) (n) (n) Eξ i ξ j be such that for γ > 0, Cr > 0, rij < Cr /nγ , i ̸= j. Then for τ ∈ R, { { } ( ) (n) −τ , in which M ∞, P Mn,ξ ≤ τ /an + bn → exp −2e n,ξ = max ξ 1 , . . . , (n) rij = as n → } (n) ξn and an , bn are as in (3.7) with Ns + 1 replaced by n. Proof. Let { }n { } { } η i i=1 be i.i.d. standard normal r.v.’s, u = ui n , v = vi n i=1 i=1 be vectors of real numbers, and w = min (|u1 | , . . . , |un | , |v1 | , . . . , |vn |). By the Normal Comparison Lemma (Leadbetter, Lindgren and Rootz´n (1983), Lemma 11.1.2), e { P } { } (n) −vj < ξ j ≤ uj for j = 1, . . . , n − P −vj < η j ≤ uj for j = 1, . . . , n 64 ≤ 4 ∑ (n) r 1≤i (2 − ε) logn, for any ε > 0 and large n. Since 1 − rij ≥ 1 − (Cr /nγ )2 → 1 n (n)2 (n) as n → ∞, i ̸= j, for i ̸= j, ∃Cr2 > 0 such that 1 − rij ≥ Cr2 > 0 and 1 + rij < 1 + ϵ for any ϵ > 0 and large n. Let Mn,η = max {|η 1 | , . . . , |η n |}. By Leadbetter, Lindgren and ( ) { } Rootz´n (1983), Theorem 1.5.3, P Mn,η ≤ τ n → exp −2e−τ as n → ∞, while the e above results entail ( ( ) ) 2 −1/2   ( ) 4 ∑ (n) (n)  −w2  P Mn,ξ ≤ τ n − P Mn,η ≤ τ n ≤ rij 1 − rij exp   (n) 2π 1 + rij i 0 such that for r = 3, 4, ..., E |ξ 1 |r ≤ cr−2 r!Eξ 2 < +∞. 1 1 ( ) √ √ Then for each n > 1, t > 0, P (|Sn | ≥ nσt) ≤ 2 exp −t2 (4 + 2ct/ nσ)−1 , in which Sn = ∑n i=1 ξ i . Lemma 3.4. Under Assumption (A2), as n → ∞ for cJ,n defined in (3.5), ) ⟨ ⟩ ( ) ( cJ,n = f tJ hs 1 + rJ,n , bJ , b ′ ≡ 0, J ̸= J ′ , where max0≤J≤Ns rJ,n ≤ Cω (f, hs ). J ( )}r { 1−r/2 1−r/2 There exist constants CB > cB > 0 such that cB hs ≤ CB hs ≤ E BJ Xij for r = 1, 2, . . . and 1 ≤ J ≤ Ns + 1, 1 ≤ j ≤ Ni , 1 ≤ i ≤ n. 65 Proof. By the definition of cJ,n in (3.5), ∫ cJ,n = ∫ ∫ ( ) { ( )} bJ (x)f (x)dx = [ ] f (x)dx = f tJ hs + [ ] f (x) − f tJ dx. tJ ,tJ+1 tJ ,tJ+1 ( ) ( ) ∫ ] f (x) − f t Hence for all J = 0, ..., Ns , cJ,n − f tJ hs ≤ [ J dx ≤ ω (f, hs ) hs , tJ ,tJ+1 { ( ) } ( ) or rJ,n = cJ,n − f tJ hs f tJ hs −1 ≤ Cω (f, hs ) , J = 0, ..., Ns . By (3.8), { ( ( )}r ( )−r/2 ∫ )1−r/2 1−r/2 E BJ Xij = cJ,n bJ (x)f (x)dx = cJ,n ∼ hs . Proof of Proposition 3.1. By Lemma 3.4 and Assumption (A2) on the continuity of functions ϕ2 (x), σ 2 (x) and f (x) on [0, 1], for any x ∈ [0, 1] k ∫ ∫ χJ(x) ϕk (x)f (x)du − ∫ J(x) χJ(x) ( ) ( ) 1+β ϕk (u) f (u) du ≤ ω ϕk f, hs hs = O hs , { } ( ) ( ) 1+β σ 2 (x)f (x) − σ 2 (u) f (u) du ≤ ω σ 2 f, hs hs = O hs . Y Y Y Hence, σ 2 (x) = c−2 (nEN1 )−1 n J(x),n ∫ σ 2 (u) f (u) du× Y J(x)   2 { }−1  ∫ ∫ κ  E {N1 (N1 − 1)} ∑  2 (u) f (u) du  1+ ϕk (u) f (u) du σ   EN1 χJ(x) J(x) Y   k=1    )} { ( )} { ( 1+β 1+β −2 {1+ (nEN1 )−1 σ 2 (x)f (x)hs + U hs = f (x)hs + U hs Y  κ )}2 { )}−1  ( ( ∑{ E {N1 (N1 − 1)} 1+β 1+β σ 2 (x)f (x)hs + U hs ϕk (x)f (x)hs + U hs Y  EN1 k=1 66 { E {N1 (N1 − 1)} = (f (x)hs nEN1 )−1 σ 2 (x) 1 + Y EN1 } ∑κ ( )} ϕ2 (x)f (x)hs { β k=1 k 1 + U hs σ 2 (x) Y { ( )} { ( )} β β = σ2 (x) 1 + U hs = σ2 (x) 1 + U hs . n,LONG n,IID A.2. Proof of Theorem 1 −1/2 Note that BJ(x) (x) ≡ c , x ∈ [0, 1], so the terms ξ k (x) and ε(x) defined in (3.12) J(x),n are NS ∑ n Ni ( ) ( ) −2 ∑ ∑ B X −1 ξ k (x) = NT BJ (x) BJ 2,N ϕk Xij ξ ik ij J T i=1 j=1 J=0 n Ni ( ) ( ) ∑∑ −2 −1/2 −1 BJ(x) Xij ϕk Xij ξ ik , =c BJ(x) N J(x),n 2,NT T i=1 j=1 n Ni ( ) ( ) ∑∑ −2 −1/2 −1 BJ(x) Xij σ Xij εij . ε(x) = c BJ(x) NT J(x),n 2,NT i=1 j=1 Let 2 −1/2 −1 ∑n ξ k (x) = BJ(x) ξ k (x) = c N R ξ , J(x),n T i=1 ik,ξ,J(x) ik 2,NT 2 −1/2 −1 ∑n ∑Ni ε(x) = BJ(x) ε(x) = c N R ε , J(x),n T i=1 j=1 ij,ε,J(x) ij 2,NT (3.16) where ( ) ( ) ) ( ) ( ∑N i B X Rik,ξ,J = ij ϕk Xij , Rij,ε,J = BJ Xij σ Xij , 0 ≤ J ≤ Ns . j=1 J (3.17) 67 Lemma 3.5. Under Assumption (A3), for e(x) given in (3.11) and ξ k (x), ε(x) given in (3.16), we have e(x) −  κ ∑  k=1   ξ k (x) + ε(x)  ≤ An (1 − An )−1 κ ∑ ξ k (x) + ε(x) , x ∈ [0, 1], k=1 where An = sup0≤J≤N BJ 2 2,NT − 1 . There exists CA > 0, such that for large n, S ( ) ) (√ √ P An ≥ CA log (n) / (nhs ) ≤ 2n−3 . An = Oa.s. log (n) / (nhs ) as n → ∞. See the Supplement of Wang and Yang (2009) for a detailed proof. Lemma 3.6. Under Assumptions (A2) and (A3), for R1k,ξ,J , R11,ε,J in (3.17), [ ∫ −1 E (N ) b (u) ϕ2 (u) f (u) du+ 2 ER1k,ξ,J = cJ,n 1 J k )2 ] (∫ , E {N1 (N1 − 1)} bJ (u) ϕk (u) f (u) du 2 ER11,ε,J = c−1 J,n ∫ bJ (u) σ 2 (u) f (u) du, 0 ≤ J ≤ Ns , ] [ 2 2 there exist 0 < cR < CR < ∞, such that ER1k,ξ,J , ER11,ε,J ∈ cR , CR for 0 ≤ J ≤ (√ ) ∑ 2 2 Ns , sup0≤J≤Ns n−1 n Rik,ξ,J − ER1k,ξ,J = Oa.s. logn/ (nhs ) , 1 ≤ k ≤ κ, i=1 (√ ) ∑n ∑Ni −1 2 2 NT logn/ (nhs ) as n → ∞. sup Rij,ε,J − ER11,ε,J = Oa.s. i=1 j=1 0≤J≤Ns 68 Proof. By independence of X1j , 1 ≤ j ≤ N1 and N1 and (3.8), {∑ { ( ) ( ) ( ) ( ) }} N1 E BJ X1j BJ X ′ ϕk X1j ϕk X ′ |N1 1j 1j j,j ′ =1 {∑ } { ( ) ( ) } N1 2 X 2 X =E E BJ + 1j ϕk 1j |N1 j=1 {∑ { ( ) ( ) ( ) ( ) }} N1 E E BJ X1j BJ X ′ ϕk X1j ϕk X ′ |N1 1j 1j j̸=j ′ { ∫ = c−1 E (N1 ) bJ (u) ϕ2 (u) f (u) du+ k J(x),n 2 ER1k,ξ,J = E )2 } (∫ E {N1 (N1 − 1)} bJ (u) ϕk (u) f (u) du . 2 It is easily shown that ∃ 0 < cR < CR < ∞ such that cR ≤ ER1k,ξ,J ≤ CR , 0 ≤ J ≤ Ns . ( ) 2 ∗ =ζ Let ζ i,J = ζ i,k,J = Rik,ξ,J , ζ i,J i,J − E ζ 1,J for r ≥ 1 and large n, {∑ {∑ ( ) ( )}2r ( )}2r Ni Ni 2r E E ζ i,J =E B Xij ϕk Xij ≤ Cϕ B Xij j=1 J j=1 J   ν 1 +···+ν N =2r (  Ni  )∏ {   ( )}ν  ∑ i 2r j 2r E = Cϕ E BJ Xij    0≤ν ···ν ≤2r ν 1 · · · ν Ni j=1    1 Ni ( )r   ( ) { ( )}ν  j 2r EN 2r C h1−r 2r E N 2r max ≤ Cϕ E BJ Xij ≤ Cϕ B s 1   1   j=1      ∏  Ni 1−r = C r!h1−r , 2r ≤ Cϕ CB cr r!hs ζ s N {∑ )r ( )}2r } ( { Ni ≥ c2r E ≥ c2r (EN1 ) cB h1−r , E ζ i,J E BJ Xij s ϕ ϕ j=1 by Lemma 3.4. So )}r )}r { ( )r { ( ( ′ ∼ 1, E ζ i,J for r ≥ 2, and ∃Cζ > E ζ 1,J ≫ E ζ 1,J 69 { ( ) }1/2 ′ > 0 such that C ′ h−1 ≥ σ 2 ≥ c′ h−1 , for σ ∗ = E ζ ∗ 2 cζ . We obtain ζ i,J ζ s ζ s ζ∗ )2 ( ) 1 { }n ( r with c∗ = Cζ /c′ r−2 h−1 , which implies that ζ ∗ E ζ∗ ≤ cr−2 r!E ζ ∗ s ∗ i,J i,J i,J i=1 ζ ∑n satisfies Cram´r’s condition. Applying Lemma 3.3 to i=1 ζ ∗ , for r > 2 and any large e i,J { } √ ∑n ∗ enough δ > 0, P n−1 i=1 ζ i,J ≥ δ logn/ (nhs ) is bounded by      ( )−1      ′ −δ 2 Cζ (logn) 2 exp ( ) 1 ( )     4 + 2 C /c′ r−2 δ c′ −1 h1/2 (logn)1/2 n−1/2     s ζ ζ ζ    −δ 2 (logn)  ≤ 2n−3 . ≤ 2 exp ′   4C ζ    n ∞  ∑ ∑ 2Ns √ 1 2 2 Rik,ξ,J − ER1k,ξ,J ≥ δ logn/ (nhs ) ≤ Hence P sup < ∞. 0≤J≤N n  n3 s i=1 n=1 n=1 ) (√ ∑ 2 2 Thus, sup0≤J≤Ns n−1 n Rik,ξ,J − ER1k,ξ,J = Oa.s. logn/ (nhs ) as n → ∞ i=1 ∞ ∑ by Borel-Cantelli Lemma. The properties of Rij,ε,J are obtained similarly. ), Order all Xij , 1 ≤ j ≤ Ni , 1 ≤ i ≤ n from large to small as X(t) , X(1) ≥ · · · ≥ X( NT and denote the εij corresponding to X(t) as ε(t) . By (3.16), ( ) ( ) ∑N Tb X(t) σ X(t) ε(t) t=1 J(x) ( ) ( ){ ∑N } −1 −1 Tb = c NT X(t) σ X(t) St − St−1 , J(x),n t=1 J(x) ε(x) = c−1 N −1 J(x),n T where Sq = ∑q t=1 ε(t) , q ≥ 1 and S0 = 0. Lemma 3.7. Under Assumptions (A2)-(A5), there is a Wiener process {W (t) , 0 ≤ t < ∞} { } independent of Ni , Xij , 1 ≤ j ≤ Ni , ξ ik , 1 ≤ k ≤ κ, 1 ≤ i ≤ n , such that as n → ∞, 70 ( ) sup ε(0) (x) − ε(x) = oa.s. nt for some t < − (1 − ϑ) /2 < 0, where ε(0) (x) is x∈[0,1] ( cJ(x),n NT ) ) ( ( )−1 ∑N Tb X(t) σ X(t) {W (t) − W (t − 1)} , x ∈ [0, 1]. t=1 J(x) (3.18) Proof. Define MN = max1≤q≤N Sq − W (q) , in which {W (t) , 0 ≤ t < ∞} is the T T { } Wiener process in Lemma 3.1 that a Borel function of the set of variables ε(t) 1 ≤ t ≤ NT { } { } is independent of Ni , Xij , 1 ≤ j ≤ Ni , ξ ik , 1 ≤ k ≤ κ, 1 ≤ i ≤ n since ε(t) 1 ≤ t ≤ NT is. Further, supx∈[0,1] ε(0) (x) − ε(x) equals to ( ) ( ){ } ( ) −1 −1 b ( ) σ X( ) supx∈[0,1] c NT X W N T − SN J(x) NT NT J(x),n T ( ) ( ) ( ) ( )} ∑N −1 { T + bJ(x) X(t) σ X(t) − bJ(x) X(t+1) σ X(t+1) {W (t) − St } t=1 { ( ) ( ) −1 N −1 b ( ) σ X( ) + ≤ max c J X N NT T 0≤J≤Ns +1 Jn T ( ) ( ) ( ) ( )} ∑N −1 T bJ X(t) σ X(t) − bJ X(t+1) σ X(t+1) MN t=1 T ≤    )     ( ) ( ∑ −1 3Cσ + σ X(t) − σ X(t+1) max c−1 NT MN  T 0≤J≤Ns +1 J,n     1≤t≤NT −1,X(t) ∈bJ which, by the H¨lder continuity of σ in Assumption (A2), is bounded by o        ∑ β −1 3C + ∥σ∥ −1 M max c X(t) − X(t+1) ≤ NT σ NT 0,β  0≤J≤Ns +1 J,n      1≤t≤NT −1,X(t) ∈bJ  β          ∑    1−β  −1 3C + ∥σ∥ −1 M X(t) − X(t+1)  NT NT max cJ,n  σ 0,β nJ    J     1≤t≤NT −1,X(t) ∈bJ   71 −1 ≤ NT MN T ( )  β max c−1 J,n 3Cσ + ∥σ∥0,β hs 0≤J≤Ns +1 )1−β   ( max nJ 0≤J≤Ns +1  ) ∑NT ( where nJ = t=1 I X(t) ∈ χJ , 0 ≤ J ≤ Ns + 1, has a binomial distribution with ( ) ∫ parameters NT , pJ,n , where pJ,n = χ f (x)dx. Simple application of Lemma 3.3 entails ( )J −1 . Meanwhile, by letting H(x) = xr , x = nt′ , max0≤J≤Ns +1 nJ = Oa.s. NT Ns n t′ ∈ (2/r, β − (1 + ϑ) /2), the existence of which is due to the Assumption (A4) that r > { }N 2/ {β − (1 + ϑ) /2}. It is clear that ε(t) T satisfies the conditions in Lemma 3.1. Since t=1 ( ) ′ n = a−r n1−rt = O n−γ 1 for some γ 1 > 1, one can use the probability inequality H(axn ) ( ) ′ in Lemma 3.1 and the Borel-Cantelli Lemma to obtain MN = Oa.s. (xn ) = Oa.s. nt . T Hence Lemma 3.4 and the above imply sup x∈[0,1] } ( ){ ( ) −β −1 1−β (0) (x) − ε(x) = O t′ −1 ε NT Ns 1 + Ns a.s. Ns n ( ) t′ −1 + N nt′ −1 × N −1 n1−β = Oa.s. Ns n s s ( ) ( ) t′ −1 + N nt′ −β = o t′ −β+ϑ = Oa.s. Ns n s a.s. n since t′ < β − (1 + ϑ) /2 by definition, implying t′ − 1 ≤ t′ − β < − (1 + ϑ) /2. The Lemma follows by setting t = t′ − β + ϑ. Now ( ) ( ) ∑N −1 Tb X(t) σ X(t) Z(t) ε(0) (x) = c−1 NT J(x),n t=1 J(x) ( ) ( ) ∑n ∑N −1 −1 i b =c N Xij σ Xij Zij , J(x),n T i=1 j=1 J(x) 72 (3.19) where Z(t) = W (t) − W (t − 1) , 1 ≤ t ≤ NT , are i.i.d N (0, 1), ξ ik , Zij , Xij , Ni are independent, for 1 ≤ k ≤ κ, 1 ≤ j ≤ Ni , 1 ≤ i ≤ n, and ξ k (x), ε(0) (x) are conditional independent of Xij , Ni , 1 ≤ j ≤ Ni , 1 ≤ i ≤ n. If the conditional variances of ξ k (x), ε(0) (x) ( ) on Xij , Ni are σ 2 ,n (x), σ 2 (x), we have ε,n ξk 1≤j≤Ni ,1≤i≤n { c−1 N −2 J(x),n T ∑n R2 i=1 ik,ξ,J(x) }1/2 σ ξ ,n (x) = k { }1/2 ∑n ∑N −1 −2 2 i R σ ε,n (x) = c N , J(x),n T i=1 j=1 ij,ε,J(x) (3.20) where Rik,ξ,J(x) , Rij,ε,J(x) , and cJ(x),n are given in (3.17) and (3.5). Lemma 3.8. Under Assumptions (A2) and (A3), let  −1/2  κ ∑   (0) (x) , 2 (x) + σ 2 (x) ξ k (x) + ε η(x) = σ ξ ,n ε,n     k k=1 k=1  κ ∑ (3.21) with σ ξ ,n (x), σ ε,n (x), ξ k (x), ε(0) (x), and cJ(x),n given in (3.20), (3.16), (3.18), and (3.5). k { } Then η(x) is a Gaussian process consisting of (Ns + 1) standard normal variables η J Ns J=0 such that η(x) = η J(x) for x ∈ [0, 1], and there exists a constant C > 0 such that for large n, sup Eη J η ′ ≤ Chs . 0≤J̸=J ′ ≤Ns J ( ) } { Proof. It is apparent that L η J Xij , Ni , 1 ≤ j ≤ Ni , 1 ≤ i ≤ n = N (0, 1) for { } 0 ≤ J ≤ Ns , so L η J = N (0, 1), for 0 ≤ J ≤ Ns . For J ̸= J ′ , by (3.17) and (3.8), ( ) ( ) ( ) 2 X Rij,ε,J R = BJ Xij B ′ Xij σ ij = 0, along with (3.19), (3.18), the condiJ ij,ε,J ′ tional independence of ξ k (x), ε(0) (x) on Xij , Ni , 1 ≤ j ≤ Ni , 1 ≤ i ≤ n, and independence 73 ( ) of ξ ik , Zij , Xij , Ni , 1 ≤ k ≤ κ, 1 ≤ j ≤ Ni , 1 ≤ i ≤ n, E η J η ′ is J {{ ∑n {∑ }}−1/2 {∑ {∑ ∑N κ n κ 2 i R2 E Rik,ξ,J + R2 + ij,ε,J i=1 k=1 j=1 i=1 k=1 ik,ξ,J ′ }}−1/2 {∑ } {∑n } ∑N κ {∑n i R2 E Rik,ξ,J ξ ik Rik,ξ,J ′ ξ ik + j=1 ij,ε,J ′ k=1 i=1 i=1 } ( {∑ } {∑ ) }} n ∑Ni n ∑Ni Z Xij , Ni R Z R i=1 j=1 ij,ε,J ij i=1 j=1 ij,ε,J ′ ij = ECn,J,J ′   C = N −1 n,J,J ′  T in which  n ∑ κ ∑ i=1  2 Rik,ξ,J + −1/2  ∑N i R2 j=1 ij,ε,J  × k=1 −1/2   Ni  κ n    ∑ ∑ ∑ −1 −1 Rik,ξ,J R . NT R2 R2 + NT ′ ik,ξ,J ′  ij,ε,J ′      k=1 ik,ξ,J  j=1 i=1 k=1 i=1     n ∑ ∑ κ Note that according to definitions of Rik,ξ,J , Rij,ε,J , and Lemma 3.5, −1 NT −1 ≥ c2 NT σ {∑ } ∑N κ 2 i R2 R + i=1 k=1 ik,ξ,J j=1 ij,ε,J ∑n ( ) ∑N 2 2 2 i B2 X ij = cσ BJ 2,NT ≥ cσ (1 − An ) , for 0 ≤ J ≤ Ns , i=1 j=1 J ∑n         n κ  ∑ ∑  −1 2 2 P inf N Rik,ξ,J + Rij,ε,J  ×  0≤J̸=J ′ ≤Ns  T     i=1 k=1 j=1    2   √ Ni   n κ   ∑ ∑ ∑ log (n)    −1 −3 4 NT + R2 R2  ≥ 1 − 2n ,  ≥ cσ 1 − CA  ′  ′ ij,ε,J  ik,ξ,J  nhs  j=1 i=1 k=1  74 Ni ∑ by Lemma 3.5. Thus for large n, with probability ≥ 1 − 2n−3 , the numerator of C is n,J,J ′ uniformly greater than c2 /2. Applying Bernstein’s inequality to σ {∑ } −1 κ ∑n R NT R k=1 i=1 ik,ξ,J ik,ξ,J ′ , there exists C0 > 0 such that, for large n,  P sup 0≤J̸=J ′ ≤Ns  ∑κ ∑n −1 R R NT ≤ C0 hs  ≥ 1 − 2n−3 . k=1 i=1 ik,ξ,J ik,ξ,J ′ ( )−1 , Putting the above together, for large n, C1 = C0 c2 /2 σ ( P sup ) 0≤J̸=J ′ ≤Ns C ≤ C1 hs ≥ 1 − 4n−3 . n,J,J ′ Note that as a continuous random variable, sup0≤J̸=J ′ ≤N Cn,J,J ′ ∈ [0, 1] , thus s ( ∫ 1 ) E sup C 0≤J̸=J ′ ≤Ns n,J,J ′ = ( ) P sup C > t dt. 0≤J̸=J ′ ≤Ns n,J,J ′ 0 ( ) For large n, C1 hs < 1 and then E sup C 0≤J̸=J ′ ≤Ns n,J,J ′ ∫ C h 1 s 0 is      ∫ 1    P sup C > t dt + P sup C ′ > t dt n,J,J ′   C1 hs 0≤J̸=J ′ ≤Ns n,J,J 0≤J̸=J ′ ≤Ns ≤ ∫ C h 1 s ∫ 1 1dt + 0 C1 hs 4n−3 dt ≤ C1 hs + 4n−3 ≤ Chs for some C > 0 and large enough n. The lemma now follows from ( sup0≤J̸=J ′ ≤N E Cn,J,J ′ s ) ( ≤ E sup0≤J̸=J ′ ≤N Cn,J,J ′ s 75 ) ≤ Chs . By Lemma 3.8, the (Ns + 1) standard normal variables η 0 , ..., η Ns satisfy the conditions of Lemma 3.2 Hence for any τ ∈ R, ( ) ( ) lim P supx∈[0,1] |η(x)| ≤ τ /aNs +1 + bNs +1 = exp −2e−τ . n→∞ (3.22) For x ∈ [0, 1], Rik,ξ,J , Rij,ε,J given in (3.17), define the ratio of population and sample { }1/2 { } ¯ ¯ quantities as rn (x) = nE (N1 ) /NT Rn (x)/R(x) 1/2 , with (∑ )} ∑N n κ 2 i R2 R + j=1 ij,ε,J(x) i=1 k=1 ik,ξ,J(x) ∑κ 2 2 ¯ R(x) = (EN1 )−1 ER1k,ξ,J(x) + ER11,ε,J(x) . k=1 ¯ Rn (x) = N −1 T {∑ Lemma 3.9. Under Assumptions (A2), (A3), for η(x), σ n (x) in (3.21), (3.6), σ n (x)−1 {∑κ } (0) (x) − η(x) = |r (x) − 1| |η(x)| ξ (x) + ε n k=1 k (3.23) { } (√ ) as n → ∞, supx∈[0,1] aNs +1 |rn (x) − 1| = Oa.s. {log (Ns + 1)} (logn) / (nhs ) . Proof. Equation (3.23) follows from the definitions of η(x) and σ n (x). By Lemma 3.6, (√ ) n Ni −1 ∑ ∑ R2 = Oa.s. − ER2 supx∈[0,1] N logn/ (nhs ) , 11,ε,J(x) T i=1 j=1 ij,ε,J(x) supx∈[0,1] N −1 T ∑κ k=1 ∑n i=1 2 Rik,ξ,J(x) − (EN1 )−1 76 ∑κ k=1 2 ER1k,ξ,J(x) ≤ supx∈[0,1] (EN1 )−1 + supx∈[0,1] (EN1 )−1 ∑κ k=1 n−1 ∑n ∑κ 2 R2 − ER1k,ξ,J(x) i=1 ik,ξ,J(x) n (EN1 ) N −1 − 1 n−1 T k=1 ∑n R2 i=1 ik,ξ,J(x) ) (√ (√ ( ) ) −1/2 = O = Oa.s. logn/ (nhs ) + Oa.s. n logn/ (nhs ) , a.s. ¯ and there exist constants 0 < cR < CR < ∞ such that for all x ∈ [0, 1], cR < R(x) < CR . ¯ ¯ ¯ ¯ ¯ ¯ Thus, supx∈[0,1] Rn (x) − R(x) is bounded by ∑κ ∑κ ∑n 2 2 Rik,ξ,J(x) − (EN1 )−1 ER1k,ξ,J(x) + supx∈[0,1] N −1 T k=1 i=1 k=1 (√ ) ∑n ∑N 2 i R2 − ER11,ε,J(x) = Oa.s. supx∈[0,1] N −1 logn/ (nhs ) . T i=1 j=1 ij,ε,J(x) { } { } { } ¯ ¯ ¯ ¯ ¯ sup Rn (x) 1/2 − R(x) 1/2 ≤ sup Rn (x) − R(x) sup R(x) −1/2 x∈[0,1] (√ ) {x∈[0,1] } x∈[0,1] = Oa.s. logn/ (nhs ) . Then supx∈[0,1] aNs +1 |rn (x) − 1| is bounded by Thus   { }1/2 { }1/2  { }1/2 ¯ ¯ nE (N1 ) /NT sup aNs +1 Rn (x)/R(x) − 1 + 1 − nE (N1 ) /NT   x∈[0,1]  { }1/2 { } { } { } ¯ ¯ ¯ sup R(x) −1/2 sup ≤ aNs +1 nE (N1 ) /NT Rn (x) 1/2 − R(x) 1/2  x∈[0,1] x∈[0,1] } }1/2 ) { (√ {log (Ns + 1)} (logn) / (nhs ) . + 1 − nE (N1 ) /NT = Oa.s. Proof of Proposition 3.2. The proof follows from Lemmas 3.5, 3.7, 3.9, (3.22), and Slutsky’s Theorem. 77 Proof of Theorem 3.1. By Theorem 3.2, ∥m(x) − m(x)∥∞ = Op (hs ), so   { } √  sup σ −1 (x) |m(x) − m(x)| = Op (nhs )1/2 log (Ns + 1)hs = op (1) , aNs +1 n x∈[0,1]   κ ∑ ξ k (x) + ε(x)  = op (1) . aNs +1  sup σ −1 (x) |m(x) − m(x)| − sup σ −1 (x) n n x∈[0,1] x∈[0,1] k=1 Meanwhile, (3.11) and Proposition 3.2 entail that, for any τ ∈ R,     sup σ −1 (x) lim P a n n→∞  Ns +1 x∈[0,1] κ ∑ k=1    ( ) ξ k (x) + ε(x) − bNs +1  ≤ τ = exp −2e−τ .  Thus Slutsky’s Theorem implies that       ( ) lim P aNs +1  sup σ −1 (x) |m(x) − m(x)| − bNs +1  ≤ τ = exp −2e−τ . n n→∞   x∈[0,1] { } Let τ = −log − 1 log (1 − α) , definitions of aNs +1 , bNs +1 , and QNs +1 (α) in (3.7) entail 2 { } lim P m(x) ∈ m(x) ± σ n (x)QNs +1 (α) , ∀x ∈ [0, 1] n→∞     = lim P Q−1+1 (α) sup σ −1 (x) |e(x) + m(x) − m(x)| ≤ 1 = 1 − α. n n→∞  Ns  x∈[0,1] by (3.11). That σ n (x)−1 {m(x) − m(x)} →d N (0, 1) for any x ∈ [0, 1] follows by directly using η(x) ∼ N (0, 1), without reference to supx∈[0,1] |η(x)|. 78 Chapter 4 Spline-backfitted Kernel Smoothing of Partially Linear Additive Model 4.1 Introduction This chapter is based on Ma and Yang (2011b). Since the 1980’s, non- and semiparametric analysis of time series has been vigorously pursued, see, for example, Tjøstheim and Auestad (1994) and Huang and Yang (2004). There are few satisfactory smoothing tools for multidimensional time series data, however, due to the poor convergence rate of nonparametric estimation of multivariate functions, known as the “curse of dimensionality”. One solution is the partially linear additive model (PLAM) studied in Li (2000), Fan and Li (2003) and Liang, Thurston, Ruppert, Apanasovich and Hauser (2008) ∑d ∑d ( ) ( ) 1 c t + 2 m (x ) Yi = m Xi , Ti + σ Xi , Ti εi , m(x, t) = c00 + 0l l l=1 α=1 α α 79 (4.1) }n { }n { in which the sequence Yi , XT , TT = Yi , Xi1 , ..., Xid , Ti1 , . . . Tid . The funci i i=1 2 1 i=1 tions m and σ are the mean and standard deviation of the response Yi conditional on the predictor vector {Xi , Ti }, and εi is a white noise conditional on {Xi , Ti }. For identifiabil( ) ity, both additive and linear components must be centered, i.e., Emα Xiα ≡ 0, 1 ≤ α ≤ d2 , ETil = 0, 1 ≤ l ≤ d1 . ( ) { }n If parameters c0l ≡ 0, 1 ≤ l ≤ d1 , Ti1 , . . . Tid are redundant. Yi , Xi1 , ..., Xid 1 2 i=1 follow an additive model. For applications of additive model, see, N´cher, Ojeda, Cadarsoa Su´rez, Roca-Pardi˜as and Acu˜a (2006), Roca-Pardi˜as, Cadarso-Su´rez, N´cher and Acu˜a a n n n a a n (2006), Gonz´lez-Manteiga, Mart´ a ınez-Miranda and Raya-Miranda (2008). Additive model, ( however, is only appropriate to model nonparametric effects of continuous predictors Xi1 , ) ..., Xid supported on compact intervals. The effects of possibly discrete and/or unbounded 2 ) ( predictors can be neatly modeled as some of the variables Ti1 , . . . Tid in the PLAM (4.1), 1 see the simulation example in Section 4.3 where Ti1 , Ti2 are normal conditional on Xi , Ti3 is discrete and Ti4 has positive density over a compact interval, and Section 4.4 which shows that the simpler PLAM fits the Boston housing data much better than an additive model. For general references on partially linear model, see Schimek (2000) and Liang (2006). For applications of partially linear model to panel data, see Su and Ullah (2006), while for data with measurement errors, see Liang, Wang and Carroll (2007) and Liang et al. (2008). { }d d2 Satisfactory estimators of functions {mα (xα )}α=1 and constants c0l 1 in model l=0 }n { (4.1) based on Yi , XT , TT i i i=1 should be (i) computationally expedient; (ii) theoretically reliable and (iii) intuitively appealing. Kernel procedures for PLAM, such as Fan and Li (2003) and Liang et al. (2008) satisfy criterion (iii) and partly (ii) but not (i) since they are computationally intensive when sample size n is large, as illustrated in the Monte-Carlo 80 results of Xue and Yang (2006). It is mentioned in Li (2000) that the computation time of estimating a PLAM is about n times of estimating a partially linear model with d2 = 1 by using the kernel marginal integration method. For discussion of computation burden issues by kernel methods, see Li (2000). Spline approaches of Li (2000), Schimek (2000) to PLAM, do not satisfy criterion (ii) as they lack limiting distribution, but are fast to compute, thus satisfying (i). The SBK estimator we propose combines the best features of both kernel and spline methods, and is essentially as fast and accurate as an univariate kernel smoothing, satisfying all three criteria (i)-(iii). We propose to extend the “spline-backfitted kernel smoothing” (SBK) of Wang and { }d Yang (2007) to PLAM (4.1). If the regression coefficients c0l 1 and the component l=0 { ( )}d { } 2 functions mβ xβ were known by “oracle”, one could create Yiα , Xiα n i=1 β=1,β̸=α ( ) ∑d2 ∑d1 m Xiβ c0l Til − with Yiα = Yi − c00 − β=1,β̸=α β l=1 ( ) ( ) = mα Xiα +σ Xi , Ti εi , from which one could compute an “oracle smoother” to estimate the only unknown function mα (xα ), bypassing the “curse of dimensionality”. The idea was ( ) to obtain approximations to the unobservable variables Yiα by substituting mβ Xiβ , 1 ≤ i ≤ n, 1 ≤ β ≤ d2 , β ̸= α, with spline estimates and argue that the error incurred by ( ) this “cheating” is of smaller magnitude than the rate O n−2/5 for estimating mα (xα ) from the unobservable data. Lemmas 4.9, 4.14, 4.17 and 4.18 establish the estimators’ uniform oracle efficiency by “reducing bias via undersmoothing (step one) and averaging out the variance (step two)”, via the joint asymptotics of kernel and spline functions. A major theoretical innovation is to resolve the dependence between T and X, making use of Assumption (A5), which is not needed in Wang and Yang (2007). Another significant innovation is the √ n-consistency and asymptotic distribution of estimators for parameters 81 { c0l }d1 , which is trivial for the additive model of Wang and Yang (2007). l=0 This chapter is organized as follows. The SBK estimators are introduced in Section 4.2 with theoretical properties. Section 4.3 contains Monte Carlo results to demonstrate the asymptotic properties of SBK estimators for moderate dimensions. The SBK estimator is applied to the Boston housing data in Section 4.4. Proofs of technical lemmas are in the Appendix. 4.2 The SBK Estimators ( ) For convenience, we denote vectors as x = x1 , ..., xd T and take ∥ · ∥ as the usual Euclidean √ d , i.e., ∥x∥ = ∑d 2 norm on R α=1 xα , and ∥·∥∞ the sup norm, i.e., ∥x∥∞ = sup1≤α≤d |xα |. We denote by Ir the r × r identity matrix, 0r×s the zero matrix of dimension r × s, and { }n diag(a, b) the 2 × 2 diagonal matrix with diagonal entries a, b. Let Yi , XT , TT i i i=1 be a sequence of strictly stationary observations from a geometrically α-mixing process following )T ( ( ) , (Ti1 , . . . Tid )T } are the i-th model (4.1), where Yi and Xi , Ti = { Xi1 , ..., Xid 2 1 response and predictor vector. Denote Y = (Y1 , ..., Yn )T the response vector. Without loss of generality, we assume Xα is distributed on [0, 1] , 1 ≤ α ≤ d2 . An integer N = Nn ∼ n1/4 log n is pre-selected. Denote the class of Lipschitz continuous functions for a constant } { ′ ) ≤ C x − x′ , ∀x, x′ ∈ [0, 1] . C > 0 as Lip([0, 1] , C) = m| m(x) − m(x We use the second order B spline (or linear B spline) basis bJ (x) = bJ,2 (x), 0 ≤ J ≤ N + 1 which is defined in Section 1.3 of Chapter 1. Let the distance between neighboring interior or boundary knots be H = Hn = (N + 1)−1 . Define next the space G of partially 82 linear additive spline functions as the linear space spanned by { } 1, tl , bJ (xα ) , 1 ≤ l ≤ d1, 1 ≤ α ≤ d2 , 1 ≤ J ≤ N + 1 , { { } ( )}n and let 1, Tl , bJ Xiα i=1 , 1 ≤ l ≤ d1, 1 ≤ α ≤ d2 , 1 ≤ J ≤ N + 1 span the space Gn ⊂ Rn . As n → ∞, with probability approaching 1, the dimension of Gn becomes {1 + d1 + d2 (N + 1)}. The spline estimator of m (x, t) is the unique element m (x, t) = ˆ { ( )} mn (x, t) from G so that m Xi , Ti T ˆ ˆ 1≤i≤n best approximates the response vector Y. To be precise, we define m (x, t) = c00 + ˆ ˆ ∑d 1 c t + ˆ l=1 0l l ∑d ∑N +1 2 c b (x ) , ˆ α=1 J=1 J,α J α (4.2) ( ) ˆ ˆ ˆ minimize where the coefficients c00 , c0l , cJ,α 1≤l≤d1, 1≤J≤N +1,1≤α≤d2 { ∑n i=1 Yi − c0 − } ∑N +1 ∑d ( ) 2 2 1 cT − c b Xiα . α=1 J=1 J,α J l=1 l il ∑d { }d { }d Pilot estimators of cT = c0l 1 and mα (xα ) are ˆT = c0l 1 and mα (xα ) = c ˆ ˆ l=0 l=0 ( ) ∑N +1 ∑ ∑N +1 cJ,α bJ (xα ) − n−1 n ˆ ˆ i=1 J=1 cJ,α bJ Xiα , which are used to define pseudo J=1 ˆ responses Yiα , estimates of the unobservable “oracle” responses Yiα : ) ( ∑d ∑d 1 c T − 2 ˆ0l il mβ Xiβ , ˆ l=1 β=1,β̸=α ( ) ∑d ∑d 1 c T − 2 Yiα = Yi − c00 − m Xiβ . l=1 0l il β=1,β̸=α β ˆ Yiα = Yi − c00 − ˆ (4.3) { }n ˆ , the SBK estimator mSBK,α (xα ) of mα (xα ) mimics the would-be Based on Yiα , Xiα ˆ i=1 83 { } Nadaraya-Watson estimator mK,α (xα ) of mα (xα ) based on Yiα , Xiα n , if the unob˜ i=1 { }n servable responses Yiα i=1 were available } { ∑n ( ) ˆ ˆ mSBK,α (xα ) = n−1 ˆ Kh Xiα − xα Yiα /fα (xα ), i=1 { } ∑n ( ) ˆ mK,α (xα ) = n−1 ˜ Kh Xiα − xα Yiα /fα (xα ), i=1 (4.4) ( ) ∑ ˆ ˆ with Yiα , Yiα in (4.3), fα (xα ) = n−1 n Kh Xiα − xα an estimator of fα (xα ). i=1 Without loss of generality, let α = 1. Under Assumptions A1-A5 and A7, it is straightforward to verify (as in Bosq 1998) that as n → ∞, ( ) supx ∈[h,1−h] mK,1 (x1 ) − m1 (x1 ) = op n−2/5 log n , ˜ 1 } { } √ { D 2 nh mK,1 (x1 ) − m1 (x1 ) − b1 (x1 ) h2 → N 0, v1 (x1 ) , ˜ ∫ { } 2 K (u) du m′′ (x ) f (x ) /2 + m′ (x ) f ′ (x ) f −1 (x ) , where, b1 (x1 ) = u 1 1 1 1 1 1 1 1 1 1 ∫ ] [ −1 2 v1 (x1 ) = K 2 (u) duE σ 2 (X, T) |X1 = x1 f1 (x1 ) . (4.5) It is shown in Li (2000) and Schimek (2000) that the spline estimator m1 (x1 ) in the first step ˆ uniformly converges to m1 (x1 ) with certain convergence rate, but lacks asymptotic distribution. Theorem 4.1 below states that the difference between mSBK,1 (x1 ) and mK,1 (x1 ) is ˆ ˜ ( ) op n−2/5 uniformly, dominated by the asymptotic uniform size of mK,1 (x1 ) − m1 (x1 ). ˜ So mSBK,1 (x1 ) has identical asymptotic distribution as mK,1 (x1 ). ˆ ˜ Theorem 4.1. Under Assumptions A1-A7, as n → ∞, the SBK estimator mSBK,1 (x1 ) ˆ ( ) ˆ ˜ given in (4.4) satisfies supx ∈[0,1] mSBK,1 (x1 ) − mK,1 (x1 ) = op n−2/5 . Hence with 1 84 2 b1 (x1 ) and v1 (x1 ) as defined in (4.5), for any x1 ∈ [h, 1 − h], √ { } { } D 2 ˆ nh mSBK,1 (x1 ) − m1 (x1 ) − b1 (x1 ) h2 → N 0, v1 (x1 ) . Instead of Nadaraya-Watson estimator, one can use local polynomial estimator, see Fan and Gijbels (1996). Under Assumptions A1-A7, for any α ∈ (0, 1), an asymptotic 100 (1 − α) % confidence intervals for m1 (x1 ) is mSBK,1 (x1 ) − ˆ1 (x1 ) h2 ± zα/2 v1 (x1 ) (nh)−1/2 ˆ b ˆ (4.6) 2 where ˆ1 (x1 ) and v1 (x1 ) are estimators of b1 (x1 ) and v1 (x1 ) respectively. b ˆ2 The following corollary provides the asymptotic distribution of mSBK (x). The proof of ˆ this corollary is straightforward and therefore omitted. Corollary 4.1. Under Assumptions A1-A7 and the additional assumption mα ∈ C (2) [0, 1] , 2 ≤ d2 d2 d2 ∑ 2 ∑ ∑ vα (xα ), mSBK,α (xα ), b (x) = ˆ bα (xα ), v 2 (x) = α ≤ d2 . Let mSBK (x) = ˆ α=1 α=1 α=1 for any x ∈ [0, 1]d2 , with SBK estimators mSBK,α (xα ) , 1 ≤ α ≤ d2 , defined in (4.4), and ˆ 2 bα (xα ), vα (xα ) similarly defined as in (4.5), as n → ∞, √ } { { } ∑d D 2 → N 0, v 2 (x) 2 m (x ) − b (x) h nh mSBK (x) − ˆ α=1 α α Next theorem describes the asymptotic behavior of estimator ˆ for c. c Theorem 4.2. Under Assumptions A1-A6, as n → ∞,       √ 0, σ 2 c additional Assumption A8, n (ˆ − c) →d N  0   85 ( ) −1/2 . With the ∥ˆ − c∥ = Op n c    ( ) 1 0T  d1 , for Σ = cov T ˜   0d Σ−1  1 ˜ with random vector T defined in (4.10). To construct confidence sets for c, Σ is consistently estimated by ∑n ˆ ∑d2 ∑N +1 ∗ ∗ ˆ ˆ ˆ i=1 Ti,l,n Ti,l′ ,n in which Tl,n = Tl − α=1 J=1 aJ,α bJ,α (xα ), where bJ,α (xα ) ≡ ( ) ∑ bJ (xα ) − n−1 n bJ Xiα is the empirical centering of bJ (x) for the α-th variable Xα , i=1 ( ) defined in the Appendix and aJ,α ˆ minimize 1≤J≤N +1,1≤α≤d2 n−1 2 ∑d ∑N +1 ∗ (X ) . 2 Tl − a b α=1 J=1 J,α J,α α n 4.3 Simulation In this section, we analyze some simulated data examples to illustrate the finite-sample behavior of SBK estimators. The number of interior knots N in (4.2) is given by N = ([ ] [ ]) min c1 n1/4 log n + c2 , (n/2 − 1 − d1 ) d−1 − 1 , in which [a] denotes the integer part 2 of a. In our implementation, we have used c1 = c2 = 1. The additional constraint that N ≤ (n/2 − 1 − d1 ) d−1 − 1 ensures that the number of terms in the linear least squares 2 problem (4.2), 1 + d1 + d2 (N + 1), is no greater than n/2, which is necessary when the sample size n is moderate. { } The i.i.d. data Yi , Xi , Ti n is generated according to the partially linear additive i=1 model (4.1), which satisfies Assumptions A1-A5, and A8 Yi = 2 + ∑d ∑d 1 T + 2 m (X ) + σ ε , m (x) ≡ sin (2πx) , 1 ≤ α ≤ d , 0 i α 2 il l=1 α=1 α iα ( ( ) ) where σ 0 = 2, εi ∼ N (0, 1) is independent of Xi , Ti , Ti = Ti1 , Ti2 , Ti3 , Ti4 such that ( ) Ti3 , Ti4 , Ti1 , Ti2 are independent, Ti3 = ±1 with probability 1/2, Ti4 ∼ U (−0.5, 0.5), 86 ( ) ( ( ) ( ))) 5−sin(2πx) Ti1 , Ti2 ′ ∼ N (0, 0)′ , diag a Xi1 , a Xi2 , a (x) = . 5+sin(2πx) { }T ( )d2 Xi = Xiα α=1 is generated from the vector autoregression (VAR) equation Xiα = {( } ) 2 1/2 Z Φ 1−a iα − 1/2, 1 ≤ α ≤ d2 with stationary distribution ( ) ( )T ( ) 2 −1 Σ Zi = Zi1 , ..., Zid ∼ N 0d , 1 − a 2 2 ( ( ) ) 2 −1 Σ , Z = aZ Z1 ∼ N 0d , 1 − a i i−1 + εi , εi ∼ N (0, Σ) , 2 ≤ i ≤ n, 2 Σ = (1 − r) Id ×d + r1d 1T , 0 < a < 1, 0 < r < 1, 2 2 2 d2 ( { } So Xi n is geometrically α-mixing with marginal distribution U [−0.5, 0.5]. i=1 We obtained for comparison the SBK estimator mSBK,α (xα ) and the “oracle” smoother ˆ mK,α (xα ) by Nadaraya-Watson regression using quartic kernel and the rule-of-thumb band˜ width. To see that mSBK,α (xα ) is as efficient as mK,α (xα ) for numerical performance, we ˆ ˜ define the empirical relative efficiency of mSBK,α (xα ) with respect to mK,α (xα ) as ˆ ˜  ( )}2 1/2 ∑n { ˜ i=1 mK,α (xα ) − mα Xiα   effα =  . { ( )}2  ∑n ˆ i=1 mSBK,α (xα ) − mα Xiα (4.7) Theorem 4.1 indicates effα should be close to 1 for 1 ≤ α ≤ d2 . Figures 4.1, 4.2, 4.3, and 4.4 provide the kernel density estimates of 100 empirical efficiencies α = 2, 3, sample sizes n = 100 (solid lines), 200 (dashed lines), 500 (thin lines) and 1000 (thick lines) at σ 0 = 2, d2 = 3, and d2 = 30. The vertical line at efficiency = 1 is the standard line for ˆ ˜ the comparison of mSBK,α (xα ) and mK,α (xα ). One clearly sees that the center of the density plots is going toward the standard line at 1 with narrower spread when sample size n is increasing, confirmative to Theorem 4.1. 87 0 0.5 1 Y 1.5 2 2.5 Efficiency of the 2-nd estimator 0 0.5 1 1.5 2 2.5 X Figure 4.1: Kernel density plots for α = 2, d1 = 4, d2 = 3 in Chapter 4 Note: kernel density plots of the 100 empirical efficiencies of mSBK,α (xα ) to mK,α (xα ), ˆ ˜ computed according to (4.7) for sample sizes n = 100 (solid lines), 200 (dashed lines), 500 (thin lines) and 1000 (thick lines) at α = 2, d1 = 4, d2 = 3. 88 0 0.5 1 Y 1.5 2 2.5 Efficiency of the 3-rd estimator 0 0.5 1 1.5 2 2.5 X Figure 4.2: Kernel density plots in Chapter 4 Note: kernel density plots of the 100 empirical efficiencies of mSBK,α (xα ) to mK,α (xα ), ˆ ˜ computed according to (4.7) for sample sizes n = 100 (solid lines), 200 (dashed lines), 500 (thin lines) and 1000 (thick lines) at α = 3, d1 = 4, d2 = 3. 89 0 0.5 1 Y 1.5 2 2.5 Efficiency of the 2-nd estimator 0 0.5 1 1.5 2 2.5 X Figure 4.3: Kernel density plots for α = 2, d1 = 4, d2 = 30 in Chapter 4 Note: kernel density plots of the 100 empirical efficiencies of mSBK,α (xα ) to mK,α (xα ), ˆ ˜ computed according to (4.7) for sample sizes n = 100 (solid lines), 200 (dashed lines), 500 (thin lines) and 1000 (thick lines) at α = 2, d1 = 4, d2 = 30. 90 0 0.5 1 Y 1.5 2 2.5 Efficiency of the 3-rd estimator 0 0.5 1 1.5 2 2.5 X Figure 4.4: Kernel density plots for α = 3, d1 = 4, d2 = 30 in Chapter 4 Note: kernel density plots of the 100 empirical efficiencies of mSBK,α (xα ) to mK,α (xα ), ˆ ˜ computed according to (4.7) for sample sizes n = 100 (solid lines), 200 (dashed lines), 500 (thin lines) and 1000 (thick lines) at α = 3, d1 = 4, d2 = 30. 91 To see that c0l is as efficient as c0l , we define the asymptotic efficiency of c0l with respect ˆ ˆ ˜  1/2 }2 ∑100 { c0l,t −c0l /100  ˜  t=1 to c0l as effl =  ∑ ˜ ˜ ˆ ˜ ˆ , where c0l,t , c0l,t are values of c0l , c0l for the  { }2 100 c −c t=1 ˆ0l,t 0l /100 t-th replication in the simulation. For n = 200, 500, d2 = 3, Table 4.1 lists the frequencies of 95% confidence interval coverage of the SBK estimators for the regression coefficients { }d1 c0l , the sample mean squared error (MSE) and the asymptotic efficiency. The coverage l=0 frequencies are all close to the nominal level of 95%. As expected, increase in sample size reduces the sample MSE and increases the asymptotic efficiency. Table 4.1: Estimation of parameters for the linear part in Chapter 4 r c00 c01 c02 c03 c04 a 0 0.3 0 0.3 0 0.3 0 0.3 0 0.3 0 0.3 0 0.3 0 0.3 0 0.3 0 0.3 0 0 0.3 0.3 0 0 0.3 0.3 0 0 0.3 0.3 0 0 0.3 0.3 0 0 0.3 0.3 95% CI coverage frequency 0.92 (0.92) 0.92 (0.91) 0.89 (0.92) 0.89 (0.92) 0.95 (0.90) 0.99 (0.91) 0.98 (0.95) 0.96 (0.94) 0.96 (0.95) 0.97 (0.97) 0.95 (0.95) 0.97 (0.97) 0.96 (0.97) 0.92 (0.98) 0.93 (0.96) 0.96 (0.97) 0.95 (0.96) 0.94 (0.95) 0.92 (0.95) 0.93 (0.96) MSE 0.0241 (0.010) 0.0264 (0.0095) 0.0263 (0.0096) 0.0282 (0.0103) 0.0330 (0.0152) 0.0297 (0.0143) 0.0296 (0.0121) 0.0336 (0.0134) 0.0306 (0.0115) 0.0378 (0.0118) 0.0329 (0.0112) 0.0336 (0.0104) 0.0259 (0.0087) 0.0301 (0.0074) 0.0362 (0.0078) 0.0258 (0.0078) 0.4006 (0.1229) 0.3771 (0.1111) 0.3867 (0.1154) 0.3533 (0.1138) Asymptotic Efficiency 0.8806 (0.8406) 0.8403 (0.8588) 0.8446 (0.8536) 0.8146 (0.8270) 0.8795 (0.8892) 0.9217 (0.9069) 0.9157 (0.9949) 0.8635 (0.9491) 0.8809 (0.8659) 0.7914 (0.8553) 0.8523 (0.8757) 0.8397 (0.9039) 0.8892 (0.8983) 0.7527 (0.9327) 0.8264 (0.9178) 0.8919 (0.9101) 0.7873 (0.9181) 0.8117 (0.9661) 0.8019 (0.9470) 0.8388 (0.9537) Note: estimation of c = (c00 , c01 , c02 , c03 , c04 )′ with d2 = 3, n = 200 (outside parentheses), n = 500 (inside parentheses). 92 For visualization of the actual function estimates, at noise level σ 0 = 2 with sample size n = 200, 500, we plot mK,α (xα ) (thin curves), mSBK,α (xα ) (thick curves) and their 95% ˜ ˆ pointwise confidence intervals (upper and lower medium curves) for mα (dashed curves) in Figures 4.5, 4.6, 4.7, and 4.8. The SBK estimators seem rather satisfactory and their performance improves with increasing n. 4.4 Application In this section we apply our method to the well-known Boston housing data, which contains 506 different houses from a variety of locations in Boston Standard Metropolitan Statistical Area in 1970. The median value and 13 sociodemographic statistics values of the Boston houses were first studied by Harrison and Rubinfeld (1978) to estimate the housing price index model. Breiman and Friedman (1985) did further analysis to deal with the multi-collinearity for overfitting by using a stepwise method. The response and explanatory variables of interest are: MEDV: Median value of owner-occupied homes in $1000’s RM: average number of rooms per dwelling TAX: full-value property-tax rate per $10, 000 PTRATIO: pupil-teacher ratio by town school district LSTAT: proportion of population that is of ”lower status” in %. Wang and Yang (2009b) fitted an additive model using RM, log(TAX), PTRSATIO and log(LSTAT) as predictors to test the linearity of the components and found that only PTRATIO is accepted at the significance level 0.05 for the linearity hypothesis test. Based on the conclusion drawn from Wang and Yang (2009b), we fitted a partial linear additive 93 0 -1.5 -1 -0.5 Y 0.5 1 1.5 2 n=200, Estimation of component #2 0 X 0.5 Figure 4.5: Plots of the estimator for the nonparametric part at α = 2, n = 200 Note: plots of the oracle smoother mK,α (xα ) (thin curve), the SBK estimator ˜ mSBK,α (xα ) (thick curve) defined in (4.4), and the 95% pointwise confidence intervals ˆ constructed by (4.6) (upper and lower medium curves) of the function components mα (xα ), α = 2 (dashed curve), d1 = 4, d2 = 3. 94 0 -1.5 -1 -0.5 Y 0.5 1 1.5 2 n=500, Estimation of component #2 0 X 0.5 Figure 4.6: Plots of the estimator for the nonparametric part at α = 2, n = 500 Note: plots of the oracle smoother mK,α (xα ) (thin curve), the SBK estimator ˜ mSBK,α (xα ) (thick curve) defined in (4.4), and the 95% pointwise confidence intervals ˆ constructed by (4.6) (upper and lower medium curves) of the function components mα (xα ), α = 2 (dashed curve), d1 = 4, d2 = 3. 95 0 -1.5 -1 -0.5 Y 0.5 1 1.5 2 n=200, Estimation of component #3 0 X 0.5 Figure 4.7: Plots of the estimator for the nonparametric part at α = 3, n = 200 Note: plots of the oracle smoother mK,α (xα ) (thin curve), the SBK estimator ˜ mSBK,α (xα ) (thick curve) defined in (4.4), and the 95% pointwise confidence intervals ˆ constructed by (4.6) (upper and lower medium curves) of the function components mα (xα ), α = 3 (dashed curve), d1 = 4, d2 = 3. 96 0 -1.5 -1 -0.5 Y 0.5 1 1.5 2 n=500, Estimation of component #3 0 X 0.5 Figure 4.8: Plots of the estimator for the nonparametric part at α = 3, n = 500 Note: plots of the oracle smoother mK,α (xα ) (thin curve), the SBK estimator ˜ mSBK,α (xα ) (thick curve) defined in (4.4), and the 95% pointwise confidence intervals ˆ constructed by (4.6) (upper and lower medium curves) of the function components mα (xα ), α = 3 (dashed curve), d1 = 4, d2 = 3. 97 model as follows: MEDV = c00 + c01 × PTRATIO + m1 (RM) + m2 (log(TAX)) + m3 (log(LSTAT)) + ε. As in Wang and Yang (2009b), the number of interior knots is N = 5. In Figure 4.9, the univariate nonlinear function estimates (dashed lines) and corresponding simultaneous confidence bands (thin lines) are displayed together with the ”pseudo data points” (dots) with pseudo response as the backfitted response after subtracting the sum function of the remaining covariates as in (4.3). The confidence bands are used to test the linearity of the nonparametric components. In Figures 4.9, 4.10 and 4.11 the straight solid lines are the least squares regression lines through the pseudo data points. The first figure confidence band with 0.999999 confidence level does not totally cover the straight regression line, i.e, the p-value is less than 0.000001. Similarly the linearity of the component functions for log(TAX) and log(LSTAT) are rejected at the significance levels 0.017 and 0.007, respectively. The estimators c00 and c01 of c00 and c01 are 33.393 and −0.58845 and both ˆ ˆ are significant with p-values close to 0. The correlation between the estimated and observed values of MEDV is 0.89944, much higher than 0.80112 obtained by Wang and Yang (2009b). This improvement is due to fitting the variable PTRATIO directly as linear with the higher accuracy of parametric model instead of treating it unnecessarily as a nonparametric variable. In other words, our simpler partially linear additive model (PLAM) fits the housing data much better than the additive model of Wang and Yang (2009b). 98 -20 0 Yhat1 20 40 60 Confidence Level = 0.999999 4 5 6 RM 7 8 Figure 4.9: Plots of the estimators for RM for Boston housing data Note: plots of the least squares regression estimator (solid line), confidence bands (upper and lower thin lines), the spline estimator (dashed line) and the data (dot). 99 -10 0 Yhat2 10 20 Confidence Level = 0.983 5.5 6 log(TAX) 6.5 Figure 4.10: Plots of the estimators for log(TAX) for Boston housing data Note: plots of the least squares regression estimator (solid line), confidence bands (upper and lower thin lines), the spline estimator (dashed line) and the data (dot). 100 10 -10 0 Yhat3 20 30 Confidence Level = 0.993 0.5 1 1.5 2 2.5 log(LSTAT) 3 3.5 Figure 4.11: Plots of the estimators for log(LSTAT) for Boston housing data Note: plots of the least squares regression estimator (solid line), confidence bands (upper and lower thin lines), the spline estimator (dashed line) and the data (dot). 101 4.5 Appendix Throughout this section, an ≫ bn means lim bn /an = 0, and an ∼ bn means lim bn /an n→∞ n→∞ = c, where c is some nonzero constant. We state the following assumptions. A1. Given 1 ≤ α ≤ d2 , mα ∈ C (2) [0, 1], while there is a constant 0 < C∞ < ∞, such that m′ ∈ Lip([0, 1], C∞ ), ∀1 ≤ β ≤ d2 and β ̸= α. β {( )}∞ T , TT , ε Xt t t t=−∞ is strictly stationary and geometrically strongly mixing, that is, its α -mixing coefficient α (k) ≤ K0 e−λ0 k , for A2. Vector process {Zt }∞ t=−∞ = K0 , λ0 > 0 , where α (k) = supB∈σ{Z ,t≤0},C∈σ{Z ,t≥k} |P (B ∩ C) − P (B) P (C)| . t t (4.8) ( ( ) ) ( ) A3. The noise εi satisfies E εi Fi = 0, E ε2 Fi = 1, E εi 2+δ Fi < Mδ < +∞ i {( ) for some δ > 2/3, Mδ > 0, and σ-fields Fi = σ X ′ , T ′ , i i } i′ ≤ i; ε ′ , i′ ≤ i − 1, 1 ≤ i ≤ n . Function σ (x, t) is continuous with i 0 < cσ ≤ inf σ (x, t) ≤ sup σ (x, t) ≤ Cσ < ∞. x∈[0,1]d2 ,t∈Rd1 x∈[0,1]d2 ,t∈Rd1 A4. The density function f (x) of X and the marginal densities fα (xα ) of Xα satisfy f ∈ C [0, 1]d2 , 0 < cf ≤ inf f (x) ≤ sup f (x) ≤ Cf < ∞, fα ∈ x∈[0,1]d2 x∈[0,1]d2 C (1) [0, 1]. A5. There exist constants 0 < cδ ≤ Cδ < +∞, 0 < cQ ≤ CQ < +∞ such that 102 ) Tl 2+δ |X = x ≤ Cδ , ∀ x ∈ [0, 1]d2 , 1 ≤ l ≤ d1. and cQ I(d +1)×(d +1) ≤ 1 {( } )T ( ) 1 Q (x) ≤ CQ I(d +1)×(d +1) , where Q (x) = E 1 TT 1 TT |X = x . 1 1 cδ ≤ E ( A6. The number of interior knots N = Nn ∼ n1/4 log n, i.e., cN n1/4 log n ≤ N ≤ CN n1/4 log n for some positive constants cN , CN . A7. The kernel function K ∈ Lip([−1, 1], C∞ ) for C∞ > 0 is bounded, nonnegative, symmetric, and supported on [−1, 1]. The bandwidth h ∼ n−1/5 , i.e., ch n−1/5 ≤ h ≤ Ch n−1/5 for positive constants Ch , ch . Assumption A1 on the smoothness of the component functions is greatly relaxed and is close to being the minimal. Assumption A2 is typical in time series literature while Assumptions A3-A5 are typical in nonparametric smoothing literature, see for instance, Fan and Gijbels (1996). For ϕ, φ on [0, 1]d2 × Rd1 , define the empirical inner product and empirical norm as ( ) ( ) ( ) ∑ ∑ ⟨ϕ, φ⟩n = n−1 n ϕ Xi , Ti φ Xi , Ti , ∥ϕ∥2 = n−1 n ϕ2 Xi , Ti . If ϕ, φ are n i=1 i=1 L2 -integrable, we define the theoretical inner product and theoretical L2 norm as ⟨ϕ, φ⟩ = { ( { ( ) ( )} )} E ϕ Xi , Ti φ Xi , Ti , ∥ϕ∥2 = E ϕ2 Xi , Ti and denote En ϕ = ⟨ϕ, 1⟩n . ϕ is empirically(theoretically) centered if En ϕ = 0(Eϕ = 0). For theoretical analysis, define the cencJ,α tered spline basis as bJ,α (xα ) = bJ (xα ) − c b (xα ) , ∀1 ≤ α ≤ d2 , 1 ≤ J ≤ N + 1, J−1,α J−1 ∫ where cJ,α = EbJ (Xα ) = bJ (xα ) fα (xα ) dxα . The standardized basis is BJ,α (xα ) = bJ,α (xα ) / bJ,α , ∀1 ≤ α ≤ d2 , 1 ≤ J ≤ N + 1. 103 (4.9) For the proof of Theorem 4.2, define the Hilbert space { } ∑d 2 p (x ) , Ep (X ) = 0, E 2 p (X ) < ∞ H = p (x) = α α α α α=1 α α of theoretically centered L2 additive functions on [0, 1]d2 , while denote by Hn its subspace { } spanned by BJ,α (xα ) , 1 ≤ α ≤ d2 , 1 ≤ J ≤ N + 1 . Denote { } ˜ ProjH Tl = pl (X) = argmin p∈H E Tl − p (X) 2 , Tl = Tl − ProjH Tl , { } ˜ ProjHn Tl = argmin p∈Hn E Tl − p (X) 2 , Tl,n = Tl − ProjHn Tl , for 1 ≤ l ≤ d1 , where ProjH Tl and ProjHn Tl are orthogonal projections of Tl unto subspaces H and Hn respectively. Denote next in vector form { } { } ˜ ˜ ˜ ˜ Tn = Tl,n , T = Tl . 1≤l≤d1 1≤l≤d1 (4.10) The next assumption is needed for the second part of Theorem 4.2. A8. Functions pl ∈ C [0, 1]d2 , 1 ≤ l ≤ d1 while σ (x, t) ≡ σ 0 , (x, t) ∈ [0, 1]d2 × Rd1 . m (x, t) can be expressed in terms of the standardized basis ˆ m (x, t) = c00 + ˆ ˆ ( ∑d ∑d ∑N +1 1 c t + 2 ˆ0l l c B ˆ (x ) , l=1 α=1 J=1 J,α J,α α (4.11) ) minimize where c00 , c0l , cJ,α ˆ ˆ ˆ 1≤l≤d1, 1≤J≤N +1,1≤α≤d2 } ∑n { ∑d ∑d ∑N +1 ( ) 2 1 cT − 2 X Yi − c0 − c B . i=1 l=1 l il α=1 J=1 J,α J,α iα 104 (4.12) While (4.2) is used for statistical implementation, algebraically equivalent (4.11) is for mathematical analysis. Pilot estimators of mα (xα ) and cT are mα (xα ) = ˆ ∑N +1 { }d ∗ cJ,α BJ,α (xα ) , ˆT = c00 , c0l 1 ˆ c ˆ ˆ l=1 J=1 (4.13) ( ) ∑ ∗ where BJ,α (xα ) ≡ BJ,α (xα ) − En BJ,α = BJ,α (xα ) − n−1 n BJ,α Xiα is the emi=1 pirical centering of BJ,α (xα ). The evaluation of m (x, t) at the n observations results in an ˆ { ( )} n-dimensional vector m Xi , Ti T ˆ 1≤i≤n , the projection of Y onto Gn with respect to the { } empirical inner product ⟨·, ·⟩n . In general, for any n-dimensional vector Λ = Λi T 1≤i≤n , we ˆ ∑d1 λ t + ˆ define Pn Λ (x, t) as the projection of Λ onto (Gn , ⟨·, ·⟩n ), i.e., Pn Λ (x, t) = λ0 + l=1 l l { } ∑d2 ∑N +1 ˆ ˆ ,λ ,λ ˆ ˆ α=1 J=1 λJ,α BJ,α (xα ), with Yi replaced by Λi and coefficients λ0 l J,α given in (4.12). Define the empirically centered additive components as ( ) ∑ ∗ ˆ Pn,α Λ (xα ) = N +1 λJ,α BJ,α (xα ) , 1 ≤ α ≤ d2 , and the linear component as Pn,c Λ T = J=1 { } ˆ ˆ λ0 , λl . Rewrite estimators m (x), mα (xα ), ˆ defined in (4.11) and (4.13) as ˆ ˆ c 1≤l≤d1 m (x, t) = Pn Y (x, t) , mα (xα ) = Pn,α Y (xα ) , ˆ = Pn,c Y. The noiseless and noisy comˆ ˆ c ponents are m (x, t) = Pn m (x, t) , mα (xα ) = Pn,α m (xα ) , ˜m = Pn,c m, ˜ ˜ c ˜ (x, t) = Pn E, ˜α (xα ) = Pn,α E (xα ) , ˜ε = Pn,c E, ε ε c (4.14) { ( )} { ( ) } for m = m Xi , Ti n , E = σ Xi , Ti εi n . Linearity of Pn , Pn,c , Pn,α , 1 ≤ α ≤ i=1 i=1 d2 , and the relation Y = m + E imply a crucial decomposition m (x, t) = m (x, t) + ˜ (x, t) , mα (xα ) = mα (xα ) + ˜α (xα ) , ˆ = ˜m + ˜ε . ˆ ˜ ε ˆ ˜ ε c c c 105 (4.15) ( )T Let ˜ = a0 , al , aJ,α a ˜ ˜ ˜ be the minimizer of 1≤l≤d1, 1≤J≤N +1,1≤α≤d2 { } ∑d ∑d ∑N +1 ( ) ( ) 2 1 aT − 2 σ Xi , Ti εi − a0 − a B X . i=1 l=1 l il α=1 J=1 J,α J,α iα ∑n (4.16) ( )T Similarly, ˜ = c00 , c0l , cJ,α c ˜ ˜ ˜ minimizes 1≤l≤d1, 1≤J≤N +1,1≤α≤d2 } ∑d ∑d ∑N +1 ( ) 2 1 cT − 2 m Xi , Ti − c0 − c B X . (4.17) i=1 l=1 l il α=1 J=1 J,α J,α iα ∑n { ( ) ( )−1 Then ˜ (x, t) = ˜T B (x, t) , ˜ = BT B ε a a BT E, with matrices { }T { ( )} B (x, t) = 1, tl , BJ,α (xα ) , B = B Xi , Ti T 1≤i≤n , 1≤l≤d1 ,1≤J≤N +1,1≤α≤d2 (4.18) and ˜, the solution of (4.16), equals to a    1     (   (     En T ′ l )T En B ′ ′ J ,α )T En Tl En BJ,α ⟨ ⟩ ⟨ ⟩ T ′ , Tl T ′ , BJ,α l n ⟩ ⟨ l n ⟩ ⟨ B ′ ′ , Tl B ′ ′ , BJ,α J ,α J ,α n n  ( )  −1 ∑n  n σ Xi , Ti εi   i=1   ( ) ∑n × n−1 Til σ Xi , Ti εi  i=1    ( ) ( ) ∑n   n−1 BJ,α Xiα σ Xi , Ti εi i=1        −1             1≤l,l′ ≤d1 ,1≤α,α′ ≤d2 , 1≤J,J ′ ≤N +1 (4.19)       1≤l≤d1 ,1≤J≤N +1,1≤α≤d2 Bernstein inequality below under geometric α-mixing is used in many proofs. 106 Lemma 4.1. [Theorem 1.4, page 31 of Bosq (1998)] Let {ξ t , t ∈ Z} be a zero mean real ∑n valued α-mixing process, Sn = t=1 ξ t . Suppose there exists c > 0 such that for t = 1, ..., n, k = 3, 4, ..., E |ξ t |k ≤ ck−2 k!Eξ 2 < +∞ (Cram´r’s condition) then for n > 1, e t integer q ∈ [1, n/2], ε > 0 and k ≥ 3 ( P (|Sn | ≥ nε) ≤ a1 exp − ) qε2 25m2 + 5cε 2 ([ + a2 (k) α n q+1 ])2k/(2k+1) , where α(·) is the α-mixing coefficient defined in (4.8) and ( ε2 n a1 = 2 + 2 1 + q 25m2 + 5cε 2 )   2k/(2k+1) 5m k , , a2 (k) = 11n 1 + ε with mr = max1≤i≤n ∥ξ t ∥r , r ≥ 2. Lemma 4.2. Under Assumptions A2, A4, A6, cf H/2 ≤ cJ,α ≤ Cf H and (i) ∃ con2 stants c0 (f ) , C0 (f ) depending on fα (xα ) , 1 ≤ α ≤ d2 , such that c0 (f ) H ≤ bJ,α ≤ C0 (f )H. (ii) uniformly for 1 ≤ J, J ′ ≤ N + 1, 1 ≤ α, α′ ≤ d2 , E BJ,α (Xα ) ≤ CH 1/2 , ( )2 ( )k ( ( ) ) E BJ,α Xiα B ′ ′ X ′ ≥ cf C −2 > 0, E BJ,α Xiα B ′ ′ X ′ ≤ J ,α iα J ,α iα f C k H 2−k , k ≥ 1. (iii) uniformly for 1 ≤ J, J ′ ≤ N + 1, 1 ≤ α ≤ d2 ,    1 J′ = J    { ( ) ( )}  ′ , X E BJ,α Xiα B ′ ∼ J ,α iα  −1/3 J − J = 1      1/6 J′ − J = 2   ( ) ( ) k  ≤ C k H 1−k J ′ − J ≤ 2 X , k ≥ 1. E BJ,α Xiα B ′ J ,α iα   0 ′−J >2 J 107 Lemma 4.3. Under Assumptions A2, A4, A6 and A7, denote ( ) ( ) ( ) ( ) ω J,α Xl , x1 = Kh Xl1 − x1 BJ,α Xlα , µJ,α (x1 ) = Eω J,α Xl , x1 , (4.20) (√ ) as n → ∞, supx ∈[0,1] sup1≤α≤d ,1≤J≤N +1 µJ,α (x1 ) = O H . 2 1 Lemma 4.4. Under Assumptions A2, A4, A6 and A7, as n → ∞, sup sup x1 ∈[0,1] 1≤α≤d2 ,1≤J≤N +1 n−1 ∑n l=1 { } { √ } ( ) ω J,α Xl , x1 − µJ,α (x1 ) = Op log n/ nh , ( ) where ω J,α Xl , x1 and µJ,α (x1 ) are given in (4.20). Lemma 4.5. Under Assumptions A2, A4-A6, as n → ∞, An,2 = sup J,J ′ ,α ( ) An,1 = supJ,α En BJ,α = Op n−1/2 log n , ⟨ ⟩ ⟨ ⟩ { } −1/2 log n , BJ,α , B ′ − BJ,α , B ′ = Op (nH) J ,α n J ,α ⟩ ⟨ ⟩ ( ) BJ,α , B ′ ′ − BJ,α , B ′ ′ = Op n−1/2 log n , J ,α n J ,α ⟨ ⟩ ⟨ ⟩ ( ) An,4 = supl,J,α Tl , BJ,α − Tl , BJ,α = Op n−1/2 log n . n (4.21) (4.22) ⟨ An,3 = sup α̸=α′ (4.23) (4.24) Lemma 4.6. Under Assumptions A2, A4-A7, denote ( ) ( ) ( ) ζ l Xi1 , Til , x1 = Kh Xi1 − x1 Til , µl (x1 ) = Eζ l Xi1 , Til , x1 , (4.25) as n → ∞, sup1≤l≤d ,x ∈[0,1] µl (x1 ) = O (1), while 1 1 sup1≤l≤d ,x ∈[0,1] n−1 1 1 { } ( ) } ζ l Xi1 , Til , x1 − µl (x1 ) = Op n−1/2 h−1/2 log n . i=1 ∑n { 108 ( For proofs of Lemmas 4.2-4.6, see Ma and Yang (2011b). Let v( ⟨ ) J ′ ,α′ ,(J,α) ) = BJ ′ ,α′ , ) ⟨ ) ⟨ ⟩ ( ) ⟨ ⟩ ( ⟩ ( ⟩ BJ,α , v ′ = T ′ , Tl , v ′ = T ′ BJ,α , v( ′ ′ ) = B ′ ′ , Tl . Del ,l l l ,(J,α) l, J ,α J ,α ,l note by V the theoretical inner product matrix of the standardized basis { } 1, tl , BJ,α (xα ) , 1 ≤ l ≤ d1 , 1 ≤ J ≤ N + 1, 1 ≤ α ≤ d2 , i.e.     V=    1 0T 0T d1 (d2 (N +1)) ( ) 0d vl′ ,l vl′ ,(J,α) 1 ( ) ( ) 0d (N +1) v( ′ ′ ) v( ′ ′ ) J ,α ,l J ,α ,(J,α) 2      . (4.26)    1≤l,l′ ≤d1 ,1≤α≤α′ ≤d2 , 1≤J,J ′ ≤N +1 Denote by S the inverse matrix of V     S=    1 0T 0T d1 (d2 (N +1)) ( ) sl′ ,l 0d sl′ ,(J,α) 1 ( ) ) ( 0d (N +1) s( ′ ′ ) s( ′ ′ ) J ,α ,(J,α) J ,α ,l 2      . (4.27)    1≤l,l′ ≤d1 ,1≤α≤α′ ≤d2 , 1≤J,J ′ ≤N +1 ˆ Next, we denote by V the empirical version of V, i.e.    0  ( ) ˆ = E T T V  n l′   ( )T En BJ ′ ,α′ En Tl En BJ,α ⟩ ⟨ ⟩ ⟨ T ′ , Tl T ′ , BJ,α l n ⟩ n ⟩ ⟨ l ⟨ BJ ′ ,α′ , BJ,α BJ ′ ,α′ , Tl n n 109    .    1≤l,l′ ≤d1 ,1≤α≤α′ ≤d2 , 1≤J,J ′ ≤N +1 (4.28) ˆ Lemma 4.7. Under Assumptions A2, A4-A7, for matrices V, S and V defined in (4.26), −1 (4.27), and (4.28) (i) there exist constants CV > cV > 0, CS = c−1 , cS = CV such that V cV I1+d +d (N +1) ≤ V ≤ CV I1+d +d (N +1) , 1 2 1 2 cS I1+d +d (N +1) ≤ S ≤ CS I1+d +d (N +1) . 1 2 1 2 (4.29) (ii) Define An = supg ,g ∈G ⟨g1 , g2 ⟩n − ⟨g1 , g2 ⟩ ∥g1 ∥−1 ∥g2 ∥−1 , then 1 2 ( ) An = Op n−1/2 H −1 log n . (iii) With probability approaching 1 as n → ∞, ˆ cV I1+d +d (N +1) ≤ V ≤ CV I1+d +d (N +1) , 1 2 1 2 ˆ cS I1+d +d (N +1) ≤ V−1 ≤ CS I1+d +d (N +1) . 1 2 1 2 (4.30) Lemma 4.8. Under Assumptions A2-A7, as n → ∞, n−1 ∑n ∑n ( ) ( ) ( ) σ Xi , Ti εi + max n−1 BJ,α Xiα σ Xi , Ti εi i=1 i=1 J,α ( ) ∑n ( ) −1 −1/2 log n . + max n T σ Xi , Ti εi = Op n i=1 il l For proofs of Lemmas 4.7 and 4.8, see Ma and Yang (2011b). Corollary 4.2. Under Assumptions A2-A7, as n → ∞, n−1 BT E = Op (n−1/2 N 1/2 log n), n−1 BT E ∞ = Op (n−1/2 log n). Corollary 4.2 follows from Lemma 4.8 directly. Next we study the difference between mSBK (x1 ) and mK,1 (x1 ) , both given in (4.4). ˆ ˜ { }d Denote c = c0l 1 , the decomposition (4.15) implies that mK,1 (x1 ) − mSBK (x1 ) = ˜ ˆ l=0 110 { } ˆ ΨT b (x1 ) + ΨT v (x1 ) + Ψb (x1 ) + Ψv (x1 ) /f1 (x1 ), where ) ( )( Kh Xi1 − x1 1, TT (˜m − c) , c i i=1 ) ∑n ( )( −1 T ˜ , ΨT v (x1 ) = n K X − x1 1, Ti cε i=1 h i1 ∑n ( ) ∑d2 { ( ) ( )} Ψb (x1 ) = n−1 Kh Xi1 − x1 mα Xiα − mα Xiα , ˜ i=1 α=2 ∑n ( ) ∑d2 ( ) Ψv (x1 ) = n−1 Kh Xi1 − x1 ˜α Xiα . ε i=1 α=2 ΨT b (x1 ) = n−1 ∑n (4.31) (4.32) ( ) First we show Ψb (x1 ) is uniformly of order Op n−1/2 for x1 ∈ [0, 1]. Lemma 4.9. Under Assumptions A1, A2, A4-A7, as n → ∞, ( ) ( ) supx ∈[0,1] Ψb (x1 ) = Op n−1/2 + H 2 = Op n−1/2 . 1 Lemma 4.10. Under Assumptions A1, A2 and A6, there exist functions gα ∈ G, 1 ≤ α ≤ d2 , ( ) ∑d2 such that as n → ∞, m − g + α=1 En gα (Xα ) = Op n−1/2 + H 2 , where g(x, t) = ˜ n ∑d1 ∑d2 c00 + c t + α=1 gα (xα ) and m is defined in (4.14). ˜ l=1 0l l For proofs of Lemma 4.10 and Lemma 4.9, see Ma and Yang (2011b). Next we show Ψv (x1 ) ( ) in (4.32) is uniformly of order op n−2/5 . For aJ,α given in (4.19), define an auxiliary ˜ entity ˜∗ = εα ∑N +1 a B ˜ (x ) , J=1 J,α J,α α (4.33) The ˜α (xα ) in (4.14) is the empirical centering of ˜∗ (x2 ), i.e. ε ε2 ˜α (xα ) ≡ ˜∗ (xα ) − n−1 ε εα 111 ∑n ( ) ˜∗ Xiα . εα i=1 (4.34) (2) (1) According to (4.34), we can write Ψv (x1 ) = Ψv (x1 ) − Ψv (x1 ), where ∑n ∑n ( ) ∑d2 ( ) (1) Ψv (x1 ) = n−1 Kh Xl1 − x1 n−1 ˜∗ Xiα , εα l=1 α=2 i=1 ∑n ( ) ∑d2 ( ) (2) Ψv (x1 ) = n−1 Kh Xl1 − x1 ˜∗ Xlα , εα l=1 α=2 (4.35) (4.36) ( ) (2) for ˜∗ Xiα in (4.33). By (4.19) and (4.33), Ψv (x1 ) can be rewritten as εα ∑d ∑n ∑N +1 ( ) (2) 2 Ψv (x1 ) = n−1 aJ,α ω J Xl , x1 , ˜ α=2 l=1 J=1 (4.37) ( ) (1) (2) for ω J,α Xl , x1 given in (4.20). Ψv (x1 ) and Ψv (x1 ) are of order } ( ) { Op N n−1 (log n)2 and Op n−1/2 log n uniformly, given in Lemmas 4.12 and 4.13. The next lemma provides the size of ˜T ˜, where ˜ is the least square solution defined by (4.16). a a a Lemma 4.11. Under Assumptions A2-A6, as n → ∞, ˜T ˜ = a2 + a a ˜0 { } ∑d ∑N +1 ∑d 2 1 a2 + ˜l a2 = Op N n−1 (log n)2 . ˜J,α α=1 l=1 J=1 (4.38) Proof. By (4.18), (4.19), (4.30), with probability approaching 1 as n → ∞, ∥˜∥ n−1 BT E ≥ a ( ) ˜T n−1 BT E = ˜T V˜ ≥ cV ∥˜∥2 with Corollary 4.2 imply ∥˜∥2 ≤ c−1 ∥˜∥ n−1 BT E = a a ˆa a a V a { } { } −1 a cV ∥˜∥ × Op N 1/2 n−1/2 log n . Thus ∥˜∥ = Op N 1/2 n−1/2 log n , n → ∞. a (1) Lemma 4.12. Under Assumptions A2-A7, as n → ∞, Ψv (x1 ) in (4.35) satisfies { } (1) supx ∈[0,1] Ψv (x1 ) = Op N n−1 (log n)2 . 1 For proof, see Ma and Yang (2011b). The vector ˜ in (4.19) is a ( )−1 ( ) ˆ ˜= V a n−1 BT E . 112 (4.39) (2) ˆ (2) Next define theoretical versions ˆ of ˜ and Ψv (x1 ) of Ψv (x1 ) in (4.37) as a a ( ) ( ) ˆ= V−1 n−1 BT E = S n−1 BT E , a ˆ (2) Ψv (x1 ) = n−1 (4.40) ∑n ∑d ∑N +1 ( ) 2 aJ,α ω J,α Xi , x1 . ˆ i=1 α=1 J=1 (4.41) (2) Lemma 4.13. Under Assumptions A2-A7, as n → ∞, Ψv (x1 ) in (4.36) satisfies ( ) ( ) (2) supx ∈[0,1] Ψv (x1 ) = Op n−1/2 log n = op n−2/5 . 1 ( ) ( ) ˆ a ˆ Proof. According to (4.39) and (4.40), one has Vˆ = V ˜, which implies V − V ˜ = a a ( ) ˆ V (ˆ − ˜). Using (4.21), (4.22), (4.23), (4.24) one obtains that ∥V (ˆ − ˜)∥ = V − V ˜ a a a a a ( ) ( ) a ≤ Op n−1/2 H −1 log n ∥˜∥. According to Lemma 4.11, ∥˜∥ = Op n−1/2 N 1/2 log n , a } { −1 N 3/2 (log n)2 . By Lemmas 4.7 and 4.11, as n → ∞, so as n → ∞, ∥V (ˆ − ˜)∥ ≤ Op n a a { } ∥ˆ − ˜∥ = Op n−1 N 3/2 (log n)2 , a a ( ) ∥ˆ∥ ≤ ∥ˆ − ˜∥ + ∥˜∥ = Op n−1/2 N 1/2 log n . a a a a (4.42) (4.43) d2 ∑ ) n ( ) ∑ N +1 ( ∑ (2) ˆ (2) sup Ψv (x1 ) − Ψv (x1 ) ≤ sup aJ,α − aJ,α n−1 ˜ ˆ ω J Xl , x1 x1 ∈[0,1] x1 ∈[0,1] α=2 J=1 l=1 = √ ( ( ( ) ) ) d2 (N + 1)Op n−1 H −3/2 log2 n Op H 1/2 = Op n−1 H −3/2 log2 n . (4.44) ˆ (2) Note that Ψv (x1 ) ≤ Q1 (x1 ) + Q2 (x1 ), where Q1 (x1 ) = Q2 (x1 ) = d2 ∑ ∑ N +1 α=2 J=1 aJ,α µJ,α (x1 ) , ˆ } ∑d ∑N +1 ∑n { ( ) 2 aJ,α n−1 ˆ ω J,α Xi , x1 − µJ,α (x1 ) . α=2 J=1 i=1 113 Using the discretization idea, we divide interval [0, 1] into Mn ∼ n equally spaced intervals with disjoint endpoints 0 = x1,0 < · · · < x1,Mn = 1, then supx ∈[0,1] Q1 (x1 ) ≤ T1 + T2 , 1 ( ) ∑d2 ∑N +1 where T1 = max1≤k≤M ˆ α=2 J=1 aJ,α µJ,α x1,k , T2 equals to n ( )} ∑d ∑N +1 { 2 supx ∈[x aJ,α µJ,α (x1 ) − aJ,α µJ,α x1,k ˆ ˆ . α=2 J=1 1 1,k−1 ,x1,k ],1≤k≤Mn   d2 N +1  d1  n ( ∑  ( ) ∑ ∑ ) 1∑ ( ) , σ Xi , Ti εi Til s(J,α),l + B ′ ′ X ′ s aJ,α = ˆ J ,α iα (J,α), J ′ ,α′   n l=1  ′ =1 J=1 i=1 α ( ) ( ) ∑N +1 ∑ ( ) aJ,α µJ,α x1,k = n−1 ˆ µJ,α x1,k s(J,α),l Til σ Xi , Ti εi J=1 i,l,J ( ) ( ) ( ∑ ) ( )B +n−1 µJ,α x1,k s X ′ σ Xi , Ti εi . ′ ,α′ J ′ ,α′ ′ ,J,J ′ iα (J,α), J i,α ( ) ( ) −1 ∑n ∑N +1 µ Define next Wα,l = max1≤k≤Mn n x1,k s(J,α),l Til σ Xi , Ti εi , i=1 J=1 J,α ( ) ( ) ( ) ∑ ∑ ( )B W = max n−1 i µJ,α x1,k s X ′ σ Xi , Ti εi . ′ ′ ′ ,α′ J ′ ,α′ α,α J,J iα (J,α), J 1≤k≤Mn ) ( ∑d2 ∑d1 ∑d2 W Wα,l + . supα,l Wα,l is bounded by So T1 ≤ α=1 l=1 α′ =1 α,α′ supl n−1 ( ) ∑N +1 ( ) Til σ Xi , Ti εi max1≤k≤M supα,l µJ,α x1,k s(J,α),l n i=1 J=1 ( ) ( ) −1/2 log n O (1) = O n−1/2 log n , = Op n p p ∑n which follows from Lemma 4.8 and Lemma 4.3. Let Dn = nθ0 , (2 + δ)−1 < θ0 < 3/8, where δ is the same as in Assumption A3. Define ( ) − = ε I ( ε ≤ D ) , ε+ = ε I ( ε > D ) , ε∗ = ε− − E ε− X , T , εi,D n n i i i i i,D i,D i,D i,D i i } { }T ( )T { ) ) ( BJ ′ ,α′ (Xiα′ ) ′ σ Xi , Ti ε∗ . s Ui,k = µα x1,k ′ ,α′ i,D (J,α), J J 1≤J,J ′ ≤N +1 ( 114 ∑ Denote W D ′ = max n−1 n Ui,k as truncated centered version of W . To i=1 α,α′ α,α 1≤k≤Mn ( ) −1/2 log n , note W D − W D ′ ≤ Λ1 + Λ2 , show W −W = Up n α,α′ α,α′ α,α′ α,α ) ) ( ( ) ( ) ( 1 ∑ ( )B Λ1 = max µJ,α x1,k s X ′ σ Xi , Ti E ε− Xi , Ti i,D iα (J,α), J ′ ,α′ J ′ ,α′ k n ′ i,J,J Λ2 = ( ) ( ) ( ) 1∑∑ ( )B max µJ,α x1,k s X ′ σ Xi , Ti ε+ . i,D iα (J,α), J ′ ,α′ J ′ ,α′ 1≤k≤Mn n i J,J ′ ( ) { ( ) ( )}T Let µα x1,k = µ1,α x1,k , · · · , µN +1,α x1,k , then [ )T { } ( ) Λ1 = max1≤k≤M µα x1,k s n (J,α), J ′ ,α′ J,J ′ { ( ) ( )} ] ∑n ) ( − −1 n B Xiα′ σ Xi , Ti E εi,D Xi , Ti ′ i=1 J ′ ,α′ J [ ( [ ) { } ( ) ≤ max1≤k≤Mn µα x1,k µα x1,k s × (J,α), J ′ ,α′ 1≤J,J ′ ≤N +1 [ } { ( ) ( )} ]T { ) ( − ∑n −1 ( ) n s i=1 BJ ′ ,α′ Xiα′ σ Xi , Ti E εi,D Xi , Ti J ′ (J,α), J ′ ,α′ J,J ′ { ( ) ( )}N +1 ]]1/2 ) ( − −1 ∑n B × n ≤ i=1 J ′ ,α′ Xiα′ σ Xi , Ti E εi,D Xi , Ti J ′ =1   ∑ ′ CS maxk   J,J ′ ( )T ( 2 1/2  )   )1 ∑ ( µ2 J,α x1,k  n ( ) ( ) ( BJ ′ ,α′ Xiα′ σ Xi , Ti E ε− Xi , Ti i,D    i . ( ) ( ) −(1+δ) . By Lemma By Assumption A3, E ε− Xi , Ti = E ε+ Xi , Ti ≤ Mδ Dn i,D i,D 115 4.1, sup ′ ′ n−1 J ,α ( ) ( ( ) ) ∑n B ′ ′ X ′ σ Xi , Ti = Op n−1/2 log n . So i=1 J ,α iα n ( ) ( ) ( ) 1∑ −(1+δ) sup Λ1 ≤ CS Mδ Dn N max sup µJ,α x1,k B ′ ′ X ′ σ Xi , Ti sup J ,α iα n k J,α α,α′ J ′ ,α′ i=1 { } } { −1 D−(1+δ) log2 n = O (log n)2 N n−3/2 , = Op N n p n where the last step follows from the choice of Dn . Meanwhile ( ) ∞ ∞ ∞ ∑ E E |εn |2+δ |Xn , Tn ∑ M ∑ E |εn |2+δ δ ≤ ∞, P (|εn | ≥ Dn ) ≤ = ≤ 2+δ 2+δ 2+δ Dn n=1 n=1 Dn n=1 n=1 Dn ∞ ∑ since δ > 2/3. By Borel-Cantelli Lemma, for large n, with probability 1, n−1 ( ) ( ) ( ∑N +1 ) ( )B µJ,α x1,k s Xiα′ σ Xi , Ti ε+ = 0, ′ ,α′ J ′ ,α′ ′ ,J=1 i,D (J,α), J i=1 J ∑n ( ) sup W − W D ′ ≤ sup Λ1 + sup Λ2 = Op (log n)2 N n−3/2 . Next we α,α′ α,α′ α,α′ α,α′ α,α ( ) show that W D ′ = Up n−1/2 log n . The variance of Ui,k is α,α ( µα x1,k )T { } ({ ) }T ( ) ∗ ( ) BJ,α′ (Xiα′ ) s var σ Xi , Ti εi,D (J,α), J ′ ,α′ J,J ′ 1≤J≤N +1 { }T ( ) ( ) × s µ x . (J,α), J ′ ,α′ 1≤J,J ′ ≤N +1 α 1,k ( ) is bounded by By Assumption A3, c2 v( ′ ) ( ′ ′ ) σ ′ J,α , J ,α 1≤J,J ≤N +1 ({ ) ( ) }T ( ) 2 v( )( ) var B ′ (X ′ ) σ Xi , Ti ≤ Cσ J,α iα 1≤J≤N +1 J,α′ , J ′ ,α′ 116 . ′ 1≤J,J ≤N +1 } ( ) )T { ( ) ( )( ) var Ui,k ∼ µα x1,k s v (J,α), J ′ ,α′ 1≤J,J ′ ≤N +1 J,α′ , J ′ ,α′ 1≤J,J ′ ≤N +1 ( ) ( { ( × s (J,α), J ′ ,α′ ( µα x1,k )T { ) }T 1≤J,J ′ ≤N +1 ) ( µα x1,k Vε,D ∼ } { }T ( ) ( ) ( ) s s µα x1,k Vε,D , (J,α), J ′ ,α′ J,J ′ (J,α), J ′ ,α′ J,J ′ ( )}1/2 { } ( ) { ( )T ∗ X , T . Let κ x , µα x1,k where Vε,D = var εi,D i i 1,k = µα x1,k )}2 ( ) )}2 ′2 { ( ′2 2 { ( cS c2 κ x1,k Vε,D ≤ var Ui,k ≤ CS Cσ κ x1,k Vε,D . σ Since E Ui,k r ≤ { ( ) }r−2 { }n 2 c0 κ x1,k Dn H −1/2 r!E Ui,k < +∞, for r ≥ 3, Ui,k i=1 satisfies the Cram´r’s condition with Cram´r’s constant e e ( ) c∗ = c0 κ x1,k Dn H −1/2 , hence by Lemma 4.1   P  n−1 n ∑ l=1   ( qρ2 n Ui,k ≥ ρn ≤ a1 exp − 2 + 5c∗ ρ  25m2 n ) ([ + a2 (3) α n q+1 ])6/7 , ( ) { ( ) }1/3 √ 2 ∼ κ2 x 3 x −1/2 D V where m2 Vε,D , m3 ≤ cκ H , ρn = ρ log n/ n, n ε,D 1,k 1,k   ( ) 6/7 2 5m ρn a1 = 2 n + 2 1 + , a2 (3) = 11n 1 + ρ3 . Similar as in Lemma 4.4, q n 25m2 +5c∗ ρn 2 ( )2 cn log n−1 ρn−1/2 log n 2 qρn ( ) ≥ ∼ log n, as n → ∞, 25m2 + 5c∗ ρn 25c∗ + 5c0 κ x1,k Dn H −1/2 ρn−1/2 log n 2 117 Taking c0 , ρ large enough, one has   n { } 1 ∑ P Ui,k > ρn−1/2 log n ≤ c log n exp −c2 ρ2 log n + Cn2−6λ0 c0 /7 ≤ n−3 , n i=1 ∑ for n large enough. Hence ∞ P n=1 ∑∞ ∑Mn k=1 n=1 ( P n−1 ( ) D −1/2 log n is bounded by W ≥ ρn α,α′ ) ∑∞ ∑n Ui,k ≥ ρn−1/2 N 1/2 log n ≤ M n−3 < ∞. i=1 n=1 n ( ) D = U n−1/2 log n , as n → ∞. Note Thus, Borel-Cantelli Lemma entails W p α,α′ ( ) √ Wα,α′ − W D ′ = Up (log n)2 N n−3/2 , then Wα,α′ = Up (log n/ n). Thus T1 ≤ ) ( α,α ∑d2 ∑d1 ∑d2 W W + . α=1 l=1 α,l α′ =1 α,α′ ( ) ( ) ( ) So as n → ∞, T1 ≤ d1 Op n−1/2 log n + d2 2 Op n−1/2 log n = Op n−1/2 log n . Employing Cauchy-Schwartz inequality and Lipschitz continuity of kernel K, Assumption A5, Lemma 4.2 (ii) and (4.43) lead to }1/2 ( ) {∑N +1 )−1 ( ) −1/2 N 1/2 log n 2 (X ) T2 ≤ d2 Op n h2 Mn = op n−1/2 . EBJ,2 12 J=1 ( ( ) Therefore, supx ∈[0,1] Q1 (x1 ) ≤ T1 + T2 = Op n−1/2 log n . Noting that 1 ( sup x1 ∈[0,1] 1/2 Q2 (x1 ) = Op n−1/2 N 1/2 (log n) d2 N 1/2 n−1/2 h−1/2 log n ) ) ( = op n−1/2 , by Cauchy-Schwartz inequality, (4.43), Lemma 4.4, Assumptions A6 and A7. Thus 118 ˆ (2) Ψv (x1 ) ≤ sup sup {Q1 (x1 ) + Q2 (x1 )} = Op x1 ∈[0,1] x1 ∈[0,1] The desired result follows from the above result and (4.44). ( n−1/2 log n ) ( ) −2/5 . = op n Lemma 4.14. Under Assumptions A2-A4, A6 and A7, as n → ∞, ( ) ( ) −1/2 log n = o n−2/5 . supx ∈[0,1] |Ψv (x1 )| = Op n p 1 Lemma 4.14 follows from Lemmas 4.12 and 4.13. Next we bound ˜T − cT , ˜T defined cm cε ( ) in (4.17), (4.16). Denote by Ir×d the matrix Ir 0 . r×d ( ) Lemma 4.15. Under Assumptions A1, A2, A4-A7, as n → ∞, ∥˜m − c∥ = op n−1/2 . c Proof. By the result on page 149 of de Boor (2001), ∃ C∞ > 0 such that for mα ∈ ′ C 1 [0, 1] with mα ∈ Lip ([0, 1] , C∞ ) , there is a function gα ∈ G such that Egα (Xα ) = ′ 0, ∥gα − mα ∥∞ ≤ C∞ mα H 2 , 1 ≤ α ≤ d2 . Then ( ) T B −1 BT m − c ˜m − c = I(1+d )×{d (N +1)} × B c 1 2 { } ( )−1 ∑d ∑d ( ) TB T c + 1 cT + 2 g X = I(1+d )×{d (N +1)} B B −c 0 l=1 l il α=1 α iα 1≤i≤n 1 2 } {∑ ( ) ( ) ( ) ∑d2 d2 T B −1 BT +I(1+d )×{d (N +1)} B − g X m X α=1 α iα 1≤i≤n α=1 α iα 1 2 ] [∑ ( )} ( ) d2 { ˆ −1 n−1 BT = I(1+d )×{d (N +1)} V mα Xiα − gα Xiα α=1 1 2 1≤i≤n ˆ with V defined in (4.28). So by Lemma 4.7, ∥˜m − c∥2 is bounded with probability apc 119 [ ] 2 −1 n−1 BT ∑d2 {m (X ) − g (X )} ˆ proaching 1 by V α iα α iα α=1 1≤i≤n [∑ ] 2 ( ) ( )} d2 { mα Xiα − gα Xiα α=1 1≤i≤n ( )2 ( )2 ∑d2 2 ∑d2 ∥g − m ∥ 2 ∑d1 −1 ∑n ≤ CS + CS α ∞ i=1 Til α=1 α α=1 ∥gα − mα ∥∞ n l=1 ) ( ( ) 2 ∑d2 ∑N +1 ∑d2 ∥g − m ∥ n−1 ∑n +CS α ∞ i=1 BJ,α′ Xiα′ α=1 α α′ =1 J=1 ( )2 { ( )2 2 ∑d2 ∥g − m ∥ −1 ∑n ≤ CS 1 + d1 sup1≤l≤d n α ∞ i=1 Til α=1 α 1 } ( ( ) )2 ∑n + (N + 1) d2 sup n−1 i=1 B ′ X ′ . 1≤α′ ≤d2 ,1≤J≤N +1 J,α iα 2 ≤ CS n−1 BT ( ) ( ) −1 ∑n 1/2 , sup n i=1 BJ,α′ Xiα′ = Oa.s. H 1≤α′ ≤d2 ,1≤J≤N +1 ∑ sup1≤l≤d n−1 n i=1 Til = Oa.s. (1), so as n → ∞, 1 [ ]2 ( ) 2 ∼ ∑d2 ∥g − m ∥ 4 ∥˜m − c∥ c α α ∞ = Op H , α=1 ( ) Lemma 4.16. Under Assumptions A1-A7, as n → ∞, ∥˜ε ∥ = Op n−1/2 . c a a a Proof. Let (Q1 , Q2 ) = I(1+d )×{d (N +1)} (ˆ, ˜ − ˆ). By (4.40), (4.42), (4.27) 1 2 ˜T = I(1+d )×{d (N +1)} ˜ = Q1 + Q2 , cε a 1 2 (4.45) } { so ∥Q2 ∥ = Op n−1 N 3/2 (log n)2 , while ( )T ∑n ∑n ( ) Q1 = n−1 σ Xi , Ti εi , n−1 ξ il′ i=1 i=1 1≤l′ ≤d1 { (4.46) } ( ) ( ) ∑ ∑ in which ξ il′ = l sl′ l Til + α,J sl′ ,(J,α) BJ,α Xiα σ Xi , Ti εi . Clearly Eξ il′ = 0 ( ) }]2 [{ ∑ ∑ 2 0, s , s 2 ≤ C2E × ≤ Cσ while Eξ ′ σ l sl′ l Tl + α,J sl′ ,(J,α) BJ,α (Xα ) l′ l l′ ,(J,α) il 120 ( )T ( ) 2 2 2 2 V 0, s ′ , s ′ ≤ Cσ CV 0, s ′ , s ′ ≤ Cσ CV CS . It is easily checked l l l ,(J,α) l l l ,(J,α) ( ) −1 ∑n ξ E ξ ′ ξ ′ = 0 for i ̸= i′ thus by Markov Inequality, sup ′ ≤d n i=1 il′ = il jl 1≤l 1 ( ) ∑ Op (n−1/2 ). Likewise n−1 n σ Xi , Ti εi = Op (n−1/2 ). Then, Lemma 4.16 follows i=1 from the above results. Lemma 4.17. Under Assumptions A1, A2, A4-A7, as n → ∞, supx ∈[0,1] ΨT b (x1 ) = 1 √ op (1/ n) . For proof, see Ma and Yang (2011b). Define a theoretical version of ( ) −1 ∑n K (X − x ) a + ∑d1 a T ΨT v (x1 ) = n ˜ as 1 ˜0 i=1 h i1 l=1 l il ( ) ∑d1 ( ) ∑ ∑ ˆ ΨT v (x1 ) = a0 n−1 n Kh Xi1 − x1 + ˆ al n−1 n ζ l Xi1 , Til , x1 . ˆ i=1 i=1 l=1 Lemma 4.18. Under Assumptions A1-A7, as n → ∞, } ( ) { supx ∈[0,1] ΨT v (x1 ) = Op n−1/2 (log n)4 = op n−2/5 . 1 For proof, see Ma and Yang (2011b). ( ) ( )−1 ˜ Lemma 4.19. Under Assumptions A1-A6 and A8, s ′ = cov Tn and as n → ∞, l ,l ( )−1 ( )−1 ( ) ˜ ˜ ˜ ˜ cov Tn → cov T , where s ′ defined in (4.27), Tn and T defined in (4.10). l ,l ( Proof. ) ( )−1 ˜ s′ = cov Tn is induced by basic linear algebra. By the result on page 149 l ,l of de Boor (2001), there is a constant C∞ > 0 and functions gl (x) ∈ Hn , 1 ≤ α ≤ d2 ( ) such that sup1≤l≤d gl − pl ∞ = o (1). Since ProjHn Tl = ProjHn ProjH Tl , for ∀1 ≤ 1 )2 ( { } ≤ E gl (X) − ProjH Tl 2 = l ≤ d1 by Hilbert space theory, E ProjHn Tl − ProjH Tl ( )−1 ( )−1 { } ˜ ˜ E gl (X) − pl (X) 2 = o (1), as n → ∞. Thus cov Tn → cov T , as n → ∞. Proof of Theorem 4.1. The term ΨT b (x1 ) in (4.31) has order op (n−1/2 ) and other terms have order op (n−2/5 ) by Lemmas 4.17, 4.18, 4.9, 4.14. Standard theory ensures that 121 ( ) ∑ ˆ f1 (x1 ) = n−1 n Kh Xi1 − x1 has a positive lower bound. Theorem 4.1 then follows. i=1 2 Proof of Theorem 4.2. The first part of Theorem 4.2 follows from Lemmas 4.15 and 4.16. √ √ n˜ε + n (˜m − c) = n (Q1 + Q2 ) + c c √ √ √ n (˜m − c) = nQ1 + op {1}. It is easily verified E ( nQ1 ) = 0. By (4.46), Lemma 4.19, c By (4.15), (4.45) and Lemma 4.15, √ n (ˆ − c) = c √ ( ) √ var ( nQ1 ) = n × σ 2 I(1+d )×{d (N+1)} var ˆˆT IT aa 0 (1+d1 )×{d2 (N +1)} 1 2 = n × n−1 σ 2 I(1+d )×{d (N+1)} SIT 0 2 (1+d1 )×{d2 (N +1)}  1        1  1       0T 0T d1 d1 2 2 =σ 0 ( )  → σ0  ( )−1  , as n → ∞.   0  0   ˜     sl′ ,l d1 d1 cov T Theorem 4.2 then follows by applying Theorem 1 of Sunklodas (1984). 2 122 Chapter 5 Spline Regression in the Presence of Categorical Predictors 5.1 Background This chapter is based on Ma, Racine and Yang (2011). Applied researchers must frequently model relationships involving both continuous and categorical predictors, and a range of nonparametric kernel regression methods have recently been proposed for such settings. These developments have extended the reach of kernel smoothing methods beyond the traditional continuous-only predictor case and have had a marked impact on applied nonparametric research; see Li and Racine (2007) for examples and an overview. Though kernel methods hold much appeal for practitioners, many in the applied community continue to resist their use, often for valid reasons. In particular, nonparametric kernel methods are local in nature, bandwidth selection can be fragile and numerically demanding, interpretation can be challenging, while imposing constraints on the resulting estimate can be difficult. 123 Regression spline methods, on the other hand, are global in nature and involve straightforward least squares solutions hence unconstrained and constrained estimation is much easier to handle and faster to compute. Furthermore, their least squares underpinnings render the methods immediately accessible to those who routinely use least squares approaches. As such, regression splines provide an immediately accessible and attractive alternative to kernel methods for the nonparametric estimation of regression models. For excellent overviews of spline modeling we direct the interested reader to Stone (1985), Stone (1994), Huang (2003), and Wang and Yang (2009a). For applications of spline approaches, see Huang (1998) for functional ANOVA models, Huang and Yang (2004), Wang and Yang (2007) and Xue (2009) for additive models, Wang and Yang (2009a) and Wang (2009) for single-index models, Liu and Yang (2010) and Xue and Liang (2010) for additive coefficient models, and Ma and Yang (2011) for jump detection in regression functions. However, just like their traditional kernel-based continuous-only predictor kin, regression splines are limited by their inability to handle the presence of categorical predictors without resorting to sample-splitting which can entail a substantial loss in efficiency. In this chapter we consider a regression spline alternative motivated by recent developments in the kernel smoothing of relationships involving categorical covariates. The proposed spline approach possesses intuitive appeal by producing a piecewise polynomial, computational expedience as discussed before and theoretical reliability according to the mean square and uniform convergence rates, and the pointwise asymptotic distribution results established in this chapter. The remainder of this chapter proceeds as follows. Section 5.2 outlines the framework and presents theorems detailing rates of convergence and the asymptotic distribution of the proposed approach. Section 5.3 considers cross-validatory selection of the spline knot vector 124 and kernel bandwidth vector. Section 5.4 examines the finite-sample performance of the proposed method versus the traditional ‘frequency’ (i.e. ‘sample-splitting’) estimator, the additive regression spline estimator, and the cross-validated local linear kernel regression estimator, while proofs are to be found in the appendix. 5.2 Methods and Main Results In what follows we presume that the reader is interested in the unknown conditional mean in the following location-scale model, Y = g (X, Z) + σ (X, Z) ε, (5.1) ( ) where g(·) is an unknown function, X = X1 , . . . , Xq T is a q-dimensional vector of continuous predictors, and Z = (Z1 , . . . , Zr )T is an r-dimensional vector of categorical predictors. Letting z = (zs )r , we assume that zs takes cs different values in Ds ≡ {0, 1, . . . , cs − 1}, s=1 ( )n s = 1, . . . , r, and let cs be a finite positive constant. Let Yi , XT , ZT i i i=1 be an i.i.d copy ( ) [ ] of Y, XT , ZT . Assume for 1 ≤ l ≤ q, each Xl is distributed on a compact interval al , bl , [ ] and without loss of generality, we take all intervals al , bl = [0, 1]. In order to handle the presence of categorical predictors, we define    l (Zs , zs , λs ) = L (Z, z, λ) = 1,when Zs = zs   λ , otherwise. s , r ∏ r ∏ 1(Z ̸=z ) λs s s , s=1 l (Zs , zs , λs ) = s=1 125 (5.2) where l(·) is a variant of Aitchison and Aitken (1976) univariate categorical kernel function, L(·) is a product categorical kernel function, and λ = (λ1 , λ2 , . . . , λr )T is the vector of bandwidths for each of the categorical predictors. We use the tensor product splines intro[ ] duced in Section 1.4 of Chapter 1. Let B = {B (X1 ) , . . . , B (Xn )}T , where B (x) is n×Kn defined in (1.3). Then g (x, z) can be approximated by B (x)T β (z), where β (z) is a Kn × 1 vector. We estimate β (z) by minimizing the following weighted least squares criterion, β (z) = arg n }2 ( ∑{ ( ) ) min Yi − B Xi T β (z) L Zi , z, λ . β(z)∈RKn i=1 ( ) Let Lz = diag {L (Z1 , z, λ) , . . . , L (Zn , z, λ)} be a diagonal matrix with L Zi , z, λ , 1 ≤ i ≤ n as the diagonal entries. Then β (z) can be written as ( ) ( ) −1 BT L B −1 n−1 BT L Y , β (z) = n z z (5.3) where Y = (Y1 , . . . , Yn )T . g (x, z) is estimated by g (x, z) = B (x)T β (z). Given any z ∈D, for any µ ∈ (0, 1], we denote by C 0,µ [0, 1]q the space of order µ H¨lder o continuous functions on [0, 1]q , i.e., ( )   ϕ(x, z) − ϕ x′ , z C 0,µ [0, 1]q = ϕ : ∥ϕ∥0,µ,z = sup < +∞ µ   x − x′ 2 x̸=x′ ,x,x′ ∈[0,1]q   )1/2 (∑ q x2 is the Euclidean norm of x , and ∥ϕ∥0,µ,z is the C 0,µ l=1 l -norm of ϕ. Let C [0, 1]q be the space of continuous functions on [0, 1]q . Given a q-tuple ) ( α = α1 , . . . , αq of nonnegative integers, let [α] =α1 + · · · + αq and let Dα denote the ∂ [α] differential operator defined by Dα = αq . For positive numbers an and bn , α1 ∂x1 ···∂xq in which ∥x∥2 = 126 n ≥ 1, let an ≍ bn mean that limn→∞ an /bn = c, where c is some nonzero constant. The assumptions employed for the asymptotic results are listed below: (A1) For any given z ∈D, the regression function satisfies Dα g ∈ C 0,1 [0, 1]q , for all α ( ) with [α] = p − 1 and 1 ≤ p ≤ min m1 , . . . , mq . (A2) The marginal density f (x) of X satisfies f (x) ∈ C [0, 1]q and f (x) ∈ [ ] c f , Cf for constants 0 < cf ≤ Cf < ∞. There exists a constant cP > 0, such that P (Z = z |X) ≥ cP for all z ∈D. ( ) 2 |X, Z = 1. There exists a positive value (A3) The noise ε satisfies E (ε |X, Z) = 0, E ε ( ) δ > 0 and finite positive Mδ such that supx∈[0,1]q ,z∈D E |ε|2+δ |X = x, Z = z < ) ( Mδ and E |ε|2+δ < Mδ . The standard deviation function σ (x, z) is continuous on [0, 1]q × D for D =D1 × · · · × Dr and 0 < cσ ≤ inf x∈[0,1]q ,z∈D σ (x, z) ≤ supx∈[0,1]q ,z∈D σ (x, z) ≤ Cσ < ∞. ) ( (A4) The number of interior knots Nl , 1 ≤ l ≤ q satisfy as n → ∞, max1≤l≤q N −1 = l { } ∏ } { q o n−1/(2p+q) , N = o n (log n)−1 , and the bandwidths λs , 1 ≤ s ≤ r l=1 l } {( ) ∑r −1 ∏q N 1/2 . satisfy as n → ∞, s=1 λs = o n l=1 l Theorem 5.1. Under assumptions (A1)-(A4), as n → ∞,   { (∏ }1/2 q r )−1 ∑ p ∑ 1 q . λs + sup |g (x, z) − g (x, z)| = Oa.s.  h + h log n l l=1 l n q ,z∈D x∈[0,1] s=1 l=1 Theorem 5.2. Under assumptions (A1)-(A4), for σ 2 (x, z) in (5.14), as n → ∞, n σ −1 (x, z) {g (x, z) − g (x, z)} −→ N (0, 1). For any given (x, z) ∈ [0, 1]q × D, c∗ n−1 × n σ )−1 (∏ )−1 (∏ q q ∗ ∗ h , for some constants 0 < c∗ < Cσ < ∞. h ≤ σ 2 (x, z) ≤ Cσ n−1 n σ l=1 l l=1 l 127 Proofs of these theorems are presented in the appendix. Having outlined the theoretical underpinnings of the proposed method, we now consider an illustrative simulated example that demonstrates how smoothing the categorical predictors in the manner prescribed above impacts the resulting estimator g (x, z). ˆ 5.3 Cross-Validated Choice of N and λ Cross-validation Stone (1977) has a rich pedigree in the regression spline arena and has been used for decades to choose the appropriate number of interior knots and is the basis for Friedman (1991) multivariate adaptive regression spline (MARS) methodology among others; see Wahba (199) for an overview in the spline context. It has also been used extensively for bandwidth selection for kernel estimators such as the local linear kernel estimator proposed by Li and Racine (2004) that appears in the simulations in Section 5.4 (see also Racine and Li (2004) for the local constant counterpart). Following in this tradition we choose the number of interior knots (i.e. the vector (N )) and smoothing parameters (i.e. the bandwidth vector λ) by minimizing the cross-validation function defined by CV (N, λ) = n−1 n ∑ ˆ (Yi − Bm (Xi )T β −i (Zi ))2 , i=1 (5.4) ˆ where β −i (Zi ) denotes the leave-one-out estimate of β. To illustrate the behavior of the data-driven cross-validated selection of N and λ, we consider two simple data generating processes (DGPs) and plot the resulting cross-validated regression estimate based on the popular cubic spline basis. For each figure we regress Yi on Xi and Zi using the proposed method. Figure 5.1 presents four simulated data sets and the 128 cross-validated values of N and λ (maximum N is 15, n = 500, λ ∈ [0, 1]). Figure 5.1: Example with n = 500 and a variety of DGPs in Chapter 5 Note: for the lower left DGP there is no difference in the function when Z = 0 versus Z = 1 and cross-validation ought to select λ = 1 (N the number of interior knots, λ the bandwidth). Figure 5.1 illustrates how the cross-validated choices of N and λ differ depending on the underlying DGP. For instance, the plot on the upper left is one for which Yi = cos(πXi ) + εi if Zi = 0 and Yi = 1 + cos(πXi ) + εi if Zi = 1. It is evident that N ≈ 0 and λ ≈ 0.0012 is appropriate here. The figure on the upper right is one for which Yi = cos(πXi ) + εi 129 regardless of the value taken on by Zi . Those on the lower left and right are similar but use cos(4πXi ) instead. For both the lower and upper figures on the right, N ≈ 0 and N ≈ 3 appear to be appropriate while λ ≈ 1 and λ ≈ 1 are also appropriate since Zi is independent of Yi . These simple examples serve to illustrate how cross-validation is delivering values of N and λ that are appropriate for the DGP at hand. Before proceeding, a few words on the numerical optimization of (5.4) are in order. Search takes place over N1 , . . . , Nq and λ1 , . . . , λr where the λ are continuous lying in [0, 1] and the N are integers. Clearly this is a mixed integer combinatorial optimization procedure which would render exhaustive search infeasible when facing a non-trivial number of predictors. However, in settings such as these one could leverage recent advances in mixed integer search algorithms which is the avenue we pursue in the Monte Carlo simulations reported below. In particular, we adopt the ‘Nonsmooth Optimization by Mesh Adaptive Direct Search’ (NOMAD) approach (Abramson, Audet, Couture, Dennis Jr., and Le Digabel (2011)). Given that the objective function can be trivially computed for large sample sizes as it involves nothing more than computing the hat matrix for weighted least squares, it turns out that the computational burden is in fact nowhere near as costly as, say, crossvalidated kernel regression for moderate to large data sets. As such, the proposed approach constitutes a computationally attractive alternative to multivariate cross-validated kernel regression. In addition, in the next section we shall also see that the proposed approach constitutes a statistically attractive alternative as well, at least from the perspective of finite-sample square error loss. 130 5.4 Monte Carlo Simulations In this section we consider a modest Monte Carlo experiment designed to assess the finitesample performance of the proposed method. We consider two DGPs with q = 2 continuous predictors and r = 2 categorical predictors given by DGP-M: Yi = cos(2πXi1 ) × sin(2πXi2 ) + (Zi1 + Zi2 )/10 + εi , DGP-A: Yi = cos(2πXi1 ) + sin(2πXi2 ) + (Zi1 + Zi2 )/10 + εi , (5.5) where the continuous predictors are drawn from the uniform (Xj ∼ U [0, 1], j = 1, 2), the categorical predictors (Zj , j = 1, 2) are drawn from the rectangular distribution with equal probability (zs ∈ {0, 1, . . . , cs − 1} where cs is the number of categorical outcomes, cs ≥ 2, s = 1, 2), and ε ∼ N (0, σ 2 ) with σ = 0.1. For what follows we set cs = c, s = 1, 2. Observe that DGP-M is multiplicative in the continuous components while DGP-A is additive. We draw M = 1, 000 Monte Carlo replications and for each replication we compute the cross-validated frequency estimator (i.e. that based only on the (Y, X) pairs lying in each ‘cell’ defined by Z), the proposed cross-validated categorical regression spline estimator, the cross-validated additive categorical regression spline estimator Ma and Racine (2011) and the cross-validated local linear kernel estimator that is often used to model continuous and categorical predictors in a manner similar to that undertaken here (note that the local linear kernel estimator is minimax efficient and has the best boundary bias correction properties of the class of kernel regression estimators; see Li and Racine (2004) for details). For the regression spline estimators we set the spline degree vector equal to (3, 3) (a popular choice) and cross-validate the number of knots vector (N1 , N2 ) and the bandwidth vector (λ1 , λ2 ). 131 We then compute the MSE of each estimator based upon (5.5) for each replication and report the relative median MSE over all M replications in Table 5.1. Table 5.2 reports a summary of the smoothing parameters chosen by cross-validation. Table 5.1: Relative median MSE in Chapter 5 DGP-M: Multiplicative Specification n c Frequency Additive Kernel 500 2 0.591 0.008 0.487 500 3 0.572 0.014 0.618 500 4 0.519 0.020 0.718 1000 2 0.745 0.004 0.403 1000 3 0.591 0.008 0.494 1000 4 0.609 0.013 0.585 1500 2 0.815 0.003 0.359 1500 3 0.636 0.006 0.446 1500 4 0.626 0.010 0.518 2000 2 0.838 0.002 0.337 2000 3 0.722 0.005 0.408 2000 4 0.628 0.008 0.480 n 500 500 500 1000 1000 1000 1500 1500 1500 2000 2000 2000 DGP-A: Additive Specification c Frequency Additive Kernel 2 0.554 2.028 0.489 3 0.488 1.939 0.593 4 0.434 1.788 0.702 2 0.776 2.072 0.417 3 0.518 2.096 0.494 4 0.523 2.072 0.571 2 0.822 2.053 0.381 3 0.660 2.119 0.446 4 0.520 2.176 0.514 2 0.852 2.087 0.368 3 0.754 2.143 0.418 4 0.572 2.226 0.478 Note: Relative median MSE of the proposed proposed spline regression estimator versus the frequency regression spline, additive regression spline, and local linear kernel regression estimator. Numbers less than one indicate superior performance of the proposed spline estimator. c denotes the number of outcomes for the discrete predictors, n the sample size. Table 5.1 illustrates how, for a given sample size, the relative performance of the proposed approach that smooths the categorical predictors versus the frequency approach that breaks the data into c1 ×c2 = (4, 9, 16) subsets improves as c increases, as expected (each categorical predictor takes on c1 = c2 = c = (2, 3, 4) values). Furthermore, for a given c, as n increases the proposed estimator approaches the frequency estimator since λ → 0 as n → ∞. Table 5.1 further illustrates how the proposed cross-validated method dominates the popular local linear kernel estimator for all sample sizes and values of c considered. Often additive spline models are recommended in applied settings due to the curse of 132 dimensionality (the property that the multiplicative tensor regression spline has a rate of convergence that deteriorates with the number of continuous predictors, q). Observe, however, that even in small sample settings such as n = 500, if the additive model is used when additivity is not present, the square error properties of the additive regression spline model can be orders of magnitude worse than the multiplicative tensor regression spline model (the tensor model has roughly 1/100 the MSE of the additive model for DGP-M). Naturally, if additivity is appropriate the additive model that incorporates this information will have better finite-sample properties (the tensor model has roughly 2 times the MSE of the additive model for DGP-A, the additive DGP). Simulations not reported here for space considerations indicate that the finite-sample mean square error improvement over the kernel regression estimator holds a) whether or not there exist categorical predictors, and b) in higher-dimension settings than those reported here. Table 5.2 reveals how the cross-validated bandwidths tend to zero as n increases. These findings are consistent with the theoretical properties detailed in the appendix. In the above simulations the tensor-based multivariate regression spline approach dominates the popular local linear kernel regression approach for the range of sample sizes and number of predictors considered. However, we caution the reader that this is not guaranteed to always be the case. The dimension of the tensor spline basis grows exponentially with the number of continuous predictors for a given order/knot combination for each predictor. Thus, for a fixed sample size n, as the number of continuous predictors q increases degrees of freedom will quickly fall and the square error properties of the resulting estimator will naturally deteriorate. So, in small n large q low degrees of freedom settings one could readily construct instances in which the kernel regression approach will display better finite-sample 133 Table 5.2: Median values for smoothing parameters in Chapter 5 DGP-M: Multiplicative Specification n c N1 N2 λ1 λ2 500 2 2 3 0.16 0.16 500 3 2 3 0.11 0.11 500 4 2 1 0.07 0.08 1000 2 2 3 0.10 0.10 1000 3 2 3 0.07 0.07 1000 4 2 3 0.05 0.05 1500 2 2 3 0.07 0.07 1500 3 2 3 0.05 0.05 1500 4 2 3 0.04 0.04 2000 2 2 3 0.06 0.06 2000 3 2 3 0.04 0.04 2000 4 2 3 0.03 0.03 DGP-A: Additive Specification n c N 1 N 2 λ1 λ2 500 2 2 3 0.16 0.16 500 3 2 3 0.11 0.11 500 4 2 3 0.09 0.09 1000 2 2 3 0.10 0.10 1000 3 2 3 0.07 0.07 1000 4 2 3 0.05 0.05 1500 2 2 3 0.07 0.07 1500 3 2 3 0.05 0.05 1500 4 2 3 0.04 0.04 2000 2 2 3 0.06 0.06 2000 3 2 3 0.04 0.04 2000 4 2 3 0.03 0.03 Note: Median values for the number of interior knots and bandwidths for the proposed spline regression estimator. c = c1 = c2 denotes the number of outcomes for the discrete predictors, n the sample size, Nj the number of interior knots for continuous predictor Xj , j = 1, 2, and λj the bandwidth for continuous predictor Zj , j = 1, 2. behavior than the regression spline approach. We therefore offer the following advice for the sound practical application of the methods proposed herein: 1. The proposed methods are best suited to settings in which q is not overly large and n not overly small. Our experience shows that for a range of DGPs the regression spline outperforms kernel regression when n ≥ 500 and q ≤ 5. One practical advantage is the reduced computational burden of cross-validation for regression splines versus their kernel counterpart, and in large sample settings (say, n ≥ 10, 000) one can push the dimension of q much higher than that considered here. 2. Of course, when the dimension of the multivariate tensor spline becomes a practical barrier to their sound application, one can always resort to additive spline models; see 134 Ma and Racine (2011) for details. The drawback of the additive spline approach is that if the DGP is non-additive, the inefficiency of the additive spline approach can be much worse than the multivariate kernel approach as clearly demonstrated above. Of course, it is a simple matter to compare the value of the cross-validation function for each of the tensor-based, additive-based, and kernel-based cross-validated approaches and it is perfectly sensible to use this as a guide in applied settings. But our experience is that the tensor-based multivariate regression spline will indeed be competitive and ought to be part of every practitioners toolkit. In summary, the simulations outlined above indicate that the proposed method is capable of outperforming the frequency estimator that breaks the sample into subsets, while it provides a compelling alternative to kernel methods when faced with a mix of categorical and continuous predictors and to additive regression spline models for general nonlinear DGPs for which additivity is not fully present. 5.5 Concluding Remarks Applied researchers frequently must model relationships containing categorical predictors, yet may require nonparametric estimators of, say, regression functions. The traditional kernel and spline estimators break the data into subsets defined by the categorical predictors and then model the resulting relationship involving continuous predictors only. Though consistent, these approaches are acknowledged to be inefficient. In this chapter we provide an approach that combines regression splines with categorical kernel functions that is capable of overcoming the efficiency losses present in the traditional sample-splitting approach. Furthermore, the proposed approach constitutes an attractive alternative to cross-validated 135 kernel estimators that admit categorical predictors. Theoretical underpinnings are provided and simulations are undertaken to assess the finite-sample performance of the proposed method. We hope that these methods are of interest to those modelling regression functions nonparametrically when faced with both continuous and categorical predictors. 5.6 Appendix For any vector ζ = (ζ 1 , . . . , ζ s ) ∈ Rs , denote the norm ∥ζ∥r = (|ζ 1 |r + · · · + |ζ s |r )1/r , 1 ≤ r < +∞, ∥ζ∥∞ = max (|ζ 1 | , . . . , |ζ s |). For any functions ϕ, φ, define the empirical in( ) ( ) ( ) ∑ ner product and norm as ⟨ϕ, φ⟩n,Lz = n−1 n ϕ Xi φ Xi L Zi , z, λ , ∥ϕ∥2 i=1 n,Lz = ( ) ( ) ∑ n−1 n ϕ2 Xi L Zi , z, λ . If the functions ϕ, φ are L2 -integrable, we define the theoi=1 retical inner product and the corresponding norm as ⟨ϕ, φ⟩Lz = E {ϕ (X) φ (X) L (Z, z, λ)}, { } ∥ϕ∥2 = E ϕ2 (X) L (Z, z, λ) . We denote by the same letters c, C, any positive conLz stants without distinction. For any s × s symmetric matrix A, denote its Lr norm as ∑ ∥A∥r = maxζ∈Rs ,ζ̸=0 ∥Aζ∥r ∥ζ∥−1 . Let ∥A∥∞ = max1≤i≤s s r j=1 Aij . Let Is be the s × s identity matrix. β (z) can be decomposed into β g (z) and β ε (z), such that β (z) = β g (z) + β ε (z), and ( ) ( ) ( ) ( ) −1 BT L B −1 n−1 BT L g , β (z) = n−1 BT L B −1 n−1 BT L E , β g (z) = n z z z z ε (5.6) where g = {g (X1 , Z1 ) , . . . , g (Xn , Zn )}T , E = {σ (X1 , Zn ) ε1 , . . . , σ (Xn , Zn ) εn }T . Then g (x, z) = gg (x, z) + gε (x, z), in which gg (x, z) = B (x)T β g (z) , gε (x, z) = B (x)T β ε (z) . 136 (5.7) Next we cite a result in the Appendix of Huang (2003). Lemma 5.1. Under assumptions (A2), for any α ∈RKn , there exist constants 0 < cB < CB < ∞ that do not depend on n, such that for any z ∈ D cB ( ∏q ) h ∥α∥2 ≤ E 2 l=1 l [{ } ] ( ∏q ) T B (X) 2 ≤ C α hl ∥α∥2 . B 2 l=1 Since   r ∏  { } k(Zs ̸=zs ) k (Z, z, λ) |X = E E L λs |X ≥ P (Z = z |X ) ≥ cp > 0,   s=1 { } E Lk (Z, z, λ) |X ≤ 1, (5.8) for any integer k ≥ 1. Thus, for CB = CB and cB = cp cB , one has [{ } ] ( ∏q ) T B 2 ≤ E αT B (X) 2 ≤ C α hl ∥α∥2 , B 2 l=1 Lz [{ } ] ( ∏q ) T B 2 ≥ c E αT B (X) 2 ≥ c α h ∥α∥2 . p B 2 l=1 l Lz (5.9) Lemma 5.2. Under assumptions (A2) and (A4), as n → ∞ , ⟨ ⟩ ⟨ ⟩ Bj ,...,jq , B ′ − Bj ,...,jq , B ′ ′ ′ j1 ,...,jq j1 ,...,jq 1 1 n,Lz Lz [{ ) }1/2 ] (∏q −1 . h log n =Oa.s. n l=1 l max max ′ z∈D j ,...,jq ,j ′ ,...jq 1 1 ( ) ( ) ( ) Proof. Let ζ ′ ,...,j ′ ,i = Bj1 ,...,jq Xi Bj ′ ,...,j ′ Xi L Zi , z, λ − j1 ,...,jq ,j1 q q } 1 { ( ) ( ) ( ) ′ E Bj ,...,jq Xi B ′ ′ Xi L Zi , z, λ . When jl − jl ≤ ml − 1 for all 1 ≤ l ≤ q, j1 ,...,jq 1 137 by the properties of the B-spline basis, there exist constants 0 < cB,k < CB,k < ∞ and (∏ ) ( ) ( )k q ′ < C ′ < ∞, such that c h ≤ E Bj ,...,jq Xi B ′ 0 < cB ≤ ′ Xi B,k B l=1 l j1 ,...,jq 1 } { (∏ ) ( ) ( ) ( ) k q ′ ∏q h k ≤ E B CB,k h and cB ≤ ′ j1 ,...,jq Xi Bj ′ ,...,jq Xi l=1 l l=1 l 1 ) ( ′ ∏q h k , thus by (5.8), CB l=1 l { } ) 2 ( ) 2 Eζ 2 ′ ′ ≥ cp E Bj1 ,...,jq Xi Bj ′ ,...,j ′ Xi j1 ,...,jq ,j1 ,...,jq ,i q 1 [ { }] ( ) ( ) 2 − E Bj ,...,jq Xi B ′ ′ Xi j1 ,...,jq 1 ) ( ∏q (∏q ) ( ∏q )2 ′ h , ≥ cp cB,2 h − CB h ≥c 2 l=1 l l=1 l l=1 l ζ { } ( ∏q ) ( ) ( ) 2 Eζ 2 ≤ E Bj ,...,j Xi B 2′ Xi ≤ CB,2 hl , ′ ′ ′ q l=1 j1 ,...,jq ,j1 ,...,jq ,i j1 ,...,jq 1 k E ζ ′ ′ j1 ,...,jq ,j1 ,...,jq ,i ( [ ( ) ( ) ( )k ≤ 2k−1 E Bj ,...,jq Xi B ′ ′ Xi L Zi , z, λ j1 ,...,jq 1 { } ] ( ) ( ) ( ) k + E Bj ,...,jq Xi B ′ ′ Xi L Zi , z, λ j1 ,...,jq 1 ( ( ∏q ( ∏q ) )k ) ( ∏q ) k−1 C ′ ≤2 hl + CB hl ≤c k hl . B,k l=1 l=1 l=1 ζ k There exists a constant c = c k c−1 , such that E ζ ≤ ck!Eζ 2 ′ ′ ′ ′ j1 ,...,jq ,j1 ,...,jq ,i ζ ζ2 j1 ,...,jq ,j1 ,...,jq ,i < ∞, for k ≥ 3. Then by Bernstein’s inequality (p.24, Theorem 1.2, BOSQ(1998)), ( P n−1 ) }1/2 ) (∏q { ′ n−1 h log n ζ ′ ′ ≥ c l=1 l i=1 j1 ,...,jq ,j1 ,...,jq ,i ∑n 138     ( )  ′ n ∏q h log n  c l=1 l ≤ 2 exp − (∏ ) { (∏ ) }1/2   q q  4C  ′n h + 2c c h log n B,2 n l=1 l l=1 l ( )−1 −c′ 4CB,2 = 2n ≤ 2n−4 , for any c′ ≥ 16CB,2 , [ ∑∞ −1 ∑n ζ which implies n=1 P maxz∈D max ′ ′ ,...,j ′ n i=1 j1 ,...,jq ,j ′ ,...,jq ,i j1 ,...,jq ,j1 q 1 )r { (∏q ) }1/2 ] ∑∞ ( ′ n−1 ≥ c h log n max c K2 n−4 < ∞. ≤2 n l=1 l n=1 1≤s≤r s Thus, the Borel-Cantelli Lemma entails that max max ′ z∈D j ,...,jq ,j ′ ,...,jq 1 1 n−1 [{ ( ∏q ) }1/2 ] −1 ζ h log n . ′ ′ = Oa.s. n i=1 j1 ,...,jq ,j1 ,...,jq ,i l=1 l ∑n ′ When jl − jl > ml − 1 for some 1 ≤ l ≤ q, ⟨ ⟩ ⟨ ⟩ Bj ,...,jq , B ′ − Bj ,...,jq , B ′ = 0. ′ ′ j1 ,...,jq j1 ,...,jq 1 1 n,Lz Lz Lemma 5.3. Under assumptions (A2) and (A4), as n → ∞, [{ }1/2 ] )−1 ( ∏q ⟨γ 1 , γ 2 ⟩n,L − ⟨γ 1 , γ 2 ⟩L z =O z . log n h n−1 An = sup sup a.s. l=1 l ∥γ 1 ∥L ∥γ 2 ∥L z∈D γ 1 ,γ 2 ∈G z z (5.10) Proof. Any γ 1 , γ 2 ∈ G can be written as γ 1 (x) = α1 B (x), γ 2 (x) = α2 B (x) for some 139 ( ) ( ) vectors α1 = αj ,...,jq ,1 ∈ RKn , α2 = αj ,...,jq ,2 ∈ RKn . ⟨γ 1 , γ 2 ⟩n,L is z 1 1    n ∑ ∑ )  ∑  ( n−1 αj ,...,jq ,1 Bj ,...,jq   αj ,...,jq ,2 Bj ,...,jq  L Zi , z, λ  1 1 1 1 i=1 j1 ,...,jq j1 ,...,jq = ∑ ⟨ ⟩ α ′ Bj ,...,jq , B ′ ′ ′ ′ ′α j1 ,...,jq j1 ,...,jq ,j1 ,...,jq j1 ,...,jq ,1 ,j1 ,...,jq ,2 1 n,Lz ⟨γ 1 , γ 2 ⟩Lz = ∑ ⟨ ⟩ . α ′ Bj ,...,jq , B ′ ′ ′ ′ ′α j1 ,...,jq j1 ,...,jq ,j1 ,...,jq j1 ,...,jq ,1 ,j1 ,...,jq ,2 1 Lz According to (5.9), one has for any z ∈ D, there exist constants 0 < c1 < C1 < ∞ and 0 < (∏ (∏ )1/2 )1/2 q q c2 < C2 < ∞, such that c1 hl ∥α1 ∥2 ≤ ∥γ 1 ∥Lz ≤ C1 hl ∥α1 ∥2 , l=1 l=1 (∏ )1/2 (∏ )1/2 q q h ∥α2 ∥2 , thus h ∥α2 ∥2 ≤ ∥γ 2 ∥L ≤ C2 c2 z l=1 l l=1 l (∏ ) (∏ ) q q h ∥α1 ∥2 ∥α2 ∥2 . Hence h ∥α1 ∥2 ∥α2 ∥2 ≤ ∥γ 1 ∥L ∥γ 1 ∥L ≤ C1 C2 c1 c2 z z l=1 l l=1 l ∥α ∥ ∥α ∥ (∏ 1 2) 2 2 × q hl ∥α1 ∥2 ∥α2 ∥2 c1 c2 l=1 ⟨ ⟩ ⟨ ⟩ max max Bj ,...,jq , B ′ − Bj ,...,j,q , B ′ ′ ′ j1 ,...,jq j1 ,...,jq 1 1 ′ z∈D j ,...,jq ,j ′ ,...jq n,Lz Lz 1 1 [{ }1/2 ] (∏q )−1 −1 =Oa.s. n . h log n l=1 l An ≤ Let Vz,n = n−1 BT Lz B = {⟨ ⟩ } , Bj ,...,jq , B ′ ′ j1 ,...,jq 1 n,Lz Kn ×Kn 140 (5.11) {⟨ Vz,n = ⟩ } Bj ,...,jq , B ′ . ′ j1 ,...,jq 1 Lz Kn ×Kn (5.12) A result in Demko (1986) is described as follows, which plays an essential role in the proof of Lemma 5.5. Lemma 5.4. For positive definite Hermitian matrices A such that A has no more than k √ 5/4 1/4 nonzero entries in each row, then A−1 ≤ 33 k A−1 ∥A∥2 . ∞ 2 Lemma 5.5. Under assumptions (A2) and (A4), there exists a constant 0 < Cm < ∞ (∏ )−1 q −1 depending only on ml , 1 ≤ l ≤ q, such that supz∈D Vz,n ≤ Cm h , where l=1 l ∞ Vz,n is defined in (5.12) (∏ ) (∏ ) q q 2 ≤ wT V w ≤C Kn , c Proof. For any vector w ∈R h ∥w∥2 h ∥w∥2 z,n B B 2 l=1 l l=1 l which follows directly from (5.9), and it is clear that Vz,n is symmetric, thus Vz,n is a positive definite Hermitian matrix . Vz,n 2 = sup w {( }1/2 ) ( ) Vz,n w T Vz,n w / ∥w∥2 2 { (∏q )( }1/2 ) ) −1 ( ≤ sup CB hl Vz,n w T Vz,n Vz,n w / ∥w∥2 2 l=1 w ( )1/2 { }1/2 (∏q ) 1/2 ∏q = CB hl sup wT Vz,n w/ ∥w∥2 ≤ CB hl 2 l=1 l=1 w −1 Similarly Vz,n ≤ 2 ) ∏q ( than 2ml − 1 l=1 ) ( −1 ∏q h −1 . By tensor spline properties, V cB z,n has no more l=1 l nonzero entries in each row, thus ( )−1 √ q q ) 1/4 5/4 ∏ ∏ ( −1 −1 −1 = Cm hl , where Vz,n ≤ 33 2ml − 1 Vz,n supz∈D Vz,n 2 2 ∞ l=1 l=1 √ q ) ∏ ( −1 2ml − 1 cB CB . Cm = 33 l=1 141 −1 Lemma 5.6. For Vz,n defined in (5.11), and for n large enough, supz∈D Vz,n ≤ ∞ )−1 (∏ q h in which Cm > 0 is the constant in Lemma 5.5. 2Cm l=1 l Proof. By Lemma 5.2, as n → ∞, sup Vz,n − Vz,n = ∞ z∈D ⟩ ⟨ ⟩ ∑ ⟨ ≤ − Bj ,...,jq , B ′ sup max Bj ,...,jq , B ′ ′ ′ j1 ,...,jq j1 ,...,jq 1 1 j1 ,...,jq ′ z∈D n,Lz Lz ′ j1 ,...jq ⟨ ⟩ ⟨ ⟩ ) 2ml − 1 sup max Bj ,...,jq , B ′ − Bj ,...,jq , B ′ ′ ′ j1 ,...,jq j1 ,...,jq 1 1 ′ ′ z∈D j1 ,...,jq ,j1 ,...jq n,Lz Lz l=1 [{ ( ∏q ) }1/2 ] −1 (5.13) = Oa.s. n h log n . l=1 l q ∏( Let ξ = Vz,n η, for any given vector η with dimension Kn × 1, then for any given (∏ )−1 q −1 −1 z ∈ D, Vz,n ξ ≤ Vz,n ∥ξ∥∞ ≤ Cm hl ∥ξ∥∞ by Lemma 5.5, and thus l=1 ∞ ( ∞ ) ( ) −1 ∏q h ∥η∥ . By (5.13) and V Vz,n η ∞ ≥ Cm − Vz,n η ≤ z,n ∞ l=1 l ∞ ( ) −1 ∏q h ∥η∥ . Vz,n − Vz,n ∥η∥∞ , for n large enough Vz,n η ≥ (1/2) Cm ∞ l=1 l ∞ ∞ (∏ )−1 q −1 Let ξ1 =Vz,n η, then we have Vz,n ξ 1 ≤ 2Cm h ∥ξ 1 ∥∞ , for any given l=1 l ∞ z ∈ D and n large enough. Therefore the result follows. Lemma 5.7. Under assumptions (A2)-(A4), as n → ∞, = Oa.s. sup n−1 BT Lz E ∞ z∈D [{ ( ∏q ) }1/2 ] −1 n h log n . l=1 l Proof. Let Dn = nϑ , with ϑ < 1/2, ϑ (2 + δ) > 1, ϑ (1 + δ) > 1/2, which are satisfied 142 by δ > 0. We decompose the noise variable εi into a truncated part and a tail part εi = ) ( ( ) Dn εDn + εDn + εDn , where εDn = εi I εi > Dn , εDn = εi I εi ≤ Dn − εDn , εi,3 = i,1 i,2 i,3 i,1 i,2 i,3 ( ) ( ) { ) ( } 1+δ = o n−1/2 , E εi I εi ≤ Dn Xi , Zi . Since εDn ≤ E εi 2+δ Xi , Zi /Dn i,3 then sup z∈D,j1 ,...,jq n−1 ) ( ) ( ) ( ) Dn −1/2 . B Xi L Zi , z, λ σ Xi , Zi εi,3 = o n i=1 j1 ,...,jq ∑n ( The tail part vanishes almost surely, since ∑∞ ∑ P (|εn | > Dn ) ≤ Mδ ∞ n−ϑ(2+δ) < n=1 n=1 ∞. The Borel Cantelli Lemma implies that sup z∈D,j1 ,...,jq n−1 n ∑ i=1 ( ) ) ( ) ( ) Dn −k , for any k > 0. Bj ,...,jq Xi L Zi , z, λ σ Xi , Zi ε =O n i,l 1 ( For the truncated part, using Bernstein’s inequality in Theorem 1.2 of Bosq (1998), one has ∑n ( ) ( ) ( ) supz∈D,j ,...,jq n−1 Bj ,...,jq Xi L Zi , z, λ σ Xi , Zi εDn i,2 i=1 1 1 [{ (∏q ) }1/2 ] n−1 = h log n l=1 l as n → ∞. Therefore the result of Lemma 5.7 follows from above. Lemma 5.8. Under assumptions (A2)-(A4), for gε (x, z) in (5.7), as n → ∞, [{ }1/2 ] )−1 (∏ q . h log n supx∈[0,1]q ,z∈D |gε (x, z)| = Oa.s. n−1 l=1 l Proof. By Theorem 5.4.2 in DeVore and Lorentz (1993), Lemma 5.6, Lemma 5.7 and the 143 definition of gε (x, z), one has ( ) −1 −1 sup β ε = sup Vz,n n−1 BT Lz E ≤ sup Vz,n sup n−1 BT Lz E ∞ z∈D ∞ z∈D ∞ z∈D ∞ z∈D [{ ] }1/2 (∏q )−1 −1 = Oa.s. n h log n . l=1 l [{ = Oa.s. sup |gε (x, z)| ≤ sup β ε ∞ z∈D x∈[0,1]q ,z∈D n−1 ( ∏q h l=1 l )−1 }1/2 ] log n . {⟨ } ⟩ Let Σz = Bj ,...,jq , B ′ , where ′ j1 ,...,jq 1 Wz Kn ×Kn ⟨ ⟩ { } 2 (Z, z, λ) σ 2 (X, Z) , and Bj ,...,jq , B ′ = E Bj ,...,jq (X) B ′ ′ ′ (X) L j1 ,...,jq j1 ,...,jq 1 1 Wz { } 2 (X , Z ) , . . . , σ 2 (X , Z ) L Wz = Lz diag σ n n z 1 1 { } = diag L2 (Z1 , z, λ) σ 2 (X1 , Z1 ) , . . . , L2 (Z1 , z, λ) σ 2 (Xn , Zn ) . For (x, z) ∈ [0, 1]q × D define −1 −1 σ 2 (x, z) = n−1 B (x)T Vz,n Σz Vz,n B (x) . n (5.14) Lemma 5.9. Under assumptions (A2)-(A4), for gε (x, z) in (5.7) and σ 2 (x, z) in (5.14), as n n → ∞, σ −1 (x, z) {gε (x, z)} −→ N (0, 1) . For any given (x, z) ∈ [0, 1]q × D, n ) ( )−1 (∏ q 2 (x, z) ≤ C ∗ n−1 ∏q h −1 , for some constants 0 < c∗ < ∗ n−1 h ≤ σn cσ σ σ l=1 l l=1 l ∗ Cσ < ∞. Proof. For any given z ∈ D, by the definition of β ε (z) in (5.6), for any c ∈ RKn with 144 ∑ ∥c∥2 = 1, we can write cT β ε (z) = n ai εi , where i=1 ( ) ( ) −1 −1 ( ) ( ) a2 = n−2 cT Vz,n B Xi B Xi T L2 Zi , z, λ σ 2 Xi Vz,n c. i For any given z ∈ D, by Lemma 5.6, as n → ∞ with probability 1, max1≤i≤n a2 ≤ i (∏ )−2 q 2 Cσ (2Cm )2 n−2 h . As n → ∞ with probability 1, l=1 l )−2 ( ∏q ∑n { ) ( )}2 2 ( −2 L Zi , z, λ hl n−2 cT B Xi a2 ≥ c2 CB σ i i=1 l=1 i=1 [{ ] (∏q )−2 ) ( )}2 2 ( −1 E cT B X 2 C −2 ≥ cσ B h n L Zi , z, λ (1 − An ) , i l=1 l ∑n the proof of which follows the same pattern as in Lemmas 5.1 and 5.3, thus as n → ∞ with (∏ )−1 ∑ q probability 1, n a2 ≥ C ′ n−1 hl . Hence for any given z ∈ D, i=1 i l=1 { ( ) } 2 / ∑n a2 = O −1 ∏q h −1 = O max a a.s. n a.s. (1) . Given any ξ > 0, one has i=1 i l=1 l 1≤i≤n i { } 2 E ε2 I ( a ε > ξ ∥a∥ ) a 2 k k=1 k )}δ/(2+δ) ( )2/(2+δ) { ( ∑n −2 −1 ∥a∥ 2 E |ε|2+δ ≤ lim ∥a∥2 a P |ε| > ξa 2 k n→∞ k=1 k )}δ/(2+δ) { ( 2/(2+δ) = 0, ≤M lim max P |ε| > ξa−1 ∥a∥2 k δ n→∞ 1≤k≤n lim ∥a∥−2 2 n→∞ ∑n thus, the Lindeberg condition is satisfied. By the Lindeberg-Feller CLT, as n → ∞, (∑ ) ∑n n a2 −1/2 → N (0, 1). Therefore, [Var {g (x, z) |X, Z}]−1/2 g (x, z) → ai εi / ε ε i=1 i=1 i N (0, 1). −1 −1 Var {gε (x, z) |X, Z} = n−1 B (x)T Vn,z Σn,z Vn,z B (x)T , 145 where ⟨ ⟩ Bj ,...,jq , B ′ ′ j1 ,...,jq 1 Wz ,n ∑n ( ) ( ) 2( ) 2( ) = n−1 Bj ,...,jq Xi B ′ ′ Xi L Zi , z, λ σ Xi , Zi j1 ,...,jq i=1 1 . By Lemma 5.3, one can prove that wT Σz w (1 − An ) ≤ wT Σn,z w ≤ wT Σz w (1 + An ) , and cΣ (∏ ) (∏ ) q q 2 ≤ wT Σ w ≤C h ∥w∥ h ∥w∥2 , thus z Σ l=1 l l=1 l (∏q ) ( ∏q )−1 −1 −1 −1 −1 hl wT Vz,n Vz,n w ≤C Σ c wT Vz,n Σz Vz,n w≤C Σ hl ∥w∥2 , l=1 l=1 ( ∏q ) (∏q )−1 T V−1 Σ V−1 w≥c T V−1 V−1 w ≥c C −2 w h w h ∥w∥2 . z,n z z,n z,n z,n Σ Σ V l=1 l l=1 l (∏ )−1 2 q For any given (x, z) ∈ [0, 1]q ×D, σ 2 (x, z) ≤ n−1 C Σ c−2 hl Bj ,...,jq (x) ≤ n V l=1 1 2 (∏ )−1 q ∗ Cσ n−1 h , where σ 2 (x, z) is defined in (5.14), and similarly one has σ 2 (x, z) ≥ n n l=1 l (∏ )−1 q ∗ c∗ n−1 h , for some constants 0 < c∗ < Cσ < ∞. For any given (x, z) ∈ σ σ l=1 l [0, 1]q × D, [ ] 1/2 {σ (x, z)}−1 = 1, since limn→∞ [Var {gε (x, z) |X, Z}] n −1 −1 Var {gε (x, z) |X, Z} ≤ B (x)T Vz,n Σz Vz,n B (x) (1 + An ) (1 − An )−2 , −1 −1 Var {gε (x, z) |X, Z} ≥ B (x)T Vz,n Σz Vz,n B (x) (1 − An ) (1 + An )−2 . Thus, the result in Lemma 5.9 follows. 146 Lemma 5.10. Under assumptions (A1), (A2) and (A4), for gg (x, z) in (5.7), as n → ∞, ) (∑ p ∑ q supx∈[0,1]q ,z∈D gg (x, z) − g (x, z) = Oa.s. h + r λs . s=1 l=1 l Proof. For 1 ≤ i ≤ n, 1 ≤ s ≤ r, let Z−is be the leave-one out vector of Zi , then ( ) r ∏ 1 Z ̸=zs ( ) ) is = 1 Zi = z + L Zi , z, λ = λs s=1 (∑ r ) ∑r ( ) λ 1 Zis ̸= zs , Z−is = z−is + o λ . s=1 s s=1 s ( Denote Lz = Lz,1 + Lz,2 + Lz,3 , where Lz,1 = diag {1 (Z1 = z) , . . . , 1 (Zn = z)}, Lz,2 = } { r r ) ( ) ( ∑ ∑ λs 1 Zns ̸= zs , Z−ns = z−ns diag λs 1 Z1s ̸= zs , Z−1s = z−1s , . . . , and s=1 s=1 ( ) r ∑ Lz,3 = o λs In . Thus by the definition of gg (x, z) in (5.7), s=1 ( ) −1 gg (x, z) − g (x, z) = B (x)T Vz,n n−1 BT Lz g − g (x, z) = Π1 + Π2 + Π3 , ( ) −1 where Π1 = B (x)T Vz,n n−1 BT Lz,1 g − g (x, z) , ( ) ( ) T V−1 n−1 BT L g , Π = B (x)T V−1 n−1 BT L g . Π2 = B (x) m z,3 z,n z,n z,2 3 By Theorem 12.8 and (13.69) on p. 149 of de Boor (2001), for any z ∈D, there exists ) (∑ p q h . Let gz = β (z) ∈ RKn , such that supx∈[0,1]q B (x)T β (z) − g (x, z) = O l=1 l ) ( −1 {g (X1 , z) , . . . , g (Xn , z)}T , Π1 = B (x)T Vz,n n−1 BT Lz,1 gz − g (x, z) = Π11 + Π12 , [ ] T V−1 n−1 BT L {g − Bβ (z)} , where Π11 = B (x) z,n z,1 z { } −1 Π12 = B (x)T Vz,n n−1 BT Lz,1 Bβ (z) − g (x, z) . 147 [ ] −1 sup Vz,n n−1 BT Lz,1 {gz − Bβ (z)} ∞ z∈D −1 ≤ sup Vz,n sup n−1 BT Lz,1 {gz − Bβ (z)} ≤ ∞ z∈D ∞ z∈D   ( ){ } q ( ) ∑ p −1 sup ∥gz − Bβ (z)∥∞ = Oa.s.  h , sup Vz,n n−1 BT Lz,1 l ∞ ∞ z∈D z∈D l=1 ) [ ] (∑ q p −1 h , sup |Π11 | ≤ sup Vz,n n−1 BT {gz − Bβ (z)} = Oa.s. l=1 l ∞ z∈D x∈[0,1]q ,z∈D by Theorem 5.4.2 in Devore and Lorentz (1993), Lemma 5.6 and properties of the B-spline. supx∈[0,1]q ,z∈D |Π12 | ≤ supx∈[0,1]q ,z∈D B (x)T β (z) − g (x, z) + ) } { ( T V−1 n−1 BT L supx∈[0,1]q ,z∈D B (x) z,n z,2 + Lz,3 Bβ (z) ≤O (∑q ) (∑ r ) p −1 h + sup Vz,n sup n−1 BT Bβ (z) O λs . l=1 l s=1 ∞ z∈D ∞ z∈D By Lemmas 5.1 and 5.2 and , one has supz∈D n−1 BT B supx∈[0,1]q ,z∈D |Π12 | ≤ O ∞ = Oa.s. ) p h + l=1 l (∏ ) q hl , then l=1 (∑ q −1 supz∈D Vz,n supz∈D n−1 BT B supz∈D ∥β (z)∥∞ O ∞ ∞ ) (∑q p ∑r λs . = Oa.s. h + s=1 l=1 l (∑r s=1 ) λs ) (∑ p ∑ q h + r λs . supx∈[0,1]q ,z∈D |Π1 | ≤ supx∈[0,1]q ,z∈D (|Π11 | + |Π12 |) = Oa.s. s=1 l=1 l ( ) −1 −1 supz∈D n−1 BT Lz,2 g ≤ supz∈D Vn,z supz∈D Vn,z n−1 BT Lz,2 g ∞ ∞ ∞ 148 −1 ≤ sup Vn,z sup n−1 BT sup ∥g∥∞ O ∞ z∈D ∞ z∈D z∈D (∑ r ) (∑ r ) λs = Oa.s. λs , s=1 s=1 sup gg (x, z) − g (x, z) ≤ sup (|Π1 | + |Π2 | + |Π3 |) x∈[0,1]q ,z∈D x∈[0,1]q ,z∈D ( ) q r ∑ p ∑ = Oa.s. h + λs . l s=1 l=1 thus, Proofs of Theorems 5.1 and 5.2. Theorem 5.1 follows from Lemmas 5.8 and 5.10 directly, while Theorem 5.2 follows from Lemmas 5.9 and 5.10 and assumption (A4) directly. 149 BIBLIOGRAPHY 150 BIBLIOGRAPHY [1] Aitchison, J. and Aitken, C. G. G. (1976). Multivariate binary discrimination by the kernel method. Biometrika, 63, 413–420. [2] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B , 57, 289–300. [3] Bickel, P. J. and Rosenblatt, M. (1973). On some global measures of the deviations of density function estimates. Annals of Statistics, 1, 1071–1095. [4] Bosq, D. (1998). Nonparametric Statistics for Stochastic Processes. Springer-Verlag, New York. [5] Breiman, L. & Friedman, J. H. (1985). Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association, 80, 580–619. [6] Cai, Z., Fan, J. and Li, R. (2000). Efficient estimation and inferences for varyingcoefficient models. Journal of American Statistical Association, 95, 888–902. [7] Caspi, A. and Moffitt, T. E. (2006). Gene-environment interactions in psychiatry: joining forces with neuroscience. Nature Reviews Neuroscience, 7, 583–590. [8] Caspi, A., Sugden, K., Moffitt, T. E., Taylor, A., Craig, I. W., Harrington, H., McClay, J., Mill, J., Martin, J., Braithwaite,A. and Poulton, R. (2003). Influence of life stress on depression: moderation by a polymorphism in the 5-HTT gene. Science, 301, 386–389. [9] Cardot, H., Ferraty, F., and Sarda, P. (2003). Spline estimators for the functional linear model. Statistica Sinica, 13, 571–591. 151 [10] Cardot, H. and Sarda, P. (2005). Estimation in generalized linear models for functional data via penalized likelihood. Journal of Multivariate Analysis, 92, 24–41. [11] Chatterjee, N. and Carroll, R. J. (2005). Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies. Biometrika, 92, 399– 418. [12] Chiang, C. T., Rice, J. A. and Wu, C. O. (2001). Smoothing spline estimation for varying coefficient models with repeatedly measured dependent variables. Journal of American Statistical Association, 96, 605–619. [13] Claeskens, G., and Van Keilegom, I. (2003). Bootstrap confidence bands for regression curves and their derivatives. The Annals of Statistics, 31, 1852–1884. [14] Cleveland, W. S., Grosse, E. and Shyu, W. M. (1991). Local regression models. In Statistical Models in S, J. M. Chambers and T. J. Hastie, 309–376. Pacific Grove: Wadsworth & Brooks. [15] Cockerham, C. C. (1954). An extension of the concept of partitioning hereditary variance for analysis of covariances among relatives when epistasis is present. Genetics, 39, 859– 882. [16] Costa, L. G. and Eaton, D. L. (2006) Gene-Environment Interactions: Fundamentals of Ecogenetics, Hoboken, NJ: John Wiley & Sons. [17] Cs˝rg˝, M. and R´v´sz, P. (1981). Strong Approximations in Probability and Statistics. o o e e Academic Press, New York-London. [18] de Boor, C. (2001). A Practical Guide to Splines. Springer-Verlag, New York. . Journal of Approximation Theory, [19] Demko, S. (1986). Spectral bounds for A−1 ∞ 48, 207–212. [20] DeVore, R. A. and Lorentz, G. G. (1993). Constructive Approximation, Springer-Verlag, New York. [21] Falconer, D. S. (1952). The problem of environment and selection. Natural American, 86, 293–298. 152 [22] Fan, J. (1993). Local linear regression smoothers and their minimax efficiency. The Annals of Statistics, 21, 196–216. [23] Fan, J. Q., and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications. Chapman and Hall, London. [24] Fan, J. and Zhang, W. (1999). Statistical estimation in varying coefficient models. The Annals of Statistics, 27, 1491–1518. [25] Fan, J. and Zhang, W. Y. (2000). Simultaneous confidence bands and hypothesis testing in varying-coefficient models. Scandinavian Journal of Statistics. 27, 715–731. [26] Fan, J. and Zhang, W. (2008). Statistical methods with varying coefficient models. Statistics and its Interface, 1, 179–195. [27] Ferraty, F. and Vieu, P. (2006). Nonparametric Functional Data Analysis: Theory and Practice. Springer; Berlin. [28] Friedman, J. H. (1991). Multivariate Adaptive Regression Splines. The Annals of Statistics, 19, 1–67. [29] Gijbels, I., Lambert, A., and Qiu, P. H. (2007). Jump-preserving regression and smoothing using local linear fitting: a compromise. Annals of the Institute of Statistical Mathematics, 59, 235–272. [30] Gonz´lez-Manteiga, W., Mart´ a ınez-Miranda, MD. & Raya-Miranda, R. (2008). SiZer Map for inference with additive models. Statistics and Computing, 18, 297–312. [31] Guo, S. W. (2000). Gene-environment interaction and the mapping of complex traits: some statistical models and their implications. Human Heredity, 50, 286–303. [32] Hall, P. and Heckman, N. (2002). Estimating and depicting the structure of a distribution of random functions. Biometrika, 89, 145–158. [33] Hall, P., M¨ller, H. G., and Wang, J. L. (2006). Properties of principal component u methods for functional and longitudinal data analysis. Annals of Statistics, 34, 1493– 1517. [34] Hall, P., and Patil, P. (1995). Formulae for mean integrated squared error of nonlinear wavelet-based density estimators. Annals of Statistics, 23, 905–928. 153 [35] Hall, P. and Titterington, D. M. (1988). On confidence bands in nonparametric density estimation and regression. Journal of Multivariate Analysis, 27, 228–254. [36] H¨rdle, W. (1989). Asymptotic maximal deviation of M-smoothers. Journal of Multia variate Analysis, 29, 163–179. [37] H¨rdle, W., Hl´vka, Z., and Klinke, S. (2000). XploRe Application Guide, Springera a Verlag, Berlin. [38] H¨rdle, W. and Marron, J. S. (1991). Bootstrap simultaneous error bars for nonparaa metric regression. Annals of Statistics, 19, 778-796. [39] H¨rdle, W. and Mammen, E. (1993) Comparing nonparmetric versus parametric regresa sion fits. Annals of Statistics, 21, 1926–1947. [40] Harrison, D. & Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for cleaning air. Journal of Economics and Management, 5, 81–102. [41] Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models.. Chapman and Hall, London. [42] Hastie, T. J. and Tibshirani, R. J. (1993). Varying-coefficient models. Journal of the Royal Statistical Society, Series B, 55, 757–796. [43] Hahn, L. W., Ritchie, M. D., Moore, J. H. (2003). Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics, 19, 376–382. [44] Hoover, D., Rice, J., Wu, C. O. and Yang, L. (1998). Nonparametric smoothing estimates of time-varying coefficient models with longitudinal data. Biometrika, 85, 809– 822. [45] Huang, J. Z. (1998). Projection estimation in multiple regression with application to functional ANOVA models. Annals of Statistics, 26, 242–272. [46] Huang, J. Z. (2003). Local asymptotics for polynomial spline regression. The Annals of Statistics, 31, 1600–1635. [47] Huang, J. Z. & Yang, L. (2004). Identification of nonlinear additive autoregression models. Journal of the Royal Statistical Society, Series B, 66, 463–477. 154 [48] Huang, X., Wang, L., Yang, L., and Kravchenko, A. N. (2008). Management practice effects on relationships of grain yields with topography and precipitation. Agronomy Journal, 100, 1463–1471. [49] Huang, J., Wu, C. and Zhou, L. (2004). Polynomial spline estimation and inference for varying coefficient models with longitudinal data. Statistica Sinica, 14, 763–788. [50] Izem, R. and Marron, J. S. (2007). Analysis of nonlinear modes of variation for functional data. Electronic Journal of Statistics, 1, 641–676. [51] James, G. M., Hastie, T., and Sugar, C. (2000). Principal component models for sparse functional data. Biometrika, 87, 587–602. [52] James, G. M. (2002). Generalized linear models with functional predictors. Journal of the Royal Statistical Society B, 64, 411–432. [53] James, G. M. and Silverman, B. W. (2005). Functional adaptive model estimation. Journal of the American Statistical Association, 100, 565–576. [54] James, G. M. and Sugar, C. A. (2003). Clustering for sparsely sampled functional data. Journal of the American Statistical Association, 98, 397–408. [55] Joo, J., and Qiu, P. (2009). Jump detection in a regression curve and its derivative.Technometrics, 51, 289–305. [56] Kang, K. H., Koo, J. Y., and Park, C. W. (2000). Kernel estimation of discontinuous regression function.Statistics and Probability Letters, 47, 277–285. [57] Kılı¸, E. (2008). Explicit formula for the inverse of a tridiagonal matrix by backward c continued fractions. Applied Mathematics and Computation, 197, 345–357. [58] Koo, J. Y. (1997). Spline estimation of discontinuous regression functions. Journal of Computational and Graphical Statistics, 6, 266–284. [59] Kraft, P., Yen, Y. C., Stram, D. O., Morrison, J. and Gauderman, W. J. (2007). Exploiting gene-environment interaction to detect genetic associations. Human Heredity, 63, 111–119. [60] Lamb, K., and Rizzino, A. (1998). Effects of differentiation on the transcriptional regulation of the FGF-4 gene: Critical roles played by a distal enhancer. Molecular Reproduction and Development, 51, 218–224. 155 [61] Leadbetter, M. R., Lindgren, G., and Rootz´n, H. (1983), Extremes and Related Prope erties of Random Sequences and Processes, Springer-Verlag, New York. [62] Li, S.Y., Lu, Q., Fu, W., Romero, R., and Cui, Y. (2009). A regularized regression approach for dissecting genetic conflicts that increase disease risk in pregnancy. Statistical Applications in Genetics and Molecular Biology Vol. 8, Iss. 1, Article 45. [63] Li, Q. (2000). Efficient estimation of additive partially linear models. International Economic Review, 41, 1073–1092. [64] Li, Q. and J. S. Racine (2004). Cross-validated local linear nonparametric regression. Statistica Sinica, 14, 485–512. [65] Li, Y. and Hsing, T. (2007). On rates of convergence in functional linear regression. Journal of Multivariate Analysis, 98, 1782–1804. [66] Li, Y. and Hsing, T. (2010). Uniform convergence rates for nonparametric regression and principal component analysis in functional/longitudinal data. Annals of Statistics, 6, 3321–3351. [67] Liang, H. (2006). Estimation in partially linear models and numerical comparisons. Computational Statistics & Data Analysis, 50, 675–687. [68] Liang, H., Wang, S. & Carroll, R. (2007). Partially linear models with missing response variables and error-prone covariates. Biometrika, 94, 185–198. [69] Liang, H., Thurston, S., Ruppert, D., Apanasovich, T. & Hauser, R. (2008). Additive partial linear models with measurement errors. Biometrika, 95, 667–678. [70] Li, Q. and J. Racine (2007). Nonparametric Econometrics: Theory and Practice, Princeton University Press. [71] Liu, R. and Yang, L. (2010). Spline-backfitted kernel smoothing of additive coefficient model. Econometric Theory, 26, 29–59. [72] Ma, S. and Yang, L. (2011a). A Jump-detecting Procedure based on Polynomial Spline Estimation. Journal of Nonparametric Statistics, 23, 67-81. [73] Ma, S. and Yang, L. (2011b). Spline-backfitted Kernel Smoothing of Partially Linear Additive Model. Journal of Statistical Planning and Inference, 141, 204-219. 156 [74] Ma, S., Yang, L. and Carroll, R. (2011). A Simultaneous Confidence Band for Sparse Longitudinal Regression. Statistica Sinica, accepted. [75] Ma, S. and Racine, J. (2011). Additive Regression Splines With Irrelevant Regressors. Manuscript. [76] Ma, S., Racine, J. and Yang, L. (2011). Spline Regression in the Presence of Categorical Predictors. Manuscript. [77] Mack, Y. P., and Silverman, B. W. (1982). Weak and strong uniform consistency of kernel regression estimates. Wahrscheinlichkeitstheorie verm. Gebiete, 61, 405–415. [78] Maity, A., Carroll, R. J., Mammen, E. and Chatterjee, N. (2009). Testing in semiparametric models with interaction, with applications to gene-environment interactions. Journal of the Royal Statistical Society B, 71, 75–96. [79] Morris, J. S. and Carroll, R. J. (2006). Wavelet-based functional mixed models. Journal of the Royal Statistical Society B, 68 , 179–199. [80] M¨ller, H. G. (1992). Change-points in nonparametric regression analysis.The Annals u of Statistics, 20, 737–761. [81] M¨ller, H. G. and Stadtm¨ller, U. (2005). Generalized functional linear models. Annals u u of Statistics, 33, 774–805. [82] M¨ller, H. G., Stadtm¨ller, U., and Yao, F. (2006). Functional variance processes. u u Journal of the American Statistical Association, 101, 1007–1018. [83] M¨ller, H. G., and Song, K. S. (1997). Two-stage change-point estimators in smooth u regression models. Statistics & Probability Letters, 34, 323–335. [84] M¨ller, H. G. and Yao, F. (2008). Functional additive models. Journal of American u Statistical Association, 103, 1534–1544. [85] N´cher, V., Ojeda, S., Cadarso-Su´rez, C., Roca-Pardi˜as, J. & Acu˜a, C. (2006). a a n n Neural correlates of memory retrieval in the prefrontal cortex. European Journal of Neuroscience, 24, 925–936. [86] Abramson, M.A., Audet, C., Couture, G., Dennis Jr., J.E. and Le Digabel, S. (2011). The NOMAD project. ”Software available at http://www.gerad.ca/nomad”. 157 [87] Park, C., and Kim, W. (2004). Estimation of a regression function with a sharp change point using boundary wavelets. Statistics & Probability Letters, 66, 435–448. [88] Park, C., and Kim, W. (2006). Wavelet estimation of a regression function with a sharp change point in a random design. Journal of Statistical Planning and Inference, 136, 2381–2394. [89] Peacock, M., Turner, C. H., Econs, M. J. and Foroud, T. (2002). Genetics of osteoporosis. Endocrine Reviews, 23, 303–326. [90] Qiu, P. H., Asano, C., and Li, X. (1991). Estimation of jump regression function. Bulletin of Informatics and Cybernetics, 24, 197–212. [91] Qiu, P. H. (1994). Estimation of the number of jumps of the jump regression functions. Communications in Statistics-Theory and Methods, 23, 2141–2155. [92] Qiu, P. H., and Yandell, B. (1998). A local polynomial jump detection algorithm in nonparametric regression. Technometrics, 40, 141–152. [93] Qiu, P. H. (2003). A jump-preserving curve fitting procedure based on local piecewiselinear kernel estimation. Journal of Nonparametric Statistics, 15, 437–453. [94] Qiu, P. H. (2005). Image Processing and Jump Regression Analysis, John Wiley & Sons, New York. [95] Qiu, P. H. (2007). Jump surface estimation, edge detection, and image restoration. Journal of the American Statistical Association, 102, 745–756. [96] Racine, J. S. and Q. Li (2004). Nonparametric estimation of regression functions with both categorical and continuous Data. Journal of Econometrics, 119, 99–130. [97] Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis, Second Edition. Springer; New York. [98] Roca-Pardi˜as, J., Cadarso-Su´rez, C. N´cher, V. & Acu˜a, C. (2006). Bootstrap-based n a a n methods for testing factor-by-curve interactions in generalized additive models: assessing prefrontal cortex neural activity related to decision-making. Statistics in Medicine, 25, 2483–2501. [99] Schimek, M. (2000). Estimation and inference in partially linear models with smoothing splines. Journal of Statistical Planning and Inference, 91, 525–540. 158 [100] Scott, D. W. (1992). Multivariate Density Estimation. Theory, Practice, and Visualization, Wiley, New York. [101] Shiau, J. (1987). A note on MSE coverage intervals in a partial spline model. Communications in Statistics-Theory and Methods, 16, 1851–1866. [102] Song, Q. and Yang, L. (2009). Spline confidence bands for variance function. Journal of Nonparametric Statistics, 21, 589–609. [103] Stone, C. J. (1977). An Asymptotic Equivalence of Choice of Model by Cross-Validation and Akaike’s Criterion. Journal of the Royal Statistical Society. Series B, 39, 44–47. [104] Stone, C. J. (1985). Additive regression and other nonparametric models. Annals of Statistics, 13, 689–705. [105] Stone, C. J. (1994). The use of polynomial splines and their tensor products in multivariate function estimation. Annals of Statistics, 22, 118–184. [106] Su, L. & Ullah, A. (2006). Profile likelihood estimation of partially linear panel data models with fixed effects. Economics Letters, 92, 75–81. [107] Sunklodas, J. (1984). On the rate of convergence in the central limit theorem for strongly mixing random variables. Lithuanian Mathematical Journal, 24, 182–190. [108] Tjøstheim, D. & Auestad, B. (1994). Nonparametric identification of nonlinear time series: projections. Journal of the American Statistical Association, 89, 1398–1409. [109] Ulrich, C.M., Kampman, E., Bigler, J., Schwartz, S. M., Chen C, Bostick, R., Fosdick, L., Beresford, S., Yasui, Y. and Potter, J. (1999). Colorectal adenomas and the C677T MTHFR polymorphism: evidence for gene-environment interaction. Cancer Epidemiology, Biomarkers & Prevention 8, 659–668. [110] Xia, Y. C. (1998). Bias-corrected confidence bands in nonparametric regression. Journal of the Royal Statistical Society Series B, 60, 797–811. [111] Xia, Y. and Li, W. K. (1999). On the estimation and testing of functional-coefficient linear models. Statistica Sinica, 3, 735–757. [112] Xue, L., and Yang, L. J. (2006). Additive coefficient modeling via polynomial spline. Statistica Sinica, 16, 1423–1446. 159 [113] Xue, L. (2006). Variable selection in additive models. Statistica Sinica, 19, 1281–1296. [114] Xue, L. and Liang, H. (2010). Polynomial spline estimation for the generalized additive coefficient model. Scandinavian Journal of Statistics, 37, 26–46. [115] Wahba, G. (1990). Spline Models for Observational Data, SIAM, Philadelphia. [116] Wang, N., Carroll, R. J., and Lin, X. (2005). Efficient semiparametric marginal estimation for longitudinal/clustered data. Journal of the American Statistical Association, 100, 147–157. [117] Wang, L. and Yang, L. (2007). Spline-backfitted kernel smoothing of nonlinear additive autoregression model. Annals of Statistics, 35, 2474–2503. [118] Wang, L. (2009). Single-index model-assisted estimation in survey sampling. Journal of Nonparametric Statistics, 21, 487–504. [119] Wang, J., and Yang, L. J. (2009a). Polynomial spline confidence bands for regression curves. Statistica Sinica, 19, 325–342. [120] Wang, J. & Yang, L. (2009b). Efficient and fast spline-backfitted kernel smoothing of additive models. Annals of the Institute of Statistical Mathematics, 61, 663–690. [121] Wang, Y. (1995). Jump and sharp cusp detection by wavelets. Biometrika, 82, 385–397. [122] Wu, J. S., and Chu, C. K. (1993). Kernel-type estimators of jump points and values of a regression function. The Annals of Statistics, 3, 1545–1566. [123] Wu, W. and Zhao, Z. (2007). Inference of trends in time series. Journal of the Royal Statistical Society B, 69, 391–410. [124] Yao, F. and Lee, T. C. M. (2006). Penalized spline models for functional principal component analysis. Journal of the Royal Statistical Society B, 68, 3–25. [125] Yao, F., M¨ller, H. G., and Wang, J. L. (2005a). Functional linear regression analysis u for longitudinal data. Annals of Statistics, 33, 2873–2903. [126] Yao, F., M¨ller, H. G., and Wang, J. L. (2005b). Functional data analysis for sparse u longitudinal data. Journal of the American Statistical Association, 100, 577–590. 160 [127] Yao, F. (2007). Asymptotic distributions of nonparametric regression estimators for longitudinal or functional data. Journal of Multivariate Analysis, 98, 40–56. [128] Zhang, J. T. and Chen, J. (2007). Statistical inferences for functional data. Annals of Statistics, 35, 1052–1079. [129] Zhao, X., Marron, J. S., and Wells, M. T. (2004). The functional data analysis view of longitudinal data. Statistica Sinica, 14, 789–808. [130] Zhao, Z. and Wu, W. (2008). Confidence bands in nonparametric time series regression. Annals of Statistics, 36, 1854–1878. [131] Zhou, L., Huang, J., and Carroll, R. J. (2008). Joint modelling of paired sparse functional data using principal components. Biometrika, 95, 601–619. [132] Zhou, S., Shen, X. and Wolfe, D. A. (1998). Local asymptotics of regression splines and confidence regions. The Annals of Statistics, 26, 1760–1782. [133] Zhou, X. and You, J. (2004). Wavelet estimation in varying-coefficient partially linear regression models. Statistics & Probability Letters, 68, 91–104. 161