.7. .43.; .mp1“ m. .330: V E a. 34mm. _ . , . ‘ ‘ . .. .Vu. ,t .. . . . , ‘ _ V .rv vhf. .._ akin“... . . , 4 , . V . ‘ . . . ”3..., 4min a -74 ‘ . . . Jmfibvmfi? 4.50 le 37.... w?) . a. urfix; . Fa .. z... a 1... . n... .o. a? 2.", , . ,2. . , ,c 1P. 22.5w MM», “.30 Jfi \s i. ; 3... .3 vi -. x $94.... n . x. .93. h. 4%»; i L 7......3. .flnwvsfln: . 5 Lifik .mls,..3wmmw..mwu Y .. . 3.»... . u T 95.: 11 , 4 a? , V 1.3.2} 4 ..I.d. . x . . ‘ . .3 ‘ . m a . 3.3% “.3 .4 . E. ,1; e3. .91 . A 2.. : 1006 This is to certify that the dissertation entitled ADDITIVE COEFFICIENT MODELING VIA MARGINAL INTEGRATION AND POLYNOMIAL SPLINE SMOOTHING presented by LAN XUE has been accepted towards fulfillment of the requirements for the Ph.D. degree in Statistics 1W Major Professar’s Signature JMIM Z T / 2005— U , Date MSU is an Affinnative Action/Equal Opportunity Institution LIBRARY Michigan State University PLACE IN RETURN Box to remove this checkout from your record. 10 AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE JAN 0 6 2008 1 , . 2/05 czni'nafiaie'om" _.imd-p.1s' ‘ ".— ADDITIVE COEFFICIENT MODELLING VIA MARGINAL INTEGRATION AND POLYNOMIAL SPLINE SMOOTHING By Lan Xue A DISSERTATION Submitted to Michigan State University in partial fullfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department. of Statistics and Probability 2005 ABSTRACT ADDITIVE COEFFICIENT MODELLING VIA MARGINAL INTEGRATION AND POLYNOMIAL SPLINE SMOOTHING By Lan Xue In this dissertation, we propose a flexible semi-parametric model called additive coefficient model (ACM). In the ACM, one assumes that the response depends linearly on some covariates, whose regression coefficients, however, are additive functions of another set of covariates. The ACM can be viewed as a generalization of the classic linear models in the sense that instead of assuming the coefficients to be constants like the linear model does, it allows the regression coefficients to vary with another set of covariates through an additive function form. This dissertation focuses on the estimation of the ACM. Two different approaches are considered. One is the local polynomial based marginal integration method, and the other one is the polynomial spline estimation. The local polynomial smoothing is local in nature, whereas the polynomial spline is a global smoothing method. This difference, in turn, leads to the difference in the asymptotic behavior of the two types of estimators. Under weak dependence, the point-wise asymptotic normality is established for the marginal integration estimators. It is found that. the estimators of the parameters in the regression coefficients have rate of convergence I/fi, and the nonparametric additive components are estimated at. the same rate of convergence as in univariate smoothing. In contrast, only mean square convergence is established for the poly- nomial spline estimators. However, the polynomial spline method is much simpler in both computation and inference. The nonparametric versions of AIC and BIC are adopted easily based on polynomial spline estimation, for the model selection purpose. Monte Carlo studies are conducted to compare the numerical performances of the two estimation methods, as well as the model selection procedures. The simulation studies show that besides being highly efficient in terms of computing, the polynomial spline estimators are also more accurate than or at least as good as the local poly- nomial based estimators. The ACM is also successfully applied to several interesting empirical examples: West German GNP, Housing price, and Sunspot data, where the semi-parametric additive coefficient model demonstrates superior performance in terms of out-of—sample forecasts. COPYRIGHT BY LAN XUE 2005 To my grandfather, my parents, my sisters, and Li ACKNOWLEDGEMENTS I would like to thank my adviser, Professor Lijian Yang. for his unwavering sup— port in the past five years of my doctoral program. I am very fortunate to have him as adviser. As an adviser, he always made himself available for all types of responsi- bilities. He provided continued guidance, endless supply of patience, and tremendous insight for my research. I owe every single piece of my achievement to him. I’d like to thank Professors Dennis Gilliland, V. S. Mandrekar, Shlemo Levental and Jiague Qi for severing on my thesis committee. Professors Gilliland and Man- drekar provided me with tremendous help and support during my last year in the program, specially during the critical time of my job searching. Also I am very grate— ful to Professor Qi for supporting me in the CLIP program. Working in CLIP has been an invaluable experience for me. I would also like to express my gratitude to Professor Connie Page, who taught me how to be a good consultant and provided me with a great deal of encouragement, and guidance when I worked at the statistical consulting service center. I also want to thank Professor Yijun Zue for teaching me a wonderful course in Robust Statistics. My special thanks go to Professor James Stapleten for his continued support and friendship throughout the years. Also I want to thank Professor Habib Salehi and Cathy for assistance in my simulation. Last, but not least, I’d like to thank all the professors and friends who ever helped me during my stay at Michigan State University. It. was so nice being with you. vi Contents List of Tables ix List of Figures x 1 The model 1 1.1 Introduction ................................ 1 1.2 The Additive Coefficient Model ..................... 4 1.3 Model Identification ........................... 6 1.4 Data Generating Process ......................... 8 2 Marginal Integration Estimation 12 2.1 Introduction ................................ 12 2.2 Estimators of constants ......................... 14 2.3 Estimators of function components ................... 16 2.4 Implementation .............................. 19 2.5 Assumption and proofs ......................... 23 2.5.1 Assumptions ........................... 23 2.5.2 Technical lemmas ......................... 25 2.5.3 Proof of Theorem 2.2.1 ...................... 31 2.5.4 Proof of Theorem 2.3.1 ...................... 39 3 Polynomial Spline Estimation 47 3.1 Introduction ................................ 47 3.2 The Set-up and Notations ........................ 49 3.3 Polynomial Spline Estimation ...................... 52 3.3.1 The estimators .......................... 52 3.3.2 Knot number selection ...................... 55 3.3.3 Model selection .......................... 56 3.4 Assumption and Proofs ......................... 58 3.4.1 Assumptions and notations ................... 58 3.4.2 Technical lemmas ......................... 60 3.4.3 Proof of mean square consistency ................ 67 3.4.4 Proof of BIC consistency .................... 71 vii 4 Examples 74 4.1 Monte Carlo Studies ........................... 74 4.1.1 An i.i.d example ......................... 75 4.1.2 A nonlinear autoregressive example ............... 78 4.2 Empirical Examples ............................ 80 4.2.1 West German GNP ....................... 80 4.2.2 Wolf’s annual sunspot number .................. 82 4.2.3 Housing price ........................... 85 BIBLIOGRAPHY 112 viii List of Tables 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 GNP data: the ASEs and ASPEs of six fits. .............. Simulated i.i.d example: estimation of constants ............. Simulated i.i.d example: estimation of function components. ..... Simulated i.i.d example: model selection with BIG and AIC. ..... Simulated nonlinear AR model: estimation of constants. ....... Simulated nonlinear AR model: estimation of function components. . Simulated nonlinear AR model: model selection with AIC and BIG. . W'elf’s Sunspot Number: eut-ef—sample absolute prediction errors. . . Tucson housing price: estimation and prediction results. ....... ix 89 90 91 93 94 95 List of Figures 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 GNP data: one—step prediction performance. . ' ............. Kernel density estimates of film, /h1,opt .................. Plots of the estimated coefficient functions using marginal integration. Plots of the estimated coefficient functions using cubic spline. Time plot of a simulated series from model (4.2), with n = 100. Plots of the estimated coefficient functions using linear spline. GNP data: time plot of the series {0, 3:41. ............... GNP data after transformation: time plot of the series {Yt 2:2 . . . . Scatter plot of Yt, Y,_2 at three levels of Yt_1. ............. Scatter plot of Y1, Yt_2 at three levels of Y¢_8. ............. Estimated functions and their bootstrap 95% confidence intervals. . . Spline approximations of the functions. ................. Estimated function components ...................... Time plot of the fitted values based on marginal integration. ..... Estimated functions with cubic spline approximation. ......... Time plot of the fitted values with cubic approximation. ....... 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 Chapter 1 The model 1.1 Introduction An important task in statistical analysis is to quantify the association between two sets of variables, say a univariate variable Y and a d—dimensional vector X. In re- gression analysis, one focuses on the averaged (or expected) response of Y given X, i.e., m(X) = E (YIX), which is also known as regression function. To estimate the unknown regression function m(X), the parametric regression analysis begins with as- suming m(X) takes a pre-determined function form with only finitely many unknown parameters, i.e., m(X) = m (AX), where ,8 is a set of unknown coefficients, and the function m (,B,x) is specified in advance. As a special case, the linear regression assumes m (,3, x) is a linear function in ,0. The unknown coefficients C can be estimated using e.g. least squares method. However, the restricted parametric form often can’t explain (or approximate) well the complicated data structure. Furthermore, the parametric regression can lead to excessive estimation biases and erroneous inferences, if a wrong model function m (,B,x) is used. On the other hand, nonparametric regression makes minimal assumptions about the regression function m. Without assuming m (fix) take any particular form, it allows the data to speak for themselves , thus they uncover the data structure that linear and parametric regression are unable to detect. To estimate the nonparametric regression function m, several smoothing methods were developed, for example, kernel smoothing (Nadaraya 1964, Watson 1964, Gasser & Miiller 1984), local polynomial smoothing (Cleveland 1979, Wand & Jones 1995, Fan & Gijbels 1996), polynomial spline (Stone 1985), smoothing spline (Eubank 1988, Wahba 1990), penalized spline (Eilers & Marx 1996, Ruppert, Wand & Carroll 2003) and Wavelet thresholding (Chiu 1992, Deneho & Johnstone 1995, Hardle, et a1. 1998). In this dissertation , we focus on two of them: the local polynomial smoothing and the polynomial spline. A serious limitation of the general nonparametric model is the “curse of dimen- sionality” phenomenon. This term refers to the fact that the convergence rate of nonparametric smoothing estimators becomes rather slow when the estimation tar- get is a general function of a large number of variables without additional structures. Many efforts have been made to impose structures on the regression function to partly alleviate the “curse of dimensionality”, which is broadly described as dimension reduc- tion. Some well-known dimension reduction approaches are: (generalized) additive models (Chen & Tsay 1993a, Hastie & Tibshirani 1990, Sperlich, Tjostheim & Yang 2002, Stone 1985), partially linear models (Hardle, Liang & Gao 2000) and varying coefficient models (Hastie & Tibshirani 1993). The idea of the varying coefficient model is especially appealing. It allows a re- sponse variable to depend linearly on some regressers, with coefficients as smooth functions of some other predictor variables. The additive-linear structure enables simple interpretation and avoids the curse of dimensionality problem in high dimen- sional cases. Specifically, consider a multivariate regression model in which a sample {(Y,, X,, T,)}?=1 is drawn that satisfies 2,," = m (XhTi) +0(X,-,T,-) 5i: (1.1) where for the response variables Y, and predictor vectors X,- and T,, m and 02 are the conditional mean and variance functions m(X,,T,) = E(Y,-|X,-,T,—), 02 (X:.Tt) = varleI-aTi) (1-2) and E(5,-|X,—,T,-) = 0, var(a,|X,-,T,-) = 1. For the varying coefficient model, the conditional mean takes the following form d m (x, T) = Z a, (X,)T1 (1.3) 1:1 in which all tuning variables Xz,l = 1, ...,d make up the vector X, and all linear predictor variables Tbl = 1, ..., d are univariate and distinct. Hastie & Tibshirani (1993) proposed a backfitting algorithm to estimate the vary- ing coefficient functions {on (1:1)},991, but gave no asymptotic justification of the algorithm. A somewhat restricted model, the functional coefficient. model, was pro- posed in the time series context by Chen & Tsay (1993b) and later in the context of longitudinal data by Hoover, Rice, Wu & Yang (1998), in which all the tuning variables Xhl = 1, ..., d are the same and univariate. For more recent developments of the functional coefficient model, see Cai, Fan & Yao (2000). In a different direc- tion, Yang, Hardle, Park & Xue (2004) studied inference for model (1.3) when all the tuning variables {X31519 are univariate but have a joint d—dimensienal density. This model breaks the restrictive nature of the functional coefficient model that all the tuning variables X (,l = 1, ..., d have to be equal. On the other hand, it requires that none of the tuning variables Xhl = 1, ...,d are equal. In this dissertation, we propose a more flexible additive coefficient model, which includes functional/ varying coefficient models as special cases. 1.2 The Additive Coefficient Model We propose the following additive coefficient model which has a more flexible form, namely d1 d2 m(X,T)= Z a,(X)T,, m(X): Z (I,,(X,),VI31_<_.1,, (1.4) [:1 .9 =1 in which the coeflicient functions {a,(X)};11___ 1 are additive functions of the tuning variables X =2 (X1, . . . ,Xd2)T. Note that without the additivity restriction on the coefficient functions {(r¢(X)};1l___ 1, model (1.4) would be a kind of functional coeffi- cient model with a multivariate tuning variable X instead of a univariate one as in the existing literature. The additive structure is imposed on the coefficient functions {01(X)}fl__}l, so that inference can be made on them without the “curse of dimension- ality” . To understand the flexibility of this model, we look at some of the models that are included as special cases: 1. When the dimension of X is 1 ((12 = 1), (1.4) reduces to the functional coeflicient model of Chen & Tsay (1993b). 2. When the linear regresser vector T is constant ((11 = 1, and T1 5 1), (1.4) reduces to the additive model of Chen & Tsay (1993a), Hastie & Tibshirani (1990). 3. When for any fixed I = 1,...,d1, 043(13) E 0 for all but one s = 1, ...,d2, (1.4) reduces to the varying coefficient model (1.3) of Hastie 8.: Tibshirani (1993). 4. When d1 = (1;; = d, and 015(153) E 0 for l # s, (1.4) reduces to the varying- coefficient model of Yang, Hardle, Park & Xue (2004). The additive coefficient model is a useful nonparametric alternative to the para— metric models. Te gain some insight into it, consider the application of our estimation procedure to the quarterly West German real GNP data from January 1960 to De- cember 1990. Denote this time series by {0, 2:}, where G, is the real GNP in the t-th quarter (the first quarter being from January 1, 1960 to April I, 1960, the 124- th quarter being from September 1, 1990 to December 1, 1990). Yang & Tschernig (2002) deseasonalized this series by removing the four seasonal means from the series log (GEM/Gt”) ,t = 1, 120. Denote the transformed time series as {ELIE}. As the nonparametric alternative to the optimal linear autoregressive model selected by the Bayesian Information Criterion (BIC), Y: = CIYt-2 + €2Yt—4 + Eta (1-5) we have fitted the following additive coefficient model (details in subsection 4.2.1), Yr = {61+ 0311(Yt—1)+ 012 (Yt—8)} Yt—2 + {(52 + 021(Y¢_1)+ 022 (14.3)} Yt_4 + 08;. (1.6) Using this model, we can efficiently take into account the phenomenon that the effect of Y¢_2, Yt_4 on Y, vary with Yt_1, 11-3. The efficiency is evidenced by its superior out-ef—sample one-step prediction at each of the last ten quarters. The averaged squared prediction error (ASPE) is 0.000112 for the linear autoregressive fit in (1.5), and 0.000077 to 0.000085 for fits of the additive coefficient model (1.6). Hence the reduction in ASPE is between 31% and 46%, see Table 4.2.3. Figure 4.1 clearly illustrates this improvement in prediction power, in which circle denotes the observed value, and cross (triangle) denotes the predictions by linear autoregressive model (1.5), and additive coefficient model (1.6) respectively. One can see that the additive coefficient model out-performs the linear autoregressive model in prediction for 8 of the 10 quarters. 1.3 Model Identification For the additive coefficient model, the regression function m (X, T) in (1.4) needs to be identified. One practical solution is to rewrite it as d1 (12 m (X,T) = Z a,(X)T,, m(X) = a“, + Z (.-.,(X,), V1 313.1,, (1.7) l = 1 s = 1 with the identification conditions E{w(X)a,,(XS)} E 0, l=1,...,d1,s =1,...,d2, (1.8) for some nonnegative weight function 21), with E {w(X)} = 1. The weight function to is introduced so that estimation of the unknown functions {al(X)}1 S l S (11 will be carried out only on the support of w, supp (in), which is compact according to assumption (A7). This is important as most of the asymptotic results for nonpara- metric estimators are developed only for values over compact sets. By having this weight function, the support of the distribution of X is not required to be compact. This relaxation is very desirable since most time series distributions are not com- pactly supported. See Yang & Tschernig (2002), p.1414 for similar use of the weight function. Note that (1.8) does not impose any restriction on the model, since any regression function m( (X T) =Zdl_ _1 232— _ 1 a{,( X)T1 can be reorganized to satisfy (1.8), by writing d1 With 010 z E {IL'(X) Zf2_ -— lals( X3)} 013(X 5): (11.3(X3) _ E {IU(X)CI;8(X5)} ‘ In addition, for the functions {01,(X3)}i 0 such that for any set. of measurable functions {613(Xs)}% (8:12 that satisfy (1.8) and any set of constants {(1.1}1 < l < d1, |/\l/\ the following holds (1, 2 d2 E Z (1.1+ Z b,,(X,) T, (1.9) [=1 3 =1 d1 d1 d2 20 Z ai+ Z Z E{bi.(X.)} - (1.10) [=1 l=1s=1 Lemma 1.3.1. Under assumptions (A0) and {A5) in the subsection 2.5.1, the rep- resentation in (1.7) subject to (1.8) is unique. Proof. Suppose that d1 d2 d1 d2 m(x,T)= 2 {am 2 a..(X.)}n= Z {6.0+ Z a..(x.) T. l: 3:1 dlid2 1=1,s= dlad2 with both the set {a,3(X,) 1 , {amfifil and the set {3),(X3) I=l,s=l’ {510%, satisfying (1.8). Then upon defining for all s,l bls(Xs) E alS(XS) ‘“ als(Xs)a at E 610 — 0’10 one has 2211: 1 {a1 + :32: 1 b)s(Xs)} T, E 0. Hence by assumption (A0) (11 (12 2 . d1 d1 d2 0=E 2: “1+ 2 bls(Xs) TI 20 2: (122+ Z Z Elbis(X-9)} l=1 3:1 1:1 l=13=1 entailing that for all s,l, a, E 0 and b128(X5) E 0 almost surely. Since assumption (A5) requires that all X 3 are continuous random variables, one has b(s(:r) E 0 for all 3,1. I 1.4 Data Generating Process In this dissertation, we consider {(Y,,X,-,T,)}" a sample generated from the re— i=l’ gression model (1.1) and (1.2) with its conditional mean function described by (1.7) and the identifiability conditions (1.8), (1.10). Furthermore its error terms {5,};1 are assumed to be i.i.d with E5,- = 0, Es,2 = 1, and with the additional property that e,- is independent of {(Xj,TJ-) ,j g i} ,i = 1, ...,n. With this error structure, the explanatory variable vector (X,, T,) can contain exogenous variables and/or lag variables of 1”,. If (X,,T,) contains only the lags of Y,-, it is a semi-parametric au- toregressive time series model, which is a useful extension of many existing nonlinear time series models such as exponential autoregressive model (EXPAR), threshold au- toregressive model (TAR), and functional autoregressive model (FAR), as well as the linear autoregressive model. To obtain the asymptotics of the estimators proposed in this dissertation , we need some additional properties on the data generating process {Q}:1 = {(Y,, X,, T,)}:1. First we assume {C321 is strictly stationary. The following definition of strict sta- tionarity is from Brockwell & Davis (1991). Definition 1.4.1. (Strict Stationarity) The series {C,}:, is said to be strictly sta- l tionary if the joint distributions of ((11, . . . ’(tk) and (Ct1+ h’ . . ”Ctk + h) are the same for all positive integers h and t1, . . . , tk 6 2”. Second, we assume {CJZI is weakly dependent. Generally speaking, weak de- pendence allows the observation at time t to be dependent with the observations at the other times, say, t + 1:, but requires this dependence diminishes to zero as the observations are far apart, i.e. |k| —> 00. There are several definitions of weak depen- dence (or mixing) when the dependence is measured by different mixing coefficients. Here we quote the definitions of two commonly used weak dependence, the so—called a—mixing and B-mixing from Bosq (1998). Definition 1.4.2. (oz—mixing) Let {(321 = {(Y,,X,,T,)},9°:1 be a strictly stationary vector process. Let .7331), and .773 denote the a-algebras generated by {C,-,i Z n + k} and {(0, . . . , Cu} separately. Then the a- coefficient which measures the correlation 00 . . between fn+k and .73 Is given as a(k) = sup |P(A)P(B) - P(AB)|. A 6 f3", 8 6 £21,, The vector process {(321 is a-mixing (or strongly mixing) if its a-coefficient a(k) —> 0, as [kl —-> oo. Specially, the vector process {(321 is geometrically a- mixing if its a-coeflicient goes to 0 geometrically fast, i.e. oz(lc) 3 cp", for some constants c > 0, 0 < p < 1. Definition 1.4.3. (fl—mixing) Let {C,}:, = {(Y,-,X,-,T,~)}f_:1 be a strictly stationary vector process. Let 31,, and .753“ denote the cr-algebras generated by {(,,i Z n + k} and {C0, . . . , Cn} separately. Then the [j-coefficient which measures the correlation 00 . . between 7:” +1: and .753“ Is given as 30%"): SUP E SUP |P(A|f3)—P(A)| n21 Aémk The vector process {Ci}:1 is fl-mixing if its ,B-coefficient ,8 —> 0 as Us] —+ 00. Similarly, the vector process {(321 is geometrically fi-mixing if its ,B-coefficient goes to 0 geometrically fast, i.e. [3(k) 5 cp", for some constants c > 0, 0 < p < 1. The fl—mixing is stronger than a—mixing, because the coefficients satisfy the inequality that m(k) g /3(k)/2. 10 Both a- and fi- mixing are weaker than the m-dependence, i.e., a {C,,t g T} and a {C,,t 2 T + k} are independent for all k > m. Most importantly, the a- and [3- mixing contain the usual linear autoregressive and moving average (ARMA) models. For more discussions about mixing, see Bosq (1996). The rest of the dissertation is organized as follows. In chapter 2, a local poly- nomial based marginal integration method is proposed to estimate the coefficient functions. The asymptotic normality is developed. In chapter 3, a fast polynomial spline estimation is developed for the estimation. Also a model selection procedure based on nonparametric Bayesian Information Criterion (BIC) is proposed for infer- ence purpose. In chapter 4, two simulation studies are given to compare the numerical performances of two proposed estimation methods, and the model selection procedure. Also the proposed methods are successfully applied to three empirical examples. 11 Chapter 2 Marginal Integration Estimation 2. 1 Introduction The main focus of this dissertation is to estimate the additive coefficient model (1.7), in which for every I = 1, ..., (12, the coefficient of TI; consists of two parts, the unknown parameter 0:0. and the unknown univariate functions {rush S s S d2: The first approach we propose is the local polynomial based marginal integration method. The marginal integration method was first discussed in Linton & Nielsen (1995) in the context of additive models, see also the marginal integration method for general- ized additive models in Linton & Hardle (1996). To see how the marginal integration method works in our context, observe that according to the identification condition (1.8), for every I = 1,...,d1 one has, 010 = E {w(X)ozz(X)} = /w(x)m(x)t,9(x)dx (2.1) 12 and for every point x = (x1, ...,:rd2)T, and every I = 1, ...,d1,s = 1, ...,d2, one has, a¢0+a,3($s) = E{w_3(X_s)a)(:r3,X_s)} (2.2) = /w_,(u_,)a,(.;,,u_,).p_,(u_.)du_. where _ T u_,, — (u1,...,u3_1,u8+1,...,ud2), T ((133,u_5) _—_ ('u1,...,us_1,13,u3+1,...,ud2) , the density of X is cp, and the marginal density of x_, = (X1,...,X8 _ 1,X8+1,...,Xd2)T is 80—31 and 'u)_.s(x_,,) = E {u2(Xs,x_,,)} = fw(u,x._,)dtps(u). In addition, the mar- ginal density of X5 is denoted by 903. Intuitively, one has Z w(XI)az(Xi) . ws(X.-,_s)az(xs. Xi,—s) 0110 m 1:1 n , 013(41’3) z 1:1 n — (km (2.3) :1 w(X,) Z21 w,(X,-,_,) and the dg-dimensional functions {al(x)};11= 1 in the above equations (2.3) can be replaced by the usual local polynomial estimators. This is the essential idea behind the marginal integration method. To gain more insight of it, we consider the following simple example. Example: Suppose we have the data generating from the simple additive coeffi— cient model Y = {2 + SIII(X1) 'I" X2}T1+{1+ SIII(X2)}T2 'I‘ E, where independent. of each other, X1 and X2, follow U [—7r,7r], and T1, T2 follow N(0.1) and 5 is the normal noise term. In this case, we take in to be the identity 13 function. Denote m(X) = 2 + sin(X1) + X2, and 02(X) = 1 + sin(X1). Then simple calculation shows that E(01(X)) = 2; E(a2(X)) =1 E((Y1(;II1,X2)) = 2 + sin(;r1); E (02(151, X2))=1+ 8111(171) E(02(X1,1‘2)) = 2 ‘I' 172; E(C¥2(X1,.’L‘2)) = I. We will discuss the same example in the simulation study. 2.2 Estimators of constants According to the first approximation equation in (2.3), to estimate the constants {amnilz 1 , we first estimate the unknown functions {al(x)};1l___ 1 at those data points X, that are in the support of the weight function to. More generally, for any fixed x E supp (to), we approximate m(x) locally by a constant a), and estimate {az(x)};ll___ 1 by T minimizing the following weighted sum of squares with respect to a = (011, . . . , a (11) , n d, 2 Z Y.— — Z are. KH(X.- — x), (2.4) ‘=1 l = 1 where K is a dg-variate kernel function of order ql, see assumption (A1), H = diag{h0,1,...,ho d2} is a diagonal matrix of positive numbers 110,1,...,h0 (12’ called bandwidths, and 1 I . KH(x) = ———-K 31..., "2 . h0,1 hO,(12 d2 1] [10,3 s=1 T Let d = (611,” . ’é‘dl) be the solution to the least squares problem in (2.4). 14 Note that d is dependent on x, as is (2.4), and the components in (3: give the esti- mators for {al(x)};11= 1 . To emphasize the dependence on x, we write (1 = 51 (x) = A T ((31 (x),...,ad1 (x)) . More precisely, let (T1, Tld1 ) W=diag{KH(X.—x>/n}.g.g., Z= s s . Y=(Y.,...,Yn)T (T711 Tndlj and e; be a dl-dimensional vector with all entries 0 except the l-th entry being 1. Then {6n(x)}1 S l < (11 is given by a,(x) = 6;" {zTW(x)z}'1 zTW(x)Y. (2.5) 3 By (2.3), the parameter am can be estimated as a weighted average of &1(X,-) s, i.e., ,, 21-1—1 ’LU(X1‘)&I(X;‘) a = '“n , l=1,...,d. (2.6 10 Zg=1w(xi) 1 ) Theorem 2.2.1. Under assumptions (AU-(A7) in subsection 2.5.1, for any l = 1,...,d1, x/mézo — (1,0) 3* N {04712}, where the asymptotic variance 0,2 is defined in {2.20). The rate of 1/\/r_i at which 5110 converges to (1,0 is due to two special features of d)(x). First, the bias of m(x) in estimating m(x) consists of terms of order fig] 1, . . . ,hgl d2’ bounded by l/fi according to assumption (A6) (a), see the deriva- tion of Lemma 2.5.5 about the term P61. Second, the usual variance of 611(x) in estimating m(x) is proportional to n‘1h& } - - - I261 (12’ which gets reduced to 1 / n due 15 to the effect of averaging in (2.6), see the derivation of the term Peg in (2.18) and (2.19). This technique of simultaneously reducing the bias by the use of a higher order kernel and “integrating out the variance ”is the common feature of all marginal integration procedures. 2.3 Estimators of function components In the following, we illustrate a procedure for estimating the functions {013(T3)};11: 1, for any fixed 3 = 1,. . . , d1. Let I3 be a point at which we want to evaluate the func- tions {als(:r3)}1 S l E (II . According to (2.3), we need to estimate {a¢(x)};il___ 1 at those points (:rs,X,-,_s) that lie in the support of 21). For any x E supp (w) , dif- ferently from estimating the constants, we approximate the function m(u) locally at x by m(u) z (I, + ELI/30.01., — 978)], and estimate {o)(x)};11= 1 by minimiz- ing the following weighted sum of squares with respect to a = (a1, . . . ,adl)T, ,3 = (511w-afilp""’fll111""’fidlp)T 2 11 d1 p Z Y,- — 2: {m + 2 sum, _ 3,)2‘} T,- khs(X,-, — .r,)L03(x,-,_, — x_,) i=1 (:1 ‘ 1:1 in which k is a univariate kernel, L is a (d2 — 1)-variate kernel of order q2, as in assumption (A1) in the subsection 2.5.1, the bandwidth matrix C, = diag {911:"293 _ 1193 +1,...,gd2}, 16 and ta Q to I: I I L ELI Us—i ”5+1 312 Ill 98’ 91’ ’93—1 , 934-1, , gd2 ' _ d2,s' 76 s for u_s = (u1,u8 _ bus +1,...,ud2) . Let (”1,3 be the solution of the above least squares problem. Then the components in (3! give the estimators for {a1(x)};11: 1 , which is given by d,(x) )=e, T{sz,( x)z,}“sz,(x)Y, (2.7) where e, is a (12+ 1)d1-dimensional vector with all entries 0 except the l-th entry being 1, = ' T1 . ._ . _ Ws(x) _ d1ag{n khs(X,s x,)LGS(X,,_s x—8)}1$ign and Tf1{(Xls — Is)/hs} TI, "-i {(Xls — xsl/hSIP TIP Z, = TI, {(Xm, — 173)/hs}T£,...,{(an — :rs)/h3}p T: If) {(XIS "' fail/half ® TI = (2.8) lp{(X 113—1. ISM/h llT (ET: in which p(u) = (1,u, . . . , u”)T and (8) denotes the Kronecker product of matrices. Then for each s, we can construct the marginal integration estimators of a), for l = 1,. . . ,dl simultaneously, which are given by . 2'11'lu-s(X.--s)dz(:rs.X.~_.-) . al ($3) = :_ n v ‘ L _ 010: (29) 3 21:1 w_3(X,-,_,) 17 where the term (3110 is the fi-consistent estimator of am in Theorem 2.2.1. The estimator (3136123) is referred to as the p—th order local polynomial estimator, where p is the highest polynomial degree of variables Xis — 2:3, 2' = 1, ..., n, in the definition of the design matrix Z, in (2.8). In particular, the local linear (p = 1) and the local cubic estimators (p = 3) are the most commonly used. Theorem 2.3.1. Under assumptions A1-A7 in the subsection 2. 5.1, one has, for any x = (1'1, ...,md2)T E supp(w), andl=1,...,d1, s = 1, ...,(lg, Vnhs {6113(Is) — 013(238) — h§+lms(1:3)} A N {0, 0124333)} , (2.10) where 17,8(233) and 0123063) are defined in (2.28) and (2.30), respectively. Finally, based on (2.6) and (2.9), one can predict Y given any realization (x, t) of (X, T) by the predictor d1 d2 mm): 2 a,0+ Z and.) t;. (2.11) :1 3 =1 To appreciate why cm can be estimated by dis at the rate of 1/\/7—1—h_3, which is the same as the rate of estimating a nonparametric function in the univariate case, we discuss two special features of d,(x) given in (2.7), which are similar to those discussed in subsection (2.2). First the bias of (31(x) in estimating m(x) is of order h’s’+1 + {123% where the first term can be understood as the approximation bias caused by locally approximating at, using a p—th degree polynomial, see the derivation of P52 in Lemma 2.5.9, and the second term can be considered as the approximation bias by locally approximating functions {awhifl using a constant, which is bounded by 9%,. since the kernel L is of order (12, see P33 in Lemma 2.5.9. The order 111%): of the second 18 bias term is negligible compared to the rescaling factor of order 1/ M, according to (A6) (b). Hence, only the first bias term appears in the asymptotic distribution formula (2.10). As for the variance of dl(x) in estimating m(x), it is proportional to 7i"1h;1 gl‘ 1 - - - 9:193:11 - - - 9321, but due to marginal averaging of variables Xi,” the bandwidths 912 ..., 93 _ 1, gs + 1, ..., gd2 related to X,,_s are integrated out, see P31 in Lemma 2.5.9. Then the variance of 61,, is reduced to the order n‘lhs‘l. If the same bandwidth h, is used for all variable directions in X, then Assumption (A6) (b) would imply that nn'dZ/ (2p + 3) ——> co and hence restricting (12 to be less than 2p+ 3, for the asymptotic results of Theorem 2.3.1 to be true. That is why we prefer the flexibility of using a set of bandwidths gl, ..., gs _ 1, gs + 1, ..., gd2 different from h,. 2.4 Implementation Practical implementation of the estimators defined in (2.6) and (2.9) requires a rather intelligent choice of bandwidths H = diag {’10, 1, ...,ho, d2}, {hs}1 S S S d2, and G, = diag {91, ...,gs _ 1,93 + 1, "°’9dg}' In the following, we discuss the choices of such bandwidths. a Note from Theorem 2.2.1 that the asymptotic distributions of the estimators (6110”,; I depend only on the quantity of, not on the bandwidths in H. Hence we have only specified that H satisfy the order assumptions in (A6) (a) by taking hm = = ind2 2 war (X) log (n) n‘l/(2q1 — 1), where ql is the order of the kernel K, required to be greater than (d2 + 1) / 2, and var (X) = 19 3 =1 1,...,d2. 1/d2 2 { H var (X 3)} , in which var (X,) denotes the sample variance of X 3, s = The asymptotic distributions of the estimators {dish 19 (S :2 depend not only |/\|/\ on the functions 7713(5123) and 0123(333) but also crucially on the choice of band- widths ha. Moreover, for each s = 1, ..., d2, the coefficient functions {oas(:1:,)},d=11 are estimated simultaneously. So we define the optimal bandwidth of [1,, de- noted by hwpt, as the minimizer of the total asymptotic mean integrated squared errors of {(7rzs(.v:.),l = 1,... ,(11}, which is defined as d1 d1 d1 [21AMISE {(113} = 11:00“) [:1 [0123(Its)d:rs + ha 121/0?s(a:s)da:s. Then limp, is found to be 2 1/(2p+3) _1f 013 (1:3)de 2n(p+:1)22;1_—1f77123(38)d$s in which 7713 (I3) and 0123(173) are the asymptotic bias and variance of ch, as in hs,opt = (2.28) and (2.30). According to the definitions of 7713 (23,), 0123 (13,), [77125 (13,) dxs and f 0123 (51:5)de can be approximated respectively by d1 n 2 1 / ( + 1)‘ 2 01(5)“) X(::$)/up+ll Z (10., (Xir-slTu’Kfs (urxS.Xi,-31Ti)} d” divs, P - 3 n i=1 I r2 c2 X,_,)a 2(,:,X T) —Zu—S( S( )/K*s(,,-.2uX T)du, #9 (X) where the functions K (‘3 are defined in (2.29). To implement this, one needs to evaluate terms such as a)? +1) (1,), 02(x,t), cp(x), 90(x_3) and Kfs. We propose the following simple estimation methods for those quantities. The resulting bandwidth is denoted as (imm. 20 (p+1) 1. The derivative functions 01, (x3) are estimated by fitting a polynomial regression model of degree p + 2 ([1 [2+2 (Y|=XT Z Z Zazstle l=13=1k=0 Then of?” ($3) is estimated as (p + 1)!alls’p+1 + (p + 2)!a.,rs,p+2:r,. As a by-product, the mean squared error of this model, is used as an estimate of 02(x). 2. Density functions 90(x) and cp(x_s), are estimated as i?" 2: 1h (X d2)¢{x(x 2:3} 1 Xis' - $3, x..3) — £1;H8'¢3h (X—32d2 “1)¢{h(x-8’d2 _1)} with the standard normal density 4) and the rule-of-the—thumb bandwidth h(,X m) =\/var(X )({4/ (m + 2)}1/(m+4)n—1/(m+4) 3. According to the definition in (2.29), the dependence of the functions Kl‘;(u,x,t) on u and t is explicitly known. The only unknown term E (TTT|X = x) contained in S C: l (x) is estimated by fitting matrix poly— nomial regression d2 p E (TTTIX = X) = C + Z ch,k-’L': S=1k=l in which the coefficients c,c3,1c are all x (11 matrices. In this procedure, one simply uses polynomial regression to estimate some of the unknown quantities, which is easy to implement, but may lead the estimated 21 optimal bandwidths to be biased relative to the true optimal bandwidths. The deveIOpment of a more sophisticated bandwidth selection method requires fur- ther investigation. 0 Since Theorem 2.3.1 implies that the asymptotic distributions of the estimators 1 < s< {01,}1- .<_ l <— (1 2do not depend on (G, }f912_ _1, we only specify that the G, sat- isfies the order assumption in (A6) (b) g1 = = 93—1 2 gs+1 = = 51% = 12595;?“12/ log(n ), in which qg, the order of the kernel function L, is required to be greater than (d2 — 1) / 2, and hwp, is the optimal bandwidth obtained using the above procedure. Following the above discussion, the order of the kernels K and L are required to be greater than ((12 + 1)/2 and (d2 — 1) /2 respectively. If the dimension of X equals to 2, kernels K and L can have order 2. We have used the quadratic kernel 15 16 kernels K, L are product kernels. k (u): —(1 — 11.22) 1{1u151}, where 1“,,151} is the indicator function of [—1,1] and the Lastly, the matrix ZTW(x)Z in (2.4) is computed as ZTW(x)Z + n‘lTTT, and the matrix ZSTWs(x)Z, in (2.7) as ZZWS(X)Z, + (nh,,,opt)—l var (X) {/k(u)p(u)p(u)Tdu} ®TTT, following the ridge regression idea of Seifert & Gasser ( 1996). 22 2.5 Assumption and proofs 2.5.1 Assumptions We have listed below some assumptions necessary for proving Theorems 2.2.1 and 2.3.1. Throughout this subsection, we denote by the same letters c,C etc., any positive constants, without distinction in each case. (A1) (A3) The kernel functions k, K and L are symmetric, Lipschitz continuous and compactly supported. The function k is a univariate probability density function, while K is d2 variate, and of order ql, i.e. f K (u)du = 1 while /K(u)u’1“-~u2‘:2du = 0, for 1 S r1+---+ rd2 _<_ q1 — 1. Kernel L is (d2 — 1) variate and of order (12. Denote p" = max(p + 1, q;, (12). Then we assume further that The functions (113(333) have bounded continuous p*-th derivatives for 1 S l 3 d1, 1 S S S d2. The vector process {(32, = {(Y,-,X,,T,-)}f:1 is strictly stationary and geo- metrically fi-mizing. According to (1.1) of Bosq (1998), the strong mixing coefficient a (k) S 13(k)/2, hence a (k) g cpk/2. (2.12) (A4) The error term satisfies: 23 (a) The innovations {5,}:1 are i.i.d with E5,- = 0, E5? = 1 and E |eil2+6 < +00 for some 5 > 0. Also, the term 5, is independent of {(Xj,TJ-) ,j S i} for alli> 1. (b) The conditional standard deviation function a (x,t) is bounded and Lip- schitz continuous. (A5) The vector (X, T) has a joint probability density w (x,t). The marginal densi- ties of X, X, and X_, are denoted by (,9, 4,05 and 30‘, respectively. (a) Letting q‘ = max(ql, (12) — 1, we assume that w (x, t) has bounded contin- uous (f-th partial derivatives with respect to x. And the marginal density go is bounded away from zero on the support of the weight function w. (b) Let S (x) = E (TTTIX = x) . We assume there exists a c > 0, such that S (x) 2 cId2 uniformly for x E supp(w). Here Id2is the d2 x d2 identity matrix. (c) The random matrix TTT satisfies the Cramer’s moment condition, i.e, there exists a positive constant c, such that E |T1T11|k g ck‘leE 17177212, and EITlTlll2 S c holds uniformly for k = 3,4, . . ., and 1 S l,l' g ([1. (A6) The bandwidths satisfy: (a) For 11 = diag{h01, . . . ,h0d2} in Theorem 2.2.1, filiflfiax —> 0 and nhpmd oc a . d n for some a > 0, where h,,,,,,, = max{hOl, . . . , 110,12}, hpmd = H22: Ibo“ and or means proportional to. 24 (b) For the bandwidths h,, and G, : diag {91, . . . ,g, _1,g$ +1,...,gd2 _1} of Theorem 2.3.1, hS = O {'n‘l/(2p+3)}, nhsgprod 0C ”0 fOT some 01 > 0 and (Tlhs 1n ”)1/2 93122“ —’ 0,» Where gmax = max {911 ' ' ' 193-17 93+“ ' ’ "9(12 " 1}’ gprod = H392, gs' ' (A7) The weight function w is nonnegative, has compact support with nonempty in- terior, and is Lipschitz continuous on its support. 2.5.2 Technical lemmas The proof of many results in this dissertation makes use of some inequalities about U ~statistics and von Mises’ statistics of dependent variables derived from Yoshihara (1976). In general, let 5,,1 g i S n denote a strictly stationary sequence of random variables with values in Rd and fi-mixing coefficients 13(k), k = 1, 2, ..., and r a fixed positive integer. Let {6” (F)} denote the functionals of the distribution function F 0f 52' 6,, (F) z/gn ($1,...,xm)dF(a:1)~--dF(.rm), where {g,,} are measurable functions symmetric in their 711 arguments such that flgn ($11"'1$m)|2+6 (1F(1‘1)- ' ' dF($m) S (Wu < +00- sup fly" (.r1,....:rm)|2+6dF€_ ,...,€- (.171, ...,;I:,,,) S Mm2 < +00, )6 S, ‘1 "" (2'1,....,2;,,, for some 6 > 0, where SC 2 {(i1,....,i,,.,)|#,(i1,....,i,,,) = c},c = 0,...,m — 1 and for every (i1,....,i,,,),1 3 i1 5 g in, S n, #,(i1,....,i,,,) = the number ofj = 1, ..., m — 1 satisfying ij+1 — i, < r. Clearly, the cardinality of each set SC is less than m—c The von Mises’ differentiable statistic and the U -statistic 0" (F71) ll 3:3 /gn( "-vxm) an(-Tl)"'an(xm) l :11... 3 Q: :5 A f!" ,2». 3 v Un : (,1, Z gn(€i11'"1€im) V allow decompositions as C 9,, (F,,) : 6,, (F) + 2 C”) v56), v56) : f 9,, (22,, 2,) 1’1 1dF,(2:,) -— dF(:r,-)], j=1 m m C U, : 0,(F)+Z(C)U,g>, Up z (22;)! Z / g... (1,1,...,:r,-c)x ign<-~ 6' > 0, then 31/50)? + 50,5)? 3 c (m, 5, r) 22-0 x n. m—l r Alf/(2+6) Z 111,136/(2+6)(k)+ 222-6211meEksi/<2+5>(A~)} (2.13) { k=r+1 c’=O ‘ k=l 26 for some constant C(m,6,r) > 0. In particular, if one has fi(k) _<_ Cgpk,0 < p < 1 then m—l E1456)? + 15115)2 g C (m, 6, r) C2C(p)n‘c {Mg/(2+5) + Z n-C’M5‘fg,‘2+°')}. (2.14) c’=0 Proof. The proof of Lemma 2 in Yoshihara (1976), which dealt with the special case of 9,, E g,r = 1, Mn = M,’, and yielded (2.13), provides an obvious venue of extension to the more general setup. Elementary arguments then establish (2.14) under geometric mixing conditions. I For any x E supp (w) , we can write zTW(x 23—12190: x,——x),-TT,T, z’fw,(x : :Zkhs (X,,—.:,) We (x,,_,—x_,) x iléi®w>1 in which, as before, ® denotes the Kronecker product of matrices. Define also the following matrix 5,,(x) : {/ k(u)p(u)p(u)Tdu} ®S(x) (2.15) where S(x) = E(TTT1X = x) as defined in (A5) (b). For any matrix A, |A| denotes the maximum absolute value of all elements in A. Lemma 2.5.2. Let bl = 1n n (th5, + 1/\/71h'prod)b2 , =11] n (h, + ganzax + 1/V 7lhsgpr0d) r 27 and define the compact set B = supp(w) C R612. Under assumptions (A 1)-(A6), as n —2 00, with probability one sup IZTW(x)Z — go(x)S(x)| = 0(b1), xEB sup 123w.z. — 120080001 = 202). x68 Proof: We only give the proof of the second part. Without loss of generality, one may assume B is bounded by the unit hypercube in Rdz. Observe that sup szw.(x)z. — r(x)Sa(x)l x68 3 sup 1E25w,(x)z, — ¢(x)Sa(x)l + sup |Z3W2(x)Zs — E(ZfW.(x)Z.)l. xEB xeB By a Taylor expansion and the fact that the kernel function L is of order (12, we can show that bgl sup IE {ZZW,(X)Z,} — co(x)Sa(x)| —> 0. x68 For the second term, consider a covering of B by val? closed hypercubes d2 B,” = {x : ||x — X) H S v; 1 , where (x, )2); denote the center points of the 21532 closed hypercubes, and ||~|| denotes the supremum norm. Then b,1.up1zzw,(x)z, — E {ZZW.zs}l x63 3 bglsup sup IZZW,(X)Z, -— ZZW,(x,-)Z, j x6 Bjn +b2‘1 sup sup lEZZW,(x)Z, - E {Z:W,(x,~)Z,}| +b§1 sup IZZW,(x,-)Zs — E {ZZW,,(X,~)Z,} . j (2.16) Note that the elements in Z:W,(x)Z, are of the form 1 n X. _ 1‘ k .7; Z khs (‘Xis _ 373) L03 (Xi,—s — x-S) ( l5h 3) 7:17:1' i=1 '8 28 2 1 for k = 0, . . . , 2p, 1 g l,l 3 d1, which is denoted as U,, (x) = — Z?=1Un,,(x). Index n k, l, l, are suppressed for notation convenience. Then the elements in 'ZZW3(X)Z3 — Z:W,(xj)Z,I are w. (x) — U1 (xm s EDI/22(10— U2.2(xj)| 1 n X2 — a: k S r; 2 k2. (X22 — .2.) La. (x.,_. - 2-.) (ii—i) 2:1 ’ Xis — 1"3 k _ khs (X13 — LII-7'3) LGs (X,,_s _ xj,-s) (——;;—J—-) 71171.14 . Under the assumption (A1), there exists a positive constant c, such that C n lUn (X) _ Un (Kill S ———‘2——: Tilzz’l/n S (hsgprod) v71 i=1 (h'sgprod)2 ”n C almost surely, as a result of assumption (A5) (c) entails that E (TTT) < 00. Choosing vn = [(hsgpmd)—3] (note vn —> 00), we have b;1 sup sup IZZW,(X)Z, — Z:W,(x,)Z,| = 0(1) almost surely. Similarly, one can show that b;1 sup sup IE {Z:W,(X)Zs} — E {Z:W,(x,)Z,}| = 0(1). J XE Bjn For the last term in (2.16), note that the elements in Z:W,(xj)Zs — E {ZZW,(x,-)Z,} are of the form 5,12,) = Una.) — E {Um->1 -—— 5: 1E...- }1 = 51,211,142). i=1 By assumptions (A1) and (A5) (c) that TTT satisfies the Cramer’s moment condi— tions, we have, for d = 3, 4, . .. X k d is _ 17's E lUn,2' (leld = E kh, (X23 — 5512) LG, (XL—3 — xj,—s) (f) TuTw |/\ .d , (1 11-2 4 2 (InfllTuml SL1, (111: r,,r,,,1 , 29 where on = C0 (hsgpmdf1 for some CO > 0. Meanwhile E1v;;,(x,)1d = ElUn.,2-(xj)—E{Uni(xj)}ld |/\ ZIEWm-(XJ) 11" ( d)E was.» < «2'. 2EEIT2T..1 r=0 as long as the constant Co is sufficiently large. Applying Theorem 1.4 (Bosq 1998) and inequality (2.12), we have, for any integer q E [1, g], 5 > 0 and each k 2 3 n 521% C [ ]2k/(2k+1) P{|S,, (xj)| > bge} 3 a1 exp (—25m§ + 5cnb25) + a2 (k) 5p (1+ 1 , where 2n 82 = — 2 1 th E U‘ a1 q + < + 25mg +5cnb25) 102 m2: { (19)}2 , k/(2k+2) a2(k) =11n(1+ T) with m, : 11v‘ (x,)11p. By taking q = [n/ (In n)2], the first term a1 exp — q€2b§ < C1 9X13 {—02 (In ”)2} 25mg + 5cnb25 — and the second term ]2k/(2k+l) 212(k) §p[q+1 _<. Caexp{—C4(lnn)2}, where the cfis are strictly positive constants. So, for any integer 1 S j 3 vii, we have P{|S,, (xj)| > be} 3 c1 exp {-C2 (ln n)2} + c3 exp (—c4 (ln n)2). Then for any 8 > 0 P{b,'sap1s,,(x ,)g1>2} ZP{b21|S,,(x,-)1>s} J 3 vi [(:1 exp {—02( 1n n) 2} + (:3 exp (—c4 (lnn) 2)] . 30 Since we have taken vn = [(hsgpmd)_3], :P{b§lsup|Sn (x,)1 > a} n J' S Zvfl [cl exp {—c2 (1n n)2} + C3 exp (—c4 (1n n)2)] < +00. n By the Borel-Cantelli lemma, we have, b; 1 sup]. |Sn (xj)| —> 0 almost surely. The rest of the lemma follows immediately. I 2.5.3 Proof of Theorem 2.2.1 By observing that, e, T{ZTW( X,~)Z}_l ZTW(X,-)Ze1r = 6”,, where 61,, equals to 1 if I = l, and equals to 0 otherwise, we have —Zw(x1{a,0—am}= 1+11+111 (2.17) i=1 in which _ 1 n 1 T T T 1 _ n;w(x.)e,{ZW(x 1‘12} ZW(X )E, 1 " _ 11 = ;Zw(x,)ef{zTW(x.)z} 1zTW(x,-)x d1 d2 M— clr+ 2 (11's ( X”) Zey , l’:]_ 3:]. 111 = — m(X.) Z az..(X.-.) M H [V] .52 + 'W :9 le’ j=1,...,n 31 and E = {0 (X1, T1) 51, . . . ,0 (Xn, Tn) 5n}T, the vector of errors. Next, observe that d1 M — 2 C1! + :2 (11's X...) 261! l, = 1d 921 = [Ii d: {ap.(X )s_al' ( Xi8)}Tfl' — ”18 — 1 3:1,...,n Define d1 d2 Rmx.) ——- Z Z {...(x..1_....(x..1}cr,. l, = 13 = 1 J:l,...,n one can rewrite I I as = £210 (x.1 [of {ZTW 1x.) 2}"1 zTW (x.1R1 (x.1] . Now let v1 be the integer such that I)“ 1 + 1: 0(1 (11 + 2 Lmax ). Following immediately from Lemma 2.5.2, one has T _‘S'(x)’1 _ S(x)‘1v1 _ZTW(x1zs-1(x1 " . x ”W" Z} o ‘ 90(X)Z{Id1 o +00 |Dn1| + ll),.2| = 0(l2(11+ 2) w.p.1. 'max Lemma 2.5.4. For fixed V = 1, v1, define n 1 Flu = E w(X.)Q1U(X.)ZTW(X.)E, i=1 F2” 2 % ’UJ(X,‘)Q1U(X.‘)ZTW(Xi)R1(Xi). i=1 Then as n —> +00 |F1ul + 1172.] = 0 (Hf/fl) w.p.1. Proof: For simplicity of notation, we only consider the case of Fly with z/ = 1 Fu(X) = i ; m(x.1s-'(x.1 (Ii—g?) — Higgi‘flq S-'(x.1zTW(x.-1E = P. —P. Let g. = (x., T., 5.1, and define gn(€..£.) = m(an-‘(xa [3&8— E{Z;.V(V,S')Z}] >< S—1(X,‘)I{}1 (Xj — X1) TjO' (Xj,Tj) EJ‘ S(x,1 _ E{ZTW(X,-)Z}] 99(le 9020(1) S-1(Xj)KH (X; — Xj) TiU (Xi, T1) 51'. +w (X.)S'1(Xj) [ 33 Then P1 can be written as the von Mises’ differential statistic 1 72 P1 :fizgn(£i2€j)a i,j=1 which can be decomposed as 1 _ 5 {0..(F1 + 2W,” + V9} in which 0,. (F) = [9,. (114.0(1ng (11.)dFEj (v) = 0. In order to write down the explicit expressions of V5.1),Vlf), let E.- denote taking expectation with respect to the random vector indexed by i and Em denote taking expectation with respect to the random vector indexed by j using the empirical measure, both under the presumption of independence between 5, and 53-. One has an) ZEiEnJgn (Epgj) %;gn1(€1 in which ( — 'w z “I z §£z_) — E {ZTW(Z)Z} .ln.l (£1) _‘ / ( )Ls ( ) [90(Z) (p2(Z) 5—1(Z)KH (Xi _ Z) 9&‘(Z)dZTj0 (X2: T2) 52‘- 1 Clearly 911,1 has mean 0 and variance of order bf. So Vnm = 5 22:1 9n,1 (5].) = op(b1/\/n) Finally for V,.2 ),by Lemma 2.5 .,1 under assumption (A3), one has for some small 6 > 0 2 2 2 E(V£,2))2 S CTZ-z Mum" +1l1n20+0 +AI2+on —1 71,1 34 where M.., MM) and M... are the quantities which satisfy the following inequalities E1E219.121.£2112+“ s M. < +oo l/\ 8.1;? E... 1% 12.5.)?” M... < +00 2 J E.- 19.15.15.112” 3 M... < +00 And observe that 2+6 Ei,j lgn(£i1€j)| |/\ cbilz+6E [w (X.) KH (X,- _ X2) T30 (X11 Tj) 5,12” l/\ cbi+6cpl‘3(){ 1KH<2+'/<'+2"c12+6, and by setting the mixing coefficient ((1+26)(2+6)/ 2+26) cb2+6 p to 0, one also gets Mn =hpmd Similarly, we can show that Mm = cbfwhgfozia). So by taking (5 small, one has E (P12) 2/(2+6) 2/(2+6) S m—z (hprgg26)(2+6)/(2+261brew) + m— 3(b2+6hpr(::5)) + cbf/n S cn-2bfhgrig+26)/(2+26) + (n 3112l1pr0H§+6V 2+6) + cbl/n g cn—lbf. Similarly, we can show that E1322 _<_ cn‘lbf. So we have F11 = op(b1/\/n). I Lemma 2.5.5. Define Plnziztfi i=1 :)){8lS Xi)ZTW(X1)R1(Xi)} then P1,. = Op (hum)- — 0,D (71’1”) as n —> oo. Proof: Let K,” (X,T) = eETS"l (X) T, then 1 n m(Xi) P": — K' XbT- K Xt—X, l ”2;,j=1(p(xi) l( J) H( J )X (11 (12 Z Z {01'.(st)-az'.(Xis)}T.-z' l’=lS =1 which is again a von Mises’ statistic. Its 8,, is of the form (11 d2 /::E:; KI, (Z,t) KH (X — Z) Z 2 {0,1,8 ($3) _ 02’. (2,3)} t1! X (=1S=1 ( (,0 (z)1,L' x, t) dzdxdt. After changing of variable 11 = H '1 (x — z), the above becomes d1 d2 / wK;(z,t>K Z Z {..., (z. + ...,...) — 01'. 02.)} t. x ' = 15 = 1 1/1(z + Hu, t) dudzdt = 00,611) m &X where the last step is obtained by Taylor expansion of 0.13 (23 + limits) to ql-th degree and of 1/2 (z + H u,t) t0 (ql — 1)-th degree, which exist according to assumptions (A2) and (A5) (a). By assumption (A1), all the terms with order smaller than hfilax disappear. So the leading term left is of 1233.... order. It is routine to verify that VS.” and V?) are 0100223....) as well. Hence P1,, = 0,, (hgi...) and assumption (A6) (a) entails that 0,, (hglu) = 0,, (71—1/2). I Finally we can finish the proof of Theorem 2.2.1 as follows. Define P2,, = ;11_ i w (Xi) {82118-1 (Xi) ZTW (X1) E} . i=1 90(Xz') Then 1 n m(X‘) q - 2n = }; 900%)!“ (X13113) K11 (Xj — X.)a (Xj,T]-)5j (2.18) i,j=l 36 which again, by a von Mises’ statistic argument, becomes iZ/gg—lKHXTflKMX.—X)¢(X)dx0(xi’Tj)5j+0p( 10%" ) ”2 h prod which, after changing of variable Xj = x + H 11 becomes -1— Z/w (x, — Hu) K; (x,- — Hu, T.) K (11) a (xj,T,-) sjdu + 0,, (1705-1) n j_1 n. hpmd 1 n ,, 10 n = —ZW(X1)K1 (XjaTj)0(vaTj)51 + 0p (“—2 g )+ 0:2 (hrqnlax) n j=1 n hpwd = %Z w (x,) K; (X,, '13-) 0 (x,, T.) e,- + 0,, (71—1/2). (2-19) j=1 1 Now come back to the decomposition of fi 2;, w (Xi) (0&0 — am) as in (2.17), and by Lemmas 2.5.2, 2.5.3, 2.5.4, 2.5.5, one has 1 n . 7—1 2: 100(1) (0:0 — 01(0) i=1 d2 1 n l - = ; ZIU(XJ') Kl (Xj,Tj)0 (Xj,Tj)Ej + Z (113 (st) + 0,, (TL 1/2) . i=1 3 =1 Now define d2 72‘ : “1(le Kf(xj.Tj)0(Xj.Tj)Ej+ Z 013(st) 3 =1 = le + Tjg. Then by the condition that 51- is independent of {(X,, T.)},.Sj, we have E {w (Xj) K; (XJ', Tj) 0’ (Xj,Tj) Sj} = E {11)(XJ‘) K; (Xj,Tj)0'(Xj,Tj)} E(Ej) = 0 and by the identification condition that E {w(X) 23,2: 1 013 (X5)} = 0- SO E (le = 0. Furthermore, by assumption (A3), {Tj} is a stationary ,B-niixing process, with 37 geometric fi-mixing coefficient. By Minkowski’s inequality, for some 6 > 0 , 1/(2+6) 1/(2+6) ElTj|2+6 S{(EI7-j1|2+6) + (EITjQIQ-Hs) } 2+6 By assumptions (A1), (A4), (A5) and (A7), we have E lTj1l2+<5 = lill/U)(X_j)1{;I (XjaTj)U(XjTj) |2+6 (Y) N. = Elw(X1)615090710(Xj.Tj)|2+‘S E |5j|2+6 2+6 d1 5 CE 2 lszl E|51|2+6 z: 1 1/(2+5) g c 2 (513.12”) E|5j|2+6<+00 z: 1 By assumption (A7) that weight function w has compact support and the continuity 2+6 2+6 of the functions 112,013, one has E I732] < +00. So E lle < +00. Next, define +00 +00 0,2 = Z cov (70,71) = 2Zcov (70,73) + var (7'0) j=-OO j=l +00 = 2 Zcov (T0, 73-2) + var (TO) (2.20) j=1 which is finite by Theorem 1.5 of Bosq (1998). Applying the central limit theorem for strongly mixing process (Theorem 1.7 of Bosq 1998), we have 1 11 17.52711 2} N(O,0’l2) . i=1 Theorem 2.2.1 now follows immediately by the assumption (A6) (a) on the bandwidths 1 and the fact that — 221:1 w(X,-) —> 1 as. I 'n 38 2.5.4 Proof of Theorem 2.3.1 Following similarly as in the proof of Theorem 2.2.1, let 112 be an integer which satisfies (15’? =2 0013”). Then by Lemma 2.5.2, one has T x -1_~8‘.:‘(x)_S.:1(x)v2 x. x {Z.W.(1z.} m, — W) ”2:21.“ 1 +c2.(1 (2.211 where A“) = 10+ 11d. - and the matrix Q. (x) satisfies suple (x)| = 0 (h§+2) w.p. 1. x68 Also as in the proof of Theorem 2.2.1, by the equation that (.31 {sz. (x..) z.}‘1 23w. (x..) 2.... = 5..., 1’ = 1,. . . ,d. for fixed I = 1,... ,dl and s = 1,. . . ,dg, we have the following decomposition —;ZU’_3(Xi,—3){013(Is)‘_ 01:. (173)} = 5 2:: w-.(X.-,_.1 [..T {zZW. (x....1Z.}”1 23w. (x.,_.1Y — a. (x.1 — am] 2 .1. Z w_. (XL—3) [63m {ZIW. (x.-.) z.}‘1 zg‘ws (x.-.) {Y - M :0 a(v)( d1 (12 + M- 2112—:_L——agzse(d1v + 1') — E C" + 2011181 (Xis’) Zsel' l_ l’_ — 1 s';£s 1 T! + 5 gm... (X.-.) 2.1:“ (X .1+— 71,: X.- .1 (..,. — «.01 (2.221 39 where M is the mean vector, as defined in Theorem 2.2.1. Next define (11 1» flow R.=R.(:r.1=[Z {a..(..X1—§_ja" (X..—.”:c1}T.] . I l = 1 j=1,...,n d1 d2 R2 (X.-.) = Z 2 {01’3’ (st’) ‘ az’s’ (Xis')} sz’ 1 li=lsl7és R3 = :11”; w-8 (Xi.-8) {Z 013' (Xis')} 7 s';és j=l,...,n R4 = 71" {Z w_s (XL—3)} (élo — (1(0), (2.23) 0.111(th )lew—sm xi —)3 )3{(1T Q3( 1731 xi,-s) ZTWs (Tsax i, —s )E} (224) n 1 032 ()Is :31; w—s( Xi,-s) {elrQs (I37 Xi,—s) ZIWS (I31 Xi,-s) R1} 1 1 11 033(33) : 7" Z w—s (XL—3) {81!st ($31 xi,—s) was (3331 Xi,—s) I12 (Xi,—s)} a ' i=1 1 11.10:.) = E 2 UL. (x._.) [1.3“{A(:.:.,x._.)}r sz. (..., x.-.) E] (2.25) i=1 1 " , R..(:c.) = g 2:; (11.. (X.-.) [(:.T {A (..., X.,-.)} ZZW. (..., X.-.) R.] . 1 " ,. R,3(1's) = g 2 10"3 (XL—3) [elf{A($31Xi,—s)} ZZWs (1'3, Xi.—s) R2 (xi,-s)] 1 i=1 1 n 10.3 Xi.-3) _ . Psl(xs) = E Z W {efSa IZZWS ($51Xi,—s) E} , (2.26) 1:1 7 1,—3 1 n w—s (XL—S) _ [352(173) = a Z W {811180 IZZWS ($3, Xi,-s) R1}, £21 31 1,~s 1 n 11)_S(Xi’-3 _ P33($3)= —st_)){elrsalzzws (IS?X1— ‘3) )(XR2 "- 3)}. 40 One can then write (2.22) as T! 1 ; w-3(Xi.-S) {ésl($3) _ asl($s)} i=1 3 3 ”U2 3 = Z Pam) + Z Dsz-(Ts) + Z Z R..(:r.) + R3 + R4. (2.27) The proof of Theorem 2.3.1 is completed by applying assumption (A6) (b) on the bandwidths h. and Gs, and the asymptotic results on each term of the decomposition in (2.27). These asymptotic results are presented in the following lemmas: Lemma 2.5.6. As n —1 +00 ME. = 0. («27.) , MR. = 0. (x/h—s) Lemma 2.5.7. As n —+ +oo sup |Dsl (23.) + 032 (13.) + 033 (1:3)| = o 015”) w.p. 1. x. E supp(ws) Lemma 2.5.8. For any fixed 1" = 1, ...,vs, as n ——> +oo sup IR..1(:L-3)|+|R.,2(;1:3)|+|R,3(;z:s)|=o(b§/\/nh,s) w.p. 1. 2:, E supp(ws) Lemma 2.5.9. As 71 —1 +oo P31 1 n ILL—s (Xj-s) 1 X‘s—IS = _ ___’___K* __3_*_ s, ‘-s, . '—-s .,T. . n 14:; ¢($31Xj,—s) hs ls( hs 117 X]. T3) ‘P—s (X1. )0 (X1 J)5J +0., {(71,115 log 704/2} , P32 (353) = h§+1771.(rs) + 01,015“) 1 P33 (Is) 1' Op( anzax) = 0P {(7th log "VI/2} ’ 41 in which 611 Z Giff” ($5) fup+1E {w-s (X—s) TI’KI‘s (“1 $31 X_s, T)} du _ 1 "13 (x3) _ (p+1)! ll (2.28) with Kl; (u,x, T) — e, Tng (x )q * (u,T) k(u), q'(u,T) = (T,uT, ...,in)T. (2.29) Furthermore WPSIAN{O.UIZS (3:3)} in which 2 _ was (z—S) 0.. (1.1 — / ————¢. (..,.-.) x K122 (11,1133, z_,, t) $33 (L3) 02 (x8, z-.. t) w (:rs, z..., t) dudz_sdt. (2.30) Proof of Lemma 2.5.6. According to Theorem 2.2.1 MR4: {2.11- _ H}a,o—a,0)= —\/?zTO,, (M)=0,,( 1..) Meanwhile, according to the identify condition (1.8) and the central limit theorem for strongly mixing process (Theorem 1.7 of Bosq 1998), we have 1 n \/ nth3 = #11113; ;11Ls(X,-,_s) 2 (1,3! (Xis’) = 71,130,, (V Up) = 0,, ( [1.3) . s'963 These two equations have completed the proof of lemma. I Proof of Lemmas 2.5.7 and 2.5.8. We have left these out as they are similar to Lemmas 2.5.3, 2.5.4. I 42 Proof of Lemma 2.5.9. From the definition in (2.26) and using the von Mises’ statistic argument 1 " w_.(x,-_s)1 X-s—rrs Pt == - -----L----1(' -l--- ,5 i—s: ' ‘ nZ.(x..)<.-,-.1h. "( h. ,.,x, T’)X z.j=l LG, (XL—s "‘ xi,—s) 0 (Xj: Tj) 51 1 n w_. (z_ X- —x =: — _—-—s s K,“ "J 3. 31 —8 T nh.;/w(xs.z_.) ”( h. 'x z ’ J)x L08 (X.-. — Z—s) 0 (X). T.) 51' X 90-5 (z_s) dz_3 + 0p {(nhs log n)_1/2} which after changing of variable z-.. = X.-. — Gav, one has _ 1 :n: [LU-3 (Xj‘_s — GsV) * Xj _ 13 P31 — fills 1:1 / 90(1331 Xj.-s " GsV) K15 < [ls .333. X175 Gsv1T3) L(V) —1/2 xw_s (x.-. — G,v)dva (19,13).,- + op{(nh,10gn) } _ 1 " LE4. st-Ts _ 11,18 j=l (p($5, ij‘fii) (8 ha +0p {(nhs log 10‘1”} . x.-. T.) .-.(X.._.)a (X. T.) By assumption (A4) (a) that 5,- is independent of {g}. j S i}, the first term is the average of a sequence of martingale differences. Then by the martingale central limit theorem of Liptser and Shirjaev (1980), the term #711131)“. or Vnhs n w_3 (X ‘,_3) N X—s — 1:3 7th., _ 1 (p (1:33 X1], _s) [3 ——J ’18— 1x81Xj,—31Tj C10—3(Xj,—:5)(7 (Xj, Tj) 51' J: . 43 is asymptotically normal with mean 0 and variance h;l / flfif (33 gsxs,zs,z_s,t) 992—. (z_,.)o2 (z,t) 1/2 (z,t) dzdt = fW—tgéif—Tz—zlj—JK122(u,xs,z_s,t)w2_s (z...)02 (:13. + hsu,z_s,t) x w (1:. + hsu, z_3, t) dmlz_sdt = f—flilflzwflmzflfl) (p3 (z_8)02(:rs,z_3,t) >< (.9 (x..z_.) s s w (.733, z.., t) dudz_sdt + 0(hs) = 0123 (£133) + 0(hs) in which the leading term 0,23 (138) is as defined in (2.30). Hence we have shown that ./n/.,,10,l 5» N {0,0,23(;1:3)} . For the term P32($s) 1 n 10-3 (Xi _s) T _l T P. .=—§ —' S zwsxm ‘20?) n i1 cp(l'.1,){i,-s) {61 O! 8 ( 1 )RI} w_s( X£_s) 1 X] —.’Its = .23.. 1.."(8(—T—XT)L05(XJ'"S”X"”) ”=1 d v 1 X P 05,5)(1133) X v T x 2:1: a..( .1—2: .1 ( .s—m.) .1 v=0 w -8 _—(_1x—s) 1 —' 1133 :/d :E—(r81x—s) ’Tifz‘sl(18 (g)hs ,1‘3,X_3.t LGS (Z_5 X_s) p Z (a...) _ZL .(1 MW}. x v=0 l 1 w (z, t) 80—... (X—s) dzdx-s(1t {1 + 011(1)} - 44 After changing of variable 23 = :rs + hsu and L. = x_s + Gsv, the above equals to (x d1 a(p+11 (1.) -8 _s) ———-K' . s, -.,t L -LL———hp+1 “It. //————— (m... ..(ux x ) (v) 1'2 (pH), 11 x w (:13. + h. u d,x_s + Gsv,t) 97—... (x..) dudvdtdx_3 {1 + 010(1)} hp“ 0(p+1)( w_3 (x-..) _. p+1., _ (p+1)ll 5:1 (aw-M] (W )K1‘;( (1, 2:3,x_3,t)u ’1 x1!) (rat x-s, t1)¢_3 (x_s) dudtdx_s {1 + op(1)} hp+l _ s (ps+l)( " (p+1)!l, 210‘ “in/WAX”) X {/K12(U.T..x-..t) upfltm/J (tIT..x_s) dudt} so... (x—s) (ix—s {1 + 0p(1)} (h:1 = )l le—ll 0(PS’)(1~+13)_3‘/UP+1E{w (X—s)TI’Kl;(u1$81X-81T)}du +01) pl:(h§+l) = h?“ (95s) 771. (Is) + 0p (”3“) with 77,3 (13.) as defined in (2.28). Lastly, the term P33 is 3:1: £11111 {673371sz. (X.-.) R. (X.,_.)} _1 c10(33111)(i,-s) _ _ w—s(xi, -3 _______)__:_l_ ,. st—xs . _ _ _ . _iz'JZl—T—_ (X1113, ' -8)hs K18 ( ha ,Is’xz’_8, T1) LGs (XL—3 X"_s) d1 d2 x Z Z {al’s’ (st’) _ al's' (Xis')}731 l, = 13' aé s (x__.,)1K S _ 173 :{fi//— (W )E-Kl 1352; hs 3:1;81x—sgt) LGS (Z_3 -— X_s) X d1 d2 2 Z {..,..( z w. (.;3.)}.,. 1,1)(z,t)¢_s(x-s)dzdx_sdt{1+op(1)} l = 13 #3 45 which after changing of variable, 23 = x8 + hsu and L, = x_s + Gav, equals to //————-;”( 11:33) w Wm) d1612 Z Z {01?(13' + gS/Us I) — alts; (1:81)} tll X l, = 13 51$ 3 w (335 + hsu, x_3 + Gsv, t) 0 such that (11 (12 d1 d2 llmlli 2 0 Z 0120 + Z [[011ng an = Z 0'10 + Zals it E M 3:] i=1 (:1 5:1 Hence for any m E M, |lm||2 = 0 implies that am = 0,011,, = 0 (2.3., for all 1 g l S d1,1 S s g (12. Consequently the model space M is theoretically identifiable. d2 Proof. Let A1(X) = 0110+ Z (m(Xs), and vector A(X) = (A1(X), . . . , Ad1(X))T. 3:1 Under assumption (C2), one has d1 d2 2 “mil: = E Z a.o+za..(x.> T. =E[A(X>TTTTA(X>] [=1 8=1 Cl} (12 2 2 c3E[A(X)TA(X)] =c3E a,o+Za,,(X,) which, by (3.1), equals to di C3 Zazzo+:z:E Zals( X3) Applying Lemma 1 of Stone (1985), one gets 611 d2 llmll2_ > C3 :010 +{( (1" (ll/2ld2 —1:ZEOI23(Xs i=1 s=l where 6 = (1 — cl/c2)1/2 with 0 < c1 5 c2 as specified in assumption (C1). By taking C = 03 {(1 — (5)/2}d2 _ 1, the first part is proved. To show identifiability, notice that d1 d2 for any m = 2 (am + 2 £113) t, E M, with ||m||2 = 0, we have (=1 3:1 (11 d2 d1 d2 0:3[Z 020+Zazsfxs)}7l2 >C[ZOIO+ZZE{013 1:1 3:1 I- 1 s—l which entails that am = 0 and (m(Xs) = 0 as. for all 1 S l g d1,1 S s 3 d2, or m = 0 as. I 51 3.3 Polynomial Spline Estimation 3.3.1 The estimators For each of the tuning variable direction, i.e. s = 1,... ,d2, we introduce a knot sequence k5,, on [0,1], which has Nn interior knots and is denoted as, k3,":{0=x3,0 0, so that for any as 6 A2 fl Cp+1([0,1]), there exists a gs 6 9:2, such that “as — gsll00 S c ”agwlll] hg’“. 00 Proof. According to de Boer (2001), p.149, there exists a constant c > 0 and spline function g; 6 gas, such that Ila, — ggllw _<_ c ”(19“)” hf”. Note next that IE (g;)| g 00 IE (93 - asll + IE (as)l S Ila; - asllx. Thus for gs = 9; - 3(93) E 903, one has Ha. — two. 3 Ha. — ggum + E (9;) 3 2c ||a§P+l>||m hi“. - Lemma 3.3.1 entails that if the functions {(113 (.733) [(1:11’ :21 1n (1.4) are smooth, they are approximated well by centered splines {91, (x3) 6 p2 [6111121. As the definition of (,9? depends on the unknown distribution of X3, the empirically defined space 952‘" = {gs :93 E cps, En(g,s) = 0} is used. Intuitively, function m E M is approximated by some function from the approximate space d2 Mn = mn( X 13) =::91(X) )lz; 91(X X): 0110 + Zgzs(-’Es);gzs E w?" 321 Given a sequence of observations {(Y,,X,~, T0}? _1 generated from the regression model (1.1), the estimator of the unknown regression function m is defined as its ‘best’ approximation from the space Mn, i.e. n . . 2 m = argmmmn 6 Mn Z (Y, — mn (Xi‘T,)} . (3.2) i=1 To be precise, we introduCe the following basis notations. Let Jn = Nn + p and {11233, w“, . . . ,ws Jn} be a set of basis for the polynomial spline space 903, for s = 1,... ,d2. For example, we have used the well-known truncated power basis in 53 the implementation p {1,223, . . . ,xg’, (:53 — 15,1): , . . . , (x5 — :55, N”) } (3.3) + in which (:c)’; = (x+)p. Let W: {1,w1,1,...,wl’Jnyu,wd2,1,...,wd2,Jn}, then {wt1,... thdl} is a set of basis oan, which has dimension R" = d1{d2Jn +1}, and (3.2) amounts to d1 d2 J, fir(x,t)=z ao+ZZa,,,-w,,,(x,) i, (3.4) 1:1 3:1 j=l in which the coefficients {6,0,613‘1J = 1,... ,d1,s = 1,. . .,d2,j = 1, . . .,J,,} minimize the sum of squares 2 71 d1 d2 Jn Z X _' 2 cm + Z: Z cts.jU)s,j (Xis) Ti (35) i=1 (=1 s=l j=l with respect to {c,0,c13,j,l = 1, . . . ,d1,s =1,...,d2,j = 1,... ,Jn}. Note Lemma3.4.5 entails that, with probability approaching one, the sum of squares in (3.5) has a unique minimizer. For t = 1,...,d1,s = 1,...,d2, denote Jn 022(223) = Z agitating). (3.6) 1:1 Then the estimators of {0'10}zd=11 and {als (rs)}f1=11’:1:21 in (1.4) are given as (12 A A * . 0110 = 6104-2 E11013: l=1,~-:d1, s=1 é13(Is) = 0730103) — End}; l = 1,. . . ,d1,s = 1,. . . ,dg. (3.7) where {6115(3123) {1:111:21 are empirically centered to consistently estimate the theoreti- cally centered function components in (1.4). These estimators are determined by the knot sequences {kmfiljl and the polynomial degree p, which relates to the smooth- ness of the regression function. We will refer to an estimator by its degree p. For example, a linear spline fit corresponds to p = 1. Theorem 3.3.1. If 0115 E Cp+1([0,1]),f0rl=1,...,d1,s = 1,. . . ,d2, and under the assumptions ( CI )-( C5) in the subsection 3.4.], one has um — m||2 = o, (W + l/nh) andforl = 1,...,d1,s =1,...,d2, |a,0 — aml = 0,,(hp+1 + 1 /nh) ,||&,, — ms”, = 0,,(1210+1 + 1/nh) . Following from Theorem 3.3.1, the optimal order of h is n‘l/ (21” 3), and in that case “a, — ms ”2 = Op (n‘l/(2P+3)), which is the same rate of the mean square errors of the marginal integration estimators in Chapter 2 . The constants {manilv however, are estimated at a faster parametric rate of 1/ fl by the marginal integration method. 3.3.2 Knot number selection An appropriate selection of the knot sequence is important to efficiently implement the proposed polynomial spline estimation method. Stone (1986) found that the number of knots is more crucial than its location. Thus we discuss an approach to select the number of knots Nn using the Akaike Information Criterion (AIC). For knots locations, we use either equally spaced knots (the same distance between 55 any adjacent knots), or quantile knots (sample quantiles with the same number of observations between any two adjacent knots). According to Theorem 3.3.1, the optimal order of N" is nl/(2P+3). Thus we propose to select the ’optimal’ Nn denoted as N3” from the set of integers in [0.5Nr, min (5Nr, Tb)] with N, = nl/(2P+3) and Tb = {n/ (4d,) — 1} /d2 which ensures that the total number of parameters in the least square estimation is less than n /4. To be specific, we denote the estimator for the i-th response Y,- by Y, (Nu) = in (X,,T,—), for i = 1, - -- ,n. Here a depends on the knot sequence as given in (3.4). Let qn = (1 + d2 Nn) d1 be the total number of parameters in the least square problem (3.5). Then N3pt is the one minimizing the AIC value N3” = argmin AIC (Nu) (3.8) Nn E [0.5Nr, min(5Nr, Tb)] ,. 2 where AIC (N,,) = log (MSE) + 2q,,/n with MSE = 2;, {Y,- — Y,- (N,)} /n. 3.3.3 Model selection For the full model (1.4), a natural question to ask is whether all the functions d1,d2 ‘ . . . . d1,d2 {a13(:r3) l=l,s=l are Significant. A Simpler model by setting some of {013 (11:3) (.321 zero may perform as well as the full model. For 1 S l 3 (11, let S; denote the set of indices of the tuning variables which are significant in the coefficient function of T1, and S the collection of indices from all the sets 5,. The set S is called the model indices. In particular, the model indices of the full model is S f = { S f1, . . . , S fdi }, where S,,:—{1,...,d2},1glg d1. For two indices S = {Sl,...,Sd1},S' = {Si,...,S;]1}, we say that S C S’ if and only if S, C Si: for all 1 S l 3 all and 56 SI sé S, for some t. The goal is to select the smallest sub-model with indices S C S f, which gives the same information as the full additive coefficient model. Following Huang & Yang (2004), both AIC and BIC are considered. For a submodel ms with indices S = {S1, . . .,Sdl}, let an be the number of interior knots used to estimate the model 1115 and Jn,s = NW3 + p. As in the full model estimation, let {6,0, end-,1 S l _<_ (11, s 6 81,1 3 j S Jn,S} be the minimizer of the sum of squares 2 n d1 Jn.S Z Y,- — 2 cm + Z Z c,,,,-w,,, (X,,) T,- . (3.9) Define d1 Jn,S ms (Xi) = Z 510 + Z Z 513.jwsJ ($5) tl- (3-10) (:1 i S E St j=1 , A 2 Denote Y,” = mg (X,,T,-),i = 1, - -- ,n, MSES = 2;, (Y,- -— Y”) /n, and the total number of parameters in (3.9) as ([3 = Zldzll {1 + # (80.1”). Then the submodel is selected with the smallest AIC (or BIC) values, which are defined as AICS = log (MSES) + 2qS/n, BICS = log (MSES) + log (n) qS/n. Let So and S be the index set of the true model and the selected model respectively. The outcome is defined as correct fitting, if S = SO; overfitting, if So C S; and underfitting, if S0 /CS, that is, SO, /CS1, for some I. For either overfitting or underfitting, we denote S 7A SO. Theorem 3.3.2. Under the same conditions as in Theorem 3.3.1, and an x Nn.So X nl/(2P+3), the BIC is consistent: for any S # SO, lim,,_.00 P (BICS > 81050) = 1, hence limnnoo P (S = SO) = 1. 57 The condition that an x ano is essential for the BIG to be consistent. The number of parameters ([3 depends on the number of knots and the number of additive terms used in the model function. To ensure BIC consistency, roughly the same sufficient number of knots should be used to estimate the various models so that q; depends only on the number of functions terms. In the implementation, we have used the same number of interior knots ngt (see (3.8), the optimal knot number for the full additive coefficient. model) in the estimation of all the submodels. 3.4 Assumption and Proofs 3.4.1 Assumptions and notations The following assumptions are needed for our theoretical results. (C1) The tuning variables X = (X1, . . "ngl are compactly supported and without lose of generality, we assume that its support is X = [0,1]d2. The joint density of X, denoted by f (x), is absolutely continuous and bounded away from zero and infinity, that is, 0 < c1 S minxeX f(x) 3 maxxeX f(x) 3 02 < 00. T Instead of assuming that T = (T1, . . . ,le) is bounded as in Huang, Wu and Zhou (2002), we impose the following (conditional) moment conditions on T. (C2) (i) There exist positive constants O < c3 3 c4,such that C3Idl S E(TTT|X = x) S c41d1 uniformly for all x E x. Here [(11 denotes the (11 x d1 identity matrix. I 17: (ii) For some sufiicient large m > 0, E IT; < +00, for l = 1, . . . ,d1. 58 (iii) Furthermore we assume that there exist positive constants c5, c6 such that, c5 3 E{(T1T1r)2 + 60 |X = x} S (:6 as. for some 60 > 0 and l,l, = 1,. .. ,(11. (C3) The (12 sets of knots denoted as k3,": {0=.rz:,,ogi:,,1 g st.Nn st’Nn+1=1},s=1,...,d2, are quasi-uniform, that is, there exists c7 > 0 max (I‘M-+1 — xs,j,j = 0, . . . , Nn) max _ , 3 c7. 3 :17. . . ,d2 n11n(.’175‘j+1 — 1133‘”) = 0, . . . , N") 1 Furthermore the number of interior knots Nn >< n21) + 3, where p denotes the degree of the spline space and ‘ x’ denotes both sides have the same order. Let h = max 1 th—2P+3. S ___ 1, . . .,d2;j = 0, . . . , Nn Ixm-H — xs’j|. Then (C3) implies that (C4) The vector process{ct}t°: = {(Y,,X.t,Tt)}f:_oO is strictly stationary and '1'” geometric strongly mixing. (C5) The conditional variance function 02 (x,t) is continuous and bounded. Assumptions (C1)-(C5) are common in the nonparametric regression literature. Assumption (C1) is the same as Condition 1, p.693 of Stone (1985), assumption (0), p.468 of Huang & Yang (2004). Assumption (C2) (i) is a direct extension of condition (ii), p.531 of Huang & Shen (2004). Assumption (C2) (ii) is a direct extension of condition (v), p.531 of Huang & Shen (2004), and of the moment condition A.2 (c) p.952 of Cai, Fan & Yao (2000). Assumption (C3) is the same as in equation (6), 59 p.249 of Huang (1998a), and also p.59, Huang (1998b). Assumption (C4) is similar to condition (iv), p.531 of Huang & Shen (2004). Assumption (C5) is the same as p.242 of Huang (1998a), and p.465 of Huang & Yang (2004). 3.4.2 Technical lemmas For notational convenience, we introduce, for 1 S l 3 d1, 1 S s 3 d2, d2 a“, = a“, + 23 Bag, a,,(x,) = am.) — Bays. (3.11) s=l . . . . _ d1 - d2 - Then one can rewrite m defined in (3.2) as m — 21:1 CY“) + 25:1 azs(xs) t), We center the aunts) in (3.11) with respect to the theoretical mean, instead of the em- ' ‘ ‘ " ' x " d1 ‘" dlidQ pirical mean as a15(x3) does in (3.7). The terms {alohzv {015(175) l=1,s=1 are not directly observable and serve only as the intermediate step in the proof of Theorem 3.3.1. By observing that, for 1 g l 3 (11,1 3 s S d2 d2 ézsfl‘s) = 513(333) — Enézs, 5110 = 5'10 — 2 3715113, (312) 3:1 the terms {€110}?le {643(3) (dig; and {a,0}f’=1,,{a,,(x,)}f’=1;fj, differ only by a d1: d2 constant. In section 3.4.3, we first prove the consistency of {amfiig , {&13(.’E5)}l=1‘3=1 in Theorem 3.4.1. Then Theorem 3.3.1 follows by showing {E,,Et15}:1=lijl:21 negligible. We use the B-spline basis for the proofs, which is equivalent to the truncated power basis used in implementation, but has nice local properties that each base is supported on a finite number of the knot intervals, see de Boor (2001) for more de- tails. With .1,, = Nn +p, we denote the B-spline basis of ass by b, = {53.0, . . . vbs, Jn}' For the polynomial spline spaces {99352, defined in subsection 3.3.1, define the 60 corresponding subspaces: 992 = {g E 903,E {g (X,)} = 0}. Note that the functions {&,s(.xs),1 S l S (11} E 992. For 1 S s S (12,denote B3 = {8,1, . . . , Bs Jnl’ in which E (113,-) B” = N" (be ‘ m (2,0) ,j =1,...,J,,. (3.13) Note that under assumption (C1), E (bag) > 0. Thus 83,,- is well defined. Now, let B = (1,131,1,. . . , Bl, Jn" . . , Bd2,1, . . . , Bd2 Jan’ in which 1 denotes the identity function defined on x. Define T G = (Bt1,...,Btdl) = B®t, T where t = (t1, . . . , (d1) . Then G is a set of basis for Mn. For easy reference of the T elements in G, we write G = (01,. . . ,GRn) , with R, = d1(d2Jn +1). Using (3.2), one gets an alternative representation of d1 Jn m:m(x,t)=z c,0+Z2Zc;,,B,, 1,, (=1 8:]. j: —l in which 6‘ ,a‘ -,1 S l S d1,1 S s S d2,1 S ' S Jn minimizes the sum of s uares IO [3,] .7 q as in (3.5), with w,,- replaced by BS, Applying Lemma 3.2.1, a function mu 6 m (12 Mn has a unique representation as mn( x ,=t) Z)- dll{azo + 2929(xsl} tug}; 5:1 992. Thus for {a,0}f’:1,, {(3130175)},d___ll’,:1__2_l defined in (3.11), one has filo = fibril, = #1 ZCISst,j(Is),1SlSd1,1SSSdQ. i=1 Theorem 3.4.1. Under assumptions (CU-(C5), if (st E (.Ip+1([0,1]), forl S l S d1,1 S s S d2, one has Hm — ml]2 = 0,, (NH + l/nh), max [(110 — “(0| + max ”(S-l — ms“. = O (th + 1/nh) . 11. (iii) There exists a constant C > 0 such that for any vector a = ((11,.. . ,aJn)T, as 2 J Ja n E: astJ 2 C E: 012-. j=1 j=1 n—>oo, 2 Proof. (i) is trivial. (ii) follows from Theorem 5.4.2 of DeVore & Lorentz (1993), and assumptions (C 1), (C3). To prove (iii), we introduce the auxiliary knots for{ks',,}:1:,. Recall that k3,, is a knot sequence on [0, 1] with Nn interior knots, ksan={0=$s.0 . C 2N,,d, - a] " 3” ds - ‘1 gal '1 + J; E(bsp) '0 ~1an > chZag Nndsj >(.:(1C()/p+1c72a§. I J'= —1 j=1 Lemma 3.4.2. There exists a constant C > 0, such that as n —> 00, for any sets of coefficients, {c10,c)3,,-,l=1,...,d1;s =1,...,d2;j=1,...,Jn} d1 d2 .1an Z CIO+ZZCwBsd ‘1 > 0: Czo+dZZCzsj Proof. By Lemma 3.2.1, there exists a constant C1 > 0 such that d1 d2 J, 2 d1 d2 Jn 2 Z cw + 22%an 3 2 01 Z a, + Z 2%ij 1:1 8:1 j=l 2 (=1 s=l j=1 2 Lemma 3.4.1 provides a constant Cg > 0, so that the above is d1 d2 Jn 2 z: + 02:31:..- [=1 8:1 j=l The lemma now follows by taking C = min(C2,1)C1. I Lemma 3.4.3. Let (G, G) be the Rn x Rn matrix defined as (G, G) = ((C,-,G,~)).R7l I.J=l' Define (G, G)" similarly as (G, G), but replace the theoretical inner product with the empirical inner product, and let D = diag( (G, G)). Define Q, —— —sup ID 1("2( ((G,G)n — (G, G))D"1/2|, where sup is taken oven all the elements in the random matrix. Then as n —+ 00, on = o, (\/n‘1h‘1 10g2(n)). 63 Proof. For notation simplicity, we consider the diagonal terms. For any 1 S l S . 1 '2 (12,153 g (11,1313 J, fixed, 1a 5 = (En—E){B§,(X,)7}2} = —§jg,, in ’ n i=1 which 5,- = Bi, (Xm)??? — E {83,- (X,s) TS} Define Ta = T,)I{ for some ITul S n6}, 0 < 6 < 1, and define E, E,- similarly as 5 and 5,, but replace T; with Ty. Then for any 6 > 0, one has 12 ~ 12 ~ P |€l26 32773752 SP Mae 031,3”) +P(€¢€) (3.14) in which Eszlm nmé—l ' P(€#E) S P(Tu 7971-1, for somei: 1,...,n) S ZPUTill Zn“) S i=1 Also note that sup |B.,,-(xs)l= lf{bSJE Mb l< b,50 US$331 0 0. Then by Minkowski’s inequality, for any positive integer k 2 3 - I: ~ k x37?) + {E ..,-(X3711) ] 2k—l [nw‘ckNJf + (an)k] S n26kckNJf. k E |/\ E.- 2""1 [E |/\ On the other hand ~ 2 B2 . (X )r2 5,] 3 l .. Z 2 l — E2 {Big (X3) 7?} - 2 £1 E E 33, (X )T,21{ EIB:J~(X.>TE|2 ml > n} _ 2 - E2 {33.) (Xs)T12} - 2 in which, under assumption (C2) 2 1 6 13132.X r31 13134.,Y,E(— r4+0x) s,]( 8) [{lTll > ”6} — s,]( ) "6,5071 l an S n:‘50 [W3 XS” S ”6150’ 64 where 60 is as in assumption (C2). Furthermore E2 {33, (X3) TE} 3 c3132 {33, (X,)} _<_ c, E |B§, (X,)T,2|2 2 CSE (BS, (X,)|4 2 Clo-5 / (33,, (.1,-3)?de 2 cc1c5Nn. > an. So there exists a constant c > 0, such that .. 2 Thus E lel 2 av, — n6," 0 forallk>2 ’2 2 ” k—2 ~ E ‘£,| S ”261:6st S ((:r1.66N,2,) klE _, Then one can apply Theorem 1.4 of Bosq (1998) to 2;, Eu with the Cramer’s con- stantcr= cn66N2. Thatis, for anye>0, qE [1, 3,] andk>3, one has qe210g2(n)/nh %lZ:—1€i2l €(/L——Oi2h(n) S a1eXp -d 25mg + 56c,(/log2(n)/nh n 2k/(2k+1) k __ was where n e2 log2(n)/nh 2 ~2 a1 = 2—+2 1+ ,m2=E§, q 25mg + 5w, log2(n)/nh k/(2k+1) ~ a2(k) = Mn 1+ 5m” log2(n)/nh Observe that 560, ccn6‘sN3V—10g2hn 10(), by taking 6 <———— —(2———122pp.+3) Then by taking q— - 71/ {c0 log (77)}, one has a1 = O =(3)— - O{log(n) ),} a2(k ) = Nk/(2k+1) 0 n n = o (n3/2). Thus, for n large enough log2(n)/nh 10g2(n)) nh 2 1 _ 650:5;2) } + 0n? exp {— log(p)(,:010g (71)} . S clog(n) exp {- 65 Thus by (3.14), taking on, e, m large enough and use assumption (C4), one has that 0° 10g2(n) 2,1,1” sup1. — men 2 . 7,,— < 2:, {d1d2( (N +2)}2 {clog(n)exp{_i;95gc%l} E T m +cn3/2 exp {— log(p)(50 10g (7‘)} + nrlmsl-ll } < 2:1 {d1d2( (711V '1" 2)}2 71—3 < +00 in which N, x n21): 3. Then the lemma follows from Borel—Cantelli Lemma and Lemma 3.4.1. I Lemma 3.4.4. As n —1 00, one has (Sblr 962) " ($14152) 10g2(n) n ‘ ___ 0 ___ (b, 5 Mill; E M, ll¢1ll2 ll¢2ll2 p nh In particular, there exist constants 0 < c < 1 < C such that, except on an event whose probability tends to zero as n ——> oo, cllmll2 S ||m||2m S C ||m||2 ,Vm E M,. Proof. Using the vector notation, one can write (23, = aTG, (b2— — a2 G, for the R, x 1 vectors a1, a2. Rn Rn Rn Rat |<¢11¢2ln - ($1,¢2>| = <2: 01101, 02101) - l j=l j=1 i=1 2': , Rn R11 2 |a1,a.2,||(G',-,G,-)n—(G,~,G,)S| Qn Z lanavl “Gilb llGj||2 i,j=1 i.j=1 R, S QnC Z lalia2jl S anv 333313332- i,j=1 On the other hand by Lemma 3.4.2, Hall: Halli = (a? (G, G> al) (at a 2) > (‘2a12a1ai‘a2 66 Then <¢lf¢2>n _ <é17¢2> < QnC V alraV a§a2 = O (Q ): O 10g2(n) ll¢1|l2l|¢2l|2 _ CVafan/agag p n p nh Lemma 3.4.4 shows that the empirical and theoretical inner products are uniformly close over the approximation space M". This lemma plays the crucial role analogous to that of Lemma 10 in Huang (19983). Our result is new in that (i) the spline basis of Huang (1998a) must be bounded, whereas the term t in basis G makes it possibly bounded; (ii) Huang (1998a)’s setting is i.i.d. with uniform approximation rate of op ( 1), while our setting is a-mixing, broadly applicable to time series data, with approximation rate the sharper Op ( log2(n) / Till). The next lemma follows immediately from Lemmas 3.4.2 and 3.4.4. Lemma 3.4.5. There exists constant C > 0 such that except on an event whose probability tends to zero as n —+ 00 d1 d2 J, 2 d1 d2 J, z z: a 20: aim. l=l s=1 i=1 2," 1:1 3:1 j=l 3.4.3 Proof of mean square consistency Proof of Theorem 3.4.1. We denote Y = (Y1,...,Y,,)T,m ={'m(X1,T1),...,m(X,,,T,,)}T, E = {0(x1, T1)51, . . . , o(Xn, Tn)s,,}T. Note that Y = m + E, and projecting this relationship onto the approximation space Mn, one has m = iTH—E, where m is defined in (3.2), and m, E are the solution to (3.2) 67 with Y,- replaced by 7n(X,-,T,~) and o(X,,T,-)s,~ respectively. Also one can uniquely represent m as m = 2:121, (310 + 23221313) 1,1,6“ 6 «pg. With these notations, one has the error decomposition rii — m = Wt — m +6, where fi— m is the bias term, and E is the variance term. Since for 1 _<_ l S d1,1 S s S d2,a(s E C”+1 ([0,1]), by Lemma 3.3.1, there exist C > O and spline functions 91.. E (02, such that “013 -— nglloo S Clip-H. (3.15) Let mn (x,t) = d__1 1-{010 + 23319155 (x3)} t; 6 Mn. One has (11 d2 d1 d2 “m ‘- mnllz S E Z l|{018(15) gls (1133)}tl||2_ C4 2 2 Hats (1.8)—.913 (38mm 1:1 8:1 (=1 s=1 S (:4(7h.p+1. (3.16) Also ||m - mnllg,” S Ch?“ 3.5. Then by the definition of projection, one has Hm - Wilts... S llm - mn||2,,. S Chp“ which also implies llfii — mn||2m< _ Hm- mllz'n + Ilm— mull.” S Clip“. By Lemma 3.4.4 — — 1 2 “m' — m'nllz S “m — Tnnll2m(1_ Qn) / = 0p (hp-H) . Together with (3.16), one has Hm — fill2 = Op(hp+1) . (3.17) Next we consider the variance term E. For some set of coefficients 68 one can write 5 (x, t) = 2:”; 61,0,- (x, t). By the definition of projection, one has 1 K 511 \ K ; Z?=1G1 (Xi7Ti)U(X37Ti)Ei \ - 1 n Rn a2 ; Zi=1G2(XiaTi)U(XiaTi)5i ((013 Gj)n)j,j’=l . : (are, ) (fizz‘scegxt'radxe'rae. ) Multiplying both sides with the same vector, one gets ( a, i A (12 ((11 (12 (Allin ) (n).fj?=l . l as t { £2.11 G1 (Xini)a(Xi1Ti)€i 1 ) ‘7; 2:1 G2 (Xi,T.-)0 (XiaTi)5i 1 n K '7; Zi=1GRfl(XiaTi)0(XoTi)€i ) 2 Now, by Lemmas 3.4.2, 3.4.4, the LHS is "2qu (1,-G, ” 2 c (1 — on) 2,713; (3;, while 2,7: the RHS is 1 fl ll /‘—\ 3° 3.3, V S /\ Q. T.) | «0 2 ‘1 A MED A :3 l*-‘ :5 2 1/2 Gj (X,,T,)o(X,~,T,-) 5i) } and as a result 2 1 n I|e|l§_ < 00- Q.) Z R:(; :0,- (x.,T.)a 0, one has d1 d2 ”m — m”: 2 0 Z (5110 — 0110)2 + Z “(315 — 013”: (=1 3:1 ThusforlSlSd1,1SsSd2, |am — 0110] = 0,, (W1 + 1/nh) , “a, — as”, = 0,, (W1 + l/nh) . . Proof of Theorem 3.3.1. By (3.12), one only needs to show |E,,&,s| = 0p(hp+1 + 1/nh) , for 1 S l S (11,1 S s S d2. Note that [Endlsl S IE” {Ens — (115}! + IEnolsl , whose first term lEn {£113 _ alsll l/\ ”fits — alsllg'n S ”013 — Z-1-'l:¢ll;g‘rl + “(its - Z1'nlsllgz'n |/\ ””13 — {Italian + “fits ‘ {Its-”2,” + “(~st — 1715”“, 70 with “013 — glsllzn < Ila;s — glslloo S Chp“, and applying Lemmas 3.2.1 and 3.4.3, one has I/\ (1+ Qn) “515 _ glsnz S (1+ Q11) ”7—77: _ mnllz : Op (hm-1): ”615 — 915 ”2,1; ”[113 — a-ls”2,n S. (1+ Q71) ”(313 — alsllz‘n S (1+ Qn) ”(EH2 = 0p (VI/71h.) . Thus IE" {5:3 — als}| = 0,, (hp+1 + 1/nh). Since |Ena13| = 0p(1/\/1_i), one now has IEnErzsl = Op (her1 + \/1/nh) Theorem 3.3.1 now follows from the triangular inequality. I 3.4.4 Proof of BIC consistency We denote the model space M 3 corresponding to the submodel mg as Ms: =:Z:OI(X lit; 01( X)=azo+ Z azs(xs);azsefl‘§ , 8 E St and its spline approximation space Mn,S as Mn,S: mn( X t) =Zgz(X) Xltz; 91(X Xl=azo+ Z gzs($s);gzs E 902 , S E S; where H2 = {(13:E{(r§(Xs)} < +oo,E{ozs (X3)} = 0}. For 5' C 3;, Ms C M5,; and Mms C Mn, S d. Let ProjS (and Projms) be the orthogonal least square pro- jection operator onto M 5 (and Mn,S) with respect to the empirical inner product. Then fits defined in (3.10) can be viewed as: m5 = Projnvs (Y). As a special case of Theorem 3.3.1, one has the following result. Lemma 3.4.6. Under the same conditions as in Theorem 3.3.1, one has lime — men. = 0,.(1 1/vg+1+ NS/n) . 71 Now denote c(S,m) = H ProjS m — m||2. One has the following results: if m 6 M50, ProjS0 m = m, thus (3(30,’In) = 0; and if S ovcrfits, since m 6 M50 C Ms, C(S, m) = 0; and if S underfits, c(S,m) > 0. Proof of Theorem 3.3.2. Notice that MSES — MSESO MSES0 MSES — M81350 = E{02(X.T)}(1+ ope» {1+ ”P (1)} + ""2””)/‘2”+3’ log (n), q - q BICS — BICSO —‘°’——S—° log {1 +0p(1)} + (71) since qs — qSo x Til/(2W3), and 1 n 1 n A 2 MSESO g E 2: {v.- — m (x, my + Z Z {mSO (x.,T.) — m (x.,T.-)} i=1 i=1 = E{a2}<1+o.<1)>. Case 1 (Overfitting): Suppose that So C S and SO 75 S. One has MSES — MSESO = “as — 77.5.)“; = ”as — 7313,0113“ + 0,. (1)} S (HIAWS ‘— m“; + “771.30 — mug) {1+ 012(1)} = 0p (n—(2p+2)/(2p+3)). Thus limn_.+oo {P (BICS — BICS0 > 0)} = 1. To see why the assumption qS—qSO x nl/(2P+3) is necessary, suppose q 50 x n’, with r > 1 / (2p + 3) instead. Then it can be shown that nr-l r-1 0 MSES — M51233O = —E{02 (X,T)} {1 + 0p(1)}— n 100(n) {1 + op(1)}, which leads to lin1n_.+oo{P(BICs — BIC So < 0)} = 1, instead. Case 2 (L'nderfitting): Similarly as in Huang & Yang (2004), we can Show that if S underfits, MSES — MSES0 2 02 (S, m) + 0p (1). Then .2 S. BICS—BICSO> ° ( .m)+op(1) _ E{02(X,T)}(1+op(1)) +0p(1)’ 72 which implies that limn_.Jr00 {P (BIC. — BICS0 > 0)} = 1. I 73 Chapter 4 Examples 4.1 Monto Carlo Studies In this section, we study the finite-sample performances of the proposed methods which include: two estimation methods (integration estimation and polynomial spline estimation), the bandwidth selection procedure for the integration method, and the model selection procedures based on nonparametric AIC and BIC proposed for the polynomial spline estimation. For those purposes, two Monte Carlo studies are de- signed: one with an i.i.d set up and the other one with a nonlinear time series set up. In both examples, sample sizes are taken to be n = 100,250 and 500, and the number of replications is 100. To assess the performance of the estimators of function components, we introduce the averaged integrated squared error (AISE). By denoting the estimated function of 74 ms in the i-th replication by aim, we define ”grid 100 A A 1 A ISE(o',-ls) = E5", 2: {a,( (3 mm) — oas(:z:m)}2 and AISE(azs) = m EISEWLB): where {xm}m"g_r1rlid are the grid points where the functions are evaluated. 4.1.1 An i.i.d example The data are generated from the following model Y = {c1 + a11(X1) + (112 (X2)} T1 + {62 + 021(X1)+ 022 (X2)} T2 + e (4.1) with Cl: 2, c2 = 1,011 (2:1) = on (2:1) = Sln(1'1),(112(1‘2) = :52, 0:22 (23;) = 0, where X = (X1, X2)T is uniformly distributed on [—7r, 1r] x [—7r, 7r], and T = (T1, T2)T follows the bivariate standard normal distribution. The vectors X, T are generated independently. The error term 5 is a standard normal random variable and indepen- dent of (X, T). First, to assess the performance of the data-driven bandwidth selector in section 2.4, we plot in Figure 4.2 the kernel estimates of the sampling distribution density of the ratio limp, /h1,opt, where hlppt is the optimal bandwidth for estimating an and Q21. Solid curve is for n = 100, dotted curve is for n = 250, and dot-dashed curve is for n = 500. One can see that the sampling distribution of the ratio limp, /h1,opt converges to 1 rapidly as the sample size increases. Similar results are also obtained for hgppt, the optimal bandwidth for estimating 012 and 0122. The plot is omitted. 75 The simulation results indicate that the proposed bandwidth selection method is reliable in this instance. The fact that the distribution of the selected bandwidth seems skewed toward larger values is due to the use of simple polynomial function as a plug-in substitute of the true regression function. Second, we use three different methods: linear spline (p = 1), cubic spline (p = 3) and the marginal integration, to estimate this additive coefficient model. In the polynomial spline estimation, we use equally spaced knots with the number of interior knots chosen by the proposed AIC procedure. For 3 = 1, 2, let ximin, xiymax denote the smallest and largest observation of the variable x, in the i-th replication. Knots ‘ :L‘i ], with the number of interior knots s.min ’ s,max are placed evenly on the intervals [1: Nn selected by AIC as in subsection 3.3.2. To make fair comparison, the functions {Gishzfmfl are estimated on a grid of equally-spaced points 15",, m = 1, ..., ngrid with x1 = —0.9757r,1:n . = 0.9757r, n grid grid = 62. Respectively, Tables 4.2.3 and 4.2.3 report the means and standard errors (in the parentheses) of {cl},=1‘2 and the averaged integrated squared errors (AISE) of {(izs}f:ll"z2 for the three fits. One observes for all three fits, the standard errors of the constant estimators and the AISEs of the estimators of the function components decrease as samples sizes increase. This result numerically confirms our asympototic convergence results. Also the polynomial spline method performs overall better than the marginal integration method. The two spline fits (p = 1, 3) are generally comparable, but clearly the cubic fit (p : 3) is slightly better than the linear fit (p = 1) for the 76 large sample size (n = 250, 500). The fitting results are also visually presented in Figures 4.3 and 4.4, which give the plots of the 100 estimated curves using marginal integration and cubic spline fitting respectively. In both figures, (a1-a4) are plots of the 100 estimated curves for 011(11) = sin(:z:1), a12(a:2) = 1:2, 021(1‘1) = sin(.r1), (122(232) = 0 for n = 100. (bl-b4) and (cl-c4) are the same as (a1-a4), but for sample size n = 250 and n = 500 respectively. They clearly illustrate the estimation improvements as sample sizes increase for both fittings. (d1-d4) give the plots of their typical estimated curves, whose ISE is the median of the 100 15133 from the replications. The solid curve represents the true curve, the dotted curve is the typical estimated curve for n = 100, the dot-dashed and dashed curves are for n = 250 and n = 500 respectively, which shows even for sample size as small as 100, the fits are satisfactory. As mentioned earlier, the polynomial spline method enjoys great computational efficiency. It takes merely 20 seconds or less to run 100 simulations using polynomial spline method on a Pentium 4 PC. The computation time is almost the same for different sample sizes. However for marginal integration method, the computation burden increases dramatically as the sample size increases. For example, it takes marginal integration about 2 hours to run 100 simulations for samples size n = 100; and takes about 20 hours for sample size n = 500. Next we test the model selection criteria proposed in the subsection 3.3.3. For each replication used for estimation, a model selection is also conducted. Polynomial splines with p = 1,2,3 are used for estimation. The model selection results are presented in Table 4.4. For each setup, the first, second and third columns give the 77 number of underfitting, correct fitting and overfitting over 100 simulations. It shows that the BIC gives rather accurate selection results (more than 86% correct selection rate) even when the sample size is as small as 100, and gives absolute correct selections when sample sizes increase to 250 and 500. This confirms our assertion that BIC is consistent. Compared with BIC, AIC tends to over-fit. But AIC has the advantage that it never under-fit. 4.1.2 A nonlinear autoregressive example In this example, the data are generated from a nonlinear autoregressive time series model Yt = {01+ 01101—1) + 012 (Yr—2)} “-3 + {62 + 021(Yt-1) +022 (IQ—2)} ”-4 + 0.15}, (4.2) with cl = 0.2, (:2 = —0.3 and 011(u) = (0.3 + u)exp(-4u2), 0112 (U) = 0'3/{1+(U ‘1)4}’ (121 (u) = 0, (122 (u) = —(0.6 +1.2u)exp(—4u2). The e, is the i.i.d. standard normal noise. In each replication, a total of 1000+ n observations are generated and only the last n observations are used to ensure stationarity. An example of the simulated series with n = 100 is given in Figure 4.5. For estimation, we have used linear polynomial spline (p = 1). We have used the quantile knot sequences, which is shown to be better than the equally spaced knots. 78 The coefficient functions {0'13}?’=21,3=1 are estimated on a grid of equally-spaced points on the interval [—1,1], with the number of grid points ngrid = 41. Tables 4.5 and 4.6 summarizes the estimation results, which includes the means and standard errors (in the parentheses) of {él}z=1,2 and the averaged integrated squared errors (AISE) of {[113 :11‘22. Similar to the i.i.d example, the estimation is shown to improve as sample sizes increase, which again supports the asymptotic result. For visual representation, the fitting results are also presented in Figures 4.6, which give the plots of the 100 estimated curves using marginal integration and cubic spline fitting respectively. In both figures, (a1-a4) are plots of the 100 estimated curves for {d15}f:11"22 when n = 100. (bl-b4) and (cl—c4) are the same as (a1-a4), but when n = 250 and n = 500 respectively. (d1-d4) are give the plots of their typical estimated curves, whose ISE is the median of the 100 ISEs from the replications. The solid curve represents the true curve, the dotted curve is the typical estimated curve for n = 100, the dot-dashed and dashed curves are for n = 250 and n = 500 respectively, which shows even for sample size as small as 100, the fits are satisfactory. The model selection results are presented in Table 4.7. AIC is found to tend to overfit, compared with BIC. Also as the degree of the polynomial spline is increased, the selection results improve, except the case that the sample size is small (n = 100). For the sample sizes 71 = 250, 500, we have obtained quite desirable model selection result. 79 4.2 Empirical Examples 4.2.1 West German GNP In this subsection, we discuss in detail the West German real GNP data first men- tioned in the introduction. Yang & Tschernig (2002) found that it had an autore- gressive structure on lags 4, 2, 8 according to FPE and AIC, lags 4, 2 according to BIC, where the FPE, AIC and BIC are lag selection criteria for linear time se— ries models as in Brockwell & Davis (1991). On the other hand, lags 4, 2, 8 are selected by the semi-parametric seasonal shift criterion, and lags 4, 1, 7 are se- lected by the semi-parametric seasonal dummy criterion for the slightly different series {log (0&4 /Gt+3) :31. Both semi—parametric criteria are developed in Yang & Tschernig (2002). According to Brockwell & Davis (1991), p.304, the lag selection criteria AIC and FPE of the linear time series models are asymptotically efficient but inconsistent, while BIC selects the correct set of variables consistently. Therefore one may fit a linear autoregressive model with either Yt-2, Y¢_4 or Yt_2, Y,_4,Yt_3 as the regressers, with the understanding that the variable Yt_3 may be redundant for linear modeling Linear AR (24): Y, = (III/3-2 + ath-4 + 08,, (4.3) Linear AR (248): Y, = b1Y¢_2 + b2Yt_4 + b3Yt-g + (75,. (4.4) From Table 4.2.3, it is clear that besides being more parsimonious, the linear model (4.3) has smaller average squared prediction error (ASPE), compared with the model (4.4). Thus model (4.3) is the preferred linear autoregressive model. Moreover, 80 Figures 4.9, 4.10 show that the Scatter plots of Y, against the two significant linear predictors, Y¢_2 and Y¢_4, along with the least squares regression lines, actually vary significantly at different levels of Yt_1 and Yt—s- Here the three levels are defined as: H, the high level, is the top 33% percent of the data, L, the lower level, is the lower 33% percent of the data, and M, the middle level, is the rest of the data. So we have fitted the additive coefficient model (1.6) in the introduction. We use the first 110 observations for estimation and perform one—step prediction using the last 10 observations. When estimating the coefficient functions in model (1.6), we first use marginal integration with local cubic fittings. According to the bandwidth selection method in section 2.4, we use bandwidths 0.0031 and 0.0020 for estimating the functions of Y¢_1 and Yt_3 respectively. The estimated coefficient functions are plotted in Figure 4.11. We have also generated 500 wild bootstrap (Mammen 1992) samples and obtain 95% point-wise bootstrap confidence intervals of the estimated coefficient functions. From Figure 4.11, one may observe that the estimated functions have obviously non-constant forms. In addition, their 95% con- fidence intervals can’t completely cover a horizontal line passing zero in any of the four plots. This supports the hypothesis that the coefficient functions in (1.6) are significantly different from a constant. (Notice that by the restrictions proposed in (1.8), if a coefficient function is constant, it has to be zero.) To assess the sensitivity of marginal integration estimation method to the degree of the local polynomial, we have also fitted the model (1.6) using local linear estimation (i.e. taking p = 1 in Z3). Table 4.2.3 shows that overall the marginal integration estimation for model ( 1.6) is not sensitive to the order of local polynomial used. 81 We have also applied the polynomial splines (p = 1,3) to fit the model. The curve estimates are plotted in Figure 4.12, in which solid lines denote the estimation results using linear spline (p = 1), and dotted lines denote estimates using cubic spline (p = 3), which are generally agreeable with those obtained from the marginal integration. For the two linear autoregressive models, we estimate their constant coefficients by maximum likelihood method. The estimated coefficients are Lil = —.2436, 6,2 = .5622 and (31 = —0.1191, 132 = 0.6458, (33 = 0.0704. Table 4.2.3 gives the ASEs (averaged squared estimation error) and ASPEs (av- eraged squared prediction error) of the above six fits. Spline fits overall are better than those from local polynomial. All four fits of the additive coefficients provide significant improvements over two linear autoregressive models in both estimation and prediction. 4.2.2 Wolf’s annual sunspot number In this example, we consider Wolf’s annual sunspot number data for the period 1700- 1987. Many authors have analyzed this data set. Tong (1990) used a TAR model with lag 8 as the tuning variable. Chen & Tsay (1993b) and Cai, Fan & Yao (2000) both used a FAR model with lag 3 as the tuning variable. Xia & Li (1999) proposed a single index model using a linear combination of lag 3 and lag 8 as the tuning variable. Motivated by those models, we propose our additive coefficient model (4.5), 82 in which we use both lag 3 and lag 8 as the additive tuning variables, Y, = {61+ 011(Yt—3)+ (112 (31—8)} Yt—l + {C2 + 021(Yt—3)+ 0122 (Vt—8)} ”-2 {Ca + O'31(Yt—3)+ 032 (32—8)} Yt—s + 05,. (45) Following the convention in the literature, we use the transformed data, where Y, = 2 (m — 1), X, denotes the observed sunspot number at year t. We use the first 280 data points (Year 1700-1979) to estimate the coefficient functions, and leave out years 1980-1987 for prediction. We have used marginal integration with local cubic fitting (MI), linear spline (PS1) and cubic spline (PS3) to estimate the unknown coefficient functions. In the marginal integration, the bandwidths 6.87 and 6.52 are selected for estimating functions of Y,_3 and functions of Y,_8 respectively. The estimated coefficient functions with integration fits are plotted in Figure 4.13. The time plot of the fitted values is given in Figure 4.14, in which solid line represents the fitted values and circles represents the observed values. The fitting using splines are similarly, thus is omitted. The averaged squared estimation errors (ASE) using integration, linear spline and cubic spline are 4.18, 3.64 and 3.72 respectively. Finally we use our estimated model to predict the sunspot numbers in 1980—1987, and compare these predictions with those based on the TAR model of Tong (1990), the FAR model of Chen & Tsay (1993), denoted as FARl, and the following two models; the FAR model of Cai, Fan & Yao (2000) denoted as FAR2 Y: = 01(Yt—3lyt—1 + 0'2 (Yt—3) Yt—2 + 03 (Y,_3) Yz—3 (4-6) +06 (Yt-3) Yt—G + (x8 (Yr—3) Yt—S + 0Q. 83 and the single index coefficient model of Xia & Li (1999) denoted as SIND Yt = (Do {94 (GM—3.114)} + (151 {94 (9,114.. Yt—8)} Yt-l (4-7) +¢2 {94 (6’. Yt—3, Yt-s)} Yt-2 + ¢3{!14 (9, Yt-s, Yt-8)} Yt—3 +¢4{g4(61Yt—3aYt—8)}Yt—8 + O'Et in which 94 (6, Y,..;,, Y,_8) = cos (6) Y,_3 + sin (6) Y,_8. According to Condition (A.1) b, p.952 of Cai, Fan & Yao (2000), the conditional density of Y,_3 given the variables (Y,_1, Y,_2, Y,_3, Y,_6, Y,_8) should be bounded. It is clear, however, that Y,_3 is completely predictable from (16-1,Y,_2,Y,_3,Y,_6,Y,_8), and hence the distribution of Y,_3 given the variables (11-1,l/,_2,Y,_3,Y,_6,Y,_8) is a probability mass at one point, not a continuous distribution with any kind of density. Thus, the use of model (4.6) has not been theoretically justified. Simi- larly, model (4.7) is also not theoretically justified, since according to Condition C5, p.127? of Xia & Li (1999), the conditional density of 94 (6, Y,_3, Y,_3) given the vari- ables (Y,_1, Y,_2, Y,_3, Y,_8, Y,) should be bounded, whereas again, the distribution of 94 (6, Y,_3, Y,-8) given the variables (Y,-1, Y,_2, Y,_3, Y,.,,, Y,) is also a point mass. In addition, we illustrate that model (4.7) is unidentifiable. For any set of functions {(00, . . .,¢4} that satisfy (4.7), one can always pick an arbitrary nonzero function f (u) and define ~ (60(1‘) = $001) + “f (u) «.51 (u) = $1 (u) .52 (u) = (152 (it), «.3300 = Mu) — cc>s<6>f(u>,?/3.