STATISTICAL INFERENCE FOR FUNCTIONAL AND LONGITUDINAL DATA By Guanqun Cao A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Statistics 2012 ABSTRACT STATISTICAL INFERENCE FOR FUNCTIONAL AND LONGITUDINAL DATA By Guanqun Cao Advances in modern technology have facilitated the collection of high-dimensional functional and low dimensional longitudinal data. For these data, it is often of interest to describe the key signals of the data (mean functions, covariance functions, derivative functions, etc.). Functional data analysis (FDA) and longitudinal data analysis (LDA) techniques have played a central role in the analysis of these data. The primary goal of this dissertation is to provide some novel statistical inference methods for FDA and LDA. In Chapter 1, we describe the structure (design, notations, etc.) of functional data and describe the spline smoothing technique as a tool to analysis these data. Longitudinal data analysis with missing not at random response is also discussed. In Chapter 2, a polynomial spline estimator is proposed for the mean function of dense functional data together with a simultaneous confidence band which is asymptotically correct. In addition, the spline estimator and its accompanying confidence band enjoy semiparametric efficiency in the sense that they are asymptotically the same as if all random trajectories are observed entirely and without errors. The confidence band is also extended to the difference of mean functions of two populations of functional data. Simulation experiments provide strong evidence that corroborates the asymptotic theory while computing is efficient. The confidence band procedure is illustrated by analyzing the near infrared spectroscopy data. A nonparametric estimation of the covariance function for dense functional data using tensor product B-splines is considered in Chapter 3. We develop both local and global asymptotic distributions for the proposed estimator, and show that our estimator is as efficient as an “oracle” estimator. Monte Carlo simulation experiments and two real data examples are also provided to illustrate the proposed method in this chapter. In Chapter 4, we develop a new procedure to construct simultaneous confidence bands for derivatives of mean curves in FDA. The technique involves polynomial splines that provide an approximation to the derivatives of the mean functions, the covariance functions and the associated eigenfunctions. The confidence band procedure is illustrated through numerical simulation studies and a real life example. In Chapter 5, we consider data generated from a longitudinal study with potentially non random missing data. For these data, a joint model for the missing data process and the outcome process, is found to be at best weakly identifiable. Due to this identifiability concerns, tests concerning the parameters of interest may not be able to use conventional theories and it may not be clear how to assess statistical significance. We extend the literature by developing a testing procedure that can be used to evaluate hypotheses under non and weakly identifiable semiparametric models. We derive the limiting distribution of this statistic and propose theoretically justified resampling approaches to approximate its asymptotic distribution. The methodology’s practical utility is illustrated in simulations and an analysis of quality-of-life outcomes from a longitudinal study on breast cancer. To my beloved parents iv ACKNOWLEDGMENTS Foremost, I would like to express my sincere gratitude to my advisor Professor Lijian Yang for the continuous support of my Ph.D study and research, for his patience, motivation, enthusiasm, and immense knowledge. His guidance helped me in all the time of research and writing of this thesis. I could not have imagined having a better mentor for my Ph.D study. I would like to express my gratitude to my co-advisor Professor David Todem, for offering invaluable assistance, support and guidance. He provided me a fabulous opportunity to study biostatistics topics. Moreover, I could not finish my job application promptly without his generous help. Besides my advisors, I would like to thank the rest of my thesis committee: Professors Lyudmila Sakhanenko and Lifeng Wang for their hard questions, encourage and insightful comments. My sincere thanks also goes to Professor Jiaguo Qi in the Center for Global Change and Earth Observations at MSU, for offering me the collaboration opportunities in his group and leading me working on diverse exciting projects. Thanks also goes to the entire faculty and staff members in the Department of Statistics and Probability who have taught me and help me during my study at MSU. My special thanks goes to Professor James Stapleton, for listening to my presentation rehearsal many times and helpful suggestions which leads to my successful presentations. Thanks to the graduate school and the Department of Statistics and Probability who provided me with the Dissertation Continuation Fellowship and Dissertation Completion Fellowship for working on the dissertation. This dissertation ia also supported in part by NSF award DMS 0706518, 1007594 and NCI/NIH K-award, 1K01 CA131259. I also would like to thank my academic family members: Dr. Lan Xue, Dr. Jing Wang, v Dr. Li Wang, Dr. Rong Liu, Dr. Qiongxia Song and Shuzhuan Zheng for the stimulating discussions, for the sleepless nights we were working together before submitting papers, and for all the fun we have had in the last five years. Last but not the least, I would like to thank my beloved parents Jinxiang Cao and Lazhen Sun, for giving birth to me at the first place and supporting me spiritually throughout my life. vi TABLE OF CONTENTS List of Tables . . . . . . . . . . . . . . . List of Figures . . . . . . . . . . . . . . . . . . . ix . . . . . . . . . . . . . . . . . . xi . . . . . . . . . . . . . . Chapter 1 Introduction . . . . . . . . . . . . . 1.1 Functional data analysis . . . . . . . . . . . 1.1.1 The basics . . . . . . . . . . . . . . . 1.1.2 Polynomial spline . . . . . . . . . . . 1.1.3 The near infrared spectroscopy data . 1.1.4 Speech recognition data . . . . . . . 1.2 Longitudinal data analysis . . . . . . . . . . 1.2.1 Problems . . . . . . . . . . . . . . . 1.2.2 The breast cancer data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 2 4 4 5 5 6 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 11 17 20 21 31 Chapter 3 Confidence Envelopes for Covariance Functions 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Spline covariance estimation . . . . . . . . . . . . . . . . . . 3.2.1 Data structure and model assumptions . . . . . . . . 3.2.2 Spline covariance estimator . . . . . . . . . . . . . . 3.3 Asymptotic theory and simultaneous confidence envelopes . 3.3.1 Assumptions and the oracle property . . . . . . . . . 3.3.2 Asymptotic confidence envelopes . . . . . . . . . . . 3.3.3 Extension to two-sample test problems . . . . . . . . 3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 47 50 50 51 54 54 56 58 59 61 Chapter 2 Confidence Bands for 2.1 Introduction . . . . . . . . . . 2.2 Main results . . . . . . . . . . 2.3 Error decomposition . . . . . 2.4 Implementation . . . . . . . . 2.5 Simulation . . . . . . . . . . . 2.6 Empirical example . . . . . . Mean . . . . . . . . . . . . . . . . . . . . . . . . vii . . . . . . . . . . . . . . . . . . 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 65 66 71 Chapter 4 Spline Confidence Bands for Functional Derivatives . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Models and spline estimators . . . . . . . . . . . . . . . . . . . . . 4.2.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Spline estimators . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Convergence rate . . . . . . . . . . . . . . . . . . . . . . . 4.3 Confidence bands . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Asymptotic confidence bands . . . . . . . . . . . . . . . . 4.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Numerical studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Simulated examples . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Tecator data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 96 98 98 99 101 103 103 104 107 107 117 Chapter 5 Testing Hypotheses Under Weakly Identifiability 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 The method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 The general framework . . . . . . . . . . . . . . . . . . 5.2.2 Large sample properties of ˆ θ(β) . . . . . . . . . . . . . 5.2.3 Global sensitivity testing . . . . . . . . . . . . . . . . 5.3 Numerical studies . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Pseudo-likelihood models with missing data . . . . . . 5.3.2 Real data analysis . . . . . . . . . . . . . . . . . . . . 5.3.3 Simulation study . . . . . . . . . . . . . . . . . . . . . 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 137 143 143 144 147 153 153 156 160 165 3.7 Empirical examples . . . . . . . . . . . . . 3.6.1 Tecator near infrared spectra data . 3.6.2 Speech recognition data . . . . . . Summary . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 LIST OF TABLES Table 2.1 Coverage frequencies from 500 replications using linear spline (??) with p = 2, Nm = [cn1/(2p) log(n)] and σ = 0.5. . . . . . . . . . . . 25 Coverage frequencies from 500 replications using linear spline (??) with p = 2, Nm = [cn1/(2p) log(n)] and σ = 0.3. . . . . . . . . . . . 25 Coverage frequencies from 500 replications using cubic spline (??) with p = 4, Nm = [cn1/(2p) log(n)] and σ = 0.5. . . . . . . . . . . . 26 Coverage frequencies from 500 replications using cubic spline (??) with p = 4, Nm = [cn1/(2p) log(n)] and σ = 0.3. . . . . . . . . . . . 27 Coverage frequencies from 500 replications using least squares Bonferroni band and least squares Bootstrap band. . . . . . . . . . . . . 28 Table 2.6 Five number summary of ratios of confidence band widths. . . . . . 29 Table 2.7 Empirical power and type I error of two-sample test using cubic spline. 30 Table 3.1 Simulation results: uniform coverage rates from 500 replications. . . Table 4.1 Uniform coverage rates using cubic spline in Model I . . . . . . . . . 108 Table 4.2 Uniform coverage rates using cubic spline in Model II . . . . . . . . 117 Table 5.1 P-values for evaluating the Treatment and Time effects using data from Study VI of the IBCSG trials . . . . . . . . . . . . . . . . . . . 159 Table 5.2 Empirical type I error and power of the infimum test and Wald test Table 2.2 Table 2.3 Table 2.4 Table 2.5 ix 95 164 Table 5.3 Empirical power of the infimum test to detect the interaction effect α3 = 1 for two ranges Ξ of β with true value being β 0 = −1 . . . . 166 ˜ Table 5.4 Empirical power of the infimum test to detect the interaction effect α3 = 0.7 for two sets Ξ of β with true value being β 0 = −1 . . . . . 166 ˜ x LIST OF FIGURES Figure 2.1 Plots of the linear spline estimator for simulated data and 95% confidence bands for m(x) . . . . . . . . . . . . . . . . . . . . . . . . . 23 Plots of the cubic spline estimator for simulated data and 95% confidence bands for m(x) . . . . . . . . . . . . . . . . . . . . . . . . . 24 Left: Plot of Tecator data. Right: Sample curves for the Tecator data. Each class has 10 sample curves. . . . . . . . . . . . . . . . . 32 Plots of the fitted linear and cubic spline regressions of m1 (x)−m2 (x) and confidence bands . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Plots of the true covariance functions of the simulated data and their 95% confidence envelopes: n=200, N=100, σ=0.5. p1 = p2 = 2 . . . 63 Plots of the true covariance functions of the simulated data and their 95% confidence envelopes: n=200, N=100, σ=0.5. p1 = p2 = 4 . . . 64 Figure 3.3 Plot of the Tecator data . . . . . . . . . . . . . . . . . . . . . . . . . 67 Figure 3.4 Plots of the cubic tensor spline estimator for the Tecator data . . . . 68 Figure 3.5 Plots of hypothesis test (3.12) results . . . . . . . . . . . . . . . . . 69 Figure 3.6 Plots of the speech recognition data . . . . . . . . . . . . . . . . . . 71 Figure 3.7 Plots of tensor spline estimators for speech recognition data sets . . 72 Figure 3.8 Plots of hypothesis test (3.13) results . . . . . . . . . . . . . . . . . 73 Figure 2.2 Figure 2.3 Figure 2.4 Figure 3.1 Figure 3.2 xi Figure 4.1 Plots of the cubic spline estimators and 99% confidence bands of m(1) (t) in Model I. n = 30, N = 30. . . . . . . . . . . . . . . . . . . 109 Figure 4.2 Plots of the cubic spline estimators and 99% confidence bands of m(1) (t) in Model I. n = 30, N = 60. . . . . . . . . . . . . . . . . . . 110 Figure 4.3 Plots of the cubic spline estimators and 99% confidence bands of m(1) (t) in Model I. n = 50, N = 50. . . . . . . . . . . . . . . . . . . 111 Figure 4.4 Plots of the cubic spline estimators and 99% confidence bands of m(1) (t) in Model I. n = 50, N = 100. . . . . . . . . . . . . . . . . . 112 Figure 4.5 Plots of the cubic spline estimators and 99% confidence bands of m(1) (t) in Model II. n = 30, N = 30. . . . . . . . . . . . . . . . . . 113 Figure 4.6 Plots of the cubic spline estimators and 99% confidence bands of m(1) (t) in Model II. n = 30, N = 60. . . . . . . . . . . . . . . . . . 114 Figure 4.7 Plots of the cubic spline estimators and 99% confidence bands of m(1) (t) in Model II. n = 50, N = 50. . . . . . . . . . . . . . . . . . 115 Figure 4.8 Plots of the cubic spline estimators and 99% confidence bands of m(1) (t) in Model II. n = 50, N = 100. . . . . . . . . . . . . . . . . . 116 Figure 4.9 Plots of the cubic spline estimators and 99% confidence bands . . . 119 Figure 5.1 Supremum of the pseudo-likelihood function profiled across β . . . . 140 Figure 5.2 Plot of the exact and the resampled CDF . . . . . . . . . . . . . . . 152 Figure 5.3 Simultaneous confidence bands for time effects on PACIS and Mood xii 161 Chapter 1 Introduction 1.1 1.1.1 Functional data analysis The basics Functional data analysis (FDA) has recently become a focal area in statistical research, as recent technological progress in measuring devices now allow one to observe spatiotemporal phenomena on arbitrarily fine grids, that is, almost in a continuous manner. This area remains distinct due to its usefulness to climatology, medicine, meteorology, etc. The functional data that we consider are a collection of trajectories η i (x) n which i=1 are i.i.d. realizations of a smooth random function η(x), defined on a continuous interval X . Assume that {η(x), x ∈ X } is an L2 process, i.e. E X η 2 (x)dx < +∞, and define the mean and covariance functions m(x) = E{η(x)} and G x, x = cov η(x), η(x ) . The covari- ance function is a symmetric nonnegative-definite function with a spectral decomposition, G x, x = κ λ ψ (x)ψ x , where λ ≥ λ ≥ · · · ≥ 0, 1 2 k k=1 k k κ λ < +∞, are the k=1 k eigenvalues, κ is either a positive integer or infinity and ψ k (x) κ k=1 are the corresponding 1 eigenfunctions and form a set of orthonormal functions in L2 (X ). By the Karhunen-Lo`ve e representation, η i (x) = m(x) + κ ξ φ (x), where the random coefficients ξ are unik k=1 ik k correlated with mean 0 and variance 1, and the functions φk = λk ψ k . Characterizing nonlinear variation in FDA is a challenging problem. In particular, when random curves are observed on regular dense grids, the existing literature on FDA focuses on pointwise estimation and inference. This, however, is not sufficient to provide understanding of the variability of the estimator of the whole regression curve, its derivatives and covariance function, nor can it be used to correctly answer questions about the curve’s or surface’s shape. In Chapters 2, 3 and 4, we propose oracle estimators and establish asymptotic correctness of the proposed simultaneous confidence bands/envelopes for mean, covariance and its derivative functions while the number of observations for each subject tends to infinite as sample size goes to infinite, using various properties of spline smoothing. 1.1.2 Polynomial spline To describe the spline functions, we first introduce a sequence of equally-spaced points tJ Nm , called interior knots which divide the interval [0, 1] into (Nm + 1) equal subinJ=1 tervals IJ = tJ , tJ+1 , J = 0, ...., Nm − 1, INm = tNm , 1 . For any positive integer p, introduce left boundary knots t1−p , ..., t0 , and right boundary knots tNm +1 , ..., tNm +p , t1−p = · · · = t0 = 0 < t1 < · · · < tNm < 1 = tNm +1 = · · · = tNm +p , tJ = Jhm , 0 ≤ J ≤ Nm + 1, hm = 1/ (Nm + 1) , in which hm is the distance between neighboring knots. Denote by H(p−2) the space of p-th order spline space, i.e., p − 2 times continuously differentiable functions on [0, 1] that 2 are polynomials of degree p − 1 on [tJ , tJ+1 ], J = 0, . . . , Nm . The J-th B-spline of order p is BJ,p (x) = x − tJ+p BJ+1,p−1 (x) x − tJ BJ,p−1 (x) − , tJ+p−1 − tJ tJ+p − tJ+1 1 − p ≤ J ≤ Nm , for p > 1, with       1 t ≤x κ, where κ is a positive integer or ∞, thus G(x, x ) = κ φ (x)φ x k k=1 k and the model that we consider is Yij = m (j/N ) + κ ξ φ (j/N ) + σ (j/N ) εij . k=1 ik k (2.1) Although the sequences λk κ , φk (·) κ k=1 k=1 and the random coefficients ξ ik exist mathematically, they are unknown or unobservable. The existing literature focuses on two data types. Yao, M¨ller and Wang (2005) studied u sparse longitudinal data for which Ni , i.e. the number of observations for the i-th curve, is bounded and follows a given distribution, in which case Ma, Yang and Carroll (2012) obtained asymptotically simultaneous confidence band for the mean function of the functional data, using piecewise constant spline estimation. Li and Hsing (2010a) established uniform convergence rate for local linear estimators of mean and covariance function of dense functional 8 data, where min1≤i≤n Ni (n/logn)1/4 , as n → ∞, similar to our Assumption (A3), but did not provide asymptotic distribution of maximal deviation or simultaneous confidence bands. Degras (2011) built asymptotically correct simultaneous confidence bands for dense functional data using local linear estimators. Bunea, Ivanescu and Wegkamp (2011) proposed asymptotically conservative rather than correct confidence set for the mean function of Gaussian functional data. In this chapter, we propose polynomial spline confidence bands for the mean function based on dense functional data. In function estimation problems, simultaneous confidence bands are an important tool to address the variability in the mean curve, see Zhao and Wu (2008); Zhou, Shen and Wolfe (1998) and Zhou and Wu (2010) for related theory and applications. The fact that simultaneous confidence bands have not been widely used for functional data analysis is certainly not due to lack of interesting applications, but due to the greater technical difficulty to formulate such bands for functional data and establish their theoretical properties. In this work, we have established asymptotic correctness of the proposed confidence bands using various properties of spline smoothing. The spline estimators and the accompanying confidence bands are asymptotically the same as if all the n random curves are recorded over the entire interval, without measurement errors. They are oracally efficient despite the use of spline smoothing, see Remark 2.2.1. This provides partial theoretical justification for treating functional data as perfectly recorded random curves over the entire data range, as in Ferraty and Vieu (2006). Theorem 3 of Hall, M¨ller and Wang (2006) stated mean square (rather than the stronger uniform) oracle u efficiency for local linear estimation of eigenfunctions and eigenvalues (rather than the mean function), under assumptions similar to ours, but provided only an outline of proof. Among 9 the existing works on functional data analysis, Ma, Yang and Carroll (2012) proposed the simultaneous confidence bands for sparse functional data. However, their result does not enjoy the oracle efficiency stated in Theorem 2.2.1, since there are not enough observations for each subject to obtain a good estimate of the individual trajectories. As a result, it has the slow nonparametric convergence rate of n−1/3 logn, instead of the parametric rate of n−1/2 as in the present work. This essential difference completely separates dense functional data from sparse ones. The aforementioned confidence bands are also extended to the difference of two regression functions. This is motivated by Li and Yu (2008), who applied functional segment discriminant analysis to a Tecator data set, see Figure 2.3. In this data set, each observation (meat) consists of a 100-channel absorbance spectrum in the wavelength with different fat, water and protein percent. Li and Yu (2008) used the spectra to predict whether the fat percentage is greater than 20%. On the flip side, we are interested in building a 100 (1 − α) % confidence band for the difference between regression functions from the spectra of the less than 20% fat group and the higher than 20% fat group. If this 100 (1 − α) % confidence band covers the zero line, one accepts the null hypothesis of no difference between the two groups, with p-value no greater than α. Test for equality between two groups of curves based on the adaptive Neyman test and wavelet thresholding techniques were proposed in Fan and Lin (1998), who did not provide an estimator of the difference of the two mean functions nor a simultaneous confidence band for such estimator. As a result, their test did not extend to testing other important hypotheses on the difference of the two mean functions while our Theorem 2.2.3 provides a benchmark for all such testing. More recently, Benko, H¨rdle and a Kneip (2009) developed two-sample bootstrap tests for the equality of eigenfunctions, eigen10 values and mean functions by using common functional principal components and bootstrap tests. This chapter is organized as follows. Section 2.2 states main theoretical results on confidence bands constructed from polynomial splines. Section 2.3 provides further insights into the error structure of spline estimators. The actual steps to implement the confidence bands are provided in Section 2.4. A simulation study is presented in Section 2.5, and an empirical illustration on how to use the proposed spline confidence band for inference is reported in Section 2.6. Technical proofs are collected in the Appendix. 2.2 Main results To describe the spline functions, we first define a sequence of equally-spaced points tJ Nm , J=1 called interior knots, which have been introduced in Chapter 1. Denote by H(p−2) the space of p-th order spline space, i.e., p − 2 times continuously differentiable functions on [0, 1] that are polynomials of degree p − 1 on [tJ , tJ+1 ], J = 0, . . . , Nm . Then H(p−2) = Nm J=1−p bJ,p BJ,p (x) , bJ,p ∈ R, x ∈ [0, 1] , where BJ,p is the J-th B-spline basis of order p as defined in de Boor (2001). We propose to estimate the mean function m(x) by mp (x) = ˆ argmin g(·)∈H(p−2) n i=1 2 N Yij − g (j/N ) . j=1 The technical assumptions we need are as follows: (A1) The regression function m ∈ C p−1,1 [0, 1], i.e., m(p−1) ∈ C 0,1 [0, 1]. (A2) The standard deviation function σ (x) ∈ C 0,µ [0, 1] for some µ ∈ (0, 1]. 11 (2.2) (A3) As n → ∞, N −1 n1/(2p) → 0 and N = O nθ for some θ > 1/ (2p); the number of 1/2 −p −1 interior knots Nm satisfies N Nm → ∞, Nm n1/2 → 0, N −1/2 Nm logn → 0 or −1/2 p equivalently N hm → ∞, hm n1/2 → 0, N −1/2 hm logn → 0. (A4) There exists CG > 0 such that G (x, x) ≥ CG , x ∈ [0, 1]; µ C 0,µ [0, 1] , κ k=1 φk ∞ < ∞ and as n → ∞, hm for k ∈ {1, . . . , κ}, φk (x) ∈ κn φ = o (1) for a k=1 k 0,µ sequence {κn }∞ of increasing integers, with limn→∞ κn = κ and the constant n=1 µ ∈ (0, 1] as in Assumption (A2). In particular, κ k=κn +1 φk ∞ = o (1). (A5) There are constants C1 , C2 ∈ (0, +∞), γ 1 , γ 2 ∈ (1, +∞), β ∈ (0, 1/2) and i.i.d. N (0, 1) variables n,κ n,N Zik,ξ , Zij,ε such that i=1,k=1 i=1,j=1 max P max 1≤t≤n 1≤k≤κ t ξ − i=1 ik t Z > C1 nβ < C2 n−γ 1 , i=1 ik,ξ (2.3) P t ε − i=1 ij t Z > C1 nβ i=1 ij,ε (2.4) max max 1≤j≤N 1≤t≤n < C2 n−γ 2 . Assumptions (A1)-(A2) are typical for spline smoothing, see Huang and Yang (2004), Xue and Yang (2006), Wang and Yang (2009a), Liu and Yang (2010) and Ma and Yang (2011). Assumption (A3) concerns the number of observations for each subject, and the number of knots of B-splines. Assumption (A4) ensures that the principal components have collectively bounded smoothness. Assumption (A5) provides Gaussian approximation of estimation error process, and is ensured by the following elementary assumption: τ (A5 ) There exist τ 1 > 4, τ 2 > 4+2θ such that E ξ ik τ 1 +E εij 2 < +∞, for 1 ≤ i < ∞, 1 ≤ k ≤ κ, 1 ≤ j < ∞. The number κ of nonzero eigenvalues is finite or κ is infinite 12 while the variables ξ ik 1≤i<∞,1≤k<∞ are i.i.d.. Degras (2011) makes a restrictive assumption (A.2) on the H¨lder continuity of the o stochastic process η(x) = m(x) + ∞ ξ φ (x). It is elementary to construct examples k=1 k k where our Assumptions (A4) and (A5) are satisfied while assumption (A.2) of Degras (2011) is not. The part of Assumption (A4) on φk ’s holds trivially if κ is finite and all φk (x) ∈ C 0,µ [0, 1]. Note also that by definition, φk = λk ψ k , φk ∞ = λk ψ k ∞ , φk 0,µ = λk ψ k 0,µ , in which ψ k ∞ form an orthonormal basis of L2 ([0, 1]), hence, Assumpk=1 tion (A4) is fulfilled for κ = ∞ as long as λk decreases to zero sufficiently fast. Following one Referee’s suggestion, we provide the following example. One takes λk = ρ2[k/2] , k = 1, 2, ... for any ρ ∈ (0, 1), with ψ k ∞ the canonical orthonormal Fourier basis of L2 ([0, 1]) k=1 √ ψ 1 (x) ≡ 1, ψ 2k+1 (x) ≡ 2 cos (kπx) √ ψ 2k (x) ≡ 2 sin (kπx) , k = 1, 2, ..., x ∈ [0, 1] . In this case, ∞ k=1 φk ∞ = 1 + ∞ ρk √2 + √2 = 1 + 2√2ρ (1 − ρ)−1 < ∞, while k=1 for any {κn }∞ with κn increasing, odd and κn → ∞, and Lipschitz order µ = 1 n=1 hm √ κn (κn −1)/2 k √ 2kπ + 2kπ φk 0,1 = hm ρ k=1 k=1 √ √ ∞ ≤ 2 2πhm ρ ρk−1 k = 2 2πhm (1 − ρ)−2 k=1 = O (hm ) = o (1) . Denote by ζ (x), x ∈ [0, 1] a standardized Gaussian process such that Eζ (x) ≡ 0, 13 Eζ 2 (x) ≡ 1, x ∈ [0, 1] with covariance function Eζ (x) ζ x = G x, x G (x, x) G x , x −1/2 , x, x ∈ [0, 1] and define the 100 × (1 − α)-th percentile of the absolute maxima distribution of ζ (x), ∀x ∈ [0, 1], i.e., P supx∈[0,1] |ζ (x)| ≤ Q1−α = 1 − α, ∀α ∈ (0, 1). Denote by z1−α/2 the 100 (1 − α/2)-th percentile of the standard normal distribution. Define also the following “infeasible estimator” of function m n η (x), x ∈ [0, 1] . i=1 i m(x) = η (x) = n−1 ¯ ¯ (2.5) The term “infeasible” refers to the fact that m(x) is computed from unknown quantity η i (x), ¯ x ∈ [0, 1], and it would be the natural estimator of m(x) if all the i.i.d. random curves η i (x), x ∈ [0, 1] were observed, a view taken in Ferraty and Vieu (2006). We now state our main results in the following theorem. Theorem 2.2.1. Under Assumptions (A1)-(A5), for ∀α ∈ (0, 1), as n → ∞, the “infeasible estimator” m(x) converges at the ¯ √ n rate P supx∈[0,1] n1/2 |m(x) − m(x)| G (x, x)−1/2 ≤ Q1−α → 1 − α, ¯ P n1/2 |m(x) − m(x)| G (x, x)−1/2 ≤ z1−α/2 → 1 − α, ∀x ∈ [0, 1], ¯ while the spline estimator mp is asymptotically equivalent to m up to order n1/2 , i.e. ˆ ¯ supx∈[0,1] n1/2 m(x) − mp (x) = oP (1) . ¯ ˆ 14 Remark 2.2.1. The significance of Theorem 2.2.1 lies in the fact that one does not need to distinguish between the spline estimator mp and the “infeasible estimator” m in (2.5), which ˆ ¯ converges with √ n rate like a parametric estimator. We therefore have established oracle efficiency of the nonparametric estimator mp . ˆ Corollary 2.2.2. Under Assumptions (A1)-(A5), as n → ∞, an asymptotic 100 (1 − α) % correct confidence band for m(x), x ∈ [0, 1] is mp (x) ± G (x, x)1/2 Q1−α n−1/2 , ∀α ∈ (0, 1) ˆ while an asymptotic 100 (1 − α) % pointwise confidence interval for m(x), x ∈ [0, 1], is mp (x) ± G (x, x)1/2 z1−α/2 n−1/2 . ˆ We next describe a two-sample extension of Theorem 2.2.1. Denote two samples indicated by d = 1, 2, which satisfy Ydij = md (j/N ) + κd ξ φ (j/N ) + σ d (j/N ) εdij , k=1 dik dk with covariance functions Gd (x, x ) = κd φ (x)φdk x k=1 dk 1 ≤ i ≤ nd , 1≤j≤N respectively. We denote the ratio of two sample sizes as r = n1 /n2 and assume that limn1 →∞ r = r > 0. ˆ ˆ For both groups, let m1p (x) and m2p (x) be the order p spline estimates of mean functions ˆ ˆ m1 (x) and m2 (x) by (2.2). Also denote by ζ 12 (x), x ∈ [0, 1] a standardized Gaussian process such that Eζ 12 (x) ≡ 0, Eζ 2 (x) ≡ 1, x ∈ [0, 1] with covariance function 12 Eζ 12 (x) ζ 12 x = G1 x, x + rG2 x, x {G1 (x, x) + rG2 (x, x)}1/2 G1 x, x + rG2 x, x 15 1/2 , x, x ∈ [0, 1]. Denote by Q12,1−α the (1 − α)-th quantile of the absolute maxima deviation of ζ 12 (x), x ∈ [0, 1] as above. We mimic the two sample t-test and state the following theorem whose proof is analogous to that of Theorem 2.2.1. Theorem 2.2.3. If Assumptions (A1)-(A5) are modified for each group accordingly, then for any α ∈ (0, 1), as n1 → ∞, r → r > 0, ˆ P    sup  x∈[0,1]  1/2 n1 m1p − m2p − m1 + m2 (x) ˆ ˆ {(G1 + rG2 ) (x, x)}1/2 ≤ Q12,1−α    → 1 − α.   Theorems 2.2.3 yields uniform asymptotic confidence band for m1 (x) − m2 (x), x ∈ [0, 1]. Corollary 2.2.4. If Assumptions (A1)-(A5) are modified for each group accordingly, as n1 → ∞, r → r > 0, a 100 × (1 − α) % asymptotically correct confidence band for m1 (x) − ˆ −1/2 Q12,1−α {(G1 + rG2 ) (x, x)}1/2 , ∀α ∈ m2 (x), x ∈ [0, 1] is m1p − m2p (x) ± n1 ˆ ˆ (0, 1). If the confidence band in Corollary 2.2.2 is used to test hypothesis H0 : m(x) = m0 (x), ∀x ∈ [0, 1] ←→ Ha : m(x) = m0 (x), for some x ∈ [0, 1], for some given function m0 (x), as one referee pointed out, the asymptotic power of the test is α under H0 , 1 under H1 due to Theorem 2.2.1. The same can be said for testing hypothesis about m1 (x) − m2 (x) using the confidence band in Corollary 2.2.4. 16 2.3 Error decomposition In this section, we break the estimation error mp (x) − m(x) into three terms. We begin by ˆ discussing the representation of the spline estimator mp (x) in (2.2). ˆ The definition of mp (x) in (2.2) means that ˆ Nm ˆ β B (x), J=1−p J,p J,p mp (x) ≡ ˆ ˆ ˆ with coefficients β 1−p,p , ..., β Nm ,p T solving the following least squares problem T ˆ ˆ β 1−p,p , ..., β Nm ,p n = argmin  N  β 1−p,p ,...,β Nm ,p ∈RNm +p i=1 j=1 Nm Y −  ij β J,p BJ,p (j/N ) 2  .(2.6)  J=1−p Applying elementary algebra, one obtains mp (x) = B1−p,p (x) , . . . , BNm ,p (x) ˆ ¯ ¯ ¯ where Y = Y.1 , . . . , Y.N T , Y.j = n−1 −1 T T X Y, X X (2.7) n Y , 1 ≤ j ≤ N , and the design matrix X is i=1 ij     B1−p,p (1/N ) · · · BN ,p (1/N ) m     X=  ··· ··· ···      B1−p,p (N/N ) · · · BNm ,p (N/N ) 17              . N ×(Nm +p) Projecting via (2.7) the relationship in model (2.1) onto the linear subspace of RNm +p spanned by BJ,p (j/N ) , we obtain the following crucial decomposi1≤j≤N,1−p≤J≤Nm tion in the space H(p−2) of spline functions: mp (x) = mp (x) + ep (x) + ˜p (x), ˆ ˜ ˜ ξ (2.8) where mp (x) = ˜ ˜ (x) = ξp Nm Nm ˜ β J,p BJ,p (x), ˜p (x) = ε a B (x), ˜ J=1−p J=1−p J,p J,p κ ˜ Nm ξ (x), ˜k,p (x) = ξ τ ˜ B (x). k=1 k,p J=1−p k,J,p J,p (2.9) T T T ˜ ˜ The vectors {β 1−p , ..., β Nm } , a1−p , ..., aNm ˜ ˜ and τ k,1−p , ..., τ k,N ˜ ˜ in (2.9) m are solutions to (2.6) with Yij replaced by m (j/N ), σ (j/N ) εij and ξ ik φk (j/N ) respectively. Alternatively, mp (x) = ˜ B1−p,p (x) , . . . , BNm ,p (x) −1 T T X m X X ep (x) = ˜ B1−p,p (x) , . . . , BNm ,p (x) −1 T T X e X X ˜ ξ k,p (x) = ¯.k B1−p,p (x) , . . . , BNm ,p (x) ξ −1 T T X X X φk , 1 ≤ k ≤ κ T in which m = (m (1/N ) , . . . m (N/N )) is the signal vector, T e = (σ (1/N ) ¯.1 , . . . , σ (N/N ) ¯.N ) , ¯.j = n−1 n εij , 1 ≤ j ≤ N is the noise ε ε ε i=1 T vector and φk = (φk (1/N ) , . . . , φk (N/N )) are the eigenfunction vectors, and ¯.k = ξ n−1 n ξ , 1 ≤ k ≤ κ. i=1 ik We cite next an important result from de Boor (2001), p. 149. 18 Theorem 2.3.1. There is an absolute constant Cp−1,µ > 0 such that for every φ ∈ C p−1,µ [0, 1] for some µ ∈ (0, 1], there exists a function g ∈ H(p−1) [0, 1] for which µ+p−1 hm . g − φ ∞ ≤ Cp−1,µ φ(p−1) 0,µ The next three propositions concern mp (x), ep (x) and ˜p (x) given in (2.8). ˜ ˜ ξ Proposition 2.3.1. Under Assumptions (A1) and (A3), as n → ∞ supx∈[0,1] n1/2 mp (x) − m(x) = o (1) . ˜ (2.10) Proposition 2.3.2. Under Assumptions (A2)-(A4), as n → ∞ supx∈[0,1] n1/2 ep (x) = oP (1) . ˜ (2.11) Proposition 2.3.3. Under Assumptions (A2)-(A4), as n → ∞ supx∈[0,1] n1/2 ˜p (x) − (m(x) − m (x)) = oP (1) ξ ¯ (2.12) also for any α ∈ (0, 1) P supx∈[0,1] n1/2 ˜p (x) G (x, x)−1/2 ≤ Q1−α ξ → 1 − α. (2.13) Equations (2.10), (2.11) and (2.12) yield the asymptotic efficiency of the spline estimator mp , i.e. supx∈[0,1] n1/2 m(x) − mp (x) = oP (1). The Appendix contains the proofs of ˆ ¯ ˆ the above three propositions, which together with (2.8), imply Theorem 2.2.1. 19 2.4 Implementation This section describes procedures to implement the confidence band of Corollary 2.2.2. N,n Given any data set j/N, Yij from model (2.1), the spline estimator mp (x) is ˆ j=1,i=1 obtained from (2.7), the number of interior knots in estimating m(x) is taken to be Nm = [cn1/(2p) log (n)], in which [a] denotes the integer part of a. Our experience shows that the choice of constant c = 0.2, 0.3, 0.5, 1, 2 seems quite adequate, and that is what we recommend. When constructing the confidence bands, one needs to estimate the unknown functions G (·, ·) and the quantile Q1−α and then plug in these estimators: the same approach is taken in Ma, Yang and Carroll (2012) and Wang and Yang (2009a). ˆ The pilot estimator Gp x, x ˆ Gp = of covariance function G x, x argmin g(·,·)∈H(p−2),2 N j=j is C.jj − g j/N, j /N 2 , with C.jj = n−1 n ˆ ˆ i=1 Yij − mp (j/N ) Yij − mp j /N , 1 ≤ j = j ≤ N and NG the tensor product spline space H(p−2),2 = { b B (t) B (s) , b ∈ J ,p JJ J,J =1−p JJ J,p R, t, s ∈ [0, 1]} in which NG = [n1/(2p) log(log(n))]. A detailed discussion of the consistent property of this plug-in estimator can be found in Chapter 3. ˆ In order to estimate Q1−α , one first does the eigenfunction decomposition of Gp x, x , i.e. N −1 N G (j/N, j /N )ψ (j/N ) = λ ψ j /N , to obtain the estimated eigenvalˆ ˆ ˆ ˆ k k k j=1 p ˆ ˆ ues λk and eigenfunctions ψ k . Next, one chooses the number κ of eigenfunctions by using the following standard and efficient criterion, i.e. T λ > 0.95 , where {λ }T are the first T estimated ˆ k k=1 k=1 k ˆ ˆ ˆ positive eigenvalues. Finally, one simulates ζ b (x) = Gp (x, x)−1/2 κ Zk,b φk (x), where k=1 κ = argmin1≤l≤T l ˆ k=1 λk / 20 ˆ φk = ˆ ˆ λk ψ k , Zk,b are i.i.d standard normal variables with 1 ≤ k ≤ κ and b = 1, . . . , bM , and bM is a preset large integer, the default of which is 1000. We take the maximal absolute ˆ ˆ value for each copy of ζ b (x) and estimates Q1−α by the empirical quantile Q1−α of these maximum values. We then use the following confidence band ˆ ˆ mp (x) ± n−1/2 Gp (x, x)1/2 Q1−α , ˆ x ∈ [0, 1], (2.14) ˆ for the mean function. We estimate Q12,1−α in a similar way as Q1−α and compute −1/2 ˆ Q12,1−α m1p − m2p (x) ± n1 ˆ ˆ ˆ G1p + rG2p (x, x) ˆˆ 1/2 , (2.15) as confidence band for m1 (x) − m2 (x). Although beyond the scope of this work, as one referee pointed out, the confidence band in (2.14) is expected to enjoy the same asymptotic coverage as if true values of Q1−α and G (x, x) were used instead, due to the consistency of ˆ Gp (x, x) estimating G (x, x). The same holds for the confidence band in (2.15). 2.5 Simulation To demonstrate the practical performance of our theoretical results, we perform a set of simulation studies. Data are generated from model Yij = m (j/N ) + 2 ξ φ (j/N ) + σεij , 1 ≤ j ≤ N, 1 ≤ i ≤ n, k=1 ik k (2.16) where ξ ik ∼ N (0, 1), k = 1, 2, εij ∼ N (0, 1), for 1 ≤ i ≤ n, 1 ≤ j ≤ N , m(x) = 10 + sin {2π (x − 1/2)}, φ1 (x) = −2 cos {π (x − 1/2)} and φ2 (x) = sin {π (x − 1/2)}. This setting 21 implies λ1 = 2 and λ2 = 0.5. The noise levels are set to be σ = 0.5 and 0.3. The number of subjects n is taken to be 60, 100, 200, 300 and 500, and under each sample size the number of observations per curve is assumed to be N = [n0.25 log2 (n)]. This simulated process has a similar design as one of the simulation models in Yao, M¨ller and Wang (2005), except u that each subject is densely observed. We consider both linear and cubic spline estimators, and use confidence levels 1 − α = 0.95 and 0.99 for our simultaneous confidence bands. The constant c in the definition of Nm in Section 2.4 is taken to be 0.2, 0.3, 0.5, 1 and 2. Each simulation is repeated 500 times. Figures 2.1 and 2.2 show the estimated mean functions and their 95% confidence bands for the true curve m(·) in model (2.16) with σ = 0.3 and n = 100, 200, 300, 500, respectively. As expected when n increases, the confidence band becomes narrower and the linear and cubic spline estimators are closer to the true curve. Tables 2.2 to 2.4 show the empirical frequency that the true curve m (·) is covered by the linear and cubic spline confidence bands (2.14) at 100 points {1/100, . . . , 99/100, 1} respectively. At all noise levels, the coverage percentages for the confidence bands are close to the nominal confidence levels 0.95 and 0.99 for linear splines with c = 0.5, 1 (Tables 2.1 and 2.2), and cubic splines with c = 0.3, 0.5 (Tables 2.3 and 2.4) but decline slightly for c = 2 and markedly for c = 0.2. The coverage percentages thus depend on the choice of Nm , and the dependency becomes stronger when sample sizes decrease. For large sample sizes n = 300, 500, the effect of the choice of Nm on the coverage percentages is negligible. Although our theory indicates no optimal choice of c, we recommend using c = 0.5 for data analysis as its performance in simulation for both linear and cubic splines is either optimal or near optimal. 22 10.5 11.5 n=200 8.5 9.0 9.5 m 8.5 9.0 9.5 m 10.5 11.5 n=100 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.6 0.8 1.0 0.8 1.0 10.5 11.5 n=500 8.5 9.0 9.5 m 8.5 9.0 9.5 m 10.5 11.5 n=300 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 Figure 2.1: For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this dissertation. Plots of the linear spline estimator (2.2) for simulated data (dashed-dotted line) and 95% confidence bands (2.14) (upper and lower dashed lines) (2.14) for m(x) (solid lines). In all panels, σ = 0.3. 23 10.5 11.5 n=200 8.5 9.0 9.5 m 8.5 9.0 9.5 m 10.5 11.5 n=100 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.6 0.8 1.0 0.8 1.0 10.5 11.5 n=500 8.5 9.0 9.5 m 8.5 9.0 9.5 m 10.5 11.5 n=300 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 Figure 2.2: Plots of the cubic spline estimator (2.2) for simulated data (dashed-dotted line) and 95% confidence bands (2.14) (upper and lower dashed lines) (2.14) for m(x) (solid lines). In all panels, σ = 0.3. 24 Table 2.1: Coverage frequencies from 500 replications using linear spline (2.14) with p = 2, Nm = [cn1/(2p) log(n)] and σ = 0.5. n 1−α 60 0.950 0.384 0.790 0.990 0.692 0.950 0.894 0.852 0.938 0.970 0.976 0.942 0.184 0.826 0.886 0.884 0.838 0.476 0.936 0.964 0.966 0.944 0.950 0.418 0.856 0.914 0.922 0.862 0.712 0.966 0.976 0.990 0.972 0.950 0.600 0.888 0.920 0.932 0.874 0.834 0.978 0.976 0.980 0.972 0.950 0.772 0.880 0.922 0.886 0.894 0.990 500 0.876 0.990 300 c=2 0.990 200 c=1 0.990 100 c = 0.2 c = 0.3 c = 0.5 0.902 0.964 0.984 0.976 0.976 Table 2.2: Coverage frequencies from 500 replications using linear spline (2.14) with p = 2, Nm = [cn1/(2p) log(n)] and σ = 0.3. n 1−α 60 0.950 0.410 0.786 0.990 0.702 0.950 500 0.930 0.914 0.884 0.950 0.972 0.966 0.954 0.198 0.822 0.916 0.916 0.896 0.496 0.940 0.974 0.974 0.968 0.950 0.414 0.862 0.946 0.942 0.926 0.720 0.966 0.984 0.984 0.980 0.950 0.602 0.896 0.940 0.934 0.926 0.990 300 c=2 0.990 200 c=1 0.990 100 c = 0.2 c = 0.3 c = 0.5 0.840 0.982 0.984 0.986 0.980 0.950 0.768 0.888 0.954 0.950 0.942 0.990 0.906 0.968 0.992 0.994 0.988 25 Table 2.3: Coverage frequencies from 500 replications using cubic spline (2.14) with p = 4, Nm = [cn1/(2p) log(n)] and σ = 0.5. n 1−α 60 0.950 0.644 0.916 0.990 0.866 0.950 500 0.902 0.890 0.738 0.980 0.958 0.964 0.888 0.596 0.902 0.904 0.876 0.846 0.786 0.970 0.968 0.956 0.952 0.950 0.928 0.942 0.932 0.936 0.904 0.978 0.992 0.982 0.992 0.978 0.950 0.920 0.948 0.926 0.948 0.898 0.990 300 c=2 0.990 200 c=1 0.990 100 c = 0.2 c = 0.3 c = 0.5 0.976 0.986 0.986 0.988 0.980 0.950 0.928 0.922 0.954 0.902 0.898 0.990 0.980 0.982 0.990 0.976 0.978 26 Table 2.4: Coverage frequencies from 500 replications using cubic spline (2.14) with p = 4, Nm = [cn1/(2p) log(n)] and σ = 0.3. n 1−α 60 0.950 0.672 0.922 0.990 0.884 0.950 0.940 0.916 0.986 0.986 0.984 0.982 0.610 0.916 0.914 0.914 0.896 0.798 0.980 0.974 0.970 0.964 0.950 0.938 0.952 0.950 0.948 0.934 0.982 0.984 0.992 0.982 0.984 0.950 0.922 0.956 0.948 0.942 0.938 0.982 0.984 0.988 0.984 0.982 0.950 0.928 0.928 0.936 0.932 0.916 0.990 500 0.940 0.990 300 c=2 0.990 200 c=1 0.990 100 c = 0.2 c = 0.3 c = 0.5 0.980 0.982 0.990 0.990 0.992 We compare by simulation the proposed spline confidence band to the least squares Bonferroni and least squares bootstrap bands in Bunea, Ivanescu and Wegkamp (2011) (BIW). Table 2.5 presents the empirical frequency that the true curve m (·) for model (2.16) is covered by these bands at {1/100, . . . , 99/100, 1} respectively as Tables 2.1 and 2.2. The coverage frequency of the BIW Bonferroni band is much higher than the nominal level making it too conservative. The coverage frequency of the BIW bootstrap band is consistently lower than the nominal level by at least 10%, thus not recommended for practical use. We also compare the widths of the three bands. For each replication, we calculate the ratios of widths of the two BIW bands against the spline band at {1/100, . . . , 99/100, 1} and then average these 100 ratios. Table 2.6 shows the five number summary of these 500 averaged ratios for σ = 0.3 and p = 4. The BIW Bonferroni band is much wider than 27 Table 2.5: Coverage frequencies from 500 replications using least squares Bonferroni band and least squares Bootstrap band. Coverage frequency 500 σ = 0.3 0.950 0.990 0.988 0.742 0.744 0.994 0.994 0.856 0.864 0.950 0.996 0.998 0.678 0.712 0.998 1.000 0.860 0.870 0.950 0.988 0.992 0.710 0.734 1.000 1.000 0.856 0.888 0.950 0.988 0.998 0.704 0.720 0.990 300 σ = 0.5 0.990 200 σ = 0.3 0.990 100 least squares bootstrap 0.990 60 least squares Bonferroni σ = 0.5 n Coverage frequency 1.000 1.000 0.868 0.870 0.950 0.996 0.998 0.718 0.732 0.990 1.000 1.000 0.856 0.860 1−α 28 Table 2.6: Five number summary of ratios of confidence band widths. least squares Bonferroni/cubic spline least squares bootstrap/cubic spline n 1−α Min. Q1 Med. Q3 Max. Min. Q1 Med. Q3 Max. 60 0.950 0.964 1.219 1.299 1.397 1.845 0.522 0.667 0.716 0.770 0.967 0.990 0.907 1.114 1.188 1.285 1.730 0.527 0.662 0.715 0.770 1.048 0.950 0.995 1.263 1.331 1.415 1.684 0.565 0.675 0.714 0.754 0.888 0.990 0.910 1.148 1.219 1.295 1.603 0.536 0.665 0.708 0.752 0.925 0.950 1.169 1.326 1.383 1.433 1.653 0.600 0.683 0.715 0.743 0.855 0.990 1.045 1.197 1.250 1.300 1.507 0.557 0.668 0.702 0.740 0.888 0.950 1.169 1.363 1.412 1.462 1.663 0.574 0.690 0.717 0.742 0.838 0.990 1.067 1.228 1.277 1.322 1.509 0.587 0.676 0.707 0.739 0.850 0.950 1.273 1.395 1.432 1.476 1.601 0.620 0.691 0.714 0.737 0.818 0.990 1.132 1.243 1.288 1.334 1.465 0.607 0.674 0.707 0.734 0.839 100 200 300 500 cubic spline band, making it undesirable. While the BIW bootstrap band is narrower, we have mentioned previously that its coverage frequency is too low to be useful in practice. Simulation for other cases (e.g. p = 2, σ = 0.5) leads to the same conclusion. To examine the performance of the two-sample test based on spline confidence band, Table 2.7 reports the empirical power and type I error for the proposed two-sample test. The data were generated from (2.16) with σ = 0.5 and m1 (x) = 10 + sin {2π (x − 1/2)} + δ (x), n = n1 for the first group, and m2 (x) = 10 + sin {2π (x − 1/2)}, n = n2 for the another group. The remaining parameters, ξ ik , εij , φ1 (x) and φ2 (x) were set to the same values for each group as in (2.16). In order to mimic the real data in Section 2.6, we set N = 50, 100 and 200 when n1 = 160, 80 and 40 and n2 = 320, 160 and 80 accordingly. The studied 29 Table 2.7: Empirical power and type I error of two-sample test using cubic spline. δ (x) n1 = 160, n2 = 320 n1 = 80, n2 = 160 n1 = 40, n2 = 80 Nominal test level Nominal test level Nominal test level 0.05 0.01 0.05 0.01 0.05 0.01 0.6t 1.000 1.000 0.980 0.918 0.794 0.574 0.7sin(x) 1.000 1.000 0.978 0.910 0.788 0.566 0 0.058 0.010 0.068 0.010 0.096 0.028 Monte Carlo SE 0.001 0.004 0.001 0.004 0.001 0.004 hypotheses are: H0 : m1 (x) = m2 (x), ∀x ∈ [0, 1] ←→ Ha : m1 (x) = m2 (x), for some x ∈ [0, 1]. Table 2.7 shows the empirical frequencies of rejecting H0 in this simulation study with nominal test level equal to 0.05 and 0.01. If δ(x) = 0, these empirical powers should be close to 1, and for δ(x) ≡ 0, the nominal levels. Each set of simulations consists of 500 Monte Carlo runs. Asymptotic standard errors (as the number of Monte Carlo iterations tends to infinity) are reported in the last row of the table. Results are listed only for cubic spline confidence bands, as those of the linear spline are similar. Overall, the two-sample test performs well, even with a rather small difference (δ (x) = 0.7 sin(x)), providing a reasonable empirical power. Moreover, the differences between nominal levels and empirical type I error do diminish as the sample size increases. 30 2.6 Empirical example In this section, we revisit the Tecator data mentioned in Section 1, which can be downloaded at http://lib.stat.cmu.edu/datasets/tecator. In this data set, there are measurements on n = 240 meat samples, where for each sample a N = 100 channel near-infrared spectrum of absorbance measurements was recorded, and contents of moisture (water), fat and protein were also obtained. The Feed Analyzer worked in the wavelength range from 850 nm to 1050 nm. Figure 2.3 shows the scatter plot of this data set. The spectral data can be naturally considered as functional data, and we will perform a two-sample test to see whether absorbance from the spectrum differs significantly due to difference in fat content. This data set has been used for comparing four classification methods (Li and Yu, 2008), building a regression model to predict the fat content from the spectrum (Li and Hsing, 2010b). Following Li and Yu (2008), we separate samples according to their fat contents being less than 20% or not. The right panel of Figure 2.3 shows 10 samples from each group. Here, hypothesis of interest is: H0 : m1 (x) = m2 (x), ∀x ∈ [850, 1050] ←→ Ha : m1 (x) = m2 (x), for some x ∈ [850, 1050], where m1 (x) and m2 (x) are the regression functions of absorbance on spectrum, for samples with fat content less than 20% and greater than or equal to 20%, respectively. Among 240 samples, there are n1 = 155 with fat content less than 20%, the rest n2 = 85 no less than 20%. The numbers of interior knots in (2.2) are computed as in Section 2.4 with c = 0.5 and are N1m = 4 and N2m = 3 for cubic spline fit and N1m = 8 and N2m = 6 for linear spline fit. Figure 2.4 depicts the linear and cubic spline confidence bands according 31 5.5 4.0 2.0 2.5 3.0 3.5 absorbance 4.5 5.0 5.5 5.0 4.5 4.0 3.5 2.0 2.5 3.0 absorbance 850 900 950 1000 1050 850 wavelength 900 950 1000 1050 wavelength Figure 2.3: Left: Plot of Tecator data. Right: Sample curves for the Tecator data. Each class has 10 sample curves. Dashed lines represent spectra with fact > 20% and solid lines represent spectra with fact < 20%. 32 to (2.15) at confidence levels 0.99 (upper and lower dashed lines) and 0.999995 (upper and lower dotted lines), with the center dashed-dotted line representing the spline estimator m1 (x) − m2 (x) and a solid line representing zero. Since even the 99.9995% confidence band ˆ ˆ does not contain the zero line entirely, the difference of low fat and high fat populations’ absorbance was extremely significant. In fact, Figure 2.4 clearly indicates that the less the fat contained, the higher the absorbance is. 33 0.6 0.0 0.2 0.4 absorbance 0.4 0.0 0.2 absorbance 0.6 0.8 Cubic spline 0.8 Linear spline 850 900 950 1000 1050 850 wavelength 900 950 1000 1050 wavelength Figure 2.4: Plots of the fitted linear and cubic spline regressions of m1 (x) − m2 (x) for the Tecator data (dashed-dotted line), 99% confidence bands (2.15) (upper and lower dashed lines), 99.9995% confidence bands (2.15) (upper and lower dotted lines) and the zero line (solid line). 34 APPENDIX 35 In this appendix, we use C to denote a generic positive constant unless stated otherwise. Preliminaries For any vector ζ = (ζ 1 , ..., ζ s ) ∈ Rs , denote the norm ζ r = (|ζ 1 |r + · · · + |ζ s |r )1/r , 1 ≤ r < +∞, ζ ∞ = max (|ζ 1 | , ..., |ζ s |). For any s × s symmetric matrix A, we define λmin (A) and λmax (A) as its smallest and largest eigenvalues, and its Lr norm as A r = maxζ∈Rs ,ζ=0 ζ −1 Aζ r . In particular, A 2 = λmax (A), and if A is also r nonsingular, A−1 = λ−1 (A). min 2 For any functions φ, ϕ ∈ L2 [0, 1], define the theoretical and empirical inner products as 1 φ, ϕ = 0 φ (x) ϕ (x) dx and φ, ϕ 2,N = N −1 N φ (j/N ) ϕ (j/N ). The corresponding j=1 norms are φ 2 = φ, φ , φ 2 = φ, φ 2,N . 2 2,N We state a strong approximation result, which is used in the proof of Lemma 2.6.6. Lemma 2.6.1. [Theorem 2.6.7 of Cs˝rg˝ and R´v´sz (1981)] Suppose that ξ i , 1 ≤ i < ∞ are o o e e i.i.d. with E(ξ 1 ) = 0, E(ξ 2 ) = 1 and H(x) > 0 (x ≥ 0) is an increasing continuous function 1 such that x−2−γ H(x) is increasing in x for some γ > 0 and x−1 logH(x) is decreasing in x with EH (|ξ 1 |) < ∞. Then there exist a sequence of Brownian motions {Wn (l)}∞ and n=1 constants C1 , C2 , a > 0, depending only on the distribution of ξ 1 and such that for any {xn }∞ satisfying H −1 (n) < xn < C1 (nlogn)1/2 and Sl = n=1 P max Sl − Wn (l) > xn 1≤l≤n l i=1 ξ i , ≤ C2 n {H (axn )}−1 . The next lemma is a special case of Theorem 13.4.3, Page 404 of DeVore and Lorentz (1993). Let p be a positive integer. A matrix A = aij is said to have bandwidth p if aij = 0 when |i − j| ≥ p, and p is the smallest integer with this property. Denote by 36 d = A 2 A−1 and d is the condition number of A. 2 Lemma 2.6.2. If a matrix A with bandwidth p has an inverse A−1 , then A−1 1/(4p) 2c0 (1 − τ )−1 , with c0 = τ −2p A−1 , τ = (d2 − 1)/(d2 + 1) . 2 ∞ ≤ N B (j/N ) Y Nm ¯ .j J=1−p , where the theoretj=1 J,p Nm ical and empirical inner product matrices of BJ,p (x) are denoted as J=1−p ˆ One writes XT X = N Vp , XT Y = Vp = BJ,p , BJ ,p Nm ˆ , Vp = J,J =1−p Nm BJ,p , B . J ,p 2,N J,J =1−p (2.17) We establish next that the theoretical inner product matrix Vp defined in (2.17) has an inverse with bounded L∞ norm. Lemma 2.6.3. For any positive integer p, there exists a constant Mp > 0 depending only −1 on p, such that Vp ≤ Mp h−1 , where hm = (Nm + 1)−1 . m ∞ Proof. According to Lemma A.1 in Wang and Yang (2009b), Vp is invertible since −1 it is a symmetric matrix with all eigenvalues positive, i.e. 0 < cp Nm ≤ λmin Vp ≤ −1 λmax Vp ≤ Cp Nm < ∞, where cp and Cp are positive real numbers. The compact support of B-spline basis makes Vp of bandwidth p, hence one can apply Lemma 2.6.2. Since dp = λmax Vp /λmin Vp ≤ Cp /cp , hence τ p = d2 − 1 p 1/4p d2 + 1 p −1/4p −1/4p 1/4p 2 2 Cp c−2 + 1 < 1. ≤ Cp c−2 − 1 p p −1 If p = 1, then Vp = h−1 INm +p , the lemma holds with Mp = 1. If p > 1, let u1−p = m T T 1, 0T +p−1 , u0 = 0T , 1, 0T p−1 Nm Nm , then u1−p 2 = u0 2 = 1. Also lemma A.1 37 in Wang and Yang (2009b) implies that 2 2 = λmin Vp u1−p ≤ uT Vp u1−p = B1−p,p , 1−p 2 2 2 uT Vp u0 = B0,p ≤ λmax Vp u0 2 = λmax Vp , 2 0 2 λmin Vp hence dp = λmax Vp /λmin Vp ≥ 2 −2 = rp > 1, where rp is an B1−p,p B0,p 2 2 absolute constant depending only on p. Thus 1 1 −1 −1 2 2 τ p = d2 − 1 4p d2 + 1 4p ≥ rp − 1 4p rp + 1 4p > 0. Applying Lemma 2.6.2 p p and putting the above bounds together, one obtains −1 Vp −2p −1 hm ≤ 2τ p Vp 1 − τ p −1 hm ∞ 2   1/4p −1 1/2  2 2   rp + 1 Cp c−2 − 1 p hm ≤ 2 λ−1 Vp × 1 − min 2 2   rp − 1   Cp c−2 + 1 p  −1  2 c−2 − 1 1/4p  2 + 1 1/2   rp −1 1 − Cp p cp ≡ Mp . ≤ 2 2 2   rp − 1   Cp c−2 + 1 p The lemma is proved. T For any function φ ∈ C [0, 1], denote the vector φ = (φ (1/N ) , . . . , φ (N/N )) and function ˜ φ (x) ≡ B1−p,p (x) , . . . , BNm ,p (x) −1 T T X X X φ. ˆ ˆ Lemma 2.6.4. Under Assumption (A3), for Vp and Vp defined in (2.17), Vp − Vp ∞ = ˆ −1 and Vp ≤ 2Mp h−1 , for large enough n. There exists cφ,p ∈ (0, ∞) such m ∞ ˜ that when n is large enough, φ ≤ cφ,p φ ∞ for any φ ∈ C [0, 1]. Furthermore, if ∞ O N −1 38 ˜ φ ∈ C p−1,µ [0, 1] for some µ ∈ (0, 1], then for Cp−1,µ = cφ,p + 1 Cp−1,µ ˜ φ−φ ∞ ˜ ≤ Cp−1,µ φ(p−1) ˆ Proof. We first show that Vp − Vp ∞ µ+p−1 hm . 0,µ (2.18) = O N −1 . Suppose p = 1, define for any 0 ≤ J ≤ Nm , the number of design points j/N in the J-th interval IJ as NJ , then NJ = # {j : j ∈ [N J/ (Nm + 1) , N (J + 1) / (Nm + 1))}, for 0 ≤ J < Nm , and NJ = # {j : j ∈ [N J/ (Nm + 1) , N (J + 1) / (Nm + 1)]}, for J = Nm . Clearly max0≤J≤Nm NJ − N hm ≤ 1, and hence ˆ V 1 − V1 ∞ = max 0≤J≤Nm 2 2 − BJ,1 BJ,1 2 2,N N N −1 max B 2 (j/N ) − hm j=1 J,1 0≤J≤Nm NJ − N hm ≤ N −1 . N −1 NJ − hm = N −1 max = max 0≤J≤Nm 0≤J≤Nm = For p > 1, de Boor (2001), Page 96, B-spline property ensures that there exists a constant C1,p > 0 such that max max sup BJ,p (j/N ) B (j/N ) − BJ,p (x) B (x) J ,p J ,p 1−p≤J,J ≤Nm 1≤j≤N x∈[(j−1)/N,j/N ] ≤ C1,p N −1 h−1 , m while there exists a constant C2,p > 0 such that max1−p≤J,J ≤N NJ,J ≤ C2,p N hm m 39 where N = # j : 1 ≤ j ≤ N, BJ,p (j/N ) B (j/N ) > 0 . Hence J,J J ,p ˆ V p − Vp ∞ 1 N BJ,p (x) BJ ,p (x) dx BJ,p (j/N ) BJ ,p (j/N ) − j=1 0 j/N N ≤ max B (j/N ) BJ ,p (j/N ) − BJ,p (x) BJ ,p (x) dx j=1 (j−1)/N J,p 1−p≤J,J ≤Nm ≤ C2,p N hm × N −1 × C1,p N −1 h−1 ≤ CN −1 . m = max N −1 1−p≤J,J ≤Nm −1 ≤ h−1 γ ∞ . Hence, According to Lemma 2.6.3, for any (Nm + p) vector γ, Vp γ m ∞ −1 Vp γ ∞ ≥ Mp hm γ ∞ . By Assumption (A3), N −1 = o (hm ) so if N is large enough, for any γ, one has −1 γ −1 ˆ ≥ hm Mp ≥ Vp γ ∞ − Vp γ − Vp γ ∞−O N ∞ ∞ hm −1 Mp γ ∞. = 2 ˆ Vp γ γ ∞ ˆ −1 Hence Vp ≤ 2Mp h−1 . m ∞ To prove the last statement of the lemma, note that for any x ∈ [0, 1] at most (p + 1) of the numbers B1−p,p (x) , . . . , BNm ,p (x) are between 0 and 1, others being 0, so ˜ φ (x) ≤ (p + 1) −1 T T T ˆ −1 X φN −1 X X X φ = (p + 1) Vp T T ˆ −1 ≤ (p + 1) Vp X φN −1 ≤ 2 (p + 1) Mp h−1 X IN N −1 m ∞ φ ∞ T T ˜ in which IN = (1, ..., 1) . Clearly X IN N −1 ≤ Chm for some C > 0, hence φ (x) ≤ 2Mp (p + 1) C φ ∞ = cφ,p φ ∞ . Now if φ ∈ C p−1,µ [0, 1] for some µ ∈ (0, 1], let 40 µ+p−1 hm according to g ∈ H(p−1) [0, 1] be such that g − φ ∞ ≤ Cp−1,µ φ(p−1) 0,µ Theorem 2.3.1, then g ≡ g as g ∈ H(p−1) [0, 1] hence ˜ ∞ = ˜ ˜ φ − g − (φ − g) ≤ ˜ φ−φ cφ,p + 1 ∞ ˜ ˜ ≤ φ−g ∞ + φ−g ∞ µ+p−1 hm φ − g ∞ ≤ cφ,p + 1 Cp−1,µ φ(p−1) 0,µ proving (2.18). Lemma 2.6.5. Under Assumption (A5), for C0 = C1 1 + βC2 ∞ sβ−1−γ 1 s=1 and n≥1 ¯ ξ max E ¯.,k − Z.k,ξ 1≤k≤κ ¯ max ¯.,j − Z.j,ε ε 1≤j≤N ¯ where Z.k,ξ = n−1 n Z −1 ¯ i=1 ik,ξ , Z.j,ε = n ≤ C0 nβ−1 , (2.19) = Oa.s. nβ−1 (2.20) n Z i=1 ij,ε , 1 ≤ j ≤ N , 1 ≤ k ≤ κ. Also max E ¯.,k ≤ n−1/2 (2/π)1/2 + C0 nβ−1 . ξ 1≤k≤κ Proof. The proof of (2.20) is trivial. Assumption (A5) entails that ¯ Fn+t,k < C2 (n + t)−γ 1 , k = 1, ..., κ, t = 0, 1, ..., ∞, in which 41 (2.21) n ξ − i=1 ik ¯ Fn+t,k = P E n Z β i=1 ik,ξ > C1 (n + t) . Taking expectation, one has n ξ − i=1 ik ≤ C1 (n + 0)β + n Z i=1 ik,ξ ∞ ¯ ¯ C (n + t)β Fn+t−1,k − Fn+t,k t=1 1 ∞ C C (n + t)−γ 1 β (n + t)β−1 t=0 1 2 ∞ ≤ C1 nβ + βC2 (n + t)β−1−γ 1 t=0 ∞ sn−1 ≤ nβ C1 1 + βC2 n−1−γ 1 (1 + t/n)β−1−γ 1 s=1 t=sn−n ∞ β−1−γ 1 ≤ C 0 nβ , ≤ nβ C1 1 + βC2 n−1−γ 1 × n t t=1 ≤ C 1 nβ + ¯ which proves (2.19) if one divides the above inequalities by n. The fact that Z.k,ξ ∼ ¯ N (0, 1/n) entails that E Z.k,ξ = n−1/2 (2/π)1/2 and thus ξ max1≤k≤κ E ¯.,k ≤ n−1/2 (2/π)1/2 + C0 nβ−1 . Lemma 2.6.6. Assumption (A5) holds under Assumption (A5’). τ Proof. Under Assumption (A5 ), E ξ ik τ 1 < +∞, τ 1 > 4, E εij 2 < +∞, τ 2 > 4 + 2θ, so there exists some β ∈ (0, 1/2) such that τ 1 > 2/β, τ 2 > (2 + θ) /β. Now let H(x) = xτ 1 , then Lemma 2.6.1 entails that there exists constants C1k , C2k , ak which depend on the distribution of ξ ik , such that for −τ −τ n xn = C1k nβ , = a 1 C 1 n1−τ 1 β and i.i.d. N (0, 1) variables Zik,ξ such that 1k k H ak xn P max 1≤t≤n t ξ − i=1 ik t −τ −τ Zik,ξ > C1k nβ < C2k a 1 C 1 n1−τ 1 β . k 1k i=1 Since τ 1 > 2/β, γ 1 = τ 1 β − 1 > 1. If the number κ of k is finite, so there are common con42 t ξ − i=1 ik stants C1 , C2 > 0 such that P max1≤t≤n −γ 1 t Z β i=1 ik,ξ > C1 n < C2 n which entails (2.3) since κ is finite. If κ is infinite but all the ξ ik ’s are i.i.d., then C1k , C2k , ak are the same for all k, so the above is again true. Likewise, under Assumption (A5 ), Lemma 2.6.1 applied to H(x) = xτ 2 implies that there exists constants C1 , C2 and a which depend on the distribution of εij and i.i.d. N (0, 1) −τ n variables Zij,ε , such that with xn = C1 nβ , = a−τ 2 C1 2 n1−τ 2 β and H ak x n max P 1≤j≤N t ε − i=1 ij max 1≤t≤n t Z > C1 nβ i=1 ij,ε −τ ≤ C2 a−τ 2 C1 2 n1−τ 2 β , Hence, τ 2 β > 2 + θ which implies that there is γ 2 > 1 such that τ 2 β − 1 > γ 2 + θ and (2.4) follows. Proof of Proposition 2.3.1. Applying (2.18), p mp − m ∞ ≤ Cp−1,1 hm . ˜ Since Assumption (A3) implies that p O hm n1/2 = o (1), equation (2.10) is proved. Proof of Proposition 2.3.2. −1 T ˜ Denote by Zp,ε (x) = B1−p,p (x) , . . . , BNm ,p (x) XT X X Z, where T ¯ ¯ Z = σ (1/N ) Z.1,ε , . . . , σ (N/N ) Z.N,ε . By (2.20), one has Z − e ∞ = Oa.s. nβ−1 , while N −1 XT (Z − e) ≤ C Z−e ∞ ∞ ≤ Z−e ∞ max 1−p≤J≤Nm max BJ,p , 1 2,N 1−p≤J≤Nm # j : BJ,p (j/N ) > 0 N −1 ≤ C Z − e ∞ hm . 43 Also for any fixed x ∈ [0, 1], one has ˜ Zp,ε (x) −˜p (x) e ∞ ˆ −1 B1−p,p (x) , . . . , BNm ,p (x) Vp N −1 XT (Z − e) ∞ ˆ −1 Z − e ∞ hm = Oa.s. nβ−1 . ≤ C Vp ∞ = ˆ −1 Note next that the random vector Vp N −1 XT Z is (Nm + p)-dimensional normal with ˆ −1 ˆ −1 covariance matrix N −2 Vp XT var (Z) XVp , bounded above by ˆ −1 ˆ −1 ˆ ˆ −1 ≤ CN −1 n−1 Vp ≤ CN −1 n−1 h−1 , max σ 2 (x) n−1 N −1 Vp Vp Vp m ∞ ∞ x∈[0,1] ˆ −1 bounding the tail probabilities of entries of Vp N −1 XT Z and applying Borel-Cantelli Lemma leads to ˆ −1 Vp N −1 XT Z ∞ −1/2 = Oa.s. N −1/2 n−1/2 hm log1/2 (Nm + p) −1/2 = Oa.s. N −1/2 n−1/2 hm log1/2 n . −1/2 ˜ Hence, supx∈[0,1] n1/2 Zp,ε (x) = Oa.s. N −1/2 hm log1/2 n and −1/2 sup n1/2 ep (x) = Oa.s. nβ−1/2 + N −1/2 hm log1/2 n ˜ x∈[0,1] Thus (2.11) holds according to Assumption (A3). Proof of Proposition 2.3.3. 44 = oa.s. (1) . ˜ ¯ We denote ζ k (x) = Z·k,ξ φk (x), k = 1, . . . , κ and define ˜ ζ (x) = n1/2 −1/2 κ φk (x) 2 k=1 κ ˜ ζ (x) = n1/2 G (x, x)−1/2 k=1 k κ ˜ ζ (x). k=1 k ˜ ˜ ˜ It is clear that ζ (x) is a Gaussian process with mean 0, variance 1 and covariance E ζ (x) ζ x = G (x, x)−1/2 G x, x −1/2 ˜ G x, x , for any x, x ∈ [0, 1]. Thus ζ (x), x ∈ [0, 1] has the same distribution as ζ (x), x ∈ [0, 1]. Using Lemma 2.6.4, one obtains that µ ˜ ˜ ˜ φk ≤ cφ,p φk ∞ , φk − φk ≤ C0,µ φk 0,µ hm , 1 ≤ k ≤ κ. ∞ ∞ (2.22) Applying the above (2.22), (2.21) and Assumptions (A3), (A4), one has κ ¯ ˜ ξ φ (x) − φk (x) k=1 ·k k κn κ µ ≤ Cn1/2 E ¯·k φk 0,µ hm + ξ E ¯·k φk ∞ ξ k=1 k=κn +1 κn κ µ ≤ C φ h + φ = o (1) , k=1 k 0,µ m k=κn +1 k ∞ En1/2 supx∈[0,1] G (x, x)−1/2 hence n1/2 supx∈[0,1] G (x, x)−1/2 κ ¯ ˜ ξ φ (x) − φk (x) k=1 ·k k = oP (1) . In addition, (2.19) and Assumptions (A3) and (A4) entail En1/2 supx∈[0,1] G (x, x)−1/2 ≤ Cnβ−1/2 κ φ = o (1) . k=1 k ∞ 45 κ ¯ Z − ¯·k φk (x) ξ k=1 ·k,ξ (2.23) Hence n1/2 supx∈[0,1] G (x, x)−1/2 κ ¯ Z − ¯·k φk (x) = oP (1) . ξ k=1 ·k,ξ Note that m(x) − m (x) − ˜p (x) = ¯ ξ ˜ ¯ n−1/2 G (x, x)1/2 ζ (x) − {m(x) − m (x)} = κ ¯ ˜ φ (x) − φk (x) , ξ k=1 ·k k κ ¯ Z − ¯·k φk (x) ξ k=1 ·k,ξ hence n1/2 sup G (x, x)−1/2 m(x) − m (x) − ˜p (x) ¯ ξ x∈[0,1] ˜ supx∈[0,1] ζ (x) − n1/2 G (x, x)−1/2 {m(x) − m (x)} ¯ = oP (1) , = oP (1) . according to (2.23) and (2.24), which leads to both (2.12) and (2.13). 46 (2.24) Chapter 3 Confidence Envelopes for Covariance Functions 3.1 Introduction Covariance estimation is crucial in both functional and longitudinal data analysis. For longitudinal data, a good estimation of the covariance function improves the estimation efficiency of the mean parameters (Wang, Carroll and Lin, 2005; Fan, Huang and Li, 2007). In functional data analysis (Ramsay and Silverman, 2005), covariance estimation plays a critical role in functional principal component analysis (James, Hastie and Sugar, 2000; Zhao, Marron and Wells, 2004; Yao, M¨ller and Wang, 2005a; Hall, M¨ller and Wang, u u 2006; Yao and Lee, 2006; Zhou, Huang and Carroll, 2008; Li and Hsing, 2010a), functional generalized linear models (Cai and Hall, 2005; Yao, M¨ller and Wang, 2005b; Li, Wang and u Carroll, 2010), and other functional nonlinear models (James and Silvermen, 2005; Li and Hsing, 2010b). Other related work on functional data analysis includes Ferraty and Vieu (2006) and Morris and Carroll (2006). 47 There are some important recent works on nonparametric covariance estimation in functional data, which are mostly based on kernel smoothing, for example Yao et al. (2005a), Hall et al. (2006) and Li and Hsing (2010a). More recently, Cai and Yuan (2010) also proposed a smoothing spline covariance estimator. So far, all existing work concentrated on estimation and the corresponding asymptotic convergence rate. There is no theoretical or methodological development for inference procedures on the covariance functions, such as simultaneous or uniform confidence envelopes. Nonparametric simultaneous confidence regions are powerful tools for making global inference on functions; see H¨rdle and Marron a (1991), Claeskens and Van Keilegom (2003) and Zhao and Wu (2008) for related theory and applications. In this chapter, we consider a typical functional data setting where the functions are recorded on a dense regular grid in an interval X and the measurements are contaminated with measurement errors. Some recent applications of this type of functional data include near infrared spectra (Li and Hsing, 2010a), recorded speeches for voice recognition (Hastie, Tibshirani and Buja, 1995), electroencephalogram (EEG) data (Crainiceanu, Stacu and Di, 2009). We propose to estimate the covariance function by tensor product B-splines. In contrast with the kernel methods (Yao et al., 2005a; Hall et al., 2006; Li and Hsing, 2010b), our proposed spline estimator is much more efficient in terms of computation. The reason is that the kernel smoothers are evaluated pointwisely, while for the spline estimator, we only need to solve for a small number of spline coefficients to have an explicit expression for the whole function. For smoothing a two-dimensional covariance surface with a moderate sample size, the kernel smoother might take up to half an hour, while our spline estimator only takes a few seconds. Computation efficiency is a huge advantage for the spline methods in analyzing large data sets and in performing simulation studies. The reader is referred to 48 Huang and Yang (2004) for more discussions on the computational merits of spline methods. Compared with the smoothing spline approach of Cai and Yuan (2010), our method uses reduced rank tensor product B-splines, and is potentially faster while analyzing large data sets. We show that the estimation error in the mean function is asymptotically negligible in estimating the covariance function, and our covariance estimator is as efficient as an “oracle” estimator where the true mean function is known. We derive both local and global asymptotic distribution for the proposed spline covariance estimator. Especially, based on the asymptotic distribution of the maximum deviation of the estimator, we propose a new simultaneous confidence envelope for the covariance function, which can be used to visualize the variability of the covariance estimator and to make global inferences on the shape of the true covariance. We apply the proposed confidence envelope method to a Tecator near infrared spectra data set to test the hypothesis that the covariance is stationary. In a speech recognition application, the classic functional linear discriminant analysis (Hastie et al., 1995; James and Hastie, 2001) assumes that the random curves from different classes share a common covariance function. We further extend our confidence envelope method to a two-sample problem, where one can test whether the covariance functions from two groups are different. We organize this chapter as follows. In Section 3.2 we describe the data structure and the proposed spline covariance estimator. In Section 3.3, we study the local and global asymptotic properties of the proposed estimator. Based on the theory, we propose a new confidence envelope approach and extend the method to two-sample hypothesis testing problems. More implementation details of the proposed confidence envelopes are provided in Section 3.4. We present simulation studies in Section 3.5 and applications to the Tecator infrared spec49 troscopy and the speech recognition data set in Sections 3.6. Some concluding remarks are provided in Section 3.7. All proofs of the theorems and technical lemmas are provided in the appendix and the supplementary material. 3.2 3.2.1 Spline covariance estimation Data structure and model assumptions Following Ramsay and Silverman (2005), the data that we consider are a collection of trajectories η i (x) n which are i.i.d. realizations of a smooth random function η(x), defined on i=1 a continuous interval X . Assume that {η(x), x ∈ X } is a L2 process, i.e. E X η 2 (x)dx < +∞, and define the mean and covariance functions as m(x) = E{η(x)} and G x, x = cov η(x), η(x ) . The covariance function is a symmetric nonnegative-definite function with κ λ ψ (x)ψ x , where λ ≥ λ ≥ · · · ≥ 0, 1 2 k k=1 k k κ λ < +∞, are the eigenvalues, and ψ (x) κ are the corresponding eigenfunctions k k=1 k=1 k and are a set of orthonormal functions in L2 (X ). By the Karhunen-Lo`ve representation, e a spectral decomposition, G x, x η i (x) = m(x) + = κ ξ φ (x), where the random coefficients ξ are uncorrelated with ik k=1 ik k mean 0 and variance 1, and the functions φk = λk ψ k . In the standard Karhunen-Lo`ve e expansion, κ can diverge to ∞. For practical consideration, κ is always truncated at a finite number in real data analysis. Our main theoretical results are developed under the assumption that κ is a finite positive integer, but some of our theoretical results can be further generalized to the infinite dimension case, which will be discussed in Section 3.3. Without loss of generality, we take X = [0, 1]. Then the observed data are Yij = η i Xij + σ Xij εij , for 1 ≤ i ≤ n, 1 ≤ j ≤ N , where Xij = j/N , εij are i.i.d. 50 random errors with E (ε11 ) = 0 and E(ε2 ) = 1, and σ 2 (x) is the variance function of 11 the measurement errors. By the Karhunen-Lo`ve representation, the observed data can be e written as Yij = m (j/N ) + κ ξ φ (j/N ) + σ (j/N ) εij . k=1 ik k We model m(·) and G(·, ·) as nonparametric functions, and hence {λk }κ , {φk (·)}κ k=1 k=1 and {ξ ik }κ k=1 are unknown and need to be estimated. 3.2.2 Spline covariance estimator To describe the tensor product spline estimator of the covariance functions, we first introduce some notation. Denote a sequence of equally-spaced points tJ Ns , called interior knots J=1 which have been defined in Chapter 1. Let hs = 1/ (Ns + 1) be the distance between neighboring knots. Let H(p−2) = H(p−2) [0, 1] be the polynomial spline space of order p. The J th B-spline of order p is denoted by BJ,p as in de Boor (2001). Thus we define the tensor product spline space as H(p−2),2 [0, 1]2 ≡ H(p−2),2 = H(p−2) ⊗ H(p−2)   Ns  = b B (x) B x JJ p J,p J ,p   J,J =1−p    ,b ∈ R, x, x ∈ [0, 1] . JJ p   If the mean function m(x) was known, one could compute the errors κ Uij ≡ Yij − m(j/N ) = ξ ik φk (j/N ) + σ (j/N ) εij , 1 ≤ i ≤ n, 1 ≤ j ≤ N. k=1 ¯ Denote U·jj = n−1 n U U , 1 ≤ j = j ≤ N , one can then define the “oracle” i=1 ij ij 51 estimator of the covariance function ˜ Gp2 (·, ·) = ¯ argmin U·jj − g j/N, j /N g(·,·)∈H(p2 −2),2 1≤j=j ≤N 2 , (3.1) using tensor product splines of order p2 ≥ 2. Since the mean function m(x) is unavailable when one analyzes data, one can use instead the spline smoother of m(x), i.e., n mp1 (·) = ˆ N argmin Yij − g (j/N ) g(·)∈H(p1 −2) i=1 j=1 2 , p1 ≥ 1. To mimic the above “oracle” smoother, we define ˆ Gp1 ,p2 (·, ·) = argmin g(·,·)∈H(p2 −2),2 1≤j=j ≤N ˆ ¯ where U·jj ,p = n−1 1 ˆ ¯ U − g j/N, j /N ·jj ,p1 2 , (3.2) n U ˆ ˆ ˆ ˆ i=1 ijp1 Uij p1 with Uijp1 = Yij − mp1 (j/N ). Let Ns1 be the number of interior knots for mean estimation, and Ns2 be the number of interior knots for 2 ˆ Gp1 ,p2 (x, x ) in each coordinate. In other words, we have Ns interior knots for the tensor 2 product spline space H(p2 −2),2 . We now provide detailed algorithm for the spline covariance estimator. For simplicity, denote B x, x JJ ,p2 B p2 x, x X= B p2 = = BJ,p (x) B x J ,p2 2 B1−p ,1−p ,p x, x 2 2 2 and , . . . , BNs ,1−p ,p x, x 2 2 2 , T . . . , B1−p ,Ns ,p x, x , . . . , BNs ,Ns ,p x, x , 2 2 2 2 2 2 1 1 2 1 1 , , . . . , B p2 1, , . . . , B p2 , 1 , . . . , B p2 1 − , 1 N N N N N 52 T . ˆ Then Gp1 ,p2 (x, x ) defined in (3.2) can be rewritten as ˆT ˆ Gp1 ,p2 (x, x ) ≡ β p ,p B p2 x, x 1 2 , (3.3) ˆ where β p ,p is the collector of the estimated spline coefficients by solving the following 1 2 least squares ˆ β p ,p = 1 2 argmin (N +p )2 bp2 ∈R s2 2 1≤j=j ≤N ˆ ¯ U − bT B p2 (j/N, j /N ) p2 ·jj ,p1 2 . By elementary algebra, one obtains ˆ Gp1 ,p2 (x, x ) = B T (x, x ) XT X p2 −1 ˆ ¯ XT U p 1 , ˜ Gp2 (x, x ) = B T (x, x ) XT X p2 −1 ¯ XT U, (3.4) where ˆ ¯ Up1 = ˆ ˆ ˆ ˆ ¯ ¯ ¯ ¯ U·21,p , . . . , U·N 1,p , . . . , U·1N,p , . . . , U·(N −1)N,p 1 1 1 1 ¯ ¯ ¯ ¯ ¯ U = (U·21 , . . . , U·N 1 , . . . , U·1N , . . . , U·(N −1)N )T . 53 T , 3.3 Asymptotic theory and simultaneous confidence envelopes 3.3.1 Assumptions and the oracle property For any ν ∈ (0, 1], we denote C q,ν [0, 1] as the space of ν-H¨lder continuous functions on o [0, 1], C q,ν [0, 1] =    φ: sup x=x ,x,x ∈[0,1] φ(q) (x) − φ(q) x x−x ν   < +∞ .  We need the following technical assumptions. (B1) The regression function m ∈ C p1 −1,1 [0, 1]. (B2) The standard deviation function σ (x) ∈ C 0,ν [0, 1]. For any k = 1, 2, . . . κ, φk (x) ∈ C p2 −1,ν [0, 1]. Also sup G x, x x,x ∈[0,1]2 < C, for some positive constant C and minx∈[0,1] G (x, x) > 0. (B3) The number of knots Ns1 and Ns2 satisfy n1/(4p1 ) p Ns 1 . min N 1/2 , n1/3 and Ns2 1 Ns1 N , n1/(2p2 ) Ns2 ∞,κ (B4) The number κ of nonzero eigenvalues is finite. The variables ξ ik and i=1,k=1 ∞,∞ εij are independent. In addition, Eε11 = 0, Eε2 = 1, Eξ 1k = 0, Eξ 2 = 11 1k i=1,j=1 1 and max1≤k≤κ E ξ 1k δ 1 < +∞, E |ε11 |δ 2 < +∞, for some δ 1 , δ 2 > 4. Assumptions (B1)-(B4) are standard in the spline smoothing literature; see Huang (2003), for instance. In particular, (B1) and (B2) guarantee the orders of the bias terms of the spline smoothers for m(x) and φk (x). Assumption (B3) is a weak assumption to ensure the order 54 of the bias and noise terms in Propositions 3.3.1 and 3.3.2. Assumption (B4) is necessary for strong approximation. More discussion about the assumptions is in Section 3.4. ˆ To gain a deeper understanding on the behavior of the spline covariance estimator Gp2 ˜ in (3.2), we first study the asymptotic property of Gp2 in (3.1). Denote by κ ∆ x, x = φk (x) φ k x ¯ ξ −δ ·kk kk , (3.5) k,k =1 where ¯·kk = n−1 ξ n ξ ξ i=1 ik ik and δ kk = 1 for k = k and 0 otherwise. Proposition 3.3.1. Under Assumptions (B2)-(B4), one has ˜ sup Gp2 (x, x ) − G(x, x ) − ∆ x, x (x,x )∈[0,1]2 = op n−1/2 . (3.6) The proof of Proposition 3.3.1 is provided in the supplementary material. The next ˆ proposition provides that the tensor product spline estimator Gp1, p2 is uniformly close to the “oracle” smoother at the rate of op n−1/2 . Proposition 3.3.2. Under Assumptions (B1)-(B4), one has ˆ ˜ Gp1, p2 (x, x ) − Gp2 (x, x ) = op n−1/2 . sup (x,x )∈[0,1]2 The proof of Proposition 3.3.2 is provided in the supplementary material. As a result of Propositions 3.3.1 and 3.3.2, ˆ sup Gp1 ,p2 (x, x ) − G(x, x ) − ∆ x, x (x,x )∈[0,1]2 55 = op n−1/2 . 3.3.2 Asymptotic confidence envelopes ˆ The next theorem provides a pointwise approximation to the mean squared error of Gp1, p2 (x, x ). Theorem 3.3.1. Under Assumptions (B1)-(B4), ˆ nE[Gp1, p2 (x, x ) − G(x, x )]2 = V x, x where V x, x = G x, x 2 + G (x, x) G x , x + + o(1), κ φ2 (x) φ2 x k=1 k k Eξ 4 − 3 . 1k Remark 3.3.1. Although the convergence result in Theorem 3.3.1 is derived under the assumption that κ is a finite positive number, it continues to hold when κ → ∞ as long as the sum in the definition of V (x, x ) still converges. The existence of V (x, x ) in the infinite dimension case is guaranteed by imposing an additional assumption that E[{supx∈X η(x)}4+δ ] < ∞, for some δ > 0, which is commonly assumed in functional data analysis (see Li and Hsing, 2010a). The assumption κ being finite is merely a practical consideration for the development of inference procedures. Further discussion on how to choose κ in real data will be provided in Section 3.4. ˆ To obtain the quantile of the distribution of n1/2 Gp1, p2 (x, x ) − G(x, x ) V −1/2 x, x , one defines ζ Z x, x   κ  = Z φk (x) φ kk k   k=k (3.7) κ x + φk (x) φk x k=1 Zk Eξ 4 − 1 1k   1/2  V −1/2 x, x   where Z =Z and Zk are i.i.d. standard gaussian random variables. Hence, for any kk kk x, x ∈ [0, 1]2 , ζ Z x, x is a standardized gaussian field such that Eζ Z x, x = 0, 56 , Eζ 2 x, x Z = 1. Define Q1−α as the 100 (1 − α)th percentile of the absolute maxima distribution of ζ Z x, x , ∀ x, x P    ∈ [0, 1]2 , i.e. sup ζ Z x, x 2 x,x ∈[0,1] ≤ Q1−α   = 1 − α, ∀α ∈ (0, 1) .  The following theorem and corollary address the simultaneous envelopes for G(x, x ). Theorem 3.3.2. Under Assumptions (B1)-(B4), for any α ∈ (0, 1),   lim P n→∞  ˆ sup n1/2 Gp1, p2 (x, x ) − G(x, x ) V −1/2 x, x (x,x )∈[0,1]2 ˆ lim P n1/2 Gp1, p2 (x, x ) − G(x, x ) V −1/2 x, x n→∞ ≤ Q1−α   = 1 − α,  ≤ Z1−α/2 = 1−α, ∀(x, x ) ∈ [0, 1]2 , where Z1−α/2 is the 100 (1 − α/2)th percentile of the standard normal distribution. Remark 3.3.2. Although this covariance function estimator cannot be guaranteed to be positive definite, it tends to the true positive definite covariance function in probability. The next result follows directly from Theorem 3.3.2. Corollary 3.3.3. Under Assumptions (B1)-(B4), as n → ∞, an asymptotic 100 (1 − α) % confidence envelope for G(x, x ), ∀(x, x ) ∈ [0, 1]2 is ˆ Gp1, p2 (x, x ) ± n−1/2 Q1−α V 1/2 x, x , ∀α ∈ (0, 1) , (3.8) while an asymptotic 100 (1 − α) % pointwise confidence envelope for G(x, x ), ∀(x, x ) ∈ [0, 1]2 is ˆ Gp1, p2 (x, x ) ± n−1/2 Z1−α/2 V 1/2 x, x 57 , ∀α ∈ (0, 1) . 3.3.3 Extension to two-sample test problems In functional analysis of variance and linear discriminant analysis, it is commonly assumed that the covariance functions are the same across different treatment groups. It is natural to extend our method to the two-sample problems, where we can construct confidence envelopes for the difference between the covariances functions from two independent groups. This procedure is equivalent to two-sample t-test. Suppose we have two independent groups of curves with sample sizes n1 and n2 , respectively. We denote the ratio of two sample sizes as r = n1 /n2 and assume that limn1 →∞ r = ˆ ˆ ˆ (1) ˆ (2) r > 0. Let Gp p (x, x ) and Gp p (x, x ) be the spline estimates of covariance functions 1, 2 1, 2 G(1) (x, x ) and G(2) (x, x ) by (3.2). Also denote by ζ 12 x, x , ∀ x, x ∈ [0, 1]2 a standardized Gaussian process such that Eζ 12 x, x ≡ 0, Eζ 2 x, x 12 ≡ 1, ∀ x, x ∈ [0, 1]2 with covariance function, for x, x ∈ [0, 1], Eζ 12 x, x ζ 12 x, x = V1 x, x + rV2 x, x {V1 (x, x) + rV2 (x, x)}1/2 V1 x , x + rV2 x , x 1/2 . Denote by Q12,1−α the (1 − α)-th quantile of the absolute maxima deviation of ζ 12 x, x , ∀ x, x ∈ [0, 1]2 as above. We have the following theorem, the proof of which is analogous to that of Theorem 3.3.2 and therefore omitted. Theorem 3.3.4. Under Assumptions (B1)-(B4) modified for each group accordingly, for 58 any α ∈ (0, 1), as n1 → ∞, r → r > 0, ˆ       ˆ (1) p (x,x )−G(2) p (x,x )−G(1) (x,x )+G(2) (x,x ) ˆp  Gp  1, 2 1, 2 1/2 sup n P ≤ Q12,1−α   (x,x )∈[0,1]2 1  V1 x,x +rV2 x,x 1/2   = 1 − α. Remark 3.3.3. Under Assumptions (B1)-(B4), an asymptotic 100(1 − α)% confidence envelope for G(1) (x, x ) − G(2) (x, x ), ∀ x, x ∈ [0, 1]2 is: −1/2 ˆ (2) ˆ (1) Q12,1−α V1 x, x Gp p (x, x )− Gp p (x, x )±n1 1, 2 1, 2 + rV2 x, x 1/2 , ∀α ∈ (0, 1) . (3.9) one can use this confidence envelope to test any hypothesis on G(1) x, x 3.4 − G(2) x, x . Implementation In this section, we describe the procedure to implement the confidence envelopes. Given N,n the data set j/N, Yij , the number of interior knots Ns1 for mp1 (x) is taken to ˆ j=1,i=1 be [n1/(4p1 ) logn], where [a] denotes the integer part of a. Meanwhile, the spline estimator ˆ Gp1 ,p2 (x, x ) is obtained by (3.3) with the number of interior knots Ns2 = [n1/(2p2 ) loglogn]. These choices of knots satisfy condition (B3) in our theory. To construct the confidence envelopes, one needs to evaluate the percentile Q1−α and 59 ˆ estimate the variance function V x, x . An estimator V x, x of V x, x is ˆ V x, x ˆ ˆ ˆ = Gp1 ,p2 (x, x )2 + Gp1 ,p2 (x, x)Gp1 ,p2 (x , x ) n ˆ4 κ ˆ2 ˆ2 + n−1 ξ −3 , φ (x) φk x i=1 ik k=1 k ˆ where φk and ˆik are the estimators of φk and ξ ik , respectively. According to Yao et al. ξ ˆ (2005b), the estimates of eigenfunctions and eigenvalues correspond to the solutions φk and ˆ λk of the eigen-equations, 1 0 ˆ ˆ ˆ ˆ Gp1 ,p2 (x, x )φk (x) dx = λk φk x , (3.10) 1ˆ 1 ˆ2 ˆ ˆ ˆ where the φk are subject to 0 φk (t) dt = λk and 0 φk (t) φ (t) dt = 0 for k < k. Since k N is sufficiently large, (3.10) can be approximated by N −1 N ˆ ˆ ˆ ˆ G (j/N, j /N )φk (j/N ) = λk φk j /N . j=1 p1 ,p2 For the same reason, the estimation of ξ ik has the form of ˆ = N −1 ξ ik N ˆ −1 ˆ λ Yij − mp1 (j/N ) φk (j/N ) . ˆ j=1 k To choose the number of principal components, κ, M¨ller (2009) described two methods. u The first method is the “pseudo-AIC” criterion proposed in Yao et al. (2005a). The second is a simple “fraction of variation explained” method, i.e. select the number of eigenvalues that can explain, say, 95% of the variation in the data. From our experience in the numerical 60 studies, the simple “fraction of variation explained” method often works well. Finally, to evaluate Q1−α , we need to simulate the Gaussian random field ζ Z (x, x ) in (3.7). The definition of ζ Z (x, x ) involves φk (x) and V (x, x ), which are replaced by their estimators described above. The fourth moment of ξ 1k is replaced by the empirical moments of ˆik . We simulate a large number of independent realizations of ζ Z (x, x ), and take the ξ maximal absolute deviation for each copy of ζ Z (x, x ). Then Q1−α is estimated by the empirical percentiles of these maximum values. For the two-sample hypothesis testing problem, we will center the two groups of curves separately by their own mean functions, since we do allow each group to have a different ˆ mean function. Analogous to Q1−α , we estimate Q12,1−α and further −1/2 ˆ ˆ ˆ (2) ˆ (1) Q12,1−α V1 x, x Gp p (x, x )− Gp p (x, x )±n1 1, 2 1, 2 + rV2 x, x ˆˆ 1/2 , ∀α ∈ (0, 1) . (3.11) is applied for two samples covariance functions in the practice. The rest of the procedure follows as described in Section 3.3.3. 3.5 Simulation To illustrate the finite-sample performance of the spline approach, we generate data from the model Yij = m (j/N ) + 2 ξ φ (j/N ) + σεij , 1 ≤ j ≤ N, 1 ≤ i ≤ n, k=1 ik k 61 where ξ ik ∼ N (0, 1), k = 1, 2, εij ∼ N (0, 1), for 1 ≤ i ≤ n, 1 ≤ j ≤ N , m(x) = 10 + sin {2π (x − 1/2)}, φ1 (x) = −2 cos {π (x − 1/2)} and φ2 (x) = sin {π (x − 1/2)}. This setting implies λ1 = 2 and λ2 = 0.5. The noise levels are set to be σ = 0.5 and 1.0. The number of subjects n is taken to be 50, 100, 200, 300 and 500, and under each sample size the number of observations per curve is assumed to be N = 4[n0.3 log(n)]. This simulated process has a similar design to one of the simulation models in Yao et al. (2005a), except that each subject is densely observed. We consider both linear and cubic spline estimators, and use confidence levels 1 − α = 0.95 and 0.99 for our simultaneous confidence envelops. Each simulation is repeated 500 times. Table 3.1 shows the empirical frequency that the true surface G(x, x ) is entirely covered by the confidence envelopes. At both noise levels, one observes that, as sample size increases, the true coverage probability of the confidence envelopes becomes closer to the nominal confidence level, which shows a positive confirmation of Theorem 3.3.2. We present two estimation schemes: a) both mean and covariance functions are estimated by linear splines, i.e., p1 = p2 = 2; b) both are estimated by cubic splines, i.e. p1 = p2 = 4. Since the true covariance function is smooth in our simulation, the cubic spline estimator provides better estimate of the covariance function. However, as can been seen from Table 3.1, the two spline estimators behave rather similarly in terms of coverage probability. We also did simulation studies for the cases p1 = 4, p2 = 2 and p1 = 2, p2 = 4, the coverage rates are not shown here because they are similar to the cases presented in Table 3.1. We show in Figures 3.1 and 3.2 the spline covariance estimator and the 95% confidence envelops for n = 200 and σ = 0.5. Figures 3.1 and 3.2 correspond to linear (p1 = p2 = 2) and cubic (p1 = p2 = 4) spline estimators respectively. In each plot, the true covariance function is overlayed by the two confidence envelopes. 62 4 G 2 0 1 −2 1 0.5 0.8 0.6 0.4 0.2 x 0 0 x Figure 3.1: Plots of the true covariance functions (middle surfaces) of the simulated data and their 95% confidence envelopes (3.11) (upper and lower surfaces): n=200, N=100, σ=0.5. p1 = p2 = 2. 63 4 3 G 2 1 0 −1 −2 1 1 0.8 0.5 0.6 0.4 0.2 x 0 0 x Figure 3.2: Plots of the true covariance functions (middle surfaces) of the simulated data and their 95% confidence envelopes (3.11) (upper and lower surfaces): n=200, N=100, σ=0.5. p1 = p2 = 4. 64 3.6 3.6.1 Empirical examples Tecator near infrared spectra data We first apply our methodology to the Tecator data mentioned in Section 1 and chapter 2.6.Figure 3.3 shows the scatter plot of the spectra. As we can see, the spectra can be naturally considered as functional data, since they are recorded on a dense grid of points with little measurement error. On the other hand, there is a lot of variation among different curves. We show the estimated covariance function and the 95% confidence envelope in Figures 3.4. These results are obtained by applying cubic spline smoothing to both the mean and covariance functions, with the number of knots Ns1 = 10, Ns2 = 6, respectively. We also tried other combinations of knots numbers and linear spline estimators. The results are very similar, and hence are not shown here. From Figure 3.4, we can see that the within curve covariance is positive and quite significant, since the zero hyperplane is far below the lower bound of the confidence envelope. Using the simultaneous confidence envelopes, one can test other interesting hypotheses on the true covariance function, such as the true covariance being stationary. Specifically, we are interested in the following hypothesis, H0 : G(x, x ) ≡ g(|x − x |), ∀ x, x v.s. Ha : G(x, x ) = g(|x − x |), ∃ x, x ∈ [a, b]2 ∈ [a, b]2 , (3.12) where g(·) is a stationary covariance function, and [a, b] is the range of wavelength. To test the hypothesis in (3.12), we need to generate a new estimator under the stationarity assumption and check if this estimator can be covered by the simultaneous confidence 65 ˆ envelope. Letting G(x, x ) be the tensor product B-spline covariance estimator, we define b−u G(x, x + u)dx for 0 ≤ u ≤ b − a and G (u) = G (−u) ˆ ˆ ˆ ˆ GS (u) = (b − a − u)−1 a S S ˆ ˆ for a − b < u < 0. Similar to G, the new estimator GS is not guaranteed to be positive ˆ semi-definite, but it is sufficient for our purpose. Under the stationarity assumption, GS is ˆ a better estimator of the true covariance. We will pretend that GS is the true covariance and reject the null hypothesis if this function is not covered by the confidence envelope. Figure 3.5 shows cubic tensor spline envelopes with 0.9995 confidence level, and the center ˆ surface is GS (x − x ) as a two-dimensional function. As we can see, even for such a high confidence level, the estimator under the stationarity assumption is still not fully covered in the envelopes. We conclude that the covariance structure in these Tecator spectra is non-stationary. The same conclusion can be drawn using linear tensor spline method. 3.6.2 Speech recognition data The data were extracted from the TIMIT database (TIMIT Acoustic-Phonetic Continuous Speech Corpus, NTIS, US Dept of Commerce) which is a widely used resource for research in speech recognition. The data set we use was formed by selecting five phonemes for classification based on digitized speech from this database. From continuous speech of 50 male speakers, 4509 speech frames of 32 msec duration were selected. From each speech frame, a log-periodogram was used as transformation for casting speech data in a form suitable for speech recognition. The five phonemes in this data set are transcribed as follows: “sh” as in “she”, “dcl” as in “dark”, “iy” as the vowel in “she”, “aa” as the vowel in “dark”, and “ao” as the first vowel in “water”. For illustration purpose, we focus on the “sh” and “ao” phoneme classes as representatives of consonants and vowels. There are n1 = 872 66 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 absorbance qq qqqqqq qqqq qqq qqq qq qqq qqqqqqqq qqqq qq q qq qqqqqqqqqq qq qqqqq qq qq qq qq qq q qq q q qq qq qq qqq q q qq qqqq q qq qqq qq q q q qq q q q q q qqqqq qq q q q q q q q qq q qqqqqq qq q qq q qq qqqqqqqqqqq q qq qq qq q q qqq q qq qq q q qqqqq q q qq q q q qqq q q q q q q q q qq q q q q q q q qq q q qqq q q q q q qq q q q q q q q q qqqq qq q q q q q qqq q qqqqqqqqq q q q q qqq qqqqqqqqqqqqqqqq q qq q q q qq q qqqqqqq qqq qq qq q q q q qq qqq qqq qq q q q q q qq qq qqq q qqqqqqqqqqqqqqqqqqqqq qqqqq qqq q qq q q qqqq qqqq q q qqqqq q q qq q qq q q qq q qq qqqq q qq q q qqq qqqqqq qqq qqqqqqqqqqq q q q q q q qqqqqqqqqqq q q q qqqqq q q q q qqq q q qq q q qq q qq qq q qq q qqqqqqqqqqqq q qqq qq q q qq qq q q qq qq qq q qqq q q qq q q q q qqqqqqqqqqqqqqqqq qq qq q qq q q qqq qq qqqqqqq qqqqqqqq q qq q qqq qqq qq q qq q q qq qqqqq qq q qq q qq qq qq qqq q qq qqq q q q qq qqqq qqqqqqqqq q qqqq qq q q q qqqq qqq qq qqq q q q q qqq qq qqqqqqq q qqqq qqqq q q q qq q q q q qqq qqqqqqqqqqqqqq q q q q qq qq qqqqqqqq q qqq q qq qqqqqq q q q q qqq q qqqqq q qq qqqqqq qqq qqqq q qqq qq q qqqqq qqqqq qqqq qqqq q qqqqqqqqq q q q q q qq q qqqq q qqq qq q q q qq q qqqqqqqq qq qq qqqq qqqqqqqqqqqq qqq qqqqqqqqqqqq q qqqqqqqqqq qqq q qq q q qq q qq q qqqqq qqq q q qq q q qqqq q qqqq qq qq qq qqqqqqqqqqqqqqqqq qqqqqqqqq qq qq q qq qqqq qqq qq qqq qq qqqq q qqq qqqq q qq q qqqqqqqqqqqqqqqqqqqq qq qqqq qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq q q q qq qqqqqqqqqqqq qqqq qq q q qq qq qqq qq qqqqq qq q q qqqqqqq qq qq q qqqqqqqqqqqqqqqqq qq qq q q q qqqqqq q qqq qq qqqqqqqqqqqqqq qq qq qq q qq qq qq qqqqqqqq qqqqqqq qq qq qq q q qqq qqqqq qqq qqqqq q q q q q qq q qqqqqqqqq qqq q q qqqqqqqqqq qqq q qq q qq qq qqqqqqqq qqqqqqqq qqqqqqqqqqqqqqq q q qqqqqq qq qq qqqqqqqq qqq q q qq qqq qqq qqqqqqqqqqqqqqq qqqq qqq qqqqq qqqqqqq qqqqqqqqqqqqqqqqqq q qq q q qq qq qqq qqq qq qqqq qqqqqqqqqqqqqqq qqqqqqqqqqq qqqqq qqqqq qqq qqqq q qqqq qq qqq q qqqq q qq qqqqqqqqqqqqqq qqqqq q qqq qqqqqqqqqqqqqqqqqq qqqqqqqqqq qq q q qq qqq qqqqqqqqq qqqqqqqqqq q qqqq q qq q qq q qq qqq qqq q q qqqqq qqqqqqqq qqqqqqqq qqqqq qqqqqqqqqqqqqqqqqqqq q q qqqq q qq q qq qqq qq q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqq qqqqqqqqqqq qq q qqqqq qqqqqqq qqqqqqqqqq qq qqq qqqqqq q q q qq qqqqq qqqq qq q qq q qq qq q q q qqqqqqqqqqq qqq qqq qqq q q q qqq qq qqqqqqqqqqq qqq qqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q qqq qqqqq qqqq qq qq qq qqqqqqq q qqqq q qqqq q q qq q qq q qqq qqq qqq q qq q q q qq qqqqqqq q qqqqqq qqqqq qqq qqqq qqq qq q q q q qq qqqqq q q q q qq q qqqq qqq q q q q q qqq q qq q qq qqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqq qqqq qqq qq qq qq qqqqqqq q qqqqq q q qqqq qq q q q qqqqq q q qq qq q qq qqq qqqq q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq qqq q qq qqq qqqq qqq qq qq qq q qqqq q qq q qqqqqqqqq q q qqq qqq qq qqqqqq q q qqq qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqq qqq q qqqqqq qq qqq qqq qqqqqq qq q q q qq qq q qq qqqq qqqqqqqqqqqqqqqqqqqq q qqqqqq qq qq qq qqqq qqq qqqq qqqq q qq q qqq q qq qqqqq q qq qqq q qqqqq q qq qqq qqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q qq q qqq qqqqqqqqqqqqqqqqqqqqqqqq qqqqqq q q q q q q q q q qqq qqqqqqqq q qqqqq qq qqqqqqq q q q q q qqqqqq qq qqqqq qqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqq qqqqq qqq q qqqqqqqqqq qqq qqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqq qqqqqq qqqq qq qqqqqq qqq q qq qqqqq q q qq q qqqq qqq q qq q q q qq qqqqqqq qqqq q qqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqq qqq qqq qqqqqq q q qqqq qq q q qq qq q q qqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqq qqqq q qqqq qq qqq qqq qq q q qqq qqq qq q q qq q q qq q q q qqqqq qq qqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqq qqq q q q q q qqqqqqqqqqqqqq qqq q q q qqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqq q qq q q qqq q qq q q qq qqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq qqqqq qq q q q qq q qq qq qqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q qqqqqqqqqqqqqqqq q q q qq qqqqqqqqq q qqqq qqqq qq q q qq qq q q q q qqq q q q qqq qqqqq qqqqqqqqq qqqqqqqqqqqqqq qqqqqqqqqqqqqq qqqqqqqqq qq qqq q q q qqqqq q q qqqqqqqqqq qqqqq qqqqqqqqqq qqqqq qq qqq q q qqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqq qq qqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqq q q q qq q q q q q q qq q q q q qq q qqqqq qqqqqqq qq q qqqq q qq qq q qqqqqq qq qqqqq qqqq q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q qqqq qq q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqq q qqqqq q qqqqq q q q qqq qqqq qq qqq q qq qqqqqqqqqqqqqqqqqqqqqq q qq qq qq qq q qqq q qqq qq q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqq q qqqqqqqqqqqqq qqqqqq qqqqqq q qqq qqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q qqqqq qqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q qqqqqqqqq qqqqqqqqq qqqq q q qqqqq qqqqqqqqqqqqqq q q qqq qq q q q qqqq qqqqqqqqqq q qqq q qqq qqqqqqqqqqqq qqqqq qq q q q qqqqq qqqqqqq q q q q q q qqq q q q q q qq qqq q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqq q qq q q q qqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqq qqqqqqqqqqq q q q q qqqq qqq qqqq q q q q qq q q q qqqqq q q qqqqqqqq qqq q q qqqqq qqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqq q qq q q qqqq qq q qq q q qqqqqqqq qqq qq q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q qq q qq qq qqqqqqqqq qqqqqqq qqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q qq q qq qqqqqq qq q q qqq q q qq qqqq qq qqqqqqqqqqqqqqqq qqq qqqq qqqqqqqqqq qqq qqqqq qq qqq q qq qq qqqqq qqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq q q q q q q q qq qqqqqqq qq qq qq q qqqqqqqq q qqq q qqqqq qq qqqqqqq q q qqq qqqqq qq qqqqqqqqq qqqqqqq q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq qqq q qq q q qqq qqqqq qqqqqqqqqqqqqqqqqqqqq q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqq q qq qqqq q qqqqq qq qqq qqqq qqq qqqqqqqq qq qqq q q qqqqqq qqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqq qqqq q q q q q qqqqqqqqqqq qq q q q qq qqqq q q qq q qqqqqqqq q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqq q qqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q qq q qqq qqq q qqqq q qqqqq qq qq qqqqqqq qqqqqqq qq qqqq q q qq q qqq qqqqqqqqqqq qqq q q qqqq qqqqqqqqqqqqqq qq qqqq qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq q q qq q q q qq q qqq q q qqqqq qqq q qqqqqq q q qqq qqqqqqqq qqqqqqqqqqqqqqqq qqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q qqqqqq q q qqq qqqqqqq qq q q q qq qqqqqqqqqqqqqqqqq qqq qqq q qqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqq qqqqqqqqqqqqqq qqqqqqqqq qqq qqqqqqqqqqqqqq qqqqqqqqqqqqqqqq qqq qq qqqq q qqqqqqqqq q q q q q q q qq qq qq q qqq qqqqq q q qq q qqqq qqq qqqq qqqqqqqqqqq qqq q qqq qqqqqq qqqq qqq q qq qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqq qqq qqq q q q q q q qqqq qq q q q q q qqqqqq q q qqq qqqqq q q q qqqqqqqqqqqqqqq qqqqqqqqqqqqqqq qqq qqqq qq qqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqq qqqqqq qqqqqqqqq q q q qqqqqq qqq qqqq q qq q q q q qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q qqq q q qqqq qqq q qqqqq q qqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqq q qqqqqqqqqqq qq qqqq qqqqqqqqqqqq qqqqq qqqqqqqqqqqqqqqqqqqq q q q q qqqq qq qqqqqq q qq qqqqq qq q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqq qqqqqqqq qq qqq qqq q qqqqqqqqq qqqqqqqqqqqqqq qqq q q qqq q qqqq q qq qq qqq qq q q qq q qqqqqqqq qq qq q qqqqqqqq qqqqqqqqqqqqq qq q qq qqqqqqqqq qqq qqqqqqqqq qq q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q qq q q qqqqq qqqqqqqqqqqqqqqq qqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqq q qqqqqqqq qqq q q qqqq qqq qqqqqqqqq q q q qqqq q qqqqqqqqqqqqqqqqqqqqqqqqqqq qq qq q q q qqqq qq q qq q q q qqqq qqqq qqqqqq q qqqqqqqq qqqqqqqqqq qqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqq q qq qqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq qqqqqqqqqqq q qqqqqqqqqq qq q q q qqqqqqqqq qqqqqqqqqqqqqqqq qqq qqqqq q qqq qqqqqqq q q q q q q qq q qq q qqq qq qq q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq qqq qqqqqqqqqqqqqqqqqqqqqqqqq q qqqqqqq qqqqqqqqqqqqqqqq qqqq qq qq q q q qqqq q qqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqq q q qq q qqqqq qqq qqqq q q qqqq qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqq qqqqq q qq q qq qqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq qq qq qqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q qq q qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqq qqqqqqq qqqqqqqqqqqqqq qq qqq qqqqqqqqqq q qq q qqqqqqq qq q q q qqqqqqqqqqqqqqqqqqqqqqqqqqq q q q qq q q qq qqqq qqqqqqq qqqq qqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqq qqqqqqqq q qqqqqqqqqqqqqqq qq q qq qqq q qqqqqqq qqq qqqq qqqq qqqqqqqqqqqqqqqqqqq q q q qqqqqqqqqq q qq q q q qqqqqqqq q q qqqqqqqqqqqqqqqq qqqq qqqqqqq qqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q qqqq qqqqqqqqq qqqqqqq q qq q qq q qq qqqqqqqq qq qqqqqqq q qqq q q qq qqqqqqqqqqqqqqqqqqqqqqqq q qqqqqqqq q qqqq qq q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q qqqqqqqqqqqqq qqq qqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q qqq q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqq q q qqqqqqq qq q qqq qqqqqqqqqqqqqq q qqqqqqqqqqqqqqqqqq qqqqq qqqqqqqqqqq q q qqqqq qqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q qqq qq qqqqq q q q qq q qqqqqq q qqqq qq qqqqqqqqqq qq qqqqq q qq qqqqqq q qq qqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqq q q qqq qqqq qqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q qq qqq qqq q qqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqq qq qq qqq q qqqqqqqq qqqqqqqq qqqqqqqqqqqqqqqqqq q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q qqqqqqqq qqqqqq q qqqqqqqqqq qqq qqqqqqqq qqq qq q q qq qqqqqqqqqq qqqqqq q qqqqqqqqqqqqqq q q qqqqqqqq q qqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq qq qqqqqq qqqqq qqqqqqqqqqqqqqqqqqqqq qqq qqq q q qqqqq q q qqqq qq qq q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqq qq qqqq q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq q qqqq qqqqqqqqqqqqq q qq q qqqqqqqqqqqqqqqqqqqqqqqq q qqqq qqq qq qq q qqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqq qqqqqqqqqqqqqqqqq qqq qq qqq qq q qqq qqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqq q qqqqqqqqqqqqqq qq q qqqqqqqqqqqqqqqqqqqqqqqqq qq qqq qqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqq q q 850 900 q q q 950 wavelength Figure 3.3: Plot of the Tecator data. 67 1000 1050 0.4 0.35 G 0.3 0.25 0.2 0.15 0.1 1050 1200 1000 950 1000 900 850 wavelength 800 wavelength Figure 3.4: Plots of the cubic tensor product spline covariance estimator (3.3) for the Tecator data (middle surface) and the 95% simultaneous confidence envelope (3.11) (upper and lower surfaces). 68 0.5 0.4 G 0.3 0.2 0.1 0 1050 1100 1000 1000 950 900 900 850 800 wavelength wavelength Figure 3.5: Plot for testing hypothesis (3.12) for the Tecator data. The upper and lower surfaces are the 99.95% confidence envelopes for the covariance function, and the middle ˆ surface is the covariance estimator under stationarity assumption, GS (x − x ). 69 log-periodograms in the “sh” class, and n2 = 1022 log-periodograms in the “ao” group. Each log-periodogram consists N = 256 equally spaced points. Figure 3.6 shows a sample 10 log-periodograms from each of the two phoneme classes. This data set was first analyzed by Hastie et al. (1995) using penalized linear discriminant analysis. One of the basic assumptions is that the covariance functions are the same for different classes. Judging from the scatter plot of the data in Figure 3.6, despite the clear difference between the mean functions of the two groups, there is no obvious indication of difference in covariance structures. We first obtain the cubic tensor product spline covariance estimators for the two phoneme classes separately, which are shown in Figure 3.7. These results are obtained by using Ns1 = 10, Ns2 = 4 number of knots for the “sh” class, and Ns1 = 11 and Ns2 = 5 for the “ao”class. Different number of knots between the two groups reflects that the sample sizes are different. By comparing the covariance estimators in Figure 3.7, there seems to be a visible difference between the two classes. We now would like to test the equal covariance assumption formally. The hypotheses of interest are H0 : G(1) (x, x ) ≡ G(2) (x, x ), ∀ x, x v.s. ∈ [0, 1]2 Ha : G(1) (x, x ) = G(2) (x, x ), ∃ x, x ∈ [0, 1]2 . (3.13) The 99.95% confidence envelopes for the difference of the two covariance functions are provided in Figure 3.8, and the zero hyperplane is used as a reference. Since the zero hyperplane is not covered by the envelopes, the equal covariance hypothesis is rejected with p-value < 0.0005. We also tried different numbers of knots and the test result is not sensitive to this 70 15 5 0 Log−Periodogram sh 0 50 100 150 200 250 200 250 5 15 ao 0 Log−Periodogram Frequency 0 50 100 150 Frequency Figure 3.6: Plots of the speech recognition data. choice. Our results suggest that a quadratic discriminant analysis, that takes into account the difference in the within-group covariance functions, might yield a better classification error rate than a linear discriminant analysis in this data set. 71 Figure 3.7: Plots of tensor spline estimators for “sh” and “ao” data sets. Right: “sh”; Left: “ao”. 72 Figure 3.8: Plots of hypothesis test (3.13) results with the cubic tensor product spline 99.95% confidence envelopes (3.11) (upper and lower surfaces) for the speech recognition data and the zero surfaces (middle flat surface). 73 3.7 Summary In this chapter, we consider covariance estimation in functional data and propose a new computationally efficient tensor-product B-spline estimator. The proposed estimator can be used as a building block for further data analysis, such as principal component analysis, linear discriminant analysis and analysis of variance. We study both local and global asymptotic properties of our estimator and proposed a simultaneous envelope approach to make inference on the true covariance function. The method is applied to a Tecator near-infrared spectra data to test the stationarity assumption on the covariance. In a classification problem, we further extend our method to a two-sample problem to test the equal covariance assumption between different treatment groups. 74 APPENDIX 75 Throughout this section, C means some positive constant in this whole section. The detailed proofs of the following lemmas, Proposition 3.3.1 and Proposition 3.3.2 can be found in the supplement. Preliminaries For any positive integer p, denote the theoretical and empirical inner product matrices of Ns BJ,p (x) as J=1−p Vp = BJ,p , B J ,p Ns ˆ , Vp = J,J =1−p Ns BJ,p , B . J ,p 2,N J,J =1−p −1 The following lemma is from Chapter 1, which established the upper bound of Vp ∞ . Lemma 3.7.1. For any positive integer p, there exists a constant Mp > 0 depending only −1 ≤ Mp h−1 , for a large enough n, where hs = (Ns + 1)−1 . on p, such that Vp s ∞ Denote by “⊗” the Kronecker product of two matrices. Note that (A ⊗ A)−1 A−1 ⊗ A−1 ∞ ≤ A−1 ∞ = 2 , for any invertible matrix A, which, together with Lemma ∞ 3.7.1, leads to the following result. Lemma 3.7.2. For any positive integer p, there exists a constant Mp > 0 depending only on p, such that Vp ⊗ Vp −1 ∞ 2 ≤ Mp h−2 . s Next we define the theoretical and empirical inner product matrices of tensor product 76 spline basis x, x B JJ ,p2 Ns2 as J,J =1−p2 Ns2 Vp ,2 = 2 BJJ ,p , BJ J ,p 2 2 ˆ Vp ,2 = 2 . BJJ ,p , BJ J ,p 2 2 2,N J,J ,J ,J =1−p 2 J,J ,J ,J =1−p2 Ns2 (3.14) ˆ The following results show that the difference of Vp ,2 and Vp ,2 is negligible. Using the 2 2 ˆ −1 results from Lemma 3.7.2, we can obtain the upper bound of the norm of Vp ,2 . 2 ˆ Lemma 3.7.3. Under Assumption (A3), for Vp ,2 and Vp ,2 defined in (3.14), 2 2 ˆ −1 ˆ = O h−2 . Vp ,2 − Vp ,2 = O N −1 and Vp ,2 s2 2 2 ∞ 2 ∞ ˆ Lemma 3.7.4. For Vp ,2 defined in (3.14) and any N (N − 1) vector ρ = ρ , there exjj 2 ˆ −1 N −2 B T (x, x )Vp ,2 XT ρ ists a constant C > 0, such that sup ≤ C ρ ∞. p2 2 ∞ 2 x,x ∈[0,1] Denote φ kk x, x = B T (x, x ) XT X p2 −1 XT φ , and kk φ = kk φk (2/N ) φ (1/N ) , . . . , φk (1) φ (1/N ) , . . . , φk (1/N ) φ (1) , . . . , φk (1 − 1/N ) φ (1) k k k k The following lemma is a direct result from de Boor (2001), p. 149 and Theorem 5.1 of Huang (2003), thus the proof is omitted. Lemma 3.7.5. There is an absolute constant Cg > 0 such that for every g ∈ C p−1,µ [0, 1], there exists a function g ∗ ∈ H(p−1) [0, 1] and some µ ∈ (0, 1] such that supx∈[0,1] g(x) − g ∗ (x) p−1+µ ≤ Cg hs . If Assumption (B2) holds, sup φ (x, x ) − φkk (x, x ) = (x,x )∈[0,1]2 kk ∞ 77 T . p O hs 2 . 2 Proofs of theorems 3.3.1 and 3.3.2 Proof of Theorem 3.3.1. By Propositions 3.3.1, ˜ E[Gp1, p2 (x, x ) − G(x, x )]2 = E∆2 x, x where ∆ x, x is defined in (3.5). Let ¯ ξ = n−1 ·kk + o(1), n ξ ξ , 1 ≤ k, k ≤ κ. According i=1 ik ik to (3.16) and (3.17), one has κ κ ∆ x, x ¯ ξ φ (x) φ ·kk k k = x ¯ ξ ·kk − 1 φk (x) φk x + . k=1 k=k Since nE ∆ x, x κ = 2 φ2 (x) φ2 k k κ x + φk x φk (x) φk (x) φk k,k =1 k,k =1 κ + φ2 (x) φ2 x Eξ 4 − 3 k k 1k k=1 κ 2 = G x, x + G (x, x) G x , x + φ2 (x) φ2 x k k k=1 x Eξ 4 − 3 ≡ V x, x 1k , and the desired result follows from Proposition 3.3.2. Next define ζ x, x = n1/2 V −1/2 x, x ∆ x, x . Lemma 3.7.6. Under Assumptions (B2)-(B4), sup 78 ζ x, x x,x ∈[0,1]2 Z − ζ x, x = oa.s. (1), where ζ Z x, x is given in (3.7). Proof of Theorem 3.3.2. According to Lemma 3.7.6, Propositions 3.3.1 and 3.3.2 and Theorem 3.3.1, for ∀α ∈ (0, 1), as n → ∞,       −1/2 ˆ lim P sup n1/2 Gp1 ,p2 x, x − G(x, x ) V x, x ≤ Q1−α n→∞    x,x ∈[0,1]2        −1/2 1/2 G ˜ p ,p x, x − G(x, x ) V x, x = lim P sup n ≤ Q1−α 1 2 n→∞     x,x ∈[0,1]2       = lim P sup ζ x, x ≤ Q1−α n→∞     x,x ∈[0,1]2     = lim P sup ζ Z x, x ≤ Q1−α . n→∞   x,x ∈[0,1]2 Supplement This supplement contains proofs for Lemmas 3.7.3, 3.7.4 and 3.7.6, Propositions 3.3.1 and 3.3.2. 79 Proof of Lemma 3.7.3. Note that ˆ Vp ,2 2    = N −2   N  s2  j/N, j /N B j/N, j /N B J J ,p2 JJ ,p2   1≤j=j ≤N J,J ,J ,J =1−p2    Ns N N   2 = N −2  BJJ ,p (j/N, j/N )  BJ J ,p (j/N, j/N )   2 2 j=1 j=1 J,J ,J ,J =1−p2 N  N  s2  −2 B (j/N, j/N ) B (j/N, j/N ) − N JJ ,p2 J J ,p2   j=1 J,J ,J ,J =1−p2  N s2 N  j j j j  −2 ˆ ˆ = Vp 2 ⊗ V p 2 − N B , , . B JJ ,p2 N N J J ,p2 N N   j=1 J,J ,J ,J =1−p2 ˆ ˆ ˆ Note that the entries in the matrix Vp ,2 − Vp2 ⊗ Vp2 are zero when the maximum absolute 2 difference between any two of the indices J, J N −2 , J ,J is greater than p; otherwise N BJJ ,p (j/N, j/N ) BJ J ,p (j/N, j/N ) 2 2 j=1 1 B (x, x) B (x, x) dx + O(N −1 h−1 ) s2 JJ ,p2 J J ,p2 0 = O N −1 hs2 + N −2 h−1 . s2 = N −1 80 ˆ ˆ ˆ = O N −1 hs2 + N −2 h−1 . Since Hence, Vp ,2 − Vp2 ⊗ Vp2 s2 2 ∞ ˆ ˆ Vp 2 ⊗ V p 2 − Vp 2 ⊗ V p 2 Ns2 = N −2 N BJJ ,p max 2 1−p2 ≤J ,J ≤Ns2 j,j =1 J,J =1−p2 1 1 x, x B x, x dxdx B − J J ,p2 0 0 JJ ,p2 Ns2 ≤ ∞ N max 1−p2 ≤J ,J ≤Ns2 J,J =1−p2 j,j =1 j /N j j , N N BJ J ,p 2 j j , N N j/N j −1 /N (j−1)/N j/N, j /N B JJ ,p2 ×B j/N, j /N − B x, x B x, x J J ,p2 JJ ,p2 J J ,p2 2 ≤ Ch−2 N hs2 × N −2 × N −2 h−2 = CN −2 h−2 , s2 s2 s2 dxdx ˆ applying Assumption (B3) one has Vp2 ⊗ Vp2 − Vp ,2 = o N −1 . 2 ∞ 2 ≤ According to Lemma 3.7.2, for any Ns2 + p2 vector τ , one has (Vp2 ⊗ Vp2 )−1 τ ∞ h−2 τ ∞ . Hence, (Vp2 ⊗ Vp2 )τ ≥ h2 τ ∞ . Note that s2 s2 ∞ ˆ ˆ Vp ,2 τ ≥ (Vp2 ⊗ Vp2 )τ − (Vp2 ⊗ Vp2 )τ − Vp ,2 τ = O h2 s2 2 2 ∞ ∞ ∞ ˆ −1 ˆ −1 = Vp ,2 τ ≤ O h−2 If τ satisfies that Vp ,2 s2 2 ∞ 2 ∞ proved. τ ∞. τ ∞ = O h−2 , the lemma is s2 m,n Proof of Lemma 3.7.4. Note that for any matrix A = aij and any n by i=1,j=1 81 1 vector α = (α1 , . . . , αn )T , one has Aα ∞ ≤ A ∞ α ∞ . It is clear that N −2 XT ρ ∞ = N −2 N  s2       1≤j=j ≤N ≤ ρ ∞ j/N, j /N ρ B jj  JJ ,p2  J,J =1−p2 ∞ max N −2 1−p2 ≤J,J ≤Ns2 1≤j=j ≤N B j/N, j /N JJ ,p2 ≤ h2 ρ ∞ . s2 One also observe that ˆ −1 ˆ −1 N −2 B T (x, x )Vp ,2 XT ρ ≤ B T (x, x ) N −2 XT ρ Vp ,2 , p2 p2 ∞ ∞ 2 2 ∞ ∞ which, together with the boundedness of spline functions and Lemma 3.7.3, leads to the desired result. Proof of Lemma 3.7.6. According to Assumption (B4) and multivariate Central Limit Theorem √ ξ ξ ·k n ¯·kk , ¯2 − 1 Eξ 4 − 1 1k −1/2 1≤k κ, where κ is a positive integer or ∞. We consider a typical functional data setting where Xi (·) is recorded on a regular grid in T , and assumed to be contaminated with measurement errors. Without loss of generality, we take T = [0, 1]. Then the observed data are Yij = Xi Tij + σ Tij εij , for 1 ≤ i ≤ n, 1 ≤ j ≤ N , where Tij = j/N , εij are independent random errors with E εij = 0 and E(ε2 ) = 1, and σ(·) is the standard deviation of the measurement errors. By the Karhunenij Lo`ve representation, the observed data can be written as e Yij = m (j/N ) + κ ξ φ (j/N ) + σ (j/N ) εij , k=1 ik k (4.1) where m(·), σ(·) and {φk (·)}κ k=1 are smooth but unknown functions of t. In addition, 1 2 1 {φk (·)}κ k=1 are further subject to constraints 0 φk (t) dt = 1, and 0 φk (t) φk (t) dt = 0, for k = k. 4.2.2 Spline estimators We first introduce some notation of the B-spline space. Divide the interval T = [0, 1] into (Nm + 1) subintervals IJ = ω J , ω J+1 , J = 0, ..., Nm − 1, INm = ω Nm , 1 , where Nm (p−2) m := ω J J=1 is a sequence of equally-spaced points, called interior knots. Let H 99 be the polynomial spline space of order p on [0, 1]. The J-th B-spline of order p is denoted by BJ,p . We augment the boundary and the number of interior knots as ω 1−p = ... = ω −1 = ω 0 = 0 < ω 1 < ... < ω Nm < 1 = ω Nm +1 = ... = ω Nm +p , in which ω J = Jhm , J = 0, 1, ..., Nm + 1 and hm = 1/ (Nm + 1) is the distance between neighboring knots. Following Cao, Yang and Todem (2012), we estimate the mean function m(·) in (4.1) by m(·) = ˆ arg min g(·)∈H(p−2) n i=1 2 N Yij − g (j/N ) = j=1 Nm ˆ B (·), b J=1−p J,p J,p where the coefficients  N  n Nm T ˆ Yij − bp = ˆ1−p,p , ..., ˆNm ,p b b = arg min Nm +p i=1 j=1  R J=1−p ¯ ¯ ¯ Let Y = Y·1 , . . . , Y·N T and Y·j = n−1 2  bJ,p BJ,p (j/N ) .  n Y , 1 ≤ j ≤ N . Applying elementary i=1 ij algebra, one obtains m (t) = Bp (t) BT B ˆ −1 BT Y (4.2) T is in which Bp (t) = B1−p,p (t) , . . . , BNm ,p (t) and B = BT (1/N ) , . . . , BT (N/N ) p p the design matrix. We denote by m(ν) (t) the ν-th order derivative of m (t) with respect to t. Since m (t) is ˆ an estimator of m (t), it is natural to consider m(ν) (t) as the estimator of m(ν) (t), for any ˆ ν = 1, ..., p − 2, i.e. −1 T (ν) m(ν) (t) = Bp (t) BT B ˆ B Y, (ν) where Bp (t) = (4.3) (ν) (ν) B1−p,p (t), . . . , BN ,p (t) . According to B-spline property in de Boor m 100 (2001), for p > 2 and 2 − p ≤ J ≤ Nm − 1, BJ,p−1 (t) BJ+1,p−1 (t) − ω J+p−1 − ω J ω J+p − ω J+1 d B (t) = (p − 1) dt J,p . (ν) Therefore, Bp (t) = Bp−ν (t) DT , in which D(ν) = D1 · · · Dν−1 Dν , with matrix (ν)    −1  ω −ω  1 1−p+l      ω −ω1  1 1−p+l    Dl = (p − l)   0      . .  .      0 0 0 ··· 0 0 −1 ω 2 −ω 2−p+l 0 ··· 0 0 −1 ω 3 −ω 3−p+l · · · 0 0 . . . . . . 1 ω 2 −ω 2−p+l . . . . . . .. 0 0 ··· . 1 0 ω Nm +p−l −ω Nm                          for 1 ≤ l ≤ ν ≤ p − 2, which is the same as equation (6) in Zhou and Wolfe (2000). 4.2.3 Convergence rate Define the following “infeasible estimator” of function m(ν) ¯ m(ν) (t) = X (ν) (t) = n−1 ¯ n (ν) X (t), i=1 i t ∈ [0, 1] . (4.4) The term “infeasible”, borrowed from Cao, Yang and Todem (2012), refers to the fact that (ν) m(ν) (·) would be a natural estimator of m(ν) (·) if all random curves Xi (·) were observed. ¯ In the following, we want to show that the spline estimator m(ν) (·) in (4.3) is asymptotically ˆ 101 equivalent to m(ν) (·). ¯ We break the error m(ν) (·) − m(ν) (·) into three terms. Let ¯·j = n−1 ˆ ε n ε , i=1 ij 1 ≤ j ≤ N . Denote the signal vector, the noise vector and the eigenfunction vectors by T N T , e= σ 1 ¯ , . . . , σ N ¯ 1 and m= m N , . . . , m N N ε·1 N ε·N T 1 . Projecting the relationship in model (4.2) onto the linφk = φk N , . . . , φk N N ear subspace of RNm +p spanned by BJ,p (j/N ) , we obtain the 1≤j≤N,1−p≤J≤Nm following crucial decomposition: (ν) m(ν) (t) = m(ν) (t) + e(ν) (t) + ˜ (t), ˆ ˜ ˜ ξ (4.5) (ν) where m(ν) (t) = Γ(ν) (t)m, e(ν) (t) = Γ(ν) (t)e and ˜ ˜ ˜ ξ (t) = κ ¯ Γ(ν) (t)φ with k k=1 ξ ·k −1 T (ν) B and ¯·k = n−1 ξ Γ(ν) (t) = Bp (t) BT B n ξ , 1 ≤ k ≤ κ. i=1 ik The following proposition provides asymptotic properties of the three terms. Proposition 4.2.1. Under Assumptions (C1)-(C6) in Appendix, one has supt∈[0,1] m(ν) (t) − m(ν) (t) = o n−1/2 , ˜ (4.6) (ν) supt∈[0,1] ˜ (t) − (m(ν) (t) − m(ν) (t)) = oP n−1/2 , ξ ¯ (4.7) supt∈[0,1] e(ν) (t) = oP n−1/2 . ˜ (4.8) Appendix A.2 contains proofs for the above proposition, which together with (4.5), leads to the following semiparametric efficiency result. Theorem 4.2.1. Under Assumptions (C1)-(C6) in Appendix, the B-spline estimator m(ν) ˆ 102 √ is asymptotically equivalent to m(ν) with the n approximation power, i.e. ¯ supt∈[0,1] m(ν) (t) − m(ν) (t) = oP n−1/2 . ˆ ¯ Remark 4.2.1. Since the “infeasible estimator” m(ν) (t) is the sample average of i.i.d. tra¯ jectories Xi (t) n , an application of the central limit theorem gives i=1 supt∈[0,1] m(ν) (t) − m(ν) (t) = OP n−1/2 . Thus combining with the results in Theo¯ rem 4.2.1, one has ˆ supt∈[0,1] m(ν) (t) − m(ν) (t) = OP n−1/2 . 4.3 Confidence bands In this section, we develop the simultaneous confidence bands for the derivative function m(ν) (·). 4.3.1 Asymptotic confidence bands Let Σ(·, ·) be a positive definite function, and defined as Σ(t, s) = κ λ φ(ν)(t)φ(ν) (s), k=1 k k k t, s ∈ [0, 1]. Denote by ζ (t), t ∈ [0, 1] a standardized Gaussian process such that Eζ (t) ≡ 0, Eζ 2 (t) ≡ 1 with covariance function Eζ (t) ζ (s) = Σ (t, s) {Σ (t, t) Σ (s, s)}−1/2 , t, s ∈ [0, 1]. Denote by q1−α the 100 (1 − α)th percentile of the absolute maxima distribution of ζ (t), t ∈ [0, 1], i.e. P supt∈[0,1] |ζ (t)| ≤ q1−α = 1 − α, ∀α ∈ (0, 1) . 103 Theorem 4.3.1. Under Assumptions (C1)-(C6) in Appendix, ∀α ∈ (0, 1), as n → ∞, P supt∈[0,1] n1/2 m(ν) (t) − m(ν) (t) Σ (t, t)−1/2 ≤ q1−α → 1 − α. ¯ Applying Theorems 4.2.1 and 4.3.1 gives asymptotic confidence bands for m(ν) (t), t ∈ [0, 1]. Corollary 4.3.2. Under Assumptions (C1)-(C6) in Appendix, ∀α ∈ (0, 1), as n → ∞, an asymptotic 100 × (1 − α) % exact confidence band for m(ν) (t) is P m(ν) (t) ∈ m(ν) (t) ± n−1/2 q1−α Σ (t, t)1/2 , ˆ 4.3.2 t ∈ [0, 1] → 1 − α. Implementation When constructing the confidence bands, one needs to estimate the unknown function Σ(t, s). (ν) Note that Σ(t, s) = G(t, s), when ν = 0. Following Liu and M¨ller (2009), we estimate φ u k through the derivatives of G(t, s). According to Cao, Wang, Li and Yang (2012), G (t, s) is estimated by ˆ G (t, s) = ˆ where R = n−1 ·jj NG ˆ b B (t) B (s) J ,p J,J =1−p JJ J,p n ˆ i=1 Yij − m (j/N ) Y ij − m j /N ˆ (4.9) , 1 ≤ j = j ≤ N , NG is the number of interior knots for B-spline, and the coefficients ˆ bJJ = NG J,J =1−p 2    N   ˆ arg min R − b B (j/N ) B j /N JJ J,p J ,p  ·jj    NG +p ⊗RNG +p R j=j 1−p≤J,J ≤NG 104 ˆ with “⊗” being the tensor product of two spaces. They showed that G converges to G as n ˆ goes to ∞. In this section, we further show that G and G are asymptomatically equivalent up to the ν-th partial derivative. We define the ν-th derivative with respect to s for G(t, s) ˆ and G(t, s) as ∂ν ˆ ∂ν ˆ G(0,ν) (t, s) = ν G(t, s), G(0,ν) (t, s) = ν G(t, s) = ∂s ∂s (ν) ˆ bJJ BJ,p (t) B (s) . (4.10) J ,p J,J Theorem 4.3.3. Under Assumptions (C1)-(C6), one has ˆ sup G(0,ν) (t, s) − G(0,ν) (t, s) = oP (1) , (t,s)∈[0,1]2 1 ≤ ν ≤ p − 2. The proof of Theorem 4.3.3 is given in Appendix A.3. According to Liu and M¨ller (2009), we estimate the ν-th derivative of eigenfunctions u ˆ (ν) using the following eigenequations, φ k 1 ∂ν 1 dν ˆ ˆ ˆ ˆ (ν) ˆ ˆ G(t, s)φk (t) dt = ν ν G(t, s)φk (t) dt = λk φk (s) , ds 0 0 ∂s (4.11) 1 ˆ2 1ˆ ˆ ˆ where φk are subject to 0 φk (t) dt = 1 and 0 φk (t) φk (t) dt = 0 for k < k. If N is sufficiently large, the left hand side of (4.11) can be approximated by j N G(0,ν) ( j , j )φ ˆ ˆ j=1 N N k N . Then we estimate Σ(t, s) by ˆ ˆ (ν) ˆ (ν) (s). ˆ Σ(t, s) = κ λk φ (t)φ k=1 k k 1 N ˆ The following theorem shows that Σ(·, ·) and Σ(·, ·) are asymptomatically equivalent. 105 Theorem 4.3.4. Under Assumptions (C1)-(C6), one has ˆ sup Σ(t, s) − Σ(t, s) = oP (1) . (s,t)∈[0,1]2 The proof of Theorem 4.3.4 is given in Appendix A.4. ˆ ˆ In practice, we choose the first L positive eigenvalues λ1 ≥. . . ≥ λL > 0 by eigenvalue ˆ decomposition of G(t, s). Then we apply a standard criterion in M¨ller (2009), to choose the u number of eigenfunctions, i.e. κ = arg min1≤l≤L l ˆ k=1 λk / L λ > 0.95 . M¨ller ˆ u k=1 k (2009) suggests the “pseudo-AIC” and this simple method of counting the percentage of variation explained can be used to choose the number of principal components. The simple method performed well in our simulations and is used for our numerical studies. To construct the confidence bands, we use cubic splines to estimate the mean and covariance functions and their first order derivatives. Generalized cross-validation is used to choose the number of knots Nm (from 2 to 20), to smooth out the mean function. According to Assumption (C3), the number of knots for smoothing the covariance function is taken to be NG = [n1/(2p) log(n)], where [a] denotes the integer part of a. Finally, in order to estimate q1−α , we generate i.i.d standard normal variables Zk,b , ˆ ˆ ˆ (ν) ˆ 1 ≤ k ≤ κ, b = 1, . . . , 5000. Let ζ b (t) = Σ (t, t)−1/2 κ k=1 λk Zk,b φk (t), t ∈ [0, 1], ˆ q1−α can be estimated by 100(1 − α)-th percentile of {supt∈[0,1] |ζ b (t)|}5000 . Therefore, b=1 in application we recommend the following band ˆ m(ν) (t) ± n−1/2 Σ (t, t)1/2 q1−α , ˆ ˆ 106 t ∈ [0, 1]. (4.12) 4.4 Numerical studies 4.4.1 Simulated examples To illustrate the finite-sample performance of the confidence band in (4.12), we generate data from the following: Yij = m (j/N ) + κ ξ φ (j/N ) + εij , k=1 ik k i.i.d. εij ∼ N (0, 0.12 ), for 1 ≤ i ≤ n, 1 ≤ j ≤ N . We consider two scenarios. √ √ Model I: m(t) = 5t+4 sin(2π(t−0.5)), φ1 (t) = − 2 cos(2π(t−0.5)), φ2 (t) = 2 sin(4π(t− 0.5)), ξ ik ∼ N (0, λk ), λ1 = 2, λ2 = 1, κ = 2; √ (t−0.5)2 Model II: m(t) = 4t + √ 1 exp − , φk (t) = 2 sin (πkt), ξ ik ∼ N (0, λk ), 2π0.1 2(0.1)2 λk = 2−(k−1) , k = 1, 2, . . . , κ = 8. The second case has similar design as in Simulation C of Liu and M¨ller (2009). We use u the proposed method in (4.12) and its “oracle” version with true Σ (t, t) to construct the confidence bands for m(1) (·) respectively in both studies. We consider two confidence levels: 1 − α = 0.95, 0.99. The number of trajectories n is taken to be 30, 50, 100, 200, and for each n, we try different numbers of observations on the trajectory. Each simulation consists of 1000 Monte Carlo samples. We evaluate the coverage of the bands over 200 equally spaced points on [0, 1] and test whether the true functions are covered by the confidence bands at these points. Tables 4.1 and 4.2 show the empirical coverage probabilities out of 1000 replications for Models I and II, respectively. From Tables 4.1 and 4.2, we observe that coverage probabilities for 107 Table 4.1: Coverage rates of the spline confidence bands in Model I. 95% 99% n N Est. Oracle Est. Oracle 30 30 0.756 0.831 0.901 0.936 60 0.833 0.900 0.926 0.969 50 0.824 0.863 0.932 0.933 100 0.856 0.904 0.947 0.979 100 100 0.851 0.897 0.943 0.971 200 0.856 0.907 0.949 0.971 200 200 0.866 0.910 0.950 0.972 400 0.869 0.944 0.959 0.987 50 both estimated bands and “oracle” bands approach the nominal levels, which show positive confirmation of Theorem 4.3.1. In most of the scenarios the “oracle” confidence bands outperform the estimated bands, and the “oracle” bands arrive at about the nominal coverage for large n and N . The convergence rates of estimated bands are slower than those “oracle” bands, but the convergence trend to nominal level is clearly. Figure 4.1 to Figure 4.8 show the estimated functions and their 99% confidence bands for the first order derivative curve m(1) (·) for Models I and II, respectively. As expected when n and N increase, the confidence band is narrower and the cubic spline estimator is closer to the true derivative curve. For Model I, the boundary effects in all four panels are almost unnoticeable. For Model II, there seems to be some boundary effects for small n and N , which are attenuated as n and N increase. 108 −20 0 m ( 1) 20 40 n=30, N=30 0.0 0.2 0.4 0.6 0.8 1.0 Figure 4.1: Plots of the cubic spline estimators (dotted-dashed line) and 99% confidence bands (upper and lower dashed lines) of m(1) (t) (solid line) in Model I. n = 30, N = 30. 109 −20 0 m ( 1) 20 40 n=30, N=60 0.0 0.2 0.4 0.6 0.8 1.0 Figure 4.2: Plots of the cubic spline estimators (dotted-dashed line) and 99% confidence bands (upper and lower dashed lines) of m(1) (t) (solid line) in Model I. n = 30, N = 60. 110 −20 0 m ( 1) 20 40 n=50, N=100 0.0 0.2 0.4 0.6 0.8 1.0 Figure 4.3: Plots of the cubic spline estimators (dotted-dashed line) and 99% confidence bands (upper and lower dashed lines) of m(1) (t) (solid line) in Model I. n = 50, N = 50. 111 −20 0 m(1) 20 40 n=50, N=100 0.0 0.2 0.4 0.6 0.8 1.0 Figure 4.4: Plots of the cubic spline estimators (dotted-dashed line) and 99% confidence bands (upper and lower dashed lines) of m(1) (t) (solid line) in Model I. n = 50, N = 100. 112 −20 0 m ( 1) 20 40 n=30, N=30 0.0 0.2 0.4 0.6 0.8 1.0 Figure 4.5: Plots of the cubic spline estimators (dotted-dashed line) and 99% confidence bands (upper and lower dashed lines) of m(1) (t) (solid line) in Model II. n = 30, N = 30. 113 −20 0 m ( 1) 20 40 n=30, N=60 0.0 0.2 0.4 0.6 0.8 1.0 Figure 4.6: Plots of the cubic spline estimators (dotted-dashed line) and 99% confidence bands (upper and lower dashed lines) of m(1) (t) (solid line) in Model II. n = 30, N = 60. 114 −20 0 m ( 1) 20 40 n=50, N=50 0.0 0.2 0.4 0.6 0.8 1.0 Figure 4.7: Plots of the cubic spline estimators (dotted-dashed line) and 99% confidence bands (upper and lower dashed lines) of m(1) (t) (solid line) in Model II. n = 50, N = 50. 115 −20 0 m(1) 20 40 n=50, N=100 0.0 0.2 0.4 0.6 0.8 1.0 Figure 4.8: Plots of the cubic spline estimators (dotted-dashed line) and 99% confidence bands (upper and lower dashed lines) of m(1) (t) (solid line) in Model II. n = 50, N = 100. 116 Table 4.2: Coverage rates of the spline confidence bands in Model II. 95% 99% n N Est. Oracle Est. Oracle 30 30 0.650 0.730 0.830 0.854 60 0.695 0.839 0.860 0.941 50 0.817 0.911 0.930 0.982 100 0.830 0.929 0.933 0.986 100 100 0.858 0.940 0.948 0.986 200 0.876 0.939 0.960 0.986 200 200 0.874 0.939 0.949 0.980 400 0.889 0.946 0.963 0.991 50 4.4.2 Tecator data Here we apply the proposed method to the Tecator dataset, which can be introduced in chapter 2.6. Figure 4.9 shows the estimated mean absorbance measurements m(·) and its estimated first order derivative m(1) (·) in the lower panel. Their 99% confidence bands (dashed lines) are also included in the figure, both bands have similar band width around 0.1 to 0.2 even though the bands for m(1) (·) looks much narrower in the figure. As shown in Figure 4.9, in the region of 850 − 950 nm the derivative estimate of mean absorbance is increasing gradually above 0, which corroborates with the convex behavior of the corresponding estimated mean function. For wavelength between 950 and 970 nm, the big bump in the derivative graph is consistent with the changing pattern of the mean estimate before it reaches the turning point at around 970 nm. When wavelength is larger than 970 nm, its estimated derivative turns negative and is relatively flat after 1000 nm, and 117 it is in accordance with the quick dip of the mean absorbance and a linear decreasing trend for wavelength larger than 1000 nm. 118 3.6 3.4 3.2 2.8 3.0 absorbance 850 900 950 1000 1050 1 −2 −1 0 µ(1) 2 3 4 wavelength 0.0 0.2 0.4 0.6 0.8 1.0 Wavelength Figure 4.9: Plots of the cubic spline estimators (dotted-dashed line) and 99% confidence bands (upper and lower dashed lines) of the mean function and its first order derivative. 119 APPENDIX 120 For any r ∈ (0, 1], we denote C q,r [0, 1] as the space of H¨lder continuous functions on o [0, 1], C q,r [0, 1] = φ : φ q,r < +∞ . The technical assumptions we need are as follows: (C1) The regression function m ∈ C p−1,1 [0, 1]; (C2) The standard deviation function σ (t) ∈ C 0,δ [0, 1] for some δ ∈ (0, 1]; nθ for some θ > 1+2ν ; the 2(p−ν) 1 1 Nm (N/log(n)) 1+2ν , n 2p (C3) The number of observations for each trajectory N 1 2(p−ν) number of interior knots satisfies n 1 NG n 2+2ν ; (C4) There exists a constant C > 0 such that Σ (t, t) > C, for any t ∈ [0, 1]; (ι) (C5) For k ∈ {1, . . . , κ}, ι = 0, 1, . . . , p − 2, φ (t) ∈ C 0,δ [0, 1], for some δ ∈ (0, 1], k (ι) κ < ∞; and for a sequence {κn }∞ of increasing integers with n=1 k=1 λk φk ∞ (ι) −δ κn = o (1); λk φ limn→∞ κn = κ, Nm k 0,δ k=1 4+δ 2 (C6) There exist δ 1 > 0, δ 2 > 0 such that E ξ ik 4+δ 1 +E εij < +∞, for 1 ≤ i < ∞, 1 ≤ k ≤ κ, 1 ≤ j < ∞. The number κ of nonzero eigenvalues is finite or κis infinite while the variables ξ ik 1≤i<∞,1≤k<∞ are i.i.d.. Assumptions (C1)-(C6) are standard in the spline smoothing literature; see Huang (2003) and Cao, Yang and Todem (2012), for instance. In particular, (C1) and (C2) guarantee the convergence rates of m(t) and its derivatives. Assumption (C3) states the requirement of the ˆ number of observations within each curve to the sample size, and the order of the number of knots of splines. Assumptions (C4) and (C5) concern that the derivatives of principal components have collectively bounded smoothness. When ι = 0, Assumption (C5) is the 121 same as (C4) in Cao, Yang and Todem (2012). Assumption (C6) is necessary when using strong approximation result in Lemma 4.4.4. (ν) (ν) If κ is finite and all φ (t) ∈ C 0,δ [0, 1], then Assumption (C5) on φ ’s holds trivially. k k For κ = ∞, Assumption (C5) is fulfilled as long as λk decreases to zero sufficiently fast. For example, considering the following canonical orthonormal Fourier basis of L2 ([0, 1]): √ φ1 (t) ≡ 1, φ2k+1 (t) ≡ 2 cos (kπt) √ 2 sin (kπt) , k = 1, 2, ..., t ∈ [0, 1] , φ2k (t) ≡ we can take λ1 = 1 and λk = ([k/2] π)−2ν ρ2[k/2] , k = 2, ..., for any ρ ∈ (0, 1), then √ √ √ (ν) ∞ = 1 + ∞ ρk λk φ 2 + 2 = 1 + 2 2ρ (1 − ρ)−1 < ∞. While for k=1 k=1 k ∞ any δ = 1 and {κn }∞ with κn increasing, odd and κn → ∞, one has n=1 −1 Nm κn k=1 √ (κn −1)/2 k √ (ν) −1 λk φ = Nm ρ 2kπ + 2kπ k 0,1 k=1 √ √ ∞ −1 −1 ≤ 2 2πNm ρ ρk−1 k = 2 2πNm (1 − ρ)−2 k=1 −1 = O Nm = o (1) . In the following, define the theoretical and empirical inner product matrices of Nm BJ,p (t) as J=1−p Vp = ˆ Vp = Nm Nm = v , JJ ,p J,J =1−p J,J =1−p Nm Nm BJ,p , B = v ˆ . J ,p 2,N J,J =1−p JJ ,p JJ =1−p BJ,p , B J ,p ˆ We establish next that Vp has an inverse with bounded L∞ norm. 122 (4.13) ˆ Lemma 4.4.1. [Cao, Yang and Todem (2012)] Under Assumption (C3), for Vp and Vp ˆ defined in (4.13), Vp − Vp ∞ ˆ −1 = O (Nm ). = O N −1 and Vp ∞ Proof of Proposition 4.2.1 Following Wang and Yang (2009b), we introduce the p-th order quasi-interpolant of m corresponding to the knots , denoted by Q (m); see equation (4.12) on page 146 of DeVore and Lorentz (1993) for details. According to Theorem 7.7.4, DeVore and Lorentz (1993), the following lemma holds. Lemma 4.4.2. There exists a constant C > 0, such that for 0 ≤ ν ≤ p − 2 and m ∈ C p,1 [0, 1], (m − Q (m))(ν) ∞ ≤ C m(p) p−ν h . ∞ m Lemma 4.4.3. Under Assumptions (C2), (C3) and (C6), one has 1 N  N BJ,p (j/N ) σ (j/N ) ¯·j = OP  ε j=1  log(n)  . nN Nm Proof. We first truncate the random error εij by un = (nN )γ (2/9 < γ < 1/3) and write εij = εij,1 +εij,2 +aij , where εij,1 = εij I εij > un , εij,2 = εij I aij = E εij I εij ≤ un . It is easy to see that aij = −E εij I εij > un 4+δ 2 −(3+δ 2 ) εij un . It is straightforward from the boundedness of spline basis that aij and ≤E εij ≤ un − 1 nN N n BJ,p (j/N ) σ (j/N ) j=1 i=1 123 −(3+δ 2 ) aij = O(un ). Next we show that the tail part vanishes almost surely. Note that ∞ P n=1 4+δ 2 ∞ ∞ E ε ij −(4+δ 2 ) un < ∞. ≤ Mδ εij > un ≤ 4+δ 2 n=1 n=1 un By the Borel-Cantelli Lemma, P ω|∃N (ω) , εij (ω) ≤ un f or n > N (ω) = 1. Let vε = max εij , 1 ≤ i, j ≤ N (ω) and there exists N1 (ω) > N (ω), uN (ω) > vε . Since un = 1 (nN )γ is an increasing function, un > uN (ω) > vε and n > N1 (ω). 1 Therefore P {ω|∃N (ω) , εij (ω) ≤ un , 1 ≤ i ≤ n, 1 ≤ j ≤ N , for min (n, N ) > N (ω)} = 1, which implies P {ω|∃N (ω) , εij,1 = 0, 1 ≤ i ≤ n, 1 ≤ j ≤ N for min (n, N ) > N (ω)} = 1. Thus 1 nN n N BJ,p (j/N ) σ (j/N ) εij,1 = Oa.s. (nN )−k , for any k > 0. i=1 j=1 Next denote Dj = (nN )−1 BJ,p (j/N ) σ (j/N ) Var εij,2 = 1 − E ε2 I ij 2 one has Vn = Var N D j=1 j εij > un n ε i=1 ij,2 . Since − a2 = 1 + O P ij −δ −2(1+δ 2 ) un 2 + un , = c (nN Nm )−1 for a constant c > 0. Now Minkowski’s inequality implies that E εij,2 Thus E Dj k ≤ 2k−1 E εk I ij k k−2 2 ≤ 2n−1 N −1 un k!E(Dj ) < ∞ with the Cramer constant c∗ = 2un . nN εij ≤ un + ak ij 124 2 k−2 ≤ 2k−2 un E εij,2 k!, k ≥ 2. log(n) nN Nm . By the Bernstein inequality, for any large enough δ > 0, For any δ, let δ n = δ P   N  Hence j=1   Dj ≥ δ n ≤ 2 exp  ∞ P n=1 1 nN     2 −δ n = 2 exp ≤ 2n−3 .  4c + 4 un δ n  nN Nm nN −δ 2 n 2 4Vn + 2c∗ δ n N B (j/N ) σ(j/N ) j=1 J,p n ε i=1 ij,2 ≥ δ n ≤2 ∞ n−3 < ∞, n=1 for such δ > 0. Thus Borel-Cantelli’s lemma implies the desired result. Lemma 4.4.4. [Theorem 2.6.7 of Cs˝rg˝ and R´v´sz (1981)] Suppose that ξ i , 1 ≤ i ≤ n o o e e are i.i.d. with E(ξ 1 ) = 0, E(ξ 2 ) = 1 and H(x) > 0 (x ≥ 0) is an increasing continuous 1 function such that x−2−γ H(x) is increasing for some γ > 0 and x−1 logH(x) is decreasing with EH (|ξ 1 |) < ∞. Then there exist constants C1 , C2 , a > 0 which depend only on the distribution of ξ 1 and a sequence of Brownian motions {Wn (m)}∞ , such that n=1 for any {xn }∞ satisfying H −1 (n) < xn < C1 (nlog(n))1/2 and Sm = n=1 m ξ , then i=1 i P max1≤m≤n |Sm − Wn (m)| > xn ≤ C2 n {H (axn )}−1 . Proof of Proposition 4.2.1. We first show (4.6). According to Theorem A.1 of Huang (2003), there exists an absolute constant C > 0, such that m−m ∞ ≤C ˜ p inf g − m ∞ ≤ C m(p) hm . ∞ g∈C p,1 (4.14) Applying Lemma 4.4.2, for 0 ≤ ν ≤ p − 2, {Q (m) − m}(ν) ∞ ≤ C m(p) As a consequence of (4.14) and (4.15) if ν = 0, one has Q 125 p−ν hm . ∞ (4.15) p (m) − m ∞ ≤ C m(p) ˜ hm , ∞ which, according to the differentiation of B-spline given in de Boor (2001), entails that {Q (m) − m}(ν) ˜ ∞ ≤ C m(p) p−ν h , ∞ m (4.16) for 0 ≤ ν ≤ p − 2. Combining (4.15) and (4.16) proves (4.6) for ν = 1, ..., p − 2. (ν) Next we prove (4.7). Similar to the definition of m(ν) (t) and ˜ (t), in the following ˜ ξ k (ν) ˜ we denote φ (t) = Γ(ν) (t) φk , for any k ≥ 1. Using the similar arguments as in the proof k of (4.6), we can show that, for any k ≥ 1, (ν) ˜ (ν) p−ν ≤ Cφ hm . φ −φ k k ∞ (4.17) (ν) ˜ (ν) Also, according to triangle inequality one has that φ ≤ cφ φ = O(1). k ∞ k ∞ According to Assumption (C6), E ξ ik 4+δ 1 < +∞, δ 1 > 0, so there exists some n β ∈ (0, 1/2), such that 4 + δ 1 > 2/β. Let H(x) = x4+δ 1 , xn = nβ , then = H(axn ) a−4−δ 1 n1−(4+δ 1 )β = O n−γ 1 for some γ 1 > 1. Applying Lemma 4.4.4 and BorelCantelli Lemma, one finds i.i.d. variables Zik,ξ ∼ N (0, 1) such that ¯ max ¯·k − Z·k,ξ ξ k≥1 ¯ where Z·k,ξ = n−1 λk = Oa.s. nβ−1 , (4.18) n Z i=1 ik,ξ , k ≥ 1. ¯ If κ is finite, according to (4.18) note that ¯·k ≤ Z·k,ξ ξ ¯ λk + ¯·k − Z·k,ξ ξ λk , 1 ≤ k ≤ κ, so max1≤k≤κ ¯·k = OP (n−1/2 + nβ−1 ). Then the definition of m(t) in (4.4) ξ ¯ 126 and (4.17) entail that (ν) m(ν) − m(ν) − ˜ ¯ ξ (ν) ˜ (ν) = oP n−1/2 . ξ −φ ≤ κ max ¯·k max φ k k ∞ 1≤k≤κ 1≤k≤κ ∞ Thus (4.8) holds according to Assumption (C3). If κ = ∞, using similar arguments in Cao, Yang and Todem (2012), by (4.18) one can −1/2 −1/2 ¯ −1/2 ¯ ξ = ξ − Z·k,ξ , for any k ≥ 1, so ¯·k λ ξ ≤ Z·k,ξ + ¯·k λ show that ¯·k λ k k k OP (n−1/2 + nβ−1 ). Also following Assumption (C5) one has ≤ ≤ + ≤ Hence, (ν) E sup m(ν) (t) − m(ν) (t) − ˜ ¯ ξ (t) t∈[0,1] κ κ (ν) (ν) ˜ (ν) (t) ≤ C ¯ (t) − φ E ¯·k sup φ ξ (t) E ξ ·k sup φ k k k t∈[0,1] t∈[0,1] k=1 k=1   κn −1/2 1/2 (ν) φ E ¯·k λ ξ λ hδ C k k k 0,δ m  k=1  κ  −1/2 1/2 (ν) φ E ¯·k λ ξ λ k k k ∞ k=κn +1   κ  κn  1/2 (ν) 1/2 (ν) Cn−1/2 λ φ hδ + λ φ = o n−1/2 . k k 0,δ m k k ∞  k=1 k=κn +1 (ν) m(ν) − m(ν) − ˜ ¯ ξ ∞ = oP n−1/2 . According to Lemmas 4.4.1 and 4.4.3, finally we have e(ν) ˜ ∞ (ν) ˆ −1 Bp (t) Vp N −1 BT e ˆ −1 ≤ Ch−ν Vp N −1 BT e m ∞ ∞ ∞ −1/2−ν = Oa.s. n−1/2 N −1/2 hm log(n) . = 127 Thus (4.8) holds according to Assumption (C3). Proof of Theorem 4.3.1 ˜ We denote ζ k (t) = ˜ ζ (t) = n1/2 (ν) ¯ λk Z·k,ξ φ (t) , k = 1, . . . , κ and define k 2 −1/2 κ (ν) λ φ (t) k k=1 k κ ˜ ζ (t) = n1/2 Σ (t, t)−1/2 k=1 k κ ˜ ζ (t). k=1 k ˜ It is clear that, for any t ∈ [0, 1], ζ (t) is Gaussian with mean 0 and variance 1, and the ˜ ˜ covariance E ζ (t) ζ (s) = Σ (t, t)−1/2 Σ (s, s)−1/2 Σ (t, s), for any t, s ∈ [0, 1]. That is, the ˜ distribution of ζ (t), t ∈ [0, 1] and the distribution of ζ (t), t ∈ [0, 1] in Section 3.1 are identical. Similar to the proof of (4.7), Note that (ν) ˜ E supt∈[0,1] ζ (t) − n1/2 Σ (t, t)−1/2 ˜ (t) ξ = n1/2 E supt∈[0,1] Σ (t, t)−1/2 κ ˜ (ν) ζ (t) − ˜ (t) ξ k=1 k κ 1/2 E sup Σ (t, t)−1/2 ≤ n t∈[0,1] k=1 ¯ Z·k,ξ (ν) (ν) ˜ (ν) λk − ¯·k φ (t) + ¯·k φ (t) − φ (t) ξ ξ k k k 128 (ν) p−ν ˜ ξ If κ is finite, then E supt∈[0,1] ζ (t) − n1/2 Σ (t, t)−1/2 ˜ (t) = O nβ−1/2 + hm = o(1). If κ = ∞, by Assumption (C5) one has κ (ν) (ν) 1/2 E sup Σ (t, t)−1/2 ˜ (ν) ¯ n ξ λk Z·k,ξ − ¯·k φ (t) + ¯·k φ (t) − φ (t) ξ k k k t∈[0,1] k=1 ≤ n1/2 sup Σ (t, t)−1/2 t∈[0,1]   κ κ   (ν) (ν) β−1 −1/2 ˜ (ν) (t) n λk φ (t) + n λk φ (t) − φ k k k   k=1 k=1 = o(1). Theorem 4.3.1 follows directly. Proof of Theorem 4.3.3 Following Cao, Wang, Li and Yang (2012), we define the tensor product spline space as H(p−2),2 [0, 1]2 ≡ H(p−2),2 = H(p−2) ⊗ H(p−2)    NG    b B (t) B (s) , t, s ∈ [0, 1] . = JJ J,p J ,p     J,J =1−p ¯ Let Rij ≡ Yij − m(j/N ), 1 ≤ i ≤ n, 1 ≤ j ≤ N and R·jj = n−1 n R R , i=1 ij ij 1 ≤ j, j ≤ N . Note that the spline estimator in (4.9) is ˆ G(·, ·) = arg min g(·,·)∈H(p−2),2 N j=j 129 ˆ R − g j/N, j /N ·jj 2 , so we define the “infeasible estimator” of the covariance function ˜ G (·, ·) = argmin g(·,·)∈H(p−2),2 ¯ R·jj − g j/N, j /N 2 . (4.19) 1≤j=j ≤N (ν) (0,ν) Denote X = B ⊗ B, Bp (t, s) = Bp (t) ⊗ Bp (s) and Bp (t, s) = Bp (t) ⊗ Bp (s). Let ¯ ¯ vector R = R ·jj 1≤j,j ≤N , then we have ν −1 T ˆ ˜ (0,ν) (t, s) ≡ ∂ G(t, s) = B(0,ν) (t, s) XT X ¯ X R. G p ∂sν (4.20) (0,ν) (ν) In the following we write φ (t, s) = φk (t) φ (s), φ (t, s) = φk (t) φ (s). Let kk k kk k φ = φk ⊗ φ , where φk is a N -dimensional vector defined in Section 2.3, kk k −1 T φ (t, s) = Bp (t, s) XT X X φ . Further we denote kk kk −1 T (0,ν) (0,ν) φ (t, s) = Bp (t, s) XT X X φ . kk kk Lemma 4.4.5. Under Assumptions (C5), for any 1 ≤ ν ≤ p − 2, and κ = ∞, one has λk λ k k,k ≥1 (0,ν) (0,ν) φ −φ = o (1) . kk kk ∞ Proof. When κ = ∞, one has λk > 0 for any k ≥ 1. Note that λk λ k k,k ≥1 (0,ν) φ ≤ kk ∞ k≥1 λk φk ∞ λ k k ≥1 130 (ν) φ = o (1) . k ∞ Also similarly, λk λk k,k ≥1 (0,ν) ≤ φ kk ∞ k≥1 ≤ C λk φk ∞ λk k ≥1 λk φk ∞ k≥1 λ k k ≥1 (ν) φ k ∞ (ν) = o (1) . φ k ∞ Lemma 4.4.6. Under Assumption (C5), for any 0 ≤ ν ≤ p − 2 and k, k ≥ 1, one has (0,ν) (0,ν) p−ν . = O hG φ −φ kk kk ∞ Proof. According to Theorem 12.8 of Schumaker (2007), there exists an absolute constant C > 0, such that (p) (p) p φ −φ ≤ C φ φ + φk φ hG , kk kk ∞ k k k ∞ which proves (4.6) for the case υ = 0. Let Q (f ) the p-th order quasi-interpolant of a function f ; see the definition in (12.29) of Schumaker (2007), for 0 ≤ ν ≤ p − 2, Q φ kk −φ kk (0,ν) ≤C φ ∞ (p) (p) p−ν φ + φk φ hG . k k k ∞ For the case ν = 0, one has Q φkk − φkk ∞ Q φkk − Q(φkk ) ≤ C φkk − φkk ∞ ∞ (p) (p) p ≤ C φ φk + φk φ hG , k k ∞ = which, according to the differentiation of B-spline given in de Boor (2001), entails that Q φ kk −φ kk (0,ν) (p) (p) p−ν ≤ C φ φ + φk φ h , for 0 ≤ ν ≤ p − 2. k k k ∞ G ∞ 131 Lemma 4.4.7. Under Assumptions (C1)-(C6), for any 1 ≤ ν ≤ p − 2 ˜ G(0,ν) − G(0,ν) ∞ p−ν = OP n−1/2 + hG + oP N −1 n−1/2 h−1−ν log1/2 n , (4.21) G ˆ ˜ G(0,ν) − G(0,ν) = OP ∞ −3/2−ν p n−1 hG + n−1/2 h−1−ν hm , G (4.22) ˆ ˆ where G(0,ν) and G(0,ν) are given in (4.10) and (4.20). Proof. Let ¯ ξ = n−1 ·kk n ξ ξ −1 ε i=1 ik ik and ¯·jj = n n ε ε . To show (4.21), i=1 ij ij ¯ we decompose R in (4.19) into ·jj κ ¯ R 1jj = k=k ¯ R3jj = σ ¯ R = n−1 4jj j N  n  κ σ j N ξ ik φk  i=1 k=1 κ j j , N N ¯ ξ φ ·kk kk ¯ φ ξ ·kk kk ¯ , R2jj = k=1 j j , N N , ¯·jj , ε j N κ j N σ ε ij ξ ik φk + k=1 j N σ j N εij   .  −1 T ¯ ¯ ˜ (0,ν) (t, s) = B(0,ν) (t, s) XT X ¯ Denote Ri = R X Ri , i = 1, 2, 3, 4. , Ri p ijj 1≤j,j ≤N ˜ ˜ (0,ν) (t, s) + R(0,ν) (t, s) + R(0,ν) (t, s) + R(0,ν) (t, s). Next we define ˜ ˜ ˜ Then G(0,ν) (t, s) = R1 2 3 4 (0,ν) R1 (t, s) = κ (0,ν) ¯ ξ ·kk φ (t, s) , kk k=k (0,ν) R2 (t, s) = G(0,ν) (t, s) + κ (0,ν) φ (t, s) ¯·kk − λk ξ kk . k=1 ˜ (0,ν) (t, s) = Note that R1 (0,ν) κ ¯ ξ φ (t, s), then Lemma 4.4.5 and Assumption (C5) ·kk kk k=k 132 imply that if κ is ∞, λk λ > 0 and one has k ˜ (0,ν) (t, s) − R(0,ν) (t, s) sup E R1 1 (t,s)∈[0,1]2 κ −1/2 (0,ν) (0,ν) ξ λk λ = o (1) . ≤ E ¯ λk λ φ −φ ·kk k k kk kk ∞ k=k Similarly, one has ˜ (0,ν) (t, s) − R(0,ν) (t, s) sup E R2 2 (t,s)∈[0,1]2 κ (0,ν) (0,ν) E ¯·kk λ−1 − 1 λk φ ξ ≤ −φ = o (1) . kk k kk ∞ k=1 If κ is finite, then Lemma A.6 and Assumption (C3) imply that ˜ (0,ν) − R(0,ν) R1 1 ≤ κ2 ∞ (0,ν) (0,ν) p−ν φ −φ = OP hG n−1/2 . kk kk ∞ ¯ max ξ ·kk 1≤k=k ≤κ Similarly, ˜ (0,ν) − R(0,ν) R2 2 ≤ κ max ¯·kk − λk ξ 1≤k≤κ ∞ ˜ (0,ν) − R(0,ν) Hence, R2 2 ˜ (0,ν) − R(0,ν) + R1 1 ∞ Wang, Li and Yang (2012), one has ¯ N −2 XT R3 ∞ (0,ν) (0,ν) p−ν φ −φ = OP hG . kk kk ∞ ¯ = N −2 XT R4 ∞ = o (1). By Proposition 3.1 in Cao, ∞ = oP N −1 n−1/2 hG log1/2 (n) . 133 ˜ (0,ν) Hence, R3 ∞ ˜ (0,ν) = R4 ˜ G(0,ν) − G(0,ν) ∞ = oP N −1 n−1/2 h−1−ν log1/2 n . Therefore, G κ ∞ ¯ ξ ·kk ≤ ˜ (0,ν) − R(0,ν) + R1 1 k=k κ (0,ν) ¯ + ξ ·kk − λk φ kk ∞ k=1 ˜ (0,ν) − R(0,ν) + R2 2 ˜ (0,ν) + R3 ∞ ∞ ∞ −1/2 + hp−ν + o −1 n−1/2 h−1−ν log1/2 n . = OP n P N G G (0,ν) φ kk ∞ ˜ (0,ν) + R4 ∞ The proof of (4.22) is similar to Proposition 2.1 in Cao, Wang, Li and Yang (2012), thus omitted. Proof of Theorem 4.3.3: According to Lemma 4.4.7, one has ˆ G(0,ν) − G(0,ν) ∞ ˜ ≤ G(0,ν) − G(0,ν) ∞ ˆ ˜ + G(0,ν) − G(0,ν) ∞ = oP (1). Proof of Theorem 4.3.4 ˆ ˆ We first show asymptotic consistency of λk and φk (·), for k ≥ 1, in the following lemma. Lemma 4.4.8. Under Assumptions (C1)-(C6), one has ˆ λk − λk = oP (1) , ˆ φk − φk = oP (1) , ∞ k ≥ 1. Proof. We first want to show that for any k ≥ 1, ∆φk ∞ = oP (1), in which ∆ is the ˆ integral operator with kernel G − G. Note that ∆φk (t) = ˆ Theorem 4.3.3, when ν = 0, G − G ∞ ˆ (G − G) (s, t) φk (s) ds. By = oP (1). Thus, for any k ≥ 1, ∆φk ∞ = oP (1). 134 Hall and Hosseini-Nasab (2006) gives the L2 expansion ˆ φk − φk = λk − λ −1 ∆φk , φ k k φ +O k ∆ 2 , 2 k =k ˆ G (s, t) − G (s, t) where ∆ 2 = 2 1/2 dsdt . By Bessel’s inequality, one has u ∆φk 2 + ∆ 2 = oP (1). By (4.9) in Hall, M¨ller and Wang (2006) ∞ 2 ˆ φk − φk ≤ C 2 and Theorem 4.3.3 ˆ λk − λk = ˆ (G − G) (s, t) φk (s) φk (t) dsdt + O ∆φk 2 = oP (1) . 2 ˆ Thus λk − λk = oP (1) for any k ≥ 1. Next note that ˆ ˆ λk φk (t) − λk φk (t) = ˆ ˆ G (s, t) φk (s) ds − ˆ ˆ (G − G) (s, t) φk (s) − φk (s) ds + = + G (s, t) φk (s) ds ˆ (G − G) (s, t) φk (s) ds ˆ G (s, t) φk (s) − φk (s) ds. By the Cauchy-Schwarz inequality, uniformly for all t ∈ [0, 1] ˆ G (s, t) φk (s) − φk (s) ds ≤ G2 (s, t) ds Similar arguments and Theorem 4.3.3 imply that 1/2 ˆ φk − φk = oP (1) . 2 ˆ ˆ (G − G) (s, t) φk (s) − φk (s) ds = ˆ ˆ ˆ oP (1) and (G−G) (s, t) φk (s) ds = oP (1). All the above together yield λk φk − λk φk = ∞ 135 oP (1). By the triangle inequality and ˆ ˆ ˆ ˆ + λk − λk ≤ λk φk − λk φk λk φk − φk ∞ ∞ ˆ = oP (1) , φk ∞ the second result in Lemma 4.4.8 follows directly. Proof of Theorem 4.3.4. According to Lemma 4.4.8 and Theorem 4.3.3, one has 1 1 ˆ ˆ ˆ (ν) (s) − φ(ν) (s) = λ−1 ˆ G(0,ν) (t, s)φk (t) dt G(0,ν) (t, s)φk (t) dt − λ−1 φ k k k k 0 0 1 ˆ −1 ˆ ˆ G(0,ν) (t, s)φk (t) dt ≤ λk − λ−1 × k 0 1 ˆ ˆ G(0,ν) (t, s)φk (t) − G(0,ν) (t, s)φk (t) dt. +λ−1 k 0 Hence, sups∈[0,1] 1 0 ˆ ˆ ˆ G(0,ν) − G(0,ν) (t, s)φk (t) + G(t, s) φk − φk (t) dt = oP (1) ˆ (ν) − φ(ν) = oP (1). Theorem 4.3.4 follows from the definition of Σ(t, s). and φ k k ∞ 136 Chapter 5 Testing Hypotheses Under Weakly Identifiability 5.1 Introduction Data in statistical research are often well described by models, in which the scientific questions of interest are described by an unknown, finite-dimensional parameter vector. Such models may be either fully parametric or semiparametric, where other aspects of the model may be described by infinite dimensional parameters which are completely unspecified. In such settings, it is often of interest to use the observed data in order to draw inferences about the parameters of interest. Standard inferential techniques may be applied if the parameters of interest can be well estimated by minimizing a parametric loss function or more generally by solving a parametric estimating function which does not involve infinite dimensional nuisance parameters. In many situations, however, these parameters may be nonidentifiable or at best weakly identifiable from the estimating function so that the standard inferential theories may not be valid. The goal of this paper is to develop hypothesis tests for scenarios 137 in which the model parameters are weakly identifiable. Conceptually, the term weak identifiability refers to the situations where data contain some information about model parameters but not enough to identify them uniquely. To illustrate the problem quite sharply, we consider a simple theoretical example where a fully parametric model is indexed by an unknown parameter vector (θ, β) for an observable random quantity Y . We assume that realizations {Yi }n of Y are independent and i=1 identically distributed (i.i.d) normal N (θ + β, 1) variates. The objective is to evaluate the hypothesis H0 : θ0 = 0, where θ0 is the true value of θ. Using only observed data and assuming that β 0 , the true value of β is unknown, inferences for θ0 may not be conducted using standard techniques due to identifiability problems arising from the mean model being overparameterized. Another interesting, more practical illustration of this problem comes from the missing data literature where weakly identifiable models are frequently encountered. Specific examples include the study of publication bias in meta-analysis (Chambers and Welsh, 1993; Copas, 1999; Copas and Li, 1997) and the analysis of longitudinal data subject to nonrandom nonresponses (Scharfstein et al., 1999; Kenward et al., 2001; Rotnitzky et al., 2001; Little and Rubin, 2002). Identifiability issues commonly arise with non-random missing data, where the parameters in the model for the missingness may not be jointly identifiable with those in the model for the outcomes of interest using only the observed data, particularly with semiparametric models, where some of the nuisance parameters may be infinite dimensional. Analyses which assume identifiability may be unreliable, with the joint selection and outcome model yielding flat “estimation” surfaces potentially having multiple modes. These phenomena have previously been reported by several authors in modeling potentially non-ignorable missing data models (Scharfstein et al., 1999; Todem et al., 2010) . 138 In section 3, we consider these missing data issues when analyzing longitudinal data with informative dropout employing the model of Troxel et al.(1998b). The model is semiparametric, with the parameter being estimated denoted by (θ, β), where β is the selection parameter that measures the extent of non-randomness of the missing data mechanism and θ consists of the remaining finite dimensional parameters of the selection and outcome models. The hypotheses of interest concern covariate effects on the outcome, which are contained in θ. In Troxel et al. (1998b), a so-called pseudo-likelihood analysis, described in detail in Section 3, was carried out under the assumption of parameter identifiability. The resulting estimating function only involves (θ, β), with the longitudinal dependence in the outcomes completely unspecified and not estimated. We investigated the parameter identifiability assumption in a reanalysis of the cancer data from Troxel et al. (1998b) by profiling the pseudo-likelihood analysis in β (Figure 5.1). The profile pseudolikelihood is flat in β, suggesting a model that is at best weakly identifiable. These results draw into question inferences which assume identifiability of θ and β. Due to identifiability concerns, tests concerning the model parameters cannot use conventional theory to assess statistical significance. Essentially, the standard estimation and inference techniques may fail due to the models being overparameterized. A natural remedy is to partition the parameter indexing the estimating function into certain parameters of interest and other parameters which may be viewed as secondary parameters. For the theoretical example discussed earlier, where Y is normally distributed, the parameter of interest in light of the hypothesis under study is θ, while β is the secondary parameter. In the missing data application (Troxel et al., 1998b), the parameter β which describes the informativeness may be viewed as the secondary parameter, while the covariate effects in θ may be of primary interest in hypothesis testing. In practice, the choice of θ and β will 139 −1680 −1720 −1760 −1800 Sup(LogL(θ, β)) −2 0 2 4 6 8 10 β Figure 5.1: Supremum of the pseudo-likelihood function profiled across β, the parameter measuring the extent of non-randomness of the missing data mechanism in the study. 140 depend on the application. Various approaches to the problem of non identifiable parameters that have appeared in the literature focused primarily on maximum likelihood based procedures. Almost all previous works in hypothesis testing deal with the case where non identifiability only occurs under the null hypothesis. Examples include Davies (1977, 1987), Hansen (1996), Ritz and Skovgaard (2005) and Song et al. (2009). Generally, this requires that the model is identifiable under the alternative hypothesis. In sensitivity analysis, the testing problem has a different formulation. The model may not be identifiable under either the null or the alternative hypothesis. Moreover, even after fixing a set of parameters, it may not be clear whether the parameters of interest can be consistently estimated under the null hypothesis. To be concrete, in the normal example, for each value of β, the maximum likelihood estimator of θ consistently estimates θ0 + β − β 0 , where θ0 and β 0 are the true values of θ and β. This only equals θ0 when β = β 0 . Our approach to inference about the parameters of interest is to adapt the profiling strategy from the earlier works described above. Because the testing problem is fundamentally different, the resulting developments are nonstandard, with relatively little work in the literature on this problem. Since the model may not be identifiable even after profiling, we need to consider the behavior of the profile estimator under model misspecification under the null. This inferential strategy poses substantial technical challenges beyond those encountered with supremum tests which assume identifiability under the alternative. In missing data applications used to motivate the sensitivity analysis, rigorous results for full likelihood analyses have been established (Lu and Copas, 2004), essentially requiring model identifiability. More recently, Todem et al. (2010) demonstrated how to conduct likelihood inference via infimum tests, including a precise analysis of the behavior of the profile estimators under 141 model misspecification and the distribution of the corresponding infimum test. Such tests are particularly important when the quantity being tested does not increase or decrease monotonically as the nonidentified parameters are increased or decreased. Under monotonicity, it is only necessary to perform the tests at the limits of the nonidentified parameter space. They developed simultaneous confidence bands which enable identification of those values of the sensitivity parameter for which significant results are obtained. Although these likelihood-based methods are useful, they require a full distribution specification for the data. This can be a difficult task in practice, especially when observed data do not have enough information to fully identify the parameter of interest. In this paper, we extend the profiling idea to arbitrary estimating functions involving θ and β but which do not require a complete parametric model specification. Our set-up includes the likelihood score functions as a special case. The generalization of the infimum test and confidence bands to non-likelihood settings is nontrivial. The infimum test has the advantage that it is simply defined directly in terms of contrasts whereas the supremum tests are obtained through nontrivial derivations using the log-likelihood functions (DacunhaCastelle and Gassiat, 1999). We present generic conditions which establish the large sample properties of the estimating function for θ profiled on β, including the uniform consistency and weak convergence of the θ estimator as a function of β. To our knowledge, these theoretical results are novel, with issues related to nonidentifiable estimating functions not having been studied rigorously, previously. We accommodate misspecification and uniformity in β in a general paradigm which permits the profiling to be carried out with respect to any suitable estimating function. Owing to the complexity of the asymptotic distributions of the infimum test and confidence bands, resampling is needed. A theoretically justified procedure is discussed for approximating such distributions. 142 The rest of this chapter is organized as follows. In Section 5.2, we present the general framework of the problem, the proposed test and the resampling procedure, along with a proof of the key asymptotic properties. In Section 5.3, the methodology is exhibited using the cancer dataset in Troxel et al. (1998b) and in simulations, where the naive Wald test may have either inflated type I error rate or reduced power. Some remaining issues are discussed in Section 5.4. 5.2 5.2.1 The method The general framework We consider a model involving a finite dimensional parameter random quantity Y . The parameter ∈ Ω for an observable may not completely determine the distribution of Y , that is, there may be other aspects of the model which are unspecified. The interest is drawing inferences about function SY ( ). Denote by ˆ of with i.i.d realizations {Yi }n of Y and a general estimating i=1 0 the true value of . If E{SY ( 0 )} = 0, then an estimator 0 usually can be obtained by solving the estimating equation, SY ( ) = 0; see Chapter 5 of van der Vaart (2000b) for an overview of Z-estimators. If SY identifies 0, then under other mild regularity conditions, this estimating equation yields a consistent and asymptotically normal parameter estimator. Under such regularity conditions, inferences about 0 can be conducted using the large sample properties of ˆ . Problems may occur if the model as a function of is “overparameterized”, with multiple values of satisfying E{SY ( )} = 0. In this case, the estimator may not have the usual asymptotic properties. Nonidentifiability can be addressed by fixing some components of 143 , conditional upon which the remaining parameters are uniquely defined by SY . One may partition = (θ, β), where θ, a p-dimensional vector, is assumed to be “identifiable” for a fixed q-dimensional vector β, as defined in Section 2.2 below. If the true value β 0 of the nonidentified parameter β is known, the estimator ˆ0 at β = β 0 can be used to conduct reliable inferences about θ0 , θ the true value of θ. This estimator is readily available by solving the estimating equation SY (θ, β 0 ) = 0, for fixed and known β 0 . The approach is unfeasible, as the true value β 0 is usually unknown to the analyst in practice. A common strategy is to fix β and study the estimator of θ at various values of β ∈ Ξ. To highlight the dependence on β, we denote by ˆ θ(β), the estimator of θ for a fixed β. The estimator of θ when β = β 0 is ˆ0 = ˆ 0 ). θ θ(β ¯ ¯ ¯ For the simple normal example, ˆ θ(β) = Y − β and ˆ 0 ) = Y − β 0 , where Y is the θ(β sample mean. This estimator is normally distributed with mean θ0 + β 0 − β and variance n−1 , uniformly in β, for each fixed n. Of course, in general, it is not possible to obtain clean finite sample results and large sample approximations are needed. In the subsection below, we study the uniform asymptotic properties of ˆ θ(β) for β ∈ Ξ. 5.2.2 Large sample properties of ˆ θ(β) When β is fixed at its true value β 0 , it is well established that for an estimating function SY (θ, β 0 ) which is smooth in θ, the estimator ˆ is consistent and approximately normal θ under mild regularity conditions (see, for example, van der Vaart and Wellner, 2000a). That 1 is, n 2 {ˆ 0 )−θ0 } →d N (0, Σ0 ), where Σ0 = (D(θ0 ))−1 var(SY (θ0 , β 0 ))(D−1 (θ0 ))T , with θ(β D(θ0 ) being the expected value of the first order derivative of SY (θ, β 0 ) with respect to θ. These properties of ˆ 0 ) can be used to conduct large-sample inferences about θ0 . θ(β For a given β, the estimator ˆ θ(β) will converge to a quantity θ∗ (β), which is generally 144 different from θ0 if β = β 0 . For the simple normal example, θ∗ (β) = θ0 + β 0 − β. This contrasts with set-ups on testing with nonidentifiability under the null (Davies, 1977, 1987), where it is generally assumed that θ∗ (β) = θ0 for all β. Moreover, appropriately standardized, ˆ θ(β) will be asymptotically normal, with variance which may be estimated using a sandwich variance approach. This is an extension of standard pointwise asymptotic theory for maximum likelihood estimation with misspecified models, originating in the seminal work of Huber (1967) and White (1982). We study below the uniform convergence of this estimator across all values of β ∈ Ξ. Suppose the data consist of iid realizations {Yi }n of Y . Let sY (θ, β) be the contribui=1 i tion of subject i to the estimating function SY (θ, β). Define Sn (θ, β) = n−1 n sY (θ, β) i=1 i ˜ and S(θ, β) = E{sY (θ, β)}. Let gY (θ, β) = ∂sY (θ, β)/∂θ, WY (θ, β) = n−1 n gY (θ, β) i=1 i 1 i i ˜ and W (θ, β) = E{gY (θ, β)}. For any given β ∈ Ξ, let ˆ θ(β) denote the solution to 1 SY (θ, β) = 0, that is SY (ˆ θ(β), β) = 0. The “least false” (White, 1982) parameter θ∗ (β), ˜ satisfies S(θ∗ (β), β) = 0. Define G1 = {sY (θ, β) : i = 1, . . . , n, θ ∈ Θ, β ∈ Ξ} and i G2 = {gY (θ, β) : i = 1, . . . , n, θ ∈ Θ, β ∈ Ξ}. i We assume the following regularity conditions: C1. The sets Θ ⊂ Rp and Ξ ⊂ Rq are compact and θ∗ (β) is an interior point of Θ for any β ∈ Ξ. C2. The function classes, G1 and G2 , are pointwise measurable and satisfy the uniform entropy condition (van der Vaart and Wellner, 2000a). ˜ C3. inf θ∈Θ,β∈Ξ λmin {−W (θ, β)} > 0, where λmin (·) denotes the minimum eigenvalue of a matrix. 145 C4. The estimating function SY (θ, β) has continuous first order derivatives with respect to θ for any given β ∈ Ξ. Condition C1 defines the parameter space for the implied parameter θ∗ (β) for a given β. Because θ∗ (β) may be nonconstant in β, the parameter space for θ∗ (β) across β is contained in a suitably defined functional space. Conditions C2 and C3 give conditions under which uniform asymptotic results for θ∗ (β) may be obtained. The entropy condition C2 ensures that the estimating function is well behaved across all β. The condition is satisfied by functions which are uniformly bounded and uniformly Lipschitz of order > {dim(θ) + dim(β)}/2, where dim(·) denotes the dimension of a vector. Condition C3 guarantees the identifiability of θ∗ (β) for all β. The longitudinal data model presented in Section 3 meets these requirements. Note that the smoothness specified in condition C4 only applies to θ. Differentiability in β is not assumed. Non-smoothness in θ could be accommodated under stronger assumptions. The proof of the following theorem is provided in the appendix. Theorem 5.2.1. Under Conditions C1-C4, supβ∈Ξ ˆ θ(β) − θ∗ (β) →p 0, where · 1 represents the Euclidean norm. Furthermore, n 2 ˆ θ(β) − θ∗ (β) converge weakly to a tight Gaussian process with positive definite covariance function Σ∗ (β 1 , β 2 ) = = 1 1 lim cov n 2 ˆ 1 ) − θ∗ (β 1 ) , n 2 ˆ 2 ) − θ∗ (β 2 ) θ(β θ(β n→∞ ˜ W (θ∗ (β 1 ), β 1 ) −1 T E sY (θ, β 1 )sT (θ, β 2 ) Y1 1 146 ˜ W (θ∗ (β 2 ), β 2 ) −1 . For fixed β, Σ∗ (β, β) = = 1 θ(β) − θ∗ (β) lim var n 2 ˆ n→∞ ˜ W (θ∗ (β), β) −1 T E sY (θ, β)sT (θ, β) Y1 1 ˜ W (θ∗ (β), β) −1 . The covariance function may be easily estimated using a robust sandwich variance estimator along the lines of White (1982), which is valid under model misspecification. This estimator may be used to construct pointwise confidence intervals for θ∗ (β) at fixed β using the pointwise asymptotic normality of ˆ θ(β). However, for the testing and confidence band procedures described below, the complexity of the limiting distribution across β is prohibitive for conducting inference, even with variance estimation. For such scenarios, we suggest resampling to approximate the distribution of the estimator. It can easily be shown that the regularity conditions are satisfied for the simple normal ¯ example. Interestingly, ˆ θ(β) − θ∗ (β) = Y − θ0 − β 0 , which does not depend on β. This greatly simplifies the results of Theorem 1, since the standardized estimators are identical for all β, which is not generally true. One should note that the form of the mean model is critical. If we assumed that E(Y ) = θβ, then the eigenvalue condition, C3, would be violated at β = 0 and the uniform convergence in Theorem 1 would fail to hold on intervals containing zero. 5.2.3 Global sensitivity testing Suppose we are interested in evaluating the null hypothesis: H0 : Cθ0 = c, where θ0 is the true value of θ and C an r × dim(θ0 ) contrast matrix for assessing single and multiple 147 linear combinations of model parameters. For example, when testing the jth component of θ, one takes C to be 1 × dim(θ) vector with a one at the jth position and zeros elsewhere. Under nonidentifiability, the above hypothesis cannot be tested without imposing unverifiable restrictions. If the true sensitivity parameter β 0 is known, then H0 : Cθ∗ (β 0 ) = c, where θ∗ (β 0 ) = θ0 . In practice, where β 0 is unknown, one may consider the process θ∗ (β), observing that the trivial inequality, 0 ≤ inf Cθ∗ (β) − c ≤ Cθ∗ (β 0 ) − c ≤ sup Cθ∗ (β) − c , β∈Ξ β∈Ξ permits a conservative assessment of H0 . To do so, we formulate the infimum hypothesis: Hinf : inf β∈Ξ Cθ∗ (β) − c = 0. The infimum statistic Tinf = inf β∈Ξ C ˆ θ(β)−c can be used to evaluate this hypothesis. The distribution of this statistic can be derived analytically in some simple situations. As an example, we revisit the normal scenario discussed earlier where the interest is in evaluating ¯ the hypothesis, H0 : θ0 = 0, using the processes ˆ θ(β) = Y − β. For ease of illustration, ¯ assume Ξ = [0, 1], such that the infimum statistic becomes Tinf = inf β∈[0,1] |Y − β|. This ¯ is a mixture of a point mass at 0 with probability Pr(Y ∈ [0, 1]) and two truncated normal 148 distributions. Specifically,       ¯  −Y          ¯ inf |Y − β| =  β∈[0,1]               0 ¯ if Y < 0, ¯ if Y ∈ [0, 1], ¯ ¯ Y − 1 if Y > 1. ¯ The corresponding cumulative distribution function-CDF Finf (x) = Pr(inf β∈[0,1] |Y − β| ≤ x) is ¯ ¯ Finf (x) = Pr(Y ≤ x + 1) − Pr(Y ≤ −x), for x ≥ 0. (5.1) ¯ ¯ ¯ In particular, we have Finf (0) = Pr(Y ≤ 1) − Pr(Y ≤ 0) = Pr(Y ∈ [0, 1]), reflecting the point mass at 0 for Tinf . In general, because of the complexity of the limiting distribution of the infimum of the test process, simple general analytic results do not appear tractable. Instead, resampling may be utilized. A simple nonparametric bootstrap (Efron and Tibshirani, 1993) may be used to compute variance estimators, and to carry out the simultaneous inferences necessary for the infimum tests and the confidence bands, described below. The validity of the bootstrap follows automatically from empirical process theory under the regularity conditions given in van der Vaart and Wellner (2000b) even under model misspecification. This requires the boundedness of the estimating function for fixed β ∈ Ξ. A difficulty with the nonparametric bootstrap is that it requires solving the estimating function for all β in each bootstrap sample, which may be computationally demanding. An alternative resampling technique which does not require repeatedly solving the estimating function may be constructed. The basic 149 idea is to generate realizations directly from the limiting distribution of ˆ θ(β) and to use these realizations to approximate the distribution of the infimum test and confidence bands. This resampling technique has been extensively used in the literature when the true asymptotic distribution is hard if not impossible to derive analytically (see for example, Parzen et al., 1994 and Zhu and Zhang, 2006). To do this, one fixes the estimator based on the observed data and then “perturbs” this estimator using a disturbance which conditionally on data has mean zero and variance-covariance in β equalling that of ˆ θ(β) in Theorem 5.2.1. The procedure is given by the following steps: Step 1. Generate n i.i.d random variables from a standard normal model ζ, denoted (b) (b) {ζ 1 , . . . , ζ n }, where superscript (b) represents replications. (b) Step 2. Given the realizations of the data, {Yi }n , and values of β ∈ Ξ, calculate ˜ (β) θ i=1 (b) (b) using the simulated {ζ 1 , . . . , ζ n } and the equation, ˜(b) (β) = ˆ θ θ(β) + n−1 (b) −1 θ(β), β), n s (ˆ WY (ˆ i=1 Yi θ(β), β)ζ i (5.2) o where the statistic ˆ θ(β) takes value ˆ (β) for observed data {Yi }n . θ i=1 Step 3. Calculate T (b) (b) (b) = inf β∈Ξ C ˜ (β) − c using ˜ (β), β ∈ Ξ. θ θ inf By repeatedly generating the normal variates {ζ j }n , B times, and repeating steps 2 and j=1 (b) 3 for each generated sample, we obtain the empirical distribution of T given observed inf data. Theorem 2 below establishes that this empirical distribution converges to the marginal asymptotic distribution of Tinf as n → ∞. Let 1(E) be the indicator function for event (b) o E. The p-value of the test is then B−1 B 1(T ≥ Tinf ), the proportion of these b=1 inf o bootstrap observations which exceed Tinf the observed value of the statistic. 150 For the simple normal example, we compare the resampling null distribution of T (b) inf to the analytical distribution Finf (.) in (5.1) for a finite sample size. Setting θ0 = 0 under the null and β 0 = 0.5, we generate {Yi }n from a normal distribution N (0.5, 102 ). i=1 (b) Furthermore, we take Ξ = [0, 1] and for each resample b = 1, · · · , B, we compute T = inf (b) (b) ¯ (b) ¯ inf β∈[0,1] |˜ (β)|, where ˜ (β) = Y − β − n−1 n (Yi − Y )ζ i . Results with n = 100 θ θ i=1 and B = 10000 resamples are plotted in Figure 5.2. The resampling distribution provides a good approximation to the analytical distribution for this simple hypothetical example. If the infimum (null) hypothesis cannot be rejected, then a supremum test or equivalently a simultaneous confidence region may be used to check whether Cθ∗ (β) − c > 0 in some o regions of Ξ. The supremum hypothesis Hsup may be tested with the statistic Tsup = (b) supβ∈Ξ C ˆ θ(β) − c using the bootstrap realizations of ˜ (β), β ∈ Ξ. The p-value of θ (b) (b) o the supremum test is then B−1 B b=1 1(Tsup ≥ Tsup ), where Tsup are the bootstrap realizations of the statistic. Alternatively, a simultaneous confidence region for Cθ∗ (β) − c across all values of β may be constructed. Let 0 < ϕ < 1. A simultaneous confidence region for Cθ∗ (β) − c, β ∈ Ξ is given by {ϑ(β) : Ξ → Rr ; ϑ(β) − C ˆ θ(β) + c < ρϕ }, where ρϕ is o o (b) θ the (1 − ϕ)th empirical percentile of {supβ∈Ξ C ˜ (β) − C ˆ (β) }B , with ˆ (β) being θ θ b=1 the value of the statistic ˆ θ(β) for observed data {Yi }n . i=1 The following result supports the validity of the resampling based infimum test and confidence bands. Theorem 5.2.2. Under Conditions C1-C4, the conditional distribution of the process o n1/2 {˜ θ(β) − ˆ (β)} given realizations {Yi }n θ i=1 of Y , is asymptotically equivalent to the unconditional distribution of the process n1/2 {ˆ θ(β) − θ∗ (β)}, β ∈ Ξ. Theorem 5.2.2 (proof provided in the appendix) coupled with a continuous mapping 151 1.0 0.8 0.6 0.4 0.0 0.2 CDF(x) 0 1 2 3 4 5 x Figure 5.2: Plot of the exact (solid line) and the resampled (dashed line) CDF (CDF(x) = ¯ Pr(inf β∈[0,1] |Y − β| ≤ x)) of the infimum test statistic under the null θ0 = 0 for the simple normal example, assuming the true parameter β 0 = 0.5, sample size n = 100 and B = 10000 resamples. 152 theorem gives that the infimum and supremum tests can be carried out using this resampling ¯ procedure. For the simple normal example, n1/2 {ˆ θ(β) − θ∗ (β)} = n1/2 (Y − θ0 − β 0 ) and o ¯ n1/2 {˜ θ(β) − ˆ (β)} = −n−1/2 n (Yi − Y )ζ i , which do not depend on β. The random θ i=1 ¯ quantity n1/2 (Y −θ0 −β 0 ) is normally distributed with mean 0 and variance 1, uniformly in β, for each fixed n. Given observed data {Yi }n , −n−1/2 i=1 distributed with mean 0 and variance n−1 n (Y − Y )ζ is also normally ¯ i i=1 i n (Y − Y )2 which converges almost surely ¯ i=1 i to 1 as n → ∞. The choice of the support Ξ of β is critically important in performing the test in practice. If values of β are selected in some data-driven fashion, the limiting distribution in Theorem 1 will be invalid. This is similar to Hansen (1996) for the case where the model is identifiable under the null after profiling on β, that is, when θ∗ (β) = θ0 , ∀β ∈ Ξ. On the other hand, an approach which ignores sample information about Ξ may be unnecessarily conservative and potentially sacrifices power. One possible solution is to consult with subject-matter experts on the choice of Ξ. This choice ideally should be based on prior studies, as in the breast cancer analysis in Section 3, where closely related datasets were used to select the range for the sensitivity parameter. From a technical standpoint, this choice should also be computationally feasible. 5.3 Numerical studies 5.3.1 Pseudo-likelihood models with missing data We consider the data set-up and model described in Troxel et al. (1998b) for potentially nonrandom missing data in longitudinal studies. The model will be referred to as the TLH model. 153 The data arise from a longitudinal study where each subject i (i = 1, . . . , n), is to be observed ∗ ∗ at K occasions. For subject i, we have a K ×1 response vector, Yi∗ = (Yi1 , . . . , YiK )T which may not be fully observed. To accommodate missingness, subject i has a vector of missing ∗ data indicators Ri = (Ri1 , . . . , RiK )T , where Rit = 1 if Yit is observed and 0 otherwise. ∗ ∗ Let Yi,obs and Yi,miss denote the observed and missing components of Yi∗ , respectively. Each individual also has a K × J covariate matrix Xi , which is assumed fully observed. The ∗ response Yi in our general formulation is {Yi,obs , Ri , Xi }. ∗ The key idea of the TLH methodology is to model the time point pair (Yit , Rit ), without accounting for the dependence on other time points. Let f (u | w) denote the density function of random quantity u conditional on possibly non random quantity w. We assume ∗ a simple selection model given by, f (Yit , Rit | Xit , where ∗ ) = f (Yit | Xit , ∗ )f (Rit | Yit , Xit , ), is a finite but unknown parameter and Xit may contain both time dependent and independent covariates. ∗ The TLH model assumes that density f (Yit | Xit , µit and σ t (t = 1, · · · , K) are elements of ) is that of normal N (µit , σ t ), where . The missing data process is assumed to satisfy ∗ Rit ∼ Bernoulli(1 − π it ) where the failure probability π it = Pr(Rit = 0 | Yit , Xit , ). We assume a logistic regression model relating the missing data probability to potentially unobserved responses, that is, ∗ logit(π it ) = γ 0t + γ 1t Xit + β t Yit , where γ jt and β t (j = 0, 1; t = 1, · · · , K) are unknown parameters and elements of (5.3) . The parameter β t measures the extent of non-randomness of the missing data mechanism in the study at time t. Specifically, exp{β t } represents the odds ratio for missing response at time t 154 ∗ for each additional unit increase of the hypothetical response Yit . Here, π it in (5.3) depends ∗ on Yit and not on previous elements of Yi∗ . Following warnings by Troxel et al. (1998b), we emphasize that this model could suffer from misspecification if the approximation of the logistic link function to the true link function fails. The TLH model lends itself to a pseudo-likelihood analysis (Gong and Samaniego, 1981), where the longitudinal association is naively ignored in the likelihood construction. Specifi∗ cally, the independence pseudo-likelihood function based on observed data {Yi,obs , Ri , Xi }n i=1 is n ind ( ) = ∗ f (Yi,obs , Ri | n ··· )= ∗ {f (Yit , Rit | )}Rit = i=1 t=1 n K ∗ ∗ {f (Yit | )f (Rit | Yit , = i=1 t=1 n K ∗ )dYi,miss i=1 i=1 n K ∗ f (Yit | ∗ ∗ f (Yi,obs , Yi,miss , Ri | ∗ )f (Rit | Yit , ∗ = {f (Yit | i=1 t=1 ∗ f (Yit , Rit | 1−Rit )}Rit ∗ )dYit )(1 − π it )}Rit ∗ )dYit 1−Rit ∗ f (Yit | ∗ )π it dYit 1−Rit . We have suppressed the dependence on covariates in ind ( ). As a pseudo-likelihood model, conditions C1-C4 are easily verified and the asymptotic results hold. The densities in the TLH model are normal and Bernoulli, which are smooth functions of the unknown parameters. 155 5.3.2 Real data analysis To illustrate our methodology, we consider data from the International Breast Cancer Study Group-IBCSG, previously reported by H¨rny et al. (1992); and Troxel et al. (1998b). This u is a group of randomized breast cancer studies with primary endpoints being survival and relapse; and quality of life being a secondary endpoint. One study, Study VI, is a randomized trial of adjuvant chemotherapy following surgery for the treatment of breast cancer. In this study, 4 treatments (A, B, C and D) were randomly assigned to 431 pre-menopausal cancer patients and several domains of quality of life were assessed. In this paper, we focus on three quality-of-life domains; 1) PACIS (perceived adjustment to chronic illness scale), 2) Mood and 3) Appetite. These variables were originally measured on a 0 − 100 scale but are normalized using a square-root transformation as recommended by Troxel et al. (1998b). Questionnaires for the quality of life assessment were administered to study patients at baseline and every three months for two years. Our analysis employs the first three time points, with rates of missing data equalling 16%, 33% and 37% for PACIS, 16%, 33% and 38% for Mood, and 15%, 33% and 38% for Appetite. A full description of Study VI and other IBCSG trials may be found elsewhere (H¨rny et al., 1992; Troxel et al., 1998a). u As in earlier analyses of Study VI, we consider the following model for the measurement outcome, µit = µ0t + α1 X1i + α2 X2i + α3 X3i , (t = 1, 2, 3), where µ0t is a time-dependent intercept and αj is a slope associated with Xji , j = 1, 2, 3.       (1, 0, 0) if treatment A,  (0, 0, 1) if treatment C  Here (X1i , X2i , X3i ) =      (0, 1, 0)  if treatment B, (0, 0, 0) if treatment D.  156 The missing data model is ∗ logit(π it ) = γ 0t + βYit (i = 1, . . . , 431; t = 1, 2, 3), ∗ where γ 0t is a time-dependent intercept and β is a slope associated with Yit . As discussed previously, β quantifies the nonrandomness of the missing data process. A constant σ t is assumed across time. Our objective is to assess the treatment and time effects on the mean quality of life. Under the assumed model, the hypotheses of interest are α1 = α2 = α3 = 0 and µ01 = µ02 = µ03 for the treatment and time effects, respectively. As a preliminary analysis, we first evaluated these hypotheses under identifiability assumptions. Specifically, we fit the TLH model by simultaneously estimating both β and θ = (α1 , α2 , α3 , µ01 , µ02 , µ03 , γ 01 , γ 02 , γ 03 ) via the independence pseudolikelihood estimating function. A Wald test based on the sandwich estimator of the covariance matrix of the regression parameter estimates was performed to evaluate the hypotheses of interest. P-values of these Wald tests for the three responses are given in Table 1. In brief, these inferences suggest that there is no treatment effect on PACIS and Appetite and no time effect on PACIS and Mood. The treatment effect on Mood and the time effect on Appetite are significant at 1% level. In addition to these analyses, we also conducted two crude analyses that do not explicitly model the missing data mechanism. The first analysis used only subjects with complete data sequences, therefore removing subjects with incomplete data profiles. The second analysis ignored the missing data and conduct the so-called ignorable (missing at random) inferences by forcing β, the non-randomness missing data parameter, to 0. Results of these analyses are also summarized in Table 1. From these additional exploratory analyses, the treatment and time effects are found to be statistically 157 significant for Mood at 5% level. The ignorable analysis also appears to yield a statistically significant time effects on Appetite. Of course, these crude analyses may not be reliable as they rely on assumptions that are not verifiable using observed data at hand. The Wald tests conducted under the assumption of identifiability may not have desirable properties if identifiability is violated. As illustrated in Figure 5.1, the model was at best weakly identifiable for the outcome PACIS. Model identifiability was also a concern for β for the other two responses. We performed the infimum test to conservatively evaluate the treatment and time effects on the three quality of life domains. To conduct these tests, the set Ξ for the range of β was obtained from an independent source. We considered data on post-menopausal cancer patients from Study VII of the IBCSG trials. Objectives of this study were similar to those of Study VI, except that the menopausal status of study participants differed. The joint model appeared to be identifiable when applied to Study VII data. Based on these results, we derived 99% confidence intervals to use as ranges for β in the infimum tests for Study VI. The ranges for PACIS, Mood and Appetite were [−4, 0], [−3, 0] and [−5.6, −1.6], respectively. Recall that in the missing data model, exp{−β} represents the odds ratio of being observed at any time point for each additional unit increase ∗ ∗ of the hypothetical response Yit . Since Yit takes values in the range 0 − 10 on a squareroot scale, for the selected ranges, the odds ratio may be as high as; exp{4} = 54.60 for PACIS, exp{3} = 20.09 for Mood, and exp{5.6} = 270.43 for Appetite. One might criticize these upper bounds as being scientifically unreasonable. However, permitting such extreme scenarios provides for a conservative test, which is in the spirit of sensitivity analysis. For computational feasibility, the ranges were approximated on fine grids with equally spaced points of 0.02. P-values of the infimum tests are given in Table 1. The infimum hypothesis for the treatment effect was rejected for Mood at the 5% level (p158 Table 5.1: P-values for evaluating the Treatment and Time effects using data from Study VI of the IBCSG trials Reponeses PACIS 0.008 0.229 COM† 0.170 0.041 0.370 0.303 0.011 0.376 0.231(0.522*) 0.025 0.281(0.369*) TLH 0.216 0.552 <0.001 COM† 0.302 0.030 0.163 IGN‡ effect 0.281 IGN‡ Wald test Appetite TLH Treatment Mood 0.142 0.006 0.023 0.134(<0.001*) 0.063(<0.001*) <0.001 Infimum test Time Wald test effect Infimum test * P-value of supremum test † Complete cases; ‡ Ignorable cases 159 value= 0.025), but not for PACIS (p-value= 0.281) and Appetite (p-value= 0.231). For the time variable, a strongly significant effect was detected only for Appetite (p-value< 0.001). For nonsignificant infimum test results, a supremum test was conducted to see if one could not reject the null hypothesis for all values β ∈ Ξ. The supremum test for the treatment effect was not rejected on PACIS (p-value= 0.522) and Appetite (p-value= 0.369), but was strongly rejected for the time effect on PACIS (p-value< 0.001) and Mood (p-value< 0.001). When the supremum test was rejected, a sensitivity analysis was conducted using a simultaneous 95% confidence band approach to identify regions of β for which the pointwise null hypotheses are rejected. Plots of these analyses for contrasts µ∗ (β) − µ∗ (β) 01 02 and µ∗ (β) − µ∗ (β) for PACIS and Mood are given in Figure 3. For PACIS, the 95% 02 03 simultaneous confidence band for µ∗ (β) − µ∗ (β), −4 ≤ β < −0.4; and µ∗ (β) − µ∗ (β), 02 01 03 02 −4 ≤ β ≤ −3, did not contain 0. Similar analyses for Mood revealed that a 95% simultaneous confidence band for µ∗ (β) − µ∗ (β), −3 ≤ β < −0.7, did not contain 0. The confidence 02 01 band for µ∗ (β) − µ∗ (β) did not exclude 0 over the selected range of β (−3 ≤ β < 0). The 03 02 Wald tests which assume identifiability were nonsignificant for all pairwise comparisons at the 5% level. 5.3.3 Simulation study In this section, we report results of a simulation study comparing the performance of the infimum test to that of the naive Wald test derived under identifiability assumptions. The simulations were conducted under a TLH model specified so as to roughly approximate data from Study VI of the IBCSG trials. For simplicity, only two treatments (A and B) and two ∗ ∗ time points (K = 2) were considered. The outcome vector (Yi1 , Yi2 ), assuming dependence on subject i, was generated from a two-dimensional normal distribution with univariate mean 160 −2 0.0 −2.0 −1.0 Time effect 3 vs 2 0.0 −2.0 −1.0 Time effect 2 vs 1 −4 0 −4 −1.5 0.0 0.0 −2.0 −1.0 Time effect 3 vs 2 0.0 −3.0 0 β −2.0 −1.0 Time effect 2 vs 1 β −2 −3.0 β −1.5 0.0 β Figure 5.3: The top panel corresponds to PACIS; and the bottom panel to Mood. In each panel, the solid lines represent µ02 (β) − µ01 (β) (on the left) and µ03 (β) − µ02 (β) (on the ˆ ˆ ˆ ˆ right) for fixed values of the parameter β. The dashed lines are the corresponding 95% simultaneous confidence bands and the dotted lines are the null values. 161 models, µit = µ0t + αt X1i , t = 1, 2, and time-point variances σ t , t = 1, 2, and correlation coefficient ρ. The parameters µ0t and αt are time-dependent intercepts and slopes associated with covariate X1i , which equals 1 if treatment B and 0 otherwise. We reparameterized µ0t and αt as, µ0t = α0 + α1 I(t = 2) and ˜ ˜ αt = α2 + α3 I(t = 2), where I(t = 2) is an indicator variable taking value 1 at the second ˜ ˜ time point. Throughout our simulations, we fixed the variances σ t , t = 1, 2, to 1 and the correlation coefficient ρ to 0.4. Missing observations were generated using a logistic model ∗ relating the dropout probability π it to the response Yit as, ∗ logit(π it ) = γ 0t + γ 1t X1i + βYit , ∗ where γ 0t , γ 1t and β are respectively the intercept and slopes associated with X1i and Yit . Time-dependent parameters γ 0t and γ 1t were reparameterized as, γ 0t = γ 0 + γ 1 I(t = 2) ˜ ˜ and γ 1t = γ 2 + γ 3 I(t = 2), t = 1, 2. ˜ ˜ We study the size and power of the infimum and Wald tests for α3 , the parameter that ˜ captures the interaction effect of time and treatment on the mean response. We set α3 = 0 ˜ and α3 = 1 for the size and power of the test respectively. Additionally, (˜ 1 , α2 ) = (0, 0) and ˜ α ˜ (˜ 1 , γ 2 , γ 3 ) = (0.5, −2, 0.2) when evaluating size, and (˜ 1 , α2 ) = (0.1, 1) and (˜ 1 , γ 2 , γ 3 ) = γ ˜ ˜ α ˜ γ ˜ ˜ (1, −3, 1) when evaluating power. The parameter γ 0 was varied throughout our simulations to produce different missing ˜ data rates. Specifically, to study the size of the test γ 0 was fixed to 0.5, to produce about ˜ 15% and 22% missing observations at the first and second time point, respectively and at 162 1.8 to produce about 33% and 43% missing observations at the first and second time point, respectively. For the power, γ 0 was fixed to 0.5, producing rates of missing observations ˜ roughly 14% and 26% at the first and second time point, respectively and to 2, producing rates of missing observations roughly 32% and 46% at the first and second time point, respectively. Finally, throughout our simulations, we set the true β to −1. One thousand datasets were generated with sample sizes 100 and 300. Equal proportions of subjects were assigned to treatment A and B. The infimum tests were performed on the interval Ξ = [−2, 0]. To ensure computational feasibility, a fine grid of equally spaced points of 0.02, was considered. We used 1000 resamples from the alternative resampling scheme discussed in section 2.3 to approximate the null distribution of the infimum test. The infimum and Wald tests were performed using working regression models having the same form as those used to generate data. These models saturate the number of parameters, leading to potential nonidentifiability as a result of overparameterization. Table 2 shows the rejection rates for nominal test levels 0.01, 0.05, and 0.1. Asymptotic standard errors (as the number of Monte Carlo iterations tends to infinity) are reported in the last row of the table. Overall, the infimum tests perform well, with the resampling distribution of the test providing a reasonable approximation to the nominal level. The Wald test appears to be very liberal when compared to the infimum test. The anti-conservativeness of the Wald test does not diminish as the sample size increases. Based on these results, our recommendation is to avoid the Wald test when identifiability is of concern. Because the empirical type I error rate of the infimum test and that of the Wald test are different, comparing their empirical powers is not appropriate. Nevertheless for both methods, a larger sample size improves the power of detecting the alternatives under consideration, a finding consistent with the literature. Moreover, the power decreases with increasing missing data rates. 163 Table 5.2: Empirical type I error and power of the infimum test and Wald test (in parenthesis) for evaluating the interaction effect represented by the parameter α3 ˜ n=100 n=300 Nominal test level Nominal test level True Missing+ value data rate 0.1 0.05 0.01 0.1 0.05 0.01 α3 = 0 ˜ 15%, 22% 0.099 0.049 0.011 0.103 0.057 0.010 (˜ 0 = 0.5) γ (0.147) (0.092) (0.034) (0.158) (0.117) (0.059) 33%, 43% 0.110 0.044 0.010 0.122 0.062 0.012 (˜ 0 = 1.8) γ (0.149) (0.097) (0.051) (0.165) (0.113) (0.073) 14%, 26% 0.982 0.959 0.870 (˜ 0 = 0.5) γ (0.961) (0.944) (0.848) (0.948) (0.942) (0.934) 32%, 46% 0.905 0.846 0.675 0.999 0.999 0.993 (˜ 0 = 2) γ (0.889) (0.827) (0.659) (0.961) (0.949) (0.926) 0.003 0.007 0.009 0.003 0.007 0.009 α3 = 1 ˜ Monte carlo SE Test performed using Ξ = [−2, 0] + First and second time point missing data rate 164 >0.999 >0.999 >0.999 While the ability to choose an appropriate support set Ξ of β to perform the infimum tests is highly desirable in practice, our simulations (results not shown) indicate that only a minimal inflation of type I error rate is observed under a modest misspecification of the set Ξ. For example, when Ξ does not contain the true β but β 0 is not far away from the boundaries of the set, close to the nominal level is still achieved under the null hypothesis. As an example, we performed the infimum test on the interval [0, 2], which does not contain β 0 = −1. For this range of β, the infimum tests nearly maintain their sizes at all significance levels. However, when [10, 12] was selected for the range of β, the infimum tests were overly anti-conservative. Another simulation study was conducted to evaluate the effects of the choice of the set Ξ on the power of the infimum tests. Specifically, we generated data as before, but performed the infimum tests on wider intervals, namely [−3, 3] and [−5, 5]. Results of this simulation study are given in Table 5.2. As expected, the power decreases as the interval widens, which occurs regardless of the missing data rate. Following a referee’s recommendation, further simulations were conducted to evaluate the loss of power when the infimum test is performed on a given support set of β compared to the ideal set Ξ = {β 0 }. For this, we generated the data as before with the only difference that α3 = 0.7. We then performed the infimum test ˜ using Ξ = [−2, 0] and Ξ = {−1}. Results revealed a minor loss of power of the infimum test on Ξ = [−2, 0] compared to the ideal set Ξ = {−1} (see Table 5.4). 5.4 Discussion While hypothesis testing under nonidentifiability has been previously considered, the framework is often too restrictive for sensitivity analyses. In a sensitivity analysis, the model may 165 Table 5.3: Empirical power of the infimum test to detect the interaction effect α3 = 1 for ˜ two ranges Ξ of β with true value being β 0 = −1 n=100 Missing+ 0.10 0.05 0.01 0.10 0.05 0.01 0.912 0.865 0.720 0.999 0.997 0.987 32%, 46% 0.783 0.706 0.529 0.981 0.960 0.905 14%, 26% 0.899 0.836 0.655 0.997 0.990 0.970 32%, 46% [−5, 5] Nominal test level 14%, 26% [−3, 3] Nominal test level data rate Ξ n=300 0.767 0.695 0.492 0.953 0.925 0.858 0.003 0.007 0.009 0.003 0.007 0.009 Monte carlo SE + First and second time point missing data rate Table 5.4: Empirical power of the infimum test to detect the interaction effect α3 = 0.7 for ˜ two sets Ξ of β with true value being β 0 = −1 n=100 n=300 Missing+ Nominal test level Nominal test level data rate 0.10 0.05 0.01 0.10 0.05 0.01 [−2, 0] 13%, 23% 0.806 0.699 0.470 0.996 0.989 0.949 {−1} 13%, 23% 0.829 0.725 0.469 0.998 0.998 0.982 [−2, 0] 31%, 44% 0.677 0.570 0.342 0.978 0.957 0.834 {−1} 31%, 44% 0.707 0.589 0.348 0.992 0.974 0.915 Monte carlo SE 0.003 0.007 0.009 0.003 0.007 0.009 Ξ + First and second time point missing data rate 166 not be identifiable under either the null or alternative hypothesis, and profiling may not lead to consistent estimation of the parameter of interest under the null. As a result, the supremum test may not be appropriate. As discussed in this paper, a theoretically rigorous approach to this testing problem may be based on infimum statistics, whose distribution must be carefully considered under model misspecification under the null hypothesis. The infimum testing approach was previously studied for likelihood analyses of parametric models (Todem et al., 2010). In this paper, we have extended these results to general estimating functions for parametric models. This includes limiting results for the profile estimators and the infimum test and confidence bands, as well as the validity of the bootstrap procedure. Such results are critically important in sensitivity analyses of complex data arising in longitudinal studies, where full model specification may be difficult and partially specified models may be more easily analyzed using non-likelihood based approaches. 167 APPENDIX 168 Proof of Theorem 5.2.1 i) We show that supβ∈Ξ ˆ θ(β) − θ∗ (β) →p 0. Condition C2 implies that G1 and G2 are Donsker and hence Glivenko-Cantelli (van der Vaart andWellner, 2000a, 2000b). Therefore, ˜ ˜ sup SY (θ, β) − S(θ, β) →p 0 and sup WY (θ, β) − W (θ, β) →p 0. (5.4) θ∈Θ,β∈Ξ θ∈Θ,β∈Ξ The definitions of ˆ θ(β) and θ∗ (β) and Condition C4 imply that ˜ 0 = SY (ˆ θ(β), β) − S(θ∗ (β), β) = ˜ SY (ˆ θ(β), β) − SY (θ∗ (β), β) + SY (θ∗ (β), β) − S(θ∗ (β), β) ˜ θ(β), β) ˆ θ(β) − θ∗ (β) + ν 1n (β), θ(β) − θ∗ (β) + v2n (β) ˆ = W (ˇ (5.5) where ˇ θ(β) is on the line segment between ˆ θ(β) and θ∗ (β). Also, supβ∈Ξ |v2n (β)| ˜ →p 0 and ν 1n (β) = SY (θ∗ (β), β) − S(θ∗ (β), β). From (5.5), we have, sup ˆ θ(β) − θ∗ (β) β∈Ξ ˜ sup − W −1 (ˇ θ(β), β) + v2n (β) ν 1n (β) β∈Ξ ˜ ≤ sup −W −1 (ˇ θ(β), β) + v2n (β) sup ν 1n (β) . β∈Ξ β∈Ξ = Because of Condition C3, for any θ ∈ Θ, for any β ∈ Ξ, there exists a positive number λ1 , such that λmin (β) > λ1 > 0. For any s × s symmetric matrix A, denote its Euclidean norm as A = λmax (A), where λmax (A) is the largest eigenvalue of A and if ˜ = λ−1 (A). Therefore, −W −1 (ˇ θ(β), β) = λ−1 (β) ≤ min min λ−1 , and supβ∈Ξ ˆ θ(β) − θ∗ (β) ≤ λ−1 supβ∈Ξ ν n (β) . The uniform consistency of 1 1 A is also nonsingular, A−1 169 ˆ θ(β) to θ∗ (β) follows from supβ∈Ξ ν n (β) ˜ = supβ∈Ξ SY (θ∗ (β), β) − S(θ∗ (β), β) ≤ ˜ supθ∈Θ,β∈Ξ SY (θ, β) − S(θ, β) →p 0, according to (5.4). ii) We show that n1/2 (ˆ θ(β) − θ∗ (β)) converge weakly to a tight Gaussian process. Based on the uniform consistency of ˆ θ(β), and (5.4) and (5.5), applying the Taylor expansion to SY (ˆ θ(β), β) around SY (θ∗ (β), β) gives ˜ n1/2 ˆ θ(β) − θ∗ (β) ≈ −n−1/2 W −1 (θ∗ (β), β)ν 1n (β) = −n−1/2 = −n−1/2 n ˜ −1 ∗ W (θ (β), β) sY (θ∗ (β), β) − EsY (θ∗ (β), β) i=1 1 i n ˜ −1 ∗ n W (θ (β), β)sY (θ∗ (β), β) ≡ −n−1/2 η (β), i=1 i=1 i i where ≈ denotes asymptotic equivalence uniformly in β ∈ Ξ. Because Condition C2 implies ˜ that G1 is Donsker and using previous results that W −1 (θ∗ (β), β) is uniformly bounded ˜ for β ∈ Ξ, the function class {W −1 (θ∗ (β), β)sY (θ∗ (β), β), β ∈ Ξ, i = 1, . . . , n} is Donsker. i This permits the application of a functional central limit theory to establish the weak convergence of ˆ θ(β). Therefore, limn→∞ cov θ(β θ(β n1/2 ˆ 1 ) − θ∗ (β 1 ) , n1/2 ˆ 2 ) − θ∗ (β 2 ) = E η 1 (β 1 )η T (β 2 ) 1 = Σ∗ (β 1 , β 2 ). For a given β, var n1/2 ˆ θ(β) − θ∗ (β) = E η 1 (β)η T (β) . 1 Proof of Theorem 5.2.2 o Applying the Taylor expansion to sY (θ∗ (β), β) around sY (ˆ (β), β) gives θ i i n−1/2 = n−1/2 n s (θ∗ (β), β)ζ i i=1 Yi n o o sY (ˆ (β), β)ζ i + n−1/2 θ∗ (β) − ˆ (β) θ θ i=1 i 170 n g (¯ θ(β), β)ζ i , i=1 Yi o where ¯ θ(β) is on the line segment between ˆ (β) and θ∗ (β). Given observations {Yi }n , θ i=1 o Condition C4 and supβ∈Ξ θ∗ (β) − ˆ (β) = op (1), one has θ n−1/2 n s (θ∗ (β), β)ζ i ≈ n−1/2 i=1 Yi n o sY (ˆ (β), β)ζ i . θ i=1 i Based on the definition of ˜ θ(β) in (5.2) and (5.4), one has o −1 θo n1/2 ˜ θ(β) − ˆ (β) = −n−1/2 WY (ˆ (β), β) θ −1 θo ≈ −n−1/2 WY (ˆ (β), β) ˜ ≈ −n−1/2 W −1 (θ∗ (β), β) = −n−1/2 n o sY (ˆ (β), β)ζ i θ i=1 i n s (θ∗ (β), β)ζ i i=1 Yi n s (θ∗ (β), β)ζ i i=1 Yi n η (β)ζ i . i=1 i o θ(β) − ˆ (β) θ Hence, conditional on observations {Yi }n , n1/2 ˜ i=1 converges weakly to a Gaussian process with mean 0 and covariance function Σo (β 1 , β 2 ) = limn→∞ n−1 n E(η (β )ζ ζ T η T (β ) | Y ). We also have i 1 i i i 2 i=1 Σo (β 1 , β 2 ) − Σ∗ (β 1 , β 2 ) = lim n−1 n→∞ = lim n−1 n→∞ n E(η i (β 1 )ζ i ζ T η T (β 2 ) | Y ) − E η 1 (β 1 )η T (β 2 ) 1 i i i=1 n η (β )η T (β ) − E η 1 (β 1 )η T (β 2 ) = 0. 1 i=1 i 1 i 2 o o Hence, limn→∞ cov n1/2 ˜ 1 ) − ˆ (β 1 ) , n1/2 ˜ 2 ) − ˆ (β 2 ) θ(β θ θ(β θ = Σ∗ (β 1 , β 2 ), the o conditional distribution of n1/2 ˜ θ(β) − ˆ (β) is asymptotically equivalent to the unconθ ditional distribution of n1/2 ˆ θ(β) − θ∗ (β) . 171 BIBLIOGRAPHY 172 BIBLIOGRAPHY [1] Cai, T. and Hall, P. (2005). Prediction in functional linear regression. Annals of Statistics, 34, 2159-2179. [2] Cai, T. and Yuan, M. (2010). Nonparametric covariance function estimation for functional and longitudinal data. Technical report. [3] Cao, G., Wang, L., Li, Y. and Yang, L. (2012). Spline confidence envelopes for covariance function in dense functional/longitudinal data. Manuscript. [4] Cao, G., Yang, L. and Todem, D. (2012). Simultaneous inference for the mean function of dense functional data. Journal of Nonparametric Statistics, 24, 359-377 [5] Cao, G., Wang, J., Wang, L. and Todem, D. (2012). Spline confidence bands for functional derivatives. Journal of Statistical Planning and Inference, 104, 1557-1570. [6] Cao, G., Todem, D., Yang, L. and Fine, J. (2012). Evaluating statistical hypotheses using weakly identifiable estimating functions. Scandinavian Journal of Statistics, accepted. [7] Chambers, R. L. and Welsh, A. H. (1993). Log-linear models for survey data with non-ignorable non-response. Journal of the Royal Statistical Society, Series B: Statistical Methodology, 55, 157-170. [8] Claeskens, G. and Van Keilegom, I. (2003). Bootstrap confidence bands for regression curves and their derivatives. Annals of Statistics, 31, 1852-1884. [9] Crainiceanu, C. M., Staicu, A. and Di, C. (2009). Generalized multilevel functional regression, Journal of the American Statistical Association, 104, 1550-1561 [10] Copas, J. (1999). What works?: Selectivity models and meta-analysis. Journal of the Royal Statistical Society, Series A: Statistics in Society, 162, 95-109. 173 [11] Copas, J. B. and Li, H. G. (1997). Inference for non-random samples. Journal of the Royal Statistical Society, Series B: Methodological, 59, 55-77. [12] Cs˝rg˝, M. and R´v´sz, P. (1981). Strong Approximations in Probability and Statistics. o o e e Academic Press, New York-London. [13] Dacunha-Castelle, D. and Gassiat, E. (1999). Testing the order of a model using locally conic parametrization: population mixtures and stationary ARMA processes. Annals of Statistics, 27, 1178-1209. [14] Davies, R. B. (1977). Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika, 64, 247-254. [15] Davies, R. B. (1987). Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika, 74, 33-43. [16] de Boor, C. (2001) A Practical Guide to Splines. Springer-Verlag, New York. [17] Degras, D. A. (2011). Simultaneous confidence bands for nonparametric regression with functional data. Statistica Sinica, 21, 1735-1765. [18] DeVore, R. and Lorentz, G. (1993). Constructive Approximation: Polynomials and Splines Approximation. Springer-Verlag, Berlin. [19] Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap. Chapman & Hall Ltd. [20] Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and its Applications. Chapman & Hall, London. [21] Ferraty, F. and Vieu, P. (1996). Nonparametric Functional Data Analysis: Theory and Practice. Springer, New York. [22] Gasser, T. and M¨ller, H. G. (1984). Estimating regression functions and their derivau tives by the kernel method. Scandinavian Journal of Statistics, 11, 171-185. [23] Gong, G. and Samaniego, F. J. (1981). Pseudo maximum likelihood estimation: Theory and applications. The Annals of Statistics, 9, 861-869. 174 [24] Hall, P. and Hosseini-Nasab, M. (2006). On properties of functional principal components analysis. Journal of the Royal Statistical Society: Series B, 68, 109-126. [25] Hall, P., M¨ller, H. G. and Wang, J. L. (2006). Properties of principal component u methods for functional and longitudinal data analysis. Annals of Statistics, 34, 1493-1517. [26] Hall, P., M¨ller, H. G. and Yao, F. (2009). Estimation of functional derivatives. Annals u of Statistics, 37, 3307-3329. [27] Hansen, B. E. (1996). Inference when a nuisance parameter is not identified under the null hypothesis. Econometrica, 61, 413-430. [28] H¨rdle, W. (1989). Asymptotic maximal deviation of M-smoothers. Journal of Multia variate Analysis, 29, 163-179. [29] Huang, J. (2003). Local asymptotics for polynomial spline regression. Annals of Statistics, 31, 1600-1635. [30] Huber, P. J. (1967). The behavior of the maximum likelihood estimates under nonstandard conditions (in Proceedings of the fifth Berkeley Symposium in mathematical Statistics and Probability, ed.). Berkeley: University of California Press. [31] H¨rny, C., Bernhard, J., Gelber, R., Coates, A., Castiglione, M., Isley, M., Dreher, D., u Peterson, H., Goldhirsch, A. and Senn, H. (1992). Quality of life measures for patients receiving adjuvant therapy for breast cancer: an international trial. Eur.J. Cancer, 28, 118-124. [32] James, G. M., Hastie, T. and Sugar, C. (2000). Principal component models for sparse functional data. Biometrika, 87, 587–602. [33] James, G. and Hastie, T. (2001). Functional linear discriminant analysis for irregularly sampled curves. Journal of the Royal Statistical Society Series B, 63, 533–550. [34] James, G. M. and Silverman, B. W. (2005). Functional adaptive model estimation. Journal of the American Statistical Association, 100, 565-576. [35] Kenward, M. G., Goetghebeur, E. J. T. and Molenberghs, G. (2001). Sensitivity for incomplete categorical data. Statistical Modelling, 1, 31-48. 175 [36] Li, Y. and Hsing, T. (2010a). Uniform convergence rates for nonparametric regression and principal component analysis in functional/longitudinal data. Annals of Statistics, 38, 3321-3351. [37] Li, Y. and Hsing, T. (2010b). Deciding the dimension of effective dimension reduction space for functional and high-dimensional data. Annals of Statistics, 38, 3028-3062. [38] Li, Y., Wang, N. and Carroll, R. J. (2010). Generalized functional linear models with semiparametric single-index interactions, Journal of the American Statistical Association, 105, 621-633. [39] Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data. John Wiley & Sons. [40] Liu, B. and M¨ller, H. G. (2009). Estimating derivatives for samples of sparsely observed u functions, with application to online auction dynamics. Journal of the American Statistical Association, 104, 704-717. [41] Lu, G. and Copas, J. B. (2004). Missing at random, likelihood ignorability and model completeness. Annals of Statistics, 32, 754-765. [42] Ma, S., Yang, L. and Carroll, R. J. (2012). A simultaneous confidence band for sparse longitudinal regression. Statistica Sinica, 22, 95-122. [43] Morris, J. S. and Carroll, R. J. (2006). Wavelet-based functional mixed models. Journal of the Royal Statistical Society, Series B, 68, 179-199. [44] M¨ller, H. G. (2009). Functional modeling of longitudinal data. In: Longitudinal Data u Analysis (Handbooks of Modern Statistical Methods), Ed. Fitzmaurice, G., Davidian, M., Verbeke, G., Molenberghs, G., Wiley, New York, 223-252. [45] Parzen, M. I., Wei, L. J. and Ying, Z. (1994). A resampling method based on pivotal estimating functions. Biometrika, 81, 341-350. [46] Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis. Springer, New York. [47] Rice, J. A. and Wu, C. O. (2001). Nonparametric mixed effects models for unequally sampled noisy curves. Biometrics, 57, 253-259. 176 [48] Ritz, C. and Skovgaard, I. M. (2005). Likelihood ratio tests in curved exponential families with nuisance parameters present only under the alternative. Biometrika, 92, 507–517. [49] Rotnitzky, A., Scharfstein, D., Su, T.-L. and Robins, J. (2001). Methods for conducting sensitivity analysis of trials with potentially nonignorable competing causes of censoring. Biometrics, 57, 103-113. [50] Scharfstein, D. O., Rotnitzky, A. and Robins, J. M. (1999). Adjusting for nonignorable drop-out using semiparametric nonresponse models (C/R: P1121-1146). Journal of the American Statistical Association, 94, 1096-1120. [51] Schumaker, L. (2007). Spline Functions: Basic Theory. Third Edition. Cambridge University Press, Cambridge. [52] Shi, M., Weiss, R. E. and Taylor, J. M. G. (1996). An analysis of paediatric CD4 counts for acquired immune deficiency syndrome using flexible random curves. Journal of the Royal Statistical Society, Series B, 45, 151-163. [53] Song, R., Kosorok, R. and Fine, J. (2009). On asymptotically optimal tests under loss of identifiability in semiparametric models. Annals of Statistics, 37, 2409-2444. [54] Stone, C. J. (1994). The use of polynomial splines and their tensor products in multivariate function estimation. Annals of Statistics, 22, 118-184. [55] Todem, D., Fine, J. and Peng, L. (2010). A global sensitivity test for evaluating statistical hypotheses with nonidentifiable models. Biometrics, 66, 558-566. [56] Troxel, A. B., Harrington, D. P. and Lipsitz, S. R. (1998a). Analysis of longitudinal data with non-ignorable non-monotone missing values. Journal of the Royal Statistical Society, Series C: Applied Statistics, 47, 425-438. [57] Troxel, A. B., Lipsitz, S. R. and Harrington, D. P. (1998b). Marginal models for the analysis of longitudinal measurements with nonignorable non-monotone missing data. Biometrika, 85, 661-672. [58] van der Vaart, A. W. and Wellner, J. A. (2000a). Preservation theorems for Glivenko-Cantelli and uniform Glivenko-Cantelli theorems. High Dimensional Probability II. Birkh¨user, Boston.: eds: E Gin´, DM Mason, JA Wellner. a e 177 [59] van der Vaart, A. W. and Wellner, J. A. (2000b). Weak convergence and empirical processes. New York: Springer. [60] Wahba, G. (1990). Spline Models for Observational Data., 59. SIAM, Philadelphia, PA. [61] Wang, J. and Yang, L. (2009a). Polynomial spline confidence bands for regression curves. Statistica Sinica, 19, 325-342. [62] Wang, L. and Yang, L. (2009b). Spline estimation of single index model. Statistica Sinica, 19, 765-783. [63] Wang, N., Carroll, R. J. and Lin, X. (2005). Efficient semiparametric marginal estimation for longitudinal/clustered data. Journal of the American Statistical Association, 100, 147-157. [64] White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50, 141-161. [65] Xia, Y. (1998). Bias-corrected confidence bands in nonparametric regression. Journal of the Royal Statistical Society, Series B, 60, 797-811. [66] Yao, F. and Lee, T. C. M. (2006). Penalized spline models for functional principal component analysis. Journal of the Royal Statistical Society, Series B, 88, 3-25. [67] Yao, F., M¨ller, H. G. and Wang, J. L. (2005a). Functional linear regression analysis u for longitudinal data. Annals of Statistics, 33, 2873-2903. [68] Yao, F., M¨ller, H. G. and Wang, J. L. (2005b). Functional data analysis for sparse u longitudinal data. Journal of the American Statistical Association, 100, 577-590. [69] Zhao, X., Marron, J. S. and Wells, M. T. (2004). The functional data analysis view of longitudinal data. Statistica Sinica, 14, 789-808. [70] Zhao, Z. and Wu, W. (2008). Confidence bands in nonparametric time series regression. Annals of Statistics, 36, 1854-1878. [71] Zhou, L., Huang, J. and Carroll, R. J. (2008). Joint modelling of paired sparse functional data using principal components. Biometrika, 95, 601-619. 178 [72] Zhou, S., Shen, X. and Wolfe, D. A. (1998). Local asymptotics of regression splines and confidence regions. Annals of Statistics, 26, 1760-1782. [73] Zhou, S. and Wolfe, D. A. (2000). On derivative estimation in spline regression. Statistica Sinica, 10, 93-108. [74] Zhu, H., Zhang, H. (2006). Generalized score test of homogeneity for mixed effects models. Annals of Statistics, 34, 1545–1569. 179