MODEL CHECKING PROBLEMS IN MEASUREMENT ERROR MODELS WITH VALIDATION DATA By Pei Geng A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics – Doctor of Philosophy 2017 ABSTRACT MODEL CHECKING PROBLEMS IN MEASUREMENT ERROR MODELS WITH VALIDATION DATA By Pei Geng This thesis addresses some aspects of regression model checking problems when the covariates are observed with measurement errors. Both classical error-in-variables models and Berkson models are investigated when validation data is available. In Tobit error-in-variables regression models, the response is truncated at a given level while the covariate is collected with errors. In this thesis we assume the density of measurement error to be unknown. Using the calibration idea, a new regression function is derived under the null hypothesis and estimated using the kernel smoothing method and validation data. Then a class of test statistics are constructed using the nonparametric residuals based on kernel regression estimators when validation data is available. The proposed class of tests is shown to be robust to the choices of parameter estimators and consistent against a large class of fixed alternatives. The asymptotic normality of these test statistics is established under the null hypothesis and under a sequence of local alternatives. A practical bandwidth selection strategy is developed. A finite sample simulation study shows superiority of a member of the proposed class of tests over the two existing tests in terms of empirical power. A real data application is presented to validate the current understanding of the data set. In Berkson models, without specifying the measurement error density, the calibrated regression function is estimated using both the primary data containing the responses and the validation data. A kernel smoothed integrated square distance is defined between the responses and the regression estimator. The parameter estimators are obtained by minimizing the integrated square distance. Further the test statistic is constructed based on the minimized distance. The consistency and asymptotic normality of these estimators are proved. The asymptotic null distribution of the proposed class of test statistics based on the corresponding minimized distances and the test consistency against certain alternatives are also established. A simulation study shows desirable behavior of a member of these minimum distance estimators and tests. Copyright by PEI GENG 2017 To my dear and beloved parents Jinchen Geng and Jingmiao Liu, aunt Jingzhi Liu and sister Ying Geng. v ACKNOWLEDGMENTS First and foremost, I wish to convey my greatest and most sincere appreciation to my advisor Professor Hira L. Koul. I am indebted to his constant guidance, encouragement and enthusiasm in research, endless time and prompt replies to my questions and confusion over the years. Without his valuable suggestions and comments, vast and sharp knowledge, passionate and caring attitude, I would not have grown so fast in academia. From Professor Koul, I have learned not only the professional knowledge and critical thinking in statistics, but more importantly, the rigorous and responsible spirit towards science which has built a profound basis for my future work. Secondly, I would like to thank Dr. Lyudmila Sakhanenko for all her advice and helpful discussions in the past. Her broad knowledge and innovative ideas has greatly enlightened me and helped me overcome the difficulty. I would like to thank Dr. Qing Lu for shedding me the light on the applications of statistics. I also would like to thank Dr. Ping-shou Zhong for serving as a member of my doctoral committee. Thirdly, I wish to express my sincere gratitude to Professor Yimin Xiao, my MSc advisor Professor Wensheng Wang and my undergraduate professor Dr. Zifeng Yang. Without their encouragement and support, I would not have come to MSU to pursue my doctoral degree. Last but not in the least, with my deepest affection, I would like to thank my parents Jinchen Geng, Jingmiao Liu, my sister Ying Geng and my aunt Jingzhi Liu for their endless love and support over the years. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix KEY TO ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Chapter 1 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 2 Model checking in Tobit EIV 2.1 Introduction . . . . . . . . . . . . . . . 2.2 A class of tests . . . . . . . . . . . . . 2.3 Asymptotic distributions . . . . . . . . 2.3.1 Asymptotic null distribution . . 2.3.2 Asymptotic power . . . . . . . . 2.4 Estimation of θ0 . . . . . . . . . . . . . 2.5 Data analysis . . . . . . . . . . . . . . 2.5.1 Simulations . . . . . . . . . . . 2.5.2 Real data application . . . . . . 2.6 Proofs . . . . . . . . . . . . . . . . . . regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . using validation data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 3 Minimum distance model checking in Berkson 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 A class of tests . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Estimation of θ0 . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Consistency of θˆn . . . . . . . . . . . . . . . . . . . 3.3.2 Asymptotic normality of θˆn . . . . . . . . . . . . . 3.4 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Finite sample performance of θˆn . . . . . . . . . . . 3.5.2 Test performance . . . . . . . . . . . . . . . . . . . 3.6 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 9 11 11 17 20 23 23 29 31 47 47 50 55 56 57 69 85 87 89 92 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 vii LIST OF TABLES Table 2.1: Comparison of estimators: absolute bias and RMSE in parenthesis . . . . 27 Table 2.2: Empirical levels for p = 1 at nominal level 0.05 . . . . . . . . . . . . . . . 28 Table 2.3: Empirical power comparison for p = 1 at nominal level 0.05 . . . . . . . . 28 Table 2.4: Empirical level and power of VT M E test for p = 2 at nominal level 0.05 . 29 Table 2.5: Estimation and testing results of the enzyme reaction dataset . . . . . . . 31 Table 3.1: Performance of θˆn , θ˜n in the linear case (3.37), p = 1. . . . . . . . . . . . . 88 Table 3.2: Performance of θˆn , θ˜n in the nonlinear case (3.38), p = 1. . . . . . . . . . . 89 Table 3.3: Performance of θˆn , θ˜n in the linear case with p = 2 . . . . . . . . . . . . . 90 Table 3.4: Empirical level and power under linear null model (left panel) and nonlinear null model (right panel) for p = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Table 3.5: Empirical level and power under linear null model for p = 2 . . . . . . . . . . 92 viii LIST OF FIGURES Figure 2.1: Estimated regression functions using three estimation methods . . . . . . ix 31 KEY TO ABBREVIATIONS xT the transpose of an Euclidean vector x IA the indicator of event A →d convergence in distribution →p convergence in probability n∧N Nq (ν, Σ) minimum of n and N q-variate normal distribution with mean vector ν and covariance matrix Σ KN Koul and Ni (2004) KS Koul and Song (2009) x Chapter 1 Introduction In the area of statistical inference, regression model checking is a critical topic to study the significant relationships between responses and covariates. Extensive research studies have been focusing on this topic since the early 1990s, as is evidenced by the recent review article of Gonz´alez-Manteiga and Crujeiras (2013). Among the main classes of tests, kernelbased test procedures are important tools to investigate if the regression functions belong to a particular parametric family. The most of the literature on regression model checking assumes that the covariates are fully observed. However, in reality, many covariates are observed with errors. Instead of observing the true covariate of interest, one observes a surrogate variable. The regression models where covariates are observed with errors are known as measurement error regression models. In general, there are two types of measurement error models: error-in-variables (EIV) models and Berkson models. The monographs of Fuller (1987), Cheng and van Ness (1999) and Carroll, Ruppert, Stefanski and Crainiceanu (2006) provide ample examples of such models and contain systematic analysis of the underlying issues involved in these models. In general, the simple analysis using the error-prone covariates as the true ones causes heavy bias in the estimators of the underlying parameters and loss of power in tests. Hence regression calibration is needed for parameter estimation and hypothesis tests. In economics and other social sciences, Tobit regression first introduced by Tobin (1958) 1 is a useful model to analyze truncated responses. In Tobit models, the early study mainly focused on parameter estimation. For example, Amemiya (1984) summarizes comprehensive estimation procedures when the regression error follows a Gaussian distribution while Abarin and Wang (2009) proposes the second-order least squares estimation when the error has a general parametric distribution. In Tobit EIV models, Wang (1998) proposed a two-step moment estimators in the linear case when both the covariate and measurement error are normally distributed. When the measurement error distribution is unknown, but with the help of validation data, the least squares estimation procedure proposed by Song (2009) is applicable after the regression model is calibrated to that based on the surrogate variable. Regarding the model checking problem in Tobit EIV models, existing methods include a score-type test proposed by Song (2009) and a transformation-based distribution-free test proposed by Song (2011). The former test applies to the least squares estimators and the latter test achieves superior performance for one dimensional covariate when the measurement error distribution is known. However, the measurement error distribution is hardly known in reality. An alternative approach is to conduct statistical inference with an available validation sample. Chapter 2 of this thesis aims to develop a robust model checking procedure in Tobit EIV models with the help of validation data. In the chapter, a kernel-based nonparametric test is proposed for fitting a parametric function to the regression function in Tobit EIV models. Given a consistent parameter estimator, the calibrated regression function is first estimated by the Nadaraya-Watson estimator based on validation data, then a class of test statistics is constructed based on the nonparametric residuals. This class of tests is obtained by adopting the test of Zheng (1996) to the current set up. The proposed tests are shown to be robust to the parameter estimation choices and the consistency and asymptotic distributions of the test 2 statistics are established under the null hypothesis and under certain alternatives. With the two bandwidth parameters involved in these tests, a practical bandwidth selection strategy is developed and applied in a simulation study. The simulation study shows attractive empirical power performance of a member of the proposed class of tests compared to the two existing tests. When a predicting variable is observed with errors, in many cases, it is more appropriate to assume that the true variable equals the observed plus an error. This is the so called Berkson measurement error models. For example, the levels of a certain pollutant, such as lead, in a place are usually measured at fixed spots while the actual exposure to an individual is subject to the location, time and air condition. Therefore, it is natural to treat the actual exposure as the measured pollutant levels with a small random error term. To ensure the identifiability in the Berkson models, it is often assumed that the density of the measurement error is known. For parameter estimation, Wang (2004) proposed a minimum distance procedure based on the first two moments of responses in nonlinear regressions. As for model checking, the main contribution is made by Koul and Song (2009) where a minimum distance test is constructed based on kernel smoothing technique. In all the mentioned studies for Berkson models above, authors have assumed that the measurement error density is known or known up to an unknown Euclidean parameter. This is a restrictive assumption as it limits the applicability of the inference procedures. However, the availability of validation data helps to circumvent this assumption. Regarding the parameter estimation in Berkson models with validation data, a collection of attractive methodologies have been studied. In linear and nonlinear EIV models with validation data, Lee and Sepanski (1995) constructed an estimation procedures based on least squares methods with regression functions replaced by their corresponding wide-sense 3 conditional expectation functions. In linear EIV models, Wang and Rao (2002) developed an estimated empirical loglikelihood based on the validation data and then constructed an estimated empirical likelihood confidence region for the parameters in regression functions. For general Berkson-type models, Du, Zou and Wang (2011) proposed a nonparametric regression function estimation based on kernel smoothing skills. But the model checking study in Berkson models with validation data seems to be sparse. To fill this gap, in Chapter 3 below, we adopt the minimum distance methodology of Koul and Song (2009) to propose analogous procedures for obtaining the parameter estimators and further perform lack-of-fit hypothesis testing. The regression function given the surrogate variable is nonparametrically estimated based on validation data. Then an integrated square distance between the responses and the regression estimator is defined by means of kernel smoothing and is minimized for parameter estimation. Eventually the minimized distance is used as the test statistic. Both consistency and asymptotic normality of the proposed estimator are established. The asymptotic distributions of the proposed test statistics under the null and the consistency against certain fixed alternative hypotheses are also derived. It is shown that the asymptotic distributions of these test statistics are the same as in the case of known measurement error density while those of the corresponding estimators of the parameters in the null model are affected by the estimation of the regression function using validation data. A finite sample study shows literally no bias in the proposed minimum distance estimators. Empirical levels and power are obtained for different choices of the sample size ratio between primary data and validation data under various alternative hypotheses. The empirical level is well controlled for most of the chosen cases and the empirical power significantly increases as the sample size increases for all the chosen alternatives. 4 Chapter 2 Model checking in Tobit EIV regression using validation data 2.1 Introduction In economics and other social sciences, many response variables are observed with lower or upper thresholds. For instance, household expenditure on certain durable goods is zero for some families depending on other factors and positive for other families, hours worked in social science is zero for women who choose not to work and positive for others, and as the third example, the demand of tickets for a game or conference is also limited to the capacity of the event. Regression model with truncated response data was first studied by Tobin (1958). Since then these models are called Tobit regression models. Bhattacharya, Chernoff and Yang (1983) developed a nonparametric Mann-Whitney type estimator of the parameter in linear Tobit models. The survey paper of Amemiya (1984) provides a comprehensive introduction to these regression models with Gaussian errors. To proceed a bit more precisely, in the Tobit regression model of interest here, the scalar response variable Y ∗ is observed only when it is positive and is related to the p-dimensional 5 predicting variable vector X by the relation Y ∗ = µ(X) + ε, Y = Y ∗ I(Y ∗ >0) . (2.1) Here Y is the observed response and the scalar random error ε is assumed to have zero mean and to be independent of X so that µ(x) = E(Y ∗ |X = x), x ∈ Rp . The last two and a half decades have seen an intense research activity on the topic of testing for lack-of-fit of a regression model as is evidenced in a recent review paper of Gonz´alez-Manteiga and Crujeiras (2013). Let Θ ⊂ Rq and M := {m(x, θ), x ∈ Rp , θ ∈ Θ} be a family of known parametric functions. In the lack-of-fit testing problem of interest here we wish to test the hypothesis H0 : µ(x) = m(x, θ), for some θ ∈ Θ for all x ∈ C, versus H1 : H0 is not true, where C is a compact subset of Rp . In the case of no measurement error, i.e., when X is fully observed, there are a few tests for fitting a parametric model to µ(x) in the above Tobit regression model. Wang (2007) developed a nonparametric test to diagnose nonlinearity in the median Tobit regression model where covariate is non-random, one dimensional, and the function µ(x) represents the median of the distribution of Y ∗ at the design variable x. Song (2011) proposed an asymptotically distribution-free test for fitting a parametric model to µ(x) of (2.1). This test is based on the supremum of the Stute, Thies and Zhu (1998) type transformation of a partial sum process of calibrated residuals and is applicable only when the dimension p of X equals 1. Koul, Song and Liu (2014) adopted the Zheng’s (1996) statistics to test H0 for 6 a large class of the given functions M and for p ≥ 1. The goal here is to develop tests for H0 in the model (2.1) when there is measurement error in the covariate vector X, where one does not observe X. Instead one observes a surrogate variable Z related to X by the relation Z = X + U, (2.2) where the random error U is distributed with mean 0 and unknown covariance matrix Σu . The r.v.’s X, U , ε are assumed to be mutually independent. Throughout the chapter, the primary data set consists of a random sample of n observations {(Yi , Zi ), i = 1, 2, · · · , n} obtained from the model (2.1) and (2.2). We further assume that there is a validation data set consisting of N i.i.d. observations {(Xj , Zj ), j = 1, · · · , N } on (X, Z) of (2.2), independent of the primary data set. To proceed further, we recall the testing methodology developed in Koul et al. (2014) (KSL) when there is no measurement error in X. For any r.v. V , let fV denote its density. Under H0 , the regression function of Y , given X, is q(x, θ) = E(Y |X = x) = m(x, θ)Qε,0 (−m(x, θ)) + Qε,1 (−m(x, θ)), where Qε,j (z) = z∞ uj fε (u)du, j = 0, 1. Thus one has the regression model Y = q(X, θ) + ξ, E(ξ|X) = 0. (2.3) The function q is monotone as a function of m(x, θ). Hence the original testing problem is equivalent to testing for E(Y |X = x) = q(x, θ). As in KSL, in order to ensure the model 7 identifiability, and for simplicity, we assume fε to be known. See Remark 2.3.1 for the case when fε belongs to a parametric family. In the error-in-variables model here, the regression model (2.3) is of little help, because X is not observable. Instead we now derive the new regression model, given Z. Let g(z, θ) = E(q(X, θ)|Z = z). Direct calculations show that under H0 , E(Y |Z = z) = g(z, θ), so that we have the regression model Y = g(Z, θ) + η, E(η|Z) = 0. (2.4) The original testing problem is thus transformed to the problem of testing H0 : E(Y |Z = z) = g(z, θ), for some θ ∈ Θ, for all z ∈ C, versus (2.5) H1 : H0 is not true. Clearly H0 implies H0 . The converse need not be true in general. Song (2008) gives a sufficient condition for the equivalence of H0 and H0 . It suffices to require the family of densities fu (z − ·), z ∈ Rp , to be a complete family. From now on, we focus on testing (2.5) under (2.4) given both primary data and validation data. Existing literature on parametric measurement error Tobit regression model mainly focuses on the parameter estimation. Song (2009) obtained consistent estimators by a modified least square procedure while Wang (1998) proposed method of moments estimators when X, u and ε all follow normal distributions. Both estimators are shown to be √ n-consistent for the true parameter θ0 and asymptotically normal. As far as the lack-of-fit testing in these models is concerned, under the assumption that the measurement error density fU 8 is known, Song and Yao (2011) generalized the test procedure in Song (2011) to the measurement error Tobit models for the one dimensional covariate while Song (2009) provided a score-type test based on the least square residuals, which were constructed using the validation data. In the current chapter we assume the availability of a validation data set, which is used to estimate g and fU , thereby avoiding the assumption of having known fU . In the next section we describe the proposed tests for this problem. Section 2.3 describes the asymptotic normality of the proposed test statistics under H0 and under some alternatives along with the needed assumptions. Some parameter estimators under H0 are described in Section 2.4. Section 2.5.1 reports findings of a finite sample simulation study, which shows some superiority of a member of the proposed class of tests compared to the tests in Song and Yao (2011) and Song (2009) in terms of the empirical power. The proposed test is also applied to a real data example in Section 2.5.2 and the results validate the current understanding of the dataset. All proofs are deferred to the last Section 2.6. 2.2 A class of tests To describe the proposed class of tests, we need to construct residuals in the model (2.4). Since here the function g is unknown, we need to estimate this function nonparametrically. The validation data is critical to do this. Let K be a kernel density function, w be a window width associated with sample sizes n and N , and set Kw (x) = K(x/w)/wp , x ∈ Rp . Let N WN (z, θ) := N −1 N Kw (z − Zk )q(Xk , θ), f˜(z) := N −1 Kw (z − Zk ), z ∈ Rp , θ ∈ Θ. k=1 k=1 9 Then, for a given θ, a kernel estimator of g(z, θ) using the validation data set is gˆ(z, θ) := WN (z, θ) , f˜(z) z ∈ Rp . (2.6) Because the validation data is independent of the primary data, gˆ(z, θ) is independent of the primary data, for each θ, so is the kernel density estimator f˜ of the density fZ of Z. √ Let θ0 be the true value of the parameter θ for which H0 holds. Let θˆn be a n-consistent estimator of θ0 , and define the residuals ηˆi = Yi − gˆ(Zi , θˆn ), i = 1, · · · , n. Let h = hn be another sequence of window widths. Based on the idea proposed by Zheng (1996), under H0 , since E(ηi |Zi = z) = 0 for all z ∈ C, we have E[ηi E(ηi |Zi )fZ (Zi )] = 0, ∀ i ≥ 1, (2.7) while it is strictly positive under H1 . In order to use the empirical version of (2.7) in the primary dataset to form the test statistic, the conditional expectation in the above equation can be estimated by n ˆ i |Zi ] = E[η j=i,j=1 Zj − Zi 1 K ηˆj IC (Zj ) f˜(Zi ). (n − 1)hp h (2.8) Upon multiplying this by ηˆi f˜(Zi )IC (Zi ) and then summing up over i leads to the class of 10 test statistics, one for each K, 1 Vn = n(n − 1)hp n n IC (Zi )IC (Zj )K i=1 i=j=1 Zi − Zj ηˆi ηˆj , h (2.9) useful in the current set up. The reason for restricting the covariate Z to a compact set C is to avoid the usual difficulty associated with the vanishing f˜(z). We first decompose the residuals as ηˆi = Yi − gˆ(Zi , θˆn ) (2.10) = [Yi − g(Zi , θ0 )] + [g(Zi , θ0 ) − gˆ(Zi , θ0 )] + [ˆ g (Zi , θ0 ) − gˆ(Zi , θˆn )] := ηi − ei − δi , say. Compared to the test statistic in Zheng (1996), it is important to observe that there is an extra nonparametric estimation residual term ei involved here, which is shown later to contribute to the asymptotic distributions of Vn . 2.3 Asymptotic distributions In this section we shall derive the asymptotic distributions of Vn under H0 and under some alternatives. 2.3.1 Asymptotic null distribution We shall first state the needed assumptions for obtaining the asymptotic null distribution of Vn . Let σ 2 (z) := E(η 2 |Z = z), z ∈ Rp , γ 2 (z) := E 11 2 q(X, θ0 ) − g(Z, θ0 ) |Z = z , z ∈ Rp , N0 denote an open neighborhood of θ0 and · the Euclidean norm of a vector or of a matrix. For a given positive integer k, a density kernel K is said to be of order k if for all 1 ≤ j ≤ k − 1 and uj K(u)du = 0, uk K(u)du = 0. We are now ready to state our assumptions. All limits are taken under N ∧ n → ∞, unless mentioned otherwise. (C1) The density function fZ is continuously differentiable and inf z∈C fZ (z) > 0. (C2) The regression function m(x, θ) is differentiable with respect to θ, for each x ∈ Rp , with its vector of derivatives m(x, ˙ θ) satisfying E supθ∈Θ |m(X, θ)|2 + supθ∈Θ m(X, ˙ θ) 2 < ∞. √ (C3) There exists an estimator θˆn of θ0 , such that n θˆn − θ0 = Op (1), under H0 . (C4) For some ∆ > 0, supz∈Rp σ 2+∆ (z)fZ (z) < ∞. The functions σ 2 (z)fZ (z) and g(z, θ0 )fZ (z) are continuous and uniformly bounded. The functions fZ (z) and g(z, θ0 )fZ (z) and their first and second derivatives are continuous and uniformly bounded. (C5) For each z ∈ Rp , g(z, θ) is twice continuously differentiable in θ at θ0 . The second differential g¨ satisfies E{supθ∈N0 g¨(Z, θ) 2 } < ∞. (C6) E(σ 2 (Z))2 + E(γ 2 (Z))2 < ∞. (C7) K is a continuous and symmetric density function on Rp with bounded partial derivatives, and of order k > p/4. (C8) N/n → λ, h → 0, w → 0, and w/h → c, 0 ≤ c < ∞. (C9) (n ∧ N )hp → ∞, (n ∧ N )wp → ∞. (C10) With k as in (C7), nhp/2 w2k → 0. Remark 2.3.1. Parametric fε . Suppose fε belongs to a parametric family of densities with unknown parameter vector ν. Then Qε,1 , Qε,0 and g(z, θ) will also depend on ν. Let γ := (θT , ν T )T , and γˆ be a √ n-consistent estimator of γ under H0 . Then one can apply 12 ˆ replaced by g(z, γˆ ) throughout. The asymptotic distributions the above tests with g(z, θ) of the thus modified test statistics are not affected by this modification. For example, if ε ∼ N1 (0, σ 2 ) and m(x, θ) = α + βx, then γ = (θT , σ)T = (α, β, σ)T and q(x, γ) = m(x, θ)Qε,0 (−m(x, θ)) + Qε,1 (−m(x, θ)) = (α + βx)Φ α + βx α + βx + σφ , σ σ where Φ and φ are the cumulative distribution function and density function of the standard normal distribution. For more estimation details, see Wang (1998) and Amemiya (1984). Remark 2.3.2. The conditions (C3)–(C5) are essentially used to ensure the √ n-consistency of the least square parameter estimators in Section 3.3 while the assumptions (C7)–(C10) about K and bandwidths are needed to derive the asymptotic distributions of the proposed test statistics. Remark 2.3.3. Note that the order k of the kernel function K needs to be larger than p/4 in order to obtain a valid bandwidth h since both nhp → ∞ and nhp/2+2k → 0 should be satisfied. For example, if p < 8, (C10) will be automatically true for any symmetric kernel density. However, the test will suffer from the curse of dimensionality since the asymptotic bias of kernel regression estimators is of the order O(hk ). As p increases, the basic assumption nhp → ∞ requires wider bandwidth. As a consequence, the kernel function order k has to be increased in order to make the bias negligible compared to the asymptotic rate. To state the main theorem, here we need to introduce K 2 (u)du, K1 := 13 (2.11) K2 := τ1 := K(u)K(v) 2 1 K(s + c(u − v)) + K(−s + c(u − v)) dudv ds 2 IC (z)[σ 2 (z)]2 fZ2 (z)dz K1 , τ2 := IC (z)[γ 2 (z)]2 fZ2 (z)dz K2 , where c is as in assumption (C8). We are now ready to state the following theorem describing the asymptotic null distribution of Vn . Throughout, →p and →D denote the convergence in probability and in distribution, respectively. Theorem 2.3.1. Under (2.1) and (2.2), the assumptions (C1)–(C10) and under H0 , the following result holds. With λ as in assumption (C8), if 0 < λ < ∞, then nhp/2 Vn →d N1 (0, 2τ1 + (2τ2 )/λ2 ). (2.12) Moreover, τ1 and τ2 can be consistently estimated by 1 τˆ1 := n(n − 1) τˆ2 := n IC (Zi )IC (Zj )Kh (Zi − Zj )ˆ ηi2 ηˆj2 K1 , i=j=1 1 N (N − 1) N IC (Zk )IC (Zl )Kw (Zk − Zl )ηk2 ηl2 K2 , k=l=1 where ηk = q(Xk , θˆn ) − gˆ(Zk , θˆn ), k = 1, · · · , N. Consequently, the test that rejects the null hypothesis whenever τ1 + 2ˆ τ2 /λ2 > zα/2 nhp/2 Vn / 2ˆ will have the asymptotic size α, where zα is the upper α quantile of the N1 (0, 1) distribution. The proof of the above theorem is given in Section 2.6. Here we briefly sketch the idea of the proof, which is also helpful in discussing the case of λ = ∞. For the sake of brevity, 14 write i=j for n i=1 n j=i=1 , k=l for N k=1 N l=k=1 , and let Kh,ij := Kh (Zi − Zj ) = h−p K (Zi − Zj )/h , 1 ≤ i, j ≤ n. Then, using the decomposition (2.10), the statistic Vn can be decomposed as Vn = = 1 n(n − 1) 1 n(n − 1) IC (Zi )IC (Zj )Kh,ij ηˆi ηˆj (2.13) i=j IC (Zi )IC (Zj )Kh,ij ηi ηj + ei ej + δi δj − 2ηi ej − 2ηi δj − 2ei δj i=j := Vn1 + Vn2 + Vn3 − 2Un1 − 2Un2 − 2Un3 , say. We will first show that only Vn1 and Vn2 contribute to the asymptotic variance of nhp/2 Vn and the asymptotic mean of nhp/2 Vn is 0 under assumed conditions. Then both Vn1 and Vn2 are approximated by degenerate U statistics, constructed by the projection of Vn1 based only on the primary data and that of Vn2 based only on the validation data. Eventually we obtain that Vn is asymptotically normally distributed with convergence rate nhp/2 . In fact, τ1 = EC {[σ 2 (Z)]2 fZ (Z)}K1 = EC {η 2 [fZ (Z)E(η 2 |Z)]}K1 , where EC denotes the expectation over the compact subset C. The unconditional expectation can be consistently estimated by the sample average 1 n n IC (Zi )ˆ ηi2 fˆ(Zi ){E(ηi2 |Zi )}, i=1 where fˆ is a kernel density estimator of fZ based on {Zi , i = 1, · · · , n} in the primary data, 15 and the conditional expectation E is the kernel estimator given by E(ηi2 |Zi ) 1 = n−1 n IC (Zj )Kh,ij ηˆj2 /fˆ(Zi ). j=i,j=1 Plugging in the estimated conditional residuals in the sample average, we obtain τˆ1 = 1 n(n − 1) IC (Zi )IC (Zj )Kh,ij ηˆi2 ηˆj2 K1 . j=i The parameter τ2 can be estimated similarly. Actually, both {Zi , i = 1, · · · , n} and {Zk , k = 1, · · · , N } can be used to formulate the kernel density estimator in τˆ1 to make the estimation more efficient as long as they are i.i.d. copies of Z. Remark 2.3.4. Alternative consistent estimators of τ1 and τ2 are given by τ˜1 := τ˜2 := 1 n(n − 1) 2 ηˆ2 ηˆ2 , IC (Zi )IC (Zj )hp Kh,ij i j (2.14) i=j 1 N (N − 1) 2 η˜2 η˜2 . IC (Zk )IC (Zl )wp Kw,kl k l k=l A modification of the proofs in Zheng (1996) yields that τ˜j , j = 1, 2 are indeed consistent for τj , j = 1, 2, respectively. Details are skipped. A simulation study also shows little difference between the two proposed estimation methods when sample size is large. Remark 2.3.5. The case λ = ∞ and λ = 0. When the validation data size N is much larger than the primary data size n, i.e., when λ is sufficiently large, the regression function g(Z, θ0 ) can be efficiently recovered by means of the validation data, hence the primary data dominates the asymptotic behavior of the tests while the validation data does not play a role in their asymptotic variances. This is justified by letting N/n → ∞, or having λ = ∞. 16 From the proof of the above theorem, we see that nhp/2 Vn1 →D N (0, 2τ1 ), and N hp/2 Vn2 →D N (0, 2τ2 ). Moreover, Vn1 and Vn2 are asymptotically independent. All other terms in Vn are asymptotically negligible, compared to these two terms. Hence nhp/2 Vn ∼ nhp/2 Vn1 + nhp/2 Vn2 = nhp/2 Vn1 + n N hp/2 Vn2 →D N (0, 2τ1 ), N since now n/N → 0. On the other hand, when the primary data set is much larger than the validation data set, the asymptotic convergence rate is limited by the validation data sample size. By going through the proof of Theorem 2.3.1 again with N/n → 0, we obtain Theorem 2.3.2. Under (2.1), (2.2), assumptions (C1)–(C10), and H0 , as N/n → λ = 0, N hp/2 Vn →D N (0, 2τ2 ). 2.3.2 Asymptotic power In this section we shall investigate the asymptotic power of the proposed tests under the fixed alternative H : µ(x) = (x), where ∈ / M and E 2 (X) < ∞. Let h(Z) = E( (X)|Z). Then the relation between Y and Z takes the form Y = h(Z) + η. 17 Note that E 2 (X) < ∞ implies that Eh2 (Z) < ∞. Additionally, we assume the following. (C11) E[(h(Z) − g(Z, θ))2 fZ2 (Z)] has a unique minimizer θa . Under H , the decomposition of the residuals (2.10) becomes ηˆi = [Yi − g(Zi , θa )] + [g(Zi , θa ) − gˆ(Zi , θa )] + [ˆ g (Zi , θa ) − gˆ(Zi , θˆn )] = η¯i − e¯i − δ¯i , where η¯i = h(Zi ) − g(Zi , θa ). Because η¯i is no longer centered at 0 under H , Vn1 is a non-degenerate U statistic. As shown in Lemma 2.6.1, under H , the asymptotic property of Vn1 is still dominated by Tn1 = 1 n(n − 1) ϕ˜2 (Zi , Zj ), i=j where ϕ˜2 (Zi , Zj ) = IC (Zi )IC (Zj )Kh,ij η¯i η¯j . Tn1 is a non-degenerate U statistic as well. By Lemma 3.1 in Zheng (1996), 2 Tn1 = n n √ E[ϕ˜2 (Zi , Zj )|Zi ] − E[ϕ˜2 (Z1 , Z2 )] + op (1/ n). i=1 By the weak law of large numbers, the first term above converges to 2E[ϕ˜2 (Z1 , Z2 )], in probability. Algebra shows that E[ϕ˜2 (Z1 , Z2 )] = E{IC (Z)[h(Z) − g(Z, θa )]2 fZ (Z)} + o(1). 18 Hence Tn1 →p E IC (Z)[h(Z) − g(Z, θa )]2 fZ (Z) . (2.15) Since Tn1 is the only leading term in Vn1 , we have Vn1 →p E IC (Z)[h(Z) − g(Z, θa )]2 fZ (Z) . (2.16) Under H , both e¯i and δ¯i share the properties of ei and δi with θ0 replaced by θa . Thus arguing as under H0 , one can verify that all the terms except Vn1 in (2.13) are Op (1/( nhp/2 )). Moreover, τ˜1 = 1 n(n − 1) = 1 n(n − 1) →p 2 ηˆ2 ηˆ2 hp IC (Zi )IC (Zj )Kh,ij i j i=j 2 η¯2 η¯2 + o (1) hp IC (Zi )IC (Zj )Kh,ij p i j i=j K 2 (u)du E{IC (Z){σ 2 (Z) + [h(Z) − g(Z, θa )]2 }fZ (Z)} := τ¯1 . Unlike τ˜1 , τ˜2 does not involve any information of Yi . Instead, since η˜k = q(Xk ) − gˆ(Zk , θˆn ), the kernel regression estimator gˆ(Zk , θˆn ) always consistently estimates E(q(Xk |Zk )), hence τ˜2 →p τ2 under H . Now we state the asymptotic property of the proposed test under H . Theorem 2.3.3. Under the conditions (C1)–(C10) and (C11), and under the alternative hypothesis H , for finite 0 < λ < ∞, we obtain Vn 2ˆ τ1 + 2ˆ τ2 /λ2 →p E{IC (Z)[h(Z) − g(Z, θa )]2 fZ (Z)} 2¯ τ1 + 2τ2 /λ2 19 > 0. The standardized test statistic nhp/2 Vn / 2ˆ τ1 + 2ˆ τ2 /λ2 →p ∞, under H . Hence the pro- posed test is consistent against the alternatives H . Now we consider a sequence of local alternatives: Ha : E(Yi∗ |Xi ) = m(Xi , θ0 ) + bn a(Xi ), where a(·) ∈ / M, a(·) is continuously differentiable, E[a(X)]2 < ∞, and bn → 0. Theorem 2.3.4. Under (C1)–(C11), Ha with bn = (nhp/2 )−1/2 and 0 < λ < ∞, we have nhp/2 Vn 2ˆ τ1 + (2ˆ τ2 )/λ2 →D N (γ, 1), where γ = E{IC (Z)E[a(X)|Z]2 fZ (Z)}/ 2τ1 + (2τ2 )/λ2 . 2.4 Estimation of θ0 To perform the proposed testing procedure, we first need to obtain a √ n-consistent estimator θ0 . Song (2009) uses the least square estimator n θˆOLS = arg min θ 2 IC (Zi ) Yi − gˆws (Zi , θ) , (2.17) i=1 where the Nadaraya-Watson regression estimator gˆws employs a bandwidth ws . In order to assess the performance of the proposed test under different estimation procedures, we also introduce the weighted least square estimators. In particular, we choose the weight function f˜(Zi ) for each observation Zi to avoid the unstableness of kernel regression estimation with 20 f˜(Zi ) close to 0. In other words, we considered the weighted least squares estimator n θ 2 Yi − gˆws (Zi , θ) [f˜(Zi )]2 . θˆW LS = arg min (2.18) i=1 An argument similar to the one used in Song (2009) shows that θˆW LS is also asymptotically normal with convergence rate √ n. In addition, when all r.v.’s X, u and ε are Gaussian in the above linear errors-in-variables Tobit model, Wang (1998) obtained two-step moment estimators θˆT M E . In order to conveniently apply the estimator in the next section, we briefly describe it here. Assume Y ∗ = α + β T X + ε, X ∼ N (µX , ΣX ), Y = max{Y ∗ , 0}, u ∼ N (0, Σu ), Z = X + u, ε ∼ N (0, Σε ), where X, u and ε are mutually independent and ∆ := Σ−1 X Σu is known. Under the normality assumption, the first and second moments can be calculated µY ∗ = α + β T µX , σY2 ∗ = β T ΣX β + σε2 , µX = µZ , σZY ∗ = ΣX β, ΣZ = ΣX + Σu . (2.19) (2.20) One can construct estimating equations by the substitution of sample moments in the above euqations in order to obtain the parameter estimation. However the moments of Y ∗ are also needed. But algebra shows that E(Y ) = Φ(δ)E(Y |Y > 0), 21 δ := µY ∗ /σY ∗ (2.21) E(Y |Y > 0) = µY ∗ + σY ∗ φ(δ)/Φ(δ) (2.22) E(ZY |Y > 0) = σZY ∗ + µZ E(Y |Y > 0). (2.23) Let µ ˆY denote the overall sample mean of Yi ’s responses and µ ˆY + be the mean of the positive Yi ’s only. Using sample moments in equation (2.21), we obtain δˆ = Φ−1 (ˆ µY /ˆ µY + ). Then combining this with (2.22), we obtain the estimates of µY ∗ and σY ∗ as follows. ˆ ˆ µ ˆY ∗ = δˆµ ˆY + /[δˆ + φ(δ)/Φ( δ)], ˆ σ ˆY ∗ = µ ˆY ∗ /δ. By equation (2.23), one can further estimate σZY ∗ by σ ˆZY ∗ = µ ˆZY ∗ − µ ˆZ µ ˆY + . Then plugging µ ˆY ∗ , σ ˆY ∗ , σ ˆZY ∗ in (2.19) and (2.20), and using ∆ = Σ−1 X Σu , which is known, we obtain the following estimators of α and β. ˆXσ βˆT M E = Σ ˆZY ∗ , ˆX = Σ ˆ Z (I + ∆)−1 , Σ ˆ α ˆT M E = µ ˆY ∗ − µ ˆX β, µ ˆX = µ ˆZ . Computationally, θˆT M E = (ˆ αT M E , βˆT M E )T is more efficient due to the closed form while the other two estimators require optimization. We use all three estimators in the simulation study of the next section. 22 2.5 2.5.1 Data analysis Simulations In this section we present the findings of a Monte Carlo simulation study. In this study we used the three estimators of θ0 mentioned in Section 3.3. Let Vθˆ denote the corresponding ˆ We chose p = 1, 2. The empirical level of the V ˆ test statistic with estimator of θ0 equal to θ. θ test is seen to be robust against these choices of the estimators for p = 1. In all simulations we set N = 2n for convenience. All the results are obtained by generating 1000 replications. In the case of one-dimensional covariate, i.e., when p = 1, we compared the power performance of the Vθˆ test with the two existing methods: one is the Wn test of Song and Yao (2011) based on Stute, Thies and Zhu (1998) type transformation of a partial sum residual process and the other is the score-type test Sn in Song (2009). Although Song’s score-type test does not directly apply to the Tobit model, it can be successfully adapted to the current model when the testing problem is transformed to (2.5). In order to clearly see how the simulation is implemented, we briefly describe both existing methods here. Regarding the score type test, Sn is defined as 1 Sn = n n IC (Yi − gˆws (Zi ))W (Zi ), i=1 where W (·) is a weight function of the covariate Z. Specifically, we used the uniform weight function W (Zi ) ≡ 1 in the simulation. As for the transformed test Wn , first define the stochastic process 1 Wn (z) = √ n n i=1 1 eˆi I(c ≤ Zi ≤ z) − √ n n eˆi i=1 23 1 n n ˆlT (Zj )M ˆ −1 I(c ≤ Zj ≤ Zi ∧ z) ˆl(Zi ), j j=1 where eˆi = ˙ i , θˆn ) ˆl(Zi ) = g(Z , τ (Zi , θˆn ) Yi − g(Zi , θˆn ) , τ (Zi , θˆn ) ˆj = 1 M n n ˆl(Zk )lT (Zk )I(Zk ≥ Zj ≥ c), k=1 τ (z, θ) = E((Y − g(Z))2 |Z = z) ˜ 2 Φ(˜ ˜ + σ 2 (˜ ˜ ˜ + σ 2 Φ(˜ ˜ − g(z, θ)2 , = σ 2 (˜ α + βz) α + βz) α + βz)φ(˜ α + βz) α + βz) ˜ ˜ + σφ(˜ ˜ g(z, θ) = σ(˜ α + βz)Φ(˜ α + βz) α + βz), α ˜ = (α + (1 − ∆)βµx )/σ, σ 2 = σε + ∆β 2 σu2 , β˜ = (1 − ∆)β/σ, ∆ = σx2 /(σx2 + σu2 ). One rejects H0 if sup |Wn (z)/ FˆZ (z0 ) − FˆZ (c)| > bα , c≤z≤z0 where bα is the value such that P (sup0≤u≤1 |B(u)| > bα ) = α and B(u) is the standard Brownian motion. For the nominal level 0.05, bα = 2.242. We used c = min{Zi } and z0 is the 95th quantile of Zi . The simulation results show that in terms of the empirical power, the V test with θˆT M E outperforms all other tests for small and moderate sample sizes, and it behaves as well as the Wn test for the large sample size. Moreover, all the three Vθˆ tests corresponding to the three different estimators θˆ of θ0 produce higher power than the corresponding score-type tests at the chosen alternatives. In addition, the empirical powers of the proposed Vθˆ test under the three specified estimation options match the estimation performance in the sense 24 that more accurate estimator θˆ leads to higher empirical power of the corresponding Vθˆ test. The estimators of θ0 used in Wn and Sn tests are θˆT M E and θˆOLS , respectively, as was done in Song and Yao (2011) and Song (2009). We used the kernel density K(u) = 0.75(1 − u2 )I(|u| ≤ 1) of order k = 2 for all the kernel density related tests. In the case p = 2, the only other available test is the Song’s score-type test. However, 2p in Song’s test, the bandwidth ws should satisfy both conditions N ws /log(N ) → ∞ and N ws2k → 0, where k is a positive even integer specifying the order of the symmetric kernel density K. When p = 2, the above bandwidth conditions require k to be larger than 2, which means that symmetric densities like the above kernel K(u) can not be used here. If we take k = 4, K will take negative values at certain points. In a simulation, using the fourth order density kernel, K(4) (u, v) = (0.086 − 0.2u2 )(0.086 − 0.2v 2 )K(u)K(v)/0.0462 as introduced in Jones and Signorini (1997), the least square estimator of θ0 showed large bias and mean square error, which in turn makes the score-type test difficult to implement. For these reasons we only report the finite sample behavior of the Vθˆ tests. We find both the empirical levels and powers under the chosen alternatives are satisfying. Bandwidth selection: As is evident, the implementation of the proposed tests requires the selection of the two bandwidths. One is the kernel regression bandwidth w and the other is the bandwidth h used in forming the test statistics Vn . It is thus important to provide a practical strategy for the selection of these bandwidths. ˆ we propose to obtain the optimal w, denoted by wb , by Given a consistent estimator θ, minimizing the mean square error of the kernel regression estimator gˆw as follows. M SE1 (w) := 1 n n ˆ 2, IC (Zi )(Yi − gˆw (Zi , θ)) i=1 25 wb := arg min M SE1 (w). w Note that no cross validation is needed since the Nadaraya-Watson estimator gˆw is constructed based on the independent validation data {(q(Xk ), Zk ), k = 1, ..., N } instead of {(Yj , Zj ), j = 1, ..., n}. Regarding the bandwidth h, recall that h was originally used to estimate E(ηi |Zi ) by the estimator given in (2.8). Since, under H0 , E(η|Z = z) ≡ 0, we propose to obtain an optimal h by minimizing the mean square error M SE2 (h) = 1 n n ˆ i |Zi ])2 = IC (Zi )(E[η i=1 1 n n n i=1 j=i,j=1 2 IC (Zi )IC (Zj ) Zj − Zi ˜(Zi ) . f η ˆ K j (n − 1)hp h To satisfy that h → 0 and w/h → c < ∞ in (C8), we enforce the constraint 0.1wb ≤ h ≤ 10wb in the above minimization. In our simulation, we applied the grid search of bandwidths starting from 0.1wb with step 0.02 to obtain the optimal bandwidth hopt . In some cases, a grid search study showed that M SE2 (h) decays slowly to 0, for all sufficient large values of h. This will cause the chosen bandwidth to be much larger. Hence, we set a threshold 0.05 for M SE2 to avoid choosing too large a bandwidth. To summarize, hopt := min h : h = argmin0.1w ≤h≤10w max{M SE2 (h), 0.05} . b b Simulation 1: p = 1. In this simulation, the data were generated from the model (2.1) and under H0 , where m(x, θ) = α + βx, θ = (α, β)T , with α = 1, β = 0.6, and Z = X + u, where X ∼ N1 (0, 22 ), ε ∼ N1 (0, 1) and u ∼ N (0, 0.52 ) so that the ratio σu2 /σx2 = 1/16 in the TME procedure of Wang (1998). The truncation rate is approximately 26%. Moreover, here q(x, θ) = (α + βx)Φ((α + βx)) + φ((α + βx)), where Φ and φ are distribution function and density function of the N1 (0, 1) r.v., respectively. 26 Following the estimation procedure in Song (2009), ws was set as N −1/3 to obtain estimators θˆOLS and θˆW LS . Table 2.1 reports the absolute bias and square root of mean square error (RMSE) (in parenthesis) of the three estimators of θ0 mentioned in Section 3.3. As expected, both bias and RMSE decrease as the sample size n increases for the three estimators, which indicates the consistency of the estimators. Moreover, under the Gaussian scenario, θˆT M E is seen to be superior among the three and θˆW LS is the least favorable. In the power analysis section, we will see that the empirical power of the Vθˆ-test shows a similar pattern. n, N α ˆ OLS α ˆ W LS α ˆT M E n, N βˆOLS βˆW LS βˆT M E (50, 100) (100, 200) 0.019(0.196) 0.005(0.129) 0.018(0.237) 0.010(0.152) 0.008(0.166) 0.004(0.116) (50, 100) (100, 200) 0.021(0.142) 0.003(0.083) 0.023(0.183) 0.011(0.113) 0.005(0.103) 0.004(0.069) (200, 400) 0.004(0.090) 0.005(0.104) 0.004(0.082) (200, 400) 0.004(0.057) 0.004(0.081) 0.0006(0.049) (300, 600) 0.002(0.076) 0.0001(0.087) 0.001(0.069) (300, 600) 0.001(0.049) 0.002(0.068) 0.0003(0.041) Table 2.1: Comparison of estimators: absolute bias and RMSE in parenthesis In the discussion below, we write VOLS , VW LS and VT M E for Vθˆ when θˆ equals the θˆOLS , θˆW LS and θˆT M E of the previous section, respectively. To implement the proposed test, the set C was used to be the overlap interval of {Zi , i = 1, ..., n} and {Zk , k = 1, ..., N }. In other words, C is chosen as the interval [a, b], where a = max{min{Zi }, min{Zk }} and b = min{max{Zi }, max{Zk }}. As mentioned in the main theorem, there are two options for estimating the asymptotic variances. To simplify the computation, we used the estimators given at (2.14). Applying the above bandwidth selection scheme, Table 2.2 shows that the empirical levels of all the V tests are well controlled for the large sample sizes. To investigate the power performance, we compared the proposed test with the Wn and 27 (n, N ) VOLS VW LS VT M E (50,100) 0.008 0.022 0.011 (100,200) 0.011 0.019 0.014 (200,400) 0.030 0.036 0.033 (300,600) (400,800) 0.029 0.043 0.031 0.041 0.034 0.049 (500,1000) 0.046 0.051 0.048 Table 2.2: Empirical levels for p = 1 at nominal level 0.05 Sn tests mentioned above. We performed a finite sample power comparison by generating data under the model (2.1) and the alternatives H1 : µ(x) = m(x, θ, b), for all x ∈ C and some b ∈ R, where m(x, θ, b) = 1 + 0.6x + b sin(x), b ∈ R. Table 2.3 displays the empirical power of the three types of tests for increasing sample sizes. One can see that the empirical power of all tests increases as the sample sizes n, N and the nonlinear effect b increase. For small and moderate sample sizes, VT M E performs the best, and both Wn and VT M E achieve the highest power for the large sample size among the five tests. All the three V tests outperform the score-type test for the larger sample sizes. Among the three V tests, VT M E performs the best, followed by VOLS and VW LS . This finding also matches the behavior of the three estimators of θ0 presented in Table 2.1. (n, N ) (100, 200) b Wn Sn VOLS 0 0.059 0.039 0.011 0.5 0.213 0.119 0.115 1 0.382 0.306 0.448 (300, 600) 0 0.067 0.049 0.029 0.5 0.563 0.317 0.592 1 0.927 0.504 0.983 (500, 1000) 0 0.077 0.058 0.046 0.5 0.822 0.394 0.805 1 0.991 0.583 0.984 VW LS 0.019 0.08 0.300 0.031 0.400 0.966 0.051 0.640 0.979 VT M E 0.014 0.218 0.743 0.034 0.643 0.984 0.048 0.834 0.983 Table 2.3: Empirical power comparison for p = 1 at nominal level 0.05 Simulation 2: p=2. We conducted a brief simulation study for the bivariate predicting variables case. Here, both primary sample {(Yi , Zi ), i = 1, ..., n} and validations sample 28 {(Xk , Zk ), k = 1, ..., N } are generated from the model Y ∗ = α + β1 X1 + β2 X2 + b (X12 + X22 ) + ε, Y = Y ∗ I(Y ∗ > 0), Z = X + u, where α = β1 = β2 = 1, ε ∼ N1 (0, 0.52 ), X = (X1 , X2 )T ∼ N2 (0, Σx ), Σx = (σij )2×2 , σ11 = σ22 = 1, σ12 = 0.5 and u ∼ N2 (0, Σu ), Σu = 0.52 I2 . Here I2 is a 2 × 2 identity matrix. b = 0 corresponds to the null model. The Gaussian distribution assumption and known covariance ratio Σ−1 x Σu suggests the use of VT M E -test. The compact set C is now a rectangle with sides chosen in a similar way as in the case of p = 1. We used the kernel function K(u, v) = K(u)K(v) of order k = 2, where K is the same as in the case of p = 1, and the bandwidths were selected by the above MSE criteria. In Table 2.4, the empirical level is seen to be slightly liberal for the larger sample sizes and the empirical power increases as b, n and N increase. b (100,200) 0 0.032 0.1 0.051 0.3 0.156 0.5 0.385 (200,400) 0.047 0.060 0.269 0.758 (300,600) 0.051 0.096 0.502 0.896 (400,800) 0.055 0.154 0.683 0.975 (500,1000) 0.068 0.164 0.782 0.989 Table 2.4: Empirical level and power of VT M E test for p = 2 at nominal level 0.05 2.5.2 Real data application The enzyme reaction speed data was originally collected in 1974 to study the relationship between the initial rate of enzyme reaction and the concentration of UDP-galactose. The data set has been analyzed by both Stute, Xue and Zhu (2007) and Du et al. (2011). The primary sample contains n = 30 observations of (Yi , Zi ), where Yi is the initial rate of reaction 29 speed and Zi denotes the basal density of UDP-galactose, for the ith individual, 1 ≤ i ≤ 30. The basal density can be measured in the two ways described in Du et al. (2011): as a simple chemical treatment and by an expensive precision machine. The former treatment produces surrogate observation Z while the latter treatment serves accurate observation X. A validation data consisting of N = 10 pairs of basal density were obtained. We manually truncated the responses at 125 with truncation rate 27% to apply the proposed test. It is commonly believed that the Michaelis-Menten model given by m(x, θ) = θ1 x/(θ2 + x), is an appropriate model for this data set. In the estimation step, we adopted the ad hoc bandwidth ws = σ ˆZ˜ N −1/3 as recommended in Sepanski and Carroll (1993) and obtained both θˆOLS and θˆW LS . Besides these two estimators, the empirical likelihood estimator θˆEL obtained in Stute, Xue and Zhu (2007) is also presented in Table 2.5. Then in the test step, we continued applying the bandwidth selection method introduced in Section 3.5. The parameter estimators, optimal bandwidths and p-values of the V tests are presented in Table 2.5 and the estimated curves by both least square methods and empirical likelihood in Stute, Xue and Zhu (2007) are displayed in Figure 2.1. None of the V tests using the three estimators are significant, which validates the current understanding that the above Michaelis-Menten model is proper for the data set. 30 180 160 Estimation by θ^OLS Estimation by θ^WLS Estimation by θ^EL 125 140 reaction speed 200 Reaction speed and basal density 0.0 0.2 0.4 0.6 0.8 1.0 VOLS VW LS VEL (θˆ1 , θˆ2 ) (217.37, 0.071) (218.41, 0.072) (212.70, 0.065) wb 0.12 0.12 0.12 hopt p-value 0.49 0.464 0.45 0.592 0.29 0.462 Table 2.5: Estimation and testing results of the enzyme reaction dataset basal density Figure 2.1: Estimated regression functions using three estimation methods 2.6 Proofs Recall the notation given in (2.10). Throughout this section, f stands for the density fZ of Z. We begin by listing some of the important facts about the first and second moments of the three parts of the residuals below, where θ¯ is a vector such that θ¯ − θ0 ≤ θˆn − θ0 . E(ηi |Zi ) = 0, E(ηi2 |Zi ) = σ 2 (Zi ), for all 1 ≤ i ≤ n. (2.24) E(ei |Zi ) = O(wk ), E(e2i |Zi ) = O(1/(N wp ) + w2k ), uniformly in i for Zi ∈ C. (2.25) 1 ¯ 0 − θˆn ). (2.26) δi = gˆ(Zi , θ0 ) − gˆ(Zi , θˆn ) = gˆ˙ (Zi , θ0 )(θ0 − θˆn ) + (θ0 − θˆn )T g¨ˆ(Zi , θ)(θ 2 Fact (2.24) is assumption (C3). Fact (2.25) follows from Theorem 2.2.1 of Bierens (1987) pertaining to Nadaraya-Watson regression estimators while the claim (2.26) follows from Taylor expansions of gˆ at θ0 . Intuitively, both e and δ are asymptotically negligible compared to η, however {ei , i = 1, · · · , n} are not independent since they are all based on the validation data set (Xk , Zk )k=1,··· ,N . Hence we need to study those terms that involve 31 {ei , i = 1, · · · , n}. In the sequel, Di = (Zi , ηi ), 1 ≤ i ≤ n; Dk = (Zk , Xk ), 1 ≤ k ≤ N. We have Lemma 2.6.1. Under H0 and (C1)–(C10), nhp/2 Vn1 →d N1 (0, 2τ1 ), (2.27) where τ1 is defined at (2.11). Proof. Recall (2.13). Rewrite Vn1 = 1 n(n − 1) IC (Zi )IC (Zj )Kh (Zi − Zj )ηi ηj . i=j Define Hn (Di , Dj ) = IC (Zi )IC (Zj )Kh (Zi − Zj )ηi ηj . It can be seen that Vn1 is a degenerate statistic since E[Hn |Di ] = 0. The result is immediate by a slight modification of the proof in Zheng (1996). The proofs of Lemma 2.6.2 and 2.6.3 use Lemmas 2.6.4 and 2.6.5 which are given in the end of the section. Lemma 2.6.2. Assume (C1)–(C10), H0 hold and 0 < λ < ∞. Then, with τ2 is as in (2.11), N hp/2 Vn2 →d N1 (0, 2τ2 ). 32 (2.28) Proof. Let Iij = IC (Zi )IC (Zj ). In the proof here the indices i, j vary from 1 to n. Let 1 Vn2 = n(n − 1)N 2 N N i=j k=1 l=1 Iij Kh,ij K K [q(Xk ) − g(Zi )][q(Xl ) − g(Zj )]. f (Zi )f (Zj ) w,ik w,jl Direct calculations show that under (C1) and (C7)–(C9), f˜(z) − 1 = op (1). z∈C f (z) sup (2.29) Now rewrite Vn2 = = 1 n(n − 1) 1 n(n − 1) IC (Zi )IC (Zj )Kh (Zi − Zj )ei ej i=j Iij Kh,ij i=j 1 = n(n − 1)N 2 i=j g(Zi ) − WN (Zi ) f˜(Zi ) g(Zj ) − WN (Zj ) f˜(Zj ) Iij Kh,ij N N Kw,ik Kw,jl [q(Xk ) − g(Zi )][q(Xl ) − g(Zj )] f˜(Zi )f˜(Zj ) k=1 l=1 = Vn2 + op (Vn2 ). The above fact follows from (2.29). We shall now analyze Vn2 . Rewrite 1 Vn2 = n(n − 1)N + N 1 Iij Kh,ij K K [q(Xk ) − g(Zi )][q(Xk ) − g(Zj )] N f (Zi )f (Zj ) w,ik w,jk i=j k=1 1 n(n − 1)N 2 i=j k=l Iij Kh,ij K K [q(Xk ) − g(Zi )][q(Xl ) − g(Zj )] f (Zi )f (Zj ) w,ik w,jl = Vn21 + Vn22 . 33 For Vn21 , define the symmetric function ψ1 (Di , Dj , Dk ) = 1 Iij Kh,ij K K [q(Xk ) − g(Zi )][q(Xk ) − g(Zj )] N f (Zi )f (Zj ) w,ik w,jk and L(Zi , Zj , Zk ) = E[(q(Xk ) − g(Zi ))2 (q(Xk ) − g(Zj ))2 |Zi , Zj , Zk ]. Note that this kernel depends on both n and N , but this dependence is not exhibited for the sake of brevity. In order to apply Lemma 2.6.4, we need to calculate variances of all projections of ψ1 . Rigorous calculation shows that Var(ψ1 ) ≤ = = = (2.30) 2 Iij Kh,ij 1 2 K 2 K 2 [q(Xk ) − g(Zi )]2 [q(Xk ) − g(Zj )]2 Eψ1 = E N 2 f 2 (Zi )f 2 (Zj ) w,ik w,jk 2 Iij Kh,ij 1 E 2 K 2 K 2 L(Zi , Zj , Zk ) N2 f (Zi )f 2 (Zj ) w,ik w,jk 2 Iij Kh,ij 1 2 K2 E[Kw,ik E 2 w,jk L(Zi , Zj , Zk )|Zi , Zj ] 2 2 N f (Zi )f (Zj ) 2 Kh,ij 1 2 (u)K 2 Zj − Zi + u L(Z , Z , Z − wu)f (Z K E i j i i C w N 2 w3p f 2 (Zi )f 2 (Zj ) = O = O 1 N 2 w3p h2p 1 N 2 w3p hp EC K 2 Zj − Zi L(Zi , Zj , Zi )f (Zi ) h K 2 (u)K 2 − wu)du Zj − Zi + u du w . In the above derivations we used assumption (C1) that guarantees that the density f is bounded from below. Next, consider E(ψ1 |Di , Dj ) = 1 Iij Kh,ij E{Kw,ik Kw,jk [q(Xk ) − g(Zi )][q(Xk ) − g(Zj )]|Di , Dj } N f (Zi )f (Zj ) 34 1 N 1 = N = Iij Kh,ij E{Kw,ik Kw,jk [µ2 (Zk ) − (g(Zi ) + g(Zj ))g(Zk ) + g(Zi )g(Zj )]|Di , Dj } f (Zi )f (Zj ) Iij Kh,ij 1 Zi − x (µ2 (x) − (g(Zi ) + g(Zj ))g(x) + g(Zi )g(Zj ))f (x)dx K f (Zi )f (Zj ) w w2p 1 Iij Kh,ij σ 2 (Zi ) = Op N wp f (Zi )f (Zj ) K(u)K Zj − Zi + u du . w Furthermore, uniformly in i, j, Var E(ψ1 |Di , Dj ) ≤ E E(ψ1 |Di , Dj ) 2 =O 1 . N 2 w2p hp Similar arguments imply that uniformly in 1 ≤ i ≤ n and 1 ≤ k ≤ N , Var(E(ψ1 |Di )) = O Var(E(ψ1 |Di , Dk )) = O 1 N 2 w2p , Var(E(ψ1 |Dk )) = O 1 N 2 w2p , 1 . N 2 wp h2p Assumptions (C9), (C10) and 0 < λ < ∞ together with Lemma 2.6.4 yield Var(N hp/2 Vn21 ) = O 2 2 4 4nhp nhp + + + + (N wp )3 (N wp )2 (N wp )(N hp ) (N wp )2 N 2 w2p = o(1). Moreover, direct calculations show that E[Vn21 ] = Eψ1 = o(1/N ). Hence, E[N hp/2 Vn21 ]2 = Var(N hp/2 Vn21 ) + E 2 [N hp/2 Vn21 ] = o(1), (2.31) Vn21 = op (1/(N hp/2 )). In order to analyze the variance of Vn22 , for i = j, 1 ≤ i, j ≤ n, k = l, 1 ≤ k, l ≤ N , we 35 define the following symmetric function ψ2 (Di , Dj , Dk , Dl ) = Iij Kh,ij {K K [q(Xk ) − g(Zi )][q(Xl ) − g(Zj )] f (Zi )f (Zj ) w,ik w,jl +Kw,il Kw,jk [q(Xl ) − g(Zi )][q(Xk ) − g(Zj )]}/2, i.e., ψ2 is symmetric within the blocks (Di , Dj ) and (Dk , Dl ). Then we have 1 Vn22 = n(n − 1)N 2 n N ψ2 (Di , Dj , Dk , Dl ). i=j=1 k=l=1 In order to apply Lemma 2.6.5, we need to calculate the variances of all projections of ψ2 . Computations similar to the above used for analyzing Var(Vn21 ) yield the following facts. Var(ψ2 ) = O 1 hp w2p , Var(E(ψ2 |Dk )) = O(w2k ), Var(E(ψ2 |Di )) = O(w4k ), (2.32) w4k , hp 1 Var(E(ψ2 |Dk , Dl )) = O p , h Var(E(ψ2 |Di , Dj )) = O w2k Var(E(ψ2 |Di , Dk )) = O , wp w2k Var(E(ψ2 |Di , Dj , Dk )) = O p p , h w Var(E(ψ2 |Di , Dk , Dl )) = O 1 . h2p Given the above projection variances, (C8)–(C10), 0 < λ < ∞ and Lemma 2.6.5 imply that 4 4w4k 4w2k + + + n N n2 N 2 hp w2p 8 8w2k 8 + + + nN 2 hp nN 2 h2p n2 N hp wp 1 2 = V ar(E(ψ2 |Dk , Dl )) + o 2 2 N N hp Var(Vn22 ) = O 2w4k 2 16w2k + + n2 hp N 2 hp nN wp =O 1 N 2 hp (2.33) . In fact, the variance term of E(ψ2 |Dk , Dl ) dominates the variance of Vn22 . The facts (2.31) 36 and (2.33) in turn imply that Vn2 = Vn22 + op (1/(N hp/2 )). (2.34) To further investigate Vn22 , we study the projection of ψ2 on the validation data set E(ψ2 |Dk , Dl ). From the format of ψ2 , we only need to study the projection when Zk ∈ C and Zl ∈ C. In fact, for fixed Zk , suppose Zk ∈ / C, there is small enough r such that Nr (Zk ) ∩ C = ∅ where Nr (Zk ) is the neighborhood of Zk within radius of M r and M is the boundary of the density support of K on each coordinate. It leads to Kw,ik = 0, then ψ2 = 0. For kernel function with density support on Rp such as normal density, one can argue with large enough MK such that the kernel density is arbitrarily small outside of [−MK , MK ]p . Details are skipped. Hence asymptotically, change of variables and Taylor expansion yield E(ψ2 |Dk , Dl ) = IC (Zk )IC (Zl ) [q(Xk ) − g(Zk )][q(Xl ) − g(Zl )] +wk C1 K(u)K(v)Kh,kl (u, v)dudv K(u)K(v)Kh,kl [q(Xk ) − g(Zk )]g (k) (Zl )v k +[q(Xl ) − g(Zl )]g (k) (Zk )uk dudv + Op (w2k ) := ψ2 (Dk , Dl ) + R2 (Dk , Dl ) + Op (w2k ), where Kh,kl (u, v) = 1/2{K((Zk − Zl )/h+w(u−v)/h)/hp +K((Zl − Zk )/h+w(u−v)/h)/hp }, C1 = 1/k!. Notice that ψ2 (Dk , Dl ) is the leading term, the other two terms are negligible as w → 0. Hence (2.33) can be rewritten as Var(Vn22 ) = 1 2 V ar[ ψ ( D , D )] + o 2 k l N2 N 2 hp 37 =O 1 . N 2 hp (2.35) Next, we will show that Vn22 is asymptotically equivalent to Vn22 defined below: Vn22 = 1 N2 {ψ2 (Dk , Dl ) + R2 (Dk , Dl )} + Op (w2k ) (2.36) k=l := Tn2 + Tn2 + Op (w2k ). It can be seen that Vn22 is the projection of Vn22 on the validation data. Hence E{(Vn22 − Vn22 )Vn22 } = 0, and Var(Vn22 − Vn22 ) = Var(Vn22 ) − Var(Vn22 ). (2.37) Now we will prove that Tn2 dominates Vn22 by showing the asymptotic properties of each term. The last term in (2.36) is negligible with asymptotic rate N hp/2 since N hp/2 w2k → 0 by assumption (C10). First, note that Tn2 is a degenerate U statistic. After verifying the conditions in Theorem 1 of Hall (1984), we apply the theorem and obtain that N Tn2 {2E ψ22 (D1 , D2 )}1/2 →d N1 (0, 1). (2.38) Since E(ψ2 (D1 , D2 )) = 0, (2.32) further implies that E ψ22 (D1 , D2 ) = Var[ψ2 (D1 , D2 )] = O(1/hp ). Moreover, (2.38) implies that V ar(Tn2 ) = 1 2 Var(ψ2 (D1 , D2 )) + o 2 2 N N hp =O 1 N 2 hp . (2.39) Second, note that Tn2 is a non-degenerate U statistic with mean 0. By applying the 38 central limit theorem for non-degenerate U statistics presented in Serfling (1981), we obtain √ N Tn2 {4Var(E(R2 |D1 ))}1/2 →d N1 (0, 1). Straightforward calculation indicates that Var(E(R2 |D1 )) = O(w2k ). Then, as n ∧ N → 0, Var(N hp/2 Tn2 ) = O(N 2 hp w2k /N ) = O(N hp w2k ) = o(1), under the condition that nhp/2 w2k → 0 and 0 < λ < ∞. Therefore Tn2 = op (1/(N hp/2 )). This fact combined with (2.39) and (2.36) yield Var(Vn22 ) = 2 1 Var[ ψ ( D , D )] + o . 2 1 2 N2 N 2 hp (2.40) The results (2.35), (2.37) and (2.40) together imply that Var(Vn22 − Vn22 ) = o 1 . N 2 hp Hence Vn2 = Vn22 + op (1/(N hp/2 )) = Vn22 + op (1/(N hp/2 )) = Tn2 + op (1/(N hp/2 )). 39 (2.41) Given the asymptotic results in (2.38), we have hp E ψ2 (D1 , D2 ) → K(u)K(v)[K(s + c(u − v)) + K(−s + c(u − v))]/2dudv 2 ds IC (x)(γ 2 (x))2 f 2 (x)dx = τ2 . × By connecting the above limiting variance with (2.38) and (2.41), eventually we obtain that N hp/2 Tn2 →1 N1 (0, 2τ2 ), hence N hp/2 Vn2 →d N1 (0, 2τ2 ). (2.42) This in turn completes the proof of Lemma 2.6.2. For the next lemma recall the decomposition (2.13). Lemma 2.6.3. Under assumptions (C1)–(C10) and 0 < λ < ∞, the following holds when H0 is true. Vn3 = op (1/(nhp/2 )); Unj = op (1/(nhp/2 )), j = 1, 2, 3. Proof. The proof of the claim about Vn3 is similar to that of Lemma 6.2. Rewrite Vn3 = = = 1 n(n − 1) 1 n(n − 1) Iij Kh,ij δi δj i=j i=j 1 n(n − 1)N 2 Iij Kh,ij ˆ ˆ [WN (Zi , θ0 ) − WN (Zi , θ)][W N (Zj , θ0 ) − WN (Zj , θ)] ˜ ˜ f (Zi )f (Zj ) i=j k,l Iij Kh,ij ˆ − q(Xk , θ0 )] Kw,ik Kw,jl [q(Xk , θ) ˜ ˜ f (Zi )f (Zj ) ˆ − q(Xl , θ0 )] ×[q(Xl , θ) = Vn3 + op (Vn3 ), 40 where Vn3 = 1 n(n − 1) i=j k,l IC (Zi )IC (Zj )Kh,ij ˆ − q(Xk , θ0 )][q(Xl , θ) ˆ − q(Xl , θ0 )]. [q(Xk , θ) f (Zi )f (Zj ) Furthermore, Vn3 is decomposed as the sum of the following two terms. 1 Vn3 = n(n − 1)N 2 + N i=j k=1 1 n(n − 1)N 2 Iij Kh,ij K K [q(Xk , θˆn ) − q(Xk , θ0 )]2 f (Zi )f (Zj ) w,ik w,jk i=j k=l Iij Kh,ij K K f (Zi )f (Zj ) w,ik w,jl ×[q(Xk , θˆn ) − q(Xk , θ0 )][q(Xl , θˆn ) − q(Xl , θ0 )] = Vn31 + Vn32 , say. Similar to the analysis of Vn21 , define the symmetric function φ1 (Di , Dj , Dk ) = Because √ Iij Kh,ij K K [q(Xk , θˆn ) − q(Xk , θ0 )]2 . f (Zi )f (Zj ) w,ik w,jk n(θˆn − θ0 ) = Op (1) and by the Taylor expansion, we have √ q(Xk , θˆn ) − q(Xk , θ0 ) = Op (1/ n). Then we can easily check that Vn31 = op (1). Following the routine argument showed in the proof of Lemma 3.3d in Zheng (1996), we obtain that Vn32 = op (1) under H0 and (C2), (C3), (C5), (C7)–(C9). 41 Similarly, Un1 can be written as Un1 = = 1 n(n − 1)N 2 1 n(n − 1)N 2 Iij Kh,ij Kw,ik Kw,jl ηi (q(Xl ) − g(Zj )) i=j,k,l Iij Kh,ij Kw,ik Kw,jk ηi (q(Xk ) − g(Zj )) i=j,k 1 + n(n − 1)N 2 = Un11 + Un12 , Iij Kh,ij Kw,ik Kw,jl ηi (q(Xl ) − g(Zj )) i=j,k=l say. Analogous to the analysis of Vn1 and Vn2 , similar results can be derived for Un1 as follows: Un11 = op (1/(nhp/2 )), and Un12 can be formulated as a non-degenerate U statistic with the kernel function φ2 (Di , Dj , Dk , Dl ) = Iij Kh,ij Kw,ik Kw,jl [ηi (q(Xl ) − g(Zj )) + ηj (q(Xk ) − g(Zi ))]/4 +Iij Kh,ij Kw,il Kw,jk [ηi (q(Xk ) − g(Zj )) + ηj (q(Xl ) − g(Zi ))]/4. By the central limit theorem of non-degenerate U statistics, we can see that √ nUn12 = Op (wk ). Thus nhp/2 Un1 = Op √ √ nhp · nUn12 = Op ( nhp w2k ) = op (1) under the assumption (C10). The proofs of the claims pertaining to Un2 and Un3 are similar. Details are omitted for the sake of brevity. 42 Proof of Theorem 2.3.4. Similar to the proof of Vn under H0 , we can show that under Ha Vn = (2.43) 1 n(n − 1) 1 = n(n − 1) 1 + 2 N IC (Zi )IC (Zj )Kh,ij ηˆi ηˆj i=j IC (Zi )IC (Zj )Kh,ij η¯i η¯j i=j IC (Zk )IC (Zl ) [q(Xk ) − g(Zk )][q(Xl ) − g(Zl )] K(u)K(v)Kh,kl (u, v)dudv k=l +op (1/(nhp/2 )) = T¯n1 + T¯n2 + op (1/(nhp/2 )), say. One can verify that T¯n1 and T¯n2 are the leading terms of Vn1 and Vn2 , respectively, as derived in Lemma 2.6.2. Rewrite η¯i = ηi + bn E(a(Xi )|Zi ), where E(ηi |Zi ) = 0. T¯n1 = 1 n(n − 1) = 1 n(n − 1) = 1 n(n − 1) +b2n := Iij Kh,ij η¯i η¯j i=j Iij Kh,ij [ηi + bn E(a(Xi )|Zi )][ηj + bn E(a(Xj )|Zj )] i=j Iij Kh,ij ηi ηj + bn i=j 1 n(n − 1) 2 n(n − 1) Iij Kh,ij ηi E[a(Xj )|Zj ] i=j Iij Kh,ij E[a(Xi )|Zi ]E[a(Xj )|Zj ] i=j W1 + bn W2 + b2n W3 . W1 is a degenerate two sample U statistic, hence nhp/2 W1 →d N1 (0, 2τ1 ). 43 After symmetrization, W2 can be written as a non-degenerate U statistic, hence √ nW2 = Op (1), furthermore, √ nhp/2 bn W2 = hp/4 ( nW2 ) →p 0. A similar argument as (2.15) indicates that W3 →p E{IC (Z)E[a(X)|Z]2 f (Z)}. Hence nhp/2 T¯n1 →d N1 (E{IC (Z)E[a(X)|Z]2 f (Z)}, 2τ1 ). (2.44) As for T¯n2 , the result of Tn2 in (2.42) still holds since T¯n2 only involves the validation data and it is irrelevant to the hypothesis of the regression model, i.e., nhp/2 T¯n2 →d N1 (0, 2τ2 ). (2.45) Note that T¯n1 and T¯n2 are independent since they are constructed based on independent samples. Combining (2.43), (2.44) and (2.45), we obtain that nhp/2 Vn →d N1 (E{IC (Z)E[a(X)|Z]2 f (Z)}, 2τ1 + (2τ2 )/λ2 ). This completes the proof of Theorem 2.3.4. Lemma 2.6.4. Let {Di , i = 1, · · · , n} be a set of i.i.d. r.v.’s and {Dk , k = 1, · · · , N } be 44 another set of i.i.d. r.v.’s, which is independent of {Di }. Define the two sample U statistic 1 T = n(n − 1)N n N ϕn (Di , Dj , Dk ), i=j=1 k=1 where ϕn is a symmetric function with regard to permutation of (Di , Dj ) and square integrable for each n. Then 1 4 2 Var(E(ϕ |D )) + Var(E(ϕn |D1 )) Var(ϕ ) + n n 1 n N n2 N 2 4 + 2 Var(E(ϕn |D1 , D2 )) + Var(E(ϕn |D1 , D1 )) . nN n Var(T ) = O (2.46) Proof. Algebra shows that Var(N n(n − 1)T ) [ϕn (Di , Dj , Dk ) − Eϕn (Di , Dj , Dk )] = E 2 i=j,k E [ϕn (Di , Dj , Dk ) − Eϕn (Di , Dj , Dk )][ϕn (Ds , Dt , Dl ) − Eϕn (Ds , Dt , Dl )] = i=j,k s=t,l = + {s,t}={i,j},l=k +4 {s,t}={i,j},k=l s=i,t=j,k=l s=i,t=j,k=l s=i,t=j,k=l E{[ϕn (Di , Dj , Dk ) − Eϕn ][ϕn (Ds , Dt , Dl ) − Eϕn ]} + + +4 s=i,t=j,k=l = 2n(n − 1)N Var(ϕn ) + 2n(n − 1)N (N − 1)Var(E(ϕn |D1 , D2 )) +4n(n − 1)(n − 2)N Var(E(ϕn |D1 , D1 )) + 4n(n − 1)(n − 2)N (N − 1)Var(E(ϕn |D1 )) +n(n − 1)(n − 2)(n − 3)N Var(E(ϕn |D1 )). The claim (2.46) follows from this identity upon dividing both sides by N n(n − 1) using the fact that (n − k)/n → 1, and (N − k)/N → 1, for k = 1, 2, 3. 45 2 and Furthermore, define 1 S= n(n − 1) n i=j=1 1 N (N − 1) N ψn (Di , Dj , Dk , Dl ), k=l=1 where ψn is square integrable and symmetric with regard to permutation of (Di , Dj ) as well as (Dk , Dl ), i.e., ψn (Di , Dj , ·, ·) = ψn (Dj , Di , ·, ·) and ψn (·, ·, Dk , Dl ) = ψn (·, ·, Dl , Dk ), for each n. An argument similar to the one used in Lemma 2.6.4 yields the following lemma. Lemma 2.6.5. Suppose {Di , 1 ≤ i ≤ n} and {Dk , 1 ≤ k ≤ N } are the two independent random samples and S is the two sample statistic defined above. Then 4 4 4 Var(ψ ) + Var(E(ψ |D )) + Var(E(ψn |Dk )) n n i n N n2 N 2 2 2 16 + 2 Var(E(ψn |Di , Dj )) + 2 Var(E(ψn |Dk , Dl )) + Var(E(ψn |Di , Dk )) nN n N 8 8 Var(E(ψn |Di , Dk , Dl )) + 2 Var(E(ψn |Di , Dj , Dk )) . + 2 nN n N Var(S) = O 46 Chapter 3 Minimum distance model checking in Berkson models 3.1 Introduction In statistical data analysis, the data is often collected subject to measurement error. One typical way to treat the measurement error is the errors-in-variables model which assumes that the real observation Z is a surrogate of the true unobserved variable X, i.e., Z = X + η, where η is the measurement error variable. Regression models with measurement error in covariates has received broad attention in the literature over the last century. In the last three decades it has been the focus of numerous researchers, as is evidenced in the three monographs by Fuller (1987), Cheng and Van Ness (1999) and Carroll, Ruppert, Stefanski and Crainiceanu (2006), and the references therein. However, as Berskon (1950) argued that in many situations it is more appropriate to treat the true unobserved variable X as the observed variable Z plus an error, i.e., X = Z + η. For instance, in economics, the household income is usually not precisely collected due to the survey design or data sensitivity. It was described in Kim, Chao and H¨ardle (2016) that when the income data were collected by asking individuals which salary range categories they belong to, such as between 5,000 USD and 9,999 USD, then the midpoint of the range interval was used in analysis. In this case, it is 47 sensible to assume that the true income fluctuates around the midpoint observation subject to errors. Another example is an individual’s exposure to some contaminant in epidemiological study, for instance, the atmospheric particulate matter that have a diameter less than 2.5 micrometers (PM2.5). Usually the concentration of PM2.5 in an area is reported hourly or daily as an average measurement, however, the true exposure for an individual relies on the specific location and the time of the day. This type of data also favors the Berkson error model. More examples can be found in Du et al. (2011) and Carroll et al. (2006). Proceeding a bit more precisely, in the Berkson measurement error regression model of interest here one has the triple X, Y, Z, obeying the relations Y = µ(X) + ε, X = Z + η. (3.1) Here Y is a scalar response variable and ε is an error variable with Eε = 0. The random vectors X, Z, η are p-dimensional, with X being the true unobservable covariate vector, Z representing an observation on X and η denoting the measurement error having Eη = 0. For identifiability reasons, the three r.v.’s ε, X, η are assumed to be mutually independent. Thus µ(x) = E(Y |X = x), for all x ∈ Rp . Let Θ ⊂ Rq be a compact set, {mθ (x); θ ∈ Θ, x ∈ Rp } be a family of given functions and C be a compact subset in Rp . The problem of interest here is to test H0 : µ(x) = mθ0 (x), for some θ0 ∈ Θ and all x ∈ C, versus H1 : H0 is not true, based on the primary sample {(Zi , Yi ), i = 1, ..., n} and an independent validation sample 48 {(Zk , Xk ), k = 1, ..., N }, all satisfying (3.1). Then the empirical version of η are naturally obtained by ηk := Xk − Zk , 1 ≤ k ≤ N . The literature contains several references that address the estimation of the underlying parameters in the model (3.1). In the case µ(x) is linear, Berkson (1950) showed that the ordinary least squares estimators continue to be unbiased and consistent for the underlying parameters. For polynomial regression, Huwang and Huang (2000) use the method of moments based on the first two conditional moments of Y , given Z, when Z, ε, η are Gaussian, to produce consistent estimators of the underlying parameters. Relaxing the normality error distribution assumption to a parametric density family, Wang (2004) developed a minimum distance approach based on the first two conditional moments of the response variable to consistently estimate more general parametric regression functions. In the case the measurement error density fη is known, Delaigle, Hall and Qiu (2006) constructed nonparametric estimators of µ(x) by means of trigonometric series and deconvolution techniques. In the case fη is unknown but validation data is available, Du et al. (2011) used integrated local linear smoothing and Fourier transformation to formulate a nonparametric estimator of µ(x). Schennach (2013) obtained a sieve-based nonparametric regression estimator with the help of instrumental variable without assuming fη to be known. Relatively, the literature is scant on the above testing problem. Koul and Song (2009) are the first authors to address this problem. Assuming fη is known, they proposed parameter estimation by minimizing an integrated square distance between a nonparametric regression function estimator and the model being fitted and then utilized the minimized distance to implement the hypothesis test. In the current chapter, we extend this methodology to the case when fη is unknown, but when validation data is available. A surprising finding is that the asymptotic distributions of the minimum distance (m.d.) 49 test statistics in the case of unknown fη is the same as in the case of known fη . The asymptotic distributions of the corresponding m.d. estimators of the null model parameters are affected by not knowing fη in general. Exceptions are provided by the linear models when the set C and the integrating measure used in the definition of the above mentioned distances are symmetric around zero. This chapter is organized as follows. Section 3.2 describes the proposed m.d. estimators and test statistics and the needed assumptions for the derivation of their consistency and asymptotic normality. Section 3.3 establishes the consistency and asymptotic normality of the m.d. estimators while in Section 3.4 we state the main results of the proposed tests under the null and certain fixed alternative hypotheses and provide sketches of the proofs. It is worth mentioning that the variation in validation data contributes to the asymptotic distributions of the proposed m.d. estimators of the null model parameters but not to the asymptotic distributions of the m.d. test statistics. Section 3.5 reports findings of a Monte Carlo study that assesses some finite sample properties of an estimator and a test in the proposed classes of these inference procedures. Some of the proofs are relegated to the last Section 3.6 of the chapter. 3.2 A class of tests This section describes a class of the proposed tests and estimators of the null model parameters along with the needed assumptions. To overcome the difficulty created by not observing X we use the calibration idea as used in Koul and Song (2009). Accordingly, 50 assume E|µ(X)| < ∞, E|mθ (X)| < ∞, for all θ ∈ Θ and z ∈ Rp , define H(z) := E µ(X) Z = z = Hθ (z) := E[mθ (X)|Z = z] = µ(x)fη (x − z)dx = µ(y + z)fη (y)dy, mθ (x)fη (x − z)dx = mθ (y + z)fη (y)dy. Then the original model can be transformed to Y = H(Z) + ξ, E(ξ|Z) = 0, (3.2) and the hypothesis testing becomes H0 : H(z) = Hθ0 (z), for some θ0 ∈ Θ and all z ∈ C, vs. H1 : H0 is not true. To proceed further, let w ≡ wn = c(log n/n)1/(p+4) , c > 0, and h ≡ hn be two bandwidth sequences associated with sample sizes n and N , K be a density kernel and G be a nondecreasing right continuous real valued function on R and define z − Zi 1 1 , fˆw (z) = Khi (z) = p K h h n Mn (θ) = Mn (θ) = Wn (θ) = 1 nfˆw (z) 1 ˆ nfw (z) 1 nfˆw (z) n N Kwi (z), Hθ (z) = N −1 i=1 n k=1 2 Khi (z)[Yi − Hθ (Zi )] dG(z), i=1 n 2 Khi (z)[Yi − Hθ (Zi )] dG(z), i=1 n 2 θ˜n = argminθ Mn (θ), θˆn = argminθ Mn (θ), Khi (z)[Hθ (Zi ) − Hθ (Zi )] dG(z). i=1 51 mθ (z + ηk ),(3.3) Note that the density estimator fˆw is based on a bandwidth w that is different from the bandwidth h employed in the numerator of the regression function estimator. This plausible scheme was proposed in Koul and Ni (2004) (KN) in order to have an nhp/2 -consistent estimator of the asymptotic bias in Mn (θ˜n ). In the case fη is known then Hθ is a known parametric function and Koul and Song (2009) (KS) proposed the minimum distance testing procedure based on Mn (θ˜n ). However, this method is not feasible without the knowledge of fη , which renders the regression function Hθ to be unknown also. But, with the availability of validation sample {Xk , Zk }, where ηk := Xk − Zk , 1 ≤ k ≤ N is a random sample from fη , we are able to estimate Hθ (z) by Hθ (z) defined above. This then leads to the class, one for each G and K, of m.d. test statistics Mn (θˆn ). We shall now present the needed assumptions for establishing the consistency and asymptotic normality of θˆn and Mn (θˆn ). Many of these assumptions are the same as in KS. Define, for x, y ∈ Rp and θ ∈ Θ, σθ (x, y) := Cov mθ (x + η), mθ (y + η) , σθ2 (x) := σθ (x, x) = Var(mθ (x + η)). (A1) {(Yi , Zi ), Zi ∈ Rp , i = 1, ..., n} is an i.i.d. sample with regression function H(z) = E(Y |Z = z) satisfying H 2 dG < ∞, where G is a σ-finite measure with continuous Legesgue density g on C while {(Zk , Xk ), Zk ∈ Rp , Xk ∈ Rp , k = 1, ..., N } is an i.i.d. sample from Berkson measurement error model X = Z + η. (A2) 0 < σε2 := Var(ε) < ∞, τ 2 (z) = E[(mθ0 (X) − Hθ0 (Z))2 |Z = z] is a.e. (G) continuous on C. (A3) Both E|ε|2+δ and E|(mθ0 (X) − Hθ0 (Z)|2+δ are finite for some δ > 0. 52 (A4) Both E|ε|4 and E|(mθ0 (X) − Hθ0 (Z)|4 are finite. (A5) σθ2 (z)dG(z) < ∞, for all θ ∈ Θ. (F1) The density fZ is uniformly continuous and bounded away from 0 in C. (F2) The density fZ is twice continuously differentiable in C. (H1) mθ (x) is a.e. continuous in x, for every θ ∈ Θ. (H2) The parametric function family Hθ (z) is identifiable with respect to θ, i.e, Hθ1 (z) = Hθ2 (z) a.e. in z implies θ1 = θ2 . (H3) For some positive continuous function r on C, and for some 0 < β ≤ 1, |Hθ1 (z) − Hθ2 (z)| ≤ θ1 − θ2 β r(z), for all θ1 , θ2 ∈ Θ and z ∈ C. (H4) For each x, mθ (x) is differentiable with respect to θ in a neighborhood of θ0 with the derivative vector m ˙ θ (x) such that for every sequence 0 < δn → 0, sup N −1 N k=1 [mθ (Zi + ηk ) − mθ0 (Zi + ηk ) − (θ − θ0 )T m ˙ θ0 (Zi + ηk )] θ − θ0 i,θ = op (1), where the supremum is taken over 1 ≤ i ≤ n, θ − θ0 ≤ δn . (H5) The vector function m ˙ θ0 (x) is continuous in x ∈ C and for every > 0, there are n and N such that for every 0 < a < ∞, and for all n > n , N > N , P max 1≤i≤n,1≤k≤N,(nhp )1/2 θ−θ0 ≤a (H6) H˙ θ0 2 dG < ∞ and Σ0 = h−p/2 m ˙ θ (Zi + ηk ) − m ˙ θ0 (Zi + ηk ) ≥ H˙ θ0 H˙ θT dG is positive definite. 0 (K) The density kernel K is positive symmetric and square integrable on [−1, 1]p . (W1) nh2p → ∞ and N/n → λ, λ > 0. 53 ≤ . (W2) h ∼ n−a , where 0 < a < min(1/2p, 4/(p(p + 4))). We state some facts that will be often used in the proofs below. Note that (H4) implies that for every 0 < a < ∞, N −1 sup N k=1 mθ (Zi + ηk ) − mθ0 (Zi + ηk ) − (θ − θ0 )T m ˙ θ0 (Zi + ηk ) θ − θ0 i,θ where the supremum is taken over 1 ≤ i ≤ n, (nhp )1/2 θ − θ0 = op (1), (3.4) ≤ a. From Mack and Silverman (1982) we obtain that under (F1), (K1), (W1) and (W2), f 2 (z) − 1 = op (1). (3.5) sup |fˆh (z) − fZ (z)| = op (1), sup |fˆw (z) − fZ (z)| = op (1), sup Z 2 (z) z∈C z∈C z∈C fˆw Theorem 2.2 part (2) in Bosq (1998) yields that under assumptions (F2) and (K1), (logk n)−1 (n/ log n)2/(p+4) sup |fˆw (z) − fZ (z)| → 0, a.s., ∀ integer k > 0. (3.6) z∈C We also recall the following facts from KN and KS. Let dϕ = fZ−2 dG, dϕˆ = fˆw−2 dG. For any continuous function α(z) with α(z)dϕ(z) ˆ − α(z)dϕ(z) |α(z)|dϕ(z) < ∞, (3.5) implies fZ2 (z) −1 2 (z) z∈C fˆw |α(z)|dϕ(z) = op α(z)dϕ(z) + op |α(z)|dϕ(z) . ≤ sup |α(z)|dϕ(z) . Hence α(z)dϕ(z) ˆ = 54 (3.7) From (3.9) in KN, for any function α(z) as above, (F1), (K1) and (W1) imply 1 E n n Kh (z − Zi )α(Zi ) 2 dϕ(z) = α2 dG + o(1) = O(1). (3.8) i=1 In the sequel, we shall not exhibit the set C in the integrals. All the integrals with respect to the measure G are supposed to be over this set, unless specified otherwise. 3.3 Estimation of θ0 In this section, we establish the consistency and asymptotic normality of θˆn under H0 . To begin with, consider the following decomposition that shows a connection between Mn (θ) and Mn (θ), where Wn (θ) is as in (3.3). Mn (θ) = 1 nfˆw (z) n 2 Khi (z)[Yi − Hθ (Zi ) + Hθ (Zi ) − Hθ (Zi )] dG(z) (3.9) i=1 = Mn (θ) + Wn (θ) + 2Rn (θ), where Rn (θ) is the cross product term. We can see that the validation data is involved through the extra terms Wn and Rn . The following lemma about Wn is found to be useful in deriving various results in the sequel. Its proof is given in the last Section 3.6 of the chapter. Let K1 be as in (2.11) and let γ(θ) := σθ2 (x, y)dG(x)dG(y), AN (θ) = 1 N σθ2 (z)dG(z). Lemma 3.3.1. Suppose (A1), (A2), (A5), (F1), (H1), (K), and (W1) hold. Then for every 55 θ ∈ Θ for which µ(x) = mθ (x), x ∈ C, we have N (Wn (θ) − AN (θ)) → N1 (0, γ(θ)). 3.3.1 (3.10) Consistency of θˆn We first establish the consistency of the proposed m.d. parameter estimators θˆn . Many details below are similar to those in KN and KS. Recall µ(x) = E(Y |X = x). Let H(z) = E(µ(X)|Z = z), and define ρ(ν, Hθ ) = (ν − Hθ )2 dG, T (ν) = argminθ (ν − Hθ )2 dG = argminθ ρ(ν, Hθ ), ν ∈ L2 (G). Lemma 3.3.2. Suppose (A1), (A2), (A5), (F1), (H1), (H3), (K) and (W1) hold. If, in addition T (H) is unique, then θˆn = T (H) + op (1). The proof is deferred to the last Section 3.6 of the chapter. Assumption (H2), Lemmas 3.3.1 and 3.3.2 immediately imply the consistency of the proposed estimators θˆn as stated in the following theorem. Theorem 3.3.1. Suppose (A1), (A2), (A5), (F1), (H1)–(H3), (K), (W1) and H0 hold. Then θˆn →p θ0 . 56 3.3.2 Asymptotic normality of θˆn Here we present the asymptotic normality result about θˆn under H0 . Theorem 3.3.2. Suppose (A1)–(A3), (A5), (F1), (F2), (H1)–(H6), (K), (W1), (W2) and H0 hold, then under H0 , √ −1 , −1 n(θˆn − θ0 ) →d Nq 0, Σ−1 0 (Σ1 + λ Σ2 )Σ0 where Σ0 is given in (H6) and Σ1 = Σ2 = (σε2 + τ 2 (u))H˙ θ0 (u)H˙ θT (u)g 2 (u) 0 fZ (u) du, (3.11) σθ0 (x, y)H˙ θ0 (x)H˙ θT (y)dG(x)dG(y). 0 √ This theorem shows that θˆn is n-consistent for θ0 and the asymptotic covariance matrix is mainly determined by the two terms Σ1 and Σ2 . The matrix Σ1 represents the variation in Berkson measurement error model when fη is known as in KS while Σ2 represents the contribution due to the estimation of Hθ by Hθ using the validation data. Moreover, the covariance tends to decay as N/n increases. When N/n → ∞, in other words, when the validation sample size N is sufficiently large, compared to the primary sample size n, not surprisingly the above asymptotic covariance degenerates to the case as if fη is known. Remark 3.3.1. Here we verify that the quantities Σ1 and Σ2 in the asymptotic variancecovariance matrix are well defined under the given assumptions. Given (A2) and the compactness of C, τ 2 (u) is bounded on C. Assumption (H6) further implies that Σ1 is finite and positive definite. 57 Next, consider Σ2 . The Cauchy-Schwarz inequality implies that σθ (x, y) ≤ σθ (x)σθ (y) for all x, y ∈ R, θ ∈ Θ, and that for any a ∈ Rq , |aT Σ2 a| ≤ = = σθ0 (x, y) aT H˙ θ0 (x)H˙ θT (y)a dG(x)dG(y) 0 σθ0 (x)σθ0 (y) aT H˙ θ0 (x) aT H˙ θ0 (y) dG(x)dG(y) σθ0 (x) aT H˙ θ0 (x) dG(x) 2 ≤ a 2 σθ2 (x)dG(x) 0 Hθ0 (x) 2 dG(x). Hence assumptions (A5) and (H6) ensure that the entries of Σ2 exist and are finite. Moreover, as seen in the proofs below, Σ2 is a positive definite covariance matrix. Now we describe some parametric function families along with the corresponding Σ2 that satisfy the assumptions (H3)–(H5). Example 3.3.1. The linear and polynomial cases. Suppose q = p, mθ (x) = θT x, θ, x ∈ Rp . Then Hθ (z) = θT z is a known function. In this case there is no need to estimate this function and one can also use θ˜n as a m.d. estimator of θ. See Remark 3.3.2 for an asymptotic equivalence between θˆn and θ˜n . In the polynomial regression of order p, q = p + 1 and mθ (x) = θT (x), x ∈ R, where θ = (θ1 , ..., θp+1 )T and (x) := (1, x, ..., xp )T such that E (X) < ∞, where · denotes the Euclidean norm. Then L(z) := E( (X)|Z = z) = (1, z, E(z + η)2 , ..., E(z + η)p )T , Hθ (z) = θT L(z). This model is a simple deviation from the linear model and one already sees the need to 58 estimate Hθ (z). Given the validation data, an estimate of Hθ (z) in this case is given by 1 Hθ (z) = N N k=1 1 mθ (z + ηk ) = N N θ1 + θ2 (z + ηk ) + θ3 (z + ηk )2 + ... + θp+1 (z + ηk )p . k=1 Here (H3) is satisfied with r = L. Furthermore, m ˙ θ (x) = (x) and H˙ θ (z) = L(z) for all θ ∈ Θ. Therefore, similar to the linear case, (H4) and (H5) hold. Moreover, σθ (x, y) = θT E (x + η) T (y + η) − L(x)LT (y) θ, Σ2 = θ0T [E (x + η) T (y + η) − L(x)LT (y)]θ0 L(x)LT (y)dG(x)dG(y). Example 3.3.2. The nonlinear case. In biochemistry, one of the well known models for enzyme kinetics relates enzyme reaction rate to the concentration of a substrate x by the formula α0 x/(θ + x), α0 > 0, θ > 0, x > 0. This is the so called Michaelis–Menten model. The ratio γ0 = α0 /θ is defined as the catalytic efficiency that measures how efficiently an enzyme converts a substrate into product. With γ0 known, the function can be written as mθ (x) := γ0 θx , θ+x θ > 0, x > 0. (3.12) We will verify that this nonlinear function satisfies (H3)–(H5). Regarding (H3), as shown in KS, one sufficient condition is that the regression function mθ (x) satisfies the same condition in (H3). In this case, direct calculation shows that |mθ1 (x) − mθ0 (x)| = γ0 x2 |θ1 − θ0 | ≤ γ0 |θ1 − θ0 |. (θ0 + x)(θ1 + x) Hence (H3) holds for (3.12). 59 Furthermore, suppose for each x ∈ Rp , the q × q matrix m ¨ θ (x) := ∂ 2 mθ (x)/∂θ2 exists for all θ in a neighborhood U0 of θ0 and m ¨ θ (x) ≤ C, for all θ ∈ U0 and x ∈ Rp , where the constant C may depend on θ0 . Then, (H4) holds because by the Mean Value Theorem, with probability 1, for all 1 ≤ i ≤ n, N ≥ 1, θ − θ0 ≤ δn , N −1 N k=1 [mθ (Zi + ηk ) − mθ0 (Zi + ηk ) − (θ − θ0 )T m ˙ θ0 (Zi + ηk )] θ − θ0 ≤ C θ − θ0 ≤ Cδn . In particular, for the function (3.12), p = 1 = q, the second derivative of the function m ¨ θ (x) = −2γ0 x2 /(θ + x)3 is bounded for θ > 0 and x > 0, so (H4) holds in this case. √ As for (H5), with nhp |θ − θ0 | ≤ a and θ1∗ falling between θ and θ0 , we have sup h−p/2 |m ˙ θ (Zi + ηk ) − m ˙ θ0 (Zi + ηk )| = sup h−p/2 |m ¨ θ∗ (Zi + ηk )(θ − θ0 )| i,k,θ∗ i,k,θ √ √ ≤ sup Ch−p/2 |θ − θ0 | = Op (h−p/2 / nhp ) = Op 1/( nhp ) = op (1), θ where C is the upper bound for the second derivative m ¨ θ (x). Therefore (H5) is satisfied. Another nonlinear example is the exponential function mθ (x) = eθx , θ, x ∈ R. In practice, it is reasonable to assume that both Θ and the domain of X are bounded subsets in R, i.e., |θ| ≤ C1 and |x| ≤ C2 . Again it suffices to verify that the condition in (H3) holds with Hθ (z) replaced by mθ (x). With θ∗ falling between θ1 and θ2 , we obtain ∗ |mθ2 (x) − mθ1 (x)| = |m ˙ θ∗ (x)(θ2 − θ1 )| = |xeθ x (θ2 − θ1 )| ≤ (|x|eC1 |x| )|θ2 − θ1 | := r(x)|θ2 − θ1 |. Therefore (H3) holds for the exponential regression function. Moreover, the second derivative 60 m ¨ θ (x) = x2 eθx is bounded by the constant C12 eC1 C2 . Hence the argument similar to that for (3.12) yields that the exponential function also satisfies (H4) and (H5). Next, we provide a sketch of the proof of Theorem 3.3.2. The most of the details of the proof are the same as in KN and KS. So we shall be briefly indicating only the major differences. Proof of Theorem 3.3.2. We first show that nhp θˆn − θ0 2 = Op (1). (3.13) Define Dn (θ) = Dn (θ) = 1 n 1 n n 2 Khi (z)(Hθ (Zi ) − Hθ0 (Zi )) dϕ(z), ˆ i=1 n 2 Khi (z)(Hθ (Zi ) − Hθ0 (Zi )) dϕ(z). ˆ (3.14) (3.15) i=1 We shall shortly prove the following two facts. nhp Dn (θˆn ) = Op (1). (3.16) For any 0 < a < ∞, there exist na and Na such that P Dn (θˆn )/ θˆn − θ0 2 ≥ a + inf bT Σ0 b > 1 − a b =1 ∀ n > n a , N > Na , (3.17) where Σ0 is defined in (H6). Then, as in KS, (3.13) follows from (3.16), (3.17) and the 61 relation nhp Dn (θˆn ) = [nhp θˆn − θ0 2 ][Dn (θˆn )/ θˆn − θ0 2 ]. Proof of (3.16). Subtracting and adding Yi to the ith summand in (3.14) and the triangular inequality yield Dn (θˆn ) ≤ 2(Mn (θˆn ) + Mn (θ0 )) ≤ 2(Mn (θ0 ) + Mn (θ0 )), because θˆn is the minimizer of Mn . From (3.4) of KS, we obtain nhp Mn (θ0 ) = Op (1). Lemma 3.3.1 and the decomposition (3.9) of Mn imply that nhp Mn (θ0 ) = Op (1). Therefore nhp Dn (θˆn ) = Op (1). (3.18) Next, subtracting and adding Hθ0 (Zi ) to the ith summand in (3.15) and the triangular inequality yield Dn (θˆn ) ≤ 2(Wn (θ0 ) + Dn (θˆn )). Lemma 3.3.1 implies that N Wn (θ0 ) = Op (1). This fact, (3.18) and (W1) yield (3.16). To prove (3.17), define un = θˆn − θ0 , µ ˆ˙ n (z, θ) = Dn1 = 1 n 1 nN ˙ θ0 (Zi + ηk ), dnik = mθˆ (Zi + ηk ) − mθ0 (Zi + ηk ) − uTn m n n ˙ Khi (z)H θ (Zi )dϕ(z) ˆ = i=1 n N Khi (z) i=1 k=1 dnik un 2 1 nN dϕ(z), ˆ 62 n (3.19) N Khi (z)m ˙ θ (Zi + ηk )dϕ(z), ˆ i=1 k=1 Dn2 = uTn µ ˆ˙ n (z, θ) 2 dϕ(z). ˆ un Then by the Cauchy-Schwarz inequality, Dn (θˆn ) = un 2 1 nN n N i=1 k=1 uTn m ˙ θ0 (Zi + ηk ) dnik + Khi (z) un un 2 dϕ(z) ˆ ≥ Dn1 + Dn2 − 2 Dn1 Dn2 . By (3.7) and (3.8), 1 n n 2 Khi (z) dϕ(z) ˆ = Op (1). (3.20) i=1 The consistency of θˆn , (H4) and (3.20) in turn imply N Dn1 ≤ max i N −1 k=1 dnik 2 un 1 n n Khi (z) 2 dϕ(z) ˆ = op (1). (3.21) i=1 An argument similar to the one used in KN in the analysis of the analog of Dn2 yields (3.17) for Dn2 , thereby completing the proof of (3.13). Now we provide a sketch to derive the asymptotic variance of √ ˆ n(θn − θ0 ). Proceeding as in KN and KS, θˆn is the root of the score equation ˙ M n (θ) = −2 1 n n Khi (z)(Yi − Hθ (Zi )) i=1 1 n n ˙ Khi (z)Hθ (Zi )) dϕ(z) ˆ = 0. (3.22) i=1 Arguing as in Lemma 4.2 of KN pertaining to gn1 , the above equation becomes ˙ M n (θ) = −2 1 n n Khi (z)(Yi − Hθ (Zi )) i=1 63 1 n n Khi (z)H˙ θ (Zi )) dϕ(z) ˆ = 0. i=1 (3.23) Define 1 µ˙ n (z, θ) = n 1 Un1 (z) = n Sn = n Khi (z)H˙ θ (Zi ), i=1 n Khi (z)ξi , i=1 1 Un2 (z) = n Un1 (z)µ˙ h (z, θ0 )dϕ(z), 1 Vn (z, θ) = n µ˙ h (z, θ) = E[Kh (z − Z)H˙ θ (Z)], ξi := Yi − Hθ0 (Zi ), n Khi (z)(Hθ0 (Zi ) − Hθ0 (Zi )), i=1 Tn = Un2 (z)µ˙ h (z, θ0 )dϕ(z), n Khi (z)(Hθ (Zi ) − Hθ0 (Zi )), Σ0 = H˙ θ0 (x)H˙ θT (x)dG(x). 0 i=1 Then the equation (3.23) is equivalent to [Un1 (z) − Un2 (z)]µ˙ n (z, θˆn )dϕ(z) ˆ = Vn (z, θˆn )µ˙ n (z, θˆn )dϕ(z). ˆ (3.24) A major difference between the proofs in KN, KS and here is the presence of the additional term Un2 (z)µ˙ n (z, θˆn )dϕ(z) ˆ in (3.24) due to the estimation of Hθ0 (z) by Hθ0 (z). A slight modification of the arguments in the proofs of Lemmas 4.1–4.3 of KN yield √ n Un1 (z)µ˙ n (z, θˆn )dϕ(z) ˆ = √ n √ nSn + op (1), Un2 (z)µ˙ n (z, θˆn )dϕ(z) ˆ = √ √ nSn →d Nq (0, Σ1 ), nTn + op (1). It thus remains to investigate the asymptotic property of Tn . For that purpose, define φT (Zi , ηk ) := (x) := Khi (z)[mθ0 (Zi + ηk ) − Hθ0 (Zi )]µ˙ h (z, θ0 )dϕ(z), [mθ0 (z + x) − Hθ0 (z)]µ˙ h (z, θ0 )fZ (z)dϕ(z). 64 1 ≤ i ≤ n, 1 ≤ k ≤ N, Then n √ 1 nTn = √ nN N φT (Zi , ηk ). i=1 k=1 The statistic Tn is a two sample U statistic with kernel function φT . We shall be using Theorem B.1 in Sepanski and Lee (1995) to derive asymptotic distribution of Tn and some other statistics. For the sake of completeness we include statement of this theorem as Lemma 3.6.1 in the last Section 3.6. In order to apply Lemma 3.6.1 to Tn , we need to identify the limits of projections of φT , i.e., limn→∞ E(φT |Z1 ) and limn→∞ E(φT |η1 ) as well as their corresponding variances. Algebra shows that E(φT |Z1 ) ≡ 0, E(φT |η1 ) →p (η1 ), Var( (η1 )) = Σ2 , where Σ2 is as in (3.11). Applying Lemma 3.6.1 in Sepanski and Lee (1995) yields that √ nTn →d Nq (0, Σ2 /λ), where λ > 0 is as in assumption (W1). Note that the asymptotic property of Tn is dominated by the behavior of E(φT |η1 ), the projection of φT on the validation sample space and Sn is constructed only based on the primary sample (Yi , Zi ). Hence Sn and Tn are asymptotically independent. Therefore the left hand side of (3.24) is asymptotically normally distributed with convergence rate √ n and variance-covariance matrix Σ1 + (Σ2 /λ). Now we will show that the right hand side of (3.24) equals Ωn (θˆn − θ0 ), where Ωn = Σ0 + op (1). 65 (3.25) Let en := un / un , Vn := 1 nN n N i=1 k=1 d ˆ Khi (z) nik µ˙ n (z, θˆn )dϕ(z), un Ln := ˆ µ˙ n (z, θˆn )µ˙ Tn (z, θ0 )dϕ(z). Then the right hand side of (3.24) can be rewritten as Vn (z, θˆn )µ˙ n (z, θˆn )dϕ(z) ˆ = [Vn eTn + Ln ]un . Argue as in KN to show that (H4) implies Vn = op (1) and Ln = Σ0 + op (1). Moreover, en being a unit vector, we obtain Vn eTn = op (1). This completes the sketch of the proof of (3.25), thereby that of Theorem 3.3.2. Remark 3.3.2. Connection between θˆn and θ˜n in linear regression. Here we shall investigate a relation between the estimators θˆn and θ˜n in the linear model. Assume µ(x) = mθ (x) = θT x, x ∈ C ⊂ Rp , for some θ ∈ Θ ⊂ Rp . (3.26) Then Hθ (z) = θT z and a closed form of θˆn can be derived by taking derivative of Mn (θ) and solving the equation ∂ Mn (θ)/∂θ = 0, i.e., Bn θˆn = An , where An = Bn = 1 n 1 n n Khi (z)Yi i=1 n 1 n n Khi (z)(Zi + η¯) dϕ(z), ˆ i=1 Khi (z)(Zi + η¯) i=1 66 1 n n Khi (z)(Zi + η¯)T dϕ(z), ˆ i=1 N k=1 ηk . with η¯ = N −1 Similarly, Bn θ˜n = An , where 1 n An = 1 n Bn = n 1 n Khi (z)Yi i=1 n 1 n Khi (z)Zi i=1 n Khi (z)Zi dϕ(z), ˆ i=1 n ˆ Khi (z)ZiT dϕ(z). i=1 Roughly speaking, because η¯ →p 0, An − An = op (1), Bn − Bn = op (1) and hence θˆn − θ˜n →p 0. Furthermore, under some specific conditions, both θˆn and θ˜n can achieve the same asymptotic efficiency. We present two such assumptions here. (A6) Eη 2 < ∞. τ1 (z) := E |ε| Z = z is a.e. (G) continuous. (A7) νG := zdG(z) = 0, zz T dG(z) is positive definite. Proposition 3.3.1. Suppose (3.1) and (3.26) hold with θ = θ0 . In addition suppose (A1), (F1), (K), (W1), (A6) and (A7) hold, then √ ˆ n(θn − θ˜n ) →p 0. Proof. For the transparency of the exposition, we give details for the case p = 1 only. Then Bn = n−1 2 n ˆ i=1 Khi (z)Zi dϕ(z). Bn = κG + op (1), where κG = By (3.5), (3.7), (3.8) and direct calculations, z 2 dG(z). By (A7), κG > 0. Then θ˜n = Bn−1 An is well defined for all sufficiently large n and the consistency of θ˜n yields that An = Op (1). We shall shortly show that (a) √ (b) Bn = Bn + op (n−1/2 ). n(An − An ) = op (1), Then for all sufficiently large n, θˆn = Bn−1 An and √ √ n(θˆn − θ˜n ) = n(An Bn − An Bn ) Bn Bn 67 √ = n[An Bn − An (Bn + op (n−1/2 ))] Bn (Bn + op (n−1/2 )) (3.27) √ = n(An − An )Bn − op (An ) = op (1). κ2G + op (1) To prove (3.27)(a), rewrite √ n(An − An ) = By (A6) and CLT, √ √ 1 n n¯ η n 1 n Khi (z)Yi i=1 n Khi (z) dϕ(z) ˆ := √ n¯ η An . i=1 n¯ η = Op (1). It thus suffices to show that An = op (1). Let A∗n denote the An with ϕˆ replaced by ϕ. Then the facts (3.7), E(|Y | Z = z) ≤ |θ0 z|+τ1 (z), assumption (A6) and rigorous calculation yield that 1 n2 |An − A∗n | = op n |Khi (z)Khj (z)Yi |dϕ(z) = op (1). i,j=1 Now we rewrite 1 A∗n = 2 n n i=1 2 (z)Y dϕ(z) + 1 Khi i n2 n n Khi (z)Khj (z)Yi dϕ(z) := An1 + An2 . i=1 j=i=1 Calculation of moments shows that EAn1 = O((nh)−1 ), EAn2 = θ0 νG + o(1), Var(An1 ) = O(n−3 h−2 ) and Var(An2 ) = O(n−1 ). Hence A∗n = θ0 νG + op (1), and (A7) implies (3.27)(a). Now we prove (3.27)(b). Let Bn := 1 n n 1 n Khi (z)Zi i=1 n Khi (z) dϕ(z). ˆ i=1 Then, by (3.8), Bn − Bn = 2¯ η Bn + η¯2 1 n n 2 Khi (z) dϕ(z) ˆ = 2¯ η Bn + Op (n−1 ). i=1 68 √ Argue as in the analysis of An to obtain that Bn = νG + op (1). This fact and n¯ η = Op (1) √ √ η )νG +Op (n−1/2 ), which together with (A7) imply (3.27)(b). imply that n(Bn −Bn ) = 2( n¯ This also completes the proof of the lemma. 3.4 Testing In this section we establish the asymptotic behavior of the proposed tests associated with Mn (θˆn ) under the null and certain fixed alternative hypotheses. Let ξi = Yi − Hθ0 (Zi ), 1 Cn = 2 n 1 Cn = 2 n n ξˆi = Yi − Hθˆ (Zi ), n 2 (z)ξ 2 dϕ(z), Khi i Γn = i=1 n 2 (z)ξˆ2 dϕ(z), Khi i ˆ Γn = i=1 2hp n2 2hp n2 Khi (z)Khj (z)ξi ξj dϕ(z) 2 , i=j ˆ Khi (z)Khj (z)ξˆi ξˆj dϕ(z) 2 . i=j Because ξ = Y − Hθ0 (Z) = ε + mθ0 (X) − Hθ0 (Z) and because Z, η and ε are mutually independent, E(ξ 2 |Z = z) = σε2 + τ 2 (z), where τ 2 is as in (A2). Since C is compact, and by (A2), τ 2 is continuous, we obtain E ξ 2 |Z = z dG(z) < ∞. The following theorem gives the main result of this section. Theorem 3.4.1. Suppose (A1), (A2), (A4), (A5), (F1)–(F2), (K), (H1)–(H6), (W1) and −1/2 (W2) hold. Then, under H0 , nhp/2 Γn Mn (θˆn ) − Cn →d N1 (0, 1). −1/2 Consequently, the null hypothesis is rejected by the test if Tn := nhp/2 Γn |Mn (θˆn ) − Cn | > zα/2 with the asymptotic size α > 0, where zα is the upper 100αth percentile of the standard normal distribution. The theorem shows that the ratio parameter N/n does not play a role in the limiting 69 null distribution. This finding is also reflected in the finite sample simulation study through the empirical level and power with different choices of N/n. Here we provide a sketch of the proof of the above theorem. Rewrite Mn (θˆn ) = 1 n n i=1 Khi (z) Yi − Hθ0 (Zi ) + Hθ0 (Zi ) − Hθ0 (Zi ) + Hθ0 (Zi ) − Hθˆ (Zi ) n = [Un1 (z) − Un2 (z) − Vn (z, θˆn )]2 dϕ(z) ˆ = [Un1 (z) − Un2 (z)]2 dϕ(z) ˆ + dϕ(z) ˆ [Vn (z, θˆn )]2 dϕ(z) ˆ [Un1 (z) − Un2 (z)]Vn (z, θˆn )dϕ(z) ˆ −2 =: Jn + Dn (θˆn ) − 2Kn (θˆn ), 2 say. The following three lemmas are needed for the proof of Theorem 3.4.1. Lemma 3.4.1. Suppose assumptions (A1), (A2), (A4), (A5), (F1)–(F2), (K), (H1)–(H6), (W1), (W2) and H0 hold. Then −1/2 nhp/2 Γn (Jn − Cn ) →d N1 (0, 1). (3.28) Lemma 3.4.2. Under the assumptions of Lemma 3.4.1, the following holds. (a) nhp/2 Dn (θˆn ) = op (1), (b) nhp/2 Kn (θˆn ) = op (1). (3.29) Lemma 3.4.3. Suppose assumptions (A1), (A2), (F1), (K), (H1)–(H6), (W1) with λ < ∞, 70 (W2) and H0 hold. Then (a) nhp/2 (Cn − Cn ) = op (1). (b) Γn − Γn = op (1). (3.30) The above three lemmas yield the asymptotic normality of Mn (θˆn ) in Theorem 3.4.1 in a routine fashion. Here we provide the proofs of these lemmas. Proof of Lemma 3.4.1. Let Jn∗ denote the Jn with ϕˆ replaced by ϕ. Algebra shows that EJn∗ = E 2 (z)dϕ(z) + E Un1 2 (z)dϕ(z) = O((nhp )−1 ) + O(N −1 ) = O (nhp )−1 . Un2 Then, by (3.6) and (W2), nhp/2 Jn − Jn∗ f 2 (z) −1 2 (z) z∈C fˆw ≤ nhp/2 Jn∗ sup = nhp/2 Op ((nhp )−1 )Op logk (n) log(n) p/(p+4) = op (1), n Therefore, Jn = Jn∗ + op ((nhp/2 )−1 ) = Op ((nhp )−1 ). (3.31) It thus suffices to prove (3.28) with Jn replaced by Jn∗ . To proceed further, define for 1 ≤ i, j ≤ n, 1 ≤ k, l ≤ N , i = j, k = l, ∆ik = mθ0 (Zi + ηk ) − Hθ0 (Zi ), Di = (Zi , ξi ), (3.32) 1 ψ1 (Di , Dj , ηk , ηl ) = Khi (z)Khj (z)[(ξi − ∆ik )(ξj − ∆jl ) + (ξj − ∆jk )(ξi − ∆il )]dϕ(z), 2 71 ψ2 (Di , ηk , ηl ) = ψ3 (Di , Dj , ηk ) = ψ4 (Di , ηk ) = 2 (z)(ξ − ∆ )(ξ − ∆ )dϕ(z), Khi i i ik il Khi (z)Khj (z)(ξi − ∆ik )(ξj − ∆jk )dϕ(z), 2 (z)(ξ − ∆ )2 dϕ(z). Khi i ik Rewrite Jn∗ 1 nN = = = 1 n2 N 2 n N Khi (z)(ξi − ∆ik ) dϕ(z) i=1 k=1 n N Khi (z)Khj (z)(ξi − ∆ik )(ξj − ∆jl )dϕ(z) i,j=1 k,l=1 1 n2 N 2 2 i=j,k=l i=j,k=l Khi (z)Khj (z)(ξi − ∆ik )(ξj − ∆jl )dϕ(z) + + + i=j,k=l i=j,k=l =: Jn1 + Jn2 + Jn3 + Jn4 . All these four quantities are similar to the two sample U statistics. We will show that only Jn2 contributes to the asymptotic expectation and only Jn1 contributes to the asymptotic variance in the limiting distribution. Note that E(∆ik |Zi ) ≡ 0 and E(ξi |Zi ) ≡ 0, a.s. Hence EJn1 = 0, E(Jn2 − Cn ) = E (3.33) 1 n2 N 2 n E(ψ2 (Di , ηk , ηl )|Di ) − Cn i=1 k=l 1 2 (z)ξ 2 dϕ(z) = O((N nhp )−1 ), E Kh1 1 Nn n−1 EJn3 = E Kh1 (z)Kh2 (z)∆11 ∆21 dϕ(z) = O(N −1 ), nN = EJn4 = O((N nhp )−1 ). 72 Now we investigate the variances of Jnj , j = 1, 2, 3, 4, using Lemmas 2.6.4 and 2.6.5. We verify that Jn1 is the only leading term. Note that 1 Jn1 = 2 2 n N ψ1 (Di , Dj , ηk , ηl ). i=j,k=l In order to apply Lemma 2.6.5, we first calculate the projections of ψ1 : E(ψ1 |Di , Dj ) = Khi (z)Khj (z)ξi ξj dϕ(z), (mθ0 (z + ηk ) − Hθ0 (z))(mθ0 (z + ηl ) − Hθ0 (z))f 2 (z)dϕ(z) + op (1), E(ψ1 |ηk , ηl ) = 1 2 1 E(ψ|Di , ηk , ηj ) = 2 E(ψ1 |Di , Dj , ηk ) = Khi (z)Khj (z)[(ξi − ∆ik )ξj + (ξj − ∆jk )ξi ]dϕ(z), Khi (z)[(ξi − ∆ik )(mθ0 (z + ηk ) − Hθ0 (z)) +(ξi − ∆il )(mθ0 (z + ηl ) − Hθ0 (z))]f (z)dϕ(z) + op (1). All other projections vanish. We also verify the variances of the above projections Var(ψ1 ) = O(h−p ), Var E(ψ1 |Di , Dj ) = O(h−p ), Var E(ψ1 |Di , Dj , ηk ) = O(h−p ), Var E(ψ1 |ηk , ηl ) = O(1), Var E(ψ1 |Di , ηk , ηl ) = O(1). Therefore, Lemma 2.6.5 implies that Var(Jn1 ) = O 1 n2 N 2 hp 1 1 1 1 + 2 p+ 2+ 2 p+ n h N n Nh nN 2 =O 1 n2 hp . Furthermore, it is seen that only the variance term associated with E(ψ1 |Di , Dj ) dominates 73 the variance of Jn1 and all other projection variances are o(1/(n2 hp )). Thus, if we let 1 Jn1 = n(n − 1) n n E(ψ1 |Di , Dj ), i=1 j=i=1 then nhp/2 (Jn1 − Jn1 ) = op (1). −1/2 From Lemma 5.1 in KN, we obtain nhp/2 Γn −1/2 nhp/2 Γn Jn1 →d N1 (0, 1). Hence Jn1 →d N1 (0, 1). (3.34) By using arguments similar to those used in the proof of Lemma 3.3.1, one can verify Var(nhp/2 Jn2 ) = o(1), Var(nhp/2 Jn3 ) = o(1), Var(nhp/2 Jn4 ) = o(1). Combining these facts with the expectation results in (3.33), we have Jn2 = Cn + op (1/(nhp/2 )), Jn3 = op (1/(nhp/2 )), Jn4 = op (1/(nhp/2 )). Therefore, (3.31) and these facts above imply −1/2 nhp/2 Γn −1/2 (Jn − Cn ) = nhp/2 Γn This fact together with (3.34) yield the conclusion (3.28). 74 Jn1 + op (1). Proof of Lemma 3.4.2. Recall the notation from (3.19). We have 1 nN Dn (θˆn ) = ≤ 2 = 2 un n ˙ θ0 (Zi + ηk ) Khi (z) dnik + uTn m i=1 k=1 n N 1 nN 2 N 2 2 dϕˆ (3.35) 2 uTn µ ˆ˙ n (z, θ0 ) dϕ(z) ˆ Khi (z)dnik dϕ(z) ˆ +2 i=1 k=1 Dn1 + Dn2 . The Cauchy-Schwarz inequality and (3.5) imply that Dn2 ≤ µ ˆ˙ n (z, θ0 ) 2 dϕ(z) ˆ = [µ ˆ˙ n (z, θ0 )]T µ ˆ˙ n (z, θ0 )dϕ(z) + op (1). Calculation shows that E [µ ˆ˙ n (z, θ0 )]T µ ˆ˙ n (z, θ0 )dϕ(z) = O(1) under (H6). Hence Dn2 = Op (1). This fact, (3.21) and the fact n un 2 = Op (1), implied by Theorem 3.3.2, together with (3.35) imply Dn (θˆn ) = op (nhp/2 )−1 , thereby proving (3.29)(a). In order to prove (3.29)(b), let Un := Un1 − Un2 and rewrite Kn (θˆn ) = = = 1 Un (z) nN Un (z) Un (z) 1 nN 1 nN n N Khi (z)[mθˆ (Zi + ηk ) − mθ0 (Zi + ηk )] dϕ(z) ˆ n i=1 k=1 n N Khi (z)[dnik + uTn m ˙ θ0 (Zi + ηk )] dϕ(z) ˆ i=1 k=1 n N Khi (z)dnik dϕ(z) ˆ + i=1 k=1 =: R1 + R2 . 75 Un (z)uTn µ ˆ˙ n (z, θ0 )dϕ(z) ˆ The facts (3.4), (3.31), and the Cauchy-Schwarz inequality imply that nhp/2 R1 ≤ 1/2 Jn = nhp/2 o 1 nN × p( un )Op n N Khi (z)dnik i=1 k=1 ((nhp )−1/2 ) 2 dϕ(z) ˆ 1/2 = op (1). Next, rewrite n R2 = uTn Un (z) n−1 i=1 n = uTn Un (z) n−1 ˙ Khi (z)H θ0 (Zi ) dϕ(z) ˆ ˙ Khi (z)H θˆ (Zi ) dϕ(z) ˆ n i=1 n −uTn Un (z) ˙ ˙ Khi (z)(H θˆ (Zi ) − H θ0 (Zi )) dϕ(z) ˆ n−1 n i=1 =: R21 − R22 . The score equation (3.22) implies that n R21 = uTn Vn (z, θˆn ) n−1 i=1 n = uTn Vn (z, θˆn ) n−1 ˙ Khi (z)H θˆ (Zi ) dϕ(z) ˆ n ˙ Khi (z)H θ0 (Zi ) dϕ(z) ˆ i=1 n +uTn Vn (z, θˆn ) n−1 ˙ ˙ ˆ Khi (z)(H θˆ (Zi ) − H θ0 (Zi )) dϕ(z) n i=1 =: R211 + R212 . Direct calculations together with (3.29)(a) and (3.8) yield n nhp/2 R211 ≤ nhp/2 n−1 un [Dn (θˆn )]1/2 i=1 76 −1/2 ˙ Khi (z)H θ0 (Zi ) 2 dϕ(z) ˆ = nhp/2 Op (n−1/2 )op ((nhp/2 )−1/2 )Op (1) = op (1). Similarly, assumption (H5), n1/2 un = Op (1) and (3.29)(a) imply that nhp/2 R212 = op (1) thereby nhp/2 R21 = op (1). Regarding R22 , the Cauchy-Schwarz inequality implies that nhp/2 R22 ≤ 1/2 un Jn × 1/2 1 Khi (z)(m ˙ θˆ (Zi + ηk ) − m ˙ θ0 (Zi + ηk )) 2 dϕ(z) ˆ n nN = nhp/2 Op (n−1/2 )Op ((nhp )−1/2 )op (hp/2 ) = op (1). The last equality holds because of assumption (H5) and (3.8). This completes the proof of the lemma. Proof of Lemma 3.4.3. Recall that ξˆi = Yi − Hθˆ(Zi ) = [Yi − Hθ0 (Zi )] + [Hθ0 (Zi ) − Hθˆ (Zi )] = ξi + δ˜i . n Note that δ˜i are not independent due to the common use of validation sample, we further decompose the residual as δ˜i = Hθ0 (Zi ) − Hθ0 (Zi ) + Hθ0 (Zi ) − Hθˆ (Zi ) = si + ti , n say. Proof of (3.30)(a). Let 1 C¯n = 2 n n 1 Bn = 2 n 2 (z)δ˜2 dϕ(z), Khi i i=1 1 φ5 (Zi , ηk ) = nN n 2 (z)ξ δ˜ dϕ(z), Khi i i i=1 2 (z)[m (Z + η ) − H (Z )]2 dϕ(z), Khi θ0 i k θ0 i 77 φ6 (Zi , ηk , ηl ) = 1 n 2 (z)[m (Z + η ) − H (Z )][m (Z + η ) − H (Z )]dϕ(z). Khi θ0 i k θ0 i θ0 i l θ0 i Let Cn∗ denote the Cn with dϕˆ replaced by dϕ. Arguing as for (3.31), it suffices to show that (3.30)(a) holds with Cn replaced by Cn∗ . Decompose Cn∗ 1 = 2 n n 2 (z)(ξ + δ˜ )2 dϕ(z) = C + C ¯n + 2Bn . Khi n i i i=1 We claim (a) nhp/2 C¯n = op (1), (b) nhp/2 Bn = op (1). (3.36) To prove (3.36)(a), by the triangular inequality, we obtain C¯n 1 n2 = n 2 (z)[H (Z ) − H (Z ) + H (Z ) − H (Z )]2 dϕ(z) Khi i θ0 i θ0 i θ0 i θˆ n i=1 1 ≤ 2 2 n n 2 (z)[H (Z ) − H (Z )]2 dϕ(z) Khi θ0 i θ0 i i=1 1 + 2 n =: 2 C¯n1 + 2 C¯n2 , n 2 (z)[H (Z ) − H (Z )]2 dϕ(z) Khi θ0 i θˆn i i=1 say. First, consider C¯n1 = 1 2 n N2 1 = nN n N 2 (z)[m (Z + η ) − H (Z )][m (Z + η ) − H (Z )]dϕ(z) Khi θ0 i k θ0 i θ0 i l θ0 i i=1 k,l=1 n N i=1 k=1 1 φ5 (Zi , ηk ) + nN 2 n N φ6 (Zi , ηk , ηl ) =: C¯n11 + C¯n12 . i=1 k=l=1 78 The summand C¯n11 is a two sample U statistic with the kernel function φ5 . Direct calculations yield the following facts. Eφ5 = O(1/(nN hp )), E(φ5 |ηk ) = K2 nN hp Var(E(φ5 |Zi )) = O E(φ5 |Zi ) = 1 nN 2 (z)σ 2 (Z )dϕ(z), Khi i [mθ0 (z + ηk ) − Hθ0 (z)]2 f (z)dϕ(z) + op (1/(nN hp )), 1 n2 N 2 h3p Var(E(φ5 |ηk )) = O , 1 n2 N 2 h2p . Because λ = lim N/n < ∞, by Lemma 3.6.1, we obtain √ N C¯n11 = Op Var(E(φ5 |Zi )) + λVar(E(φ5 |ηk )) = Op 1/(nN h3p/2 ) . Therefore, (W1) implies nhp/2 C¯n11 = Op 1 √ N hp N = op (1). Next, consider C¯n12 . It is a two sample degenerated U statistic with the kernel function φ6 . Similar to the analysis of Q3 in Lemma 3.3.1, we have Var(Cn12 ) = O N −2 (nhp )−2 . Hence under (W1), nhp/2 Cn12 = Op nhp/2 N nhp 1 = Op √ √ N N hp = op (1). Therefore nhp/2 C¯n1 = op (1). Next, consider C¯n2 = 1 n2 n 2 (z) Khi i=1 1 N N [mθˆ (Zi + ηk ) − mθ0 (Zi + ηk )] n k=1 79 2 dϕ(z) = ≤ 1 n2 2 n2 n 2 (z) Khi i=1 n 2 (z) Khi i=1 1 N 1 N 2 + 2 n := C¯n21 + C¯n22 , N dnik + uTn m ˙ θ0 (Zi + ηk ) k=1 N 2 dnik 2 dϕ(z) dϕ(z) k=1 n 2 (z) Khi i=1 1 N N uTn m ˙ θ0 (Zi + ηk ) 2 dϕ(z) k=1 say. By the facts (3.4) and n un 2 = Op (1), we obtain C¯n21 = op ( un 2 )Op The facts N −1 N ˙ θ0 (z k=1 m C¯n22 = Op 2 n2 2 n2 n 2 (z)dϕ(z) = o (n−2 h−p ). Khi p i=1 + ηk ) = H˙ θ0 (z) + op (1), n un 2 = Op (1) and (H6) yield n 2 −2 −p 2 (z) uT H ˙ Khi n θ0 (Zi ) dϕ(z) = Op (n h ). i=1 Hence, by assumption (W1), we obtain nhp/2 C¯n2 = nhp/2 Op (n2 hp )−1 = op (1), thereby completing the proof of (3.36)(a). Next, consider 1 Bn = 2 n n 2 (z)ξ s dϕ(z) + Khi i i i=1 1 n2 n 2 (z)ξ t dϕ(z) =: B + B . Khi n1 n2 i i i=1 Recall the notation in (3.32). Rewrite 1 Bn1 = 2 n N n N 2 (z)ξ ∆ dϕ(z). Khi i ik i=1 k=1 80 Algebra shows that E(Bn1 ) = 0 and 1 Var(Bn1 ) = 4 n N2 = O n N E 2 (y)K 2 (z)ξ ξ ∆ ∆ dϕ(y)dϕ(z) Khi i j ik jl hj i,j=1 k,l=1 1 n3 N h2p . Therefore, nhp/2 Bn1 = op (1). An argument similar to the one used in the analysis of C¯n2 yields that nhp/2 Bn2 = op (1), thereby completing the proof of (3.36)(b), and also of (3.30)(a). Proof of (3.30)(b). Rewrite Γn = 2hp n2 Khi (z)Khj (z)(ξi + δ˜i )(ξj + δ˜j )dϕ(z) ˆ 2 i=j = Γn + 2hp n2 ˆ Khi (z)Khj (z)(ξi δ˜j + ξj δ˜i + δ˜i δ˜j )dϕ(z) i=j 4hp + 2 n 2 ˆ Khi (z)Khj (z)(ξi δ˜j + ξj δ˜i + δ˜i δ˜j )dϕ(z) Khi (z)Khj (z)ξi ξj dϕ(z) ˆ i=j =: Γn + Γn1 + Γn2 , say. It suffices to show that Γn1 = op (1) and Γn2 = op (1). The triangular inequality implies that Γn1 ≤ 6hp n2 Khi (z)Khj (z)ξi δ˜j dϕ(z) ˆ i=j p 6h + 2 n 2 Khi (z)Khj (z)δ˜i δ˜j dϕ(z) ˆ i=j =: I1 + I2 + I3 . 81 6hp + 2 n 2 Khi (z)Khj (z)δ˜i ξj dϕ(z) ˆ i=j 2 Substituting sj + tj for δ˜j in I1 , it can be seen that 12hp n2 ≤ I1 Khi (z)Khj (z)ξi ti dϕ(z) ˆ 2 + i=j 12hp n2 Khi (z)Khj (z)ξi si dϕ(z) ˆ 2 i=j =: I11 + I12 . Rewrite 12hp I11 = 2 n n i=j=1 1 N N Khi (z)Khj (z)ξi [mθˆ (Zj + ηk ) − mθ0 (Zj + ηk )]dϕ(z) ˆ 2 n k=1 . Analogous to the analysis of C¯n2 , by (3.7) and the Cauchy-Schwarz inequality, for 1 ≤ i = j ≤ n, we obtain 1 Khi (z)Khj (z)ξi N = Op ≤ Op = N k=1 dϕ(z) ˆ N 1 Khi (z)Khj (z)ξi N k=1 2 (z)ξ 2 dϕ(z) × Khi i Op (h−p/2 )Op ((nhp )−1/2 ) mθˆ (Zj + ηk ) − mθ0 (Zj + ηk ) n mθˆ (Zj + ηk ) − mθ0 (Zj + ηk ) dϕ(z) n 2 (z) Khj = Op 1 N N dnik + uTn m ˙ θ0 (Zj + ηk ) 2 dϕ(z) 1/2 k=1 −1/2 −p (n h ). Hence I11 = hp Op (n−1/2 h−p ) = op (1). ∗ denote the I with ϕ Regarding I12 , let I12 ˆ replaced by ϕ, then it suffices to prove that 12 ∗ = o (1) by (3.5). Rewrite I12 p ∗ I12 hp = 2 2 n N n N Khi (y)Khj (y)Khi (z)Khj (z)ξi2 ∆jk ∆jl dϕ(y)dϕ(z). i=j=1 k,l=1 82 Define 1 2N 1 φ8 (Di , Dj , ηk , ηl ) = 2 Khi (y)Khj (y)Khi (z)Khj (z)[ξi2 ∆2jk + ξj2 ∆2ik ]dϕ(y)dϕ(z), φ7 (Di , Dj , ηk ) = Khi (y)Khj (y)Khi (z)Khj (z)[ξi2 ∆jk ∆jl + ξj2 ∆ik ∆il ]dϕ(y)dϕ(z). ∗ can be rewritten as Then I12 ∗ I12 = hp n2 N n N i=j=1 k=1 =: L1 + L2 , hp φ7 (Di , Dj , ηk ) + 2 2 n N n N φ8 (Di , Dj , ηk , ηl ) i=j=1 k=l=1 say. Both L1 and L2 are two sample U statistics. Verify that by (A2), (A5) and (W1), ∗ ) = E(L ) = E(I12 1 (n − 1)hp Eφ6 = O((N hp )−1 ) = o(1). nN Furthermore, by calculating the second moments of the conditional expectations in Lemma 2.6.4, it can be shown that, under (A2), (A4), (A5) and (W1), Eφ27 = O N −2 h−3p , Var(E(φ7 |Di )) = O (N hp )−2 , Var(E(φ7 |Di , Dj )) = O N −2 h−3p , Var(E(φ7 |ηk )) = O (N hp )−1 . Lemma 2.6.4 implies that Var(L1 ) = o(1). Thereby L1 = op (1). Similarly, Lemma 2.6.5 yields that L2 = op (1). The results I2 = op (1) and I3 = op (1) are obtained in a similar manner. Details are skipped for the sake of brevity of the chapter. The fact Γn2 = op (1) is 83 derived by using the fact that hp n−2 Khi (z)Khj (z)|ξi ξj |dϕ(z) ˆ 2 = Op (1) i=j proved in KN, the application of Cauchy-Scharwz inequality and the fact that Γn1 = op (1). This completes the proof of (3.30)(b) and also of Lemma 3.4.3. We further briefly discuss the consistency of these tests. We establish that under some regularity conditions, |Tn | →p ∞, under certain fixed alternatives, which implies the consistency of the sequences of tests based on Tn . Recall the definitions of H(z) and T (H) in the beginning of Section 3.3.1. Let θn be an consistent estimator of T (H) and define ξi = Yi − H(Zi ), 1 Cn = 2 n ξni = Yi − Hθn (Zi ), n 2 (z)ξ 2 dϕ(z), Khi ni ˆ i=1 −1/2 Let Tn := nhp/2 Γn 2hp Γn = 2 n n Khi (z)Khj (z)ξni ξnj dϕ(z) ˆ 2 . i=j=1 (Mn (θn ) − Cn ). Then the theorem below presents the asymptotic behavior of the proposed test under certain alternative hypotheses. Theorem 3.4.2. Suppose (A1), (A2), (A4), (A5),(F1), (F2), (H3), (K), (W1) and (W2) hold and the alternative hypothesis H1 : µ(x) = m(x), x ∈ C satisfies that inf θ ρ(H, Hθ ) > 0 and T (H) is unique. Then |Tn | →p ∞ for any consistent estimator θn of T (H). By Lemma 3.3.2, θˆn is consistent for T (H), therefore the above theorem implies that |Tn | → ∞ in probability under the same regularity conditions, and the test based on Tn is consistent against the alternative m for which inf θ ρ(H, Hθ ) > 0. The proof of Theorem 3.4.2 is similar to that of Theorem 5.1 in KS with slight modifications. The techniques used 84 for analyzing Wn (θ) in Lemma 3.3.1 and Dn (θ) in the proof of Theorem 3.3.2 are enough to produce the conclusions. Details are skipped for the sake of brevity. 3.5 Simulation In this section, we present the results of a Monte Carlo study of the proposed estimation and testing procedures for p = 1, 2. For p = 1, both linear and nonlinear functions are chosen as the underlying true regression to generate the primary and validation data. For p = 2, a linear regression is assumed. Various values of the ratio N/n are selected to demonstrate its role on the performance of these inference procedures. Throughout the simulation, the kernel function K is chosen as K(u) = 0.75(1 − u2 )I(|u|≤1) for p = 1 while K(u) = 0.752 (1 − u21 )(1 − u22 )I(|u |≤1,|u |≤1) for p = 2. All of the results obtained are based 1 2 on 1000 replications. We need to determine the two bandwidths for the implementation of the above estimation and testing procedures. As mentioned in the beginning of Section 3.4, one bandwidth used for estimating fZ is w = c(log n/n)1/(p+4) , c > 0. We propose to obtain c by minimizing, w.r.t. c, the unbiased cross-validation criterion U CV (w) developed in Wolfgang, Marron and Wand (1990), where 1 (R(K))p U CV (w) = + p nw n(n − 1)wp with R(K) = K 2 (x)dx and K ∗ K(x) = n (K ∗ K − K) i=j=1 Zi − Zj , w K(y)K(x − y)dy. We apply a grid search to 85 choose the optimal coefficient c starting from 0.1 with step 0.02, i.e., c∗n := argmin0.1≤c≤10 U CV c(log n/n)1/(p+4) , wopt = c∗n (log n/n)1/(p+4) . For the bandwidth h, in order to satisfy (W2), we choose to use h = σ ˆZ n−1/3 for p = 1 recommended by Sepanski and Carroll (1993) and h = n−1/4.5 for p = 2 as used in KS. In order to interpret the performance of the proposed estimator θˆn , we also present the performance of the KS estimator θ˜KS . Recall that in KS, the measurement error density fη is assumed to be known. Both the means and square root of mean square error (RMSE) of the two estimators are reported. For the proposed estimator θˆn , N/n are chosen as 1, 1/2 and 1/10 to illustrate how N/n affects the estimator performance. In both linear and nonlinear cases, the bias and RMSE decrease as the sample sizes increase. In the linear case, as shown in Example 3.3.1, the asymptotic variance of θˆn is the same as that of θ˜KS . This is also reflected in this finite sample study as the RMSEs of θˆn and θ˜KS in Table 3.1 are very similar for all of the three choices of N/n. In the nonlinear case, Table 3.2 shows that the obtained RMSE of θˆn is larger than θ˜KS and it decreases as N/n increases from 1/10 to 1. In the testing procedure, with nominal level 0.05, the empirical level and power are obtained by computing #{|Tn | ≥ 1.96}/1000. The sample size ratios are chosen as N/n = 4, 1 and 1/4. Both the linear and nonlinear regressions are used as the null for p = 1 while the linear regression is chosen as the null for p = 2. The empirical power is obtained under various choices of alternative models. 86 3.5.1 Finite sample performance of θˆn In this subsection we report the findings of a finite sample performance of the estimator θˆn in linear and nonlinear cases. The linear case with q = 1 = p. In this case, we generated the data from (3.1) with mθ (x) = θx, θ0 = 1, (3.37) where ε ∼ N1 (0, 0.22 ), η ∼ N1 (0, 0.12 ), Z ∼ U [−1, 1]. Then 1 Hθ (z) = N Hθ (z) = θz, N θ(z + ηk ) = θ(z + η¯). k=1 The two bandwidths are chosen as described above. Throughout the simulation, C = [−1, 1], and G is the uniform measure on [−1, 1]. Hence, as noted in Example 3.3.1, here Σ2 = 0 and the asymptotic variances of θˆn is equivalent to that of θ˜KS . This fact is also reflected in this finite sample study by observing that the RMSE of θˆn remains the same for different choices of N/n as seen in the Table 3.1. The nonlinear case with q = 1 = p. In this section, the regression function is mθ (x) = eθx , θ0 = −1, (3.38) and all other setup is the same as in the above simulation for the linear case. Then the regression function, given z, is 2 2 Hθ (z) = eθ ση /2 eθz , Hθ (z) = 1 N N eθ(z+ηk ) = eθz k=1 87 1 N N eθηk . k=1 N/n = 1 (n, N ) (100,100) ˆ mean(θn ) 1.0016 ˆ RMSE(θn ) 0.0389 N/n = 1/2 (n, N ) (100,50) ˆ mean(θn ) 0.9979 ˆ RMSE(θn ) 0.0381 N/n = 1/10 (n, N ) (100,10) mean(θˆn ) 0.9974 ˆ RMSE(θn ) 0.0399 KS n 100 ˜ mean(θKS ) 0.9999 ˜ RMSE(θKS ) 0.0393 (200,200) (400,400) (600,600) 0.9994 1.0005 0.9996 0.0298 0.0194 0.0163 (200,100) (400,200) (600,300) 0.9985 1.0005 1.0005 0.0295 0.0194 0.0165 (200,20) (400,40) (600,60) 0.9984 1.0001 0.9999 0.0299 0.0195 0.0170 200 400 600 0.9996 1.0006 0.9995 0.0298 0.0194 0.0170 Table 3.1: Performance of θˆn , θ˜n in the linear case (3.37), p = 1. In this case, the second term Σ2 in the asymptotic variance is calculated as 2 2 σθ0 (x, y) = eση (eση − 1)eθ0 (x+y) , 2 2 Σ2 = e2ση (eση − 1) (x + ση2 θ0 )eθ0 x dG(x) 2 > 0. . Table 3.2 shows the consistency of θˆn as the bias is very little and the RMSE decreases as the samples sizes increase. The RMSEs of θˆn are larger than those of θ˜KS , for all chosen values of N/n. Furthermore, the RMSE of θˆn decreases as N/n increases. The linear case with q = 2 = p. We further consider the case mθ (x) = θ1 x1 + θ2 x2 , where θ = (θ1 , θ2 )T ∈ R2 and x = (x1 , x2 )T ∈ R2 . The true parameter θ0 = (1, 1) is used to generate the data. Denote Zi = (Zi1 , Zi2 )T and ηi = (ηi1 , ηi2 )T for 1 ≤ i ≤ n. Both Zi1 and Zi2 are generated independently from U [−1, 1] while ηi1 and ηi2 are generated from N1 (0, 0.12 ) and N1 (0, 0.22 ), respectively. Then Xi = (Xi1 , Xi2 ) is obtained as the sum of Zi and ηi . The primary data {(Yi , Zi ), 1 ≤ i ≤ n} are obtained with the above regression function and the error ε following N1 (0, 0.22 ). The validation data {ηk , 1 ≤ k ≤ N } are independently simulated from ηi . The bandwidth w is obtained based on the UCV criterion 88 N/n = 1 (n, N ) (100,100) ˆ mean(θn ) -0.9999 ˆ RMSE(θn ) 0.0360 N/n = 1/2 (n, N ) (100,50) ˆ mean(θn ) -1.0032 ˆ RMSE(θn ) 0.0392 N/n = 1/10 (n, N ) (100,10) mean(θˆn ) -1.0023 ˆ RMSE(θn ) 0.0498 KS n 100 ˜ mean(θKS ) -1.0005 ˜ RMSE(θKS ) 0.0321 (200,200) (400,400) (600,600) -1.0004 -0.9999 -1.0002 0.0249 0.0172 0.0141 (200,100) (400,200) (600,300) -1.0004 -1.0000 -1.0001 0.0267 0.0181 0.0149 (200,20) (400,40) (600,60) -1.0009 -1.0004 -1.0003 0.0358 0.0245 0.0200 200 400 600 -0.9998 -0.9998 -1.0002 0.0233 0.0162 0.0132 Table 3.2: Performance of θˆn , θ˜n in the nonlinear case (3.38), p = 1. with p = 2 while h is taken as h = n−1/4.5 as used in KS. In this case, C = [−1, 1]2 and G is the uniform measure on [−1, 1]2 . The choices of N/n are the same as the previous cases. Both means and RMSE of the estimator θˆn = (θˆn,1 , θˆn,2 )T and θ˜KS = (θ˜KS,1 , θ˜KS,2 )T are presented in Table 3.3. It shows small estimation bias and reduced RMSE for increased sample sizes. 3.5.2 Test performance Here we present the test performance of the proposed test associated with Mn (θˆn )) in terms of empirical level and power for different alternative hypotheses and various sample size ratio choices. The case q = 1 = p. The finite sample performance of the Tn test is assessed for both the above linear (3.37) and nonlinear (3.38) regression models as the null. For each case, the three different alternatives are chosen to obtain the empirical power of a member of the class of the proposed tests. 89 N/n = 1 (n, N ) (100,100) ˆ mean(θn,1 ) 0.9953 ˆ RMSE(θn,1 ) 0.0728 ˆ mean(θn,2 ) 0.9989 ˆ RMSE(θn,2 ) 0.0634 N/n = 1/2 (n, N ) (100,50) ˆ mean(θn,1 ) 0.9943 ˆ RMSE(θn,1 ) 0.0779 mean(θˆn,2 ) 0.9975 RMSE(θˆn,2 ) 0.0644 N/n = 1/10 (n, N ) (100,10) ˆ mean(θn,1 ) 0.9928 ˆ RMSE(θn,1 ) 0.0813 ˆ mean(θn,2 ) 0.9892 ˆ RMSE(θn,2 ) 0.0679 KS n 100 mean(θ˜KS,1 ) 0.9957 RMSE(θ˜KS,1 ) 0.0732 mean(θ˜KS,2 ) 0.9999 ˜ RMSE(θKS,2 ) 0.0633 (200,200) (300,300) (400,400) 1.0004 0.9999 1.0013 0.0397 0.0332 0.0275 1.0032 1.0013 0.9999 0.0398 0.0307 0.0271 (200,100) (300,150) (400,200) 1.0000 0.9998 1.0007 0.0399 0.0332 0.0269 1.0011 1.0009 0.9983 0.0395 0.0308 0.0261 (200,20) (300,30) (400,40) 0.9990 0.9990 1.0006 0.0400 0.0333 0.0275 0.9965 0.9980 0.9971 0.0399 0.0311 0.0273 200 300 400 1.0006 0.9999 1.0013 0.0397 0.0334 0.0275 1.0037 1.0017 1.0002 0.0399 0.0306 0.0271 Table 3.3: Performance of θˆn , θ˜n in the linear case with p = 2 Model 0: Y = X + ε Y = e−X + ε Model 1: Y = X + 0.2X 2 + ε Y = e−X + 0.2X 2 + ε Model 2: Y = X + 0.5 sin(2X) + ε Y = e−X + 0.5 sin(2X) + ε Model 3: Y = XI(X≤0.5) + 0.5I(X>0.5) + ε Y = e−X I(X≤0.5) + e−0.5 I(X>0.5) + ε The entities G, K, fZ , U and ε are as in the q = 1 = p cases in Section 3.5.1. The empirical levels under Model 0 and the empirical power under Models 1, 2, 3, are shown in Table 3.4 with increasing sample sizes. The left and right panels of Table 3.4 correspond to the left and right panels of the models above, respectively. With nominal level 0.05, the empirical level is well controlled in the linear case while it is slightly conservative under the exponential null model with larger sample sizes. The proposed test rejects the 90 N/n = 4 (n, N ) Model 0 Model 1 Model 2 Model 3 N/n = 1 (n, N ) Model 0 Model 1 Model 2 Model 3 N/n = 1/4 (n, N ) Model 0 Model 1 Model 2 Model 3 (100,400) 0.031 0.686 0.392 0.913 (200,800) 0.041 0.982 0.781 1.000 (100,100) 0.029 0.671 0.409 0.921 (200,200) 0.037 0.984 0.790 1.000 (100,25) 0.056 0.679 0.483 0.902 (200,50) 0.048 0.967 0.814 1.000 N/n = 4 (500,2000) (100,400) 0.032 0.046 0.326 1.000 0.996 1.000 1.000 0.321 N/n = 1 (500,500) (100,100) 0.041 0.029 1.000 0.345 1.000 0.996 1.000 0.327 N/n = 1/4 (500,125) (100,25) 0.052 0.037 1.000 0.344 0.995 0.996 1.000 0.339 (200,800) 0.022 0.748 1.000 0.740 (500,2000) 0.039 1.000 1.000 1.000 (200,200) 0.032 0.746 1.000 0.730 (500,500) 0.040 1.000 1.000 1.000 (200,50) 0.033 0.744 1.000 0.733 (500,125) 0.038 1.000 1.000 0.996 Table 3.4: Empirical level and power under linear null model (left panel) and nonlinear null model (right panel) for p = 1 null hypotheses with high power for moderate and large sample sizes, for all the three chosen alternatives. Moreover, it is observed that for the same primary sample size n, the empirical power changes little when the validation sample size N increases. This finding also is somewhat consistent with the theoretical result that the sample size ratio N/n does not play a critical role in the asymptotic behavior of the proposed test statistic. The case q = 2 = p. In this case, the setup is the same as in the estimation subsection 3.5.1 for p = 2. We investigate the empirical level of the proposed test under Model ∅ and power under alternative Models I, II and III below. Model ∅: Y = θ0T X + ε, θ0 = (1, 1)T , Model I: Y = θ0T X + 0.2X1 X2 + ε Model II: Y = θ0T X + 0.5 sin(2X1 X2 ) + ε Model III: Y = θ0T XI T + 0.5I T +ε (θ X≤0.5) (θ X>0.5) 0 0 91 X = (X1 , X2 )T N/n = 4 (n, N ) Model ∅ Model I Model II Model III N/n = 1 (n, N ) Model ∅ Model I Model II Model III N/n = 1/4 (n, N ) Model ∅ Model I Model II Model III (40,160) 0.041 0.052 0.581 0.654 (100,400) 0.034 0.103 0.900 0.910 (200,800) 0.048 0.278 0.953 0.961 (400,1600) 0.047 0.596 0.998 0.999 (40,40) 0.040 0.056 0.585 0.636 (100,100) 0.035 0.121 0.900 0.908 (200,200) 0.049 0.295 0.958 0.963 (400,400) 0.046 0.608 0.998 0.997 (40,10) 0.092 0.105 0.604 0.638 (100,25) 0.081 0.188 0.901 0.912 (200,50) 0.077 0.340 0.961 0.964 (400,100) 0.069 0.658 0.999 0.999 Table 3.5: Empirical level and power under linear null model for p = 2 The numerical findings are summarized in Table 3.5. It is observed that the empirical levels preserve the nominal size 0.05 for larger sample sizes when N/n = 1 and 4. The empirical levels for N/n = 1/4 are slightly inflated due to limited validation sample and it decreases towards the nominal size 0.05 as sample sizes increase. The empirical power under all chosen alternatives increases as the sample size increases. 3.6 Proofs In this section, we provide the detailed proofs of Lemmas 3.3.1 and 3.3.2. To proceed, we state Theorem B.1 in Sepanski and Lee (1995) and recall Lemmas 2.6.4 and 2.6.5 pertaining to some two sample U statistics. Lemma 3.6.1. Let {xi }, i = 1, ..., n be an i.i.d. sample, and {vj }, j = 1, ..., m be another i.i.d. sample which is independent with {xi }.The functions ψn (v, x, h) is a sequence of ran92 dom functions with a bandwidth h. In addition, suppose the following hold. (1) There exists square integrable functions q1 (v) and q2 (x) such that |E{ψn (v, x, h)|v}| ≤ q1 (v) and |E{ψn (v, x, h)|x}| ≤ q2 (x), (2) limn→∞ E{ψn (v, x, h)|v} → p1 (v), a.e., and limn→∞ E{ψn (v, x, h)|x} → p2 (x), a.e. for some measurable functions p1 (v) and p2 (x), and √ (3) limn→∞ nE{ψn (v, x, h)} → 0. Then 1 √ m n n m ψn (vj , xi , h) →d N1 (0, λVar{p1 (v)} + Var{p2 (x)}), i=1 j=1 where λ = limn∧m→∞ (n/m), assumed to be finite. Proof of Lemma 3.3.1. Let Wn∗ (θ) be the Wn (θ) with dϕˆ replaced by dϕ. To proceed, for 1 ≤ i = j ≤ n, 1 ≤ k = l ≤ N , define 2 (z)[m (Z + η ) − H (Z )]2 dϕ(z), Khi θ i k θ i φ1 (Zi , ηk ) = φ2 (Zi , Zj , ηk , ηl ) = φ3 (Zi , Zj , ηk ) = φ4 (Zi , ηk , ηl ) = Khi (z)Khj (z)[mθ (Zi + ηk ) − Hθ (Zi )][mθ (Zj + ηl ) − Hθ (Zj )]dϕ(z), Khi (z)Khj (z)[mθ (Zi + ηk ) − Hθ (Zi )][mθ (Zj + ηk ) − Hθ (Zj )]dϕ(z), Khi (z)Khj (z)[mθ (Zi + ηk ) − Hθ (Zi )][mθ (Zi + ηl ) − Hθ (Zi )]dϕ(z). Rewrite Wn∗ (θ) 1 = n2 n Khi (z)Khj (z)[Hθ (Zi ) − Hθ (Zi )][Hθ (Zj ) − Hθ (Zj )]dϕ(z) i,j=1 93 = = 1 n2 N 2 n N Khi (z)Khj (z)[mθ (Zi + ηk ) − Hθ (Zi )][mθ (Zj + ηl ) − Hθ (Zj )]dϕ(z) i, j=1 k,l=1 1 n2 N 2 φ1 (Zi , ηk ) + i=j k=l φ2 (Zi , Zj , ηk , ηl ) i=j k=l + φ3 (Zi , Zj , ηk ) + i=j k=l i=j k=l =: Q1 + Q2 + Q3 + Q4 , φ4 (Zi , ηk , ηl ) say. Because E[mθ (Z + η)|Z = z] = Hθ (z), then EQ2 = EQ4 = 0. Therefore 1 1 Eφ1 (Z, η) + Eφ3 (Z1 , Z2 , η) nN N 1 Kh2 (z − Z1 )σθ2 (Z1 )dϕ(z) + E Kh (z − Z1 )Kh (z − Z2 )σθ (Z1 , Z2 )dϕ(z) N E(Wn∗ (θ)) = EQ1 + EQ3 = 1 E nN 1 = O N = σθ2 (z)dG(z) = O AN (θ) . This fact and (3.5) imply that Wn (θ) = Wn∗ (θ) + op (Wn∗ (θ)) = Wn∗ (θ) + op (1/N ). Therefore, it suffices to prove that (3.10) holds with Wn (θ) replaced by Wn∗ (θ). We investigate each quantity in the decomposition of Wn∗ (θ). First, Q1 is a two sample U statistic with kernel function φ1 . In order to apply Lemma 3.6.1, it is necessary to calculate the projections of φ1 , i.e., E(φ1 |Zi ) = hp 2 (z)σ 2 (Z )dϕ(z), Khi i E(φ1 |ηk ) = Op K1 [mθ (z + ηk ) − Hθ (z)]2 dG(z) . It can be verified that Eφ1 , Var(E(φ1 |Zi )) and Var(E(φ1 |ηk )) are all finite. Therefore Lemma 3.6.1 implies that for finite 0 < λ < ∞, Zn := n √ 1 N× nN N [φ1 (Zi , ηk ) − Eφ1 ] i=1 k=1 94 is asymptotically normally distributed. Hence N Q1 = 1 nN 1/2 Zn + 1 nN 1/2 E(φ1 ) = op (1). (3.39) Similarly, Q2 is a two sample statistic with the kernel function φ2 . Note that E[φ2 |Zi ] = E[φ2 |Zi , Zj ] = E[φ2 |Zi , Zj , ηk ] = E[φ2 |ηk ] = E[φ2 |Zi , ηk ] = 0 for 1 ≤ i = j ≤ n and 1 ≤ k ≤ N. To proceed further, define [mθ (z + ηk ) − Hθ (z)][mθ (z + ηl ) − Hθ (z)]f 2 (z)dϕ(z), φ˜2 (ηk , ηl ) = 1 Q2 = N (N − 1) N φ˜2 (ηk , ηl ). k=l=1 Calculation shows that, Var(φ2 ) = O 1 , h2p E(φ2 |ηk , ηl ) = Op φ˜2 (ηk , ηl ) , Var φ˜2 (ηk , ηl ) = Σθ . Then Lemma 2.6.5 implies that Var N (Q2 − Q2 ) = O (nhp )−2 = o(1). Thereby N (Q2 − Q2 ) = op (1). Furthermore, with Q2 being a degenerated U statistic, applying Theorem 1 in Hall (1984) to Q2 yields N Q2 → N1 (0, 2γ(θ)), which in turn implies N Q2 → N1 (0, 2γ(θ)). 95 (3.40) Next, consider Q3 , which is defined with kernel function φ3 . Algebra shows 1 N 1 E(φ3 |Zi ) = Op N 1 E(φ3 |ηk ) = Op N 1 E(φ3 |Zi , ηk ) = Op N E(φ3 |Zi , Zj ) = Op Khi (z)Khj (z)σθ (Zi , Zj )dϕ(z) , Khi (z)σθ (Zi , z)f (z)dϕ(z) , E(φ3 ) = Op AN (θ) , [mθ (z + ηk ) − Hθ (z)]2 fZ2 (z)dϕ(z) , Khi (z)[mθ (Zi + ηk ) − Hθ (Zi )][mθ (z + ηk ) − Hθ (z)]fZ (z)dϕ(z) . Furthermore, the second moments of the above projections can be derived Eφ23 = O(N −2 h−2p ), E[E(φ3 |Zi , Zj )]2 = O(N −2 h−2p ), E[E(φ3 |ηk )]2 = O(N −2 ), E[E(φ3 |Zi )]2 = O(N −2 h−p ), E[E(φ3 |Zi , ηk )]2 = O(N −2 h−p ). Then Lemma 2.6.4 yields that Var(Q3 ) = O 1 1 4 4 Var(E(φ3 |Z1 ) + Var(E(φ3 |η1 )) = O + 3 . p 2 n N nh N N Hence N (Q3 − Eφ3 ) = op (1) for sufficient large N and Eφ3 = O(1/N ). Therefore Q3 = Q3 − Eφ3 + Eφ3 = Eφ3 + op (1/N ) = AN (θ) + op (1/N ). (3.41) The same routine argument and Lemma 2.6.4 applying to Q4 lead to Var(Q4 ) = O 1 (nhp )2 N 2 and EQ4 = 0. (3.42) Combining all the results of components of Wn (3.39)–(3.42), one can see that Q2 domi- 96 nates the convergence rate of Wn and only Q3 contributes to the mean of Wn asymptotically, which in turn yields (3.10). Proof of Lemma 3.3.2. KS has shown that θ˜n = T (H) + op (1) by proving that sup |Mn (θ) − ρ(H, Hθ )| = op (1). (3.43) θ∈Θ In the current setup, if we show sup |Mn (θ) − Mn (θ)| = op (1), (3.44) θ∈Θ then, by (3.43), sup |Mn (θ) − ρ(H, Hθ )| = op (1). θ∈Θ Then arguing as in KS will yield the lemma. Proof of (3.44). By the Cauchy-Schwarz inequality, |Mn (θ) − Mn (θ)| ≤ Wn (θ) + 2[Wn (θ)Mn (θ)]1/2 . It suffices to show that supθ |Wn (θ)| = op (1) and supθ |Mn (θ)| = Op (1). The the compactness of Θ and Hθ ∈ L2 (G) imply that supθ |ρ(H, Hθ )| is finite. Furthermore, (3.43) shows that supθ |Mn (θ)| = Op (1). Now we study Wn (θ). By Lemma 3.3.1, Wn (θ) = op (1), for every θ ∈ Θ. Moreover, for 97 any θ1 , θ2 ∈ Θ, |Wn (θ1 ) − Wn (θ2 )| 1 n ≤ n Khi (z)[Hθ1 (Zi ) − Hθ1 (Zi ) + Hθ2 (Zi ) − Hθ2 (Zi )] 2 dϕ(z) ˆ i=1 1 n × n Khi (z)[Hθ1 (Zi ) − Hθ1 (Zi ) − Hθ2 (Zi ) + Hθ2 (Zi )] 2 dϕ(z) ˆ 1/2 . i=1 The first term on the right hand side above is Op (1) due to the boundedness of Hθ and the compactness of Θ. Similar to the proof on p143 of KS, the second term is bounded above by 2 1 n n 2 Khi (z)[Hθ1 (Zi ) − Hθ2 (Zi )] dϕ(z) ˆ (3.45) i=1 +2 1 n n Khi (z)[Hθ1 (Zi ) − Hθ2 (Zi )] 2 dϕ(z). ˆ i=1 The first term can be rewritten as 2 times the factor 1 n n Khi (z)[Hθ1 (Zi ) − Hθ1 (Zi ) + Hθ1 (Zi ) − Hθ2 (Zi ) + Hθ2 (Zi ) − Hθ2 (Zi )] 2 dϕ(z) ˆ i=1 1 n ≤ 3 Wn (θ1 ) + 1 n =3 n Khi (z)[Hθ1 (Zi ) − Hθ2 (Zi )] 2 dϕ(z) ˆ + Wn (θ2 ) i=1 n Khi (z)[Hθ1 (Zi ) − Hθ2 (Zi )] 2 dϕ(z) ˆ + op (1). i=1 The last claim holds because N AN (θ) = O(1) and hence by Lemma 3.3.1, WN (θ) = op (1), for all θ ∈ Θ. Then, by (H3), the bound (3.45) is further bounded from the above by 8 1 n n Khi (z)[Hθ1 (Zi ) − Hθ2 (Zi )] i=1 98 2 dϕ(z) ˆ + op (1) ≤ 8 θ1 − θ2 2β 1 n n Khi (z)r(Zi ) 2 dϕ(z) ˆ + op (1) = θ1 − θ2 2β Op (1), i=1 by (3.8) applied with α = r. The above result and the compactness of Θ implies that supθ∈Θ |Wn (θ)| = op (1), by a routine argument. 99 BIBLIOGRAPHY 100 BIBLIOGRAPHY Abarin T. and Wang L. (2009). Second-order least squares estimation of censored regression models. Journal of Stat. Plann. and Infer. 139, 125–135. Amemiya, T. (1984). Tobit models: A survey. Journal of Econometrics 24, 3–61. Berkson, J. (1950). Are there two regressions? J. Amer. Statist. Assoc. 45, 164–180. Bhattacharya, P. K., Chernoff, H. and Yang, S. S. (1983). Nonparametric estimation of the slope of a truncated regression. Ann. Statist. 11, 505–514. Bierens, H. (1987). Kernel Estimators of Regression Functions. Advances in Econometrics, Fifth Worldcongress, Vol 1, Bewley (ed.), 99–144. Bosq, D. (1998). Nonparametric statistics for stochastic processes, 2nd Edition, Springer, Berlin. Carroll, R. J., Ruppert, D., Stefanski, L. A. and Crainiceanu, C. (2006). Measurement Error in Nonlinear Models: A Modern Perspective. Second edition. Chapman and Hall, London. Cheng, C. L., and Van Ness, J. W. (1999). Statistical regression with measurement error. John Wiley & Sons. Delaigle, A., Hall, P. and Qiu, P. (2006). Nonparametric methods for solving the Berkson errors-in-variables problem. J. R. Statist. Soc. B, 68(2), 201–220. Du, L., Zou, C. and Wang, Z. (2011). Nonparametric regression function estimation for errors-in-variables models with validation data. Statistica Sinica, 21, 1093–1113. Fuller, W. A. (1987). Measurement error models. John Wiley & Sons. Gonz´alez-Manteiga, W. and Crujeiras, R. M. (2013). An updated review of Goodness-of-Fit tests for regression models. Test, 22, 361–411. Hall, P. (1984). Central limit theorem for integrated square error of multivariate nonparametric density estimators. J. Multivariate Anal. 14, 1–16. Huwang, L. and Huang Y.H. (2000). On errors-in-variables in polynomial regression– Berkson case. Statistica Sinica 10, 923–936. Jones, M. C. and Signorini, D. F. (1997). A Comparison of Higher-Order Bias Kernel Density Estimators J. Amer. Statist. Assoc. 439, 1063–1073. 101 Kim, K. H., Chao, S. and H¨ardle, W. K. (2016), Simultaneous Inference for the Partially Linear Model with a Multivariate Unknown Function when the Covariates are Measured with Errors. SFB 649 Discussion Paper 2016–024. Koul, L. H. and Ni, P. (2004). Minimum distance regression model checking. Journal of Stat. Plann. and Infer. 119, 109–141. Koul, H. L. and Song, W. (2009) Minimum distance regression model checking with Berkson measurement errors. Annals of Statistics, 37(1), 132–156. Koul, H. L., Song, W. and Liu, S. (2014). Model checking in Tobit regression via nonparametric smoothing. J. Multivariate Anal. 125, 36–49. Lee, L. F. and Sepanski, J. H. (1995). Estimation of Linear and Nonlinear Errors-inVariables Models Using Validation Data, Journal of the American Statistical Association, 90(429), 130–140. Mack, Y.P. and Silverman, B.W. (1982). Weak and strong uniform consistency of kernel regression estimates. Z. Wahrsch. Gebiete, 61, 405–415. Schennach, S.M. (2013). Regressions with Berkson errors in covariates–a nonparametric approach. The Annals of Statistics, 41(3), 1642–1668. Sepanski, J. H. and Carroll, R. J. (1993). Semiparametric quasilikelihood and variance function estimation in measurement error models. J. Econometrics, 58, 223–256. Sepanski, J.H. and Lee, L. (1995). Semiparametric estimation of nonlinear errors-invariables models with validation study. Journal of Nonparametric Statistics. 4(4), 365–394. Serfling R. J. (1981). Approximation theorems of mathematical statistics. Wiley Series in Probability and Statistics. Wiley, Hoboken, NJ. Song, W. (2008). Model checking in errors-in-variables regression. J. Multivariate Anal. 99, 2406–2443. Song, W. (2009). Lack-of-fit testing in errors-in-variables regression model with validation data. Statist. Probab. Lett. 79, 765–773. Song, W. (2011). Distribution-free test in Tobit mean regression model. Journal of Stat. Plann. and Infer. 141, 2891–2901. Song, W. and Yao, W. (2011). A lack-of-fit test in Tobit errors-in-variables regression models. Statist. Probab. Lett. 81, 1792–1801. Stute, W., Thies, S. and Zhu, L. (1998). Model checks for regression: an innovation process approach. Ann. Statist., 26, 1916–1934. 102 Stute, W., Xue, L. and Zhu, L. (2007). Empirical likelihood inference in nonlinear errors-incovariables models with validation data. J. Amer. Statist. Assoc.,102, 332-346. Tobin, J. (1958). Estimation of Relationships for Limited Dependent Variables. Econometrica, 26(1), 24–36. Wang, L. (1998). Estimation of censored linear errors-in-variables models. J. Econometrics 84, 383–400. Wang, L. (2004). Estimation of nonlinear models with Berkson measurement errors. Annals of Statistics,32, 2559–2579. Wang, L. (2007). A simple nonparametric test for diagnosing nonlinearity in Tobit median regression model. Statist. Probab. Lett. 77, 1034–1042. Wang, Q. and Rao, J. N. K. (2002). Empirical Likelihood-Based Inference in Linear Errorsin-Covariables Models with Validation Data. Biometrika, 89, 2, 345–358. Wolfgang, H., Marron, J. S. and Wand, M. P. (1990). Bandwidth choice for density derivatives. J. R. Statist. Soc. B, 52(1), 223–232. Zheng, J. X. (1996). A consistent test of functional form via nonparametric estimation techniques. J. Econometrics 175(2), 263–289. 103