ESSAYS IN ECONOMETRICS By Ot´vio Augusto Camargo Bartalotti a A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Economics 2012 ABSTRACT ESSAYS IN ECONOMETRICS By Ot´vio Augusto Camargo Bartalotti a This dissertation is divided in three self-contained chapters. The first extends the GMM redundancy results of Prokhorov and Schmidt (2009) for nonsmooth objective functions, giving sharp guidelines about how to obtain efficient estimates of parameters of interest ( β o ) in the presence of nuisance parameters (γ o ). The use of one-step GMM estimators for both sets of parameters is asymptotically more efficient than two-step procedures. These results are applied to Wooldridge (2007)’s inverse probability weighted estimator (IPW), generalizing the framework to deal with missing-data in this context. Even though twostep estimation of β o is more efficient than using known probabilities of selection, this is dominated by a one-step joint estimation procedure. Examples for quantile regression with missing data and instrumental variable quantile regression are provided. The second chapter analyzes the asymptotic distribution of local polynomial estimators in the context of regression discontinuity designs. The standard “small-h” approach in the literature (Hahn et al., 2001; Porter, 2003; Imbens and Lemieux, 2008; Lee and Lemieux, 2009) is to assume the bandwidth, h, around the discontinuity shrinks towards zero as the sample size increases. However, in practice, the researcher has to choose an h > 0 to implement the estimator. This chapter derives the fixed-h asymptotic distribution that allows for the bandwidth to be positive, providing refined approximations for the estimator’s behavior. When h > 0, the small-h asymptotic variance is equivalent to assuming that the density of the running variable and the conditional variance of the dependent variable are constant around the cutoff. Simulations provide evidence that fixed-h asymptotic distributions better describe the behavior of both bias and variance of the estimator, leading to improved inference. Estimators for fixed-h standard errors are proposed and incorporate the theoretical gains of the improved approximations. The fixed-h variance estimators improve markedly over small-h estimators in the presence of some forms of heteroskedasticity. Interestingly, in the special case of homoskedastic errors using a local linear estimator, the variance estimators based on small-h asymptotics produce tests with similar size to the fixed-h variance estimators proposed in this chapter. Chapter 3 develops the asymptotic properties of quantile regression estimators under standard stratification sampling, following Wooldridge (2001). Formulas for the asymptotic variance and feasible estimators are provided. Under exogenous stratification the usual quantile regression estimators and standard errors are still valid. To Beatriz. iv ACKNOWLEDGMENTS “We have not journeyed all this way across the centuries, across the oceans, across the mountains, across the prairies, because we are made of sugar candy.” Winston Churchill I would like acknowledge the great help and support that was given to this endeavor by Michigan State University faculty. Especially my advisor, Jeff Wooldridge, who supported me on my goals even before I started the Ph.D. program and kept me upbeat and motivated through the toughest moments, I would not have been able to accomplish this without his guidance, help and friendship. Also, Tim Vogelsang, who was fundamental in all the parts of this dissertation through his insights, truthfulness and great openness to my ideas. I cannot express enough thanks to Gary Solon, who always kept his door open and a great interest in my research, providing insight, guidance and great advice in the five years I spent at MSU’s Economics Department. Peter Schmidt, always sharp on comments and an open mind provided formidable advice that greatly improved this dissertation. I am also thankful several other great faculty such as Todd Elder, Steve Haider, Matias Cattaneo (at University of Michigan) and many others. I cannot forget to thank Carlos Pereira for believing in me when few would. Finally, none of this would have been achieved without the support from Maria Carolina Leme who taught me through my first steps in academia and still teaches me so much to this day. My dissertation and life have been made better by great friends, their inputs, ideas and long coffee breaks like Steve Dieterle, Quentin Brummet, Thomas Fujiwara, Ilya Rahkovsky, v Max Melstrom, Breno Braga, Valentin Verdier, Maksym Ivanyna, Elizabeth Quin, Ehren Schuttringer, Cuicui Lu, Stacy Miller, Iraj Rahmani, Paul Burkander, Hassan Enayati, Laura Rees, Jason Stornelli, Adithya Pattabhiramaiah among others. I thank my parents Bete and Fl´vio and my stepmother Hilda for their love and support a throughout this journey and for embedding in my formation the desire for knowledge that drove me to this career. Also my sister La´ who helps me keep things on perspective and I ıs love. To my grandparents for the great life example they provide and especially to Antonio Bartalotti who did not live to see this day, but is in my thoughts and heart everyday. Finally, I acknowledge the courage, help and love I have received from my wife, Beatriz, who shared with me every moment of the path, and was understanding and reassuring of my achievements and failures. vi TABLE OF CONTENTS LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1 GMM Efficiency and IPW for Nonsmooth Functions . . . . . . . . . . . . . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 General Estimation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Estimation with missing data . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Data Selection under Ignorability . . . . . . . . . . . . . . . . . . . . 1.3.2 Data Selection under Exogeneity of Selection . . . . . . . . . . . . . . 1.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Quantile Regression under Ignorability of Selection . . . . . . . . . . 1.4.2 Instrumental Variable Quantile Regression . . . . . . . . . . . . . . . 1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 15 16 21 23 23 27 28 2 Fixed Bandwidth Asymptotics for Regression Discontinuity Designs . . . . 31 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.2.1 Sharp Regression Discontinuity Design . . . . . . . . . . . . . . . . . 34 2.2.2 Fuzzy Regression Discontinuity Design . . . . . . . . . . . . . . . . . 35 2.3 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.4 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.5 Asymptotic Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.5.1 Fuzzy Regression Discontinuity Design . . . . . . . . . . . . . . . . . 45 2.6 Variance Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.7 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.7.1 Simulations for Infeasible Inference . . . . . . . . . . . . . . . . . . . 53 2.7.2 Simulations for Feasible Inference . . . . . . . . . . . . . . . . . . . . 58 2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3 Asymptotic Properties of Quantile Regression for Standard Stratified Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 The Quantile Regression Population Problem . . . . . . . . . . . . . . . . . 3.2.1 Quantile Regression under Stratified Sampling . . . . . . . . . . . . . vii 64 64 65 66 3.3 3.4 3.2.2 Quantile Regression Estimation under Exogenous Stratification 3.2.3 Sequence of Quantile Regressions . . . . . . . . . . . . . . . . . Asymptotic Variance Estimation . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 71 73 76 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 A Proofs to “GMM Efficiency and IPW for Nonsmooth Functions” . . . . . . . 79 B Figures to “Fixed Bandwidth Asymptotics for Regression Discontinuity Designs” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 B.1 Simulations for Infeasible Inference . . . . . . . . . . . . . . . . . . . . . . . 88 B.1.1 Nadaraya-Watson Estimator . . . . . . . . . . . . . . . . . . . . . . . 88 B.1.2 Local Linear Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 93 B.1.3 Heteroskedastic Errors . . . . . . . . . . . . . . . . . . . . . . . . . . 97 B.2 Simulations for Feasible Inference . . . . . . . . . . . . . . . . . . . . . . . . 101 B.2.1 Local Linear Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 103 B.2.2 Heteroskedastic Errors . . . . . . . . . . . . . . . . . . . . . . . . . . 105 B.2.3 Bandwidth Choice for fo (x) . . . . . . . . . . . . . . . . . . . . . . . 109 C Proofs to “Fixed Bandwidth Asymptotics for Regression Discontinuity Designs” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 D Proofs to “Asymptotic Properties of Quantile Regression for Standard Stratified Samples” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 viii LIST OF FIGURES B.1 Nadaraya-Watson Estimator - DGP: No X - Homosk. Errors . . . . . . . . . 89 B.2 Nadaraya-Watson Estimator - DGP: Linear - Homosk. Errors . . . . . . . . 90 B.3 Nadaraya-Watson Estimator - DGP: Linear - Bias Corrected - Homosk. Errors 91 B.4 Nadaraya-Watson Estimator - DGP: Linear - Comparison - Homosk. Errors 92 B.5 Local Linear Estimator - DGP: Linear - Homosk. Errors . . . . . . . . . . . 93 B.6 Local Linear Estimator - DGP: Quadratic - Homosk. Errors . . . . . . . . . 94 B.7 Local Linear Estimator - DGP: Quadratic - Bias Corrected - Homosk. Errors 95 B.8 Local Linear Estimator - DGP: Quadratic - Comparison - Homosk. Errors . 96 B.9 Local Linear Estimator - DGP: Linear - Heterosk. Errors Case 1 . . . . . . . 97 B.10 Local Linear Estimator - DGP: Quadratic - Heterosk. Errors Case 1 . . . . . 98 B.11 Local Linear Estimator - DGP: Linear - Heterosk. Errors Case 2 . . . . . . . 99 B.12 Local Linear Estimator - DGP: Quadratic - Heterosk. Errors Case 2 . . . . . 100 B.13 Nadaraya-Watson Estimator - DGP: No X - Feasible - Homosk. Errors . . . 101 B.14 Nadaraya-Watson Estimator - DGP: Linear - Feasible - Homosk. Errors . . . 102 B.15 Local Linear Estimator - DGP: Linear - Feasible - Homosk. Errors . . . . . . 103 B.16 Local Linear Estimator - DGP: Quadratic - Feasible - Homosk. Errors . . . . 104 B.17 Local Linear Estimator - DGP: Linear - Feasible - Heterosk. Errors Case 1 . 105 B.18 Local Linear Estimator - DGP: Quadratic - Feasible - Heterosk. Errors Case 1 106 B.19 Local Linear Estimator - DGP: Linear - Feasible - Heterosk. Errors Case 2 . 107 B.20 Local Linear Estimator - DGP: Quadratic - Feasible - Heterosk. Errors Case 2 108 B.21 Small-h Sensitivity to Density Bandwidth - DGP: Linear . . . . . . . . . . . 109 ix B.22 Small-h Sensitivity to Density Bandwidth - DGP: Quadratic . . . . . . . . . 110 x CHAPTER 1 GMM Efficiency and IPW for Nonsmooth Functions 1.1 Introduction This chapter extends Prokhorov and Schmidt (2009) analysis to the estimation of a general GMM problem with nonsmooth objective functions in which nuisance parameters are present. The framework developed encompasses several interesting problems in econometrics such as missing data, censored or truncated data, treatment effects, instrumental variables, etc. More importantly, by allowing nonsmooth objective functions, the analysis extends to models that have gained additional importance in recent years, e.g., least absolute deviations (LAD), quantile regression (QR), censored LAD, quantile treatment effects and instrumental variables quantile regression (IVQR). The core results of this chapter extend Prokhorov and Schmidt (2009) results on GMM redundancy by allowing the use of nonsmooth objective functions. These results rely on Newey and McFadden (1994) to obtain the asymptotic variance of the GMM estimator under less restrictive assumptions on the smoothness of the objective functions. For that consider two sets of moment conditions, where the first includes both the parameters of interest (β o ) and 1 certain nuisance parameters (γ o ) while the second set includes only the nuisance parameters. By defining four competing estimators based on different assumptions regarding the information available about these nuisance parameters and the moment conditions utilized, results about the relative efficiency of each proposed estimator are derived. These results provide guidance to applied work in the presence of nuisance parameters. As discussed by Prokhorov and Schmidt (2009), joint estimation of nuisance parameters and parameters of interest is more efficient than a two-step procedure or knowing the true nuisance parameters and disregarding the second set of moment conditions. This fact is due to the information contained in correlation between both sets of moment conditions, which is useful even when γ o is known. Using only the first set of moment conditions and known values of γ o in the estimation procedure does not use the additional information embedded in the second set of moment conditions. These results are shown to hold when the objective functions are nonsmooth. The general results are directly applicable to missing data problems and encompass Wooldridge (2002b, 2007) analysis of inverse probability weighting (IPW) estimators, extending its use for nonsmooth objective functions under the usual “ignorability” assumptions about the selection process. The general estimation results described confirm the validity of the puzzle described by Wooldridge (2007), i.e., that it is better (in an efficiency sense) to estimate the selection probabilities, even if the latter are known. In other terms, we obtain more efficient estimates for β o if we estimate γ o than if we use the true γ o . This result is “puzzling” because knowledge of γ o , if properly exploited, cannot be harmful. Previous works discussed this result, such as Wooldridge (2002b, 2007) in the context of IPW. Hirano et al. (2003); Hitomi et al. (2008); Prokhorov and Schmidt (2009) addressed the problem for the smooth objective function case. Even though this issue has been considered by Chen, Hong, and Tarozzi (2008) in a semiparametric context with nonsmooth objective functions, the parametric approach proposed here provides, as a novelty, the conditions under which this puzzle is valid and, furthermore, shows that the two-step estimator is usually dominated 2 by a one-step joint estimation procedure that uses both the weighted moment conditions and the conditions associated with the selection model. There have been several papers devoted to general theories of estimation in settings where nonsmooth objective functions are allowed, following Daniels (1961) and Huber (1967). Studies that allow for estimation of models based on nonsmooth objective functions include, among others, Pollard (1985); Pakes and Pollard (1989) and Newey and McFadden (1994, section 7). Recent studies have approached the problem of nonsmoothness with focus on semiparametric models, see Chen, Linton, and Van Keilegom (2003) for a general estimation approach; Chen et al. (2008) for an approach for missing data problems with nonparametric first stage; and Cattaneo (2010) for an approach on the estimation of multi-valued treatment effects on a semiparametric framework. The remainder of the chapter is organized as follows. Section 1.2 sets up the general GMM framework used in the analysis and presents results regarding efficiency and redundancy of the estimators proposed, as well as estimators for the asymptotic variances of the parameters estimated. Section 1.3 studies the IPW approach to missing data problems proposed by Wooldridge (2002b, 2007), extending its scope to nonsmooth objective functions. Section 1.4 provides examples of the uses of the framework proposed here by, first, considering a model for the conditional quantile in a context with missing data; secondly I consider a simplified IVQR model as proposed by Chernozhukov and Hansen (2005, 2006). Section 1.5 concludes. 1.2 General Estimation Problem ∗ Let ω ∗ ∈ Q∗ ⊂ Rdim(ω ) be a random vector; θ∈ Θ ⊂ RP be a parameter vector, Θ is a compact set, and the population condition go (θo ) = E[g(ω ∗ , θo )] = 0 where g : Q∗ × Θ →Rm is a vector of known real-valued moment functions. 3 (1.1) Newey and McFadden (1994) have shown consistency and asymptotic normality of the Generalized Method of Moments (GMM) estimator that minimizes a squared Euclidean distance of the random sample analogues of the population moments, i.e. gn (θ) = n−1 n i=1 g(ω ∗ , θ), from their population counterparts (which equal zero). I am interested in the case i in which the moment functions, g(·), are allowed to be nonsmooth, so we can deal with a wider range of interesting problems. The GMM estimator minimizes the objective function gn (θ) W gn (θ) (1.2) where W converges in probability to W , the appropriate positive semidefinite weighting matrix. Assume ω ∗ , i = 1, ..., n, are i.i.d. Two useful results from Newey and McFadden i (1994) will be used to derive the asymptotic variance of the estimators. The first regards the consistency of the GMM estimator. Theorem 1 (Newey and McFadden, 1994, Theorem 2.6) Let • denote the Eu- clidean norm. Suppose that: (i) ω ∗ are i.i.d. for i=1,2,...; i p (ii) W −→ W ; (iii) W is positive semi-definite and W E[g(ω ∗ , θ)] = 0 only if θ = θo ; (iv) θo ∈ Θ ⊂ RP , and Θ is compact; (v) g(ω ∗ , θ) is continuous at each θ with probability one; (vi) E supθ∈Θ g(ω ∗ , θ) < ∞; p Then θ −→ θo . This result relies on relatively weak conditions, and allow for discontinuities in the objective function. The second theorem demonstrates the asymptotic normality of the GMM estimator under a certain form of nonsmoothness of the objective function. Theorem 2 (Newey and McFadden, 1994, Theorem 7.2) Suppose that: 4 (i) gn (θ) W gn (θ) ≤ inf θ∈Θ gn (θ) W gn (θ) + op (n−1 ); p (ii) θ −→ θ o ; p (iii) W −→ W , and W is positive semi-definite; (iv) go (θ o ) = 0; (v) go (θ) is differentiable at θ o with derivative G such that G W G is nonsingular; (vi) θ o is an interior point in Θ; √ d (vii) ngn (θ o ) −→ N (0, Σ); p (viii) for any δ n −→ 0, √ n gn (θ) − gn (θ o ) − go (θ) p √ −→ 0 sup [1 + n (θ − θ o ) ] (θ−θ o ) ≤δ n Then, √ d n(θ−θ o ) −→ N 0, (G W G)−1 G W ΣW G(G W G)−1 . As shown by Pollard (1985) the differentiability of the objective function g(ω ∗ , θ) can i be replaced by the differentiability of go (θ) for the purpose of obtaining the asymptotic normality of these estimators. As emphasized by Newey and McFadden (1994) the key condition to allow for nonsmooth objective functions is condition (viii), which is a “stochastic equicontinuity” assumption that guarantees uniform convergence in probability of the linear approximation of go (θ) by g(ω ∗ , θ) in a shrinking neighborhood of θ o . This is similar to the i stochastic differentiability condition in Pollard (1985) and primitive conditions are available in Pollard (1985), Andrews (1994) and Chen et al. (2003). Those simplify the task of checking its validity to specific moment functions, however this is beyond the scope of this work and I refer the reader to those papers. Suppose that θ can be partitioned into subsets of parameters (β , γ ) ∈ B×Γ ⊂ Rp1 ×Rp2 and that g(·) can be partitioned into subsets of functions (g1 (·) , g2 (·) ) as defined below. For notational convenience, ω ∗ is suppressed in the following discussion, then E[g1 (β o , γ o )] = 0 (1.3) E[g2 (γ o )] = 0 (1.4) 5 where β ∈ B, γ ∈ Γ, g1 (·) and g2 (·) are m1 and m2 vectors of known functions, respectively (m = m1 + m2 ). Note that the second set of moment conditions does not depend on β while the first set of moment conditions depend on the full parameter set θ. Let gn1 (θ) = n−1 g1 (ω ∗ , θ) and gn2 (γ) = n−1 i n i=1 n i=1 g2 (ω ∗ , γ). The framework developed here is valid for the i general case of overidentification, i.e., m1 p1 and m2 p2 . This guarantees that γ o is identified by 1.4 alone, and that for a given γ, β o can be identified by 1.3 alone, hence, two step estimation is possible. Let the asymptotic covariance matrix for the moment functions, Σ, be defined as Σ = V [g(θ o )] ≡ C11 C12 C21 C22 where we assume Σ is finite and nonsingular so its inverse exists: Σ−1 ≡ −1 −1 −1 C11 (I + C12 E −1 C21 C11 ) −C11 C12 E −1 C 11 C 12 = −1 C 21 C 22 −E −1 C21 C11 E −1 since Σ (and Σ−1 ) is symmetric C12 = C21 and the second equality holds (see White, 1984, −1 p. 80) for E ≡ C22 − C21 C11 C12 . Define the matrix of derivatives as G ≡ θ go (θo ) = θ E[g(θo )] ≡ G11 ≡ β E[g1 (β o , γ o )] G12 ≡ γ E[g1 (β o , γ o )] G22 ≡ G11 G12 0 G22 γ E[g2 (γ o )] where the lower off-diagonal matrix equals zero since the second set of moment conditions does not depend on β. Following Prokhorov and Schmidt (2009), define four different possible GMM estimators that differ in which moment conditions are used and/or whether γ is treated as known. Definition 1 Call the estimator of θo that minimizes gn (θ) W gn (θ) 6 (1.5) with the weighting matrix W = Σ−1 the ONE-STEP estimator. This is the usual GMM estimator that uses all the available orthogonality conditions jointly to estimate β o and γ o . Definition 2 Call the estimator of β o that minimizes −1 gn1 (β, γ o ) C11 gn1 (β, γ o ) (1.6) and γ o is treated as known the KNOW-γ estimator. This estimator ignores the second set of orthogonality conditions 1.4, treating γ o as a known vector of parameters and estimating β o using only the information available in the first set of moment assumptions. This could arise if one has information about the true values of γ o or if he disregards the fact that γ o was estimated in the first stage and, hence its variability, in what could be called a “naive” estimator. Definition 3 Call the estimator of β o that minimizes gn (β, γ o ) Σ−1 gn (β, γ o ) (1.7) and γ o is treated as known the KNOW-γ-JOINT estimator. This is the GMM estimator for β o in the form considered by Qian and Schmidt (1999). In this case, one has information about the true values of γ o but still uses both set of moments conditions in obtaining an estimate for β o . Definition 4 Call the estimator of θo obtained in the following fashion, the TWO-STEP estimator: (i) the estimator γ is obtained by minimizing −1 gn2 (γ) C22 gn2 (γ) 7 (1.8) (ii) the estimator β is obtained by minimizing −1 gn1 (β, γ) C11 gn1 (β, γ) (1.9) and γ is treated as given. This is the sequential estimator that uses only the second set of moment conditions 1.4 to obtain a consistent estimator of the unknown parameter vector γ o and then uses only the first set of moment conditions 1.3 to obtain the estimator of β o . This estimator is widely used in the applied economics literature and encompasses several common problems. The estimators defined above depend on a known Σ. In practice, Σ is not known and has to be replaced by an initial consistent estimate. To compare the properties of these different estimators we need to obtain their asymptotic variances. Those are derived directly by Theorem 2. Theorem 3 Let VON E−ST EP , VKN OW −γ ,VKN OW −γ−JOIN T and VT W O−ST EP denote the asymptotic variance of ONE-STEP, KNOW-γ, KNOW-γ-JOINT and TWO-STEP respectively. Then, under the conditions described in Theorem 1 and 2. VON E−ST EP = VKN OW −γ = VKN OW −γ−JOIN T VT W O−ST EP = G Σ−1 G −1 (1.10) −1 G11 C11 G11 −1 G11 C 11 G11 −1 (1.11) (1.12) = BΣB (1.13) where, B= B11 B12 0 B22 with −1 B11 = − G11 C11 G11 B12 = −1 G11 C11 G11 −1 −1 −1 B22 = − G22 C22 G22 −1 G11 C11 −1 −1 G11 C11 G12 G22 C22 G22 −1 −1 G22 C22 8 −1 −1 G22 C22 Proof. All proofs are provided in the appendix. It is possible to analyze the relative asymptotic efficiency of these estimators.1 Theorem 4 For the estimators defined above as the ONE-STEP, KNOW-γ, KNOW-γJOINT and TWO-STEP with asymptotic variances given by 1.10, 1.11, 1.12 and 1.13, respectively, the following statements hold: 1. KNOW-γ-JOINT is no less efficient than ONE-STEP, KNOW-γ and TWO-STEP for β o. 2. If C12 = 0 then KNOW-γ-JOINT and KNOW-γ are equally efficient for β o . 3. If G12 = 0 then TWO-STEP and KNOW-γ are equally efficient for β o . 4. If C12 = 0 and G12 = 0, then ONE-STEP, KNOW-γ, KNOW-γ-JOINT and TWOSTEP are equally efficient for β o , and ONE-STEP and TWO-STEP are equally efficient for γ o. 5. ONE-STEP is no less efficient than TWO-STEP. 6. If m1 = p1 then the ONE-STEP and TWO-STEP estimates of γ o are equal. 7. If m1 = p1 and m2 = p2 then the ONE-STEP and TWO-STEP estimates are equal for both β o and γ o . 8. If m1 = p1 and C12 = 0 then the ONE-STEP and TWO-STEP estimates are equally efficient for both β o and γ o . −1 9. If G12 = C12 C22 G22 , then KNOW-γ-JOINT and ONE-STEP are equally efficient for β o. −1 10. If G12 = C12 C22 G22 , then ONE-STEP, KNOW-γ-JOINT and TWO-STEP are no less efficient for β o than KNOW-γ. 1 I denote the asymptotic variance of θ as V meaning that √n(θ − θ ) converges in o distribution to N (0, V ). 9 The statements that form Theorem 4 are direct extensions of Prokhorov and Schmidt (2009) for the case in which nonsmooth objective functions are allowed. Statement 1 shows, as expected, that KNOW-γ-JOINT dominates the other estimators. This is an intuitive result since the known value of γ o is at least as efficient as any estimate of γ o , and KNOW-γ-JOINT uses the full set of relevant moment conditions. Statement 2 is the result Qian and Schmidt (1999), where it is shown that using additional moment conditions that include no unknown parameters (as is the case for KNOW-γ-JOINT) improves efficiency except in the special case in which C12 = 0. In other words, the second set of moments is redundant in the estimation of β o , Prokhorov and Schmidt (2009) call this M-redundancy. Statement 3 gives the condition under which the first stage estimation of the nuisance parameter γ o does not affect the asymptotic behavior of the second stage estimate of β o. This result is similar to the one shown in Wooldridge (2002a), however in this case we are dealing with a nonsmooth objective function and, therefore, the restriction G12 ≡ γ E[g1 (β o , γ o )] = 0 differs from the one proposed by Wooldridge since the deriva- tive of g1 (β o , γ o ) is not necessarily available. Statement 4 provides conditions under which the ONE-STEP, KNOW-γ, KNOW-γJOINT and TWO-STEP estimators are equally efficient for β o , hence the use of the additional moment conditions in 1.4 by the ONE-STEP, KNOW-γ-JOINT and TWO-STEP estimators does not improve the precision of the estimated parameters of interest as in the previous statement; and the knowledge of γ o does not help in estimating β o . This would hold if two sets of moment conditions are asymptotically uncorrelated (C12 = 0) and γ is not present in the first set of moment conditions (G12 = 0). Statement 5 is the usual result that in general, sequential estimation procedures are less efficient than joint (one step) estimation. Statement 6, 7 and 8 follow directly from Ahn and Schmidt (1995) and show that the GMM separability holds in the framework that allows non-smooth objective functions. The GMM 10 estimates for γ o are not improved by the inclusion of an equal number of additional moment conditions and parameters. It can be shown that if G11 is nonsingular, the ONE-STEP estimator for β o can be written in terms of the ONE-STEP estimator of γ o using the equation −1 gn1 (β, γ) = C12 C22 gn2 (γ) (see appendix for details). Thus, as described by Prokhorov and Schmidt (2009) the ONE-STEP and TWO-STEP estimators for β o will be derived from the same equation as long as gn2 (γ) = 0, which will be true under exact identification of γ o , and asymptotically equally efficient if C12 = 0, since the moment conditions will be asymptotically uncorrelated, not adding to the information set exploited by ONE-STEP relatively to TWO-STEP. Statement 9 and 10 are direct extensions of Prokhorov and Schmidt (2009). Statement 9 says that KNOW-γ-JOINT and ONE-STEP are equally efficient for the estimation of β o , which means that knowledge of γ o is not useful in terms of the efficiency of the estimates −1 for β o if we are using the full set of moment conditions and G12 = C12 C22 G22 . Statement 10 shows that under the same condition about G12 , KNOW-γ is dominated by ONE-STEP, KNOW-γ-JOINT and TWO-STEP. This happens because knowledge of γ o is not useful in the estimation of β o in this case, and the KNOW-γ estimator does not use the information in the second set of moment conditions, which is useful unless C12 = 0. The statements presented in theorem 4 show that the results for GMM redundancy presented by Prokhorov and Schmidt (2009) extend to GMM estimation procedures based on nonsmooth objective functions. Under the conditions of parts 9 and 10 of theorem 4, the following corollary can be obtained. −1 Corollary 1 If G12 = C12 C22 G22 and G22 is invertible, then −1 V (β T W O−ST EP ) = G11 C11 G11 −1 −1 −1 −1 G11 C11 Do C11 G11 G11 C11 G11 −1 (1.14) Additionally, if G11 is invertible, then V (β T W O−ST EP ) = G−1 Do G−1 11 11 11 (1.15) where Do = E ei ei ei = −1 g1 (ω ∗ , θ) − C12 C22 g2 (ω ∗ , γ) i i Note that ei is the residual of the linear projection of the first set of moments conditions on the second set of moment conditions. This result is useful in the estimation of the asymptotic variance of the estimators, as I discuss below. Unfortunately, this applies only if the second set of moment conditions is exactly identified for formula 1.14 and if both sets of moment conditions are exactly identified for formula 1.15. An arresting issue is to obtain estimates of the variance matrices described in theorem 3. The nonsmoothness of the objective function creates some obstacles to the usual estimations procedures. As described by Lee (2008) the fact that the estimates for the variances depend on the derivative of the expectation of the estimating function in the nonsmooth case warrants a more careful approach in estimating the variances used for inference. A general approach that work in most cases is offered in Newey and McFadden (1994), and consists on obtaining consistent estimators for the separate components of the variance matrix. For estimating Σ or its relevant components a standard estimator is available. This procedure can be used in a first-step to obtain consistent estimates of the appropriate 12 weighting matrix for the desired estimation procedure. n Σ = n−1 C11 = n−1 C12 = n−1 C22 = n−1 ∗ ˆ g(ω ∗ , ˆ i θ)g(ω i , θ) i=1 n i=1 n i=1 n (1.16) g1 (ω ∗ , ˆ 1 (ω ∗ , ˆ i θ) i θ)g (1.17) g1 (ω ∗ , ˆ 2 (ω ∗ , γ ) i ˆ i θ)g (1.18) g2 (ω ∗ , γ )g2 (ω ∗ , γ ) i ˆ i ˆ (1.19) i=1 C21 = C 12 (1.20) To be able to plug this estimates on the equations derived in Theorem 3 we need to obtain estimates of G, which can be difficult to obtain due to the nonsmoothness of the objective function. In this approach an estimate of G is obtained by numerical derivatives. Following Newey and McFadden (1994) let ei denote the ith unit vector, n denote a small positive constant that depends on the sample size. Define the estimators for G and its components as  Gj = 1  −1 n 2n  G11 j = 1  −1 n 2n  G12 j = 1  −1 n 2n  G22 j = 1  −1 n 2n n i=1 n i=1 n i=1 n  g(ω ∗ , ˆ + ej n ) − g(ω ∗ , ˆ − ej n ) i θ i θ  ˆ ˆ g1 (ω ∗ , β + ej n , γ ) − g1 (ω ∗ , β − ej n , γ ) ˆ ˆ i i  ˆ ˆ ˆ ˆ g1 (ω ∗ , β, γ + ej n ) − g1 (ω ∗ , β, γ − ej n ) i i  g2 (ω ∗ , γ + ej n ) − g2 (ω ∗ , γ − ej n ) i ˆ i ˆ i=1 Where the subscript j denotes the j th column of the matrix being estimated. Newey and √ McFadden (1994, Theorem 7.4) shows that if n converges to zero and n n converges to infinity as n gets larger, these estimators will be consistent for the terms of the variances 13 presented in theorem 3, and plugging them in the formulas provide consistent estimators for the variances of the parameters being estimated. However, these estimators are cumbersome and not practical. As emphasized by Newey and McFadden, the choice of n is a difficult problem and the formulation described above, using a unique value for n would be good only if the estimated parameters had been scaled to have similar magnitudes. If that is not done, we would have to pick different n for different components. On specific cases, other estimators are available. As discussed in Newey and McFadden (1994) if g(ω ∗ , ˆ is differentiable with probability one, with θ) ∗ ˆ θ g(ω , θ) that is continuous at θo with probability one and dominated by an integrable function in a neighborhood of θo , then G = n−1 n i=1 ∗ ˆ θ g(ω i , θ) is a consistent estimator for G. Hence, the more standard estimator is available and would be easier to implement. Clearly, alternatives could be available for specific moment conditions. Section 1.4 provides the example for the leading case of IPW for linear quantile regression. Even in this case, the calculation of the matrix B that is present in the asymptotic variance of the TWO-STEP estimator could be cumbersome. For the cases in which the conditions −1 from part 9 and 10 of theorem 4 hold, namely G12 = C12 C22 G22 , corollary 1 offers a different approach to the problem of estimating the asymptotic variance in those cases (even though we still need to resort to one of the estimators above to obtain G11 ). We can obtain an estimate of the matrix E ei ei by regressing the first set of moment conditions on the ˆ ˆ second set of moment conditions in the sample to obtain the residuals e = g (ω ∗ , β, γ ) − n−1 n i=1 ˆ ˆ g2 (ω ∗ , γ )g1 (ω ∗ , β, γ ) i ˆ i n−1 n i=1 −1 g2 (ω ∗ , γ )g2 (ω ∗ , γ ) i ˆ i ˆ the sample analogue of the desired matrix D = n−1 i 1 i g2 (ω ∗ , γ ), and calculating i ˆ n i=1 ei ei . Unfortunately, this simple procedure is valid only for the asymptotic variance of the TWO-STEP estimator under the condition above and under exact identification of at least the second set of moment conditions. 14 For most of the relevant problems, we could use a bootstrap procedure to obtain consistent estimates of the variance of ˆ directly, but these could be computationally demanding for θ models in which the solution of the optimization problem for both sets of moment conditions require numerical optimization of the objective function. 1.3 Estimation with missing data This section specializes the results of the section 1.2 to a model in which missing data is allowed in a framework that expands that proposed by Wooldridge (2002b, 2007) to allow nonsmooth objective functions. Consider ω ∈ Q ⊂ Rdim(ω) a random vector with density f (ω); β∈ B ⊂ Rp1 a parameter vector, where B is a compact set. Suppose there is the population moment equation go (β o ) = E[g(ω, β o )] = 0 (1.21) where g : Q × B →Rm1 is a vector of known real-valued moment functions with m1 ≥ p1, so β o could be overidentified. Assume β o is the unique solution to 1.21. I am interested in estimating β o . Note that the moment conditions presented above hold in the unselected population. Assume nonrandom sampling occurs and it is characterized by a selection indicator, s ∈ {0, 1}, such that ω i is observed if and only if si = 1. Keep in mind that all or part of ω i is not observed when si = 0. The GMM estimator based on 1.21 using the selected sample, in effect makes the empirical moments n−1 n i=1 si g(ω i , β) close to zero. These empirical moments are the sample analogues of the population moments of the form E [sg(ω, β)] = 0 (1.22) which are referred to as the unweighted selected population moments (Prokhorov and Schmidt, 2009; Wooldridge, 2002b). The name emphasizes that they are evaluated at the 15 selected rather than the full population of interest and differentiates them from the weighted selected population moments defined below. The selectivity problem occurs exactly because 1.22 may not hold; in other words, the value β o that solves 1.21 may not also solve 1.22 (Prokhorov and Schmidt, 2009). If that happens, the estimate for β o obtained through this procedure is not generally consistent. In fact, its consistency and potential solutions for the data selection problem will depend on the relationship between the selection process and both the dependent and independent variables. 1.3.1 Data Selection under Ignorability A straightforward solution is to solve the nonrandom sampling problem using inverse probability weighting (IPW) as shown by Wooldridge (2002b, 2007). To be able to use IPW we need some variables that are reasonable predictors of selection as described in Wooldridge (2007). This is formally stated as an “ignorability” of selection assumption. Assumption 1 (Wooldridge, 2007, Assumption 3.1) (i) ω i is observed whenever si = 1; (ii) For a random vector zi such that P (si = 1 | ω i , zi ) =P (si = 1 | zi ) ≡p(zi ); (iii) For all z ∈ Z ⊂ RJ , p(z) > 0; (iv) zi is observed whenever si = 1. Item (ii) in this assumption requires that s ⊥ ω | z. In other words, the selection has to be independent of the y and x conditional on z. As discussed at length by Wooldridge (2007), assumption 1 encompasses a variety of selection schemes common in the missing data literature, including “missing at random”, “variable probability sampling”, “selection on observables’ etc. This allows, for example, that the probability of observing ω i to depend on the stratum in which ω i falls into; or that zi is observed only along with ω i ; or that partial information is known about the incompletely observed data. Assumption 1 does not 16 apply to the “selection on unobservables”2 case as generally used in econometrics. I will not explore these possibilities directly here, referring the reader to Wooldridge (2007). Assume that a conditional density determining selection is correctly specified and that a maximum likelihood estimator of the selection model is available. Assumption 2 (Wooldridge, 2007, Assumption 3.2) (i) G(z, γ) is a parametric model for p(z), where γ ∈ Γ ⊂ Rp2 and G(z, γ) > 0 for all z ∈ Z and γ ∈ Γ; (ii) There exists γ o in the interior of Γ such that p(z) = G(z, γ o ); (iii) For a random vector vi such that D(vi | ω i , zi ) = D(vi | zi ), the estimator γ solves a conditional maximum likelihood problem of the form n ln [f (vi | zi , γ)] max γ∈Γ (1.23) i=1 where f (v | z, γ) > 0 is a conditional density function known up to the parameters γ o , and si = h(vi , zi ) for some nonstochastic function h(·, ·); (iv) The solution to 1.23 has the first-order representation   1 n −1 √ n− 2 n(γ − γ o ) = E di (γ o ) di (γ o ) di (γ o ) + op (1) i=1 with di (γ) ≡ γ f (vi |zi ,γ) , which is the p2 × 1 score vector for the MLE. f (vi |zi ,γ) The assumption above requires standard regularity conditions about G(z, γ), including smoothness of the parametric model. Even though this restricts the possibilities to model the selection process, it includes the most used probability models used in the literature. By doing so, we concentrate on the impacts of nonsmoothness in the model of interest and provide results about the use of IPW in correcting sample selection for those cases. Assumption 2 covers the cases presented by Wooldridge (2002b) in which the conditional log-likelihood was for a binary response model. The advantage of using this slightly more 2 A quantile regression estimator for the case wheen selection is on unobservables is provided by Buchinsky (1998) 17 complicated framework is to allow zi to be only partially observed and to permit si to be a function of another random variable vi which includes a broader class of selection problems. For a deeper discussion on the extensions allowed by assumption 2, see Wooldridge (2007). Note that the MLE estimator for γ o described above can be obtained in a GMM setting as follows. Let γ the Maximum Likelihood Estimator (MLE) of γ o , that is γ solves n ln [f (vi | zi , γ)] max γ∈Γ Define g2 (z, γ,s) ≡ d (γ) = p i=1 γ f (vi |zi ,γ) and gn2 (γ) ≡ n−1 f (vi |zi ,γ) n g (z , γ, s ). Hence, i i=1 2 i gn2 (γ) −→ g2o (γ) ≡ E [g2 (z, γ, s)]. Then, the problem above is characterized by the following first order conditions n−1 n g2 (zi ,γ,si ) = n−1 i=1 = n−1 n i=1 n γ f (vi | zi , γ) f (vi | zi , γ) 1 − di (γ) = op (n 2 ) i=1 and, E [g2 (z, γ o ,s)] = E [d (γ o )] = 0 Under assumption 1, the following lemma, presented in Wooldridge (2002b) is valid. Lemma 1 (Wooldridge, 2002b, Lemma 3.1) Under the conditions presented in Assumptions 1 and 2, for any real-valued function g(ω) such that E E s g(ω, β o ) G(z, γ o ) s g(ω, β o ) p(z) =E |g(ω,β o )| < ∞, G(z,γ o ) = E [g(ω, β o )] (1.24) Lemma 1 suggests that we use the sampling probabilities to consistently estimate β o Consider the weighted selected population moments that weight 1.22 by the inverse of the 18 selection probability: E s g(ω, β o ) G(z, γ o ) =0 (1.25) Given an estimator for γ o , γ, we can form G(zi , γ) for all i with si = 1 and we are able to obtain consistent estimates for β o by using the weighted selected population moments 1.25 as described in Wooldridge (2007). Note that, by the Law of Large Numbers and Law of Iterated Expectations, assumptions 1, 2 and consistency of γ for γ o (see Wooldridge, 2002b, theorem 3.1). n n−1 i=1 si g(ω i , β) | ω i ,zi p(zi ) = E E = E = E si si p g(ω i, β) −→ E g(ω i , β) G(zi , γ) p(zi ) 1 E [s | ω i ,zi ] E [g(ω i , β) | ω i ,zi ] p(zi ) p(zi ) E [g(ω i , β) | ω i ,zi ] p(zi ) = E [E [g(ω i , β) | ω i ,zi ]] = E [g(ω i , β)] = go (β) Therefore, n−1 n i=1 si p g(ω i , β o ) −→ go (β o ) = 0 G(zi , γ) Hence, this provides a set of valid moment conditions that could be used to estimate β o . Efficiency Comparisons The relative efficiency of the estimators for β o that use IPW to correct a missing data problem under assumption 1 and 2 can be analyzed under the framework developed in section 1.2. Consider the two sets of moment functions s g(ω, β) G(z, γ) γ f (v | z, γ) g2 (z, γ, s) ≡ d (γ) = f (v | z, γ) g1 (ω, z, β, γ, s) = and the following moment conditions are valid, 19 E [g1 (ω, z, β, γ, s)] = 0 (1.26) E [g2 (z, γ o , s)] = 0 (1.27) Any of the estimators discussed in section 1.2 can be used, differing on the set of moment conditions used and the knowledge about the weights. Under the assumptions on the moment conditions and the selection process discussed in this section, the following lemma holds. Lemma 2 If the conditions of Theorem 1 and 2; Assumptions 1 and 2 hold, and the moment −1 conditions are defined by 1.26 and 1.27, then G12 = C12 C22 G22 . By using this result, we can see that under these assumptions, the results of Theorem 4 can be directly applied to this specific case and the ONE-STEP, TWO-STEP and KNOWγ-JOINT estimators will be no less efficient than the KNOW-γ. Theorem 5 Under the conditions of Lemma 2 , ONE-STEP, KNOW-γ-JOINT and TWOSTEP are no less efficient for β o than KNOW-γ. Furthermore, ONE-STEP and KNOW-γJOINT are equally efficient for β o . Hence, unless C12 = 0 (in which case the four estimators would be equally efficient), using ONE-STEP or TWO-STEP that estimate γ o through MLE produce more efficient estimates for β o than using known weights (if we knew them) in the KNOW-γ estimator. The KNOWγ-JOINT estimator is as efficient as ONE-STEP as well, indicating that the knowledge of γ o is not useful in terms of the efficiency of the estimates for β o . The efficiency gains relatively to KNOW-γ are due to the use the information in the second set of moment conditions. Therefore, the puzzle described in Wooldridge (2002b, 2007) that KNOW-γ is inefficient relative to TWO-STEP, extends to a larger set of estimators in which the original set of unweighted moment conditions is nonsmooth as it was discussed by Chen et al. (2008) and 20 Hitomi et al. (2008). In these cases we are better off estimating the weights by a conditional MLE than knowing them. Nonetheless, the TWO-STEP estimator is dominated by both ONE-STEP and KNOW-γ-JOINT and those should be used to obtain relatively efficient estimates of β o . It is important to note that the framework developed in this chapter does not extend directly to semiparametric cases in which the probability of selection is estimated nonparametrically. That can be a serious inconvenience when we have limited information about the selection process and would benefit from a more flexible estimator to those probabilities. However, as it is shown in the section 1.3.2 we can obtain consistent estimates for β o even if using misspecified selection probabilities, as long as the data selection is exogenous. 1.3.2 Data Selection under Exogeneity of Selection The literature in sample selection has long established that sample selection does not necessarily cause bias in unweighted estimators. As shown in Wooldridge (2007) if selection is exogenous conditional on the vector of covariates x the estimators of interest using the unweighted moment conditions will be consistent and, in fact, more efficient (Prokhorov and Schmidt, 2009) than the weighted estimators. Following Wooldridge (2007), I analyze the properties of the estimators obtained under exogenous selection but with potential misspecification of the selection model. The main results about consistency of the estimators for the parameters of interest shown in Wooldridge (2007) and Prokhorov and Schmidt (2009) remain unaltered by the fact that the moment conditions are allowed to be nonsmooth as summarized below. Consider that we have a potentially misspecified model for the probability of selection given by G(z, γ ∗ ), which is not necessarily equal to the true p(zi ). Assume that the estimate √ γ obtained based on that model is consistent to some parameter vector γ ∗ and n(γ − γ ∗ ) = Op (1). 21 In this case, the weighted moment condition n−1 n i=1 p si s g(ω i , β o ) −→ E g(ω, β) G(z, γ) G(z, γ ∗ ) (1.28) instead of E [g(ω, β)] = 0, as seen in section 1.3.1. Assume that the selection process is exogenous conditional on z. Assumption 3 (Wooldridge, 2007, Assumption 4.1) (i) ω i is observed whenever si = 1; (ii) For a random vector zi such that P (si = 1 | ω i , zi ) =P (si = 1 | zi ) ≡p(zi ); (iii) zi is observed whenever si = 1. (iv) β o ∈ B solves the problem E [g(ω, β) | z] = 0 for all z ∈ Z. This assumption is the same as in Prokhorov and Schmidt (2009) and as shown by them in Lemma 4.1 and Theorem 4.1 (p.53), which are not altered due to the use of nonsmooth objective functions, it implies E [g(ω, β) | z, s] = 0 Hence, any function of z and s is uncorrelated with g(ω, β) and both weighted and unweighted moment conditions hold in the selected sample for any weighting (that is a function of z and s) that we could use. Therefore, the weighted moment condition in equation 1.28 will hold in the selected sample for any misspecified model G(z, γ ∗ ). Obviously, this holds for the unweighted moment conditions as well since it is equal to the special case in which G(z, γ ∗ ) = 1. Then, we conclude that under exogeneity of selection, the IPW estimator for β o proposed is consistent, regardless of the misspecification of the model for probability of selection3. 3 This conclusion is equivalent to Theorem 4.1 in Wooldridge (2007), extending it for nonsmooth objective functions. 22 This robustness is an important feature of the IPW procedure and adds to its usefulness in applications. 1.4 1.4.1 Examples Quantile Regression under Ignorability of Selection Quantile regression is one of the main motivations for this research. As an example of the use of the results presented here, consider I am interested in estimating the conditional quantile function (CQF) of a random variable y conditional on a vector of explanatory variables x. This is defined by, Qτ (Y | X) = inf {y : FY (y | X) ≥ τ } where τ ∈ (0, 1) indexes the τ th quantile of the conditional distribution of Y . Suppose that the CQF is a linear model Y = X βτ o + ε and that Qτ (ε | X) = 0. Then, Qτ (Y | X) = X β τ o In the population, β o solves the following problem min E ρτ (Y − X β τ ) β∈B where, ρτ (u) = (τ − 1 [u ≤ 0])u Given a random sample from the population of size n, it is possible to obtain consistent estimates of β o by a standard quantile regression (QR) estimator. min n−1 β∈B n ρτ (yi − xi β τ ) i=1 23 Note that the population minimization problem has the following first order conditions E τ − 1 y − x βτ o ≤ 0 x =0 and their sample analogue is (Buchinsky, 1998) n n−1 τ − 1 yi − xi β τ ≤ 0 1 − xi = op (n 2 ) i=1 Hence, we frame this problem as a GMM estimator that uses as moment conditions the first order conditions of the QR problem, since these identify β τ o . However, suppose a random sample of (y, x) is not observed. We have a selection problem such that the full vector (yi , xi ) is observed only if a certain binary variable that defines selection equals the unity, si = 1, if si = 0 at least some part of (yi , xi ) is not observed. Then, in the selected sample, we can only estimate n−1 n si τ − 1 y i − xi β τ ≤ 0 1 − xi = op (n 2 ) i=1 which is the sample analogue of E s τ − 1 y − x βτ o ≤ 0 x =0 but the value β τ o that solves the population moment condition does not necessarily solve the selected population moment condition. Additionally, assume that the probability of selection can be written as a parametric function of some vector of variables (xi , zi ) and parameters γ o and that conditional on zi , the terms of xi that are not included in zi and yi are irrelevant for the probability of selection (Assumption 1). P (si = 1 | yi , xi , zi ) =P (si = 1 | zi ) ≡p(zi , γ o ) In this situation, we can estimate consistent and asymptotically normal estimates for β τ o using the selected sample by weighting the original observations by the inverse of the 24 probability of selection. Note that, s τ − 1 y − x βτ o ≤ 0 x p(z, γ o ) s = E E τ − 1 y − x β τ o ≤ 0 x | x, y, z p(z, γ o ) E (s | z) = E τ − 1 y − x βτ o ≤ 0 x p(z, γ o ) E τ − 1 y − x βτ o ≤ 0 = E s then E p(z,γ ) o τ − 1 y − x βτ o ≤ 0 x x =0 = 0 holds. Therefore, we can estimate β τ o by using those weighted moment conditions. Naturally, we would need to estimate the weights if they are unknown. Let the true selection model be a standard binary response model for simplicity. Then, estimate the selection of probability by MLE, or more conveniently, a GMM procedure that uses the first order conditions of the MLE for the selection model as moment conditions. The MLE maximization problem and its first order condition are given by, respectively, N max γ∈Γ {si ln [p(zi , γ)] + (1 − si ) ln [1 − p(zi , γ)]} i=1 n n−1 i=1 (1.29) si − p(zi , γ) γ p(zi , γ) p(z , γ) (1 − p(z , γ)) = 0 i i (1.30) where the estimator for γ o is defined as the vector γ. Again, 1.30 is the sample analogue of the following moment condition, E s − p(z, γ o ) γ p(z, γ o ) p(z, γ ) (1 − p(z, γ )) = 0 o o Hence, we have two sets of moment conditions that can be used to estimate both the selection model and the conditional median model. The GMM estimator in this case would be given by any of the four estimators proposed in section 1.2, with gn1 (θ) = n−1 gn2 (γ) = n−1 n i=1 n i=1 si p(zi , γ) τ − 1 y i − xi β τ ≤ 0 si − p(zi , γ) γ p(zi , γ) p(z , γ) (1 − p(z , γ)) i i 25 xi the variance of the estimates will depend on the choice of estimator as stated by Theorem 4. To estimate the variance of the estimated parameters we need to obtain valid estimates for the components of G in the variance of ˆ θ.Note that, for example, G11 ≡ β E[g1 (β o , γ o )] = s p(z, γ o ) βE τ − 1 y − x βτ o ≤ 0 x s τ − 1 y − x β τ o ≤ 0 x | z, x, s p(z, γ o ) s = β E p(z, γ ) E τ − 1 y − x β τ o ≤ 0 | z, x, s x o s = β E p(z, γ ) τ − Fy|z,x,s (x β τ o ) x o s f (x β τ o )x x = E p(z, γ o ) y|z,x,s = βE E hence, consistent estimates can be obtained by the sample analogue, G11 = n−1 G12 = n−1 G22 = n−1 n i=1 n si f (x β )x x p(zi , γ ) y|z,x,s i τ i i ˆ − i=1 n ˆ γ p(zi , γ ) si [p(zi , γ )]2 ˆ ˆ γ p(zi , γ ) i=1 τ − 1 y i − xi β τ ≤ 0 2 si − p(zi , γ ) ˆ p(zi , γ ) (1 − p(zi , γ )) ˆ ˆ xi ˆ γ p(zi , γ ) where the last equality is a direct application of GIME and fy|z,x,s (·) is a suitable estimator of the conditional density of y, commonly by a kernel estimator. Note that the same asymptotic variance formula for the KNOW-γ estimator for β τ is obtained by a simple extension of the results for weighted quantile regression presented in Koenker (2005) as shown in claim 1 in the appendix. Since the conditions in Theorem 5 hold, we will obtain more efficient estimates by estimating the inverse probability weights than using the “true” weights, characterizing the puzzle described in the literature (Wooldridge, 2002b, 2007). The relatively more efficient estimate for β τ o is given by the one-step estimator that jointly estimates both the probability weights and the parameters of interest, β τ o . 26 One interesting point to note is that, even this relatively restrictive model for the CQF, which assumes linearity, can be very insightful about the potentially nonlinear true CQF. As discussed in detail by Angrist, Chernozhukov, and Fern´ndez-Val (2006), a linear quana tile regression provides the best linear approximation of the true CQF in the sense that it minimizes a weighted mean square error loss function. So even if we have reasons to believe that the true CQF in which we are interested is nonlinear, the use of a linear quantile regression in the example above would provide us with the “best linear approximation” to it in a similar way that a linear OLS model offers the best linear approximation to the conditional mean function. Hence, by using IPW to correct the selection bias caused by missing data we can recover this linear approximation to the CQF of interest, even if we don’t know its true specification. Nevertheless, this framework can be applied to nonlinear conditional quantiles of the form Qτ (Y | X) = m X, β τ o , with gn1 (θ) = n−1 G11 = n−1 G12 = n−1 n i=1 n si (τ − 1 [yi − m (xi , β τ ) ≤ 0]) p(zi , γ) − si f m xi , β τ p(zi , γ ) y|z,x,s ˆ − ˆ γ p(zi , γ ) si [p(zi , γ )]2 ˆ i=1 n i=1 β m (xi , β τ ) β m xi , β τ τ − 1 yi − m xi , β τ ≤0 β m xi , β τ β m xi , β τ and the remaining equations unchanged. 1.4.2 Instrumental Variable Quantile Regression Consider a simplified version of the IVQR estimator described in Chernozhukov and Hansen (2006). Focus on the basic linear model that allow for heterogeneous effects given by, Yd = q(d, x, τ ) = d ατ + x β τ where d is a vector of (potentially endogenous) multi-valued treatment variables and x is a vector of covariates. Under the conditions described in Assumption 1 of Chernozhukov and 27 Hansen (2006), the IVQR estimator of the vector of parameters (α(τ ) , β(τ ) ) proposed in that paper approximately solves the estimating equation4: n n−1 1 − (1 yi − di ατ − xi β τ ≤ 0 − τ )(xi , Φiτ ) = op (n 2 ) i=1 where Φiτ ≡ Φτ (τ , xi , zi ) is a vector of transformations of the instruments. In a simple model Φiτ can be formed by the least squares projection of d on z and x (and its powers) (Chernozhukov and Hansen, 2006, 2008). In that simple case, we could write the sample analogue of the moment conditions that will identify the parameters of the model as 1 n − gn1 (θ) = n 2 gn2 (γ) = n−1 (1 yi − di ατ − xi β τ ≤ 0 − τ )(xi , (xi , zi )γ) i=1 n (xi , zi ) [di − (xi , zi )γ] i=1 Hence, the analysis developed in section 1.2 can be applied to the IVQR estimator proposed by Chernozhukov and Hansen (2005, 2006, 2008) and the results shown above are valid in its scope. Nevertheless, it is important to note that the framework developed here does not extend directly to semiparametric cases in which the “first stage” is estimated nonparametrically. That can be a serious inconvenience when we have limited information about the form of the transformation on the vector of instruments that would be preferable in estimating IVQR. 1.5 Conclusion This chapter (i) extends the GMM efficiency and redundancy results of Prokhorov and Schmidt (2009) to nonsmooth objective functions; (ii) analyzes the extent to which these results could be useful in the context of inverse probability weighting (IPW) as a mechanism to correct missing data issues, thus allowing its use in the LAD and quantile regression 4 For simplicity I’m assuming that the weights V in Chernozhukov and Hansen (2006) iτ are equal to the unit. 28 framework; (iii) verifies the conditions under which the puzzle of selectivity literature, i.e., that weighting using known probabilities of selection leads to a less efficient estimate than using estimated probabilities of selection (Wooldridge, 2002b, 2007; Prokhorov and Schmidt, 2009; Hitomi et al., 2008), is valid under nonsmoothness of the objective functions that characterize the models of interest; and (iv) shows that even in that case the widely used two-step estimator is relatively less efficient than a one-step joint estimator. Section 1.2 extends results on redundancy and efficiency due to Prokhorov and Schmidt (2009) that can now be applied to a wide range of contexts in which nonsmooth objective functions can be useful, including LAD, quantile regression, censored LAD and quantile treatment effects. Joint estimation of nuisance parameters and parameters of interest is more efficient than a two-step procedure or knowing the true nuisance parameters in the nonsmooth case. This springs from the information contained in the correlation between both sets of moment conditions which is useful, even when γ o is known. Using only the first set of moment conditions and known values of γ o in the estimation procedure does not use the additional information embedded in the second set of moment conditions, being inefficient. Some possible consistent estimators for the variance of both sets of parameters are presented. Section 1.3 analyzes the missing data problem described in Wooldridge (2007). The selection model is estimated by a conditional MLE procedure, but the assumptions about the selection model are weak enough to cover most of the common parametric selection processes in the literature, like attrition, variable probability, “missing at random”, etc. One important case not covered is “selection on unobservables”. The results from Wooldridge (2007) and Prokhorov and Schmidt (2009) extend to nonsmooth objective functions. If we use both sets of moment conditions, knowledge about the nuisance parameters is not useful for the efficiency of the estimates of the parameters of interest. Additionally, the moment conditions that are associated with the selection model are not redundant, except in special cases. Estimating the parameters of interest using only the first set of moment conditions 29 with known probabilities of selection as weights is inefficient because it ignores information in the second set of moment conditions. This is the type of puzzle referred to in the selectivity literature, specially in the IPW approach to missing data. In summary, IPW can be used to correct missing data problems when the model of interest is based on nonsmooth objective functions. Furthermore, two-step estimation of β o is more efficient than using known probabilities of selection. Nonetheless, the two-step estimator is dominated by a one-step joint estimation procedure that uses both the weighted moment conditions and the selection model’s conditions. Hence, the analysis by Prokhorov and Schmidt (2009) extends to the relative efficiency of an IPW approach to deal with missing data problems in which the moment conditions of interest are nonsmooth, encompassing, for example, LAD, quantile regression, Censored LAD and IVQR. Finally, two illustrative examples of interesting models are provided that are encompassed by the general framework developed in this work. The first is a quantile regression model with missing data and, the second one is a simplified version of the Instrumental Variable Quantile Regression estimator (IVQR) presented by Chernozhukov and Hansen (2006). 30 CHAPTER 2 Fixed Bandwidth Asymptotics for Regression Discontinuity Designs 2.1 Introduction Regression discontinuity (RD) designs have been propelled to the spotlight of economic analysis in recent years1, especially in the policy and treatment evaluation literatures, as a form of estimating treatment effects in a non-experimental setting. The appeal of RD comes from the relative weak assumptions necessary for the identification of treatment effects and inference, which rely on RD’s “quasi-experimental” characteristics. The standard approach to derive the asymptotic properties of estimates obtained in RD settings relies on the traditional assumption that the bandwidth, h, used in the estimation procedure shrinks towards zero as the sample size grows. This guarantees identification of the parameter of interest under mild conditions. Hence, the asymptotic distribution of the estimator used as the basis for inference depends crucially on this small-h condition. 1 Lee and Lemieux (2009), in a broad review of the RD literature, compile a list of more than 60 papers that apply RD design to many different contexts. The overwhelming majority of the papers have been published in the last decade. 31 In practice, to obtain an estimate of the parameter of interest and perform inference about it, the empiricist is required to choose a fixed bandwidth greater than zero. Hence, even though the asymptotic theory requires that h → 0, in practice h > 0 and fixed. Asymptotic distributions that treat h as fixed can provide a more refined approximation of the asymptotic behavior of the estimator than those derived under the assumption that h → 0. This chapter derives the asymptotic distribution for the local polynomial estimator when the bandwidth is allowed to be any positive real number. The results shown in section 2.5 provide a new, fixed-h, approximation to the estimator’s bias and variance that incorporate the bandwidth size chosen by the researcher. Corollary 2 shows that the standard small-h asymptotic distribution of the parameter of interest is a special case of fixed-h in which h → 0. Also, corollary 3 shows that when a fixed h > 0 is used the standard small-h result for the variance of the estimators is equivalent to assuming that the density of the running variable and the conditional variance of the outcome variable are constant around the discontinuity. The increased theoretical interest in RD started with Hahn, Todd, and Van der Klaauw (1999, 2001), who presented the conditions for identification of the average treatment effect of interest and its estimation exploiting discontinuities in the probability of treatment provision, which are determined by the so-called running variable. They also derived the asymptotic distribution of the estimators by looking at a shrinking bandwidth around the discontinuity. Porter (2003) provided widely used results on the asymptotic properties of the estimators for the treatment effect of interest, obtaining limiting distributions for estimators based on local polynomial regression and partially linear estimation. Imbens and Lemieux (2008) and Lee and Lemieux (2009) offer a broad review of the theoretical and applied literature with emphasis on the identification of the parameter of interest and its potential interpretation as a weighted average treatment effect. The analysis of asymptotic properties of estimators for fixed bandwidths has received some attention in other literatures. Notably, Neave (1970), in the framework of spectral 32 density estimation, obtains more accurate approximations to the variance of nonparametric spectral estimates by acknowledging that, with a finite sample, the bandwidth used is fixed. He asserts that, in the context of his paper, the assumption equivalent to the bandwidth converging to zero: “(...) is a convenient assumption mathematically in that, in particular, it ensures consistency of the estimates, but it is unrealistic when such results are used as approximations to the finite case(...)”(Neave, 1970, p. 70). Also, Fan (1998) provides an alternative approximation for goodness-of-fit tests for density function estimates in which the bandwidth used in the test is fixed, obtaining improved approximations to the asymptotic behavior of the test and more appropriate critical values for inference. The same can be said in the regression discontinuity design. Even though h → 0 is a convenient assumption that guarantees consistency of the estimates of the average treatment effect, it will be unrealistic. It is of theoretical and practical interest to obtain more accurate asymptotic distributions by treating h as fixed so that the theory used for inference is more accurate and aligned with the practice of applied economists. Monte Carlo simulations in section 2.7.1 indicate that, compared with small-h, asymptotic distributions derived based on fixed-h better characterize the behavior of the estimators and provide improved inference about the treatment effect, reducing size distortions in tests and better approximating the bias in the estimates. These improvements are more important when the bandwidth is farther from zero, as one would expect. Section 2.6 proposes estimators for the asymptotic variance based on the fixed-h results and provides evidence, through Monte Carlo simulations (section 2.7.2), that the feasible inference incorporates the inference improvements predicted by the theory, suggesting that the theoretical gains in robustness can be translated to practical benefits in applied work. Section 2.7.2 also compares the performance of the small-h standard error estimators proposed in the literature in performing inference. The fixed-h variance estimators can improve markedly over small-h estimators in the presence of some forms of heteroskedasticity. Simulations using heteroskedastic errors have provided evidence that feasible tests based on the 33 fixed-h approach obtain better coverage, outperforming small-h starting at relatively small bandwidths. Interestingly, in the case of the widely used local linear estimator with homoskedastic errors the variance estimators based on small-h asymptotics suggested in the literature produce well behaved tests with similar size performance to the fixed-h variance estimators, performing better than the standard theory would expect. 2.2 Model The interest lies in estimating the average treatment effect, τ , of a certain treatment or policy that affects part of a population of interest. As discussed in Porter (2003); Imbens and Lemieux (2008) and Lee and Lemieux (2009), RD designs are closely associated with the treatment effect literature.2 There are two types of RD designs, sharp and fuzzy, and they differ as to how treatment is assigned to a certain observation and the impact of the discontinuity in its assignment. I will focus on the sharp design in this section and emphasize the differences of the fuzzy design when needed. 2.2.1 Sharp Regression Discontinuity Design In the sharp design, the treatment status, D, is a deterministic function of a so called “running” or “forcing” variable, x, such that, di = 1 if xi ≥ x 0 if xi < x where x is the known cut-off point. Then, let Y1 and Y0 be the potential outcomes corresponding to the two possible treatment assignments. As usual, we cannot observe both potential outcomes, having access only to Y = dY1 − (1 − d)Y0 . As described by Hahn et al. 2 Angrist and Pischke (2009) provide a simple introduction to the intuition of regression discontinuity. 34 (2001) and Porter (2003), under a smoothness assumption that E Yj | X = x is continuous at x for j = 0, 1, the average treatment effect can be estimated by comparing points just above and just below the discontinuity. The discontinuity in treatment assignment at x provides the opportunity for identifying the average treatment effect at the cutoff without any additional parametric functional form restrictions on the conditional expectations of the outcome variable. The average causal effect of the treatment at the discontinuity is Imbens and Lemieux (2008) τ S ≡ E [Y1 − Y0 | X = x] = lim E [Y | X = x] − lim E [Y | X = x] x↑x x↓x where the second equality holds under some smoothness assumptions regarding the conditional expectations (discussed below). The sharp regression discontinuity design uses the discontinuity in the conditional expectation of Y given X to uncover the average treatment effect. If the treatment effect is deemed constant across individuals, τ S is the effect of treatment for each individual in the population. If we allow the treatment effect to differ among individuals, τ S is the average treatment effect for individuals at the cutoff. Interestingly, Lee and Lemieux (2009) show that the so-called RD gap obtained by the comparison of observations just above and just below the cutoff can be interpreted as a weighted average treatment effect across all individuals, not only the individuals around the cutoff. In this case each individual would have weights directly proportional to the ex ante likelihood that an individual’s realization of X will be close to the threshold. For a comprehensive review of RD designs and their applications and interpretation, see Lee and Lemieux (2009). 2.2.2 Fuzzy Regression Discontinuity Design In the fuzzy design the probability of receiving treatment still changes discontinuously at the threshold, but is not required to go from 0 to 1, allowing for a smaller jump in the probability 35 of receiving treatment at the cutoff, lim Pr(d | X = x) = lim Pr(d | X = x) x↓x x↑x This framework allows for a greater range of applications since it includes cases in which the incentives to receive (or assign) treatment change discontinuously at the threshold, but are not strong enough to induce all individuals above it to be treated (and those below not to be treated). The average treatment effect at the cutoff can be identified by the ratio of the change in the conditional expectation for the outcome variable to the change in the conditional probability of receiving treatment (Imbens and Lemieux, 2008): τF ≡ limx↓x E [Y | X = x] − limx↑x E [Y | X = x] limx↓x E [d | X = x] − limx↑x E [d | X = x] This parameter’s interpretation is closely linked to the instrumental variables approach. As emphasized by Hahn, Todd, and Van der Klaauw (2001); Imbens and Lemieux (2008) and Lee and Lemieux (2009), a causal interpretation of this ratio requires the same assumptions for local average treatment effects (LATE) presented in Imbens and Angrist (1994). For that we assume monotonicity, i.e., that the treatment status is non-increasing in the cutoff value, or, as stated by Lee and Lemieux (2009, p. 23): “(...)X crossing the cutoff cannot simultaneously cause some units to take up and others to reject the treatment.” Also, crossing the cutoff cannot affect the outcome other than by the receipt of treatment, otherwise we would erroneously attribute changes in the conditional expectation of Y due to changes in X to the treatment. Under these additional assumptions, τ F has an interpretation similar to the IV estimator, the average treatment effect for the individuals at the threshold (due to the RD design) and only for those whose participation on treatment was affected by the cutoff. Those individuals are described as compliers in the Average Treatment Effect literature3. Hence (Imbens and Lemieux, 2008), 3 See Imbens and Angrist (1994); Hahn, Todd, and Van der Klaauw (2001); Imbens and Lemieux (2008) and Lee and Lemieux (2009). 36 τ F ≡ E [Y1 − Y0 | individual is a complier and X = x] Similarly as in the sharp RD design, Lee and Lemieux (2009) show that the fuzzy RD design estimator can be interpreted as a weighted LATE with an individual’s weight directly proportional to the ex ante likelihood that an individual’s realization of X will be close to the threshold. 2.3 Estimators I analyze estimates for the parameter of interest, τ S or τ F , obtained by local polynomial estimators. In applied work local polynomial estimators are a staple for estimation of treatment effects in RD settings. This is partially due to their easy implementation, nice properties and by the fact that the local linear estimator has been the focus on several papers that helped to disseminate the technique (Hahn, Todd, and Van der Klaauw, 1999, 2001; Imbens and Lemieux, 2008; Lee and Lemieux, 2009). Theoretically as well, local polynomial estimators are attractive for estimation in the regression discontinuity setting given its nice boundary behavior as described by Fan and Gijbels (1996). The order p local polynomial estimator is defined as follows. In the sharp design case, given data (yi , xi )i=1,2,...,n , let di = 1[xi ≥ x], k(·) be a kernel function, h denote a bandwidth that controls size of the local neighborhood to be averaged over. Also, define the p + 1 × 1 vector Z(x) = 1, (x − x) , (x − x)2 , ..., (x − x)p and let αp+ , β p+ be the solution to the minimization problem:4 1 min a,b1 ,...,bp n n i=1 1 k h xi − x h di yi − a − b1 (xi − x) − ... − bp (xi − x)p 2 4 Note that in the sharp RD design, d will be identical to the treatment assignment i variable Di since the probability of being treated is zero below the threshold and one above it. 37 and similarly αp− , β p− minimizes the same objective function but with 1 − di replacing di . The estimator of the parameter of interest is given by τ S ≡ αp = αp+ − αp− This estimator fits a polynomial on X on a neighborhood just above and below the cutoff for treatment, x, and encompasses some familiar estimators as special cases. Case 1 If p = 0 is used, the Nadaraya-Watson estimator is obtained. The Nadaraya-Watson estimator takes a kernel weighted average of observations at each side of the discontinuity and its difference. n−1 α = n−1 = x−xi −1 yi di n−1 ih k h − x−xj −1 k n−1 dj jh h x−xi yi di h − x−xj dj jk h ik x−xi −1 yi (1 − di ) ih k h x−xj −1 (1 − dj ) jh k h x−xi yi (1 − di ) h x−xj (1 − dj ) jk h ik If in addition we use the rectangular kernel, α simplifies to be the difference of the means of yi in the bandwidths above and below the cutoff. Case 2 If the rectangular kernel is used, the local least squares estimator of yi on a polynomial of (xi − x) with order p is obtained on the neighborhood on each side of the cutoff. Moreover, if the condition β p+ = β p− is imposed, the estimator αp is the coefficient on di on the OLS regression of yi on di and the polynomial of (xi − x) with order p using the data inside the bandwidth on both sides of the cutoff. In both the theoretical and applied literatures, emphasis has been given to the case in which a linear model (p = 1) on X is fitted on each side of the cutoff (Hahn, Todd, and Van der Klaauw, 1999, 2001; Imbens and Lemieux, 2008; Lee and Lemieux, 2009). Case 3 If p = 1, the local linear estimator is obtained. For the rectangular kernel α simplifies to the difference of the intercepts from the linear regression of yi on 1 and (xi − x) in the 38 ranges above and below the cutoff. If, additionally, β 1+ = β 1− is imposed, the estimator of the ATE of interest is the coefficient for di on the OLS regression of yi on 1, di and (xi − x) using the data inside the ranges on both sides of the cutoff. 2.4 Assumptions To derive the asymptotic distribution of the estimator for τ , the following assumptions are sufficient. Assumption 4 (a) k(·) is a symmetric, bounded, Lipschitz function, zero outside a bounded set; k(u)du = 1. (b) For a positive integer s, k(u)uj du = 0, 1 ≤ j ≤ s − 1. Assumption 4 allows for higher order kernels5 and a bounded support set for the kernel avoids the use of a trimming function. Let fo denote the marginal density of x and m(x) denote the conditional expectation of y given x minus the discontinuity, i.e., m(x) = E [y | x] − α1[x ≥ x], where x is the value of the running variable in which the discontinuity occurs. Assumption 5 Suppose the data (yi , xi )i=1,2,...,n is i.i.d. and α is defined by α = lim E [y | X = x] − lim E [y | X = x] x↓x x↑x For some compact interval ℵ of x with x ∈ int(ℵ), fo is lf times continuously differentiable and bounded away from zero; m(x) is lm times continuously differentiable for x ∈ ℵ {x}, and m is continuous at x with finite right and left-hand derivatives to order lm . In the sharp RD design τ S = α and the average treatment effect is obtained directly from the discontinuity in the conditional expectation of Y . In the following, I discuss the 5 If s ≥ 3, the kernel has to be negative for some region of its domain to satisfy part (b) of the assumption. 39 estimation of α and interpret it as the estimate for the average treatment effect of interest. For the Fuzzy RD design the average treatment effect will be given by the ratio of two such discontinuities, the conditional expectations of the outcome and probability of receiving “treatment”. Assumption 5 guarantees smoothness of the density of x and the conditional expectation of y on both sides of the discontinuity while allowing for different right and left-side derivatives of m at x. Also, bounding the density of x on the neighborhood around x guarantees there is density (“data”) around the discontinuity to estimate the jump size. Assumption 6 describes the behavior of the moments of the outcome variable around the discontinuity. Define, ε = y − E [y | X = x] = y − m(x) − α1[x ≥ x]. Assumption 6 (a) σ 2 (x) = E ε2 | X = x is continuous for x = x, x ∈ ℵ, and right and left-hand limits at x exist. (b) For some ζ > 0, E |ε|2+ζ | X = x is uniformly bounded on ℵ. Assumption 6(a) allows the conditional variance of the outcome variable to be a function of the running variable and assures it is well behaved around the cutoff. Part (b) bounds the moments so that a central limit theorem can be applied. The fixed-h asymptotic distributions described in section 2.5 do not require additional assumptions over what is used in the standard, small-h literature, e.g., Hahn, Todd, and Van der Klaauw (2001); Porter (2003) etc. 2.5 Asymptotic Distributions This section develops the asymptotic distribution for the local polynomial estimator of the average treatment effect for a fixed bandwidth, h. 40 Theorem 6 Suppose Assumptions 4 (a) and 6 hold. If Assumption 5 (a) holds with lm ≥ p + 1 and lf any nonnegative integer. If h is fixed and positive, as n → ∞, then √ d nh(αp − α∗ ) → N 0, Vf ixed−h p (2.1) where Vf ixed−h = e1 Γ∗ −1 ∆∗ Γ∗ −1 + Γ∗ −1 ∆∗ Γ∗ −1 e1 − − − + + + (2.2) α∗ = α + Bf ixed−h p Bf ixed−h = e1 ∞ Γ∗ −1 0 k (u) Z(x + uh)m(x + uh)fo (x + uh)du − + ∞ − Γ∗ −1 0 k (u) Z(x − uh)m(x − uh)fo (x − uh)du − (2.3) and   +(−)   +(−) +(−) +(−) δ0 · · · δp γ0 · · · γp  .  . .  .  .. .. . , .  , ∆∗ = . Γ∗ = . . . . .  . .  +(−)  +(−)  +(−) +(−) +(−) +(−) · · · δ 2p δp · · · γ 2p γp ∞ e1 = 1 0 · · · 0 , γ + = 0 k (u) uj fo (x + uh)du, j ∞ γ − = (−1)j 0 k (u) uj fo (x − uh)du, j ∞ δ + = 0 k 2 (u) uj σ 2 (x + uh)fo (x + uh)du, j ∞ δ − = (−1)j 0 k 2 (u) uj σ 2 (x − uh)fo (x − uh)du j The proof is given in the appendix. Theorem 6 provides the asymptotic distribution for the local polynomial estimator of the parameter of interest for any bandwidth value. The formula for asymptotic variance explicitly takes into consideration the choice of bandwidth, without assuming h → 0. The fixed-h approach used in theorem 6 captures the impact of h on the asymptotic variance, Vf ixed−h . Even though the asymptotic variance formulas are somewhat cumbersome, these are still functions of known data and can be calculated for given functions fo (x) and σ 2 (x) or estimated in a dataset (see section 2.6). The bias term that arises under the fixed-h assumption does not vanish as the sample size increases as suggested by the standard approximations but, for a given bandwidth, it 41 converges to Bar . The bias is the difference of the (scaled) linear projection for m(x) on Z evaluated at x = x (i.e., the difference in intercepts) inside the bandwidth above and below the cutoff. Intuitively, the bias in α is a difference between the conditional expectation of the outcome above and below the cutoff that would have arisen in the absence of treatment, i.e., the difference that would have happened nevertheless and are erroneously attributed to the treatment or policy being analyzed. The fixed-h approach tackles the bias problem “head on”, making explicit the impact of the bandwidth choice on the bias of the estimate obtained. The local polynomial approach mitigates the bias problem if it is able to approximate m(x) appropriately, since it partially captures changes in m(x) above and below the cutoff that would exist even in the absence of treatment by using the higher order polynomials. Note that, as h → 0 the results for the asymptotic distribution of α in theorem 6 approach the asymptotic variance and bias of small-h asymptotics (Porter, 2003). Corollary 2 If the conditions in theorem 6 hold, h → 0, then the asymptotic variance and bias for αp are equal to the small-h approximation (Porter, 2003) σ 2+ (x) + σ 2− (x) e1 Γ−1 ∆Γ−1 e1 Vsmall−h = fo (x) (2.4) α∗ small−h = α + Bsmall−h  γ p+1 limh→0 hp+1 (p+1)+  .  (2.5) Bsmall−h = [m (x) − (−1)p+1 m(p+1)− (x)]e1 Γ−1  .  . (p + 1)! γ 2p+1  and     γ0 · · · γp δ0 · · · δp . . .  .  Γ =  . . . . . , ∆ =  . . . . . , . . . . δ p · · · δ 2p γ p · · · γ 2p ∞ e1 = 1 0 · · · 0 , γ j = 0 k(u)uj du, δ j = ∞ k 2 (u)uj du and m(l)+(−) (x) is the lth 0 right (left)-hand derivative of m(x) at point x. Additionally, the small-h asymptotic distribution variance and bias in corollary 2 are equal to that obtained by assuming that fo (x) and σ 2 (x) are constant around the cutoff and that 42 m(x) can be exactly approximated by a polynomial of order p + 1. Corollary 3 Under the assumptions in theorem 6, if h > 0 and, in the bandwidth around the cutoff, fo (x) and σ 2 (x) are constant and m(x) can be exactly approximated by an expansion √ of order p + 1. Then, the asymptotic variance and bias of nh(αp − α) obtained by fixed-h (theorem 6) and small-h (Porter, 2003) are the same. Focusing on the formula for asymptotic variance in both fixed-h and small-h approaches, it is clear that the refinements obtained by fixed-h are due to incorporating the behavior of fo (x) and σ 2 (x) in the ranges around the cutoff, while small-h ignores it by considering only the values at the cutoff, fo (x) and σ 2 (x). Hence, the benefits in using fixed-h asymptotics are expected to be larger when the density of X and the conditional variance change markedly inside the bandwidths around the cutoff, i.e., heteroskedasticity inside the bandwidth could lead to poor performance by the small-h variance approximation relative to fixed-h. It is relevant to note that both fixed-h and small-h asymptotic approximations are based on the same estimator for α. For a given bandwidth the bias present in the estimate is set. Small-h asymptotics may lead one to ignore the bias by arguing to have chosen the bandwidth to “undersmooth”. However, once a bandwidth is chosen the bias is given and should not be ignored. To clarify the intuition on the results in theorem 6, it is interesting to analyze the special case of the Nadaraya-Watson estimator. Case 4 For the Nadaraya-Watson estimator case, we have √ d nh(α − α∗ W ) → N (0, VN W ) N where ∞ 2 ∞ 2 2+ 2− 0 k (u) σ (x + uh)fo (x + uh)du + 0 k (u) σ (x − uh)fo (x − uh)du 2 2 ∞ ∞ 0 k (u) fo (x + uh)du 0 k (u) fo (x − uh)du VN W = α∗ W N = α + BN W BN W = ∞ ∞ 0 k (u) m(x + uh)fo (x + uh)du − 0 k (u) m(x − uh)fo (x − uh)du ∞ ∞ 0 k (u) fo (x + uh)du 0 k (u) fo (x − uh)du 43 If the rectangular kernel is used, the asymptotic variance and bias simplify to VN W = BN W = ∞ 2+ ∞ 2− 0 σ (x + uh)fo (x + uh)du + 0 σ (x − uh)fo (x − uh)du 2 2 ∞ ∞ 0 fo (x + uh)du 0 fo (x − uh)du ∞ ∞ 0 m(x + uh)fo (x + uh)du − 0 m(x − uh)fo (x − uh)du ∞ ∞ 0 fo (x + uh)du 0 fo (x − uh)du The asymptotic variance is given by a weighted average of the conditional variance of Y above and below the cutoff, and that the asymptotic bias is simply the difference in the (local) averages of m(x) above and below the cutoff, i.e., the difference in outcome that would have arisen even in the absence of treatment. Intuitively, it is interesting to draw a parallel of the results in theorem 6 with the issue of model misspecification. The problem of estimating the ATE at the cutoff discussed here can be seen as one of correctly estimating E [Y | X] on both sides of the cutoff. In this sense, the local polynomial estimator is a polynomial approximation to the unknown conditional expectation inside the bandwidth on each side, not different from standard parametric methods. By choosing a relatively small bandwidth we are fitting the conditional expectation on a restricted support and, hence, expect a polynomial of order p to produce a better fit than if we were trying to fit E [Y | X] globally, this is the benefit associated with a nonparametric approach, since it allows the conditional expectation to be unrestricted outside the bandwidth. Clearly, one does not expect the conditional expectation in the bandwidth to be completely described by a polynomial of the chosen order p, so we can draw some intuition by looking at the asymptotic results in Theorem 6 as those arising from potentially misspecified models (White, 1982, 1996). The estimator converges to α∗ which is not equal to the parameter of interest but still provides relevant information about the population. 44 2.5.1 Fuzzy Regression Discontinuity Design In the Fuzzy RD design the estimator of the parameter of interest is given by the ratio τF = α θ where α is any of the estimators described in the previous section and θ is the estimator for the change in the probability of being in the treated group at the cutoff. Note that θ is obtained by using the estimators described above with the treatment assignment variable, Di , as the dependent variable. To obtain the asymptotic distribution of the fuzzy RD estimator, the delta method can be used, similarly to the result in Porter (2003). Theorem 7 If √ nh(α − α∗ ) √ nh(θ − θ∗ ) d →N 0 0 , Vα Cαθ Cαθ Vθ then √ nh α∗ − ∗ θ θ α d →N 1 α∗ 0, ∗2 Vα − 2 ∗3 Cαθ + θ θ α∗2 Vθ θ∗4 where α∗ = α + Bα , θ∗ = θ + Bθ and Bα and Bθ are the bias terms for the estimators as defined in theorem 6 for local polynomial estimators. The proof of the proposition follows directly from the Delta Method and is omitted. The condition of multivariate normality required in this proposition follows from usual multivariate central limit theorem using a Cramer-Wold device (James, 2004; Pagan and Ullah, 1999). Note that, α + Bα α∗ ∗ = θ+B θ θ α + Bα θ = θ θ + Bθ α θ Bα θ = + θ θ + Bθ θ θ + Bθ (2.6) θ for given values of α and θ, if |θ| < |θ + Bθ | then 0 ≤ θ+B ≤ 1. Clearly, if there is no θ bias in the estimate for α or θ, i.e., Bα = 0 and Bθ = 0, the fuzzy design RD estimator will 45 be consistent for the true treatment effect. If Bα = 0 and Bθ = 0, the estimator will suffer an attenuation bias and tests for the null hypotheses that the treatment is unimportant will be conservative. If Bα = 0 and Bθ = 0, the estimator’s bias is similar to the one seem for the sharp RD design, only being scaled by 1 . Finally, if Bα = 0 and Bθ = 0, any increase in θ Bα increases the bias in the ATE estimator but there will be a trade-off regarding the size of Bθ since its impact in the first and second terms will be in opposite directions. All the terms that appear in the asymptotic distribution above, except for Cαθ , can be obtained from theorem 6 by using local polynomial estimators discussed in section 2.3. It is necessary to specify Cαθ in order to obtain the asymptotic distribution of the estimator in the fuzzy RD design. Theorem 8 Suppose σ εη = E [εη | X = x] is continuous for x = x, x ∈ ℵ and the left and right-hand limits at x exist. If α and θ are the local polynomial estimators and the conditions of theorem 6 hold for both estimators, then Cαθ = e1  +(−) ρ  0. ρ where ∆+(−) =  .  . +(−) ρp ρ ρ Γ∗ −1 ∆+ Γ∗ −1 + Γ∗ −1 ∆− Γ∗ −1 e1 + + − − +(−)  +(−)  ,  · · · ρp . .. . . . · · · ρ2p + = ∞ k 2 (u) uj σ (x + uh)f (x + uh)du, ρj o εη 0 ∞ ρ− = (−1)j 0 k 2 (u) uj σ εη (x − uh)fo (x − uh)du, Γ∗ and Γ∗ are defined as in previous + − j Corollaries. As h → 0, the standard small-h asymptotic covariance is the same as the one in theorem 8, as one would expect. Corollary 4 Letting h −→ 0 in the expressions of theorem 8, then the asymptotic covariance, Cαθ , obtained by fixed-h (theorem 8) and small-h (Porter, 2003) are the same: σ + (x) + σ − (x) εη εη Cαθ = e1 Γ−1 ∆Γ−1 e1 fo (x) 46 Also, a result similar to the corollary 3 is readily available. Corollary 5 Under the assumptions in theorem 8, if h > 0, and in the bandwidth around the cutoff, fo (x) and σ εη (x) are constant, then the asymptotic covariance, Cαθ , obtained by fixed-h (theorem 8) and small-h (Porter, 2003) are the same. In the case of the Nadaraya-Watson estimator, the asymptotic covariance simplifies in similar fashion to the asymptotic variance in formula (2.6) and provides intuition about the refinements obtained by the fixed-h asymptotic distribution relative to small-h. Those improvements arise from incorporating the behavior of σ εη (x) and fo (x) in the range around the cutoff while small-h does not. Case 5 For the Nadaraya-Watson estimator we have ∞ 2 ∞ 2 0 k (u) σ εη (x + uh)fo (x + uh)du + 0 k (u) σ εη (x − uh)fo (x − uh)du Cαθ = 2 2 ∞ ∞ 0 k (u) fo (x + uh)du 0 k (u) fo (x − uh)du If the rectangular kernel is used Cαθ simplifies to ∞ σ (x + uh)f (x + uh)du ∞ σ (x − uh)f (x − uh)du εη εη o o Cαθ = 0 + 0 2 2 ∞ ∞ 0 fo (x + uh)du 0 fo (x − uh)du 2.6 Variance Estimators To be able to perform inference about α using the information in a given sample, appropriate estimates for the unknown terms in the asymptotic variance formulas from theorem 6 are √ necessary. Note that the components of the asymptotic variance of nh(αp − α∗ ) can be p written as expectations of population quantities and estimated using sample analogues. We have 47 γ + = E h−1 k j x−x h x−x j d , h γ − = E h−1 k j x−x h x−x j (1 − d) , h δ + = E h−1 k j x−x 2 h x−x j 2 dε , h δ − = E h−1 k j x−x 2 h x−x j (1 − d)ε2 h Then, define the sample analog estimators of those quantities as γ + = (nh)−1 j γ − = (nh)−1 j + δ j = (nh)−1 − δj = (nh)−1 n k x − xi h x − xi j di , h k x − xi h x − xi j (1 − di ), h k x − xi 2 h x − xi j d i ε2 , i h k x − xi 2 h x − xi j (1 − di )ε2 ; i h i=1 n i=1 n i=1 n i=1 which are consistent by standard arguments based on the Law of Large Numbers. The residuals used in these estimators will depend on the order of the local polynomial used to estimate the Average Treatment Effect of interest and are given by εi = yi − di αp+ + β 1,p+ (xi − x) + ... + β p,p+ (xi − x)p − (1 − di ) αp− + β 1,p− (xi − x) + ... + β p,p− (xi − x)p Even though these estimators requires the calculation of 4(2p + 1) terms6 to obtain the plug-in estimator of the fixed-h variance-covariance matrix, Γ∗ + −1 ∆∗ Γ∗ + + −1 + Γ∗ − −1 ∆∗ Γ∗ − − −1 , (2.7) 6 In fact, since we are interested only on the estimate for the ATE, α, one can potentially only estimate the terms of both matrices that show up at the element at the [1, 1] position of the variance-covariance matrix for the estimators. 48 these are very simple averages of the data and kernel weights. Porter (2003) suggests an estimator for the variance of α using the small-h approximation in corollary 2 which requires only the estimation of the conditional variance of the errors at the cutoff approaching both from right and left and the density of x at the cutoff.7 σ 2+ (x) = σ 2− (x) = n k x−xi d ε2 i i i=1 h , 1 f (x) o 2 n k x−xi (1 − d ) ε2 i i i=1 h , 1 f (x) o 2 (nh)−1 (nh)−1 fo (x) = (nh)−1 n k i=1 x − xi h , (2.8) (2.9) (2.10) then, σ 2+ (x) + σ 2− (x) fo (x) e1 Γ−1 ∆Γ−1 e1 (2.11) is the estimator for the asymptotic variance matrix. This small-h variance estimator avoids estimating each component of the matrices by assuming h → 0, which is similar to assuming that fo (x) and σ 2 (x) are constant in the bandwidth around the cutoff as shown in corollary 3. The matrix Γ−1 ∆Γ−1 can be calculated directly because it is a deterministic function of the kernel. A drawback of the variance estimator in formula (2.11) is the need to estimate fo (x), which is not necessary if one uses the fixed-h variance estimator in formula (2.7). To obtain fo (x) we need to choose a kernel and a bandwidth for the density estimator, increasing the number of tuning parameters to be chosen. A natural choice would be both the kernel and bandwidth used in the estimation of the parameter of interest. In section 2.7, I present evidence that using the same bandwidth not only saves one the trouble of choosing another bandwidth, but also provides more reliable inference than choosing a bandwidth that differs from the one used to estimate τ . 7 The estimator presented in formula (2.11) is not exactly the one presented in Porter (2003). He never suggested a specific estimator fo (x), so I chose the standard RosenblattParzen kernel estimator for fo (x) presented in Pagan and Ullah (1999). 49 For the Nadaraya-Watson estimator, the variance estimator simplifies greatly. Case 6 The fixed-h estimator of the asymptotic variance for the Nadaraya-Watson estimator is given by x−xi n k 2 x−xi d ε2 (nh)−1 n k 2 (1 − di )ε2 i i i=1 i=1 i h h + 2 2 x−xi x−xi (nh)−1 n k di (nh)−1 n k (1 − di ) i=1 i=1 h h x−xi x−xi di ε2 (nh) n k 2 (1 − di )ε2 (nh) n k 2 i=1 i=1 i i h h + 2 2 n k x−xi d n k x−xi (1 − d ) i i i=1 i=1 h h (nh)−1 = where nl and nu are the number of observations used below and above the cutoff, respectively. 2 Let εu = n−1 u 2 −1 2 x≤xi ≤x+h εi and εl = nl 2 x−h≤xi ≤x εi . For the case of the rectan- gular kernel, equation (2.12) simplifies to nh 2 nh 2 ε + ε nu u nl l and if nu = nl , simplifies further to 2nh 2 2 εu + εl nu + nl (2.12) which is the estimator proposed for the asymptotic variance by Imbens and Lemieux (2008) in the local linear case, adapted for the Nadaraya-Watson Estimator. Imbens and Lemieux (2008) propose the plug-in estimator of formula (2.12) for σ 2+ (x)+σ 2− (x) and obtain their estimate for the asymptotic variance of the local linear fo (x) estimator by scaling it by e1 Γ−1 ∆Γ−1 e1 . Note that e1 Γ−1 ∆Γ−1 e1 equals 4 for the local linear estimator and 1 for the Nadaraya-Watson estimator. If higher polynomial orders are used in the estimator, the only change in the formula for the variance estimator is the scaling term. In fact, both small-h (Porter, 2003; Imbens and Lemieux, 2008) variance estimators are based on an estimate for σ 2+ (x)+σ 2− (x) and a scaling parameter that depends on the order fo (x) of the polynomial and kernel used on the estimation of the parameter of interest. In section 2.7 I present simulation evidence on test coverage using the variance estimators in formulas (2.7) and (2.11). 50 2.7 Simulations This section presents simulation evidence displaying the empirical coverage of a standard t-statistic used to perform inference about the treatment effect of interest. All simulations are based on a Sharp RD design. The objective of the simulations is to evaluate the relative performance of the asymptotic distributions obtained by fixed-h and small-h regarding inference about the parameter of interest. As shown by corollary 3, assuming that h → 0 provides asymptotic variance and bias approximations that are equal to the ones obtained by assuming that the probability density function of X and the conditional variance of the outcome are constant in the bandwidth around the cutoff. In fact, one would reasonably expect that the approximations should be similar for bandwidth values close to zero. Evidence from simulations presented below indicates that inference about the treatment effect of interest using the fixed-h theoretical approximation has better size behavior than the small-h approach, especially for larger bandwidths. Simulations using feasible estimators for the asymptotic variance indicate that tests based on fixed-h approach can improve over tests based on small-h, especially for larger bandwidths and when some forms of heteroskedasticity are present, however fixed-h can show slightly worse size behavior on tests that use small bandwidths. Let X be the running variable, Y is the outcome variable for which we would like to estimate the average treatment effect at the cutoff and u be the error term. The details of the simulations are listed below. • Sample size (n): 750 • Number of replications of the experiment: 2,000 • X is drawn from a N ormal(50, 100) • u is drawn from a N ormal(0, 1) • The cutoff for receiving treatment is x = 55. 51 • The treatment variable is defined as di = 1 [xi ≥ x] 1 • The bandwidths range from 0.2 to 20, or from 50 to 2 standard deviations of the running variable. The empirical coverages presented are the fraction of rejections in the 2,000 repetitions for a test of size 5% (two-sided). I analyze 5 different data generating processes (DGPs) for the outcome variable, Y . • DGP 1: yi = µ + αdi + ui • DGP 2: yi = µ + β 1 xi + αdi + ui • DGP 3: yi = µ + β 1 xi + β 2 x2 + αdi + ui i • DGP 4: yi = µ + β 1 xi + β 2 x2 + β 3 x3 + αdi + ui i i x • DGP 5: yi = exp 20 + αdi + ui The true value of the parameters is µ = 3, α = 10, β 1 = 0.5, β 2 = −0.005, β 3 = 00002. Two estimators for the parameter of interest α are used in the simulations, the first is the Nadaraya-Watson estimator presented in case 1 and, second, the widely used local linear estimator presented on case 3. For both estimators I use the Bartlett kernel8. The next subsection compares the test coverages obtained by the theoretical fixed-h and small-h asymptotic distributions derived in theorem 6 and corollary 2. The results obtained are infeasible since they depend on knowledge about fo (x), σ 2 (x) and m(x) around the cutoff. Nevertheless, they demonstrate the theoretical improvements that fixed-h provides over small-h asymptotics. Subsection 2.7.2 compares the empirical coverages obtained with (feasible) estimated standard errors. 8 Similar results were obtained when using a rectangular kernel and a truncated gaussian kernel (weighted so that the kernel integrates to one). They are available from the author upon request. 52 2.7.1 Simulations for Infeasible Inference Nadaraya-Watson Estimator The first set of figures9 show the empirical coverage of the test for the (true) null hypotheses that α = 10 when the Nadaraya-Watson estimator is used to obtain α and the (infeasible) √ variances for nh(α − α∗ ) presented in theorem 6 and corollary 2 are used. For DGP 1, shown in figure B.1, the dependent variable does not depend on X directly and no bias is expected in the estimates for α using the Nadaraya-Watson estimator since the relationship between Y and X would be correctly captured even for larger bandwidths. As expected, the empirical size for the tests using fixed-h and small-h standard errors approximations behave very similarly for small bandwidths, but the differences increase with the bandwidth, suggesting that the fixed-h asymptotic distribution presented in theorem 6 provide a better approximation for the behavior of the estimator α. Figures B.2, B.3 and B.4 refer to DGP 2 in which Y is linearly related to X. In general, a large bias on the Nadaraya-Watson estimate is expected to arise for any bandwidth away from zero, since the estimator does not capture the relationship between Y and X. Hence, the estimates erroneously attribute differences in m(x) above and below the cutoff to the treatment or policy. The steep decline on the empirical coverage in figure B.2 reflects the deleterious effects of the bias on the estimate and inference. This effect overwhelms the gains obtained by the better approximation for the variance of the estimates. Nevertheless, since the DGP, fo (x) and σ 2 (x) are known I can obtain a bias approximation for this estimator using BN W for fixed-h and Bsmall−h 10with p = 0 for small-h. Figure B.3 shows the empirical coverage for the (infeasible) bias corrected test. To better understand the role of the improved bias and standard error approximations separately, figure B.4 adds the empirical coverage that would be obtained if small-h bias and fixed-h’s variance approximation were used to obtain the test-statistic and vice-versa. In this 9 See appendix. 10Formulas (2.6) and (2.5), respectively. 53 case, the majority of the improvement is due to better (infeasible) approximation of the bias but the more precise calculations for the standard error provide non-trivial improvement. The results for the remainder DGPs are qualitatively similar to the ones observed for DGP 2 and the graphs are omitted for brevity.11The “speed” with which the bias becomes a problem for inference varies depending on the DGP, but in general it becomes relevant for relatively small bandwidths. To be fair, the comparisons between the Nadaraya-Watson bias approximations by fixed-h and small-h are not adequate for any true model in which the relationship of y and x could be described by a polynomial of order higher than linear (or order higher than p + 1 in the local polynomial case) since the small-h approximation (Porter, 2003) describes the bias as being a function of the derivative of m(x) limiting its accuracy to more complex functional forms. Nevertheless, from the simulations it is clear that the fixed-h approximation for the bias developed in theorem 6 better describes the asymptotic behavior of the estimator than the small-h bias given by corollary 2. In summary, for all the simulations we have evidence that the asymptotic distribution √ of nh(α − α∗ ) is best described by the fixed-h approach developed in theorem 6, which explicitly considers the effects of the choice of bandwidth, than by the standard small-h asymptotic approximation, which assumes that h → 0. The gains in the approximation are, as one would expect, larger for bandwidths further away from zero. The importance of the asymptotic bias is substantial in the Nadaraya-Watson estimator’s case and serves as cautionary evidence of the risks of dismissing the presence of bias in the estimation by arguing some “undersmoothing” in the choice of bandwidth. The bias can be greatly reduced by the use of local polynomial estimators (see next subsection). 11The graphs are available from the author upon request. 54 Local Polynomial Estimator This section presents simulations in which the “empiricist’s favorite” local linear estimator is used. This estimator has been a staple in the applied literature that uses RD designs and has been shown to have nice theoretical bias reduction properties. For DGPs 1 and 2, no bias is expected in the estimates, since the local linear estimator correctly captures the relationship between Y and X inside any of the bandwidths used. In these cases, the local linear estimator correctly captures the DGP on the bandwidth and no bias arises. The empirical coverage for DGP 2 is presented on figure B.5.12 For smaller bandwidths, the use of small-h asymptotic variance generates similar empirical coverages to the ones obtained using the refined fixed-h variance approximation, but there is a significant decrease in the small-h coverage as the bandwidth increases, with the fixed-h approach outperforming the standard approximation on both DGPs 1 and 2. The improvement increases with the bandwidth size as one would expect. For the remaining DGPs (X has a quadratic, cubic or exponential relationship to Y ) both the asymptotic bias and variance approximations are relevant.13Figures B.6, B.7 and B.8 show the empirical coverage under DGP 3. Figure B.6 compares the test coverages using fixed-h versus small-h standard error approximations while ignoring the bias. It is clear that the general pattern observed till this moment remains, with fixed-h outperforming small-h, specially for larger bandwidths. Figure B.7 graphs the empirical coverage obtained by (infeasible) bias corrected tests and; figure B.8 separate the gains due to improvement in bias and variance refinements by adding the graphs of the “counterfactual” coverages 12 The empirical coverages for DGP 1 and 2 are very similar. Only DGP 2’s graph is reported here. 13 As discussed in section 2.5 the local polynomial estimator could be analyzed under a parametric framework as the problem of estimating a (potentially) misspecified model. (White, 1982, 1996). 55 that calculate the test statistics using small-h bias and fixed-h variance approximation and vice-versa.14 For DGP 3 and 4, the bias do not seem to be greatly important, being successfully reduced by the use of the local linear estimator. Naturally then, the difference in bias approximations is not the main source of improvement in the empirical coverage as can be seem in figure B.8. This contrasts with the results for the same DGPs using the Nadaraya-Watson estimator. In that case, the bias had a large effect on the test’s coverage and the majority of the gains associated with the use of fixed-h asymptotics were due to the bias’ refinement. For DGP 5, even though the bias is substantially mitigated by the use of local linear estimator, the empirical coverage is still significantly reduced for bandwidths larger than one standard error of the running variable (h = 10). As in the case of the Nadaraya-Watson estimator, the bias is an important component of the improvement in coverage and infeasible fixed-h bias correction correctly captures the bias and provides coverages that outperform the small-h approach. As described in section 2.5, the refinements obtained by fixed-h are due to considering the behavior of fo (x) and σ 2 (x) inside the bandwidth, while small-h in effect ignores it. Note that the previous simulations were based on DGPs with homoskedastic errors. In the presence of heteroskedasticity, one would expect the improvements of the fixed-h approximation to be even more important. To exemplify the distortions heteroskedasticity can create and how well the fixed-h asymptotic approximation can capture it, I have simulated empirical coverages for two heteroskedastic cases. For the first and second cases the standard error of the error term is defined as σ(x) = 1 + 0.25x2 and σ(x) = 1 + 0.25 (x − x)2 , respectively.15 The (infeasible) tests based 14The graphs for GDPs 1, 4 and 5 are available from the author upon request. 15 These examples are not intended to be representative of any empirical problem com- monly faced in the applied literature and are intended to highlight the behavior of the fixed-h and small-h approximations in different heteroskedastic contexts. 56 on the fixed-h asymptotic approximation behave very well on both cases, highlighting its robustness. In the first case, for GDPs 2 and 3 (figures B.9 and B.10) , the small-h asymptotic approximation holds up relatively well, with a pattern similar to the one obtained in the homoskedastic case. In contrast, the second case in figures B.11 and B.12, the small-h based test has a steep decline16in coverage as the bandwidth increases, since it is not able to properly capture the effect of the heteroskedasticity in its asymptotic variance. The difference of the small-h performance in the two cases can provide useful intuition to when its weaknesses can prove most relevant. The second case was designed to be a “worst case scenario” heteroskedasticity for small-h asymptotics since the conditional variance of the error at the cutoff, σ 2 (x) is at the extreme of the range of values assumed by σ 2 (x) in any given bandwidth. As can be seen from formula (2.4), the small-h and fixed-h asymptotic variances will be more similar the closer σ 2 (x) is from the “weighted average” of σ 2 (x) inside the bandwidths. In the first case, since σ 2 (x) is at the “middle” of the range for the conditional variance, the distortion produced by the heteroskedasticity is less marked than in the second case. Some points are worth emphasizing. First, the general pattern is that, as expected, the empirical coverages obtained using the fixed-h results from theorem 6 outperform the small-h approximations, especially for larger bandwidths. Second, both the asymptotic variance and asymptotic bias refined calculations improve the precision of inference relative to the standard approach. For smaller bandwidths small-h asymptotics provide similar coverages to the fixed-h approach, making clear that the core difference is due to the suitability of the restrictions imposed on fo (x), σ 2 (x) and m(x) as the bandwidth increases (corollary 3). Naturally, those restrictions tend to be less realistic for larger bandwidths. 16 Note the change in the scale of the y-axis, which now emcompasses the interval from 0 to 1. 57 Third, the use of the local linear estimator reduces significantly the coverage distortion17 introduced by the bias present in the estimates, even when the linear approximation does not fully capture the local relationship between Y and X. This is in line with the results in Porter (2003) and justifies the reliance on the local linear estimator in applications. Fourth, in the presence of heteroskedasticity, the small-h asymptotic approximation can have very poor performance, while the fixed-h approach still provides a reliable asymptotic approximation for the estimator’s behavior. 2.7.2 Simulations for Feasible Inference The simulations in the previous subsection have established that fixed-h asymptotic distribution approximations based on theorem 6 improve over the usual approximations in the literature, with better test size behavior by incorporating the choice of bandwidth by the researcher on the formulas for asymptotic variance and bias. In obtaining those results I used knowledge about the true DGP that is unavailable to the practitioner when implementing such estimators. As described in section 2.6 natural estimators for the asymptotic variance of the parameters of interest are readily available and can be easily calculated for a given sample. This section presents simulations for the empirical coverage of the tests using two different estimated standard errors. The first one is based on the fixed-h asymptotic distribution and is given by formula (2.7). The second is proposed by Porter (2003) and described by formula (2.11). Figures B.13 and B.14 present the empirical coverages for the tests based on the NadarayaWatson estimator for DGPs 1 and 2 described above.18 It also includes the coverages obtained 17Note the change in the y-axis’ range relatively to most cases presented in the Nadaraya- Watson simulations. 18 The graphs for GDPs 3, 4 and 5 in the Nadaraya-Watson estimator case and 1, 4 and 5 in the local linear estimator case are available from the author upon request. 58 by the (infeasible) theoretical formulas19 so one can compare the feasible coverage relative to the infeasible coverage in section 2.7.1. Figures B.15 and B.16 perform the same exercise using the local linear estimator for DGPs 2 and 3. In figure B.13,20it is clear that even though both tests tend to overreject for small bandwidths, due to the small amount of data available in those cases, the coverages obtained by the fixed-h variance estimators provide meaningful improvements for larger bandwidths over the tests based on small-h variance estimators. For DGPs 2 through 5, the presence of a strong bias overwhelms the tests even for relatively small bandwidths, as expected given the results from section 2.7.1. Nevertheless, the general pattern that the variance estimators based on formula (2.7) reflect the theoretical gains is maintained. Similarly, when the local linear estimator is used, the empirical coverage obtained using the fixed-h standard errors’ estimator incorporates the gains of improved inference described in the theory and shown in the infeasible simulations even for large bandwidths. As in the Nadaraya-Watson case, the tests overreject for very small bandwidths, probably due to the relative small amount of data available on these cases but hold very good size behavior for larger bandwidths. Surprisingly, when the local linear estimator is used, the tests obtained using small-h standard error estimates behave very similarly to fixed-h ones especially for larger bandwidths, for which the results in section 2.7.1 would lead one to expect a significantly smaller coverage based on the small-h approach. In fact, the small-h estimators provide tests with better size behavior for small (close 19Those lines, named fixed-h and small-h, are exactly the same presented in Figures B.1 and B.2, respectively. 20In figure B.13, the Nadaraya-Watson estimator is used for DGP 1, correctly capturing the relationship between Y and X around the cutoff. As discussed in section 2.7.1 there is no asymptotic bias in this case. 59 to zero) bandwidths, due to the fact that fixed-h standard errors require the estimation of several more terms for the components of the asymptotic variance, suffering more acutely with the restricted amount of data on the smaller bandwidths. It seems that the small-h variance estimators are benefiting from the fact that, in practice, one cannot actually restrict the bandwidth too close to zero. Since the estimator for the standard errors sums across ε2 it (partially) captures the behavior of fo (x) in the range i around the cutoff that the small-h asymptotic approximation ignores by forcing h → 0. As discussed in section 2.7.1 the presence of heteroskedasticity can generate substantial problems for the size of tests using the theoretical small-h approximation. Figures B.17 through B.20 show simulations for the coverage of feasible tests using the fixed-h and smallh asymptotic variance estimators in the two heteroskedastic cases described in section 2.7.1.21 The results clearly show that, differently from the homoskedastic case, the fixed-h variance estimator produces better test sizes than the one based on the small-h approach. In the first case (figures B.17 and B.18), the use of the fixed-h estimator in formula (2.7) provides better empirical size for bandwidths larger than 5. It is important to emphasize that, even though small-h tests have better sizes for small bandwidths, both tend to overreject due to constrained data availability, and a researcher would be ill advised to use too small of a bandwidth. In figures B.19 and B.20, the second case, the fixed-h variance estimator produces tests with coverage very close to the test’s nominal size, while for the small˙h the coverage rapidly increases to 1 as the bandwidth increases. Hence, there is evidence that heteroskedasticity can be accurately captured by tests based on fixed-h asymptotic approximations but small-h estimators can produce tests which perform substantially worse. It is possible that these results are somewhat dependent on the DGPs chosen and, even though the empirical coverages obtained are similar using any of the asymptotic variance 21For the first case the standard error of the error term is defined as σ(x) = 1 + 0.25x2 , in the second case (“worst case scenario”) it is given by σ(x) = 1 + 0.25 (x − x)2 . 60 estimators in some cases, it seems the fixed-h standard error estimator is a “safer choice” for practitioners since it is based on a more robust asymptotic approximation and its computation is very easy once a kernel and bandwidth are chosen. Using standard error estimates based on small-h asymptotics can lead to serious size distortions for larger bandwidths, especially in the presence of heteroskedasticity, even in the absence of bias. Furthermore, the fixed-h variance estimator has the advantage of not requiring the estimation of fo (x). This entails the choice of (potentially different) kernel and bandwidth for fo (x). The additional choice of these two tuning parameters might significantly alter the empirical size of the tests performed about τ and depends on the discretion of the researcher. To exemplify this issue, figures B.21 and B.22 show the simulated empirical coverages obtained by using the small-h variance estimator for DGPs 2 and 322 using the Bartlett kernel for five different scenarios. Each scenario differs by the choice of the bandwidth, hf , used in formula (2.10) to obtain fo (x). The first reproduces the small-h result described above by choosing the same bandwidth used to estimate τ , i.e., hf = h, the other lines are the empirical coverages obtained by using bandwidth of 1, 5, 10 and 2023 for fo (x) independent of the bandwidth used for τ . The choice of bandwidth on the estimation of fo (x) can have a relevant impact on the test coverages. Interestingly, choosing the same bandwidth as used in estimating the parameter of interest provides more stable empirical coverages for a wide range of h relative to the cases in which the bandwidths are different. The cautious practitioner using the small-h variance estimator would be well advised to choose the same bandwidth for both estimators. One key problem (especially the ones with the Nadaraya-Watson estimator), is how to deal with the bias in practice. The bias is a main contributor for the divergence between the empirical and nominal sizes of the tests being performed. Adequate estimation of the bias based on the results on theorem 6 would require observing or estimating the counterfactual 22The graphs for GDPs 1, 4 and 5 are available from the author upon request. 23 1 , 1 , 1 and 1 standard deviations of the running variable, respectively. 20 4 2 61 conditional expectation of y around the cutoff in the absence of treatment, which is not available for most cases where the RD design is relevant.24 2.8 Conclusion The use of regression discontinuity designs to obtain estimates of treatment effect, τ , has been widely used in recent years by researchers in economics. Special attention has been given to the use of local polynomial estimators to obtain the ATE of interest. The standard literature on RD designs (Hahn, Todd, and Van der Klaauw, 2001; Porter, 2003; Imbens and Lemieux, 2008) assumes that the bandwidth around the discontinuity, h, shrinks fast enough towards zero, h → 0, to guarantee identification of the parameter of interest (small-h asymptotics). This chapter derives, in the RD design context, a refined asymptotic distribution for the local polynomial estimators by treating h as fixed. This fixed-h asymptotics explicitly acknowledges the fact that a researcher has to choose bandwidth to implement the estimator and are usually bounded in their ability to reduced the bandwidth size by data availability constraints. The fixed-h asymptotic distributions obtained are more precise and provide refined inference relative to asymptotic distributions based on small-h approach. The fixed-h asymptotic approximation provides more precise formulas for both bias and variance of the estimators of interest (theorem 6). The standard small-h asymptotic bias and variance can be obtained 24 The small-h asymptotic bias approximation (Porter, 2003) lends itself for estimation, since estimates for m(p+1) above and below the cutoff can be obtained. However, the results in section 2.7.1 indicate that this would be a relatively poor approximation. Furthermore, to a large degree, bias reduction could be obtained by increasing the order of the local polynomial fitted above and below the cutoff, reducing the “misspecification” in the model (see section 2.5). 62 by allowing h → 0 in the fixed-h distribution (corollary 2). Also, when h > 0 the small-h result for the variance of the estimators is equivalent to assume that the density of the running variable and the conditional variance of the dependent variable are constant around the cutoff (corollary 3). Similar results are shown for both sharp and fuzzy RD designs. Simulations provide evidence that fixed-h asymptotic distributions more accurately describe the behavior for both bias and variances than the usual small-h results used in the literature. This is reflected on improved test size, specially when larger bandwidths are used. Simple feasible estimators for the refined, fixed-h, standard errors are provided and shown to incorporate the theoretical gains of the improved approximations in simulations. These estimators are simple to implement and have the advantage of not requiring the estimation of the density of the running variable at the discontinuity. The fixed-h variance estimators can improve markedly over small-h estimators in the presence of heteroskedasticity and should be generally preferred. Simulations using heteroskedastic errors have provided evidence that feasible tests based on the fixed-h approach obtain better coverage, outperforming small-h starting at relatively small bandwidths. Interestingly, in the case of the widely used local linear estimator with homoskedastic errors the variance estimators based on small-h asymptotics suggested in the literature produce well behaved tests with similar size performance to the fixed-h variance estimators, performing better than the standard theory would expect. In other words, when errors are homoskedastic (inside the bandwidth) the inability of the empiricist to mimic what theory suggests ends up improving the properties of the tests and its robustness to the choice of bandwidth, relative to what the theory that spawned those estimators would have provided. The results indicate that the fixed-h standard error estimator is a “safer choice” for practitioners since it is based on a more robust asymptotic approximation and its computation is very easy once a kernel and bandwidth are chosen. 63 CHAPTER 3 Asymptotic Properties of Quantile Regression for Standard Stratified Samples 3.1 Introduction Quantile Regression (QR) has been widely used in the social sciences in recent decades, and provides a useful characterization of the distributional features of variables in which the researcher is interested. In economics, for example, a very natural use of quantile regression has been to analyze the wage structure and potential differences in the determinants of the observed wages at different levels of the wage distribution, e.g., Albrecht et al. (2003); Buchinsky (1998, 2001); Machado and Mata (2005); Martins and Pereira (2004) and Melly (2005). In those analyses it is very common to use datasets generated by stratified sampling. In standard stratified sampling (SS sampling) the population is divided into J mutually exclusive, exhaustive strata, and a random sample of size Nj is taken from stratum j. As described by Wooldridge (2001) when stratification is based on exogenous variables it 64 usually does not cause serious problems. The usual estimators that ignore stratification are consistent and asymptotically normal and the usual variance estimators are still valid. When stratification is based on endogenous variables, the standard unweighted estimators are generally inconsistent. Wooldridge (2001) studies the asymptotic properties of general weighted M-estimators under SS sampling which will be consistent, asymptotically normal and provides estimators for standard errors of the parameters of interest that can be used to perform inference for general stratification. However, those results are not directly applicable to the quantile regression case due to the nonsmoothness in the objective function that provides the QR estimates. This chapter fills that gap, extends the analysis to the quantile regression case, analyzes the asymptotic properties of the weighted QR estimates under general SS sampling, and provides consistent estimators for the standard errors that take the stratification of the data into account. Under exogenous stratification the usual unweighted QR estimators are still valid as well as its standard error estimates. 3.2 The Quantile Regression Population Problem We are interested in estimating the conditional quantile function (CQF) of a random variable Y conditional on a vector of q explanatory variables X. This is defined by, Qτ (Y | X) = inf {y : FY (y | X) ≥ τ } where τ ∈ (0, 1) indexes the τ th quantile of the conditional distribution of Y . Let the CQF be described by a known function g (·) of the parameters and the explanatory variables, Qτ (Y | X) = g X, β τ o A special case of interest is given by the linear model1 1 This formulation assumes the error term is additive and, hence, separable. For a 65 Y = X βτ o + ε with Qτ (ε | X) = 0. Throughout this chapter I concentrate on the linear CQF, since it is the most widely used by practitioners and for ease of exposition. Nevertheless, the results presented are valid for a nonlinear, correctly specified CQF, g (·). In the population, β τ o solves the following problem min E ρτ Y − X β τ (3.1) β τ ∈B where, ρτ (u) = (τ − 1 [u ≤ 0])u and B ∈ RK is the parameter space. Given a random sample from the population of size n, it is possible to obtain consistent estimates of β o by a standard quantile regression (QR) estimator. n −1 min n ρτ (yi − xi β τ ) β τ ∈B i=1 (3.2) Note that the minimization problem has the following first order conditions and sample analogue (Buchinsky, 1998) n−1 τ − 1 y − x βτ o ≤ 0 x τ − 1 yi − xi β τ ≤ 0 E xi = 0 = 0 n (3.3) i=1 Hence, we can frame this problem as a GMM estimator that uses as moment conditions the first order conditions of the QR problem that identify β τ o . Under random sampling, the standard QR procedures can be used to estimate β τ o and perform inference. 3.2.1 Quantile Regression under Stratified Sampling Suppose our sample is obtained by a standard stratification scheme as formally described by Wooldridge (2001). Assume that the population is divided into J nonempty, mutually treatment of the more general formulation with (potentially) non-separable ε see Powell (1991). 66 exclusive, and exhaustive strata, W1 , W2 , ..., WJ , where J is a finite integer. Let w denote a random variable having the population distribution of interest. Definition 5 Standard stratified sampling: For j = 1, ..., J, draw a random sample of size Nj from stratum j. For each j, denote this random sample by {wij : i = 1, 2, ..., Nj }. The strata sample sizes Nj are nonrandom. Therefore, the total sample size, N = N1 + ... + NJ , is nonrandom. Notice that for a given j, {wij : i = 1, 2, ..., Nj } is an independent, identically distributed (i.i.d.) sequence having the same distribution conditional on being part of a strata, D(w|w ∈ Wj ). Then, one can rewrite the minimization problem and its moment conditions as J min E ρτ (Y − X β τ ) β τ ∈B = Qj E ρτ (Y − X β τ )|w ∈ Wj min β τ ∈B j=1 J E τ − 1 y − x βτ o ≤ 0 x = Qj E τ − 1 y − x βτ o ≤ 0 x|w ∈ Wj = 0 (3.4) j=1 where Qj = P (w ∈ Wj ), j = 1, ..., J and its sample analogue  J j=1  1 Qj  Nj Nj τ − 1 yij − xij β τ ≤ 0 i=1 1 N j=1 Hj   xij  = 0   Nj J Q j   τ − 1 yij − xij β τ ≤ 0 i=1  xij  = 0 N with Hj = Nj . This can be rewritten as, 1 N N Q j i=1 Hj τ − 1 yi − xi β τ ≤ 0 xi = 0 (3.5) This is the empirical moment condition that is used to estimate the parameters of interest, defining the weighted quantile regression estimator. This estimator is consistent for the parameters of interest under standard stratified sampling (Wooldridge, 2001, theorem 3.1).2 2 As a minor point note that if one wants to implement the weighted estimator by applying a standard quantile regression to weighted data, the weights for each observation 67 The asymptotic distribution of the weighted quantile regression estimator can be obtained by a direct application of Newey and McFadden (1994) Theorem 7.1,with careful consideration to the formulation of V ar g(β τ o ) due to the stratified nature of the data. Corollary 6 If the conditions in Newey and McFadden (1994) theorem 7.1 hold, N follows the standard stratified sample scheme, Nj → √ a H j > 0 as N → ∞ for each j. Then N β τ − β τ o ∼ N 0, A−1 Bw A−1 , where w w wij : i = 1, 2, ..., Nj ; j = 1, 2, ..., J •• Aw = E fy|x (g x, β τ o )g g and J Q2 j Bw = j=1 • where g ≡ Hj V ar τ − 1 y − g x, β τ o ≤ 0 • g|w ∈ Wj ∂g(x,β) |β=β ∂β τo In the special case of the linear CQF, g (X, β) = x β, Aw = E fy|x x β τ o xx and Bw = J Q2 j j=1 Hj V ar τ − 1 y − x βτ o ≤ 0 x|w ∈ Wj Corollary 6 provides a general form for the asymptotic variance of the quantile regression estimators under standard stratification. Two main points are relevant when analyzing Bw . The first, which is general to the standard stratification literature, is that we cannot replace V ar τ − 1 yij − g xij , β τ o ≤ 0 • g i |w ∈ Wj by the outer product of the score as in the random sampling case because in general E Qj will be given by H i instead of the ji Qj i Hj i 1 2 squares estimators and its variants. 68 τ − 1 y − x βτ o ≤ 0 x|w ∈ Wj = 0, usually required when implementing least as pointed out by Wooldridge (2001). It is also interesting to note that, differently from the standard results in quantile regression for random sampling, Bw does not simplify to τ (1 − τ )E[xx ] in this case. That is due to the fact that the variance of the binary variable 1 yij − xij β τ o ≤ 0 is not necessarily the same for each stratum, in other words, xij β τ o will not represent the τ quantile in every stratum. If we assume that E τ − 1 y − x βτ o ≤ 0 x|w ∈ Wj = 0, for j = 0, 1, 2, . . . , J. (3.6) then an alternative formula for Bw is available. Bw = J Q2 j j=1 Hj E τ − 1 y − x βτ o ≤ 0 2 xx |w ∈ Wj A sufficient condition for equation 3.6 is that the conditional distribution of Y is independent from strata, in which case E 1 y − g x, β τ o ≤ 0 |x, w ∈ Wj = τ for all j. Then Bw = τ (1 − τ ) J Q2 j j=1 3.2.2 Hj E xx |w ∈ Wj Quantile Regression Estimation under Exogenous Stratification If we are modeling the quantiles of Y given X, and stratification is based solely on the conditioning variables X, stratification is said to be exogenous. Let the sample space for X be partitioned into J nonempty, mutually exclusive, and exhaustive strata, χ1 , χ2 , ..., χJ , where J is a finite integer, then we can analyze the effects of stratification under the framework of the previous section. The weighted quantile regression estimator in equation 3.5 can be used both when the stratification is exogenous or not. Under exogenous stratification the unweighted quantile 69 regression estimator is also consistent (Wooldridge, 2001).3 This estimator drops the weights Qj Hj , solving J N 1 N τ − 1 yi − xi β τ ≤ 0 xi = j=1 i=1   1 Hj  Nj  Nj τ − 1 yij − xij β τ ≤ 0 i=1  xij  = 0 (3.7) so that each stratum average is just weighted by its sample frequency, i.e., this is the usual QR estimator that would be used under random sampling. Under exogenous stratification the unweighted quantile regression estimator has asymptotic distribution as described in the corollary below. Corollary 7 If the conditions in Newey and McFadden (1994) theorem 7.1 hold, wij : i = 1, 2, ..., Nj ; j = 1, 2, ..., J follows the standard stratified sample scheme, and, N stratification is a deterministic function of X, Nj → H j > 0 as N → ∞ for each j. √ a Then N β τ − β τ o ∼ N 0, A−1 Bu A−1 , where u u J Au = H j E[fy|x g x, β τ o •• g g |x ∈ χj ] j=1 and J •• H j E g g |x ∈ χj Bu = τ (1 − τ ) j=1 • where g ≡ ∂g(x,β) |β=β ∂β τo 3 As emphasized in Wooldridge (2001) the consistency of the unweighted estimator will hold under exogenous stratification if the conditional quantiles of Y are correctly specified. The unweighted estimator solves    J  min H j E ρτ Y − X β τ |x ∈ χj  β τ ∈B  j=1 which solution is not necessarily β τ o under misspecification. 70 In the special case of the linear CQF, g (X, β) = x β, J H j E[fy|x x β τ o xx |x ∈ χj ] Au = j=1 and J Bu = τ (1 − τ ) H j E xx |x ∈ χj j=1 3.2.3 Sequence of Quantile Regressions The discussion above considers only the estimation of a single quantile regression for a given value τ but one might be interested in estimating several quantile regressions for diverse points of the conditional distribution of Y . As emphasized by Buchinsky (1998), because the coefficients are estimated utilizing the same data with different weighting schemes, the estimators will be correlated. Consider that we are still interested in estimating the linear conditional quantile function for p separate quantiles, τ , Y = X β τ r + ετ r and that Qτ r (ετ r | X) = 0 for r = 1, . . . , p. Also, let 0 < τ 1 < τ 2 < . . . < τ p < 1, and β τ = (β τ , β τ , . . . , β τ p ) For each τ r define 1 2 ψ r (y, x, β τ r ) ≡ τ r − 1 y − x β τ r ≤ 0 and Ψ y, x, β τ , β τ , . . . , β τ p 1 2 = x ψ 1 (y, x, β τ ) , ψ 2 (y, x, β τ ) , . . . , ψ p (y, x, β τ p ) . 1 2 Hence J E Ψ y, x, β τ , β τ , . . . , β τ p 1 2 = j=1 Qj E Ψ y, x, β τ , β τ , . . . , β τ p |w ∈ Wj = 0 1 2 71 By the analogy principle, the estimator β τ for β τ solves 1 N J Q Nj j j=1 Hj i=1 1 Ψ yij , xij , β τ , β τ , . . . , β τ p = 1 2 N N Q ji i=1 Hj Ψ yi , xi , β τ , β τ , . . . , β τ p = 0 1 i 2 which can be solved separately if no cross-quantile restrictions are imposed on β τ , β τ , . . . , β τ p or simultaneously otherwise. Then, we can show that β τ follows an asymp1 2 totic multivariate normal distribution as well. Corollary 8 If the conditions in Newey and McFadden (1994) theorem 7.1 hold, N follows the standard stratified sample scheme, Nj → √ a H j > 0 as N → ∞ for each j. Then N β τ − β τ ∼ N (0, Λτ ), where Λτ = wij : i = 1, 2, ..., Nj ; j = 1, 2, ..., J Λτ l,k l,k=1,...p with typical element defined as −1 • • Λτ l,k = E fy|x (g x, β τ )g l g l × l   J Q2 j × Cov ψ l (y, x, β τ ), ψ k (y, x, β τ )|w ∈ Wj  × l k Hj j=1 • • −1 ×E fy|x (g x, β τ )g k g k k where, in this case, ψ r (y, x, β τ r ) = τ r − 1 y − g x, β τ r ≤ 0 • • g r for r = l, k and g l ≡ ∂g(x,β) |β=β . ∂β τl In the special case of the linear CQF, g (X, β) = x β −1 Λτ l,k = E fy|x x β τ xx × l   J Q2 j × Cov ψ l (y, x, β τ ), ψ k (y, x, β τ )|w ∈ Wj  × l k Hj j=1 ×E fy|x x β τ xx k with ψ r (y, x, β τ r ) = τ r − 1 y − xβ τ r ≤ 0 −1 x for r = l, k. Once again, the notable difference of the result relative to the usual QR under random sampling is that the center term on Λτ l,k does not simplify as neatly as in standard the 72 random sampling case, since the covariance of the binary variables 1 yij − xij β τ ≤ 0 and l 1 yij − xij β τ ≤ 0 are not necessarily the same for each stratum. k 3.3 Asymptotic Variance Estimation To perform inference about the parameter’s estimates, β τ , we need to obtain valid estimators for its asymptotic variance. From corollary 6 we have that for linear CQF V ar √ N βτ − βτ o = A−1 Bw A−1 w w −1 = E fy|x (x β τ o )x x ×  J Q2 j  × V ar τ − 1 y − x β τ o ≤ 0 Hj  x|w ∈ Wj  × j=1 ×E fy|x (x β τ o )x x −1 Natural estimators for Aw and Bw , as suggested by Wooldridge (2001), are given by   J Aw ≡ j=1 Nj  −1 Qj Nj = N −1 i=1  fy|x,w∈W (xij β τ )xij xij  j N Q ji f (x β )x x Hj y|x,w∈Wj i τ i i i i=1  Nj J Q2 j  −1 sij β τ Bw ≡ Nj Hj i=1 j=1 = N −1 J j=1 where sij β τ Qj Hj 2     − sj β τ sij β τ − sj β τ    Nj sij β τ − sj β τ sij β τ − sj β τ i=1 ≡ τ − 1 yij − xij β τ ≤ 0 xij , sj β τ −1 = Nj Nj sij β τ   Qj and H i = ji i=1 Qj Hj for the stratum j of xi . In the more general, nonlinear framework, a similar estimator for • the covariance matrix is given by replacing sij β τ ≡ τ − 1 yij − g xij , β τ ≤ 0 g i 73 N −1 and the outer terms ∂g xij ,β ∂β | β=β τ N Qj i H fy|x,w∈Wj g xij , β τ i=1 ji • • gigi −1 • with g i ≡ . Then, √ N βτ − βτ o V ar = A−1 Bw A−1 w w To estimate the out of diagonal terms Λτ l,k when we are interested in performing inference about the parameters on a sequence of quantile regressions, we can use a similar approach for estimating the middle term by  Nj J Q2 j  −1 sij β τ − sj β τ Nj l l Hj j=1  sij β τ i=1 k − sj β τ k   and we can still use the same estimators, Aw , for the outer terms of Λτ l,k , noticing that they are based on the estimates for τ l and τ k , respectively. Finally, under exogenous stratification and correct specification of the underlying CQF, the score will generally have a zero conditional mean for all X when evaluated at β τ o . Then the asymptotic variance estimator for the unweighted quantile regression estimator (with linear CQF) is given by: V ar √ N βτ − βτ o = A−1 Bu A−1 u u with 1 Au = N N fy|x (xi β τ )xi xi i=1 1 Bu = τ (1 − τ ) N N xi xi i=1 which are the usual variance matrix estimators from the quantile regression literature under random sampling as described by Buchinsky (1998) and Koenker (2005). This confirms that the usual quantile regression estimators and standard errors are valid when the sample 74 is exogenously stratified, this is the same result as obtained by Wooldridge (2001) for the general M-estimators with smooth objective functions. A main issue, which is specific to quantile regression, is that we need to estimate fy|x,w∈W (xi β τ ) taking in consideration the stratification. An intuitive approach is to take j advantage of the fact that we have assumed random sampling for each stratum and apply a standard nonparametric density estimator for each stratum and simply plug in the formula above. For example, using the Rosenblatt-Parzen kernel estimator described in Pagan and Ullah (1999) fy|x,w∈W (xij β τ ) = Nj hnj j −1 Nj i=1   yij − xij β τ  K hnj where K (·) is a kernel function and hnj is a bandwidth parameter such that hnj → 0 and Nj hnj → ∞. Another option is to bypass the estimation of fy|x,w∈W (xi β τ ) itself and j revert to the estimator for Awj = E[fy|x,w∈W x β τ o x x|w ∈ Wj ] j that is referred to as the Powell Sandwich by Koenker (2005). This takes account of the fact that estimating Awj is just as estimating a matrix weighted density estimator (Koenker, 2005). Awj = Nj hnj −1 Nj  yij − xij β τ  xij x K ij hnj  i=1 J Then, we can obtain Aw = j=1 Qj Awj . Powell (1991) has shown that under some additional conditions regarding f (·), Awj converges is probability to Awj . Two drawbacks of this estimator are the necessity of choosing kernels and bandwidths (which could be different for each stratum) and the fact that, in practice, the researcher might have some stratum for which only a small amount of data is available, reducing the confidence in the obtained 75 estimates for Awj .4 When the stratification is exogenous, we can take advantage of the fact that fy|x,x∈χ (xij β τ ) = fy|x (xi β τ ) for all strata and obtain fy|x (xij β τ ) or Aw directly as j N K Aw = (N hn )−1 y i − xi β τ hn K fy|x (xi β τ ) = (N hn )−1 y i − xi β τ hn i=1 N i=1 xi xi The choice of estimator for fy|x,w∈W (xij β τ ) as well as any nuisance parameters associated j with its estimation is an important issue that remains open. It seems advisable for researchers to be cautious regarding the impacts of the choice of estimator and nuisance parameters on the standard errors used in QR. Pagan and Ullah (1999) present several possible estimators for such densities and discuss their advantages and drawbacks. 3.4 Conclusion This chapter addressed the issue of inference on quantile regressions when the data is obtained through standard stratified sampling. Extending results from Wooldridge (2001) I derive the asymptotic distribution of the weighted quantile regression estimator for the case with general stratification and of the unweighted quantile regression estimator in the case that the stratification is a deterministic function of the conditioning variables. Valid estimators for the asymptotic variance matrix of those estimators are provided. The results shown here provide confirmation to the intuition that the results for general M-estimators with smooth objective functions transfer neatly to the quantile regression case. This adds to literature a more careful treatment of the quantile regression case for stratified sampling that had not been available.5 4 For the nonlinear case, one can use A wj = Nj hnj N −1 j   yij −g xij ,β τ • •  gigi. K hnj i=1 5 And this fact has probably played a significant role in the absence of probability 76 Weighted quantile regression provides a simple and reliable way to deal with data obtained through stratified sampling, albeit requiring adjustments to the usual standard errors in the literature. Valid estimators for the standard errors are provided. Under exogenous stratification one could use the usual unweighted estimators, which retain its properties of consistency (Wooldridge, 2001) and asymptotic normality. Even more relevant, in that case, some usual standard error estimators in the literature (Koenker, 2005; Buchinsky, 1998, etc) are still valid. weighted and “survey” methods for quantile regression in popular statistical packages like STATA. 77 APPENDICES 78 APPENDIX A Proofs to “GMM Efficiency and IPW for Nonsmooth Functions” Proof. [Proof of Theorem 1] For VON E−ST EP , VKN OW −γ and VKN OW −γ−JOIN T this result is a direct application of known results in the literature (see, e.g., p. 2186 in Newey and McFadden 1994 or more generally p. 1594 in Chen, Linton and Van Keilegom 2003) and the simplifications that take effect by the use of the appropriate weighting matrix. For VT W O−ST EP I rely on the approximations used by Newey and McFadden (1994) in theorem 7.2 and Pakes and Pollard (1989) theorem 3.3 and lemma 3.5. Following Pakes and Pollard (1989), I claim that gn (θ) is very well approximated by the linear function Ln (θ) = = Ln1 (θ) = gn (θo ) + G(θ − θo ) Ln2 (θ) gn1 (β o , γ o ) + G11 (β − β o ) + G12 (γ − γ o ) gn2 (γ o ) + G22 (γ − γ o ) 1 − within a Op (n 2 ) neighborhood of θo . More precisely, I need the approximation error to 1 − be of order op (n 2 ) at θ and at θ∗ which minimizes Ln (θ) globally. In the case analyzed 79 here, gn (θ) − Ln (θ) = gn (θ) − gn (θo ) − G(θ − θo ) = gn (θ) − gn (θo ) − G(θ − θo ) − go (θ) + go (θ) ≤ gn (θ) − go (θ) − gn (θo ) + go (θ) − G(θ − θo ) 1 − ≤ op (1)n 2 1 + √ n (θ−θ o ) + op ( (θ−θ o ) ) 1 − = op (n 2 ) 1 − where in the last equality I used the fact that (θ−θ o ) ≤ Op (n 2 ) (see Newey and Mc- Fadden, 1994, p. 2191). To correspond to a minimum of Ln (θ) , the vector G(θ∗ − θo ) must be equal to the linear projection of −gn (θo ) onto the space G. Hence, G(θ∗ − θo ) = −G(G G)−1 G gn (θo ) from this equation, we can obtain √ √ ∗ n(θ − θo ) = − n(G G)−1 G gn (θo ) from Pakes and Pollard (1989, lemma 3.5) the result above holds for the case in which we use the appropriate positive semidefinite weighting matrix W that converges in probability to W , in which case √ √ n(θ∗ − θo ) = − n(G W G)−1 G W gn (θo ) as shown by Pakes and Pollard (1989, p. 2042) under the conditions listed above θ∗ and θ are close enough in this shrinking neighborhood around θo such that we can write √ n(θ − θo ) = √ n(θ∗ − θo ) + op (1) Hence, for the first step estimator, the following approximation is valid √ −1 √ −1 −1 n(γ − γ o ) = − n G22 C22 G22 G22 C22 gn2 (γ o ) + op (1) 80 (A.1) Then, for the second step, using the same results, we can approximate √ −1 √ −1 −1 G11 C11 gn1 (β o , γ) + op (1) n(β − β o ) = − n G11 C11 G11 −1 √ −1 −1 = − n G11 C11 G11 G11 C11 [gn1 (β o , γ o ) + G12 (γ−γ o )] + op (1) −1 √ −1 −1 = − n G11 C11 G11 G11 C11 gn1 (β o , γ o )+ + −1 −1 √ −1 −1 −1 × G11 C11 G12 G22 C22 G22 n G11 C11 G11 (A.2) −1 × G22 C22 gn2 (γ o ) + op (1) then, by combining A.1 and A.2 we can write √ √ n(θ − θo ) = B ngn (θ o ) + op (1) where, B= B11 B12 0 B22 with −1 B11 = − G11 C11 G11 B12 = −1 G11 C11 G11 −1 −1 −1 B22 = − G22 C22 G22 −1 G11 C11 −1 −1 G11 C11 G12 G22 C22 G22 −1 −1 −1 G22 C22 −1 G22 C22 hence, VT W O−ST EP = BΣB Proof. [Proof of Corollary 1] The proof follows directly from Prokhorov and Schmidt (2009) since theorem 3 has shown that the variance structure of the four estimators considered is the same as in Prokhorov and Schmidt (2009). The proof that the result hold directly for the case in which the objective functions considered are nonsmooth is available under request. 81 Proof. [Proof of Corollary 2] Note that the asymptotic variance of −1 can be rewritten as (note that B12 = B11 G12 G22 C22 G22 −1 √ n(β T W O−ST EP −β o ) −1 G22 C22 ) V (β T W O−ST EP ) = B11 C11 B11 + B12 C21 B11 + B11 C12 B12 + B12 C22 B12 = B11 E g1 (ω ∗ , θ)g1 (ω ∗ , θ) B11 + B12 E g2 (ω ∗ , γ)g1 (ω ∗ , θ) B11 + i i i i +B11 E g1 (ω ∗ , θ)g2 (ω ∗ , γ) B12 + B12 E g2 (ω ∗ , γ)g2 (ω ∗ , γ) B12 i i i i   −1 −1 g (ω ∗ , γ) × −1 G ∗ , θ) − G G22 C22 2 i 12 G22 C22 22   g1 (ω i B  = B11 E   11 −1 −1 g (ω ∗ , γ) −1 G G22 C22 2 i × g1 (ω ∗ , θ) − G12 G22 C22 22 i −1 if G12 = C12 C22 G22 −1 −1 −1 −1 G22 C22 g2 (ω ∗ , γ) ×  g1 (ω ∗ , θ) − C12 C22 G22 G22 C22 G22 i i  B = B11 E   11  −1 −1 g (ω ∗ , γ) −1 G ∗ , θ) − C C −1 G G22 C22 2 i × g1 (ω i 12 22 22 G22 C22 22   Since it is assumed that G22 is invertible, = B11 E −1 g1 (ω ∗ , θ) − C12 C22 g2 (ω ∗ , γ) i i −1 g1 (ω ∗ , θ) − C12 C22 g2 (ω ∗ , γ) i i B11 −1 If we define ei = g1 (ω ∗ , θ) − C12 C22 g2 (ω ∗ , γ), and Do = E ei ei , we can write, i i −1 V (β T W O−ST EP ) = G11 C11 G11 −1 −1 −1 −1 G11 C11 Do C11 G11 G11 C11 G11 −1 In this case, we can write the variance of the two-step estimator for β o in a quadratic form in which the term in the middle of the matrix is the residual of the linear projection of the first set of moment conditions on the second set of moment conditions. If, in addition to the conditions above, we assume G11 is invertible, the result follows. V (β T W O−ST EP ) = G−1 Do G−1 11 11 82 Proof. [Proof of Lemma 2] First, note that, E sg2 (z, γ o , s) | z γ f (vi | zi , γ) = E s ∞ = z f (vi | zi , γ) s −∞ ∞ γ f (v | z, γ) f (v | z, γ) f (v | z, γ)dv h(v, z) γ f (v | z, γ) dv = −∞ =  ∞  h(v, z)f (v | z, γ) dv  γ −∞ = γ E [s z] = γ p(z, γ o ) this is nonzero in general. Hence, C12 = E g1 (ω ∗ , β o , γ o , s)g2 (z, γ o ,s) = E = E = E = E = E = E s g(ω, β o )g2 (z,γ o ,s) p(z, γ o ) s g(ω, β o )g2 (z,γ o ,s) z E p(z, γ o ) 1 E g(ω, β o )sg2 (z,γ o ,s) z p(z, γ o ) 1 E [g(ω, β o ) z] E sg2 (z, γ o ,s) p(z, γ o ) g(ω, β o ) E [sg2 (z, γ o , s) z] p(z, γ o ) g(ω, β o ) γ p(z, γ o ) p(z, γ o ) which is generally nonzero. Analyzing G12 , G12 = = ∗ γ E[g1 (ω , β o , γ o , s)] γE s g(ω, β o ) p(z, γ o ) 83 z , by ignorability s since, g1 (ω ∗ , β o , γ o , s) = p(z,γ) g(ω, β), is smooth in γ, G12 = E γ = E − = = = = s p(z, γ o ) g(ω, β o ) s γ p(z, γ o )g(ω, β o ) (p(z, γ o ))2 s γ p(z, γ o ) g(ω, β o ) E − p(z, γ o ) p(z, γ o ) s γ p(z, γ o ) −E E g(ω, β o ) z , by LIE p(z, γ o ) p(z, γ o ) E(s z) γ p(z, γ o ) −E E [g(ω, β o ) z] p(z, γ o ) p(z, γ o ) γ p(z, γ o ) −E g(ω, β o ) , since E [s z] = p(z, γ o ) p(z, γ o ) = −C12 Then, to prove the lemma 2 I need that G22 = −C22 , which follows from the Generalized Information Equality (remembering g2 (z,γ o ,s) is a smooth function). G22 = γ E[g2 (z, γ o ,s)] = E[ γ g2 (z, γ o ,s)] = −E g2 (z,γ o ,s)g2 (z,γ o ,s) = −C22 −1 −1 hence, G12 = −C12 = −C12 (−C22 G22 ) = C12 C22 G22 . Proof. [Proof of Theorem 2] This follows directly from Lemma 2 and statements 9 and 10 in Corollary 1. Claim 1 Consider the conditional quantile function Qτ (Y | X) = X β τ o and the weighted linear quantile estimator obtained as wi ρτ (yi − xi b) β τ = arg min b∈Rp 84 for some known weight wi that could be a function of exogenous variables. Under conditions 7 and 8, we have √ with, D2 = limn→∞ −1 −1 ∼ N 0, τ (1 − τ )D1 Do D1 n βτ − βτ n w f (x β )x x and D = lim o n→∞ i=1 i i i τ o i i n w2 x x i=1 i i i Assumption 7 For Y1 , Y2 , . . . , Yn independent random variables with distribution functions F1 , F2 , . . . , Fn , {Fi } are absolutely continuous with continuous densities fi (·) and weights, wi , uniformly bounded away from 0 and ∞ at the points fi xi β τ o for every i. Assumption 8 There exist positive definite matrices Do and D1 such that 1 i) limn→∞ n 1 ii) limn→∞ n n w2 x x = D o i=1 i i i n w f (x β )x x = D 1 i=1 i i i τ o i i x iii) max √i → 0 n Proof. [Proof of Claim 1] This proof follows the steps presented on Koenker (2005, p. 120). Consider ui = yi − xi β τ o , then n β τ = arg min b∈Rp wi τ − 1 yi − xi b ≤ 0 i=1 n = arg min b∈Rp y i − xi b wi ρτ (ui ) i=1 Consider the following convex objective function, with unique minimizer at n Zn (δ) = wi ρτ i=1 δ ui − xi √ n √ n βτ − βτ o , − ρτ (ui ) v using Knight’s identity ρτ (u − v) − ρτ (u) = −vΨτ (u) + (1 [u ≤ S] − 1 [u ≤ 0]) dS, with 0 Ψτ (u) = τ − 1 [u ≤ 0]   δ xi √ n n   δ Zn (δ) = wi −xi √ Ψτ (ui ) +  n  i=1 0   (1 [ui ≤ S] − 1 [ui ≤ 0]) dS    n = Z1n (δ) + Z2ni (δ) = Z1n (δ) + Z2n (δ) i=1 85 Note that, by the Lindeberg-Feller central limit theorem, n 1 Z1n (δ) = −δ √ n wi xi Ψτ (ui ) i=1 n 1 = −δ √ n wi xi (τ − 1 [ui ≤ 0]) i=1 ∼ −δ W  n 2 w i xi xi  W ∼ N 0, τ (1 − τ ) lim n→∞ i=1 Also, n Z2n (δ) = Z2ni (δ) i=1 n = n Z2ni (δ) − E [Z2ni (δ)] E [Z2ni (δ)] + i=1 i=1 but, n δ xi √ n n E [Z2ni (δ)] = E [1 [ui ≤ S] − 1 [ui ≤ 0]] dS wi i=1 i=1 0 δ xi √ n n = Fi (xi β τ o + S) − Fi (xi β τ o )dS wi i=1 0 t let S = √ , then n n E [Z2ni (δ)] = i=1 = = 1 n 1 n xi δ n √ t n F i xi β τ o + √ n wi i=1 0 xi δ n wi i=1 n 1 2n fi (xi β τ o )tdt + o(1) 0 wi fi (xi β τ o )δ xi xi δ + o(1) i=1  → − Fi (xi β τ o ) dt 1  1 δ lim n→∞ n 2 n i=1 86  1 wi fi (xi β τ o )xi xi  δ = δ D1 δ 2 Under A2(iii): 1 Zn (δ) ∼ Zo (δ) = −δ W + δ D1 δ 2 then √ n βτ − βτ = δ n = arg min Zn (δ) ∼ δ o = arg min Zo (δ) −1 δ o = D1 W hence, √ n βτ − βτ −1 −1 ∼ N 0, τ (1 − τ )D1 Do D1 87 APPENDIX B Figures to “Fixed Bandwidth Asymptotics for Regression Discontinuity Designs” B.1 B.1.1 Simulations for Infeasible Inference Nadaraya-Watson Estimator 88 Figure B.1. Nadaraya-Watson Estimator - DGP: No X - Homosk. Errors 0.90 0.85 Small−h Fixed−h 95% line For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this dissertation. 0.80 Empirical Coverage 0.95 1.00 Nadaraya−Watson Estimator Empirical Coverage 0 5 10 15 Bandwidth (h) DGP: No X Model − Homoskedastic Errors 89 20 Figure B.2. Nadaraya-Watson Estimator - DGP: Linear - Homosk. Errors 0.6 0.2 0.4 Small−h Fixed−h 95% line 0.0 Empirical Coverage 0.8 1.0 Nadaraya−Watson Estimator Empirical Coverage 0 5 10 15 Bandwidth (h) DGP: Linear Model − Homoskedastic Errors 90 20 Figure B.3. Nadaraya-Watson Estimator - DGP: Linear - Bias Corrected - Homosk. Errors 0.6 0.4 0.2 Small−h Fixed−h 95% line 0.0 Empirical Coverage 0.8 1.0 Nadaraya−Watson Estimator Empirical Coverage 0 5 10 15 20 Bandwidth (h) DGP: Linear Model − Infeasible Bias Corrected − Homoskedastic Errors 91 Figure B.4. Nadaraya-Watson Estimator - DGP: Linear - Comparison - Homosk. Errors 0.6 0.4 0.2 Small−h Small−h Bias and Fixed−h s.e. Fixed−h Fixed−h Bias and Small−h s.e. 95% line 0.0 Empirical Coverage 0.8 1.0 Nadaraya−Watson Estimator Empirical Coverage 0 5 10 15 Bandwidth (h) DGP: Linear Model − Homoskedastic Errors 92 20 B.1.2 Local Linear Estimator Figure B.5. Local Linear Estimator - DGP: Linear - Homosk. Errors 0.90 0.85 Small−h Fixed−h 95% line 0.80 Empirical Coverage 0.95 1.00 Local Linear Estimator Empirical Coverage 0 5 10 15 Bandwidth (h) DGP: Linear Model − Homoskedastic Errors 93 20 Figure B.6. Local Linear Estimator - DGP: Quadratic - Homosk. Errors 0.90 0.85 Small−h Fixed−h 95% line 0.80 Empirical Coverage 0.95 1.00 Local Linear Estimator Empirical Coverage 0 5 10 15 Bandwidth (h) DGP: Quadratic Model − Homoskedastic Errors 94 20 Figure B.7. Local Linear Estimator - DGP: Quadratic - Bias Corrected - Homosk. Errors 0.90 0.85 Small−h Fixed−h 95% line 0.80 Empirical Coverage 0.95 1.00 Local Linear Estimator Empirical Coverage 0 5 10 15 20 Bandwidth (h) DGP: Quadratic Model − Infeasible Bias Corrected − Homoskedastic Errors 95 Figure B.8. Local Linear Estimator - DGP: Quadratic - Comparison - Homosk. Errors 0.90 0.85 Small−h Small−h Bias and Fixed−h s.e. Fixed−h Fixed−h Bias and Small−h s.e. 95% line 0.80 Empirical Coverage 0.95 1.00 Local Linear Estimator Empirical Coverage 0 5 10 15 Bandwidth (h) DGP: Quadratic Model − Homoskedastic Errors 96 20 B.1.3 Heteroskedastic Errors Figure B.9. Local Linear Estimator - DGP: Linear - Heterosk. Errors Case 1 0.90 0.85 Small−h Fixed−h 95% line 0.80 Empirical Coverage 0.95 1.00 Local Linear Estimator Empirical Coverage 0 5 10 15 Bandwidth (h) DGP: Linear Model − Heteroskedastic Errors (Case 1) 97 20 Figure B.10. Local Linear Estimator - DGP: Quadratic - Heterosk. Errors Case 1 0.90 0.85 Small−h Fixed−h 95% line 0.80 Empirical Coverage 0.95 1.00 Local Linear Estimator Empirical Coverage 0 5 10 15 Bandwidth (h) DGP: Quadratic Model − Heteroskedastic Errors (Case 1) 98 20 Figure B.11. Local Linear Estimator - DGP: Linear - Heterosk. Errors Case 2 0.6 0.4 0.2 Small−h Fixed−h 95% line 0.0 Empirical Coverage 0.8 1.0 Local Linear Estimator Empirical Coverage 0 5 10 15 Bandwidth (h) DGP: Linear Model − Heteroskedastic Errors (Case 2) 99 20 Figure B.12. Local Linear Estimator - DGP: Quadratic - Heterosk. Errors Case 2 0.6 0.4 0.2 Small−h Fixed−h 95% line 0.0 Empirical Coverage 0.8 1.0 Local Linear Estimator Empirical Coverage 0 5 10 15 Bandwidth (h) DGP: Quadratic Model − Heteroskedastic Errors (Case 2) 100 20 B.2 Simulations for Feasible Inference Figure B.13. Nadaraya-Watson Estimator - DGP: No X - Feasible - Homosk. Errors 0.90 0.85 Small−h Estimated Small−h Fixed−h Estimated Fixed−h 95% line 0.80 Empirical Coverage 0.95 1.00 Nadaraya−Watson Estimator Empirical Coverage 0 5 10 15 Bandwidth (h) DGP: No X Model − Homoskedastic Errors 101 20 Figure B.14. Nadaraya-Watson Estimator - DGP: Linear - Feasible - Homosk. Errors 0.6 0.2 0.4 Small−h Estimated Small−h Fixed−h Estimated Fixed−h 95% line 0.0 Empirical Coverage 0.8 1.0 Nadaraya−Watson Estimator Empirical Coverage 0 5 10 15 Bandwidth (h) DGP: Linear Model − Homoskedastic Errors 102 20 B.2.1 Local Linear Estimator Figure B.15. Local Linear Estimator - DGP: Linear - Feasible - Homosk. Errors 0.90 0.85 Small−h Estimated Small−h Fixed−h Estimated Fixed−h 95% line 0.80 Empirical Coverage 0.95 1.00 Local Linear Estimator Empirical Coverage 0 5 10 15 Bandwidth (h) DGP: Linear Model − Homoskedastic Errors 103 20 Figure B.16. Local Linear Estimator - DGP: Quadratic - Feasible - Homosk. Errors 0.90 0.85 Small−h Estimated Small−h Fixed−h Estimated Fixed−h 95% line 0.80 Empirical Coverage 0.95 1.00 Local Linear Estimator Empirical Coverage 0 5 10 15 Bandwidth (h) DGP: Quadratic Model − Homoskedastic Errors 104 20 B.2.2 Heteroskedastic Errors Figure B.17. Local Linear Estimator - DGP: Linear - Feasible - Heterosk. Errors Case 1 0.90 0.85 Small−h Estimated Small−h Fixed−h Estimated Fixed−h 95% line 0.80 Empirical Coverage 0.95 1.00 Local Linear Estimator Empirical Coverage 0 5 10 15 Bandwidth (h) DGP: Linear Model − Heteroskedastic Errors (Case 1) 105 20 Figure B.18. Local Linear Estimator - DGP: Quadratic - Feasible - Heterosk. Errors Case 1 0.90 0.85 Small−h Estimated Small−h Fixed−h Estimated Fixed−h 95% line 0.80 Empirical Coverage 0.95 1.00 Local Linear Estimator Empirical Coverage 0 5 10 15 Bandwidth (h) DGP: Quadratic Model − Heteroskedastic Errors (Case 1) 106 20 Figure B.19. Local Linear Estimator - DGP: Linear - Feasible - Heterosk. Errors Case 2 0.90 0.85 Small−h Estimated Small−h Fixed−h Estimated Fixed−h 95% line 0.80 Empirical Coverage 0.95 1.00 Local Linear Estimator Empirical Coverage 0 5 10 15 Bandwidth (h) DGP: Linear Model − Heteroskedastic Errors (Case 2) 107 20 Figure B.20. Local Linear Estimator - DGP: Quadratic - Feasible - Heterosk. Errors Case 2 0.90 0.85 Small−h Estimated Small−h Fixed−h Estimated Fixed−h 95% line 0.80 Empirical Coverage 0.95 1.00 Local Linear Estimator Empirical Coverage 0 5 10 15 Bandwidth (h) DGP: Quadratic Model − Heteroskedastic Errors (Case 2) 108 20 B.2.3 Bandwidth Choice for fo (x) Figure B.21. Small-h Sensitivity to Density Bandwidth - DGP: Linear 0.90 0.85 hf=h hf=1 hf=5 hf=10 hf=20 95% line 0.80 Empirical Coverage 0.95 1.00 Small−h Sensitivity to Density Bandwidth (hf) 0 5 10 15 Bandwidth (h) Local Linear Estimator − DGP: Linear Model − Homoskedastic Errors 109 20 Figure B.22. Small-h Sensitivity to Density Bandwidth - DGP: Quadratic 0.90 0.85 hf=h hf=1 hf=5 hf=10 hf=20 95% line 0.80 Empirical Coverage 0.95 1.00 Small−h Sensitivity to Density Bandwidth (hf) 0 5 10 15 20 Bandwidth (h) Local Linear Estimator − DGP: Quadratic Model − Homoskedastic Errors 110 APPENDIX C Proofs to “Fixed Bandwidth Asymptotics for Regression Discontinuity Designs” Proof. [Proof of Theorem 6] The local polynomial estimator is given by αp = αp+ − αp− note that,  αp+ = e1  1 nh n i=1  1 = e1 Dn+  nh 1 with Dn+ = nh xi − x h k n k i=1 −1  di Zi Zi  xi − x h n  1 k nh i=1  xi − x h di Zi yi  n k xi −x d Z Z −1 and, i i i i=1 h  1 αp− = e1 Dn−  nh n k i=1 111 xi − x h  (1 − di )Zi yi   di Zi yi  n k xi −x (1 − d )Z Z −1 . Then, i i i i=1 h 1 with Dn− = nh  αp+ = e1 Dn+  1 nh  n k i=1 n  xi − x h di Zi [m(xi ) + αdi + εi ]  1 xi − x di Zi m(xi ) + k nh h i=1   n 1 xi − x +αe1 Dn+  k di Zi  + nh h i=1   n 1 xi − x +e1 Dn+  k di Zi εi  nh h = e1 Dn+  i=1 note that Zi = Zi Zi e1 , e1 e1 = 1 then    n n xi − x  1  = e Dn+  1 e1 Dn+ k di Zi k 1 nh h nh i=1 i=1  xi − x h di Zi Zi  e1 = 1 and  αp+ − α = e1 Dn+  1 nh n k i=1 xi − x h   1 di Zi m(xi ) + e1 Dn+  nh n k i=1 xi − x h  di Zi εi  similarly  1 αp− = e1 Dn−  nh n k i=1 xi − x h   1 (1 − di )Zi m(xi )+e1 Dn−  nh n k i=1 xi − x h  (1 − di )Zi εi  Then √ √ nh(αp+ − α − αp− )   n √ xi − x 1 = e1 Dn+ nh  k di Zi m(xi ) + nh h i=1   n 1 xi − x +e1 Dn+  √ k di Zi εi  − h nh i=1  √  n k xi −x (1 − d )Z m(x ) +   nh 1   i i i nh i=1 h −e1 Dn− n k xi −x (1 − d )Z ε   + √1   i i i i=1 h nh(αp − α) = nh 112 For the denominator terms Dn+ and Dn− , n 1 −1 Dn+ = nh xi − x h k i=1 di Zi Zi and each element of this matrix is given by 1 −1 Dn+ = nh l,j n xi − x h k i=1 xi − x j+l−2 h di which has asymptotic variance converging to zero since  n 1 xi − x −1  V ar Dn+ = V ar k 2 h j,l (nh) i=1 xi − x h 1 = V ar k nh2 1 1 2 ≤ E k nh h x−x h x+h 1 1 k2 = nh x h di x−x h h   xi − x j+l−2 h di d xi − x j+l−2 x − x 2(j+l−2) h x − x 2(j+l−2) fo (x)dx h 1 Note that the terms in the integral and the integral itself are O(1) and nh = o(1). Hence, V ar −1 Dn+ l,j → 0. Now, −1 Dn+ l,j −1 Dn+ l,j  = E = n 1  E nh k i=1 = x+h 1 x ∞ = 0 xi − x h xi − x h 1 k h = E + op (1) h k di x−x h di xi − x j+l−2 h   + op (1) xi − x j+l−2 + op (1) h x − x j+l−2 fo (x)dx + op (1) h k (u) uj+l−2 fo (x + uh)du + op (1) ∞ Let, γ + = 0 k (u) uj fo (x + uh)du and Γ∗ is the (p + 1) × (p + 1) matrix with (j, l) element + j γ+ j+l−2 for j, l = 1, ..., p + 1. Then, p Dn+ → Γ∗ −1 + 113 Similarly, p Dn− → Γ∗ −1 − where Γ∗ is the (p + 1) × (p + 1) matrix with (j, l) element γ − − j+l−2 for j, l = 1, ..., p + 1, and ∞ γ − = (−1)j 0 k (u) uj fo (x − uh)du. j Now we will derive the asymptotic distribution of √1 nh n k xi −x d Z ε . Following i i i i=1 h Porter (2003) I use the Cramer-Wold device to derive the asymptotic distribution. Let λ be a nonzero, finite vector. Then,  n  1 E √ k nh i=1 n = i=1 = ≤ ≤ 1 nh 1 nh 1 nh xi − x h ζ 1 2 1 E nh nh  ζ 2 1 h ζ 2 1 h ζ 2 1  E k k di λ Zi εi 2+ζ 2+ζ p x−x h x−x h di λ Zi d l=1 2+ζ  = 2+ζ nh  2+ζ  E |ε|2+ζ | X = x   2+ζ p λl d l=1  sup E |ε|2+ζ | X = x E  k h x∈ℵ x+h x−x l  h x−x h 1 2 1 x−x sup E |ε|2+ζ | X = x k nh h x∈ℵ h x  ζ ζ 1 2 1 2 O(1)O(1) = O  = o(1) nh nh then, √1 |εi |2+ζ x−x l h λl ζ =    xi − x h   E k 2+ζ  E |ε|2+ζ | X = x  p 2+ζ λl d l=1 2+ζ p λl l=1 n k xi −x d Z ε follows Liapunov’s CLT. Note that, i i i i=1 h   n xi − x 1 k E √ di Zi εi  h nh i=1 √ n xi − x = E √ k di Zi E [ε | X = x] = 0 h h 114 2+ζ x−x l h   2+ζ x−x l dFo (x) h and  n xi − x h 1 k V ar  √ nh i=1  di Zi εi  1 xi − x V ar k d i Z i εi h h xi − x 1 = E k2 di Zi Zi ε2 i h h 1 xi − x = E k2 di Zi Zi E ε2 | X = x i h h = = x+h 1 k2 h x x−x h ZZ σ 2 (x)fo (x)dx It helps to remember that Zi Zi is a function of the x,  xi −x 1 ··· h   x −x xi −x 2  i ···  h h Zi Zi =  . . ..  . . . . .   x −x p p+1 xi −x i ··· h h  xi −x p h  xi −x p+1    h  . . .   2p  xi −x h x−x j σ 2 (x)f (x)dx = ∞ k 2 (u) uj σ 2 (x + uh)f (x + uh)du o o 0 h and ∆∗ is the (p + 1) × (p + 1) matrix with (j, l) element δ + + j+l−2 for j, l = 1, ..., p + 1. Then, x+h 1 k 2 x−x Let δ + = x j h h n 1 √ k nh i=1 xi − x h p di Zi εi → N (0, ∆∗ ) + Similarly we can show that n 1 √ k nh i=1 xi − x h p (1 − di )Zi εi → N (0, ∆∗ ) − where ∆∗ is the (p + 1) × (p + 1) matrix with (j, l) element δ − − j+l−2 for j, l = 1, ..., p + 1, and x 1 δ − = x−h h k 2 x−x j h x−x j σ 2 (x)f (x)dx = (−1)j ∞ k 2 (u) uj σ 2 (x−uh)f (x−uh)du. o o 0 h The bias term is given by   n  √ 1 xi − x nhe1 Dn+  k  nh h   di Zi m(xi ) − Dn−  i=1 115 1 nh n k i=1 xi − x h   (1 − di )Zi m(xi )  Notice that if the rectangular kernel is used this is nothing else than the difference between the intercepts estimated by the linear projection of m(x) on Z, above and below the cutoff point using only the data inside the bandwidth. Note that,  n 1 E k nh i=1  xi − x h xi − x h 1 k h di Zi m(xi ) = E x+h 1 = x ∞ = h k di Zi m(xi ) x−x h Z(x)m(x)f (x)dx k (u) Z(x + uh)m(x + uh)f (x + uh)du 0 and similarly, 1 nh n k i=1 xi − x h p (1 − di )Zi m(xi ) → x 1 k x−h h x−x h Z(x)m(x)dx Hence, the bias term can be approximated by, e1 ∞ Γ∗ −1 0 k (u) Z(x + uh)m(x + uh)f (x + uh)du − + ∞ − Γ∗ −1 0 k (u) Z(x − uh)m(x − uh)f (x − uh)du − Proof. [Proof of Corollary 2] First, note that, if h → 0, γ+ = j ∞ lim h→0 0 = fo (x) k (u) uj fo (x + uh)du ∞ k (u) uj du (C.1) 0 = fo (x)γ j and δ+ = j ∞ k 2 (u) uj σ 2 (x + uh)fo (x + uh)du h→0 0 ∞ = σ 2+ (x)fo (x) k 2 (u) uj du 0 = σ 2+ (x)fo (x)δ j lim 116 (C.2) and similarly for γ − and δ − . Then, for the variance, j j lim Γ∗ −1 ∆∗ Γ∗ −1 + Γ∗ −1 ∆∗ Γ∗ −1 − − − + + + h→0 = (fo (x)Γ)−1 σ 2+ (x)fo (x)∆ (fo (x)Γ)−1 + (fo (x)Γ)−1 σ 2− (x)fo (x)∆ (fo (x)Γ)−1 = σ 2+ (x) + σ 2− (x) e1 Γ−1 ∆Γ−1 e1 fo (x) For the bias, if we approximate m(x + uh) = m(x) just above m(x) : m(x) = LP + (m(x) on Z(x)) + 1 m(p+1)+ (x) (x − x)p+1 + o hp+1 (p + 1)! and similarly for approximating m(x) just below the cutoff. When we evaluate LP + (m(x) on Z(x)) at x, we get the intercept m(x) and the “residual” as described above. A helpful fact is that, by the definition of Z(x),    γ p+1 1 ∞  .  . k (u)  .  up+1 du =  .  . .  ∞ k (u) Z(x + uh)up+1 du = 0 0 up  ∞ k (u) Z(x − uh)up+1 du = 0 ∞ 0 1 . . .  γ 2p+1  γ p+1 . . . (C.3)    p+1  du =  u  (C.4) p pγ (−u) (−1) 2p+1  k (u)    γ p+1  .  Note that Γ−1  .  is equal both above and below the cutoff. The bias formula in . γ 2p+1 theorem 6 is given by e1 ∞ Γ∗ −1 0 k (u) Z(x + uh)m(x + uh)fo (x + uh)du − + ∞ − Γ∗ −1 0 k (u) Z(x − uh)m(x − uh)fo (x − uh)du − (C.5) as discussed in section 2.5 the main term is just the difference between the intercepts of the linear projections of k (u) m(x) on k (u) Z(x) in the bandwidth above below the cutoff, which is equal to the linear projections evaluated at x. Hence, plugging the bias formula for the linear projection, formula (C.5) can be written as (plus an o hp+1 term), e1 −e1 Γ∗ −1 + ∞ k (u) Z(x + uh) 0 Γ∗ −1 − ∞ k (u) Z(x − uh) 0 1 m(p+1)+ (x) (uh)p+1 fo (x + uh)du (p + 1)! (−1)p+1 (p+1)− m (x) (uh)p+1 fo (x − uh)du (p + 1)! 117 = hp+1 e (p + 1)! 1 − = hp+1 (p + 1)! hp+1 (p + 1)! − Γ∗ −1 + ∞ 0 k (u) Z(x + uh)m(p+1)+ (x)up+1 fo (x + uh)du Γ∗ −1 − e1 (−1)p+1 e1 m(p+1)+ (x) ∞ 0 k (u) Z(x − uh)m(p+1)− (x)up+1 fo (x − uh)du ∞ Γ∗ −1 + 0 hp+1 e1 (−1)p+1 m(p+1)− (x) (p + 1)! k (u) Z(x + uh)up+1 fo (x + uh)du Γ∗ −1 − ∞ 0 k (u) Z(x − uh)up+1 fo (x − uh)du m(x) , is the vector of coefficients of the linear projection of m(x) on Z(x) is mp+(−) the bandwidth above (below) the cutoff. If h → 0, using the equalities in equations (C.3), where (C.4), (C.2) and (C.1), = limh→0 hp+1 (p + 1)!   γ p+1  .  m(p+1)+ (x) − (−1)p+1 m(p+1)− (x) e1 Γ−1  .  . γ 2p+1 Proof. [Proof of Corollary 3] First, note that, if h > 0 and, in the bandwidth around the cutoff, fo (x) = fo (x), σ 2 (x) = σ 2 (x) and m(x) = m(x)+m + (x) (x − x)+...+ 1 (p)+ 1 m (x) (x − x)p + m(p+1)+ (x) (x − x)p+1 p! (p + 1)! then, γ+ = j ∞ 0 k (u) uj fo (x + uh)du = fo (x) ∞ k (u) uj du (C.6) 0 = fo (x)γ j and δ+ = j ∞ k 2 (u) uj σ 2 (x + uh)fo (x + uh)du 0 = σ 2+ (x)fo (x) ∞ 0 = σ 2+ (x)fo (x)δ j 118 k 2 (u) uj du (C.7) and similarly for γ − and δ − . Then, for the variance, j j Γ∗ −1 ∆∗ Γ∗ −1 + Γ∗ −1 ∆∗ Γ∗ −1 − − − + + + = (fo (x)Γ)−1 σ 2+ (x)fo (x)∆ (fo (x)Γ)−1 + (fo (x)Γ)−1 σ 2− (x)fo (x)∆ (fo (x)Γ)−1 = σ 2+ (x) + σ 2− (x) e1 Γ−1 ∆Γ−1 e1 fo (x) For the bias, the strategy is basically the same as in the proof of corollary 2: m(x) = LP + (m(x) on Z(x)) + 1 m(p+1)+ (x) (x − x)p+1 (p + 1)! and similarly for approximating m(x) just below the cutoff. When we evaluate LP + (m(x) on Z(x)) at x, we get the intercept m(x) and the “residual” as described above. Once again, using formulas (C.3) and (C.4), the bias formula in theorem 6 is given by ∞ Γ∗ −1 0 k (u) Z(x + uh)m(x + uh)fo (x + uh)du − + ∞ − Γ∗ −1 0 k (u) Z(x − uh)m(x − uh)fo (x − uh)du − e1 as discussed in section 2.5 the main term is just the difference between the intercepts of the linear projections of k (u) m(x) on k (u) Z(x) in the bandwidth above below the cutoff, which is equal to the linear projections evaluated at x. Hence, plugging the bias formula for the linear projection: Γ∗ −1 + e1 −e1 = = hp+1 (p + 1)! 0 − (p + 1)! (−1)p+1 (p+1)− m (x) (uh)p+1 fo (x − uh)du (p + 1)! ∞ k (u) Z(x − uh) 0 Γ∗ −1 + ∞ 0 e1 (−1)p+1 hp+1 e1 m(p+1)+ (x) (p + 1)! hp+1 1 m(p+1)+ (x) (uh)p+1 fo (x + uh)du (p + 1)! k (u) Z(x + uh) Γ∗ −1 − hp+1 e (p + 1)! 1 − ∞ k (u) Z(x + uh)m(p+1)+ (x)up+1 fo (x + uh)du Γ∗ −1 − ∞ 0 Γ∗ −1 + e1 (−1)p+1 m(p+1)− (x) k (u) Z(x − uh)m(p+1)− (x)up+1 fo (x − uh)du ∞ 0 k (u) Z(x + uh)up+1 fo (x + uh)du Γ∗ −1 − 119 ∞ 0 k (u) Z(x − uh)up+1 fo (x − uh)du m(x) , is the vector of coefficients of the linear projection of m(x) on Z(x) is mp+(−) the bandwidth above (below) the cutoff. Using fo (x) = fo (x) and the equalities in formulas where (C.3), (C.4), (C.7) and (C.6),  = hp+1 (p + 1)!  γ p+1  .  m(p+1)+ (x) − (−1)p+1 m(p+1)− (x) e1 Γ−1  .  . γ 2p+1 Proof. [Proof of Theorem 8] To obtain the Covariance term for the asymptotic variance of the Fuzzy Regression Discontinuity estimator, note that the covariance will be determined by the expectation of the product of the stochastic terms. The covariance between the estimators for the outcome of interest and the treatment probability will be given by two independent terms, one for each side of the threshold. The upper side is given by   n   1 E e1 Dn+ k  nh i=1 xi − x h  di Z i y i   1 nh n k i=1 xi − x h  di Zi ti  Dn+ e1    Where ti is the dummy variable indicating that the observation has received treatment. In obtaining the asymptotic covariance, the bias term of the estimator can be ignored, hence   E e1 Dn+   1 nh  = E e1 Dn+  1 nh n k i=1 n n xi − x h k i=1 j=1  n d i Z i εi   k i=1 xi − x h k 120 xj − x h xi − x h   di Zi η i  Dn+ e1    di dj Zi Zj εi η j  Dn+ e1    = E e1 Dn+   1 nh  = E e1 Dn+  1 nh n k k i=1 h x h h   di Zi Zi E [εi η i | X = x] Dn+ e1    di Zi Zi σ εη (xi ) Dn+ e1  xi − x 2 di Zi Zi σ εη (xi ) Dn+ e1 h x+h 1 = e1 Dn+ xi − x 2 i=1 n 1 k h = E e1 Dn+ xi − x 2 x−x 2 ZZ σ εη (x)fo (x)dx Dn+ e1 h k where I used the assumption that E εi η j | X = x = 0 for j = i. Similarly for the second term, x−x 2 ZZ σ εη (x)fo (x)dx Dn− e1 h x e1 Dn− 1 k x−h h x−x j σ (x)f (x)dx = εη o h ∞ 2 j 0 k (u) u σ εη (x + uh)fo (x + ρ − uh)du, ∆+ is the (p+1)×(p+1) matrix with (j, l) element ρ+ j+l−2 for j, l = 1, ..., p+1, ρj = x 1 k 2 x−x x−x j σ (x)f (x)dx = (−1)j ∞ k 2 (u) uj σ (x−uh)f (x−uh)du and εη o εη o x−h h 0 h h ρ ∆− is the (p + 1) × (p + 1) matrix with (j, l) element ρ− j+l−2 for j, l = 1, ..., p + 1 Then the x+h 1 k 2 x−x Let ρ+ = x j h h asymptotic covariance is given by Cαθ = e1 ρ ρ Γ∗ −1 ∆+ Γ∗ −1 + Γ∗ −1 ∆− Γ∗ −1 e1 + + − − The asymptotic covariance for the Nadaraya-Watson estimator will be given by the special case when p = 0. ∞ Cαθ = k 2 (u) 0 σ εη (x + uh)fo (x + uh) + ∞ k (u) f (x + uh)du 2 o 0 σ εη (x − uh)fo (x − uh) 2 ∞ 0 k (u) fo (x − uh)du du Proof. [Proof of Corollary 4] Using the results in equations C.1 and noting that, if h → 0 ρ+ = j ∞ k 2 (u) uj σ εη (x + uh)fo (x + uh)du h→0 0 = σ + (x)fo (x)δ j εη lim 121 and similarly for ρ− . Then, j lim e1 h→0 ρ ρ Γ∗ −1 ∆+ Γ∗ −1 + Γ∗ −1 ∆− Γ∗ −1 e1 + + − − = e1 (fo (x)Γ)−1 σ + (x)fo (x)∆ (fo (x)Γ)−1 + (fo (x)Γ)−1 σ − (x)fo (x)∆ (fo (x)Γ)−1 e1 εη εη σ + (x) + σ − (x) εη εη = e1 Γ−1 ∆Γ−1 e1 fo (x) Proof. [Proof of Corollary 5] The proof follows very closely corollary 4. Using the results in equations C.1 and noting that, if h > 0 and fo (x) = fo (x) and σ εη (x) = σ εη (x) for any x in the range around the cutoff ρ+ = j ∞ k 2 (u) uj σ εη (x + uh)fo (x + uh)du 0 = σ + (x)fo (x)δ j εη and similarly for ρ− . Then, j e1 ρ ρ Γ∗ −1 ∆+ Γ∗ −1 + Γ∗ −1 ∆− Γ∗ −1 e1 + + − − = e1 (fo (x)Γ)−1 σ + (x)fo (x)∆ (fo (x)Γ)−1 + (fo (x)Γ)−1 σ − (x)fo (x)∆ (fo (x)Γ)−1 e1 εη εη σ + (x) + σ − (x) εη εη e1 Γ−1 ∆Γ−1 e1 = fo (x) 122 APPENDIX D Proofs to “Asymptotic Properties of Quantile Regression for Standard Stratified Samples” Proof. [Proof of Corollary 6] The general form of the variance follows directly from Newey and McFadden (1994) theorem 7.1. The specific formulas for Aw and Bw , are obtained by checking that the proof used by Wooldridge (2001) still holds for the estimator that minimizes the objective function given by equation 3.2. I follow his procedure below. Since within each stratum we have a i.i.d. sequence wij : i = 1, 2, ..., Nj for each j, a CLT for i.i.d. observations can be applied for each stratum. −1 Nj 2 Nj d sij β τ o − µj → N (0, Bj ) i=1 where sij β τ o µj ≡ τ − 1 yij − g xij , β τ o ≤ 0 ≡ E sij β τ o V ar sij β τ o = V ar = E • • g i , with g i ≡ τ − 1 yij − g xij , β τ o ≤ 0 τ − 1 yij − g xij , β τ o ≤ 0 123 • • ∂g xij ,β ∂β |β=β g i |w ∈ Wj , and Bj τo , ≡ g i |w ∈ Wj . As seen in equation 3.4, J Qj µj = 0 j=1 1 The score of the objective function, multiplied by N 2 and evaluated at β τ o can be written as     Nj Nj J Q 1 J Qj  j    −2 −1 N sij β τ o  = N 2 sij β τ o − µj    H H j=1 j i=1 j=1 j i=1   N 1 j Qj  − 2  d sij β τ o − µj  → N (0, Bw ) N = 1  j i=1 j=1 H 2 j J with Bw = J Q2 J Q2 j j Bj = V ar Hj Hj j=1 j=1 τ − 1 yij − g xij , β τ o ≤ 0 • g i |w ∈ Wj . Where I used the fact that sampling is random within stratum and the observations are indeJ Q2 j pendent across strata. In the linear case this formula simplifies to Bw = Hj Bj = j=1 J Q2 j V ar τ − 1 yij − xij β τ o ≤ 0 xij |w ∈ Wj . Hj j=1 124 For the outer part of the variance matrix it is enough to note that J Aw = • Qj E[ τ − 1 y − g x, β τ o ≤ 0 g|w ∈ Wj ] Qj β E[ τ − 1 y − g x, β τ o ≤ 0 g|w ∈ Wj ] Qj β E[ τ − 1 y − g x, β τ o ≤ 0 g|w ∈ Wj ] β j=1 J = j=1 J = • • j=1 J • Qj β E E τ − 1 y − g x, β τ o ≤ 0 |x, w ∈ Wj g|w ∈ Wj = j=1 J = τ − Fy|x,w∈W g x, β τ o j Qj β E j=1 J = Qj E[fy|x,w∈W j=1 = E fy|x g x, β τ o J And that the Jacobian of j=1 g x, β τ o j •• g g |w ∈ Wj ] •• gg  1 Qj  N in probability uniformly to Aw . • g|w ∈ Wj  Nj τ − 1 yij − g xij , β τ j i=1 ≤0 • g i converges In the linear case this formula simplifies to Aw = J j=1 Qj E[fy|x,w∈W Proof. j x β τ o xx |w ∈ Wj ] = E fy|x (x β τ o )xx . [Proof of Corollary 7] The general form of the variance follows directly from Newey and McFadden (1994) theorem 7.1. Under exogenous stratification, the follow- ing changesneed to be made to the definitions used to prove corollary 6: (a) µj ≡ E sij β τ o = E τ − 1 yij − g xij , β τ o ≤ 0 (b) Bj ≡ V ar sij β τ o = V ar • g i |x ∈ χj = 0 for every stratum j; τ − 1 yij − g xij , β τ o ≤ 0 125 • g i |x ∈ χj . Then, the 1 score of the objective function, multiplied by N 2 and evaluated at β τ o can be written as   N −1 N 2 sij β τ o i=1 J = j=1 1 1  − Hj2 Nj 2 Nj i=1  sij β τ o    N 1 1 j  d  −2 = Hj2 Nj sij β τ o  → N (0, Bu ) j=1 i=1 J J J with Bu = j=1 H j Bj = j=1 H j V ar • τ − 1 yij − g xij , β τ o ≤ 0 g i |x ∈ χj . Where I used the fact that sampling is random within stratum and the observations are indeJ pendent across strata. In the linear case this formula simplifies to Bu = j=1 H j Bj = J j=1 τ − 1 yij − xij β τ o ≤ 0 H j V ar xij |x ∈ χj . Note that in this case, since E 1 yij − xij β τ o ≤ 0 |x is the same to every stratum, V ar τ − 1 yij − xij β τ o ≤ 0 J • • • g i |x ∈ χj = τ (1 − τ )E g i g i |x ∈ χj J • • and Bu = j=1 H j τ (1 − τ )E g i g i |x ∈ χj = τ (1 − τ ) J linear CQF case, Bu = j=1 j=1 • • H j E g i g i |x ∈ χj . In the in the J • • H j τ (1 − τ )E g i g i |x ∈ χj = τ (1 − τ ) j=1 H j E xx |x ∈ χj . For the outer part of the variance matrix it is enough to note that J H j β E[ τ − 1 y − g x, β τ o ≤ 0 Au = • g|x ∈ χj ] j=1 J • H j β E E τ − 1 y − g x, β τ o ≤ 0 |x g|x ∈ χj = j=1 J = Hj βE j=1 J = • τ − Fy|x g x, β τ o H j E[fy|x g x, β τ o g|x ∈ χj •• g g |x ∈ χj ] j=1 J In the linear case this formula simplifies to Au = j=1 126 H j E[fy|x xij β τ o xx |x ∈ χj ]. Proof. [Proof of Corollary 8] The asymptotic multivariate normality result fol- lows directly from the use of a standard Cramer-Wold device argument for the vector of the scores for each quantile applied separately for each stratum. sij (β τ ) = s1 (y, x, β τ ) , s2 (y, x, β τ ) , . . . , spij (y, x, β τ p ) 1 2 ij ij E s1 (y, x, β τ ) , s2 (y, x, β τ ) , . . . , spij (y, x, β τ p ) then, 1 ij ij Let and µj ≡ E sij (β τ ) = 2 −1 Nj 2 Nj d sij (β τ ) − µj → N (0, Bj ) i=1 with Bj a p × p variance covariance matrix with typical element, Bj l,k ≡ Cov sij β τ , sij β τ l k = Cov τ l − 1 y − g(x, β τ ) ≤ 0 l • g l , τ k − 1 y − g(x, β τ ) ≤ 0 k • g k |w ∈ Wj Then,     Nj Nj J Q 1 J Qj  j    −2 −1 N sij (β τ ) = N 2 sij (β τ ) − µj    H H j=1 j i=1 j=1 j i=1   N 1 j Qj  − 2  d N = sij (β τ ) − µj  → N (0, Bs ) 1  j j=1 H 2 i=1 j J where Bs = Bsl,k with typical element l,k=1,...p   J Q2 j Bj  Bsl,k =  H j l,k j=1 • • and the outer part of each term is given by the Awl ≡ E fy|x (g (x, β τ l ))g l g l and Awk ≡ • • E fy|x (g x, β τ )g k g k as argued in Buchinsky (1998). k √ a Hence, N β τ − β τ ∼ N (0, Λτ ), where Λτ = Λτ l,k l,k=1,...p with typical element defined as • • Λτ l,k = E fy|x (g x, β τ )g l g l l −1   J Q2 −1 • • j  Bj  E fy|x (g x, β τ )g k g k k l,k H j=1 j 127 and, in the special case of the linear CQF, g (X, β) = x β   J Q2 −1 −1 j  Λτ l,k = E fy|x x β τ xx Bj  E fy|x x β τ xx l k H j l,k j=1 128 BIBLIOGRAPHY 129 BIBLIOGRAPHY S. Ahn and P. Schmidt. A separability result for gmm estimation, with applications to gls prediction and conditional moment tests. Econometric reviews, 14(1):19–34, 1995. J. Albrecht, A. Bj¨rklund, and S. Vroman. Is there a glass ceiling in sweden? Journal o of Labor Economics, 21(1):145–177, 2003. D. Andrews. Empirical process methods in econometrics. In R. Engle and D. McFadden, editors, Handbook of econometrics, volume 4, pages 2247–2294. Elsevier, 1994. J. Angrist and J. Pischke. Mostly harmless econometrics: An empiricist’s companion. Princeton University Press, 2009. J. Angrist, V. Chernozhukov, and I. Fern´ndez-Val. Quantile regression under misspeca ification, with an application to the us wage structure. Econometrica, 74(2):539–563, 2006. M. Buchinsky. Recent advances in quantile regression models: a practical guideline for empirical research. Journal of Human Resources, 33(1):88–126, 1998. M. Buchinsky. Quantile regression with sample selection: Estimating women’s return to education in the us. Empirical Economics, 26(1):87–113, 2001. M. Cattaneo. Efficient semiparametric estimation of multi-valued treatment effects under ignorability. Journal of Econometrics, 155(2):138–154, 2010. X. Chen, O. Linton, and I. Van Keilegom. Estimation of semiparametric models when the criterion function is not smooth. Econometrica, 71(5):1591–1608, 2003. X. Chen, H. Hong, and A. Tarozzi. Semiparametric efficiency in gmm models with auxiliary data. The Annals of Statistics, 36(2):808–843, 2008. V. Chernozhukov and C. Hansen. An iv model of quantile treatment effects. Econometrica, 73(1):245–261, 2005. V. Chernozhukov and C. Hansen. Instrumental quantile regression inference for structural and treatment effect models. Journal of Econometrics, 132(2):491–525, 2006. 130 V. Chernozhukov and C. Hansen. Instrumental variable quantile regression: A robust inference approach. Journal of Econometrics, 142(1):379–398, 2008. H. Daniels. The asymptotic efficiency of a maximum likelihood estimator. In Proceedings of the fourth Berkeley symposium on mathematical statististics and probability, volume 1, pages 151–163. Berkeley: University of California Press, 1961. J. Fan and I. Gijbels. Local polynomial modelling and its applications. Chapman & Hall/CRC, 1996. Y. Fan. Goodness-of-fit tests based on kernel density estimators with fixed smoothing parameters. Econometric Theory, 14(5):604–621, 1998. J. Hahn, P. Todd, and W. Van der Klaauw. Evaluating the effect of an antidiscrimination law using a regression-discontinuity design. NBER Working Paper Series, 1999. J. Hahn, P. Todd, and W. Van der Klaauw. Identification and estimation of treatment effects with a regression-discontinuity design. Econometrica, 69(1):201–209, 2001. K. Hirano, G. Imbens, and G. Ridder. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4):1161–1189, 2003. K. Hitomi, Y. Nishiyama, and R. Okui. A puzzling phenomenon in semiparametric estimation problems with infinite-dimensional nuisance parameters. Econometric Theory, 24(06):1717–1728, 2008. P. Huber. The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 221–33, 1967. G. Imbens and J. Angrist. Identification and estimation of local average treatment effects. Econometrica, 62(2):467–475, 1994. G. Imbens and T. Lemieux. Regression discontinuity designs: A guide to practice. Journal of Econometrics, 142(2):615–635, 2008. B. James. Probabilidade: Um curso em n´vel intermedi´rio. IMPA, Rio de Janeiro, ı a third edition, 2004. R. Koenker. Quantile regression. Cambridge University Press, 2005. D. Lee and T. Lemieux. Regression discontinuity designs in economics. Technical report, National Bureau of Economic Research, 2009. W. Lee. Robust tests of hypotheses in models with m-estimation. Working Paper, Department of economics, National Chung Cheng University, 2008. 131 J. Machado and J. Mata. Counterfactual decomposition of changes in wage distributions using quantile regression. Journal of applied Econometrics, 20(4):445–465, 2005. P. Martins and P. Pereira. Does education reduce wage inequality? quantile regression evidence from 16 countries. Labour Economics, 11(3):355–371, 2004. B. Melly. Decomposition of differences in distribution using quantile regression. Labour Economics, 12(4):577–590, 2005. H. Neave. An improved formula for the asymptotic variance of spectrum estimates. The Annals of Mathematical Statistics, 41(1):70–77, 1970. W. Newey and D. McFadden. Large sample estimation and hypothesis testing. In R. Engle and D. McFadden, editors, Handbook of econometrics, volume 4, pages 2111– 2245. Elsevier, 1994. A. Pagan and A. Ullah. Nonparametric econometrics. Cambridge University Press, 1999. A. Pakes and D. Pollard. Simulation and the asymptotics of optimization estimators. Econometrica, 57(5):1027–1057, 1989. D. Pollard. New ways to prove central limit theorems. Econometric Theory, 1(3): 295–313, 1985. J. Porter. Estimation in the regression discontinuity model. Unpublished manuscript, Department of economics, University of Wisconsin at Madison, 2003. J. Powell. Estimation of monotonic regression models under quantile restrictions. In W. Barnett, J. Powell, and G. Tauchen, editors, Nonparametric and semiparametric methods in econometrics and statistics, pages 357–384, 1991. A. Prokhorov and P. Schmidt. Gmm redundancy results for general missing data problems. Journal of Econometrics, 151(1):47–55, 2009. H. Qian and P. Schmidt. Improved instrumental variables and generalized method of moments estimators. Journal of Econometrics, 91(1):145–169, 1999. H. White. Maximum likelihood estimation of misspecified models. Econometrica, pages 1–25, 1982. H. White. Asymptotic theory for econometricians. New York: Academic Press, New York, New York, 1984. H. White. Estimation, inference and specification analysis. Cambridge University Press, 1996. 132 J. Wooldridge. Asymptotic properties of weighted m-estimators for standard stratified samples. Econometric Theory, 17(2):451–470, 2001. J. Wooldridge. Econometric analysis of cross section and panel data. First edition, 2002a. J. Wooldridge. Inverse probability weighted m-estimators for sample selection, attrition, and stratification. Portuguese Economic Journal, 1(2):117–139, 2002b. J. Wooldridge. Inverse probability weighted estimation for general missing data problems. Journal of Econometrics, 141(2):1281–1301, 2007. 133