ESSAYS IN ECONOMETRICS

By
Ot´vio Augusto Camargo Bartalotti
a

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY

Economics
2012

ABSTRACT
ESSAYS IN ECONOMETRICS
By
Ot´vio Augusto Camargo Bartalotti
a

This dissertation is divided in three self-contained chapters. The ﬁrst extends the GMM
redundancy results of Prokhorov and Schmidt (2009) for nonsmooth objective functions,
giving sharp guidelines about how to obtain eﬃcient estimates of parameters of interest (
β o ) in the presence of nuisance parameters (γ o ). The use of one-step GMM estimators for
both sets of parameters is asymptotically more eﬃcient than two-step procedures. These
results are applied to Wooldridge (2007)’s inverse probability weighted estimator (IPW),
generalizing the framework to deal with missing-data in this context. Even though twostep estimation of β o is more eﬃcient than using known probabilities of selection, this is
dominated by a one-step joint estimation procedure. Examples for quantile regression with
missing data and instrumental variable quantile regression are provided.
The second chapter analyzes the asymptotic distribution of local polynomial estimators
in the context of regression discontinuity designs. The standard “small-h” approach in the
literature (Hahn et al., 2001; Porter, 2003; Imbens and Lemieux, 2008; Lee and Lemieux,
2009) is to assume the bandwidth, h, around the discontinuity shrinks towards zero as the
sample size increases. However, in practice, the researcher has to choose an h > 0 to implement the estimator. This chapter derives the ﬁxed-h asymptotic distribution that allows for
the bandwidth to be positive, providing reﬁned approximations for the estimator’s behavior.

When h > 0, the small-h asymptotic variance is equivalent to assuming that the density
of the running variable and the conditional variance of the dependent variable are constant
around the cutoﬀ. Simulations provide evidence that ﬁxed-h asymptotic distributions better
describe the behavior of both bias and variance of the estimator, leading to improved inference. Estimators for ﬁxed-h standard errors are proposed and incorporate the theoretical
gains of the improved approximations. The ﬁxed-h variance estimators improve markedly
over small-h estimators in the presence of some forms of heteroskedasticity. Interestingly,
in the special case of homoskedastic errors using a local linear estimator, the variance estimators based on small-h asymptotics produce tests with similar size to the ﬁxed-h variance
estimators proposed in this chapter.
Chapter 3 develops the asymptotic properties of quantile regression estimators under standard stratiﬁcation sampling, following Wooldridge (2001). Formulas for the asymptotic variance and feasible estimators are provided. Under exogenous stratiﬁcation the usual quantile
regression estimators and standard errors are still valid.

To Beatriz.

iv

ACKNOWLEDGMENTS

“We have not journeyed all this way across the centuries, across the oceans,
across the mountains, across the prairies, because we are made of sugar candy.”
Winston Churchill

I would like acknowledge the great help and support that was given to this endeavor by
Michigan State University faculty. Especially my advisor, Jeﬀ Wooldridge, who supported
me on my goals even before I started the Ph.D. program and kept me upbeat and motivated
through the toughest moments, I would not have been able to accomplish this without his
guidance, help and friendship. Also, Tim Vogelsang, who was fundamental in all the parts
of this dissertation through his insights, truthfulness and great openness to my ideas. I
cannot express enough thanks to Gary Solon, who always kept his door open and a great
interest in my research, providing insight, guidance and great advice in the ﬁve years I spent
at MSU’s Economics Department. Peter Schmidt, always sharp on comments and an open
mind provided formidable advice that greatly improved this dissertation. I am also thankful
several other great faculty such as Todd Elder, Steve Haider, Matias Cattaneo (at University
of Michigan) and many others. I cannot forget to thank Carlos Pereira for believing in me
when few would. Finally, none of this would have been achieved without the support from
Maria Carolina Leme who taught me through my ﬁrst steps in academia and still teaches
me so much to this day.
My dissertation and life have been made better by great friends, their inputs, ideas and
long coﬀee breaks like Steve Dieterle, Quentin Brummet, Thomas Fujiwara, Ilya Rahkovsky,

v

Max Melstrom, Breno Braga, Valentin Verdier, Maksym Ivanyna, Elizabeth Quin, Ehren
Schuttringer, Cuicui Lu, Stacy Miller, Iraj Rahmani, Paul Burkander, Hassan Enayati, Laura
Rees, Jason Stornelli, Adithya Pattabhiramaiah among others.
I thank my parents Bete and Fl´vio and my stepmother Hilda for their love and support
a
throughout this journey and for embedding in my formation the desire for knowledge that
drove me to this career. Also my sister La´ who helps me keep things on perspective and I
ıs
love. To my grandparents for the great life example they provide and especially to Antonio
Bartalotti who did not live to see this day, but is in my thoughts and heart everyday.
Finally, I acknowledge the courage, help and love I have received from my wife, Beatriz,
who shared with me every moment of the path, and was understanding and reassuring of
my achievements and failures.

vi

TABLE OF CONTENTS

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

1 GMM Eﬃciency and IPW for Nonsmooth Functions . . . . . . . . . . . . . . . . . . . .
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 General Estimation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Estimation with missing data . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Data Selection under Ignorability . . . . . . . . . . . . . . . . . . . .
1.3.2 Data Selection under Exogeneity of Selection . . . . . . . . . . . . . .
1.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.1 Quantile Regression under Ignorability of Selection . . . . . . . . . .
1.4.2 Instrumental Variable Quantile Regression . . . . . . . . . . . . . . .
1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
1
3
15
16
21
23
23
27
28

2 Fixed Bandwidth Asymptotics for Regression Discontinuity Designs . . . . 31
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.1 Sharp Regression Discontinuity Design . . . . . . . . . . . . . . . . . 34
2.2.2 Fuzzy Regression Discontinuity Design . . . . . . . . . . . . . . . . . 35
2.3 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5 Asymptotic Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.1 Fuzzy Regression Discontinuity Design . . . . . . . . . . . . . . . . . 45
2.6 Variance Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.7 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.7.1 Simulations for Infeasible Inference . . . . . . . . . . . . . . . . . . . 53
2.7.2 Simulations for Feasible Inference . . . . . . . . . . . . . . . . . . . . 58
2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3 Asymptotic Properties of Quantile Regression for Standard Stratiﬁed Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 The Quantile Regression Population Problem . . . . . . . . . . . . . . . . .
3.2.1 Quantile Regression under Stratiﬁed Sampling . . . . . . . . . . . . .

vii

64
64
65
66

3.3
3.4

3.2.2 Quantile Regression Estimation under Exogenous Stratiﬁcation
3.2.3 Sequence of Quantile Regressions . . . . . . . . . . . . . . . . .
Asymptotic Variance Estimation . . . . . . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

69
71
73
76

APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
A Proofs to “GMM Eﬃciency and IPW for Nonsmooth Functions” . . . . . . . 79
B Figures to “Fixed Bandwidth Asymptotics for Regression Discontinuity
Designs” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
B.1 Simulations for Infeasible Inference . . . . . . . . . . . . . . . . . . . . . . . 88
B.1.1 Nadaraya-Watson Estimator . . . . . . . . . . . . . . . . . . . . . . . 88
B.1.2 Local Linear Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 93
B.1.3 Heteroskedastic Errors . . . . . . . . . . . . . . . . . . . . . . . . . . 97
B.2 Simulations for Feasible Inference . . . . . . . . . . . . . . . . . . . . . . . . 101
B.2.1 Local Linear Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 103
B.2.2 Heteroskedastic Errors . . . . . . . . . . . . . . . . . . . . . . . . . . 105
B.2.3 Bandwidth Choice for fo (x) . . . . . . . . . . . . . . . . . . . . . . . 109
C Proofs to “Fixed Bandwidth Asymptotics for Regression Discontinuity
Designs” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
D Proofs to “Asymptotic Properties of Quantile Regression for Standard
Stratiﬁed Samples” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

viii

LIST OF FIGURES

B.1 Nadaraya-Watson Estimator - DGP: No X - Homosk. Errors . . . . . . . . .

89

B.2 Nadaraya-Watson Estimator - DGP: Linear - Homosk. Errors . . . . . . . .

90

B.3 Nadaraya-Watson Estimator - DGP: Linear - Bias Corrected - Homosk. Errors 91
B.4 Nadaraya-Watson Estimator - DGP: Linear - Comparison - Homosk. Errors

92

B.5 Local Linear Estimator - DGP: Linear - Homosk. Errors . . . . . . . . . . .

93

B.6 Local Linear Estimator - DGP: Quadratic - Homosk. Errors . . . . . . . . .

94

B.7 Local Linear Estimator - DGP: Quadratic - Bias Corrected - Homosk. Errors

95

B.8 Local Linear Estimator - DGP: Quadratic - Comparison - Homosk. Errors .

96

B.9 Local Linear Estimator - DGP: Linear - Heterosk. Errors Case 1 . . . . . . .

97

B.10 Local Linear Estimator - DGP: Quadratic - Heterosk. Errors Case 1 . . . . .

98

B.11 Local Linear Estimator - DGP: Linear - Heterosk. Errors Case 2 . . . . . . .

99

B.12 Local Linear Estimator - DGP: Quadratic - Heterosk. Errors Case 2 . . . . . 100
B.13 Nadaraya-Watson Estimator - DGP: No X - Feasible - Homosk. Errors . . . 101
B.14 Nadaraya-Watson Estimator - DGP: Linear - Feasible - Homosk. Errors . . . 102
B.15 Local Linear Estimator - DGP: Linear - Feasible - Homosk. Errors . . . . . . 103
B.16 Local Linear Estimator - DGP: Quadratic - Feasible - Homosk. Errors . . . . 104
B.17 Local Linear Estimator - DGP: Linear - Feasible - Heterosk. Errors Case 1 . 105
B.18 Local Linear Estimator - DGP: Quadratic - Feasible - Heterosk. Errors Case 1 106
B.19 Local Linear Estimator - DGP: Linear - Feasible - Heterosk. Errors Case 2 . 107
B.20 Local Linear Estimator - DGP: Quadratic - Feasible - Heterosk. Errors Case 2 108
B.21 Small-h Sensitivity to Density Bandwidth - DGP: Linear . . . . . . . . . . . 109

ix

B.22 Small-h Sensitivity to Density Bandwidth - DGP: Quadratic . . . . . . . . . 110

x

CHAPTER 1
GMM Eﬃciency and IPW for
Nonsmooth Functions
1.1

Introduction

This chapter extends Prokhorov and Schmidt (2009) analysis to the estimation of a general GMM problem with nonsmooth objective functions in which nuisance parameters are
present. The framework developed encompasses several interesting problems in econometrics
such as missing data, censored or truncated data, treatment eﬀects, instrumental variables,
etc. More importantly, by allowing nonsmooth objective functions, the analysis extends to
models that have gained additional importance in recent years, e.g., least absolute deviations
(LAD), quantile regression (QR), censored LAD, quantile treatment eﬀects and instrumental
variables quantile regression (IVQR).
The core results of this chapter extend Prokhorov and Schmidt (2009) results on GMM redundancy by allowing the use of nonsmooth objective functions. These results rely on Newey
and McFadden (1994) to obtain the asymptotic variance of the GMM estimator under less
restrictive assumptions on the smoothness of the objective functions. For that consider two
sets of moment conditions, where the ﬁrst includes both the parameters of interest (β o ) and

1

certain nuisance parameters (γ o ) while the second set includes only the nuisance parameters. By deﬁning four competing estimators based on diﬀerent assumptions regarding the
information available about these nuisance parameters and the moment conditions utilized,
results about the relative eﬃciency of each proposed estimator are derived. These results
provide guidance to applied work in the presence of nuisance parameters.
As discussed by Prokhorov and Schmidt (2009), joint estimation of nuisance parameters
and parameters of interest is more eﬃcient than a two-step procedure or knowing the true
nuisance parameters and disregarding the second set of moment conditions. This fact is due
to the information contained in correlation between both sets of moment conditions, which
is useful even when γ o is known. Using only the ﬁrst set of moment conditions and known
values of γ o in the estimation procedure does not use the additional information embedded
in the second set of moment conditions. These results are shown to hold when the objective
functions are nonsmooth.
The general results are directly applicable to missing data problems and encompass
Wooldridge (2002b, 2007) analysis of inverse probability weighting (IPW) estimators, extending its use for nonsmooth objective functions under the usual “ignorability” assumptions
about the selection process. The general estimation results described conﬁrm the validity of
the puzzle described by Wooldridge (2007), i.e., that it is better (in an eﬃciency sense) to
estimate the selection probabilities, even if the latter are known. In other terms, we obtain
more eﬃcient estimates for β o if we estimate γ o than if we use the true γ o . This result
is “puzzling” because knowledge of γ o , if properly exploited, cannot be harmful. Previous
works discussed this result, such as Wooldridge (2002b, 2007) in the context of IPW. Hirano
et al. (2003); Hitomi et al. (2008); Prokhorov and Schmidt (2009) addressed the problem for
the smooth objective function case. Even though this issue has been considered by Chen,
Hong, and Tarozzi (2008) in a semiparametric context with nonsmooth objective functions,
the parametric approach proposed here provides, as a novelty, the conditions under which
this puzzle is valid and, furthermore, shows that the two-step estimator is usually dominated

2

by a one-step joint estimation procedure that uses both the weighted moment conditions and
the conditions associated with the selection model.
There have been several papers devoted to general theories of estimation in settings where
nonsmooth objective functions are allowed, following Daniels (1961) and Huber (1967). Studies that allow for estimation of models based on nonsmooth objective functions include,
among others, Pollard (1985); Pakes and Pollard (1989) and Newey and McFadden (1994,
section 7). Recent studies have approached the problem of nonsmoothness with focus on
semiparametric models, see Chen, Linton, and Van Keilegom (2003) for a general estimation
approach; Chen et al. (2008) for an approach for missing data problems with nonparametric
ﬁrst stage; and Cattaneo (2010) for an approach on the estimation of multi-valued treatment
eﬀects on a semiparametric framework.
The remainder of the chapter is organized as follows. Section 1.2 sets up the general GMM
framework used in the analysis and presents results regarding eﬃciency and redundancy of
the estimators proposed, as well as estimators for the asymptotic variances of the parameters
estimated. Section 1.3 studies the IPW approach to missing data problems proposed by
Wooldridge (2002b, 2007), extending its scope to nonsmooth objective functions. Section
1.4 provides examples of the uses of the framework proposed here by, ﬁrst, considering a
model for the conditional quantile in a context with missing data; secondly I consider a
simpliﬁed IVQR model as proposed by Chernozhukov and Hansen (2005, 2006). Section 1.5
concludes.

1.2

General Estimation Problem

∗
Let ω ∗ ∈ Q∗ ⊂ Rdim(ω ) be a random vector; θ∈ Θ ⊂ RP be a parameter vector, Θ is a

compact set, and the population condition
go (θo ) = E[g(ω ∗ , θo )] = 0
where g : Q∗ × Θ →Rm is a vector of known real-valued moment functions.
3

(1.1)

Newey and McFadden (1994) have shown consistency and asymptotic normality of the
Generalized Method of Moments (GMM) estimator that minimizes a squared Euclidean
distance of the random sample analogues of the population moments, i.e. gn (θ) = n−1

n
i=1

g(ω ∗ , θ), from their population counterparts (which equal zero). I am interested in the case
i

in which the moment functions, g(·), are allowed to be nonsmooth, so we can deal with a
wider range of interesting problems. The GMM estimator minimizes the objective function
gn (θ) W gn (θ)

(1.2)

where W converges in probability to W , the appropriate positive semideﬁnite weighting
matrix. Assume ω ∗ , i = 1, ..., n, are i.i.d. Two useful results from Newey and McFadden
i
(1994) will be used to derive the asymptotic variance of the estimators. The ﬁrst regards
the consistency of the GMM estimator.
Theorem 1 (Newey and McFadden, 1994, Theorem 2.6) Let

•

denote the Eu-

clidean norm. Suppose that:
(i) ω ∗ are i.i.d. for i=1,2,...;
i
p

(ii) W −→ W ;
(iii) W is positive semi-deﬁnite and W E[g(ω ∗ , θ)] = 0 only if θ = θo ;
(iv) θo ∈ Θ ⊂ RP , and Θ is compact;
(v) g(ω ∗ , θ) is continuous at each θ with probability one;
(vi) E supθ∈Θ g(ω ∗ , θ)

< ∞;

p

Then θ −→ θo .
This result relies on relatively weak conditions, and allow for discontinuities in the objective
function.
The second theorem demonstrates the asymptotic normality of the GMM estimator under
a certain form of nonsmoothness of the objective function.
Theorem 2 (Newey and McFadden, 1994, Theorem 7.2) Suppose that:
4

(i) gn (θ) W gn (θ) ≤ inf θ∈Θ gn (θ) W gn (θ) + op (n−1 );
p

(ii) θ −→ θ o ;
p

(iii) W −→ W , and W is positive semi-deﬁnite;
(iv) go (θ o ) = 0;
(v) go (θ) is diﬀerentiable at θ o with derivative G such that G W G is nonsingular;
(vi) θ o is an interior point in Θ;
√
d
(vii) ngn (θ o ) −→ N (0, Σ);
p

(viii) for any δ n −→ 0,
√
n gn (θ) − gn (θ o ) − go (θ) p
√
−→ 0
sup
[1 + n (θ − θ o ) ]
(θ−θ o ) ≤δ n
Then,

√
d
n(θ−θ o ) −→ N 0, (G W G)−1 G W ΣW G(G W G)−1 .

As shown by Pollard (1985) the diﬀerentiability of the objective function g(ω ∗ , θ) can
i
be replaced by the diﬀerentiability of go (θ) for the purpose of obtaining the asymptotic
normality of these estimators. As emphasized by Newey and McFadden (1994) the key
condition to allow for nonsmooth objective functions is condition (viii), which is a “stochastic
equicontinuity” assumption that guarantees uniform convergence in probability of the linear
approximation of go (θ) by g(ω ∗ , θ) in a shrinking neighborhood of θ o . This is similar to the
i
stochastic diﬀerentiability condition in Pollard (1985) and primitive conditions are available
in Pollard (1985), Andrews (1994) and Chen et al. (2003). Those simplify the task of checking
its validity to speciﬁc moment functions, however this is beyond the scope of this work and
I refer the reader to those papers.
Suppose that θ can be partitioned into subsets of parameters (β , γ ) ∈ B×Γ ⊂ Rp1 ×Rp2
and that g(·) can be partitioned into subsets of functions (g1 (·) , g2 (·) ) as deﬁned below.
For notational convenience, ω ∗ is suppressed in the following discussion, then
E[g1 (β o , γ o )] = 0

(1.3)

E[g2 (γ o )] = 0

(1.4)

5

where β ∈ B, γ ∈ Γ, g1 (·) and g2 (·) are m1 and m2 vectors of known functions, respectively
(m = m1 + m2 ). Note that the second set of moment conditions does not depend on β while
the ﬁrst set of moment conditions depend on the full parameter set θ. Let gn1 (θ) = n−1
g1 (ω ∗ , θ) and gn2 (γ) = n−1
i

n
i=1

n

i=1

g2 (ω ∗ , γ). The framework developed here is valid for the
i

general case of overidentiﬁcation, i.e., m1

p1 and m2

p2 . This guarantees that γ o is

identiﬁed by 1.4 alone, and that for a given γ, β o can be identiﬁed by 1.3 alone, hence, two
step estimation is possible. Let the asymptotic covariance matrix for the moment functions,
Σ, be deﬁned as
Σ = V [g(θ o )] ≡

C11 C12
C21 C22

where we assume Σ is ﬁnite and nonsingular so its inverse exists:
Σ−1 ≡

−1
−1
−1
C11 (I + C12 E −1 C21 C11 ) −C11 C12 E −1
C 11 C 12
=
−1
C 21 C 22
−E −1 C21 C11
E −1

since Σ (and Σ−1 ) is symmetric C12 = C21 and the second equality holds (see White, 1984,
−1
p. 80) for E ≡ C22 − C21 C11 C12 .

Deﬁne the matrix of derivatives as
G ≡

θ go (θo ) =

θ E[g(θo )] ≡

G11 ≡

β E[g1 (β o , γ o )]

G12 ≡

γ E[g1 (β o , γ o )]

G22 ≡

G11 G12
0 G22

γ E[g2 (γ o )]

where the lower oﬀ-diagonal matrix equals zero since the second set of moment conditions
does not depend on β.
Following Prokhorov and Schmidt (2009), deﬁne four diﬀerent possible GMM estimators
that diﬀer in which moment conditions are used and/or whether γ is treated as known.
Deﬁnition 1 Call the estimator of θo that minimizes
gn (θ) W gn (θ)
6

(1.5)

with the weighting matrix W = Σ−1 the ONE-STEP estimator.
This is the usual GMM estimator that uses all the available orthogonality conditions jointly
to estimate β o and γ o .
Deﬁnition 2 Call the estimator of β o that minimizes
−1
gn1 (β, γ o ) C11 gn1 (β, γ o )

(1.6)

and γ o is treated as known the KNOW-γ estimator.
This estimator ignores the second set of orthogonality conditions 1.4, treating γ o as a
known vector of parameters and estimating β o using only the information available in the
ﬁrst set of moment assumptions. This could arise if one has information about the true
values of γ o or if he disregards the fact that γ o was estimated in the ﬁrst stage and, hence
its variability, in what could be called a “naive” estimator.
Deﬁnition 3 Call the estimator of β o that minimizes
gn (β, γ o ) Σ−1 gn (β, γ o )

(1.7)

and γ o is treated as known the KNOW-γ-JOINT estimator.
This is the GMM estimator for β o in the form considered by Qian and Schmidt (1999). In
this case, one has information about the true values of γ o but still uses both set of moments
conditions in obtaining an estimate for β o .
Deﬁnition 4 Call the estimator of θo obtained in the following fashion, the TWO-STEP
estimator:
(i) the estimator γ is obtained by minimizing
−1
gn2 (γ) C22 gn2 (γ)

7

(1.8)

(ii) the estimator β is obtained by minimizing
−1
gn1 (β, γ) C11 gn1 (β, γ)

(1.9)

and γ is treated as given.
This is the sequential estimator that uses only the second set of moment conditions 1.4 to
obtain a consistent estimator of the unknown parameter vector γ o and then uses only the
ﬁrst set of moment conditions 1.3 to obtain the estimator of β o . This estimator is widely
used in the applied economics literature and encompasses several common problems.
The estimators deﬁned above depend on a known Σ. In practice, Σ is not known and has
to be replaced by an initial consistent estimate.
To compare the properties of these diﬀerent estimators we need to obtain their asymptotic
variances. Those are derived directly by Theorem 2.
Theorem 3 Let VON E−ST EP , VKN OW −γ ,VKN OW −γ−JOIN T and VT W O−ST EP denote the asymptotic variance of ONE-STEP, KNOW-γ, KNOW-γ-JOINT and TWO-STEP
respectively. Then, under the conditions described in Theorem 1 and 2.
VON E−ST EP

=

VKN OW −γ =
VKN OW −γ−JOIN T
VT W O−ST EP

=

G Σ−1 G

−1

(1.10)

−1
G11 C11 G11

−1

G11 C 11 G11

−1

(1.11)
(1.12)

= BΣB

(1.13)

where,
B=

B11 B12
0 B22

with
−1
B11 = − G11 C11 G11

B12 =

−1
G11 C11 G11

−1

−1

−1
B22 = − G22 C22 G22

−1
G11 C11

−1
−1
G11 C11 G12 G22 C22 G22

−1

−1
G22 C22

8

−1

−1
G22 C22

Proof. All proofs are provided in the appendix.
It is possible to analyze the relative asymptotic eﬃciency of these estimators.1
Theorem 4 For the estimators deﬁned above as the ONE-STEP, KNOW-γ, KNOW-γJOINT and TWO-STEP with asymptotic variances given by 1.10, 1.11, 1.12 and 1.13, respectively, the following statements hold:
1. KNOW-γ-JOINT is no less eﬃcient than ONE-STEP, KNOW-γ and TWO-STEP for
β o.
2. If C12 = 0 then KNOW-γ-JOINT and KNOW-γ are equally eﬃcient for β o .
3. If G12 = 0 then TWO-STEP and KNOW-γ are equally eﬃcient for β o .
4. If C12 = 0 and G12 = 0, then ONE-STEP, KNOW-γ, KNOW-γ-JOINT and TWOSTEP are equally eﬃcient for β o , and ONE-STEP and TWO-STEP are equally eﬃcient for
γ o.
5. ONE-STEP is no less eﬃcient than TWO-STEP.
6. If m1 = p1 then the ONE-STEP and TWO-STEP estimates of γ o are equal.
7. If m1 = p1 and m2 = p2 then the ONE-STEP and TWO-STEP estimates are equal
for both β o and γ o .
8. If m1 = p1 and C12 = 0 then the ONE-STEP and TWO-STEP estimates are equally
eﬃcient for both β o and γ o .
−1
9. If G12 = C12 C22 G22 , then KNOW-γ-JOINT and ONE-STEP are equally eﬃcient for

β o.
−1
10. If G12 = C12 C22 G22 , then ONE-STEP, KNOW-γ-JOINT and TWO-STEP are no

less eﬃcient for β o than KNOW-γ.
1 I denote the asymptotic variance of θ as V meaning that √n(θ − θ ) converges in
o

distribution to N (0, V ).

9

The statements that form Theorem 4 are direct extensions of Prokhorov and Schmidt
(2009) for the case in which nonsmooth objective functions are allowed.
Statement 1 shows, as expected, that KNOW-γ-JOINT dominates the other estimators.
This is an intuitive result since the known value of γ o is at least as eﬃcient as any estimate
of γ o , and KNOW-γ-JOINT uses the full set of relevant moment conditions.
Statement 2 is the result Qian and Schmidt (1999), where it is shown that using additional
moment conditions that include no unknown parameters (as is the case for KNOW-γ-JOINT)
improves eﬃciency except in the special case in which C12 = 0. In other words, the second
set of moments is redundant in the estimation of β o , Prokhorov and Schmidt (2009) call this
M-redundancy.
Statement 3 gives the condition under which the ﬁrst stage estimation of the nuisance
parameter γ o does not aﬀect the asymptotic behavior of the second stage estimate of
β o.

This result is similar to the one shown in Wooldridge (2002a), however in this

case we are dealing with a nonsmooth objective function and, therefore, the restriction
G12 ≡

γ E[g1 (β o , γ o )] = 0 diﬀers from the one proposed by Wooldridge since the deriva-

tive of g1 (β o , γ o ) is not necessarily available.
Statement 4 provides conditions under which the ONE-STEP, KNOW-γ, KNOW-γJOINT and TWO-STEP estimators are equally eﬃcient for β o , hence the use of the additional moment conditions in 1.4 by the ONE-STEP, KNOW-γ-JOINT and TWO-STEP
estimators does not improve the precision of the estimated parameters of interest as in the
previous statement; and the knowledge of γ o does not help in estimating β o . This would
hold if two sets of moment conditions are asymptotically uncorrelated (C12 = 0) and γ is
not present in the ﬁrst set of moment conditions (G12 = 0).
Statement 5 is the usual result that in general, sequential estimation procedures are less
eﬃcient than joint (one step) estimation.
Statement 6, 7 and 8 follow directly from Ahn and Schmidt (1995) and show that the GMM
separability holds in the framework that allows non-smooth objective functions. The GMM

10

estimates for γ o are not improved by the inclusion of an equal number of additional moment
conditions and parameters. It can be shown that if G11 is nonsingular, the ONE-STEP
estimator for β o can be written in terms of the ONE-STEP estimator of γ o using the equation
−1
gn1 (β, γ) = C12 C22 gn2 (γ) (see appendix for details). Thus, as described by Prokhorov

and Schmidt (2009) the ONE-STEP and TWO-STEP estimators for β o will be derived from
the same equation as long as gn2 (γ) = 0, which will be true under exact identiﬁcation of
γ o , and asymptotically equally eﬃcient if C12 = 0, since the moment conditions will be
asymptotically uncorrelated, not adding to the information set exploited by ONE-STEP
relatively to TWO-STEP.
Statement 9 and 10 are direct extensions of Prokhorov and Schmidt (2009). Statement 9
says that KNOW-γ-JOINT and ONE-STEP are equally eﬃcient for the estimation of β o ,
which means that knowledge of γ o is not useful in terms of the eﬃciency of the estimates
−1
for β o if we are using the full set of moment conditions and G12 = C12 C22 G22 .

Statement 10 shows that under the same condition about G12 , KNOW-γ is dominated by
ONE-STEP, KNOW-γ-JOINT and TWO-STEP. This happens because knowledge of γ o is
not useful in the estimation of β o in this case, and the KNOW-γ estimator does not use the
information in the second set of moment conditions, which is useful unless C12 = 0.
The statements presented in theorem 4 show that the results for GMM redundancy presented by Prokhorov and Schmidt (2009) extend to GMM estimation procedures based on
nonsmooth objective functions.
Under the conditions of parts 9 and 10 of theorem 4, the following corollary can be obtained.
−1
Corollary 1 If G12 = C12 C22 G22 and G22 is invertible, then
−1
V (β T W O−ST EP ) = G11 C11 G11

−1

−1
−1
−1
G11 C11 Do C11 G11 G11 C11 G11

−1

(1.14)

Additionally, if G11 is invertible, then
V (β T W O−ST EP ) = G−1 Do G−1
11
11
11

(1.15)

where
Do = E ei ei
ei =

−1
g1 (ω ∗ , θ) − C12 C22 g2 (ω ∗ , γ)
i
i

Note that ei is the residual of the linear projection of the ﬁrst set of moments conditions on
the second set of moment conditions. This result is useful in the estimation of the asymptotic
variance of the estimators, as I discuss below. Unfortunately, this applies only if the second
set of moment conditions is exactly identiﬁed for formula 1.14 and if both sets of moment
conditions are exactly identiﬁed for formula 1.15.
An arresting issue is to obtain estimates of the variance matrices described in theorem 3.
The nonsmoothness of the objective function creates some obstacles to the usual estimations
procedures. As described by Lee (2008) the fact that the estimates for the variances depend on the derivative of the expectation of the estimating function in the nonsmooth case
warrants a more careful approach in estimating the variances used for inference.
A general approach that work in most cases is oﬀered in Newey and McFadden (1994),
and consists on obtaining consistent estimators for the separate components of the variance
matrix. For estimating Σ or its relevant components a standard estimator is available.
This procedure can be used in a ﬁrst-step to obtain consistent estimates of the appropriate

12

weighting matrix for the desired estimation procedure.
n

Σ = n−1
C11 = n−1
C12 = n−1
C22 = n−1

∗ ˆ
g(ω ∗ , ˆ
i θ)g(ω i , θ)

i=1
n
i=1
n
i=1
n

(1.16)

g1 (ω ∗ , ˆ 1 (ω ∗ , ˆ
i θ)
i θ)g

(1.17)

g1 (ω ∗ , ˆ 2 (ω ∗ , γ )
i ˆ
i θ)g

(1.18)

g2 (ω ∗ , γ )g2 (ω ∗ , γ )
i ˆ
i ˆ

(1.19)

i=1

C21 = C 12

(1.20)

To be able to plug this estimates on the equations derived in Theorem 3 we need to obtain
estimates of G, which can be diﬃcult to obtain due to the nonsmoothness of the objective
function. In this approach an estimate of G is obtained by numerical derivatives. Following
Newey and McFadden (1994) let ei denote the ith unit vector, n denote a small positive
constant that depends on the sample size. Deﬁne the estimators for G and its components
as

Gj =

1  −1
n

2n


G11

j

=

1  −1
n

2n


G12

j

=

1  −1
n

2n


G22

j

=

1  −1
n

2n

n
i=1
n
i=1
n
i=1
n


g(ω ∗ , ˆ + ej n ) − g(ω ∗ , ˆ − ej n )
i θ
i θ

ˆ
ˆ
g1 (ω ∗ , β + ej n , γ ) − g1 (ω ∗ , β − ej n , γ )
ˆ
ˆ
i
i

ˆ ˆ
ˆ ˆ
g1 (ω ∗ , β, γ + ej n ) − g1 (ω ∗ , β, γ − ej n )
i
i

g2 (ω ∗ , γ + ej n ) − g2 (ω ∗ , γ − ej n )
i ˆ
i ˆ

i=1

Where the subscript j denotes the j th column of the matrix being estimated. Newey and
√
McFadden (1994, Theorem 7.4) shows that if n converges to zero and n n converges to
inﬁnity as n gets larger, these estimators will be consistent for the terms of the variances

13

presented in theorem 3, and plugging them in the formulas provide consistent estimators for
the variances of the parameters being estimated.
However, these estimators are cumbersome and not practical. As emphasized by Newey
and McFadden, the choice of n is a diﬃcult problem and the formulation described above,
using a unique value for n would be good only if the estimated parameters had been scaled
to have similar magnitudes. If that is not done, we would have to pick diﬀerent n for
diﬀerent components.
On speciﬁc cases, other estimators are available. As discussed in Newey and McFadden
(1994) if g(ω ∗ , ˆ is diﬀerentiable with probability one, with
θ)

∗ ˆ
θ g(ω , θ) that is continuous

at θo with probability one and dominated by an integrable function in a neighborhood of
θo , then G = n−1

n
i=1

∗ ˆ
θ g(ω i , θ) is a consistent estimator for G. Hence, the more standard

estimator is available and would be easier to implement.
Clearly, alternatives could be available for speciﬁc moment conditions. Section 1.4 provides
the example for the leading case of IPW for linear quantile regression.
Even in this case, the calculation of the matrix B that is present in the asymptotic variance
of the TWO-STEP estimator could be cumbersome. For the cases in which the conditions
−1
from part 9 and 10 of theorem 4 hold, namely G12 = C12 C22 G22 , corollary 1 oﬀers a

diﬀerent approach to the problem of estimating the asymptotic variance in those cases (even
though we still need to resort to one of the estimators above to obtain G11 ). We can obtain
an estimate of the matrix E ei ei by regressing the ﬁrst set of moment conditions on the
ˆ ˆ
second set of moment conditions in the sample to obtain the residuals e = g (ω ∗ , β, γ ) −
n−1

n
i=1

ˆ ˆ
g2 (ω ∗ , γ )g1 (ω ∗ , β, γ )
i ˆ
i

n−1

n
i=1

−1

g2 (ω ∗ , γ )g2 (ω ∗ , γ )
i ˆ
i ˆ

the sample analogue of the desired matrix D = n−1

i

1

i

g2 (ω ∗ , γ ), and calculating
i ˆ

n
i=1

ei ei . Unfortunately, this simple

procedure is valid only for the asymptotic variance of the TWO-STEP estimator under the
condition above and under exact identiﬁcation of at least the second set of moment conditions.

14

For most of the relevant problems, we could use a bootstrap procedure to obtain consistent
estimates of the variance of ˆ directly, but these could be computationally demanding for
θ
models in which the solution of the optimization problem for both sets of moment conditions
require numerical optimization of the objective function.

1.3

Estimation with missing data

This section specializes the results of the section 1.2 to a model in which missing data is
allowed in a framework that expands that proposed by Wooldridge (2002b, 2007) to allow
nonsmooth objective functions.
Consider ω ∈ Q ⊂ Rdim(ω) a random vector with density f (ω); β∈ B ⊂ Rp1 a parameter
vector, where B is a compact set. Suppose there is the population moment equation
go (β o ) = E[g(ω, β o )] = 0

(1.21)

where g : Q × B →Rm1 is a vector of known real-valued moment functions with m1 ≥ p1,
so β o could be overidentiﬁed. Assume β o is the unique solution to 1.21. I am interested in
estimating β o .
Note that the moment conditions presented above hold in the unselected population.
Assume nonrandom sampling occurs and it is characterized by a selection indicator, s ∈
{0, 1}, such that ω i is observed if and only if si = 1. Keep in mind that all or part of ω i is
not observed when si = 0.
The GMM estimator based on 1.21 using the selected sample, in eﬀect makes the empirical moments n−1

n
i=1

si g(ω i , β) close to zero. These empirical moments are the sample

analogues of the population moments of the form
E [sg(ω, β)] = 0

(1.22)

which are referred to as the unweighted selected population moments (Prokhorov and
Schmidt, 2009; Wooldridge, 2002b). The name emphasizes that they are evaluated at the
15

selected rather than the full population of interest and diﬀerentiates them from the weighted
selected population moments deﬁned below. The selectivity problem occurs exactly because
1.22 may not hold; in other words, the value β o that solves 1.21 may not also solve 1.22
(Prokhorov and Schmidt, 2009). If that happens, the estimate for β o obtained through this
procedure is not generally consistent. In fact, its consistency and potential solutions for the
data selection problem will depend on the relationship between the selection process and
both the dependent and independent variables.

1.3.1

Data Selection under Ignorability

A straightforward solution is to solve the nonrandom sampling problem using inverse probability weighting (IPW) as shown by Wooldridge (2002b, 2007). To be able to use IPW we
need some variables that are reasonable predictors of selection as described in Wooldridge
(2007). This is formally stated as an “ignorability” of selection assumption.
Assumption 1 (Wooldridge, 2007, Assumption 3.1) (i) ω i is observed whenever si =
1;
(ii) For a random vector zi such that P (si = 1 | ω i , zi ) =P (si = 1 | zi ) ≡p(zi );
(iii) For all z ∈ Z ⊂ RJ , p(z) > 0;
(iv) zi is observed whenever si = 1.
Item (ii) in this assumption requires that s ⊥ ω | z. In other words, the selection has
to be independent of the y and x conditional on z. As discussed at length by Wooldridge
(2007), assumption 1 encompasses a variety of selection schemes common in the missing
data literature, including “missing at random”, “variable probability sampling”, “selection
on observables’ etc. This allows, for example, that the probability of observing ω i to depend
on the stratum in which ω i falls into; or that zi is observed only along with ω i ; or that
partial information is known about the incompletely observed data. Assumption 1 does not

16

apply to the “selection on unobservables”2 case as generally used in econometrics. I will not
explore these possibilities directly here, referring the reader to Wooldridge (2007).
Assume that a conditional density determining selection is correctly speciﬁed and that a
maximum likelihood estimator of the selection model is available.
Assumption 2 (Wooldridge, 2007, Assumption 3.2) (i) G(z, γ) is a parametric model
for p(z), where γ ∈ Γ ⊂ Rp2 and G(z, γ) > 0 for all z ∈ Z and γ ∈ Γ;
(ii) There exists γ o in the interior of Γ such that p(z) = G(z, γ o );
(iii) For a random vector vi such that D(vi | ω i , zi ) = D(vi | zi ), the estimator γ solves
a conditional maximum likelihood problem of the form
n

ln [f (vi | zi , γ)]

max
γ∈Γ

(1.23)

i=1

where f (v | z, γ) > 0 is a conditional density function known up to the parameters γ o , and
si = h(vi , zi ) for some nonstochastic function h(·, ·);
(iv) The solution to 1.23 has the ﬁrst-order representation


1 n
−1
√
n− 2
n(γ − γ o ) = E di (γ o ) di (γ o )
di (γ o ) + op (1)
i=1

with di (γ) ≡

γ f (vi |zi ,γ)
, which is the p2 × 1 score vector for the MLE.
f (vi |zi ,γ)

The assumption above requires standard regularity conditions about G(z, γ), including
smoothness of the parametric model. Even though this restricts the possibilities to model
the selection process, it includes the most used probability models used in the literature.
By doing so, we concentrate on the impacts of nonsmoothness in the model of interest
and provide results about the use of IPW in correcting sample selection for those cases.
Assumption 2 covers the cases presented by Wooldridge (2002b) in which the conditional
log-likelihood was for a binary response model. The advantage of using this slightly more
2 A quantile regression estimator for the case wheen selection is on unobservables is

provided by Buchinsky (1998)
17

complicated framework is to allow zi to be only partially observed and to permit si to be a
function of another random variable vi which includes a broader class of selection problems.
For a deeper discussion on the extensions allowed by assumption 2, see Wooldridge (2007).
Note that the MLE estimator for γ o described above can be obtained in a GMM setting
as follows.
Let γ the Maximum Likelihood Estimator (MLE) of γ o , that is γ solves
n

ln [f (vi | zi , γ)]

max
γ∈Γ

Deﬁne g2 (z, γ,s) ≡ d (γ) =
p

i=1

γ f (vi |zi ,γ)
and gn2 (γ) ≡ n−1
f (vi |zi ,γ)

n g (z , γ, s ). Hence,
i
i=1 2 i

gn2 (γ) −→ g2o (γ) ≡ E [g2 (z, γ, s)]. Then, the problem above is characterized by the following ﬁrst order conditions
n−1

n

g2 (zi ,γ,si ) = n−1

i=1

= n−1

n
i=1
n

γ f (vi | zi , γ)

f (vi | zi , γ)
1

−
di (γ) = op (n 2 )

i=1

and,
E [g2 (z, γ o ,s)] = E [d (γ o )] = 0
Under assumption 1, the following lemma, presented in Wooldridge (2002b) is valid.
Lemma 1 (Wooldridge, 2002b, Lemma 3.1) Under the conditions presented in Assumptions 1 and 2, for any real-valued function g(ω) such that E

E

s
g(ω, β o )
G(z, γ o )

s
g(ω, β o )
p(z)

=E

|g(ω,β o )|
< ∞,
G(z,γ o )

= E [g(ω, β o )]

(1.24)

Lemma 1 suggests that we use the sampling probabilities to consistently estimate β o
Consider the weighted selected population moments that weight 1.22 by the inverse of the
18

selection probability:
E

s
g(ω, β o )
G(z, γ o )

=0

(1.25)

Given an estimator for γ o , γ, we can form G(zi , γ) for all i with si = 1 and we are able to
obtain consistent estimates for β o by using the weighted selected population moments 1.25
as described in Wooldridge (2007). Note that, by the Law of Large Numbers and Law of
Iterated Expectations, assumptions 1, 2 and consistency of γ for γ o (see Wooldridge, 2002b,
theorem 3.1).
n

n−1

i=1

si
g(ω i , β) | ω i ,zi
p(zi )

= E E
= E
= E

si
si
p
g(ω i, β) −→ E
g(ω i , β)
G(zi , γ)
p(zi )

1
E [s | ω i ,zi ] E [g(ω i , β) | ω i ,zi ]
p(zi )
p(zi )
E [g(ω i , β) | ω i ,zi ]
p(zi )

= E [E [g(ω i , β) | ω i ,zi ]] = E [g(ω i , β)] = go (β)
Therefore,
n−1

n
i=1

si
p
g(ω i , β o ) −→ go (β o ) = 0
G(zi , γ)

Hence, this provides a set of valid moment conditions that could be used to estimate β o .
Eﬃciency Comparisons
The relative eﬃciency of the estimators for β o that use IPW to correct a missing data problem
under assumption 1 and 2 can be analyzed under the framework developed in section 1.2.
Consider the two sets of moment functions
s
g(ω, β)
G(z, γ)
γ f (v | z, γ)
g2 (z, γ, s) ≡ d (γ) =
f (v | z, γ)

g1 (ω, z, β, γ, s) =

and the following moment conditions are valid,
19

E [g1 (ω, z, β, γ, s)] = 0

(1.26)

E [g2 (z, γ o , s)] = 0

(1.27)

Any of the estimators discussed in section 1.2 can be used, diﬀering on the set of moment
conditions used and the knowledge about the weights.
Under the assumptions on the moment conditions and the selection process discussed in
this section, the following lemma holds.
Lemma 2 If the conditions of Theorem 1 and 2; Assumptions 1 and 2 hold, and the moment
−1
conditions are deﬁned by 1.26 and 1.27, then G12 = C12 C22 G22 .

By using this result, we can see that under these assumptions, the results of Theorem 4
can be directly applied to this speciﬁc case and the ONE-STEP, TWO-STEP and KNOWγ-JOINT estimators will be no less eﬃcient than the KNOW-γ.
Theorem 5 Under the conditions of Lemma 2 , ONE-STEP, KNOW-γ-JOINT and TWOSTEP are no less eﬃcient for β o than KNOW-γ. Furthermore, ONE-STEP and KNOW-γJOINT are equally eﬃcient for β o .
Hence, unless C12 = 0 (in which case the four estimators would be equally eﬃcient), using
ONE-STEP or TWO-STEP that estimate γ o through MLE produce more eﬃcient estimates
for β o than using known weights (if we knew them) in the KNOW-γ estimator. The KNOWγ-JOINT estimator is as eﬃcient as ONE-STEP as well, indicating that the knowledge of γ o
is not useful in terms of the eﬃciency of the estimates for β o . The eﬃciency gains relatively
to KNOW-γ are due to the use the information in the second set of moment conditions.
Therefore, the puzzle described in Wooldridge (2002b, 2007) that KNOW-γ is ineﬃcient
relative to TWO-STEP, extends to a larger set of estimators in which the original set of
unweighted moment conditions is nonsmooth as it was discussed by Chen et al. (2008) and

20

Hitomi et al. (2008). In these cases we are better oﬀ estimating the weights by a conditional
MLE than knowing them. Nonetheless, the TWO-STEP estimator is dominated by both
ONE-STEP and KNOW-γ-JOINT and those should be used to obtain relatively eﬃcient
estimates of β o .
It is important to note that the framework developed in this chapter does not extend
directly to semiparametric cases in which the probability of selection is estimated nonparametrically. That can be a serious inconvenience when we have limited information about
the selection process and would beneﬁt from a more ﬂexible estimator to those probabilities.
However, as it is shown in the section 1.3.2 we can obtain consistent estimates for β o even
if using misspeciﬁed selection probabilities, as long as the data selection is exogenous.

1.3.2

Data Selection under Exogeneity of Selection

The literature in sample selection has long established that sample selection does not necessarily cause bias in unweighted estimators. As shown in Wooldridge (2007) if selection
is exogenous conditional on the vector of covariates x the estimators of interest using the
unweighted moment conditions will be consistent and, in fact, more eﬃcient (Prokhorov and
Schmidt, 2009) than the weighted estimators. Following Wooldridge (2007), I analyze the
properties of the estimators obtained under exogenous selection but with potential misspeciﬁcation of the selection model. The main results about consistency of the estimators for
the parameters of interest shown in Wooldridge (2007) and Prokhorov and Schmidt (2009)
remain unaltered by the fact that the moment conditions are allowed to be nonsmooth as
summarized below.
Consider that we have a potentially misspeciﬁed model for the probability of selection
given by G(z, γ ∗ ), which is not necessarily equal to the true p(zi ). Assume that the estimate
√
γ obtained based on that model is consistent to some parameter vector γ ∗ and n(γ − γ ∗ ) =
Op (1).

21

In this case, the weighted moment condition
n−1

n
i=1

p
si
s
g(ω i , β o ) −→ E
g(ω, β)
G(z, γ)
G(z, γ ∗ )

(1.28)

instead of E [g(ω, β)] = 0, as seen in section 1.3.1.
Assume that the selection process is exogenous conditional on z.
Assumption 3 (Wooldridge, 2007, Assumption 4.1) (i) ω i is observed whenever si =
1;
(ii) For a random vector zi such that P (si = 1 | ω i , zi ) =P (si = 1 | zi ) ≡p(zi );
(iii) zi is observed whenever si = 1.
(iv) β o ∈ B solves the problem
E [g(ω, β) | z] = 0
for all z ∈ Z.
This assumption is the same as in Prokhorov and Schmidt (2009) and as shown by them
in Lemma 4.1 and Theorem 4.1 (p.53), which are not altered due to the use of nonsmooth
objective functions, it implies
E [g(ω, β) | z, s] = 0
Hence, any function of z and s is uncorrelated with g(ω, β) and both weighted and unweighted moment conditions hold in the selected sample for any weighting (that is a function
of z and s) that we could use. Therefore, the weighted moment condition in equation 1.28
will hold in the selected sample for any misspeciﬁed model G(z, γ ∗ ). Obviously, this holds
for the unweighted moment conditions as well since it is equal to the special case in which
G(z, γ ∗ ) = 1.
Then, we conclude that under exogeneity of selection, the IPW estimator for β o proposed
is consistent, regardless of the misspeciﬁcation of the model for probability of selection3.
3 This conclusion is equivalent to Theorem 4.1 in Wooldridge (2007), extending it for

nonsmooth objective functions.
22

This robustness is an important feature of the IPW procedure and adds to its usefulness in
applications.

1.4
1.4.1

Examples
Quantile Regression under Ignorability of Selection

Quantile regression is one of the main motivations for this research. As an example of the use
of the results presented here, consider I am interested in estimating the conditional quantile
function (CQF) of a random variable y conditional on a vector of explanatory variables x.
This is deﬁned by,
Qτ (Y | X) = inf {y : FY (y | X) ≥ τ }
where τ ∈ (0, 1) indexes the τ th quantile of the conditional distribution of Y .
Suppose that the CQF is a linear model

Y = X βτ o + ε
and that Qτ (ε | X) = 0. Then,
Qτ (Y | X) = X β τ o
In the population, β o solves the following problem
min E ρτ (Y − X β τ )

β∈B

where, ρτ (u) = (τ − 1 [u ≤ 0])u
Given a random sample from the population of size n, it is possible to obtain consistent
estimates of β o by a standard quantile regression (QR) estimator.
min n−1
β∈B

n

ρτ (yi − xi β τ )
i=1

23

Note that the population minimization problem has the following ﬁrst order conditions
E

τ − 1 y − x βτ o ≤ 0

x =0

and their sample analogue is (Buchinsky, 1998)
n

n−1

τ − 1 yi − xi β τ ≤ 0

1

−
xi = op (n 2 )

i=1

Hence, we frame this problem as a GMM estimator that uses as moment conditions the
ﬁrst order conditions of the QR problem, since these identify β τ o . However, suppose a
random sample of (y, x) is not observed. We have a selection problem such that the full
vector (yi , xi ) is observed only if a certain binary variable that deﬁnes selection equals the
unity, si = 1, if si = 0 at least some part of (yi , xi ) is not observed. Then, in the selected
sample, we can only estimate
n−1

n

si

τ − 1 y i − xi β τ ≤ 0

1

−
xi = op (n 2 )

i=1

which is the sample analogue of
E s

τ − 1 y − x βτ o ≤ 0

x

=0

but the value β τ o that solves the population moment condition does not necessarily solve the
selected population moment condition. Additionally, assume that the probability of selection
can be written as a parametric function of some vector of variables (xi , zi ) and parameters
γ o and that conditional on zi , the terms of xi that are not included in zi and yi are irrelevant
for the probability of selection (Assumption 1).
P (si = 1 | yi , xi , zi ) =P (si = 1 | zi ) ≡p(zi , γ o )
In this situation, we can estimate consistent and asymptotically normal estimates for
β τ o using the selected sample by weighting the original observations by the inverse of the

24

probability of selection. Note that,
s
τ − 1 y − x βτ o ≤ 0 x
p(z, γ o )
s
= E E
τ − 1 y − x β τ o ≤ 0 x | x, y, z
p(z, γ o )
E (s | z)
= E
τ − 1 y − x βτ o ≤ 0 x
p(z, γ o )
E

τ − 1 y − x βτ o ≤ 0

= E
s
then E p(z,γ )
o

τ − 1 y − x βτ o ≤ 0

x

x

=0

= 0 holds. Therefore, we can estimate β τ o by

using those weighted moment conditions. Naturally, we would need to estimate the weights
if they are unknown.
Let the true selection model be a standard binary response model for simplicity. Then,
estimate the selection of probability by MLE, or more conveniently, a GMM procedure that
uses the ﬁrst order conditions of the MLE for the selection model as moment conditions.
The MLE maximization problem and its ﬁrst order condition are given by, respectively,
N

max
γ∈Γ

{si ln [p(zi , γ)] + (1 − si ) ln [1 − p(zi , γ)]}

i=1
n
n−1
i=1

(1.29)

si − p(zi , γ)
γ p(zi , γ) p(z , γ) (1 − p(z , γ)) = 0
i
i

(1.30)

where the estimator for γ o is deﬁned as the vector γ. Again, 1.30 is the sample analogue of
the following moment condition,
E

s − p(z, γ o )
γ p(z, γ o ) p(z, γ ) (1 − p(z, γ )) = 0
o
o

Hence, we have two sets of moment conditions that can be used to estimate both the
selection model and the conditional median model. The GMM estimator in this case would
be given by any of the four estimators proposed in section 1.2, with
gn1 (θ) = n−1
gn2 (γ) = n−1

n
i=1
n
i=1

si
p(zi , γ)

τ − 1 y i − xi β τ ≤ 0

si − p(zi , γ)
γ p(zi , γ) p(z , γ) (1 − p(z , γ))
i
i
25

xi

the variance of the estimates will depend on the choice of estimator as stated by Theorem 4.
To estimate the variance of the estimated parameters we need to obtain valid estimates
for the components of G in the variance of ˆ
θ.Note that, for example,
G11 ≡

β E[g1 (β o , γ o )] =

s
p(z, γ o )

βE

τ − 1 y − x βτ o ≤ 0

x

s
τ − 1 y − x β τ o ≤ 0 x | z, x, s
p(z, γ o )
s
=
β E p(z, γ ) E τ − 1 y − x β τ o ≤ 0 | z, x, s x
o
s
=
β E p(z, γ ) τ − Fy|z,x,s (x β τ o ) x
o
s
f
(x β τ o )x x
= E
p(z, γ o ) y|z,x,s
=

βE

E

hence, consistent estimates can be obtained by the sample analogue,
G11 = n−1
G12 = n−1
G22 = n−1

n
i=1
n

si
f
(x β )x x
p(zi , γ ) y|z,x,s i τ i i
ˆ
−

i=1
n

ˆ
γ p(zi , γ )
si
[p(zi , γ )]2
ˆ
ˆ
γ p(zi , γ )

i=1

τ − 1 y i − xi β τ ≤ 0

2
si − p(zi , γ )
ˆ
p(zi , γ ) (1 − p(zi , γ ))
ˆ
ˆ

xi
ˆ
γ p(zi , γ )

where the last equality is a direct application of GIME and fy|z,x,s (·) is a suitable estimator
of the conditional density of y, commonly by a kernel estimator.
Note that the same asymptotic variance formula for the KNOW-γ estimator for β τ is
obtained by a simple extension of the results for weighted quantile regression presented in
Koenker (2005) as shown in claim 1 in the appendix.
Since the conditions in Theorem 5 hold, we will obtain more eﬃcient estimates by estimating the inverse probability weights than using the “true” weights, characterizing the puzzle
described in the literature (Wooldridge, 2002b, 2007). The relatively more eﬃcient estimate
for β τ o is given by the one-step estimator that jointly estimates both the probability weights
and the parameters of interest, β τ o .

26

One interesting point to note is that, even this relatively restrictive model for the CQF,
which assumes linearity, can be very insightful about the potentially nonlinear true CQF.
As discussed in detail by Angrist, Chernozhukov, and Fern´ndez-Val (2006), a linear quana
tile regression provides the best linear approximation of the true CQF in the sense that it
minimizes a weighted mean square error loss function. So even if we have reasons to believe
that the true CQF in which we are interested is nonlinear, the use of a linear quantile regression in the example above would provide us with the “best linear approximation” to it in a
similar way that a linear OLS model oﬀers the best linear approximation to the conditional
mean function. Hence, by using IPW to correct the selection bias caused by missing data
we can recover this linear approximation to the CQF of interest, even if we don’t know its
true speciﬁcation.
Nevertheless, this framework can be applied to nonlinear conditional quantiles of the form
Qτ (Y | X) = m X, β τ o , with
gn1 (θ) = n−1
G11 = n−1
G12 = n−1

n
i=1
n

si
(τ − 1 [yi − m (xi , β τ ) ≤ 0])
p(zi , γ)
−

si
f
m xi , β τ
p(zi , γ ) y|z,x,s
ˆ

−

ˆ
γ p(zi , γ )
si
[p(zi , γ )]2
ˆ

i=1
n
i=1

β m (xi , β τ )

β m xi , β τ

τ − 1 yi − m xi , β τ

≤0

β m xi , β τ

β m xi , β τ

and the remaining equations unchanged.

1.4.2

Instrumental Variable Quantile Regression

Consider a simpliﬁed version of the IVQR estimator described in Chernozhukov and Hansen
(2006). Focus on the basic linear model that allow for heterogeneous eﬀects given by,
Yd = q(d, x, τ ) = d ατ + x β τ
where d is a vector of (potentially endogenous) multi-valued treatment variables and x is a
vector of covariates. Under the conditions described in Assumption 1 of Chernozhukov and
27

Hansen (2006), the IVQR estimator of the vector of parameters (α(τ ) , β(τ ) ) proposed in
that paper approximately solves the estimating equation4:
n

n−1

1

−
(1 yi − di ατ − xi β τ ≤ 0 − τ )(xi , Φiτ ) = op (n 2 )

i=1

where Φiτ ≡ Φτ (τ , xi , zi ) is a vector of transformations of the instruments. In a simple
model Φiτ can be formed by the least squares projection of d on z and x (and its powers)
(Chernozhukov and Hansen, 2006, 2008). In that simple case, we could write the sample
analogue of the moment conditions that will identify the parameters of the model as
1 n

−
gn1 (θ) = n 2

gn2 (γ) = n−1

(1 yi − di ατ − xi β τ ≤ 0 − τ )(xi , (xi , zi )γ)

i=1
n

(xi , zi ) [di − (xi , zi )γ]
i=1

Hence, the analysis developed in section 1.2 can be applied to the IVQR estimator proposed by Chernozhukov and Hansen (2005, 2006, 2008) and the results shown above are
valid in its scope. Nevertheless, it is important to note that the framework developed here
does not extend directly to semiparametric cases in which the “ﬁrst stage” is estimated
nonparametrically. That can be a serious inconvenience when we have limited information
about the form of the transformation on the vector of instruments that would be preferable
in estimating IVQR.

1.5

Conclusion

This chapter (i) extends the GMM eﬃciency and redundancy results of Prokhorov and
Schmidt (2009) to nonsmooth objective functions; (ii) analyzes the extent to which these
results could be useful in the context of inverse probability weighting (IPW) as a mechanism
to correct missing data issues, thus allowing its use in the LAD and quantile regression
4 For simplicity I’m assuming that the weights V in Chernozhukov and Hansen (2006)
iτ

are equal to the unit.
28

framework; (iii) veriﬁes the conditions under which the puzzle of selectivity literature, i.e.,
that weighting using known probabilities of selection leads to a less eﬃcient estimate than
using estimated probabilities of selection (Wooldridge, 2002b, 2007; Prokhorov and Schmidt,
2009; Hitomi et al., 2008), is valid under nonsmoothness of the objective functions that
characterize the models of interest; and (iv) shows that even in that case the widely used
two-step estimator is relatively less eﬃcient than a one-step joint estimator.
Section 1.2 extends results on redundancy and eﬃciency due to Prokhorov and Schmidt
(2009) that can now be applied to a wide range of contexts in which nonsmooth objective
functions can be useful, including LAD, quantile regression, censored LAD and quantile
treatment eﬀects. Joint estimation of nuisance parameters and parameters of interest is
more eﬃcient than a two-step procedure or knowing the true nuisance parameters in the
nonsmooth case. This springs from the information contained in the correlation between
both sets of moment conditions which is useful, even when γ o is known. Using only the
ﬁrst set of moment conditions and known values of γ o in the estimation procedure does
not use the additional information embedded in the second set of moment conditions, being
ineﬃcient. Some possible consistent estimators for the variance of both sets of parameters
are presented.
Section 1.3 analyzes the missing data problem described in Wooldridge (2007). The selection model is estimated by a conditional MLE procedure, but the assumptions about the
selection model are weak enough to cover most of the common parametric selection processes in the literature, like attrition, variable probability, “missing at random”, etc. One
important case not covered is “selection on unobservables”. The results from Wooldridge
(2007) and Prokhorov and Schmidt (2009) extend to nonsmooth objective functions. If we
use both sets of moment conditions, knowledge about the nuisance parameters is not useful
for the eﬃciency of the estimates of the parameters of interest. Additionally, the moment
conditions that are associated with the selection model are not redundant, except in special
cases. Estimating the parameters of interest using only the ﬁrst set of moment conditions

29

with known probabilities of selection as weights is ineﬃcient because it ignores information in
the second set of moment conditions. This is the type of puzzle referred to in the selectivity
literature, specially in the IPW approach to missing data.
In summary, IPW can be used to correct missing data problems when the model of interest
is based on nonsmooth objective functions. Furthermore, two-step estimation of β o is more
eﬃcient than using known probabilities of selection. Nonetheless, the two-step estimator is
dominated by a one-step joint estimation procedure that uses both the weighted moment
conditions and the selection model’s conditions. Hence, the analysis by Prokhorov and
Schmidt (2009) extends to the relative eﬃciency of an IPW approach to deal with missing
data problems in which the moment conditions of interest are nonsmooth, encompassing, for
example, LAD, quantile regression, Censored LAD and IVQR.
Finally, two illustrative examples of interesting models are provided that are encompassed
by the general framework developed in this work. The ﬁrst is a quantile regression model
with missing data and, the second one is a simpliﬁed version of the Instrumental Variable
Quantile Regression estimator (IVQR) presented by Chernozhukov and Hansen (2006).

30

CHAPTER 2
Fixed Bandwidth Asymptotics for
Regression Discontinuity Designs
2.1

Introduction

Regression discontinuity (RD) designs have been propelled to the spotlight of economic
analysis in recent years1, especially in the policy and treatment evaluation literatures, as a
form of estimating treatment eﬀects in a non-experimental setting. The appeal of RD comes
from the relative weak assumptions necessary for the identiﬁcation of treatment eﬀects and
inference, which rely on RD’s “quasi-experimental” characteristics.
The standard approach to derive the asymptotic properties of estimates obtained in RD
settings relies on the traditional assumption that the bandwidth, h, used in the estimation
procedure shrinks towards zero as the sample size grows. This guarantees identiﬁcation of
the parameter of interest under mild conditions. Hence, the asymptotic distribution of the
estimator used as the basis for inference depends crucially on this small-h condition.
1 Lee and Lemieux (2009), in a broad review of the RD literature, compile a list of

more than 60 papers that apply RD design to many diﬀerent contexts. The overwhelming
majority of the papers have been published in the last decade.

31

In practice, to obtain an estimate of the parameter of interest and perform inference about
it, the empiricist is required to choose a ﬁxed bandwidth greater than zero. Hence, even
though the asymptotic theory requires that h → 0, in practice h > 0 and ﬁxed. Asymptotic
distributions that treat h as ﬁxed can provide a more reﬁned approximation of the asymptotic
behavior of the estimator than those derived under the assumption that h → 0.
This chapter derives the asymptotic distribution for the local polynomial estimator when
the bandwidth is allowed to be any positive real number. The results shown in section 2.5
provide a new, ﬁxed-h, approximation to the estimator’s bias and variance that incorporate
the bandwidth size chosen by the researcher.
Corollary 2 shows that the standard small-h asymptotic distribution of the parameter of
interest is a special case of ﬁxed-h in which h → 0. Also, corollary 3 shows that when a ﬁxed
h > 0 is used the standard small-h result for the variance of the estimators is equivalent
to assuming that the density of the running variable and the conditional variance of the
outcome variable are constant around the discontinuity.
The increased theoretical interest in RD started with Hahn, Todd, and Van der Klaauw
(1999, 2001), who presented the conditions for identiﬁcation of the average treatment eﬀect of
interest and its estimation exploiting discontinuities in the probability of treatment provision,
which are determined by the so-called running variable. They also derived the asymptotic
distribution of the estimators by looking at a shrinking bandwidth around the discontinuity.
Porter (2003) provided widely used results on the asymptotic properties of the estimators
for the treatment eﬀect of interest, obtaining limiting distributions for estimators based on
local polynomial regression and partially linear estimation.
Imbens and Lemieux (2008) and Lee and Lemieux (2009) oﬀer a broad review of the
theoretical and applied literature with emphasis on the identiﬁcation of the parameter of
interest and its potential interpretation as a weighted average treatment eﬀect.
The analysis of asymptotic properties of estimators for ﬁxed bandwidths has received
some attention in other literatures. Notably, Neave (1970), in the framework of spectral

32

density estimation, obtains more accurate approximations to the variance of nonparametric
spectral estimates by acknowledging that, with a ﬁnite sample, the bandwidth used is ﬁxed.
He asserts that, in the context of his paper, the assumption equivalent to the bandwidth
converging to zero: “(...) is a convenient assumption mathematically in that, in particular,
it ensures consistency of the estimates, but it is unrealistic when such results are used as
approximations to the ﬁnite case(...)”(Neave, 1970, p. 70). Also, Fan (1998) provides an
alternative approximation for goodness-of-ﬁt tests for density function estimates in which the
bandwidth used in the test is ﬁxed, obtaining improved approximations to the asymptotic
behavior of the test and more appropriate critical values for inference.
The same can be said in the regression discontinuity design. Even though h → 0 is a
convenient assumption that guarantees consistency of the estimates of the average treatment
eﬀect, it will be unrealistic. It is of theoretical and practical interest to obtain more accurate
asymptotic distributions by treating h as ﬁxed so that the theory used for inference is more
accurate and aligned with the practice of applied economists.
Monte Carlo simulations in section 2.7.1 indicate that, compared with small-h, asymptotic
distributions derived based on ﬁxed-h better characterize the behavior of the estimators and
provide improved inference about the treatment eﬀect, reducing size distortions in tests and
better approximating the bias in the estimates. These improvements are more important
when the bandwidth is farther from zero, as one would expect.
Section 2.6 proposes estimators for the asymptotic variance based on the ﬁxed-h results
and provides evidence, through Monte Carlo simulations (section 2.7.2), that the feasible
inference incorporates the inference improvements predicted by the theory, suggesting that
the theoretical gains in robustness can be translated to practical beneﬁts in applied work.
Section 2.7.2 also compares the performance of the small-h standard error estimators proposed in the literature in performing inference. The ﬁxed-h variance estimators can improve
markedly over small-h estimators in the presence of some forms of heteroskedasticity. Simulations using heteroskedastic errors have provided evidence that feasible tests based on the

33

ﬁxed-h approach obtain better coverage, outperforming small-h starting at relatively small
bandwidths.
Interestingly, in the case of the widely used local linear estimator with homoskedastic errors
the variance estimators based on small-h asymptotics suggested in the literature produce well
behaved tests with similar size performance to the ﬁxed-h variance estimators, performing
better than the standard theory would expect.

2.2

Model

The interest lies in estimating the average treatment eﬀect, τ , of a certain treatment or
policy that aﬀects part of a population of interest. As discussed in Porter (2003); Imbens
and Lemieux (2008) and Lee and Lemieux (2009), RD designs are closely associated with
the treatment eﬀect literature.2 There are two types of RD designs, sharp and fuzzy, and
they diﬀer as to how treatment is assigned to a certain observation and the impact of the
discontinuity in its assignment. I will focus on the sharp design in this section and emphasize
the diﬀerences of the fuzzy design when needed.

2.2.1

Sharp Regression Discontinuity Design

In the sharp design, the treatment status, D, is a deterministic function of a so called
“running” or “forcing” variable, x, such that,
di =

1 if xi ≥ x
0 if xi < x

where x is the known cut-oﬀ point. Then, let Y1 and Y0 be the potential outcomes corresponding to the two possible treatment assignments. As usual, we cannot observe both
potential outcomes, having access only to Y = dY1 − (1 − d)Y0 . As described by Hahn et al.
2 Angrist and Pischke (2009) provide a simple introduction to the intuition of regression

discontinuity.

34

(2001) and Porter (2003), under a smoothness assumption that E Yj | X = x is continuous at x for j = 0, 1, the average treatment eﬀect can be estimated by comparing points
just above and just below the discontinuity. The discontinuity in treatment assignment at x
provides the opportunity for identifying the average treatment eﬀect at the cutoﬀ without
any additional parametric functional form restrictions on the conditional expectations of the
outcome variable. The average causal eﬀect of the treatment at the discontinuity is Imbens
and Lemieux (2008)
τ S ≡ E [Y1 − Y0 | X = x]
= lim E [Y | X = x] − lim E [Y | X = x]
x↑x

x↓x

where the second equality holds under some smoothness assumptions regarding the conditional expectations (discussed below). The sharp regression discontinuity design uses the
discontinuity in the conditional expectation of Y given X to uncover the average treatment
eﬀect. If the treatment eﬀect is deemed constant across individuals, τ S is the eﬀect of treatment for each individual in the population. If we allow the treatment eﬀect to diﬀer among
individuals, τ S is the average treatment eﬀect for individuals at the cutoﬀ. Interestingly,
Lee and Lemieux (2009) show that the so-called RD gap obtained by the comparison of
observations just above and just below the cutoﬀ can be interpreted as a weighted average
treatment eﬀect across all individuals, not only the individuals around the cutoﬀ. In this
case each individual would have weights directly proportional to the ex ante likelihood that
an individual’s realization of X will be close to the threshold. For a comprehensive review
of RD designs and their applications and interpretation, see Lee and Lemieux (2009).

2.2.2

Fuzzy Regression Discontinuity Design

In the fuzzy design the probability of receiving treatment still changes discontinuously at the
threshold, but is not required to go from 0 to 1, allowing for a smaller jump in the probability

35

of receiving treatment at the cutoﬀ,
lim Pr(d | X = x) = lim Pr(d | X = x)

x↓x

x↑x

This framework allows for a greater range of applications since it includes cases in which
the incentives to receive (or assign) treatment change discontinuously at the threshold, but
are not strong enough to induce all individuals above it to be treated (and those below not
to be treated). The average treatment eﬀect at the cutoﬀ can be identiﬁed by the ratio
of the change in the conditional expectation for the outcome variable to the change in the
conditional probability of receiving treatment (Imbens and Lemieux, 2008):
τF ≡

limx↓x E [Y | X = x] − limx↑x E [Y | X = x]
limx↓x E [d | X = x] − limx↑x E [d | X = x]

This parameter’s interpretation is closely linked to the instrumental variables approach. As
emphasized by Hahn, Todd, and Van der Klaauw (2001); Imbens and Lemieux (2008) and
Lee and Lemieux (2009), a causal interpretation of this ratio requires the same assumptions
for local average treatment eﬀects (LATE) presented in Imbens and Angrist (1994). For that
we assume monotonicity, i.e., that the treatment status is non-increasing in the cutoﬀ value,
or, as stated by Lee and Lemieux (2009, p. 23): “(...)X crossing the cutoﬀ cannot simultaneously cause some units to take up and others to reject the treatment.” Also, crossing the
cutoﬀ cannot aﬀect the outcome other than by the receipt of treatment, otherwise we would
erroneously attribute changes in the conditional expectation of Y due to changes in X to
the treatment.
Under these additional assumptions, τ F has an interpretation similar to the IV estimator,
the average treatment eﬀect for the individuals at the threshold (due to the RD design) and
only for those whose participation on treatment was aﬀected by the cutoﬀ. Those individuals
are described as compliers in the Average Treatment Eﬀect literature3. Hence (Imbens and
Lemieux, 2008),
3 See Imbens and Angrist (1994); Hahn, Todd, and Van der Klaauw (2001); Imbens and

Lemieux (2008) and Lee and Lemieux (2009).
36

τ F ≡ E [Y1 − Y0 | individual is a complier and X = x]
Similarly as in the sharp RD design, Lee and Lemieux (2009) show that the fuzzy RD
design estimator can be interpreted as a weighted LATE with an individual’s weight directly
proportional to the ex ante likelihood that an individual’s realization of X will be close to
the threshold.

2.3

Estimators

I analyze estimates for the parameter of interest, τ S or τ F , obtained by local polynomial estimators. In applied work local polynomial estimators are a staple for estimation of treatment
eﬀects in RD settings. This is partially due to their easy implementation, nice properties and
by the fact that the local linear estimator has been the focus on several papers that helped
to disseminate the technique (Hahn, Todd, and Van der Klaauw, 1999, 2001; Imbens and
Lemieux, 2008; Lee and Lemieux, 2009). Theoretically as well, local polynomial estimators
are attractive for estimation in the regression discontinuity setting given its nice boundary
behavior as described by Fan and Gijbels (1996).
The order p local polynomial estimator is deﬁned as follows. In the sharp design case, given
data (yi , xi )i=1,2,...,n , let di = 1[xi ≥ x], k(·) be a kernel function, h denote a bandwidth
that controls size of the local neighborhood to be averaged over. Also, deﬁne the p + 1 × 1
vector Z(x) =

1, (x − x) , (x − x)2 , ..., (x − x)p

and let αp+ , β p+

be the solution to

the minimization problem:4
1
min
a,b1 ,...,bp n

n
i=1

1
k
h

xi − x
h

di yi − a − b1 (xi − x) − ... − bp (xi − x)p 2

4 Note that in the sharp RD design, d will be identical to the treatment assignment
i

variable Di since the probability of being treated is zero below the threshold and one above
it.

37

and similarly αp− , β p− minimizes the same objective function but with 1 − di replacing
di . The estimator of the parameter of interest is given by
τ S ≡ αp = αp+ − αp−
This estimator ﬁts a polynomial on X on a neighborhood just above and below the cutoﬀ
for treatment, x, and encompasses some familiar estimators as special cases.
Case 1 If p = 0 is used, the Nadaraya-Watson estimator is obtained. The Nadaraya-Watson
estimator takes a kernel weighted average of observations at each side of the discontinuity
and its diﬀerence.
n−1
α =
n−1

=

x−xi
−1
yi di n−1
ih k
h
−
x−xj
−1 k
n−1
dj
jh
h

x−xi
yi di
h
−
x−xj
dj
jk
h

ik

x−xi
−1
yi (1 − di )
ih k
h
x−xj
−1
(1 − dj )
jh k
h

x−xi
yi (1 − di )
h
x−xj
(1 − dj )
jk
h

ik

If in addition we use the rectangular kernel, α simpliﬁes to be the diﬀerence of the means of
yi in the bandwidths above and below the cutoﬀ.
Case 2 If the rectangular kernel is used, the local least squares estimator of yi on a polynomial of (xi − x) with order p is obtained on the neighborhood on each side of the cutoﬀ.
Moreover, if the condition β p+ = β p− is imposed, the estimator αp is the coeﬃcient on di
on the OLS regression of yi on di and the polynomial of (xi − x) with order p using the data
inside the bandwidth on both sides of the cutoﬀ.
In both the theoretical and applied literatures, emphasis has been given to the case in
which a linear model (p = 1) on X is ﬁtted on each side of the cutoﬀ (Hahn, Todd, and
Van der Klaauw, 1999, 2001; Imbens and Lemieux, 2008; Lee and Lemieux, 2009).
Case 3 If p = 1, the local linear estimator is obtained. For the rectangular kernel α simpliﬁes to the diﬀerence of the intercepts from the linear regression of yi on 1 and (xi − x) in the
38

ranges above and below the cutoﬀ. If, additionally, β 1+ = β 1− is imposed, the estimator of
the ATE of interest is the coeﬃcient for di on the OLS regression of yi on 1, di and (xi − x)
using the data inside the ranges on both sides of the cutoﬀ.

2.4

Assumptions

To derive the asymptotic distribution of the estimator for τ , the following assumptions are
suﬃcient.
Assumption 4 (a) k(·) is a symmetric, bounded, Lipschitz function, zero outside a bounded
set;

k(u)du = 1.

(b) For a positive integer s,

k(u)uj du = 0, 1 ≤ j ≤ s − 1.

Assumption 4 allows for higher order kernels5 and a bounded support set for the kernel
avoids the use of a trimming function.
Let fo denote the marginal density of x and m(x) denote the conditional expectation of
y given x minus the discontinuity, i.e., m(x) = E [y | x] − α1[x ≥ x], where x is the value of
the running variable in which the discontinuity occurs.
Assumption 5 Suppose the data (yi , xi )i=1,2,...,n is i.i.d. and α is deﬁned by
α = lim E [y | X = x] − lim E [y | X = x]
x↓x

x↑x

For some compact interval ℵ of x with x ∈ int(ℵ), fo is lf times continuously diﬀerentiable
and bounded away from zero; m(x) is lm times continuously diﬀerentiable for x ∈ ℵ {x},
and m is continuous at x with ﬁnite right and left-hand derivatives to order lm .
In the sharp RD design τ S = α and the average treatment eﬀect is obtained directly
from the discontinuity in the conditional expectation of Y . In the following, I discuss the
5 If s ≥ 3, the kernel has to be negative for some region of its domain to satisfy part (b)

of the assumption.
39

estimation of α and interpret it as the estimate for the average treatment eﬀect of interest.
For the Fuzzy RD design the average treatment eﬀect will be given by the ratio of two such
discontinuities, the conditional expectations of the outcome and probability of receiving
“treatment”.
Assumption 5 guarantees smoothness of the density of x and the conditional expectation of
y on both sides of the discontinuity while allowing for diﬀerent right and left-side derivatives
of m at x. Also, bounding the density of x on the neighborhood around x guarantees there
is density (“data”) around the discontinuity to estimate the jump size.
Assumption 6 describes the behavior of the moments of the outcome variable around the
discontinuity. Deﬁne, ε = y − E [y | X = x] = y − m(x) − α1[x ≥ x].
Assumption 6 (a) σ 2 (x) = E ε2 | X = x is continuous for x = x, x ∈ ℵ, and right and
left-hand limits at x exist.
(b) For some ζ > 0, E |ε|2+ζ | X = x is uniformly bounded on ℵ.
Assumption 6(a) allows the conditional variance of the outcome variable to be a function
of the running variable and assures it is well behaved around the cutoﬀ. Part (b) bounds
the moments so that a central limit theorem can be applied.
The ﬁxed-h asymptotic distributions described in section 2.5 do not require additional
assumptions over what is used in the standard, small-h literature, e.g., Hahn, Todd, and
Van der Klaauw (2001); Porter (2003) etc.

2.5

Asymptotic Distributions

This section develops the asymptotic distribution for the local polynomial estimator of the
average treatment eﬀect for a ﬁxed bandwidth, h.

40

Theorem 6 Suppose Assumptions 4 (a) and 6 hold. If Assumption 5 (a) holds with lm ≥
p + 1 and lf any nonnegative integer. If h is ﬁxed and positive, as n → ∞, then
√

d

nh(αp − α∗ ) → N 0, Vf ixed−h
p

(2.1)

where
Vf ixed−h = e1

Γ∗ −1 ∆∗ Γ∗ −1 + Γ∗ −1 ∆∗ Γ∗ −1 e1
−
− −
+
+ +

(2.2)

α∗ = α + Bf ixed−h
p
Bf ixed−h = e1

∞
Γ∗ −1 0 k (u) Z(x + uh)m(x + uh)fo (x + uh)du −
+
∞
− Γ∗ −1 0 k (u) Z(x − uh)m(x − uh)fo (x − uh)du
−

(2.3)

and


 +(−)

 +(−)
+(−)
+(−)
δ0
· · · δp
γ0
· · · γp
 .
 .
. 
. 
..
..
. ,
.  , ∆∗
= .
Γ∗
= .
.
.
.
. 
.
. 
+(−) 
+(−) 
+(−)
+(−)
+(−)
+(−)
· · · δ 2p
δp
· · · γ 2p
γp

∞
e1 = 1 0 · · · 0 , γ + = 0 k (u) uj fo (x + uh)du,
j
∞
γ − = (−1)j 0 k (u) uj fo (x − uh)du,
j
∞
δ + = 0 k 2 (u) uj σ 2 (x + uh)fo (x + uh)du,
j
∞
δ − = (−1)j 0 k 2 (u) uj σ 2 (x − uh)fo (x − uh)du
j

The proof is given in the appendix.
Theorem 6 provides the asymptotic distribution for the local polynomial estimator of the
parameter of interest for any bandwidth value.
The formula for asymptotic variance explicitly takes into consideration the choice of bandwidth, without assuming h → 0. The ﬁxed-h approach used in theorem 6 captures the impact
of h on the asymptotic variance, Vf ixed−h . Even though the asymptotic variance formulas
are somewhat cumbersome, these are still functions of known data and can be calculated for
given functions fo (x) and σ 2 (x) or estimated in a dataset (see section 2.6).
The bias term that arises under the ﬁxed-h assumption does not vanish as the sample
size increases as suggested by the standard approximations but, for a given bandwidth, it

41

converges to Bar . The bias is the diﬀerence of the (scaled) linear projection for m(x) on Z
evaluated at x = x (i.e., the diﬀerence in intercepts) inside the bandwidth above and below
the cutoﬀ. Intuitively, the bias in α is a diﬀerence between the conditional expectation of
the outcome above and below the cutoﬀ that would have arisen in the absence of treatment,
i.e., the diﬀerence that would have happened nevertheless and are erroneously attributed
to the treatment or policy being analyzed. The ﬁxed-h approach tackles the bias problem
“head on”, making explicit the impact of the bandwidth choice on the bias of the estimate
obtained.
The local polynomial approach mitigates the bias problem if it is able to approximate
m(x) appropriately, since it partially captures changes in m(x) above and below the cutoﬀ
that would exist even in the absence of treatment by using the higher order polynomials.
Note that, as h → 0 the results for the asymptotic distribution of α in theorem 6 approach
the asymptotic variance and bias of small-h asymptotics (Porter, 2003).
Corollary 2 If the conditions in theorem 6 hold, h → 0, then the asymptotic variance and
bias for αp are equal to the small-h approximation (Porter, 2003)
σ 2+ (x) + σ 2− (x)
e1 Γ−1 ∆Γ−1 e1
Vsmall−h =
fo (x)

(2.4)

α∗
small−h = α + Bsmall−h

γ p+1
limh→0 hp+1 (p+1)+
 . 
(2.5)
Bsmall−h =
[m
(x) − (−1)p+1 m(p+1)− (x)]e1 Γ−1  . 
.
(p + 1)!
γ 2p+1


and






γ0 · · · γp
δ0 · · · δp
.
.
. 
. 
Γ =  . . . . . , ∆ =  . . . . . ,
.
.
.
.
δ p · · · δ 2p
γ p · · · γ 2p
∞

e1 = 1 0 · · · 0 , γ j =

0

k(u)uj du, δ j =

∞

k 2 (u)uj du and m(l)+(−) (x) is the lth

0

right (left)-hand derivative of m(x) at point x.
Additionally, the small-h asymptotic distribution variance and bias in corollary 2 are equal
to that obtained by assuming that fo (x) and σ 2 (x) are constant around the cutoﬀ and that
42

m(x) can be exactly approximated by a polynomial of order p + 1.
Corollary 3 Under the assumptions in theorem 6, if h > 0 and, in the bandwidth around the
cutoﬀ, fo (x) and σ 2 (x) are constant and m(x) can be exactly approximated by an expansion
√
of order p + 1. Then, the asymptotic variance and bias of nh(αp − α) obtained by ﬁxed-h
(theorem 6) and small-h (Porter, 2003) are the same.
Focusing on the formula for asymptotic variance in both ﬁxed-h and small-h approaches,
it is clear that the reﬁnements obtained by ﬁxed-h are due to incorporating the behavior of
fo (x) and σ 2 (x) in the ranges around the cutoﬀ, while small-h ignores it by considering only
the values at the cutoﬀ, fo (x) and σ 2 (x). Hence, the beneﬁts in using ﬁxed-h asymptotics are
expected to be larger when the density of X and the conditional variance change markedly
inside the bandwidths around the cutoﬀ, i.e., heteroskedasticity inside the bandwidth could
lead to poor performance by the small-h variance approximation relative to ﬁxed-h.
It is relevant to note that both ﬁxed-h and small-h asymptotic approximations are based
on the same estimator for α. For a given bandwidth the bias present in the estimate is
set. Small-h asymptotics may lead one to ignore the bias by arguing to have chosen the
bandwidth to “undersmooth”. However, once a bandwidth is chosen the bias is given and
should not be ignored.
To clarify the intuition on the results in theorem 6, it is interesting to analyze the special
case of the Nadaraya-Watson estimator.
Case 4 For the Nadaraya-Watson estimator case, we have
√
d
nh(α − α∗ W ) → N (0, VN W )
N
where
∞ 2
∞ 2
2+
2−
0 k (u) σ (x + uh)fo (x + uh)du + 0 k (u) σ (x − uh)fo (x − uh)du
2
2
∞
∞
0 k (u) fo (x + uh)du
0 k (u) fo (x − uh)du

VN W

=

α∗ W
N

= α + BN W

BN W

=

∞
∞
0 k (u) m(x + uh)fo (x + uh)du − 0 k (u) m(x − uh)fo (x − uh)du
∞
∞
0 k (u) fo (x + uh)du
0 k (u) fo (x − uh)du

43

If the rectangular kernel is used, the asymptotic variance and bias simplify to
VN W

=

BN W

=

∞ 2+
∞ 2−
0 σ (x + uh)fo (x + uh)du + 0 σ (x − uh)fo (x − uh)du
2
2
∞
∞
0 fo (x + uh)du
0 fo (x − uh)du
∞
∞
0 m(x + uh)fo (x + uh)du − 0 m(x − uh)fo (x − uh)du
∞
∞
0 fo (x + uh)du
0 fo (x − uh)du

The asymptotic variance is given by a weighted average of the conditional variance of Y
above and below the cutoﬀ, and that the asymptotic bias is simply the diﬀerence in the
(local) averages of m(x) above and below the cutoﬀ, i.e., the diﬀerence in outcome that
would have arisen even in the absence of treatment.
Intuitively, it is interesting to draw a parallel of the results in theorem 6 with the issue of
model misspeciﬁcation. The problem of estimating the ATE at the cutoﬀ discussed here can
be seen as one of correctly estimating E [Y | X] on both sides of the cutoﬀ. In this sense,
the local polynomial estimator is a polynomial approximation to the unknown conditional
expectation inside the bandwidth on each side, not diﬀerent from standard parametric methods. By choosing a relatively small bandwidth we are ﬁtting the conditional expectation on
a restricted support and, hence, expect a polynomial of order p to produce a better ﬁt than
if we were trying to ﬁt E [Y | X] globally, this is the beneﬁt associated with a nonparametric approach, since it allows the conditional expectation to be unrestricted outside the
bandwidth.
Clearly, one does not expect the conditional expectation in the bandwidth to be completely
described by a polynomial of the chosen order p, so we can draw some intuition by looking
at the asymptotic results in Theorem 6 as those arising from potentially misspeciﬁed models
(White, 1982, 1996). The estimator converges to α∗ which is not equal to the parameter of
interest but still provides relevant information about the population.

44

2.5.1

Fuzzy Regression Discontinuity Design

In the Fuzzy RD design the estimator of the parameter of interest is given by the ratio
τF =

α
θ

where α is any of the estimators described in the previous section and θ is the estimator
for the change in the probability of being in the treated group at the cutoﬀ. Note that θ is
obtained by using the estimators described above with the treatment assignment variable,
Di , as the dependent variable.
To obtain the asymptotic distribution of the fuzzy RD estimator, the delta method can
be used, similarly to the result in Porter (2003).
Theorem 7 If

√
nh(α − α∗ )
√
nh(θ − θ∗ )

d

→N

0
0

,

Vα Cαθ
Cαθ Vθ

then
√
nh

α∗
− ∗
θ
θ

α

d

→N

1
α∗
0, ∗2 Vα − 2 ∗3 Cαθ +
θ
θ

α∗2
Vθ
θ∗4

where α∗ = α + Bα , θ∗ = θ + Bθ and Bα and Bθ are the bias terms for the estimators as
deﬁned in theorem 6 for local polynomial estimators.
The proof of the proposition follows directly from the Delta Method and is omitted. The
condition of multivariate normality required in this proposition follows from usual multivariate central limit theorem using a Cramer-Wold device (James, 2004; Pagan and Ullah,
1999). Note that,
α + Bα
α∗
∗ = θ+B
θ
θ
α + Bα θ
=
θ
θ + Bθ
α θ
Bα θ
=
+
θ θ + Bθ
θ θ + Bθ

(2.6)

θ
for given values of α and θ, if |θ| < |θ + Bθ | then 0 ≤ θ+B ≤ 1. Clearly, if there is no
θ
bias in the estimate for α or θ, i.e., Bα = 0 and Bθ = 0, the fuzzy design RD estimator will

45

be consistent for the true treatment eﬀect. If Bα = 0 and Bθ = 0, the estimator will suﬀer
an attenuation bias and tests for the null hypotheses that the treatment is unimportant will
be conservative. If Bα = 0 and Bθ = 0, the estimator’s bias is similar to the one seem for
the sharp RD design, only being scaled by 1 . Finally, if Bα = 0 and Bθ = 0, any increase in
θ
Bα increases the bias in the ATE estimator but there will be a trade-oﬀ regarding the size
of Bθ since its impact in the ﬁrst and second terms will be in opposite directions.
All the terms that appear in the asymptotic distribution above, except for Cαθ , can be
obtained from theorem 6 by using local polynomial estimators discussed in section 2.3. It is
necessary to specify Cαθ in order to obtain the asymptotic distribution of the estimator in
the fuzzy RD design.
Theorem 8 Suppose σ εη = E [εη | X = x] is continuous for x = x, x ∈ ℵ and the left and
right-hand limits at x exist. If α and θ are the local polynomial estimators and the conditions
of theorem 6 hold for both estimators, then
Cαθ = e1
 +(−)
ρ
 0.
ρ
where ∆+(−) =  .
 .
+(−)
ρp

ρ
ρ
Γ∗ −1 ∆+ Γ∗ −1 + Γ∗ −1 ∆− Γ∗ −1 e1
+
+
−
−
+(−)



+(−)


,


· · · ρp
.
..
.
.
.

· · · ρ2p
+ = ∞ k 2 (u) uj σ (x + uh)f (x + uh)du,
ρj
o
εη
0
∞
ρ− = (−1)j 0 k 2 (u) uj σ εη (x − uh)fo (x − uh)du, Γ∗ and Γ∗ are deﬁned as in previous
+
−
j

Corollaries.
As h → 0, the standard small-h asymptotic covariance is the same as the one in theorem
8, as one would expect.
Corollary 4 Letting h −→ 0 in the expressions of theorem 8, then the asymptotic covariance, Cαθ , obtained by ﬁxed-h (theorem 8) and small-h (Porter, 2003) are the same:
σ + (x) + σ − (x)
εη
εη
Cαθ =
e1 Γ−1 ∆Γ−1 e1
fo (x)
46

Also, a result similar to the corollary 3 is readily available.
Corollary 5 Under the assumptions in theorem 8, if h > 0, and in the bandwidth around
the cutoﬀ, fo (x) and σ εη (x) are constant, then the asymptotic covariance, Cαθ , obtained by
ﬁxed-h (theorem 8) and small-h (Porter, 2003) are the same.
In the case of the Nadaraya-Watson estimator, the asymptotic covariance simpliﬁes in
similar fashion to the asymptotic variance in formula (2.6) and provides intuition about
the reﬁnements obtained by the ﬁxed-h asymptotic distribution relative to small-h. Those
improvements arise from incorporating the behavior of σ εη (x) and fo (x) in the range around
the cutoﬀ while small-h does not.
Case 5 For the Nadaraya-Watson estimator we have
∞ 2
∞ 2
0 k (u) σ εη (x + uh)fo (x + uh)du + 0 k (u) σ εη (x − uh)fo (x − uh)du
Cαθ =
2
2
∞
∞
0 k (u) fo (x + uh)du
0 k (u) fo (x − uh)du

If the rectangular kernel is used Cαθ simpliﬁes to
∞ σ (x + uh)f (x + uh)du
∞ σ (x − uh)f (x − uh)du
εη
εη
o
o
Cαθ = 0
+ 0
2
2
∞
∞
0 fo (x + uh)du
0 fo (x − uh)du

2.6

Variance Estimators

To be able to perform inference about α using the information in a given sample, appropriate
estimates for the unknown terms in the asymptotic variance formulas from theorem 6 are
√
necessary. Note that the components of the asymptotic variance of nh(αp − α∗ ) can be
p
written as expectations of population quantities and estimated using sample analogues. We
have

47

γ + = E h−1 k
j

x−x
h

x−x j
d ,
h

γ − = E h−1 k
j

x−x
h

x−x j
(1 − d) ,
h

δ + = E h−1 k
j

x−x 2
h

x−x j 2
dε ,
h

δ − = E h−1 k
j

x−x 2
h

x−x j
(1 − d)ε2
h

Then, deﬁne the sample analog estimators of those quantities as
γ + = (nh)−1
j
γ − = (nh)−1
j
+
δ j = (nh)−1
−

δj

= (nh)−1

n

k

x − xi
h

x − xi j
di ,
h

k

x − xi
h

x − xi j
(1 − di ),
h

k

x − xi 2
h

x − xi j
d i ε2 ,
i
h

k

x − xi 2
h

x − xi j
(1 − di )ε2 ;
i
h

i=1
n
i=1
n
i=1
n
i=1

which are consistent by standard arguments based on the Law of Large Numbers. The
residuals used in these estimators will depend on the order of the local polynomial used to
estimate the Average Treatment Eﬀect of interest and are given by
εi = yi − di αp+ + β 1,p+ (xi − x) + ... + β p,p+ (xi − x)p
− (1 − di ) αp− + β 1,p− (xi − x) + ... + β p,p− (xi − x)p
Even though these estimators requires the calculation of 4(2p + 1) terms6 to obtain the
plug-in estimator of the ﬁxed-h variance-covariance matrix,
Γ∗
+

−1

∆∗ Γ∗
+
+

−1

+ Γ∗
−

−1

∆∗ Γ∗
−
−

−1

,

(2.7)

6 In fact, since we are interested only on the estimate for the ATE, α, one can potentially

only estimate the terms of both matrices that show up at the element at the [1, 1] position
of the variance-covariance matrix for the estimators.
48

these are very simple averages of the data and kernel weights.
Porter (2003) suggests an estimator for the variance of α using the small-h approximation
in corollary 2 which requires only the estimation of the conditional variance of the errors at
the cutoﬀ approaching both from right and left and the density of x at the cutoﬀ.7
σ 2+ (x) =

σ 2− (x) =

n k x−xi d ε2
i i
i=1
h
,
1 f (x)
o
2
n k x−xi (1 − d ) ε2
i i
i=1
h
,
1 f (x)
o
2

(nh)−1

(nh)−1

fo (x) = (nh)−1

n

k
i=1

x − xi
h

,

(2.8)

(2.9)
(2.10)

then,
σ 2+ (x) + σ 2− (x)
fo (x)

e1 Γ−1 ∆Γ−1 e1

(2.11)

is the estimator for the asymptotic variance matrix.
This small-h variance estimator avoids estimating each component of the matrices by
assuming h → 0, which is similar to assuming that fo (x) and σ 2 (x) are constant in the
bandwidth around the cutoﬀ as shown in corollary 3. The matrix Γ−1 ∆Γ−1 can be calculated directly because it is a deterministic function of the kernel. A drawback of the variance
estimator in formula (2.11) is the need to estimate fo (x), which is not necessary if one uses
the ﬁxed-h variance estimator in formula (2.7). To obtain fo (x) we need to choose a kernel
and a bandwidth for the density estimator, increasing the number of tuning parameters to
be chosen. A natural choice would be both the kernel and bandwidth used in the estimation
of the parameter of interest. In section 2.7, I present evidence that using the same bandwidth not only saves one the trouble of choosing another bandwidth, but also provides more
reliable inference than choosing a bandwidth that diﬀers from the one used to estimate τ .
7 The estimator presented in formula (2.11) is not exactly the one presented in Porter

(2003). He never suggested a speciﬁc estimator fo (x), so I chose the standard RosenblattParzen kernel estimator for fo (x) presented in Pagan and Ullah (1999).

49

For the Nadaraya-Watson estimator, the variance estimator simpliﬁes greatly.
Case 6 The ﬁxed-h estimator of the asymptotic variance for the Nadaraya-Watson estimator
is given by
x−xi
n k 2 x−xi d ε2
(nh)−1 n k 2
(1 − di )ε2
i i
i=1
i=1
i
h
h
+
2
2
x−xi
x−xi
(nh)−1 n k
di
(nh)−1 n k
(1 − di )
i=1
i=1
h
h
x−xi
x−xi
di ε2 (nh) n k 2
(1 − di )ε2
(nh) n k 2
i=1
i=1
i
i
h
h
+
2
2
n k x−xi d
n k x−xi (1 − d )
i
i
i=1
i=1
h
h

(nh)−1

=

where nl and nu are the number of observations used below and above the cutoﬀ, respectively.
2

Let εu = n−1
u

2
−1
2
x≤xi ≤x+h εi and εl = nl

2
x−h≤xi ≤x εi . For the case of the rectan-

gular kernel, equation (2.12) simpliﬁes to
nh 2 nh 2
ε +
ε
nu u nl l
and if nu = nl , simpliﬁes further to
2nh
2
2
εu + εl
nu + nl

(2.12)

which is the estimator proposed for the asymptotic variance by Imbens and Lemieux (2008)
in the local linear case, adapted for the Nadaraya-Watson Estimator.
Imbens and Lemieux (2008) propose the plug-in estimator of formula (2.12) for
σ 2+ (x)+σ 2− (x)
and obtain their estimate for the asymptotic variance of the local linear
fo (x)
estimator by scaling it by e1 Γ−1 ∆Γ−1 e1 . Note that e1 Γ−1 ∆Γ−1 e1 equals 4 for the local

linear estimator and 1 for the Nadaraya-Watson estimator. If higher polynomial orders are
used in the estimator, the only change in the formula for the variance estimator is the scaling
term.
In fact, both small-h (Porter, 2003; Imbens and Lemieux, 2008) variance estimators are
based on an estimate for

σ 2+ (x)+σ 2− (x)
and a scaling parameter that depends on the order
fo (x)

of the polynomial and kernel used on the estimation of the parameter of interest.
In section 2.7 I present simulation evidence on test coverage using the variance estimators
in formulas (2.7) and (2.11).
50

2.7

Simulations

This section presents simulation evidence displaying the empirical coverage of a standard
t-statistic used to perform inference about the treatment eﬀect of interest. All simulations
are based on a Sharp RD design. The objective of the simulations is to evaluate the relative performance of the asymptotic distributions obtained by ﬁxed-h and small-h regarding
inference about the parameter of interest. As shown by corollary 3, assuming that h → 0
provides asymptotic variance and bias approximations that are equal to the ones obtained
by assuming that the probability density function of X and the conditional variance of the
outcome are constant in the bandwidth around the cutoﬀ. In fact, one would reasonably
expect that the approximations should be similar for bandwidth values close to zero. Evidence from simulations presented below indicates that inference about the treatment eﬀect
of interest using the ﬁxed-h theoretical approximation has better size behavior than the
small-h approach, especially for larger bandwidths. Simulations using feasible estimators for
the asymptotic variance indicate that tests based on ﬁxed-h approach can improve over tests
based on small-h, especially for larger bandwidths and when some forms of heteroskedasticity are present, however ﬁxed-h can show slightly worse size behavior on tests that use small
bandwidths.
Let X be the running variable, Y is the outcome variable for which we would like to
estimate the average treatment eﬀect at the cutoﬀ and u be the error term. The details of
the simulations are listed below.
• Sample size (n): 750
• Number of replications of the experiment: 2,000
• X is drawn from a N ormal(50, 100)
• u is drawn from a N ormal(0, 1)
• The cutoﬀ for receiving treatment is x = 55.
51

• The treatment variable is deﬁned as di = 1 [xi ≥ x]
1
• The bandwidths range from 0.2 to 20, or from 50 to 2 standard deviations of the

running variable.
The empirical coverages presented are the fraction of rejections in the 2,000 repetitions
for a test of size 5% (two-sided). I analyze 5 diﬀerent data generating processes (DGPs) for
the outcome variable, Y .
• DGP 1: yi = µ + αdi + ui
• DGP 2: yi = µ + β 1 xi + αdi + ui
• DGP 3: yi = µ + β 1 xi + β 2 x2 + αdi + ui
i
• DGP 4: yi = µ + β 1 xi + β 2 x2 + β 3 x3 + αdi + ui
i
i
x
• DGP 5: yi = exp 20 + αdi + ui

The true value of the parameters is µ = 3, α = 10, β 1 = 0.5, β 2 = −0.005, β 3 = 00002.
Two estimators for the parameter of interest α are used in the simulations, the ﬁrst is the
Nadaraya-Watson estimator presented in case 1 and, second, the widely used local linear
estimator presented on case 3. For both estimators I use the Bartlett kernel8.
The next subsection compares the test coverages obtained by the theoretical ﬁxed-h and
small-h asymptotic distributions derived in theorem 6 and corollary 2. The results obtained
are infeasible since they depend on knowledge about fo (x), σ 2 (x) and m(x) around the
cutoﬀ. Nevertheless, they demonstrate the theoretical improvements that ﬁxed-h provides
over small-h asymptotics. Subsection 2.7.2 compares the empirical coverages obtained with
(feasible) estimated standard errors.
8 Similar results were obtained when using a rectangular kernel and a truncated gaussian

kernel (weighted so that the kernel integrates to one). They are available from the author
upon request.
52

2.7.1

Simulations for Infeasible Inference

Nadaraya-Watson Estimator
The ﬁrst set of ﬁgures9 show the empirical coverage of the test for the (true) null hypotheses
that α = 10 when the Nadaraya-Watson estimator is used to obtain α and the (infeasible)
√
variances for nh(α − α∗ ) presented in theorem 6 and corollary 2 are used.
For DGP 1, shown in ﬁgure B.1, the dependent variable does not depend on X directly
and no bias is expected in the estimates for α using the Nadaraya-Watson estimator since
the relationship between Y and X would be correctly captured even for larger bandwidths.
As expected, the empirical size for the tests using ﬁxed-h and small-h standard errors approximations behave very similarly for small bandwidths, but the diﬀerences increase with
the bandwidth, suggesting that the ﬁxed-h asymptotic distribution presented in theorem 6
provide a better approximation for the behavior of the estimator α.
Figures B.2, B.3 and B.4 refer to DGP 2 in which Y is linearly related to X. In general,
a large bias on the Nadaraya-Watson estimate is expected to arise for any bandwidth away
from zero, since the estimator does not capture the relationship between Y and X. Hence,
the estimates erroneously attribute diﬀerences in m(x) above and below the cutoﬀ to the
treatment or policy. The steep decline on the empirical coverage in ﬁgure B.2 reﬂects the
deleterious eﬀects of the bias on the estimate and inference. This eﬀect overwhelms the gains
obtained by the better approximation for the variance of the estimates.
Nevertheless, since the DGP, fo (x) and σ 2 (x) are known I can obtain a bias approximation
for this estimator using BN W for ﬁxed-h and Bsmall−h 10with p = 0 for small-h. Figure
B.3 shows the empirical coverage for the (infeasible) bias corrected test.
To better understand the role of the improved bias and standard error approximations
separately, ﬁgure B.4 adds the empirical coverage that would be obtained if small-h bias and
ﬁxed-h’s variance approximation were used to obtain the test-statistic and vice-versa. In this
9 See appendix.
10Formulas (2.6) and (2.5), respectively.

53

case, the majority of the improvement is due to better (infeasible) approximation of the bias
but the more precise calculations for the standard error provide non-trivial improvement.
The results for the remainder DGPs are qualitatively similar to the ones observed for
DGP 2 and the graphs are omitted for brevity.11The “speed” with which the bias becomes
a problem for inference varies depending on the DGP, but in general it becomes relevant for
relatively small bandwidths.
To be fair, the comparisons between the Nadaraya-Watson bias approximations by ﬁxed-h
and small-h are not adequate for any true model in which the relationship of y and x could
be described by a polynomial of order higher than linear (or order higher than p + 1 in the
local polynomial case) since the small-h approximation (Porter, 2003) describes the bias as
being a function of the derivative of m(x) limiting its accuracy to more complex functional
forms. Nevertheless, from the simulations it is clear that the ﬁxed-h approximation for the
bias developed in theorem 6 better describes the asymptotic behavior of the estimator than
the small-h bias given by corollary 2.
In summary, for all the simulations we have evidence that the asymptotic distribution
√
of nh(α − α∗ ) is best described by the ﬁxed-h approach developed in theorem 6, which
explicitly considers the eﬀects of the choice of bandwidth, than by the standard small-h
asymptotic approximation, which assumes that h → 0. The gains in the approximation are,
as one would expect, larger for bandwidths further away from zero.
The importance of the asymptotic bias is substantial in the Nadaraya-Watson estimator’s
case and serves as cautionary evidence of the risks of dismissing the presence of bias in the
estimation by arguing some “undersmoothing” in the choice of bandwidth. The bias can be
greatly reduced by the use of local polynomial estimators (see next subsection).
11The graphs are available from the author upon request.

54

Local Polynomial Estimator
This section presents simulations in which the “empiricist’s favorite” local linear estimator
is used. This estimator has been a staple in the applied literature that uses RD designs and
has been shown to have nice theoretical bias reduction properties.
For DGPs 1 and 2, no bias is expected in the estimates, since the local linear estimator
correctly captures the relationship between Y and X inside any of the bandwidths used. In
these cases, the local linear estimator correctly captures the DGP on the bandwidth and no
bias arises.
The empirical coverage for DGP 2 is presented on ﬁgure B.5.12 For smaller bandwidths,
the use of small-h asymptotic variance generates similar empirical coverages to the ones
obtained using the reﬁned ﬁxed-h variance approximation, but there is a signiﬁcant decrease
in the small-h coverage as the bandwidth increases, with the ﬁxed-h approach outperforming
the standard approximation on both DGPs 1 and 2. The improvement increases with the
bandwidth size as one would expect.
For the remaining DGPs (X has a quadratic, cubic or exponential relationship to Y )
both the asymptotic bias and variance approximations are relevant.13Figures B.6, B.7 and
B.8 show the empirical coverage under DGP 3. Figure B.6 compares the test coverages
using ﬁxed-h versus small-h standard error approximations while ignoring the bias. It is
clear that the general pattern observed till this moment remains, with ﬁxed-h outperforming
small-h, specially for larger bandwidths. Figure B.7 graphs the empirical coverage obtained
by (infeasible) bias corrected tests and; ﬁgure B.8 separate the gains due to improvement
in bias and variance reﬁnements by adding the graphs of the “counterfactual” coverages
12 The empirical coverages for DGP 1 and 2 are very similar. Only DGP 2’s graph is

reported here.
13 As discussed in section 2.5 the local polynomial estimator could be analyzed under

a parametric framework as the problem of estimating a (potentially) misspeciﬁed model.
(White, 1982, 1996).

55

that calculate the test statistics using small-h bias and ﬁxed-h variance approximation and
vice-versa.14
For DGP 3 and 4, the bias do not seem to be greatly important, being successfully reduced
by the use of the local linear estimator. Naturally then, the diﬀerence in bias approximations
is not the main source of improvement in the empirical coverage as can be seem in ﬁgure B.8.
This contrasts with the results for the same DGPs using the Nadaraya-Watson estimator.
In that case, the bias had a large eﬀect on the test’s coverage and the majority of the gains
associated with the use of ﬁxed-h asymptotics were due to the bias’ reﬁnement.
For DGP 5, even though the bias is substantially mitigated by the use of local linear
estimator, the empirical coverage is still signiﬁcantly reduced for bandwidths larger than
one standard error of the running variable (h = 10). As in the case of the Nadaraya-Watson
estimator, the bias is an important component of the improvement in coverage and infeasible
ﬁxed-h bias correction correctly captures the bias and provides coverages that outperform
the small-h approach.
As described in section 2.5, the reﬁnements obtained by ﬁxed-h are due to considering the
behavior of fo (x) and σ 2 (x) inside the bandwidth, while small-h in eﬀect ignores it. Note that
the previous simulations were based on DGPs with homoskedastic errors. In the presence of
heteroskedasticity, one would expect the improvements of the ﬁxed-h approximation to be
even more important.
To exemplify the distortions heteroskedasticity can create and how well the ﬁxed-h asymptotic approximation can capture it, I have simulated empirical coverages for two heteroskedastic cases. For the ﬁrst and second cases the standard error of the error term is deﬁned as
σ(x) = 1 + 0.25x2 and σ(x) = 1 + 0.25 (x − x)2 , respectively.15 The (infeasible) tests based
14The graphs for GDPs 1, 4 and 5 are available from the author upon request.
15 These examples are not intended to be representative of any empirical problem com-

monly faced in the applied literature and are intended to highlight the behavior of the ﬁxed-h
and small-h approximations in diﬀerent heteroskedastic contexts.

56

on the ﬁxed-h asymptotic approximation behave very well on both cases, highlighting its
robustness. In the ﬁrst case, for GDPs 2 and 3 (ﬁgures B.9 and B.10) , the small-h asymptotic approximation holds up relatively well, with a pattern similar to the one obtained in
the homoskedastic case. In contrast, the second case in ﬁgures B.11 and B.12, the small-h
based test has a steep decline16in coverage as the bandwidth increases, since it is not able
to properly capture the eﬀect of the heteroskedasticity in its asymptotic variance.
The diﬀerence of the small-h performance in the two cases can provide useful intuition to
when its weaknesses can prove most relevant. The second case was designed to be a “worst
case scenario” heteroskedasticity for small-h asymptotics since the conditional variance of
the error at the cutoﬀ, σ 2 (x) is at the extreme of the range of values assumed by σ 2 (x) in any
given bandwidth. As can be seen from formula (2.4), the small-h and ﬁxed-h asymptotic
variances will be more similar the closer σ 2 (x) is from the “weighted average” of σ 2 (x)
inside the bandwidths. In the ﬁrst case, since σ 2 (x) is at the “middle” of the range for the
conditional variance, the distortion produced by the heteroskedasticity is less marked than
in the second case.
Some points are worth emphasizing. First, the general pattern is that, as expected, the
empirical coverages obtained using the ﬁxed-h results from theorem 6 outperform the small-h
approximations, especially for larger bandwidths.
Second, both the asymptotic variance and asymptotic bias reﬁned calculations improve
the precision of inference relative to the standard approach. For smaller bandwidths small-h
asymptotics provide similar coverages to the ﬁxed-h approach, making clear that the core
diﬀerence is due to the suitability of the restrictions imposed on fo (x), σ 2 (x) and m(x) as
the bandwidth increases (corollary 3). Naturally, those restrictions tend to be less realistic
for larger bandwidths.
16 Note the change in the scale of the y-axis, which now emcompasses the interval from

0 to 1.

57

Third, the use of the local linear estimator reduces signiﬁcantly the coverage distortion17
introduced by the bias present in the estimates, even when the linear approximation does
not fully capture the local relationship between Y and X. This is in line with the results in
Porter (2003) and justiﬁes the reliance on the local linear estimator in applications.
Fourth, in the presence of heteroskedasticity, the small-h asymptotic approximation can
have very poor performance, while the ﬁxed-h approach still provides a reliable asymptotic
approximation for the estimator’s behavior.

2.7.2

Simulations for Feasible Inference

The simulations in the previous subsection have established that ﬁxed-h asymptotic distribution approximations based on theorem 6 improve over the usual approximations in the
literature, with better test size behavior by incorporating the choice of bandwidth by the
researcher on the formulas for asymptotic variance and bias. In obtaining those results I used
knowledge about the true DGP that is unavailable to the practitioner when implementing
such estimators.
As described in section 2.6 natural estimators for the asymptotic variance of the parameters
of interest are readily available and can be easily calculated for a given sample. This section
presents simulations for the empirical coverage of the tests using two diﬀerent estimated
standard errors. The ﬁrst one is based on the ﬁxed-h asymptotic distribution and is given
by formula (2.7). The second is proposed by Porter (2003) and described by formula (2.11).
Figures B.13 and B.14 present the empirical coverages for the tests based on the NadarayaWatson estimator for DGPs 1 and 2 described above.18 It also includes the coverages obtained
17Note the change in the y-axis’ range relatively to most cases presented in the Nadaraya-

Watson simulations.
18 The graphs for GDPs 3, 4 and 5 in the Nadaraya-Watson estimator case and 1, 4 and

5 in the local linear estimator case are available from the author upon request.

58

by the (infeasible) theoretical formulas19 so one can compare the feasible coverage relative
to the infeasible coverage in section 2.7.1. Figures B.15 and B.16 perform the same exercise
using the local linear estimator for DGPs 2 and 3.
In ﬁgure B.13,20it is clear that even though both tests tend to overreject for small bandwidths, due to the small amount of data available in those cases, the coverages obtained by
the ﬁxed-h variance estimators provide meaningful improvements for larger bandwidths over
the tests based on small-h variance estimators.
For DGPs 2 through 5, the presence of a strong bias overwhelms the tests even for relatively
small bandwidths, as expected given the results from section 2.7.1. Nevertheless, the general
pattern that the variance estimators based on formula (2.7) reﬂect the theoretical gains is
maintained.
Similarly, when the local linear estimator is used, the empirical coverage obtained using
the ﬁxed-h standard errors’ estimator incorporates the gains of improved inference described
in the theory and shown in the infeasible simulations even for large bandwidths. As in the
Nadaraya-Watson case, the tests overreject for very small bandwidths, probably due to the
relative small amount of data available on these cases but hold very good size behavior for
larger bandwidths.
Surprisingly, when the local linear estimator is used, the tests obtained using small-h standard error estimates behave very similarly to ﬁxed-h ones especially for larger bandwidths,
for which the results in section 2.7.1 would lead one to expect a signiﬁcantly smaller coverage
based on the small-h approach.
In fact, the small-h estimators provide tests with better size behavior for small (close
19Those lines, named ﬁxed-h and small-h, are exactly the same presented in Figures B.1

and B.2, respectively.
20In ﬁgure B.13, the Nadaraya-Watson estimator is used for DGP 1, correctly capturing

the relationship between Y and X around the cutoﬀ. As discussed in section 2.7.1 there is
no asymptotic bias in this case.

59

to zero) bandwidths, due to the fact that ﬁxed-h standard errors require the estimation of
several more terms for the components of the asymptotic variance, suﬀering more acutely
with the restricted amount of data on the smaller bandwidths.
It seems that the small-h variance estimators are beneﬁting from the fact that, in practice,
one cannot actually restrict the bandwidth too close to zero. Since the estimator for the
standard errors sums across ε2 it (partially) captures the behavior of fo (x) in the range
i
around the cutoﬀ that the small-h asymptotic approximation ignores by forcing h → 0.
As discussed in section 2.7.1 the presence of heteroskedasticity can generate substantial
problems for the size of tests using the theoretical small-h approximation. Figures B.17
through B.20 show simulations for the coverage of feasible tests using the ﬁxed-h and smallh asymptotic variance estimators in the two heteroskedastic cases described in section 2.7.1.21
The results clearly show that, diﬀerently from the homoskedastic case, the ﬁxed-h variance
estimator produces better test sizes than the one based on the small-h approach. In the ﬁrst
case (ﬁgures B.17 and B.18), the use of the ﬁxed-h estimator in formula (2.7) provides
better empirical size for bandwidths larger than 5. It is important to emphasize that, even
though small-h tests have better sizes for small bandwidths, both tend to overreject due
to constrained data availability, and a researcher would be ill advised to use too small of a
bandwidth.
In ﬁgures B.19 and B.20, the second case, the ﬁxed-h variance estimator produces tests
with coverage very close to the test’s nominal size, while for the small˙h the coverage rapidly
increases to 1 as the bandwidth increases. Hence, there is evidence that heteroskedasticity
can be accurately captured by tests based on ﬁxed-h asymptotic approximations but small-h
estimators can produce tests which perform substantially worse.
It is possible that these results are somewhat dependent on the DGPs chosen and, even
though the empirical coverages obtained are similar using any of the asymptotic variance
21For the ﬁrst case the standard error of the error term is deﬁned as σ(x) = 1 + 0.25x2 ,

in the second case (“worst case scenario”) it is given by σ(x) = 1 + 0.25 (x − x)2 .

60

estimators in some cases, it seems the ﬁxed-h standard error estimator is a “safer choice” for
practitioners since it is based on a more robust asymptotic approximation and its computation is very easy once a kernel and bandwidth are chosen. Using standard error estimates
based on small-h asymptotics can lead to serious size distortions for larger bandwidths,
especially in the presence of heteroskedasticity, even in the absence of bias.
Furthermore, the ﬁxed-h variance estimator has the advantage of not requiring the estimation of fo (x). This entails the choice of (potentially diﬀerent) kernel and bandwidth for
fo (x). The additional choice of these two tuning parameters might signiﬁcantly alter the
empirical size of the tests performed about τ and depends on the discretion of the researcher.
To exemplify this issue, ﬁgures B.21 and B.22 show the simulated empirical coverages
obtained by using the small-h variance estimator for DGPs 2 and 322 using the Bartlett
kernel for ﬁve diﬀerent scenarios. Each scenario diﬀers by the choice of the bandwidth, hf ,
used in formula (2.10) to obtain fo (x). The ﬁrst reproduces the small-h result described
above by choosing the same bandwidth used to estimate τ , i.e., hf = h, the other lines
are the empirical coverages obtained by using bandwidth of 1, 5, 10 and 2023 for fo (x)
independent of the bandwidth used for τ .
The choice of bandwidth on the estimation of fo (x) can have a relevant impact on the test
coverages. Interestingly, choosing the same bandwidth as used in estimating the parameter
of interest provides more stable empirical coverages for a wide range of h relative to the cases
in which the bandwidths are diﬀerent. The cautious practitioner using the small-h variance
estimator would be well advised to choose the same bandwidth for both estimators.
One key problem (especially the ones with the Nadaraya-Watson estimator), is how to
deal with the bias in practice. The bias is a main contributor for the divergence between the
empirical and nominal sizes of the tests being performed. Adequate estimation of the bias
based on the results on theorem 6 would require observing or estimating the counterfactual
22The graphs for GDPs 1, 4 and 5 are available from the author upon request.
23 1 , 1 , 1 and 1 standard deviations of the running variable, respectively.
20 4 2

61

conditional expectation of y around the cutoﬀ in the absence of treatment, which is not
available for most cases where the RD design is relevant.24

2.8

Conclusion

The use of regression discontinuity designs to obtain estimates of treatment eﬀect, τ , has
been widely used in recent years by researchers in economics. Special attention has been
given to the use of local polynomial estimators to obtain the ATE of interest.
The standard literature on RD designs (Hahn, Todd, and Van der Klaauw, 2001; Porter,
2003; Imbens and Lemieux, 2008) assumes that the bandwidth around the discontinuity, h,
shrinks fast enough towards zero, h → 0, to guarantee identiﬁcation of the parameter of
interest (small-h asymptotics).
This chapter derives, in the RD design context, a reﬁned asymptotic distribution for
the local polynomial estimators by treating h as ﬁxed. This ﬁxed-h asymptotics explicitly
acknowledges the fact that a researcher has to choose bandwidth to implement the estimator
and are usually bounded in their ability to reduced the bandwidth size by data availability
constraints.
The ﬁxed-h asymptotic distributions obtained are more precise and provide reﬁned inference relative to asymptotic distributions based on small-h approach. The ﬁxed-h asymptotic
approximation provides more precise formulas for both bias and variance of the estimators
of interest (theorem 6). The standard small-h asymptotic bias and variance can be obtained
24 The small-h asymptotic bias approximation (Porter, 2003) lends itself for estimation,

since estimates for m(p+1) above and below the cutoﬀ can be obtained. However, the results
in section 2.7.1 indicate that this would be a relatively poor approximation. Furthermore,
to a large degree, bias reduction could be obtained by increasing the order of the local
polynomial ﬁtted above and below the cutoﬀ, reducing the “misspeciﬁcation” in the model
(see section 2.5).

62

by allowing h → 0 in the ﬁxed-h distribution (corollary 2). Also, when h > 0 the small-h
result for the variance of the estimators is equivalent to assume that the density of the running variable and the conditional variance of the dependent variable are constant around the
cutoﬀ (corollary 3). Similar results are shown for both sharp and fuzzy RD designs.
Simulations provide evidence that ﬁxed-h asymptotic distributions more accurately describe the behavior for both bias and variances than the usual small-h results used in the
literature. This is reﬂected on improved test size, specially when larger bandwidths are used.
Simple feasible estimators for the reﬁned, ﬁxed-h, standard errors are provided and shown
to incorporate the theoretical gains of the improved approximations in simulations. These
estimators are simple to implement and have the advantage of not requiring the estimation of
the density of the running variable at the discontinuity. The ﬁxed-h variance estimators can
improve markedly over small-h estimators in the presence of heteroskedasticity and should
be generally preferred. Simulations using heteroskedastic errors have provided evidence that
feasible tests based on the ﬁxed-h approach obtain better coverage, outperforming small-h
starting at relatively small bandwidths.
Interestingly, in the case of the widely used local linear estimator with homoskedastic errors
the variance estimators based on small-h asymptotics suggested in the literature produce well
behaved tests with similar size performance to the ﬁxed-h variance estimators, performing
better than the standard theory would expect. In other words, when errors are homoskedastic
(inside the bandwidth) the inability of the empiricist to mimic what theory suggests ends up
improving the properties of the tests and its robustness to the choice of bandwidth, relative
to what the theory that spawned those estimators would have provided.
The results indicate that the ﬁxed-h standard error estimator is a “safer choice” for practitioners since it is based on a more robust asymptotic approximation and its computation
is very easy once a kernel and bandwidth are chosen.

63

CHAPTER 3
Asymptotic Properties of Quantile
Regression for Standard Stratiﬁed
Samples
3.1

Introduction

Quantile Regression (QR) has been widely used in the social sciences in recent decades, and
provides a useful characterization of the distributional features of variables in which the
researcher is interested. In economics, for example, a very natural use of quantile regression
has been to analyze the wage structure and potential diﬀerences in the determinants of
the observed wages at diﬀerent levels of the wage distribution, e.g., Albrecht et al. (2003);
Buchinsky (1998, 2001); Machado and Mata (2005); Martins and Pereira (2004) and Melly
(2005).
In those analyses it is very common to use datasets generated by stratiﬁed sampling.
In standard stratiﬁed sampling (SS sampling) the population is divided into J mutually
exclusive, exhaustive strata, and a random sample of size Nj is taken from stratum j.
As described by Wooldridge (2001) when stratiﬁcation is based on exogenous variables it

64

usually does not cause serious problems. The usual estimators that ignore stratiﬁcation are
consistent and asymptotically normal and the usual variance estimators are still valid.
When stratiﬁcation is based on endogenous variables, the standard unweighted estimators
are generally inconsistent. Wooldridge (2001) studies the asymptotic properties of general
weighted M-estimators under SS sampling which will be consistent, asymptotically normal
and provides estimators for standard errors of the parameters of interest that can be used to
perform inference for general stratiﬁcation. However, those results are not directly applicable
to the quantile regression case due to the nonsmoothness in the objective function that
provides the QR estimates.
This chapter ﬁlls that gap, extends the analysis to the quantile regression case, analyzes
the asymptotic properties of the weighted QR estimates under general SS sampling, and
provides consistent estimators for the standard errors that take the stratiﬁcation of the data
into account. Under exogenous stratiﬁcation the usual unweighted QR estimators are still
valid as well as its standard error estimates.

3.2

The Quantile Regression Population Problem

We are interested in estimating the conditional quantile function (CQF) of a random variable
Y conditional on a vector of q explanatory variables X. This is deﬁned by,
Qτ (Y | X) = inf {y : FY (y | X) ≥ τ }
where τ ∈ (0, 1) indexes the τ th quantile of the conditional distribution of Y . Let the CQF
be described by a known function g (·) of the parameters and the explanatory variables,

Qτ (Y | X) = g X, β τ o
A special case of interest is given by the linear model1
1 This formulation assumes the error term is additive and, hence, separable. For a

65

Y = X βτ o + ε
with Qτ (ε | X) = 0. Throughout this chapter I concentrate on the linear CQF, since it is
the most widely used by practitioners and for ease of exposition. Nevertheless, the results
presented are valid for a nonlinear, correctly speciﬁed CQF, g (·). In the population, β τ o
solves the following problem
min E ρτ Y − X β τ

(3.1)

β τ ∈B

where, ρτ (u) = (τ − 1 [u ≤ 0])u and B ∈ RK is the parameter space.
Given a random sample from the population of size n, it is possible to obtain consistent
estimates of β o by a standard quantile regression (QR) estimator.
n
−1
min n
ρτ (yi − xi β τ )
β τ ∈B
i=1

(3.2)

Note that the minimization problem has the following ﬁrst order conditions and sample
analogue (Buchinsky, 1998)

n−1

τ − 1 y − x βτ o ≤ 0

x

τ − 1 yi − xi β τ ≤ 0

E

xi = 0

= 0

n

(3.3)

i=1

Hence, we can frame this problem as a GMM estimator that uses as moment conditions
the ﬁrst order conditions of the QR problem that identify β τ o . Under random sampling, the
standard QR procedures can be used to estimate β τ o and perform inference.

3.2.1

Quantile Regression under Stratiﬁed Sampling

Suppose our sample is obtained by a standard stratiﬁcation scheme as formally described
by Wooldridge (2001). Assume that the population is divided into J nonempty, mutually
treatment of the more general formulation with (potentially) non-separable ε see Powell
(1991).
66

exclusive, and exhaustive strata, W1 , W2 , ..., WJ , where J is a ﬁnite integer. Let w denote
a random variable having the population distribution of interest.
Deﬁnition 5 Standard stratiﬁed sampling: For j = 1, ..., J, draw a random sample of size
Nj from stratum j. For each j, denote this random sample by {wij : i = 1, 2, ..., Nj }.
The strata sample sizes Nj are nonrandom. Therefore, the total sample size, N = N1 +
... + NJ , is nonrandom. Notice that for a given j, {wij : i = 1, 2, ..., Nj } is an independent,
identically distributed (i.i.d.) sequence having the same distribution conditional on being
part of a strata, D(w|w ∈ Wj ).
Then, one can rewrite the minimization problem and its moment conditions as
J

min E ρτ (Y − X β τ )

β τ ∈B

=

Qj E ρτ (Y − X β τ )|w ∈ Wj

min

β τ ∈B

j=1

J

E

τ − 1 y − x βτ o ≤ 0

x

=

Qj E

τ − 1 y − x βτ o ≤ 0

x|w ∈ Wj = 0
(3.4)

j=1

where Qj = P (w ∈ Wj ), j = 1, ..., J and its sample analogue

J

j=1

 1
Qj 
Nj

Nj

τ − 1 yij − xij β τ ≤ 0
i=1

1
N

j=1

Hj




xij  = 0




Nj
J Q
j 



τ − 1 yij − xij β τ ≤ 0

i=1


xij  = 0

N

with Hj = Nj . This can be rewritten as,
1
N

N Q
j
i=1

Hj

τ − 1 yi − xi β τ ≤ 0

xi = 0

(3.5)

This is the empirical moment condition that is used to estimate the parameters of interest,
deﬁning the weighted quantile regression estimator. This estimator is consistent for the
parameters of interest under standard stratiﬁed sampling (Wooldridge, 2001, theorem 3.1).2
2 As a minor point note that if one wants to implement the weighted estimator by

applying a standard quantile regression to weighted data, the weights for each observation
67

The asymptotic distribution of the weighted quantile regression estimator can be obtained
by a direct application of Newey and McFadden (1994) Theorem 7.1,with careful consideration to the formulation of V ar g(β τ o ) due to the stratiﬁed nature of the data.
Corollary 6 If the conditions in Newey and McFadden (1994) theorem 7.1 hold,
N

follows the standard stratiﬁed sample scheme, Nj →
√
a
H j > 0 as N → ∞ for each j. Then N β τ − β τ o ∼ N 0, A−1 Bw A−1 , where
w
w
wij : i = 1, 2, ..., Nj ; j = 1, 2, ..., J

••

Aw = E fy|x (g x, β τ o )g g
and

J Q2
j

Bw =

j=1
•

where g ≡

Hj

V ar

τ − 1 y − g x, β τ o ≤ 0

•

g|w ∈ Wj

∂g(x,β)
|β=β
∂β
τo

In the special case of the linear CQF, g (X, β) = x β,
Aw = E fy|x x β τ o xx
and
Bw =

J Q2
j
j=1

Hj

V ar

τ − 1 y − x βτ o ≤ 0

x|w ∈ Wj

Corollary 6 provides a general form for the asymptotic variance of the quantile regression
estimators under standard stratiﬁcation. Two main points are relevant when analyzing Bw .
The ﬁrst, which is general to the standard stratiﬁcation literature, is that we cannot replace
V ar

τ − 1 yij − g xij , β τ o ≤ 0

•

g i |w ∈ Wj by the outer product of the score as in

the random sampling case because in general E
Qj
will be given by H i instead of the
ji

Qj
i
Hj
i

1
2

squares estimators and its variants.

68

τ − 1 y − x βτ o ≤ 0

x|w ∈ Wj

= 0,

usually required when implementing least

as pointed out by Wooldridge (2001). It is also interesting to note that, diﬀerently from
the standard results in quantile regression for random sampling, Bw does not simplify to
τ (1 − τ )E[xx ] in this case. That is due to the fact that the variance of the binary variable
1 yij − xij β τ o ≤ 0 is not necessarily the same for each stratum, in other words, xij β τ o
will not represent the τ quantile in every stratum.
If we assume that
E

τ − 1 y − x βτ o ≤ 0

x|w ∈ Wj = 0, for j = 0, 1, 2, . . . , J.

(3.6)

then an alternative formula for Bw is available.
Bw =

J Q2
j
j=1

Hj

E

τ − 1 y − x βτ o ≤ 0

2

xx |w ∈ Wj

A suﬃcient condition for equation 3.6 is that the conditional distribution of Y is independent from strata, in which case E 1 y − g x, β τ o ≤ 0 |x, w ∈ Wj = τ for all j. Then

Bw = τ (1 − τ )

J Q2
j
j=1

3.2.2

Hj

E xx |w ∈ Wj

Quantile Regression Estimation under Exogenous Stratiﬁcation

If we are modeling the quantiles of Y given X, and stratiﬁcation is based solely on the
conditioning variables X, stratiﬁcation is said to be exogenous. Let the sample space for X be
partitioned into J nonempty, mutually exclusive, and exhaustive strata, χ1 , χ2 , ..., χJ , where
J is a ﬁnite integer, then we can analyze the eﬀects of stratiﬁcation under the framework of
the previous section.
The weighted quantile regression estimator in equation 3.5 can be used both when the
stratiﬁcation is exogenous or not. Under exogenous stratiﬁcation the unweighted quantile

69

regression estimator is also consistent (Wooldridge, 2001).3 This estimator drops the weights
Qj
Hj , solving
J

N

1
N

τ − 1 yi − xi β τ ≤ 0

xi =
j=1

i=1


 1
Hj 
Nj



Nj

τ − 1 yij − xij β τ ≤ 0
i=1


xij  = 0
(3.7)

so that each stratum average is just weighted by its sample frequency, i.e., this is the usual
QR estimator that would be used under random sampling.
Under exogenous stratiﬁcation the unweighted quantile regression estimator has asymptotic distribution as described in the corollary below.
Corollary 7 If the conditions in Newey and McFadden (1994) theorem 7.1 hold,
wij : i = 1, 2, ..., Nj ; j = 1, 2, ..., J

follows the standard stratiﬁed sample scheme, and,
N

stratiﬁcation is a deterministic function of X, Nj → H j > 0 as N → ∞ for each j.
√
a
Then N β τ − β τ o ∼ N 0, A−1 Bu A−1 , where
u
u
J

Au =

H j E[fy|x g x, β τ o

••

g g |x ∈ χj ]

j=1

and

J

••

H j E g g |x ∈ χj

Bu = τ (1 − τ )
j=1
•

where g ≡

∂g(x,β)
|β=β
∂β
τo

3 As emphasized in Wooldridge (2001) the consistency of the unweighted estimator will

hold under exogenous stratiﬁcation if the conditional quantiles of Y are correctly speciﬁed.
The unweighted estimator solves


 J

min
H j E ρτ Y − X β τ |x ∈ χj

β τ ∈B 
j=1

which solution is not necessarily β τ o under misspeciﬁcation.

70

In the special case of the linear CQF, g (X, β) = x β,
J

H j E[fy|x x β τ o xx |x ∈ χj ]

Au =
j=1

and

J

Bu = τ (1 − τ )

H j E xx |x ∈ χj
j=1

3.2.3

Sequence of Quantile Regressions

The discussion above considers only the estimation of a single quantile regression for a given
value τ but one might be interested in estimating several quantile regressions for diverse
points of the conditional distribution of Y . As emphasized by Buchinsky (1998), because
the coeﬃcients are estimated utilizing the same data with diﬀerent weighting schemes, the
estimators will be correlated.
Consider that we are still interested in estimating the linear conditional quantile function
for p separate quantiles, τ ,

Y = X β τ r + ετ r
and that Qτ r (ετ r | X) = 0 for r = 1, . . . , p. Also, let 0 < τ 1 < τ 2 < . . . < τ p < 1, and
β τ = (β τ , β τ , . . . , β τ p ) For each τ r deﬁne
1
2
ψ r (y, x, β τ r ) ≡ τ r − 1 y − x β τ r ≤ 0
and Ψ y, x, β τ , β τ , . . . , β τ p
1
2

=

x

ψ 1 (y, x, β τ ) , ψ 2 (y, x, β τ ) , . . . , ψ p (y, x, β τ p ) .
1

2

Hence
J

E Ψ y, x, β τ , β τ , . . . , β τ p
1
2

=
j=1

Qj E Ψ y, x, β τ , β τ , . . . , β τ p |w ∈ Wj = 0
1
2

71

By the analogy principle, the estimator β τ for β τ solves
1
N

J Q Nj
j
j=1

Hj

i=1

1
Ψ yij , xij , β τ , β τ , . . . , β τ p =
1
2
N

N Q
ji
i=1

Hj

Ψ yi , xi , β τ , β τ , . . . , β τ p = 0
1

i

2

which can be solved separately if no cross-quantile restrictions are imposed on
β τ , β τ , . . . , β τ p or simultaneously otherwise. Then, we can show that β τ follows an asymp1
2
totic multivariate normal distribution as well.
Corollary 8 If the conditions in Newey and McFadden (1994) theorem 7.1 hold,
N

follows the standard stratiﬁed sample scheme, Nj →
√
a
H j > 0 as N → ∞ for each j. Then N β τ − β τ ∼ N (0, Λτ ), where Λτ =
wij : i = 1, 2, ..., Nj ; j = 1, 2, ..., J

Λτ l,k

l,k=1,...p

with typical element deﬁned as
−1

• •

Λτ l,k = E fy|x (g x, β τ )g l g l
×
l


J Q2
j
×
Cov ψ l (y, x, β τ ), ψ k (y, x, β τ )|w ∈ Wj  ×
l
k
Hj
j=1

• •

−1

×E fy|x (g x, β τ )g k g k
k
where, in this case, ψ r (y, x, β τ r ) = τ r − 1 y − g x, β τ r ≤ 0

•

•

g r for r = l, k and g l ≡

∂g(x,β)
|β=β .
∂β
τl

In the special case of the linear CQF, g (X, β) = x β
−1

Λτ l,k = E fy|x x β τ xx
×
l


J Q2
j
×
Cov ψ l (y, x, β τ ), ψ k (y, x, β τ )|w ∈ Wj  ×
l
k
Hj
j=1

×E fy|x x β τ xx
k
with ψ r (y, x, β τ r ) = τ r − 1 y − xβ τ r ≤ 0

−1

x for r = l, k.

Once again, the notable diﬀerence of the result relative to the usual QR under random
sampling is that the center term on Λτ l,k does not simplify as neatly as in standard the
72

random sampling case, since the covariance of the binary variables 1 yij − xij β τ ≤ 0 and
l
1 yij − xij β τ ≤ 0 are not necessarily the same for each stratum.
k

3.3

Asymptotic Variance Estimation

To perform inference about the parameter’s estimates, β τ , we need to obtain valid estimators
for its asymptotic variance. From corollary 6 we have that for linear CQF
V ar

√
N βτ − βτ o

= A−1 Bw A−1
w
w
−1

= E fy|x (x β τ o )x x
×

J Q2
j

×
V ar τ − 1 y − x β τ o ≤ 0
Hj


x|w ∈ Wj  ×

j=1

×E fy|x (x β τ o )x x

−1

Natural estimators for Aw and Bw , as suggested by Wooldridge (2001), are given by


J

Aw ≡
j=1

Nj

 −1
Qj Nj

= N −1

i=1


fy|x,w∈W (xij β τ )xij xij 
j

N Q
ji
f
(x β )x x
Hj y|x,w∈Wj i τ i i
i
i=1



Nj
J Q2
j  −1
sij β τ
Bw ≡
Nj
Hj
i=1
j=1

= N −1

J
j=1

where sij β τ

Qj
Hj

2






− sj β τ

sij β τ − sj β τ





Nj

sij β τ − sj β τ

sij β τ − sj β τ

i=1

≡ τ − 1 yij − xij β τ ≤ 0

xij , sj β τ

−1
= Nj

Nj

sij β τ




Qj

and H i =
ji

i=1
Qj
Hj for the stratum j of xi . In the more general, nonlinear framework, a similar estimator for
•
the covariance matrix is given by replacing sij β τ ≡ τ − 1 yij − g xij , β τ ≤ 0 g i

73

N −1

and the outer terms
∂g xij ,β
∂β

|

β=β τ

N Qj
i
H fy|x,w∈Wj g xij , β τ
i=1 ji

• •
gigi

−1

•

with g i

≡

.

Then,
√
N βτ − βτ o

V ar

= A−1 Bw A−1
w
w

To estimate the out of diagonal terms Λτ l,k when we are interested in performing inference
about the parameters on a sequence of quantile regressions, we can use a similar approach
for estimating the middle term by

Nj
J Q2
j  −1
sij β τ − sj β τ
Nj
l
l
Hj
j=1


sij β τ

i=1

k

− sj β τ

k




and we can still use the same estimators, Aw , for the outer terms of Λτ l,k , noticing that
they are based on the estimates for τ l and τ k , respectively.
Finally, under exogenous stratiﬁcation and correct speciﬁcation of the underlying CQF,
the score will generally have a zero conditional mean for all X when evaluated at β τ o . Then
the asymptotic variance estimator for the unweighted quantile regression estimator (with
linear CQF) is given by:
V ar

√
N βτ − βτ o

= A−1 Bu A−1
u
u

with
1
Au =
N

N

fy|x (xi β τ )xi xi
i=1

1
Bu = τ (1 − τ )
N

N

xi xi
i=1

which are the usual variance matrix estimators from the quantile regression literature under
random sampling as described by Buchinsky (1998) and Koenker (2005). This conﬁrms
that the usual quantile regression estimators and standard errors are valid when the sample

74

is exogenously stratiﬁed, this is the same result as obtained by Wooldridge (2001) for the
general M-estimators with smooth objective functions.
A main issue, which is speciﬁc to quantile regression, is that we need to estimate
fy|x,w∈W (xi β τ ) taking in consideration the stratiﬁcation. An intuitive approach is to take
j
advantage of the fact that we have assumed random sampling for each stratum and apply a
standard nonparametric density estimator for each stratum and simply plug in the formula
above. For example, using the Rosenblatt-Parzen kernel estimator described in Pagan and
Ullah (1999)
fy|x,w∈W (xij β τ ) = Nj hnj
j

−1

Nj
i=1




yij − xij β τ

K
hnj

where K (·) is a kernel function and hnj is a bandwidth parameter such that hnj → 0 and
Nj hnj → ∞. Another option is to bypass the estimation of fy|x,w∈W (xi β τ ) itself and
j
revert to the estimator for
Awj = E[fy|x,w∈W

x β τ o x x|w ∈ Wj ]

j

that is referred to as the Powell Sandwich by Koenker (2005). This takes account of the fact
that estimating Awj is just as estimating a matrix weighted density estimator (Koenker,
2005).
Awj = Nj hnj

−1

Nj


yij − xij β τ
 xij x
K
ij
hnj


i=1
J

Then, we can obtain Aw =
j=1

Qj Awj . Powell (1991) has shown that under some additional

conditions regarding f (·), Awj converges is probability to Awj . Two drawbacks of this
estimator are the necessity of choosing kernels and bandwidths (which could be diﬀerent
for each stratum) and the fact that, in practice, the researcher might have some stratum
for which only a small amount of data is available, reducing the conﬁdence in the obtained

75

estimates for Awj .4
When the stratiﬁcation is exogenous, we can take advantage of the fact that
fy|x,x∈χ (xij β τ ) = fy|x (xi β τ ) for all strata and obtain fy|x (xij β τ ) or Aw directly as
j
N

K

Aw = (N hn )−1

y i − xi β τ
hn

K

fy|x (xi β τ ) = (N hn )−1

y i − xi β τ
hn

i=1
N
i=1

xi xi

The choice of estimator for fy|x,w∈W (xij β τ ) as well as any nuisance parameters associated
j
with its estimation is an important issue that remains open. It seems advisable for researchers
to be cautious regarding the impacts of the choice of estimator and nuisance parameters on
the standard errors used in QR. Pagan and Ullah (1999) present several possible estimators
for such densities and discuss their advantages and drawbacks.

3.4

Conclusion

This chapter addressed the issue of inference on quantile regressions when the data is obtained
through standard stratiﬁed sampling. Extending results from Wooldridge (2001) I derive
the asymptotic distribution of the weighted quantile regression estimator for the case with
general stratiﬁcation and of the unweighted quantile regression estimator in the case that
the stratiﬁcation is a deterministic function of the conditioning variables. Valid estimators
for the asymptotic variance matrix of those estimators are provided.
The results shown here provide conﬁrmation to the intuition that the results for general
M-estimators with smooth objective functions transfer neatly to the quantile regression case.
This adds to literature a more careful treatment of the quantile regression case for stratiﬁed
sampling that had not been available.5
4 For the nonlinear case, one can use A
wj = Nj hnj

N
−1 j





yij −g xij ,β τ
• •
 gigi.
K
hnj

i=1
5 And this fact has probably played a signiﬁcant role in the absence of probability

76

Weighted quantile regression provides a simple and reliable way to deal with data obtained
through stratiﬁed sampling, albeit requiring adjustments to the usual standard errors in
the literature. Valid estimators for the standard errors are provided. Under exogenous
stratiﬁcation one could use the usual unweighted estimators, which retain its properties of
consistency (Wooldridge, 2001) and asymptotic normality. Even more relevant, in that case,
some usual standard error estimators in the literature (Koenker, 2005; Buchinsky, 1998, etc)
are still valid.

weighted and “survey” methods for quantile regression in popular statistical packages like
STATA.
77

APPENDICES

78

APPENDIX A
Proofs to “GMM Eﬃciency and IPW
for Nonsmooth Functions”
Proof. [Proof of Theorem 1] For VON E−ST EP , VKN OW −γ and VKN OW −γ−JOIN T
this result is a direct application of known results in the literature (see, e.g., p. 2186 in
Newey and McFadden 1994 or more generally p. 1594 in Chen, Linton and Van Keilegom
2003) and the simpliﬁcations that take eﬀect by the use of the appropriate weighting matrix.
For VT W O−ST EP I rely on the approximations used by Newey and McFadden (1994) in
theorem 7.2 and Pakes and Pollard (1989) theorem 3.3 and lemma 3.5. Following Pakes and
Pollard (1989), I claim that gn (θ) is very well approximated by the linear function
Ln (θ) =
=

Ln1 (θ)
= gn (θo ) + G(θ − θo )
Ln2 (θ)
gn1 (β o , γ o ) + G11 (β − β o ) + G12 (γ − γ o )
gn2 (γ o ) + G22 (γ − γ o )

1

−
within a Op (n 2 ) neighborhood of θo . More precisely, I need the approximation error to
1

−
be of order op (n 2 ) at θ and at θ∗ which minimizes Ln (θ) globally. In the case analyzed

79

here,
gn (θ) − Ln (θ)

=

gn (θ) − gn (θo ) − G(θ − θo )

=

gn (θ) − gn (θo ) − G(θ − θo ) − go (θ) + go (θ)

≤

gn (θ) − go (θ) − gn (θo ) + go (θ) − G(θ − θo )
1

−
≤ op (1)n 2 1 +

√

n (θ−θ o )

+ op ( (θ−θ o ) )

1

−
= op (n 2 )
1

−
where in the last equality I used the fact that (θ−θ o ) ≤ Op (n 2 ) (see Newey and Mc-

Fadden, 1994, p. 2191). To correspond to a minimum of Ln (θ) , the vector G(θ∗ − θo )
must be equal to the linear projection of −gn (θo ) onto the space G. Hence,
G(θ∗ − θo ) = −G(G G)−1 G gn (θo )
from this equation, we can obtain
√
√ ∗
n(θ − θo ) = − n(G G)−1 G gn (θo )
from Pakes and Pollard (1989, lemma 3.5) the result above holds for the case in which we
use the appropriate positive semideﬁnite weighting matrix W that converges in probability
to W , in which case
√

√
n(θ∗ − θo ) = − n(G W G)−1 G W gn (θo )

as shown by Pakes and Pollard (1989, p. 2042) under the conditions listed above θ∗ and θ
are close enough in this shrinking neighborhood around θo such that we can write
√

n(θ − θo ) =

√

n(θ∗ − θo ) + op (1)

Hence, for the ﬁrst step estimator, the following approximation is valid
√

−1
√
−1
−1
n(γ − γ o ) = − n G22 C22 G22
G22 C22 gn2 (γ o ) + op (1)

80

(A.1)

Then, for the second step, using the same results, we can approximate
√

−1
√
−1
−1
G11 C11 gn1 (β o , γ) + op (1)
n(β − β o ) = − n G11 C11 G11
−1
√
−1
−1
= − n G11 C11 G11
G11 C11 [gn1 (β o , γ o ) + G12 (γ−γ o )] + op (1)
−1
√
−1
−1
= − n G11 C11 G11
G11 C11 gn1 (β o , γ o )+

+

−1
−1
√
−1
−1
−1
×
G11 C11 G12 G22 C22 G22
n G11 C11 G11

(A.2)

−1
× G22 C22 gn2 (γ o ) + op (1)

then, by combining A.1 and A.2 we can write
√
√
n(θ − θo ) = B ngn (θ o ) + op (1)
where,
B=

B11 B12
0 B22

with
−1
B11 = − G11 C11 G11

B12 =

−1
G11 C11 G11

−1

−1

−1
B22 = − G22 C22 G22

−1
G11 C11

−1
−1
G11 C11 G12 G22 C22 G22

−1

−1

−1
G22 C22

−1
G22 C22

hence,
VT W O−ST EP = BΣB

Proof. [Proof of Corollary 1] The proof follows directly from Prokhorov and Schmidt (2009)
since theorem 3 has shown that the variance structure of the four estimators considered is
the same as in Prokhorov and Schmidt (2009). The proof that the result hold directly for the
case in which the objective functions considered are nonsmooth is available under request.

81

Proof. [Proof of Corollary 2] Note that the asymptotic variance of
−1
can be rewritten as (note that B12 = B11 G12 G22 C22 G22

−1

√

n(β T W O−ST EP −β o )

−1
G22 C22 )

V (β T W O−ST EP ) = B11 C11 B11 + B12 C21 B11 + B11 C12 B12 + B12 C22 B12
= B11 E g1 (ω ∗ , θ)g1 (ω ∗ , θ) B11 + B12 E g2 (ω ∗ , γ)g1 (ω ∗ , θ) B11 +
i
i
i
i
+B11 E g1 (ω ∗ , θ)g2 (ω ∗ , γ) B12 + B12 E g2 (ω ∗ , γ)g2 (ω ∗ , γ) B12
i
i
i
i


−1
−1 g (ω ∗ , γ) ×
−1 G
∗ , θ) − G
G22 C22 2 i
12 G22 C22 22

 g1 (ω i
B

= B11 E 
 11
−1
−1 g (ω ∗ , γ)
−1 G
G22 C22 2 i
× g1 (ω ∗ , θ) − G12 G22 C22 22
i
−1
if G12 = C12 C22 G22
−1
−1
−1
−1
G22 C22 g2 (ω ∗ , γ) × 
g1 (ω ∗ , θ) − C12 C22 G22 G22 C22 G22
i
i

B
= B11 E 
 11

−1
−1 g (ω ∗ , γ)
−1 G
∗ , θ) − C C −1 G
G22 C22 2 i
× g1 (ω i
12 22 22 G22 C22 22





Since it is assumed that G22 is invertible,

= B11 E

−1
g1 (ω ∗ , θ) − C12 C22 g2 (ω ∗ , γ)
i
i

−1
g1 (ω ∗ , θ) − C12 C22 g2 (ω ∗ , γ)
i
i

B11

−1
If we deﬁne ei = g1 (ω ∗ , θ) − C12 C22 g2 (ω ∗ , γ), and Do = E ei ei , we can write,
i
i
−1
V (β T W O−ST EP ) = G11 C11 G11

−1

−1
−1
−1
G11 C11 Do C11 G11 G11 C11 G11

−1

In this case, we can write the variance of the two-step estimator for β o in a quadratic form
in which the term in the middle of the matrix is the residual of the linear projection of the
ﬁrst set of moment conditions on the second set of moment conditions.
If, in addition to the conditions above, we assume G11 is invertible, the result follows.
V (β T W O−ST EP ) = G−1 Do G−1
11
11

82

Proof. [Proof of Lemma 2] First, note that,
E sg2 (z, γ o , s) | z

γ f (vi | zi , γ)

= E s
∞

=

z

f (vi | zi , γ)

s
−∞
∞

γ f (v | z, γ)

f (v | z, γ)

f (v | z, γ)dv

h(v, z) γ f (v | z, γ) dv

=
−∞

=

 ∞

h(v, z)f (v | z, γ) dv 
γ
−∞

=

γ E [s z]

=

γ p(z, γ o )

this is nonzero in general. Hence,
C12 = E g1 (ω ∗ , β o , γ o , s)g2 (z, γ o ,s)
= E
= E
= E
= E
= E
= E

s
g(ω, β o )g2 (z,γ o ,s)
p(z, γ o )
s
g(ω, β o )g2 (z,γ o ,s) z
E
p(z, γ o )
1
E
g(ω, β o )sg2 (z,γ o ,s) z
p(z, γ o )
1
E [g(ω, β o ) z] E sg2 (z, γ o ,s)
p(z, γ o )
g(ω, β o )
E [sg2 (z, γ o , s) z]
p(z, γ o )
g(ω, β o )
γ p(z, γ o )
p(z, γ o )

which is generally nonzero.
Analyzing G12 ,
G12 =
=

∗
γ E[g1 (ω , β o , γ o , s)]
γE

s
g(ω, β o )
p(z, γ o )
83

z

, by ignorability

s
since, g1 (ω ∗ , β o , γ o , s) = p(z,γ) g(ω, β), is smooth in γ,

G12 = E

γ

= E −
=
=
=
=

s
p(z, γ o )

g(ω, β o )

s

γ p(z, γ o )g(ω, β o )
(p(z, γ o ))2
s
γ p(z, γ o )
g(ω, β o )
E −
p(z, γ o )
p(z, γ o )
s
γ p(z, γ o )
−E E
g(ω, β o )
z , by LIE
p(z, γ o )
p(z, γ o )
E(s z)
γ p(z, γ o )
−E
E [g(ω, β o ) z]
p(z, γ o )
p(z, γ o )
γ p(z, γ o )
−E g(ω, β o )
, since E [s z] = p(z, γ o )
p(z, γ o )

= −C12
Then, to prove the lemma 2 I need that G22 = −C22 , which follows from the Generalized
Information Equality (remembering g2 (z,γ o ,s) is a smooth function).
G22 =

γ E[g2 (z, γ o ,s)]

= E[ γ g2 (z, γ o ,s)]
= −E g2 (z,γ o ,s)g2 (z,γ o ,s)

= −C22

−1
−1
hence, G12 = −C12 = −C12 (−C22 G22 ) = C12 C22 G22 .

Proof. [Proof of Theorem 2] This follows directly from Lemma 2 and statements 9 and 10
in Corollary 1.
Claim 1 Consider the conditional quantile function
Qτ (Y | X) = X β τ o
and the weighted linear quantile estimator obtained as
wi ρτ (yi − xi b)

β τ = arg min

b∈Rp

84

for some known weight wi that could be a function of exogenous variables. Under conditions
7 and 8, we have
√
with, D2 = limn→∞

−1
−1
∼ N 0, τ (1 − τ )D1 Do D1

n βτ − βτ

n w f (x β )x x and D = lim
o
n→∞
i=1 i i i τ o i i

n w2 x x
i=1 i i i

Assumption 7 For Y1 , Y2 , . . . , Yn independent random variables with distribution functions
F1 , F2 , . . . , Fn , {Fi } are absolutely continuous with continuous densities fi (·) and weights,
wi , uniformly bounded away from 0 and ∞ at the points fi xi β τ o for every i.
Assumption 8 There exist positive deﬁnite matrices Do and D1 such that
1
i) limn→∞ n
1
ii) limn→∞ n

n w2 x x = D
o
i=1 i i i
n w f (x β )x x = D
1
i=1 i i i τ o i i

x
iii) max √i → 0
n

Proof. [Proof of Claim 1] This proof follows the steps presented on Koenker (2005, p. 120).
Consider ui = yi − xi β τ o , then
n

β τ = arg min

b∈Rp

wi

τ − 1 yi − xi b ≤ 0

i=1
n

= arg min

b∈Rp

y i − xi b

wi ρτ (ui )
i=1

Consider the following convex objective function, with unique minimizer at
n

Zn (δ) =

wi ρτ
i=1

δ
ui − xi √
n

√
n βτ − βτ o ,

− ρτ (ui )
v

using Knight’s identity ρτ (u − v) − ρτ (u) = −vΨτ (u) +

(1 [u ≤ S] − 1 [u ≤ 0]) dS, with
0

Ψτ (u) = τ − 1 [u ≤ 0]




δ
xi √
n

n



δ
Zn (δ) =
wi −xi √ Ψτ (ui ) +

n

i=1

0



(1 [ui ≤ S] − 1 [ui ≤ 0]) dS 



n

= Z1n (δ) +

Z2ni (δ) = Z1n (δ) + Z2n (δ)
i=1

85

Note that, by the Lindeberg-Feller central limit theorem,
n

1
Z1n (δ) = −δ √
n

wi xi Ψτ (ui )
i=1
n

1
= −δ √
n

wi xi (τ − 1 [ui ≤ 0])
i=1

∼ −δ
W



n

2
w i xi xi 

W ∼ N 0, τ (1 − τ ) lim

n→∞

i=1

Also,
n

Z2n (δ) =

Z2ni (δ)
i=1
n

=

n

Z2ni (δ) − E [Z2ni (δ)]

E [Z2ni (δ)] +
i=1

i=1

but,
n

δ
xi √
n

n

E [Z2ni (δ)] =

E [1 [ui ≤ S] − 1 [ui ≤ 0]] dS

wi

i=1

i=1

0
δ
xi √
n

n

=

Fi (xi β τ o + S) − Fi (xi β τ o )dS

wi
i=1

0

t
let S = √ , then
n
n

E [Z2ni (δ)] =
i=1

=

=

1
n

1
n

xi δ

n

√
t
n F i xi β τ o + √
n

wi
i=1

0
xi δ

n

wi
i=1
n

1
2n

fi (xi β τ o )tdt + o(1)
0

wi fi (xi β τ o )δ xi xi δ + o(1)
i=1


→

− Fi (xi β τ o ) dt

1 
1
δ
lim
n→∞ n
2

n
i=1

86


1
wi fi (xi β τ o )xi xi  δ = δ D1 δ
2

Under A2(iii):
1
Zn (δ) ∼ Zo (δ) = −δ W + δ D1 δ
2
then
√

n βτ − βτ

= δ n = arg min Zn (δ) ∼ δ o = arg min Zo (δ)

−1
δ o = D1 W

hence,
√

n βτ − βτ

−1
−1
∼ N 0, τ (1 − τ )D1 Do D1

87

APPENDIX B
Figures to “Fixed Bandwidth
Asymptotics for Regression
Discontinuity Designs”
B.1
B.1.1

Simulations for Infeasible Inference
Nadaraya-Watson Estimator

88

Figure B.1. Nadaraya-Watson Estimator - DGP: No X - Homosk. Errors

0.90
0.85

Small−h
Fixed−h
95% line
For interpretation of the references to color in
this and all other figures, the reader is referred
to the electronic version of this dissertation.

0.80

Empirical Coverage

0.95

1.00

Nadaraya−Watson Estimator Empirical Coverage

0

5

10

15

Bandwidth (h)
DGP: No X Model − Homoskedastic Errors

89

20

Figure B.2. Nadaraya-Watson Estimator - DGP: Linear - Homosk. Errors

0.6
0.2

0.4

Small−h
Fixed−h
95% line

0.0

Empirical Coverage

0.8

1.0

Nadaraya−Watson Estimator Empirical Coverage

0

5

10

15

Bandwidth (h)
DGP: Linear Model − Homoskedastic Errors

90

20

Figure B.3. Nadaraya-Watson Estimator - DGP: Linear - Bias Corrected - Homosk. Errors

0.6
0.4
0.2

Small−h
Fixed−h
95% line

0.0

Empirical Coverage

0.8

1.0

Nadaraya−Watson Estimator Empirical Coverage

0

5

10

15

20

Bandwidth (h)
DGP: Linear Model − Infeasible Bias Corrected − Homoskedastic Errors

91

Figure B.4. Nadaraya-Watson Estimator - DGP: Linear - Comparison - Homosk. Errors

0.6
0.4
0.2

Small−h
Small−h Bias and Fixed−h s.e.
Fixed−h
Fixed−h Bias and Small−h s.e.
95% line

0.0

Empirical Coverage

0.8

1.0

Nadaraya−Watson Estimator Empirical Coverage

0

5

10

15

Bandwidth (h)
DGP: Linear Model − Homoskedastic Errors

92

20

B.1.2

Local Linear Estimator

Figure B.5. Local Linear Estimator - DGP: Linear - Homosk. Errors

0.90
0.85

Small−h
Fixed−h
95% line

0.80

Empirical Coverage

0.95

1.00

Local Linear Estimator Empirical Coverage

0

5

10

15

Bandwidth (h)
DGP: Linear Model − Homoskedastic Errors

93

20

Figure B.6. Local Linear Estimator - DGP: Quadratic - Homosk. Errors

0.90
0.85

Small−h
Fixed−h
95% line

0.80

Empirical Coverage

0.95

1.00

Local Linear Estimator Empirical Coverage

0

5

10

15

Bandwidth (h)
DGP: Quadratic Model − Homoskedastic Errors

94

20

Figure B.7. Local Linear Estimator - DGP: Quadratic - Bias Corrected - Homosk. Errors

0.90
0.85

Small−h
Fixed−h
95% line

0.80

Empirical Coverage

0.95

1.00

Local Linear Estimator Empirical Coverage

0

5

10

15

20

Bandwidth (h)
DGP: Quadratic Model − Infeasible Bias Corrected − Homoskedastic Errors

95

Figure B.8. Local Linear Estimator - DGP: Quadratic - Comparison - Homosk. Errors

0.90
0.85

Small−h
Small−h Bias and Fixed−h s.e.
Fixed−h
Fixed−h Bias and Small−h s.e.
95% line

0.80

Empirical Coverage

0.95

1.00

Local Linear Estimator Empirical Coverage

0

5

10

15

Bandwidth (h)
DGP: Quadratic Model − Homoskedastic Errors

96

20

B.1.3

Heteroskedastic Errors

Figure B.9. Local Linear Estimator - DGP: Linear - Heterosk. Errors Case 1

0.90
0.85

Small−h
Fixed−h
95% line

0.80

Empirical Coverage

0.95

1.00

Local Linear Estimator Empirical Coverage

0

5

10

15

Bandwidth (h)
DGP: Linear Model − Heteroskedastic Errors (Case 1)

97

20

Figure B.10. Local Linear Estimator - DGP: Quadratic - Heterosk. Errors Case 1

0.90
0.85

Small−h
Fixed−h
95% line

0.80

Empirical Coverage

0.95

1.00

Local Linear Estimator Empirical Coverage

0

5

10

15

Bandwidth (h)
DGP: Quadratic Model − Heteroskedastic Errors (Case 1)

98

20

Figure B.11. Local Linear Estimator - DGP: Linear - Heterosk. Errors Case 2

0.6
0.4
0.2

Small−h
Fixed−h
95% line

0.0

Empirical Coverage

0.8

1.0

Local Linear Estimator Empirical Coverage

0

5

10

15

Bandwidth (h)
DGP: Linear Model − Heteroskedastic Errors (Case 2)

99

20

Figure B.12. Local Linear Estimator - DGP: Quadratic - Heterosk. Errors Case 2

0.6
0.4
0.2

Small−h
Fixed−h
95% line

0.0

Empirical Coverage

0.8

1.0

Local Linear Estimator Empirical Coverage

0

5

10

15

Bandwidth (h)
DGP: Quadratic Model − Heteroskedastic Errors (Case 2)

100

20

B.2

Simulations for Feasible Inference

Figure B.13. Nadaraya-Watson Estimator - DGP: No X - Feasible - Homosk. Errors

0.90
0.85

Small−h
Estimated Small−h
Fixed−h
Estimated Fixed−h
95% line

0.80

Empirical Coverage

0.95

1.00

Nadaraya−Watson Estimator Empirical Coverage

0

5

10

15

Bandwidth (h)
DGP: No X Model − Homoskedastic Errors

101

20

Figure B.14. Nadaraya-Watson Estimator - DGP: Linear - Feasible - Homosk. Errors

0.6
0.2

0.4

Small−h
Estimated Small−h
Fixed−h
Estimated Fixed−h
95% line

0.0

Empirical Coverage

0.8

1.0

Nadaraya−Watson Estimator Empirical Coverage

0

5

10

15

Bandwidth (h)
DGP: Linear Model − Homoskedastic Errors

102

20

B.2.1

Local Linear Estimator

Figure B.15. Local Linear Estimator - DGP: Linear - Feasible - Homosk. Errors

0.90
0.85

Small−h
Estimated Small−h
Fixed−h
Estimated Fixed−h
95% line

0.80

Empirical Coverage

0.95

1.00

Local Linear Estimator Empirical Coverage

0

5

10

15

Bandwidth (h)
DGP: Linear Model − Homoskedastic Errors

103

20

Figure B.16. Local Linear Estimator - DGP: Quadratic - Feasible - Homosk. Errors

0.90
0.85

Small−h
Estimated Small−h
Fixed−h
Estimated Fixed−h
95% line

0.80

Empirical Coverage

0.95

1.00

Local Linear Estimator Empirical Coverage

0

5

10

15

Bandwidth (h)
DGP: Quadratic Model − Homoskedastic Errors

104

20

B.2.2

Heteroskedastic Errors

Figure B.17. Local Linear Estimator - DGP: Linear - Feasible - Heterosk. Errors Case 1

0.90
0.85

Small−h
Estimated Small−h
Fixed−h
Estimated Fixed−h
95% line

0.80

Empirical Coverage

0.95

1.00

Local Linear Estimator Empirical Coverage

0

5

10

15

Bandwidth (h)
DGP: Linear Model − Heteroskedastic Errors (Case 1)

105

20

Figure B.18. Local Linear Estimator - DGP: Quadratic - Feasible - Heterosk. Errors Case 1

0.90
0.85

Small−h
Estimated Small−h
Fixed−h
Estimated Fixed−h
95% line

0.80

Empirical Coverage

0.95

1.00

Local Linear Estimator Empirical Coverage

0

5

10

15

Bandwidth (h)
DGP: Quadratic Model − Heteroskedastic Errors (Case 1)

106

20

Figure B.19. Local Linear Estimator - DGP: Linear - Feasible - Heterosk. Errors Case 2

0.90
0.85

Small−h
Estimated Small−h
Fixed−h
Estimated Fixed−h
95% line

0.80

Empirical Coverage

0.95

1.00

Local Linear Estimator Empirical Coverage

0

5

10

15

Bandwidth (h)
DGP: Linear Model − Heteroskedastic Errors (Case 2)

107

20

Figure B.20. Local Linear Estimator - DGP: Quadratic - Feasible - Heterosk. Errors Case 2

0.90
0.85

Small−h
Estimated Small−h
Fixed−h
Estimated Fixed−h
95% line

0.80

Empirical Coverage

0.95

1.00

Local Linear Estimator Empirical Coverage

0

5

10

15

Bandwidth (h)
DGP: Quadratic Model − Heteroskedastic Errors (Case 2)

108

20

B.2.3

Bandwidth Choice for fo (x)

Figure B.21. Small-h Sensitivity to Density Bandwidth - DGP: Linear

0.90
0.85

hf=h
hf=1
hf=5
hf=10
hf=20
95% line

0.80

Empirical Coverage

0.95

1.00

Small−h Sensitivity to Density Bandwidth (hf)

0

5

10

15

Bandwidth (h)
Local Linear Estimator − DGP: Linear Model − Homoskedastic Errors

109

20

Figure B.22. Small-h Sensitivity to Density Bandwidth - DGP: Quadratic

0.90
0.85

hf=h
hf=1
hf=5
hf=10
hf=20
95% line

0.80

Empirical Coverage

0.95

1.00

Small−h Sensitivity to Density Bandwidth (hf)

0

5

10

15

20

Bandwidth (h)
Local Linear Estimator − DGP: Quadratic Model − Homoskedastic Errors

110

APPENDIX C
Proofs to “Fixed Bandwidth
Asymptotics for Regression
Discontinuity Designs”
Proof. [Proof of Theorem 6] The local polynomial estimator is given by
αp = αp+ − αp−
note that,

αp+ = e1 

1
nh

n
i=1


1
= e1 Dn+ 
nh
1
with Dn+ = nh

xi − x
h

k
n

k
i=1

−1 
di Zi Zi 

xi − x
h

n

 1
k
nh
i=1


xi − x
h

di Zi yi 

n k xi −x d Z Z −1 and,
i i i
i=1
h


1
αp− = e1 Dn− 
nh

n

k
i=1

111

xi − x
h


(1 − di )Zi yi 


di Zi yi 

n k xi −x (1 − d )Z Z −1 . Then,
i i i
i=1
h

1
with Dn− = nh


αp+ = e1 Dn+ 

1
nh



n

k
i=1
n



xi − x
h

di Zi [m(xi ) + αdi + εi ]


1
xi − x
di Zi m(xi ) +
k
nh
h
i=1


n
1
xi − x
+αe1 Dn+ 
k
di Zi  +
nh
h
i=1


n
1
xi − x
+e1 Dn+ 
k
di Zi εi 
nh
h

= e1 Dn+ 

i=1

note that Zi = Zi Zi e1 , e1 e1 = 1 then



n
n
xi − x
 1
 = e Dn+  1
e1 Dn+
k
di Zi
k
1
nh
h
nh
i=1

i=1



xi − x
h

di Zi Zi  e1 = 1

and

αp+ − α = e1 Dn+ 

1
nh

n

k
i=1

xi − x
h





1
di Zi m(xi ) + e1 Dn+ 
nh

n

k
i=1

xi − x
h


di Zi εi 

similarly

1
αp− = e1 Dn− 
nh

n

k
i=1

xi − x
h





1
(1 − di )Zi m(xi )+e1 Dn− 
nh

n

k
i=1

xi − x
h


(1 − di )Zi εi 

Then
√

√
nh(αp+ − α − αp− )


n
√
xi − x
1
= e1 Dn+ nh 
k
di Zi m(xi ) +
nh
h
i=1


n
1
xi − x
+e1 Dn+  √
k
di Zi εi  −
h
nh i=1
 √

n k xi −x (1 − d )Z m(x ) + 
 nh 1


i i
i
nh i=1
h
−e1 Dn−
n k xi −x (1 − d )Z ε


+ √1


i i i
i=1
h

nh(αp − α) =

nh

112

For the denominator terms Dn+ and Dn− ,
n

1
−1
Dn+ =
nh

xi − x
h

k
i=1

di Zi Zi

and each element of this matrix is given by
1
−1
Dn+
=
nh
l,j

n

xi − x
h

k
i=1

xi − x j+l−2
h

di

which has asymptotic variance converging to zero since

n
1
xi − x
−1

V ar Dn+
=
V ar
k
2
h
j,l
(nh)
i=1
xi − x
h

1
=
V ar k
nh2
1
1 2
≤
E
k
nh
h

x−x
h

x+h 1
1
k2
=
nh x
h

di

x−x
h

h




xi − x j+l−2
h

di
d

xi − x j+l−2

x − x 2(j+l−2)
h
x − x 2(j+l−2)
fo (x)dx
h

1
Note that the terms in the integral and the integral itself are O(1) and nh = o(1). Hence,

V ar

−1
Dn+

l,j

→ 0. Now,

−1
Dn+

l,j

−1
Dn+
l,j


= E
=

n

1 
E
nh

k

i=1

=

x+h 1
x
∞

=
0

xi − x
h

xi − x
h

1
k
h

= E

+ op (1)

h

k

di

x−x
h

di

xi − x j+l−2
h


 + op (1)

xi − x j+l−2
+ op (1)
h
x − x j+l−2
fo (x)dx + op (1)
h

k (u) uj+l−2 fo (x + uh)du + op (1)

∞
Let, γ + = 0 k (u) uj fo (x + uh)du and Γ∗ is the (p + 1) × (p + 1) matrix with (j, l) element
+
j

γ+
j+l−2 for j, l = 1, ..., p + 1. Then,
p
Dn+ → Γ∗ −1
+

113

Similarly,
p
Dn− → Γ∗ −1
−

where Γ∗ is the (p + 1) × (p + 1) matrix with (j, l) element γ −
−
j+l−2 for j, l = 1, ..., p + 1, and
∞
γ − = (−1)j 0 k (u) uj fo (x − uh)du.
j

Now we will derive the asymptotic distribution of √1

nh

n k xi −x d Z ε . Following
i i i
i=1
h

Porter (2003) I use the Cramer-Wold device to derive the asymptotic distribution. Let λ be
a nonzero, ﬁnite vector. Then,

n

 1
E √
k
nh i=1
n

=
i=1

=

≤

≤

1
nh
1
nh
1
nh

xi − x
h

ζ

1 2 1
E
nh
nh

ζ
2 1

h
ζ
2 1

h
ζ
2 1


E k

k

di λ Zi εi
2+ζ

2+ζ

p

x−x
h
x−x
h

di λ Zi

d
l=1
2+ζ



=

2+ζ

nh



2+ζ


E |ε|2+ζ | X = x 


2+ζ

p

λl

d
l=1


sup E |ε|2+ζ | X = x E  k

h x∈ℵ

x+h

x−x l 
h
x−x
h

1 2 1
x−x
sup E |ε|2+ζ | X = x
k
nh
h x∈ℵ
h
x

ζ
ζ
1 2
1 2
O(1)O(1) = O 
= o(1)
nh
nh

then, √1

|εi |2+ζ

x−x l
h

λl

ζ

=





xi − x
h



E k

2+ζ


E |ε|2+ζ | X = x 
p

2+ζ

λl

d
l=1
2+ζ p

λl
l=1

n k xi −x d Z ε follows Liapunov’s CLT. Note that,
i i i
i=1
h





n

xi − x
1
k
E √
di Zi εi 
h
nh i=1
√
n
xi − x
= E √ k
di Zi E [ε | X = x] = 0
h
h
114

2+ζ
x−x l

h




2+ζ
x−x l
dFo (x)
h

and


n

xi − x
h

1
k
V ar  √
nh i=1


di Zi εi 

1
xi − x
V ar k
d i Z i εi
h
h
xi − x
1
=
E k2
di Zi Zi ε2
i
h
h
1
xi − x
=
E k2
di Zi Zi E ε2 | X = x
i
h
h
=

=

x+h 1
k2
h
x

x−x
h

ZZ σ 2 (x)fo (x)dx

It helps to remember that Zi Zi is a function of the x,

xi −x
1
···
h

 x −x
xi −x 2
 i
···

h
h
Zi Zi = 
.
.
..

.
.
.
.
.

 x −x p
p+1
xi −x
i
···
h
h


xi −x p
h

xi −x p+1 


h

.
.
.



2p 
xi −x
h

x−x j σ 2 (x)f (x)dx = ∞ k 2 (u) uj σ 2 (x + uh)f (x + uh)du
o
o
0
h
and ∆∗ is the (p + 1) × (p + 1) matrix with (j, l) element δ +
+
j+l−2 for j, l = 1, ..., p + 1. Then,
x+h 1 k 2 x−x
Let δ + = x
j
h
h

n

1
√
k
nh i=1

xi − x
h

p

di Zi εi → N (0, ∆∗ )
+

Similarly we can show that
n

1
√
k
nh i=1

xi − x
h

p

(1 − di )Zi εi → N (0, ∆∗ )
−

where ∆∗ is the (p + 1) × (p + 1) matrix with (j, l) element δ −
−
j+l−2 for j, l = 1, ..., p + 1, and
x
1
δ − = x−h h k 2 x−x
j
h

x−x j σ 2 (x)f (x)dx = (−1)j ∞ k 2 (u) uj σ 2 (x−uh)f (x−uh)du.
o
o
0
h

The bias term is given by


n

√
1
xi − x
nhe1 Dn+ 
k

nh
h





di Zi m(xi ) − Dn− 

i=1

115

1
nh

n

k
i=1

xi − x
h



(1 − di )Zi m(xi )


Notice that if the rectangular kernel is used this is nothing else than the diﬀerence between
the intercepts estimated by the linear projection of m(x) on Z, above and below the cutoﬀ
point using only the data inside the bandwidth.
Note that,

n
1
E
k
nh
i=1



xi − x
h

xi − x
h

1
k
h

di Zi m(xi ) = E

x+h 1

=

x
∞

=

h

k

di Zi m(xi )

x−x
h

Z(x)m(x)f (x)dx

k (u) Z(x + uh)m(x + uh)f (x + uh)du
0

and similarly,
1
nh

n

k
i=1

xi − x
h

p

(1 − di )Zi m(xi ) →

x

1
k
x−h h

x−x
h

Z(x)m(x)dx

Hence, the bias term can be approximated by,
e1

∞
Γ∗ −1 0 k (u) Z(x + uh)m(x + uh)f (x + uh)du −
+
∞
− Γ∗ −1 0 k (u) Z(x − uh)m(x − uh)f (x − uh)du
−

Proof. [Proof of Corollary 2] First, note that, if h → 0,
γ+ =
j

∞

lim

h→0 0

= fo (x)

k (u) uj fo (x + uh)du

∞

k (u) uj du

(C.1)

0

= fo (x)γ j
and
δ+ =
j

∞

k 2 (u) uj σ 2 (x + uh)fo (x + uh)du
h→0 0
∞
= σ 2+ (x)fo (x)
k 2 (u) uj du
0
= σ 2+ (x)fo (x)δ j
lim

116

(C.2)

and similarly for γ − and δ − . Then, for the variance,
j
j
lim Γ∗ −1 ∆∗ Γ∗ −1 + Γ∗ −1 ∆∗ Γ∗ −1
−
− −
+
+ +

h→0

= (fo (x)Γ)−1 σ 2+ (x)fo (x)∆ (fo (x)Γ)−1 + (fo (x)Γ)−1 σ 2− (x)fo (x)∆ (fo (x)Γ)−1
=

σ 2+ (x) + σ 2− (x)
e1 Γ−1 ∆Γ−1 e1
fo (x)

For the bias, if we approximate m(x + uh) = m(x) just above m(x) :
m(x) = LP + (m(x) on Z(x)) +

1
m(p+1)+ (x) (x − x)p+1 + o hp+1
(p + 1)!

and similarly for approximating m(x) just below the cutoﬀ. When we evaluate LP + (m(x)
on Z(x)) at x, we get the intercept m(x) and the “residual” as described above. A helpful
fact is that, by the deﬁnition of Z(x),



γ p+1
1
∞
 . 
.
k (u)  .  up+1 du =  . 
.
.


∞

k (u) Z(x + uh)up+1 du =

0

0

up


∞

k (u) Z(x − uh)up+1 du =

0

∞
0

1
.
.
.



γ 2p+1


γ p+1
.
.
.

(C.3)



 p+1

du = 
u
 (C.4)
p
pγ
(−u)
(−1) 2p+1


k (u) 




γ p+1
 . 
Note that Γ−1  .  is equal both above and below the cutoﬀ. The bias formula in
.
γ 2p+1
theorem 6 is given by
e1

∞
Γ∗ −1 0 k (u) Z(x + uh)m(x + uh)fo (x + uh)du −
+
∞
− Γ∗ −1 0 k (u) Z(x − uh)m(x − uh)fo (x − uh)du
−

(C.5)

as discussed in section 2.5 the main term is just the diﬀerence between the intercepts of the
linear projections of k (u) m(x) on k (u) Z(x) in the bandwidth above below the cutoﬀ, which
is equal to the linear projections evaluated at x. Hence, plugging the bias formula for the
linear projection, formula (C.5) can be written as (plus an o hp+1 term),

e1
−e1

Γ∗ −1
+

∞

k (u) Z(x + uh)
0

Γ∗ −1
−

∞

k (u) Z(x − uh)
0

1
m(p+1)+ (x) (uh)p+1 fo (x + uh)du
(p + 1)!
(−1)p+1 (p+1)−
m
(x) (uh)p+1 fo (x − uh)du
(p + 1)!
117

=

hp+1
e
(p + 1)! 1
−

=

hp+1
(p + 1)!

hp+1
(p + 1)!
−

Γ∗ −1
+

∞
0

k (u) Z(x + uh)m(p+1)+ (x)up+1 fo (x + uh)du

Γ∗ −1
−

e1 (−1)p+1

e1 m(p+1)+ (x)

∞
0

k (u) Z(x − uh)m(p+1)− (x)up+1 fo (x − uh)du

∞

Γ∗ −1
+

0

hp+1
e1 (−1)p+1 m(p+1)− (x)
(p + 1)!

k (u) Z(x + uh)up+1 fo (x + uh)du

Γ∗ −1
−

∞
0

k (u) Z(x − uh)up+1 fo (x − uh)du

m(x)
, is the vector of coeﬃcients of the linear projection of m(x) on Z(x) is
mp+(−)
the bandwidth above (below) the cutoﬀ. If h → 0, using the equalities in equations (C.3),
where

(C.4), (C.2) and (C.1),

=

limh→0

hp+1

(p + 1)!




γ p+1
 . 
m(p+1)+ (x) − (−1)p+1 m(p+1)− (x) e1 Γ−1  . 
.
γ 2p+1

Proof. [Proof of Corollary 3] First, note that, if h > 0 and, in the bandwidth around the
cutoﬀ, fo (x) = fo (x), σ 2 (x) = σ 2 (x) and
m(x) = m(x)+m + (x) (x − x)+...+

1 (p)+
1
m
(x) (x − x)p +
m(p+1)+ (x) (x − x)p+1
p!
(p + 1)!

then,
γ+ =
j

∞
0

k (u) uj fo (x + uh)du

= fo (x)

∞

k (u) uj du

(C.6)

0

= fo (x)γ j
and
δ+ =
j

∞

k 2 (u) uj σ 2 (x + uh)fo (x + uh)du

0
= σ 2+ (x)fo (x)

∞
0

= σ 2+ (x)fo (x)δ j
118

k 2 (u) uj du

(C.7)

and similarly for γ − and δ − . Then, for the variance,
j
j
Γ∗ −1 ∆∗ Γ∗ −1 + Γ∗ −1 ∆∗ Γ∗ −1
−
− −
+
+ +
= (fo (x)Γ)−1 σ 2+ (x)fo (x)∆ (fo (x)Γ)−1 + (fo (x)Γ)−1 σ 2− (x)fo (x)∆ (fo (x)Γ)−1
=

σ 2+ (x) + σ 2− (x)
e1 Γ−1 ∆Γ−1 e1
fo (x)

For the bias, the strategy is basically the same as in the proof of corollary 2:
m(x) = LP + (m(x) on Z(x)) +

1
m(p+1)+ (x) (x − x)p+1
(p + 1)!

and similarly for approximating m(x) just below the cutoﬀ. When we evaluate LP + (m(x)
on Z(x)) at x, we get the intercept m(x) and the “residual” as described above. Once again,
using formulas (C.3) and (C.4), the bias formula in theorem 6 is given by
∞
Γ∗ −1 0 k (u) Z(x + uh)m(x + uh)fo (x + uh)du −
+
∞
− Γ∗ −1 0 k (u) Z(x − uh)m(x − uh)fo (x − uh)du
−

e1

as discussed in section 2.5 the main term is just the diﬀerence between the intercepts of the
linear projections of k (u) m(x) on k (u) Z(x) in the bandwidth above below the cutoﬀ, which
is equal to the linear projections evaluated at x. Hence, plugging the bias formula for the
linear projection:

Γ∗ −1
+

e1
−e1

=

=

hp+1
(p + 1)!

0

−

(p + 1)!

(−1)p+1 (p+1)−
m
(x) (uh)p+1 fo (x − uh)du
(p + 1)!

∞

k (u) Z(x − uh)
0

Γ∗ −1
+

∞
0

e1 (−1)p+1

hp+1
e1 m(p+1)+ (x)
(p + 1)!
hp+1

1
m(p+1)+ (x) (uh)p+1 fo (x + uh)du
(p + 1)!

k (u) Z(x + uh)

Γ∗ −1
−

hp+1
e
(p + 1)! 1
−

∞

k (u) Z(x + uh)m(p+1)+ (x)up+1 fo (x + uh)du

Γ∗ −1
−

∞
0

Γ∗ −1
+

e1 (−1)p+1 m(p+1)− (x)

k (u) Z(x − uh)m(p+1)− (x)up+1 fo (x − uh)du

∞
0

k (u) Z(x + uh)up+1 fo (x + uh)du

Γ∗ −1
−
119

∞
0

k (u) Z(x − uh)up+1 fo (x − uh)du

m(x)
, is the vector of coeﬃcients of the linear projection of m(x) on Z(x) is
mp+(−)
the bandwidth above (below) the cutoﬀ. Using fo (x) = fo (x) and the equalities in formulas
where

(C.3), (C.4), (C.7) and (C.6),


=

hp+1
(p + 1)!


γ p+1
 . 
m(p+1)+ (x) − (−1)p+1 m(p+1)− (x) e1 Γ−1  . 
.
γ 2p+1

Proof. [Proof of Theorem 8] To obtain the Covariance term for the asymptotic variance of
the Fuzzy Regression Discontinuity estimator, note that the covariance will be determined
by the expectation of the product of the stochastic terms.
The covariance between the estimators for the outcome of interest and the treatment
probability will be given by two independent terms, one for each side of the threshold. The
upper side is given by


n

 1
E e1 Dn+
k

nh
i=1

xi − x
h


di Z i y i  

1
nh

n

k
i=1

xi − x
h


di Zi ti  Dn+ e1





Where ti is the dummy variable indicating that the observation has received treatment. In
obtaining the asymptotic covariance, the bias term of the estimator can be ignored, hence





E e1 Dn+ 


1
nh



= E e1 Dn+ 

1
nh

n

k
i=1
n n

xi − x
h
k

i=1 j=1



n

d i Z i εi  

k
i=1

xi − x
h

k

120

xj − x
h

xi − x
h





di Zi η i  Dn+ e1 




di dj Zi Zj εi η j  Dn+ e1 





= E e1 Dn+ 


1
nh



= E e1 Dn+ 

1
nh

n

k

k
i=1

h

x

h

h





di Zi Zi E [εi η i | X = x] Dn+ e1 




di Zi Zi σ εη (xi ) Dn+ e1 

xi − x 2
di Zi Zi σ εη (xi ) Dn+ e1
h

x+h 1

= e1 Dn+

xi − x 2

i=1
n

1
k
h

= E e1 Dn+

xi − x 2

x−x 2
ZZ σ εη (x)fo (x)dx Dn+ e1
h

k

where I used the assumption that E εi η j | X = x = 0 for j = i.
Similarly for the second term,
x−x 2
ZZ σ εη (x)fo (x)dx Dn− e1
h

x

e1 Dn−

1
k
x−h h

x−x j σ (x)f (x)dx =
εη
o
h

∞ 2
j
0 k (u) u σ εη (x + uh)fo (x +
ρ
−
uh)du, ∆+ is the (p+1)×(p+1) matrix with (j, l) element ρ+
j+l−2 for j, l = 1, ..., p+1, ρj =
x
1 k 2 x−x
x−x j σ (x)f (x)dx = (−1)j ∞ k 2 (u) uj σ (x−uh)f (x−uh)du and
εη
o
εη
o
x−h h
0
h
h
ρ
∆− is the (p + 1) × (p + 1) matrix with (j, l) element ρ−
j+l−2 for j, l = 1, ..., p + 1 Then the
x+h 1 k 2 x−x
Let ρ+ = x
j
h
h

asymptotic covariance is given by
Cαθ = e1

ρ
ρ
Γ∗ −1 ∆+ Γ∗ −1 + Γ∗ −1 ∆− Γ∗ −1 e1
+
+
−
−

The asymptotic covariance for the Nadaraya-Watson estimator will be given by the special
case when p = 0.
∞

Cαθ =

k 2 (u)

0

σ εη (x + uh)fo (x + uh)

+
∞ k (u) f (x + uh)du 2
o
0

σ εη (x − uh)fo (x − uh)
2
∞
0 k (u) fo (x − uh)du

du

Proof. [Proof of Corollary 4] Using the results in equations C.1 and noting that, if h → 0
ρ+ =
j

∞

k 2 (u) uj σ εη (x + uh)fo (x + uh)du
h→0 0
= σ + (x)fo (x)δ j
εη
lim

121

and similarly for ρ− . Then,
j
lim e1

h→0

ρ
ρ
Γ∗ −1 ∆+ Γ∗ −1 + Γ∗ −1 ∆− Γ∗ −1 e1
+
+
−
−

= e1 (fo (x)Γ)−1 σ + (x)fo (x)∆ (fo (x)Γ)−1 + (fo (x)Γ)−1 σ − (x)fo (x)∆ (fo (x)Γ)−1 e1
εη
εη
σ + (x) + σ − (x)
εη
εη
=
e1 Γ−1 ∆Γ−1 e1
fo (x)

Proof. [Proof of Corollary 5] The proof follows very closely corollary 4. Using the results
in equations C.1 and noting that, if h > 0 and fo (x) = fo (x) and σ εη (x) = σ εη (x) for any
x in the range around the cutoﬀ
ρ+ =
j

∞

k 2 (u) uj σ εη (x + uh)fo (x + uh)du

0
= σ + (x)fo (x)δ j
εη

and similarly for ρ− . Then,
j
e1

ρ
ρ
Γ∗ −1 ∆+ Γ∗ −1 + Γ∗ −1 ∆− Γ∗ −1 e1
+
+
−
−

= e1 (fo (x)Γ)−1 σ + (x)fo (x)∆ (fo (x)Γ)−1 + (fo (x)Γ)−1 σ − (x)fo (x)∆ (fo (x)Γ)−1 e1
εη
εη
σ + (x) + σ − (x)
εη
εη
e1 Γ−1 ∆Γ−1 e1
=
fo (x)

122

APPENDIX D
Proofs to “Asymptotic Properties of
Quantile Regression for Standard
Stratiﬁed Samples”
Proof. [Proof of Corollary 6] The general form of the variance follows directly from Newey
and McFadden (1994) theorem 7.1. The speciﬁc formulas for Aw and Bw , are obtained
by checking that the proof used by Wooldridge (2001) still holds for the estimator that
minimizes the objective function given by equation 3.2. I follow his procedure below.
Since within each stratum we have a i.i.d. sequence

wij : i = 1, 2, ..., Nj

for each j, a

CLT for i.i.d. observations can be applied for each stratum.
−1
Nj 2

Nj

d

sij β τ o − µj → N (0, Bj )
i=1

where sij β τ o
µj

≡

τ − 1 yij − g xij , β τ o ≤ 0

≡ E sij β τ o

V ar sij β τ o

= V ar

= E

•

•

g i , with g i ≡

τ − 1 yij − g xij , β τ o ≤ 0

τ − 1 yij − g xij , β τ o ≤ 0

123

•

•

∂g xij ,β
∂β

|β=β

g i |w ∈ Wj , and Bj

τo

,

≡

g i |w ∈ Wj . As seen in equation

3.4,
J

Qj µj = 0
j=1
1

The score of the objective function, multiplied by N 2 and evaluated at β τ o can be written
as








Nj
Nj
J Q
1 J Qj 
j 


−2
−1
N
sij β τ o  = N 2
sij β τ o − µj 


H
H
j=1 j i=1
j=1 j i=1





N
1 j
Qj  − 2
 d
sij β τ o − µj  → N (0, Bw )
N
=
1  j
i=1
j=1 H 2
j
J

with Bw =

J Q2
J Q2
j
j
Bj =
V ar
Hj
Hj
j=1
j=1

τ − 1 yij − g xij , β τ o ≤ 0

•

g i |w ∈ Wj . Where

I used the fact that sampling is random within stratum and the observations are indeJ Q2
j
pendent across strata. In the linear case this formula simpliﬁes to Bw =
Hj Bj =
j=1
J Q2
j
V ar τ − 1 yij − xij β τ o ≤ 0 xij |w ∈ Wj .
Hj
j=1

124

For the outer part of the variance matrix it is enough to note that
J

Aw =

•

Qj E[ τ − 1 y − g x, β τ o ≤ 0

g|w ∈ Wj ]

Qj β E[ τ − 1 y − g x, β τ o ≤ 0

g|w ∈ Wj ]

Qj β E[ τ − 1 y − g x, β τ o ≤ 0

g|w ∈ Wj ]

β
j=1
J

=
j=1
J

=

•

•

j=1
J

•

Qj β E E τ − 1 y − g x, β τ o ≤ 0 |x, w ∈ Wj g|w ∈ Wj

=
j=1
J

=

τ − Fy|x,w∈W g x, β τ o
j

Qj β E
j=1
J

=

Qj E[fy|x,w∈W
j=1

= E fy|x g x, β τ o
J

And that the Jacobian of
j=1

g x, β τ o

j

••

g g |w ∈ Wj ]

••

gg


1
Qj  N

in probability uniformly to Aw .

•

g|w ∈ Wj



Nj

τ − 1 yij − g xij , β τ

j i=1

≤0

•

g i converges

In the linear case this formula simpliﬁes to Aw =

J
j=1

Qj E[fy|x,w∈W

Proof.

j

x β τ o xx |w ∈ Wj ] = E fy|x (x β τ o )xx .

[Proof of Corollary 7] The general form of the variance follows directly from

Newey and McFadden (1994) theorem 7.1.

Under exogenous stratiﬁcation, the follow-

ing changesneed to be made to the deﬁnitions used to prove corollary 6: (a) µj ≡
E sij β τ o

= E

τ − 1 yij − g xij , β τ o ≤ 0

(b) Bj ≡ V ar sij β τ o

= V ar

•

g i |x ∈ χj

= 0 for every stratum j;

τ − 1 yij − g xij , β τ o ≤ 0

125

•

g i |x ∈ χj . Then, the

1

score of the objective function, multiplied by N 2 and evaluated at β τ o can be written as


N
−1
N 2
sij β τ o
i=1

J

=
j=1

1
1
 −
Hj2 Nj 2

Nj

i=1


sij β τ o 




N
1
1 j
 d
 −2
=
Hj2 Nj
sij β τ o  → N (0, Bu )
j=1
i=1
J

J

J

with Bu =
j=1

H j Bj =

j=1

H j V ar

•

τ − 1 yij − g xij , β τ o ≤ 0

g i |x ∈ χj . Where

I used the fact that sampling is random within stratum and the observations are indeJ

pendent across strata. In the linear case this formula simpliﬁes to Bu =
j=1

H j Bj =

J
j=1

τ − 1 yij − xij β τ o ≤ 0

H j V ar

xij |x ∈ χj .

Note that in this case, since E 1 yij − xij β τ o ≤ 0 |x is the same to every stratum,
V ar

τ − 1 yij − xij β τ o ≤ 0

J

•

• •

g i |x ∈ χj = τ (1 − τ )E g i g i |x ∈ χj
J

• •

and Bu =
j=1

H j τ (1 − τ )E g i g i |x ∈ χj = τ (1 − τ )
J

linear CQF case, Bu =
j=1

j=1

• •

H j E g i g i |x ∈ χj . In the in the
J

• •

H j τ (1 − τ )E g i g i |x ∈ χj = τ (1 − τ )

j=1

H j E xx |x ∈ χj .

For the outer part of the variance matrix it is enough to note that
J

H j β E[ τ − 1 y − g x, β τ o ≤ 0

Au =

•

g|x ∈ χj ]

j=1
J

•

H j β E E τ − 1 y − g x, β τ o ≤ 0 |x g|x ∈ χj

=
j=1
J

=

Hj βE
j=1
J

=

•

τ − Fy|x g x, β τ o

H j E[fy|x g x, β τ o

g|x ∈ χj

••

g g |x ∈ χj ]

j=1
J

In the linear case this formula simpliﬁes to Au =
j=1

126

H j E[fy|x xij β τ o xx |x ∈ χj ].

Proof.

[Proof of Corollary 8] The asymptotic multivariate normality result fol-

lows directly from the use of a standard Cramer-Wold device argument for the
vector of the scores for each quantile applied separately for each stratum.
sij (β τ ) = s1 (y, x, β τ ) , s2 (y, x, β τ ) , . . . , spij (y, x, β τ p )
1
2
ij
ij
E s1 (y, x, β τ ) , s2 (y, x, β τ ) , . . . , spij (y, x, β τ p ) then,
1

ij

ij

Let

and µj ≡ E sij (β τ ) =

2

−1
Nj 2

Nj

d

sij (β τ ) − µj → N (0, Bj )
i=1

with Bj a p × p variance covariance matrix with typical element,
Bj

l,k

≡ Cov sij β τ , sij β τ
l
k
= Cov

τ l − 1 y − g(x, β τ ) ≤ 0
l

•

g l , τ k − 1 y − g(x, β τ ) ≤ 0
k

•

g k |w ∈ Wj

Then,








Nj
Nj
J Q
1 J Qj 
j 


−2
−1
N
sij (β τ ) = N 2
sij (β τ ) − µj 


H
H
j=1 j i=1
j=1 j i=1





N
1 j
Qj  − 2
 d
N
=
sij (β τ ) − µj  → N (0, Bs )
1  j
j=1 H 2
i=1
j
J

where Bs = Bsl,k
with typical element
l,k=1,...p


J Q2
j
Bj 
Bsl,k = 
H j l,k
j=1

• •

and the outer part of each term is given by the Awl ≡ E fy|x (g (x, β τ l ))g l g l and Awk ≡
• •

E fy|x (g x, β τ )g k g k as argued in Buchinsky (1998).
k
√
a
Hence, N β τ − β τ ∼ N (0, Λτ ), where Λτ = Λτ l,k

l,k=1,...p

with typical element

deﬁned as
• •

Λτ l,k = E fy|x (g x, β τ )g l g l
l

−1





J Q2
−1
• •
j

Bj  E fy|x (g x, β τ )g k g k
k
l,k
H
j=1 j

127

and, in the special case of the linear CQF, g (X, β) = x β


J Q2
−1
−1
j

Λτ l,k = E fy|x x β τ xx
Bj  E fy|x x β τ xx
l
k
H j l,k
j=1

128

BIBLIOGRAPHY

129

BIBLIOGRAPHY

S. Ahn and P. Schmidt. A separability result for gmm estimation, with applications to
gls prediction and conditional moment tests. Econometric reviews, 14(1):19–34, 1995.
J. Albrecht, A. Bj¨rklund, and S. Vroman. Is there a glass ceiling in sweden? Journal
o
of Labor Economics, 21(1):145–177, 2003.
D. Andrews. Empirical process methods in econometrics. In R. Engle and D. McFadden,
editors, Handbook of econometrics, volume 4, pages 2247–2294. Elsevier, 1994.
J. Angrist and J. Pischke. Mostly harmless econometrics: An empiricist’s companion.
Princeton University Press, 2009.
J. Angrist, V. Chernozhukov, and I. Fern´ndez-Val. Quantile regression under misspeca
iﬁcation, with an application to the us wage structure. Econometrica, 74(2):539–563,
2006.
M. Buchinsky. Recent advances in quantile regression models: a practical guideline for
empirical research. Journal of Human Resources, 33(1):88–126, 1998.
M. Buchinsky. Quantile regression with sample selection: Estimating women’s return
to education in the us. Empirical Economics, 26(1):87–113, 2001.
M. Cattaneo. Eﬃcient semiparametric estimation of multi-valued treatment eﬀects
under ignorability. Journal of Econometrics, 155(2):138–154, 2010.
X. Chen, O. Linton, and I. Van Keilegom. Estimation of semiparametric models when
the criterion function is not smooth. Econometrica, 71(5):1591–1608, 2003.
X. Chen, H. Hong, and A. Tarozzi. Semiparametric eﬃciency in gmm models with
auxiliary data. The Annals of Statistics, 36(2):808–843, 2008.
V. Chernozhukov and C. Hansen. An iv model of quantile treatment eﬀects. Econometrica, 73(1):245–261, 2005.
V. Chernozhukov and C. Hansen. Instrumental quantile regression inference for structural and treatment eﬀect models. Journal of Econometrics, 132(2):491–525, 2006.

130

V. Chernozhukov and C. Hansen. Instrumental variable quantile regression: A robust
inference approach. Journal of Econometrics, 142(1):379–398, 2008.
H. Daniels. The asymptotic eﬃciency of a maximum likelihood estimator. In Proceedings
of the fourth Berkeley symposium on mathematical statististics and probability, volume 1,
pages 151–163. Berkeley: University of California Press, 1961.
J. Fan and I. Gijbels. Local polynomial modelling and its applications. Chapman &
Hall/CRC, 1996.
Y. Fan. Goodness-of-ﬁt tests based on kernel density estimators with ﬁxed smoothing
parameters. Econometric Theory, 14(5):604–621, 1998.
J. Hahn, P. Todd, and W. Van der Klaauw. Evaluating the eﬀect of an antidiscrimination
law using a regression-discontinuity design. NBER Working Paper Series, 1999.
J. Hahn, P. Todd, and W. Van der Klaauw. Identiﬁcation and estimation of treatment
eﬀects with a regression-discontinuity design. Econometrica, 69(1):201–209, 2001.
K. Hirano, G. Imbens, and G. Ridder. Eﬃcient estimation of average treatment eﬀects
using the estimated propensity score. Econometrica, 71(4):1161–1189, 2003.
K. Hitomi, Y. Nishiyama, and R. Okui. A puzzling phenomenon in semiparametric estimation problems with inﬁnite-dimensional nuisance parameters. Econometric Theory,
24(06):1717–1728, 2008.
P. Huber. The behavior of maximum likelihood estimates under nonstandard conditions.
In Proceedings of the ﬁfth Berkeley symposium on mathematical statistics and probability,
volume 1, pages 221–33, 1967.
G. Imbens and J. Angrist. Identiﬁcation and estimation of local average treatment
eﬀects. Econometrica, 62(2):467–475, 1994.
G. Imbens and T. Lemieux. Regression discontinuity designs: A guide to practice.
Journal of Econometrics, 142(2):615–635, 2008.
B. James. Probabilidade: Um curso em n´vel intermedi´rio. IMPA, Rio de Janeiro,
ı
a
third edition, 2004.
R. Koenker. Quantile regression. Cambridge University Press, 2005.
D. Lee and T. Lemieux. Regression discontinuity designs in economics. Technical report,
National Bureau of Economic Research, 2009.
W. Lee. Robust tests of hypotheses in models with m-estimation. Working Paper,
Department of economics, National Chung Cheng University, 2008.
131

J. Machado and J. Mata. Counterfactual decomposition of changes in wage distributions
using quantile regression. Journal of applied Econometrics, 20(4):445–465, 2005.
P. Martins and P. Pereira. Does education reduce wage inequality? quantile regression
evidence from 16 countries. Labour Economics, 11(3):355–371, 2004.
B. Melly. Decomposition of diﬀerences in distribution using quantile regression. Labour
Economics, 12(4):577–590, 2005.
H. Neave. An improved formula for the asymptotic variance of spectrum estimates. The
Annals of Mathematical Statistics, 41(1):70–77, 1970.
W. Newey and D. McFadden. Large sample estimation and hypothesis testing. In
R. Engle and D. McFadden, editors, Handbook of econometrics, volume 4, pages 2111–
2245. Elsevier, 1994.
A. Pagan and A. Ullah. Nonparametric econometrics. Cambridge University Press,
1999.
A. Pakes and D. Pollard. Simulation and the asymptotics of optimization estimators.
Econometrica, 57(5):1027–1057, 1989.
D. Pollard. New ways to prove central limit theorems. Econometric Theory, 1(3):
295–313, 1985.
J. Porter. Estimation in the regression discontinuity model. Unpublished manuscript,
Department of economics, University of Wisconsin at Madison, 2003.
J. Powell. Estimation of monotonic regression models under quantile restrictions. In
W. Barnett, J. Powell, and G. Tauchen, editors, Nonparametric and semiparametric
methods in econometrics and statistics, pages 357–384, 1991.
A. Prokhorov and P. Schmidt. Gmm redundancy results for general missing data problems. Journal of Econometrics, 151(1):47–55, 2009.
H. Qian and P. Schmidt. Improved instrumental variables and generalized method of
moments estimators. Journal of Econometrics, 91(1):145–169, 1999.
H. White. Maximum likelihood estimation of misspeciﬁed models. Econometrica, pages
1–25, 1982.
H. White. Asymptotic theory for econometricians. New York: Academic Press, New
York, New York, 1984.
H. White. Estimation, inference and speciﬁcation analysis. Cambridge University Press,
1996.
132

J. Wooldridge. Asymptotic properties of weighted m-estimators for standard stratiﬁed
samples. Econometric Theory, 17(2):451–470, 2001.
J. Wooldridge. Econometric analysis of cross section and panel data. First edition,
2002a.
J. Wooldridge. Inverse probability weighted m-estimators for sample selection, attrition,
and stratiﬁcation. Portuguese Economic Journal, 1(2):117–139, 2002b.
J. Wooldridge. Inverse probability weighted estimation for general missing data problems. Journal of Econometrics, 141(2):1281–1301, 2007.

133