ROBUST AND EFFICIENT ESTIMATION OF TREATMENT EFFECTS IN

EXPERIMENTAL AND NON-EXPERIMENTAL SETTINGS

By

Akanksha Negi

A DISSERTATION

Submitted to

Michigan State University

in partial fulﬁllment of the requirements

for the degree of

Economics – Doctor of Philosophy

2020

ABSTRACT

ROBUST AND EFFICIENT ESTIMATION OF TREATMENT EFFECTS IN

EXPERIMENTAL AND NON-EXPERIMENTAL SETTINGS

By

Akanksha Negi

Broadly, this dissertation identiﬁes and addresses issues that arise in experimental and ob-

servational data contexts for estimating causal eﬀects. In particular, the three chapters in

this dissertation focus on issues of consistent and eﬃcient estimation of causal eﬀects using

methods that are robust to misspeciﬁcation of a conditional model of interest.

Chapter 1: Revisiting regression adjustment in experiments with heterogeneous

treatment eﬀects

Regression adjustment with covariates in experiments is intended to improve precision over

a simple diﬀerence in means between the treated and control outcomes. The eﬃciency ar-

gument in favor of regression adjustment has come under criticism lately, where papers like

Freedman (2008a,b) ﬁnd no systematic gain in asymptotic eﬃciency of the covariate adjusted

estimator. This chapter shows that, like in Lin (2013), estimating separate regressions for

the control and treated groups is guaranteed to do no worse than both the simple diﬀerence-

in-means estimator and just including the covariates in additive fashion. This result appears

to be new, and simulations show that the eﬃciency gains can be substantial. This chapter

also talks about some important cases – applicable to binary, fractional, count, and other

nonnegative responses – where nonlinear regression adjustment is consistent without any

restrictions on the conditional mean functions.

Chapter 2: Robust and eﬃcient estimation of potential outcome means under

random assignment

This chapter studies improvements in eﬃciency for estimating the entire vector of potential

outcome means using linear regression adjustment with two or more assignment levels. This

chapter shows that using separate regression adjustments for each assignment level is never

worse asymptotically than using the subsample averages and that separate regression ad-

justment generally improves over pooled regression adjustment, except in the obvious case

where slope parameters in the linear projections are identical across the diﬀerent assignment

levels. An especially promising direction is to use certain nonlinear regression adjustment

methods, which we show to be robust to misspeciﬁcation in the conditional means. We apply

this general potential outcomes framework to a contingent valuation study which seeks to

estimate the lower bound mean willingness to pay (WTP) for an oil spill prevention program

in California.

Chapter 3: Doubly weighted M-estimation for nonrandom assignment and miss-

ing outcomes

This chapter studies the problems of nonrandom assignment and missing outcomes, which

together, undermine the validity of standard causal inference. While the econometrics lit-

erature has used weighting to address each issue in isolation, empirical analysis is often

complicated by the presence of both. This chapter proposes a new class of inverse proba-

bility weighted M-estimators that deal with the two issues by combining propensity score

weighting with weighting for missing data. This chapter also discusses applications of the

proposed method for robust estimation of the two prominent causal parameters, namely, the

average treatment eﬀect and quantile treatment eﬀects, under misspeciﬁcation the frame-

work’s parametric components. This chapter also demonstrates the proposed estimator’s

viability in empirical settings by applying it to the sample of Aid to Families with Depen-

dent Children (AFDC) women from the National Supported Work program compiled by

Calónico and Smith (2017).

ACKNOWLEDGEMENTS

I would like to extend a sincere thanks to my advisor, Jeﬀrey M. Wooldridge, for his unre-

lenting support, guidance, and insights that has been invaluable for the development of this

project and which have helped me grow into an independent researcher. I would also like

to thank committee members, Steven Haider, Ben Zou, and Kenneth Frank for discussions

from which this work has beneﬁted immensely. A special thanks to Timothy Vogelsang,

Jon X. Eguia, Wendun Wang, Mike Conlin, Todd Elder, Alyssa Carlson, for their helpful

comments, advice, and encouragement in the years leading up to the job market.

This work has evolved signiﬁcantly thanks to comments and suggestions received from

participants at the Econometrics seminar series, Econometrics reading group meetings, and

graduate student seminar series at Michigan State University.

I am also grateful for the

discussions at Midwest Econometric Group Meetings, International Association for Applied

Econometrics, and International Econometrics PhD Conference in Rotterdam.

I would like to acknowledge ﬁnancial support from the Department of Economics, Grad-

uate School, College of Social Sciences, and Council of Graduate Students, along with the

numerous fellowships; Dissertation Completion Fellowship, Kelly Research Fellowship and

Supplemental Research Fellowship received through the duration of these PhD years that

have been instrumental in successful completion of this work. The administrative help and

support received from Lori Jean Nichols and Jay Feight has made navigating this PhD pro-

gram easier than it would otherwise have been.

My warmest gratitude to family and friends who continue to inspire and motivate me to

take up new adventures both personally and professionally. Finally, a note of thanks to my

dear friend, and colleague, Christian Cox, with whom I have shared the vicissitudes of PhD

life and who has supported me throughout this journey.

iv

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 1 REVISITING REGRESSION ADJUSTMENT IN EXPERIMENTS

WITH HETEROGENEOUS TREATMENT EFFECTS†

. . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1
1.2 Potential outcomes and parameter of interest . . . . . . . . . . . . . . . . . .
1.3 Random assignment and random sampling . . . . . . . . . . . . . . . . . . .
1.4 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simple diﬀerence in means (SDM) . . . . . . . . . . . . . . . . . . . .
1.4.1
. . . . . . . . . . . . . . . . . .
1.4.2 Pooled regression adjustment (PRA)
1.4.3 Linear projections and infeasible regression adjustment (IRA)
. . . .
1.4.4 Full regression adjustment (FRA) . . . . . . . . . . . . . . . . . . . .
1.5 Asymptotic variances and eﬃciency comparisons . . . . . . . . . . . . . . . .
1.6 Simulations
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.1 Design details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.2 Discussion of simulation ﬁndings
. . . . . . . . . . . . . . . . . . . .
1.7 Nonlinear regression adjustment . . . . . . . . . . . . . . . . . . . . . . . . .
1.7.1 Pooled nonlinear regression adjustment . . . . . . . . . . . . . . . . .
1.7.2 Full nonlinear regression adjustment
. . . . . . . . . . . . . . . . . .
Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.7.3
1.8 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 2 ROBUST AND EFFICIENT ESTIMATION OF POTENTIAL OUT-

COME MEANS UNDER RANDOM ASSIGNMENT† . . . . . . . . .
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Potential outcomes, random assignment, and random sampling . . . . . . . .
. . . . . . . . . . . . . .
2.3 Subsample means and linear regression adjustment
2.3.1
Subsample means . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.2 Full regression adjustment . . . . . . . . . . . . . . . . . . . . . . . .
2.3.3 Pooled regression adjustment
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
2.4 Comparing the asymptotic variances
2.4.1 Comparing FRA to subsample means . . . . . . . . . . . . . . . . . .
2.4.2 Full RA versus pooled RA . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Nonlinear regression adjustment . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Full regression adjustment . . . . . . . . . . . . . . . . . . . . . . . .
2.5.2 Pooled regression adjustment
. . . . . . . . . . . . . . . . . . . . . .
2.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.1 Treatment eﬀects with multiple treatment levels . . . . . . . . . . . .
2.6.2 Diﬀerence-in-Diﬀerences designs . . . . . . . . . . . . . . . . . . . . .

v

ix

1

3
3
7
8
10
10
10
11
14
14
19
20
23
24
25
29
31
33

36
36
38
41
41
43
44
45
46
46
48
49
51
52
52
53

2.6.3 Estimating lower bound mean willingness-to-pay . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
2.6.4 Application to california oil data
2.7 Monte-carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.7.1 Population models

53
55
57
58
61
62

CHAPTER 3 DOUBLY WEIGHTED M-ESTIMATION FOR NONRANDOM AS-

SIGNMENT AND MISSING OUTCOMES† . . . . . . . . . . . . . .
64
64
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
3.2 Doubly weighted framework . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
3.2.1 Potential outcomes and the population models . . . . . . . . . . . . .
71
3.2.2 The unweighted M-estimator . . . . . . . . . . . . . . . . . . . . . . .
73
3.2.3
Ignorable missingness and unconfoundedness . . . . . . . . . . . . . .
75
3.2.4 Population problem with double weighting . . . . . . . . . . . . . . .
76
3.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
3.3.1 Estimated weights using binary response MLE . . . . . . . . . . . . .
79
3.3.2 Doubly weighted M-estimator . . . . . . . . . . . . . . . . . . . . . .
80
3.4 Asymptotic theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
3.4.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
3.4.2 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . .
85
3.4.3 Eﬃciency gain with estimated weights
. . . . . . . . . . . . . . . . .
86
3.5 Some feature of interest is correctly speciﬁed . . . . . . . . . . . . . . . . . .
93
3.6 Robust estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
3.6.1 Average treatment eﬀect . . . . . . . . . . . . . . . . . . . . . . . . .
3.6.2 Monte carlo evidence . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
3.6.3 Quantile eﬀects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.6.4 Monte carlo evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
. . . . . . . . . . . . . . . . . . . 109
3.7.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

3.7 Application to Calónico and Smith (2017)

APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
FIGURES FOR CHAPTER 1 . . . . . . . . . . . . . . . . . 115
TABLES FOR CHAPTER 1 . . . . . . . . . . . . . . . . . . 124
PROOFS FOR CHAPTER 1 . . . . . . . . . . . . . . . . . . 129
TABLES FOR CHAPTER 2 . . . . . . . . . . . . . . . . . . 138
PROOFS FOR CHAPTER 2 . . . . . . . . . . . . . . . . . . 143
AUXILIARY RESULTS FOR CHAPTER 3 . . . . . . . . . . 155
APPLICATION APPENDIX FOR CHAPTER 3 . . . . . . . 163
FIGURES FOR CHAPTER 3 . . . . . . . . . . . . . . . . . 170
TABLES FOR CHAPTER 3 . . . . . . . . . . . . . . . . . . 180
PROOFS FOR CHAPTER 3 . . . . . . . . . . . . . . . . . . 203

APPENDIX A
APPENDIX B
APPENDIX C
APPENDIX D
APPENDIX E
APPENDIX F
APPENDIX G
APPENDIX H
APPENDIX I
APPENDIX J

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

vi

LIST OF TABLES

Table 1.1: Description of the data generating processes . . . . . . . . . . . . . . . . .

Table 2.1: Combinations of means and QLLFs to ensure consistency . . . . . . . . .

23

50

Table B.1: QLL and mean function combinations . . . . . . . . . . . . . . . . . . . . 124

Table B.2: Bias and standard deviation for N=100 . . . . . . . . . . . . . . . . . . . 125

Table B.3: Bias and standard deviation for N=500 . . . . . . . . . . . . . . . . . . . 126

Table B.4: Bias and standard deviation for N=1000 . . . . . . . . . . . . . . . . . . . 127

Table B.5: Bias and standard deviation for binary outcome

. . . . . . . . . . . . . . 128

Table B.6: Bias and standard deviation for non-negative outcome . . . . . . . . . . . 128

Table D.1: Summary of yes votes at diﬀerent bid amounts . . . . . . . . . . . . . . . 138

Table D.2: Lower bound mean willingness to pay estimate using ABERS and FRA

estimators

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

Table D.3: Bias and standard deviation of RA estimators for DGP 1 across four

assignment vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

Table D.4: Bias and standard deviation of RA estimators for DGP 2 across four

assignment vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

Table D.5: Bias and standard deviation of RA estimators for DGP 3 across four

assignment vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Table I.1: An illustration of the observed sample ((cid:88)means observed, ? means missing)180

Table I.2: Diﬀerent scenarios under ignorability and unconfoundedness . . . . . . . . 181

Table I.3: Diﬀerent scenarios under exogeneity of missingness and unconfoundedness 182

Table I.4: When is unweighted more eﬃcient than weighted assuming ignorability

and unconfoundedness and D(y(g)|x) correctly speciﬁed? . . . . . . . . . 183

Table I.5: When the conditional mean model is correctly speciﬁed . . . . . . . . . . 184

Table I.6: Misspeciﬁed conditional mean model . . . . . . . . . . . . . . . . . . . . . 186

vii

Table I.7: A) Both probability models are correct

. . . . . . . . . . . . . . . . . . . 188

Table I.8: B) When missing data probability is misspeciﬁed and propensity score

is correct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

Table I.9: C) When missing data probability is correct and propensity score is mis-

speciﬁed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

Table I.10: D) Both probability models are misspeciﬁed . . . . . . . . . . . . . . . . . 192

Table I.11: Proportion of missing earnings in the experimental sample . . . . . . . . . 194

Table I.12: Proportion of missing data in the PSID samples

. . . . . . . . . . . . . . 194

Table I.13: Unweighted and weighted pre-training earnings comparisons using NSW

and PSID comparison groups . . . . . . . . . . . . . . . . . . . . . . . . . 195

Table I.14: Estimation summary for ATE under diﬀerent cases of misspeciﬁcation . . 196

Table I.15: Estimation summary for quantile eﬀects under diﬀerent cases of misspec-

iﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

Table I.16: Covariate means and p-values from the test of equality of two means, by

treatment status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

Table I.17: Covariate means and p-values from the test of equality of two means for

the observed and missing samples

. . . . . . . . . . . . . . . . . . . . . . 199

Table I.18: Unweighted and weighted earnings comparisons and estimated training

eﬀects using NSW and PSID comparison groups

. . . . . . . . . . . . . . 200

Table I.19: Unconditional quantile treatment eﬀect (UQTE) using PSID-1 compar-

ison group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

Table I.20: Unconditional quantile treatment eﬀect (UQTE) using PSID-2 compar-

ison group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

viii

LIST OF FIGURES

Figure A.1: Quadratic design, continuous covariates (mild heterogeneity) . . . . . . . 115

Figure A.2: Quadratic design, continuous covariates (strong heterogeneity) . . . . . . 116

Figure A.3: Quadratic design, one binary covariate (mild heterogeneity)

. . . . . . . 117

Figure A.4: Quadratic design, one binary covariate (strong heterogeneity)

. . . . . . 118

Figure A.5: Probit design, continuous covariates (mild heterogeneity) . . . . . . . . . 119

Figure A.6: Probit design, continuous covariates (strong heterogeneity) . . . . . . . . 120

Figure A.7: Probit design, one binary covariate (mild heterogeneity)

. . . . . . . . . 121

Figure A.8: Probit design, one binary covariate (strong heterogeneity)

. . . . . . . . 122

Figure A.9: Binary outcome, bernoulli QLL with logistic mean . . . . . . . . . . . . . 123

Figure A.10:Non-negative outcome, poisson QLL with exponential mean . . . . . . . 123

Figure G.1: Kernel density plots for the composite probability . . . . . . . . . . . . . 167

Figure G.2: Kernel density plots for the estimated propensity score . . . . . . . . . . 168

Figure G.3: Kernel density plots for the estimated missing outcomes probability . . . 169

Figure H.1: Relative estimated bias in UQTE estimates at diﬀerent quantiles of the

1979 earnings distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 170

Figure H.2: Empirical distribution of estimated ATE for N=5000 . . . . . . . . . . . 171

Figure H.3: Estimated CQTE with true CQTE as a function of x1, N=5000 . . . . . 173

Figure H.4: Bias in estimated linear projection relative to true linear projection as

a function of x1 using Angrist et al. (2006b) methodology, N=5000 . . . 174

Figure H.5: Empirical distribution of estimated UQTE for N=5000 . . . . . . . . . . 176

ix

INTRODUCTION

Many questions in health, labor, development, and other areas of applied microeconomics

are questions of causal inference. Establishing causality not only helps to quantify cause-

and-eﬀect relationships but also helps to formulate counterfactual scenarios which can be

useful for informing policy and contributing to policy debates. Randomized experiments

are generally considered to be a reasoned basis for conducting causal inference since the

experimental design creates groups that are comparable on average and do not diﬀer sys-

tematically in measured and unmeasured dimensions. On the other hand, observational or

non-experimental data make it diﬃcult to isolate causal mechanisms due to the absence

of random assignment which introduces overt and hidden biases between treatment-control

comparisons. This dissertation studies issues of consistency and eﬃciency in estimating

causal eﬀects under experimental and non-experimental data settings. Given this objective,

the focus of this dissertation is on methods that do not rely on correct functional form as-

sumptions of some conditional feature of interest to achieve consistent or eﬃcient estimation

of treatment eﬀects.

Chapter 1 of this dissertation revisits the argument of regression adjustment which is

routinely employed for improving precision on the estimated average treatment eﬀect in ex-

periments over a simple diﬀerence in means. As has been noted in the statistics literature

by Freedman (2008b), Freedman (2008b), and Lin (2013), such an argument may be mis-

guided. This is because additively controlling for covariates in a regression is not guaranteed

to produce more precise estimates than a simple diﬀerence in means estimate. A similar re-

sult, however, is formally lacking in the economics literature where random sampling from a

inﬁnite population is the mainstream asymptotic paradigm. Chapter 2 builds on this idea of

regression (or covariate) adjustment in experiments and extends it to the case of more than

two treatment levels. This is a more general setting which is able to encompass a variety of

applications such as experiments with multiple treatment levels, diﬀerence-in-diﬀerence de-

1

signs, and willingness to pay studies. Chapter 3 relaxes the experimental setting to consider

treatment eﬀects estimation with observational data when some of the observed outcomes

are missing. This is a widely encountered problem in most micro-econometric empirical

analyses. In this setting, consistent estimation of treatment eﬀects is complicated unless the

researcher imposes structure on the treatment assignment and missing data mechanisms.

Chapter 1 in this dissertation acknowledges the criticism leveled against simple regression

adjustment for improving precision on the estimated average treatment eﬀect. This chapter,

then, proposes a regression estimator, which allows for separate estimation of the slopes in

the treatment and control groups, that is guaranteed to be more precise than the simple

diﬀerence in means and pooled regression adjustment estimators. This result is easily ob-

served in the simulations where we consider diﬀerent data generating processes and a range

of assignment probabilities. Chapter 2 extends this eﬃciency result of the separate slopes

regression estimator to the case of G arbitrary treatment levels and provides an empirical

illustration and simulation results which provide favorable evidence for the separate slopes

estimator. Finally, the third chapter proposes an inverse probability weighted estimator that

double weights for the problems of nonrandom assignment and missing outcomes.

The methodology used in each of the chapters does not rely on correct conditional fea-

ture assumptions to propose eﬃcient or consistent estimators for the causal parameters in

question. In the ﬁrst two chapters, eﬃciency gains with the separate slopes estimator are

not based on correct conditional mean speciﬁcation, but on linear projections which are con-

sistently estimated under mild assumptions. Similarly, the double weighted IPW estimator

proposed in the third chapter is consistent even when the outcome model is misspeciﬁed

as long as the weights are correct. This dissertation aims to provide empiricists with a

wide-range of treatment eﬀect settings where the methods studied here may prove useful for

conducting sound and robust causal analysis.

2

CHAPTER 1

REVISITING REGRESSION ADJUSTMENT IN EXPERIMENTS WITH

HETEROGENEOUS TREATMENT EFFECTS†

1.1

Introduction

The role of covariates in randomized experiments has been studied since the early 1930s
[Fisher (1935)]1. When compared with the simple diﬀerence-in-means (SDM) estimator,

the main beneﬁt of adjusting for covariates is that the precision of the estimated average

treatment eﬀect (ATE) can be improved if the covariates are suﬃciently predictive of the

outcome [Cochran (1957); Lin (2013)]. Nevertheless, regression adjustment is not uniformly

accepted as being preferred over the SDM estimator. For example, Freedman (2008b,a)

argues against using regression adjustment because it is not guaranteed to be unbiased unless

one makes the strong assumption that the conditional expectation functions are correctly

speciﬁed (and linear in parameters).

It is important to understand that there are two diﬀerent, potentially valid criticisms

of regression adjustment (RA). The ﬁrst is the issue of bias just mentioned: unlike the

SDM estimator, which is unbiased conditional on having some units in both the control

and treatment groups, RA estimators are only guaranteed to be consistent, not unbiased.

Therefore, in experiments with small sample sizes, one might be willing to forego potential

eﬃciency gains in order to ensure an unbiased estimator of the average treatment eﬀect. As

Bruhn and McKenzie (2009) points out, samples of 100 to 500 individuals or 20 to 100 schools

or health clinics is fairly common in experiments conducted in development economics. In

situations where unbiased estimation is the highest priority, the current paper has little

†This work is joint with Jeﬀrey M. Wooldridge and is unpublished.
1In the statistics and biomedical literature, such variables are also known as prognostic
factors or concomitant variables. In the econometrics of program evaluation literature they
are known as pre-treatment covariates. As the name suggests, they are ideally measured
before the treatment is administered.

3

to add, other than to provide simulation evidence that using our preferred RA procedure

often results in small bias. Henceforth, we are not interested in small-sample problems as

a criticism of RA. More and more economic experiments, especially those conducted online,

include enough units to make consistency and asymptotic eﬃciency a relevant criteria for

evaluating estimators of ATEs. In cases where eﬀect sizes are small (but important in the

aggregate), improving precision can be important even when sample sizes seem fairly large.

A second criticism of RA methods, and the one most relevant for this paper, is that RA

methods may not improve over the SDM estimator even if we focus on asymptotic eﬃciency.

Freedman (2008b,a) and Lin (2013) level this criticism of RA when the covariates are simply

added as controls along with a treatment indicator in a linear regression analysis. Freedman

(2008b), for example, ﬁnds no systematic eﬃciency gain from using covariates. Lin (2013)

provides an in-depth discussion about how simply adding covariates will not necessarily

produce eﬃciency gains when treatment eﬀects are heterogeneous. Both Freedman and Lin

operate under a ﬁnite population paradigm where all population units are observed in the

sample. Therefore, uncertainty in the estimators is due to the assignment into treatment

and control, and not due to sampling from a population [Abadie et al. (2017b), Abadie et al.

(2017a) and Rosenbaum (2002) discuss similar settings].

Random sampling from a population is still an important setting for empirical work, and

the ﬁndings in Freedman (2008b) and Lin (2013) do not extend to the random sampling sce-

nario. This involves accounting for both types of uncertainties, sampling-based and design-

based. Our paper will not have more to say about the diﬀerences between these two types

of uncertainties. For a deeper discussion of sampling-based and design-based uncertainty,

see Abadie et al. (2017b). Imbens and Rubin (2015) study linear regression adjustment in

the same sampling setting that we use here: independent and identically distributed (i.i.d.)

draws from a population. However, they state eﬃciency results only in the case that the pop-

ulation means of the covariates are known, even though only a random sample is available.

In addition, Imbens and Rubin only consider linear regression adjustment.

4

Regression adjustment in experiments has also been studied in the statistics literature

by papers like Yang and Tsiatis (2001), Tsiatis et al. (2008), Ansel et al. (2018), and Berk

et al. (2013) for the case of linear adjustment and by Rosenblum and Van Der Laan (2010)

and Bartlett (2018) for the case of nonlinear regression adjustment. While some of the

results derived in these papers overlap with what we show, the expressions and exposition

is not as transparent and simple as ours. For nonlinear regression adjustment, we establish

consistency by distinguishing between pooled and separate regression adjustment, which is

missing from the discussion in Rosenblum and Van Der Laan (2010) and Bartlett (2018).

In this paper, we revisit regression adjustment and resolve some outstanding issues in

the literature. We cover the standard case of i.i.d. draws from an underlying population, so

that randomness comes from sampling error as well as assignment into control and treatment

groups. Further, unlike Imbens and Rubin (2015), we consider the realistic case where the

population means of the covariates must be estimated using the sample averages.

In the case of linear regression adjustment, we study four estimators: the SDM estimator,

the pooled regression adjusted (PRA) estimator, the full regression adjusted (FRA) estimator

– which uses separate regressions for the control and treatment groups – and what we call the

infeasible regression adjusted (IRA) estimator, which is like the FRA estimator but assumes

the population means of the covariates are known. We include IRA for completeness, as

it is studied in Imbens and Rubin (2015), and doing so allows us to characterize the lost

eﬃciency due to having to estimate the population means.

Our most important results in the linear regression case can be easily summarized. First,

even when accounting for the sampling error of the covariates in estimating the population

means, using separate linear regressions for the control and treatment groups leads to an ATE

estimator that is never less precise (asymptotically) than the SDM estimator and the PRA

estimator. Unless small sample bias is a concern, there is no reason not to use full regression

adjustment. Further, there are two interesting cases when there will be no precision gain

when using full RA compared with pooled RA. The ﬁrst is when there is no heterogeneity

5

in the slopes of the linear projections of the potential outcomes (although there could be

in the unobservables). In this case, it is not surprising that using pooled RA is suﬃcient

to capture the eﬃciency gains of using covariates. The second important case where there

is no additional gain from FRA is when the design is balanced: the probability of being in

the treatment group is equal to 0.5. Therefore, if one has imposed a balanced design and

is considering only linear regression adjustment, the pooled method is probably preferred

(due to conserving degrees of freedom). A ﬁnal result, which is pretty obvious, is that there

is no eﬃciency gain when the covariates are not predictive of the potential outcomes; then,

SDM is asymptotically eﬃcient. We want to emphasize that there is no (asymptotic) cost

in doing the regression adjustment, whether PRA or FRA: each estimator has the same

asymptotic variance. In situations where one has good predictors of the outcome, regression

adjustment can be attractive. Our simulation study illustrates the special cases derived from

our theortical results.

Another important contribution of our paper is to characterize situations where nonlinear

regression adjustment preserves consistency of average treatment eﬀect estimators without

imposing additional assumptions. In particular, we show that when the response is binary,

fractional, count, or other nonnegative outcomes, certain kinds of nonlinear regression ad-

justment consistently estimates the average treatment eﬀect. Our simulations for the case

of binary and non-negative response suggest that nonlinear RA, especially the full version,

can produce sampling variances that are smaller than SDM and also linear regression ad-

justment. In terms of bias, nonlinear FRA (NFRA) is comparable to SDM, which we know

is unbiased.

The rest of the paper is organized as follows. Section 3.2 brieﬂy introduces the poten-

tial outcomes setting and deﬁnes the population average treatment eﬀect – the parameter

of interest in this paper. Section 3.3 discusses the random assignment mechanism and the

random sampling assumption. Section 3.4 is important and describes linear regression ad-

justment in the population in terms of linear projections, which are consistently estimated

6

by ordinary least squares (OLS) given a random sample. Importantly, we need not impose

any assumptions on the conditional mean functions of the potential outcomes. Section 3.5

presents the asymptotic variances of the four linear estimators and ranks them on the basis

of asymptotic eﬃciency. We also characterize the cases under which estimating two separate

regressions does not improve eﬃciency over SDM or PRA (or both). Section 3.6 presents

Monte Carlo simulations that compare the bias and root mean squared error (RMSE) of the

estimators for eight diﬀerent data generating processes. In section 3.7 we draw on results

from the doubly robust estimation literature and characterize the nonlinear RA estimators

– both pooled and full – that produce consistent estimators of the ATE. Our simulations

in this section show that nonlinear methods have modest bias while considerably improving

eﬃciency compared with both SDM and linear RA methods. Section 3.8 concludes the paper

with a discussion of some future research topics. All proofs are included in the appendix.

1.2 Potential outcomes and parameter of interest

Our framework is the standard Neyman-Rubin causal model, involving potential (or
counterfactual) outcomes. Let {Y (0), Y (1)} be the two potential outcomes corresponding
to the control and treatment states, respectively, where {Y (0), Y (1)} has a joint distribu-
tion in the population. The setup is nonparametric in that we make no assumptions about
the distribution of {Y (0), Y (1)} other than ﬁnite moment conditions needed to apply stan-
dard asymptotic theory. In particular, {Y (0), Y (1)} may be discrete, continuous, or mixed
random variables. For example, Y (0) and Y (1) can be binary employment indicators for

nonparticipation and participation in a job training program. Or, they could be the fraction

of assets held in the stock market, or counts of the number of hospital visits taken by a

patient.

Deﬁne the means of the potential outcomes as

µ0 = E(cid:2)Y (0)(cid:3)
µ1 = E(cid:2)Y (1)(cid:3)

7

The parameter of interest is the population average treatment eﬀect (PATE),

τ = E(cid:2)Y (1) − Y (0)(cid:3) = µ1 − µ0

As has been often noted in the literature, the problem of causal inference is essentially

a missing data problem. We only observe one of the the outcomes, Y (0) or Y (1), once the

treatment, represented by the Bernoulli random variable W , is determined. Speciﬁcally, the

observed Y is deﬁned by

Y (0), if W = 0

Y (1), if W = 1

Y =

It is also useful to write Y as

Y = (1 − W ) · Y (0) + W · Y (1).

1.3 Random assignment and random sampling

(1.1)

(1.2)

In determining an appropriate method to estimate τ, we need to know how the treatment,

W , is assigned. In this paper, we assume that W is independent of the potential outcomes as

well as observed covariates, which we write as X = (X1, X2, . . . , XK ). Formally, the random
assignment assumption is as follows.

Assumption 1.3.1. The binary assignment indicator, W , is a Bernoulli trial and is inde-
pendent of {Y (0), Y (1), X}, where X = (X1, X2, . . . , XK ). Mathematically,

W ⊥ {Y (0), Y (1), X}.

Letting ρ = P (W = 1) be the probability of being assigned into treatment, assume that 0 <
ρ < 1.

The assumption of random assignment implies that there are many consistent estimators

of τ. The goal in this paper is to rank, as much as possible, commonly used estimators of τ

in terms of asymptotic eﬃciency.

8

As mentioned in the introduction, both early [Neyman (1923) and Fisher (1935)] and

recent [Freedman (2008b) Freedman (2008a) and Lin (2013)] approaches to estimating ATEs

assume that the entire population is the sample. Therefore, the only stochastic element of

the setup is the assignment of the treatment, which is randomized. Such a perspective rules

out any uncertainty stemming from unobservability of the entire population (also termed

sampling-based uncertainty) and only allows uncertainty that arises due to the experimental

design (also known as design-based uncertainty). Here, we adopt the assumption commonly

used in studying various estimators in statistics and econometrics.

Assumption 1.3.2. For a nonrandom integer N, {(cid:0)Yi(0), Yi(1), Wi, Xi

(cid:1) : i = 1, 2, . . . , N}

are independent and identically distributed draws from the population.

Given the random sampling assumption, standard asymptotic theory for i.i.d. sequences

of random vectors can be applied, where N tends to inﬁnity. We assume in what follows

that at least second moments of the potential outcomes and covariates are ﬁnite so that,

when we use regression methods, we can apply the law of large numbers and central limit

theorem. We do not state these moment assumptions explicitly as they cannot be checked,

anyway.

For each unit i drawn from the population, the treatment eﬀect is

which we can write as

Yi(1) − Yi(0),

Yi(1) − Yi(0) = τ +(cid:2)Vi(1) − Vi(0)(cid:3)

where Vi (w) = Yi(w)− µw for w ∈ {0, 1}. The treatment eﬀects are homogeneous when the
unit-speciﬁc components, Vi(1) − Vi(0), are identically zero for any random draw i.

9

1.4 Estimators

We now carefully describe the estimators that we use in the linear regression context.

1.4.1 Simple diﬀerence in means (SDM)

Random assignment provides the luxury of using an estimator available from basic statistics.

outcomes. Let Wi be the treatment indicator for unit i. Then N0 = (cid:80)N
N1 = (cid:80)N

This estimator dates back to Neyman (1923) in the context of causal inference using potential
i=1(1 − Wi) and
i=1 Wi are the number of control and treated units in the sample, respectively.
These are random variables. When N0, N1 > 0 we can deﬁne the sample averages for the
control and treated units:

N(cid:88)
N(cid:88)

i=1

¯Y0 = N−1

0

¯Y1 = N−1

1

(1 − Wi)Yi

WiYi,

(1.3)

(1.4)

where Yi is the observed outcome for unit i. The simple diﬀerence-in-means estimator is

i=1

ˆτSDM = ¯Y1 − ¯Y0.

(1.5)

Under random assignment and conditional on N0, N1 > 0, ˆτSDM is unbiased for τ – see,
for example, Imbens and Rubin (2015). Further, ˆτSDM is consistent as N → ∞ for τ when
0 < ρ < 1, as we assume. As is well know, ˆτSDM can be obtained as the coeﬃcient on Wi
in the simple regression

Yi on 1, Wi, i = 1, . . . , N.

(1.6)

See, for example, Imbens and Rubin (2015).

1.4.2 Pooled regression adjustment (PRA)

When we have covariates that (hopefully) predict the outcome Y , the simplest way to use

those covariates is to add them to the simple regression in (1.6). As documented in Słoczyński

10

(2018), adding covariates along with a binary treatment indicator is still very common in

estimating treatment eﬀects, whether one has randomized assignment or assumes uncon-

foundedness conditional on the covariates. Speciﬁcally, the regression is

Yi on 1, Wi, Xi, i = 1, 2, . . . , N.

The coeﬃcient on Wi is the estimator of τ, and we called this estimator “pooled regression
adjustment” (PRA) and denote it ˆτP RA. The name “pooled” emphasizes that we are pooling
across the control and treatment groups in imposing common coeﬃcients on the vector of

covariates, Xi.
important to understand that we are making no assumption about whether the coeﬃcients

In other words, the slopes are the same for W = 0 and W = 1.

It is

in an underlying linear model in the population are the same, or even whether there is an

underlying linear model representing a conditional expectation. This will become clear in

the next subsection when we formally describe linear projections.

As is well known, adding the variables Xi to the simple regression does not change
the probability limit provided Wi and Xi are uncorrelated, which follows under random
assignment. However, it is not always the case that adding Xi is a good idea, even if we
focus on asymptotic eﬃciency: it may or may not improve asymptotic eﬃciency compared

with the SDM estimator.

1.4.3 Linear projections and infeasible regression adjustment (IRA)

The SDM estimator is an example of an estimator that can be written as

ˆτ = ˆµ1 − ˆµ0,

where, in the case of SDM, ˆµ0 and ˆµ1 are the sample averages of the control and treated
groups, respectively. But there are other ways to consistently estimate µ0 and µ1 when we
have covariates X, represented as a 1× K vector. In particular, deﬁne the linear projections

11

of the potential outcomes on the vector of covariates as

L(cid:2)Y (0)|1, X(cid:3) = α0 + Xβ0
L(cid:2)Y (1)|1, X(cid:3) = α1 + Xβ1,

(1.7)

(1.8)

where the expressions for α0, α1, β0, and β1 can be found in Wooldridge (2010) Section 2.3.
As discussed in Wooldridge, the linear projections always exist and are well deﬁned provided

Y (w) and the elements of X have ﬁnite second moments. Any of the random variables can

be discrete, continuous, or mixed. The elements of X can include the usual functional forms,

such as logarithms, squares, and interactions. The requirement for the coeﬃcients in the

linear projections (LPs) to be unique is simply that the variance-covariance matrix of X,

ΩX, is nonsingular, an assumption that rules out perfect collinearity in the population.

It is often helpful to slightly rewrite the LPs. Deﬁne the 1 × K vector of population
means of X as µX = E (X), and let ˙X = X − µX be the deviations from the population
mean. Then we can write the linear projection in terms of the means µ0 and µ1 as

L(cid:2)Y (0)|1, X(cid:3) = µ0 + ˙Xβ0
L(cid:2)Y (1)|1, X(cid:3) = µ1 + ˙Xβ1

(1.9)

(1.10)

The two representations make it clear that the PATE, τ, can be expressed as

τ = µ1 − µ0 = (α1 − α0) + µX(β1 − β0)

Therefore, if we have consistent estimators of α0, α1, β0, β1, and µX, then we can consis-
tently estimate τ. Importantly, as discussed in Wooldridge (2010) Chapter 4, ordinary least

squares estimation using a random sample always consistently estimates the parameters in

a population linear projection (subject to the mild ﬁnite second moment assumptions and

the non-singularity of ΩX). This is true regardless of the nature of Y (w) or X. This insight
is critical for understanding why regression adjustment produces consistent estimators of

τ, and for the asymptotic eﬃciency arguments later on. Unlike in Imbens and Wooldridge

12

(2009), we do not assume that the linear projection is the same as the conditional mean. We

are silent on the conditional mean functions E(cid:2)Y (0)|X(cid:3) and E(cid:2)Y (1)|X(cid:3).

Given random assignment, consistent estimators of the LP coeﬃcients are obtained from

the separate regressions

Yi on 1, Xi using Wi = 0

Yi on 1, Xi using Wi = 1

(1.11)

(1.12)

Wooldridge (2010) Chapter 19 formally shows that the linear projections are consistently esti-

mated under the assumption that selection – in this case, Wi – is independent of(cid:2)Xi, Yi(w)(cid:3).

This is sometimes called the “missing completely at random” (MCAR) assumption in the

missing data literature [for example, Little and Rubin (2002)].

If we assume that the vector of population means µX is known, a consistent estimator

of τ is

ˆτIRA = (ˆα1 − ˆα0) + µX( ˆβ1 − ˆβ0),

where ˆα0, ˆα1, ˆβ0, and ˆβ1 are the OLS estimators from the separate regressions. We call this
the “infeasible regression adjustment” (IRA) estimator because it depends on µX, which is
likely to be unknown in our context with random sampling.

From the algebra of ordinary least squares (OLS) it is easily shown that ˆτIRA can be
obtained as the coeﬃcient on Wi in the regression that includes a full set of interactions
between Wi and ˙Xi, namely,

Yi on 1, Wi, Xi, Wi · ˙Xi, i = 1, . . . , N.

The demeaning of the covariates ensures that the coeﬃcient on Wi is ˆτIRA. This regression
is also convenient for obtaining a valid standard error for ˆτIRA, as the usual Eicker-Huber-
White heteroskedasticity-robust standard error is asymptotically valid.

In the case where the linear projections are also the conditional expectations – that is,

E(cid:2)Y (w)|X(cid:3) = L(cid:2)Y (w)|1, X(cid:3), w ∈ {0, 1} – ˆα0, ˆα1, ˆβ0, and ˆβ1 are unbiased conditional on

13

{Xi : i = 1, 2, . . . , N}, provided we rule out perfect collinearity in the control and treated
subsamples. Then, ˆτIRA would also be unbiased conditional on {Xi : i = 1, 2, . . . , N}, and
unbiased unconditionally if its expectation exists. But linearity of the conditional expec-

tations is much too strong an assumption, and it is clearly not needed for unbiasedness or

consistency of the SDM estimator. Therefore, in what follows, we make no assumptions about

E(cid:2)Y (w)|X(cid:3). We simply assume enough moments are ﬁnite and rule out perfect collinearity

in X in order for the linear projections to exist.

1.4.4 Full regression adjustment (FRA)

¯X = N−1(cid:80)N

We can easily make the IRA estimator feasible by replacing µX with the sample average,
i=1 Xi. This leads to what we will call the “full regression adjustment” (FRA)

estimator:

ˆτF RA = (ˆα1 − ˆα0) + ¯X( ˆβ1 − ˆβ0).

This estimator can also be obtained as the OLS coeﬃcient on Wi but from the regression

Yi on 1, Wi, Xi, Wi · ¨Xi, i = 1, 2, . . . , N ,

where

¨Xi = Xi − ¯X, i = 1, 2, . . . , N

are the demeaned covariates using the sample average. This estimator is always available

given a sample (cid:8)(Yi, Wi, Xi) : i = 1, 2, . . . , N(cid:9). Generally, ˆτF RA (cid:54)= ˆτIRA. Like ˆτIRA, we

can only conclude that ˆτF RA is consistent, although it will be unbiased under essentially the
same assumptions discussed for ˆτIRA. In the next section, we will rank the four estimators,
to the extent possible, in terms of asymptotic eﬃciency.

1.5 Asymptotic variances and eﬃciency comparisons

We ﬁrst derive the asymptotic variances of the SDM, PRA, IRA and FRA estimators in

the general case of heterogeneous treatment eﬀects. Naturally, the formulas include homo-

14

geneous treatment eﬀects as a special case. We then compare the asymptotic variances in

general and in special cases.

In order to obtain the asymptotic variances, we need to study the linear projections of

the potential outcomes on the covariates more closely. Recall that we can write the potential

outcomes as

Y (0) = µ0 + V (0)

Y (1) = µ1 + V (1),

(1.13)

(1.14)

where V (0) and V (1) have zero means, by construction. Following the discussion in Section

3.4, we linearly project each of V (0) and V (1) onto the population demeaned covariates,

˙X:

V (0) = ˙Xβ0 + U (0)

V (1) = ˙Xβ1 + U (1)

where the intercepts are necessarily zero. Then

Y (0) = µ0 + ˙Xβ0 + U (0)

Y (1) = µ1 + ˙Xβ1 + U (1)

(1.15)

(1.16)

(1.17)

(1.18)

By deﬁnition of the linear projection,

It follows that

0 = V(cid:2)U (0)(cid:3) and σ2

1 = V(cid:2)U (1)(cid:3).

where σ2

E(cid:2)U (0)(cid:3) = E(cid:2)U (1)(cid:3) = 0
= E(cid:104) ˙X(cid:48)U (1)
E(cid:104) ˙X(cid:48)U (0)
(cid:105)
(cid:105)
V(cid:2)Y (0)(cid:3) = β(cid:48)
V(cid:2)Y (1)(cid:3) = β(cid:48)

0ΩXβ0 + σ2
0
1ΩXβ1 + σ2
1

= 0

15

We can write the observed outcome, Y , as

(cid:104)

Y = (1 − W )

µ0 + ˙Xβ0 + U (0)

= µ0 + ˙Xβ0 + U (0) + τ W +

+ W
W · ˙X

µ1 + ˙Xβ1 + U (1)

δ + W ·(cid:2)U (1) − U (0)(cid:3)

(1.19)

(1.20)

(cid:105)
(cid:16)

(cid:104)
(cid:17)

(cid:105)

where δ = β1 − β0. The following lemma is a precursor to the eﬃciency comparisons.

Lemma 1.5.1. Under the assumptions of random assignment given in 1.3.1, random sam-

pling 3.2.4, and ﬁnite moment assumptions, the following asymptotic distributions hold:

d→ N(cid:16)

(cid:17)

√

N (ˆτSDM − τ )
β(cid:48)
β(cid:48)
0ΩXβ0
1ΩXβ1
(1 − ρ)

+

ρ

0, ω2

SDM
σ2
1
ρ

+

+

σ2
0

(1 − ρ)

ω2
SDM =

(1.21)

(1.22)

(1.23)

(1.24)

(1.25)

(1.26)

(1.27)

(1.28)

d→ N(cid:16)

(cid:33)

√

(cid:32)
N (ˆτP RA − τ )

(1 − ρ)2

ω2
P RA =

+

ρ2

(1 − ρ)

ρ

(cid:17)

P RA

0, ω2
(β1 − β0)(cid:48) ΩX (β1 − β0) +

σ2
1
ρ

+

σ2
0

(1 − ρ)

d→ N(cid:16)

√

N (ˆτF RA − τ )

0, ω2
F RA = (β1 − β0)(cid:48) ΩX (β1 − β0) +
ω2

F RA

(cid:17)

σ2
1
ρ

+

σ2
0

(1 − ρ)

d→ N(cid:16)

(cid:17)

0, ω2

IRA

√
N (ˆτIRA − τ )
σ2
σ2
1
0
ρ

(1 − ρ)

+

. (cid:3)

ω2
IRA =

The asymptotic variance expressions allow us to determine asymptotic eﬃciency under

various scenarios. Not surprisingly, all four asymptotic variances depend on the error vari-
1. Generally, the asymptotic variances of the three feasible estimators can

0 and σ2

ances, σ2

depend on ΩX, β0, and β1.

16

By comparing the formulas in Lemma 1.5.1 we have the following result, which ranks

the asymptotic variances of the four diﬀerent estimators in the general case of heterogeneous
treatments and ρ ∈ (0, 1).

Theorem 1.5.2. Under the assumptions of Lemma 1.5.1,

(i)

F RA ≤ ω2
ω2
F RA ≤ ω2
ω2
IRA ≤ ω2
ω2

SDM

P RA

F RA

(1.29)

(1.30)

(1.31)

(ii) If β0 = β1 = 0 then all asymptotic variances are the same.

(iii) If β0 = β1 = β then ω2

P RA = ω2

F RA = ω2

IRA and if β (cid:54)= 0 then ω2

SDM is strictly

larger.

(iv) If ρ = 1/2 then ω2

P RA = ω2

F RA ≤ ω2

SDM , with strict inequality in the latter case unless

β1 = −β0.

Many of the results in Theorem 1.5.2 follow from inspection of the asymptotic variance

formulas, although some are more subtle. For example, (1.31) is immediate because the ﬁrst

term in (1.26) is nonnegative. Part (iii) is also immediate because all asymptotic variances

equal σ2

1/ρ + σ2

0/(1 − ρ) when β0 = β1. For part (iv), the function

g(ρ) ≡ (1 − ρ)2

ρ

+

ρ2

(1 − ρ)

,ρ ∈ (0, 1)

can be shown to have a minimum value of unity, uniquely achieved when ρ = 1/2.

The most diﬃcult inequality to establish, and the one that is most important, is (1.29).

Straightforward matrix multiplication shows that

SDM − ω2
ω2

F RA =

β(cid:48)
1ΩXβ1

ρ

β(cid:48)
0ΩXβ0
(1 − ρ)

+

− (β1 − β0)(cid:48) ΩX (β1 − β0) = λ(cid:48)ΩXλ,

17

(cid:115)(cid:18) ρ

(cid:19)

β0.

β1 +

1 − ρ
SDM = ω2

where

(cid:115)(cid:18)1 − ρ

ρ

λ =

(cid:19)

Because ΩX is assumed positive deﬁnite, ω2

F RA if and only if λ = 0. One
case where λ = 0 is β1 = β0 = 0, in which case the covariates do not predict the potential
outcomes.
It can happen in other cases but all of the slope coeﬃcients would have to

have opposite signs in the linear projections of the two potential outcomes. For example, if
ρ = 1/2, we would need β1 = −β0, which means the slopes in the linear projection of Y (1)
on 1, X would be the opposite signs of the slope coeﬃcients in the linear projection of Y (0)

on 1, X. This seems highly unlikely. For example, we would expect pre-training education

to have a positive eﬀect on earnings whether or not someone participates in a job training
program. In the homogenous case β1 = β0 (cid:54)= 0, ω2

SDM > ω2

F RA = ω2

P RA.

We can never know for sure whether β1 = β0, but we should know whether ρ = 1/2
If ρ = 1/2 then 1.5.2 suggests that the pooled

based on the design of the experiment.

estimator is probably preferred:

it is as asymptotically eﬃcient as the full RA estimator

and conserves on degrees of freedom, which may be important if N is not large and the
potential K (number of covariates) is somewhat large. For ρ (cid:54)= 1/2, Theorem 1.5.2 shows
that the full RA estimator is attractive provided small-sample issues are not important. In

particular, ˆτF RA is always more asymptotically eﬃcient that both ˆτSDM and ˆτP RA in the
presence of heterogenous slopes, and there is no (asymptotic) price to pay if β1 = β0 or even
if β1 = β0 = 0. Estimating the 2K parameters is, asymptotically, harmless when it comes
to the precision in estimating τ.

It may be helpful to provide intuition as to why ˆτF RA is more eﬃcient than ˆτSDM .
Consider estimating the mean of the potential outcome in the treated state, µ1. The FRA
estimator is

ˆµ1,F RA = ˆα1 + ¯X ˆβ1,

where ¯X is the sample average across the entire sample. By the simple mechanics of OLS,

18

where

¯Y1 = ˆα1 + ¯X1

ˆβ1

¯X1 = N−1

1

N(cid:88)

i=1

WiXi

√

is the sample average of the Xi over the treated units. Because of random assignment, ¯X1 is
also a
N-consistent, asymptotically normal estimator of µX. But it is ineﬃcient compared
with ¯X because the latter uses the entire sample. The same is true of ˆµ0,F RA and ¯Y0. This
does not quite prove that ˆτF RA is asymptotically more eﬃcient because ˆµ1,F RA and ˆµ0,F RA
are not (asymptotically) uncorrelated, but it does provide some intuition. The same sort of

intuition indicates why ˆτIRA is asymptotically more eﬃcient than ˆτF RA: ˆτIRA is not subject
to the sampling error in estimating µX.

1.6 Simulations

In this section we study the ﬁnite sample properties of the four estimators just discussed.

We evaluate the estimators primarily in terms of root mean squared error (RMSE), since this

accounts for bias as well as sampling variance. Since bias has been cited as a concern with

covariate adjustment estimators, especially in small-scale experiments, looking at the trade

oﬀs between bias and eﬃciency through RMSE is key to studying the ﬁnite sample perfor-

mance of these estimators. In order to compute the RMSE, we ﬁrst generate a population of

10,000 observations to approximate an “inﬁnite” population setting. We then draw random

samples of sizes 100, 500 and 1,000 repeatedly from this population a thousand times. For

a comprehensive assessment, we report the RMSE across these diﬀerent sample sizes and

treatment probability combinations where the treatment probabilities range from 0.1 to 0.9.

To keep the tables simple, we report results only for the odd treatment probabilities even

though the graphs are plotted for all values. The reported simulation results are for the

case of heterogeneous treatment eﬀects in the population, both in terms of the slopes on the

linear projections and in the distribution of the projection errors, U (0) and U (1).

19

1.6.1 Design details

The treatment, W , is a binary variable, and so it has a Bernoulli distribution with

P (W = 1) = ρ,

and we vary the value of ρ. For the potential outcomes, we consider continuous and discrete

responses. In the ﬁrst, the potential outcomes are conditionally normally distributed, with

means linear in a quadratic in two covariates, X1 and X2. Speciﬁcally,

Y (0) = γ00 + γ01X1 + γ02X2 + γ03X2

Y (1) = γ10 + γ11X1 + γ12X2 + γ13X2

1 + γ04X2
1 + γ14X2

2 + γ05X1X2 + R(0) ≡ Zγ0 + R(0)
2 + γ15X1X2 + R(1) ≡ Zγ1 + R(1),

(cid:18)

Z =

where

and

(cid:19)

1 X1 X2 X2

1 X2

2 X1X2

R(0)| (X1, X2) ∼ N (0, σ2
0)
R(1)| (X1, X2) ∼ N (0, σ2
1)

We allow the γwj to diﬀer across w ∈ {0, 1}, and so there is heterogeneity in the treatment
eﬀects in terms of the observables, X, and the unobservables, R(0) and R(1) which are
allowed to be correlated.2 We also allow σ2

0 and σ2

1 to diﬀer.

It is important to understand that, in order to be realistic, we do not assume that

the quadratic conditional mean function is known. Instead, the researcher uses only linear

regression on a constant, X1, and X2. In a traditional view of econometrics, these regressions
would be “misspeciﬁed.” Of course, to ensure we have the best mean squared error predictors

of Y (0) and Y (1), we would use the correct speciﬁcations of E(cid:2)Y (0)|X(cid:3) and E(cid:2)Y (1)|X(cid:3).

But it would be unusual for us to know the exact speciﬁcation of the conditional mean
2R(0) and R(1) are generated to be aﬃne transformations of the same standard normal

variable.

20

functions. One can argue that most empirical researchers would include simple functions,

such as squares as interactions. But then the true mean function could depend on higher

order polynomials, or other more exotic functions.

In fact, the mean might not even be

linear in parameters. We take our setup as reﬂecting the realistic case that the researcher

uses a linear regression that does not correspond to the correct conditional mean.

Our second design generates the potential outcomes as binary variables. Remember, when

W is randomly assigned, we can use any kind of linear regression adjustment to improve

asymptotic eﬃciency, regardless of the nature of Y (0), Y (1). Speciﬁcally, for Z deﬁned

above,

Y (0) = 1[Zγ0 + R(0) > 0]

Y (1) = 1[Zγ1 + R(1) > 0],

where R(0) and R(1) are again independent of (X1, X2) and normally distributed, this time
each with unit variances. As before, R(0) and R(1) are allowed to be correlated. In the binary

response case, one might traditionally think of two forms of “misspeciﬁcation” in using linear

regression adjustment on (1, X1, X2). First, we are using what is traditionally called a “linear
probability model” rather than the correct probit model. Second, we are omitting the terms
2, and X1X2. Thus, there are two kinds of functional form “misspeciﬁcation.” Our
view is that, to make a case for linear regression adjustment, it should produce notable

1, X2

X2

eﬃciency gains even when the potential outcomes are discrete (although we return to this

issue in the next section).

We consider two diﬀerent designs for generating the covariates. Both are based on an

underlying bivariate normal distribution:



 .

0.5

3


1
 ,

2

 2

0.5

21

 ∼ N

X∗

X∗

1

2

X∗(cid:48) =

(cid:19)
(cid:19)

(cid:19)

(cid:19)

, γ(cid:48)
(cid:19)
(cid:19)

(cid:18)

(cid:18)

(cid:18)
(cid:18)

In the ﬁrst design, X = X∗. In the second design, X1 = X∗

1 and

X2 = 1[X∗

2 > 0],

so that X2 is binary (in which case X2
potential outcomes).

2 is redundant in the mechanism generating the

With the linear and probit data we consider two diﬀerent levels of heterogeneity across

the coeﬃcients γ0 and γ1, which we label “mild” and “strong.” For the mild heterogeneity
with continuous covariates

2 2 −2 −0.05 −0.02 0.3

, γ(cid:48)

1 =

3 1 −1 −0.05 −0.03 0.6

and for the strong heterogeneity

0 1 −1 −0.05 0.02 0.6

1 =

1 −1 1.5 0.03 −0.02 −0.6

(cid:18)
(cid:18)

γ(cid:48)
0 =

γ(cid:48)
0 =

With the binary regressor, for mild heterogeneity we have

γ(cid:48)
0 =

0 1 −1 0.05 0.2

, γ(cid:48)

1 =

3 3 1 0.05 0.9

and for strong heterogeneity we have

(cid:18)

(cid:18)

(cid:19)

(cid:19)

γ(cid:48)
0 =

0 1 −2 −0.05 0.2

, γ(cid:48)

1 =

3 −1 1 0.05 −0.9

Combining the linear and probit designs for the potential outcomes with two diﬀerent

levels of heterogeneity and two diﬀerent covariate compositions leads to a total of eight

scenarios. We allow the treatment probability, ρ, to range between 0.1 and 0.9 in increments

of 0.1 . We consider three sample sizes, 100, 500, and 1,000. Note that when N = 100 and

ρ = 0.1, we expect only 10 treated units and 90 control units. The covariates are generated

to ensure that they are predictive of the potential outcomes, with the population R-squared
1 are allowed to be diﬀerent for the two

ranging between (0.1, 0.6). The variances σ2

0 and σ2

potential outcomes and across four of the eight diﬀerent data generation processes.

We assess the relative ﬁnite sample performance of the four estimators under each such

scenario, which we term a DGP. The Table below describes each of the DGP’s in detail.

22

Table 1.1: Description of the data generating processes

Covariates

DGP Design
Quadratic
X
Quadratic
X
Quadratic X2 = 1[X∗
Quadratic X2 = 1[X∗
Probit
X
Probit
X
X2 = 1[X∗
Probit
X2 = 1[X∗
Probit

1
2
3
4
5
6
7
8

2 > 0]
2 > 0]

2 > 0]
2 > 0]

Heterogeneity R2
0
0.52
0.31
0.59
0.27
0.59
0.51
0.45
0.38

Mild
Strong
Mild
Strong
Mild
Strong
Mild
Strong

R2
1
0.44
0.46
0.33
0.34
0.38
0.45
0.28
0.40

1 PATE
0 σ2
σ2
2.68
9
16
16
9
0.93
7.46
4
1
2.92
4
9
1
1
0.28
0.09
1
1
0.35
1
1
1
1
0.43

1.6.2 Discussion of simulation ﬁndings

In the eight diﬀerent DGPs, we see that FRA performs better than SDM and PRA in terms

of RMSE. This behavior seems to be more pronounced at larger sample sizes as seen from

the ﬁgures. Two things are worth pointing. One, the diﬀerence in IRA and FRA is less

prominent for cases of mild heterogeneity. In such cases, PRA also performs comparably.

This makes sense since pooling slopes in the treatment and control groups when the slopes

are not very diﬀerent should produces estimates that are close to the ones estimated by the

separate slopes regression. Second, as was clear from Theorem 1.5.2, PRA and FRA have

approximately the same RMSE at ρ = 0.5. This is not surprising to see in the graph because

at larger sample sizes, biases in these estimators are negligible which means that RMSE is

approximately the same as the variance.

Overall we see that the ﬁnite sample performance of FRA is superior to SDM and PRA

for a variety of data generating processes (see ﬁgures A.1 and A.2 for quadratic design with

mild and strong levels of heterogeneity, A.3 and A.4 for quadratic design with one binary

covariate with mild and strong levels of heterogeneity, A.5 and A.6 for a probit design with

mild and strong levels of heterogeneity and ﬁnally A.7 and A.8 for a probit design with one

binary covariate with mild and strong levels of heterogeneity. For tables, see ??, ?? and ??).

23

1.7 Nonlinear regression adjustment

If the outcome Y – more precisely, the potential outcomes, Y (0) and Y (1) – are dis-

crete, or have limited support, using nonlinear conditional mean functions, chosen to ensure

ﬁtted values are logically consistent with E(cid:2)Y |X(cid:3), have considerable appeal. Intuitively, get-
ting better approximations to E(cid:2)Y (0)|X(cid:3) and E(cid:2)Y (1)|X(cid:3) can yield estimators with smaller

asymptotic variances when compared with the SDM estimator and linear regression adjust-

ment. However, as cautioned by Imbens and Rubin (2015) page 128, one should not sacriﬁce

consistency in order to obtain an asymptotically more eﬃcient estimator. Imbens and Rubin

(2015) leave the impression that all nonlinear models should be avoided because consistency

cannot be ensured. In this section, we use the features of the linear exponential family class

of distributions, combined with particular conditional mean models, to show that if one is

careful in choosing the combination of conditional mean function and quasi-log likelihood

(QLL) function, one can preserve consistency. Unfortunately, we cannot formally show that

using this particular set of nonlinear models is more eﬃcient than the SDM estimator, but

our simulations suggest the eﬃciency gains can be substantial. (And we have found no cases

where it is worse to do nonlinear RA.)

In deciding on nonlinear RA methods, the key is to remember is that

τate = µ1 − µ0,

and so we need to consistently estimate µ1 and µ0 without imposing additional assump-
tions. Earlier we showed how linear regression adjustment does just that. And, linear RA,

when done separately to estimate µ0 and µ1, is asymptotically more eﬃcient than the SDM
estimator. Our goal here is to summarize the nonlinear methods that produce consistent es-

timators of τate without additional assumptions (except for standard regularity conditions).
We start with pooled methods.

24

1.7.1 Pooled nonlinear regression adjustment

In the generalized linear models (GLM) literature, it is well known that certain combinations

of QLLs in the linear exponential family (LEF) and link functions lead to ﬁrst order con-

ditions where, in the sample, the residuals average to zero and are uncorrelated with every
explanatory variable. To state the precise results, let g (·) be a strictly increasing function
on R, with range that can be a subset of R. The inverse, g−1 (·), is known as the “link
function” in the GLM literature. In the context of treatment eﬀect estimation with mean

function g (α + xβ + γw), when using the so-called canonical link function [McCullagh and

Nelder (1989)] the ﬁrst order conditions (FOCs) are of the form

N(cid:88)
N(cid:88)
N(cid:88)

Wi

i=1

(cid:20)
(cid:20)
(cid:20)

i=1

X(cid:48)

i

Yi − g

Yi − g

Yi − g

(cid:16)
(cid:16)
(cid:16)

ˆα + Xi

ˆα + Xi

ˆα + Xi

ˆβ + ˆγWi

ˆβ + ˆγWi

ˆβ + ˆγWi

(cid:17)(cid:21)
(cid:17)(cid:21)
(cid:17)(cid:21)

= 0

= 0

= 0

i=1

When g (z) = z, these equations produce the ﬁrst order conditions for the pooled OLS

estimator. The leading cases where these conditions hold for nonlinear estimation are for

the Bernoulli QLL when g (z) = Λ(z) = exp(z)/[1 + exp(z)] is the logistic function and for

the Poisson QLL when g (z) = exp(z).

Under random sampling and weak regularity conditions, the probability limits of the
estimators solve the population versions of the sample moment conditions. Let α∗, β∗, and
γ∗ denote the probability limits of ˆα, ˆβ, and ˆγ, respectively. Importantly, as argued in White
(1982), these plims exist very generally without assuming that mean function is correctly

speciﬁed – just as the parameters in the linear projection exist under very weak assumptions.

The ﬁrst two FOCs in the population are

E(cid:104)
Y − g(cid:0)α∗ + Xβ∗ + γ∗W(cid:1)(cid:105)
Y − g(cid:0)α∗ + Xβ∗ + γ∗W(cid:1)(cid:105)(cid:27)
(cid:104)

(cid:26)

W

E

= 0

= 0;

(1.32)

(1.33)

25

we will not need the last set of conditions obtained from the gradient with respect to β. As
before, we assume that ρ = P (W = 1) satisﬁes 0 < ρ < 1.

Now, recall that Y = (1 − W )Y (0) + W Y (1). Then, by random assignment,

E (Y ) = E (1 − W ) E(cid:2)Y (0)(cid:3) + E (W ) E(cid:2)Y (1)(cid:3) = (1 − ρ)µ0 + ρµ1.

By iterated expectations,

g(cid:0)α∗ + Xβ∗ + γ∗W(cid:1)(cid:105)

E(cid:104)

g(cid:0)α∗ + Xβ∗ + γ∗W(cid:1)|X

(cid:105)(cid:27)

Therefore, we can write (1.32) as

(1 − ρ)µ0 + ρµ1 = E(cid:104)
(cid:26)

E(cid:104)
g(cid:0)α∗ + Xβ∗ + γ∗W(cid:1)(cid:105)

= E

and, because W is independent of X with P (W = 1) = ρ,

E(cid:104)

g(cid:0)α∗ + Xβ∗ + γ∗W(cid:1)|X

(cid:105)

= (1 − ρ)g(cid:0)α∗ + Xβ∗(cid:1) + ρg(cid:0)α∗ + Xβ∗ + γ∗(cid:1) .

Therefore,

(1 − ρ)µ0 + ρµ1 = (1 − ρ)E(cid:104)

g(cid:0)α∗ + Xβ∗(cid:1)(cid:105)

+ ρE(cid:104)

g(cid:0)α∗ + Xβ∗ + γ∗(cid:1)(cid:105)

.

(1.34)

Further, using and W Y = W Y (1), from (1.33),

E(cid:2)W Y (1)(cid:3) = E(cid:104)

W g(cid:0)α∗ + Xβ∗ + γ∗W(cid:1)(cid:105)

.

Again using random assignment and iterated expectations,

ρµ1 = (1 − ρ) · 0 + ρE(cid:104)

g(cid:0)α∗ + Xβ∗ + γ∗(cid:1)(cid:105)

= ρE(cid:104)

g(cid:0)α∗ + Xβ∗ + γ∗(cid:1)(cid:105)

.

Because ρ > 0, we have

26

g(cid:0)α∗ + Xβ∗ + γ∗(cid:1)(cid:105)
µ1 = E(cid:104)
g(cid:0)α∗ + Xβ∗(cid:1)(cid:105)
µ0 = E(cid:104)
g(cid:0)α∗ + Xβ∗ + γ∗(cid:1)(cid:105) − E(cid:104)

.

(1.35)

(1.36)

g(cid:0)α∗ + Xβ∗(cid:1)(cid:105)

.

(1.37)

Also, because ρ < 1, (1.34) now implies

It follows that

τate = E(cid:104)

This equation essentially is the basis for proving that pooled regression adjustment, where

we use a QLL in the linear exponential family and the conditional mean implied by the

canonical link function, is consistent. Consistency follows because the estimated ATE using

the pooled method is

ˆτate,pooled = N−1

(cid:16)

(cid:20)

g

N(cid:88)

i=1

(cid:17) − g

(cid:16)

ˆα + Xi

(cid:17)(cid:21)

ˆβ

.

ˆα + Xi

ˆβ + ˆγ

(1.38)

By Wooldridge (2010) question 12.17, this converges in probability to (1.37). As a practical

matter, it is convenient to note that (1.38) is the exact quantity reported by standard software

packages when one requests the average “marginal” (or “partial”) eﬀect of the binary variable

W after a standard GLM estimation. Packages that have this pre-programmed also provide

a valid standard error, although one must be sure to use a “robust” option during the GLM

estimation so that a sandwich estimator is used for the asymptotic variance of the parameter

(cid:16)

(cid:17)(cid:48)

.

estimators

ˆα, ˆβ(cid:48), ˆγ

Table B.1 summarizes the useful combinations of the QLL and mean functions that lead

to consistent estimation of τate without additional assumptions.

The Bernoulli/logistic case applies to binary or fractional outcomes, without change.

When Y is fractional, it can have probability mass at zero, one, or anywhere else. See

Papke and Wooldridge (1996) for further discussion. In any case, we treat the problem as

27

quasi-MLE because we do not wish to assume either that the distribution or mean function

is correct.

The Poisson/exponential combination is very useful for nonnegative outcomes without a

natural upper bound, although it can be applied to any nonnegative outcome. This includes

count outcomes but also continuous outcomes and outcomes with corner solutions at zero

(or other focal points). In the latter case, it is important to understand that commonly used

models, such as Tobit, do not provide any known robustness to misspeciﬁcation of the Tobit

model. By contrast, the Poisson QMLE with an exponential mean provides full robustness.

Remember, we are not trying to estimate the conditional mean functions; we are trying to

consistently estimate the unconditional means, µ0 and µ1. Other than linear regression,
the Poisson QMLE with an exponential mean is the only sensible choice for nonnegative,

unbounded responses.

If the outcome has a natural, known upper bound, say Bi, which may vary by unit i, the

binomial QMLE can be used in conjunction with the mean function

(cid:20) exp(α + xβ + γw)

1 + exp(α + xβ + γw)

(cid:21)

,

m(b, x, w, θ) = b

as this is known to be the mean associated with the canonical link for the binomial distri-

bution. The data then consists of (Yi, Bi, Xi, Wi). Again, it does not matter whether Yi is
an integer response or is continuous, or even has a corner at zero, Bi, or both: using the
binomial QMLE with logistic mean is simply a way to possibly improve over SDM or linear

RA. The estimated ATE is

N(cid:88)

i=1

N−1

Bi

(cid:34)

exp(ˆα + Xi

ˆβ + ˆγ)

1 + exp(ˆα + Xi

ˆβ + ˆγ)

ˆβ)
− exp(ˆα + Xi
1 + exp(ˆα + Xi

ˆβ)

(cid:35)

;

again, this is simple the average partial eﬀect with respect to the binary variable W .

The last entry in Table B.1 extends the Bernoulli QLL/logistic mean and is relevant in

two general situations. The ﬁrst is when the support of the response is ﬁnite (and greater

than two; otherwise one would use the logistic mean function with the Bernoulli QLL). For

28

example, Yg(w) could be an ordered response, such as a measure of health on a Lichert scale,
or an unordered response, such as the choice of a health plan. A second situation is when the

response consists of fractions that sum to unity, such as proportions of wealth in diﬀerent

investment categories, in which case the model has been called “multinomial fractional logit”

[Mullahy (2015)]. If there are G + 1 possible outcomes then there are G + 1 means each

for the control and treated states. The conditional mean functions for a pooled estimation

would be

mg(x, w, θ) =

(cid:104)

1 +(cid:80)G

exp(αg + xβg + γgw)
h=1 exp(αh + xβh + γhw)

(cid:105), g = 0, 1, ..., G

with α0 = 0, β0 = 0. Then the estimated means are

ˆµwg = N−1

N(cid:88)

i=1

(cid:104)

1 +(cid:80)G

ˆβg + ˆγgw)

exp(ˆαg + Xi
h=1 exp(ˆαh + Xi

ˆβh + ˆγhw)

(cid:105), w ∈ {0, 1} , g ∈ {0, 1, ..., G}

and the estimated average treatment eﬀect for each g is

Because for each w,(cid:80)G

ˆτate,g = ˆµ1g − ˆµ0g.

g=0 ˆµwg = 1, the sum over g of the ˆτate,g is necessarily zero.

1.7.2 Full nonlinear regression adjustment

As in the linear case, consistency is preserved if we estimate two separate regression functions

for the control and treatment cases. This follows from Wooldridge (2007) results on doubly

robust estimators, where, in the current setting, the propensity score, P(cid:0)W = 1|X = x(cid:1) = ρ,

is not a function of x. But a direct argument is easier to follow. For example, consider using

a QLL with the canonical link function using only the treatments. The FOC for ˆα1, the
intercept inside the conditional mean function, is simply

29

or, because of random assignment,

It follows that

(cid:1)(cid:105)

1

1 + Xβ∗
(cid:1)(cid:105)

.

1

W g(cid:0)α∗

E(cid:2)W Y (1)(cid:3) = E(cid:104)
ρµ1 = ρE(cid:104)
g(cid:0)α∗
1 + Xβ∗
µ1 = E(cid:104)
(cid:1)(cid:105)
g(cid:0)α∗
1 + Xβ∗
(cid:17)(cid:48)
µ0 = E(cid:104)
g(cid:0)α∗
0 + Xβ∗

ˆα0, ˆβ(cid:48)

(cid:1)(cid:105)

(cid:16)

0

1

0

.

(cid:20)

Wi

N(cid:88)

i=1

(cid:16)

Yi − g

(cid:17)(cid:21)

= 0.

ˆα1 + Xi

ˆβ1

Notice again how the treatment indicator serves to select the subsample of treated units.

The population analog is

(cid:16)
The same argument works for the untreated case, where Wi is replaced with (1 − Wi), and
ˆα1, ˆβ(cid:48)

. The conclusion is

are replaced with

(cid:17)(cid:48)

1

Remember, the parameters with a “*” now indicate the probability limits from the two sep-

arate estimations, rather than there being the same parameters as in the pooled estimation.

It follows under general regularity conditions that a consistent and asymptotically normal

estimator of τate is

ˆτate,f ull = N−1

(cid:16)

(cid:20)

g

N(cid:88)

i=1

(cid:17) − g

(cid:16)

ˆα0 + Xi

ˆβ0

(cid:17)(cid:21)

.

ˆα1 + Xi

ˆβ1

As a practical matter, some packages, such as Stata, have built-in commands for some full

RA estimators, including the Bernoulli/logistic and the Poisson/exponential combinations,

and so a standard error is computed along with the estimate. Again, one must be sure to

30

use a robust variance matrix estimator for the parameters. Alternatively, using a bootstrap

routine is very eﬃcient for these kinds of estimators.

In deciding on a procedure to use – linear versus nonlinear, pooled versus full – it is im-

portant to understand that all methods studied in this paper produce consistent estimators

of τate. In the linear case, we have the result that full RA is asymptotically eﬃcient compared
with SDM and pooled RA. As mentioned earlier, a proof that full nonlinear RA is asymp-

totically more eﬃcient than the pooled version is elusive. Also, we have not proven that full

nonlinear RA is always at least as asymptotically eﬃcient as SDM. We now report represen-

tative simulations that show the nonlinear methods can improve precision substantially in

some cases without introducing bias, even in pretty small sample sizes.

1.7.3 Simulations

For non-linear simulations we only consider continuous covariates which means that for both
binary and non-negative data generating processes, X = X∗ where,

 ∼ N


 ,
1

2

 2

0.5

0.5

3

X∗(cid:48) =

1

X∗
(cid:18)

X∗

2

 .

(cid:19)

.

As with the linear simulations,

Z =

1 X1 X2 X2

1 X2

2 X1X2

The tables report bias and standard deviation for sample sizes of N = 500 and N =

1, 000. To keep the tables simple, we only report values for treatment probabilities ρ =

0.1, 0.3, 0.5, 0.7, 0.9 even though the graphs are plotted for ρ ranging between 0.1 to 0.9.

31

1.7.3.1 Binary response

For the binary response case, the outcomes have been generated using a probit mean function

as given below

Y (0) = 1[Zγ0 + R(0) > 0]

Y (1) = 1[Zγ1 + R(1) > 0],

(cid:19)

−2 1 2 0.05 0.02 0.1

, γ(cid:48)

1 =

(cid:18)

γ(cid:48)
0 =

and

and

(cid:18)

(cid:19)

0 3 1 −0.05 0.03 0.9

R(0)| (X1, X2) ∼ N (0, 1)
R(1)| (X1, X2) ∼ N (0, 1)

where R(0) and R(1) are allowed to be correlated. While estimating the nonlinear pooled

and separate slopes estimators we use case (ii) in Table B.1.

We ﬁnd that separate slopes nonlinear estimator (NFRA) has the lowest root mean

squared error compared with the linear estimators and the pooled nonlinear estimator

(NPRA) for all treatment probabilities (see ﬁgure A.9). The tables show that nonlinear

estimators have bias that is comparable to the SDM estimator (see table ??).

1.7.3.2 Nonnegative response

For the non-negative response, the outcomes have been generated using a log normal distri-

bution as given below

(cid:19)
(cid:19)

+ 0.3 · N (0, 1)

+ 0.4 · N (0, 1)

,

(cid:18)Zγ0 + R(0)
(cid:18)Zγ1 + R(1)

10

Y (0) = exp

Y (1) = exp

10

32

(cid:19)

0 2 −1 −0.05 0.02 0.6

, γ(cid:48)

1 =

(cid:18)

where

γ(cid:48)
0 =

and

(cid:18)

1 −1 1.5 0.03 −0.02 −0.6

(cid:19)

R(0)| (X1, X2) ∼ N (0, 4)
R(1)| (X1, X2) ∼ N (0, 9)

where R(0) and R(1) are allowed to be correlated. While estimating the nonlinear pooled

and separate slopes estimators we use case (iv) in Table B.1.

Similar to the binary response simulations, we see that that NFRA again has the lowest

root mean squared error compared with both linear and pooled nonlinear estimators across

all treatment probabilities. In fact, NFRA peforms better than both SDM and FRA. The

NPRA and linear PRA are very similar in terms of RMSE; see table ?? and ﬁgure A.10.

1.8 Concluding remarks

We have studied linear and nonlinear regression adjustment estimators of the average

treatment eﬀect in an experimental framework. For linear regression adjustment, this paper

makes some key contributions to the econometrics literature on randomized experiments.

First, by considering a previously ignored aspect of the separate slopes estimator, this paper

is able to ﬁll a clear gap in the literature by showing the full RA estimator is always the

most eﬃcient even when the population means of the covariates is estimated using the

sample sample. Second, in obtaining our results, we rely only on linear projections, and so

no extra assumptions are used in establishing asymptotic eﬃciency. Third, the setup allows

us to determine when using full RA, or RA at all, is unecessary to achieve eﬃciency. Our

simulation ﬁndings support the theory and show that substantial eﬃciency gains are possible

when we have good predictors of the response. Obtaining the correct standard errors for the

full RA estimator is particularly simple in commonly used software packages. For example,

33

Stata R(cid:13), with its built-in “teﬀects” command, provides the correct standard errors for the
FRA estimator

As an interesting complement to our work, Słoczyński (2018) studies pooled versus full

RA when assignment is unconfounded conditional on covariates. Assuming that the condi-

tional means are linear in parameters, Słoczyński (2018) shows that using pooled RA when

the treatment eﬀects are heterogeneous is inconsistent for the ATE in a way that is par-

ticularly troublesome in designs that are heavily unbalanced. In particular, the pooled RA
estimator consistently estimates the weighted average (1−ρ)·τAT T +ρ·τAT U , where τAT T is
the average treatment eﬀect on the treated (W = 1) and τAT U is the ATE on the untreated
(W = 0). The ATE can be expressed as τAT E = ρ · τAT T + (1 − ρ) · τAT U , and so the PRA
estimator, in the limit, gets the weights reversed. Under random assignment, there is no

diﬀerence between τAT E, τAT T , and τAT U , and so consistency of PRA for τAT E is not the
issue. But as we showed, the pooled RA estimator is generally ineﬃcient when treatment

eﬀects are heterogeneous. Also, when ρ = 1/2, there is no inconsistency in the pooled RA

estimator when unconfoundedness holds. As we have shown in this paper, in the random

assignment case ρ = 1/2 is precisely the condition that implies no eﬃciency gain from full

RA even when there is arbitrary heterogeneity in the treatment eﬀects. Our ﬁndings mesh

well with those of Słoczyński (2018), with the conclusion that in moderate samples, FRA

should be used unless ρ is known to be close to 1/2.

In addition to the linear estimators, we also propose nonlinear regression adjustment

estimators, characterizing the combinations of quasi-log-likelihood functions and conditional

means functions that ensure consistency regardless of misspeciﬁcation. We believe this pa-

per is the ﬁrst to do so. We do not have theoretical results to show when the nonlinear RA

methods unambiguously improve asymptotic eﬃciency, and this is a good topic for future

research. However, our simulations suggest that nonlinear adjustment estimators can have

bias comparable to that of simple diﬀerence in means (SDM) and can produce sampling vari-

ances that are considerably smaller than that of SDM and, in majority of cases, substantially

34

smaller than linear feasible regression adjustment.

Going forward, there are a lot of natural extensions. One is to study an assignment scheme

that is diﬀerent from the one considered here. This paper assumes independence across

treatment assignments but a more common design, known as the completely randomized

experiment, induces dependence across units by ﬁxing the number of treated units before

sampling from the population. Also, because most randomized experiments in economics

are plagued with issues of nonparticipation or nonrandom attrition, it is also fruitful to

study regression adjustment in conjunction with an Instrumental Variables (IV). Comparing

the eﬃciency of standard regression adjustment estimators under random assignment to

estimators based on stratiﬁed assignment schemes is also a good area for future research.

35

CHAPTER 2

ROBUST AND EFFICIENT ESTIMATION OF POTENTIAL OUTCOME

MEANS UNDER RANDOM ASSIGNMENT†

2.1

Introduction

In the past several decades, the potential outcomes framework has become a staple of

causal inference in statistics, econometrics, and related ﬁelds. Envisioning each unit in a

population under diﬀerent states of intervention or treatment allows one to deﬁne treatment

or causal eﬀects without referencing a model. One merely needs the means of the potential

outcomes, or perhaps the potential outcome (PO) means for subpopulations.

When interventions are randomized – whether the assignment is to control and treatment

groups in a clinical trial (Hirano and Imbens (2001)), assignment to participate in a job

training program (Calónico and Smith (2017)), receiving a school voucher when studying the

eﬀects of private schooling on educational outcomes (Angrist et al. (2006a)), or contingent

valuation studies, where diﬀerent bid values are randomized among people (Carson et al.

(2004)) – one can simply use the subsample means for each treatment level in order to obtain

unbiased and consistent estimators of the PO means. In some cases, the precisions of the

subsample means will be suﬃcient. Nevertheless, with the availability of good predictors of

the outcome or response, it is appealing to think that the precision can be improved, thereby

shrinking conﬁdence intervals and making conclusions about interventions more reliable.

In this paper we build on Negi and Wooldridge (2019), who studied the problem of

estimating the average treatment eﬀect under random assignment with one control group

and one treatment group. In the context of random sampling, we showed that performing

separate linear regressions for the control and treatment groups in estimating the average

treatment eﬀect never does worse, asymptotically, than the simple diﬀerence in means esti-

†This work is joint with Jeﬀrey M. Wooldridge and is unpublished.

36

mator or a pooled regression adjustment estimator. In addition, we characterized the class

of nonlinear regression adjustment methods that produce consistent estimators of the ATE

without any additional assumptions (except regularity conditions). The simulation ﬁndings

for both the linear and nonlinear cases are quite promising when covariates are available that

predict the outcomes.

In the current paper, we consider any number of “treatment” levels and consider the

problem of joint estimation of the vector of potential outcome means. We assume that the

assignment to the treatment is random – that is, independent of both potential outcomes

and observed predictors of the POs. Importantly, other than standard regularity conditions

(such as ﬁnite second moments of the covariates), we impose no additional assumptions. In

other words, the full RA estimators are consistent under essentially the same assumptions

as the subsample means with, generally, smaller asymptotic variance. Interestingly, even if

the predictors are unhelpful, or the slopes in the linear projections are the same across all

groups, no asymptotic eﬃciency is lost by using the most general RA method.

We also extend the nonlinear RA results in Negi and Wooldridge (2019) to the general

case of G assignment levels. We show that for particular kinds of responses such as binary,

fractional or nonnegative, it is possible to consistently estimate PO means using pooled and

separate regression adjustment. Unlike the linear regression adjustment case, we do not have

any general asymptotic results to compare full nonlinear RA with pooled nonlinear RA.

Finally, we apply the full RA estimator to data from a contingent valuation study ob-

tained from Carson et al. (2004). This data is used to elicit a lower bound on mean willingness

to pay (WTP) for a program that would prevent future oil spills along California’s central

coast. Our results show that the PO means for the ﬁve diﬀerent bid amounts that were

randomly assigned to California residents are estimated more eﬃciently using separate re-

gression adjustment than just using subsample averages. This eﬃciency result is preserved

for estimating the lower bound since it is a linear combination of PO means. Hence, using

separate RA also delivers a more precise lower bound mean WTP for the oil spill prevention

37

program than the ABERS estimator which uses subsample averages for constructing the

estimate.

A monte carlo excercise substantiates the theoretical results across three kinds of data

generating processes. We generate the outcomes to be either generated to be continuous

non-negative values or multinomial responses. In addition, we consider four diﬀerent conﬁg-

urations of the assignment vector. In each setting, we ﬁnd FRA to be atleast as precise as SM

across three diﬀerent sample sizes. The performance of PRA relative to SM is less decisive

since some of the subsample means are estimated more noisily than their SM counterparts.

The rest of the paper is organized as follows: Section 2.2 discusses the potential outcomes

framework extended to the case of G treatment levels along with a discussion of the crucial

random sampling and random assignment assumptions. Section 2.3 derives the asymp-

totic variances of the diﬀerent linear regression adjustment estimators, namely, subsample

means, pooled regression adjustment and full regression adjustment. Section 2.4 compares

the asymptotic variances of the entire vector of subsample means, pooled and full regression

adjustment. Section 2.5 considers a class of nonlinear regression adjustment estimators that

ensure consistency of the subsample means without imposing additional assumptions. Sec-

tion 2.6 discusses applications of this framework to randomized experiments, diﬀerences in

diﬀerences settings and contingent valuation studies. This section also applies full regression

adjustment estimator for estimating the lower bound mean WTP for the California Oil Spill

study using data from Carson et al. (2004). Section 2.7 constructs a monte carlo study

for studying and comparing the ﬁnite sample behavior of the linear regression adjustment

estimators and Section 2.8 discusses the results of this study. Section 2.9 concludes.

2.2 Potential outcomes, random assignment, and random sampling

We use the standard potential outcomes framework, also known as the Neyman-Rubin

causal model. The goal is to estimate the population means of G potential (counterfactual)

38

outcomes, Y (g), g = 1, ..., G. Deﬁne

µg = E(cid:2)Y (g)(cid:3) , g = 1, ..., G.

The vector of assignment indicators is

where each Wg is binary and

W = (W1, ..., WG),

W1 + W2 + ··· + WG = 1.

In other words, the groups are exhaustive and mutually exclusive. The setup applies to

many situations, including the standard treatment-control group setup, with G = 2, multiple

treatment levels (with g = 1 the control group), the basic diﬀerence-in-diﬀerences setup with

G = 4, and in contingent valuation studies where subjects are presented with a set of G prices

or bid values.

We assume that each group, g, has a positive probability of being assigned:

Next, let

ρg ≡ P(Wg = 1) > 0, g = 1, ..., G

ρ1 + ρ2 + ··· + ρG = 1

X = (X1, X2, ..., XK )

be a vector of observed covariates.

Assumption 2.2.1 (Random Assignment). Assignment is independent of the potential

outcomes and observed covariates:

W ⊥(cid:2)Y (1), Y (2), ..., Y (G), X(cid:3) .

39

Further, the assignment probabilities are all strictly positive. (cid:3)
Assumption 1 is what puts us in the framework of experimental interventions. It would

be much too strong for an observational study.

Assumption 2.2.2 (Random Sampling). For a nonrandom integer N,

(cid:110)(cid:2)Wi, Yi(1), Yi(2), ..., Yi(G), Xi

(cid:3) : i = 1, 2, ..., N

(cid:111)

is independent and identically distributed. (cid:3)

The IID assumption is not the only one we can make. For example, we could allow for a

sampling-without-replacement scheme given a ﬁxed sample size N. This would complicate

the analysis because it generates slight correlation within draws. As discussed in Negi and

Wooldridge (2019), Assumption 2 is traditional in studying the asymptotic properties of

estimators and is realistic as an approximation. Importantly, it forces us to account for the
sampling error in the sample average, ¯X, as an estimator of µX = E (X).

For each draw i from the population, we only observe

Yi = Wi1Yi(1) + Wi2Yi(2) + ··· + WiGYi(G),

and so the data we have to work with is

(cid:8)(Wi, Yi, Xi) : i = 1, 2, ..., N(cid:9) .

Deﬁnition of population quantities only requires us to use the random vector (W, Y, X),

which represents the population.

Assumptions 1 and 2 are the only substantive restrictions used in this paper. Subse-

quently, we assume that linear projections exist and that the central limit theorem holds for

properly standardized sample averages of IID random vectors. Therefore, we are implicitly

imposing at least ﬁnite second moment assumptions on the Y (g) and the Xj. We do not
make this explicit in what follows.

40

2.3 Subsample means and linear regression adjustment

In this section we derive the asymptotic variances of three estimators: the subsample

means, full (separate) regression adjustment, and pooled regression adjustment.

2.3.1 Subsample means

The simplest estimator of µg is the sample average within treatment group g:

¯Yg = N−1

g

where

N(cid:88)

N(cid:88)

WigYi = N−1

g

WigYi(g),

i=1

i=1

N(cid:88)

i=1

Wig

Ng =

is a random variable in our setting.
WihWig = 0 for h (cid:54)= g. Under random assignment and random sampling,

In expressing ¯Yg as a function of the Yi(g) we use

E(cid:16) ¯Yg|W1g, ..., WN g, Ng > 0
(cid:17)

(cid:105)
Yi(g)|W1g, ..., WN g, Ng > 0

i=1

N(cid:88)
N(cid:88)
N(cid:88)

i=1

= N−1

g

= N−1

g

= N−1

g

WigE(cid:104)
WigE(cid:2)Yi(g)(cid:3)

Wigµg = µg,

and so ¯Yg is unbiased conditional on observing a positive number of units in group g.

By the law of large numbers, a consistent estimator of ρg is

i=1

ˆρg = Ng/N,

the sample share of units in group g. Therefore, by the law of large numbers and Slutsky’s

Theorem,

(cid:33)
(cid:32)
g E(cid:0)Wg

N
Ng
= ρ−1

N−1

N(cid:88)
(cid:1) E(cid:2)Y (g)(cid:3) = µg,

WigYi(g)

i=1

¯Yg =

g E(cid:2)WgY (g)(cid:3)

p→ ρ−1

41

and so ¯Yg is consistent for µg.

By the central limit theorem,

asymptotic representation of

N(cid:0) ¯Yg − µg
(cid:1) is asymptotically normal. We need an
(cid:1) that allows us to compare its asymptotic variance

√

N(cid:0) ¯Yg − µg

√

with those from regression adjustment estimators. To this end, write

Y (g) = µg + V (g)
˙X = X − µX,

where ˙X is X demeaned using the population mean, µX. Now project each V (g) linearly
onto ˙X:

V (g) = ˙Xβg + U (g), g = 1, ..., G.

By construction, the population projection errors U (g) have the properties

E(cid:2)U (g)(cid:3) = 0, g = 1, ..., G
E(cid:104) ˙X(cid:48)U (g)
(cid:105)

= 0, g = 1, ..., G.

Plugging in gives

Y (g) = µg + ˙Xβg + U (g), g = 1, ..., G

(cid:104)

(cid:105)

. The observed

Importantly, by random assignment, W is independent of

U (1), ..., U (G), ˙X

outcome can be written as

(cid:104)

Wg

G(cid:88)

g=1

Y =

(cid:105)

µg + ˙Xβg + U (g)

Theorem 2.3.1 (Asymptotic variance of Subsample means estimator of PO means). Under

Assumptions 1, 2, and ﬁnite second moments,

N(cid:0) ¯Y − µ(cid:1) =

√

(cid:104)
(cid:104)



N−1/2(cid:80)N
N−1/2(cid:80)N
N−1/2(cid:80)N
N(cid:88)

i=1

i=1

i=1

(cid:104)

(cid:105)
(cid:105)

 + op(1)

(cid:105)

Wi1 ˙Xiβ1/ρ1 + Wi1Ui(1)/ρ1
Wi2 ˙Xiβ2/ρ2 + Wi2Ui(2)/ρ2

...

WiG

˙XiβG/ρG + WiGUi(G)/ρG

≡ N−1/2

(Li + Qi) + op(1)

i=1

42

where

and

Li ≡

Qi ≡




WiG

˙XiβG/ρG

Wi1 ˙Xiβ1/ρ1
Wi2 ˙Xiβ2/ρ2

...

Wi1Ui(1)/ρ1

Wi2Ui(2)/ρ2

...

WiGUi(G)/ρG




(2.1)

(2.2)

E(cid:0)LiQ(cid:48)

By random assignment and the linear projection property, E (Li) = E (Qi) = 0, and

(cid:1) = 0. Also, because WigWih = 0, g (cid:54)= h, the elements of Li are pairwise uncorre-

i

lated; the same is true of the elements of Qi.

2.3.2 Full regression adjustment

To motivate full regression adjustment, write the linear projection for each g as

Y (g) = αg + Xβg + U (g)

E(cid:2)U (g)(cid:3) = 0
E(cid:104)

(cid:105)
X(cid:48)U (g)

= 0

It follows immediately that

µg = αg + µXβg.

Theorem 2.3.2 (Asymptotic variance of Full regression adjustment estimator of PO means).

43

Under assumptions 1 and 2, and ﬁnite second moments,

√

N ( ˆµ − µ) =



i=1

i=1

N−1/2(cid:80)N
N−1/2(cid:80)N
N−1/2(cid:80)N
N(cid:88)

i=1

(cid:104) ˙Xiβ1 + Wi1Ui(1)/ρ1
(cid:105)
(cid:104) ˙Xiβ2 + Wi2Ui(2)/ρ2
(cid:105)
(cid:104) ˙XiβG + WiGUi(G)/ρG

...

≡ N−1/2

(Ki + Qi) + op(1)

 + op(1)

(cid:105)

(2.3)

i=1

Ki =

where

and Qi is given in (2.2).



˙Xiβ1
˙Xiβ2

...

˙XiβG



Both Ki and Qi have zero means, the latter by random assignment. Further, by random

assignment and the linear projection property, E(cid:0)KiQ(cid:48)
= E(Wig)E(cid:104) ˙X(cid:48)

E(cid:104) ˙X(cid:48)

iWigUi(g)

(cid:105)

i

(cid:1) = 0 because

(cid:105)

iUi(g)

= 0.

However, unlike the elements of Li, we must recognize that the elements of Ki are correlated
except in the trivial case that all but one of the βg are zero.

2.3.3 Pooled regression adjustment

Now consider the pooled RA estimator, ˇµ, which can be obtained as the vector of coeﬃcients

on Wi = (Wi1, Wi2, ..., WiG) from the regression

Yi on Wi, ¨Xi, i = 1, 2, ..., N.

We refer to this as a pooled method because the coeﬃcients on ¨Xi, say, ˇβ, are assumed to
be the same for all groups. Compared with subsample means, we add the controls ¨Xi, but

44

unlike FRA, the pooled method imposes the same coeﬃcients across all g.

Theorem 2.3.3 (Asymptotic variance of Pooled regression adjustment estimator of PO

means). Under assumptions (1) and (2), along with ﬁnite second moments

(cid:104)
(cid:104)
ρ−1
1 Wi1 ˙Xi (β1 − β) + ˙Xiβ + Wi1Ui(1)/ρ1
ρ−1
2 Wi2 ˙Xi (β2 − β) + ˙Xiβ + Wi2Ui(2)/ρ2

(cid:105)
(cid:105)

(cid:104)
ρ−1
G WiG

...

˙Xi (βG − β) + ˙Xiβ + WiGUi(G)/ρG

√

N ( ˇµ − µ) =



i=1

i=1

N−1/2(cid:80)N
N−1/2(cid:80)N
N−1/2(cid:80)N
N(cid:88)

i=1

≡ N−1/2

i=1

(Fi + Ki + Qi) + op(1)

where Ki and Qi are as before and, with δg = βg − β,

 + op(1)

(cid:105)

(2.4)





1

ρ−1
ρ−1

2

(Wi1 − ρ1) ˙Xiδ1
(Wi2 − ρ2) ˙Xiδ2

Fi =

ρ−1
G (WiG − ρG) ˙XiδG
(cid:17)

= E(cid:16)

FiQ(cid:48)

(cid:17)

= 0

i

FiK(cid:48)

i

E(cid:16)

Notice that, again by random assignment and the linear projection property,

2.4 Comparing the asymptotic variances

We now take the representations derived in Section 3 and use them to compare the

asymptotic variances of the three estimators. For notational clarity, it is helpful summarize

45

the conclusions reached in Section 3:

√
N ( ˆµSM − µ) = N−1/2

√
N ( ˆµF RA − µ) = N−1/2
√

N ( ˆµP RA − µ) = N−1/2

i=1

N(cid:88)
N(cid:88)
N(cid:88)

i=1

(Li + Qi) + op(1)

(Ki + Qi) + op(1)

(Fi + Ki + Qi) + op(1),

where Li, Qi, Ki, and Fi are deﬁned in 2.1, 2.2, 2.3 and 2.4 respectively.

i=1

2.4.1 Comparing FRA to subsample means

Theorem 2.4.1. Under assumptions of theorems 2.3.1 and 2.3.2,

(cid:104)√

Avar

is PSD.

(cid:105) − Avar

N ( ˆµSM − µ)

(cid:104)√

(cid:105)
N ( ˆµF RA − µ)

= ΩL − ΩK

(2.5)

The one case where there is no gain in asymptotic eﬃciency in using FRA is when

βg = 0, g = 1, ..., G, in which case X does not help predict any of the potential outcomes.
Importantly, there is no gain in asymptotic eﬃciency in imposing βg = 0, which is what
the subsample means estimator does. From an asymptotic perspective, it is harmless to

separately estimate the βg even when they are zero. When they are not all zero, estimating
them leads to asymptotic eﬃciency gains.

Theorem 2.4.1 implies that any smooth nonlinear function of µ is estimated more eﬃ-

ciently using ˆµF RA. For example, in estimating a percentage diﬀerence in means, we would
be interested in µ2/µ1, and using the FRA estimators is asymptotically more eﬃcient than
using the SM estimators.

2.4.2 Full RA versus pooled RA

The comparision between FRA and PRA is simple given the expressions in (m2) and (m3)

because, as stated earlier, Fi, Ki, and Qi are pairwise uncorrelated.

46

Theorem 2.4.2. Under the assumptions of theorem 2.3.2 and 2.3.3,

(cid:104)√

(cid:105) − Avar

(cid:104)√

(cid:105)
N ( ˆµF RA − µ)

= ΩF

N ( ˆµP RA − µ)

Avar

which is PSD.

Therefore, ˆµF RA is never less asymptotically eﬃcient than ˆµP RA. There are some special
cases where the estimators achieve the same asymptotic variance, the most obvious being

when the slopes in the linear projections are homogeneous:

β1 = β2 = ··· = βG

As with comparing FRA with subsample means, there is no gain in eﬃciency from imposing

this restriction when it is true. This is another fact that makes FRA attractive if the sample

size is not small.

Other situations where there is no asymptotic eﬃciency gain in using FRA are more
In general, suppose we are interested in linear combinations τ = a(cid:48)µ for a given

subtle.
G × 1 vector a. If

then a(cid:48) ˆµP RA is asymptotically as eﬃcient as a(cid:48) ˆµF RA for estimating τ. Generally, the
diagonal elements of

a(cid:48)ΩFa = 0
(cid:16)

FiF(cid:48)

i

(cid:17)

ΩF = E

(1 − ρg)

ρg

δ(cid:48)
gΩXδg

= ρg(1 − ρg). The oﬀ diagonal terms of ΩF are

Wig − ρg

(cid:17)2(cid:21)
(cid:17)

−δ(cid:48)

gΩXδh

(cid:21)
(Wih − ρh)

are

because E

(cid:20)(cid:16)
(cid:20)(cid:16)

Wig − ρg

= −ρgρh. Now consider the case covered in Negi and
because E
Wooldridge (2019), where G = 2 and a(cid:48) = (−1, 1), so the parameter of interest is τ = µ2−µ1
(the average treatment eﬀect). If ρ1 = ρ2 = 1/2 then

47

Now δ2 = −δ1 because δ1 = β1 − (β1 + β2)/2 = (β1 − β2)/2 = −δ2. Therefore,

ΩF =

1ΩXδ1 −δ(cid:48)
δ(cid:48)
−δ(cid:48)
2ΩXδ1
2ΩXδ2

1ΩXδ2

 δ(cid:48)
δ(cid:48)

 .

−1
 = 0.

1



and

(cid:18)

−1 1

1ΩXδ1 δ(cid:48)
1ΩXδ1 δ(cid:48)
δ(cid:48)

1ΩXδ1

1ΩXδ1

ΩF =

(cid:19)δ(cid:48)

1ΩXδ1 δ(cid:48)
δ(cid:48)
1ΩXδ1 δ(cid:48)

1ΩXδ1

1ΩXδ1

This ﬁnding does not extend to the G ≥ 3 case when interestingly, it is not true that
for estimating each mean separately that PRA is asymptotically equivalent to FRA. So, for

example, with lower bound WTP, it might require that bid values have the same frequency.

But it is not clear that even that is suﬃcient.

What about general G with ρg = 1/G for all g? Then
(G − 1)

1 − ρg = 1 − 1
G

=

G

and so

Note that

1 − ρg
ρg

= G − 1.

δg = βg − (β1 + β2 + ··· + βG) /G

and it is less clear when there is a degeneracy. Seems very likely for estimating pairwise

diﬀerences.

2.5 Nonlinear regression adjustment

We now discuss a class of nonlinear regression adjustment methods that preserve con-

sistency without adding additional assumptions (other than weak regularity conditions). In

48

particular, we extend the setup in Negi and Wooldridge (2019) to allow for more than two

treatment levels.

We show that both separate and pooled methods are consistent provided we choose the

mean functions and objective functions appropriately. Not surprisingly, using a canonical

link function in the context of quasi-maximum likelihood in the linear exponential family

plays a key role.

Unlike in the linear case, we can only show that full RA improves over the subsample

means estimator when the conditional mean is correctly speciﬁed. Whether one can prove

eﬃciency more general is an interesting topic for future research.

2.5.1 Full regression adjustment

We model the conditional means, E(cid:2)Y (g)|X(cid:3), for each g = 1, 2, ..., G. Importantly, we will

not assume that the means are correctly speciﬁed. As it turns out, to ensure consistency,

the mean should have the index form common in the generalized linear models literature. In

particular, we use mean functions

m(αg + xβg),

where m (·) is a smooth function deﬁned on R. The range of m (·) is chosen to reﬂect
the nature of Y (g). Given that the nature of Y (g) does not change across g, we choose
a common function m (·) across all g. Also, as usual, the vector X can include nonlinear
functions (typically squares, interactions, and so on) of underlying covariates.

As discussed in Negi and Wooldridge (2019) in the binary treatment case, the function
m (·) is tied to a speciﬁc quasi-log-likelihood function in the linear exponential family (LEF).
Table 1 gives the pairs of mean function and quasi-log-likelihood function that ensure consis-

tent estimation. Consistent estimation follows from the results on doubly-robust estimation

in the context of missing data in Wooldridge (2007). Each quasi-LLF is tied to the mean

function associated with the canonical link function.

49

Table 2.1: Combinations of means and QLLFs to ensure consistency

Support Restrictions
None
Y (g) ∈ [0, 1] (binary, fractional)
Y (g) ∈ [0, B] (count, corners)
Y (g) ≥ 0 (count, continuous, corner ) Exponential

Mean Function Quasi-LLF
Linear
Logistic
Logistic

Gaussian (Normal)
Bernoulli
Binomial
Poisson
Multinomial

Yj(g) ≥ 0,(cid:80)J

j=0 Yj(g) = 1

Logistic

The binomial QMLE is rarely applied, but is a good choice for counts with a known upper

bound, even if it is individual-speciﬁc (so Bi is a positive integer for each i). It can also be
applied to corner solution outcomes in the interval [0, Bi] where the outcome is continuous
on (0, Bi) but perhaps has mass at zero or Bi. The leading case is Bi = B for all i. Note
that we do not recommend a Tobit model in such cases because Tobit is not generally robust

to distributional or mean failure. Combining the multinomial QLL and the logistic mean

functions is attractive when the outcome is either a multinomial response or more than two

shares that necessarily sum to unity.

As discussed in Wooldridge (2007), the key feature of the single outcome combinations

in Table 1 is that it is always true that

E(cid:2)Y (g)(cid:3) = E

(cid:104)

(cid:105)

m(α∗

g + Xβ∗
g)

,

g and β∗

where α∗
g are the probability limits of the QMLEs whether or not the conditional
mean function is correctly speciﬁed. The analog also holds for the multinomial logit objective

function.

Applying nonlinear RA with multiple treatment levels is straightforward. For treatment

level g, after obtaining ˆαg, ˆβg by quasi-MLE using only units from treatment level g, the
mean, µg, is estimated as

N(cid:88)

ˆµg = N−1

m(ˆαg + Xi

ˆβg),

which includes linear RA as a special case. This estimator is consistent by a standard ap-

plication of the uniform law of large numbers; see, for example, Wooldridge (2010) (Chapter

i=1

12, question 12.17).

50

As in the linear case, and of the mean/QLL combinations in Table 1 allow us to write

the subsample average as

¯Yg = N−1

g

N(cid:88)

i=1

Wigm(ˆαg + Xi

ˆβg).

It seems that ˆµg should be asymptotically more eﬃcient than ¯Yg because ˆµg averages across
all of the data rather than just the units at treatment level g. Unfortunately, the proof used in

the linear case does not go through in the nonlinear case. At this point, we must be satisﬁed

with consistent estimators of the POs that impose the logical restrictions on E(cid:2)Y (g)|X(cid:3).

In the binary treatment case, Negi and Wooldridge (2019) ﬁnd nontrivial eﬃciency gains in

using logit, fractional logit, and Poisson regression even compared with full linear RA.

2.5.2 Pooled regression adjustment

In cases where N is not especially large, one might, just as in the linear case, resort to

pooled RA. Provided the mean/QLL combinations are chosen as in Table 1, the pooled RA

estimator is still consistent under arbitrary misspeciﬁcation of the mean function. To see

why, write the mean function, without an intercept in the index, as

m(γ1w1 + γ2w2 + ··· + γGwG + xβ).

The ﬁrst-order conditions of the pooled QMLE include the G conditions

Wig

Yi − m(ˆγ1Wi1 + ˆγ2Wi2 + ··· + ˆγGWiG + Xi

ˆβ)

= 0, g = 1, ..., G.

(cid:104)

N(cid:88)

i=1

N−1

(cid:105)

Therefore, assuming no degeneracies, the probability limits of the estimators, denoted with
a ∗, solve the population analogs:

E(cid:0)WgY(cid:1) = E(cid:2)WgY (g)(cid:3) = E(cid:2)Wgm(Wγ∗ + Xβ∗)(cid:3) ,

where W = (W1, W2, ..., WG). By random assignment, E(cid:2)WgY (g)(cid:3) = ρgµg. By iterated

expectations and random assignment,

E(cid:2)Wgm(Wγ∗ + Xβ∗)(cid:3) = E

(cid:110)
E(cid:2)Wgm(Wγ∗ + Xβ∗)|X(cid:3)(cid:111)

51

and

Therefore,

E(cid:2)Wgm(Wγ∗ + Xβ∗)|X(cid:3) = P(Wg = 1|X)m(γ∗
(cid:104)
E(cid:2)Wgm(Wγ∗ + Xβ∗)(cid:3) = ρgE
m(γ∗
(cid:105)
g + Xβ∗)

(cid:104)
m(γ∗

and, using ρg > 0, we have shown

By deﬁnition, ˆγg is consistent for γ∗
QMLE estimation, we obtain the estimated means as

N(cid:88)

ˇµg = N−1

m(ˆγg + Xi

ˆβ),

g + Xβ∗) = ρgm(γ∗

g + Xβ∗).

(cid:105)
g + Xβ∗)

µg = E
g and ˆβ is consistent for β∗. Therefore, after the pooled

and these are consistent by application of the uniform law of large numbers.

i=1

As in the case of comparing full nonlinear RA to the subsample averages, we have no

general asymptotic eﬃciency results comparing full nonlinear RA to pooled nonlinear RA.

As shown in Section 4.2, in the linear case it is never worse, asymptotically, to use full RA.

2.6 Applications

2.6.1 Treatment eﬀects with multiple treatment levels

The most direct application of the previous results is in the context of a randomized inter-

vention with more than two treatment levels. Regression adjustment can be used for any

kind of response variable. With a reasonable sample size per treatment level, full regression

adjustment is preferred to pooled regression adjustment.

If the outcome Y (g) is restricted in some substantive way, a nonlinear RA method of the

kind described in Section 5 can be used to exploit the logical restrictions on E(cid:2)Y (g)|X(cid:3).

While we cannot show this guarantees eﬃciency gains compared with using subsample av-

erages, the simulation ﬁndings in Negi and Wooldridge (2019) suggest the gains can be

nontrivial – even compared with full linear RA.

52

2.6.2 Diﬀerence-in-Diﬀerences designs

Diﬀerence-in-diﬀerences applications can be viewed as a special case of multiple treatment

levels. For illustration, consider the standard setting where there is a single before period

and a single post treatment period. Let C be the control group and T the treatment group.

Label B the before period and A the after period. The standard DID treatment eﬀect is a

particular linear combination of the means from the four groups:

τ = (µT A − µT B) − (µCA − µCB)

Estimating the means by separate regression adjustment is generally better than not con-

trolling for covariates, or putting them in additively.

2.6.3 Estimating lower bound mean willingness-to-pay

In the context of contingent valuation, individuals are randomly presented with the price of

a new good or tax for a new project. They are asked whether they would purchase the good

at the given price, or be in favor of the project at the given tax. Generally, the price or tax is

called the “bid value.” The outcome for each individual is a binary “vote” (yes = 1, no = 0).

A common approach in CV studies is to estimate a lower bound on the mean willingness-

to-pay (WTP). The common estimators are based on the area under the WTP survival

function:

E(W T P ) =

(cid:90) ∞

S(a)da

0

When a population of individuals is presented with a small number of bid values, it is not

possible to identify E(W T P ), but only a lower bound. Speciﬁcally, let b1, b2, ..., bG be G
bid values and deﬁne the binary potential outcomes as

Y (g) = 1[W T P > bg], g = 1, ..., G.

In other words, if a person is presented with bid value bg, Y (g) is the binary response, which
is assumed to be unity if W T P exceeds the bid value. The connection with the survivor

53

function is

µg ≡ E(cid:2)Y (g)(cid:3) = P (W T P > bg) = S(bg)

Notice that µg is the proportion of people in the population who have a WTP exceeding
bg. This ﬁts into the potential outcomes setting because each person is presented with only
one bid value. Standard consumer theory implies that µg+1 ≤ µg, which simply means the
demand curve is weakly declining in price.

It can be shown that, with b0 ≡ 0 for notational ease,

τ ≡ G(cid:88)

(bg − bg−1)µg ≤ E(W T P ),

g=1

and it is this particular linear combination of {µg : g = 1, 2, ..., G} that we are interested in
estimating. The so-called ABERS (1955) estimator introduced by Ayer et al. (1955), without

a downward sloping survival function imposed, replaces µg with its sample analog:

where

ˆτABERS =

¯Yg = N−1

g

G(cid:88)
(bg − bg−1) ¯Yg
N(cid:88)

g=1

Yi1[Bi = bg]

is the fraction of yes votes at bid value bg. Of course, the ¯Yg can also be obtained as the
coeﬃcients from the regression

i=1

Yi on Bid1i, Bid2i, ..., BidGi, i = 1, ..., N

Lewbel (2000) and Watanabe (2010) allows for covariates in order to see how WTP changes

with individual or family characteristics and attitudes, but here we are interested in esti-

mating τ.

We can apply the previous results on eﬃciency because τ is a linear combination of the

µg. Therefore, using separate linear RA to estimate each µg, and then forming

ˆτF RA =

(bg − bg−1)ˆµg

G(cid:88)

g=1

54

is generally asymptotically more eﬃcient than ˆτABERS. Moreover, because Y is a binary
outcome, we might improve eﬃciency further by using logit models at each bid value to

obtain the ˆµg.

2.6.4 Application to california oil data

This section applies the linear RA estimators discussed in section 2.3 to survey data from the

California Oil Spill study from Carson et al. (2004). The study implemented a CV survey

to assess the value of damages to natural resources from future oil spills along California’s

Central Coast. This was achieved by estimating a lower bound mean WTP measure of the

cost of such spills to California’s residents. The survey provided respondents with the choice

of voting for or against a governmental program that would prevent natural resource injuries

to shorelines and wildlife along California’s central coast over the next decade. In return,

the public would be asked to pay a one time lump sum income tax surcharge for setting up

the program.

The main sample survey which was used to elicit the yes or no votes was conducted

by Westat, Inc. The data was a random sample of 1,085 interviews conducted with English

speaking Californian households where the respondent was 18 years or older, and lived in pri-

vate residences that were either owned or rented. To address issues of non-representativeness

of the interviewed sample from the total initially chosen sample, weights were used. Each

respondent was randomly assigned one of ﬁve tax amounts: $5, $25, $65, $120, or $220 and

the binary choice of “yes" or “no” for the oil spill prevention program was recorded at the

randomly assigned tax amount.

Apart from the choice at diﬀerent bid amounts, data was also collected on demographics

for the respondent and the respondent’s household such as total income, prior knowledge of

the spill site, distance to the site, environmental attitudes, attitudes towards big businesses,

understanding of the program and the task of voting, beliefs about the oil spill scenario etc.

Table D.1 provides a summary of yes votes at the diﬀerent bid or tax amounts presented

55

to the respondents. Table D.2 provides estimates for the PO means as well as the lower

bound mean WTP estimate. We see that the FRA estimator delivers more precise estimates

for the vector of PO means. Since the treatment eﬀect, which in this case is the lower bound

mean willingness to pay for the oil prevention program, is a smooth function of the estimated

PO means, we see that the FRA estimate leads to a more precise lower bound mean WTP

than the ABERS estimator.

56

2.7 Monte-carlo

This section reports the ﬁnite sample behavior of the three diﬀerent linear regression ad-

justment estimators, namely, subsample means (SM), pooled regression adjustment (PRA)

and separate slopes (or feasible) regression estimator (FRA) for the vector of PO means.

For this monte-carlo study, we generate a population of 1 million observations and mimic

the asymptotic setting of random sampling from an “inﬁnite” population. The empirical
distributions of the RA estimators are simulated for sample sizes N ∈ {500, 1000, 5000} by
randomly drawing the data vector {(Yi, Xi, Wi); i = 1, 2,··· , N} a thousand times from
the above mentioned population. For a comprehensive assessment of the linear RA estima-

tors, we consider three diﬀerent populations along with four conﬁgurations of the treatment

assignment vector. Tables D.3, D.5, and D.4 provide bias and standard deviation mea-

sures for the vector of PO means estimated using the diﬀerent estimators for these unique

combinations of population models, assignment vector, and sample sizes.

To simulate multiple treatments, we consider potential outcomes, Y (g), corresponding

to three diﬀerent treatment states, g = 1, 2, 3. Hence, G = 3 for all the simulations. In each

of the populations, the treatment vector W = (W1, W2, W3) is generated with probability
mass function deﬁned in the following manner:

Z = g P(Wg = 1)

1
2
3

ρ1
ρ2
ρ3

To generate the vector of assignments from the above distribution, we ﬁrst draw a uniform

random variable U = Uniform(0, 1) and partition the unit interval (0, 1) into subintervals

(0, ρ1), (ρ1, ρ1 + ρ2), (ρ1 + ρ2, ρ1 + ρ2 + ρ3)

and record the interval in which the uniform draw falls. For a particular draw, if U ∈ (0, ρ1),
then W = (1, 0, 0). If U ∈ (ρ1, ρ1 + ρ2), then W = (0, 1, 0) and ﬁnally, if U ∈ (ρ1 + ρ2, ρ1 +

57

ρ2 + ρ3), then W = (0, 0, 1). This would ensure that the number of treated observations in
each treatment group g, on average, will be close to the true assignment probabilities and that

each observation (or draw) will belong to only one treatment state i.e., Wi1 + Wi2 + Wi3 = 1
for all i. In all the simulations, we consider the following conﬁgurations of the assignment

vector

ρ ∈

(cid:40)(cid:18)1

3

(cid:19)

(cid:18)1

3

,

(cid:19)

(cid:18)1

6

,

,

1
6

,

1
6

(cid:19)

(cid:18)1

5

,

,

2
5

,

2
5

,

2
3

,

1
6

,

1
3

,

1
3

(cid:19)(cid:41)

2.7.1 Population models

To compare the empirical distributions of these linear RA estimators, we consider three

diﬀerent population models. Each model, which we term a data generating process (DGP),

or discrete.

assumes that the potential outcomes follow a particular distribution, whether continuous
In the ﬁrst two DGP’s, Y (g)(cid:48)s are simulated to be continuous non-negative
outcomes. The ﬁrst model uses an exponential distribution whereas the second uses a a

mixture of an exponential and log-normal distribution. The third DGP takes Y (g) to be

categorical responses which take four discrete values. Each DGP’s is described in detail

below.

For the ﬁrst two DGP’s, we consider two covariates, X = (X1, X2), which are drawn

from a bivariate normal distribution as follows

X1

 ∼ N

X2


1
 ,

2

 2

X(cid:48) =





0.5

0.5

3

For each DGP, we choose parameters such that covariates have some predictive power in

explaining the potential outcomes, so that the beneﬁts of regression adjustment can be

reaped.

58

Population 1: For each g,

(cid:16)

Y (g) ∼ Exponential(λg)
λg = exp

(cid:17)
γ0g + γ1g · X1 + γ2g · X2 + R(g)
(cid:17)(cid:48)

g) and R(1), R(2), and R(3) are allowed to be correlated.1 The
g for g=1,2,3 are parameterized as

, and variance σ2

γ0g, γ1g, γ2g

(cid:16)
where R(g)|X1, X2 ∼ N (0, σ2
parameter vector, γg =
follows

1 = 0.04

γ1 = (0,−1, 1)(cid:48) , σ2
γ2 = (1, 1.62,−0.5)(cid:48) , σ2
γ3 = (2,−2, 0.6)(cid:48) , σ2

2 = 1

3 = 0.01

For this conﬁguration of parameters, the covariates are only mildly predictive of the out-

comes in the three treatment groups, R2

1 = 0.04, R2

2 = 0.02, and R2

3 = 0.01.

Population 2:

In this case, we generate the outcomes to be a mixture between expo-

nential and lognormal distributions,

Exponential(λg) if 0 ≤ V < δg

Y (g) ∼

Lognormal(ηg, ν2
ηg = α0g + α1g · X1+α2g · X2 + α3g · X2

g ) if δg ≤ V ≤ 1
1 + α4g · X2

2 + α5g · X1 · X2 + K(g)

where the mean λg is deﬁned exactly as above. Also, K(g)|X1, X2 ∼ N (0, κ2
and K(3) are also allowed to be correlated.

g) and K(1), K(2),

1These are simulated to be aﬃne transformations of the same standard normal random

variable.

59

The other parameters αg, δg, κ2, and ν2

g are chosen as follows

,

(cid:18) 1
(cid:18)1.2
(cid:18)0.3

15

15

10

α1 =

α2 =

α3 =

,

,

−1
15
2
15
1.5
10

,

3
15
2
15
1
10

,

,

,

,

,

,

−0.02
15
−0.02
15
0.15
10

,

, 0,

,

0.05
15
0.03
15
0.13
10

,

0.1
15
0.5
15

(cid:19)

(cid:19)
(cid:19)

, δ1 = 0.7 , κ2

1 =

, δ2 = 0.5 , κ2

2 =

0.04
225
0.09
225

, ν2

1 = 0.01

, ν2

2 = 0.16

, δ3 = 0.3 , κ2

3 =

0.16
100

, ν2

3 = 0.36

For this DGP, the population R-squared for the three treatment groups are R2

1 = 0.119, R2

2 =

0.1570, and R2

3 = 0.1177 respectively.

Finally, for the third population model, we consider each potential outcome to be cat-

egorical response which is generated using a multinomial logit model. For this setting we

only consider a single covariate, X, which is assumed to be distributed Poisson

X ∼ P oisson(14)

As an example, one could imagine the treatment to be three diﬀerent political advertisements

that are shown to a voter and the response (or outcome) indicates the voter’s preferred can-

didate amongst four potential choices with X denoting the voter’s years of schooling.

Population 3:
g = {1, 2, 3}, say,

Let Y (g) take one of four discrete values„ j ∈ {1, 2, 3, 4}, for each

(cid:16)

(cid:17)
(cid:16)
(cid:17)
ω1gj · X + ω2gj · X2 + Rj(g)
ω1gj · X + ω2gj · X2 + Rj(g)

exp

(cid:80)4

j=1 exp

P{Y (g) = j} =

where Rj(g)|X ∼ U (0, σ2
g). For notational simplicity, we collect all the index parameters in
ω1g = (ω1g1, ω1g2, ω1g3, ω1g4)(cid:48) and ω2g = (ω2g1, ω2g2, ω2g3, ω2g4)(cid:48). For these we picked the

60

following values,

ω11 = (−0.1291,−0.1014,−0.7041,−0.7798)(cid:48) , ω21 = (−0.0108,−0.0234,−0.0376,−0.0192)(cid:48)
ω12 = (0.7866, 0.1804, 0.6310, 0.9695)(cid:48) ,
ω13 = (0.3271, 0.2005, 0.4048, 0.3930)(cid:48) ,

ω22 = (0, 0, 0, 0)(cid:48)
ω23 = (0.308, 0.0411, 0.0301, 0.0475)(cid:48)

1 = 1, σ2
σ2

2 = 0.04, σ2

3 = 0.1¯1

Given these choices, the population R-squared’s are R2

3 =

1 = 0.0859, R2

2 = 0.0319, and R2

0.1048.

In all the three cases, while estimating the PO means, we assume that the above functional

forms are unknown and simply run the regression of the observed outcome on a constant, and

the covariates. This is meant to reﬂect the uncertainty that researchers often have about the

underlying outcome distributions and how they are generated. Considering three diﬀerent

environments in which to compare the performance of the linear RA estimators also helps

to mimic the variety of experimental settings that researchers may encounter where separate

slopes regression adjustment produces substantial precision gains.

2.8 Discussion

Tables D.3, D.5, and D.4 below report the bias and standard deviation of SM, PRA,

and FRA estimators for the three diﬀerent DGP’s respectively. Each table reports these

measures across four assignment vectors that were chosen in the manner described above.

Note that in most cases, the bias of FRA and PRA estimates is comparable and sometimes

even smaller than its SM counterpart. However, one may be willing to forego the bias in RA

estimates in favor of eﬃciency, in which case we turn our attention to the standard deviation

measures on these estimates.

Across all four conﬁgurations, we see that the standard deviation of the separate slopes

estimator is weakly smaller than that of the subsample means and pooled regression esti-

61

mators. The comparison of PRA and SM estimators is less unequivocal since in almost all

cases and for all sample sizes, the PRA estimator for the ﬁrst PO mean is almost always less

precise than the SM counterpart.

For DGP 2 and 3, we see a similar pattern as for DGP 1. In all cases, PRA produces

estimates that may or may not be more precise than the subsample means estimator. Note

that some of the means are estimated more precisely with PRA than the others. However,

the comparison between SM and FRA is less ambiguous. We always ﬁnd all means estimated

through FRA to be weakly more precise than those estimated using just the diﬀerence in

subsample means.

2.9 Conclusion

In this paper, we build on the work of Negi and Wooldridge (2019) to study eﬃciency

improvements in linear regression adjustment estimators when there are more than two

treatment levels. In particular, we consider any arbitrary ‘G’ number of treatments when

these treatments have been randomly assigned. We show that jointly estimating the vector

of potential outcome means using linear RA that allows for separate slopes for the diﬀerent

assignment levels is asymptotically never worse than just using subsample averages. One

case when there is no gain in asymptotic eﬃciency from using FRA is when the slopes are all

zero. In other words, when the covariates are not predictive of the potential outcomes, then

using separate slopes does not produce more precise estimates compared to just estimating

the subsample averages. We also show that separate slopes RA is generally more eﬃcient

compared to pooled RA, unless the true linear projection slopes are homogeneous. In this

case using FRA to estimate the vector of PO means is harmless. In other words, using FRA

under this scenario does not hurt.

In addition, this paper also extends the discussion around nonlinear regression adjustment

made in Negi and Wooldridge (2019) to more than two treatment levels. In particular, we

show that pooled and separate nonlinear RA estimators in the quasi maximum likelihood

62

family are consistent if one chooses the mean and objective functions appropriately from the

linear exponential family of distributions.

As an illustration of these eﬃciency arguments, we apply the diﬀerent linear RA esti-

mators for estimating the lower bound mean WTP using data from a contingent valuation

study undertaken to provide an ex-ante measure of damages to natural resources from future

oil spills along California’s central coast. We ﬁnd that the lower bound mean WTP is esti-

mated more eﬃciently when we allow the slopes on the diﬀerent bid values to be estimated

separately as opposed to the ABERS estimator, which uses subsample averages for the PO

means. A comprehensive simulation study also oﬀers ﬁnite sample evidence on eﬃciency

improvements with FRA over SM under three diﬀerent empirical settings. We ﬁnd FRA

estimator of PO means to be unequivocally more precise than PRA and weakly better than

SM for all data generating processes despite the covariates being only mildly predictive of

the potential outcomes.

63

CHAPTER 3

DOUBLY WEIGHTED M-ESTIMATION FOR NONRANDOM

ASSIGNMENT AND MISSING OUTCOMES†

3.1

Introduction

Much of the applied literature in economics is interested in questions of causal inference,

such as measuring the impact of job training on labor market outcomes (Calónico and Smith

(2017), Ba et al. (2017), Card et al. (2011)), determining the eﬃcacy of school voucher

programs on student achievement (Muralidharan and Sundararaman (2015)), and even, es-

timating the eﬀects of ﬁrm competition on prices (Busso and Galiani (2019)). A key concern

with causal eﬀects estimation is that, typically, the units under comparison are diﬀerent even

before the treatment is assigned, rendering the task of drawing causal claims diﬃcult. This

task is made even more challenging when there is missing data on the outcome of interest,

such as earnings, test scores, or prices.

The econometrics literature has proposed weighting to deal with non-random assignment

(Hahn (1998), Hirano and Imbens (2001), Hirano et al. (2003), Firpo (2007)) and missing

data (Robins and Rotnitzky (1995), Robins et al. (1994), Wooldridge (2002), Wooldridge
(2007)).1 However, the two weighting procedures have typically been studied in isolation.2

This paper proposes a double inverse probability weighted (IPW) estimator that addresses

these twin identiﬁcation issues in a general M-estimation framework. Speciﬁc examples

†This work is unpublished.
1See Li et al. (2013) for a review of IPW approaches to deal with missing data under a

variety of missing data patterns.

2Huber (2014b) studies treatment eﬀects in the presence of the double selection problem
using a nested weighting procedure. He considers the traditional problem of sample selection
based on unobservables and uses a nested weighting structure, which includes the ﬁrst stage
sample selection probability as a covariate in the second stage propensity score model. Other
papers that point or set identify causal parameters in the presence of the double selection
problem include Fricke et al. (2015), Frölich and Huber (2014), Vossmeyer (2016), Mattei
et al. (2014) and Huber and Mellace (2015).

64

include linear regression, maximum likelihood (MLE), and quantile regression (QR).

In particular, consider a prototypical training program. Learning about the eﬀects of

such an intervention on (say) earnings, necessitates comparing individuals based on their

participation status. If these individuals are not randomly assigned to the program, such a

comparison will confound the true training eﬀect with factors that simultaneously determine

selection into the program and future earnings. For instance, individuals with poor labor

market histories may be more likely to participate, and contemporaneously, have lower earn-

ings than nonparticipants. Hence, the true eﬀect of the training program is not identiﬁed

in the presence of nonrandom participation. This identiﬁcation problem is compounded, if

say, individuals who participate in the program are also likely to drop out, introducing the

additional problem of missing outcomes.

Even in randomized experiments, the problem of missing outcomes can arise due to attri-

tion, no-shows, dropouts, or non-response (see Bloom (1984), Heckman et al. (1998b), and

Hausman and Wise (1979) for a discussion). A speciﬁc example is the National Supported

Work (NSW) program, where 19 percent of the randomized sample attrited between the

baseline and ﬁrst round of follow-up interviews.

In this case, the standard simple diﬀer-

ence in means estimator will no longer produce an unbiased training eﬀect estimate (see

Huber (2012), Huber (2014a), Behaghel et al. (2015), Frumento et al. (2012) for alternative

approaches of dealing with various post-randomization complications).

A common empirical strategy for dealing with missing data is to drop individuals with

incomplete information, and treat the observed units as a random sample from the population
of interest.3 In a setting with only missing outcomes, such a strategy will not only waste

potentially useful information on covariates, but more importantly create a non-random

sample for estimation.

In turn, this can generally lead to inconsistent treatment eﬀect
3For example, Chen et al. (2018) drop observations with missing labor market outcomes
for week 208 after random assignment using the National Job Corps Study data to derive
bounds on the Average Treatment Eﬀect as well as Average Treatment Eﬀect on the Treated.
Drange and Havnes (2018) also report excluding children with missing data on the outcomes
to study the eﬀect of early child care on cognitive development in Norway.

65

estimates.

One of the main contributions of this paper is to propose a new class of consistent and

asymptotically normal estimators that combine propensity score weighting with weighting

for missing data, to address the problems of nonrandom assignment and missing outcomes.

Traditionally, the weighting literature has studied each problem individually. By study-

ing them together, this paper builds upon and extends the existing weighting literature to

incorporate both issues simultaneously. A second contribution is to consider a general M-

estimation problem, which is permitted to be non-smooth in the underlying parameters.

Therefore, the identiﬁcation and estimation arguments made in this paper encompass both

average treatment eﬀect (ATE) and quantile treatment eﬀect (QTE) parameters. Finally, a

key feature of the proposed estimator is its robustness to parametric misspeciﬁcation of a

conditional model of interest (such as a conditional mean or conditional quantile) and the

two weighting functions.

To obtain consistent estimation of causal parameters, this paper assumes that selection
into treatment is based on observed covariates.4 Put diﬀerently, this restriction implies that

the treatment is as good as randomly assigned after conditioning on pre-treatment covariates.

Previous studies have found several situations where such an assumption is tenable,

especially when pre-treatment values of the outcome variable are available. For example,

LaLonde (1986) and Hotz et al. (2006) have shown that controlling for pre-training earnings

alone reduces signiﬁcant bias between non-experimental and experimental estimates. The

literature assessing teacher impact on student achievement has reported similar ﬁndings with

pre-test scores (Chetty et al. (2014), Kane and Staiger (2008) and Shadish et al. (2008)),

indicating the plausibility of unconfoundedness in these settings.

This paper also assumes that the missing outcomes mechanism is ignorable once we con-
4This is a widely used assumption in the treatment eﬀects literature (Imbens and
Wooldridge (2009)) and is known by a variety of names such as unconfoundedness, ex-
ogenous assignment (exogeneity), ignorability of assignment, selection on observables and
conditional independence assumption (CIA).

66

dition on covariates and the treatment status.5 In other words, covariates and the treatment

are suﬃcient for predicting observation into the sample (see Wooldridge (2007) for a similar

ignorability assumption). This mechanism falls under the “Missing at Random” (MAR) or

the “selection on observables” label used in the econometrics literature (for example, Moﬃt

et al. (1999) use it to model attrition) and allows for diﬀerential non-response, attrition, and
even non-compliance to the extent that conditioning variables predict it.6

Under unconfoundedness and ignorability, the proposed strategy leads to an estimation

method that follows in two steps; the ﬁrst step estimates the treatment and missing outcome
probabilities using binary response MLE.7 The second step uses these estimated probabilities

as weights to minimize (or maximize) a general objective function. Given the parametric

nature of the ﬁrst and second steps, this paper highlights a robustness property which allows

the estimator to remain consistent for a parameter of interest, under misspeciﬁcation of

either the conditional model or the two probabilities. Consequently, the asymptotic theory

in this paper distinguishes between these two important halves. The ﬁrst half focuses on

misspeciﬁcation of either a conditional expectation function (CEF) or a conditional quantile

function (CQF), whereas the second half considers arbitrary misspeciﬁcation of the weighting

functions. Delineating the two cases helps to clarify the interpretation on causal estimands

in diﬀerent misspeciﬁcation scenarios. This property also nests the well known result of

‘double robustness’ (Słoczyński and Wooldridge (2018)) as a special case.

As illustrative examples, the paper discusses robust estimation of two speciﬁc causal

parameters, namely, the ATE and QTEs. Consistent estimation of the ATE is achievable

under both misspeciﬁcation scenarios. Of particular interest is the case when the conditional

mean function is misspeciﬁed. In this case, consistent estimation of ATE relies on double
5Typically, covariates also include pre-treatment outcomes like pre-training earnings or

pre-test scores.

6Attrition in a two period panel is allowed as long as it is a function of key time-invariant

characteristics and the assigned treatment status.

7As a practical matter, researchers typically follow the convention of estimating these

probabilities as ﬂexible logit functions.

67

weighting and results from the generalized linear model literature. For estimation of quan-

tile treatment eﬀects, the paper considers three diﬀerent parameters, namely, conditional

quantile treatment eﬀect (CQTE), a linear approximation to CQTE, and the unconditional

quantile treatment eﬀect (UQTE), each of which may be of interest to the researcher de-

pending on whether features of the conditional or unconditional outcomes distribution are

of particular interest. In the event that the underlying CQF is assumed to be correct, the

double-weighted estimator is shown to be consistent for the true CQTE, otherwise, delivers

a consistent weighted linear approximation to the true CQTE (using results from Angrist

et al. (2006b)). In addition, the paper underscores the importance of double weighting for

a parameter like UQTE, where covariates, which serve to remove biases due to nonrandom

assignment and missing outcomes, enter the estimating equation only through the two proba-

bility models. Simulations show that the doubly weighted ATE and QTE estimates have the

lowest ﬁnite sample bias compared to alternatives that ignore one or both problems (such as

the unweighted estimator that drops data, or the propensity score weighted estimator which

weights only by the treatment probability).

Finally, the proposed method is applied to estimate average and distributional impacts

of the NSW training program on earnings for the Aid to Families with Dependent Chil-

dren (AFDC) target group. This sample is obtained from Calónico and Smith (2017), who

recreate Lalonde’s within-study analysis for the AFDC women. To have missing cases, these

data are augmented to include women with missing earnings information in 1979 that were

originally dropped from Calónico and Smith’s analysis. This empirical application helps to

quantify the estimated bias in the unweighted and propensity score weighted estimates, rela-

tive to the doubly weighted estimates, through the presence of an experimental benchmark.

Results show that the doubly-weighted estimates have an estimated bias which is smaller

than that computed for the unweighted estimates, but comparable in magnitude to the bias

estimated for the single (propensity score) weighted estimates. This ﬁnding indicates that,

for this particular application, the missing outcomes problem seems to be much less conse-

68

quential than the nonrandom assignment problem in obtaining estimates close to the true

experimental benchmark.

The rest of this paper is structured as follows. Section 3.2 describes the framework and

provides a short description of the population models with an introduction to the naive un-

weighted estimator. Section 3.3 discusses estimation of the probability weights which is a

necessary ﬁrst step in solving the doubly weighted problem. Section 3.4 develops the ﬁrst

half of the asymptotic theory which is explicitly focused on misspeciﬁcation of a conditional

model of interest. In contrast, section 3.5 discusses the second half which considers cases

where the conditional model of interest is correctly speciﬁed. Section 3.6 studies the speciﬁcs

of the robustness property for estimating ATE and QTEs in rigorous detail. It also provides

supporting Monte Carlo evidence under diﬀerent cases of misspeciﬁcation. Section 3.7 illus-

trates the performance of double weighting using Calónico and Smith (2017) data. Section

3.8 concludes with a direction for future research. Tables, ﬁgures, proofs, and some auxiliary

results are provided in the appendix.

3.2 Doubly weighted framework

3.2.1 Potential outcomes and the population models

Let y(g) denote the potential outcome for g = 0, 1 and let wg be an indicator variable for
treatment level g where w0 + w1 = 1, implying that the two treatment groups are mutually
exclusive and exhaustive. Then,

y = y(0) · w0 + y(1) · w1

(3.1)

Let (y(g), x) denote a M × 1 random vector taking values in RM where x is the vector
of pre-treatment characteristics.8 Some feature of the distribution of (y(g), x) is assumed
8For instance, in the NSW program, y(1) and y(0) will denote potential earnings in
the event of participation and non-participation in the training program respectively. The
covariates on which information was collected in the baseline period included individual’s
age, ethnicity, high-school dropout status, real earnings along with other socio-economic and
demographic characteristics.

69

to depend on a ﬁnite Pg × 1 vector θg, contained in a parameter space Θg ⊂ RPg.9 Let
D(y(g)|x) denote the conditional distribution of y(g) and x and let q(y(g), x, θg) be an ob-
jective function depending on y(g), x and θg. This paper allows q(·) to be a smooth or a
non-smooth function of the underlying parameter, θg. The parameter of interest, denoted
by θ0

g, is deﬁned to be the solution to the following population problem

Assumption 3.2.1 (Identiﬁcation of θ0

g). The parameter vector θ0

g ∈ Θg is the unique

solution to the population minimization problem

E(cid:2)q(y(g), x, θg)(cid:3) , g = 0, 1

min
θg∈Θg

(3.2)

Notice that assumption 3.2.1 describes a general M-estimation framework where the

interest lies in minimizing some objective function, q(y(g), x, θg). Speciﬁc examples include
the smooth ordinary least squares objective function, q(y(g), x, θg) = (y(g) − αg − xβg)2 or
the non-smooth check function of Koenker and Bassett (1978), q(y(g), x, θg) = cτ (y(g)−αg−
xβg) where θg ≡ (αg, βg)(cid:48).10 Other examples of q(·) include log-likelihood and quasi-log
likelihood (QLL) functions.

population problem in (3.2).

If θ0

An implicit point made in the assumption above is that θ0

full conditional distribution. Assumption 3.2.1 simply requires θ0

quantities, then the parameter is of direct interest to researchers. However, if θ0

g is not assumed to be correctly
speciﬁed for a conditional feature like a conditional mean, conditional variance or even the
g to uniquely minimize the
g is correctly speciﬁed for any of the above mentioned
g is misspec-
iﬁed for any of these distributional features, assumption 3.2.1 guarantees a unique pseudo
true solution, θ∗
g (White (1982)). In the case of misspeciﬁcation, determining whether θ∗
g is
meaningful will depend on the conditional feature being studied and the estimation method
9For generality, the dimension of θg is allowed to be diﬀerent for the treatment and control
group problems and is also diﬀerent than the dimension of x, where x ∈ X ⊂ Rdim(X)
10For a random variable u, cτ (u) = (τ − 1[u<0])u, is the asymmetric loss function for
estimating quantiles.

70

used. For example, in the OLS case, θ0

g will still index a linear projection if one is agnostic
about linearity of the CEF. Angrist et al. (2006b) establish analogous approximation proper-

ties for quantiles, where a misspeciﬁed CQF can still provide the best weighted mean square
g solves the following

approximation to the true τ-CQF. In other words, they show that θ0

weighted mean square error loss function

ωτ (x, θg) · (αg(τ ) + xβg(τ ) − Quantτ (y(g)|x))2(cid:105)(cid:27)
(cid:104)

E

(cid:26)
where ωτ (x, θg) =(cid:82) 1

min
θg∈Θg

N(cid:88)

min
θg∈Θg

i=1

71

0 (1−u)fy(g)(u·xθg +(1−u)·Quantτ (y(g)|x)|x) is the weighting function
given in Angrist et al. (2006b) adapted to the potential outcomes framework, Quantτ (y(g)|x)
is the true CQF and α0
g (τ ) represents a weighted linear approximation. Hence,
g(τ ) + xβ0
g )(cid:48) provides an interesting interpretation that can be of practical
g ≡ (α0
interest to researchers.

in this case, θ0

g, β0

Note that assumption 3.2.1 only requires the parameter to solve an unconditional prob-

lem. A suﬃcient condition for the same is that the parameter additionally solves the condi-

tional problem. However, this latter condition will not be required to derive the asymptotic

theory in section 3.4. For the reader, an eﬀective way to separate the results in section 3.4

from the ones discussed in section 3.5 is to consider the current section as allowing potential

misspeciﬁcation of the conditional feature being studied in the sense of assumption 3.2.1.
g to be identiﬁed in the stronger conditional sense. Together, re-
sults developed in section 3.4 and section 3.5 can then be used to characterize the robustness

Section 3.5 will require θ0

property of the proposed estimator.

3.2.2 The unweighted M-estimator

g. If one obtains a random sample
In this paper, the objective is to consistently estimate θ0
on {(yi(0), yi(1), wig, xi) : i = 1, 2, . . . , N} from the population of interest, then one can
solve

wig · q(yi(g), xi, θg)

(3.3)

For the estimator, which solves (3.3), to consistently estimate θ0

g, the reverse analogy prin-

ciple dictates that θ0

g must also solve,

E(cid:2)wg · q(y(g), x, θg)(cid:3) ; g = 0, 1

min
θg∈Θg

However, this argument may not necessarily hold. For example, consider the linear model

(3.4)

(3.5)

y(g) = αg + xβg + u(g); g = 0, 1

E(cid:0)u(g)(cid:1) = 0, E

(cid:16)

(cid:17)
x(cid:48)u(g)

= 0

If the treatment (say, job training) is correlated with baseline characteristics, as can be ex-

pected when the program is non-randomly assigned, then E(cid:2)wg · x(cid:48)u(g)(cid:3) (cid:54)= 0.11 In addition,

suppose there is missing data on the outcome of interest. To formalize this, let ‘s’ be a binary

indicator for missing outcomes, then



y =

y(0), if g = 0, s = 1

y(1), if g = 1, s = 1

missing, if s = 0

(3.6)

where s = 1 if the outcome is observed and s = 0 if it is missing.12 In this case, a common

empirical strategy is to solve

N(cid:88)

i=1

min
θg∈Θg

si · wig · q(yi(g), xi, θg)

(3.7)

which only uses observed data to estimate θ0

as the unweighted M-estimator, and denote it as ˆθu

g. Let us refer to the estimator that solves 3.7
g . In this case, even if the treatment is
randomly assigned, the missing outcomes may still be correlated with the treatment, observ-
g is
now confounded on two grounds; non-random assignment which renders the treatment and
11When the treatment is randomized, as in the case of NSW, or as studied in Negi and

able factors or both, which implies that E(cid:2)s · wg · x(cid:48)u(g)(cid:3) (cid:54)= 0. Hence, identiﬁcation of θ0
Wooldridge (2019), one will necessarily have E(cid:2)wg · x(cid:48)u(g)(cid:3) = 0, due to the experimental

design.

12For an illustration of the observed sample, see ﬁgure I.1.

72

control groups incomparable and missing outcomes which leads to violation of the ‘random

sampling’ assumption. The next section discusses the identiﬁcation approach taken in this

paper.

3.2.3

Ignorable missingness and unconfoundedness

ulation, identifying and estimating θ0

Without imposing any structure on the assignment and missingness mechanism in the pop-
g remains diﬃcult because of the argument outlined in
the previous section. To proceed with identiﬁcation, I assume that the treatment is uncon-
founded on covariates (Rosenbaum and Rubin (1983)).13 Formally, consider the following

Assumption 3.2.2. (Strong ignorability) Assume,

{y(0), y(1) ⊥⊥ wg}| x

(3.8)

1. (3.8) implies that P(wg = 1|y(0), y(1), x) = P(wg = 1|x) ≡ pg(x) for g = 0, 1, where

p0(x) + p1(x) = 1

2. The vector of pre-treatment covariates, x, is always observed for the entire sample.

3. For all x ∈ X ⊂ Rdim(X), pg(x) > κg > 0

Assumption 3.2.2 part (1) says that conditioning on covariates is enough to parse out

any systematic diﬀerences that may exist between the treatment and control groups. This is

a widely used assumption in the treatment eﬀects literature, and is known as unconfound-
edness.14 One advantage of unconfoundedness is that, intuitively, it has a better chance
13Like most other assumptions, unconfoundedness is non-refutable. For methods that
indirectly test for its validity, see Huber and Melly (2015), de Luna and Johansson (2014),
Rosenbaum (1987) and Heckman and Hotz (1989).

14Imbens and Wooldridge (2009) attribute the popularity of unconfoundedness, as an

identifying restriction, to the paucity of general methods for estimating treatment eﬀects.

73

of holding once we control for a rich set of variables in x.15 Note that unconfoundedness

not only includes cases where the treatment is a deterministic function of the covariates, for

example stratiﬁed (or block) experiments, but also cases where the treatment is a stochastic
function of covariates.16 Part (2) requires that we observe these covariates for all individuals.

Part (3) is an overlap assumption which ensures that for all values of x in the support of the
distribution, there is a chance of observing units in both treatment and control states.17

With respect to the missing outcomes mechanism, I assume ignorability conditional on

covariates and the treatment status. Formally, consider

Assumption 3.2.3. (Ignorability of missing outcomes) Assume,

{y(0), y(1) ⊥⊥ s}| x, wg

(3.9)

1. (3.9) implies that P(s = 1|y(0), y(1), x, wg) = P(s = 1|x, wg) ≡ r(x, wg)

2. In addition to x, wg is always observed for the entire sample.

3. For each (x, wg) ∈ Rdim(X)+1, r(x, wg) > η > 0

Part (1) states that conditional on covariates and the treatment status, the individuals

whose outcomes are missing do not diﬀer systematically from those who are observed. This

implies that adjusting for x and wg renders the outcomes are as good as randomly missing.
In the econometrics literature, this assumption falls under the “selection on observables”

tag. In the statistics literature, this is also known as MAR and represents a scenario where

missingness only depends on observables and not on the missing values of the variable (Little

and Rubin (2002)). Special cases covered under this mechanism are patterns such as missing
15For example, Hirano and Imbens (2001) control for a rich set of prognostic factors to
justify unconfoundedness while estimating the eﬀects of right heart catheterization (RHC)
on survival rates of patients.

16The appendix discusses the case of a stratiﬁed experiment where unconfoundedness is
satisﬁed by design if one additionally assumes the missing outcome pattern to be ignorable.
17Methods for checking overlap involve calculating normalized sample average diﬀerences

for each covariate and checking the empirical distribution of propensity scores.

74

completely at random (MCAR) and exogenous missingness, as considered in Wooldridge

(2007), with z = x. Allowing the missingness probability to be a function of the treatment

indicator is particularly useful in cases of diﬀerential nonresponse. For instance, in NSW,

people assigned to the treatment group were less likely to drop out of the program compared

to the control group.

In such cases, covariates alone may not be suﬃcient for predicting

missingness. To the extent that being observed in the sample is predicted by x and wg,
assumption 3.2.3 can accommodate non-observability due to sampling design, item non-
response and attrition in a two period panel.18

Part (2) of the above assumption ensures that x and wg are fully observed. Finally, part
(3) imposes an overlap condition, where the probability of being observed in bounded away

from zero. This implies that there is a positive probability of observing people in the sample

with a given value of x and wg in the population.

To study the estimation method in terms of the selected sample, I consider random

sampling in the following sense,

Assumption 3.2.4. (Sampling) Assume that {(yi(0), yi(1), xi, wig, si); i = 1, 2, . . . , N} are
independent and identical random draws from the population where in the population

1. wig is unconfounded with respect to {yi(0), yi(1)} given xi
2. si is ignorable with respect to {yi(0), yi(1)} given (xi, wig)

The next section discusses identiﬁcation and estimation of θ0

g using a double inverse

probability weighted procedure.

3.2.4 Population problem with double weighting

Consider the following population problem

18For the case of attrition, one must assume that second period missingness is ignorable

conditional on initial period covariates and the treatment status.

75

(cid:34)

min
θg∈Θg

E

s

r(x, wg)

· wg
pg(x)

(cid:35)

· q(y(g), x, θg)

;

g = 0, 1

(3.10)

then under unconfoundedness and ignorability, solving this doubly weighted population prob-

lem is the same as solving 3.2. The following lemma establishes this equivalence

Lemma 3.2.5. Given assumptions 3.2.1, 3.2.2, 3.2.3 and 3.2.4, if q(y(g), x, θg) is a real
valued function for all (y(g), x) ⊂ RM and for all θg ∈ Θg such that E
∞ for g = 0, 1, then we have

| q(y(g),x,θg)
r(x,wg)·pg(x)|

<

(cid:34)

E

s

r(x, wg)

· wg
pg(x)

· q(cid:0)y(g), x, θg

(cid:1)(cid:35)

(cid:104)
q(cid:0)y(g), x, θg

= E

(cid:21)

(cid:20)
(cid:1)(cid:105)

The proof uses two applications of the law of iterated expectations with unconfounded-

ness and ignorability to arrive at the above result. This equivalence implies that one can

now address the identiﬁcation issue by solving the doubly weighted population problem.
g by solving the sample analogue

Consequently, one can obtain a consistent estimator of θ0

of (3.10) as follows

min
θg∈Θg

N(cid:88)

i=1

si

r(xi, wig)

· wig
pg(xi)

· q(yi(g), xi, θg);

g = 0, 1

(3.11)

Let the estimator which solves eq (3.11) be denoted as ˜θg. Note, however, that this estimator
is infeasible as it depends on unknown probabilities, r(·) and pg(·). The next section discusses
the ﬁrst-step of estimating these probabilities.

3.3 Estimation

As mentioned above, one problem with the formulation of ˜θg is that the treatment and
missing outcome propensity scores are unknown. Therefore, in its current form ˜θg cannot
be implemented, unless the true probabilities are known. The following assumptions posit

that I have a correctly speciﬁed model for the two probabilities which help me formulate

consistent estimators of pg(x) and r(x, wg).

Assumption 3.3.1. (Correct parametric speciﬁcation of propensity score) Assume that

76

1. There exists a known parametric function G(x, γ) for p1(x) where γ ∈ Γ ⊂ RI and

0 < G(x, γ) < 1 for all x ∈ X , γ ∈ Γ.

2. There exists γ0 ∈ Γ s.t. p1(x) = G(x, γ0)

Part 1) postulates the existence of a parametric model for the propensity score that

is known to the researcher and part 2) assumes that, for some true value of γ, say γ0,
this model is correctly speciﬁed for the true assignment probability. Similarly, in order

to estimate the missing outcome propensity score, I assume that R(x, wg, δ) is a correctly
speciﬁed parametric model for r(x, wg). Formally,

Assumption 3.3.2. (Correct parametric speciﬁcation of missing outcomes probability) As-

sume that

1. There exists a known parametric function R(x, wg, δ) for r(x, wg) where δ ∈ ∆ ⊂ RK

and R(x, wg, δ) > 0 for all x ∈ X , δ ∈ ∆.

2. There exists δ0 ∈ ∆ s.t. r(x, wg) ≡ R(x, wg, δ0)

3.3.1 Estimated weights using binary response MLE
To estimate the probability functions G(x,·) and R(x, wg,·), this paper uses binary response
conditional maximum likelihood. Since both wg and s are binary responses, estimation of γ0
and δ0 using MLE will be asymptotically eﬃcient under correct speciﬁcation of these func-
tions, as assumed in 3.3.1 and 3.3.2. The following two lemmas provide formal consistency

and asymptotic normality conditions for MLE estimation of the two probability models. The

conditions are adapted from theorems 2.5 and 3.3 of Newey and McFadden (1994).

Lemma 3.3.3. (Consistency of maximum likelihood) Assume 3.2.4 so that si and wi1
are i.i.d with pdf’s given as f (si|wi1, xi, δ) = R(xi, wi1, δ)si · (1 − R(xi, wi1, δ))1−si and
f (wi1|xi, γ) = G(xi, γ)wi1 · (1 − G(xi, γ))1−wi1. Additionally, assume that

77

1. γ0 ∈ Γ and δ0 ∈ ∆, where Γ, ∆ are compact sets.
2. If γ (cid:54)= γ0, then f (wi1|xi, γ) (cid:54)= f (wi1|xi, γ0) and if δ (cid:54)= δ0, then f (si|wi1, xi, δ) (cid:54)=

f (si|wi1, xi, δ0).

3. lnf (wi1|xi, γ) and lnf (si|wi1, xi, δ) is continuous at each γ ∈ Γ and δ ∈ ∆ respectively

(cid:104)

with probability one.

supγ∈Γ| ln f (wi1|xi, γ)|(cid:105)

4. E

< ∞ and E(cid:2)supδ∈∆| ln f (si|wi1, x, δ)|(cid:3) < ∞. Then

p→ γ0 and ˆδ

p→ δ0

ˆγ

The proof of the lemma is given in the appendix. For asymptotic normality, consider the

following

Lemma 3.3.4. (Asymptotic normality for MLE) Assume that conditions of lemma 3.3.3

are satisﬁed and

1. γ ∈ interior (Γ) and δ ∈ interior (∆).

2. f (si|wi1, xi, δ) and f (wi1|xi, γ) are both twice continuously diﬀerentiable and f (si|wi1, xi, δ) >

ilarly,

0 and f (wi1|xi, γ) > 0 in a neighborhood N of γ0 and δ0 respectively.

3. (cid:82) supγ∈N ||∇γf (wi1|xi, γ)||dw1 < ∞,(cid:82) supγ∈N ||∇γγ(cid:48)f (wi1|xi, γ)||dw1 < ∞. Sim-
(cid:82) supδ∈N ||∇δf (si|wi1, xi, δ)||ds < ∞ and(cid:82) supγ∈N ||∇δδ(cid:48)f (si|wi1, xi, δ)||ds < ∞.
4. E(cid:2)∇γlnf (wi1|xi, γ0){∇γlnf (wi1|xi, γ0)}(cid:48)(cid:3) exists and is non-singular. Similarly,
E(cid:2)∇δlnf (si|wi1, xi, δ0){∇δlnf (si|wi1, xi, δ0)}(cid:48)(cid:3) exists and is non-singular.
supδ∈N ||∇δδ(cid:48)lnf (si|wi1, xi, δ)||(cid:105)
supγ∈N ||∇γγ(cid:48)lnf (wi1|xi, γ)||(cid:105)
(cid:104)

< ∞ and E

<

5. E
∞.

(cid:104)

78

Then, the MLE estimator, ˆγ, and ˆδ solve

(cid:8)wi1logG(xi, γ) + (1 − wi1)log(1 − G(xi, γ))(cid:9)
(cid:8)silog{R(xi, wi1, δ)} + (1 − si)log{1 − R(xi, wi1, δ)}(cid:9)

N(cid:88)
N(cid:88)

i=1

i=1

max
γ∈Γ

max
δ∈∆

respectively.

For a proof, see appendix H. Given estimators ˆγ and ˆδ, one can estimate the assignment
and missing outcome propensity scores by G(·, ˆγ) and R(·, ˆδ) respectively. Consistency
and asymptotic normality follow from applying the continuous mapping theorem and the
delta method given that G(·, ˆγ) and R(·, ˆδ) are assumed to be continuously diﬀerentiable,
which is implicit in Lemma 3.3.3 and 3.3.4. In practice, this paper follows the convention of

estimating these probabilities as ﬂexible logits where the above requirements of continuity

and diﬀerentiability are easily satisﬁed.

3.3.2 Doubly weighted M-estimator

Once the probability weights have been estimated, let ˆθ1 denote the doubly weighted esti-
mator which solves the treatment group problem,

min
θ1∈Θ1

R(xi, wi1, ˆδ)

i=1

si

· wi1
G(xi, ˆγ)

· q(yi(1), xi, θ1)

(3.12)

N(cid:88)

N(cid:88)

with weights given by G(x, ˆγ) and R(x, w1, ˆδ), and let ˆθ0 be the estimator which solves the
control group problem,

min
θ0∈Θ0

R(xi, wi0, ˆδ)

i=1

si

·

wi0

(1 − G(xi, ˆγ))

· q(yi(0), xi, θ0)

(3.13)

using weights (1 − G(x, ˆγ)) and R(x, w0, ˆδ). Henceforth, this estimator will be denoted as
ˆθg for g = 0, 1.

79

Example 1 (Ordinary least squares): In the case of a misspeciﬁed conditional mean

function, ˆθ1 ≡ (ˆα1, ˆβ1)(cid:48) will solve a double weighted version of the OLS problem i.e.

N(cid:88)

ˆθ1 = argmin
θ1∈Θ1

R(xi, wi1, ˆδ)

i=1

si

· wi1
G(xi, ˆγ)

· (yi(1) − α1 − xiβ1)2

Similarly,

N(cid:88)

i=1

ˆθ0 = argmin
θ0∈Θ0

si

R(xi, wi0, ˆδ)

·

wi0

(1 − G(xi, ˆγ))

· (yi(0) − α0 − xiβ0)2

where (ˆαg, ˆβg)(cid:48) will be consistent for the linear projection of y(g)|x.

quantile function, ˆθ0
error loss functions (Angrist et al. (2006b)), i.e.

Example 2 (Quantile regression): Similarly, in the case of a misspeciﬁed conditional
g (τ ))(cid:48) will solve the following weighted mean square
· ωτ (xi, θ1) ·(cid:2)Quantτ (yi(1)|xi) − α1(τ ) − xiβ1(τ )(cid:3)2

g(τ ) ≡ (ˆα0

g(τ ), ˆβ0

R(xi, wi1, ˆδ)

· wi1
G(xi, ˆγ)

ˆθ1(τ ) = argmin
θ1∈Θ1

· ωτ (xi, θ0) ·(cid:2)Quantτ (yi(0)|xi) − α0(τ ) − xiβ0(τ )(cid:3)2

N(cid:88)
N(cid:88)

i=1

si

si

ˆθ0(τ ) = argmin
θ0∈Θ0

R(xi0, wi0, ˆδ)

i=1

·

wi0

(1 − G(xi, ˆγ))

where ˆθg(τ ) will now be consistent for a weighted linear projection to the true CQF
of y(g)|x. Using the doubly weighted estimator, one can now consistently estimate causal
parameters like the average treatment eﬀect and diﬀerent quantile treatment eﬀects. Section

3.6 discusses each of these examples in detail. The next section develops and discusses the

large sample theory of the proposed estimator.

3.4 Asymptotic theory

This paper implements the proposed estimator in a two-step procedure. The ﬁrst step

uses binary response MLE for the estimation of the probability weights and the second
g. The asymptotic
theory utilizes results for two-step estimators with a non-smooth objective function in the

step uses the ﬁrst-step weights to estimate the parameter of interest, θ0

second step, to establish consistency and asymptotic normality of ˆθg. Therefore, the usual

80

regularity conditions assuming continuity and twice diﬀerentiability with respect to θ0

g are

now relaxed.

3.4.1 Consistency

Using the conditions in Lemma 2.4 of Newey and McFadden (1994), it is easy to establish

consistency of the doubly weighted M-estimator, ˆθg. The conditions of the lemma are quite
weak with continuity and a data dependent upper bound with a ﬁnite expectation being

the only substantive requirements. The following theorem ﬁlls in the primitive regularity

conditions for applying the uniform law of large numbers.

Theorem 3.4.1. (Consistency) Assume 3.2.1, 3.2.2, 3.2.3, 3.2.4, 3.3.1 and 3.3.2 hold.

Further, let

1) Θg is compact for g = 0, 1

2) q(y(g), x, θg) is continuous at each θg ∈ Θg with probability one.

3) For all θg ∈ Θg, |q(y(g), x, θg)| ≤ b(y(g), x) for some function b(·) such that

E(cid:2)b(y(g), x)(cid:3) < ∞

Then,

ˆθg

p→ θ0

g as N → ∞

The proof of the theorem can be found in the appendix. The conditions of the above

theorem allow the objective function to not be continuous on all of θg for a given x. This is
useful for cases where q(·) is allowed to be non-smooth. Under the dominance condition given
in (3), uniform convergence of sample averages holds quite generally. Compactness of the

parameter space and identiﬁcation, as given in Assumption 3.2.1, are both more primitive

that can be relaxed without aﬀecting consistency.

81

(cid:20)
(cid:20)

Q0(θ0) = E

Q0(θ1) = E

QN (θ0) =

1
N0

si · wi0

si · wi1

R(xi, wi0, δ0) · (1 − G(xi, γ0))
R(xi, wi1, δ0) · G(xi, γ0)
N(cid:88)
N(cid:88)

R(xi, wi0, ˆδ) · (1 − G(xi, ˆγ))

si · wi0

si · wi1

i=1

R(xi, wi1, ˆδ) · G(xi, ˆγ)

(cid:21)
· q(yi(0), xi, θ0)

(cid:21)
· q(yi(1), xi, θ1)

· q(yi(0), xi, θ0)

· q(yi(1), xi, θ1)

3.4.2 Asymptotic normality

For establishing asymptotic normality, I provide conditions for the general case of non-

smooth objective functions since the conditions in this case can accommodate the smooth

case as well. The main condition needed for establishing asymptotic normality of the doubly

weighted estimator is stochastic equicontinuity that will be suﬃcient to guarantee uniform

convergence of the objective function to its population counterpart. Before I state the

conditions of the normality proof, let the population problem be denoted as

and the sample analogue be given as

QN (θ1) =

where N1 =(cid:80)N

i=1 si · wi1 and N0 =(cid:80)N

i=1

1
N1

i=1 si · wi0. Then, I have the following theorem for
asymptotic normality which is taken from Newey and McFadden (1994) section 7 along with

primitive conditions taken from Andrews (1994).

Theorem 3.4.2. (Asymptotic Normality of the Doubly Weighted Estimator) Given assump-

tions 3.2.1, 3.2.2, 3.2.3, 3.2.4

(1) Suppose that ˆθg is an approximate minimum i.e

QN ( ˆθg) ≤ inf
θg∈Θg

QN (θg) + op(N−1)

(2) ˆθg

p→ θ0

g, θ0

g ∈ int(Θg)

(3) Q0(θg) is minimized on Θg at θ0
g

82

(4) Q0(θg) is twice diﬀerentiable at θ0
(5) Suppose ∇θg QN (θ0

g with a nonsingular Hessian, Hg
N ∇θg QN (θ0
g)

g) exists with probability one and

√

d→ N (0, Ωg)

(6) Let,

l = ∇θ1

k = ∇θ0

(cid:40)
(cid:40)

(cid:41)(cid:48)

s · w1

R(x, w1, δ∗) · G(x, γ∗)

s · w0

R(x, w0, δ∗) · (1 − G(x, γ∗))

· q(y(1), x, θ1)

(cid:41)(cid:48)
· q(y(0), x, θ0)

and let the class

(cid:40)

F =

f ; f (y(g), x) =

l, for g = 1

k, for g = 0

(cid:41)

: θg ∈ Θg, ∀(y(g), x) ⊂ RM

satisfy Pollard’s entropy condition with envelope given by

F = 1 ∨ sup
f∈F

|f (·)|

for Type-I classes or satisﬁes Ossiander’s Lp entropy condition with p = 2 with enve-

lope given by

|f (·)|

F = sup
f∈F

for Type II-VI classes, where these are deﬁned in Andrews (1994).

E(F )2+ζ < ∞ for some ζ > 0 and F given above.

N(cid:88)

i=1

(7) lim sup
N→∞

1
N

(8) The conditions of Lemma 3.4 are satisﬁed allowing the ﬁrst order inﬂuence function

representation for ˆγ
√

where

di = wi1 ·

N (ˆγ − γ0) =

N−1/2

(cid:17)(cid:21)−1(cid:40)
(cid:35)

(cid:20)

E

did(cid:48)

(cid:16)
(cid:34)∇γ G(xi, γ0)(cid:48)

i

G(xi, γ0)

− (1 − wi1) ·

(cid:41)

+ op(1)

di

i=1

N(cid:88)
(cid:34)∇γ G(xi, γ0)(cid:48)

1 − G(xi, γ0)

(cid:35)

(3.14)

(3.15)

is the I × 1 score of the maximum likelihood binary response log-likelihood evaluated
at the true parameter value γ0.

83

Similarly, ˆδ has the following ﬁrst order inﬂuence function representation

where

√

N

bi = si ·

E

=

(cid:20)

(cid:17)

(cid:16)
(cid:16)ˆδ − δ0
(cid:34)∇δR(xi, wi1, δ0)(cid:48)

(cid:17)(cid:21)−1
bib(cid:48)
(cid:35)

i

R(xi, wi1, δ0)

N−1/2

bi

i=1

 + op(1)
N(cid:88)
(cid:34)∇δR(xi, wi1, δ0)(cid:48)

1 − R(xi, wi1, δ0)

− (1 − si) ·

(cid:35)

(3.16)

(3.17)

is the K × 1 score of the maximum likelihood binary response log-likelihood evaluated
at the true parameter value δ0.

Then,

where

Ω1 = E

Ω0 = E

(cid:16)
(cid:16)

(cid:16)

(cid:17)

√
d→ N
N ( ˆθg − θ0
g)
(cid:17)−1
(cid:16)
(cid:16)
(cid:17)
(cid:17) − E
(cid:16)
bib(cid:48)
lib(cid:48)
lil(cid:48)
(cid:17)−1
(cid:16)
(cid:16)
(cid:17) − E
bib(cid:48)
kib(cid:48)
kik(cid:48)

(cid:17)

E

E

E

E

i

i

i

i

i

i

g

0, H−1
g ΩgH−1
(cid:17)
(cid:17) − E
(cid:16)
lid(cid:48)
bil(cid:48)
(cid:16)
(cid:17) − E
(cid:16)
kid(cid:48)
bik(cid:48)

i

i

i

i

E

(cid:16)
(cid:17)

(cid:17)−1
(cid:16)
(cid:17)−1
did(cid:48)

E

i

i

did(cid:48)
(cid:16)

E

(cid:17)
dik(cid:48)

i

(cid:17)

dil(cid:48)
(cid:16)

i

E

The primitive conditions for stochastic equicontinuity hold for classes of functions of

type I-VI as deﬁned in Andrews. Conditions (1)-(5) are standard for the case of non-smooth

objective functions. Condition (5) requires that the score of the objective function exists

with probability one and is normally distributed. This condition is important for establishing

distributional convergence of ˆθg.

Condition (6) and (7) together with random sampling are suﬃcient for stochastic equicon-
tinuity of the remainder term in Newey and McFadden (1994).19 Checking these conditions
in a particular application would entail showing that f (·) belongs to one these classes. For
instance, both linear and quantile regression examples considered in this paper belong to
19Directly verifying stochastic equicontinuity as mentioned in theorem 7.2 of Newey and
McFadden (1994) is diﬃcult and hence primitive conditions like (6) and (7) can be useful.
Pollard (1985) also provides primitive conditions that are suﬃcient for stochastic diﬀeren-
tiability which is quite similar to the condition of stochastic equicontinuity.

84

type-I class of functions. Consequently, stochastic equicontinuity follows from Theorem 1

and 4 in Andrews (1994) for type-I class and type II-VI classes respectively. Conditions (8)
and (9) are simply imposing regularity conditions on R(·) and G(·) so that inﬂuence function
representations given in 3.14 and 3.16 are possible. For a proof of the theorem, see appendix.

3.4.3 Eﬃciency gain with estimated weights

The asymptotic variance expression derived in the previous section oﬀers some interest-

ing insights. First, the middle term, Ωg, represents the variance of the residual from the
population regression of the weighted score on the two binary response scores, bi and di.
Note that even though Ωg should involve a fourth term for the covariance between the two
scores, this term is zero in the present case, on account of the two scores being conditionally
independent.20

Second, the expression for Ωg, as derived here, is diﬀerent from what I obtain in section
3.5 under the stronger identiﬁcation assumption. This diﬀerence has an interesting eﬃciency
implication. In the case when a researcher is only willing to assume identiﬁcation in the
weaker sense of 3.2.1, it is potentially more eﬃcient to estimate the two probabilities in a
ﬁrst step. Note though that this result is asymptotic in nature. In order to see that, let us
assume that we know G(xi, γ0) and R(xi, wig, δ0). Then, the asymptotic variance of the
estimator, say ˜θg which uses the known probabilities is:
= H−1

(cid:16) ˜θg − θ0

g ΣgH−1

(cid:20)√

(cid:17)(cid:21)

Avar

N

g

g

where Σ1 = Var(li) = E
the control group. I formalize this result in the next theorem

for the treatment group and Σ0 = Var(ki) = E

i

Theorem 3.4.3. (Eﬃciency gain with estimated weights) Under the assumptions of theorem

20For the proof, see appendix.

85

(cid:17)

(cid:16)

lil(cid:48)

(cid:16)

(cid:17)

kik(cid:48)

i

for

3.4.2 we obtain

(cid:20)√

(cid:16) ˜θg − θ0

g

(cid:17)(cid:21)

N

(cid:20)√

N

(cid:16) ˆθg − θ0

g

(cid:17)(cid:21)

− Avar

Avar

= H−1
= H−1

g − H−1
g ΣgH−1
g ΩgH−1
(cid:1) H−1
(cid:0)Σg − Ωg

g

g

is PSD.

g

The proof is given in the appendix.

In other words, asymptotically, we do no worse

by estimating the probabilities than when we actually know them. Due to the asymptotic

nature of the results, there may not be any gain in ﬁnite samples. This result can be seen

an extension of Wooldridge (2007) to the case when one has two sets of probability weights

being estimated in the ﬁrst stage.

In the missing data literature, this result has also been called the “eﬃciency puzzle”.

Prokhorov and Schmidt (2009) study this puzzle in a GMM framework using an augmented

set of moment conditions, where the ﬁrst subset of moments correspond to the weighted

objective function and the second subset belong to the missing outcomes (or selection) prob-

lem. An interesting explanation that emerges from their framework is that the second set of

moment conditions are useful even when selection probability parameters are known. There-

fore, ineﬃciency of the known probability estimator (as seen above) is due to its failure to

exploit the correlation between the ﬁrst and second set of moments. Hence, knowledge of

the selection parameters do not play a role in eﬃcient estimation.

3.5 Some feature of interest is correctly speciﬁed

The results in the previous section were derived under the assumption that the parameter

vector solves an unconditional M-estimation problem. Even though it can handle cases where

the conditional feature of interest is correctly speciﬁed, the explicit focus was on examples

of model misspeciﬁcation such as estimating a misspeciﬁed linear model for either the true

conditional mean or the true conditional quantile function. In contrast, this section focuses
g indexes a true conditional feature of interest. This could be a mean,
g can be said to be

on situations where θ0
quantile or the entire conditional distribution of y(g)|x. In this case, θ0

86

identiﬁed in a stronger sense which is reﬂected in an improvisation of the basic identiﬁcation

assumption given in eq (3.2) to the following,

Assumption 3.5.1. (Strong identiﬁcation of θ0
g)

The parameter vector θ0

g ∈ Θg is the unique solution to the population minimization

problem

E(cid:2)q(y(g), x, θg)|x, wg, s(cid:3) ; g = 0, 1

min
θg∈Θg

(3.18)

for each (x, wg, s) ∈ V ⊂ Rdim(X)+2. In other words, under ignorability (as deﬁned in 3.2.3)
and unconfoundedness (deﬁned in 3.2.2), θ0

g solves

E(cid:2)q(y(g), x, θg)|x(cid:3) ; g = 0, 1

(3.19)

min
θg∈Θg

for each x ∈ X ⊂ Rdim(X).

in 3.2.1. The basic identiﬁcation assumption simply deﬁnes θ0

The above assumption can be seen as a strengthening of the identiﬁcation assumption
g to be the solution to the
unconditional M-estimation problem, irrespective of whether it is correctly speciﬁed for an
g to solve the stronger
conditional M-estimation problem. For instance, assumption 3.5.1 will be satisﬁed for a

underlying model or not. Assumption 3.5.1 is additionally requiring θ0

correctly speciﬁed CEF given by

g + xβ0

y(g) = α0

E(cid:0)u(g)|x(cid:1) = 0

g + u(g); g = 0, 1

(3.20)

with either OLS or QMLE in the linear exponential family as the chosen estimation method.

This would also hold for a correctly speciﬁed CQF estimated either using quantile regression

or QMLE in the tick exponential family (Komunjer (2005)). Requirement for the parameter,
g, to solve the stronger ID problem is an important distinction which will ultimately help
θ0
me characterizing the robustness properties of the doubly weighted estimator. I will illustrate

this property through two main examples; the ﬁrst will study estimation of ATE and the

87

second will study estimation of quantile eﬀects. Both these examples are studied in detail

in section 3.6.

Until now I have not said anything about the parametric speciﬁcations of functions R(·)
In fact, under assumption 3.5.1, correct functional form assumptions on these

and G(·).
two probabilities can be dispensed with. This is a second important distinction between

the results characterized under assumption 3.2.1 and the results characterized in this sec-
g solves the objective function for each
tion under 3.5.1. Therefore, the requirement that θ0
(x, wg, s) ∈ V is much stronger than the requirement in assumption 3.2.1 since assumption
3.5.1 implies assumption 3.2.1 but not the other way around. Formally, I will show that the
g that solves the sample equivalent of eq (3.19) with potentially misspeciﬁed
g. Before that,

treatment and missing outcomes probabilities will still consistently estimate θ0

estimator of θ0

the following assumptions formalize possible misspeciﬁcation of these probability models

Assumption 3.5.2. (Parametric speciﬁcation of propensity score) Assume that conditions
(1) and (3) of 3.3.1 hold where condition (2) is deﬁned for some γ∗ ∈ Γ such that plim(ˆγ) =
γ∗

Assumption 3.5.2 says that we have a known parametric function for the propensity score

but there is no requirement for this model to be correctly speciﬁed. I continue to assume
that the estimator of γ∗ solves a binary response maximum likelihood problem and G(x, γ∗)
is the model evaluated at the pseudo true value. In the event that the model is correctly
speciﬁed for the propensity score, G(x, γ∗) = p1(x). I make a similar assumption for the
missing outcomes model.

Assumption 3.5.3. (Parametric speciﬁcation of missingness probability) Assume that con-
ditions (1) and (3) of 3.3.2 hold where condition (2) is deﬁned for some δ∗ ∈ Γ such that
plim(ˆδ) = δ∗

Again, assumption 3.5.3 says that we have a known parametric function for the missing

outcome probability given by R(x, wg, δ) and I do not impose any requirement for this model

88

to be correctly speciﬁed. However, when this model is correctly speciﬁed, R(x, wg, δ∗) =
r(x, wg).

Given assumptions 3.5.1, 3.5.2 and 3.5.3, its easy to show that θ0

g solves the doubly
weighted problem in the population where the weights are constructed using potentially

misspeciﬁed probabilities. I provide a sketch of the argument for the treatment group pa-

rameter θ0

1 and the proof for θ0

0 follows in a similar manner. Consider,

E

s

R(x, w1, δ∗)

·

w1

G(x, γ∗)

· q(y(1), x, θ1)

(3.21)

(cid:20)

(cid:21)

Using three applications of LIE along with ignorability and unconfoundedness, I can rewrite

the above expectation as

(cid:20) r(x, w1)

R(x, w1, δ∗)

E

·

p1(x)
G(x, γ∗)

(cid:21)
· E[q(y(1), x, θ1)|x]

Assumption 3.5.1 along with positive weights i.e.
(x, w1), implies

R(x,w1,δ∗) ≥ 0 and p1(x)
r(x,w1)

G(x,γ∗) ≥ 0 for all

·

p1(x)
G(x, γ∗)

· E[ q(y(1), x, θ0

r(x, w1)
R(x, w1, δ∗)
where the inequality is strict when θ1 (cid:54)= θ0
even if the weights are misspeciﬁed.

1)|x] ≤ r(x, w1)
R(x, w1, δ∗)

·

p1(x)
G(x, γ∗)

· E[ q(y(1), x, θ1)|x], ∀ θ1 ∈ Θ1

1. Therefore, solving 3.21 identiﬁes the parameter
In general, the parameter that solves 3.21 will be
g is a unique solution, solving 3.21

diﬀerent from the one that solves 3.2. But as long as θ0

will identify it.

When R(x, wg, δ∗) = r(x, wg) and G(x, γ∗) = p1(x), then solving 3.21 will be the same
as solving 3.12 for the treatment group and 3.13 for the control group. Estimation of G(·)
and R(·) follows from Lemma 3.3.3 and 3.3.4 but with probability limits given by δ∗ and
γ∗ rather than δ0 and γ0 respectively.

Since R(x, wg, δ∗) and G(x, γ∗) can be any positive functions of x and wg, one special
case corresponds to them being constants. Since weighting by ﬁxed constants does not aﬀect
g , which

the minimization problem, this implies that the unweighted estimator, denoted by ˆθu

is a special case of the doubly weighted estimator, is also be consistent for θ0
g.

89

The following theorem establishes consistency of the doubly weighted estimator under

strong identiﬁcation.

Theorem 3.5.4. (Consistency under strong identiﬁcation)

Under assumptions 3.2.2, 3.2.3, 3.2.4, 3.5.1, 3.5.2 and 3.5.3 and assume regularity con-

ditions (1), (2) and (3) of Theorem 3.4.1. Then,

ˆθg

p→ θ0

g as N → ∞

where ˆθg is the doubly-weighted estimator that solves problem 3.21.

The next theorem states asymptotic normality of the doubly weighted estimator that

solves the conditional M-estimation problem with misspeciﬁed probabilities.

Theorem 3.5.5. (Asymptotic Normality)

Under the assumptions of theorem 3.5.4 and the regularity conditions of theorem 3.4.2,

we obtain

where

√

N ( ˆθg − θ0
g )
(cid:16)

(cid:17)

lil(cid:48)

i

Ω1 = E

(cid:16)

d→ N

0, H−1

(cid:17)
g ΩgH−1
(cid:17)

(cid:16)

g

kik(cid:48)

i

and Ω0 = E

with Hg as deﬁned in condition (4) of Theorem 3.4.2 and li and ki deﬁned as in condition
6) of Theorem 3.4.2 but with weights given by G(x, γ∗) and R(x, wg, δ∗).

Substantively, there is no real diﬀerence in the proof of the above theorem except that

now ˆγ and ˆδ are converging to probability limits that could be potentially diﬀerent from

those indexing the true treatment and missing outcome probabilities. A consequence of the

objective function solving the conditional problem is reﬂected in the asymptotic variance

expression above. Compared to section 3.4, Ωg now is just the variance of the weighted
score without the ﬁrst stage adjustment. Since the conditional score of weighted problem

(cid:104)∇θg q(y(g), x, θ0

g)(cid:48)|x

(cid:105)

is zero i.e. E

= 0, this implies that the correlation between the

90

weighted score and the two MLE scores will be zero, giving us the familiar expression above.

A consequence of this simpler expression for Ωg is that now estimating the probabilities in
a ﬁrst step is not any superior than using known weights. This is formalized in the following

corollary.

Corollary 3.5.6. (No gain with estimated weights under strong identiﬁcation) Under the
assumptions of theorem 3.5.5 we obtain

(cid:20)√

(cid:16) ˜θg − θ0

g

(cid:17)(cid:21)

(cid:20)√

(cid:16) ˆθg − θ0

g

(cid:17)(cid:21)

Avar

N

= Avar

N

= H−1

g ΩgH−1

g

where ˜θg is the estimator that uses known (potentially misspeciﬁed) probabilities and ˆθg is
the estimator that uses estimated probabilities.

This too is attributable to the conditional score of the weighted problem being zero,

(cid:104)∇θg q(y(g), x, θ0

(cid:105)

g)(cid:48)|x

= 0.

namely, E

A second interesting question concerns the role of weighting in this scenario. As I men-

misspeciﬁed probabilities will be consistent for θ0

tioned earlier, the unweighted estimator or in fact any weighted estimator with possibly
g (In fact the estimator that only weights
by the propensity score will also be consistent in this case). Interestingly, if the objective

function satisﬁes the generalized conditional information matrix equality (GCIME) deﬁned

below, the unweighted estimator is asymptotically more eﬃcient than any weighted estima-

tor. The following theorem formalizes this eﬃciency result

Theorem 3.5.7. (Eﬃciency gain with unweighted estimator under GCIME)

Under assumptions of theorem 3.5.5 if we additionally assume that the objective function
satisﬁes the generalized conditional information matrix equality (GCIME) in the population
which is deﬁned as:

g )(cid:48)∇θg q(y(g), x, θ0

g )|x

= σ2

0g · ∇2
θg

E[q(y(g), x, θ0

g )|x] = σ2

0g · A(xi, θ0
g )

where

∇2
θg

E[q(y(g), x, θ0

g )|x] = A(xi, θ0
g )

91

(3.22)

(cid:104)∇θg q(y(g), x, θ0

E

(cid:105)

(cid:20)√

N

Avar

(cid:17)(cid:21)

(cid:16) ˆθg − θ0
(cid:21)

g

= H−1
(cid:20)

g ΩgH−1

g

(cid:21)

A(xi, θ0
0)

Then,

where

(cid:20)

and

where

H1 = E

Ω1 = σ2

A(xi, θ0
1)

r(xi, wi1) · p1(xi)
(cid:34)
R(xi, wi1, δ∗) · G(xi, γ∗)
01 · E
(cid:34)

R2(xi, wi1, δ∗) · G2(xi, γ∗)
r(xi, wi0) · p0(xi)

r(xi, wi1) · p1(xi)

, H0 = E

(cid:35)

A(xi, θ0
1)

Ω0 = σ2

00 · E

A(xi, θ0
0)

r(xi, wi0) · p0(xi)

R(xi, wi0, δ∗) · (1 − G(xi, γ∗))
(cid:35)

= (Hu

g )−1Ωu

g (Hu

g )−1

(cid:16) ˆθu

R2(xi, wi0, δ∗) · (1 − G(xi, γ∗))2
(cid:17)(cid:21)

(cid:20)√

g

N

Avar

g − θ0
(cid:104)
(cid:105)
(cid:104)
r(xi, wi1) · p1(xi) · A(xi, θ0
, Hu
1)
r(xi, wi1) · p1(xi) · A(xi, θ0
01 · E
1)
(cid:20)√
(cid:20)√

(cid:16) ˆθg − θ0

− Avar

(cid:17)(cid:21)

Avar

N

N

(cid:105)

g

Hu

1 = E
Ωu
1 = σ2

Given the above,

(cid:104)

00 · E
(cid:17)(cid:21)

(cid:16) ˆθu

g − θ0

g

(cid:105)

0 = E

r(xi, wi0) · p0(xi) · A(xi, θ0
0)

(cid:104)
r(xi, wi0) · p0(xi) · A(xi, θ0
0)

(cid:105)

, Ωu

0 = σ2

is positive semi-deﬁnite

The proof of the above theorems is easy to establish and can be found in the appendix.

The GCIME assumption is known in a variety of estimation contexts. In the case of full
maximum likelihood, GCIME holds for q(y(g), x, θg) = − ln f (y(g)|x, θg) where f (·) is the
0g = 1. For the case of quasi maximum likelihood in
true conditional density of y(g) with σ2
the linear exponential family for estimating the true conditional mean parameters, GCIME
holds for the same q(·) but f (·) now denotes a density from the linear exponential family with
V ar(y(g)|x) = σ2
g)]. In other words, GCIME will be satisﬁed for the QMLE
case when V ar(y(g)|x) satisﬁes the generalized linear model assumption, irrespective of
whether the higher order moments of the conditional distribution of y(g) correspond to the

0g · v[mg(x, θ0

chosen LEF density or not. For estimation using NLS, GCIME will hold for q(y(g), x, θg) =
(y(g) − mg(x, θg))2 with the homoskedasticity assumption. Hence in all these cases the

92

unweighted estimator will be more eﬃcient than its weighted counterpart. But when GCIME

is not satisﬁed, the two cannot be ranked eﬃciency wise.

The next section uses the results discussed in this section and section 3.4 to characterize

the nature of this robustness property with explicit focus on two important causal param-

eters; the ATE and QTEs. Before this, a ﬂowchart on the next page outlines the diﬀerent

cases of misspeciﬁcation that are possible under the doubly weighted framework.

3.6 Robust estimation

The asymptotic theory developed in sections 3.4 and 3.5 can now be used to characterize

the robustness property of the doubly weighted estimator. Delineating the asymptotic theory

using the weak and strong identiﬁcation assumptions helps me to be precise about the nature

of this robustness and its constituents.

3.6.1 Average treatment eﬀect

The most common parameter of interest in applied work is the ATE, deﬁned for an under-
lying population of interest.21 Given the importance of this parameter in applied work, I

discuss how the current framework allows robust estimation of the ATE. Depending on the

component of the doubly weighted framework that is allowed to be misspeciﬁed, I will utilize

the asymptotic results from sections 3.4 and 3.5 along with certain estimation methods to

establish consistent estimation of the ATE.

In the presence of covariates, x, that are predictive of the potential outcomes, it is helpful

to deﬁne the average treatment eﬀect (ATE) as

τate = E(cid:2)µ1(x)(cid:3) − E(cid:2)µ0(x)(cid:3)

where µg(x) denotes the true conditional mean (or regression function) of y(g). Let mg(x, θg)
21For instance, in the NSW program which is the main empirical application in this paper,

I deﬁne ATE to be the expectation over the population of all eligible participants.

93

be a parametric model for E(cid:2)y|x, wg = 1(cid:3).22 Then this model is said to be correctly speciﬁed

if

µg(x) = mg(x, θ0

g), for some θ0

g ∈ Θg

(3.23)

Given the parametric nature of this framework, I acknowledge and tackle misspeciﬁcation

of the conditional mean model, mg(x, θg), the propensity score model, G(x, γ), and the
missing outcomes probability model, R(x, wg, δ). While the discussion in this section focuses
0. The
on consistent estimation of θ0

1, an analogous argument can be made for estimating θ0

ﬁrst case considers correct speciﬁcation of the missing outcomes probability model,

Case 1: Correct missing probability model, R(x, wg, δ)
In the current framework, when R(·) is correctly speciﬁed, one obtains the usual double
robustness (DR) result of causal inference. DR ensures that θ0
g is estimated consistently
despite having either the propensity score or the conditional mean model being misspeciﬁed,
g represents in this case will depend on what is being
assumed about the conditional mean model. However, I will show under each of these cases,

but not both. Naturally, what θ0

a consistent estimate of ATE can always be obtained.

a. First half of DR: Correct conditional mean, E(cid:0)y(g)|x(cid:1)

Having a correctly speciﬁed mean model implies that I can decompose the potential outcomes

into their true means as follows

y(g) = mg(x, θ0

E(cid:2)u(g)|x(cid:3) = 0

g) + u(g)

(3.24)

for both g = 0, 1.

consistently estimate θ0

In this case we know there are many estimation methods that can
g such as Nonlinear least squares and QMLE in the linear exponential
family. The question that remains to be addressed is whether any of these procedures require

22Under unconfoundedness, the regression function can be identiﬁed as E(cid:2)µg(x)(cid:3) =
E(cid:2)y|x, wg = 1(cid:3) ,∀ g ∈ {0, 1}

94

weighting to obtain consistent estimates of θ0

g. To answer this, I look at these two estimation

methods in detail and tie them to the theoretical results developed in earlier sections.

Solving for θ0

g using NLS means minimizing the expected squared error between y(g)
g is identiﬁed in the stronger sense that it solves the

and mg(x, θg). In fact, under 3.24, θ0
conditional NLS problem,

(cid:104)

(y(g) − mg(x, θg))2|x

(cid:105)

(3.25)

θ0
g = argmin
θg∈Θg

E

Similarly, for estimation of θ0

g using QMLE in the linear exponential family (Gourieroux
et al. (1984), Wooldridge (2010) chapter 13), if one chooses the range of the conditional

mean function, mg(x, θg), to correspond with the range of the quasi-log likelihood for a
given linear exponential density, θ0

g is again identiﬁed in the conditional sense,

E(cid:2)ln f (y(g), mg(x, θg))|x(cid:3)

θ0
g = argmin
θg∈Θg

where f (·) is the density associated with the chosen linear exponential distribution.23 For
both these examples, results from section 3.5 dictate that weighting by either correct or

misspeciﬁed probabilities is not needed for consistency. The fact that one could weight by

the misspeciﬁed propensity score model and still obtain this result is what forms the ‘ﬁrst

part’ of the DR result with propensity score weighting.

Once ˆθg has been estimated by solving the sample version of the NLS or QMLE problem,

ATE can be estimated as follows

ˆτate =

N(cid:88)

i=1

1
N

(cid:16)

m1(xi, ˆθ1) − 1
N
(cid:17)

= σ2

N(cid:88)

i=1

m0(xi, ˆθ0)

If in addition to having a correct conditional mean, I also assume the error variance of the
0g), then the estimator that does not weight
at all may be the preferred estimator from an eﬃciency perspective. This result is due to

outcomes is homoskedastic (E

u2(g)|x

GCIME being satisﬁed under homoskedasticity with NLS.

23For example,
f (y(g), mg(x, θg)) = mg(x, θg)y(g) · (1 − mg(x, θg))(1−y(g)).

if mg(x, θg) ∈ (0, 1), one would typically use the Bernoulli density,

95

b. Second part of DR: Correct propensity score model, G(x, γ)

If one acknowledges misspeciﬁcation of the conditional mean model, then this brings us to

the second case of DR where only the propensity score model is assumed to be correct. In this

case, there is no general way of consistently estimating the ATE. A very useful mean ﬁtting

property of QMLEs in the linear exponential family can be utilized here to obtain consistent

estimators of the unconditional means, E(cid:2)y(g)(cid:3), despite misspeciﬁcation of mg(x, θg).24

The estimation strategy is to choose mg(x, θg), to be the inverse canonical link function,
h(·), with the QLL corresponding to a choice of LEF density. In the generalized linear model
(GLM) literature, the link function, h−1(·), relates the mean of the distribution to a linear
index

h−1(µg(x)) = xθg

(3.26)

Then the ﬁrst order conditions of such a QMLE problem give us

(cid:34)∇θmg(x, θ∗

E

(cid:35)

g )(cid:48) · (y(g) − mg(x, θ∗
g ))
v(mg(x, θ∗
g ))

= 0

(3.27)

where θ∗
g denotes the pseudo true parameter indexing the misspeciﬁed conditional mean
model (White (1982)). By choosing the canonical link as the mean model of choice, the

gradient in the numerator of 3.27 cancels with the variance term in the denominator. Note

that this only occurs only when one uses the canonical link associated with the chosen LEF

density and not with any other choice of link function. This in turn ensures that if one

includes an intercept in x, the model ﬁts overall mean of the distribution (see Wooldridge

(2010) chapter 13 for more detail) i.e.

E(cid:2)y(g)(cid:3) = E

(cid:104)
mg(x, θ∗
g)

(cid:105)

Under ‘i.i.d’ sampling, solving the sample analogue of the population FOC given in 3.27
would have been suﬃcient to obtain consistent estimates of θ∗
g. However, in the presence of
24Słoczyński and Wooldridge (2018) use this mean ﬁtting property for developing doubly

robust estimators of various ATEs.

96

non-random assignment and missing outcomes, one needs to weight the ﬁrst order conditions
in 3.27 to ensure that θ∗
In other words, one would solve the
N(cid:88)
following moment conditions

g is estimated consistently.

si · wi1

yi − h(ˆα1 + xi

ˆβ1)

= 0

R(xi, wi1, ˆδ) · G(xi, ˆγ)

i=1

si · wi0

R(xi, wi0, ˆδ) · (1 − G(xi, ˆγ))

N(cid:88)

i=1

i ·(cid:104)
i ·(cid:104)

· x(cid:48)

· x(cid:48)

yi − h(ˆα0 + xi

ˆβ0)

(3.28)

= 0

(cid:105)
(cid:105)

The choice of the LEF density would have to be consistent with the range and nature of
the outcome, y(g).25

Estimation summary under second part of DR:

Estimation of the average treatment eﬀect in the case of a misspeciﬁed mean model but

correct propensity score and missing probability models follows in two steps:

1. Depending upon the range and nature of the outcome variable, y(g), choose an ap-
propriate LEF density. Choose the mean function, mg(x, θg) = h(xθg), where h(·)
25The following combinations of QLL and link functions produce the mean ﬁtting property.

1. Normal log-likelihood with identity link function when there are no restrictions on the

range of y(g)

(cid:104)
x(cid:48) · (y(g) − xθ∗
g)

(cid:105)

E

= 0

This is the ﬁrst order condition for OLS which ensures that E[y(g)] = E
intercept is included in the linear projection.

2. Poisson log-likelihood with log link function when the range of y(g) is restricted to be

non-negative (y(g) ≥ 0)

(cid:105)

(cid:104)

xθ∗

g

if an

3. Bernoulli log likelihood with logit link function when y(g) is restricted to be in the

unit interval, (y(g) ∈ [0, 1])

E

(cid:34)exp(xθ∗
 exp(xθ∗

g)
(1+exp(xθ∗

E

g) · x(cid:48) · (y(g) − exp(xθ∗
g))

exp(xθ∗
g)

(cid:35)

= 0

(cid:18)
y(g) − exp(xθ∗
g)
1+exp(xθ∗
g)

(cid:19)

 = 0

g))2 · x(cid:48) ·

exp(xθ∗
g)
(1+exp(xθ∗
g))2

97

is the inverse canonical link function associated with this chosen density. Using this

combination of mean function and quasi-log-likelihood, use the moment conditions in

3.28 to obtain consistent estimates, ˆθg.

2. Using estimates that solve problem 3.28, one can then obtain consistent estimates of

the average treatment eﬀect as follows

N(cid:88)

i=1

ˆτate =

1
N

h(ˆα1 + xi

ˆβ1) − 1
N

N(cid:88)

i=1

h(ˆα0 + xi

ˆβ0)

(3.29)

where ˆαg and ˆβg are the solutions to 3.28. The formal proof of consistency for ˆτate in this
case is given in the appendix J, and follows in a manner similar to Negi and Wooldridge

(2019).

Case 2: Misspeciﬁed missing outcomes probability model, R(x, wg, δ)
If the missing outcomes model is misspeciﬁed, then suﬃcient for consistent estimation of

words, θ0

ATE is a strengthening of the identiﬁcation assumption from 3.2.1 to 3.5.1.

In other
g).
Hence, misspeciﬁcation in R(x, wg, δ) can be allowed in exchange for identiﬁcation of θ0
g
g solves
in the conditional sense. For instance, estimation via NLS would imply that θ0

g would index the true conditional mean function i.e. E(cid:0)y(g)|x(cid:1) = mg(x, θ0
(cid:104)

(cid:105)

and similarly for the QMLE example

(y(g) − mg(x, θg))2|x

E(cid:2)ln f (y(g), mg(x, θg))|x(cid:3).

E

min
θg∈Θg
θ0
g = argmin
θg∈Θg

To conclude, robust estimation of ATE under the doubly weighted framework can be
achieved as follows: If the missing outcomes probability model R(·, δ) is misspeciﬁed, then
one can consistently estimate ATE when the conditional mean model is correct. However,
if R(·, δ) is correct, then one can estimate the ATE in the usual double robust manner i.e.
misspeciﬁcation may be allowed either in the propensity score model or the conditional mean

model, but not both. Finally, if the conditional mean model is misspeciﬁed, then both the
probability models, G(·, γ) (for propensity score) and R(·, δ) (for missing outcomes proba-

98

bility) would need to be correct.

To illustrate robust estimation of the ATE using the proposed doubly weighted estimator

and to study its ﬁnite sample behavior, the next section discusses a simulation study which

considers the diﬀerent cases of misspeciﬁcation mentioned above.

3.6.2 Monte carlo evidence

To allow for possible misspeciﬁcation of the regression functions µg(x), I simulate two binary
potential outcomes generated using a probit as follows

1, y∗(g) > 0

0, y∗(g) ≤ 0

y(g) =

y∗(g) = xθ0

g + u(g)

g, is parameterized to have covariates
Note that x includes an intercept. The linear index, xθ0
1 = 0.14 in the
be only mildly predictive of the potential outcomes with R2
population.26 The two covariates and the two latent errors are drawn from two independent
bivariate normal distributions as follows,

0 = 0.19 and R2

x1

 ∼ N

x2


1
 ,

2

 and

u(0)

u(1)




0.2

1

0.2

0


 ∼ N
0
 ,
 1
1, s∗ > 0

0, s∗ ≤ 0

s =

(3.30)

(3.31)

The assignment and missing outcome mechanisms have been simulated to ensure that un-

confoundedness and ignorability are satisﬁed

w1 =

1 > 0
1 ≤ 0

and

26Here θ0

0 = (0, 1, 1)(cid:48) and θ0

0, w∗
1 = (−1, 1, 1)(cid:48). With cross sectional data, covariates are
typically seen to be mildly predictive of the outcome. For example, in the National Supported
Work dataset from Calónico and Smith (2017), baseline factors explain about 26-50 percent
of the variation in the non-experimental sample and about .04-2 percent in the experimental
sample depending upon the included subset of covariates.

2

0.2

0.2

 3

1, w∗

99

where

w∗
1 = xγ0 + ξ

s∗ = zδ0 + ζ

with the errors ξ and ζ drawn from two independent standard logistic distributions.27 Mis-

speciﬁcation in these models is allowed in both the functional form and linear index dimension

where for the misspeciﬁed cases, I estimate a probit with x1 omitted from the linear index.
For scenarios where the conditional mean is misspeciﬁed, I estimate a linear model with a

correct index. The parameters, γ0 and δ0, indexing the assignment and missingness mech-
anisms have been chosen to ensure average propensity of assignment to be 0.41 and average
propensity of being observed to be 0.38.28 The missing data have been simulated to imitate

empirical settings where a signiﬁcant portion of the outcomes are missing. Table ?? gives

an estimation summary for the eight diﬀerent cases of misspeciﬁcation that are considered

here

Results

I discuss results for cases (4) and (5) as these two scenarios are highlighted in sections 3.4

and 3.5. Finally, I also discuss case (8). Even though the theory developed in this paper is

silent when all three components of the framework are misspeciﬁed, the simulation results

look promising. All other cases are given in the appendix. Case (4) depicts the possibility

that the conditional mean model is correct but both probability models are misspeciﬁed.

For this case, one can see that weighting does not have any added bite in resolving the

identiﬁcation problem, beyond that already achieved from having a correct mean function. In

ﬁgure H.2 d), the empirical distributions of the estimated ATE for the unweighted, propensity

score wighted and double weighted estimators all coincide. Moreover, all are centered on the

true ATE. In terms of root mean squared error, all three perform the same for a sample

size of 5000 but PS-weighting performs better when the sample size is 1000. This suggests
27This implies that p(w1 = 1|x) = p1(x) = Λ(xγ0) and p(s = 1|wg, x) = r(wg, x) =
Λ(zδ0) where Λ(·) is the standard logistic CDF.
28Here γ0 = (0.05, -0.2, -0.11)(cid:48), δ0 = (0.01, 0.03, 0.05, -0.28)(cid:48) and z = (1, wg, x1, x2)

100

that PS-weighting could be beneﬁcial in terms of RMSE, at-least for small sample sizes.

Estimating the propensity score reduces the variance of the weighted score of the problem

which will not necessarily be the case when estimating both probability models. So, it might

be better to use the propensity score weighted estimator in the case when the conditional

mean function is correctly speciﬁed.

Case (5) considers a scenario where the mean function is misspeciﬁed but the two proba-

bilities models are correct. This is the principal case covered in section 3.4 where weighting

has a crucial role to play. As one can see, the average bias in the unweighted estimator of

ATE is higher than for the doubly weighted estimator. In fact, the empirical distribution of

the unweighted estimator is shifted to the right whereas for the doubly weighted estimator

is centered on the truth (refer ﬁgure ??). In this case, the doubly weighted estimator has

both; the smallest Bias and Rmse amongst all three estimators. Under this case, I also con-

sider the doubly weighted estimator which uses known weights (see table ?? for reference).

In ﬁnite samples, estimation of the weights could result in conservative variance estimates.

While estimating the weights would result in a smaller residual of the weighted score (li),
the residual variance could be larger compared to the known weights estimator because of

non-zero cross correlations between the probability scores.

Finally, case (8) considers the scenario where all components of the framework are mis-

speciﬁed. The theory in this paper does not address this case. However, this is an interesting

possibility given that misspeciﬁcation of all components is a valid concern. The simulation

results do oﬀer some insight here. The doubly weighted estimator seems to be the only

estimator that delivers the true ATE on average whereas the others are away from the truth

(see table ??).

3.6.3 Quantile eﬀects

Under treatment eﬀect heterogeneity, distributional impacts beyond the ATE are of increas-

ing interest to researchers, especially in program evaluation studies. However, unlike the case

101

of ATE, it is generally not possible to obtain robust estimation of UQTEτ .29 In this section,
I employ the double weighting framework to focus attention on estimating three diﬀerent

quantile eﬀects, namely, UQTEτ , CQTEτ , and a weighted linear approximation (LP) to the
g indexes the true CQF or an approximation will depend on what
true CQTEτ . Whether θ0
is being assumed about the conditional quantile model and the estimation method used.

Let us assume that the two potential outcomes are continuous in R and that the un-

conditional quantiles of y(g) are unique and do not have any ﬂat spots at the τ th quantile.

Then, the conditional quantiles of y(0) and y(1) given covariates, x, are deﬁned as:

(cid:0)y(g)|x(cid:1) = inf{y : Fy(g)(y|x) ≥ τ}; where 0 < τ < 1

Quantτ

where Fy(g)(y(g)|x) is the distribution function of y(g) conditional on x and is assumed to
have density f (y(g)|x). Then, the CQTEτ at x = x0 for the τ th quantile is deﬁned as the
diﬀerence in the conditional quantiles of the two outcome distributions i.e.

CQT Eτ (x0) = Quantτ (y(1)|x0) − Quantτ (y(0)|x0)

Similarly, UQTEτ is deﬁned as the diﬀerence in the τ th unconditional quantiles of the

two outcome distributions.

U QT Eτ = Quantτ (y(1)) − Quantτ (y(0))

Let quantg,τ (x, θg) be a model for the τ th conditional quantile of y(g). This is said to be
correctly speciﬁed for Quantτ (y(g)|x) if

Quantτ (y(g)|x) = quantg,τ (x, θ0

g) for some θ0

g(τ ) ∈ Θg, g = 0, 1

(3.32)

The next section discusses estimation under the ﬁrst case when we have R(·) correctly
speciﬁed.

Case 1: Correct missing probability model, R(x, wg, δ)
Similar to the ATE case, when R(·) is correctly speciﬁed, one obtains the nested DR result

29This is because averaging the CQTEτ does not give us the UQTEτ .

102

of causal inference. However, the parameter estimable in each case depends on what is being

assumed about the CQF. To consider each of these scenarios in detail, consider the ﬁrst half

of DR when we have a correct CQF.

(cid:0)x, θg

(cid:1)

a. First half of DR: Correct conditional quantiles, quantg,τ
If the CQF is correctly speciﬁed, as deﬁned in 3.32, then one can decompose the potential

outcomes as,

y(g) = quantg,τ (x, θg) + uτ (g)

(cid:0)uτ (g)|x(cid:1) = 0

Quantτ

(3.33)

In this case there are two estimation methods that will ensure consistent estimation of
g(τ ). The ﬁrst is quantile regression (QR) of Koenker and
Bassett (1978). The second is a class of quasi maximum likelihood estimators in a special

the correct CQF parameters, θ0

‘tick-exponential’ family of distributions proposed by Komunjer (2005). This method is

analogous to estimation of correctly speciﬁed conditional mean parameters using QMLE in

the linear exponential family. The ‘ﬁrst part’ of this double robustness result implies that any

inverse propensity score weighted version of the QR or QML objective functions, irrespective

√

N-asymptotically

of whether those weights are correct, will also deliver a consistent and

normal estimator of θg(τ ).

For estimation that uses QR, correct speciﬁcation as given in 3.33 implies that θg(τ ) will

actually solve the stronger conditional problem,

E(cid:2)cτ (y(g) − quantg,τ (x, θg))|x(cid:3)

where cτ (u) = u ·(cid:0)τ − 1[u < 0](cid:1) is the check function deﬁned for some random variable

θ0
g(τ ) = argmin
θg∈Θg

(3.34)

u. Since, θ0

g(τ ) satisﬁes the stronger identiﬁcation condition, results from section 3.5 can
be applied. This means that weighting is not needed for consistent estimation of θg(τ ),
irrespective of whether the weighting functions are correct or not.

103

In a similar vein, estimation via QML using the tick exponential family implies that as

long as CQF is correct,

θ0
g(τ ) = argmin
θg∈Θg

E(cid:2)ln (φτ (y(g), quantτ,g(x, θg)))|x(cid:3)

(3.35)

where φτ (·,·) is the density that belongs to the ‘tick-exponential’ family characterized by:

φτ (y, η) = exp(cid:2)−(1 − τ )[a(η) − b(y)]1{y ≤ η} + τ [a(η) − c(y)]1{y > η}(cid:3)

and τ ∈ (0, 1), a(·) is continuously diﬀerentiable and b(·) and c(·) are continuous functions
such that η ∈ M ⊂ R.30. Once we have obtained ˆθg either by solving the QR or QML
problem, the conditional quantile treatment eﬀect for the subgroup deﬁned by xi can be
estimated as

ˆCQT Eτ (xi) = quant1,τ (xi, ˆθ1) − quant0,τ (xi, ˆθ0).

b. Second half of DR: Correct propensity score model, G(x, γ)

Suppose now we have a correctly speciﬁed propensity score model or equivalently a mis-

speciﬁed conditional quantile model. Traditionally, the theory of quantile estimation has
not dealt with this case of misspeciﬁcation.31 However, Angrist et al. (2006b) establish an

approximation property of QR with a misspeciﬁed linear CQF that is analogous to the ap-
proximation property of linear regression.32 Hence, solving the QR objective function with

quantτ,g(x, θg) = xθg would still identify a weighted approximation to the CQF.

30φτ (y, η) is a probability density and η is the τ-quantile of φτ such that(cid:82) η−∞ φτ (y, η)dy =

τ (1−τ ) · y,
τ. Komunjer (2005) shows that if one chooses a(η) = 1
then the quasi log likelihood function is proportional to the check function that was originally
introduced by Koenker and Bassett (1978)

τ (1−τ ) · η and b(y) = c(y) = 1

31Kim and White (2003) establish consistency and asymptotic normality of the QR esti-
mator for a pseudo true value in the case of a misspeciﬁed linear conditional quantile model.
32Adapting Angrist et al. (2006b)’s notation to the potential outcomes framework, the
parameters that solve the QR problem solve a weighted mean square approximation to the
true CQF,

(cid:104)

ωτ (x, θg) · (Quantτ (y(g)|x) − xθg)2(cid:105)

(1 − u)fy(g)(u · xθg + (1 − u) · Quantτ (y(g)|x)|x)du

104

where

ωτ (x, θg) =

θ0
g(τ ) = argmin
θg∈Θg

E

(cid:90) 1

0

Under ‘i.i.d’ sampling, solving the sample QR objective function is suﬃcient to obtain
consistent estimates of θ∗
g. However, as in the case of ATE, weighting becomes crucial in
the presence of non-random assignment and missing outcomes. In other words, one would

need to weight the QR estimator with the correct propensity score and missing outcomes
probability models to consistently estimate θ∗
g. For instance, one would now solve the
following treatment and control group problems,
si · wi1

(cid:34)

N(cid:88)
(cid:34)

i=1

min
θ1∈Θ1

N(cid:88)

R(xi, wi1, ˆδ) · G(xi, ˆγ)

si · wi0

min
θ0∈Θ0

i=1

R(xi, wi0, ˆδ) · (1 − G(xi, ˆγ))

(cid:35)
· cτ (y(1) − xiθ1)
(cid:35)
· cτ (y(0) − xθ0)

(3.36)

and the solution to these sample problems, ˆθg, is interpretable as providing a weighted LP
to the true CQTEτ .

Case 2: Misspeciﬁed missing outcomes probability model, R(x, wg, δ)
If the missing outcomes model is misspeciﬁed, then suﬃcient for consistent estimation of
g is a strengthening of the identiﬁcation condition from 3.2.1 to 3.5.1. For estimation via
θ0
quantile regression, this means that θ0

g solves the conditional QR problem

E(cid:2)cτ (y(g) − quantg,τ (x, θg))|x(cid:3)

θ0
g = argmin
θg∈Θg

which will hold only when the conditional score of the check function is zero i.e.

τ · 1[y(g) − quantg,τ (x, θ0

g ) ≥ 0] − (1 − τ ) · 1[y(g) − quantg,τ (x, θ0

g ) < 0]

and this will be true only when Quantτ (y(g)|x) = quantg,τ (x, θ0
g). So, misspeciﬁcation in
R(x, wg, δ) can be allowed in exchange for having a correctly speciﬁed conditional quantile
model.

is the weighting function that determines the importance given by the minimizer, θ0
points in the support of x.

g, to

105

(cid:20)
−x(cid:48)(cid:110)

E

(cid:21)

= 0

(cid:111)(cid:12)(cid:12)(cid:12)(cid:12)x

Direct estimation of UQTEτ
As was mentioned earlier in this section, estimating UQTEτ from CQTEτ (x) is generally
not possible even if we assume a correct model for the conditional quantiles of the outcomes.

This is because the mean of the quantiles is not equal to the quantiles of the mean. Hence,

one cannot obtain unconditional quantiles from averaging conditional quantiles over x. In

this case, one can directly estimate the marginal quantiles by running a quantile regression
of y(g) on an intercept (as shown in Firpo (2007)).33 In the present case, one would weight

the objective function by the two probabilities in the following manner,

(cid:34)
(cid:34)

E

E

θ0
1(τ ) = argmin
θ1∈Θ1

θ0
0(τ ) = argmin
θ0∈Θ0

(cid:35)
· cτ (y(1) − θ1)

(cid:35)

· cτ (y(0) − θ0)

si · wi1

R(xi, wi1, ˆδ) · G(xi, ˆγ)

si · wi0

R(xi, wi0, ˆδ) · (1 − G(xi, ˆγ))

Weighting by G(·) and R(·) is crucial here since these primarily serve to remove the se-
lection biases due to non-random assignment and missing data. Then, one can obtain the

unconditional quantile treatment eﬀect as,

U QT Eτ = θ0

1(τ ) − θ0

0(τ )

The next section explores estimation of these three quantile estimands using a Monte Carlo

experiment where I allow misspeciﬁcation of the weighting functions and the conditional

quantile model.

3.6.4 Monte carlo evidence

To ensure that the marginal quantiles of the potential outcome distributions are unique with
no ﬂat spots, I simulate two continuous non-negative outcomes as follows,

y(g) = exp(xθ0

g + u(g)), for g = 0, 1

33Firpo (2007) uses propensity score weighting to directly estimate unconditional quantiles

in the presence of non-random assignment.

106

1 = (0.1,−0.36,−0.1)(cid:48) and θ0

0 = (0.2, 0.24,−0.45)(cid:48) are parameterized to ensure R2

where θ0
0 =
0.15 and R2
1 = 0.13 in the population. The two covariates and the two latent errors are drawn
from two independent normal distributions following eq (3.30). The missing outcomes and
the treatment assignment mechanisms are also generated according to eq (3.31). Since exp(·)
is an increasing continuous function, the equivariance property of quantiles implies that

(cid:104)

exp(xθ0

(cid:105)
(cid:105)
g + u(g))|x
(cid:105)
g + u(g)|x)
(cid:105)
g + Quantτ (u(g)|x)
g + Φ−1(τ )

Quantτ (xθ0

(cid:104)
Quantτ (y(g)|x) = Quantτ
(cid:104)
(cid:104)

= exp

= exp

xθ0

= exp

xθ0

where Φ−1(τ ) is the inverse standard normal CDF evaluated at τ. This equivariance property
helps to characterize and estimate CQTE for cases when the CQF is correct. For brevity,

I study the behavior of the unweighted, propensity score weighted and the double weighted

estimators for only ﬁve of the eight cases of misspeciﬁcation. These are enumerated in table

?? below. Out of these, I discuss cases 4, 5 and 8 in the main text and results for the rest

can be found in the appendix. Cases 4 and 5 correspond to the scenarios for which results

are derived in section 3.5 and 3.4 respectively. The last case corresponds to the scenario

where all components of the doubly weighted framework are misspeciﬁed. Even though the

theory in this paper does not address that speciﬁc case, the simulation results show that the

proposed estimator has the lowest bias among all three.

Results

For the ﬁrst case when the CQF is correctly speciﬁed, ﬁgure H.3 plots CQTE as a function

of x1 for the 25th quantile of the outcome distribution. Results for the 50th and 75th
quantiles are given in the appendix. One can see that the estimated function coincides
with the true CQTE.34 To make this case interesting, I consider misspeciﬁcation of both

probability models. As the results in section 3.5 dictate, all three estimators; unweighted,
34For plotting these functions, I ﬁrst collect the QR estimates that solve the unweighted,
ps-weighted and double-weighted check function (deﬁned 3.34) corresponding to a particular

107

ps-weighted and double-weighted will be consistent for the true CQTE because the CQF

is correctly speciﬁed. Hence, misspeciﬁcation of the two probability models does not aﬀect

consistent estimation of the estimand.

In fact weighting by any positive function would

deliver this result, including just the ps-weighted estimator.

Next, I consider the case when the CQF is misspeciﬁed. Using the results in Angrist

et al. (2006b), I interpret the solution to the double-weighted problem given in eq 3.36

as providing a consistent weighted-linear projection to the true CQF. I use these linear

projections to estimate an LP to the true CQTE. Figure H.4 plots the bias in the estimated

LP relative to the true LP as a function of x1 for the three estimators. In panel A) where
both probability models are correct, the relative bias from the double-weighted estimator is

the lowest and coincides with the line of no bias. Panel D) considers the case where all three

parametric speciﬁcations are wrong. Again, we see that the double weighted estimator is

performing the best in terms of bias. Even though the theory does not guide us here, double

weighting seems to be the least biased procedure.

Finally, I consider direct estimation of the unconditional quantile treatment eﬀect (UQTE)

at the 25th quantile. Again, results for the 50th and 75th quantiles can be found in the ap-

pendix. Notice that estimation of UQTE does not require parametric speciﬁcation of the

CQF since it is the diﬀerence in marginal quantiles. Hence, the two probability models are

the only relevant components of the framework that aﬀect consistent estimation of UQTE.

In the ﬁrst case, when both probability models are correct, unweighted and double-weighted

estimators are both close to the true quantile eﬀect. For the second case where both proba-

bility models are misspeciﬁed, double weighting does a little worse than not weighting at all.

However, the results at other quantile levels reﬂect more favorably upon double weighting.

Propensity score weighting performs the worst in both cases suggesting instances where it
quantile level, τ ∈ {0.25, 0.50, 0.75} across 1000 Monte Carlo repetitions.
I then draw a
linearly spaced x1 vector and simulate the CQTE using the 1000 estimated QR coeﬃcients.
Averaging these 1000 functions at each point on the x1 vector gives me the estimated average
CQTE function. I plot this along with the 1000 individual functions and the true CQTE,
which is calculated using the population QR parameters, θ0
g.

108

might not be better to just correct for nonrandom assignment. Tables below report the bias

and Rmse of the three estimators along with the double weighted estimator that uses known

probability weights. When the two probability models are correct, the double-weighted esti-

mator has the lowest Rmse. This, however, ceases to be true when the two probabilities are

misspeciﬁed.

3.7 Application to Calónico and Smith (2017)

In this section, I apply the proposed estimator to the Aid to Families with Dependent

Children (AFDC) sample of women from the National Supported Work program compiled
by Calónico and Smith (2017).35 NSW was a transitional and subsidized work experience

program which was implemented as a randomized experiment in the United States between

1975-1979. CS replicate LaLonde (1986)’s within-study analysis for the AFDC women in

the program, where the purpose of such an analysis is to evaluate how training estimates

obtained from using non-experimental identiﬁcation strategies (for example, CIA) compare to

experimental estimates. To compute the non-experimental estimates, CS combine the NSW

experimental sample with two non-experimental comparison groups drawn from PSID, called
PSID-1 and PSID-2.36 In this paper, I utilize the within-study feature of this empirical

application to estimate bias in the unweighted and propensity-score weighted estimates,

relative to the proposed double weighting procedure.

To construct these measures, I augment the CS sample to allow for women who had

missing earnings information in 1979. This renders 26% of the experimental and 11% of

the PSID samples missing. I then combine the experimental treatment group of NSW with

three distinct comparison groups present in the CS dataset, namely, the experimental control

group, and the two PSID samples, to compute the unweighted, single-weighted and double-

35Henceforth, Calónico and Smith (2017) is referred as CS.
36The PSID-1 sample constructed by CS involves keeping all female household heads con-
tinuously from 1975-1979 who were between 20 and 55 years of age in 1975 and were not
retired in 1975. The sample labeled PSID-2 further restricts PSID-1 to include only those
women who received AFDC welfare in 1975.

109

weighted training estimates.37 The diﬀerence in the non-experimental estimate, obtained

from using the doubly weighted estimator, and the experimental estimate provides the ﬁrst

measure of estimated bias associated with the proposed strategy. Combining the experimen-

tal control group with the non-experimental comparison group gives a second measure of

estimated bias (Heckman et al. (1998a)). Much like CS, I report both these measures across

a range of regression speciﬁcations for the average training estimates.

Given the growing importance of estimating distributional impacts of training programs,

I also estimate marginal quantile treatment eﬀects at every 10th quantile of the 1979 earn-

ings distribution. The role of double weighting for ensuring consistency of the estimates is

highlighted for the case of estimating marginal quantiles where covariates, which primarily

serve to remove biases arising from non-random assignment and missing outcomes, enter the

estimating equation only through the two probability models.

3.7.1 Results

First, to evaluate whether women with missing earnings in 1979 were signiﬁcantly diﬀerent

than those who were observed, Table I.17 reports the mean and standard deviation of the

woman’s age, years of schooling, pre-training earnings and other characteristics across the

observed and missing samples. In terms of age, the women who were observed in the exper-

imentally treated group of NSW and the PSID-1 sample were, on average, older than those

who were missing. The observed women in PSID-1 were also more likely to be married. For

the PSID-2 sample, women who were observed had, on average, more kids with higher pre-

training earnings. Apart from these minor diﬀerences, the observed women did not appear

to be systematically diﬀerent that those who were missing, as measured through observable

characteristics.

The presence of non-experimental control groups implies that assignment was nonrandom
37For details regarding sample construction, and other aspects of this application, see

appendix G

110

and therefore an issue in the sample. This is because the comparison groups were drawn

from PSID after imposing only a partial version of the full NSW eligibility criteria. Table

I.16 provides descriptive statistics for the covariates, by the treatment status. As can be

expected, the treatment and control groups of NSW are not observably diﬀerent, indicating

the strong role that randomization plays in producing comparable groups. In contrast, the

women in PSID-1 and PSID-2 groups are statistically diﬀerent than the treatment group

members, implying substantial scope for nonrandom assignment.

3.7.1.1 Estimated bias for average and unconditional quantile training eﬀects

Table I.18 reports the doubly-weighted, ps-weighted and unweighted average training esti-

mates which use the three diﬀerent comparison groups; NSW control, PSID-1 and PSID-2.

The unweighted (unadjusted and adjusted) experimental estimates given in row 1, are same

as the estimates reported by CS in Table 3 of their paper. Overall, one can see that the dou-

ble weighted experimental estimates are more stable than the single weighted or unweighted

estimates across the diﬀerent regression speciﬁcations, with a range between $824-$828.

For computing the ps-weighted and double-weighted non-experimental estimates, I ﬁrst
trim the sample to ensure common support between the treatment and comparison groups.38

This reduces the sample size from 1,248 to 1,016 observations for the PSID-1 estimates and

from 782 to 720 observations for the PSID-2 estimates. A pattern that is consistent across

the two sets of non-experimental estimates is that weighting gets us much closer to the

benchmark relative to not weighting at all. For instance, the unweighted simple diﬀerence in

means estimate of training, which uses the PSID-1 comparison group, is -$799 whereas the

weighted estimates are $827 and $803. For the PSID-2 comparison group, the unweighted

estimate which controls for all covariates is $335 whereas the weighted estimates are $905

and $904.

38Appendix G describes estimation of the two probability weights along with the sample

trimming criteria.

111

The second panel of Table I.18 reports the bias in training estimates from combining

the experimental control group with the PSID comparison groups. A similar pattern is seen

here with weighted bias estimates being much closer to zero than the unweighted estimates.

For instance, the double-weighted estimate that adjusts for all covariates using the PSID-1

comparison group is -$21 whereas the unweighted estimates is -$568. These results suggest

that the argument for weighting is strong when using a non-experimental comparison group
where nonrandom assignment and missing outcomes are signiﬁcant problems.39

Figure H.1 plots the relative bias in UQTE estimates at every 10th quantile of the 1979

earnings distribution. Much like the average training estimates, we see that the weighted

estimates consistently lie below the unweighted estimates for most quantiles, irrespective of

whether we use the PSID-1 or PSID-2 non-experimental group. Note that I do not plot the
UQTE estimates for quantiles less than 0.46, since these are all zero.40

This empirical application illustrates the role of proposed estimator in both experimental

and observational data contexts. The comparison involving the treatment and control group

of NSW demonstrates its use in an experiment with missing outcomes, whereas the non-

experimental sample demonstrates its use in the more realistic observational data setting.

3.8 Conclusion

In empirical research, the problems of nonrandom assignment and missing outcomes

threaten identiﬁcation of causal parameters. This paper proposes a new class of consistent

and asymptotically-normal estimators that address these two issues using a double inverse

probability weighted procedure. The method combines propensity score weighting with

weighting for missing data in a general M-estimation framework, which can be utilized to

study a range of problems, such as ordinary least-squares, quasi Maximum likelihood, and
39Note that the large standard errors for the non-experimental estimates can be attributed
to the small sample sizes and to the large residual variance of earnings in the PSID-1 and
PSID-2 populations.

40There are a lot of women in the experimental and PSID samples with zero real earnings

in 1979.

112

quantile regression. In addition, the proposed class is characterized by a robustness property,

which makes it resilient to parametric misspeciﬁcation of a conditional model of interest (CEF

or CQF) and the two weighting functions.

As leading applications of this framework, the paper discusses robust estimation of ATE

and QTEs. A Monte Carlo study indicates that the doubly weighted estimates of average

and quantile eﬀects have the lowest bias, compared to naive alternatives (unweighted or

propensity-score weighted estimators), for interesting cases of misspeciﬁcation. Finally, the

estimator is applied to the data on AFDC women from the NSW program compiled by

Calónico and Smith (2017). The presence of experimental and non-experimental comparison

groups in this application help to quantify the estimated bias in the double-weighted training

estimates. Results suggest that the argument for weighting is strong whenever nonrandom

assignment and (or) missing outcomes are signiﬁcant concerns. Since the severity and mag-

nitude of bias introduced from ignoring either problem cannot be assessed ex-ante, a safe

bet from the practitioner’s perspective is to provide both weighted and unweighted causal

eﬀect estimates.

Practically, the doubly weighted estimator is easy to implement. Appendix F.3 pro-

vides an example code that uses Stata gmm command for implementing the double-weighted

estimator of ATE. Computation of analytically correct standard errors, however, requires

additional coding and is still a work in progress. Alternatively, one can use bootstrapped

standard errors which will provide asymptotically correct inference.

Even though missing outcomes are a common concern in empirical analysis, it is equally

common to encounter missing data on the covariates. A particularly important future ex-

tension will be to allow for missing data on both. In this case, using a generalized method

of moments framework which incorporates information on complete and incomplete cases

could provide eﬃciency gains over just using the observed data.

113

APPENDICES

114

APPENDIX A

FIGURES FOR CHAPTER 1

A.1 Root mean squared error across diﬀerent sample sizes

Figure A.1: Quadratic design, continuous covariates (mild heterogeneity)

N=100

N=500

N=1000

115

Figure A.2: Quadratic design, continuous covariates (strong heterogeneity)

N=100

N=500

N=1000

116

Figure A.3: Quadratic design, one binary covariate (mild heterogeneity)

N=100

N=500

N=1000

117

Figure A.4: Quadratic design, one binary covariate (strong heterogeneity)

N=100

N=500

N=1000

118

Figure A.5: Probit design, continuous covariates (mild heterogeneity)

N=100

N=500

N=1000

119

Figure A.6: Probit design, continuous covariates (strong heterogeneity)

N=100

N=500

N=1000

120

Figure A.7: Probit design, one binary covariate (mild heterogeneity)

N=100

N=500

N=1000

121

Figure A.8: Probit design, one binary covariate (strong heterogeneity)

N=100

N=500

N=1000

122

Figure A.9: Binary outcome, bernoulli QLL with logistic mean

N=500

N=1000

Figure A.10: Non-negative outcome, poisson QLL with exponential mean

N=500

N=1000

123

APPENDIX B

TABLES FOR CHAPTER 1

Table B.1: QLL and mean function combinations

Restrictions on support of response Quasi-Log Likelihood Function Conditional Mean Function
None
Y (w) ∈ [0, 1]
Y (w) ∈ [0, B]
Y (w) ≥ 0

Gaussian (Normal)
Bernoulli
Binomial
Poisson
Multinomial

Linear
Logistic
Logistic
Exponential
Logistic

Yg(w) ≥ 0,(cid:80)G

g=0 Yg(w) = 1

124

Estimator
SDM
PRA
FRA
IRA

SDM
PRA
FRA
IRA

SDM
PRA
FRA
IRA

SDM
PRA
FRA
IRA

SDM
PRA
FRA
IRA

SDM
PRA
FRA
IRA

SDM
PRA
FRA
IRA

bias
0.045
0.047
0.017
0.004

0.045
0.047
0.017
0.004

-0.058
-0.045
0.051
0.046

0.094
0.013
0.042
0.022

0.002
0.003
0.025
0.026

-0.002
0.000
0.028
0.030

0.000
-0.001
0.019
0.020

std
1.590
1.312
1.697
1.690

1.590
1.312
1.697
1.690

2.508
2.083
1.988
1.944

1.517
1.716
1.593
1.561

0.134
0.109
0.117
0.117

0.169
0.249
0.206
0.195

0.102
0.105
0.093
0.091

bias
0.025
-0.022
-0.023
-0.014

0.025
-0.022
-0.023
-0.014

-0.085
-0.069
-0.100
-0.085

-0.040
-0.047
-0.050
-0.072

0.002
-0.001
0.003
0.005

0.000
0.001
0.009
0.013

0.000
0.002
0.005
0.005

std
1.056
0.825
0.815
0.810

1.056
0.825
0.815
0.810

1.350
1.061
1.052
1.003

0.891
0.932
0.860
0.783

0.088
0.069
0.068
0.067

0.108
0.124
0.108
0.084

0.076
0.065
0.063
0.061

std
1.008
0.756
0.757
0.746

1.008
0.756
0.757
0.746

1.141
0.910
0.907
0.850

0.751
0.752
0.752
0.658

0.086
0.063
0.064
0.063

0.096
0.099
0.097
0.074

0.082
0.066
0.066
0.063

bias
-0.035
-0.031
-0.022
-0.025

-0.035
-0.031
-0.022
-0.025

-0.038
0.038
0.003
0.014

0.014
0.004
0.015
0.003

-0.002
0.000
0.000
0.002

0.001
0.004
0.004
0.008

-0.005
-0.001
-0.004
-0.002

std
1.194
0.929
0.922
0.914

1.194
0.929
0.922
0.914

1.120
0.987
0.926
0.864

0.747
0.845
0.739
0.632

0.100
0.073
0.073
0.073

0.107
0.119
0.104
0.081

0.097
0.081
0.080
0.078

bias
0.070
0.039
-0.021
-0.022

0.070
0.039
-0.021
-0.022

-0.039
-0.094
0.043
0.072

-0.031
-0.034
0.058
0.019

0.003
0.003
-0.001
-0.001

0.003
0.007
0.004
0.004

-0.008
-0.005
-0.004
-0.003

std
2.023
1.566
1.750
1.786

2.023
1.566
1.750
1.786

1.391
1.602
1.286
1.221

0.958
1.410
0.931
0.848

0.170
0.123
0.144
0.145

0.168
0.239
0.164
0.151

0.167
0.140
0.150
0.150

0.136
0.199
0.134
0.125

0.092
0.104
0.091
0.075

0.004
-0.005
0.020
0.022

0.165
SDM
0.213
PRA
0.163
FRA
IRA
0.152
a Here SDM refers to simple diﬀerence in means, PRA refers to pooled regression adjustment, IRA is the
infeasible regression adjustment estimator and FRA is the feasible regression adjustment estimator.
b Simulation across 1000 replications.

-0.022
-0.015
0.010
0.013

0.006
0.009
0.011
0.011

0.093
0.093
0.091
0.072

-0.003
0.002
0.002
0.006

0.103
0.111
0.098
0.082

Table B.2: Bias and standard deviation for N=100

0.1

0.3

DGP1

0.5

0.7

0.9

bias
0.034
0.039
0.042
0.026
DGP2
0.034
0.039
0.042
0.026
DGP3
0.041
0.054
0.030
0.019
DGP4
0.005
0.007
0.002
0.003
DGP5
-0.002
0.000
0.000
0.001
DGP6
0.000
0.003
0.004
0.008
DGP7
-0.003
0.001
0.000
0.001
DGP8
0.004
0.008
0.007
0.010

125

Estimator
SDM
PRA
FRA
IRA

SDM
PRA
FRA
IRA

SDM
PRA
FRA
IRA

SDM
PRA
FRA
IRA

SDM
PRA
FRA
IRA

SDM
PRA
FRA
IRA

SDM
PRA
FRA
IRA

bias
0.035
0.025
0.023
0.021

0.035
0.025
0.023
0.021

0.054
0.031
-0.014
-0.011

-0.034
-0.051
-0.007
-0.010

0.001
0.000
0.003
0.003

0.000
0.001
0.006
0.006

-0.001
-0.001
0.003
0.004

std
0.675
0.566
0.511
0.507

0.675
0.566
0.511
0.507

1.073
0.878
0.755
0.729

0.652
0.744
0.599
0.574

0.056
0.047
0.044
0.043

0.072
0.101
0.062
0.055

0.044
0.044
0.038
0.037

bias
0.023
0.009
0.008
0.007

0.023
0.009
0.008
0.007

-0.003
0.002
-0.004
0.003

0.012
0.013
0.012
0.001

0.001
0.000
0.000
0.001

0.001
0.001
0.003
0.004

-0.001
0.000
0.000
0.001

std
0.493
0.379
0.374
0.353

0.493
0.379
0.374
0.353

0.642
0.490
0.457
0.429

0.391
0.402
0.375
0.336

0.039
0.030
0.030
0.028

0.048
0.055
0.046
0.035

0.035
0.030
0.029
0.028

std
0.453
0.353
0.353
0.332

0.453
0.353
0.353
0.332

0.546
0.415
0.414
0.372

0.337
0.336
0.335
0.287

0.038
0.028
0.028
0.027

0.043
0.043
0.043
0.031

0.036
0.028
0.028
0.027

bias
-0.005
0.007
0.008
0.009

-0.005
0.007
0.008
0.009

-0.013
0.010
0.000
0.004

0.006
-0.001
0.007
0.000

0.000
0.001
0.001
0.001

-0.001
-0.001
0.001
0.001

0.001
0.001
0.001
0.000

std
0.535
0.396
0.382
0.372

0.535
0.396
0.382
0.372

0.486
0.428
0.400
0.366

0.333
0.364
0.333
0.273

0.044
0.031
0.030
0.030

0.050
0.055
0.048
0.035

0.042
0.034
0.033
0.032

bias
0.049
0.029
0.011
0.013

0.049
0.029
0.011
0.013

0.001
0.025
0.007
0.011

0.006
-0.001
0.001
0.000

0.006
0.004
0.002
0.002

0.006
0.008
0.005
0.004

0.001
0.002
0.001
0.001

std
0.856
0.648
0.612
0.606

0.856
0.648
0.612
0.606

0.621
0.707
0.544
0.518

0.431
0.624
0.401
0.365

0.073
0.053
0.050
0.049

0.077
0.106
0.063
0.054

0.072
0.059
0.055
0.054

0.063
0.087
0.056
0.050

-0.002
-0.001
-0.001
0.000

0.042
0.047
0.041
0.032

-0.002
-0.002
0.002
0.004

0.071
SDM
0.089
PRA
0.061
FRA
IRA
0.055
a Here SDM refers to simple diﬀerence in means, PRA refers to pooled regression adjustment, IRA is the
infeasible regression adjustment estimator and FRA is the feasible regression adjustment estimator.
b Simulation across 1000 replications.

0.039
0.039
0.038
0.030

0.001
0.002
0.002
0.001

0.044
0.045
0.041
0.034

0.002
0.003
0.005
0.005

Table B.3: Bias and standard deviation for N=500

0.1

0.3

DGP1

0.5

0.7

0.9

bias
-0.029
-0.019
-0.019
-0.019
DGP2
-0.029
-0.019
-0.019
-0.019
DGP3
-0.009
0.003
-0.003
-0.005
DGP4
-0.003
-0.004
-0.004
-0.005
DGP5
0.000
0.001
0.001
0.001
DGP6
-0.001
-0.001
0.000
0.000
DGP7
0.000
0.001
0.001
0.001
DGP8
0.001
0.001
0.001
0.002

126

Estimator
SDM
PRA
FRA
IRA

SDM
PRA
FRA
IRA

SDM
PRA
FRA
IRA

SDM
PRA
FRA
IRA

SDM
PRA
FRA
IRA

SDM
PRA
FRA
IRA

SDM
PRA
FRA
IRA

bias
0.016
-0.002
-0.001
0.000

0.016
-0.002
-0.001
0.000

0.019
0.015
-0.001
0.001

-0.007
-0.010
0.002
-0.002

0.001
-0.001
0.001
0.001

-0.001
-0.002
0.003
0.003

0.002
0.000
0.003
0.003

std
0.467
0.401
0.354
0.347

0.467
0.401
0.354
0.347

0.753
0.615
0.529
0.519

0.486
0.554
0.432
0.413

0.040
0.033
0.031
0.030

0.049
0.070
0.041
0.036

0.030
0.031
0.025
0.024

bias
-0.006
-0.009
-0.010
-0.009

-0.006
-0.009
-0.010
-0.009

0.001
0.000
-0.006
-0.004

-0.006
-0.006
-0.004
-0.006

0.001
0.000
0.001
0.001

0.001
0.001
0.001
0.001

0.001
0.000
0.001
0.001

std
0.335
0.263
0.252
0.244

0.335
0.263
0.252
0.244

0.468
0.360
0.337
0.308

0.272
0.281
0.266
0.241

0.028
0.022
0.021
0.021

0.033
0.038
0.032
0.025

0.023
0.020
0.019
0.019

std
0.317
0.243
0.243
0.233

0.317
0.243
0.243
0.233

0.363
0.277
0.277
0.256

0.242
0.240
0.240
0.196

0.026
0.020
0.020
0.019

0.032
0.032
0.032
0.024

0.026
0.021
0.021
0.020

bias
-0.009
-0.008
-0.009
-0.007

-0.009
-0.008
-0.009
-0.007

-0.002
0.004
0.000
0.002

0.004
0.003
0.003
0.001

-0.001
0.000
-0.001
0.000

-0.001
-0.001
0.000
0.000

-0.001
0.000
0.000
0.000

std
0.373
0.281
0.274
0.264

0.373
0.281
0.274
0.264

0.346
0.306
0.284
0.257

0.231
0.247
0.226
0.190

0.032
0.022
0.022
0.021

0.034
0.037
0.033
0.025

0.031
0.024
0.023
0.022

bias
0.018
0.013
0.011
0.011

0.018
0.013
0.011
0.011

0.006
0.001
0.001
0.000

0.004
0.007
-0.003
-0.004

0.001
0.001
0.000
0.001

0.002
0.002
0.001
0.002

-0.001
0.000
-0.001
-0.001

std
0.599
0.451
0.425
0.423

0.599
0.451
0.425
0.423

0.432
0.492
0.369
0.344

0.305
0.442
0.275
0.241

0.051
0.036
0.033
0.033

0.053
0.073
0.044
0.037

0.048
0.040
0.038
0.038

0.001
0.000
0.004
0.004

0.042
0.059
0.036
0.032

0.000
0.000
0.001
0.001

0.029
0.028
0.028
0.022

0.050
SDM
0.062
PRA
0.043
FRA
IRA
0.039
a Here SDM refers to simple diﬀerence in means, PRA refers to pooled regression adjustment, IRA is the
infeasible regression adjustment estimator and FRA is the feasible regression adjustment estimator.
b Simulation across 1000 replications.

-0.001
-0.001
-0.001
0.000

0.032
0.034
0.030
0.024

0.030
0.033
0.028
0.023

0.001
0.000
0.002
0.002

Table B.4: Bias and standard deviation for N=1000

0.1

0.3

DGP1

0.5

0.7

0.9

bias
0.008
-0.001
-0.001
-0.001
DGP2
0.008
-0.001
-0.001
-0.001
DGP3
0.010
0.006
0.003
0.003
DGP4
0.002
0.002
0.001
0.002
DGP5
0.001
0.000
0.000
0.000
DGP6
0.000
0.001
0.001
0.001
DGP7
-0.001
-0.001
-0.001
-0.001
DGP8
0.000
0.000
0.000
0.001

127

Table B.5: Bias and standard deviation for binary outcome

0.1

0.3

N=500
0.5

0.7

0.9

Estimator
SDM
PRA
FRA
N-PRA
N-RA

bias
0.014
0.023
0.018
0.013
0.006

std
0.062
0.056
0.051
0.055
0.052

bias
0.002
0.000
0.000
0.001
0.001

std
0.041
0.035
0.034
0.034
0.033

bias
-0.007
-0.002
-0.002
-0.001
-0.002

std
0.037
0.031
0.031
0.030
0.030

bias
0.003
0.003
0.002
0.004
0.004

std
0.042
0.035
0.035
0.034
0.033

bias
0.001
0.015
0.009
0.014
0.007

std
0.063
0.054
0.053
0.055
0.051

N=1000

0.008
0.010
0.010
0.010
0.012

0.027
0.023
0.023
0.022
0.022

0.044
0.038
0.037
0.038
0.036

0.000
0.000
0.000
-0.001
-0.001

-0.017
-0.021
-0.024
-0.019
-0.020

0.043
SDM
0.038
PRA
0.036
FRA
0.038
N-PRA
0.034
N-RA
a Here SDM refers to simple diﬀerence in means, PRA refers to pooled regression adjustment, IRA is the
infeasible regression adjustment estimator and FRA is the feasible regression adjustment estimator, N-
PRA refers to pooled non-linear regression adjustment and N-RA refers to separate nonlinear regression
adjustment.
b Simulation across 1000 replications.
c. True ATE is 0.037, R2

-0.016
-0.006
-0.010
-0.003
-0.008

0.003
0.009
0.009
0.006
0.006

0.029
0.024
0.024
0.023
0.022

0.026
0.022
0.022
0.021
0.021

0 = 0.491 and R2

1 = 0.457.

Table B.6: Bias and standard deviation for non-negative outcome

0.1

0.3

N=500
0.5

0.7

0.9

Estimator
SDM
PRA
FRA
N-PRA
N-RA

bias
0.010
-0.003
0.024
0.000
0.027

std
0.137
0.180
0.132
0.179
0.132

bias
-0.005
-0.024
-0.006
-0.022
-0.006

std
0.093
0.103
0.093
0.101
0.092

bias
0.017
0.015
0.015
0.015
0.011

std
0.080
0.078
0.078
0.078
0.077

bias
-0.027
-0.023
-0.013
-0.024
-0.013

std
0.101
0.101
0.093
0.100
0.086

bias
-0.067
-0.074
-0.041
-0.078
-0.039

std
0.138
0.166
0.112
0.168
0.107

N=1000

0.064
0.066
0.061
0.066
0.060

0.020
0.028
0.008
0.028
0.006

0.089
0.114
0.086
0.115
0.084

-0.055
-0.059
-0.044
-0.056
-0.040

0.116
SDM
0.133
PRA
0.102
FRA
0.133
N-PRA
0.089
N-RA
a Here SDM refers to simple diﬀerence in means, PRA refers to pooled regression adjustment, IRA is the
infeasible regression adjustment estimator and FRA is the feasible regression adjustment estimator, N-PRA
refers to pooled non-linear regression adjustment and N-RA refers to separate nonlinear regression adjustment.
b Simulation across 1000 replications.
c. True ATE is 0.012, R2

-0.014
-0.023
-0.002
-0.024
-0.001

-0.022
-0.023
-0.022
-0.025
-0.013

0.006
0.004
0.003
0.004
0.006

0.066
0.068
0.061
0.068
0.059

0.061
0.060
0.060
0.060
0.059

0 = 0.435 and R2

1 = 0.233.

128

APPENDIX C

PROOFS FOR CHAPTER 1

Proof of Lemma 5.1

Proof. Asymptotic variance of SDM

Consider the diﬀerence-in-means estimator. We can write the sample average for the treated

1

as

Therefore,

because N1/N

Wi

i=1

1

i=1

i=1

1

Wi

√

(cid:105)

N(cid:88)

¯Y1 = N−1

= µ1 + N−1

µ1 + ˙Xiβ1 + Ui(1)

N(cid:0) ¯Y1 − µ1

(cid:104)
N(cid:88)
WiYi = N−1
(cid:104) ˙Xiβ1 + Ui(1)
(cid:105)
N(cid:88)
(cid:105)
(cid:104) ˙Xiβ1 + Ui(1)
N(cid:88)
(cid:1) = (N/N1)N−1/2
(cid:104) ˙Xiβ1 + Ui(1)
(cid:105)
N(cid:88)
= (1/ρ)N−1/2
(cid:104) ˙Xiβ1 + Ui(1)
(cid:105) d→ N ormal(0, c2
N(cid:88)
p→ ρ. By the CLT,
N−1/2
(cid:26)
(cid:105)2(cid:27)
(cid:104) ˙Xiβ1 + Ui(1)
(cid:17)
(cid:16)
Wi
β(cid:48)
1ΩXβ1 + σ2
1

where, c2

1 = E

= ρ

Wi

i=1

Wi

i=1

Wi

i=1

+ op(1)

1)

where Wi independent of(cid:0)Xi, Ui(1)(cid:1) is used. It follows that
(cid:17)
(cid:16)
β(cid:48)
1ΩXβ1 + σ2
1
(cid:17)

(cid:1)(cid:105)
(cid:16)
(cid:104)√
N(cid:0) ¯Y1 − µ1
(cid:1)(cid:105)
(cid:104)√
N(cid:0) ¯Y0 − µ0

β(cid:48)
1ΩXβ1 + σ2
1

β(cid:48)
0ΩXβ0 + σ2
0

/(1 − ρ).

= (1/ρ)2ρ

Similarly,

(cid:16)

Avar

Avar

=

=

(cid:17)

/ρ

(C.1)

(C.2)

129

Combining results from eq(39) and eq(40), we have:

d−→ N(cid:16)

0, ω2

SDM

√

N (ˆτSDM − τ )

(cid:17)

Since the sample averages are asymptotically uncorrelated, therefore

SDM = β(cid:48)
ω2

1ΩXβ1/ρ + β(cid:48)

0ΩXβ0/(1 − ρ) + σ2

1/ρ + σ2

0/(1 − ρ)

(cid:4)

Proof. Asymptotic variance of P-RA

To ﬁnd the asymptotic variance of ˇτ, note that it can be obtained from

Yi on 1, Wi, ˙Xi.

Note that ˙Xi is orthogonal to (1, Wi) because E( ˙Xi) = 0 and Wi is independent of Xi. We
know that

L(Yi|1, Wi) = µ0 + τ Wi

because τ = E(Yi|Wi = 1) − E(Yi|Wi = 0). Therefore,

L(Yi|1, Wi, ˙Xi) = µ0 + τ Wi + ˙Xiβ
(cid:17)

(cid:17)(cid:21)−1

(cid:20)

(cid:16) ˙X(cid:48)

β =

E

(cid:16) ˙X(cid:48)

˙Xi

i

E

iYi

By orthogonality,

Now

Yi = (1 − Wi)µ0 + (1 − Wi) ˙Xiβ0 + (1 − Wi)Ui(0)

+ Wiµ1 + Wi ˙Xiβ1 + WiUi(1)

Therefore,

(cid:17)

(cid:16) ˙X(cid:48)

iYi

E

(cid:104)

(1 − Wi) ˙X(cid:48)
˙Xi

(cid:16) ˙X(cid:48)

= E
= (1 − ρ)E

i

i

(cid:17)

˙Xiβ0

β0 + ρE

(cid:105)

+ E

(cid:105)

(cid:104)
(cid:16) ˙X(cid:48)

Wi ˙X(cid:48)
˙Xi

i

i

(cid:17)

˙Xiβ1

β1

130

(cid:17)

(cid:16) ˙Xi

= 0, and independence

where we use the linear projection properties of the errors, E

of Wi and(cid:2)Xi, Ui(0), Ui(1)(cid:3). Plugging in gives

β = (1 − ρ)β0 + ρβ1

Now we can write the projection error as

Ui = (1 − Wi)µ0 + (1 − Wi) ˙Xiβ0 + (1 − Wi)Ui(0)
(cid:3)

(cid:2)(1 − ρ)β0 + ρβ1

+ Wiµ1 + Wi ˙Xiβ1 + WiUi(1)
− µ0 − (µ1 − µ0)Wi − ˙Xi
= −(Wi − ρ) ˙Xiβ0 + (1 − Wi)Ui(0)
+ (Wi − ρ) ˙Xiβ1 + WiUi(1).

Because (1, Wi) is orthogonal to ˙Xi, it follows as in the previous section that

 + op(1)
 .

√

N (ˆτP RA − τ ) =

E( ˙W 2
i )

(cid:105)−1
(cid:104)
=(cid:2)ρ(1 − ρ)(cid:3)−1

N−1/2
N(cid:88)
(Wi − ρ)Ui
N−1/2
N(cid:88)
d−→ N(cid:16)
(cid:1) /(cid:2)ρ(1 − ρ)(cid:3)2.

(Wi − ρ)Ui
(cid:17)
P RA = V ar(cid:0)(Wi − ρ) Ui
Now we need to ﬁnd the asymptotic variance of N−1/2(cid:80)N

√
N (ˆτP RA − τ )

0, ω2

P RA

i=1

i=1

Then using asymptotic equivalence lemma and CLT, we have:

where ω2

i=1(Wi − ρ)Ui. The term

(Wi − ρ)Ui has zero mean by the linear projection property. Further,

(Wi − ρ)Ui = −(Wi − ρ)2 ˙Xiβ0 + (Wi − ρ)2 ˙Xiβ1

+ (Wi − ρ)(1 − Wi)Ui(0) + (Wi − ρ)WiUi(1)

The covariance between the last two terms is zero as (1 − Wi)Wi = 0. The last two terms
can be written as

−ρ(1 − Wi)Ui(0) + (1 − ρ)WiUi(1)

131

and so

V ar(cid:2)−ρ(1 − Wi)Ui(0) + (Wi − ρ)WiUi(1)(cid:3) = ρ2(1 − ρ)σ2

0 + (1 − ρ)2ρσ2
1.

Write the ﬁrst two terms as

The variance is

(cid:104)

E

(Wi − ρ)4(cid:105)

(Wi − ρ)2 ˙Xi (β1 − β0) .

(β1 − β0)(cid:48) ΩX (β1 − β0) .

Combining all of the terms gives

Avar

(cid:104)√
(cid:105)
(cid:26)
(Wi − ρ)4(cid:105)
(cid:104)
N (ˆτP RA − τ )
=(cid:2)ρ(1 − ρ)(cid:3)−2
(cid:104)
(Wi − ρ)4(cid:105)
(cid:2)ρ(1 − ρ)(cid:3)2

=

E

E

Note that we can write

(β1 − β0)(cid:48) ΩX (β1 − β0) + ρ2(1 − ρ)σ2

0 + (1 − ρ)2ρσ2

1

(cid:27)

and Jensen’s inequality tells us this is greater than unity: take Zi = (Wi − ρ)2. We can also
show

= (1 − ρ)4ρ + ρ4(1 − ρ)

and so the scale factor is

Hence,

(cid:104)√

Avar

N (ˆτP RA − τ )

=

ρ

+

(1 − ρ)

(1 − ρ)2

ρ

+

ρ2

(1 − ρ)

.

(cid:33)

(β1 − β0)(cid:48) ΩX (β1 − β0) +

σ2
0

(1 − ρ)

+

σ2
1
ρ
(cid:4)

σ2
0

σ2
1
ρ

(1 − ρ)

+

(cid:104)
(Wi − ρ)4(cid:105)
(cid:2)V ar(Wi)(cid:3)2

E

E

(β1 − β0)(cid:48) ΩX (β1 − β0) +
(cid:104)
(Wi − ρ)4(cid:105)
(cid:2)ρ(1 − ρ)(cid:3)2 =
(cid:104)
(Wi − ρ)4(cid:105)
(cid:2)ρ(1 − ρ)(cid:3)2
(cid:32)

(1 − ρ)4ρ + ρ4(1 − ρ)

E

(1 − ρ)2

(cid:105)

ρ2

=

132

Proof. Asymptotic variance of F-RA

Now consider the full regression adjustment estimator. Let ˆα1 and ˆβ1 be the OLS estimates
from the Wi = 1 sample:

and then

Yi on 1, Xi Wi = 1

ˆµ1,F RA = ˆα1 + ¯X ˆβ1

where ¯X is the sample average over the entire sample. (For intuition, it is useful to note

that ¯Y1 = ˆα1 + ¯X1 ˆβ1, and so ˆµ1 uses a more eﬃcient estimator of µX.) By least squares
mechanics, ˆµ1 is the intercept in the regression

Yi on 1, Xi − ¯X, Wi = 1.

Let ¨Xi = Xi − ¯X and

Deﬁne

ˆγ1 =

=

¨Ri = (1, ¨Xi).

 =
 ˆµ1
 N(cid:88)
N−1
N(cid:88)

ˆβ1

i=1

Wi ¨R(cid:48)

¨Ri

−1 N(cid:88)
−1N−1
N(cid:88)

i=1

i

Wi ¨R(cid:48)

¨Ri

i



Wi ¨R(cid:48)
iYi

Wi ¨R(cid:48)

iYi(1)

 .

Now write

i=1

i=1

Yi(1) = µ1 + ˙Xiβ1 + Ui(1) = µ1 + ¨Xiβ1 + ( ˙Xi − ¨Xi)β1 + Ui(1)

= µ1 + ¨Xiβ1 + ( ¯X − µX)β1 + Ui(1) = ¨Riγ1 + ( ¯X − µX)β1 + Ui(1)

Plugging in gives

N(cid:88)

i=1

N−1

Wi ¨R(cid:48)

iYi(1) =

 γ1 +

N−1

N(cid:88)

i=1

 ( ¯X − µX)β1

Wi ¨R(cid:48)

i

Wi ¨R(cid:48)

i

¨Ri

N−1
N(cid:88)
N(cid:88)

i=1

+ N−1

Wi ¨R(cid:48)

iUi(1)

i=1

133

Now we can write

N−1

−1
N−1

N(cid:88)

i=1

 ( ¯X − µX)β1 + N−1

Wi ¨R(cid:48)

i

N(cid:88)

i=1

Wi ¨R(cid:48)

i

¨Ri



Wi ¨R(cid:48)

iUi(1)

N(cid:88)

i=1

ˆγ1 = γ1+

and so
√

N (ˆγ1 − γ1)
N(cid:88)

N−1

=

Wi ¨R(cid:48)

i

¨Ri

i=1

Next, because ¯X

−1
N−1
N(cid:88)

N−1

N(cid:88)

i=1

Wi ¨R(cid:48)

i

√
N(cid:88)

N ( ¯X − µX)β1 + N−1/2



Wi ¨R(cid:48)

iUi(1)

N(cid:88)

i=1

p→ µX, the law of large numbers and Slutsky’s Theorem imply

Wi ¨R(cid:48)

i

¨Ri = N−1

Wi ˙R(cid:48)

i

˙Ri + op(1)

i=1

i=1

˙Ri = (1, ˙Xi) = (1, Xi − µX)

where

Further,

Note that

N(cid:88)

N−1

Wi ˙R(cid:48)

i

˙Ri

p→ E

i=1

√

A ≡

The terms

1
N ( ¯X − µX)β1 and N−1/2(cid:80)N
N(cid:88)

√
N (ˆγ1 − γ1) = (1/ρ)A−1

N−1
Consider the ﬁrst element of N−1(cid:80)N
N(cid:88)
Wi ¨R(cid:48)



+ op(1).

N−1

i=1

(cid:16) ˙R(cid:48)

i

˙Ri

(cid:17)

.

= ρE



(cid:17)

(cid:16)

(cid:17)

Wi ˙R(cid:48)

i

˙Ri

i

0 E

˙Xi
i=1 Wi ¨R(cid:48)

0

(cid:16) ˙X(cid:48)
√

iUi(1) are Op(1), and so

Wi ¨R(cid:48)

i

N ( ¯X − µX)β1 + N−1/2

N(cid:88)



Wi ¨R(cid:48)

iUi(1)

i=1



 1

¨Xi

N(cid:88)

i=1

Wi

i=1 Wi ¨R(cid:48)
i:

i = N−1

i=1

134

N(cid:88)

i=1

N−1

and so the ﬁrst element is

Also,

N(cid:88)

i=1

N−1/2

and so the ﬁrst element is

Wi = N1/N = ˆρ

p→ ρ.

 Ui(1)

 1

¨Xi

N(cid:88)

i=1

Wi

Wi ¨R(cid:48)

iUi(1) = N−1/2

N(cid:88)

i=1

N−1/2

WiUi(1).

Because of the block diagonality of A, the ﬁrst element of,
satisﬁes√

(cid:16)
ˆµ1,F RA − µ1

= (1/ρ)ρ

(cid:17)

N

√

N (ˆµ1 − µ1)

√

N (ˆγ1 − γ1),
N(cid:88)

WiUi(1) + op(1)

√
N ( ¯X − µX)β1 + (1/ρ)N−1/2
N(cid:88)

i=1

√

=

i=1

WiUi(1) + op(1).

N ( ¯X − µX)β1 + (1/ρ)N−1/2
(cid:17)

(cid:2)(Xi − µX) β1 + WiUi(1)/ρ(cid:3) + op(1)

N(cid:88)
= N−1/2
N(cid:88)
(cid:2)(Xi − µX) β0 + (1 − Wi)Ui(0)/(1 − ρ)(cid:3) + op(1)
(cid:104) ˙Xi (β1 − β0) + WiUi(1)/ρ − (1 − Wi)Ui(0)/(1 − ρ)
(cid:105)

i=1

i=1

= N−1/2
N(cid:88)

+ op(1)

(cid:16)

√

N

We can also write

ˆµ1,F RA − µ1
(cid:17)

(cid:16)

A similar argument gives
ˆµ0,F RA − µ0

√

N

and so√
N (ˆτF RA − τ ) = N−1/2

Again, by asymptotic equivalence lemma and CLT, we have:

i=1

√

d−→ N(cid:16)

(cid:17)

N (ˆτF RA − τ )

(cid:16) ˙Xi (β1 − β0) + WiUi(1)/ρ − (1 − Wi)Ui(0)/(1 − ρ)

0, ω2

F RA

(cid:17)

where ω2

F RA = V ar

Now consider the above expression inside the variance. The three terms are pairwise
uncorrelated, the second and third because Wi(1− Wi) = 0, and the ﬁrst with the other two
because, for example,

(cid:104)
(β1 − β0)(cid:48) ˙X(cid:48)

E

iWiUi(1)

(cid:105)

(cid:104) ˙X(cid:48)

(cid:105)

iUi(1)

= 0

= E(Wi) (β1 − β0)(cid:48) E

135

because E

= 0 by linear projection properties. It follows that

(cid:105)

(cid:104) ˙X(cid:48)
(cid:104)√
(cid:105)
N (ˆτF RA − τ )

iUi(1)

Avar

= (β1 − β0)(cid:48) ΩX (β1 − β0) + (1/ρ2)E(Wi)E
(1/(1 − ρ)2)E(1 − Wi)E

U 2

i (0)

(cid:104)

(cid:105)

= (β1 − β0)(cid:48) ΩX (β1 − β0) + σ2

1/ρ + σ2

0/(1 − ρ).

(cid:104)

(cid:105)

U 2

i (1)

+

(cid:4)

Proof. Asymptotic variance of I-RA
The derivation for τ∗ follows closely that for ˆτ, with the important diﬀerence that ¨Xi =
Xi − ¯X is replaced with ˙Xi = Xi − µX. This means that the terms
N ( ¯X − µX)β1 and
√
N ( ¯X − µX)β0 terms will not appear. Therefore,

√

(cid:104)√
(cid:105)
N (ˆτIRA − τ )

= σ2

1/ρ + σ2

0/(1 − ρ).

Avar

Proof of Theorem 5.2

Proof. CLAIM 1 : ω2

For this consider consider the left hand side,

SDM

F RA ≤ ω2
(cid:104)√
1ΩXβ1/ρ + β(cid:48)

N (ˆτSDM − τ )

Avar
= β(cid:48)

(cid:105) − Avar

(cid:104)√

(cid:105)

N (ˆτF RA − τ )

0ΩXβ0/(1 − ρ) − (β1 − β0)(cid:48) ΩX (β1 − β0)

The last term in the above expression can be written as:

0ΩXβ0/(1 − ρ) −(cid:104)
(cid:19)

(cid:18) ρ

1 − ρ

(cid:18)1 − ρ
(cid:19)
1ΩXβ1/ρ + β(cid:48)
β(cid:48)
β(cid:48)
1ΩXβ1 +
≡ δ(cid:48)ΩXδ

=

ρ

0ΩXβ0 − 2β(cid:48)

0ΩXβ1

β(cid:48)
1ΩXβ1 + β(cid:48)
β(cid:48)
0ΩXβ0 + 2β(cid:48)

0ΩXβ1

136

(cid:4)

(cid:105)

where

(cid:115)(cid:18)1 − ρ

(cid:19)

ρ

δ =

β1 +

(cid:115)(cid:18) ρ

1 − ρ

(cid:19)

β0.

Because ΩX is positive deﬁnite, this proves the claim. One case where there is no eﬃciency
gain is when ρ = 1/2 and β1 = −β0. The second condition seems unrealistic unless both
vectors are zero.

CLAIM 2 : ω2

For this consider the left hand side of the expression above,

P RA

F RA ≤ ω2
(cid:104)√
N (ˆτP RA − τ )
ρ2

(1 − ρ)2

Avar

(cid:34)

=

ρ

+

(1 − ρ)

(cid:105) − Avar
(cid:35)

(cid:104)√
(cid:105)
N (ˆτF RA − τ )
(β1 − β0)(cid:48) ΩX (β1 − β0)

− 1

≥ 0

CLAIM 3 : ω2

IRA ≤ ω2

F RA

It is easy to see why this holds true since the L.H.S just equals

(cid:104)√

Avar

N (ˆτF RA − τ )

(cid:105) − Avar

(cid:104)√
N (ˆτIRA − τ )

(cid:105)

= (β1 − β0)(cid:48) ΩX (β1 − β0)

Because ΩX is psd and the above is just a quadratic form which will be greater than or
equal to zero.

Combing the results from CLAIM 1, 2 and 3 we have the result.

(cid:4)

137

APPENDIX D

TABLES FOR CHAPTER 2

Table D.1: Summary of yes votes at diﬀerent bid amounts

Bid Yes-votes %
20
$5
$25
20
22
$65
17
$120
$220
21
100
Total

219
216
241
181
228
1085

138

Table D.2: Lower bound mean willingness to pay estimate using ABERS and FRA estimators

PO means
SM
0.689
(0.0313)
0.569
(0.0338)
0.485
(0.0323)
0.403
(0.0365)
0.289
(0.0301)
ABERS

FRA
0.685
(0.0288)
0.597
(0.0307)
0.489
(0.0294)
0.378
(0.0332)
0.290
(0.0286)
FRA
84.67
(3.792)
1085

Bids
$5

$25

$65

$120

$220

ˆτ

Obs

85.39
(3.905)
1085

139

Table D.3: Bias and standard deviation of RA estimators for DGP 1 across four assignment
vectors

PO means\N

BIAS

SD

BIAS

SD

BIAS

SD

µ1
µ2
µ3
µ1
µ2
µ3

µ1
µ2
µ3
µ1
µ2
µ3

µ1
µ2
µ3
µ1
µ2
µ3

SM
1000
0.0018
0.0191
-0.0136
0.0077
0.0083
0.0094

-0.0043
-0.0181
-0.0075
0.0052
0.0109
0.0137

-0.0065
0.0013
-0.0075
0.0108
0.0056
0.0137

500

-0.0003
0.0226
-0.0036
0.0107
0.0113
0.0133

-0.0058
-0.0173
0.0078
0.0074
0.0170
0.0191

-0.0085
0.0075
0.0078
0.0151
0.0078
0.0191

ρ = (1/3, 1/3, 1/3)

5000
0.0008
0.0170
-0.0058
0.0036
0.0037
0.0044

500
0.0021
0.0179
-0.0014
0.0110
0.0107
0.0128

PRA
1000
0.0030
0.0161
-0.0117
0.0078
0.0080
0.0091

ρ = (2/3, 1/6, 1/6)
-0.0048
-0.0165
0.0016
0.0025
0.0054
0.0063

-0.0063
-0.0169
0.0092
0.0075
0.0164
0.0186

-0.0047
-0.0184
-0.0053
0.0052
0.0106
0.0132

ρ = (1/6, 2/3, 1/6)
-0.0082
0.0032
0.0016
0.0050
0.0026
0.0063

-0.0010
0.0041
0.0137
0.0160
0.0076
0.0184

-0.0025
-0.0013
-0.0009
0.0110
0.0055
0.0131

5000
0.0027
0.0136
-0.0043
0.0036
0.0035
0.0043

-0.0052
-0.0165
0.0030
0.0025
0.0052
0.0061

-0.0014
0.0003
0.0064
0.0052
0.0026
0.0060

500

-0.0007
0.0136
-0.0019
0.0107
0.0105
0.0128

-0.0058
-0.0216
0.0101
0.0074
0.0157
0.0186

-0.0088
0.0031
0.0101
0.0151
0.0076
0.0186

FRA
1000
0.0014
0.0130
-0.0117
0.0077
0.0079
0.0091

-0.0043
-0.0210
-0.0035
0.0052
0.0104
0.0131

-0.0065
-0.0021
-0.0035
0.0108
0.0055
0.0131

5000
0.0008
0.0110
-0.0040
0.0036
0.0035
0.0043

-0.0048
-0.0180
0.0051
0.0025
0.0050
0.0060

-0.0074
-0.0004
0.0051
0.0050
0.0026
0.0060

µ1
µ2
µ3
µ1
µ2
µ3

ρ = (1/5, 2/5, 2/5)
-0.0098
0.0135
-0.0087
0.0045
0.0035
0.0040

-0.0031
0.0099
-0.0035
0.0145
0.0098
0.0116

-0.0090
0.0110
-0.0153
0.0098
0.0076
0.0086

BIAS

SD

-0.0100
0.0173
-0.0075
0.0139
0.0103
0.0120

-0.0090
0.0057
-0.0045
0.0045
0.0033
0.0039
a Here SM refers to subsample means, PRA refers to pooled regression adjustment, and FRA is the feasible regression adjustment
estimator.
b Empirical distributions generated with 1000 monte-carlo repetitions.
c The true population mean vector is: µ = (1.4437, 1.6662, 1.8718)

-0.0039
0.0071
-0.0052
0.0046
0.0033
0.0039

-0.0089
0.0036
-0.0110
0.0099
0.0070
0.0084

-0.0059
0.0054
-0.0112
0.0101
0.0071
0.0084

-0.0104
0.0071
-0.0036
0.0139
0.0096
0.0116

140

Table D.4: Bias and standard deviation of RA estimators for DGP 2 across four assignment
vectors

PO means\N

BIAS

SD

BIAS

SD

BIAS

SD

µ1
µ2
µ3
µ1
µ2
µ3

µ1
µ2
µ3
µ1
µ2
µ3

µ1
µ2
µ3
µ1
µ2
µ3

SM
1000
-0.0028
0.0110
0.0049
0.0076
0.0084
0.0094

0.0003
-0.0315
0.0316
0.0054
0.0109
0.0142

0.0087
-0.0032
0.0316
0.0106
0.0059
0.0142

500

-0.0009
0.0039
0.0111
0.0109
0.0116
0.0139

0.0022
-0.0236
0.0321
0.0078
0.0170
0.0195

0.0102
-0.0062
0.0321
0.0158
0.0081
0.0195

ρ = (1/3, 1/3, 1/3)

5000
-0.0063
0.0108
0.0054
0.0034
0.0036
0.0043

500
0.0010
0.0067
0.0065
0.0111
0.0112
0.0133

PRA
1000
-0.0015
0.0123
0.0023
0.0077
0.0079
0.0090

ρ = (2/3, 1/6, 1/6)
-0.0035
-0.0266
0.0331
0.0023
0.0049
0.0063

0.0038
-0.0241
0.0262
0.0078
0.0164
0.0189

0.0010
-0.0303
0.0275
0.0054
0.0105
0.0137

ρ = (1/5, 2/5, 2/5)

0.0018
-0.0043
0.0331
0.0048
0.0025
0.0063

0.0079
-0.0030
0.0219
0.0162
0.0080
0.0187

0.0073
-0.0008
0.0233
0.0109
0.0058
0.0134

ρ = (1/5, 2/5, 2/5)

5000
-0.0053
0.0126
0.0027
0.0034
0.0035
0.0042

-0.0026
-0.0254
0.0287
0.0023
0.0048
0.0060

-0.0017
-0.0012
0.0243
0.0050
0.0025
0.0060

500

-0.0004
0.0070
0.0033
0.0109
0.0113
0.0132

0.0027
-0.0272
0.0160
0.0078
0.0159
0.0186

0.0092
-0.0029
0.0160
0.0158
0.0080
0.0186

FRA
1000
-0.0031
0.0124
0.0006
0.0076
0.0079
0.0090

0.0002
-0.0317
0.0194
0.0054
0.0103
0.0132

0.0078
-0.0006
0.0194
0.0106
0.0058
0.0132

5000
-0.0061
0.0135
0.0015
0.0034
0.0035
0.0041

-0.0032
-0.0248
0.0219
0.0023
0.0046
0.0059

0.0017
-0.0007
0.0219
0.0048
0.0025
0.0059

BIAS

SD

µ1
µ2
µ3
µ1
µ2
µ3

0.0079
-0.0024
0.0041
0.0142
0.0103
0.0127

0.0006
0.0075
-0.0024
0.0043
0.0032
0.0038
a Here SM refers to subsample means, PRA refers to pooled regression adjustment, and FRA is the feasible regression adjustment
estimator.
b Empirical distributions generated with 1000 monte-carlo repetitions.
c The true population mean vector is : µ = (1.4439, 1.6665, 1.8722)

-0.0011
0.0065
-0.0021
0.0045
0.0032
0.0038

0.0072
0.0087
-0.0004
0.0101
0.0072
0.0084

0.0070
0.0024
-0.0017
0.0142
0.0100
0.0121

0.0066
0.0090
-0.0012
0.0098
0.0072
0.0084

0.0080
0.0058
0.0020
0.0098
0.0077
0.0087

0.0009
0.0034
-0.0001
0.0044
0.0033
0.0040

0.0071
0.0023
0.0000
0.0146
0.0099
0.0122

141

Table D.5: Bias and standard deviation of RA estimators for DGP 3 across four assignment
vectors

PO means\N

BIAS

SD

BIAS

SD

BIAS

SD

µ1
µ2
µ3
µ1
µ2
µ3

µ1
µ2
µ3
µ1
µ2
µ3

µ1
µ2
µ3
µ1
µ2
µ3

SM
1000
0.0085
-0.0056
0.0002
0.0021
0.0045
0.0036

0.0063
-0.0120
-0.0061
0.0015
0.0063
0.0052

0.0216
-0.0034
-0.0069
0.0030
0.0032
0.0051

500
0.0215
-0.0083
0.0017
0.0030
0.0063
0.0052

0.0119
-0.0257
-0.0141
0.0022
0.0091
0.0071

0.0466
-0.0075
-0.0184
0.0042
0.0046
0.0071

ρ = (1/3, 1/3, 1/3)

5000
-0.0019
0.0001
-0.0005
0.0010
0.0021
0.0017

500
0.0214
-0.0079
0.0014
0.0032
0.0062
0.0051

PRA
1000
0.0081
-0.0055
0.0005
0.0022
0.0045
0.0036

ρ = (2/3, 1/6, 1/6)

0.0003
-0.0015
-0.0058
0.0007
0.0030
0.0023

0.0120
-0.0263
-0.0140
0.0022
0.0091
0.0072

0.0063
-0.0121
-0.0063
0.0015
0.0063
0.0053

ρ = (1/5, 2/5, 2/5)
-0.0018
-0.0008
-0.0061
0.0013
0.0015
0.0024

0.0485
-0.0082
-0.0175
0.0046
0.0046
0.0069

0.0218
-0.0037
-0.0058
0.0033
0.0032
0.0049

5000
-0.0027
0.0004
0.0001
0.0010
0.0021
0.0016

0.0003
-0.0015
-0.0061
0.0007
0.0031
0.0023

-0.0032
-0.0009
-0.0041
0.0014
0.0015
0.0023

500
0.0215
-0.0071
0.0012
0.0030
0.0062
0.0051

0.0127
-0.0202
-0.0134
0.0022
0.0089
0.0069

0.0425
-0.0085
-0.0152
0.0041
0.0046
0.0068

FRA
1000
0.0093
-0.0052
0.0013
0.0021
0.0045
0.0035

0.0068
-0.0100
-0.0033
0.0015
0.0062
0.0050

0.0211
-0.0038
-0.0040
0.0030
0.0032
0.0048

5000
-0.0007
0.0006
0.0010
0.0009
0.0020
0.0016

0.0007
-0.0012
-0.0022
0.0007
0.0030
0.0022

-0.0004
-0.0010
-0.0022
0.0013
0.0015
0.0023

BIAS

SD

µ1
µ2
µ3
µ1
µ2
µ3

0.0389
-0.0112
0.0028
0.0039
0.0057
0.0048

-0.0009
-0.0012
0.0031
0.0012
0.0018
0.0014
a Here SM refers to subsample means, PRA refers to pooled regression adjustment, and FRA is the feasible regression adjustment
estimator.
b Empirical distributions generated with 1000 monte-carlo repetitions.
c The true population mean vector is: µ = (1.1897, 3.7310, 3.7752)

-0.0026
-0.0011
0.0025
0.0013
0.0018
0.0014

0.0153
-0.0031
0.0026
0.0027
0.0042
0.0032

0.0351
-0.0115
0.0020
0.0038
0.0056
0.0047

ρ = (1/5, 2/5, 2/5)
-0.0017
-0.0008
0.0017
0.0012
0.0018
0.0015

0.0410
-0.0116
0.0022
0.0042
0.0056
0.0047

0.0167
-0.0031
0.0022
0.0030
0.0042
0.0032

0.0162
-0.0027
0.0020
0.0028
0.0043
0.0033

142

APPENDIX E

PROOFS FOR CHAPTER 2

Proof of Theorem 3

Proof. Using the expression, Y = µg + ˙Xβg + U (g), write ¯Yg as
¯Yg = N−1

µg + ˙Xiβg + Ui(g)

N(cid:88)

Wig

g

g

(cid:105)

i=1

i=1

= µg + N−1
(cid:32)

g

i=1

= µg +

Therefore,

1
ˆρg

√

By random assignment,

i=1

(cid:104)
(cid:105)

Wig

i=1

Wig

Wig

N−1

N(cid:88)

= µg + (N/Ng)N−1

WigYi(g) = N−1
N(cid:88)
(cid:33)

N(cid:88)
(cid:104) ˙Xiβg + Ui(g)
(cid:105)
(cid:104) ˙Xiβg + Ui(g)
(cid:105)
(cid:104) ˙Xiβg + Ui(g)
N(cid:88)
(cid:105)
N−1/2
(cid:104) ˙Xiβg + Ui(g)
N(cid:88)
(cid:1) = ˆρ−1
N(cid:0) ¯Yg − µg
(cid:17)
(cid:16) ˙Xi
(cid:17)
(cid:16)
(cid:16)
(cid:17)
(cid:105)
E(cid:2)Ui(g)(cid:3) = ρgE(cid:2)Ui(g)(cid:3) = 0,
(cid:105) + op(1).
N−1/2

(cid:104) ˙Xiβg + Ui(g)

(cid:1) = ρ−1

= E(Wig)E

N(cid:88)

Wig ˙Xi

Wig

E

WigUi(g)

= E

Wig

(cid:104)

Wig

i=1

= 0

g

i=1

E

(E.1)

N(cid:0) ¯Yg − µg

√

g

and so the CLT applies to the standardized average in (E.1). Now use ˆρg = ρg + op(1) to
obtain the following ﬁrst-order representation:

Our goal is to be able to make eﬃciency statements about both linear and nonlinear
functions of the vector of means µ = (µ1, µ2, ..., µG)(cid:48), and so we stack the subsample means
into the G× 1 vector ¯Y. For later comparison, it is helpful to remember that ¯Y is the vector
of OLS coeﬃcients in the regression

Yi on Wi1, Wi2, ..., WiG, i = 1, 2, ..., N.

We have proven the following result.

(cid:4)

143

Proof of Theorem 4

Proof. Consistent estimators of αg and βg are obtained from the regression

Yi on 1, Xi, if Wig = 1,

(cid:16)

(cid:17)(cid:48)

is

g

˘Xi

(cid:17)(cid:21)

E(cid:16)

Wig ˘X(cid:48)

i

Wig ˘X(cid:48)
iYi

(cid:17)(cid:21)−1(cid:20)

ˆαg, ˆβ(cid:48)
(cid:20)

E(cid:16)

which produces intercept and slopes ˆαg and ˆβg. Letting ˘Xi = (1, Xi), the probability limit
of

(cid:17)(cid:21)
(cid:17)(cid:21)

αg
where random assignment is used so that Wig is independent of(cid:2)Xi, Yi(g)(cid:3). It follows that
(cid:16)

(cid:17)(cid:21)−1(cid:20)
(cid:20)
E(cid:16)
E(cid:16) ˘X(cid:48)
Wig ˘X(cid:48)
(cid:20)
(cid:17)(cid:21)−1(cid:20)
E(cid:16) ˘X(cid:48)
ρgE(cid:16) ˘X(cid:48)
(cid:17)(cid:21)
(cid:17)(cid:21)−1(cid:20)
E(cid:16) ˘X(cid:48)
E(cid:16) ˘X(cid:48)

g

= ρ−1
= ρ−1
(cid:20)

g

iYi(g)

iYi(g)

iYi(g)

=

˘Xi

i

˘Xi

i

(cid:17)(cid:48)

(cid:17)

=

˘Xi

i

is consistent for

, and so a consistent estimator of µg is

(cid:16)

αg, β(cid:48)

g

ˆαg, ˆβ(cid:48)

g

βg

ˆµg = ˆαg + ¯X ˆβg.

Note that this estimator, which we refer to as full (or separate) regression adjustment (FRA),

is the same as an imputation procedure. Given ˆαg and ˆβg, impute a value of Yi(g) for each
i in the sample, whether or not i is assigned to group g:

ˆYi(g) = ˆαg + Xi

ˆβg, i = 1, 2, ..., N.

Averaging these imputed values across all i produces ˆµg. In order to derive the asymptotic
variance of ˆµg, it is helpful to obtain it as the intercept from the regression

Let ¨Xi = Xi − ¯X and

Yi on 1, Xi − ¯X, Wig = 1.

¨Ri = (1, ¨Xi).

144

Deﬁne

Now write

ˆγg =

=

 ˆµg
 =
 N(cid:88)
N−1
N(cid:88)

ˆβg

i=1

Wig ¨R(cid:48)

i

¨Ri

Wig ¨R(cid:48)

¨Ri

−1 N(cid:88)
−1N−1
N(cid:88)

i=1

i

i=1

i=1



Wig ¨R(cid:48)
iYi

Wig ¨R(cid:48)

iYi(g)

 .

Yi(g) = µg + ˙Xiβg + Ui(g) = µg + ¨Xiβg + ( ˙Xi − ¨Xi)βg + Ui(g)

= µg + ¨Xiβg + ( ¯X − µX)βg + Ui(g) = ¨Riγg + ( ¯X − µX)βg + Ui(g)

−1
N−1

N(cid:88)

i=1

 ( ¯X − µX)βg + N−1

Wig ¨R(cid:48)

i

N(cid:88)

i=1



Wig ¨R(cid:48)

iUi(g)

Wig ¨R(cid:48)

i

¨Ri

N(cid:88)

i=1

Wig ¨R(cid:48)

i

√

N ( ¯X − µX)βg + N−1/2

N(cid:88)

i=1



Wig ¨R(cid:48)

iUi(g)

Plugging in for Yi(g) gives

ˆγg = γg+

i=1

N(cid:88)

N−1
N(cid:0)ˆγg − γg
(cid:1)
N−1
N(cid:88)
Wig ¨R(cid:48)

and so
√

=

i=1

Next, because ¯X

¨Ri

i

−1
N−1
N(cid:88)

p→ µX, the law of large numbers and Slutsky’s Theorem imply

N(cid:88)

N−1

Wig ¨R(cid:48)

i

¨Ri = N−1

Wig ˙R(cid:48)

i

˙Ri + op(1)

where

i=1

i=1

˙Ri = (1, ˙Xi).

Further, by random assignment,

N(cid:88)

N−1

Wig ˙R(cid:48)

i

˙Ri

p→ E(cid:16)

(cid:17)

= ρgE(cid:16) ˙R(cid:48)

i

˙Ri

(cid:17)

Wig ˙R(cid:48)

i

˙Ri

= ρgA,

i=1

145



(cid:17)

0

√

where

A ≡

The terms
√

1
0 E(cid:16) ˙X(cid:48)
N ( ¯X − µX)βg and N−1/2(cid:80)N

√
N−1
N(cid:88)
N(cid:0)ˆγg − γg
(cid:1) = (1/ρg)A−1
Consider the ﬁrst element of N−1(cid:80)N
N(cid:88)
N(cid:88)
Wig ¨R(cid:48)
N(cid:88)

and so the ﬁrst element is

i=1 Wig ¨R(cid:48)
i:

i = N−1

Wig ¨R(cid:48)

N−1

i=1

i=1

i=1

i

i

˙Xi
i=1 Wig ¨R(cid:48)
iUi(g) are Op(1) by the CLT, and so

N ( ¯X − µX)βg + N−1/2

Wig ¨R(cid:48)

iUi(g)

N(cid:88)

 .

i=1



 1

¨X(cid:48)

i

Wig

i=1

N−1
N(cid:88)

Wig = Ng/N = ˆρg

p→ ρg.

N(cid:88)

i=1

Wig

 Ui(g)

 1

¨X(cid:48)

i

Also,

N−1/2

Wig ¨R(cid:48)

iUi(g) = N−1/2

and so the ﬁrst element is

i=1

N(cid:88)

i=1

N−1/2

WigUi(g).

Because of the block diagonality of A, the ﬁrst element of,

(cid:1),

N(cid:0)ˆµg − µg

√

(cid:1),

√

N(cid:0)ˆγg − γg
N(cid:88)

satisﬁes
√

N(cid:0)ˆµg − µg

(cid:1) = (1/ρg)ρg

√
N ( ¯X − µX)βg + (1/ρg)N−1/2

=

N(cid:0)ˆµg − µg

√

√
N ( ¯X − µX)βg + (1/ρg)N−1/2
(cid:1) = N−1/2

N(cid:88)

(cid:104)

(Xi − µX) βg + WigUi(g)/ρg

(cid:105)

+ op(1)

WigUi(g) + op(1)

N(cid:88)

i=1

i=1

WigUi(g) + op(1).

We can also write

The above representation holds for all g. Then, stacking the RA estimates gives us theorem
(cid:4)

4.

i=1

146

Proof of Theorem 5

Proof. In order to ﬁnd a useful ﬁrst order representation of

√

N ( ˇµ − µ), we ﬁrst characterize

the probability limit of ˇβ. Under random assignment,

E(cid:16)

(cid:17)

= E (W)(cid:48) E(cid:16) ˙X

(cid:17)

W(cid:48) ˙X

= 0,

which means that the coeﬃcients on W in the linear projections L(cid:0)Y |W(cid:1) and L(cid:16)
still consistently estimates µ. Moreover, we can ﬁnd the coeﬃcients on ˙X in L(cid:16)
by ﬁnding L(cid:16)

Y |W, ˙X
are the same and equal to µ. This essentially proves that adding the demeaned covariates
Y |W, ˙X

. Let β be the the linear projection of Y on ˙X. Then

(cid:17)
(cid:17)

Y | ˙X

(cid:17)

Now use

so that

E(cid:16) ˙X(cid:48)Y

(cid:17)

=

β =

G(cid:88)
G(cid:88)

g=1

X

Wg

g=1

Y =

(cid:17)

(cid:17)

(cid:17)

= Ω−1

µg + ˙Xβg + U (g)

E(cid:16) ˙X(cid:48)Y
(cid:104)

(cid:20)
(cid:17)(cid:21)−1
E(cid:16) ˙X(cid:48) ˙X
E(cid:16) ˙X(cid:48)Y
(cid:105)
G(cid:88)
(cid:26)
βg + E(cid:104) ˙X(cid:48)WgU (g)
(cid:17)
+ E(cid:16) ˙X(cid:48)Wg ˙X
E(cid:16) ˙X(cid:48)Wgµg
 ,
 G(cid:88)
(cid:8)0 + ρgΩXβg + 0(cid:9) = ΩX
= 0, and E(cid:104) ˙X(cid:48)U (g)
(cid:105)
 =
 .
 G(cid:88)
(cid:17)

X ΩX

ρgβg

ρgβg

ρgβg

(cid:17)

g=1

g=1

(cid:105)(cid:27)

=

g=1

where we again use random assignment, E(cid:16) ˙X
 G(cid:88)
Therefore, the β in the linear projection L(cid:16)

β = Ω−1

g=1

= 0. It follows that

Y | ˙X

is simply a weighted average of the

coeﬃcients from the separate linear projections using the potential outcomes.

Now we can write

Yi = Wiµ + ˙Xiβ + Ui

147

where the linear projection error Ui is

Ui =

=

(cid:104)

g=1

G(cid:88)
G(cid:88)
(cid:16)

g=1

Wig

µig + ˙Xiβg + Ui(g)

G(cid:88)
(Wig − ρg) ˙Xiβg +
ˇµ(cid:48), ˇβ(cid:48)(cid:17)(cid:48)
(cid:16)

, and ˇθ =

(cid:17)

g=1

WigUi(g)

(cid:105) − Wiµ − ˙Xi

ρgβg



 G(cid:88)
N ( ˇµ − µ). Write θ =(cid:0)µ(cid:48), β(cid:48)(cid:1)(cid:48),

g=1

(cid:16)

We can now obtain the asymptotic representation for

(cid:17)
as the OLS estimators. The asymptotic
Wi, ˙Xi
√
N ( ˇµ − µ) is not the same as replacing ¨Xi with ˙Xi (even though for ˇβ it is).

˙Ri =
variance of

Wi, ¨Xi

, ¨Ri =

√

Write

Now

ˇθ =

(cid:1) β + Ui

Yi = Wiµ + ¨Xiβ +

= ¨Riθ +(cid:0) ¯X − µX
N−1
N(cid:88)
N−1

¨R(cid:48)
N(cid:88)

(cid:17)
(cid:16) ˙Xi − ¨Xi
(cid:1) β + Ui.
−1N−1
N(cid:88)
−1
N−1

β + Ui = Wiµ + ¨Xiβ +(cid:0) ¯X − µX

¨R(cid:48)
iYi
N(cid:88)

(cid:48)(cid:0) ¯X − µX

(cid:1) β + N−1

¨R(cid:48)

¨Ri

¨Ri

¨Ri

i=1

i=1

i

i

= θ +

i=1

i=1

i=1

N(cid:88)



¨R(cid:48)
iUi

=

N

¨R(cid:48)

and so
√

(cid:17)
(cid:16) ˇθ − θ
N−1
N(cid:88)
N−1
N(cid:88)
because N−1(cid:80)N

¨R(cid:48)

i=1

i=1

=

i

i

¨Ri

¨Ri

N(cid:88)
¨R(cid:48)
iUi
N(cid:88)


Wi

i=1

¨Xi

i=1

(cid:48)



Ui

β + N−1/2

β + N−1/2

i=1

−1
N−1
N(cid:88)

N−1(cid:80)N
−1
N(cid:88)

0

N−1

¨Ri

i=1 W(cid:48)

(cid:48)(cid:104)√
(cid:1)(cid:105)
N(cid:0) ¯X − µX
(cid:104)√
N(cid:0) ¯X − µX
E(cid:0)W(cid:48)
(cid:1)

iWi

p→

i

0

(cid:1)(cid:105)



0

ΩX

¨R(cid:48)

i

¨Ri

i=1

¨X(cid:48)
i = 0. Further, the terms in [·] are Op(1) and

i=1

148

by random assignment and E
√
N

(cid:16) ˇθ − θ
(cid:17)

(cid:104)E(cid:0)W(cid:48)

=

(cid:1)(cid:105)−1

iWi

0

0
Ω−1

X

(cid:16) ˙Xi
(cid:17)


N−1(cid:80)N

0

= 0. Therefore,

We can now look at

√

N ( ˇµ − µ), the ﬁrst G elements of

and so

√
N ( ˇµ − µ) =

(cid:20)
E(cid:16)

W(cid:48)

iWi

(cid:17)(cid:21)−1

Note that

and so



N(cid:88)

i=1

¨R(cid:48)
iUi

β + N−1/2
(cid:17)

. But

 + op(1).

N(cid:88)

i=1

W(cid:48)

iUi

√
N

i=1 W(cid:48)

i



(cid:104)√
(cid:1)(cid:105)
N(cid:0) ¯X − µX
(cid:16) ˇθ − θ

 N−1/2

˙Xiβ + N−1/2

N(cid:88)

W(cid:48)

i

ρ1

ρ2
...

p→

ρG

i=1

N(cid:88)

i=1

N−1

ρ1

ρ2
...

ρG






 .

W(cid:48)

iWi =

Wi1

0

0 Wi2
...
...
···

0

···
...

0
...

0

0 WiG

E(cid:16)

(cid:17)

=

W(cid:48)

iWi

ρ1

0

0
...

0

ρ2
...
···

···
...
...

0
...

0

0

ρG



149

Therefore,

√

N ( ˇµ − µ) = jGN−1/2

(cid:20)

E(cid:16)

W(cid:48)

iWi

(cid:17)(cid:21)−1

N−1/2

˙Xiβ +

N(cid:88)

i=1

N(cid:88)

i=1

W(cid:48)

iUi + op(1)

((k1))

where jG = (1, 1, ..., 1)(cid:48). Now write
N(cid:88)

√
N ( ˇµ − µ) = jGN−1/2

˙Xiβ +

−1N−1/2

N(cid:88)

i=1

 + op(1)

W(cid:48)

iUi

W(cid:48)

iUi =

(Wih − ρg) ˙Xiβg +

and so
√

N ( ˇµ − µ)

= N−1/2

N(cid:88)

i=1




˙Xiβ
˙Xiβ
...
˙Xiβ



Wi1Ui(1)/ρ1

Wi2Ui(2)/ρ2

...

WiGUi(G)/ρG



(E.2)

 (k4)

(cid:80)G
(cid:80)G
h=1 Wi1(Wih − ρh) ˙Xiβh/ρ1
h=1 Wi2(Wih − ρh) ˙Xiβh/ρ1
(cid:80)G
h=1 WiG(Wih − ρh) ˙Xiβh/ρ1

...

150



0

ρ1

0

ρG

0
...

0

0
...

0

ρ2
...
···

···
...
...


 G(cid:88)
(cid:80)G
(cid:80)G
h=1(Wih − ρh) ˙Xiβh
h=1(Wih − ρh) ˙Xiβh
(cid:80)G
h=1(Wih − ρh) ˙Xiβh



h=1

...

Wi1

Wi2
...

WiG

Wi1

Wi2

WiG

i=1

=



 +


G(cid:88)

WihUi(h)

h=1

Wi1Ui(1)

Wi2Ui(2)

...

WiGUi(G)




 +


 +

E(cid:0)LiQ(cid:48)

(cid:1) = 0, and E(cid:0)KiQ(cid:48)

i

i

(cid:1) = 0, it follows that
(cid:105)
(cid:104)√
(cid:105)
(cid:104)√
N ( ˆµSM − µ)
N ( ˆµF RA − µ)

Avar

= ΩL + ΩQ

For each g, we can write the second term in bracket as follows. Then, combine the ﬁrst and

second parts and simplify using the expression for β. For example,

Wi1(Wig − ρg) ˙Xiβg/ρ1 = ρ−1

1

(cid:104)

G(cid:88)

g=1

Wi1(Wi1 − ρ1) ˙Xiβ1 − Wi1ρ2 ˙Xiβ2 − ··· − Wi1ρG
(cid:3)

(cid:2)Wi1(1 − ρ1)β1 − Wi1ρ2β2 − ··· − Wi1ρGβG
(cid:2)β1 − (ρ1β1 + ρ2β2 + ··· + ρGβG)(cid:3)

˙Xi

= ρ−1
= ρ−1
= ρ−1

1
1 Wi1 ˙Xi
1 Wi1 ˙Xi (β1 − β) .

(cid:105)

˙XiβG

Using (k4) and adding ˙Xiβ and rearranging, we obtain the following theorem,

(cid:4)

Proof of Theorem 6

Proof. We now show that, asymptotically, ˆµF RA is no worse than ˆµSM . From (m1), (m2),

where ΩL = E(cid:0)LiL(cid:48)

i

Avar

(cid:1) and so on. Therefore, to show that Avar

= ΩK + ΩQ

(cid:104)√

(cid:105)

is smaller

N ( ˆµF RA − µ)

(in the matrix sense), we must show

is PSD, where



˙Xiβ1
˙Xiβ2

...

˙XiβG

Ki =

ΩL − ΩK

 and Li =



Wi1 ˙Xiβ1/ρ1
Wi2 ˙Xiβ2/ρ2

...

WiG

˙XiβG/ρG



The elements of Li are uncorrelated because WigWih = 0 for g (cid:54)= h. The variance of the gth
element is

(cid:17)2(cid:21)

= E(cid:16)

(cid:17)

ρ−2
g E

Wig

(cid:20)(cid:16) ˙Xiβg

(cid:17)2(cid:21)

(cid:20)(cid:16) ˙Xiβg

(cid:17)2(cid:21)

= ρ−1
g E

(cid:20)(cid:16)

E

Wig ˙Xiβg/ρg

= ρ−1
g β(cid:48)

gΩXβg.

151



ρ−1

1

0
...

0

2

0
ρ−1
...
···

···
...
...

0

 ⊗ ΩX

 B

0
...

0
ρ−1

G

Therefore,

E(cid:16)

(cid:17)

=

LiL(cid:48)

i



= B(cid:48)

where

β(cid:48)
1ΩXβ1/ρ1

0
...

0



ΩX/ρ1

0

0
...

0

ΩX/ρ2

...
···

0

β(cid:48)
2ΩXβ2/ρ2

...
···

···
...
...
0 β(cid:48)

0
...

0

GΩXβG/ρG


 B = B(cid:48)



···
...
...

0
...

0

0 βG

···
...
...

0
...

0

0 ΩX/ρG



β1

0

0 β2
...
...
···

0

B =

For the variance matrix of Ki,

V(cid:16) ˙Xiβg

(cid:17)

= β(cid:48)
C( ˙Xiβg, ˙Xiβh) = β(cid:48)

gΩXβg

gΩXβh

Therefore,

E(cid:16)

KiK(cid:48)

i

(cid:17)

= B(cid:48)



ΩX ΩX ··· ΩX
...
...
ΩX ΩX
...
...
... ΩX
ΩX ··· ΩX ΩX

 B = B(cid:48)(cid:20)(cid:16)

(cid:21)

(cid:17) ⊗ ΩX

B

jGj(cid:48)

G

152

G = (1, 1, ..., 1). Therefore, the comparison we need to make is

1

0

ρ−1

where j(cid:48)


That is, we need to show


0

0

0
ρ−1

2

0
0 ρ−1

G

ρ−1

1

0
...

0

2

0
ρ−1
...
···

···
...
...

0

is PSD. The Kronecker product of two PSD matrices is also PSD, so it suﬃces to show

(cid:17) ⊗ ΩX

jGj(cid:48)

G

⊗ ΩX

G



G

0
...

(cid:16)

(cid:17)

jGj(cid:48)

0
ρ−1

 ⊗ ΩX versus
 −(cid:16)
 −(cid:16)
 a =
G(cid:88)
0
ρ−1
 G(cid:88)
(cid:17)2
2
 G(cid:88)

(cid:16)
a(cid:48)jG

···
...
...

···
...
...

0
ρ−1

0
...

0
...

g=1

=

ag

.

G

0

G

0

(cid:17)

jGj(cid:48)

G

a2
g/ρg

g=1

2

ag




ρ−1

1

0
...

0

2

0
ρ−1
...
···

2

0
ρ−1
...
···

a =

ρ−1

1

0
...

a(cid:48)

a(cid:48)(cid:16)

0

(cid:17)
jGj(cid:48)
G(cid:88)

G

is PSD when the ρg add to unity. Let a be any G × 1 vector. Then

So we have to show

g/ρg ≥
a2

g=1

g=1

153

Deﬁne vectors b =

√
a1/

√
√
ρ2, ..., aG/
ρ1, a2/

ρG

and c =

apply the Cauchy-Schwarz inequality:

(cid:16)
2

 G(cid:88)

ag

g=1

because(cid:80)G

g=1 ρg = 1.

(cid:17)(cid:16)
c(cid:48)c

b(cid:48)c

(cid:17)2 ≤(cid:16)
(cid:16)
 G(cid:88)

a2
g/ρg

g=1

b(cid:48)b



=

=

(cid:17)(cid:48)
(cid:17)

=

 G(cid:88)

g=1

a2
g/ρg

(cid:17)(cid:48)

√
ρG

and

√
ρ2, ...,

ρ1,

(cid:16)√
 G(cid:88)



ρg

g=1

(cid:4)

Proof of Theorem 7
Proof. By random assignment and the linear projection property, E(FiK(cid:48)
E(FiQ(cid:48)

i) = 0. Hence, Fi, Ki, and Qi are pairwise uncorrelated.

i) = E(KiQ(cid:48)

i) =
(cid:4)

154

APPENDIX F

AUXILIARY RESULTS FOR CHAPTER 3

F.1 Stratiﬁed (or block) experiment with missing outcome

Consider a stratiﬁed experiment where the population is partitioned into J strata, or
blocks, based on the covariates x ∈ X ⊂ RJ, given by X1,X2, . . . ,XJ, such that these sets
are mutually exclusive and exhaustive.

Then draw a sample of size Nj from stratum j where j = 1, 2, . . . , J with N =(cid:80)J

j=1 Nj.
Let wijg be a binary indicator for treatment level g = 0, 1 for unit i in stratum j. Then,
by construction, the probability of unit i getting treated in stratum j is a function of the

covariates that have been used to deﬁne the strata. In other words,

P(wijg = 1|yij(0), yij(1), xij) = P(wijg = 1|xij) ≡ pg(xij) for j = 1, 2, . . . , J; g = 0, 1

Hence, in a stratiﬁed experiment, the treatment assignment satisﬁes unconfoundedness by

design, where this probability is constant for all units in a particular stratum j, but varies

across the diﬀerent strata.

Let sij be a missing data indicator for unit i belonging to stratum j, such that

1; yij is observed

0; yij is missing

sij =

Then, one can characterize a stratiﬁed sample from stratum ‘j’ as {(yij, xij, wijg, sij); i =
1, . . . , Nj}. Now, suppose that the missing outcomes are ignorable, i.e.

P(sij = 1|yij(0), yij(1), xij, wijg) = P(sij = 1|xij, wijg) ≡ r(xij, wijg)

which implies that missingness is suﬃciently well predicted by the covariates and the treat-

ment indicator. Given this setup, one can use the doubly weighted estimator to consistently

155

estimate θ0

g as follows

and

ˆθ1 = argmin
θ1∈Θ1

ˆθ0 = argmin
θ0∈Θ0

J(cid:88)
J(cid:88)

j=1

Nj(cid:88)
Nj(cid:88)

i=1

j=1

i=1

sij · wij1

r(xij, wij1) · p1(xij)

sij · wij0

r(xij, wij0) · p0(xij)

· q(yij(1), xij, θ1)

· q(yij(0), xij, θ0)

where r(xij, wijg) and pg(xij) can be replaced by consistent estimators without changing
the result. Note that even though the assignment probabilities are typically known in a

stratiﬁed experiment, it can be asymptotically more eﬃcient to estimate them using binary

response MLE.

F.2 Consistent variance estimation

In order to construct asymptotic conﬁdence intervals and obtain valid inference with the

doubly weighted estimator, it is important to ﬁnd a consistent estimator of its asymptotic

variance. For smooth objective functions like OLS, NLS, MLE, this task is simple as one can

replace the population Hessian and Jacobian functions by their sample counterparts. This

involves substituting the sample average in place of the population expectations. However,

for non-smooth objective functions, the task of obtaining a consistent variance estimator is

not straightforward. The ﬁrst order or second order derivatives of the objective function may

not exist. In such situations, numerical derivatives of the objective functions can be used to

approximate the true derivatives. Following Newey and McFadden (1994), let ei denote the
ith unit vector and εN denote a small positive constant that depends on the sample size.

For the doubly weighted estimator that solves the treatment problem, ˆθg, the asymptotic
variance expression is given as H−1
g , where Hg can now be estimated using a second
order numerical derivative of the objective function by ˆHg, where the (j, k)th element of

g ΩgH−1

156

matrix ˆHg is given as,

ˆHgjk

(cid:34)
(cid:34)

=

+

(cid:35)

4ε2
N

+QN ( ˆθg − ejεN − ekεN )

4ε2
N

QN ( ˆθg + ejεN + ekεN ) − QN ( ˆθg − ejεN + ekεN ) − QN ( ˆθg + ejεN − ekεN )

(cid:35)

(cid:17)

(cid:16)

uigu(cid:48)

ig

(cid:80)N
i=1 ˆuig ˆu(cid:48)

For the middle term of the asymptotic variance expression which is Ωg = E
can approximate it as ˆΩg = 1
N
ˆH−1

, we
ig, where uig exists with probability one. Hence,

g will be consistent under the conditions of the following theorem.

ˆΩg ˆH−1

g

Theorem F.2.1. (Consistency of asymptotic variance) Suppose that εN → 0 and εN
∞, then under conditions of theorem 3.4.2, ˆHg

p→ Hg and ˆΩg

p→ Ωg.

√
N →

The proof of this theorem is given in appendix J and follows from Theorem 7.4 in Newey

and McFadden (1994).

Table ?? characterizes cases when the weighted and unweighted estimator will be con-
g. Table I.4 talks about situations when the unweighted

sistent for the true parameter, θ0

estimator is more eﬃcient than the weighted estimator.

F.3 Asymptotic variance for ATE

√

Given

N consistent and asymptotically normal estimators, ˆθ1 and ˆθ0, the estimated

average treatment eﬀect

N(cid:88)

i=1

1
N

m1(xi, ˆθ1) − 1
N

N(cid:88)

i=1

m0(xi, ˆθ0)

ˆτate =
√

is easily shown to also be

N-consistent and asymptotically normal (Wooldridge (2010)

chapter 21). Regularity conditions for such an asymptotic result would require that the
parametric model, mg(x, θg), is continuously diﬀerentiable on the parameter space Θg ⊂
g is in the interior of Θg. Then, by the continuous mapping theorem and slutsky’s
RPg and θ0

157

theorem,

where V = E(cid:2)ψ(xi)ψ(xi)(cid:48)(cid:3). Let’s denote E

d→ N (0, V)

(cid:104)∇θg mg(xi, θ0

g)

(cid:105) ≡ G0

g, then

√
N (ˆτate − τate)

ψ(xi) = {m1(xi, θ0

1) − m0(xi, θ0

0) − τate} − G0

1 · H−1

1 ui1 + G0

0 · H−1

0 ui0

where Hg is the Hessian for the treatment group g, and uig is the residual from the regres-
sion of the weighted score on the scores of two probability models. For the case when the

conditional mean model is correctly speciﬁed, the variance expression simpliﬁes to
(cid:48)
0 · V0 · G0

1) − m0(xi, θ0

1 · V1 · G0

0)) − τate

(m1(xi, θ0

V = E

+ G0

+ G0

1

0

(cid:48)

(F.1)

(cid:104)

(cid:105)2

Here V1 and V0 are the asymptotic variances of the doubly weighted estimator that solve
the treatment and control group problems respectively. The above formula makes it clear

that it better to use more eﬃcient estimators of ˆθg. But we know from the results in section
3.5 that when the conditional mean model is correctly speciﬁed, using estimated weights is

as eﬃcient as using known weights. Another alternative in this case is to use unweighted
g since under GCIME, unweighted estimators can be potentially more eﬃcient

estimators of θ0

than the doubly weighted estimators of θ0
g.

For the case when the mean model is misspeciﬁed, the asymptotic variance of the ATE

is given as follows

(m1(xi, θ0

(cid:104)
(cid:104){m1(xi, θ0
(cid:104){m1(xi, θ0

V = E
− 2E

+ 2E

(cid:105)2
0)) − τate
0) − τate}u(cid:48)
0) − τate}u(cid:48)

i1

i0

+ G0

(cid:105)
(cid:105)

(cid:48)

1

1 · V1 · G0
H−1
1 G0
1
H−1
0 G0
0

(cid:48)

1) − m0(xi, θ0
1) − m0(xi, θ0
1) − m0(xi, θ0

(cid:48)

+ G0

0 · V0 · G0

0

(cid:48)

(F.2)

though it is better to have more eﬃcient estimators of θ0

In this case, the variance expression is a bit more complicated than the previous case. Even
g in this case as well, it is not
obvious whether that would help obtain a smaller variance for the ATE since we now have

cross correlation terms in the variance expression. The proof of the asymptotic variances is

provided in appendix .....

158

F.4 Practical advice for obtaining double-weighted ATE estimates

An easy way to obtain the doubly weighted estimates, ˆθg, for estimating ATE, is to
combine the treatment and control group problems into a one-step GMM procedure. Es-

sentially, this means that one would stack the moment conditions from the ﬁrst and second

steps, which can then be solved jointly via GMM. Since there are no over-identifying restric-
g is equivalent to two-step
tions in the double weighted framework, one-step estimation of θ0
estimation (Negi (2019)). For ease of notation, let wi1 = wi and wi0 = (1 − wi). Then,
consider the following set of moment conditions:

¯m(θ0, θ1, γ, δ) =

N(cid:88)

i=1

1
N

mi(θ0, θ1, γ, δ) = N−1



N
N0
N
N1

·(cid:80)N
·(cid:80)N
(cid:80)N
(cid:80)N

i=1 mi0(θ0, γ, δ)

i=1 mi1(θ1, γ, δ)

i=1 mi2(γ)

i=1 mi3(δ)



where,

mi0(θ0, γ, δ) =

mi1(θ1, γ, δ) =

si · (1 − wi)
si · wi

R(xi, wi, ˆδ) · (1 − G(xi, ˆγ))
R(xi, wi, ˆδ) · G(xi, ˆγ)

mi2(γ) = ∇γG(xi, γ)(cid:48) ·

q(yi(0), xi, θ0)(cid:48)
q(yi(1), xi, θ1)(cid:48)

· ∇θ0
· ∇θ1
wi − G(xi, γ)
G(xi, γ) · (1 − G(xi, γ))
si − R(xi, wi, δ)
R(xi, wi, δ) · (1 − R(xi, wi, δ))

mi3(δ) = ∇δR(xi, wi, δ)(cid:48) ·

The example code below uses STATA’s gmm command to estimate two weighted linear re-

gressions for estimating ATE.

159

Example code using STATA’s gmm

local Rhat="exp(b31+b32*w+b33*x1+b34*x2)/(1+exp(b31+b32*w+b33*x1+b34*x2))"

local Ghat="exp(b21+b22*x1+b23*x2)/(1+exp(b21+b22*x1+b23*x2))"

gmm ((-2*s*(1-w)/(‘Rhat’*(1-‘Ghat’)))*(y-b00-b01*x1-b02*x2)*(n/nc)) ///

((-2*s*w/(‘Rhat’*‘Ghat’))*(y-b10-b11*x1-b12*x2)*(n/nt)) ///

(w-exp(b21+b22*x1+b23*x2)/(1+exp(b21+b22*x1+b23*x2))) ///

(s-exp(b31+b32*w+b33*x1+b34*x2)/(1+exp(b31+b32*w+b33*x1+b34*x2))), ///

instruments(1 2 3:

x1 x2) instruments(4:

w x1 x2) winitial(identity) ///

nocommonesample onestep from(b00 0.1 b01 0.1 b02 0.1 b10 0.1 b11 0.1 b12

///

0.1 b21 0.1 b22 0.1 b23 0.1 b31 0.1 b32 0.1 b33 0.1 b34 0.1)

Then using the GMM estimates, one can estimate the average treatment eﬀect as

gen y0hat = _b[b00:

_cons]+_b[b01: _cons]*x1+_b[b02:

_cons]*x2

gen y1hat = _b[b10:

_cons]+_b[b11: _cons]*x1+_b[b12:

_cons]*x2

egen ate = mean(y1hat-y0hat)

Since I am estimating the two probability models as logits (as is the convention in applied

work), the third and fourth moments simplify to

mi2(γ) = x(cid:48)
mi3(δ) = z(cid:48)

i · (wi − Λ(xiγ))
i · (si − Λ(ziδ))

Even though this one-step estimation allows us to obtain variance estimates ˆV1 and ˆV0 for ˆθ1
and ˆθ0 respectively, obtaining analytically correct standard errors for estimated ATE requires
additional work. A command that implements the correct standard errors is still in the works.

Meanwhile, one can use bootstrapped standard errors, which provide asymptotically correct

inference.

160

F.5 Asymptotic variance for QTEs

Given that ˆθg are

N-consistent and asymptotically normal for the CQF parameters,
g (conditions for QR & Komunjer type QMLE estimators can be easily veriﬁed and can be
θ0
found in basic textbooks), the estimated CQTEτ will also be
N-consistent and asymptot-
ically normal under the condition that quantg,τ (x, θg) is continuously diﬀerentiable on the
parameter space Θg ⊂ RPg and θ0
g is an interior point in Θg (just like for the case of ATE).
Then, again, by the continuous mapping theorem, we obtain

√

ˆCQT Eτ (xi) − CQT Eτ (xi)

(cid:17) d→ N (0, V(xi))

√

(cid:16)

√

N

where

V(xi) = ∇θ1
+ ∇θ1

quant1,τ (xi, θ0

quant0,τ (xi, θ0

1)V1∇θ1
0)V0∇θ0

1)(cid:48)
quant1,τ (xi, θ0
0)(cid:48)
quant0,τ (xi, θ0

In the case when we are able to consistently estimate the CQTE, the researcher may be

interested in a quantity which I call the average quantile eﬀect (AQE). This is deﬁned to

be the average diﬀerence in the CQTE function at a given quantile. Using the weak law of

large numbers, one can also establish that

√

N ( ˆAQEτ − AQEτ )

d→ N (0, V)

where

V =E

E

(cid:34)(cid:26)(cid:16)
(cid:104)∇θ1

(cid:17) − AQEτ

(cid:27)2(cid:35)
(cid:104)∇θ1
(cid:104)∇θ0
(cid:105) · V0 · E

+ E

quant1,τ (xi, θ0
1)

(cid:105) · V1·
(cid:105)(cid:48)

quant1,τ (xi, θ0

1) − quant0,τ (xi, θ0
(cid:105)(cid:48)
0)

(cid:104)∇θ1

+ E

quant1,τ (xi, θ0
1)

quant0,τ (xi, θ0
0)

quant0,τ (xi, θ0
0)

and V1 and V0 are the asymptotic variances of the doubly weighted estimator that solves
the QR or QMLE problem for the treatment and control groups respectively. The derivation

of the two asymptotic variances is provided appendix H. For average quantile eﬀect, the

derivation proceeds in a similar manner to the case of ATE. Since the above results hinge on

161

correct quantile speciﬁcation, one may use the usual robust asymptotic variance form for V1
and V0. However, one might be able to obtain a smaller ﬁnite sample variance from using
estimated weights even though weighting would not have any bite in establishing consistency

here.

As discussed in the examples section, when the conditional quantile model is misspeciﬁed,
g can still be interpreted as a weighted linear approximation parameter to the true τ-CQF
θ0
of y(g). Since linear projections can be used as linear operators, the diﬀerence in the two

linear projections of the two potential outcomes will give us a linear projection to the true

CQTE. Formally,

LP [CQTEτ ] = LP[quant1,τ (xi, θ1)] − LP[quant0,τ (xi, θ0)]

Therefore, one can use θ0

g in the case of a misspeciﬁed CQF to deﬁne a linear projection to

the true CQTE.

162

APPENDIX G

APPLICATION APPENDIX FOR CHAPTER 3

G.1 National Supported Work Program

The NSW was a transitional and subsidized work experience program that was mainly

intended to target four sub-populations; ex-oﬀenders, former drug addicts, women on AFDC
welfare and high school dropouts.1 The program became operational in 1975 and continued

until 1979 at ﬁfteen locations in the United States. In ten of these sites, the program operated

as a randomized experiment where individuals who qualiﬁed for the training program were
randomly assigned to either the treatment or control group.2 At the time of enrollment in

April 1975, individuals were given a retrospective baseline survey which was then followed

by four follow-up interviews conducted at nine month intervals each. The survey data was

collected using these baseline and follow-up interviews over a period of four years. The

data included measurement on baseline covariates like age, years of education, number of

children in 1975, high school dropout status, marital status, two race indicators for black

and Hispanic sub-populations and other demographic and socio-economic information. The

main outcome of interest was real earnings for the post-training year of 1979.

G.2 Augmenting the Calónico and Smith (2017) sample to account

for missing earnings in 1979

I obtain the data from Calónico and Smith (2017)’s supplementary data ﬁles in the

Journal of Labor Economics where the authors recreate the experimental sample on AFDC
1The AFDC program is administered and funded by the federal and state governments
and is meant to provide ﬁnancial assistance to needy families. Source: US Census Bureau.
Beyond the main eligibility criteria that was applied to all four target populations, the AFDC
group was subjected to two additional criteria which were, a) no child below 6 years of age
and b) on AFDC welfare for at least 30 of the last 36 months.

2Out of the 10 sites, 7 served AFDC women with random assignment at one or more of

these sites in operation from Feb 1976-Aug 1977 (Calónico and Smith (2017)).

163

women using the raw public use data ﬁles maintained by the Inter-University Consortium

for Political and Social Research (ICPSR). Then, I use the PSIDcross ﬁle provided by CS

along with other supplementary data ﬁles to add back the individuals whom CS originally

dropped from the analysis for not having valid earnings information between 1975-1979. For

this, I apply the same ﬁlters applied by CS who use them to match their PSID samples to

the ones used by LaLonde (1986). These ﬁlters involve keeping all female household heads

continuously from 1975-1979 who were between 20 and 55 years of age in 1975 and were not
retired in 1975.3 This constitutes the ﬁrst non-experimental sample that CS use in their

analysis, which they call the PSID-1 sample. The second PSID sample, which they label

PSID-2 further restricts the PSID-1 sample to include only those women who received AFDC
welfare in 1975.4 In order to compare my sample with the original sample used by CS, I

ﬁrst apply all the above mentioned ﬁlters and create a dummy variable which I call “cs”.

Next, I remove the ﬁlter which requires the women to be continuous household heads and

instead only impose that ﬁlter for 1975 and 1976. The reason this ﬁlter is imposed for both

years 1975 and 1976 but not for any other years is because in the PSID datasets, the income

information in a particular year corresponds to the previous calendar year. This ensures

that merging the cross-ﬁle with the separate single-year ﬁles for 1975 and 1976 guarantee

that only those women are included who do not have any missing earnings information for

the pre-training year of 1974 and 1975. This is important since pre-training earnings are

treated as any other baseline covariate in this paper, on which I do not allow any missing

information.

After merging cross year individual ﬁle with the single year family ﬁles, I then merge this

PSID dataset with the NSW dataset using CS’s .do ﬁles and generate the various sample
3For the additional ﬁlters that CS impose, see the Calónico and Smith (2017) supplemen-

tary material provided in JLE.

4Even though the two PSID comparison groups are not perfectly representative of women
who would have proven eligible for NSW, there is no clear alternative since the PSID data
lacks detailed covariate information that would be needed to impose the full eligibility criteria
on the PSID sample.

164

dummies essentially in the same manner as they do. After this, I further restrict the sample

to include only those women who have valid earnings information in 1975, which is the pre-

training year for AFDC women. I also drop the cases where the measured age or education is

less than zero. In order to make sure that any observations not used by CS only correspond

to the ones that have missing post-program earnings, I also drop observations that do not

satisfy the CS criteria but have observed earnings in 1979.

G.3 Treatment and missing outcome probability speciﬁcations and

sample trimming

In this application, I estimate three sets of treatment assignment and missing outcomes

probability models depending upon which comparison group is used for obtaining the esti-

mates. For the experimental estimates, I use the experimental treatment and control groups

to estimate the propensity score model. For the PSID-1 estimates, I consider the NSW ex-

perimental observations to be the treatment group and use PSID-1 as the control group. For

estimating the PSID-2 propensity score model, I switch to PSID-2 as being the comparison

control group. For estimating the missing outcome probability models, I include the treat-

ment indicator depending upon the comparison group as mentioned above. The probability

models are estimated as logits and include the following covariates in their speciﬁcation.

For the treatment probability, I include the real earnings in 1974 and 1975 along with an

indicator variable for whether the individual had any zero earnings in 1974 and 1975. Be-

yond these, I also include Age, Age-squared, Education, High school dropout status, the

race indicators of black and Hispanic along as well as the number of children in 1975. CS

also add some interaction terms in their propensity score speciﬁcation which I do not.

I

noticed that allowing for those terms in my speciﬁcations drove the ﬁnal weights for many

women in the sample too close to a 0 or 1. For the missing outcomes probability, I include

the treatment indicator along with the same covariates. I kept the speciﬁcations to be the

same for the three sets of probabilities I estimated. However, my regression speciﬁcations

165

include the same covariates as CS to allow for some comparison across the analyses. These

comparisons should be made with some caution. Except the estimates that use the NSW

control group, all other estimates are obtained using samples that are diﬀerent than the CS

samples.

The ﬁnal sample used to obtain estimates for the PSID-1 comparison group is trimmed

in order to ensure common support for the weights in the treatment and comparison groups.

For the PSID-1 group, this meant dropping observations with ﬁnal weight either less than

0.03 or greater than 0.8. For the PSID-2 sample, this meant dropping observations with ﬁnal

weight that was either less than 0.1 or greater than 0.86. These ﬁnal weights are the weights

that are speciﬁed in the regression commands in STATA and are constructed as follows:

weight = (w/Ghat+(1-w)/(1-Ghat))*(s/Rhat)

The trimming threshold for PS-weighted estimates is kept the same as for computing the

double weighted estimates since the overlap problem was relatively more severe when using

the composite weights than when using propensity scores only. The graphs below plot the

kernel density for the probabilities Rhat*Ghat for the treatment group and Rhat*(1-Ghat)

for the control group. The common support problem due to which the samples were appro-

priately trimmed can be seen in the graphs below.

Additionally, ﬁgures G.2 and G.3 plot the estimated distributions for the propensity score

and missing outcomes probability, where panel (a)-(c) display these for the three treatment

and comparison group combinations. A couple of points emerge from the estimated graphs.

For ﬁgure G.2, panel (a), we see that the treatment and control distributions appear very

similar, conﬁrming the strong role of randomization in producing groups that are balanced in

terms of covariates. For panel (b), we see that the experimental observations have a relatively

high probability of being treated whereas the control group have low probabilities. Note,

however, that the common support condition holds quite strongly for the PSID-1 group.

In panel (c), while the estimated distribution for the treated units still has a higher mean,

the PSID-2 comparison group distribution is relatively similar than PSID-1 in panel (b).

166

Figure G.1: Kernel density plots for the composite probability

a) Experimental treatment and control

b) Experimental treatment and PSID-1

c) Experimental treatment and PSID-2

Notes: The weights here correspond to the product of the estimated assignment and missing outcomes prob-
abilities. Following Calónico and Smith (2017), I exploit the eﬃciency gain from combining the experimental
treatment and control groups for estimating the treatment and missing outcome probability models. For the
PSID-1 group, this means using the full experimental group to be the treatment group and the PSID-1 as the
control group. Similarly, to construct weights for the PSID-2 group, this means using the full experimental
group along with the PSID-2 as the control group.

These ﬁndings suggest that nonrandom assignment is predicted well by the covariates in the

propensity score distributions. The same cannot be said for the estimated missing outcomes

probabilities where panel (b) and (c) reveal a strong overlap problem. Moreover, we see that

the treated units are less likely to be missing outcomes compared to the comparison groups.

167

Figure G.2: Kernel density plots for the estimated propensity score

a) Experimental treatment and control

b) Experimental treatment and PSID-1

c) Experimental treatment and PSID-2

Notes: Following Calónico and Smith (2017), I exploit the eﬃciency gains from combining the experimental
treatment and control groups for estimating the propensity scores. For the PSID-1 group, this means using
the full experimental group to be the treatment group and the PSID-1 as the control group. Similarly, to
construct weights for the PSID-2 group, this means using the full experimental group along with the PSID-2
as the control group.

168

Figure G.3: Kernel density plots for the estimated missing outcomes probability

a) Experimental treatment and control

b) Experimental treatment and PSID-1

c) Experimental treatment and PSID-2

Notes: Following Calónico and Smith (2017), I exploit the eﬃciency gains from combining the experimental
treatment and control groups for estimating the missing outcome probability. For the PSID-1 group, this
means using the full experimental group to be the treatment group and the PSID-1 as the control group.
Similarly, to construct weights for the PSID-2 group, this means using the full experimental group along
with the PSID-2 as the control group.

169

APPENDIX H

FIGURES FOR CHAPTER 3

Figure H.1: Relative estimated bias in UQTE estimates at diﬀerent quantiles of the 1979
earnings distribution

a) PSID-1 control group

b) PSID-2 control group

Notes: This graph plots the bias in the unweighted, PS-weighted and doubly-weighted UQTE estimates
relative to the true experimental estimates across diﬀerent quantiles of the 1979 earnings distribution. Panel
(a) plots the relative bias estimates using the PSID-1 comparison group and Panel (b) plots the same using
the PSID-2 comparison group. The treatment and missing outcome propensity score models have been
estimated as ﬂexible logits and the samples used for constructing these estimates have been trimmed to
ensure common support across the two groups. The treatment propensity score has been estimated using
the full experimental sample along with either PSID-1 or PSID-2 comparison group. The UQTE estimates
for τ < 0.46 are omitted from the graph since these are zero.

170

Figure H.2: Empirical distribution of estimated ATE for N=5000

Case 1: When conditional mean model is correct

a) Both probability models are correct

b) Misspeciﬁed propensity score model

c) Misspeciﬁed missingness model

d) Both probability models are misspeciﬁed

Notes: The empirical distribution is obtained from 1000 simulation draws of sample size 5000. However,
the eﬀective sample sizes are much smaller. Since the average propensity of treatment is a 0.41 and average
propensity of being observed as 0.38, the average treated sample is N1 = 5000 × 0.41 × 0.38 = 779 and
average control sample is N0 = 5000× (1− 0.41)× 0.38 = 1, 121. The true ATE = 0.096. The graphs display
the empirical distribution of the estimated ATE with correct mean speciﬁcation under three diﬀerent cases of
misspeciﬁcation of the probability models. For the fourth case, see the main text. The graphs communicate
the theoretical ﬁndings of this paper which state that under correct speciﬁcation of the conditional model
(conditional mean for these simulations), unweighted and weighted estimators will all be consistent for the
true average treatment eﬀect. Hence, correct speciﬁcation of the probabilities does not have any added bite
here in terms of achieving consistency.

171

Figure H.2 (cont’d)

Case 2: When the conditional mean model is misspeciﬁed

a) Both probability models are correct

b) Misspeciﬁed propensity score model

c) Misspeciﬁed missingness model

d) Both probability models are misspeciﬁed

Notes: The empirical distribution is obtained from 1000 simulation draws of sample size 5000. However,
the eﬀective sample sizes are much smaller. Since the average propensity of treatment is a 0.41 and average
propensity of being observed as 0.38, the average treated sample is N1 = 5000 × 0.41 × 0.38 = 779 and
average control sample is N0 = 5000× (1− 0.41)× 0.38 = 1, 121. The true ATE = 0.096. The graphs display
the empirical distribution of the estimated ATE with misspeciﬁed mean model under two diﬀerent cases of
misspeciﬁcation of the probability models. For the other two, see the main text. In each of these graphs we
can the doubly weighted estimator is consistent for the true ATE whereas the unweighted and PS-weighted
are away from the truth.

172

Figure H.3: Estimated CQTE with true CQTE as a function of x1, N=5000

D) Both probability models are misspeciﬁed

a) τ = 0.25

b) τ = 0.50

c) τ = 0.75

Notes: This ﬁgure plots the estimated CQTE along with the true CQTE as a function of x1. The ﬁgure
corresponds to the scenario when the conditional quantile functions for the treated and control groups are
correctly speciﬁed but the two probability models are misspeciﬁed. Along with these two graphs, the ﬁgure
also plots the function across the 1000 simulations (reps). The other three cases for when the propensity
score or the missing data probability is allowed to be misspeciﬁed are not considered since under correct
CQF speciﬁcation, all these graphs look identical. For, N = 5000, the average treated sample is N1 =
5000 × 0.41 × 0.38 = 779 and average control sample is N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121.

173

Figure H.4: Bias in estimated linear projection relative to true linear projection as a function
of x1 using Angrist et al. (2006b) methodology, N=5000

A) Both probability models are correct

a) τ = 0.25

b) τ = 0.50

c) τ = 0.75

Notes: Angrist et al. (2006b) show that under the case of misspeciﬁcation of the true CQF, the check function
can still estimate a weighted linear projection to the true CQF. Since this particular case corresponds to
misspeciﬁcation of the CQF (where I estimate it to be linear), the solutions to problem 3.36 will consistently
estimate the LP’s to the two CQFs under the problems of non-random assignment and missing outcomes.
Therefore, one can characterize an LP to the true CQTE using these two objects. This ﬁgure plots the bias
in the doubly-weighted, PS-weighted and unweighted linear projection of the true CQTE relative to the true
population LP of CQTE. For, N = 5000, the average treated sample is N1 = 5000 × 0.41 × 0.38 = 779 and
average control sample is N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. For a description of how these functions
were estimated, see the simulation appendix.

174

Figure H.4 (cont’d)

D) Both probability models are misspeciﬁed

a) τ = 0.25

b) τ = 0.50

c) τ = 0.75

Notes: Angrist et al. (2006b) show that under the case of misspeciﬁcation of the true CQF, the check function
can still estimate a weighted linear projection to the true CQF. Since this particular case corresponds to
misspeciﬁcation of all three components of the doubly weighted framework, the solutions to problem 3.36
will not consistently estimate the LP’s to the two CQFs under the problems of non-random assignment
and missing outcomes. This ﬁgure plots the bias in the doubly-weighted, PS-weighted and unweighted linear
projection of the true CQTE relative to the true population LP of CQTE. For, N = 5000, the average treated
sample is N1 = 5000× 0.41× 0.38 = 779 and average control sample is N0 = 5000× (1− 0.41)× 0.38 = 1, 121.
For a description of how these functions were estimated, see the simulation appendix. The unweighted
estimator does not weight the observed data by anything. The PS-weighted estimator weights to correct only
for non-random assignment and the doubly weighted estimator weights by both the treatment and missing
outcomes propensity score models to deal with non-random assignment and missing outcome problems.

175

Figure H.5: Empirical distribution of estimated UQTE for N=5000

A) Both probability models are correct

a) τ = 0.25

b) τ = 0.50

c) τ = 0.75

Notes: The empirical distribution is obtained from 1000 simulation draws for N = 5000. Since the average
propensity of treatment is 0.41 and average propensity of being observed is 0.38, this implies the average
treated sample is N1 = 5000×0.41×0.38 = 779 and average control sample is N0 = 5000×(1−0.41)×0.38 =
1, 121. The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator
weights to correct only for non-random assignment and the doubly weighted estimator weights by both the
treatment and missing outcomes propensity score models to deal with non-random assignment and missing
outcome problems.

176

Figure H.5 (cont’d)

B) Correct missing outcomes probability but misspeciﬁed propensity score model

a) τ = 0.25

b) τ = 0.50

c) τ = 0.75

Notes: The empirical distribution is obtained from 1000 simulation draws for N = 5000. Since the average
propensity of treatment is 0.41 and average propensity of being observed is 0.38, this implies the average
treated sample is N1 = 5000×0.41×0.38 = 779 and average control sample is N0 = 5000×(1−0.41)×0.38 =
1, 121. The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator
weights to correct only for non-random assignment and the doubly weighted estimator weights by both the
treatment and missing outcomes propensity score models to deal with non-random assignment and missing
outcome problems.

177

Figure H.5 (cont’d)

C) Misspeciﬁed missing outcomes probability but correct propensity score model

a) τ = 0.25

b) τ = 0.50

c) τ = 0.75

Notes: The empirical distribution is obtained from 1000 simulation draws for N = 5000. Since the average
propensity of treatment is 0.41 and average propensity of being observed is 0.38, this implies the average
treated sample is N1 = 5000×0.41×0.38 = 779 and average control sample is N0 = 5000×(1−0.41)×0.38 =
1, 121. The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator
weights to correct only for non-random assignment and the doubly weighted estimator weights by both the
treatment and missing outcomes propensity score models to deal with non-random assignment and missing
outcome problems.

178

Figure H.5 (cont’d)

D) Both probability models are misspeciﬁed

a) τ = 0.25

b) τ = 0.50

c) τ = 0.75

Notes: The empirical distribution is obtained from 1000 simulation draws for N = 5000. Since the average
propensity of treatment is 0.41 and average propensity of being observed is 0.38, the average treated sample
is N1 = 5000 × 0.41 × 0.38 = 779 and average control sample is N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The
unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to
correct only for non-random assignment and the doubly weighted estimator weights by both the treatment
and missing outcomes propensity score models to deal with non-random assignment and missing outcome
problems.

179

APPENDIX I

TABLES FOR CHAPTER 3

Table I.1: An illustration of the observed sample ((cid:88)means observed, ? means missing)

i

x w s

y
? (cid:88) 1
1
y(1) (cid:88) 0
2
y(0) (cid:88) 1
3
? (cid:88) 1
4
...
...
...
N y(1) (cid:88) 0

...

0
1
1
0
...
1

180

Table I.2: Diﬀerent scenarios under ignorability and unconfoundedness

Situation

D(y(g)|x)

P(s = 1|y(g), x, wg) = P(s = 1|x, wg)? P(s = 1|x, wg) correct? P(wg = 1|y(g), x) = P(w = 1|x)? P(w = 1|x) correct? Unweigted for D(y(g)|x)? Weighted for D(y(g)|x)? Weighted for 2.1?

1
2
3
5
6
7
8
9

Correctly speciﬁed
Correctly speciﬁed
Correctly speciﬁed

Misspeciﬁed
Misspeciﬁed
Misspeciﬁed
Misspeciﬁed
Misspeciﬁed

No
Yes
Yes
No
Yes
Yes
Yes
Yes

Either
Either
Either
Either
Either

Yes
Yes
No

Either

Yes
No

Yes
Yes
No

Either

Either

Either
Either
Either
Either

No
Yes

Either
Either

No
Yes
No
No
No
No
No
No

No
Yes
No
No
No
No
No
No

No
Yes
No
No
No
Yes
No
No

a. Notice that if the missingness mechanism is not ignorable or for that matter the assignment mechanism is not unconfounded, then nothing can be consistently estimated whether or not
other components of the framework are correctly speciﬁed. This can be seen in cases (1) and (3) in the table above. Situations (2) and (7) together forms what is called robust
estimation that has been described in the sections above. Remember that under unconfoundedness and ignorability, D(y(g)|x) is the same as D(y(g)|x, wg, s).

181

Table I.3: Diﬀerent scenarios under exogeneity of missingness and unconfoundedness

Situation

D(y(g)|x)

P(s = 1|y(g), x, wg) = P(s = 1|x)? P(s = 1|x) correct? P(wg = 1|y(g), x) = P(wg = 1|x)? P(wg = 1|x) correct? Unweigted for D(y(g)|x)? Weighted for D(y(g)|x)? Weighted for 2.1?

1
2
3
4
5
6
7
8
9

Correctly speciﬁed
Correctly speciﬁed
Correctly speciﬁed

Misspeciﬁed
Misspeciﬁed
Misspeciﬁed
Misspeciﬁed
Misspeciﬁed
Misspeciﬁed

Yes
No
Yes
Yes
Yes
Yes
Yes
No
Yes

Either
Either
Either

Yes
No
No
Yes

Either

Yes

Yes

Either

No
Yes
Yes
Yes
Yes

Either

No

Either
Either
Either

Yes
No
Yes
No

Either
Either

Yes
No
No
No
No
No
No
No
No

Yes
No
No
No
No
No
No
No
No

Yes
No
No
Yes
No
No
No
No
No

a. Situations (1) and (4) combine to give you the double robustness result which says that either the conditional feature of interest needs to be correctly speciﬁed or the treatment and
missing probabilities both need to be correctly speciﬁed. Again, just like the previous case, if the missingness mechanism is not exogenous or if the assignment mechanism is not
unconfounded, then even correctly specifying either of these features will not consistently estimate the parameter of interest. This is illustrated in cases (2) and (3). If one looks at
case (2) in the tabel above, under both these situations the unweighted estimator works to deliver a consistent estimator of θ0
g. In such a scenario where both the unweighted and
weighted estimators are consistent, how can we choose amongst them? The following table enumerates situations where not weighting is better than weighting.

182

Table I.4: When is unweighted more eﬃcient than weighted assuming ignorability and unconfoundedness and D(y(g)|x) correctly
speciﬁed?

Situation P(s = 1|x, wg) correct? P(s = 1|y(g), x, wg) = P(s = 1|x)? P(s = 1|x) correct? P(wg = 1|x) correct? GCIME holds? Unweighted more eﬃcient Weighted with estimated probabilities more eﬃcient?

1
2
3

Either
Either
Either

No
Yes

Either

Doesn’t apply

Either
Either

Either
Either
Either

Yes
Yes
No

Yes
Yes

Can’t say

No
No

Can’t say

183

I.0.1 Bias and root-mean squared error for ATE simulations

Table I.5: When the conditional mean model is correctly speciﬁed

A) Both probability models are correct

Estimator Unweighted PS-weighted

N=1000

BIAS
RMSE

-0.00082
0.02372

-0.00067
0.02370

D-weighted

Estimated Known
-0.00066
0.02376

-0.00065
0.02374

N=5000

Unweighted PS-weighted

-0.00039
0.01074

-0.00037
0.01074

D-weighted

Estimated Known
-0.00034
0.01075

-0.00034
0.01075

Notes: The unweighted estimator does not weight the observed data. The PS-weighted estimator weights to correct only for non-random assignment
and the doubly weighted estimator weights by both the propensity score and the missingness probabilities to deal with the assignment and missing
data problems. The two columns under the doubly weighted estimator report the Bias and RMSE of the estimators that use estimated and known
probability weights respectively. The eﬃciency results in section 3.5 dictate no asymptotic eﬃciency gains in the case when we have the conditional
model correctly speciﬁed. However, in ﬁnite samples, one could obtain smaller or larger variance estimates. For, N = 1000, the average treated sample
is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000, N1 = 5000 × 0.41 × 0.38 = 779
and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and RMSE are reported across 1000 simulations.

B) Correct missingness model but misspeciﬁed propensity score model

N=1000

N=5000

Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted

BIAS
RMSE

-0.00082
0.02372

-0.00069
0.02369

-0.00081
0.02376

-0.00039
0.01074

-0.00035
0.01074

-0.00040
0.01076

Notes: The unweighted estimator does not weight the observed data. The PS-weighted estimator weights to correct only for
non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity score
models to deal with non-random assignment and missing outcomes. For, N = 1000, the average treated sample is N1 = 1000 ×
0.41×0.38 = 156 and average control sample is N0 = 1000×(1−0.41)×0.38 = 224. For, N = 5000, N1 = 5000×0.41×0.38 = 779
and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and RMSE are reported across 1000 simulations.

184

Table I.5 (cont’d)

C) Misspeciﬁed missingness model but correct propensity score model

N=1000

N=5000

Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted

BIAS
RMSE

-0.00082
0.02372

-0.00067
0.02370

-0.00067
0.02372

-0.00039
0.01074

-0.00037
0.01074

-0.00035
0.01075

Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to
correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing
outcomes propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000,
the average treated sample is N1 = 1000×0.41×0.38 = 156 and average control sample is N0 = 1000×(1−0.41)×0.38 =
224. For, N = 5000, N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse
are reported across 1000 simulations.

D) Both probability models are misspeciﬁed

N=1000

N=5000

Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted

BIAS
RMSE

-0.00082
0.02372

-0.00069
0.02369

-0.00080
0.02373

-0.00039
0.01074

-0.00035
0.01074

-0.00039
0.01075

Notes: The unweighted estimator does not weight the observed data. The PS-weighted estimator weights to correct only for
non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity scores
to deal with non-random assignment and missing outcomes. For, N = 1000, the average treated sample size is N1 = 1000× 0.41×
0.38 = 156 and average control sample size is N0 = 1000× (1− 0.41)× 0.38 = 224. For, N = 5000, N1 = 5000× 0.41× 0.38 = 779
and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000 Monte Carlo repetitions.

185

Table I.6: Misspeciﬁed conditional mean model

A) Both probability models are correct

N=1000
Estimator Unweighted PS-weighted

BIAS
RMSE

0.01087
0.03250

0.00008
0.03038

D-weighted

estimated known
0.00003
-0.00002
0.02979
0.02986

N=5000

Unweighted PS-weighted

0.01052
0.01744

-0.00058
0.01396

D-weighted

estimated known
-0.00064
-0.00064
0.01376
0.01375

Notes: The unweighted estimator does not weight the observed data. The PS-weighted estimator weights to correct only for non-random assignment
and the doubly weighted estimator weights by both the treatment and missing outcomes propensity scores to deal with non-random assignment and
missing outcomes. The two columns under the doubly weighted estimator report the Bias and Rmse of the estimators that use estimated and known
probability weights respectively. For, N = 1000, the average treated sample size is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample size
is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000, N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and
Rmse are reported across 1000 Monte Carlo repetitions.

B) Correct missingness model but misspeciﬁed propensity score model

N=1000

N=5000

0.01087
0.03250

0.00699
0.03117

Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted
BIAS
RMSE
Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to
correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes
propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000, the average treated
sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000,
N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000
simulations.

0.00651
0.01532

0.00102
0.02984

0.01052
0.01744

0.00049
0.01378

186

Table I.6 (cont’d)

C) Misspeciﬁed missingness model but correct propensity score model

N=1000

N=5000

Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted

BIAS
RMSE

0.01087
0.03250

0.00008
0.03038

0.00001
0.02970

0.01052
0.01744

-0.00058
0.01396

-0.00063
0.01371

Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to
correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes
propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000, the average treated
sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000,
N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000
simulations.

D) Both probability models are misspeciﬁed

N=1000

N=5000

0.01087
0.03250

0.00699
0.03117

0.00651
0.01532

Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted
-0.00150
BIAS
RMSE
0.01380
Notes: The unweighted estimator does not weight the observed data. The PS-weighted estimator weights to correct only for
non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity scores
to deal with non-random assignment and missing outcomes. For, N = 1000, the average treated sample size is N1 = 1000× 0.41×
0.38 = 156 and average control sample size is N0 = 1000× (1− 0.41)× 0.38 = 224. For, N = 5000, N1 = 5000× 0.41× 0.38 = 779
and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000 Monte Carlo repetitions.

-0.00093
0.02992

0.01052
0.01744

187

I.0.2 Bias and root-mean squared error for UQTE simulations

Table I.7: A) Both probability models are correct

For τ = 0.25 (25th quantile)

Estimator Unweighted PS-weighted

N=1000

N=5000

Unweighted PS-weighted

D-weighted

Estimated Known
0.0038
0.0549

0.0046
0.0532

D-weighted

Estimated Known
0.0012
0.0255

0.0012
0.0247

-0.0424
0.0690

-0.0014
0.0554

BIAS
RMSE
Notes: The unweighted estimator does not weight the observed data. The ps-weighted estimator weights to correct only for non-random assignment and
the doubly weighted estimator weights by both the propensity score model and the missingness model to correct for non-random assignment and missing
outcomes. For, N = 1000, the average treated sample is N1 = 1000×0.41×0.38 = 156 and average control sample is N0 = 1000×(1−0.41)×0.38 = 224.
For, N = 5000, N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000 Monte Carlo
repetitions.

-0.0446
0.0512

-0.0022
0.0254

For τ = 0.50 (50th quantile)

Estimator Unweighted PS-weighted

N=1000

N=5000

Unweighted PS-weighted

D-weighted

Estimated Known
0.0007
0.1068

0.0000
0.1028

D-weighted

Estimated Known
0.0044
0.0483

0.0043
0.0462

-0.1072
0.1543

-0.0206
0.1181

BIAS
RMSE
Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to correct only for non-random
assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with non-random
assignment and missing outcome problems. For, N = 1000, the average treated sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is
N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000, N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse
are reported across 1000 simulations.

-0.0998
0.1114

-0.0157
0.0522

188

Table I.7 (cont’d)

For τ = 0.75 (75th quantile)

Estimator Unweighted PS-weighted

N=1000

N=5000

Unweighted PS-weighted

-0.2742
0.3803

-0.0899
0.3097

BIAS
RMSE
Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to correct only for non-random
assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with non-random
assignment and missing outcome problems. For, N = 1000, the average treated sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is
N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000, N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse
are reported across 1000 simulations.

-0.0145
0.0983

-0.2687
0.2917

-0.0896
0.1550

D-weighted

Estimated Known
-0.0217
0.2523

-0.0210
0.2399

D-weighted

Estimated Known
-0.0147
0.1036

Table I.8: B) When missing data probability is misspeciﬁed and propensity score is correct

For τ = 0.25 (25th quantile)

N=1000

N=5000

-0.0014
0.0554

Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted
BIAS
RMSE
Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to
correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes
propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000, the average treated
sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000,
N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000
simulations.

-0.0150
0.0291

-0.0424
0.0690

-0.0116
0.0557

-0.0022
0.0254

-0.0446
0.0512

189

Table I.8 (cont’d)

For τ = 0.50 (50th quantile)

N=1000

N=5000

-0.0206
0.1181

Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted
BIAS
RMSE
Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to
correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes
propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000, the average treated
sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000,
N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000
simulations.

-0.0319
0.0571

-0.0157
0.0522

-0.0998
0.1114

-0.0355
0.1119

-0.1072
0.1543

For τ = 0.75 (75th quantile)

N=1000

N=5000

-0.0899
0.3097

Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted
BIAS
RMSE
Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to
correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes
propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000, the average treated
sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000,
N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000
simulations.

-0.0896
0.1348

-0.2742
0.3803

-0.2687
0.2917

-0.0896
0.1550

-0.0989
0.2648

190

Table I.9: C) When missing data probability is correct and propensity score is misspeciﬁed

For τ = 0.25 (25th quantile)

N=1000

N=5000

-0.0014
0.0554

Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted
BIAS
RMSE
Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to
correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes
propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000, the average treated
sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000,
N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000
simulations.

-0.0239
0.0352

-0.0227
0.0598

-0.0022
0.0254

0.0265
0.0614

0.0243
0.0348

For τ = 0.50 (50th quantile)

N=1000

N=5000

-0.0206
0.1181

Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted
BIAS
RMSE
Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to
correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes
propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000, the average treated
sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000,
N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000
simulations.

-0.0681
0.1349

-0.0637
0.0809

-0.0157
0.0522

0.0456
0.1168

0.0488
0.0673

191

Table I.9 (cont’d)

For τ = 0.75 (75th quantile)

N=1000

N=5000

-0.0899
0.3097

Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted
BIAS
RMSE
Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to
correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes
propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000, the average treated
sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000,
N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000
simulations.

-0.1978
0.2346

-0.0896
0.1550

-0.2005
0.3611

0.0691
0.2894

0.0709
0.1377

Table I.10: D) Both probability models are misspeciﬁed

For τ = 0.25 (25th quantile)

N=1000

N=5000

-0.0014
0.0554

Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted
BIAS
RMSE
Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to
correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes
propensity score models to correct for non-random assignment and missing outcome. For, N = 1000, the average treated sample
is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000,
N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000 Monte
Carlo repetitions.

-0.0227
0.0598

-0.0022
0.0254

-0.0239
0.0352

0.0095
0.0571

0.0074
0.0268

192

Table I.10 (cont’d)

For τ = 0.50 (50th quantile)

N=1000

N=5000

-0.0206
0.1181

Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted
BIAS
RMSE
Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to
correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes
propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000, the average treated
sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000,
N1 = 5000 × 0.41 × 0.38 = 779 and average control sample size, N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are
reported across 1000 simulations.

-0.0681
0.1349

-0.0157
0.0522

-0.0637
0.0809

0.0070
0.1108

0.0136
0.0503

For τ = 0.75 (75th quantile)

N=1000

N=5000

-0.0899
0.3097

Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted
BIAS
RMSE
Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to
correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes
propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000, the average treated
sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000,
N1 = 5000 × 0.41 × 0.38 = 779 and average control sample size, N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are
reported across 1000 simulations.

-0.0131
0.1202

-0.0896
0.1550

-0.2005
0.3611

-0.0216
0.2806

-0.1978
0.2346

193

I.0.3 Calonico and Smith Application

Table I.11: Proportion of missing earnings in the experimental sample

Earnings in 1979 Treated Control Total
406
Missing
Observed
1185
1591
Total

196
600
796

210
585
795

Table I.12: Proportion of missing data in the PSID samples

Earnings in 1979 PSID-1 PSID-2
Missing
Observed
Total

81
648
729

22
182
204

194

Table I.13: Unweighted and weighted pre-training earnings comparisons using NSW and PSID comparison groups

Comparison group

Unadjusted

Pre-training estimates

NSW
N=1,185

PSID-1
N=1,016

PSID-2
N=720

PSID-1
N=1,001

Unweighted PS-weighted D-weighted

Unweighted

-18

(123.45)

-2,534
(283.95)

-2,080
(411.23)

-9

(51.07)

-222

(213.57)

-1,371
(331.41)

1

(48.76)

-255

(205.59)

-1,357
(317.41)

-22

(124.70)

-2,804
(281.49)

-2,181
(427.24)

Bias using NSW control

Adjusted
PS-weighted D-weighted

-10

(51.34)

-199

(212.55)

-1,505
(359.98)

-1

(48.97)

-222

(205.45)

-1,467
(342.16)

-2,517
(279.38)

289

(256.93)

236

(247.18)

-2,760
(283.09)

334

(257.50)

287

(248.20)

-2,063
(416.53)

-1,249
(323.36)

-1,255
(310.59)

-2,144
(435.74)

-1,306
(354.12)

-1,297
(337.68)

PSID-2
N=705
Adjusted covariates
Pre-training earnings (1975)
Age
Age2
Education
High school droput
Black
Hispanic
Marital status
Notes: This table reports unadjusted and adjusted pre-training earnings diﬀerences where the ﬁrst row reports the experimental estimates.
The second and third row reports non-experimental estimates computed using the PSID-1 and PSID-2 control groups respectively. The second
panel of the table reports bias estimates computed from combining the NSW control and PSID-1 and PSID-2 comparison groups respectively.
Both the pre-training estimates and the bias estimates should be compared to zero. Bootstrapped standard errors (in parentheses) have been
constructed using 10,000 replications. All values are in 1982 dollars. The samples used for estimating the training and bias estimates using
PSID-1 and PSID-2 comparison groups have been trimmed to ensure common support in the distribution of weights for the NSW-treatment
and comparison groups. For more detail, see appendix G.

(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)

(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)

(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)

195

Table I.14: Estimation summary for ATE under diﬀerent cases of misspeciﬁcation

Scenario

CEF

G(·)

R(·)

Model Estimation Model Estimation Model Estimation

1
2
3
4
5
6
7
8

C
C
C
C
M
M
M
M

Φ(xθg)
Φ(xθg)
Φ(xθg)
Φ(xθg)

xθg
xθg
xθg
xθg

C
C
M
M
C
C
M
M

Λ(xγ)
Λ(xγ)

Φ(x(1)γ(1))
Φ(x(1)γ(1))

Λ(xγ)
Λ(xγ)

Φ(x(1)γ(1))
Φ(x(1)γ(1))

C
M
C
M
C
M
C
M

Λ(zγ)

Φ(z(1)γ(1))

Λ(zγ)

Φ(z(1)γ(1))

Λ(zγ)

Φ(z(1)γ(1))

Λ(zγ)

Φ(z(1)γ(1))

Notes: C and M correspond to whether the estimated model is correctly speciﬁed or misspeciﬁed.
x and z both include an intercept. x(1) and z(1) are the subsets of x and z left after omitting
x1. Therefore, the probability models have been misspeciﬁed in both the functional form and linear
index dimension. G(·) refers to the propensity score model and R(·) refers to the missing outcomes
probability model.

196

Table I.15: Estimation summary for quantile eﬀects under diﬀerent cases of misspeciﬁcation

Scenario

CQF

G(·)

R(·)

Model Estimation Model Estimation Model Estimation

4
5
6
7
8

C
M
M
M
M

exp(xθg(τ ))

xθg(τ )
xθg(τ )
xθg(τ )
xθg(τ )

M
C
C
M
M

Φ(x(1)γ(1))

Λ(xγ)
Λ(xγ)

Φ(x(1)γ(1))
Φ(x(1)γ(1))

M
C
M
C
M

Φ(x(1)γ(1))

Λ(zγ)

Φ(x(1)γ(1))

Λ(zγ)

Φ(x(1)γ(1))

Notes: C and M denote whether the estimated model is correctly speciﬁed or misspeciﬁed. x and z
both include an intercept. x(1) and z(1) are the subsets of x and z left after omitting x1. Therefore,
the probability models have been misspeciﬁed in both the functional form and the linear index dimen-
sion. G(·) refers to the propensity score model and R(·) refers to the missing outcomes probability
model.

197

Table I.16: Covariate means and p-values from the test of equality of two means, by treatment
status

Treatment Control P(cid:0)|T| >|t|(cid:1) PSID-1 P(cid:0)|T| >|t|(cid:1) PSID-2 P(cid:0)|T| >|t|(cid:1)

Covariates
Age in years

Years of education

Proportion of high school dropouts

Proportion Married

Proportion Black

Proportion Hispanic

Number of children in 1975

Real earnings in 1975

33.37
(7.42)
10.30
(1.92)
0.70
(0.46)
0.02
(0.15)
0.84
(0.37)
0.12
(0.32)
2.17
(1.30)
799.88
(1931.92)

33.64
(7.19)
10.27
(2.00)
0.69
(0.46)
0.04
(0.20)
0.82
(0.39)
0.13
(0.33)
2.26
(1.32)
811.19
(2041.32)

0.46

0.72

0.73

0.03

0.29

0.59

0.21

0.91

36.73
(10.60)
11.32
(2.71)
0.45
(0.50)
0.02
(0.13)
0.66
(0.47)
0.02
(0.12)
1.70
(1.75)
7446.15
(7515.59)

0.00

0.00

0.00

0.05

0.00

0.00

0.00

0.00

34.41
(9.48)
10.55
(2.09)
0.59
(0.49)
0.01
(0.10)
0.87
(0.34)
0.02
(0.16)
2.91
(1.73)
2069.65
(3474.10)

0.11

0.07

0.00

0.08

0.13

0.00

0.00

0.00

Observations
Notes: Along with the covariate means and standard deviation (in parentheses), the table also reports p-values from the test of equality for two means.
Column 4 tests for diﬀerences between the NSW treatment and control groups, column 6 and 8 report the same using PSID-1 and PSID-2 comparison
groups respectively. Real earnings in 1975 are expressed in terms of 1982 dollars.

204

729

796

795

198

Table I.17: Covariate means and p-values from the test of equality of two means for the observed and missing samples

Covariates
Age

Years of education

Proportion of high school dropouts

Proportion married

Proportion black

Proportion hispanic

Number of children in 1975

Real earnings in 1975

Missing
33.36
(7.30)
10.29
(1.93)
0.70
(0.46)
0.05
(0.21)
0.81
(0.39)
0.12
(0.33)
2.33
(1.29)
621.54

(1,523.00)

Control

Observed P(cid:0)|T| >|t|(cid:1) Missing

Treatment
Observed

P(cid:0)|T| >|t|(cid:1) Missing

PSID-1

Observed P(cid:0)|T| >|t|(cid:1) Missing

PSID-2

Observed P(cid:0)|T| >|t|(cid:1)

33.74
(7.15)
10.26
(2.03)
0.68
(0.47)
0.04
(0.19)
0.82
(0.39)
0.13
(0.33)
2.23
(1.34)
879.28

(2,194.93)

0.51

0.85

0.57

0.61

0.81

0.87

0.34

0.12

32.15
(7.39)
10.29
(2.05)
0.69
(0.46)
0.03
(0.16)
0.83
(0.38)
0.13
(0.33)
2.14
(1.32)
610.77

(1,677.36)

33.77
(7.40)
10.31
(1.88)
0.70
(0.46)
0.02
(0.15)
0.84
(0.37)
0.12
(0.32)
2.19
(1.29)
861.65

(2,005.53)

0.01

0.89

0.77

0.75

0.87

0.64

0.69

0.11

34.00
(10.50)
11.44
(2.17)
0.43
(0.50)
0.00
(0.00)
0.74
(0.44)
0.01
(0.11)
1.54
(1.45)
6927.95
(7,330.74)

37.07
(10.57)
11.30
(2.77)
0.45
(0.50)
0.02
(0.14)
0.65
(0.48)
0.02
(0.12)
1.71
(1.78)
7510.92
(7,541.41)

0.01

0.60

0.73

0.00

0.10

0.82

0.33

0.50

33.32
(10.81)
11.05
(1.73)
0.55
(0.51)
0.00
(0.00)
0.91
(0.29)
0.05
(0.21)
2.41
(1.14)
896.56

(2,315.12)

34.54
(9.34)
10.49
(2.13)
0.59
(0.49)
0.01
(0.10)
0.86
(0.35)
0.02
(0.15)
2.97
(1.79)
2211.45
(3,567.50)

0.62

0.18

0.68

0.16

0.50

0.62

0.05

0.02

Observations
Notes: Along with the covariate means and standard deviation (in parentheses), the table also reports p-values from the test of equality for two means between the observed and missing samples. Real earnings in 1975 are
expressed in terms of 1982 dollars.

204

204

795

795

729

729

796

796

199

Table I.18: Unweighted and weighted earnings comparisons and estimated training eﬀects using NSW and PSID comparison
groups

Comparison group

Unadjusted

Adjusted

Adjusted

Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted

Post-training earnings estimates

NSW
N=1,185

PSID-1
N=1,016

PSID-2
N=720

PSID-1
N=1,001

821

(307.22)

-799

(444.84)

-31

(713.88)

848

(304.04)

827

(503.00)

824

(304.61)

803

(503.26)

845

(303.60)

298

(428.60)

852

(302.94)

909

(497.76)

828

(303.53)

907

(501.54)

569

(1041.81)

566

(1027.12)

492

(664.46)
(953.80)
Bias estimates using NSW control

1,040
(961.74)

996

864

(303.47)

335

(440.18)

698

(784.28)

850

(302.96)

905

(518.54)

826

(303.58)

904

(522.97)

1,082

(1264.18)

1,049

(1217.46)

-1,620
(431.75)

169

(561.74)

156

(553.07)

-493

(427.93)

-40

(499.91)

-21

(501.44)

-568

(434.59)

-38

(504.19)

-21

(507.02)

-109

(663.80)

207

(962.85)

200

(954.61)

-853

(707.87)

-17

-24

(1195.47)

(1156.39)

-212

(1025.87)

-228

(1041.44)

-378

(759.75)

PSID-2
N=705
Adjusted covariates
Pre-training earnings (1975)
Age
Age2
Education
High school droput
Black
Hispanic
Marital status
Number of Children (1975)
Notes: This table reports unadjusted and adjusted post-training earnings diﬀerences between the NSW treatment and three diﬀerent comparison groups, namely, NSW control, PSID-1
and PSID-2. The ﬁrst row reports experimental training estimates which combines the NSW treatment and control group whereas the second and third rows report non-experimental
estimates computed from using the PSID-1 and PSID-2 groups respectively. Each of the non-experimental estimates should be compared to the experimental benchmark. The second
panel of the table reports bias estimates computed from combining the NSW control with PSID-1 and PSID-2 comparison groups respectively. These represent a second measure of
bias which should be compared to zero. Bootstrapped standard errors are given in parentheses and have been constructed using 10,000 replications. All values are in 1982 dollars. The
samples used for estimating the training and bias estimates have been trimmed to ensure common support in the distribution of weights for the treatment and comparison groups. For
more detail, see the application appendix.

(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)

(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)

(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)

(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)

(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)

(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)
(cid:88)

200

Table I.19: Unconditional quantile treatment eﬀect (UQTE) using PSID-1 comparison group

0.1

0.2

0.3

0.6

0.7

0.4

0.5

0
(0)
0
(0)
0
(12.91)
-1124.61
(552.97)
-2227.26
(983.43)
-860.55
(964.97)
428.01
(728.22)
-190.60
(519.63)
-1563.27
(952.85)

0
(0)
0
(0)
0
(0)
0
(207.14)
2076.58
(851.09)
3602.76
(1299.08)
3415.47
(988.24)
2019.44
(984.59)
-385.45
(1059.43)

0
(0)
0
(0)
0
(0)
0
(11.17)
993.52
(695.93)
2004.40
(1112.82)
2129.93
(716.04)
1753.27
(372.37)
1134.21
(449.86)

Quantile Experimental Unweighted PS-weighted D-weighted
0
(0)
0
(0)
0
(0)
0
(174.89)
1847.04
(829.42)
3535.85
(1284.64)
3340.84
(992.95)
2019.44
(999.47)
-385.45
(1056.09)
Notes: This table reports unweighted, PS-weighted and double-weighted UQTE
estimates for three diﬀerent comparison groups, namely, NSW control, PSID-1 and
PSID-2. The estimates are reported at every 10th quantile of the 1979 earnings
distribution. The experimental and PSID-1 estimates have been constructed using
N=1,185 and N=1,016 observations respectively. Bootstrapped standard errors
are given in parentheses and have been constructed using 1,000 replications. All
values are in 1982 dollars. The samples used for constructing these estimates have
been trimmed to ensure common support across the treatment and comparison
groups.

0.8

0.9

201

Table I.20: Unconditional quantile treatment eﬀect (UQTE) using PSID-2 comparison group

0.1

0.2

0.3

0.6

0.7

0.4

0.5

0
(0)
0
(0)
0
(111.74)
-795.71
(672.87)
-237.98
(1232.63)
193.77
(1426.40)
1857.64
(943.38)
1148.85
(1152.92)
-237.08
(1888.06)

0
(0)
0
(0)
0
(0)
0
(13.25)
993.52
(693.73)
2004.40
(1114.65)
2129.93
(710.26)
1753.27
(371.73)
1134.21
(452.08)

0
(0)
0
(10.07)
0
(136.31)
0
(573.22)
378.98
(1312.93)
1480.47
(1647.31)
2616.22
(1217.80)
2010.87
(1541.14)
1089.10
(3321.56)

Quantile Experimental Unweighted PS-weighted D-weighted
0
(0)
0
(10.07)
0
(129.77)
0
(546.78)
372.07
(1267.28)
1294.77
(1659.69)
2599.73
(1209.60)
1990.37
(1553.67)
1089.10
(3246.78)
Notes: This table reports unweighted, PS-weighted and double-weighted UQTE
estimates for three diﬀerent comparison groups, namely, NSW control, PSID-1 and
PSID-2. The estimates are reported at every 10th quantile of the 1979 earnings
distribution. The experimental and PSID-2 estimates have been computed using
N=1,185 and N=720 observations respectively. Bootstrapped standard errors are
given in parentheses and have been constructed using 1,000 replications. All values
are in 1982 dollars. The samples used for constructing these estimates have been
trimmed to ensure common support across the treatment and comparison groups.

0.8

0.9

202

APPENDIX J

PROOFS FOR CHAPTER 3

Proof of Lemma 3.2.5

Proof. By the law of iterated expectations (LIE)

· wg
pg(x)

· q(cid:0)y(g), x, θg

(cid:1)(cid:35)
· q(cid:0)y(g), x, θg
· q(cid:0)y(g), x, θg
· q(cid:0)y(g), x, θg
· q(cid:0)y(g), x, θg
(cid:1)(cid:35)

(cid:1)(cid:35)

(cid:33)
(cid:1)(cid:12)(cid:12)(cid:12)(cid:12)y(g), x, wg
(cid:1) · P(cid:0)s = 1|y(g), x, wg
(cid:1)(cid:35)
(cid:1) · P(cid:0)s = 1|x, wg
(cid:35)
(cid:1) · r(x, wg)

E

r(x, wg)

· wg
pg(x)

s

r(x, wg)

(cid:34)

= E

= E

= E

= E

= E

s

(cid:32)

E
(cid:34)
(cid:34)
(cid:34)
(cid:34) wg

r(x, wg) · pg(x)

r(x, wg) · pg(x)

wg

wg

wg

r(x, wg) · pg(x)

· q(cid:0)y(g), x, θg

pg(x)

Using another application of LIE, rewrite the above expectation as

pg(x)

(cid:33)
E
(cid:32) wg
(cid:1)(cid:12)(cid:12)(cid:12)(cid:12)y(g), x
· q(cid:0)y(g), x, θg
· P(cid:0)wg = 1|y(g), x(cid:1)(cid:35)
(cid:34)
(cid:1)
q(cid:0)y(g), x, θg
(cid:34)
· P(cid:0)wg = 1|x(cid:1)(cid:35)
(cid:1)
q(cid:0)y(g), x, θg
(cid:34)
(cid:1)
q(cid:0)y(g), x, θg
(cid:1)(cid:105)
(cid:104)
q(cid:0)y(g), x, θg

· pg(x)

pg(x)

pg(x)

pg(x)

(cid:35)

= E

= E

= E

= E

= E

203

where the third equality follows from ignorability and the third last equality follows from

unconfoundedness. Hence, θ0

g solves the weighted population problem.

(cid:4)

Proof of Lemma 3.3.3

Proof. Proving consistency of ˆγ and ˆδ follows directly after verifying the conditions of

Theorem 2.1 in Newey and McFadden (1994). Condition 2.1(i), which implies unique

solution to the maximization problem, is satisﬁed using 3.3.3 (1), (4) and Lemma 2.2 of

Newey and McFadden (1994). Condition 2.1(ii), which implies compactness of the

parameter space, holds due to 3.3.3(i). Conditions 2.1(iii) and (iv) follow from Lemma 2.4
(cid:4)

in Newey and McFadden (1994).

Proof of Lemma 3.3.4

Proof. Again, the proof of asymptotic normality follows from verifying the conditions of

Theorem 3.1 in Newey and McFadden (1994), which is the basic asymptotic normality

proof for extremum estimators. I will then use the arguments as laid out in Newey and

McFadden (1994) to prove asymptotic normality of

√
N (ˆδ − δ0) follows in a similar manner.

normality for

By lemma 3.3.3, we have ˆγ
and (ii). 3.1(iii) holds with Σ = −E

equality, E(cid:2)∇γ ln f (w1|x, γ0)(cid:3) = 0 (condition 3.4(iii)), existence of
(cid:104)∇γγ(cid:48) ln f (w1|x, γ0)
(cid:105)

(cid:104)∇γγ(cid:48) ln f (w1|x, γ0)

E

√

N (ˆγ − γ0). The asymptotic
(cid:105)

by the information matrix

p→ γ. Theorem 3.1(i) and (ii) hold because of condition 3.4(i)

(condition 3.4(iii)), and the Lindberg-Levy central limit theorem.

(cid:104)∇γγ(cid:48) ln f (w1|x, γ0)
(cid:105)

E

Condition 3.1(iv) follows from results of Lemma 2.4 in Newey and McFadden (1994) which
require compactness of Γ, γ being an interior point in Γ, with a(z, θ) = ∇γγ(cid:48) ln f (w1|x, γ)
using conditions (ii) and (v). Condition 3.1(v) follows from non-singularity of

using condition 3.4(iv). Then, asymptotic normality follows from

the conclusion of Theorem 3.1 in Newey and McFadden (1994).

(cid:4)

Proof of Theorem 3.4.1

204

Proof. I have already shown that

(cid:35)

· q(y(g), x, θg)

= E(cid:2)q(y(g), x, θg)(cid:3)

E

s

r(x, wg)

· wg
pg(x)

(cid:34)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

for both g = 0, 1. Now, one needs to prove uniform convergence of the weighted sample
objective function to its population expectation. Formally, I need to show

N(cid:88)

(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1

Ng

sup
θg∈Θg
i=1
Then consider,

si · wig

r(xi, wig) · pg(xi)

· q(yi(g), xi, θg) − E

· q(cid:0)y(g), x, θg

s

r(x, wg)

· wg
pg(x)

s

r(x, wg)

· wg
pg(x)

(cid:34)
(cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ |q(cid:0)y(g), x, θg

r(x, wg) · pg(x)

(cid:1)|

≤ b(y(g), x)

η · κg

(cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p→ 0

· q(y(g), x, θg)

(J.1)

(J.2)

Inequality J.2 holds due to part (3) of Assumptions 3.2.2 and 3.2.3. Now,

E(cid:2)b(y(g), x)(cid:3) < ∞ by condition (3) in this theorem. Therefore, uniform convergence is

established by Lemma 2.4 of Newey and McFadden (1994). Hence,

ˆθg

p→ θ0

g

Replacing the true probabilities, r(·), and, pg(·), by their consistent estimates does not
change the above result.

(cid:4)

Proof of Theorem 3.4.2

Proof. Following Newey and McFadden (1994), with minor modiﬁcations, I will ﬁrst show

(cid:13)(cid:13)(cid:13) ˆθg − θ0

g

(cid:13)(cid:13)(cid:13) = Op(1) or in other words, ˆθg is

√

that

N

√
N−consistent.

Q0(θg) = Q0(θ0

g) + ∇θg Q0(θ0
g) + (θg − θ0

g)(cid:48) + (θg − θ0
g)(cid:48)Hg(θg − θ0

g)(cid:48)Hg(θg − θ0

g)/2 + o

(cid:18)(cid:13)(cid:13)(cid:13)θg − θ0

g

(cid:13)(cid:13)(cid:13)2(cid:19)

g)/2 + o

= Q0(θ0

(cid:18)(cid:13)(cid:13)(cid:13)θg − θ0

g

(cid:13)(cid:13)(cid:13)2(cid:19)

(J.3)

where the ﬁrst equality follows from the second order Taylor series approximation. For the
g, the ﬁrst derivative will be

second equality, since QN (θg) has a local minimum at θ0

g at θ0

205

zero. Since Hg is positive deﬁnite and non-signular, there exists a constant C ≥ 0 and a
small enough neighborhood of θ0

g such that

(θg − θ0

g)(cid:48)Hg(θg − θ0

g)/2 + o

(cid:13)(cid:13)(cid:13)θg − θ0

g

(cid:13)(cid:13)(cid:13)2

g

(cid:13)(cid:13)(cid:13)(cid:19)2 ≥ C
(cid:18)(cid:13)(cid:13)(cid:13)θg − θ0
(cid:13)(cid:13)(cid:13)θg − θ0
(cid:13)(cid:13)(cid:13)2

g

g) + C

Q0(θg) ≥ Q0(θ0

p→ θ0

g with probability approaching one we can rewrite eq (J.3) as

Therefore, since θg

Let us deﬁne,

RN (θg) = QN (θg) − QN (θ0

g) − (Q0(θg) − Q0(θ0

g)) − ∇θg QN (θ0

g)(cid:48)(θg − θ0
g)

then using Ossiander’s entropy conditions given in 4.2(6), 4.2(7) along with i.i.d sampling

as given in assumption (2.4), one obtains stochastic equicontinuity using Theorem 4 and
Theorem 5 (with p = 2) of Andrews (1994). Hence, for any sequence, βN → 0

In other words, the above implies that with probability approaching one, for all θg,

N · RN (θg)

√
N

g

√

sup

(cid:13)(cid:13)(cid:13)≤βN

(cid:13)(cid:13)(cid:13)θg − θ0
(cid:13)(cid:13)(cid:13) (1 +
(cid:13)(cid:13)(cid:13)θg−θ0
(cid:13)(cid:13)(cid:13)(cid:18)
N · RN (θg) ≤(cid:13)(cid:13)(cid:13)θg − θ0

g

g

√

= op(1)

g

(cid:13)(cid:13)(cid:13)θg − θ0
(cid:13)(cid:13)(cid:13))
(cid:13)(cid:13)(cid:13)θg − θ0

√
N

g

(cid:13)(cid:13)(cid:13)(cid:19)

1 +

op(1)

(J.4)

Choose UN so that ˆθg ∈ UN with probability approaching one, so that eq (J.4) holds.
Again since ˆθg is consistent for θ0
g ) − op(N−1)
g ) + ∇θg QN (θ0

g ) + RN ( ˆθg) − op(N−1)

g, we can write

g )(cid:48)( ˆθg − θ0

(cid:13)(cid:13)(cid:13)2

0 ≥ QN ( ˆθg) − QN (θ0
= Q0( ˆθg) − Q0(θ0
≥ C
− op(N−1)

(cid:13)(cid:13)(cid:13) ˆθg − θ0
≥(cid:2)C + op(1)(cid:3)(cid:13)(cid:13)(cid:13) ˆθg − θ0

(cid:13)(cid:13)(cid:13)∇θg QN (θ0
(cid:13)(cid:13)(cid:13)2

+

g

g

g )(cid:48)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˆθg − θ0

g

(cid:13)(cid:13)(cid:13) +
(cid:13)(cid:13)(cid:13) ˆθg − θ0

(cid:13)(cid:13)(cid:13) ˆθg − θ0
(cid:13)(cid:13)(cid:13)(cid:16)
(cid:13)(cid:13)(cid:13) − op(N−1)

g

g

1 +

+ Op(N−1/2)

(cid:13)(cid:13)(cid:13) ˆθg − θ0

g

(cid:13)(cid:13)(cid:13)(cid:17)

√

N

op(N−1/2)

(J.5)

206

op(N−1/2)

(cid:13)(cid:13)(cid:13)2 · op(1)

g

g

g

g

g

g

g

We obtain the above simpliﬁcation because,

+

− op(N−1)

(cid:13)(cid:13)(cid:13)∇θg QN (θ0
(cid:13)(cid:13)(cid:13)2

(cid:13)(cid:13)(cid:13)2
C ·(cid:13)(cid:13)(cid:13) ˆθg − θ0
= C ·(cid:13)(cid:13)(cid:13) ˆθg − θ0
=(cid:2)C + op(1)(cid:3) ·(cid:13)(cid:13)(cid:13) ˆθg − θ0
=(cid:2)C + op(1)(cid:3) ·(cid:13)(cid:13)(cid:13) ˆθg − θ0

− op(N−1)

(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13)2

+

g

g

g

N

√

1 +

Op(N−1/2) + op(N−1/2)

g )(cid:48)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˆθg − θ0
(cid:13)(cid:13)(cid:13) +
(cid:13)(cid:13)(cid:13) ˆθg − θ0
(cid:13)(cid:13)(cid:13) +
(cid:13)(cid:13)(cid:13) ˆθg − θ0
(cid:13)(cid:13)(cid:13) ˆθg − θ0
+ Op(N−1/2)
(cid:16)
+ Op(N−1/2) ·(cid:13)(cid:13)(cid:13) ˆθg − θ0
(cid:13)(cid:13)(cid:13)2 ≤ −Op(N−1/2)
(cid:13)(cid:13)(cid:13) ˆθg − θ0
(cid:13)(cid:13)(cid:13) ˆθg − θ0

(cid:13)(cid:13)(cid:13)2 ≤ −Op(N−1/2)

(cid:13)(cid:13)(cid:13)(cid:17)
(cid:13)(cid:13)(cid:13) ˆθg − θ0
(cid:13)(cid:13)(cid:13)(cid:16)
(cid:13)(cid:13)(cid:13) ˆθg − θ0
(cid:13)(cid:13)(cid:13) · op(N−1/2) +
(cid:13)(cid:13)(cid:13) − op(N−1)
(cid:17) ·(cid:13)(cid:13)(cid:13) ˆθg − θ0
(cid:13)(cid:13)(cid:13) − op(N−1)
(cid:13)(cid:13)(cid:13) ˆθg − θ0
(cid:13)(cid:13)(cid:13) + op(N−1)
(cid:13)(cid:13)(cid:13), b = Op(N−1/2) and
(cid:33)

(cid:13)(cid:13)(cid:13) + op(N−1)

(cid:2)C + op(1)(cid:3)(cid:13)(cid:13)(cid:13) ˆθg − θ0
(cid:13)(cid:13)(cid:13) ˆθg − θ0

(cid:33)2

We now use completing the square trick with x =
c = −op(N−1) to obtain

Since C + op(1) is bounded away from zero with probability approaching one,

Then we can write the the inequality in (J.5) as

g

g

g

g

g

g

g

(cid:32)
op(N−1) +

−

Op(N−1/2) · Op(N−1/2)

≤ 0

)

4

(cid:32)(cid:13)(cid:13)(cid:13) ˆθg − θ0

g

(cid:13)(cid:13)(cid:13) +

Op(N−1/2)

2

)

By the rules of the asymptotic notation, Op(N−1/2) · Op(N−1/2) = Op(N−1). Therefore,
we obtain,

Taking a square root on both sides,

(cid:13)(cid:13)(cid:13) + Op(N−1/2)

(cid:12)(cid:12)(cid:12)(cid:12) ≤ Op(N−1/2)

(J.6)

(cid:17)
op(N−1) + Op(N−1))

(cid:19)2 ≤(cid:16)
(cid:19)2 ≤ Op(N−1)

(cid:18)(cid:13)(cid:13)(cid:13) ˆθg − θ0
(cid:18)(cid:13)(cid:13)(cid:13) ˆθg − θ0

g

g

(cid:13)(cid:13)(cid:13) + Op(N−1/2)
(cid:13)(cid:13)(cid:13) + Op(N−1/2)
(cid:12)(cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13) ˆθg − θ0

g

207

Now, by triangle inequality

(cid:13)(cid:13)(cid:13) ˆθg − θ0

g

(cid:13)(cid:13)(cid:13) =

(cid:13)(cid:13)(cid:13) ˆθg − θ0

g

(cid:13)(cid:13)(cid:13) + Op(N−1/2) − Op(N−1/2) ≤

(cid:12)(cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13) ˆθg − θ0

g

(cid:12)(cid:12)(cid:12)(cid:12) +
(cid:13)(cid:13)(cid:13) + Op(N−1/2)

(cid:12)(cid:12)(cid:12)−Op(N−1/2)

(cid:12)(cid:12)(cid:12)

≤ Op(N−1/2)

(By equation J.6 )

g − H−1

Hence we have established that ˆθg is
¨θg = θ0
∇θg QN (θ0

√
N consistent. Now let,
√
N−consistent almost by construction since
g) is Op(N−1/2). Now consider,
g ) = Q0( ˆθg) − Q0(θ0

g ∇θg QN (θ0

QN ( ˆθg) − QN (θ0

g )(cid:48) · ( ˆθg − θ0

g ) + RN ( ˆθg) + op(N−1)

g), then ¨θg is

g ) + ∇θg QN (θ0

Using J.3 gives me,

QN ( ˆθg) − QN (θ0

g ) =( ˆθg − θ0

∇θg QN (θ0
Therefore, using the fact that ∇θg QN (θ0
(cid:104)
QN ( ˆθg) − QN (θ0
g)

(cid:105)

2

(cid:18)(cid:13)(cid:13)(cid:13) ˆθg − θ0

g

(cid:13)(cid:13)(cid:13)2(cid:19)

g )/2 + o
g ) + RN ( ˆθg) + op(N−1)

+

g )(cid:48)Hg( ˆθg − θ0
g )(cid:48) · ( ˆθg − θ0
g) = −Hg( ¨θg − θ0

g) we get

=( ˆθg − θ0
=( ˆθg − θ0

g)(cid:48)Hg( ˆθg − θ0
g)(cid:48)Hg( ˆθg − θ0

g) + 2∇θg QN (θ0
g) − 2( ¨θg − θ0

g)(cid:48) · ( ˆθg − θ0
g)(cid:48)Hg( ˆθg − θ0

g) + op(N−1)
g) + op(N−1)

To show that the remaining terms are of order op(N−1), observe that the order of
magnitude

(cid:18)(cid:13)(cid:13)(cid:13) ˆθg − θ0

g

(cid:13)(cid:13)(cid:13)2(cid:19)

o

+ RN ( ˆθg) + op(N−1) =o(Op(N−1/2) · Op(N−1/2)) + Op(N−1/2)

· op(N−1/2 + Op(N−1/2))
=op(N−1)

In a similar manner, we can write,

(cid:104)
QN ( ¨θg) − QN (θ0
g)

(cid:105)

2

g)(cid:48)Hg( ¨θg − θ0
g)(cid:48)Hg( ¨θg − θ0

=( ¨θg − θ0
=( ¨θg − θ0
= − ( ¨θg − θ0

g)(cid:48)Hg( ¨θg − θ0

g) + 2∇θg QN (θ0
g) − 2( ¨θg − θ0
g) + op(N−1)

g)(cid:48) · ( ¨θg − θ0
g)(cid:48)Hg( ¨θg − θ0

g) + op(N−1)
g) + op(N−1)

208

Then,

(cid:104)
(cid:105) − 2
QN ( ˆθg) − QN (θ0
g )
2
g )(cid:48)Hg( ˆθg − θ0
g ) − 2( ¨θg − θ0
=( ˆθg − θ0

(cid:104)
QN ( ¨θg) − QN (θ0
g )

g ) + ( ¨θg − θ0

g )(cid:48)Hg( ¨θg − θ0

g ) + op(N−1)

where

2

(cid:104)
QN ( ˆθg) − QN (θ0
g )

(cid:104)
QN ( ¨θg) − QN (θ0
g )

(cid:105) ≤ op(N−1)

(cid:105)
g )(cid:48)Hg( ˆθg − θ0
(cid:105) − 2

(cid:13)(cid:13)(cid:13) ˆθg − ¨θg

(cid:13)(cid:13)(cid:13)2

and
op(N−1) ≥ ( ˆθg − θ0

g)(cid:48)Hg( ˆθg − θ0

g) − 2( ¨θg − θ0

g)(cid:48)Hg( ˆθg − θ0

g) + ( ¨θg − θ0

g)(cid:48)Hg( ¨θg − θ0
g)

= ( ˆθg − ¨θg)(cid:48)Hg( ˆθg − ¨θg) ≥ C

Hence,

(cid:13)(cid:13)(cid:13)√

N ( ˆθg − θ0

N∇θg QN (θ0
g))
Therefore, the conclusion follows from the fact that

g) − (−H−1

g

√

(cid:13)(cid:13)(cid:13) =

(cid:13)(cid:13)(cid:13) ˆθg − ¨θg

(cid:13)(cid:13)(cid:13) p→ 0

√
N

√

−H−1

g

N∇θg QN (θ0
g)

d→ N (0, H−1

g ΩgH−1
g )

Proof of Theorem 3.4.3

Proof. Consider,

(cid:16)
Σ1 − Ω1
(cid:16)
lil(cid:48)
lib(cid:48)

= E

= E

(cid:17) − {E
(cid:17)
(cid:16)

(cid:16)
(cid:17) − E
(cid:17)−1
lil(cid:48)
bib(cid:48)

(cid:16)
lib(cid:48)
E(bil(cid:48)

E

i

i

i

i

i

(cid:17)

E

i) + E

(cid:4)

(cid:17)

(cid:16)

E

did(cid:48)

i

(cid:17)−1

E(dil(cid:48)
i)}

(cid:16)
(cid:17)−1
(cid:16)
(cid:16)
(cid:17)
bib(cid:48)
lid(cid:48)

E

i

i

(cid:16)
lid(cid:48)
E(dil(cid:48)
i)

i

(cid:17)−1
E(bil(cid:48)
i) − E
did(cid:48)

i

since each component matrix in the above expression is positive semi-deﬁnite, therefore the

sum of the two matrices is also positive semi-deﬁnite. The proof for the control group

follows analogously.

(cid:4)

209

Proof of Theorem 3.5.4

Proof. I have shown that

(cid:20)

(cid:21)

s

w1

·

· q(y(1), x, θ1)

E

R(x, w1, δ∗)
will identify the parameter of interest, θ0

G(x, γ∗)
1, under the strong identiﬁcation condition given
in 3.5.1. In order to prove consistency of ˆθ1 for θ0
1, we need to prove uniform convergence
of the weighted sample objective function to its population expectation. Formally, we need

(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1

N1

N(cid:88)

i=1

to show

(cid:20)

sup
θ1∈Θ1
− E

si · wi1

R(xi, wi1, δ∗) · G(xi, γ∗)
s · w1

· q(y(1), x, θ1)

R(x, w1, δ∗) · G(x, γ∗)

· q(yi(1), xi, θ1)

(cid:21)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p→ 0

Replacing r(x, w1) and p1(x) in the proof of theorem 3.4.1 by R(x, w1, δ∗) and G(x, γ∗)
gives us the desired result. Consistency of ˆθ0 for θ0
0 can be established analogously by
replacing w1, G(x, γ∗) and R(x, w1, δ∗) above by w0, (1 − G(·, γ∗)) and R(x, w0, δ∗)
respectively.

Proof of Theorem 3.5.5

Proof. The proof of this theorem follows in the manner of Theorem 3.4.2 but where Hg
now denotes the non-singular Hessian, with weights given by G(x, γ∗) and R(x, wg, δ∗).
Also, Ωg now denotes the variance of the doubly weighted scores, li and ki for the
treatment and control group problems respectively.

Proof of corollary 3.5.6

Proof. This proof follows from the proof of the above theorem, 3.5.5, and the asymptotic

variance of the estimator that uses known weights which is

where Ω1 = E(cid:0)lil(cid:48)

i

√
Avar
N

(cid:1) and Ω0 = E(cid:0)kik(cid:48)

i

(cid:17)

(cid:16) ˜θg − θ0
(cid:1). The result follows immediately.

−1ΩgHg

= Hg

−1

g

210

(cid:4)

(cid:4)

(cid:4)

Proof of theorem 3.5.7

Proof. Using two applications of LIE and invoking ignorability and unconfoundedness, I

can rewrite

(cid:20)
(cid:20)

E

= E

si · wi1

R(xi, wi1, δ∗) · G(xi, γ∗)
R(xi, wi1, δ∗) · G(xi, γ∗)

r(xi, wi1) · p1(xi)

(cid:20)

Using another application of LIE, I can rewrite the above as

= E

Then,

· E

r(xi, wi1) · p1(xi)
R(xi, wi1, δ∗) · G(xi, γ∗)
(cid:20)
R(xi, wi1, δ∗) · G(xi, γ∗)
r(xi, wi1) · p1(xi)

r(xi, wi1) · p1(xi)

R(xi, wi1, δ∗) · G(xi, γ∗)

E

(cid:16)
· A(xi, θ0
1)

· E

(cid:21)

(cid:20)
H1 = ∇2
θ1

= E

(cid:21)
(cid:21)

· q(yi(1), xi, θ0
1)
· q(yi(1), xi, θ0
1)
(cid:16)

q(yi(1), xi, θ0

1)|xi

(cid:17)(cid:21)

(cid:17)(cid:21)

q(yi(1), xi, θ0

1)|xi

(cid:19)(cid:35)

(cid:12)(cid:12)(cid:12)(cid:12)xi, wi1, si

∇θ1

q(yi(1), xi, θ0

1)(cid:48)∇θ1

q(yi(1), xi, θ0
1)

· E

∇θ1

q(yi(1), xi, θ0

1)(cid:48)∇θ1

q(yi(1), xi, θ0
1)

(cid:33)

(cid:12)(cid:12)(cid:12)(cid:12)xi, wi1, si
(cid:33)
(cid:12)(cid:12)(cid:12)(cid:12)xi

Similarly, I use LIE to express Ω1 as

Ω1

(cid:34)

(cid:18)

= E

E

r(xi, wi1) · p1(xi)

R2(xi, wi1, δ∗) · G2(xi, γ∗)

· ∇θ1

= E

= E




· q(yi(1), xi, θ0
(cid:32)
1)
(cid:32)

· E

q(yi(1), xi, θ0

1)(cid:48)∇θ1
r(xi, wi1) · p1(xi)

r(xi, wi) · p(xi)

R2(xi, wi1, δ∗) · G2(xi, γ∗)
(cid:34)
R2(xi, wi, δ∗) · G2(xi, γ∗)
r(xi, wi) · p(xi)

R2(xi, wi, δ∗) · G2(xi, γ∗)

= σ2

01 · E

(cid:35)

· A(xi, θ0
1)

211

For the unweighted estimator, the variance simpliﬁes, and this happens precisely due to the

GCIME. To see this, consider Hu

1. Then using LIE, I can rewrite

(cid:20)
(cid:16)
1 = ∇2
(cid:104)
r(xi, wi1) · p1(xi) · E
Hu
θ1
r(xi, wi1) · p1(xi) · A(xi, θ0
1)

= E

E

(cid:105)

(cid:17)(cid:21)

q(yi(1), xi, θ0

1)|xi

and similarly we can rewrite Ωu

1 using LIE as

(cid:20)
(cid:20)

Ωu

1 = E

= E

r(xi, wi) · p(xi) · E
(cid:104)
r(xi, wi) · p(xi) · E
01 · E

(cid:16)∇θ1
(cid:16)∇θ1

Therefore, the asymptotic variance simpliﬁes to simply

(cid:18)

√
Avar

N

1

E

N

= σ2

1 − θ0

01 ·
= σ2

√
Avar

r(xi, wi) · p(xi) · A(xi, θ0
1)
(cid:18)
(cid:104)
(cid:16) ˆθ1 − θ0
(cid:32)
(cid:19)
ri · pi
· Ai
(cid:19)
(cid:18)
(cid:18)
i · G2
R2
·
(cid:16)
D(cid:48)
iBi

(cid:16) ˆθu
(cid:17)
(cid:18)
(cid:17)(cid:19)−1 −
(cid:16) ˆθu
E (ri · pi · Ai) − E
(cid:18) ri · pi
(cid:26)
(cid:17) · E
(cid:16)
(cid:17) − E
(cid:16)

and Di =
D(cid:48)
iDi

Ri · Gi
(cid:16)

1/2
i

(cid:17)−1 · E

1/2
i

· A
B(cid:48)
iDi

√
Avar

1 − θ0

1

(cid:17)(cid:19)−1

· p

1/2
i

· E

i

p

/Ri

N

r

1

·

=

1
σ2
01

1/2
i

Let Bi = r
B(cid:48)
iBi

E

=

1
σ2
01

For showing that the two variances are positive semi-deﬁnite consider the following

q(yi(1), xi, θ0

q(yi(1), xi, θ0

q(yi(1), xi, θ0

q(yi(1), xi, θ0

1)(cid:48)∇θ1
1)(cid:48)∇θ1

(cid:105)

(cid:17)(cid:21)

(cid:17)(cid:21)

1)|xi, wi, si
1)|xi

(cid:105)(cid:19)−1

r(xi, wi) · p(xi) · A(xi, θ0
1)

(cid:18) ri · pi

Ri · Gi

· E

(cid:19)

· Ai

(cid:33)−1
(cid:19)

· Ai

/Gi

1/2
i

(cid:17)(cid:27)

· A

1/2
i

where the quantity inside the brackets is nothing but the variance of the residuals from the

population regression of Bi on Di. Hence, the diﬀerence is positive semi-deﬁnite. The
results for the control group can be proven analogously.

(cid:4)

Proof of theorem F.2.1

212

Proof. Consider a constant vector a. Then by the conclusion of theorem 3.4.2, we know

(cid:13)(cid:13)(cid:13) ˆθg + aεN − θ0

g

(cid:13)(cid:13)(cid:13) = Op(εN ). Consider,

that

|QN ( ˆθg + aεN ) − QN (θ0

g)|
g) − Q0( ˆθg + aεN ) + Q0(θ0

Then we know that,

RN (θg + aεN ) + ∇θg QN (θ0
QN (θg + aεN ) − QN (θ0

g)(cid:48)(θg + aεN − θ0

g) =
g) − Q0(θg + aεN ) + Q0(θ0
g)

Using eq(J.7) with eq(J.8) we obtain,

|QN ( ˆθg + aεN ) − QN (θ0
= |RN ( ˆθg + aεN ) + ∇θg QN (θ0

g) − Q0( ˆθg + aεN ) + Q0(θ0
g)|
g)(cid:48)( ˆθg + aεN − θ0
g)|

Then using Triangle and Cauchy-Schwartz inequality,

(J.7)

(J.8)

(J.9)

|QN ( ˆθg + aεN ) − QN (θ0
≤ |RN ( ˆθg + aεN )| +

(cid:13)(cid:13)(cid:13)

g) − Q0( ˆθg + aεN ) + Q0(θ0
g)|

(cid:13)(cid:13)(cid:13)∇θg QN (θ0
(cid:13)(cid:13)(cid:13)(cid:18)

g)(cid:48)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˆθg + aεN − θ0
(cid:13)(cid:13)(cid:13)(cid:19)
(cid:13)(cid:13)(cid:13)θg + aεN − θ0

1 +

g

g

√
op(1/
N )

Now, using stochastic equicontinuity condition,
√
N

RN (θg) ≤(cid:13)(cid:13)(cid:13)θg + aεN − θ0

g

Then,

|QN ( ˆθg + aεN ) − QN (θ0

≤(cid:13)(cid:13)(cid:13)θg + aεN − θ0

(cid:13)(cid:13)(cid:13)(cid:18)

g)|
g) − Q0( ˆθg + aεN ) + Q0(θ0
√
N

(cid:13)(cid:13)(cid:13)θg + aεN − θ0

(cid:13)(cid:13)(cid:13)(cid:19)

1 +

g

g

√
N )
op(1/

+ Op(N−1/2) · Op(εN )

(cid:104)

√

= Op(εN )

= op(ε2

N )

1 +

N Op(εN )

√
op(1/

√
N )
N ) + Op(εN /

(cid:105)

213

Hence,

|QN ( ˆθg + aεN ) − QN (θ0

g) − (Q0( ˆθg + aεN ) − Q0(θ0
ε2
N
Since, Q0(θg) is twice diﬀerentiable in θ0
g,
− a(cid:48)Hga

(cid:105)

g))|

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

2

ε2
N

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
(cid:104)
Q0( ˆθg + aεN ) − Q0(θ0
g)
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1
(cid:34)
(cid:12)(cid:12)(cid:12)(cid:12) 1

( ˆθg + εN a − θ0

(cid:12)(cid:12)(cid:12)(cid:12) +

( ˆθg − θ0

g)(cid:48)Hga

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1

ε2
N

εN

ε2
N

2

=

≤

g)(cid:48)Hg( ˆθg + εN a − θ0
g)

( ˆθg − θ0

g)(cid:48)Hg( ˆθg − θ0
g)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1
(cid:104)
Then using J.10, J.11 and triangle inequality,
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1
QN ( ˆθg + aεN ) − QN (θ0
(cid:104)
g)
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1
QN ( ˆθg + aεN ) − QN (θ0
(cid:104)

Q0( ˆθg + aεN ) − Q0(θ0
g)

ε2
N

ε2
N

≤

+

ε2
N

= op(1)

(J.10)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

− a(cid:48)Hga

2

(J.11)

g

+ o

(cid:13)(cid:13)(cid:13)2(cid:19)(cid:35)
(cid:18)(cid:13)(cid:13)(cid:13) ˆθg + εN a − θ0
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + op(1) = op(1)
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
(cid:105) − a(cid:48)Hga

(cid:105) − a(cid:48)Hga

(cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

g) − Q0( ˆθg + aεN ) + Q0(θ0
g)

2

2

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

≤op(1) + op(1) = op(1)

It follows that,

ˆHgjk

+

=

(cid:35)

QN ( ˆθg + ejεN + ekεN ) − QN ( ˆθg − ejεN + ekεN ) − QN ( ˆθg + ejεN − ekεN )

(cid:34)
(cid:34)
QN ( ˆθg − ejεN − ekεN )
p→(cid:104)
(cid:104)
2(ej + ek)(cid:48)Hgjk(ej + ek) − (ej − ek)(cid:48)Hgjk(ej − ek) − (ek − ej)(cid:48)Hgjk(ek − ej)
e(cid:48)
jHgjkej + e(cid:48)
= 2
= e(cid:48)

kHgjkek − e(cid:48)

iHgjkei − e(cid:48)

/8 + e(cid:48)

kHgjkek

jHgjkek

4ε2
N

4ε2
N

(cid:105)

jHgjkek = Hgjk

(cid:35)

(cid:105)

/8

214

Pooled slopes

(cid:4)

Proof. Let us assume that m(x, θ) = h(α + xβ + ηw1) is the chosen mean function for µ(x).
Then, in the presence of non-random sampling, we have the following ﬁrst order conditions

(cid:32) wi1

+

ˆR · ˆG

N(cid:88)

i=1

si ·

(cid:32) wi1

ˆR · ˆG

N(cid:88)

i=1

si ·

+

(cid:33)

·(cid:104)
·(cid:104)
(cid:104)

i

wi0

N(cid:88)

ˆR · (1 − ˆG)
si · wi1
(cid:33)
ˆR · ˆG
· x(cid:48)

i=1
wi0

ˆR · (1 − ˆG)

yi − h(ˆα + xi

yi − h(ˆα + xi

yi − h(ˆα + xi

ˆβ + ˆηwi1)

ˆβ + ˆηwi1)

ˆβ + ˆηwi1)

(cid:105)
(cid:105)
(cid:105)

= 0

= 0

= 0

where ˆR = R(x, w, ˆδ) and ˆG = G(x, ˆγ). Ignoring the set of conditions corresponding to the

slope parameter β, the population counterparts to the above FOC are

(cid:34)

s ·

E

(cid:18) w1

R · G

+

R · (1 − G)

(cid:19)
(cid:20)s · w1

R · G

w0

E

·(cid:2)y − h(α∗ + xβ∗ + η∗w1)(cid:3)(cid:35)
·(cid:2)y − h(α∗ + xβ∗ + η∗w1)(cid:3)(cid:21)

= 0

= 0

(J.12)

(J.13)

where α∗, β∗ and γ∗ are the probability limits of QMLE estimators ˆα, ˆβ and ˆγ.
Rearranging J.12 and J.13 gives us

(cid:34)

s ·

E

(cid:18) w1

w0

R · G

+

R · (1 − G)

(cid:19)
(cid:20)s · w1

R · G

(cid:35)
(cid:21)

· y

· y

E

w0

s ·

(cid:19)

(cid:34)
(cid:18) w1
(cid:20)s · w1

(cid:35)
· h(α∗ + xβ∗ + η∗w1)
(cid:21)
· h(α∗ + xβ∗ + η∗w1)

R · (1 − G)

R · G

+

(J.14)

(J.15)

R · G

= E

= E

Now, y = y(0) · w0 + y(1) · w1 which implies that we can replace y in the above two
equations to obtain the LHs of J.14 equal to

(cid:34)

s ·

E

(cid:18)w1 · y(1)

R · G

w0 · y(0)
R · (1 − G)

+

(cid:19)(cid:35)

215

By using iterated expectations we can rewrite the above equation as

· E(s · y(1)|x, w1) +

w0

(1 − G) · R

· E(s · y(0)|x, w0)

Due to ignorability of sample selection, we can split the conditional expectation into parts.

(cid:20) w1

G · R

E

(cid:20) w1

G · R

E

(cid:21)
(cid:1) = E(cid:0)y(0)|x(cid:1).

(cid:21)

(cid:21)

· E(s|x, w1) · E(y(1)|x, w1) +

w0

(1 − G) · R

· E(s|x, w0) · E(y(0)|x, w0)

Note that, w1 · E(s|x, w1) = w1 · R. similarly, w0 · E(s|x, w0) = w0 · R and due to

unconfoundedness we have, E(cid:0)y(1)|x, w1

Therefore, we can simplify the above expression into
w0 · R

E

Another application of iterated expectation gives us

(cid:1) = E(cid:0)y(1)|x(cid:1) and E(cid:0)y(0)|x, w0
(cid:21)

· E(y(0)|x)

(1 − G) · R

(cid:21)

· E(w1|x) +

µ0(x)
(1 − G)

· E(w0|x)

G · R

· E(y(1)|x) +

(cid:20)w1 · R
(cid:20)µ1(x)
= E(cid:2)µ1(x) + µ0(x)(cid:3)

G

E

= E[y(1)] + E[y(0)]

Manipulating the RHS of J.14 using iterated expectations gives us

(cid:34)
h(α∗ + xβ∗ + η∗w1) ·
(cid:34)
(cid:20)

h(α∗ + xβ∗ + η∗w1) ·
h(α∗ + xβ∗ + η∗w1) · w1

(cid:26)w1
(cid:26)w1
(cid:21)

G

G

E

= E

= E

· 1
R

+

(cid:27)(cid:35)

· E(s|x, w1) +
(cid:20)

(1 − G)

w0

+ E

G

h(α∗ + xβ∗ + η∗w1) · w0

(1 − G)

(cid:27)(cid:35)

w0

(1 − G)

· 1
R

· E(s|x, w0)

Therefore, combining the LHS and RHS give the result

E[y(1)] + E[y(0)] = E

(cid:21)

+ E

(cid:20)

h(α∗ + xβ∗ + η∗w1) · w0

(1 − G)

(cid:21)

(J.16)

(cid:20)

h(α∗ + xβ∗ + η∗w1) · w1

G

216

Now, consider the LHS of J.15.

E

(cid:20)s · w1

R · G

(cid:21)

· y

(cid:20)s · w1

R · G

(cid:21)

· y(1)

= E

= E[y(1)]

by LIE

Similarly using LIE, the RHS of J.15 can be re-written as

(cid:20)s · w1

R · G

E

· h(α∗ + xβ∗ + η∗w1)

(cid:21)

(cid:20)
(cid:20)

G

h(α∗ + xβ∗ + η∗w1) · w1
h(α∗ + xβ∗ + η∗w1) · w1
(cid:21)

G

= E

= E

(cid:20)

h(α∗ + xβ∗ + η∗w1) · w1

G

(cid:20)
h(α∗ + xβ∗ + η∗w1) · w0

(1 − G)

(cid:21)

(cid:21)
· E(s|x, w1)

· 1
R

(cid:21)

(J.17)

(J.18)

Therefore combining the LHS and RHS give us

Then using J.17 along with J.16 implies that

E[y(1)] = E

E[y(0)] = E

Consider

(cid:104)

Therefore, E

(cid:104)
h(α∗ + xβ∗ + η∗w1) · w1

G

Similarly, we can also show that E

E(cid:2)h(α∗ + xβ∗ + η∗w1) · w1|x(cid:3)
= E(cid:2)h(α∗ + xβ∗ + η∗)(cid:3) · P (w1 = 1|x)
= E(cid:2)h(α∗ + xβ∗ + η∗)(cid:3)
(cid:105)

(cid:105)
h(α∗ + xβ∗ + η∗w1) · w0
(1−G)

Hence, the pooled regression adjustment estimator can be written as

τP RA = E(cid:2)h(α∗ + xβ∗ + η∗)(cid:3) − E(cid:2)h(α∗ + xβ∗)(cid:3)

= E(cid:2)h(α∗ + xβ∗)(cid:3)

so a consistent estimator of the QMLE pooled regression adjustment estimator can be

obtained by replacing the population expectation by the sample average in the above

expression and weighting by the appropriate probabilities to recover the balance of the

random sample which gives us

ˆτP RA =

1
N

(cid:105) − 1

N

(cid:105)

ˆβ)

h(ˆα + xi

N(cid:88)

(cid:104)

i=1

(cid:4)

N(cid:88)

(cid:104)

h(ˆα + xi

ˆβ + ˆη)

i=1

217

Separate slopes

Proof. Let us assume that mg(x, θg) = h(αg + xβg) is the chosen mean function for µg(x).
Then the population FOC’s are

(cid:20)s · w1

R · G

E

s · w0

R · (1 − G)

(cid:20)

E

·(cid:2)y − h(α∗
·(cid:2)y − h(α∗

1 + xβ∗
0 + xβ∗

1)(cid:3)(cid:21)
0)(cid:3)(cid:21)

= 0

= 0

(J.19)

(J.20)

where where α∗
J.19 and J.20 just like in the pooled case gives us the following equalities.

g are the probability limits of QMLE estimators ˆαg, ˆβg. Rearranging

g, β∗

(cid:20) s · w1
(cid:21)

R · G
· y

(cid:21)
(cid:20)

(cid:20)s · w1

R · G

(cid:20)

E

E

s · w0

R · (1 − G)

· y

= E

= E

s · w0

R · (1 − G)

· h(α∗
· h(α∗

1 + xβ∗
1)
0 + xβ∗
0)

(cid:21)
(cid:21)

(J.21)

(J.22)

Proceeding with the above two equations in the same way as in the pooled case gives us

the results

Therefore, τF RA = E(cid:2)h(α∗

1 + xβ∗
0 + xβ∗

E[y(1)] = E(cid:2)h(α∗
E[y(0)] = E(cid:2)h(α∗
1)(cid:3) − E(cid:2)h(α∗
0 + xβ∗
1 + xβ∗
(cid:104)
N(cid:88)

1)(cid:3)
0)(cid:3)
0)(cid:3) and so a consistent estimator of
N(cid:88)

(cid:104)

(cid:105)

(cid:105) − 1

N

i=1

1
N

i=1

ˆτF RA =

h(ˆα1 + xi

ˆβ1)

h(ˆα0 + xi

ˆβ0)

the QMLE separate regression adjustment estimator can be obtained as

Consistency of ˆτP RA for τP RA and ˆτF RA for τF RA follows from the results on double
weighting and generalized linear model properties. Remember that the framework of this

paper does not rely on the correct speciﬁcation of some conditional mean of the

distribution. I have allowed for both; when the mean function is correctly speciﬁed but

everything else about the distribution is misspeciﬁed as well as when everything is allowed

to be misspeciﬁed including the mean. In both cases, results from quasi maximum

likelihood in the linear exponential family have been instrumental in guaranteeing

consistency of pooled and separate slopes methods.

(cid:4)

218

Asymptotic variance expression for ATE: Correct CEF

N(cid:88)

i=1

(cid:17)
(cid:17)} =

+ E

Proof. Assuming continuous diﬀerentiability of mg(xi, θg) on Θg, mean value expansion
around θ0

g gives

mg(xi, ˆθg) =

1
N

mg(xi, θ0

g) +

1
N

∇θg mg(xi, ¨θg) · ( ˆθg − θ0

g) + Remainder

where ¨θg lies between ˆθg and θ0
large numbers, we obtain

g. Since ˆθg

p→ θ0

g, so does ¨θg. Hence, using the weak law of

mg(xi, ˆθg) =

mg(xi, θ0

g ) + E

(cid:104)∇θg m1(xi, θ0

g )

(cid:105) ·

√

N ( ˆθg − θ0

g ) + op(1)

N(cid:88)

i=1

1
N

N(cid:88)

i=1

1√
N

N(cid:88)

i=1

N(cid:88)
(cid:16)

1√
N

√

i=1
N · E

(cid:17)}

mg(xi, θ0
g)

g) − E
√
N ( ˆθg − θ0

g) + op(1)

Adding and subtracting

mg(xi, θ0
g )

1√
N

(cid:16)

i=1

N(cid:88)
{mg(xi, ˆθg) − E
(cid:104)∇θg mg(xi, θ0

g)

Let E

mg(xi, θ0
g)

(cid:105) ≡ G0
(cid:16) ˆθ1 − θ0
(cid:17)
(cid:16) ˆθ0 − θ0
(cid:17)

N

N

1

0

√

√

= −H−1

1

= −H−1

0

(cid:16)

1√
N

on both sides gives us

i=1

g)

N(cid:88)
{mg(xi, θ0
(cid:104)∇θg mg(xi, θ0
(cid:105) ·
 + op(1)
 1√
 + op(1)
 1√

N(cid:88)
N(cid:88)

i=1

ki

N

N

li

i=1

we posit that the conditional feature of interest is correctly speciﬁed, we have

g. Then, using the asymptotic results from section 3.5, where

Therefore,

√

N (ˆτate − τate)
N(cid:88)

(cid:32)
{m1(xi, θ0

=

1√
N

1) − m0(xi, θ0

0) − τate} − G0

1 · H−1

1 li + G0

0 · H−1
0 ki

(cid:33)

+ op(1)

i=1

219

We may rewrite the above using the inﬂuence function representation as

ψ(xi) + op(1) where E(cid:2)ψ(xi)(cid:3) = 0

√

N(cid:88)
Then, provided that E(cid:2)ψ(xi)ψ(xi)(cid:48)(cid:3) exists,

N (ˆτate − τate) =
(cid:105)

(cid:104)√

1√
N

(cid:104)

i=1

N (ˆτate − τate)

= E

Avar

(m1(xi, θ0

1) − m0(xi, θ0

0)) − τate

(cid:105)2

+ G0

+ G0

1 · V1 · G0
0 · V0 · G0

0

1

(cid:48)

(cid:48)

Note that the covariance term involving li and ki is zero since the two denote scores for the
treatment and control group problems. The covariance terms involving

1) − m0(xi, θ0

(m1(xi, θ0
also be zero. This is because θ0

(cid:104)∇θg q(yi(g), xi, θ0

0) − τate) and li and (m1(xi, θ0
(cid:105)

= 0 (i.e. for g = 1, E(cid:0)li|xi

g)(cid:48)|xi

E

1) − m0(xi, θ0

0) − τate) and ki will

(cid:1) = 0 and for g = 0, E(cid:0)ki|xi

(cid:1) = 0).

g solves the conditional problem, which implies that

Then, using LIE, those covariance terms can be shown to be zero.

Misspeciﬁed mean model

In the case of a misspeciﬁed mean model, we still have

1√
N

N(cid:88)
N(cid:88)

i=1

{mg(xi, ˆθg) − E

mg(xi, θ0
g)

{mg(xi, θ0

g) − E

mg(xi, θ0
g)

1√
=
N
√
N ( ˆθg − θ0

i=1

g) + op(1)

(cid:16)
(cid:16)

(cid:17)}
(cid:17)} + E

(cid:104)∇θg mg(xi, θ0

g)

(cid:105)·

220

However now, using results from section 3.4
√
N

(cid:16) ˆθ1 − θ0

(cid:26)
li − E

= − H−1

N(cid:88)

(cid:17)

(cid:16)

1

(cid:17)

E

(cid:16)

lib(cid:48)

i

bib(cid:48)

i

(cid:17)−1

bi − E(lid(cid:48)

i)E(did(cid:48)

i)−1di

(cid:27)

1

1√
N

+ op(1)
= − H−1

1

1√
N

= − H−1

0

1√
N

(cid:16) ˆθ0 − θ0

0

(cid:17)

√
N

i=1

N(cid:88)
N(cid:88)

i=1

i=1

ui1 + op(1)

(cid:26)
ki − E

(cid:16)

kib(cid:48)

i

(cid:17)

(cid:16)

E

bib(cid:48)

i

(cid:17)−1

bi − E(kid(cid:48)

i)E(did(cid:48)

i)−1di

(cid:27)

+ op(1)
= −H−1

0

1√
N

N(cid:88)

i=1

ui0 + op(1)

{m1(xi, θ0

1) − m0(xi, θ0

0) − τate} − G0

1 · H−1

1 ui1 + G0

0 · H−1

0 ui0

Then,
√

(cid:32)
N (ˆτate − τate)

N(cid:88)
N(cid:88)

i=1

i=1

=

=

1√
N

1√
N

ψ(xi) + op(1)

Then,

Avar

(cid:105)
(cid:104)√
N (ˆτate − τate)

=E

(cid:104)

(cid:33)

+ op(1)

1

1 · V1 · G0
(cid:105)
(cid:105)

H−1
1 G0
1
H−1
0 G0
0

(cid:48)

(cid:48)

(cid:48)

(cid:4)

(cid:105)2

1) − m0(xi, θ0

0)) − τate

+ G0

+ G0
− 2E

(cid:48)

(m1(xi, θ0
0 · V0 · G0

(cid:104){m1(xi, θ0
(cid:104){m1(xi, θ0

0

+ 2E

1) − m0(xi, θ0
1) − m0(xi, θ0

0) − τate}u(cid:48)
0) − τate}u(cid:48)

i1

i0

221

BIBLIOGRAPHY

222

BIBLIOGRAPHY

Abadie, A., S. Athey, G. W. Imbens, and J. Wooldridge (2017a): “When Should
You Adjust Standard Errors for Clustering?” Tech. rep., National Bureau of Economic
Research.

Abadie, A., S. Athey, G. W. Imbens, and J. M. Wooldridge (2017b): “Sampling-

based vs. Design-based Uncertainty in Regression Analysis,” Working Paper.

Andrews, D. W. (1994): “Empirical process methods in econometrics,” Handbook of econo-

metrics, 4, 2247–2294.

Angrist, J., E. Bettinger, and M. Kremer (2006a): “Long-term educational conse-
quences of secondary school vouchers: Evidence from administrative records in Colombia,”
American economic review, 96, 847–862.

Angrist, J., V. Chernozhukov, and I. Fernández-Val (2006b): “Quantile Regression
under Misspeciﬁcation, with an Application to the U.S. Wage Structure,” Econometrica,
74, 539–563.

Ansel, J., H. Hong, and J. Li (2018): “OLS and 2SLS in Randomized and Conditionally
Randomized Experiments,” Jahrbücher für Nationalökonomie und Statistik, 238, 243–293.
Ayer, M., H. D. Brunk, G. M. Ewing, W. T. Reid, and E. Silverman (1955): “An
empirical distribution function for sampling with incomplete information,” The annals of
mathematical statistics, 641–647.

Ba, B. A., J. C. Ham, R. J. LaLonde, and X. Li (2017): “Estimating (easily interpreted)
dynamic training eﬀects from experimental data,” Journal of Labor Economics, 35, S149–
S200.

Bartlett, J. W. (2018): “Covariate adjustment and estimation of mean response in ran-

domised trials,” Pharmaceutical statistics, 17, 648–666.

Behaghel, L., B. Crépon, M. Gurgand, and T. Le Barbanchon (2015): “Please
call again: Correcting nonresponse bias in treatment eﬀect models,” Review of Economics
and Statistics, 97, 1070–1080.

Berk, R., E. Pitkin, L. Brown, A. Buja, E. George, and L. Zhao (2013): “Covari-
ance adjustments for the analysis of randomized ﬁeld experiments,” Evaluation review, 37,
170–196.

Bloom, H. S. (1984): “Accounting for no-shows in experimental evaluation designs,” Eval-

uation review, 8, 225–246.

Bruhn, M. and D. McKenzie (2009): “In pursuit of balance: Randomization in practice
in development ﬁeld experiments,” American economic journal: applied economics, 1, 200–
232.

223

Busso, M. and S. Galiani (2019): “The causal eﬀect of competition on prices and quality:
Evidence from a ﬁeld experiment,” American Economic Journal: Applied Economics, 11,
33–56.

Calónico, S. and J. Smith (2017): “The women of the national supported work demon-

stration,” Journal of Labor Economics, 35, S65–S97.

Card, D., P. Ibarrarán, F. Regalia, D. Rosas-Shady, and Y. Soares (2011):
“The labor market impacts of youth training in the Dominican Republic,” Journal of
Labor Economics, 29, 267–300.

Carson, R. T., M. B. Conaway, W. M. Hanemann, J. A. Krosnick, R. C.

Mitchell, and S. Presser (2004): “Valuing oil spill prevention,” .

Chen, X., C. A. Flores, and A. Flores-Lagunes (2018): “Going beyond LATE Bound-
ing Average Treatment Eﬀects of Job Corps Training,” Journal of Human Resources, 53,
1050–1099.

Chetty, R., J. N. Friedman, and J. E. Rockoff (2014): “Measuring the impacts of
teachers I: Evaluating bias in teacher value-added estimates,” American Economic Review,
104, 2593–2632.

Cochran, W. G. (1957): “Analysis of covariance:

261–281.

its nature and uses,” Biometrics, 13,

de Luna, X. and P. Johansson (2014): “Testing for the unconfoundedness assumption

using an instrumental assumption,” Journal of Causal Inference, 2, 187–199.

Drange, N. and T. Havnes (2018): “Early child care and cognitive development: Evidence

from an assignment lottery,” Journal of Labor Economics, 0.

Firpo, S. (2007): “Eﬃcient semiparametric estimation of quantile treatment eﬀects,” Econo-

metrica, 75, 259–276.

Fisher, R. A. (1935): “The design of experiments. 1935,” Oliver and Boyd, Edinburgh.
Freedman, D. A. (2008a): “On regression adjustments in experiments with several treat-

ments,” The annals of applied statistics, 176–196.

——— (2008b): “On Regression adjustments to experimental data,” Advances in Applied

Mathematics, 40, 180–193.

Fricke, H., M. Frölich, M. Huber, and M. Lechner (2015): “Endogeneity and non-
response bias in treatment evaluation: Nonparametric identiﬁcation of causal eﬀects by
instruments,” .

Frölich, M. and M. Huber (2014): “Treatment evaluation with multiple outcome periods
under endogeneity and attrition,” Journal of the American Statistical Association, 109,
1697–1711.

224

Frumento, P., F. Mealli, B. Pacini, and D. B. Rubin (2012): “Evaluating the eﬀect of
training on wages in the presence of noncompliance, nonemployment, and missing outcome
data,” Journal of the American Statistical Association, 107, 450–466.

Gourieroux, C., A. Monfort, and A. Trognon (1984): “Pseudo maximum likelihood

methods: Theory,” Econometrica: Journal of the Econometric Society, 681–700.

Hahn, J. (1998): “On the role of the propensity score in eﬃcient semiparametric estimation

of average treatment eﬀects,” Econometrica, 315–331.

Hausman, J. A. and D. A. Wise (1979): “Attrition bias in experimental and panel data:
the Gary income maintenance experiment,” Econometrica: Journal of the Econometric
Society, 455–473.

Heckman, J., H. Ichimura, J. Smith, and P. Todd (1998a): “Characterizing Selection

Bias Using Experimental Data,” Econometrica, 66, 1017–1098.

Heckman, J., J. Smith, and C. Taber (1998b): “Accounting for dropouts in evaluations

of social programs,” Review of Economics and Statistics, 80, 1–14.

Heckman, J. J. and V. J. Hotz (1989): “Choosing among alternative nonexperimental
methods for estimating the impact of social programs: The case of manpower training,”
Journal of the American statistical Association, 84, 862–874.

Hirano, K. and G. W. Imbens (2001): “Estimation of causal eﬀects using propensity
score weighting: An application to data on right heart catheterization,” Health Services
and Outcomes research methodology, 2, 259–278.

Hirano, K., G. W. Imbens, and G. Ridder (2003):

“Eﬃcient estimation of average

treatment eﬀects using the estimated propensity score,” Econometrica, 71, 1161–1189.

Hotz, V. J., G. W. Imbens, and J. A. Klerman (2006): “Evaluating the diﬀerential
eﬀects of alternative welfare-to-work training components: A reanalysis of the California
GAIN program,” Journal of Labor Economics, 24, 521–566.

Huber, M. (2012): “Identiﬁcation of average treatment eﬀects in social experiments under
alternative forms of attrition,” Journal of Educational and Behavioral Statistics, 37, 443–
474.

——— (2014a):

“Identifying causal mechanisms (primarily) based on inverse probability

weighting,” Journal of Applied Econometrics, 29, 920–943.

——— (2014b): “Treatment evaluation in the presence of sample selection,” Econometric

Reviews, 33, 869–905.

Huber, M. and G. Mellace (2015): “Sharp bounds on causal eﬀects under sample selec-

tion,” Oxford bulletin of economics and statistics, 77, 129–151.

Huber, M. and B. Melly (2015): “A test of the conditional independence assumption in

sample selection models,” Journal of Applied Econometrics, 30, 1144–1168.

225

Imbens, G. W. and D. B. Rubin (2015): Causal inference in statistics, social, and biomed-

ical sciences, Cambridge University Press.

Imbens, G. W. and J. M. Wooldridge (2009): “Recent developments in the econometrics

of program evaluation,” Journal of economic literature, 47, 5–86.

Kane, T. J. and D. O. Staiger (2008): “Estimating teacher impacts on student achieve-
ment: An experimental evaluation,” Tech. rep., National Bureau of Economic Research.
Kim, T.-H. and H. White (2003): “Estimation, inference, and speciﬁcation testing for
possibly misspeciﬁed quantile regression,” in Maximum likelihood estimation of misspeciﬁed
models: twenty years later, Emerald Group Publishing Limited, 107–132.

Koenker, R. and G. Bassett (1978): “Regression Quantiles,” Econometrica, 46, 33–50.
Komunjer, I. (2005):

“Quasi-maximum likelihood estimation for conditional quantiles,”

Journal of Econometrics, 128, 137 – 164.

LaLonde, R. J. (1986): “Evaluating the econometric evaluations of training programs with

experimental data,” The American economic review, 604–620.

Lewbel, A. (2000): “Semiparametric qualitative response model estimation with unknown

heteroscedasticity or instrumental variables,” Journal of Econometrics, 97, 145–177.

Li, L., C. Shen, X. Li, and J. M. Robins (2013): “On weighting approaches for missing

data,” Statistical methods in medical research, 22, 14–30.

Lin, W. (2013): “Agnostic notes on regression adjustments to experimental data: Reexam-

ining Freedman’s critique,” The Annals of Applied Statistics, 7, 295–318.

Little, R. J. and D. B. Rubin (2002): “Statistical analysis with missing data: Wiley

series in probability and statistics,” .

Mattei, A., F. Mealli, and B. Pacini (2014): “Identiﬁcation of Local Causal Eﬀects
with Missing Outcome Values and an Instrument for Non Response,” Communications in
Statistics-Theory and Methods, 43, 815–825.

McCullagh, P. and J. Nelder (1989): Generalized Linear Models, London, Chapman

and Hall.

Moffit, R., J. Fitzgerald, and P. Gottschalk (1999): “Sample attrition in panel
data: The role of selection on observables,” Annales d’Economie et de Statistique, 129–152.
Mullahy, J. (2015): “Multivariate fractional regression estimation of econometric share

models,” Journal of econometric methods, 4, 71–100.

Muralidharan, K. and V. Sundararaman (2015):

“The aggregate eﬀect of school
choice: Evidence from a two-stage experiment in India,” The Quarterly Journal of Eco-
nomics, 130, 1011–1066.

226

Negi, A. (2019): “GMM characterization of the Doubly weighted M-estimator,” Working

paper.

Negi, A. and J. M. Wooldridge (2019): “Regression adjustment in experiments with

heterogeneous treatment eﬀects,” Working paper.

Newey, W. K. and D. McFadden (1994):

testing,” Handbook of econometrics, 4, 2111–2245.

“Large sample estimation and hypothesis

Neyman, J. (1923): “On the Application of Probability Theory to Agricultural Experiments.
Essay on Principles. Section 9.(Tlanslated and edited by DM Dabrowska and TP Speed,
Statistical Science (1990), 5, 465-480),” Annals of Agricultural Sciences, 10, 1–51.

Papke, L. E. and J. M. Wooldridge (1996): “Econometric methods for fractional re-
sponse variables with an application to 401 (k) plan participation rates,” Journal of applied
econometrics, 11, 619–632.

Pollard, D. (1985): “New ways to prove central limit theorems,” Econometric Theory, 1,

295–313.

Prokhorov, A. and P. Schmidt (2009): “GMM redundancy results for general missing

data problems,” Journal of Econometrics, 151, 47–55.

Robins, J. M. and A. Rotnitzky (1995):

“Semiparametric eﬃciency in multivariate
regression models with missing data,” Journal of the American Statistical Association, 90,
122–129.

Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994): “Estimation of regression coeﬃ-
cients when some regressors are not always observed,” Journal of the American statistical
Association, 89, 846–866.

Rosenbaum, P. R. (1987): “The role of a second control group in an observational study,”

Statistical Science, 2, 292–306.

——— (2002): “Observational studies,” in Observational studies, Springer, 1–17.
Rosenbaum, P. R. and D. B. Rubin (1983): “The central role of the propensity score in

observational studies for causal eﬀects,” Biometrika, 70, 41–55.

Rosenblum, M. and M. J. Van Der Laan (2010): “Simple, eﬃcient estimators of treat-
ment eﬀects in randomized trials using generalized linear models to leverage baseline vari-
ables,” The international journal of biostatistics, 6.

Shadish, W. R., M. H. Clark, and P. M. Steiner (2008): “Can nonrandomized ex-
periments yield accurate answers? A randomized experiment comparing random and non-
random assignments,” Journal of the American statistical association, 103, 1334–1344.

Słoczyński, T. (2018): “A general weighted average representation of the ordinary and

two-stage least squares estimands,” arXiv preprint arXiv:1810.01576.

227

Słoczyński, T. and J. M. Wooldridge (2018): “A general double robustness result for

estimating average treatment eﬀects,” Econometric Theory, 34, 112–133.

Tsiatis, A. A., M. Davidian, M. Zhang, and X. Lu (2008): “Covariate adjustment for
two-sample treatment comparisons in randomized clinical trials: a principled yet ﬂexible
approach,” Statistics in medicine, 27, 4658–4677.

Vossmeyer, A. (2016): “Sample Selection and Treatment Eﬀect Estimation of Lender of

Last Resort Policies,” Journal of Business & Economic Statistics, 34, 197–212.

Watanabe, M. (2010): “Nonparametric estimation of mean willingness to pay from discrete

response valuation data,” American Journal of Agricultural Economics, 92, 1114–1135.

White, H. (1982): “Maximum likelihood estimation of misspeciﬁed models,” Econometrica:

Journal of the Econometric Society, 1–25.

Wooldridge, J. M. (2002): “Inverse probability weighted M-estimators for sample selec-

tion, attrition, and stratiﬁcation,” Portuguese Economic Journal, 1, 117–139.

——— (2007): “Inverse probability weighted estimation for general missing data problems,”

Journal of Econometrics, 141, 1281–1301.

——— (2010): Econometric analysis of cross section and panel data, MIT press.
Yang, L. and A. A. Tsiatis (2001): “Eﬃciency study of estimators for a treatment eﬀect

in a pretest–posttest trial,” The American Statistician, 55, 314–321.

228