ROBUST AND EFFICIENT ESTIMATION OF TREATMENT EFFECTS IN EXPERIMENTAL AND NON-EXPERIMENTAL SETTINGS By Akanksha Negi A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Economics – Doctor of Philosophy 2020 ABSTRACT ROBUST AND EFFICIENT ESTIMATION OF TREATMENT EFFECTS IN EXPERIMENTAL AND NON-EXPERIMENTAL SETTINGS By Akanksha Negi Broadly, this dissertation identifies and addresses issues that arise in experimental and ob- servational data contexts for estimating causal effects. In particular, the three chapters in this dissertation focus on issues of consistent and efficient estimation of causal effects using methods that are robust to misspecification of a conditional model of interest. Chapter 1: Revisiting regression adjustment in experiments with heterogeneous treatment effects Regression adjustment with covariates in experiments is intended to improve precision over a simple difference in means between the treated and control outcomes. The efficiency ar- gument in favor of regression adjustment has come under criticism lately, where papers like Freedman (2008a,b) find no systematic gain in asymptotic efficiency of the covariate adjusted estimator. This chapter shows that, like in Lin (2013), estimating separate regressions for the control and treated groups is guaranteed to do no worse than both the simple difference- in-means estimator and just including the covariates in additive fashion. This result appears to be new, and simulations show that the efficiency gains can be substantial. This chapter also talks about some important cases – applicable to binary, fractional, count, and other nonnegative responses – where nonlinear regression adjustment is consistent without any restrictions on the conditional mean functions. Chapter 2: Robust and efficient estimation of potential outcome means under random assignment This chapter studies improvements in efficiency for estimating the entire vector of potential outcome means using linear regression adjustment with two or more assignment levels. This chapter shows that using separate regression adjustments for each assignment level is never worse asymptotically than using the subsample averages and that separate regression ad- justment generally improves over pooled regression adjustment, except in the obvious case where slope parameters in the linear projections are identical across the different assignment levels. An especially promising direction is to use certain nonlinear regression adjustment methods, which we show to be robust to misspecification in the conditional means. We apply this general potential outcomes framework to a contingent valuation study which seeks to estimate the lower bound mean willingness to pay (WTP) for an oil spill prevention program in California. Chapter 3: Doubly weighted M-estimation for nonrandom assignment and miss- ing outcomes This chapter studies the problems of nonrandom assignment and missing outcomes, which together, undermine the validity of standard causal inference. While the econometrics lit- erature has used weighting to address each issue in isolation, empirical analysis is often complicated by the presence of both. This chapter proposes a new class of inverse proba- bility weighted M-estimators that deal with the two issues by combining propensity score weighting with weighting for missing data. This chapter also discusses applications of the proposed method for robust estimation of the two prominent causal parameters, namely, the average treatment effect and quantile treatment effects, under misspecification the frame- work’s parametric components. This chapter also demonstrates the proposed estimator’s viability in empirical settings by applying it to the sample of Aid to Families with Depen- dent Children (AFDC) women from the National Supported Work program compiled by Calónico and Smith (2017). ACKNOWLEDGEMENTS I would like to extend a sincere thanks to my advisor, Jeffrey M. Wooldridge, for his unre- lenting support, guidance, and insights that has been invaluable for the development of this project and which have helped me grow into an independent researcher. I would also like to thank committee members, Steven Haider, Ben Zou, and Kenneth Frank for discussions from which this work has benefited immensely. A special thanks to Timothy Vogelsang, Jon X. Eguia, Wendun Wang, Mike Conlin, Todd Elder, Alyssa Carlson, for their helpful comments, advice, and encouragement in the years leading up to the job market. This work has evolved significantly thanks to comments and suggestions received from participants at the Econometrics seminar series, Econometrics reading group meetings, and graduate student seminar series at Michigan State University. I am also grateful for the discussions at Midwest Econometric Group Meetings, International Association for Applied Econometrics, and International Econometrics PhD Conference in Rotterdam. I would like to acknowledge financial support from the Department of Economics, Grad- uate School, College of Social Sciences, and Council of Graduate Students, along with the numerous fellowships; Dissertation Completion Fellowship, Kelly Research Fellowship and Supplemental Research Fellowship received through the duration of these PhD years that have been instrumental in successful completion of this work. The administrative help and support received from Lori Jean Nichols and Jay Feight has made navigating this PhD pro- gram easier than it would otherwise have been. My warmest gratitude to family and friends who continue to inspire and motivate me to take up new adventures both personally and professionally. Finally, a note of thanks to my dear friend, and colleague, Christian Cox, with whom I have shared the vicissitudes of PhD life and who has supported me throughout this journey. iv TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 1 REVISITING REGRESSION ADJUSTMENT IN EXPERIMENTS WITH HETEROGENEOUS TREATMENT EFFECTS† . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 1.2 Potential outcomes and parameter of interest . . . . . . . . . . . . . . . . . . 1.3 Random assignment and random sampling . . . . . . . . . . . . . . . . . . . 1.4 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple difference in means (SDM) . . . . . . . . . . . . . . . . . . . . 1.4.1 . . . . . . . . . . . . . . . . . . 1.4.2 Pooled regression adjustment (PRA) 1.4.3 Linear projections and infeasible regression adjustment (IRA) . . . . 1.4.4 Full regression adjustment (FRA) . . . . . . . . . . . . . . . . . . . . 1.5 Asymptotic variances and efficiency comparisons . . . . . . . . . . . . . . . . 1.6 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Design details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.2 Discussion of simulation findings . . . . . . . . . . . . . . . . . . . . 1.7 Nonlinear regression adjustment . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.1 Pooled nonlinear regression adjustment . . . . . . . . . . . . . . . . . 1.7.2 Full nonlinear regression adjustment . . . . . . . . . . . . . . . . . . Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.3 1.8 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 2 ROBUST AND EFFICIENT ESTIMATION OF POTENTIAL OUT- COME MEANS UNDER RANDOM ASSIGNMENT† . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Potential outcomes, random assignment, and random sampling . . . . . . . . . . . . . . . . . . . . . . 2.3 Subsample means and linear regression adjustment 2.3.1 Subsample means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Full regression adjustment . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Pooled regression adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Comparing the asymptotic variances 2.4.1 Comparing FRA to subsample means . . . . . . . . . . . . . . . . . . 2.4.2 Full RA versus pooled RA . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Nonlinear regression adjustment . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Full regression adjustment . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Pooled regression adjustment . . . . . . . . . . . . . . . . . . . . . . 2.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Treatment effects with multiple treatment levels . . . . . . . . . . . . 2.6.2 Difference-in-Differences designs . . . . . . . . . . . . . . . . . . . . . v ix 1 3 3 7 8 10 10 10 11 14 14 19 20 23 24 25 29 31 33 36 36 38 41 41 43 44 45 46 46 48 49 51 52 52 53 2.6.3 Estimating lower bound mean willingness-to-pay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.4 Application to california oil data 2.7 Monte-carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Population models 53 55 57 58 61 62 CHAPTER 3 DOUBLY WEIGHTED M-ESTIMATION FOR NONRANDOM AS- SIGNMENT AND MISSING OUTCOMES† . . . . . . . . . . . . . . 64 64 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.2 Doubly weighted framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.2.1 Potential outcomes and the population models . . . . . . . . . . . . . 71 3.2.2 The unweighted M-estimator . . . . . . . . . . . . . . . . . . . . . . . 73 3.2.3 Ignorable missingness and unconfoundedness . . . . . . . . . . . . . . 75 3.2.4 Population problem with double weighting . . . . . . . . . . . . . . . 76 3.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.3.1 Estimated weights using binary response MLE . . . . . . . . . . . . . 79 3.3.2 Doubly weighted M-estimator . . . . . . . . . . . . . . . . . . . . . . 80 3.4 Asymptotic theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.4.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.4.2 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.4.3 Efficiency gain with estimated weights . . . . . . . . . . . . . . . . . 86 3.5 Some feature of interest is correctly specified . . . . . . . . . . . . . . . . . . 93 3.6 Robust estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.6.1 Average treatment effect . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Monte carlo evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 3.6.3 Quantile effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.6.4 Monte carlo evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 . . . . . . . . . . . . . . . . . . . 109 3.7.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 3.7 Application to Calónico and Smith (2017) APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 FIGURES FOR CHAPTER 1 . . . . . . . . . . . . . . . . . 115 TABLES FOR CHAPTER 1 . . . . . . . . . . . . . . . . . . 124 PROOFS FOR CHAPTER 1 . . . . . . . . . . . . . . . . . . 129 TABLES FOR CHAPTER 2 . . . . . . . . . . . . . . . . . . 138 PROOFS FOR CHAPTER 2 . . . . . . . . . . . . . . . . . . 143 AUXILIARY RESULTS FOR CHAPTER 3 . . . . . . . . . . 155 APPLICATION APPENDIX FOR CHAPTER 3 . . . . . . . 163 FIGURES FOR CHAPTER 3 . . . . . . . . . . . . . . . . . 170 TABLES FOR CHAPTER 3 . . . . . . . . . . . . . . . . . . 180 PROOFS FOR CHAPTER 3 . . . . . . . . . . . . . . . . . . 203 APPENDIX A APPENDIX B APPENDIX C APPENDIX D APPENDIX E APPENDIX F APPENDIX G APPENDIX H APPENDIX I APPENDIX J BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 vi LIST OF TABLES Table 1.1: Description of the data generating processes . . . . . . . . . . . . . . . . . Table 2.1: Combinations of means and QLLFs to ensure consistency . . . . . . . . . 23 50 Table B.1: QLL and mean function combinations . . . . . . . . . . . . . . . . . . . . 124 Table B.2: Bias and standard deviation for N=100 . . . . . . . . . . . . . . . . . . . 125 Table B.3: Bias and standard deviation for N=500 . . . . . . . . . . . . . . . . . . . 126 Table B.4: Bias and standard deviation for N=1000 . . . . . . . . . . . . . . . . . . . 127 Table B.5: Bias and standard deviation for binary outcome . . . . . . . . . . . . . . 128 Table B.6: Bias and standard deviation for non-negative outcome . . . . . . . . . . . 128 Table D.1: Summary of yes votes at different bid amounts . . . . . . . . . . . . . . . 138 Table D.2: Lower bound mean willingness to pay estimate using ABERS and FRA estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Table D.3: Bias and standard deviation of RA estimators for DGP 1 across four assignment vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Table D.4: Bias and standard deviation of RA estimators for DGP 2 across four assignment vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Table D.5: Bias and standard deviation of RA estimators for DGP 3 across four assignment vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Table I.1: An illustration of the observed sample ((cid:88)means observed, ? means missing)180 Table I.2: Different scenarios under ignorability and unconfoundedness . . . . . . . . 181 Table I.3: Different scenarios under exogeneity of missingness and unconfoundedness 182 Table I.4: When is unweighted more efficient than weighted assuming ignorability and unconfoundedness and D(y(g)|x) correctly specified? . . . . . . . . . 183 Table I.5: When the conditional mean model is correctly specified . . . . . . . . . . 184 Table I.6: Misspecified conditional mean model . . . . . . . . . . . . . . . . . . . . . 186 vii Table I.7: A) Both probability models are correct . . . . . . . . . . . . . . . . . . . 188 Table I.8: B) When missing data probability is misspecified and propensity score is correct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Table I.9: C) When missing data probability is correct and propensity score is mis- specified . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Table I.10: D) Both probability models are misspecified . . . . . . . . . . . . . . . . . 192 Table I.11: Proportion of missing earnings in the experimental sample . . . . . . . . . 194 Table I.12: Proportion of missing data in the PSID samples . . . . . . . . . . . . . . 194 Table I.13: Unweighted and weighted pre-training earnings comparisons using NSW and PSID comparison groups . . . . . . . . . . . . . . . . . . . . . . . . . 195 Table I.14: Estimation summary for ATE under different cases of misspecification . . 196 Table I.15: Estimation summary for quantile effects under different cases of misspec- ification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Table I.16: Covariate means and p-values from the test of equality of two means, by treatment status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 Table I.17: Covariate means and p-values from the test of equality of two means for the observed and missing samples . . . . . . . . . . . . . . . . . . . . . . 199 Table I.18: Unweighted and weighted earnings comparisons and estimated training effects using NSW and PSID comparison groups . . . . . . . . . . . . . . 200 Table I.19: Unconditional quantile treatment effect (UQTE) using PSID-1 compar- ison group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Table I.20: Unconditional quantile treatment effect (UQTE) using PSID-2 compar- ison group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 viii LIST OF FIGURES Figure A.1: Quadratic design, continuous covariates (mild heterogeneity) . . . . . . . 115 Figure A.2: Quadratic design, continuous covariates (strong heterogeneity) . . . . . . 116 Figure A.3: Quadratic design, one binary covariate (mild heterogeneity) . . . . . . . 117 Figure A.4: Quadratic design, one binary covariate (strong heterogeneity) . . . . . . 118 Figure A.5: Probit design, continuous covariates (mild heterogeneity) . . . . . . . . . 119 Figure A.6: Probit design, continuous covariates (strong heterogeneity) . . . . . . . . 120 Figure A.7: Probit design, one binary covariate (mild heterogeneity) . . . . . . . . . 121 Figure A.8: Probit design, one binary covariate (strong heterogeneity) . . . . . . . . 122 Figure A.9: Binary outcome, bernoulli QLL with logistic mean . . . . . . . . . . . . . 123 Figure A.10:Non-negative outcome, poisson QLL with exponential mean . . . . . . . 123 Figure G.1: Kernel density plots for the composite probability . . . . . . . . . . . . . 167 Figure G.2: Kernel density plots for the estimated propensity score . . . . . . . . . . 168 Figure G.3: Kernel density plots for the estimated missing outcomes probability . . . 169 Figure H.1: Relative estimated bias in UQTE estimates at different quantiles of the 1979 earnings distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Figure H.2: Empirical distribution of estimated ATE for N=5000 . . . . . . . . . . . 171 Figure H.3: Estimated CQTE with true CQTE as a function of x1, N=5000 . . . . . 173 Figure H.4: Bias in estimated linear projection relative to true linear projection as a function of x1 using Angrist et al. (2006b) methodology, N=5000 . . . 174 Figure H.5: Empirical distribution of estimated UQTE for N=5000 . . . . . . . . . . 176 ix INTRODUCTION Many questions in health, labor, development, and other areas of applied microeconomics are questions of causal inference. Establishing causality not only helps to quantify cause- and-effect relationships but also helps to formulate counterfactual scenarios which can be useful for informing policy and contributing to policy debates. Randomized experiments are generally considered to be a reasoned basis for conducting causal inference since the experimental design creates groups that are comparable on average and do not differ sys- tematically in measured and unmeasured dimensions. On the other hand, observational or non-experimental data make it difficult to isolate causal mechanisms due to the absence of random assignment which introduces overt and hidden biases between treatment-control comparisons. This dissertation studies issues of consistency and efficiency in estimating causal effects under experimental and non-experimental data settings. Given this objective, the focus of this dissertation is on methods that do not rely on correct functional form as- sumptions of some conditional feature of interest to achieve consistent or efficient estimation of treatment effects. Chapter 1 of this dissertation revisits the argument of regression adjustment which is routinely employed for improving precision on the estimated average treatment effect in ex- periments over a simple difference in means. As has been noted in the statistics literature by Freedman (2008b), Freedman (2008b), and Lin (2013), such an argument may be mis- guided. This is because additively controlling for covariates in a regression is not guaranteed to produce more precise estimates than a simple difference in means estimate. A similar re- sult, however, is formally lacking in the economics literature where random sampling from a infinite population is the mainstream asymptotic paradigm. Chapter 2 builds on this idea of regression (or covariate) adjustment in experiments and extends it to the case of more than two treatment levels. This is a more general setting which is able to encompass a variety of applications such as experiments with multiple treatment levels, difference-in-difference de- 1 signs, and willingness to pay studies. Chapter 3 relaxes the experimental setting to consider treatment effects estimation with observational data when some of the observed outcomes are missing. This is a widely encountered problem in most micro-econometric empirical analyses. In this setting, consistent estimation of treatment effects is complicated unless the researcher imposes structure on the treatment assignment and missing data mechanisms. Chapter 1 in this dissertation acknowledges the criticism leveled against simple regression adjustment for improving precision on the estimated average treatment effect. This chapter, then, proposes a regression estimator, which allows for separate estimation of the slopes in the treatment and control groups, that is guaranteed to be more precise than the simple difference in means and pooled regression adjustment estimators. This result is easily ob- served in the simulations where we consider different data generating processes and a range of assignment probabilities. Chapter 2 extends this efficiency result of the separate slopes regression estimator to the case of G arbitrary treatment levels and provides an empirical illustration and simulation results which provide favorable evidence for the separate slopes estimator. Finally, the third chapter proposes an inverse probability weighted estimator that double weights for the problems of nonrandom assignment and missing outcomes. The methodology used in each of the chapters does not rely on correct conditional fea- ture assumptions to propose efficient or consistent estimators for the causal parameters in question. In the first two chapters, efficiency gains with the separate slopes estimator are not based on correct conditional mean specification, but on linear projections which are con- sistently estimated under mild assumptions. Similarly, the double weighted IPW estimator proposed in the third chapter is consistent even when the outcome model is misspecified as long as the weights are correct. This dissertation aims to provide empiricists with a wide-range of treatment effect settings where the methods studied here may prove useful for conducting sound and robust causal analysis. 2 CHAPTER 1 REVISITING REGRESSION ADJUSTMENT IN EXPERIMENTS WITH HETEROGENEOUS TREATMENT EFFECTS† 1.1 Introduction The role of covariates in randomized experiments has been studied since the early 1930s [Fisher (1935)]1. When compared with the simple difference-in-means (SDM) estimator, the main benefit of adjusting for covariates is that the precision of the estimated average treatment effect (ATE) can be improved if the covariates are sufficiently predictive of the outcome [Cochran (1957); Lin (2013)]. Nevertheless, regression adjustment is not uniformly accepted as being preferred over the SDM estimator. For example, Freedman (2008b,a) argues against using regression adjustment because it is not guaranteed to be unbiased unless one makes the strong assumption that the conditional expectation functions are correctly specified (and linear in parameters). It is important to understand that there are two different, potentially valid criticisms of regression adjustment (RA). The first is the issue of bias just mentioned: unlike the SDM estimator, which is unbiased conditional on having some units in both the control and treatment groups, RA estimators are only guaranteed to be consistent, not unbiased. Therefore, in experiments with small sample sizes, one might be willing to forego potential efficiency gains in order to ensure an unbiased estimator of the average treatment effect. As Bruhn and McKenzie (2009) points out, samples of 100 to 500 individuals or 20 to 100 schools or health clinics is fairly common in experiments conducted in development economics. In situations where unbiased estimation is the highest priority, the current paper has little †This work is joint with Jeffrey M. Wooldridge and is unpublished. 1In the statistics and biomedical literature, such variables are also known as prognostic factors or concomitant variables. In the econometrics of program evaluation literature they are known as pre-treatment covariates. As the name suggests, they are ideally measured before the treatment is administered. 3 to add, other than to provide simulation evidence that using our preferred RA procedure often results in small bias. Henceforth, we are not interested in small-sample problems as a criticism of RA. More and more economic experiments, especially those conducted online, include enough units to make consistency and asymptotic efficiency a relevant criteria for evaluating estimators of ATEs. In cases where effect sizes are small (but important in the aggregate), improving precision can be important even when sample sizes seem fairly large. A second criticism of RA methods, and the one most relevant for this paper, is that RA methods may not improve over the SDM estimator even if we focus on asymptotic efficiency. Freedman (2008b,a) and Lin (2013) level this criticism of RA when the covariates are simply added as controls along with a treatment indicator in a linear regression analysis. Freedman (2008b), for example, finds no systematic efficiency gain from using covariates. Lin (2013) provides an in-depth discussion about how simply adding covariates will not necessarily produce efficiency gains when treatment effects are heterogeneous. Both Freedman and Lin operate under a finite population paradigm where all population units are observed in the sample. Therefore, uncertainty in the estimators is due to the assignment into treatment and control, and not due to sampling from a population [Abadie et al. (2017b), Abadie et al. (2017a) and Rosenbaum (2002) discuss similar settings]. Random sampling from a population is still an important setting for empirical work, and the findings in Freedman (2008b) and Lin (2013) do not extend to the random sampling sce- nario. This involves accounting for both types of uncertainties, sampling-based and design- based. Our paper will not have more to say about the differences between these two types of uncertainties. For a deeper discussion of sampling-based and design-based uncertainty, see Abadie et al. (2017b). Imbens and Rubin (2015) study linear regression adjustment in the same sampling setting that we use here: independent and identically distributed (i.i.d.) draws from a population. However, they state efficiency results only in the case that the pop- ulation means of the covariates are known, even though only a random sample is available. In addition, Imbens and Rubin only consider linear regression adjustment. 4 Regression adjustment in experiments has also been studied in the statistics literature by papers like Yang and Tsiatis (2001), Tsiatis et al. (2008), Ansel et al. (2018), and Berk et al. (2013) for the case of linear adjustment and by Rosenblum and Van Der Laan (2010) and Bartlett (2018) for the case of nonlinear regression adjustment. While some of the results derived in these papers overlap with what we show, the expressions and exposition is not as transparent and simple as ours. For nonlinear regression adjustment, we establish consistency by distinguishing between pooled and separate regression adjustment, which is missing from the discussion in Rosenblum and Van Der Laan (2010) and Bartlett (2018). In this paper, we revisit regression adjustment and resolve some outstanding issues in the literature. We cover the standard case of i.i.d. draws from an underlying population, so that randomness comes from sampling error as well as assignment into control and treatment groups. Further, unlike Imbens and Rubin (2015), we consider the realistic case where the population means of the covariates must be estimated using the sample averages. In the case of linear regression adjustment, we study four estimators: the SDM estimator, the pooled regression adjusted (PRA) estimator, the full regression adjusted (FRA) estimator – which uses separate regressions for the control and treatment groups – and what we call the infeasible regression adjusted (IRA) estimator, which is like the FRA estimator but assumes the population means of the covariates are known. We include IRA for completeness, as it is studied in Imbens and Rubin (2015), and doing so allows us to characterize the lost efficiency due to having to estimate the population means. Our most important results in the linear regression case can be easily summarized. First, even when accounting for the sampling error of the covariates in estimating the population means, using separate linear regressions for the control and treatment groups leads to an ATE estimator that is never less precise (asymptotically) than the SDM estimator and the PRA estimator. Unless small sample bias is a concern, there is no reason not to use full regression adjustment. Further, there are two interesting cases when there will be no precision gain when using full RA compared with pooled RA. The first is when there is no heterogeneity 5 in the slopes of the linear projections of the potential outcomes (although there could be in the unobservables). In this case, it is not surprising that using pooled RA is sufficient to capture the efficiency gains of using covariates. The second important case where there is no additional gain from FRA is when the design is balanced: the probability of being in the treatment group is equal to 0.5. Therefore, if one has imposed a balanced design and is considering only linear regression adjustment, the pooled method is probably preferred (due to conserving degrees of freedom). A final result, which is pretty obvious, is that there is no efficiency gain when the covariates are not predictive of the potential outcomes; then, SDM is asymptotically efficient. We want to emphasize that there is no (asymptotic) cost in doing the regression adjustment, whether PRA or FRA: each estimator has the same asymptotic variance. In situations where one has good predictors of the outcome, regression adjustment can be attractive. Our simulation study illustrates the special cases derived from our theortical results. Another important contribution of our paper is to characterize situations where nonlinear regression adjustment preserves consistency of average treatment effect estimators without imposing additional assumptions. In particular, we show that when the response is binary, fractional, count, or other nonnegative outcomes, certain kinds of nonlinear regression ad- justment consistently estimates the average treatment effect. Our simulations for the case of binary and non-negative response suggest that nonlinear RA, especially the full version, can produce sampling variances that are smaller than SDM and also linear regression ad- justment. In terms of bias, nonlinear FRA (NFRA) is comparable to SDM, which we know is unbiased. The rest of the paper is organized as follows. Section 3.2 briefly introduces the poten- tial outcomes setting and defines the population average treatment effect – the parameter of interest in this paper. Section 3.3 discusses the random assignment mechanism and the random sampling assumption. Section 3.4 is important and describes linear regression ad- justment in the population in terms of linear projections, which are consistently estimated 6 by ordinary least squares (OLS) given a random sample. Importantly, we need not impose any assumptions on the conditional mean functions of the potential outcomes. Section 3.5 presents the asymptotic variances of the four linear estimators and ranks them on the basis of asymptotic efficiency. We also characterize the cases under which estimating two separate regressions does not improve efficiency over SDM or PRA (or both). Section 3.6 presents Monte Carlo simulations that compare the bias and root mean squared error (RMSE) of the estimators for eight different data generating processes. In section 3.7 we draw on results from the doubly robust estimation literature and characterize the nonlinear RA estimators – both pooled and full – that produce consistent estimators of the ATE. Our simulations in this section show that nonlinear methods have modest bias while considerably improving efficiency compared with both SDM and linear RA methods. Section 3.8 concludes the paper with a discussion of some future research topics. All proofs are included in the appendix. 1.2 Potential outcomes and parameter of interest Our framework is the standard Neyman-Rubin causal model, involving potential (or counterfactual) outcomes. Let {Y (0), Y (1)} be the two potential outcomes corresponding to the control and treatment states, respectively, where {Y (0), Y (1)} has a joint distribu- tion in the population. The setup is nonparametric in that we make no assumptions about the distribution of {Y (0), Y (1)} other than finite moment conditions needed to apply stan- dard asymptotic theory. In particular, {Y (0), Y (1)} may be discrete, continuous, or mixed random variables. For example, Y (0) and Y (1) can be binary employment indicators for nonparticipation and participation in a job training program. Or, they could be the fraction of assets held in the stock market, or counts of the number of hospital visits taken by a patient. Define the means of the potential outcomes as µ0 = E(cid:2)Y (0)(cid:3) µ1 = E(cid:2)Y (1)(cid:3) 7 The parameter of interest is the population average treatment effect (PATE), τ = E(cid:2)Y (1) − Y (0)(cid:3) = µ1 − µ0 As has been often noted in the literature, the problem of causal inference is essentially a missing data problem. We only observe one of the the outcomes, Y (0) or Y (1), once the treatment, represented by the Bernoulli random variable W , is determined. Specifically, the observed Y is defined by Y (0), if W = 0 Y (1), if W = 1 Y = It is also useful to write Y as Y = (1 − W ) · Y (0) + W · Y (1). 1.3 Random assignment and random sampling (1.1) (1.2) In determining an appropriate method to estimate τ, we need to know how the treatment, W , is assigned. In this paper, we assume that W is independent of the potential outcomes as well as observed covariates, which we write as X = (X1, X2, . . . , XK ). Formally, the random assignment assumption is as follows. Assumption 1.3.1. The binary assignment indicator, W , is a Bernoulli trial and is inde- pendent of {Y (0), Y (1), X}, where X = (X1, X2, . . . , XK ). Mathematically, W ⊥ {Y (0), Y (1), X}. Letting ρ = P (W = 1) be the probability of being assigned into treatment, assume that 0 < ρ < 1. The assumption of random assignment implies that there are many consistent estimators of τ. The goal in this paper is to rank, as much as possible, commonly used estimators of τ in terms of asymptotic efficiency. 8 As mentioned in the introduction, both early [Neyman (1923) and Fisher (1935)] and recent [Freedman (2008b) Freedman (2008a) and Lin (2013)] approaches to estimating ATEs assume that the entire population is the sample. Therefore, the only stochastic element of the setup is the assignment of the treatment, which is randomized. Such a perspective rules out any uncertainty stemming from unobservability of the entire population (also termed sampling-based uncertainty) and only allows uncertainty that arises due to the experimental design (also known as design-based uncertainty). Here, we adopt the assumption commonly used in studying various estimators in statistics and econometrics. Assumption 1.3.2. For a nonrandom integer N, {(cid:0)Yi(0), Yi(1), Wi, Xi (cid:1) : i = 1, 2, . . . , N} are independent and identically distributed draws from the population. Given the random sampling assumption, standard asymptotic theory for i.i.d. sequences of random vectors can be applied, where N tends to infinity. We assume in what follows that at least second moments of the potential outcomes and covariates are finite so that, when we use regression methods, we can apply the law of large numbers and central limit theorem. We do not state these moment assumptions explicitly as they cannot be checked, anyway. For each unit i drawn from the population, the treatment effect is which we can write as Yi(1) − Yi(0), Yi(1) − Yi(0) = τ +(cid:2)Vi(1) − Vi(0)(cid:3) where Vi (w) = Yi(w)− µw for w ∈ {0, 1}. The treatment effects are homogeneous when the unit-specific components, Vi(1) − Vi(0), are identically zero for any random draw i. 9 1.4 Estimators We now carefully describe the estimators that we use in the linear regression context. 1.4.1 Simple difference in means (SDM) Random assignment provides the luxury of using an estimator available from basic statistics. outcomes. Let Wi be the treatment indicator for unit i. Then N0 = (cid:80)N N1 = (cid:80)N This estimator dates back to Neyman (1923) in the context of causal inference using potential i=1(1 − Wi) and i=1 Wi are the number of control and treated units in the sample, respectively. These are random variables. When N0, N1 > 0 we can define the sample averages for the control and treated units: N(cid:88) N(cid:88) i=1 ¯Y0 = N−1 0 ¯Y1 = N−1 1 (1 − Wi)Yi WiYi, (1.3) (1.4) where Yi is the observed outcome for unit i. The simple difference-in-means estimator is i=1 ˆτSDM = ¯Y1 − ¯Y0. (1.5) Under random assignment and conditional on N0, N1 > 0, ˆτSDM is unbiased for τ – see, for example, Imbens and Rubin (2015). Further, ˆτSDM is consistent as N → ∞ for τ when 0 < ρ < 1, as we assume. As is well know, ˆτSDM can be obtained as the coefficient on Wi in the simple regression Yi on 1, Wi, i = 1, . . . , N. (1.6) See, for example, Imbens and Rubin (2015). 1.4.2 Pooled regression adjustment (PRA) When we have covariates that (hopefully) predict the outcome Y , the simplest way to use those covariates is to add them to the simple regression in (1.6). As documented in Słoczyński 10 (2018), adding covariates along with a binary treatment indicator is still very common in estimating treatment effects, whether one has randomized assignment or assumes uncon- foundedness conditional on the covariates. Specifically, the regression is Yi on 1, Wi, Xi, i = 1, 2, . . . , N. The coefficient on Wi is the estimator of τ, and we called this estimator “pooled regression adjustment” (PRA) and denote it ˆτP RA. The name “pooled” emphasizes that we are pooling across the control and treatment groups in imposing common coefficients on the vector of covariates, Xi. important to understand that we are making no assumption about whether the coefficients In other words, the slopes are the same for W = 0 and W = 1. It is in an underlying linear model in the population are the same, or even whether there is an underlying linear model representing a conditional expectation. This will become clear in the next subsection when we formally describe linear projections. As is well known, adding the variables Xi to the simple regression does not change the probability limit provided Wi and Xi are uncorrelated, which follows under random assignment. However, it is not always the case that adding Xi is a good idea, even if we focus on asymptotic efficiency: it may or may not improve asymptotic efficiency compared with the SDM estimator. 1.4.3 Linear projections and infeasible regression adjustment (IRA) The SDM estimator is an example of an estimator that can be written as ˆτ = ˆµ1 − ˆµ0, where, in the case of SDM, ˆµ0 and ˆµ1 are the sample averages of the control and treated groups, respectively. But there are other ways to consistently estimate µ0 and µ1 when we have covariates X, represented as a 1× K vector. In particular, define the linear projections 11 of the potential outcomes on the vector of covariates as L(cid:2)Y (0)|1, X(cid:3) = α0 + Xβ0 L(cid:2)Y (1)|1, X(cid:3) = α1 + Xβ1, (1.7) (1.8) where the expressions for α0, α1, β0, and β1 can be found in Wooldridge (2010) Section 2.3. As discussed in Wooldridge, the linear projections always exist and are well defined provided Y (w) and the elements of X have finite second moments. Any of the random variables can be discrete, continuous, or mixed. The elements of X can include the usual functional forms, such as logarithms, squares, and interactions. The requirement for the coefficients in the linear projections (LPs) to be unique is simply that the variance-covariance matrix of X, ΩX, is nonsingular, an assumption that rules out perfect collinearity in the population. It is often helpful to slightly rewrite the LPs. Define the 1 × K vector of population means of X as µX = E (X), and let ˙X = X − µX be the deviations from the population mean. Then we can write the linear projection in terms of the means µ0 and µ1 as L(cid:2)Y (0)|1, X(cid:3) = µ0 + ˙Xβ0 L(cid:2)Y (1)|1, X(cid:3) = µ1 + ˙Xβ1 (1.9) (1.10) The two representations make it clear that the PATE, τ, can be expressed as τ = µ1 − µ0 = (α1 − α0) + µX(β1 − β0) Therefore, if we have consistent estimators of α0, α1, β0, β1, and µX, then we can consis- tently estimate τ. Importantly, as discussed in Wooldridge (2010) Chapter 4, ordinary least squares estimation using a random sample always consistently estimates the parameters in a population linear projection (subject to the mild finite second moment assumptions and the non-singularity of ΩX). This is true regardless of the nature of Y (w) or X. This insight is critical for understanding why regression adjustment produces consistent estimators of τ, and for the asymptotic efficiency arguments later on. Unlike in Imbens and Wooldridge 12 (2009), we do not assume that the linear projection is the same as the conditional mean. We are silent on the conditional mean functions E(cid:2)Y (0)|X(cid:3) and E(cid:2)Y (1)|X(cid:3). Given random assignment, consistent estimators of the LP coefficients are obtained from the separate regressions Yi on 1, Xi using Wi = 0 Yi on 1, Xi using Wi = 1 (1.11) (1.12) Wooldridge (2010) Chapter 19 formally shows that the linear projections are consistently esti- mated under the assumption that selection – in this case, Wi – is independent of(cid:2)Xi, Yi(w)(cid:3). This is sometimes called the “missing completely at random” (MCAR) assumption in the missing data literature [for example, Little and Rubin (2002)]. If we assume that the vector of population means µX is known, a consistent estimator of τ is ˆτIRA = (ˆα1 − ˆα0) + µX( ˆβ1 − ˆβ0), where ˆα0, ˆα1, ˆβ0, and ˆβ1 are the OLS estimators from the separate regressions. We call this the “infeasible regression adjustment” (IRA) estimator because it depends on µX, which is likely to be unknown in our context with random sampling. From the algebra of ordinary least squares (OLS) it is easily shown that ˆτIRA can be obtained as the coefficient on Wi in the regression that includes a full set of interactions between Wi and ˙Xi, namely, Yi on 1, Wi, Xi, Wi · ˙Xi, i = 1, . . . , N. The demeaning of the covariates ensures that the coefficient on Wi is ˆτIRA. This regression is also convenient for obtaining a valid standard error for ˆτIRA, as the usual Eicker-Huber- White heteroskedasticity-robust standard error is asymptotically valid. In the case where the linear projections are also the conditional expectations – that is, E(cid:2)Y (w)|X(cid:3) = L(cid:2)Y (w)|1, X(cid:3), w ∈ {0, 1} – ˆα0, ˆα1, ˆβ0, and ˆβ1 are unbiased conditional on 13 {Xi : i = 1, 2, . . . , N}, provided we rule out perfect collinearity in the control and treated subsamples. Then, ˆτIRA would also be unbiased conditional on {Xi : i = 1, 2, . . . , N}, and unbiased unconditionally if its expectation exists. But linearity of the conditional expec- tations is much too strong an assumption, and it is clearly not needed for unbiasedness or consistency of the SDM estimator. Therefore, in what follows, we make no assumptions about E(cid:2)Y (w)|X(cid:3). We simply assume enough moments are finite and rule out perfect collinearity in X in order for the linear projections to exist. 1.4.4 Full regression adjustment (FRA) ¯X = N−1(cid:80)N We can easily make the IRA estimator feasible by replacing µX with the sample average, i=1 Xi. This leads to what we will call the “full regression adjustment” (FRA) estimator: ˆτF RA = (ˆα1 − ˆα0) + ¯X( ˆβ1 − ˆβ0). This estimator can also be obtained as the OLS coefficient on Wi but from the regression Yi on 1, Wi, Xi, Wi · ¨Xi, i = 1, 2, . . . , N , where ¨Xi = Xi − ¯X, i = 1, 2, . . . , N are the demeaned covariates using the sample average. This estimator is always available given a sample (cid:8)(Yi, Wi, Xi) : i = 1, 2, . . . , N(cid:9). Generally, ˆτF RA (cid:54)= ˆτIRA. Like ˆτIRA, we can only conclude that ˆτF RA is consistent, although it will be unbiased under essentially the same assumptions discussed for ˆτIRA. In the next section, we will rank the four estimators, to the extent possible, in terms of asymptotic efficiency. 1.5 Asymptotic variances and efficiency comparisons We first derive the asymptotic variances of the SDM, PRA, IRA and FRA estimators in the general case of heterogeneous treatment effects. Naturally, the formulas include homo- 14 geneous treatment effects as a special case. We then compare the asymptotic variances in general and in special cases. In order to obtain the asymptotic variances, we need to study the linear projections of the potential outcomes on the covariates more closely. Recall that we can write the potential outcomes as Y (0) = µ0 + V (0) Y (1) = µ1 + V (1), (1.13) (1.14) where V (0) and V (1) have zero means, by construction. Following the discussion in Section 3.4, we linearly project each of V (0) and V (1) onto the population demeaned covariates, ˙X: V (0) = ˙Xβ0 + U (0) V (1) = ˙Xβ1 + U (1) where the intercepts are necessarily zero. Then Y (0) = µ0 + ˙Xβ0 + U (0) Y (1) = µ1 + ˙Xβ1 + U (1) (1.15) (1.16) (1.17) (1.18) By definition of the linear projection, It follows that 0 = V(cid:2)U (0)(cid:3) and σ2 1 = V(cid:2)U (1)(cid:3). where σ2 E(cid:2)U (0)(cid:3) = E(cid:2)U (1)(cid:3) = 0 = E(cid:104) ˙X(cid:48)U (1) E(cid:104) ˙X(cid:48)U (0) (cid:105) (cid:105) V(cid:2)Y (0)(cid:3) = β(cid:48) V(cid:2)Y (1)(cid:3) = β(cid:48) 0ΩXβ0 + σ2 0 1ΩXβ1 + σ2 1 = 0 15 We can write the observed outcome, Y , as (cid:104) Y = (1 − W ) µ0 + ˙Xβ0 + U (0) = µ0 + ˙Xβ0 + U (0) + τ W + + W W · ˙X µ1 + ˙Xβ1 + U (1) δ + W ·(cid:2)U (1) − U (0)(cid:3) (1.19) (1.20) (cid:105) (cid:16) (cid:104) (cid:17) (cid:105) where δ = β1 − β0. The following lemma is a precursor to the efficiency comparisons. Lemma 1.5.1. Under the assumptions of random assignment given in 1.3.1, random sam- pling 3.2.4, and finite moment assumptions, the following asymptotic distributions hold: d→ N(cid:16) (cid:17) √ N (ˆτSDM − τ ) β(cid:48) β(cid:48) 0ΩXβ0 1ΩXβ1 (1 − ρ) + ρ 0, ω2 SDM σ2 1 ρ + + σ2 0 (1 − ρ) ω2 SDM = (1.21) (1.22) (1.23) (1.24) (1.25) (1.26) (1.27) (1.28) d→ N(cid:16) (cid:33) √ (cid:32) N (ˆτP RA − τ ) (1 − ρ)2 ω2 P RA = + ρ2 (1 − ρ) ρ (cid:17) P RA 0, ω2 (β1 − β0)(cid:48) ΩX (β1 − β0) + σ2 1 ρ + σ2 0 (1 − ρ) d→ N(cid:16) √ N (ˆτF RA − τ ) 0, ω2 F RA = (β1 − β0)(cid:48) ΩX (β1 − β0) + ω2 F RA (cid:17) σ2 1 ρ + σ2 0 (1 − ρ) d→ N(cid:16) (cid:17) 0, ω2 IRA √ N (ˆτIRA − τ ) σ2 σ2 1 0 ρ (1 − ρ) + . (cid:3) ω2 IRA = The asymptotic variance expressions allow us to determine asymptotic efficiency under various scenarios. Not surprisingly, all four asymptotic variances depend on the error vari- 1. Generally, the asymptotic variances of the three feasible estimators can 0 and σ2 ances, σ2 depend on ΩX, β0, and β1. 16 By comparing the formulas in Lemma 1.5.1 we have the following result, which ranks the asymptotic variances of the four different estimators in the general case of heterogeneous treatments and ρ ∈ (0, 1). Theorem 1.5.2. Under the assumptions of Lemma 1.5.1, (i) F RA ≤ ω2 ω2 F RA ≤ ω2 ω2 IRA ≤ ω2 ω2 SDM P RA F RA (1.29) (1.30) (1.31) (ii) If β0 = β1 = 0 then all asymptotic variances are the same. (iii) If β0 = β1 = β then ω2 P RA = ω2 F RA = ω2 IRA and if β (cid:54)= 0 then ω2 SDM is strictly larger. (iv) If ρ = 1/2 then ω2 P RA = ω2 F RA ≤ ω2 SDM , with strict inequality in the latter case unless β1 = −β0. Many of the results in Theorem 1.5.2 follow from inspection of the asymptotic variance formulas, although some are more subtle. For example, (1.31) is immediate because the first term in (1.26) is nonnegative. Part (iii) is also immediate because all asymptotic variances equal σ2 1/ρ + σ2 0/(1 − ρ) when β0 = β1. For part (iv), the function g(ρ) ≡ (1 − ρ)2 ρ + ρ2 (1 − ρ) ,ρ ∈ (0, 1) can be shown to have a minimum value of unity, uniquely achieved when ρ = 1/2. The most difficult inequality to establish, and the one that is most important, is (1.29). Straightforward matrix multiplication shows that SDM − ω2 ω2 F RA = β(cid:48) 1ΩXβ1 ρ β(cid:48) 0ΩXβ0 (1 − ρ) + − (β1 − β0)(cid:48) ΩX (β1 − β0) = λ(cid:48)ΩXλ, 17 (cid:115)(cid:18) ρ (cid:19) β0. β1 + 1 − ρ SDM = ω2 where (cid:115)(cid:18)1 − ρ ρ λ = (cid:19) Because ΩX is assumed positive definite, ω2 F RA if and only if λ = 0. One case where λ = 0 is β1 = β0 = 0, in which case the covariates do not predict the potential outcomes. It can happen in other cases but all of the slope coefficients would have to have opposite signs in the linear projections of the two potential outcomes. For example, if ρ = 1/2, we would need β1 = −β0, which means the slopes in the linear projection of Y (1) on 1, X would be the opposite signs of the slope coefficients in the linear projection of Y (0) on 1, X. This seems highly unlikely. For example, we would expect pre-training education to have a positive effect on earnings whether or not someone participates in a job training program. In the homogenous case β1 = β0 (cid:54)= 0, ω2 SDM > ω2 F RA = ω2 P RA. We can never know for sure whether β1 = β0, but we should know whether ρ = 1/2 If ρ = 1/2 then 1.5.2 suggests that the pooled based on the design of the experiment. estimator is probably preferred: it is as asymptotically efficient as the full RA estimator and conserves on degrees of freedom, which may be important if N is not large and the potential K (number of covariates) is somewhat large. For ρ (cid:54)= 1/2, Theorem 1.5.2 shows that the full RA estimator is attractive provided small-sample issues are not important. In particular, ˆτF RA is always more asymptotically efficient that both ˆτSDM and ˆτP RA in the presence of heterogenous slopes, and there is no (asymptotic) price to pay if β1 = β0 or even if β1 = β0 = 0. Estimating the 2K parameters is, asymptotically, harmless when it comes to the precision in estimating τ. It may be helpful to provide intuition as to why ˆτF RA is more efficient than ˆτSDM . Consider estimating the mean of the potential outcome in the treated state, µ1. The FRA estimator is ˆµ1,F RA = ˆα1 + ¯X ˆβ1, where ¯X is the sample average across the entire sample. By the simple mechanics of OLS, 18 where ¯Y1 = ˆα1 + ¯X1 ˆβ1 ¯X1 = N−1 1 N(cid:88) i=1 WiXi √ is the sample average of the Xi over the treated units. Because of random assignment, ¯X1 is also a N-consistent, asymptotically normal estimator of µX. But it is inefficient compared with ¯X because the latter uses the entire sample. The same is true of ˆµ0,F RA and ¯Y0. This does not quite prove that ˆτF RA is asymptotically more efficient because ˆµ1,F RA and ˆµ0,F RA are not (asymptotically) uncorrelated, but it does provide some intuition. The same sort of intuition indicates why ˆτIRA is asymptotically more efficient than ˆτF RA: ˆτIRA is not subject to the sampling error in estimating µX. 1.6 Simulations In this section we study the finite sample properties of the four estimators just discussed. We evaluate the estimators primarily in terms of root mean squared error (RMSE), since this accounts for bias as well as sampling variance. Since bias has been cited as a concern with covariate adjustment estimators, especially in small-scale experiments, looking at the trade offs between bias and efficiency through RMSE is key to studying the finite sample perfor- mance of these estimators. In order to compute the RMSE, we first generate a population of 10,000 observations to approximate an “infinite” population setting. We then draw random samples of sizes 100, 500 and 1,000 repeatedly from this population a thousand times. For a comprehensive assessment, we report the RMSE across these different sample sizes and treatment probability combinations where the treatment probabilities range from 0.1 to 0.9. To keep the tables simple, we report results only for the odd treatment probabilities even though the graphs are plotted for all values. The reported simulation results are for the case of heterogeneous treatment effects in the population, both in terms of the slopes on the linear projections and in the distribution of the projection errors, U (0) and U (1). 19 1.6.1 Design details The treatment, W , is a binary variable, and so it has a Bernoulli distribution with P (W = 1) = ρ, and we vary the value of ρ. For the potential outcomes, we consider continuous and discrete responses. In the first, the potential outcomes are conditionally normally distributed, with means linear in a quadratic in two covariates, X1 and X2. Specifically, Y (0) = γ00 + γ01X1 + γ02X2 + γ03X2 Y (1) = γ10 + γ11X1 + γ12X2 + γ13X2 1 + γ04X2 1 + γ14X2 2 + γ05X1X2 + R(0) ≡ Zγ0 + R(0) 2 + γ15X1X2 + R(1) ≡ Zγ1 + R(1), (cid:18) Z = where and (cid:19) 1 X1 X2 X2 1 X2 2 X1X2 R(0)| (X1, X2) ∼ N (0, σ2 0) R(1)| (X1, X2) ∼ N (0, σ2 1) We allow the γwj to differ across w ∈ {0, 1}, and so there is heterogeneity in the treatment effects in terms of the observables, X, and the unobservables, R(0) and R(1) which are allowed to be correlated.2 We also allow σ2 0 and σ2 1 to differ. It is important to understand that, in order to be realistic, we do not assume that the quadratic conditional mean function is known. Instead, the researcher uses only linear regression on a constant, X1, and X2. In a traditional view of econometrics, these regressions would be “misspecified.” Of course, to ensure we have the best mean squared error predictors of Y (0) and Y (1), we would use the correct specifications of E(cid:2)Y (0)|X(cid:3) and E(cid:2)Y (1)|X(cid:3). But it would be unusual for us to know the exact specification of the conditional mean 2R(0) and R(1) are generated to be affine transformations of the same standard normal variable. 20 functions. One can argue that most empirical researchers would include simple functions, such as squares as interactions. But then the true mean function could depend on higher order polynomials, or other more exotic functions. In fact, the mean might not even be linear in parameters. We take our setup as reflecting the realistic case that the researcher uses a linear regression that does not correspond to the correct conditional mean. Our second design generates the potential outcomes as binary variables. Remember, when W is randomly assigned, we can use any kind of linear regression adjustment to improve asymptotic efficiency, regardless of the nature of Y (0), Y (1). Specifically, for Z defined above, Y (0) = 1[Zγ0 + R(0) > 0] Y (1) = 1[Zγ1 + R(1) > 0], where R(0) and R(1) are again independent of (X1, X2) and normally distributed, this time each with unit variances. As before, R(0) and R(1) are allowed to be correlated. In the binary response case, one might traditionally think of two forms of “misspecification” in using linear regression adjustment on (1, X1, X2). First, we are using what is traditionally called a “linear probability model” rather than the correct probit model. Second, we are omitting the terms 2, and X1X2. Thus, there are two kinds of functional form “misspecification.” Our view is that, to make a case for linear regression adjustment, it should produce notable 1, X2 X2 efficiency gains even when the potential outcomes are discrete (although we return to this issue in the next section). We consider two different designs for generating the covariates. Both are based on an underlying bivariate normal distribution:   . 0.5 3  1  , 2  2 0.5 21  ∼ N X∗ X∗ 1 2 X∗(cid:48) = (cid:19) (cid:19) (cid:19) (cid:19) , γ(cid:48) (cid:19) (cid:19) (cid:18) (cid:18) (cid:18) (cid:18) In the first design, X = X∗. In the second design, X1 = X∗ 1 and X2 = 1[X∗ 2 > 0], so that X2 is binary (in which case X2 potential outcomes). 2 is redundant in the mechanism generating the With the linear and probit data we consider two different levels of heterogeneity across the coefficients γ0 and γ1, which we label “mild” and “strong.” For the mild heterogeneity with continuous covariates 2 2 −2 −0.05 −0.02 0.3 , γ(cid:48) 1 = 3 1 −1 −0.05 −0.03 0.6 and for the strong heterogeneity 0 1 −1 −0.05 0.02 0.6 1 = 1 −1 1.5 0.03 −0.02 −0.6 (cid:18) (cid:18) γ(cid:48) 0 = γ(cid:48) 0 = With the binary regressor, for mild heterogeneity we have γ(cid:48) 0 = 0 1 −1 0.05 0.2 , γ(cid:48) 1 = 3 3 1 0.05 0.9 and for strong heterogeneity we have (cid:18) (cid:18) (cid:19) (cid:19) γ(cid:48) 0 = 0 1 −2 −0.05 0.2 , γ(cid:48) 1 = 3 −1 1 0.05 −0.9 Combining the linear and probit designs for the potential outcomes with two different levels of heterogeneity and two different covariate compositions leads to a total of eight scenarios. We allow the treatment probability, ρ, to range between 0.1 and 0.9 in increments of 0.1 . We consider three sample sizes, 100, 500, and 1,000. Note that when N = 100 and ρ = 0.1, we expect only 10 treated units and 90 control units. The covariates are generated to ensure that they are predictive of the potential outcomes, with the population R-squared 1 are allowed to be different for the two ranging between (0.1, 0.6). The variances σ2 0 and σ2 potential outcomes and across four of the eight different data generation processes. We assess the relative finite sample performance of the four estimators under each such scenario, which we term a DGP. The Table below describes each of the DGP’s in detail. 22 Table 1.1: Description of the data generating processes Covariates DGP Design Quadratic X Quadratic X Quadratic X2 = 1[X∗ Quadratic X2 = 1[X∗ Probit X Probit X X2 = 1[X∗ Probit X2 = 1[X∗ Probit 1 2 3 4 5 6 7 8 2 > 0] 2 > 0] 2 > 0] 2 > 0] Heterogeneity R2 0 0.52 0.31 0.59 0.27 0.59 0.51 0.45 0.38 Mild Strong Mild Strong Mild Strong Mild Strong R2 1 0.44 0.46 0.33 0.34 0.38 0.45 0.28 0.40 1 PATE 0 σ2 σ2 2.68 9 16 16 9 0.93 7.46 4 1 2.92 4 9 1 1 0.28 0.09 1 1 0.35 1 1 1 1 0.43 1.6.2 Discussion of simulation findings In the eight different DGPs, we see that FRA performs better than SDM and PRA in terms of RMSE. This behavior seems to be more pronounced at larger sample sizes as seen from the figures. Two things are worth pointing. One, the difference in IRA and FRA is less prominent for cases of mild heterogeneity. In such cases, PRA also performs comparably. This makes sense since pooling slopes in the treatment and control groups when the slopes are not very different should produces estimates that are close to the ones estimated by the separate slopes regression. Second, as was clear from Theorem 1.5.2, PRA and FRA have approximately the same RMSE at ρ = 0.5. This is not surprising to see in the graph because at larger sample sizes, biases in these estimators are negligible which means that RMSE is approximately the same as the variance. Overall we see that the finite sample performance of FRA is superior to SDM and PRA for a variety of data generating processes (see figures A.1 and A.2 for quadratic design with mild and strong levels of heterogeneity, A.3 and A.4 for quadratic design with one binary covariate with mild and strong levels of heterogeneity, A.5 and A.6 for a probit design with mild and strong levels of heterogeneity and finally A.7 and A.8 for a probit design with one binary covariate with mild and strong levels of heterogeneity. For tables, see ??, ?? and ??). 23 1.7 Nonlinear regression adjustment If the outcome Y – more precisely, the potential outcomes, Y (0) and Y (1) – are dis- crete, or have limited support, using nonlinear conditional mean functions, chosen to ensure fitted values are logically consistent with E(cid:2)Y |X(cid:3), have considerable appeal. Intuitively, get- ting better approximations to E(cid:2)Y (0)|X(cid:3) and E(cid:2)Y (1)|X(cid:3) can yield estimators with smaller asymptotic variances when compared with the SDM estimator and linear regression adjust- ment. However, as cautioned by Imbens and Rubin (2015) page 128, one should not sacrifice consistency in order to obtain an asymptotically more efficient estimator. Imbens and Rubin (2015) leave the impression that all nonlinear models should be avoided because consistency cannot be ensured. In this section, we use the features of the linear exponential family class of distributions, combined with particular conditional mean models, to show that if one is careful in choosing the combination of conditional mean function and quasi-log likelihood (QLL) function, one can preserve consistency. Unfortunately, we cannot formally show that using this particular set of nonlinear models is more efficient than the SDM estimator, but our simulations suggest the efficiency gains can be substantial. (And we have found no cases where it is worse to do nonlinear RA.) In deciding on nonlinear RA methods, the key is to remember is that τate = µ1 − µ0, and so we need to consistently estimate µ1 and µ0 without imposing additional assump- tions. Earlier we showed how linear regression adjustment does just that. And, linear RA, when done separately to estimate µ0 and µ1, is asymptotically more efficient than the SDM estimator. Our goal here is to summarize the nonlinear methods that produce consistent es- timators of τate without additional assumptions (except for standard regularity conditions). We start with pooled methods. 24 1.7.1 Pooled nonlinear regression adjustment In the generalized linear models (GLM) literature, it is well known that certain combinations of QLLs in the linear exponential family (LEF) and link functions lead to first order con- ditions where, in the sample, the residuals average to zero and are uncorrelated with every explanatory variable. To state the precise results, let g (·) be a strictly increasing function on R, with range that can be a subset of R. The inverse, g−1 (·), is known as the “link function” in the GLM literature. In the context of treatment effect estimation with mean function g (α + xβ + γw), when using the so-called canonical link function [McCullagh and Nelder (1989)] the first order conditions (FOCs) are of the form N(cid:88) N(cid:88) N(cid:88) Wi i=1 (cid:20) (cid:20) (cid:20) i=1 X(cid:48) i Yi − g Yi − g Yi − g (cid:16) (cid:16) (cid:16) ˆα + Xi ˆα + Xi ˆα + Xi ˆβ + ˆγWi ˆβ + ˆγWi ˆβ + ˆγWi (cid:17)(cid:21) (cid:17)(cid:21) (cid:17)(cid:21) = 0 = 0 = 0 i=1 When g (z) = z, these equations produce the first order conditions for the pooled OLS estimator. The leading cases where these conditions hold for nonlinear estimation are for the Bernoulli QLL when g (z) = Λ(z) = exp(z)/[1 + exp(z)] is the logistic function and for the Poisson QLL when g (z) = exp(z). Under random sampling and weak regularity conditions, the probability limits of the estimators solve the population versions of the sample moment conditions. Let α∗, β∗, and γ∗ denote the probability limits of ˆα, ˆβ, and ˆγ, respectively. Importantly, as argued in White (1982), these plims exist very generally without assuming that mean function is correctly specified – just as the parameters in the linear projection exist under very weak assumptions. The first two FOCs in the population are E(cid:104) Y − g(cid:0)α∗ + Xβ∗ + γ∗W(cid:1)(cid:105) Y − g(cid:0)α∗ + Xβ∗ + γ∗W(cid:1)(cid:105)(cid:27) (cid:104) (cid:26) W E = 0 = 0; (1.32) (1.33) 25 we will not need the last set of conditions obtained from the gradient with respect to β. As before, we assume that ρ = P (W = 1) satisfies 0 < ρ < 1. Now, recall that Y = (1 − W )Y (0) + W Y (1). Then, by random assignment, E (Y ) = E (1 − W ) E(cid:2)Y (0)(cid:3) + E (W ) E(cid:2)Y (1)(cid:3) = (1 − ρ)µ0 + ρµ1. By iterated expectations, g(cid:0)α∗ + Xβ∗ + γ∗W(cid:1)(cid:105) E(cid:104) g(cid:0)α∗ + Xβ∗ + γ∗W(cid:1)|X (cid:105)(cid:27) Therefore, we can write (1.32) as (1 − ρ)µ0 + ρµ1 = E(cid:104) (cid:26) E(cid:104) g(cid:0)α∗ + Xβ∗ + γ∗W(cid:1)(cid:105) = E and, because W is independent of X with P (W = 1) = ρ, E(cid:104) g(cid:0)α∗ + Xβ∗ + γ∗W(cid:1)|X (cid:105) = (1 − ρ)g(cid:0)α∗ + Xβ∗(cid:1) + ρg(cid:0)α∗ + Xβ∗ + γ∗(cid:1) . Therefore, (1 − ρ)µ0 + ρµ1 = (1 − ρ)E(cid:104) g(cid:0)α∗ + Xβ∗(cid:1)(cid:105) + ρE(cid:104) g(cid:0)α∗ + Xβ∗ + γ∗(cid:1)(cid:105) . (1.34) Further, using and W Y = W Y (1), from (1.33), E(cid:2)W Y (1)(cid:3) = E(cid:104) W g(cid:0)α∗ + Xβ∗ + γ∗W(cid:1)(cid:105) . Again using random assignment and iterated expectations, ρµ1 = (1 − ρ) · 0 + ρE(cid:104) g(cid:0)α∗ + Xβ∗ + γ∗(cid:1)(cid:105) = ρE(cid:104) g(cid:0)α∗ + Xβ∗ + γ∗(cid:1)(cid:105) . Because ρ > 0, we have 26 g(cid:0)α∗ + Xβ∗ + γ∗(cid:1)(cid:105) µ1 = E(cid:104) g(cid:0)α∗ + Xβ∗(cid:1)(cid:105) µ0 = E(cid:104) g(cid:0)α∗ + Xβ∗ + γ∗(cid:1)(cid:105) − E(cid:104) . (1.35) (1.36) g(cid:0)α∗ + Xβ∗(cid:1)(cid:105) . (1.37) Also, because ρ < 1, (1.34) now implies It follows that τate = E(cid:104) This equation essentially is the basis for proving that pooled regression adjustment, where we use a QLL in the linear exponential family and the conditional mean implied by the canonical link function, is consistent. Consistency follows because the estimated ATE using the pooled method is ˆτate,pooled = N−1 (cid:16) (cid:20) g N(cid:88) i=1 (cid:17) − g (cid:16) ˆα + Xi (cid:17)(cid:21) ˆβ . ˆα + Xi ˆβ + ˆγ (1.38) By Wooldridge (2010) question 12.17, this converges in probability to (1.37). As a practical matter, it is convenient to note that (1.38) is the exact quantity reported by standard software packages when one requests the average “marginal” (or “partial”) effect of the binary variable W after a standard GLM estimation. Packages that have this pre-programmed also provide a valid standard error, although one must be sure to use a “robust” option during the GLM estimation so that a sandwich estimator is used for the asymptotic variance of the parameter (cid:16) (cid:17)(cid:48) . estimators ˆα, ˆβ(cid:48), ˆγ Table B.1 summarizes the useful combinations of the QLL and mean functions that lead to consistent estimation of τate without additional assumptions. The Bernoulli/logistic case applies to binary or fractional outcomes, without change. When Y is fractional, it can have probability mass at zero, one, or anywhere else. See Papke and Wooldridge (1996) for further discussion. In any case, we treat the problem as 27 quasi-MLE because we do not wish to assume either that the distribution or mean function is correct. The Poisson/exponential combination is very useful for nonnegative outcomes without a natural upper bound, although it can be applied to any nonnegative outcome. This includes count outcomes but also continuous outcomes and outcomes with corner solutions at zero (or other focal points). In the latter case, it is important to understand that commonly used models, such as Tobit, do not provide any known robustness to misspecification of the Tobit model. By contrast, the Poisson QMLE with an exponential mean provides full robustness. Remember, we are not trying to estimate the conditional mean functions; we are trying to consistently estimate the unconditional means, µ0 and µ1. Other than linear regression, the Poisson QMLE with an exponential mean is the only sensible choice for nonnegative, unbounded responses. If the outcome has a natural, known upper bound, say Bi, which may vary by unit i, the binomial QMLE can be used in conjunction with the mean function (cid:20) exp(α + xβ + γw) 1 + exp(α + xβ + γw) (cid:21) , m(b, x, w, θ) = b as this is known to be the mean associated with the canonical link for the binomial distri- bution. The data then consists of (Yi, Bi, Xi, Wi). Again, it does not matter whether Yi is an integer response or is continuous, or even has a corner at zero, Bi, or both: using the binomial QMLE with logistic mean is simply a way to possibly improve over SDM or linear RA. The estimated ATE is N(cid:88) i=1 N−1 Bi (cid:34) exp(ˆα + Xi ˆβ + ˆγ) 1 + exp(ˆα + Xi ˆβ + ˆγ) ˆβ) − exp(ˆα + Xi 1 + exp(ˆα + Xi ˆβ) (cid:35) ; again, this is simple the average partial effect with respect to the binary variable W . The last entry in Table B.1 extends the Bernoulli QLL/logistic mean and is relevant in two general situations. The first is when the support of the response is finite (and greater than two; otherwise one would use the logistic mean function with the Bernoulli QLL). For 28 example, Yg(w) could be an ordered response, such as a measure of health on a Lichert scale, or an unordered response, such as the choice of a health plan. A second situation is when the response consists of fractions that sum to unity, such as proportions of wealth in different investment categories, in which case the model has been called “multinomial fractional logit” [Mullahy (2015)]. If there are G + 1 possible outcomes then there are G + 1 means each for the control and treated states. The conditional mean functions for a pooled estimation would be mg(x, w, θ) = (cid:104) 1 +(cid:80)G exp(αg + xβg + γgw) h=1 exp(αh + xβh + γhw) (cid:105), g = 0, 1, ..., G with α0 = 0, β0 = 0. Then the estimated means are ˆµwg = N−1 N(cid:88) i=1 (cid:104) 1 +(cid:80)G ˆβg + ˆγgw) exp(ˆαg + Xi h=1 exp(ˆαh + Xi ˆβh + ˆγhw) (cid:105), w ∈ {0, 1} , g ∈ {0, 1, ..., G} and the estimated average treatment effect for each g is Because for each w,(cid:80)G ˆτate,g = ˆµ1g − ˆµ0g. g=0 ˆµwg = 1, the sum over g of the ˆτate,g is necessarily zero. 1.7.2 Full nonlinear regression adjustment As in the linear case, consistency is preserved if we estimate two separate regression functions for the control and treatment cases. This follows from Wooldridge (2007) results on doubly robust estimators, where, in the current setting, the propensity score, P(cid:0)W = 1|X = x(cid:1) = ρ, is not a function of x. But a direct argument is easier to follow. For example, consider using a QLL with the canonical link function using only the treatments. The FOC for ˆα1, the intercept inside the conditional mean function, is simply 29 or, because of random assignment, It follows that (cid:1)(cid:105) 1 1 + Xβ∗ (cid:1)(cid:105) . 1 W g(cid:0)α∗ E(cid:2)W Y (1)(cid:3) = E(cid:104) ρµ1 = ρE(cid:104) g(cid:0)α∗ 1 + Xβ∗ µ1 = E(cid:104) (cid:1)(cid:105) g(cid:0)α∗ 1 + Xβ∗ (cid:17)(cid:48) µ0 = E(cid:104) g(cid:0)α∗ 0 + Xβ∗ ˆα0, ˆβ(cid:48) (cid:1)(cid:105) (cid:16) 0 1 0 . (cid:20) Wi N(cid:88) i=1 (cid:16) Yi − g (cid:17)(cid:21) = 0. ˆα1 + Xi ˆβ1 Notice again how the treatment indicator serves to select the subsample of treated units. The population analog is (cid:16) The same argument works for the untreated case, where Wi is replaced with (1 − Wi), and ˆα1, ˆβ(cid:48) . The conclusion is are replaced with (cid:17)(cid:48) 1 Remember, the parameters with a “*” now indicate the probability limits from the two sep- arate estimations, rather than there being the same parameters as in the pooled estimation. It follows under general regularity conditions that a consistent and asymptotically normal estimator of τate is ˆτate,f ull = N−1 (cid:16) (cid:20) g N(cid:88) i=1 (cid:17) − g (cid:16) ˆα0 + Xi ˆβ0 (cid:17)(cid:21) . ˆα1 + Xi ˆβ1 As a practical matter, some packages, such as Stata, have built-in commands for some full RA estimators, including the Bernoulli/logistic and the Poisson/exponential combinations, and so a standard error is computed along with the estimate. Again, one must be sure to 30 use a robust variance matrix estimator for the parameters. Alternatively, using a bootstrap routine is very efficient for these kinds of estimators. In deciding on a procedure to use – linear versus nonlinear, pooled versus full – it is im- portant to understand that all methods studied in this paper produce consistent estimators of τate. In the linear case, we have the result that full RA is asymptotically efficient compared with SDM and pooled RA. As mentioned earlier, a proof that full nonlinear RA is asymp- totically more efficient than the pooled version is elusive. Also, we have not proven that full nonlinear RA is always at least as asymptotically efficient as SDM. We now report represen- tative simulations that show the nonlinear methods can improve precision substantially in some cases without introducing bias, even in pretty small sample sizes. 1.7.3 Simulations For non-linear simulations we only consider continuous covariates which means that for both binary and non-negative data generating processes, X = X∗ where,  ∼ N   , 1 2  2 0.5 0.5 3 X∗(cid:48) = 1 X∗ (cid:18) X∗ 2  .  (cid:19) . As with the linear simulations, Z = 1 X1 X2 X2 1 X2 2 X1X2 The tables report bias and standard deviation for sample sizes of N = 500 and N = 1, 000. To keep the tables simple, we only report values for treatment probabilities ρ = 0.1, 0.3, 0.5, 0.7, 0.9 even though the graphs are plotted for ρ ranging between 0.1 to 0.9. 31 1.7.3.1 Binary response For the binary response case, the outcomes have been generated using a probit mean function as given below Y (0) = 1[Zγ0 + R(0) > 0] Y (1) = 1[Zγ1 + R(1) > 0], (cid:19) −2 1 2 0.05 0.02 0.1 , γ(cid:48) 1 = (cid:18) γ(cid:48) 0 = and and (cid:18) (cid:19) 0 3 1 −0.05 0.03 0.9 R(0)| (X1, X2) ∼ N (0, 1) R(1)| (X1, X2) ∼ N (0, 1) where R(0) and R(1) are allowed to be correlated. While estimating the nonlinear pooled and separate slopes estimators we use case (ii) in Table B.1. We find that separate slopes nonlinear estimator (NFRA) has the lowest root mean squared error compared with the linear estimators and the pooled nonlinear estimator (NPRA) for all treatment probabilities (see figure A.9). The tables show that nonlinear estimators have bias that is comparable to the SDM estimator (see table ??). 1.7.3.2 Nonnegative response For the non-negative response, the outcomes have been generated using a log normal distri- bution as given below (cid:19) (cid:19) + 0.3 · N (0, 1) + 0.4 · N (0, 1) , (cid:18)Zγ0 + R(0) (cid:18)Zγ1 + R(1) 10 Y (0) = exp Y (1) = exp 10 32 (cid:19) 0 2 −1 −0.05 0.02 0.6 , γ(cid:48) 1 = (cid:18) where γ(cid:48) 0 = and (cid:18) 1 −1 1.5 0.03 −0.02 −0.6 (cid:19) R(0)| (X1, X2) ∼ N (0, 4) R(1)| (X1, X2) ∼ N (0, 9) where R(0) and R(1) are allowed to be correlated. While estimating the nonlinear pooled and separate slopes estimators we use case (iv) in Table B.1. Similar to the binary response simulations, we see that that NFRA again has the lowest root mean squared error compared with both linear and pooled nonlinear estimators across all treatment probabilities. In fact, NFRA peforms better than both SDM and FRA. The NPRA and linear PRA are very similar in terms of RMSE; see table ?? and figure A.10. 1.8 Concluding remarks We have studied linear and nonlinear regression adjustment estimators of the average treatment effect in an experimental framework. For linear regression adjustment, this paper makes some key contributions to the econometrics literature on randomized experiments. First, by considering a previously ignored aspect of the separate slopes estimator, this paper is able to fill a clear gap in the literature by showing the full RA estimator is always the most efficient even when the population means of the covariates is estimated using the sample sample. Second, in obtaining our results, we rely only on linear projections, and so no extra assumptions are used in establishing asymptotic efficiency. Third, the setup allows us to determine when using full RA, or RA at all, is unecessary to achieve efficiency. Our simulation findings support the theory and show that substantial efficiency gains are possible when we have good predictors of the response. Obtaining the correct standard errors for the full RA estimator is particularly simple in commonly used software packages. For example, 33 Stata R(cid:13), with its built-in “teffects” command, provides the correct standard errors for the FRA estimator As an interesting complement to our work, Słoczyński (2018) studies pooled versus full RA when assignment is unconfounded conditional on covariates. Assuming that the condi- tional means are linear in parameters, Słoczyński (2018) shows that using pooled RA when the treatment effects are heterogeneous is inconsistent for the ATE in a way that is par- ticularly troublesome in designs that are heavily unbalanced. In particular, the pooled RA estimator consistently estimates the weighted average (1−ρ)·τAT T +ρ·τAT U , where τAT T is the average treatment effect on the treated (W = 1) and τAT U is the ATE on the untreated (W = 0). The ATE can be expressed as τAT E = ρ · τAT T + (1 − ρ) · τAT U , and so the PRA estimator, in the limit, gets the weights reversed. Under random assignment, there is no difference between τAT E, τAT T , and τAT U , and so consistency of PRA for τAT E is not the issue. But as we showed, the pooled RA estimator is generally inefficient when treatment effects are heterogeneous. Also, when ρ = 1/2, there is no inconsistency in the pooled RA estimator when unconfoundedness holds. As we have shown in this paper, in the random assignment case ρ = 1/2 is precisely the condition that implies no efficiency gain from full RA even when there is arbitrary heterogeneity in the treatment effects. Our findings mesh well with those of Słoczyński (2018), with the conclusion that in moderate samples, FRA should be used unless ρ is known to be close to 1/2. In addition to the linear estimators, we also propose nonlinear regression adjustment estimators, characterizing the combinations of quasi-log-likelihood functions and conditional means functions that ensure consistency regardless of misspecification. We believe this pa- per is the first to do so. We do not have theoretical results to show when the nonlinear RA methods unambiguously improve asymptotic efficiency, and this is a good topic for future research. However, our simulations suggest that nonlinear adjustment estimators can have bias comparable to that of simple difference in means (SDM) and can produce sampling vari- ances that are considerably smaller than that of SDM and, in majority of cases, substantially 34 smaller than linear feasible regression adjustment. Going forward, there are a lot of natural extensions. One is to study an assignment scheme that is different from the one considered here. This paper assumes independence across treatment assignments but a more common design, known as the completely randomized experiment, induces dependence across units by fixing the number of treated units before sampling from the population. Also, because most randomized experiments in economics are plagued with issues of nonparticipation or nonrandom attrition, it is also fruitful to study regression adjustment in conjunction with an Instrumental Variables (IV). Comparing the efficiency of standard regression adjustment estimators under random assignment to estimators based on stratified assignment schemes is also a good area for future research. 35 CHAPTER 2 ROBUST AND EFFICIENT ESTIMATION OF POTENTIAL OUTCOME MEANS UNDER RANDOM ASSIGNMENT† 2.1 Introduction In the past several decades, the potential outcomes framework has become a staple of causal inference in statistics, econometrics, and related fields. Envisioning each unit in a population under different states of intervention or treatment allows one to define treatment or causal effects without referencing a model. One merely needs the means of the potential outcomes, or perhaps the potential outcome (PO) means for subpopulations. When interventions are randomized – whether the assignment is to control and treatment groups in a clinical trial (Hirano and Imbens (2001)), assignment to participate in a job training program (Calónico and Smith (2017)), receiving a school voucher when studying the effects of private schooling on educational outcomes (Angrist et al. (2006a)), or contingent valuation studies, where different bid values are randomized among people (Carson et al. (2004)) – one can simply use the subsample means for each treatment level in order to obtain unbiased and consistent estimators of the PO means. In some cases, the precisions of the subsample means will be sufficient. Nevertheless, with the availability of good predictors of the outcome or response, it is appealing to think that the precision can be improved, thereby shrinking confidence intervals and making conclusions about interventions more reliable. In this paper we build on Negi and Wooldridge (2019), who studied the problem of estimating the average treatment effect under random assignment with one control group and one treatment group. In the context of random sampling, we showed that performing separate linear regressions for the control and treatment groups in estimating the average treatment effect never does worse, asymptotically, than the simple difference in means esti- †This work is joint with Jeffrey M. Wooldridge and is unpublished. 36 mator or a pooled regression adjustment estimator. In addition, we characterized the class of nonlinear regression adjustment methods that produce consistent estimators of the ATE without any additional assumptions (except regularity conditions). The simulation findings for both the linear and nonlinear cases are quite promising when covariates are available that predict the outcomes. In the current paper, we consider any number of “treatment” levels and consider the problem of joint estimation of the vector of potential outcome means. We assume that the assignment to the treatment is random – that is, independent of both potential outcomes and observed predictors of the POs. Importantly, other than standard regularity conditions (such as finite second moments of the covariates), we impose no additional assumptions. In other words, the full RA estimators are consistent under essentially the same assumptions as the subsample means with, generally, smaller asymptotic variance. Interestingly, even if the predictors are unhelpful, or the slopes in the linear projections are the same across all groups, no asymptotic efficiency is lost by using the most general RA method. We also extend the nonlinear RA results in Negi and Wooldridge (2019) to the general case of G assignment levels. We show that for particular kinds of responses such as binary, fractional or nonnegative, it is possible to consistently estimate PO means using pooled and separate regression adjustment. Unlike the linear regression adjustment case, we do not have any general asymptotic results to compare full nonlinear RA with pooled nonlinear RA. Finally, we apply the full RA estimator to data from a contingent valuation study ob- tained from Carson et al. (2004). This data is used to elicit a lower bound on mean willingness to pay (WTP) for a program that would prevent future oil spills along California’s central coast. Our results show that the PO means for the five different bid amounts that were randomly assigned to California residents are estimated more efficiently using separate re- gression adjustment than just using subsample averages. This efficiency result is preserved for estimating the lower bound since it is a linear combination of PO means. Hence, using separate RA also delivers a more precise lower bound mean WTP for the oil spill prevention 37 program than the ABERS estimator which uses subsample averages for constructing the estimate. A monte carlo excercise substantiates the theoretical results across three kinds of data generating processes. We generate the outcomes to be either generated to be continuous non-negative values or multinomial responses. In addition, we consider four different config- urations of the assignment vector. In each setting, we find FRA to be atleast as precise as SM across three different sample sizes. The performance of PRA relative to SM is less decisive since some of the subsample means are estimated more noisily than their SM counterparts. The rest of the paper is organized as follows: Section 2.2 discusses the potential outcomes framework extended to the case of G treatment levels along with a discussion of the crucial random sampling and random assignment assumptions. Section 2.3 derives the asymp- totic variances of the different linear regression adjustment estimators, namely, subsample means, pooled regression adjustment and full regression adjustment. Section 2.4 compares the asymptotic variances of the entire vector of subsample means, pooled and full regression adjustment. Section 2.5 considers a class of nonlinear regression adjustment estimators that ensure consistency of the subsample means without imposing additional assumptions. Sec- tion 2.6 discusses applications of this framework to randomized experiments, differences in differences settings and contingent valuation studies. This section also applies full regression adjustment estimator for estimating the lower bound mean WTP for the California Oil Spill study using data from Carson et al. (2004). Section 2.7 constructs a monte carlo study for studying and comparing the finite sample behavior of the linear regression adjustment estimators and Section 2.8 discusses the results of this study. Section 2.9 concludes. 2.2 Potential outcomes, random assignment, and random sampling We use the standard potential outcomes framework, also known as the Neyman-Rubin causal model. The goal is to estimate the population means of G potential (counterfactual) 38 outcomes, Y (g), g = 1, ..., G. Define µg = E(cid:2)Y (g)(cid:3) , g = 1, ..., G. The vector of assignment indicators is where each Wg is binary and W = (W1, ..., WG), W1 + W2 + ··· + WG = 1. In other words, the groups are exhaustive and mutually exclusive. The setup applies to many situations, including the standard treatment-control group setup, with G = 2, multiple treatment levels (with g = 1 the control group), the basic difference-in-differences setup with G = 4, and in contingent valuation studies where subjects are presented with a set of G prices or bid values. We assume that each group, g, has a positive probability of being assigned: Next, let ρg ≡ P(Wg = 1) > 0, g = 1, ..., G ρ1 + ρ2 + ··· + ρG = 1 X = (X1, X2, ..., XK ) be a vector of observed covariates. Assumption 2.2.1 (Random Assignment). Assignment is independent of the potential outcomes and observed covariates: W ⊥(cid:2)Y (1), Y (2), ..., Y (G), X(cid:3) . 39 Further, the assignment probabilities are all strictly positive. (cid:3) Assumption 1 is what puts us in the framework of experimental interventions. It would be much too strong for an observational study. Assumption 2.2.2 (Random Sampling). For a nonrandom integer N, (cid:110)(cid:2)Wi, Yi(1), Yi(2), ..., Yi(G), Xi (cid:3) : i = 1, 2, ..., N (cid:111) is independent and identically distributed. (cid:3) The IID assumption is not the only one we can make. For example, we could allow for a sampling-without-replacement scheme given a fixed sample size N. This would complicate the analysis because it generates slight correlation within draws. As discussed in Negi and Wooldridge (2019), Assumption 2 is traditional in studying the asymptotic properties of estimators and is realistic as an approximation. Importantly, it forces us to account for the sampling error in the sample average, ¯X, as an estimator of µX = E (X). For each draw i from the population, we only observe Yi = Wi1Yi(1) + Wi2Yi(2) + ··· + WiGYi(G), and so the data we have to work with is (cid:8)(Wi, Yi, Xi) : i = 1, 2, ..., N(cid:9) . Definition of population quantities only requires us to use the random vector (W, Y, X), which represents the population. Assumptions 1 and 2 are the only substantive restrictions used in this paper. Subse- quently, we assume that linear projections exist and that the central limit theorem holds for properly standardized sample averages of IID random vectors. Therefore, we are implicitly imposing at least finite second moment assumptions on the Y (g) and the Xj. We do not make this explicit in what follows. 40 2.3 Subsample means and linear regression adjustment In this section we derive the asymptotic variances of three estimators: the subsample means, full (separate) regression adjustment, and pooled regression adjustment. 2.3.1 Subsample means The simplest estimator of µg is the sample average within treatment group g: ¯Yg = N−1 g where N(cid:88) N(cid:88) WigYi = N−1 g WigYi(g), i=1 i=1 N(cid:88) i=1 Wig Ng = is a random variable in our setting. WihWig = 0 for h (cid:54)= g. Under random assignment and random sampling, In expressing ¯Yg as a function of the Yi(g) we use E(cid:16) ¯Yg|W1g, ..., WN g, Ng > 0 (cid:17) (cid:105) Yi(g)|W1g, ..., WN g, Ng > 0 i=1 N(cid:88) N(cid:88) N(cid:88) i=1 = N−1 g = N−1 g = N−1 g WigE(cid:104) WigE(cid:2)Yi(g)(cid:3) Wigµg = µg, and so ¯Yg is unbiased conditional on observing a positive number of units in group g. By the law of large numbers, a consistent estimator of ρg is i=1 ˆρg = Ng/N, the sample share of units in group g. Therefore, by the law of large numbers and Slutsky’s Theorem, (cid:33) (cid:32) g E(cid:0)Wg N Ng = ρ−1 N−1 N(cid:88) (cid:1) E(cid:2)Y (g)(cid:3) = µg, WigYi(g) i=1 ¯Yg = g E(cid:2)WgY (g)(cid:3) p→ ρ−1 41 and so ¯Yg is consistent for µg. By the central limit theorem, asymptotic representation of N(cid:0) ¯Yg − µg (cid:1) is asymptotically normal. We need an (cid:1) that allows us to compare its asymptotic variance √ N(cid:0) ¯Yg − µg √ with those from regression adjustment estimators. To this end, write Y (g) = µg + V (g) ˙X = X − µX, where ˙X is X demeaned using the population mean, µX. Now project each V (g) linearly onto ˙X: V (g) = ˙Xβg + U (g), g = 1, ..., G. By construction, the population projection errors U (g) have the properties E(cid:2)U (g)(cid:3) = 0, g = 1, ..., G E(cid:104) ˙X(cid:48)U (g) (cid:105) = 0, g = 1, ..., G. Plugging in gives Y (g) = µg + ˙Xβg + U (g), g = 1, ..., G (cid:104) (cid:105) . The observed Importantly, by random assignment, W is independent of U (1), ..., U (G), ˙X outcome can be written as (cid:104) Wg G(cid:88) g=1 Y = (cid:105) µg + ˙Xβg + U (g) Theorem 2.3.1 (Asymptotic variance of Subsample means estimator of PO means). Under Assumptions 1, 2, and finite second moments, N(cid:0) ¯Y − µ(cid:1) = √ (cid:104) (cid:104)  N−1/2(cid:80)N N−1/2(cid:80)N N−1/2(cid:80)N N(cid:88) i=1 i=1 i=1 (cid:104) (cid:105) (cid:105)  + op(1) (cid:105) Wi1 ˙Xiβ1/ρ1 + Wi1Ui(1)/ρ1 Wi2 ˙Xiβ2/ρ2 + Wi2Ui(2)/ρ2 ... WiG ˙XiβG/ρG + WiGUi(G)/ρG ≡ N−1/2 (Li + Qi) + op(1) i=1 42 where and Li ≡ Qi ≡   WiG ˙XiβG/ρG Wi1 ˙Xiβ1/ρ1 Wi2 ˙Xiβ2/ρ2 ... Wi1Ui(1)/ρ1 Wi2Ui(2)/ρ2 ... WiGUi(G)/ρG   (2.1) (2.2) E(cid:0)LiQ(cid:48) By random assignment and the linear projection property, E (Li) = E (Qi) = 0, and (cid:1) = 0. Also, because WigWih = 0, g (cid:54)= h, the elements of Li are pairwise uncorre- i lated; the same is true of the elements of Qi. 2.3.2 Full regression adjustment To motivate full regression adjustment, write the linear projection for each g as Y (g) = αg + Xβg + U (g) E(cid:2)U (g)(cid:3) = 0 E(cid:104) (cid:105) X(cid:48)U (g) = 0 It follows immediately that µg = αg + µXβg. Theorem 2.3.2 (Asymptotic variance of Full regression adjustment estimator of PO means). 43 Under assumptions 1 and 2, and finite second moments, √ N ( ˆµ − µ) =  i=1 i=1 N−1/2(cid:80)N N−1/2(cid:80)N N−1/2(cid:80)N N(cid:88) i=1 (cid:104) ˙Xiβ1 + Wi1Ui(1)/ρ1 (cid:105) (cid:104) ˙Xiβ2 + Wi2Ui(2)/ρ2 (cid:105) (cid:104) ˙XiβG + WiGUi(G)/ρG ... ≡ N−1/2 (Ki + Qi) + op(1)  + op(1) (cid:105) (2.3) i=1 Ki = where and Qi is given in (2.2).  ˙Xiβ1 ˙Xiβ2 ... ˙XiβG  Both Ki and Qi have zero means, the latter by random assignment. Further, by random assignment and the linear projection property, E(cid:0)KiQ(cid:48) = E(Wig)E(cid:104) ˙X(cid:48) E(cid:104) ˙X(cid:48) iWigUi(g) (cid:105) i (cid:1) = 0 because (cid:105) iUi(g) = 0. However, unlike the elements of Li, we must recognize that the elements of Ki are correlated except in the trivial case that all but one of the βg are zero. 2.3.3 Pooled regression adjustment Now consider the pooled RA estimator, ˇµ, which can be obtained as the vector of coefficients on Wi = (Wi1, Wi2, ..., WiG) from the regression Yi on Wi, ¨Xi, i = 1, 2, ..., N. We refer to this as a pooled method because the coefficients on ¨Xi, say, ˇβ, are assumed to be the same for all groups. Compared with subsample means, we add the controls ¨Xi, but 44 unlike FRA, the pooled method imposes the same coefficients across all g. Theorem 2.3.3 (Asymptotic variance of Pooled regression adjustment estimator of PO means). Under assumptions (1) and (2), along with finite second moments (cid:104) (cid:104) ρ−1 1 Wi1 ˙Xi (β1 − β) + ˙Xiβ + Wi1Ui(1)/ρ1 ρ−1 2 Wi2 ˙Xi (β2 − β) + ˙Xiβ + Wi2Ui(2)/ρ2 (cid:105) (cid:105) (cid:104) ρ−1 G WiG ... ˙Xi (βG − β) + ˙Xiβ + WiGUi(G)/ρG √ N ( ˇµ − µ) =  i=1 i=1 N−1/2(cid:80)N N−1/2(cid:80)N N−1/2(cid:80)N N(cid:88) i=1 ≡ N−1/2 i=1 (Fi + Ki + Qi) + op(1) where Ki and Qi are as before and, with δg = βg − β,  + op(1) (cid:105) (2.4)   1 ρ−1 ρ−1 2 (Wi1 − ρ1) ˙Xiδ1 (Wi2 − ρ2) ˙Xiδ2 Fi = ρ−1 G (WiG − ρG) ˙XiδG (cid:17) = E(cid:16) FiQ(cid:48) (cid:17) = 0 i FiK(cid:48) i E(cid:16) Notice that, again by random assignment and the linear projection property, 2.4 Comparing the asymptotic variances We now take the representations derived in Section 3 and use them to compare the asymptotic variances of the three estimators. For notational clarity, it is helpful summarize 45 the conclusions reached in Section 3: √ N ( ˆµSM − µ) = N−1/2 √ N ( ˆµF RA − µ) = N−1/2 √ N ( ˆµP RA − µ) = N−1/2 i=1 N(cid:88) N(cid:88) N(cid:88) i=1 (Li + Qi) + op(1) (Ki + Qi) + op(1) (Fi + Ki + Qi) + op(1), where Li, Qi, Ki, and Fi are defined in 2.1, 2.2, 2.3 and 2.4 respectively. i=1 2.4.1 Comparing FRA to subsample means Theorem 2.4.1. Under assumptions of theorems 2.3.1 and 2.3.2, (cid:104)√ Avar is PSD. (cid:105) − Avar N ( ˆµSM − µ) (cid:104)√ (cid:105) N ( ˆµF RA − µ) = ΩL − ΩK (2.5) The one case where there is no gain in asymptotic efficiency in using FRA is when βg = 0, g = 1, ..., G, in which case X does not help predict any of the potential outcomes. Importantly, there is no gain in asymptotic efficiency in imposing βg = 0, which is what the subsample means estimator does. From an asymptotic perspective, it is harmless to separately estimate the βg even when they are zero. When they are not all zero, estimating them leads to asymptotic efficiency gains. Theorem 2.4.1 implies that any smooth nonlinear function of µ is estimated more effi- ciently using ˆµF RA. For example, in estimating a percentage difference in means, we would be interested in µ2/µ1, and using the FRA estimators is asymptotically more efficient than using the SM estimators. 2.4.2 Full RA versus pooled RA The comparision between FRA and PRA is simple given the expressions in (m2) and (m3) because, as stated earlier, Fi, Ki, and Qi are pairwise uncorrelated. 46 Theorem 2.4.2. Under the assumptions of theorem 2.3.2 and 2.3.3, (cid:104)√ (cid:105) − Avar (cid:104)√ (cid:105) N ( ˆµF RA − µ) = ΩF N ( ˆµP RA − µ) Avar which is PSD. Therefore, ˆµF RA is never less asymptotically efficient than ˆµP RA. There are some special cases where the estimators achieve the same asymptotic variance, the most obvious being when the slopes in the linear projections are homogeneous: β1 = β2 = ··· = βG As with comparing FRA with subsample means, there is no gain in efficiency from imposing this restriction when it is true. This is another fact that makes FRA attractive if the sample size is not small. Other situations where there is no asymptotic efficiency gain in using FRA are more In general, suppose we are interested in linear combinations τ = a(cid:48)µ for a given subtle. G × 1 vector a. If then a(cid:48) ˆµP RA is asymptotically as efficient as a(cid:48) ˆµF RA for estimating τ. Generally, the diagonal elements of a(cid:48)ΩFa = 0 (cid:16) FiF(cid:48) i (cid:17) ΩF = E (1 − ρg) ρg δ(cid:48) gΩXδg = ρg(1 − ρg). The off diagonal terms of ΩF are Wig − ρg (cid:17)2(cid:21) (cid:17) −δ(cid:48) gΩXδh (cid:21) (Wih − ρh) are because E (cid:20)(cid:16) (cid:20)(cid:16) Wig − ρg = −ρgρh. Now consider the case covered in Negi and because E Wooldridge (2019), where G = 2 and a(cid:48) = (−1, 1), so the parameter of interest is τ = µ2−µ1 (the average treatment effect). If ρ1 = ρ2 = 1/2 then 47 Now δ2 = −δ1 because δ1 = β1 − (β1 + β2)/2 = (β1 − β2)/2 = −δ2. Therefore, ΩF = 1ΩXδ1 −δ(cid:48) δ(cid:48) −δ(cid:48) 2ΩXδ1 2ΩXδ2 1ΩXδ2  δ(cid:48) δ(cid:48)  .  −1  = 0. 1  and (cid:18) −1 1 1ΩXδ1 δ(cid:48) 1ΩXδ1 δ(cid:48) δ(cid:48) 1ΩXδ1 1ΩXδ1 ΩF = (cid:19)δ(cid:48) 1ΩXδ1 δ(cid:48) δ(cid:48) 1ΩXδ1 δ(cid:48) 1ΩXδ1 1ΩXδ1 This finding does not extend to the G ≥ 3 case when interestingly, it is not true that for estimating each mean separately that PRA is asymptotically equivalent to FRA. So, for example, with lower bound WTP, it might require that bid values have the same frequency. But it is not clear that even that is sufficient. What about general G with ρg = 1/G for all g? Then (G − 1) 1 − ρg = 1 − 1 G = G and so Note that 1 − ρg ρg = G − 1. δg = βg − (β1 + β2 + ··· + βG) /G and it is less clear when there is a degeneracy. Seems very likely for estimating pairwise differences. 2.5 Nonlinear regression adjustment We now discuss a class of nonlinear regression adjustment methods that preserve con- sistency without adding additional assumptions (other than weak regularity conditions). In 48 particular, we extend the setup in Negi and Wooldridge (2019) to allow for more than two treatment levels. We show that both separate and pooled methods are consistent provided we choose the mean functions and objective functions appropriately. Not surprisingly, using a canonical link function in the context of quasi-maximum likelihood in the linear exponential family plays a key role. Unlike in the linear case, we can only show that full RA improves over the subsample means estimator when the conditional mean is correctly specified. Whether one can prove efficiency more general is an interesting topic for future research. 2.5.1 Full regression adjustment We model the conditional means, E(cid:2)Y (g)|X(cid:3), for each g = 1, 2, ..., G. Importantly, we will not assume that the means are correctly specified. As it turns out, to ensure consistency, the mean should have the index form common in the generalized linear models literature. In particular, we use mean functions m(αg + xβg), where m (·) is a smooth function defined on R. The range of m (·) is chosen to reflect the nature of Y (g). Given that the nature of Y (g) does not change across g, we choose a common function m (·) across all g. Also, as usual, the vector X can include nonlinear functions (typically squares, interactions, and so on) of underlying covariates. As discussed in Negi and Wooldridge (2019) in the binary treatment case, the function m (·) is tied to a specific quasi-log-likelihood function in the linear exponential family (LEF). Table 1 gives the pairs of mean function and quasi-log-likelihood function that ensure consis- tent estimation. Consistent estimation follows from the results on doubly-robust estimation in the context of missing data in Wooldridge (2007). Each quasi-LLF is tied to the mean function associated with the canonical link function. 49 Table 2.1: Combinations of means and QLLFs to ensure consistency Support Restrictions None Y (g) ∈ [0, 1] (binary, fractional) Y (g) ∈ [0, B] (count, corners) Y (g) ≥ 0 (count, continuous, corner ) Exponential Mean Function Quasi-LLF Linear Logistic Logistic Gaussian (Normal) Bernoulli Binomial Poisson Multinomial Yj(g) ≥ 0,(cid:80)J j=0 Yj(g) = 1 Logistic The binomial QMLE is rarely applied, but is a good choice for counts with a known upper bound, even if it is individual-specific (so Bi is a positive integer for each i). It can also be applied to corner solution outcomes in the interval [0, Bi] where the outcome is continuous on (0, Bi) but perhaps has mass at zero or Bi. The leading case is Bi = B for all i. Note that we do not recommend a Tobit model in such cases because Tobit is not generally robust to distributional or mean failure. Combining the multinomial QLL and the logistic mean functions is attractive when the outcome is either a multinomial response or more than two shares that necessarily sum to unity. As discussed in Wooldridge (2007), the key feature of the single outcome combinations in Table 1 is that it is always true that E(cid:2)Y (g)(cid:3) = E (cid:104) (cid:105) m(α∗ g + Xβ∗ g) , g and β∗ where α∗ g are the probability limits of the QMLEs whether or not the conditional mean function is correctly specified. The analog also holds for the multinomial logit objective function. Applying nonlinear RA with multiple treatment levels is straightforward. For treatment level g, after obtaining ˆαg, ˆβg by quasi-MLE using only units from treatment level g, the mean, µg, is estimated as N(cid:88) ˆµg = N−1 m(ˆαg + Xi ˆβg), which includes linear RA as a special case. This estimator is consistent by a standard ap- plication of the uniform law of large numbers; see, for example, Wooldridge (2010) (Chapter i=1 12, question 12.17). 50 As in the linear case, and of the mean/QLL combinations in Table 1 allow us to write the subsample average as ¯Yg = N−1 g N(cid:88) i=1 Wigm(ˆαg + Xi ˆβg). It seems that ˆµg should be asymptotically more efficient than ¯Yg because ˆµg averages across all of the data rather than just the units at treatment level g. Unfortunately, the proof used in the linear case does not go through in the nonlinear case. At this point, we must be satisfied with consistent estimators of the POs that impose the logical restrictions on E(cid:2)Y (g)|X(cid:3). In the binary treatment case, Negi and Wooldridge (2019) find nontrivial efficiency gains in using logit, fractional logit, and Poisson regression even compared with full linear RA. 2.5.2 Pooled regression adjustment In cases where N is not especially large, one might, just as in the linear case, resort to pooled RA. Provided the mean/QLL combinations are chosen as in Table 1, the pooled RA estimator is still consistent under arbitrary misspecification of the mean function. To see why, write the mean function, without an intercept in the index, as m(γ1w1 + γ2w2 + ··· + γGwG + xβ). The first-order conditions of the pooled QMLE include the G conditions Wig Yi − m(ˆγ1Wi1 + ˆγ2Wi2 + ··· + ˆγGWiG + Xi ˆβ) = 0, g = 1, ..., G. (cid:104) N(cid:88) i=1 N−1 (cid:105) Therefore, assuming no degeneracies, the probability limits of the estimators, denoted with a ∗, solve the population analogs: E(cid:0)WgY(cid:1) = E(cid:2)WgY (g)(cid:3) = E(cid:2)Wgm(Wγ∗ + Xβ∗)(cid:3) , where W = (W1, W2, ..., WG). By random assignment, E(cid:2)WgY (g)(cid:3) = ρgµg. By iterated expectations and random assignment, E(cid:2)Wgm(Wγ∗ + Xβ∗)(cid:3) = E (cid:110) E(cid:2)Wgm(Wγ∗ + Xβ∗)|X(cid:3)(cid:111) 51 and Therefore, E(cid:2)Wgm(Wγ∗ + Xβ∗)|X(cid:3) = P(Wg = 1|X)m(γ∗ (cid:104) E(cid:2)Wgm(Wγ∗ + Xβ∗)(cid:3) = ρgE m(γ∗ (cid:105) g + Xβ∗) (cid:104) m(γ∗ and, using ρg > 0, we have shown By definition, ˆγg is consistent for γ∗ QMLE estimation, we obtain the estimated means as N(cid:88) ˇµg = N−1 m(ˆγg + Xi ˆβ), g + Xβ∗) = ρgm(γ∗ g + Xβ∗). (cid:105) g + Xβ∗) µg = E g and ˆβ is consistent for β∗. Therefore, after the pooled and these are consistent by application of the uniform law of large numbers. i=1 As in the case of comparing full nonlinear RA to the subsample averages, we have no general asymptotic efficiency results comparing full nonlinear RA to pooled nonlinear RA. As shown in Section 4.2, in the linear case it is never worse, asymptotically, to use full RA. 2.6 Applications 2.6.1 Treatment effects with multiple treatment levels The most direct application of the previous results is in the context of a randomized inter- vention with more than two treatment levels. Regression adjustment can be used for any kind of response variable. With a reasonable sample size per treatment level, full regression adjustment is preferred to pooled regression adjustment. If the outcome Y (g) is restricted in some substantive way, a nonlinear RA method of the kind described in Section 5 can be used to exploit the logical restrictions on E(cid:2)Y (g)|X(cid:3). While we cannot show this guarantees efficiency gains compared with using subsample av- erages, the simulation findings in Negi and Wooldridge (2019) suggest the gains can be nontrivial – even compared with full linear RA. 52 2.6.2 Difference-in-Differences designs Difference-in-differences applications can be viewed as a special case of multiple treatment levels. For illustration, consider the standard setting where there is a single before period and a single post treatment period. Let C be the control group and T the treatment group. Label B the before period and A the after period. The standard DID treatment effect is a particular linear combination of the means from the four groups: τ = (µT A − µT B) − (µCA − µCB) Estimating the means by separate regression adjustment is generally better than not con- trolling for covariates, or putting them in additively. 2.6.3 Estimating lower bound mean willingness-to-pay In the context of contingent valuation, individuals are randomly presented with the price of a new good or tax for a new project. They are asked whether they would purchase the good at the given price, or be in favor of the project at the given tax. Generally, the price or tax is called the “bid value.” The outcome for each individual is a binary “vote” (yes = 1, no = 0). A common approach in CV studies is to estimate a lower bound on the mean willingness- to-pay (WTP). The common estimators are based on the area under the WTP survival function: E(W T P ) = (cid:90) ∞ S(a)da 0 When a population of individuals is presented with a small number of bid values, it is not possible to identify E(W T P ), but only a lower bound. Specifically, let b1, b2, ..., bG be G bid values and define the binary potential outcomes as Y (g) = 1[W T P > bg], g = 1, ..., G. In other words, if a person is presented with bid value bg, Y (g) is the binary response, which is assumed to be unity if W T P exceeds the bid value. The connection with the survivor 53 function is µg ≡ E(cid:2)Y (g)(cid:3) = P (W T P > bg) = S(bg) Notice that µg is the proportion of people in the population who have a WTP exceeding bg. This fits into the potential outcomes setting because each person is presented with only one bid value. Standard consumer theory implies that µg+1 ≤ µg, which simply means the demand curve is weakly declining in price. It can be shown that, with b0 ≡ 0 for notational ease, τ ≡ G(cid:88) (bg − bg−1)µg ≤ E(W T P ), g=1 and it is this particular linear combination of {µg : g = 1, 2, ..., G} that we are interested in estimating. The so-called ABERS (1955) estimator introduced by Ayer et al. (1955), without a downward sloping survival function imposed, replaces µg with its sample analog: where ˆτABERS = ¯Yg = N−1 g G(cid:88) (bg − bg−1) ¯Yg N(cid:88) g=1 Yi1[Bi = bg] is the fraction of yes votes at bid value bg. Of course, the ¯Yg can also be obtained as the coefficients from the regression i=1 Yi on Bid1i, Bid2i, ..., BidGi, i = 1, ..., N Lewbel (2000) and Watanabe (2010) allows for covariates in order to see how WTP changes with individual or family characteristics and attitudes, but here we are interested in esti- mating τ. We can apply the previous results on efficiency because τ is a linear combination of the µg. Therefore, using separate linear RA to estimate each µg, and then forming ˆτF RA = (bg − bg−1)ˆµg G(cid:88) g=1 54 is generally asymptotically more efficient than ˆτABERS. Moreover, because Y is a binary outcome, we might improve efficiency further by using logit models at each bid value to obtain the ˆµg. 2.6.4 Application to california oil data This section applies the linear RA estimators discussed in section 2.3 to survey data from the California Oil Spill study from Carson et al. (2004). The study implemented a CV survey to assess the value of damages to natural resources from future oil spills along California’s Central Coast. This was achieved by estimating a lower bound mean WTP measure of the cost of such spills to California’s residents. The survey provided respondents with the choice of voting for or against a governmental program that would prevent natural resource injuries to shorelines and wildlife along California’s central coast over the next decade. In return, the public would be asked to pay a one time lump sum income tax surcharge for setting up the program. The main sample survey which was used to elicit the yes or no votes was conducted by Westat, Inc. The data was a random sample of 1,085 interviews conducted with English speaking Californian households where the respondent was 18 years or older, and lived in pri- vate residences that were either owned or rented. To address issues of non-representativeness of the interviewed sample from the total initially chosen sample, weights were used. Each respondent was randomly assigned one of five tax amounts: $5, $25, $65, $120, or $220 and the binary choice of “yes" or “no” for the oil spill prevention program was recorded at the randomly assigned tax amount. Apart from the choice at different bid amounts, data was also collected on demographics for the respondent and the respondent’s household such as total income, prior knowledge of the spill site, distance to the site, environmental attitudes, attitudes towards big businesses, understanding of the program and the task of voting, beliefs about the oil spill scenario etc. Table D.1 provides a summary of yes votes at the different bid or tax amounts presented 55 to the respondents. Table D.2 provides estimates for the PO means as well as the lower bound mean WTP estimate. We see that the FRA estimator delivers more precise estimates for the vector of PO means. Since the treatment effect, which in this case is the lower bound mean willingness to pay for the oil prevention program, is a smooth function of the estimated PO means, we see that the FRA estimate leads to a more precise lower bound mean WTP than the ABERS estimator. 56 2.7 Monte-carlo This section reports the finite sample behavior of the three different linear regression ad- justment estimators, namely, subsample means (SM), pooled regression adjustment (PRA) and separate slopes (or feasible) regression estimator (FRA) for the vector of PO means. For this monte-carlo study, we generate a population of 1 million observations and mimic the asymptotic setting of random sampling from an “infinite” population. The empirical distributions of the RA estimators are simulated for sample sizes N ∈ {500, 1000, 5000} by randomly drawing the data vector {(Yi, Xi, Wi); i = 1, 2,··· , N} a thousand times from the above mentioned population. For a comprehensive assessment of the linear RA estima- tors, we consider three different populations along with four configurations of the treatment assignment vector. Tables D.3, D.5, and D.4 provide bias and standard deviation mea- sures for the vector of PO means estimated using the different estimators for these unique combinations of population models, assignment vector, and sample sizes. To simulate multiple treatments, we consider potential outcomes, Y (g), corresponding to three different treatment states, g = 1, 2, 3. Hence, G = 3 for all the simulations. In each of the populations, the treatment vector W = (W1, W2, W3) is generated with probability mass function defined in the following manner: Z = g P(Wg = 1) 1 2 3 ρ1 ρ2 ρ3 To generate the vector of assignments from the above distribution, we first draw a uniform random variable U = Uniform(0, 1) and partition the unit interval (0, 1) into subintervals (0, ρ1), (ρ1, ρ1 + ρ2), (ρ1 + ρ2, ρ1 + ρ2 + ρ3) and record the interval in which the uniform draw falls. For a particular draw, if U ∈ (0, ρ1), then W = (1, 0, 0). If U ∈ (ρ1, ρ1 + ρ2), then W = (0, 1, 0) and finally, if U ∈ (ρ1 + ρ2, ρ1 + 57 ρ2 + ρ3), then W = (0, 0, 1). This would ensure that the number of treated observations in each treatment group g, on average, will be close to the true assignment probabilities and that each observation (or draw) will belong to only one treatment state i.e., Wi1 + Wi2 + Wi3 = 1 for all i. In all the simulations, we consider the following configurations of the assignment vector ρ ∈ (cid:40)(cid:18)1 3 (cid:19) (cid:18)1 3 , (cid:19) (cid:18)1 6 , , 1 6 , 1 6 (cid:19) (cid:18)1 5 , , 2 5 , 2 5 , 2 3 , 1 6 , 1 3 , 1 3 (cid:19)(cid:41) 2.7.1 Population models To compare the empirical distributions of these linear RA estimators, we consider three different population models. Each model, which we term a data generating process (DGP), or discrete. assumes that the potential outcomes follow a particular distribution, whether continuous In the first two DGP’s, Y (g)(cid:48)s are simulated to be continuous non-negative outcomes. The first model uses an exponential distribution whereas the second uses a a mixture of an exponential and log-normal distribution. The third DGP takes Y (g) to be categorical responses which take four discrete values. Each DGP’s is described in detail below. For the first two DGP’s, we consider two covariates, X = (X1, X2), which are drawn from a bivariate normal distribution as follows X1  ∼ N X2  1  , 2  2 X(cid:48) =   0.5 0.5 3 For each DGP, we choose parameters such that covariates have some predictive power in explaining the potential outcomes, so that the benefits of regression adjustment can be reaped. 58 Population 1: For each g, (cid:16) Y (g) ∼ Exponential(λg) λg = exp (cid:17) γ0g + γ1g · X1 + γ2g · X2 + R(g) (cid:17)(cid:48) g) and R(1), R(2), and R(3) are allowed to be correlated.1 The g for g=1,2,3 are parameterized as , and variance σ2 γ0g, γ1g, γ2g (cid:16) where R(g)|X1, X2 ∼ N (0, σ2 parameter vector, γg = follows 1 = 0.04 γ1 = (0,−1, 1)(cid:48) , σ2 γ2 = (1, 1.62,−0.5)(cid:48) , σ2 γ3 = (2,−2, 0.6)(cid:48) , σ2 2 = 1 3 = 0.01 For this configuration of parameters, the covariates are only mildly predictive of the out- comes in the three treatment groups, R2 1 = 0.04, R2 2 = 0.02, and R2 3 = 0.01. Population 2: In this case, we generate the outcomes to be a mixture between expo- nential and lognormal distributions, Exponential(λg) if 0 ≤ V < δg Y (g) ∼ Lognormal(ηg, ν2 ηg = α0g + α1g · X1+α2g · X2 + α3g · X2 g ) if δg ≤ V ≤ 1 1 + α4g · X2 2 + α5g · X1 · X2 + K(g) where the mean λg is defined exactly as above. Also, K(g)|X1, X2 ∼ N (0, κ2 and K(3) are also allowed to be correlated. g) and K(1), K(2), 1These are simulated to be affine transformations of the same standard normal random variable. 59 The other parameters αg, δg, κ2, and ν2 g are chosen as follows , (cid:18) 1 (cid:18)1.2 (cid:18)0.3 15 15 10 α1 = α2 = α3 = , , −1 15 2 15 1.5 10 , 3 15 2 15 1 10 , , , , , , −0.02 15 −0.02 15 0.15 10 , , 0, , 0.05 15 0.03 15 0.13 10 , 0.1 15 0.5 15 (cid:19) (cid:19) (cid:19) , δ1 = 0.7 , κ2 1 = , δ2 = 0.5 , κ2 2 = 0.04 225 0.09 225 , ν2 1 = 0.01 , ν2 2 = 0.16 , δ3 = 0.3 , κ2 3 = 0.16 100 , ν2 3 = 0.36 For this DGP, the population R-squared for the three treatment groups are R2 1 = 0.119, R2 2 = 0.1570, and R2 3 = 0.1177 respectively. Finally, for the third population model, we consider each potential outcome to be cat- egorical response which is generated using a multinomial logit model. For this setting we only consider a single covariate, X, which is assumed to be distributed Poisson X ∼ P oisson(14) As an example, one could imagine the treatment to be three different political advertisements that are shown to a voter and the response (or outcome) indicates the voter’s preferred can- didate amongst four potential choices with X denoting the voter’s years of schooling. Population 3: g = {1, 2, 3}, say, Let Y (g) take one of four discrete values„ j ∈ {1, 2, 3, 4}, for each (cid:16) (cid:17) (cid:16) (cid:17) ω1gj · X + ω2gj · X2 + Rj(g) ω1gj · X + ω2gj · X2 + Rj(g) exp (cid:80)4 j=1 exp P{Y (g) = j} = where Rj(g)|X ∼ U (0, σ2 g). For notational simplicity, we collect all the index parameters in ω1g = (ω1g1, ω1g2, ω1g3, ω1g4)(cid:48) and ω2g = (ω2g1, ω2g2, ω2g3, ω2g4)(cid:48). For these we picked the 60 following values, ω11 = (−0.1291,−0.1014,−0.7041,−0.7798)(cid:48) , ω21 = (−0.0108,−0.0234,−0.0376,−0.0192)(cid:48) ω12 = (0.7866, 0.1804, 0.6310, 0.9695)(cid:48) , ω13 = (0.3271, 0.2005, 0.4048, 0.3930)(cid:48) , ω22 = (0, 0, 0, 0)(cid:48) ω23 = (0.308, 0.0411, 0.0301, 0.0475)(cid:48) 1 = 1, σ2 σ2 2 = 0.04, σ2 3 = 0.1¯1 Given these choices, the population R-squared’s are R2 3 = 1 = 0.0859, R2 2 = 0.0319, and R2 0.1048. In all the three cases, while estimating the PO means, we assume that the above functional forms are unknown and simply run the regression of the observed outcome on a constant, and the covariates. This is meant to reflect the uncertainty that researchers often have about the underlying outcome distributions and how they are generated. Considering three different environments in which to compare the performance of the linear RA estimators also helps to mimic the variety of experimental settings that researchers may encounter where separate slopes regression adjustment produces substantial precision gains. 2.8 Discussion Tables D.3, D.5, and D.4 below report the bias and standard deviation of SM, PRA, and FRA estimators for the three different DGP’s respectively. Each table reports these measures across four assignment vectors that were chosen in the manner described above. Note that in most cases, the bias of FRA and PRA estimates is comparable and sometimes even smaller than its SM counterpart. However, one may be willing to forego the bias in RA estimates in favor of efficiency, in which case we turn our attention to the standard deviation measures on these estimates. Across all four configurations, we see that the standard deviation of the separate slopes estimator is weakly smaller than that of the subsample means and pooled regression esti- 61 mators. The comparison of PRA and SM estimators is less unequivocal since in almost all cases and for all sample sizes, the PRA estimator for the first PO mean is almost always less precise than the SM counterpart. For DGP 2 and 3, we see a similar pattern as for DGP 1. In all cases, PRA produces estimates that may or may not be more precise than the subsample means estimator. Note that some of the means are estimated more precisely with PRA than the others. However, the comparison between SM and FRA is less ambiguous. We always find all means estimated through FRA to be weakly more precise than those estimated using just the difference in subsample means. 2.9 Conclusion In this paper, we build on the work of Negi and Wooldridge (2019) to study efficiency improvements in linear regression adjustment estimators when there are more than two treatment levels. In particular, we consider any arbitrary ‘G’ number of treatments when these treatments have been randomly assigned. We show that jointly estimating the vector of potential outcome means using linear RA that allows for separate slopes for the different assignment levels is asymptotically never worse than just using subsample averages. One case when there is no gain in asymptotic efficiency from using FRA is when the slopes are all zero. In other words, when the covariates are not predictive of the potential outcomes, then using separate slopes does not produce more precise estimates compared to just estimating the subsample averages. We also show that separate slopes RA is generally more efficient compared to pooled RA, unless the true linear projection slopes are homogeneous. In this case using FRA to estimate the vector of PO means is harmless. In other words, using FRA under this scenario does not hurt. In addition, this paper also extends the discussion around nonlinear regression adjustment made in Negi and Wooldridge (2019) to more than two treatment levels. In particular, we show that pooled and separate nonlinear RA estimators in the quasi maximum likelihood 62 family are consistent if one chooses the mean and objective functions appropriately from the linear exponential family of distributions. As an illustration of these efficiency arguments, we apply the different linear RA esti- mators for estimating the lower bound mean WTP using data from a contingent valuation study undertaken to provide an ex-ante measure of damages to natural resources from future oil spills along California’s central coast. We find that the lower bound mean WTP is esti- mated more efficiently when we allow the slopes on the different bid values to be estimated separately as opposed to the ABERS estimator, which uses subsample averages for the PO means. A comprehensive simulation study also offers finite sample evidence on efficiency improvements with FRA over SM under three different empirical settings. We find FRA estimator of PO means to be unequivocally more precise than PRA and weakly better than SM for all data generating processes despite the covariates being only mildly predictive of the potential outcomes. 63 CHAPTER 3 DOUBLY WEIGHTED M-ESTIMATION FOR NONRANDOM ASSIGNMENT AND MISSING OUTCOMES† 3.1 Introduction Much of the applied literature in economics is interested in questions of causal inference, such as measuring the impact of job training on labor market outcomes (Calónico and Smith (2017), Ba et al. (2017), Card et al. (2011)), determining the efficacy of school voucher programs on student achievement (Muralidharan and Sundararaman (2015)), and even, es- timating the effects of firm competition on prices (Busso and Galiani (2019)). A key concern with causal effects estimation is that, typically, the units under comparison are different even before the treatment is assigned, rendering the task of drawing causal claims difficult. This task is made even more challenging when there is missing data on the outcome of interest, such as earnings, test scores, or prices. The econometrics literature has proposed weighting to deal with non-random assignment (Hahn (1998), Hirano and Imbens (2001), Hirano et al. (2003), Firpo (2007)) and missing data (Robins and Rotnitzky (1995), Robins et al. (1994), Wooldridge (2002), Wooldridge (2007)).1 However, the two weighting procedures have typically been studied in isolation.2 This paper proposes a double inverse probability weighted (IPW) estimator that addresses these twin identification issues in a general M-estimation framework. Specific examples †This work is unpublished. 1See Li et al. (2013) for a review of IPW approaches to deal with missing data under a variety of missing data patterns. 2Huber (2014b) studies treatment effects in the presence of the double selection problem using a nested weighting procedure. He considers the traditional problem of sample selection based on unobservables and uses a nested weighting structure, which includes the first stage sample selection probability as a covariate in the second stage propensity score model. Other papers that point or set identify causal parameters in the presence of the double selection problem include Fricke et al. (2015), Frölich and Huber (2014), Vossmeyer (2016), Mattei et al. (2014) and Huber and Mellace (2015). 64 include linear regression, maximum likelihood (MLE), and quantile regression (QR). In particular, consider a prototypical training program. Learning about the effects of such an intervention on (say) earnings, necessitates comparing individuals based on their participation status. If these individuals are not randomly assigned to the program, such a comparison will confound the true training effect with factors that simultaneously determine selection into the program and future earnings. For instance, individuals with poor labor market histories may be more likely to participate, and contemporaneously, have lower earn- ings than nonparticipants. Hence, the true effect of the training program is not identified in the presence of nonrandom participation. This identification problem is compounded, if say, individuals who participate in the program are also likely to drop out, introducing the additional problem of missing outcomes. Even in randomized experiments, the problem of missing outcomes can arise due to attri- tion, no-shows, dropouts, or non-response (see Bloom (1984), Heckman et al. (1998b), and Hausman and Wise (1979) for a discussion). A specific example is the National Supported Work (NSW) program, where 19 percent of the randomized sample attrited between the baseline and first round of follow-up interviews. In this case, the standard simple differ- ence in means estimator will no longer produce an unbiased training effect estimate (see Huber (2012), Huber (2014a), Behaghel et al. (2015), Frumento et al. (2012) for alternative approaches of dealing with various post-randomization complications). A common empirical strategy for dealing with missing data is to drop individuals with incomplete information, and treat the observed units as a random sample from the population of interest.3 In a setting with only missing outcomes, such a strategy will not only waste potentially useful information on covariates, but more importantly create a non-random sample for estimation. In turn, this can generally lead to inconsistent treatment effect 3For example, Chen et al. (2018) drop observations with missing labor market outcomes for week 208 after random assignment using the National Job Corps Study data to derive bounds on the Average Treatment Effect as well as Average Treatment Effect on the Treated. Drange and Havnes (2018) also report excluding children with missing data on the outcomes to study the effect of early child care on cognitive development in Norway. 65 estimates. One of the main contributions of this paper is to propose a new class of consistent and asymptotically normal estimators that combine propensity score weighting with weighting for missing data, to address the problems of nonrandom assignment and missing outcomes. Traditionally, the weighting literature has studied each problem individually. By study- ing them together, this paper builds upon and extends the existing weighting literature to incorporate both issues simultaneously. A second contribution is to consider a general M- estimation problem, which is permitted to be non-smooth in the underlying parameters. Therefore, the identification and estimation arguments made in this paper encompass both average treatment effect (ATE) and quantile treatment effect (QTE) parameters. Finally, a key feature of the proposed estimator is its robustness to parametric misspecification of a conditional model of interest (such as a conditional mean or conditional quantile) and the two weighting functions. To obtain consistent estimation of causal parameters, this paper assumes that selection into treatment is based on observed covariates.4 Put differently, this restriction implies that the treatment is as good as randomly assigned after conditioning on pre-treatment covariates. Previous studies have found several situations where such an assumption is tenable, especially when pre-treatment values of the outcome variable are available. For example, LaLonde (1986) and Hotz et al. (2006) have shown that controlling for pre-training earnings alone reduces significant bias between non-experimental and experimental estimates. The literature assessing teacher impact on student achievement has reported similar findings with pre-test scores (Chetty et al. (2014), Kane and Staiger (2008) and Shadish et al. (2008)), indicating the plausibility of unconfoundedness in these settings. This paper also assumes that the missing outcomes mechanism is ignorable once we con- 4This is a widely used assumption in the treatment effects literature (Imbens and Wooldridge (2009)) and is known by a variety of names such as unconfoundedness, ex- ogenous assignment (exogeneity), ignorability of assignment, selection on observables and conditional independence assumption (CIA). 66 dition on covariates and the treatment status.5 In other words, covariates and the treatment are sufficient for predicting observation into the sample (see Wooldridge (2007) for a similar ignorability assumption). This mechanism falls under the “Missing at Random” (MAR) or the “selection on observables” label used in the econometrics literature (for example, Moffit et al. (1999) use it to model attrition) and allows for differential non-response, attrition, and even non-compliance to the extent that conditioning variables predict it.6 Under unconfoundedness and ignorability, the proposed strategy leads to an estimation method that follows in two steps; the first step estimates the treatment and missing outcome probabilities using binary response MLE.7 The second step uses these estimated probabilities as weights to minimize (or maximize) a general objective function. Given the parametric nature of the first and second steps, this paper highlights a robustness property which allows the estimator to remain consistent for a parameter of interest, under misspecification of either the conditional model or the two probabilities. Consequently, the asymptotic theory in this paper distinguishes between these two important halves. The first half focuses on misspecification of either a conditional expectation function (CEF) or a conditional quantile function (CQF), whereas the second half considers arbitrary misspecification of the weighting functions. Delineating the two cases helps to clarify the interpretation on causal estimands in different misspecification scenarios. This property also nests the well known result of ‘double robustness’ (Słoczyński and Wooldridge (2018)) as a special case. As illustrative examples, the paper discusses robust estimation of two specific causal parameters, namely, the ATE and QTEs. Consistent estimation of the ATE is achievable under both misspecification scenarios. Of particular interest is the case when the conditional mean function is misspecified. In this case, consistent estimation of ATE relies on double 5Typically, covariates also include pre-treatment outcomes like pre-training earnings or pre-test scores. 6Attrition in a two period panel is allowed as long as it is a function of key time-invariant characteristics and the assigned treatment status. 7As a practical matter, researchers typically follow the convention of estimating these probabilities as flexible logit functions. 67 weighting and results from the generalized linear model literature. For estimation of quan- tile treatment effects, the paper considers three different parameters, namely, conditional quantile treatment effect (CQTE), a linear approximation to CQTE, and the unconditional quantile treatment effect (UQTE), each of which may be of interest to the researcher de- pending on whether features of the conditional or unconditional outcomes distribution are of particular interest. In the event that the underlying CQF is assumed to be correct, the double-weighted estimator is shown to be consistent for the true CQTE, otherwise, delivers a consistent weighted linear approximation to the true CQTE (using results from Angrist et al. (2006b)). In addition, the paper underscores the importance of double weighting for a parameter like UQTE, where covariates, which serve to remove biases due to nonrandom assignment and missing outcomes, enter the estimating equation only through the two proba- bility models. Simulations show that the doubly weighted ATE and QTE estimates have the lowest finite sample bias compared to alternatives that ignore one or both problems (such as the unweighted estimator that drops data, or the propensity score weighted estimator which weights only by the treatment probability). Finally, the proposed method is applied to estimate average and distributional impacts of the NSW training program on earnings for the Aid to Families with Dependent Chil- dren (AFDC) target group. This sample is obtained from Calónico and Smith (2017), who recreate Lalonde’s within-study analysis for the AFDC women. To have missing cases, these data are augmented to include women with missing earnings information in 1979 that were originally dropped from Calónico and Smith’s analysis. This empirical application helps to quantify the estimated bias in the unweighted and propensity score weighted estimates, rela- tive to the doubly weighted estimates, through the presence of an experimental benchmark. Results show that the doubly-weighted estimates have an estimated bias which is smaller than that computed for the unweighted estimates, but comparable in magnitude to the bias estimated for the single (propensity score) weighted estimates. This finding indicates that, for this particular application, the missing outcomes problem seems to be much less conse- 68 quential than the nonrandom assignment problem in obtaining estimates close to the true experimental benchmark. The rest of this paper is structured as follows. Section 3.2 describes the framework and provides a short description of the population models with an introduction to the naive un- weighted estimator. Section 3.3 discusses estimation of the probability weights which is a necessary first step in solving the doubly weighted problem. Section 3.4 develops the first half of the asymptotic theory which is explicitly focused on misspecification of a conditional model of interest. In contrast, section 3.5 discusses the second half which considers cases where the conditional model of interest is correctly specified. Section 3.6 studies the specifics of the robustness property for estimating ATE and QTEs in rigorous detail. It also provides supporting Monte Carlo evidence under different cases of misspecification. Section 3.7 illus- trates the performance of double weighting using Calónico and Smith (2017) data. Section 3.8 concludes with a direction for future research. Tables, figures, proofs, and some auxiliary results are provided in the appendix. 3.2 Doubly weighted framework 3.2.1 Potential outcomes and the population models Let y(g) denote the potential outcome for g = 0, 1 and let wg be an indicator variable for treatment level g where w0 + w1 = 1, implying that the two treatment groups are mutually exclusive and exhaustive. Then, y = y(0) · w0 + y(1) · w1 (3.1) Let (y(g), x) denote a M × 1 random vector taking values in RM where x is the vector of pre-treatment characteristics.8 Some feature of the distribution of (y(g), x) is assumed 8For instance, in the NSW program, y(1) and y(0) will denote potential earnings in the event of participation and non-participation in the training program respectively. The covariates on which information was collected in the baseline period included individual’s age, ethnicity, high-school dropout status, real earnings along with other socio-economic and demographic characteristics. 69 to depend on a finite Pg × 1 vector θg, contained in a parameter space Θg ⊂ RPg.9 Let D(y(g)|x) denote the conditional distribution of y(g) and x and let q(y(g), x, θg) be an ob- jective function depending on y(g), x and θg. This paper allows q(·) to be a smooth or a non-smooth function of the underlying parameter, θg. The parameter of interest, denoted by θ0 g, is defined to be the solution to the following population problem Assumption 3.2.1 (Identification of θ0 g). The parameter vector θ0 g ∈ Θg is the unique solution to the population minimization problem E(cid:2)q(y(g), x, θg)(cid:3) , g = 0, 1 min θg∈Θg (3.2) Notice that assumption 3.2.1 describes a general M-estimation framework where the interest lies in minimizing some objective function, q(y(g), x, θg). Specific examples include the smooth ordinary least squares objective function, q(y(g), x, θg) = (y(g) − αg − xβg)2 or the non-smooth check function of Koenker and Bassett (1978), q(y(g), x, θg) = cτ (y(g)−αg− xβg) where θg ≡ (αg, βg)(cid:48).10 Other examples of q(·) include log-likelihood and quasi-log likelihood (QLL) functions. population problem in (3.2). If θ0 An implicit point made in the assumption above is that θ0 full conditional distribution. Assumption 3.2.1 simply requires θ0 quantities, then the parameter is of direct interest to researchers. However, if θ0 g is not assumed to be correctly specified for a conditional feature like a conditional mean, conditional variance or even the g to uniquely minimize the g is correctly specified for any of the above mentioned g is misspec- ified for any of these distributional features, assumption 3.2.1 guarantees a unique pseudo true solution, θ∗ g (White (1982)). In the case of misspecification, determining whether θ∗ g is meaningful will depend on the conditional feature being studied and the estimation method 9For generality, the dimension of θg is allowed to be different for the treatment and control group problems and is also different than the dimension of x, where x ∈ X ⊂ Rdim(X) 10For a random variable u, cτ (u) = (τ − 1[u<0])u, is the asymmetric loss function for estimating quantiles. 70 used. For example, in the OLS case, θ0 g will still index a linear projection if one is agnostic about linearity of the CEF. Angrist et al. (2006b) establish analogous approximation proper- ties for quantiles, where a misspecified CQF can still provide the best weighted mean square g solves the following approximation to the true τ-CQF. In other words, they show that θ0 weighted mean square error loss function ωτ (x, θg) · (αg(τ ) + xβg(τ ) − Quantτ (y(g)|x))2(cid:105)(cid:27) (cid:104) E (cid:26) where ωτ (x, θg) =(cid:82) 1 min θg∈Θg N(cid:88) min θg∈Θg i=1 71 0 (1−u)fy(g)(u·xθg +(1−u)·Quantτ (y(g)|x)|x) is the weighting function given in Angrist et al. (2006b) adapted to the potential outcomes framework, Quantτ (y(g)|x) is the true CQF and α0 g (τ ) represents a weighted linear approximation. Hence, g(τ ) + xβ0 g )(cid:48) provides an interesting interpretation that can be of practical g ≡ (α0 interest to researchers. in this case, θ0 g, β0 Note that assumption 3.2.1 only requires the parameter to solve an unconditional prob- lem. A sufficient condition for the same is that the parameter additionally solves the condi- tional problem. However, this latter condition will not be required to derive the asymptotic theory in section 3.4. For the reader, an effective way to separate the results in section 3.4 from the ones discussed in section 3.5 is to consider the current section as allowing potential misspecification of the conditional feature being studied in the sense of assumption 3.2.1. g to be identified in the stronger conditional sense. Together, re- sults developed in section 3.4 and section 3.5 can then be used to characterize the robustness Section 3.5 will require θ0 property of the proposed estimator. 3.2.2 The unweighted M-estimator g. If one obtains a random sample In this paper, the objective is to consistently estimate θ0 on {(yi(0), yi(1), wig, xi) : i = 1, 2, . . . , N} from the population of interest, then one can solve wig · q(yi(g), xi, θg) (3.3) For the estimator, which solves (3.3), to consistently estimate θ0 g, the reverse analogy prin- ciple dictates that θ0 g must also solve, E(cid:2)wg · q(y(g), x, θg)(cid:3) ; g = 0, 1 min θg∈Θg However, this argument may not necessarily hold. For example, consider the linear model (3.4) (3.5) y(g) = αg + xβg + u(g); g = 0, 1 E(cid:0)u(g)(cid:1) = 0, E (cid:16) (cid:17) x(cid:48)u(g) = 0 If the treatment (say, job training) is correlated with baseline characteristics, as can be ex- pected when the program is non-randomly assigned, then E(cid:2)wg · x(cid:48)u(g)(cid:3) (cid:54)= 0.11 In addition, suppose there is missing data on the outcome of interest. To formalize this, let ‘s’ be a binary indicator for missing outcomes, then  y = y(0), if g = 0, s = 1 y(1), if g = 1, s = 1 missing, if s = 0 (3.6) where s = 1 if the outcome is observed and s = 0 if it is missing.12 In this case, a common empirical strategy is to solve N(cid:88) i=1 min θg∈Θg si · wig · q(yi(g), xi, θg) (3.7) which only uses observed data to estimate θ0 as the unweighted M-estimator, and denote it as ˆθu g. Let us refer to the estimator that solves 3.7 g . In this case, even if the treatment is randomly assigned, the missing outcomes may still be correlated with the treatment, observ- g is now confounded on two grounds; non-random assignment which renders the treatment and 11When the treatment is randomized, as in the case of NSW, or as studied in Negi and able factors or both, which implies that E(cid:2)s · wg · x(cid:48)u(g)(cid:3) (cid:54)= 0. Hence, identification of θ0 Wooldridge (2019), one will necessarily have E(cid:2)wg · x(cid:48)u(g)(cid:3) = 0, due to the experimental design. 12For an illustration of the observed sample, see figure I.1. 72 control groups incomparable and missing outcomes which leads to violation of the ‘random sampling’ assumption. The next section discusses the identification approach taken in this paper. 3.2.3 Ignorable missingness and unconfoundedness ulation, identifying and estimating θ0 Without imposing any structure on the assignment and missingness mechanism in the pop- g remains difficult because of the argument outlined in the previous section. To proceed with identification, I assume that the treatment is uncon- founded on covariates (Rosenbaum and Rubin (1983)).13 Formally, consider the following Assumption 3.2.2. (Strong ignorability) Assume, {y(0), y(1) ⊥⊥ wg}| x (3.8) 1. (3.8) implies that P(wg = 1|y(0), y(1), x) = P(wg = 1|x) ≡ pg(x) for g = 0, 1, where p0(x) + p1(x) = 1 2. The vector of pre-treatment covariates, x, is always observed for the entire sample. 3. For all x ∈ X ⊂ Rdim(X), pg(x) > κg > 0 Assumption 3.2.2 part (1) says that conditioning on covariates is enough to parse out any systematic differences that may exist between the treatment and control groups. This is a widely used assumption in the treatment effects literature, and is known as unconfound- edness.14 One advantage of unconfoundedness is that, intuitively, it has a better chance 13Like most other assumptions, unconfoundedness is non-refutable. For methods that indirectly test for its validity, see Huber and Melly (2015), de Luna and Johansson (2014), Rosenbaum (1987) and Heckman and Hotz (1989). 14Imbens and Wooldridge (2009) attribute the popularity of unconfoundedness, as an identifying restriction, to the paucity of general methods for estimating treatment effects. 73 of holding once we control for a rich set of variables in x.15 Note that unconfoundedness not only includes cases where the treatment is a deterministic function of the covariates, for example stratified (or block) experiments, but also cases where the treatment is a stochastic function of covariates.16 Part (2) requires that we observe these covariates for all individuals. Part (3) is an overlap assumption which ensures that for all values of x in the support of the distribution, there is a chance of observing units in both treatment and control states.17 With respect to the missing outcomes mechanism, I assume ignorability conditional on covariates and the treatment status. Formally, consider Assumption 3.2.3. (Ignorability of missing outcomes) Assume, {y(0), y(1) ⊥⊥ s}| x, wg (3.9) 1. (3.9) implies that P(s = 1|y(0), y(1), x, wg) = P(s = 1|x, wg) ≡ r(x, wg) 2. In addition to x, wg is always observed for the entire sample. 3. For each (x, wg) ∈ Rdim(X)+1, r(x, wg) > η > 0 Part (1) states that conditional on covariates and the treatment status, the individuals whose outcomes are missing do not differ systematically from those who are observed. This implies that adjusting for x and wg renders the outcomes are as good as randomly missing. In the econometrics literature, this assumption falls under the “selection on observables” tag. In the statistics literature, this is also known as MAR and represents a scenario where missingness only depends on observables and not on the missing values of the variable (Little and Rubin (2002)). Special cases covered under this mechanism are patterns such as missing 15For example, Hirano and Imbens (2001) control for a rich set of prognostic factors to justify unconfoundedness while estimating the effects of right heart catheterization (RHC) on survival rates of patients. 16The appendix discusses the case of a stratified experiment where unconfoundedness is satisfied by design if one additionally assumes the missing outcome pattern to be ignorable. 17Methods for checking overlap involve calculating normalized sample average differences for each covariate and checking the empirical distribution of propensity scores. 74 completely at random (MCAR) and exogenous missingness, as considered in Wooldridge (2007), with z = x. Allowing the missingness probability to be a function of the treatment indicator is particularly useful in cases of differential nonresponse. For instance, in NSW, people assigned to the treatment group were less likely to drop out of the program compared to the control group. In such cases, covariates alone may not be sufficient for predicting missingness. To the extent that being observed in the sample is predicted by x and wg, assumption 3.2.3 can accommodate non-observability due to sampling design, item non- response and attrition in a two period panel.18 Part (2) of the above assumption ensures that x and wg are fully observed. Finally, part (3) imposes an overlap condition, where the probability of being observed in bounded away from zero. This implies that there is a positive probability of observing people in the sample with a given value of x and wg in the population. To study the estimation method in terms of the selected sample, I consider random sampling in the following sense, Assumption 3.2.4. (Sampling) Assume that {(yi(0), yi(1), xi, wig, si); i = 1, 2, . . . , N} are independent and identical random draws from the population where in the population 1. wig is unconfounded with respect to {yi(0), yi(1)} given xi 2. si is ignorable with respect to {yi(0), yi(1)} given (xi, wig) The next section discusses identification and estimation of θ0 g using a double inverse probability weighted procedure. 3.2.4 Population problem with double weighting Consider the following population problem 18For the case of attrition, one must assume that second period missingness is ignorable conditional on initial period covariates and the treatment status. 75 (cid:34) min θg∈Θg E s r(x, wg) · wg pg(x) (cid:35) · q(y(g), x, θg) ; g = 0, 1 (3.10) then under unconfoundedness and ignorability, solving this doubly weighted population prob- lem is the same as solving 3.2. The following lemma establishes this equivalence Lemma 3.2.5. Given assumptions 3.2.1, 3.2.2, 3.2.3 and 3.2.4, if q(y(g), x, θg) is a real valued function for all (y(g), x) ⊂ RM and for all θg ∈ Θg such that E ∞ for g = 0, 1, then we have | q(y(g),x,θg) r(x,wg)·pg(x)| < (cid:34) E s r(x, wg) · wg pg(x) · q(cid:0)y(g), x, θg (cid:1)(cid:35) (cid:104) q(cid:0)y(g), x, θg = E (cid:21) (cid:20) (cid:1)(cid:105) The proof uses two applications of the law of iterated expectations with unconfounded- ness and ignorability to arrive at the above result. This equivalence implies that one can now address the identification issue by solving the doubly weighted population problem. g by solving the sample analogue Consequently, one can obtain a consistent estimator of θ0 of (3.10) as follows min θg∈Θg N(cid:88) i=1 si r(xi, wig) · wig pg(xi) · q(yi(g), xi, θg); g = 0, 1 (3.11) Let the estimator which solves eq (3.11) be denoted as ˜θg. Note, however, that this estimator is infeasible as it depends on unknown probabilities, r(·) and pg(·). The next section discusses the first-step of estimating these probabilities. 3.3 Estimation As mentioned above, one problem with the formulation of ˜θg is that the treatment and missing outcome propensity scores are unknown. Therefore, in its current form ˜θg cannot be implemented, unless the true probabilities are known. The following assumptions posit that I have a correctly specified model for the two probabilities which help me formulate consistent estimators of pg(x) and r(x, wg). Assumption 3.3.1. (Correct parametric specification of propensity score) Assume that 76 1. There exists a known parametric function G(x, γ) for p1(x) where γ ∈ Γ ⊂ RI and 0 < G(x, γ) < 1 for all x ∈ X , γ ∈ Γ. 2. There exists γ0 ∈ Γ s.t. p1(x) = G(x, γ0) Part 1) postulates the existence of a parametric model for the propensity score that is known to the researcher and part 2) assumes that, for some true value of γ, say γ0, this model is correctly specified for the true assignment probability. Similarly, in order to estimate the missing outcome propensity score, I assume that R(x, wg, δ) is a correctly specified parametric model for r(x, wg). Formally, Assumption 3.3.2. (Correct parametric specification of missing outcomes probability) As- sume that 1. There exists a known parametric function R(x, wg, δ) for r(x, wg) where δ ∈ ∆ ⊂ RK and R(x, wg, δ) > 0 for all x ∈ X , δ ∈ ∆. 2. There exists δ0 ∈ ∆ s.t. r(x, wg) ≡ R(x, wg, δ0) 3.3.1 Estimated weights using binary response MLE To estimate the probability functions G(x,·) and R(x, wg,·), this paper uses binary response conditional maximum likelihood. Since both wg and s are binary responses, estimation of γ0 and δ0 using MLE will be asymptotically efficient under correct specification of these func- tions, as assumed in 3.3.1 and 3.3.2. The following two lemmas provide formal consistency and asymptotic normality conditions for MLE estimation of the two probability models. The conditions are adapted from theorems 2.5 and 3.3 of Newey and McFadden (1994). Lemma 3.3.3. (Consistency of maximum likelihood) Assume 3.2.4 so that si and wi1 are i.i.d with pdf’s given as f (si|wi1, xi, δ) = R(xi, wi1, δ)si · (1 − R(xi, wi1, δ))1−si and f (wi1|xi, γ) = G(xi, γ)wi1 · (1 − G(xi, γ))1−wi1. Additionally, assume that 77 1. γ0 ∈ Γ and δ0 ∈ ∆, where Γ, ∆ are compact sets. 2. If γ (cid:54)= γ0, then f (wi1|xi, γ) (cid:54)= f (wi1|xi, γ0) and if δ (cid:54)= δ0, then f (si|wi1, xi, δ) (cid:54)= f (si|wi1, xi, δ0). 3. lnf (wi1|xi, γ) and lnf (si|wi1, xi, δ) is continuous at each γ ∈ Γ and δ ∈ ∆ respectively (cid:104) with probability one. supγ∈Γ| ln f (wi1|xi, γ)|(cid:105) 4. E < ∞ and E(cid:2)supδ∈∆| ln f (si|wi1, x, δ)|(cid:3) < ∞. Then p→ γ0 and ˆδ p→ δ0 ˆγ The proof of the lemma is given in the appendix. For asymptotic normality, consider the following Lemma 3.3.4. (Asymptotic normality for MLE) Assume that conditions of lemma 3.3.3 are satisfied and 1. γ ∈ interior (Γ) and δ ∈ interior (∆). 2. f (si|wi1, xi, δ) and f (wi1|xi, γ) are both twice continuously differentiable and f (si|wi1, xi, δ) > ilarly, 0 and f (wi1|xi, γ) > 0 in a neighborhood N of γ0 and δ0 respectively. 3. (cid:82) supγ∈N ||∇γf (wi1|xi, γ)||dw1 < ∞,(cid:82) supγ∈N ||∇γγ(cid:48)f (wi1|xi, γ)||dw1 < ∞. Sim- (cid:82) supδ∈N ||∇δf (si|wi1, xi, δ)||ds < ∞ and(cid:82) supγ∈N ||∇δδ(cid:48)f (si|wi1, xi, δ)||ds < ∞. 4. E(cid:2)∇γlnf (wi1|xi, γ0){∇γlnf (wi1|xi, γ0)}(cid:48)(cid:3) exists and is non-singular. Similarly, E(cid:2)∇δlnf (si|wi1, xi, δ0){∇δlnf (si|wi1, xi, δ0)}(cid:48)(cid:3) exists and is non-singular. supδ∈N ||∇δδ(cid:48)lnf (si|wi1, xi, δ)||(cid:105) supγ∈N ||∇γγ(cid:48)lnf (wi1|xi, γ)||(cid:105) (cid:104) < ∞ and E < 5. E ∞. (cid:104) 78 Then, the MLE estimator, ˆγ, and ˆδ solve (cid:8)wi1logG(xi, γ) + (1 − wi1)log(1 − G(xi, γ))(cid:9) (cid:8)silog{R(xi, wi1, δ)} + (1 − si)log{1 − R(xi, wi1, δ)}(cid:9) N(cid:88) N(cid:88) i=1 i=1 max γ∈Γ max δ∈∆ respectively. For a proof, see appendix H. Given estimators ˆγ and ˆδ, one can estimate the assignment and missing outcome propensity scores by G(·, ˆγ) and R(·, ˆδ) respectively. Consistency and asymptotic normality follow from applying the continuous mapping theorem and the delta method given that G(·, ˆγ) and R(·, ˆδ) are assumed to be continuously differentiable, which is implicit in Lemma 3.3.3 and 3.3.4. In practice, this paper follows the convention of estimating these probabilities as flexible logits where the above requirements of continuity and differentiability are easily satisfied. 3.3.2 Doubly weighted M-estimator Once the probability weights have been estimated, let ˆθ1 denote the doubly weighted esti- mator which solves the treatment group problem, min θ1∈Θ1 R(xi, wi1, ˆδ) i=1 si · wi1 G(xi, ˆγ) · q(yi(1), xi, θ1) (3.12) N(cid:88) N(cid:88) with weights given by G(x, ˆγ) and R(x, w1, ˆδ), and let ˆθ0 be the estimator which solves the control group problem, min θ0∈Θ0 R(xi, wi0, ˆδ) i=1 si · wi0 (1 − G(xi, ˆγ)) · q(yi(0), xi, θ0) (3.13) using weights (1 − G(x, ˆγ)) and R(x, w0, ˆδ). Henceforth, this estimator will be denoted as ˆθg for g = 0, 1. 79 Example 1 (Ordinary least squares): In the case of a misspecified conditional mean function, ˆθ1 ≡ (ˆα1, ˆβ1)(cid:48) will solve a double weighted version of the OLS problem i.e. N(cid:88) ˆθ1 = argmin θ1∈Θ1 R(xi, wi1, ˆδ) i=1 si · wi1 G(xi, ˆγ) · (yi(1) − α1 − xiβ1)2 Similarly, N(cid:88) i=1 ˆθ0 = argmin θ0∈Θ0 si R(xi, wi0, ˆδ) · wi0 (1 − G(xi, ˆγ)) · (yi(0) − α0 − xiβ0)2 where (ˆαg, ˆβg)(cid:48) will be consistent for the linear projection of y(g)|x. quantile function, ˆθ0 error loss functions (Angrist et al. (2006b)), i.e. Example 2 (Quantile regression): Similarly, in the case of a misspecified conditional g (τ ))(cid:48) will solve the following weighted mean square · ωτ (xi, θ1) ·(cid:2)Quantτ (yi(1)|xi) − α1(τ ) − xiβ1(τ )(cid:3)2 g(τ ) ≡ (ˆα0 g(τ ), ˆβ0 R(xi, wi1, ˆδ) · wi1 G(xi, ˆγ) ˆθ1(τ ) = argmin θ1∈Θ1 · ωτ (xi, θ0) ·(cid:2)Quantτ (yi(0)|xi) − α0(τ ) − xiβ0(τ )(cid:3)2 N(cid:88) N(cid:88) i=1 si si ˆθ0(τ ) = argmin θ0∈Θ0 R(xi0, wi0, ˆδ) i=1 · wi0 (1 − G(xi, ˆγ)) where ˆθg(τ ) will now be consistent for a weighted linear projection to the true CQF of y(g)|x. Using the doubly weighted estimator, one can now consistently estimate causal parameters like the average treatment effect and different quantile treatment effects. Section 3.6 discusses each of these examples in detail. The next section develops and discusses the large sample theory of the proposed estimator. 3.4 Asymptotic theory This paper implements the proposed estimator in a two-step procedure. The first step uses binary response MLE for the estimation of the probability weights and the second g. The asymptotic theory utilizes results for two-step estimators with a non-smooth objective function in the step uses the first-step weights to estimate the parameter of interest, θ0 second step, to establish consistency and asymptotic normality of ˆθg. Therefore, the usual 80 regularity conditions assuming continuity and twice differentiability with respect to θ0 g are now relaxed. 3.4.1 Consistency Using the conditions in Lemma 2.4 of Newey and McFadden (1994), it is easy to establish consistency of the doubly weighted M-estimator, ˆθg. The conditions of the lemma are quite weak with continuity and a data dependent upper bound with a finite expectation being the only substantive requirements. The following theorem fills in the primitive regularity conditions for applying the uniform law of large numbers. Theorem 3.4.1. (Consistency) Assume 3.2.1, 3.2.2, 3.2.3, 3.2.4, 3.3.1 and 3.3.2 hold. Further, let 1) Θg is compact for g = 0, 1 2) q(y(g), x, θg) is continuous at each θg ∈ Θg with probability one. 3) For all θg ∈ Θg, |q(y(g), x, θg)| ≤ b(y(g), x) for some function b(·) such that E(cid:2)b(y(g), x)(cid:3) < ∞ Then, ˆθg p→ θ0 g as N → ∞ The proof of the theorem can be found in the appendix. The conditions of the above theorem allow the objective function to not be continuous on all of θg for a given x. This is useful for cases where q(·) is allowed to be non-smooth. Under the dominance condition given in (3), uniform convergence of sample averages holds quite generally. Compactness of the parameter space and identification, as given in Assumption 3.2.1, are both more primitive that can be relaxed without affecting consistency. 81 (cid:20) (cid:20) Q0(θ0) = E Q0(θ1) = E QN (θ0) = 1 N0 si · wi0 si · wi1 R(xi, wi0, δ0) · (1 − G(xi, γ0)) R(xi, wi1, δ0) · G(xi, γ0) N(cid:88) N(cid:88) R(xi, wi0, ˆδ) · (1 − G(xi, ˆγ)) si · wi0 si · wi1 i=1 R(xi, wi1, ˆδ) · G(xi, ˆγ) (cid:21) · q(yi(0), xi, θ0) (cid:21) · q(yi(1), xi, θ1) · q(yi(0), xi, θ0) · q(yi(1), xi, θ1) 3.4.2 Asymptotic normality For establishing asymptotic normality, I provide conditions for the general case of non- smooth objective functions since the conditions in this case can accommodate the smooth case as well. The main condition needed for establishing asymptotic normality of the doubly weighted estimator is stochastic equicontinuity that will be sufficient to guarantee uniform convergence of the objective function to its population counterpart. Before I state the conditions of the normality proof, let the population problem be denoted as and the sample analogue be given as QN (θ1) = where N1 =(cid:80)N i=1 si · wi1 and N0 =(cid:80)N i=1 1 N1 i=1 si · wi0. Then, I have the following theorem for asymptotic normality which is taken from Newey and McFadden (1994) section 7 along with primitive conditions taken from Andrews (1994). Theorem 3.4.2. (Asymptotic Normality of the Doubly Weighted Estimator) Given assump- tions 3.2.1, 3.2.2, 3.2.3, 3.2.4 (1) Suppose that ˆθg is an approximate minimum i.e QN ( ˆθg) ≤ inf θg∈Θg QN (θg) + op(N−1) (2) ˆθg p→ θ0 g, θ0 g ∈ int(Θg) (3) Q0(θg) is minimized on Θg at θ0 g 82 (4) Q0(θg) is twice differentiable at θ0 (5) Suppose ∇θg QN (θ0 g with a nonsingular Hessian, Hg N ∇θg QN (θ0 g) g) exists with probability one and √ d→ N (0, Ωg) (6) Let, l = ∇θ1 k = ∇θ0 (cid:40) (cid:40) (cid:41)(cid:48) s · w1 R(x, w1, δ∗) · G(x, γ∗) s · w0 R(x, w0, δ∗) · (1 − G(x, γ∗)) · q(y(1), x, θ1) (cid:41)(cid:48) · q(y(0), x, θ0) and let the class (cid:40) F = f ; f (y(g), x) = l, for g = 1 k, for g = 0 (cid:41) : θg ∈ Θg, ∀(y(g), x) ⊂ RM satisfy Pollard’s entropy condition with envelope given by F = 1 ∨ sup f∈F |f (·)| for Type-I classes or satisfies Ossiander’s Lp entropy condition with p = 2 with enve- lope given by |f (·)| F = sup f∈F for Type II-VI classes, where these are defined in Andrews (1994). E(F )2+ζ < ∞ for some ζ > 0 and F given above. N(cid:88) i=1 (7) lim sup N→∞ 1 N (8) The conditions of Lemma 3.4 are satisfied allowing the first order influence function representation for ˆγ √ where di = wi1 · N (ˆγ − γ0) = N−1/2 (cid:17)(cid:21)−1(cid:40) (cid:35) (cid:20) E did(cid:48) (cid:16) (cid:34)∇γ G(xi, γ0)(cid:48) i G(xi, γ0) − (1 − wi1) · (cid:41) + op(1) di i=1 N(cid:88) (cid:34)∇γ G(xi, γ0)(cid:48) 1 − G(xi, γ0) (cid:35) (3.14) (3.15) is the I × 1 score of the maximum likelihood binary response log-likelihood evaluated at the true parameter value γ0. 83 Similarly, ˆδ has the following first order influence function representation where √ N bi = si · E = (cid:20) (cid:17) (cid:16) (cid:16)ˆδ − δ0 (cid:34)∇δR(xi, wi1, δ0)(cid:48) (cid:17)(cid:21)−1 bib(cid:48) (cid:35) i R(xi, wi1, δ0) N−1/2 bi i=1  + op(1) N(cid:88) (cid:34)∇δR(xi, wi1, δ0)(cid:48) 1 − R(xi, wi1, δ0) − (1 − si) · (cid:35) (3.16) (3.17) is the K × 1 score of the maximum likelihood binary response log-likelihood evaluated at the true parameter value δ0. Then, where Ω1 = E Ω0 = E (cid:16) (cid:16) (cid:16) (cid:17) √ d→ N N ( ˆθg − θ0 g) (cid:17)−1 (cid:16) (cid:16) (cid:17) (cid:17) − E (cid:16) bib(cid:48) lib(cid:48) lil(cid:48) (cid:17)−1 (cid:16) (cid:16) (cid:17) − E bib(cid:48) kib(cid:48) kik(cid:48) (cid:17) E E E E i i i i i i g 0, H−1 g ΩgH−1 (cid:17) (cid:17) − E (cid:16) lid(cid:48) bil(cid:48) (cid:16) (cid:17) − E (cid:16) kid(cid:48) bik(cid:48) i i i i E (cid:16) (cid:17) (cid:17)−1 (cid:16) (cid:17)−1 did(cid:48) E i i did(cid:48) (cid:16) E (cid:17) dik(cid:48) i (cid:17) dil(cid:48) (cid:16) i E The primitive conditions for stochastic equicontinuity hold for classes of functions of type I-VI as defined in Andrews. Conditions (1)-(5) are standard for the case of non-smooth objective functions. Condition (5) requires that the score of the objective function exists with probability one and is normally distributed. This condition is important for establishing distributional convergence of ˆθg. Condition (6) and (7) together with random sampling are sufficient for stochastic equicon- tinuity of the remainder term in Newey and McFadden (1994).19 Checking these conditions in a particular application would entail showing that f (·) belongs to one these classes. For instance, both linear and quantile regression examples considered in this paper belong to 19Directly verifying stochastic equicontinuity as mentioned in theorem 7.2 of Newey and McFadden (1994) is difficult and hence primitive conditions like (6) and (7) can be useful. Pollard (1985) also provides primitive conditions that are sufficient for stochastic differen- tiability which is quite similar to the condition of stochastic equicontinuity. 84 type-I class of functions. Consequently, stochastic equicontinuity follows from Theorem 1 and 4 in Andrews (1994) for type-I class and type II-VI classes respectively. Conditions (8) and (9) are simply imposing regularity conditions on R(·) and G(·) so that influence function representations given in 3.14 and 3.16 are possible. For a proof of the theorem, see appendix. 3.4.3 Efficiency gain with estimated weights The asymptotic variance expression derived in the previous section offers some interest- ing insights. First, the middle term, Ωg, represents the variance of the residual from the population regression of the weighted score on the two binary response scores, bi and di. Note that even though Ωg should involve a fourth term for the covariance between the two scores, this term is zero in the present case, on account of the two scores being conditionally independent.20 Second, the expression for Ωg, as derived here, is different from what I obtain in section 3.5 under the stronger identification assumption. This difference has an interesting efficiency implication. In the case when a researcher is only willing to assume identification in the weaker sense of 3.2.1, it is potentially more efficient to estimate the two probabilities in a first step. Note though that this result is asymptotic in nature. In order to see that, let us assume that we know G(xi, γ0) and R(xi, wig, δ0). Then, the asymptotic variance of the estimator, say ˜θg which uses the known probabilities is: = H−1 (cid:16) ˜θg − θ0 g ΣgH−1 (cid:20)√ (cid:17)(cid:21) Avar N g g where Σ1 = Var(li) = E the control group. I formalize this result in the next theorem for the treatment group and Σ0 = Var(ki) = E i Theorem 3.4.3. (Efficiency gain with estimated weights) Under the assumptions of theorem 20For the proof, see appendix. 85 (cid:17) (cid:16) lil(cid:48) (cid:16) (cid:17) kik(cid:48) i for 3.4.2 we obtain (cid:20)√ (cid:16) ˜θg − θ0 g (cid:17)(cid:21) N (cid:20)√ N (cid:16) ˆθg − θ0 g (cid:17)(cid:21) − Avar Avar = H−1 = H−1 g − H−1 g ΣgH−1 g ΩgH−1 (cid:1) H−1 (cid:0)Σg − Ωg g g is PSD. g The proof is given in the appendix. In other words, asymptotically, we do no worse by estimating the probabilities than when we actually know them. Due to the asymptotic nature of the results, there may not be any gain in finite samples. This result can be seen an extension of Wooldridge (2007) to the case when one has two sets of probability weights being estimated in the first stage. In the missing data literature, this result has also been called the “efficiency puzzle”. Prokhorov and Schmidt (2009) study this puzzle in a GMM framework using an augmented set of moment conditions, where the first subset of moments correspond to the weighted objective function and the second subset belong to the missing outcomes (or selection) prob- lem. An interesting explanation that emerges from their framework is that the second set of moment conditions are useful even when selection probability parameters are known. There- fore, inefficiency of the known probability estimator (as seen above) is due to its failure to exploit the correlation between the first and second set of moments. Hence, knowledge of the selection parameters do not play a role in efficient estimation. 3.5 Some feature of interest is correctly specified The results in the previous section were derived under the assumption that the parameter vector solves an unconditional M-estimation problem. Even though it can handle cases where the conditional feature of interest is correctly specified, the explicit focus was on examples of model misspecification such as estimating a misspecified linear model for either the true conditional mean or the true conditional quantile function. In contrast, this section focuses g indexes a true conditional feature of interest. This could be a mean, g can be said to be on situations where θ0 quantile or the entire conditional distribution of y(g)|x. In this case, θ0 86 identified in a stronger sense which is reflected in an improvisation of the basic identification assumption given in eq (3.2) to the following, Assumption 3.5.1. (Strong identification of θ0 g) The parameter vector θ0 g ∈ Θg is the unique solution to the population minimization problem E(cid:2)q(y(g), x, θg)|x, wg, s(cid:3) ; g = 0, 1 min θg∈Θg (3.18) for each (x, wg, s) ∈ V ⊂ Rdim(X)+2. In other words, under ignorability (as defined in 3.2.3) and unconfoundedness (defined in 3.2.2), θ0 g solves E(cid:2)q(y(g), x, θg)|x(cid:3) ; g = 0, 1 (3.19) min θg∈Θg for each x ∈ X ⊂ Rdim(X). in 3.2.1. The basic identification assumption simply defines θ0 The above assumption can be seen as a strengthening of the identification assumption g to be the solution to the unconditional M-estimation problem, irrespective of whether it is correctly specified for an g to solve the stronger conditional M-estimation problem. For instance, assumption 3.5.1 will be satisfied for a underlying model or not. Assumption 3.5.1 is additionally requiring θ0 correctly specified CEF given by g + xβ0 y(g) = α0 E(cid:0)u(g)|x(cid:1) = 0 g + u(g); g = 0, 1 (3.20) with either OLS or QMLE in the linear exponential family as the chosen estimation method. This would also hold for a correctly specified CQF estimated either using quantile regression or QMLE in the tick exponential family (Komunjer (2005)). Requirement for the parameter, g, to solve the stronger ID problem is an important distinction which will ultimately help θ0 me characterizing the robustness properties of the doubly weighted estimator. I will illustrate this property through two main examples; the first will study estimation of ATE and the 87 second will study estimation of quantile effects. Both these examples are studied in detail in section 3.6. Until now I have not said anything about the parametric specifications of functions R(·) In fact, under assumption 3.5.1, correct functional form assumptions on these and G(·). two probabilities can be dispensed with. This is a second important distinction between the results characterized under assumption 3.2.1 and the results characterized in this sec- g solves the objective function for each tion under 3.5.1. Therefore, the requirement that θ0 (x, wg, s) ∈ V is much stronger than the requirement in assumption 3.2.1 since assumption 3.5.1 implies assumption 3.2.1 but not the other way around. Formally, I will show that the g that solves the sample equivalent of eq (3.19) with potentially misspecified g. Before that, treatment and missing outcomes probabilities will still consistently estimate θ0 estimator of θ0 the following assumptions formalize possible misspecification of these probability models Assumption 3.5.2. (Parametric specification of propensity score) Assume that conditions (1) and (3) of 3.3.1 hold where condition (2) is defined for some γ∗ ∈ Γ such that plim(ˆγ) = γ∗ Assumption 3.5.2 says that we have a known parametric function for the propensity score but there is no requirement for this model to be correctly specified. I continue to assume that the estimator of γ∗ solves a binary response maximum likelihood problem and G(x, γ∗) is the model evaluated at the pseudo true value. In the event that the model is correctly specified for the propensity score, G(x, γ∗) = p1(x). I make a similar assumption for the missing outcomes model. Assumption 3.5.3. (Parametric specification of missingness probability) Assume that con- ditions (1) and (3) of 3.3.2 hold where condition (2) is defined for some δ∗ ∈ Γ such that plim(ˆδ) = δ∗ Again, assumption 3.5.3 says that we have a known parametric function for the missing outcome probability given by R(x, wg, δ) and I do not impose any requirement for this model 88 to be correctly specified. However, when this model is correctly specified, R(x, wg, δ∗) = r(x, wg). Given assumptions 3.5.1, 3.5.2 and 3.5.3, its easy to show that θ0 g solves the doubly weighted problem in the population where the weights are constructed using potentially misspecified probabilities. I provide a sketch of the argument for the treatment group pa- rameter θ0 1 and the proof for θ0 0 follows in a similar manner. Consider, E s R(x, w1, δ∗) · w1 G(x, γ∗) · q(y(1), x, θ1) (3.21) (cid:20) (cid:21) Using three applications of LIE along with ignorability and unconfoundedness, I can rewrite the above expectation as (cid:20) r(x, w1) R(x, w1, δ∗) E · p1(x) G(x, γ∗) (cid:21) · E[q(y(1), x, θ1)|x] Assumption 3.5.1 along with positive weights i.e. (x, w1), implies R(x,w1,δ∗) ≥ 0 and p1(x) r(x,w1) G(x,γ∗) ≥ 0 for all · p1(x) G(x, γ∗) · E[ q(y(1), x, θ0 r(x, w1) R(x, w1, δ∗) where the inequality is strict when θ1 (cid:54)= θ0 even if the weights are misspecified. 1)|x] ≤ r(x, w1) R(x, w1, δ∗) · p1(x) G(x, γ∗) · E[ q(y(1), x, θ1)|x], ∀ θ1 ∈ Θ1 1. Therefore, solving 3.21 identifies the parameter In general, the parameter that solves 3.21 will be g is a unique solution, solving 3.21 different from the one that solves 3.2. But as long as θ0 will identify it. When R(x, wg, δ∗) = r(x, wg) and G(x, γ∗) = p1(x), then solving 3.21 will be the same as solving 3.12 for the treatment group and 3.13 for the control group. Estimation of G(·) and R(·) follows from Lemma 3.3.3 and 3.3.4 but with probability limits given by δ∗ and γ∗ rather than δ0 and γ0 respectively. Since R(x, wg, δ∗) and G(x, γ∗) can be any positive functions of x and wg, one special case corresponds to them being constants. Since weighting by fixed constants does not affect g , which the minimization problem, this implies that the unweighted estimator, denoted by ˆθu is a special case of the doubly weighted estimator, is also be consistent for θ0 g. 89 The following theorem establishes consistency of the doubly weighted estimator under strong identification. Theorem 3.5.4. (Consistency under strong identification) Under assumptions 3.2.2, 3.2.3, 3.2.4, 3.5.1, 3.5.2 and 3.5.3 and assume regularity con- ditions (1), (2) and (3) of Theorem 3.4.1. Then, ˆθg p→ θ0 g as N → ∞ where ˆθg is the doubly-weighted estimator that solves problem 3.21. The next theorem states asymptotic normality of the doubly weighted estimator that solves the conditional M-estimation problem with misspecified probabilities. Theorem 3.5.5. (Asymptotic Normality) Under the assumptions of theorem 3.5.4 and the regularity conditions of theorem 3.4.2, we obtain where √ N ( ˆθg − θ0 g ) (cid:16) (cid:17) lil(cid:48) i Ω1 = E (cid:16) d→ N 0, H−1 (cid:17) g ΩgH−1 (cid:17) (cid:16) g kik(cid:48) i and Ω0 = E with Hg as defined in condition (4) of Theorem 3.4.2 and li and ki defined as in condition 6) of Theorem 3.4.2 but with weights given by G(x, γ∗) and R(x, wg, δ∗). Substantively, there is no real difference in the proof of the above theorem except that now ˆγ and ˆδ are converging to probability limits that could be potentially different from those indexing the true treatment and missing outcome probabilities. A consequence of the objective function solving the conditional problem is reflected in the asymptotic variance expression above. Compared to section 3.4, Ωg now is just the variance of the weighted score without the first stage adjustment. Since the conditional score of weighted problem (cid:104)∇θg q(y(g), x, θ0 g)(cid:48)|x (cid:105) is zero i.e. E = 0, this implies that the correlation between the 90 weighted score and the two MLE scores will be zero, giving us the familiar expression above. A consequence of this simpler expression for Ωg is that now estimating the probabilities in a first step is not any superior than using known weights. This is formalized in the following corollary. Corollary 3.5.6. (No gain with estimated weights under strong identification) Under the assumptions of theorem 3.5.5 we obtain (cid:20)√ (cid:16) ˜θg − θ0 g (cid:17)(cid:21) (cid:20)√ (cid:16) ˆθg − θ0 g (cid:17)(cid:21) Avar N = Avar N = H−1 g ΩgH−1 g where ˜θg is the estimator that uses known (potentially misspecified) probabilities and ˆθg is the estimator that uses estimated probabilities. This too is attributable to the conditional score of the weighted problem being zero, (cid:104)∇θg q(y(g), x, θ0 (cid:105) g)(cid:48)|x = 0. namely, E A second interesting question concerns the role of weighting in this scenario. As I men- misspecified probabilities will be consistent for θ0 tioned earlier, the unweighted estimator or in fact any weighted estimator with possibly g (In fact the estimator that only weights by the propensity score will also be consistent in this case). Interestingly, if the objective function satisfies the generalized conditional information matrix equality (GCIME) defined below, the unweighted estimator is asymptotically more efficient than any weighted estima- tor. The following theorem formalizes this efficiency result Theorem 3.5.7. (Efficiency gain with unweighted estimator under GCIME) Under assumptions of theorem 3.5.5 if we additionally assume that the objective function satisfies the generalized conditional information matrix equality (GCIME) in the population which is defined as: g )(cid:48)∇θg q(y(g), x, θ0 g )|x = σ2 0g · ∇2 θg E[q(y(g), x, θ0 g )|x] = σ2 0g · A(xi, θ0 g ) where ∇2 θg E[q(y(g), x, θ0 g )|x] = A(xi, θ0 g ) 91 (3.22) (cid:104)∇θg q(y(g), x, θ0 E (cid:105) (cid:20)√ N Avar (cid:17)(cid:21) (cid:16) ˆθg − θ0 (cid:21) g = H−1 (cid:20) g ΩgH−1 g (cid:21) A(xi, θ0 0) Then, where (cid:20) and where H1 = E Ω1 = σ2 A(xi, θ0 1) r(xi, wi1) · p1(xi) (cid:34) R(xi, wi1, δ∗) · G(xi, γ∗) 01 · E (cid:34) R2(xi, wi1, δ∗) · G2(xi, γ∗) r(xi, wi0) · p0(xi) r(xi, wi1) · p1(xi) , H0 = E (cid:35) A(xi, θ0 1) Ω0 = σ2 00 · E A(xi, θ0 0) r(xi, wi0) · p0(xi) R(xi, wi0, δ∗) · (1 − G(xi, γ∗)) (cid:35) = (Hu g )−1Ωu g (Hu g )−1 (cid:16) ˆθu R2(xi, wi0, δ∗) · (1 − G(xi, γ∗))2 (cid:17)(cid:21) (cid:20)√ g N Avar g − θ0 (cid:104) (cid:105) (cid:104) r(xi, wi1) · p1(xi) · A(xi, θ0 , Hu 1) r(xi, wi1) · p1(xi) · A(xi, θ0 01 · E 1) (cid:20)√ (cid:20)√ (cid:16) ˆθg − θ0 − Avar (cid:17)(cid:21) Avar N N (cid:105) g Hu 1 = E Ωu 1 = σ2 Given the above, (cid:104) 00 · E (cid:17)(cid:21) (cid:16) ˆθu g − θ0 g (cid:105) 0 = E r(xi, wi0) · p0(xi) · A(xi, θ0 0) (cid:104) r(xi, wi0) · p0(xi) · A(xi, θ0 0) (cid:105) , Ωu 0 = σ2 is positive semi-definite The proof of the above theorems is easy to establish and can be found in the appendix. The GCIME assumption is known in a variety of estimation contexts. In the case of full maximum likelihood, GCIME holds for q(y(g), x, θg) = − ln f (y(g)|x, θg) where f (·) is the 0g = 1. For the case of quasi maximum likelihood in true conditional density of y(g) with σ2 the linear exponential family for estimating the true conditional mean parameters, GCIME holds for the same q(·) but f (·) now denotes a density from the linear exponential family with V ar(y(g)|x) = σ2 g)]. In other words, GCIME will be satisfied for the QMLE case when V ar(y(g)|x) satisfies the generalized linear model assumption, irrespective of whether the higher order moments of the conditional distribution of y(g) correspond to the 0g · v[mg(x, θ0 chosen LEF density or not. For estimation using NLS, GCIME will hold for q(y(g), x, θg) = (y(g) − mg(x, θg))2 with the homoskedasticity assumption. Hence in all these cases the 92 unweighted estimator will be more efficient than its weighted counterpart. But when GCIME is not satisfied, the two cannot be ranked efficiency wise. The next section uses the results discussed in this section and section 3.4 to characterize the nature of this robustness property with explicit focus on two important causal param- eters; the ATE and QTEs. Before this, a flowchart on the next page outlines the different cases of misspecification that are possible under the doubly weighted framework. 3.6 Robust estimation The asymptotic theory developed in sections 3.4 and 3.5 can now be used to characterize the robustness property of the doubly weighted estimator. Delineating the asymptotic theory using the weak and strong identification assumptions helps me to be precise about the nature of this robustness and its constituents. 3.6.1 Average treatment effect The most common parameter of interest in applied work is the ATE, defined for an under- lying population of interest.21 Given the importance of this parameter in applied work, I discuss how the current framework allows robust estimation of the ATE. Depending on the component of the doubly weighted framework that is allowed to be misspecified, I will utilize the asymptotic results from sections 3.4 and 3.5 along with certain estimation methods to establish consistent estimation of the ATE. In the presence of covariates, x, that are predictive of the potential outcomes, it is helpful to define the average treatment effect (ATE) as τate = E(cid:2)µ1(x)(cid:3) − E(cid:2)µ0(x)(cid:3) where µg(x) denotes the true conditional mean (or regression function) of y(g). Let mg(x, θg) 21For instance, in the NSW program which is the main empirical application in this paper, I define ATE to be the expectation over the population of all eligible participants. 93 be a parametric model for E(cid:2)y|x, wg = 1(cid:3).22 Then this model is said to be correctly specified if µg(x) = mg(x, θ0 g), for some θ0 g ∈ Θg (3.23) Given the parametric nature of this framework, I acknowledge and tackle misspecification of the conditional mean model, mg(x, θg), the propensity score model, G(x, γ), and the missing outcomes probability model, R(x, wg, δ). While the discussion in this section focuses 0. The on consistent estimation of θ0 1, an analogous argument can be made for estimating θ0 first case considers correct specification of the missing outcomes probability model, Case 1: Correct missing probability model, R(x, wg, δ) In the current framework, when R(·) is correctly specified, one obtains the usual double robustness (DR) result of causal inference. DR ensures that θ0 g is estimated consistently despite having either the propensity score or the conditional mean model being misspecified, g represents in this case will depend on what is being assumed about the conditional mean model. However, I will show under each of these cases, but not both. Naturally, what θ0 a consistent estimate of ATE can always be obtained. a. First half of DR: Correct conditional mean, E(cid:0)y(g)|x(cid:1) Having a correctly specified mean model implies that I can decompose the potential outcomes into their true means as follows y(g) = mg(x, θ0 E(cid:2)u(g)|x(cid:3) = 0 g) + u(g) (3.24) for both g = 0, 1. consistently estimate θ0 In this case we know there are many estimation methods that can g such as Nonlinear least squares and QMLE in the linear exponential family. The question that remains to be addressed is whether any of these procedures require 22Under unconfoundedness, the regression function can be identified as E(cid:2)µg(x)(cid:3) = E(cid:2)y|x, wg = 1(cid:3) ,∀ g ∈ {0, 1} 94 weighting to obtain consistent estimates of θ0 g. To answer this, I look at these two estimation methods in detail and tie them to the theoretical results developed in earlier sections. Solving for θ0 g using NLS means minimizing the expected squared error between y(g) g is identified in the stronger sense that it solves the and mg(x, θg). In fact, under 3.24, θ0 conditional NLS problem, (cid:104) (y(g) − mg(x, θg))2|x (cid:105) (3.25) θ0 g = argmin θg∈Θg E Similarly, for estimation of θ0 g using QMLE in the linear exponential family (Gourieroux et al. (1984), Wooldridge (2010) chapter 13), if one chooses the range of the conditional mean function, mg(x, θg), to correspond with the range of the quasi-log likelihood for a given linear exponential density, θ0 g is again identified in the conditional sense, E(cid:2)ln f (y(g), mg(x, θg))|x(cid:3) θ0 g = argmin θg∈Θg where f (·) is the density associated with the chosen linear exponential distribution.23 For both these examples, results from section 3.5 dictate that weighting by either correct or misspecified probabilities is not needed for consistency. The fact that one could weight by the misspecified propensity score model and still obtain this result is what forms the ‘first part’ of the DR result with propensity score weighting. Once ˆθg has been estimated by solving the sample version of the NLS or QMLE problem, ATE can be estimated as follows ˆτate = N(cid:88) i=1 1 N (cid:16) m1(xi, ˆθ1) − 1 N (cid:17) = σ2 N(cid:88) i=1 m0(xi, ˆθ0) If in addition to having a correct conditional mean, I also assume the error variance of the 0g), then the estimator that does not weight at all may be the preferred estimator from an efficiency perspective. This result is due to outcomes is homoskedastic (E u2(g)|x GCIME being satisfied under homoskedasticity with NLS. 23For example, f (y(g), mg(x, θg)) = mg(x, θg)y(g) · (1 − mg(x, θg))(1−y(g)). if mg(x, θg) ∈ (0, 1), one would typically use the Bernoulli density, 95 b. Second part of DR: Correct propensity score model, G(x, γ) If one acknowledges misspecification of the conditional mean model, then this brings us to the second case of DR where only the propensity score model is assumed to be correct. In this case, there is no general way of consistently estimating the ATE. A very useful mean fitting property of QMLEs in the linear exponential family can be utilized here to obtain consistent estimators of the unconditional means, E(cid:2)y(g)(cid:3), despite misspecification of mg(x, θg).24 The estimation strategy is to choose mg(x, θg), to be the inverse canonical link function, h(·), with the QLL corresponding to a choice of LEF density. In the generalized linear model (GLM) literature, the link function, h−1(·), relates the mean of the distribution to a linear index h−1(µg(x)) = xθg (3.26) Then the first order conditions of such a QMLE problem give us (cid:34)∇θmg(x, θ∗ E (cid:35) g )(cid:48) · (y(g) − mg(x, θ∗ g )) v(mg(x, θ∗ g )) = 0 (3.27) where θ∗ g denotes the pseudo true parameter indexing the misspecified conditional mean model (White (1982)). By choosing the canonical link as the mean model of choice, the gradient in the numerator of 3.27 cancels with the variance term in the denominator. Note that this only occurs only when one uses the canonical link associated with the chosen LEF density and not with any other choice of link function. This in turn ensures that if one includes an intercept in x, the model fits overall mean of the distribution (see Wooldridge (2010) chapter 13 for more detail) i.e. E(cid:2)y(g)(cid:3) = E (cid:104) mg(x, θ∗ g) (cid:105) Under ‘i.i.d’ sampling, solving the sample analogue of the population FOC given in 3.27 would have been sufficient to obtain consistent estimates of θ∗ g. However, in the presence of 24Słoczyński and Wooldridge (2018) use this mean fitting property for developing doubly robust estimators of various ATEs. 96 non-random assignment and missing outcomes, one needs to weight the first order conditions in 3.27 to ensure that θ∗ In other words, one would solve the N(cid:88) following moment conditions g is estimated consistently. si · wi1 yi − h(ˆα1 + xi ˆβ1) = 0 R(xi, wi1, ˆδ) · G(xi, ˆγ) i=1 si · wi0 R(xi, wi0, ˆδ) · (1 − G(xi, ˆγ)) N(cid:88) i=1 i ·(cid:104) i ·(cid:104) · x(cid:48) · x(cid:48) yi − h(ˆα0 + xi ˆβ0) (3.28) = 0 (cid:105) (cid:105) The choice of the LEF density would have to be consistent with the range and nature of the outcome, y(g).25 Estimation summary under second part of DR: Estimation of the average treatment effect in the case of a misspecified mean model but correct propensity score and missing probability models follows in two steps: 1. Depending upon the range and nature of the outcome variable, y(g), choose an ap- propriate LEF density. Choose the mean function, mg(x, θg) = h(xθg), where h(·) 25The following combinations of QLL and link functions produce the mean fitting property. 1. Normal log-likelihood with identity link function when there are no restrictions on the range of y(g) (cid:104) x(cid:48) · (y(g) − xθ∗ g) (cid:105) E = 0 This is the first order condition for OLS which ensures that E[y(g)] = E intercept is included in the linear projection. 2. Poisson log-likelihood with log link function when the range of y(g) is restricted to be non-negative (y(g) ≥ 0) (cid:105) (cid:104) xθ∗ g if an 3. Bernoulli log likelihood with logit link function when y(g) is restricted to be in the unit interval, (y(g) ∈ [0, 1]) E (cid:34)exp(xθ∗  exp(xθ∗ g) (1+exp(xθ∗ E g) · x(cid:48) · (y(g) − exp(xθ∗ g)) exp(xθ∗ g) (cid:35) = 0 (cid:18) y(g) − exp(xθ∗ g) 1+exp(xθ∗ g) (cid:19)  = 0 g))2 · x(cid:48) · exp(xθ∗ g) (1+exp(xθ∗ g))2 97 is the inverse canonical link function associated with this chosen density. Using this combination of mean function and quasi-log-likelihood, use the moment conditions in 3.28 to obtain consistent estimates, ˆθg. 2. Using estimates that solve problem 3.28, one can then obtain consistent estimates of the average treatment effect as follows N(cid:88) i=1 ˆτate = 1 N h(ˆα1 + xi ˆβ1) − 1 N N(cid:88) i=1 h(ˆα0 + xi ˆβ0) (3.29) where ˆαg and ˆβg are the solutions to 3.28. The formal proof of consistency for ˆτate in this case is given in the appendix J, and follows in a manner similar to Negi and Wooldridge (2019). Case 2: Misspecified missing outcomes probability model, R(x, wg, δ) If the missing outcomes model is misspecified, then sufficient for consistent estimation of words, θ0 ATE is a strengthening of the identification assumption from 3.2.1 to 3.5.1. In other g). Hence, misspecification in R(x, wg, δ) can be allowed in exchange for identification of θ0 g g solves in the conditional sense. For instance, estimation via NLS would imply that θ0 g would index the true conditional mean function i.e. E(cid:0)y(g)|x(cid:1) = mg(x, θ0 (cid:104) (cid:105) and similarly for the QMLE example (y(g) − mg(x, θg))2|x E(cid:2)ln f (y(g), mg(x, θg))|x(cid:3). E min θg∈Θg θ0 g = argmin θg∈Θg To conclude, robust estimation of ATE under the doubly weighted framework can be achieved as follows: If the missing outcomes probability model R(·, δ) is misspecified, then one can consistently estimate ATE when the conditional mean model is correct. However, if R(·, δ) is correct, then one can estimate the ATE in the usual double robust manner i.e. misspecification may be allowed either in the propensity score model or the conditional mean model, but not both. Finally, if the conditional mean model is misspecified, then both the probability models, G(·, γ) (for propensity score) and R(·, δ) (for missing outcomes proba- 98 bility) would need to be correct. To illustrate robust estimation of the ATE using the proposed doubly weighted estimator and to study its finite sample behavior, the next section discusses a simulation study which considers the different cases of misspecification mentioned above. 3.6.2 Monte carlo evidence To allow for possible misspecification of the regression functions µg(x), I simulate two binary potential outcomes generated using a probit as follows 1, y∗(g) > 0 0, y∗(g) ≤ 0 y(g) = y∗(g) = xθ0 g + u(g) g, is parameterized to have covariates Note that x includes an intercept. The linear index, xθ0 1 = 0.14 in the be only mildly predictive of the potential outcomes with R2 population.26 The two covariates and the two latent errors are drawn from two independent bivariate normal distributions as follows, 0 = 0.19 and R2 x1  ∼ N x2  1  , 2  and u(0) u(1)   0.2 1 0.2 0   ∼ N 0  ,  1 1, s∗ > 0 0, s∗ ≤ 0 s = (3.30) (3.31) The assignment and missing outcome mechanisms have been simulated to ensure that un- confoundedness and ignorability are satisfied w1 = 1 > 0 1 ≤ 0 and 26Here θ0 0 = (0, 1, 1)(cid:48) and θ0 0, w∗ 1 = (−1, 1, 1)(cid:48). With cross sectional data, covariates are typically seen to be mildly predictive of the outcome. For example, in the National Supported Work dataset from Calónico and Smith (2017), baseline factors explain about 26-50 percent of the variation in the non-experimental sample and about .04-2 percent in the experimental sample depending upon the included subset of covariates. 2 0.2 0.2  3  1, w∗ 99 where w∗ 1 = xγ0 + ξ s∗ = zδ0 + ζ with the errors ξ and ζ drawn from two independent standard logistic distributions.27 Mis- specification in these models is allowed in both the functional form and linear index dimension where for the misspecified cases, I estimate a probit with x1 omitted from the linear index. For scenarios where the conditional mean is misspecified, I estimate a linear model with a correct index. The parameters, γ0 and δ0, indexing the assignment and missingness mech- anisms have been chosen to ensure average propensity of assignment to be 0.41 and average propensity of being observed to be 0.38.28 The missing data have been simulated to imitate empirical settings where a significant portion of the outcomes are missing. Table ?? gives an estimation summary for the eight different cases of misspecification that are considered here Results I discuss results for cases (4) and (5) as these two scenarios are highlighted in sections 3.4 and 3.5. Finally, I also discuss case (8). Even though the theory developed in this paper is silent when all three components of the framework are misspecified, the simulation results look promising. All other cases are given in the appendix. Case (4) depicts the possibility that the conditional mean model is correct but both probability models are misspecified. For this case, one can see that weighting does not have any added bite in resolving the identification problem, beyond that already achieved from having a correct mean function. In figure H.2 d), the empirical distributions of the estimated ATE for the unweighted, propensity score wighted and double weighted estimators all coincide. Moreover, all are centered on the true ATE. In terms of root mean squared error, all three perform the same for a sample size of 5000 but PS-weighting performs better when the sample size is 1000. This suggests 27This implies that p(w1 = 1|x) = p1(x) = Λ(xγ0) and p(s = 1|wg, x) = r(wg, x) = Λ(zδ0) where Λ(·) is the standard logistic CDF. 28Here γ0 = (0.05, -0.2, -0.11)(cid:48), δ0 = (0.01, 0.03, 0.05, -0.28)(cid:48) and z = (1, wg, x1, x2) 100 that PS-weighting could be beneficial in terms of RMSE, at-least for small sample sizes. Estimating the propensity score reduces the variance of the weighted score of the problem which will not necessarily be the case when estimating both probability models. So, it might be better to use the propensity score weighted estimator in the case when the conditional mean function is correctly specified. Case (5) considers a scenario where the mean function is misspecified but the two proba- bilities models are correct. This is the principal case covered in section 3.4 where weighting has a crucial role to play. As one can see, the average bias in the unweighted estimator of ATE is higher than for the doubly weighted estimator. In fact, the empirical distribution of the unweighted estimator is shifted to the right whereas for the doubly weighted estimator is centered on the truth (refer figure ??). In this case, the doubly weighted estimator has both; the smallest Bias and Rmse amongst all three estimators. Under this case, I also con- sider the doubly weighted estimator which uses known weights (see table ?? for reference). In finite samples, estimation of the weights could result in conservative variance estimates. While estimating the weights would result in a smaller residual of the weighted score (li), the residual variance could be larger compared to the known weights estimator because of non-zero cross correlations between the probability scores. Finally, case (8) considers the scenario where all components of the framework are mis- specified. The theory in this paper does not address this case. However, this is an interesting possibility given that misspecification of all components is a valid concern. The simulation results do offer some insight here. The doubly weighted estimator seems to be the only estimator that delivers the true ATE on average whereas the others are away from the truth (see table ??). 3.6.3 Quantile effects Under treatment effect heterogeneity, distributional impacts beyond the ATE are of increas- ing interest to researchers, especially in program evaluation studies. However, unlike the case 101 of ATE, it is generally not possible to obtain robust estimation of UQTEτ .29 In this section, I employ the double weighting framework to focus attention on estimating three different quantile effects, namely, UQTEτ , CQTEτ , and a weighted linear approximation (LP) to the g indexes the true CQF or an approximation will depend on what true CQTEτ . Whether θ0 is being assumed about the conditional quantile model and the estimation method used. Let us assume that the two potential outcomes are continuous in R and that the un- conditional quantiles of y(g) are unique and do not have any flat spots at the τ th quantile. Then, the conditional quantiles of y(0) and y(1) given covariates, x, are defined as: (cid:0)y(g)|x(cid:1) = inf{y : Fy(g)(y|x) ≥ τ}; where 0 < τ < 1 Quantτ where Fy(g)(y(g)|x) is the distribution function of y(g) conditional on x and is assumed to have density f (y(g)|x). Then, the CQTEτ at x = x0 for the τ th quantile is defined as the difference in the conditional quantiles of the two outcome distributions i.e. CQT Eτ (x0) = Quantτ (y(1)|x0) − Quantτ (y(0)|x0) Similarly, UQTEτ is defined as the difference in the τ th unconditional quantiles of the two outcome distributions. U QT Eτ = Quantτ (y(1)) − Quantτ (y(0)) Let quantg,τ (x, θg) be a model for the τ th conditional quantile of y(g). This is said to be correctly specified for Quantτ (y(g)|x) if Quantτ (y(g)|x) = quantg,τ (x, θ0 g) for some θ0 g(τ ) ∈ Θg, g = 0, 1 (3.32) The next section discusses estimation under the first case when we have R(·) correctly specified. Case 1: Correct missing probability model, R(x, wg, δ) Similar to the ATE case, when R(·) is correctly specified, one obtains the nested DR result 29This is because averaging the CQTEτ does not give us the UQTEτ . 102 of causal inference. However, the parameter estimable in each case depends on what is being assumed about the CQF. To consider each of these scenarios in detail, consider the first half of DR when we have a correct CQF. (cid:0)x, θg (cid:1) a. First half of DR: Correct conditional quantiles, quantg,τ If the CQF is correctly specified, as defined in 3.32, then one can decompose the potential outcomes as, y(g) = quantg,τ (x, θg) + uτ (g) (cid:0)uτ (g)|x(cid:1) = 0 Quantτ (3.33) In this case there are two estimation methods that will ensure consistent estimation of g(τ ). The first is quantile regression (QR) of Koenker and Bassett (1978). The second is a class of quasi maximum likelihood estimators in a special the correct CQF parameters, θ0 ‘tick-exponential’ family of distributions proposed by Komunjer (2005). This method is analogous to estimation of correctly specified conditional mean parameters using QMLE in the linear exponential family. The ‘first part’ of this double robustness result implies that any inverse propensity score weighted version of the QR or QML objective functions, irrespective √ N-asymptotically of whether those weights are correct, will also deliver a consistent and normal estimator of θg(τ ). For estimation that uses QR, correct specification as given in 3.33 implies that θg(τ ) will actually solve the stronger conditional problem, E(cid:2)cτ (y(g) − quantg,τ (x, θg))|x(cid:3) where cτ (u) = u ·(cid:0)τ − 1[u < 0](cid:1) is the check function defined for some random variable θ0 g(τ ) = argmin θg∈Θg (3.34) u. Since, θ0 g(τ ) satisfies the stronger identification condition, results from section 3.5 can be applied. This means that weighting is not needed for consistent estimation of θg(τ ), irrespective of whether the weighting functions are correct or not. 103 In a similar vein, estimation via QML using the tick exponential family implies that as long as CQF is correct, θ0 g(τ ) = argmin θg∈Θg E(cid:2)ln (φτ (y(g), quantτ,g(x, θg)))|x(cid:3) (3.35) where φτ (·,·) is the density that belongs to the ‘tick-exponential’ family characterized by: φτ (y, η) = exp(cid:2)−(1 − τ )[a(η) − b(y)]1{y ≤ η} + τ [a(η) − c(y)]1{y > η}(cid:3) and τ ∈ (0, 1), a(·) is continuously differentiable and b(·) and c(·) are continuous functions such that η ∈ M ⊂ R.30. Once we have obtained ˆθg either by solving the QR or QML problem, the conditional quantile treatment effect for the subgroup defined by xi can be estimated as ˆCQT Eτ (xi) = quant1,τ (xi, ˆθ1) − quant0,τ (xi, ˆθ0). b. Second half of DR: Correct propensity score model, G(x, γ) Suppose now we have a correctly specified propensity score model or equivalently a mis- specified conditional quantile model. Traditionally, the theory of quantile estimation has not dealt with this case of misspecification.31 However, Angrist et al. (2006b) establish an approximation property of QR with a misspecified linear CQF that is analogous to the ap- proximation property of linear regression.32 Hence, solving the QR objective function with quantτ,g(x, θg) = xθg would still identify a weighted approximation to the CQF. 30φτ (y, η) is a probability density and η is the τ-quantile of φτ such that(cid:82) η−∞ φτ (y, η)dy = τ (1−τ ) · y, τ. Komunjer (2005) shows that if one chooses a(η) = 1 then the quasi log likelihood function is proportional to the check function that was originally introduced by Koenker and Bassett (1978) τ (1−τ ) · η and b(y) = c(y) = 1 31Kim and White (2003) establish consistency and asymptotic normality of the QR esti- mator for a pseudo true value in the case of a misspecified linear conditional quantile model. 32Adapting Angrist et al. (2006b)’s notation to the potential outcomes framework, the parameters that solve the QR problem solve a weighted mean square approximation to the true CQF, (cid:104) ωτ (x, θg) · (Quantτ (y(g)|x) − xθg)2(cid:105) (1 − u)fy(g)(u · xθg + (1 − u) · Quantτ (y(g)|x)|x)du 104 where ωτ (x, θg) = θ0 g(τ ) = argmin θg∈Θg E (cid:90) 1 0 Under ‘i.i.d’ sampling, solving the sample QR objective function is sufficient to obtain consistent estimates of θ∗ g. However, as in the case of ATE, weighting becomes crucial in the presence of non-random assignment and missing outcomes. In other words, one would need to weight the QR estimator with the correct propensity score and missing outcomes probability models to consistently estimate θ∗ g. For instance, one would now solve the following treatment and control group problems, si · wi1 (cid:34) N(cid:88) (cid:34) i=1 min θ1∈Θ1 N(cid:88) R(xi, wi1, ˆδ) · G(xi, ˆγ) si · wi0 min θ0∈Θ0 i=1 R(xi, wi0, ˆδ) · (1 − G(xi, ˆγ)) (cid:35) · cτ (y(1) − xiθ1) (cid:35) · cτ (y(0) − xθ0) (3.36) and the solution to these sample problems, ˆθg, is interpretable as providing a weighted LP to the true CQTEτ . Case 2: Misspecified missing outcomes probability model, R(x, wg, δ) If the missing outcomes model is misspecified, then sufficient for consistent estimation of g is a strengthening of the identification condition from 3.2.1 to 3.5.1. For estimation via θ0 quantile regression, this means that θ0 g solves the conditional QR problem E(cid:2)cτ (y(g) − quantg,τ (x, θg))|x(cid:3) θ0 g = argmin θg∈Θg which will hold only when the conditional score of the check function is zero i.e. τ · 1[y(g) − quantg,τ (x, θ0 g ) ≥ 0] − (1 − τ ) · 1[y(g) − quantg,τ (x, θ0 g ) < 0] and this will be true only when Quantτ (y(g)|x) = quantg,τ (x, θ0 g). So, misspecification in R(x, wg, δ) can be allowed in exchange for having a correctly specified conditional quantile model. is the weighting function that determines the importance given by the minimizer, θ0 points in the support of x. g, to 105 (cid:20) −x(cid:48)(cid:110) E (cid:21) = 0 (cid:111)(cid:12)(cid:12)(cid:12)(cid:12)x Direct estimation of UQTEτ As was mentioned earlier in this section, estimating UQTEτ from CQTEτ (x) is generally not possible even if we assume a correct model for the conditional quantiles of the outcomes. This is because the mean of the quantiles is not equal to the quantiles of the mean. Hence, one cannot obtain unconditional quantiles from averaging conditional quantiles over x. In this case, one can directly estimate the marginal quantiles by running a quantile regression of y(g) on an intercept (as shown in Firpo (2007)).33 In the present case, one would weight the objective function by the two probabilities in the following manner, (cid:34) (cid:34) E E θ0 1(τ ) = argmin θ1∈Θ1 θ0 0(τ ) = argmin θ0∈Θ0 (cid:35) · cτ (y(1) − θ1) (cid:35) · cτ (y(0) − θ0) si · wi1 R(xi, wi1, ˆδ) · G(xi, ˆγ) si · wi0 R(xi, wi0, ˆδ) · (1 − G(xi, ˆγ)) Weighting by G(·) and R(·) is crucial here since these primarily serve to remove the se- lection biases due to non-random assignment and missing data. Then, one can obtain the unconditional quantile treatment effect as, U QT Eτ = θ0 1(τ ) − θ0 0(τ ) The next section explores estimation of these three quantile estimands using a Monte Carlo experiment where I allow misspecification of the weighting functions and the conditional quantile model. 3.6.4 Monte carlo evidence To ensure that the marginal quantiles of the potential outcome distributions are unique with no flat spots, I simulate two continuous non-negative outcomes as follows, y(g) = exp(xθ0 g + u(g)), for g = 0, 1 33Firpo (2007) uses propensity score weighting to directly estimate unconditional quantiles in the presence of non-random assignment. 106 1 = (0.1,−0.36,−0.1)(cid:48) and θ0 0 = (0.2, 0.24,−0.45)(cid:48) are parameterized to ensure R2 where θ0 0 = 0.15 and R2 1 = 0.13 in the population. The two covariates and the two latent errors are drawn from two independent normal distributions following eq (3.30). The missing outcomes and the treatment assignment mechanisms are also generated according to eq (3.31). Since exp(·) is an increasing continuous function, the equivariance property of quantiles implies that (cid:104) exp(xθ0 (cid:105) (cid:105) g + u(g))|x (cid:105) g + u(g)|x) (cid:105) g + Quantτ (u(g)|x) g + Φ−1(τ ) Quantτ (xθ0 (cid:104) Quantτ (y(g)|x) = Quantτ (cid:104) (cid:104) = exp = exp xθ0 = exp xθ0 where Φ−1(τ ) is the inverse standard normal CDF evaluated at τ. This equivariance property helps to characterize and estimate CQTE for cases when the CQF is correct. For brevity, I study the behavior of the unweighted, propensity score weighted and the double weighted estimators for only five of the eight cases of misspecification. These are enumerated in table ?? below. Out of these, I discuss cases 4, 5 and 8 in the main text and results for the rest can be found in the appendix. Cases 4 and 5 correspond to the scenarios for which results are derived in section 3.5 and 3.4 respectively. The last case corresponds to the scenario where all components of the doubly weighted framework are misspecified. Even though the theory in this paper does not address that specific case, the simulation results show that the proposed estimator has the lowest bias among all three. Results For the first case when the CQF is correctly specified, figure H.3 plots CQTE as a function of x1 for the 25th quantile of the outcome distribution. Results for the 50th and 75th quantiles are given in the appendix. One can see that the estimated function coincides with the true CQTE.34 To make this case interesting, I consider misspecification of both probability models. As the results in section 3.5 dictate, all three estimators; unweighted, 34For plotting these functions, I first collect the QR estimates that solve the unweighted, ps-weighted and double-weighted check function (defined 3.34) corresponding to a particular 107 ps-weighted and double-weighted will be consistent for the true CQTE because the CQF is correctly specified. Hence, misspecification of the two probability models does not affect consistent estimation of the estimand. In fact weighting by any positive function would deliver this result, including just the ps-weighted estimator. Next, I consider the case when the CQF is misspecified. Using the results in Angrist et al. (2006b), I interpret the solution to the double-weighted problem given in eq 3.36 as providing a consistent weighted-linear projection to the true CQF. I use these linear projections to estimate an LP to the true CQTE. Figure H.4 plots the bias in the estimated LP relative to the true LP as a function of x1 for the three estimators. In panel A) where both probability models are correct, the relative bias from the double-weighted estimator is the lowest and coincides with the line of no bias. Panel D) considers the case where all three parametric specifications are wrong. Again, we see that the double weighted estimator is performing the best in terms of bias. Even though the theory does not guide us here, double weighting seems to be the least biased procedure. Finally, I consider direct estimation of the unconditional quantile treatment effect (UQTE) at the 25th quantile. Again, results for the 50th and 75th quantiles can be found in the ap- pendix. Notice that estimation of UQTE does not require parametric specification of the CQF since it is the difference in marginal quantiles. Hence, the two probability models are the only relevant components of the framework that affect consistent estimation of UQTE. In the first case, when both probability models are correct, unweighted and double-weighted estimators are both close to the true quantile effect. For the second case where both proba- bility models are misspecified, double weighting does a little worse than not weighting at all. However, the results at other quantile levels reflect more favorably upon double weighting. Propensity score weighting performs the worst in both cases suggesting instances where it quantile level, τ ∈ {0.25, 0.50, 0.75} across 1000 Monte Carlo repetitions. I then draw a linearly spaced x1 vector and simulate the CQTE using the 1000 estimated QR coefficients. Averaging these 1000 functions at each point on the x1 vector gives me the estimated average CQTE function. I plot this along with the 1000 individual functions and the true CQTE, which is calculated using the population QR parameters, θ0 g. 108 might not be better to just correct for nonrandom assignment. Tables below report the bias and Rmse of the three estimators along with the double weighted estimator that uses known probability weights. When the two probability models are correct, the double-weighted esti- mator has the lowest Rmse. This, however, ceases to be true when the two probabilities are misspecified. 3.7 Application to Calónico and Smith (2017) In this section, I apply the proposed estimator to the Aid to Families with Dependent Children (AFDC) sample of women from the National Supported Work program compiled by Calónico and Smith (2017).35 NSW was a transitional and subsidized work experience program which was implemented as a randomized experiment in the United States between 1975-1979. CS replicate LaLonde (1986)’s within-study analysis for the AFDC women in the program, where the purpose of such an analysis is to evaluate how training estimates obtained from using non-experimental identification strategies (for example, CIA) compare to experimental estimates. To compute the non-experimental estimates, CS combine the NSW experimental sample with two non-experimental comparison groups drawn from PSID, called PSID-1 and PSID-2.36 In this paper, I utilize the within-study feature of this empirical application to estimate bias in the unweighted and propensity-score weighted estimates, relative to the proposed double weighting procedure. To construct these measures, I augment the CS sample to allow for women who had missing earnings information in 1979. This renders 26% of the experimental and 11% of the PSID samples missing. I then combine the experimental treatment group of NSW with three distinct comparison groups present in the CS dataset, namely, the experimental control group, and the two PSID samples, to compute the unweighted, single-weighted and double- 35Henceforth, Calónico and Smith (2017) is referred as CS. 36The PSID-1 sample constructed by CS involves keeping all female household heads con- tinuously from 1975-1979 who were between 20 and 55 years of age in 1975 and were not retired in 1975. The sample labeled PSID-2 further restricts PSID-1 to include only those women who received AFDC welfare in 1975. 109 weighted training estimates.37 The difference in the non-experimental estimate, obtained from using the doubly weighted estimator, and the experimental estimate provides the first measure of estimated bias associated with the proposed strategy. Combining the experimen- tal control group with the non-experimental comparison group gives a second measure of estimated bias (Heckman et al. (1998a)). Much like CS, I report both these measures across a range of regression specifications for the average training estimates. Given the growing importance of estimating distributional impacts of training programs, I also estimate marginal quantile treatment effects at every 10th quantile of the 1979 earn- ings distribution. The role of double weighting for ensuring consistency of the estimates is highlighted for the case of estimating marginal quantiles where covariates, which primarily serve to remove biases arising from non-random assignment and missing outcomes, enter the estimating equation only through the two probability models. 3.7.1 Results First, to evaluate whether women with missing earnings in 1979 were significantly different than those who were observed, Table I.17 reports the mean and standard deviation of the woman’s age, years of schooling, pre-training earnings and other characteristics across the observed and missing samples. In terms of age, the women who were observed in the exper- imentally treated group of NSW and the PSID-1 sample were, on average, older than those who were missing. The observed women in PSID-1 were also more likely to be married. For the PSID-2 sample, women who were observed had, on average, more kids with higher pre- training earnings. Apart from these minor differences, the observed women did not appear to be systematically different that those who were missing, as measured through observable characteristics. The presence of non-experimental control groups implies that assignment was nonrandom 37For details regarding sample construction, and other aspects of this application, see appendix G 110 and therefore an issue in the sample. This is because the comparison groups were drawn from PSID after imposing only a partial version of the full NSW eligibility criteria. Table I.16 provides descriptive statistics for the covariates, by the treatment status. As can be expected, the treatment and control groups of NSW are not observably different, indicating the strong role that randomization plays in producing comparable groups. In contrast, the women in PSID-1 and PSID-2 groups are statistically different than the treatment group members, implying substantial scope for nonrandom assignment. 3.7.1.1 Estimated bias for average and unconditional quantile training effects Table I.18 reports the doubly-weighted, ps-weighted and unweighted average training esti- mates which use the three different comparison groups; NSW control, PSID-1 and PSID-2. The unweighted (unadjusted and adjusted) experimental estimates given in row 1, are same as the estimates reported by CS in Table 3 of their paper. Overall, one can see that the dou- ble weighted experimental estimates are more stable than the single weighted or unweighted estimates across the different regression specifications, with a range between $824-$828. For computing the ps-weighted and double-weighted non-experimental estimates, I first trim the sample to ensure common support between the treatment and comparison groups.38 This reduces the sample size from 1,248 to 1,016 observations for the PSID-1 estimates and from 782 to 720 observations for the PSID-2 estimates. A pattern that is consistent across the two sets of non-experimental estimates is that weighting gets us much closer to the benchmark relative to not weighting at all. For instance, the unweighted simple difference in means estimate of training, which uses the PSID-1 comparison group, is -$799 whereas the weighted estimates are $827 and $803. For the PSID-2 comparison group, the unweighted estimate which controls for all covariates is $335 whereas the weighted estimates are $905 and $904. 38Appendix G describes estimation of the two probability weights along with the sample trimming criteria. 111 The second panel of Table I.18 reports the bias in training estimates from combining the experimental control group with the PSID comparison groups. A similar pattern is seen here with weighted bias estimates being much closer to zero than the unweighted estimates. For instance, the double-weighted estimate that adjusts for all covariates using the PSID-1 comparison group is -$21 whereas the unweighted estimates is -$568. These results suggest that the argument for weighting is strong when using a non-experimental comparison group where nonrandom assignment and missing outcomes are significant problems.39 Figure H.1 plots the relative bias in UQTE estimates at every 10th quantile of the 1979 earnings distribution. Much like the average training estimates, we see that the weighted estimates consistently lie below the unweighted estimates for most quantiles, irrespective of whether we use the PSID-1 or PSID-2 non-experimental group. Note that I do not plot the UQTE estimates for quantiles less than 0.46, since these are all zero.40 This empirical application illustrates the role of proposed estimator in both experimental and observational data contexts. The comparison involving the treatment and control group of NSW demonstrates its use in an experiment with missing outcomes, whereas the non- experimental sample demonstrates its use in the more realistic observational data setting. 3.8 Conclusion In empirical research, the problems of nonrandom assignment and missing outcomes threaten identification of causal parameters. This paper proposes a new class of consistent and asymptotically-normal estimators that address these two issues using a double inverse probability weighted procedure. The method combines propensity score weighting with weighting for missing data in a general M-estimation framework, which can be utilized to study a range of problems, such as ordinary least-squares, quasi Maximum likelihood, and 39Note that the large standard errors for the non-experimental estimates can be attributed to the small sample sizes and to the large residual variance of earnings in the PSID-1 and PSID-2 populations. 40There are a lot of women in the experimental and PSID samples with zero real earnings in 1979. 112 quantile regression. In addition, the proposed class is characterized by a robustness property, which makes it resilient to parametric misspecification of a conditional model of interest (CEF or CQF) and the two weighting functions. As leading applications of this framework, the paper discusses robust estimation of ATE and QTEs. A Monte Carlo study indicates that the doubly weighted estimates of average and quantile effects have the lowest bias, compared to naive alternatives (unweighted or propensity-score weighted estimators), for interesting cases of misspecification. Finally, the estimator is applied to the data on AFDC women from the NSW program compiled by Calónico and Smith (2017). The presence of experimental and non-experimental comparison groups in this application help to quantify the estimated bias in the double-weighted training estimates. Results suggest that the argument for weighting is strong whenever nonrandom assignment and (or) missing outcomes are significant concerns. Since the severity and mag- nitude of bias introduced from ignoring either problem cannot be assessed ex-ante, a safe bet from the practitioner’s perspective is to provide both weighted and unweighted causal effect estimates. Practically, the doubly weighted estimator is easy to implement. Appendix F.3 pro- vides an example code that uses Stata gmm command for implementing the double-weighted estimator of ATE. Computation of analytically correct standard errors, however, requires additional coding and is still a work in progress. Alternatively, one can use bootstrapped standard errors which will provide asymptotically correct inference. Even though missing outcomes are a common concern in empirical analysis, it is equally common to encounter missing data on the covariates. A particularly important future ex- tension will be to allow for missing data on both. In this case, using a generalized method of moments framework which incorporates information on complete and incomplete cases could provide efficiency gains over just using the observed data. 113 APPENDICES 114 APPENDIX A FIGURES FOR CHAPTER 1 A.1 Root mean squared error across different sample sizes Figure A.1: Quadratic design, continuous covariates (mild heterogeneity) N=100 N=500 N=1000 115 Figure A.2: Quadratic design, continuous covariates (strong heterogeneity) N=100 N=500 N=1000 116 Figure A.3: Quadratic design, one binary covariate (mild heterogeneity) N=100 N=500 N=1000 117 Figure A.4: Quadratic design, one binary covariate (strong heterogeneity) N=100 N=500 N=1000 118 Figure A.5: Probit design, continuous covariates (mild heterogeneity) N=100 N=500 N=1000 119 Figure A.6: Probit design, continuous covariates (strong heterogeneity) N=100 N=500 N=1000 120 Figure A.7: Probit design, one binary covariate (mild heterogeneity) N=100 N=500 N=1000 121 Figure A.8: Probit design, one binary covariate (strong heterogeneity) N=100 N=500 N=1000 122 Figure A.9: Binary outcome, bernoulli QLL with logistic mean N=500 N=1000 Figure A.10: Non-negative outcome, poisson QLL with exponential mean N=500 N=1000 123 APPENDIX B TABLES FOR CHAPTER 1 Table B.1: QLL and mean function combinations Restrictions on support of response Quasi-Log Likelihood Function Conditional Mean Function None Y (w) ∈ [0, 1] Y (w) ∈ [0, B] Y (w) ≥ 0 Gaussian (Normal) Bernoulli Binomial Poisson Multinomial Linear Logistic Logistic Exponential Logistic Yg(w) ≥ 0,(cid:80)G g=0 Yg(w) = 1 124 Estimator SDM PRA FRA IRA SDM PRA FRA IRA SDM PRA FRA IRA SDM PRA FRA IRA SDM PRA FRA IRA SDM PRA FRA IRA SDM PRA FRA IRA bias 0.045 0.047 0.017 0.004 0.045 0.047 0.017 0.004 -0.058 -0.045 0.051 0.046 0.094 0.013 0.042 0.022 0.002 0.003 0.025 0.026 -0.002 0.000 0.028 0.030 0.000 -0.001 0.019 0.020 std 1.590 1.312 1.697 1.690 1.590 1.312 1.697 1.690 2.508 2.083 1.988 1.944 1.517 1.716 1.593 1.561 0.134 0.109 0.117 0.117 0.169 0.249 0.206 0.195 0.102 0.105 0.093 0.091 bias 0.025 -0.022 -0.023 -0.014 0.025 -0.022 -0.023 -0.014 -0.085 -0.069 -0.100 -0.085 -0.040 -0.047 -0.050 -0.072 0.002 -0.001 0.003 0.005 0.000 0.001 0.009 0.013 0.000 0.002 0.005 0.005 std 1.056 0.825 0.815 0.810 1.056 0.825 0.815 0.810 1.350 1.061 1.052 1.003 0.891 0.932 0.860 0.783 0.088 0.069 0.068 0.067 0.108 0.124 0.108 0.084 0.076 0.065 0.063 0.061 std 1.008 0.756 0.757 0.746 1.008 0.756 0.757 0.746 1.141 0.910 0.907 0.850 0.751 0.752 0.752 0.658 0.086 0.063 0.064 0.063 0.096 0.099 0.097 0.074 0.082 0.066 0.066 0.063 bias -0.035 -0.031 -0.022 -0.025 -0.035 -0.031 -0.022 -0.025 -0.038 0.038 0.003 0.014 0.014 0.004 0.015 0.003 -0.002 0.000 0.000 0.002 0.001 0.004 0.004 0.008 -0.005 -0.001 -0.004 -0.002 std 1.194 0.929 0.922 0.914 1.194 0.929 0.922 0.914 1.120 0.987 0.926 0.864 0.747 0.845 0.739 0.632 0.100 0.073 0.073 0.073 0.107 0.119 0.104 0.081 0.097 0.081 0.080 0.078 bias 0.070 0.039 -0.021 -0.022 0.070 0.039 -0.021 -0.022 -0.039 -0.094 0.043 0.072 -0.031 -0.034 0.058 0.019 0.003 0.003 -0.001 -0.001 0.003 0.007 0.004 0.004 -0.008 -0.005 -0.004 -0.003 std 2.023 1.566 1.750 1.786 2.023 1.566 1.750 1.786 1.391 1.602 1.286 1.221 0.958 1.410 0.931 0.848 0.170 0.123 0.144 0.145 0.168 0.239 0.164 0.151 0.167 0.140 0.150 0.150 0.136 0.199 0.134 0.125 0.092 0.104 0.091 0.075 0.004 -0.005 0.020 0.022 0.165 SDM 0.213 PRA 0.163 FRA IRA 0.152 a Here SDM refers to simple difference in means, PRA refers to pooled regression adjustment, IRA is the infeasible regression adjustment estimator and FRA is the feasible regression adjustment estimator. b Simulation across 1000 replications. -0.022 -0.015 0.010 0.013 0.006 0.009 0.011 0.011 0.093 0.093 0.091 0.072 -0.003 0.002 0.002 0.006 0.103 0.111 0.098 0.082 Table B.2: Bias and standard deviation for N=100 0.1 0.3 DGP1 0.5 0.7 0.9 bias 0.034 0.039 0.042 0.026 DGP2 0.034 0.039 0.042 0.026 DGP3 0.041 0.054 0.030 0.019 DGP4 0.005 0.007 0.002 0.003 DGP5 -0.002 0.000 0.000 0.001 DGP6 0.000 0.003 0.004 0.008 DGP7 -0.003 0.001 0.000 0.001 DGP8 0.004 0.008 0.007 0.010 125 Estimator SDM PRA FRA IRA SDM PRA FRA IRA SDM PRA FRA IRA SDM PRA FRA IRA SDM PRA FRA IRA SDM PRA FRA IRA SDM PRA FRA IRA bias 0.035 0.025 0.023 0.021 0.035 0.025 0.023 0.021 0.054 0.031 -0.014 -0.011 -0.034 -0.051 -0.007 -0.010 0.001 0.000 0.003 0.003 0.000 0.001 0.006 0.006 -0.001 -0.001 0.003 0.004 std 0.675 0.566 0.511 0.507 0.675 0.566 0.511 0.507 1.073 0.878 0.755 0.729 0.652 0.744 0.599 0.574 0.056 0.047 0.044 0.043 0.072 0.101 0.062 0.055 0.044 0.044 0.038 0.037 bias 0.023 0.009 0.008 0.007 0.023 0.009 0.008 0.007 -0.003 0.002 -0.004 0.003 0.012 0.013 0.012 0.001 0.001 0.000 0.000 0.001 0.001 0.001 0.003 0.004 -0.001 0.000 0.000 0.001 std 0.493 0.379 0.374 0.353 0.493 0.379 0.374 0.353 0.642 0.490 0.457 0.429 0.391 0.402 0.375 0.336 0.039 0.030 0.030 0.028 0.048 0.055 0.046 0.035 0.035 0.030 0.029 0.028 std 0.453 0.353 0.353 0.332 0.453 0.353 0.353 0.332 0.546 0.415 0.414 0.372 0.337 0.336 0.335 0.287 0.038 0.028 0.028 0.027 0.043 0.043 0.043 0.031 0.036 0.028 0.028 0.027 bias -0.005 0.007 0.008 0.009 -0.005 0.007 0.008 0.009 -0.013 0.010 0.000 0.004 0.006 -0.001 0.007 0.000 0.000 0.001 0.001 0.001 -0.001 -0.001 0.001 0.001 0.001 0.001 0.001 0.000 std 0.535 0.396 0.382 0.372 0.535 0.396 0.382 0.372 0.486 0.428 0.400 0.366 0.333 0.364 0.333 0.273 0.044 0.031 0.030 0.030 0.050 0.055 0.048 0.035 0.042 0.034 0.033 0.032 bias 0.049 0.029 0.011 0.013 0.049 0.029 0.011 0.013 0.001 0.025 0.007 0.011 0.006 -0.001 0.001 0.000 0.006 0.004 0.002 0.002 0.006 0.008 0.005 0.004 0.001 0.002 0.001 0.001 std 0.856 0.648 0.612 0.606 0.856 0.648 0.612 0.606 0.621 0.707 0.544 0.518 0.431 0.624 0.401 0.365 0.073 0.053 0.050 0.049 0.077 0.106 0.063 0.054 0.072 0.059 0.055 0.054 0.063 0.087 0.056 0.050 -0.002 -0.001 -0.001 0.000 0.042 0.047 0.041 0.032 -0.002 -0.002 0.002 0.004 0.071 SDM 0.089 PRA 0.061 FRA IRA 0.055 a Here SDM refers to simple difference in means, PRA refers to pooled regression adjustment, IRA is the infeasible regression adjustment estimator and FRA is the feasible regression adjustment estimator. b Simulation across 1000 replications. 0.039 0.039 0.038 0.030 0.001 0.002 0.002 0.001 0.044 0.045 0.041 0.034 0.002 0.003 0.005 0.005 Table B.3: Bias and standard deviation for N=500 0.1 0.3 DGP1 0.5 0.7 0.9 bias -0.029 -0.019 -0.019 -0.019 DGP2 -0.029 -0.019 -0.019 -0.019 DGP3 -0.009 0.003 -0.003 -0.005 DGP4 -0.003 -0.004 -0.004 -0.005 DGP5 0.000 0.001 0.001 0.001 DGP6 -0.001 -0.001 0.000 0.000 DGP7 0.000 0.001 0.001 0.001 DGP8 0.001 0.001 0.001 0.002 126 Estimator SDM PRA FRA IRA SDM PRA FRA IRA SDM PRA FRA IRA SDM PRA FRA IRA SDM PRA FRA IRA SDM PRA FRA IRA SDM PRA FRA IRA bias 0.016 -0.002 -0.001 0.000 0.016 -0.002 -0.001 0.000 0.019 0.015 -0.001 0.001 -0.007 -0.010 0.002 -0.002 0.001 -0.001 0.001 0.001 -0.001 -0.002 0.003 0.003 0.002 0.000 0.003 0.003 std 0.467 0.401 0.354 0.347 0.467 0.401 0.354 0.347 0.753 0.615 0.529 0.519 0.486 0.554 0.432 0.413 0.040 0.033 0.031 0.030 0.049 0.070 0.041 0.036 0.030 0.031 0.025 0.024 bias -0.006 -0.009 -0.010 -0.009 -0.006 -0.009 -0.010 -0.009 0.001 0.000 -0.006 -0.004 -0.006 -0.006 -0.004 -0.006 0.001 0.000 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.000 0.001 0.001 std 0.335 0.263 0.252 0.244 0.335 0.263 0.252 0.244 0.468 0.360 0.337 0.308 0.272 0.281 0.266 0.241 0.028 0.022 0.021 0.021 0.033 0.038 0.032 0.025 0.023 0.020 0.019 0.019 std 0.317 0.243 0.243 0.233 0.317 0.243 0.243 0.233 0.363 0.277 0.277 0.256 0.242 0.240 0.240 0.196 0.026 0.020 0.020 0.019 0.032 0.032 0.032 0.024 0.026 0.021 0.021 0.020 bias -0.009 -0.008 -0.009 -0.007 -0.009 -0.008 -0.009 -0.007 -0.002 0.004 0.000 0.002 0.004 0.003 0.003 0.001 -0.001 0.000 -0.001 0.000 -0.001 -0.001 0.000 0.000 -0.001 0.000 0.000 0.000 std 0.373 0.281 0.274 0.264 0.373 0.281 0.274 0.264 0.346 0.306 0.284 0.257 0.231 0.247 0.226 0.190 0.032 0.022 0.022 0.021 0.034 0.037 0.033 0.025 0.031 0.024 0.023 0.022 bias 0.018 0.013 0.011 0.011 0.018 0.013 0.011 0.011 0.006 0.001 0.001 0.000 0.004 0.007 -0.003 -0.004 0.001 0.001 0.000 0.001 0.002 0.002 0.001 0.002 -0.001 0.000 -0.001 -0.001 std 0.599 0.451 0.425 0.423 0.599 0.451 0.425 0.423 0.432 0.492 0.369 0.344 0.305 0.442 0.275 0.241 0.051 0.036 0.033 0.033 0.053 0.073 0.044 0.037 0.048 0.040 0.038 0.038 0.001 0.000 0.004 0.004 0.042 0.059 0.036 0.032 0.000 0.000 0.001 0.001 0.029 0.028 0.028 0.022 0.050 SDM 0.062 PRA 0.043 FRA IRA 0.039 a Here SDM refers to simple difference in means, PRA refers to pooled regression adjustment, IRA is the infeasible regression adjustment estimator and FRA is the feasible regression adjustment estimator. b Simulation across 1000 replications. -0.001 -0.001 -0.001 0.000 0.032 0.034 0.030 0.024 0.030 0.033 0.028 0.023 0.001 0.000 0.002 0.002 Table B.4: Bias and standard deviation for N=1000 0.1 0.3 DGP1 0.5 0.7 0.9 bias 0.008 -0.001 -0.001 -0.001 DGP2 0.008 -0.001 -0.001 -0.001 DGP3 0.010 0.006 0.003 0.003 DGP4 0.002 0.002 0.001 0.002 DGP5 0.001 0.000 0.000 0.000 DGP6 0.000 0.001 0.001 0.001 DGP7 -0.001 -0.001 -0.001 -0.001 DGP8 0.000 0.000 0.000 0.001 127 Table B.5: Bias and standard deviation for binary outcome 0.1 0.3 N=500 0.5 0.7 0.9 Estimator SDM PRA FRA N-PRA N-RA bias 0.014 0.023 0.018 0.013 0.006 std 0.062 0.056 0.051 0.055 0.052 bias 0.002 0.000 0.000 0.001 0.001 std 0.041 0.035 0.034 0.034 0.033 bias -0.007 -0.002 -0.002 -0.001 -0.002 std 0.037 0.031 0.031 0.030 0.030 bias 0.003 0.003 0.002 0.004 0.004 std 0.042 0.035 0.035 0.034 0.033 bias 0.001 0.015 0.009 0.014 0.007 std 0.063 0.054 0.053 0.055 0.051 N=1000 0.008 0.010 0.010 0.010 0.012 0.027 0.023 0.023 0.022 0.022 0.044 0.038 0.037 0.038 0.036 0.000 0.000 0.000 -0.001 -0.001 -0.017 -0.021 -0.024 -0.019 -0.020 0.043 SDM 0.038 PRA 0.036 FRA 0.038 N-PRA 0.034 N-RA a Here SDM refers to simple difference in means, PRA refers to pooled regression adjustment, IRA is the infeasible regression adjustment estimator and FRA is the feasible regression adjustment estimator, N- PRA refers to pooled non-linear regression adjustment and N-RA refers to separate nonlinear regression adjustment. b Simulation across 1000 replications. c. True ATE is 0.037, R2 -0.016 -0.006 -0.010 -0.003 -0.008 0.003 0.009 0.009 0.006 0.006 0.029 0.024 0.024 0.023 0.022 0.026 0.022 0.022 0.021 0.021 0 = 0.491 and R2 1 = 0.457. Table B.6: Bias and standard deviation for non-negative outcome 0.1 0.3 N=500 0.5 0.7 0.9 Estimator SDM PRA FRA N-PRA N-RA bias 0.010 -0.003 0.024 0.000 0.027 std 0.137 0.180 0.132 0.179 0.132 bias -0.005 -0.024 -0.006 -0.022 -0.006 std 0.093 0.103 0.093 0.101 0.092 bias 0.017 0.015 0.015 0.015 0.011 std 0.080 0.078 0.078 0.078 0.077 bias -0.027 -0.023 -0.013 -0.024 -0.013 std 0.101 0.101 0.093 0.100 0.086 bias -0.067 -0.074 -0.041 -0.078 -0.039 std 0.138 0.166 0.112 0.168 0.107 N=1000 0.064 0.066 0.061 0.066 0.060 0.020 0.028 0.008 0.028 0.006 0.089 0.114 0.086 0.115 0.084 -0.055 -0.059 -0.044 -0.056 -0.040 0.116 SDM 0.133 PRA 0.102 FRA 0.133 N-PRA 0.089 N-RA a Here SDM refers to simple difference in means, PRA refers to pooled regression adjustment, IRA is the infeasible regression adjustment estimator and FRA is the feasible regression adjustment estimator, N-PRA refers to pooled non-linear regression adjustment and N-RA refers to separate nonlinear regression adjustment. b Simulation across 1000 replications. c. True ATE is 0.012, R2 -0.014 -0.023 -0.002 -0.024 -0.001 -0.022 -0.023 -0.022 -0.025 -0.013 0.006 0.004 0.003 0.004 0.006 0.066 0.068 0.061 0.068 0.059 0.061 0.060 0.060 0.060 0.059 0 = 0.435 and R2 1 = 0.233. 128 APPENDIX C PROOFS FOR CHAPTER 1 Proof of Lemma 5.1 Proof. Asymptotic variance of SDM Consider the difference-in-means estimator. We can write the sample average for the treated 1 as Therefore, because N1/N Wi i=1 1 i=1 i=1 1 Wi √ (cid:105) N(cid:88) ¯Y1 = N−1 = µ1 + N−1 µ1 + ˙Xiβ1 + Ui(1) N(cid:0) ¯Y1 − µ1 (cid:104) N(cid:88) WiYi = N−1 (cid:104) ˙Xiβ1 + Ui(1) (cid:105) N(cid:88) (cid:105) (cid:104) ˙Xiβ1 + Ui(1) N(cid:88) (cid:1) = (N/N1)N−1/2 (cid:104) ˙Xiβ1 + Ui(1) (cid:105) N(cid:88) = (1/ρ)N−1/2 (cid:104) ˙Xiβ1 + Ui(1) (cid:105) d→ N ormal(0, c2 N(cid:88) p→ ρ. By the CLT, N−1/2 (cid:26) (cid:105)2(cid:27) (cid:104) ˙Xiβ1 + Ui(1) (cid:17) (cid:16) Wi β(cid:48) 1ΩXβ1 + σ2 1 where, c2 1 = E = ρ Wi i=1 Wi i=1 Wi i=1 + op(1) 1) where Wi independent of(cid:0)Xi, Ui(1)(cid:1) is used. It follows that (cid:17) (cid:16) β(cid:48) 1ΩXβ1 + σ2 1 (cid:17) (cid:1)(cid:105) (cid:16) (cid:104)√ N(cid:0) ¯Y1 − µ1 (cid:1)(cid:105) (cid:104)√ N(cid:0) ¯Y0 − µ0 β(cid:48) 1ΩXβ1 + σ2 1 β(cid:48) 0ΩXβ0 + σ2 0 /(1 − ρ). = (1/ρ)2ρ Similarly, (cid:16) Avar Avar = = (cid:17) /ρ (C.1) (C.2) 129 Combining results from eq(39) and eq(40), we have: d−→ N(cid:16) 0, ω2 SDM √ N (ˆτSDM − τ ) (cid:17) Since the sample averages are asymptotically uncorrelated, therefore SDM = β(cid:48) ω2 1ΩXβ1/ρ + β(cid:48) 0ΩXβ0/(1 − ρ) + σ2 1/ρ + σ2 0/(1 − ρ) (cid:4) Proof. Asymptotic variance of P-RA To find the asymptotic variance of ˇτ, note that it can be obtained from Yi on 1, Wi, ˙Xi. Note that ˙Xi is orthogonal to (1, Wi) because E( ˙Xi) = 0 and Wi is independent of Xi. We know that L(Yi|1, Wi) = µ0 + τ Wi because τ = E(Yi|Wi = 1) − E(Yi|Wi = 0). Therefore, L(Yi|1, Wi, ˙Xi) = µ0 + τ Wi + ˙Xiβ (cid:17) (cid:17)(cid:21)−1 (cid:20) (cid:16) ˙X(cid:48) β = E (cid:16) ˙X(cid:48) ˙Xi i E iYi By orthogonality, Now Yi = (1 − Wi)µ0 + (1 − Wi) ˙Xiβ0 + (1 − Wi)Ui(0) + Wiµ1 + Wi ˙Xiβ1 + WiUi(1) Therefore, (cid:17) (cid:16) ˙X(cid:48) iYi E (cid:104) (1 − Wi) ˙X(cid:48) ˙Xi (cid:16) ˙X(cid:48) = E = (1 − ρ)E i i (cid:17) ˙Xiβ0 β0 + ρE (cid:105) + E (cid:105) (cid:104) (cid:16) ˙X(cid:48) Wi ˙X(cid:48) ˙Xi i i (cid:17) ˙Xiβ1 β1 130 (cid:17) (cid:16) ˙Xi = 0, and independence where we use the linear projection properties of the errors, E of Wi and(cid:2)Xi, Ui(0), Ui(1)(cid:3). Plugging in gives β = (1 − ρ)β0 + ρβ1 Now we can write the projection error as Ui = (1 − Wi)µ0 + (1 − Wi) ˙Xiβ0 + (1 − Wi)Ui(0) (cid:3) (cid:2)(1 − ρ)β0 + ρβ1 + Wiµ1 + Wi ˙Xiβ1 + WiUi(1) − µ0 − (µ1 − µ0)Wi − ˙Xi = −(Wi − ρ) ˙Xiβ0 + (1 − Wi)Ui(0) + (Wi − ρ) ˙Xiβ1 + WiUi(1). Because (1, Wi) is orthogonal to ˙Xi, it follows as in the previous section that  + op(1)  . √ N (ˆτP RA − τ ) = E( ˙W 2 i ) (cid:105)−1 (cid:104) =(cid:2)ρ(1 − ρ)(cid:3)−1 N−1/2 N(cid:88) (Wi − ρ)Ui N−1/2 N(cid:88) d−→ N(cid:16) (cid:1) /(cid:2)ρ(1 − ρ)(cid:3)2. (Wi − ρ)Ui (cid:17) P RA = V ar(cid:0)(Wi − ρ) Ui Now we need to find the asymptotic variance of N−1/2(cid:80)N √ N (ˆτP RA − τ ) 0, ω2 P RA i=1 i=1 Then using asymptotic equivalence lemma and CLT, we have: where ω2 i=1(Wi − ρ)Ui. The term (Wi − ρ)Ui has zero mean by the linear projection property. Further, (Wi − ρ)Ui = −(Wi − ρ)2 ˙Xiβ0 + (Wi − ρ)2 ˙Xiβ1 + (Wi − ρ)(1 − Wi)Ui(0) + (Wi − ρ)WiUi(1) The covariance between the last two terms is zero as (1 − Wi)Wi = 0. The last two terms can be written as −ρ(1 − Wi)Ui(0) + (1 − ρ)WiUi(1) 131 and so V ar(cid:2)−ρ(1 − Wi)Ui(0) + (Wi − ρ)WiUi(1)(cid:3) = ρ2(1 − ρ)σ2 0 + (1 − ρ)2ρσ2 1. Write the first two terms as The variance is (cid:104) E (Wi − ρ)4(cid:105) (Wi − ρ)2 ˙Xi (β1 − β0) . (β1 − β0)(cid:48) ΩX (β1 − β0) . Combining all of the terms gives Avar (cid:104)√ (cid:105) (cid:26) (Wi − ρ)4(cid:105) (cid:104) N (ˆτP RA − τ ) =(cid:2)ρ(1 − ρ)(cid:3)−2 (cid:104) (Wi − ρ)4(cid:105) (cid:2)ρ(1 − ρ)(cid:3)2 = E E Note that we can write (β1 − β0)(cid:48) ΩX (β1 − β0) + ρ2(1 − ρ)σ2 0 + (1 − ρ)2ρσ2 1 (cid:27) and Jensen’s inequality tells us this is greater than unity: take Zi = (Wi − ρ)2. We can also show = (1 − ρ)4ρ + ρ4(1 − ρ) and so the scale factor is Hence, (cid:104)√ Avar N (ˆτP RA − τ ) = ρ + (1 − ρ) (1 − ρ)2 ρ + ρ2 (1 − ρ) . (cid:33) (β1 − β0)(cid:48) ΩX (β1 − β0) + σ2 0 (1 − ρ) + σ2 1 ρ (cid:4) σ2 0 σ2 1 ρ (1 − ρ) + (cid:104) (Wi − ρ)4(cid:105) (cid:2)V ar(Wi)(cid:3)2 E E (β1 − β0)(cid:48) ΩX (β1 − β0) + (cid:104) (Wi − ρ)4(cid:105) (cid:2)ρ(1 − ρ)(cid:3)2 = (cid:104) (Wi − ρ)4(cid:105) (cid:2)ρ(1 − ρ)(cid:3)2 (cid:32) (1 − ρ)4ρ + ρ4(1 − ρ) E (1 − ρ)2 (cid:105) ρ2 = 132 Proof. Asymptotic variance of F-RA Now consider the full regression adjustment estimator. Let ˆα1 and ˆβ1 be the OLS estimates from the Wi = 1 sample: and then Yi on 1, Xi Wi = 1 ˆµ1,F RA = ˆα1 + ¯X ˆβ1 where ¯X is the sample average over the entire sample. (For intuition, it is useful to note that ¯Y1 = ˆα1 + ¯X1 ˆβ1, and so ˆµ1 uses a more efficient estimator of µX.) By least squares mechanics, ˆµ1 is the intercept in the regression Yi on 1, Xi − ¯X, Wi = 1. Let ¨Xi = Xi − ¯X and Define ˆγ1 = = ¨Ri = (1, ¨Xi).  =  ˆµ1  N(cid:88) N−1 N(cid:88) ˆβ1 i=1 Wi ¨R(cid:48) ¨Ri −1 N(cid:88) −1N−1 N(cid:88) i=1 i Wi ¨R(cid:48) ¨Ri i  Wi ¨R(cid:48) iYi Wi ¨R(cid:48) iYi(1)  . Now write i=1 i=1 Yi(1) = µ1 + ˙Xiβ1 + Ui(1) = µ1 + ¨Xiβ1 + ( ˙Xi − ¨Xi)β1 + Ui(1) = µ1 + ¨Xiβ1 + ( ¯X − µX)β1 + Ui(1) = ¨Riγ1 + ( ¯X − µX)β1 + Ui(1) Plugging in gives N(cid:88) i=1 N−1 Wi ¨R(cid:48) iYi(1) =  γ1 + N−1 N(cid:88) i=1  ( ¯X − µX)β1 Wi ¨R(cid:48) i Wi ¨R(cid:48) i ¨Ri N−1 N(cid:88) N(cid:88) i=1 + N−1 Wi ¨R(cid:48) iUi(1) i=1 133 Now we can write N−1 −1 N−1 N(cid:88) i=1  ( ¯X − µX)β1 + N−1 Wi ¨R(cid:48) i N(cid:88) i=1 Wi ¨R(cid:48) i ¨Ri  Wi ¨R(cid:48) iUi(1) N(cid:88) i=1 ˆγ1 = γ1+ and so √ N (ˆγ1 − γ1) N(cid:88) N−1 = Wi ¨R(cid:48) i ¨Ri i=1 Next, because ¯X −1 N−1 N(cid:88) N−1 N(cid:88) i=1 Wi ¨R(cid:48) i √ N(cid:88) N ( ¯X − µX)β1 + N−1/2  Wi ¨R(cid:48) iUi(1) N(cid:88) i=1 p→ µX, the law of large numbers and Slutsky’s Theorem imply Wi ¨R(cid:48) i ¨Ri = N−1 Wi ˙R(cid:48) i ˙Ri + op(1) i=1 i=1 ˙Ri = (1, ˙Xi) = (1, Xi − µX) where Further, Note that N(cid:88) N−1 Wi ˙R(cid:48) i ˙Ri p→ E i=1 √ A ≡ The terms 1 N ( ¯X − µX)β1 and N−1/2(cid:80)N N(cid:88) √ N (ˆγ1 − γ1) = (1/ρ)A−1 N−1 Consider the first element of N−1(cid:80)N N(cid:88) Wi ¨R(cid:48)  + op(1). N−1 i=1 (cid:16) ˙R(cid:48) i ˙Ri (cid:17) . = ρE  (cid:17) (cid:16) (cid:17) Wi ˙R(cid:48) i ˙Ri i 0 E ˙Xi i=1 Wi ¨R(cid:48) 0 (cid:16) ˙X(cid:48) √ iUi(1) are Op(1), and so Wi ¨R(cid:48) i N ( ¯X − µX)β1 + N−1/2 N(cid:88)  Wi ¨R(cid:48) iUi(1) i=1   1 ¨Xi N(cid:88) i=1 Wi i=1 Wi ¨R(cid:48) i: i = N−1 i=1 134 N(cid:88) i=1 N−1 and so the first element is Also, N(cid:88) i=1 N−1/2 and so the first element is Wi = N1/N = ˆρ p→ ρ.  Ui(1)  1 ¨Xi N(cid:88) i=1 Wi Wi ¨R(cid:48) iUi(1) = N−1/2 N(cid:88) i=1 N−1/2 WiUi(1). Because of the block diagonality of A, the first element of, satisfies√ (cid:16) ˆµ1,F RA − µ1 = (1/ρ)ρ (cid:17) N √ N (ˆµ1 − µ1) √ N (ˆγ1 − γ1), N(cid:88) WiUi(1) + op(1) √ N ( ¯X − µX)β1 + (1/ρ)N−1/2 N(cid:88) i=1 √ = i=1 WiUi(1) + op(1). N ( ¯X − µX)β1 + (1/ρ)N−1/2 (cid:17) (cid:2)(Xi − µX) β1 + WiUi(1)/ρ(cid:3) + op(1) N(cid:88) = N−1/2 N(cid:88) (cid:2)(Xi − µX) β0 + (1 − Wi)Ui(0)/(1 − ρ)(cid:3) + op(1) (cid:104) ˙Xi (β1 − β0) + WiUi(1)/ρ − (1 − Wi)Ui(0)/(1 − ρ) (cid:105) i=1 i=1 = N−1/2 N(cid:88) + op(1) (cid:16) √ N We can also write ˆµ1,F RA − µ1 (cid:17) (cid:16) A similar argument gives ˆµ0,F RA − µ0 √ N and so√ N (ˆτF RA − τ ) = N−1/2 Again, by asymptotic equivalence lemma and CLT, we have: i=1 √ d−→ N(cid:16) (cid:17) N (ˆτF RA − τ ) (cid:16) ˙Xi (β1 − β0) + WiUi(1)/ρ − (1 − Wi)Ui(0)/(1 − ρ) 0, ω2 F RA (cid:17) where ω2 F RA = V ar Now consider the above expression inside the variance. The three terms are pairwise uncorrelated, the second and third because Wi(1− Wi) = 0, and the first with the other two because, for example, (cid:104) (β1 − β0)(cid:48) ˙X(cid:48) E iWiUi(1) (cid:105) (cid:104) ˙X(cid:48) (cid:105) iUi(1) = 0 = E(Wi) (β1 − β0)(cid:48) E 135 because E = 0 by linear projection properties. It follows that (cid:105) (cid:104) ˙X(cid:48) (cid:104)√ (cid:105) N (ˆτF RA − τ ) iUi(1) Avar = (β1 − β0)(cid:48) ΩX (β1 − β0) + (1/ρ2)E(Wi)E (1/(1 − ρ)2)E(1 − Wi)E U 2 i (0) (cid:104) (cid:105) = (β1 − β0)(cid:48) ΩX (β1 − β0) + σ2 1/ρ + σ2 0/(1 − ρ). (cid:104) (cid:105) U 2 i (1) + (cid:4) Proof. Asymptotic variance of I-RA The derivation for τ∗ follows closely that for ˆτ, with the important difference that ¨Xi = Xi − ¯X is replaced with ˙Xi = Xi − µX. This means that the terms N ( ¯X − µX)β1 and √ N ( ¯X − µX)β0 terms will not appear. Therefore, √ (cid:104)√ (cid:105) N (ˆτIRA − τ ) = σ2 1/ρ + σ2 0/(1 − ρ). Avar Proof of Theorem 5.2 Proof. CLAIM 1 : ω2 For this consider consider the left hand side, SDM F RA ≤ ω2 (cid:104)√ 1ΩXβ1/ρ + β(cid:48) N (ˆτSDM − τ ) Avar = β(cid:48) (cid:105) − Avar (cid:104)√ (cid:105) N (ˆτF RA − τ ) 0ΩXβ0/(1 − ρ) − (β1 − β0)(cid:48) ΩX (β1 − β0) The last term in the above expression can be written as: 0ΩXβ0/(1 − ρ) −(cid:104) (cid:19) (cid:18) ρ 1 − ρ (cid:18)1 − ρ (cid:19) 1ΩXβ1/ρ + β(cid:48) β(cid:48) β(cid:48) 1ΩXβ1 + ≡ δ(cid:48)ΩXδ = ρ 0ΩXβ0 − 2β(cid:48) 0ΩXβ1 β(cid:48) 1ΩXβ1 + β(cid:48) β(cid:48) 0ΩXβ0 + 2β(cid:48) 0ΩXβ1 136 (cid:4) (cid:105) where (cid:115)(cid:18)1 − ρ (cid:19) ρ δ = β1 + (cid:115)(cid:18) ρ 1 − ρ (cid:19) β0. Because ΩX is positive definite, this proves the claim. One case where there is no efficiency gain is when ρ = 1/2 and β1 = −β0. The second condition seems unrealistic unless both vectors are zero. CLAIM 2 : ω2 For this consider the left hand side of the expression above, P RA F RA ≤ ω2 (cid:104)√ N (ˆτP RA − τ ) ρ2 (1 − ρ)2 Avar (cid:34) = ρ + (1 − ρ) (cid:105) − Avar (cid:35) (cid:104)√ (cid:105) N (ˆτF RA − τ ) (β1 − β0)(cid:48) ΩX (β1 − β0) − 1 ≥ 0 CLAIM 3 : ω2 IRA ≤ ω2 F RA It is easy to see why this holds true since the L.H.S just equals (cid:104)√ Avar N (ˆτF RA − τ ) (cid:105) − Avar (cid:104)√ N (ˆτIRA − τ ) (cid:105) = (β1 − β0)(cid:48) ΩX (β1 − β0) Because ΩX is psd and the above is just a quadratic form which will be greater than or equal to zero. Combing the results from CLAIM 1, 2 and 3 we have the result. (cid:4) 137 APPENDIX D TABLES FOR CHAPTER 2 Table D.1: Summary of yes votes at different bid amounts Bid Yes-votes % 20 $5 $25 20 22 $65 17 $120 $220 21 100 Total 219 216 241 181 228 1085 138 Table D.2: Lower bound mean willingness to pay estimate using ABERS and FRA estimators PO means SM 0.689 (0.0313) 0.569 (0.0338) 0.485 (0.0323) 0.403 (0.0365) 0.289 (0.0301) ABERS FRA 0.685 (0.0288) 0.597 (0.0307) 0.489 (0.0294) 0.378 (0.0332) 0.290 (0.0286) FRA 84.67 (3.792) 1085 Bids $5 $25 $65 $120 $220 ˆτ Obs 85.39 (3.905) 1085 139 Table D.3: Bias and standard deviation of RA estimators for DGP 1 across four assignment vectors PO means\N BIAS SD BIAS SD BIAS SD µ1 µ2 µ3 µ1 µ2 µ3 µ1 µ2 µ3 µ1 µ2 µ3 µ1 µ2 µ3 µ1 µ2 µ3 SM 1000 0.0018 0.0191 -0.0136 0.0077 0.0083 0.0094 -0.0043 -0.0181 -0.0075 0.0052 0.0109 0.0137 -0.0065 0.0013 -0.0075 0.0108 0.0056 0.0137 500 -0.0003 0.0226 -0.0036 0.0107 0.0113 0.0133 -0.0058 -0.0173 0.0078 0.0074 0.0170 0.0191 -0.0085 0.0075 0.0078 0.0151 0.0078 0.0191 ρ = (1/3, 1/3, 1/3) 5000 0.0008 0.0170 -0.0058 0.0036 0.0037 0.0044 500 0.0021 0.0179 -0.0014 0.0110 0.0107 0.0128 PRA 1000 0.0030 0.0161 -0.0117 0.0078 0.0080 0.0091 ρ = (2/3, 1/6, 1/6) -0.0048 -0.0165 0.0016 0.0025 0.0054 0.0063 -0.0063 -0.0169 0.0092 0.0075 0.0164 0.0186 -0.0047 -0.0184 -0.0053 0.0052 0.0106 0.0132 ρ = (1/6, 2/3, 1/6) -0.0082 0.0032 0.0016 0.0050 0.0026 0.0063 -0.0010 0.0041 0.0137 0.0160 0.0076 0.0184 -0.0025 -0.0013 -0.0009 0.0110 0.0055 0.0131 5000 0.0027 0.0136 -0.0043 0.0036 0.0035 0.0043 -0.0052 -0.0165 0.0030 0.0025 0.0052 0.0061 -0.0014 0.0003 0.0064 0.0052 0.0026 0.0060 500 -0.0007 0.0136 -0.0019 0.0107 0.0105 0.0128 -0.0058 -0.0216 0.0101 0.0074 0.0157 0.0186 -0.0088 0.0031 0.0101 0.0151 0.0076 0.0186 FRA 1000 0.0014 0.0130 -0.0117 0.0077 0.0079 0.0091 -0.0043 -0.0210 -0.0035 0.0052 0.0104 0.0131 -0.0065 -0.0021 -0.0035 0.0108 0.0055 0.0131 5000 0.0008 0.0110 -0.0040 0.0036 0.0035 0.0043 -0.0048 -0.0180 0.0051 0.0025 0.0050 0.0060 -0.0074 -0.0004 0.0051 0.0050 0.0026 0.0060 µ1 µ2 µ3 µ1 µ2 µ3 ρ = (1/5, 2/5, 2/5) -0.0098 0.0135 -0.0087 0.0045 0.0035 0.0040 -0.0031 0.0099 -0.0035 0.0145 0.0098 0.0116 -0.0090 0.0110 -0.0153 0.0098 0.0076 0.0086 BIAS SD -0.0100 0.0173 -0.0075 0.0139 0.0103 0.0120 -0.0090 0.0057 -0.0045 0.0045 0.0033 0.0039 a Here SM refers to subsample means, PRA refers to pooled regression adjustment, and FRA is the feasible regression adjustment estimator. b Empirical distributions generated with 1000 monte-carlo repetitions. c The true population mean vector is: µ = (1.4437, 1.6662, 1.8718) -0.0039 0.0071 -0.0052 0.0046 0.0033 0.0039 -0.0089 0.0036 -0.0110 0.0099 0.0070 0.0084 -0.0059 0.0054 -0.0112 0.0101 0.0071 0.0084 -0.0104 0.0071 -0.0036 0.0139 0.0096 0.0116 140 Table D.4: Bias and standard deviation of RA estimators for DGP 2 across four assignment vectors PO means\N BIAS SD BIAS SD BIAS SD µ1 µ2 µ3 µ1 µ2 µ3 µ1 µ2 µ3 µ1 µ2 µ3 µ1 µ2 µ3 µ1 µ2 µ3 SM 1000 -0.0028 0.0110 0.0049 0.0076 0.0084 0.0094 0.0003 -0.0315 0.0316 0.0054 0.0109 0.0142 0.0087 -0.0032 0.0316 0.0106 0.0059 0.0142 500 -0.0009 0.0039 0.0111 0.0109 0.0116 0.0139 0.0022 -0.0236 0.0321 0.0078 0.0170 0.0195 0.0102 -0.0062 0.0321 0.0158 0.0081 0.0195 ρ = (1/3, 1/3, 1/3) 5000 -0.0063 0.0108 0.0054 0.0034 0.0036 0.0043 500 0.0010 0.0067 0.0065 0.0111 0.0112 0.0133 PRA 1000 -0.0015 0.0123 0.0023 0.0077 0.0079 0.0090 ρ = (2/3, 1/6, 1/6) -0.0035 -0.0266 0.0331 0.0023 0.0049 0.0063 0.0038 -0.0241 0.0262 0.0078 0.0164 0.0189 0.0010 -0.0303 0.0275 0.0054 0.0105 0.0137 ρ = (1/5, 2/5, 2/5) 0.0018 -0.0043 0.0331 0.0048 0.0025 0.0063 0.0079 -0.0030 0.0219 0.0162 0.0080 0.0187 0.0073 -0.0008 0.0233 0.0109 0.0058 0.0134 ρ = (1/5, 2/5, 2/5) 5000 -0.0053 0.0126 0.0027 0.0034 0.0035 0.0042 -0.0026 -0.0254 0.0287 0.0023 0.0048 0.0060 -0.0017 -0.0012 0.0243 0.0050 0.0025 0.0060 500 -0.0004 0.0070 0.0033 0.0109 0.0113 0.0132 0.0027 -0.0272 0.0160 0.0078 0.0159 0.0186 0.0092 -0.0029 0.0160 0.0158 0.0080 0.0186 FRA 1000 -0.0031 0.0124 0.0006 0.0076 0.0079 0.0090 0.0002 -0.0317 0.0194 0.0054 0.0103 0.0132 0.0078 -0.0006 0.0194 0.0106 0.0058 0.0132 5000 -0.0061 0.0135 0.0015 0.0034 0.0035 0.0041 -0.0032 -0.0248 0.0219 0.0023 0.0046 0.0059 0.0017 -0.0007 0.0219 0.0048 0.0025 0.0059 BIAS SD µ1 µ2 µ3 µ1 µ2 µ3 0.0079 -0.0024 0.0041 0.0142 0.0103 0.0127 0.0006 0.0075 -0.0024 0.0043 0.0032 0.0038 a Here SM refers to subsample means, PRA refers to pooled regression adjustment, and FRA is the feasible regression adjustment estimator. b Empirical distributions generated with 1000 monte-carlo repetitions. c The true population mean vector is : µ = (1.4439, 1.6665, 1.8722) -0.0011 0.0065 -0.0021 0.0045 0.0032 0.0038 0.0072 0.0087 -0.0004 0.0101 0.0072 0.0084 0.0070 0.0024 -0.0017 0.0142 0.0100 0.0121 0.0066 0.0090 -0.0012 0.0098 0.0072 0.0084 0.0080 0.0058 0.0020 0.0098 0.0077 0.0087 0.0009 0.0034 -0.0001 0.0044 0.0033 0.0040 0.0071 0.0023 0.0000 0.0146 0.0099 0.0122 141 Table D.5: Bias and standard deviation of RA estimators for DGP 3 across four assignment vectors PO means\N BIAS SD BIAS SD BIAS SD µ1 µ2 µ3 µ1 µ2 µ3 µ1 µ2 µ3 µ1 µ2 µ3 µ1 µ2 µ3 µ1 µ2 µ3 SM 1000 0.0085 -0.0056 0.0002 0.0021 0.0045 0.0036 0.0063 -0.0120 -0.0061 0.0015 0.0063 0.0052 0.0216 -0.0034 -0.0069 0.0030 0.0032 0.0051 500 0.0215 -0.0083 0.0017 0.0030 0.0063 0.0052 0.0119 -0.0257 -0.0141 0.0022 0.0091 0.0071 0.0466 -0.0075 -0.0184 0.0042 0.0046 0.0071 ρ = (1/3, 1/3, 1/3) 5000 -0.0019 0.0001 -0.0005 0.0010 0.0021 0.0017 500 0.0214 -0.0079 0.0014 0.0032 0.0062 0.0051 PRA 1000 0.0081 -0.0055 0.0005 0.0022 0.0045 0.0036 ρ = (2/3, 1/6, 1/6) 0.0003 -0.0015 -0.0058 0.0007 0.0030 0.0023 0.0120 -0.0263 -0.0140 0.0022 0.0091 0.0072 0.0063 -0.0121 -0.0063 0.0015 0.0063 0.0053 ρ = (1/5, 2/5, 2/5) -0.0018 -0.0008 -0.0061 0.0013 0.0015 0.0024 0.0485 -0.0082 -0.0175 0.0046 0.0046 0.0069 0.0218 -0.0037 -0.0058 0.0033 0.0032 0.0049 5000 -0.0027 0.0004 0.0001 0.0010 0.0021 0.0016 0.0003 -0.0015 -0.0061 0.0007 0.0031 0.0023 -0.0032 -0.0009 -0.0041 0.0014 0.0015 0.0023 500 0.0215 -0.0071 0.0012 0.0030 0.0062 0.0051 0.0127 -0.0202 -0.0134 0.0022 0.0089 0.0069 0.0425 -0.0085 -0.0152 0.0041 0.0046 0.0068 FRA 1000 0.0093 -0.0052 0.0013 0.0021 0.0045 0.0035 0.0068 -0.0100 -0.0033 0.0015 0.0062 0.0050 0.0211 -0.0038 -0.0040 0.0030 0.0032 0.0048 5000 -0.0007 0.0006 0.0010 0.0009 0.0020 0.0016 0.0007 -0.0012 -0.0022 0.0007 0.0030 0.0022 -0.0004 -0.0010 -0.0022 0.0013 0.0015 0.0023 BIAS SD µ1 µ2 µ3 µ1 µ2 µ3 0.0389 -0.0112 0.0028 0.0039 0.0057 0.0048 -0.0009 -0.0012 0.0031 0.0012 0.0018 0.0014 a Here SM refers to subsample means, PRA refers to pooled regression adjustment, and FRA is the feasible regression adjustment estimator. b Empirical distributions generated with 1000 monte-carlo repetitions. c The true population mean vector is: µ = (1.1897, 3.7310, 3.7752) -0.0026 -0.0011 0.0025 0.0013 0.0018 0.0014 0.0153 -0.0031 0.0026 0.0027 0.0042 0.0032 0.0351 -0.0115 0.0020 0.0038 0.0056 0.0047 ρ = (1/5, 2/5, 2/5) -0.0017 -0.0008 0.0017 0.0012 0.0018 0.0015 0.0410 -0.0116 0.0022 0.0042 0.0056 0.0047 0.0167 -0.0031 0.0022 0.0030 0.0042 0.0032 0.0162 -0.0027 0.0020 0.0028 0.0043 0.0033 142 APPENDIX E PROOFS FOR CHAPTER 2 Proof of Theorem 3 Proof. Using the expression, Y = µg + ˙Xβg + U (g), write ¯Yg as ¯Yg = N−1 µg + ˙Xiβg + Ui(g) N(cid:88) Wig g g (cid:105) i=1 i=1 = µg + N−1 (cid:32) g i=1 = µg + Therefore, 1 ˆρg √ By random assignment, i=1 (cid:104) (cid:105) Wig i=1 Wig Wig N−1 N(cid:88) = µg + (N/Ng)N−1 WigYi(g) = N−1 N(cid:88) (cid:33) N(cid:88) (cid:104) ˙Xiβg + Ui(g) (cid:105) (cid:104) ˙Xiβg + Ui(g) (cid:105) (cid:104) ˙Xiβg + Ui(g) N(cid:88) (cid:105) N−1/2 (cid:104) ˙Xiβg + Ui(g) N(cid:88) (cid:1) = ˆρ−1 N(cid:0) ¯Yg − µg (cid:17) (cid:16) ˙Xi (cid:17) (cid:16) (cid:16) (cid:17) (cid:105) E(cid:2)Ui(g)(cid:3) = ρgE(cid:2)Ui(g)(cid:3) = 0, (cid:105) + op(1). N−1/2 (cid:104) ˙Xiβg + Ui(g) (cid:1) = ρ−1 = E(Wig)E N(cid:88) Wig ˙Xi Wig E WigUi(g) = E Wig (cid:104) Wig i=1 = 0 g i=1 E (E.1) N(cid:0) ¯Yg − µg √ g and so the CLT applies to the standardized average in (E.1). Now use ˆρg = ρg + op(1) to obtain the following first-order representation: Our goal is to be able to make efficiency statements about both linear and nonlinear functions of the vector of means µ = (µ1, µ2, ..., µG)(cid:48), and so we stack the subsample means into the G× 1 vector ¯Y. For later comparison, it is helpful to remember that ¯Y is the vector of OLS coefficients in the regression Yi on Wi1, Wi2, ..., WiG, i = 1, 2, ..., N. We have proven the following result. (cid:4) 143 Proof of Theorem 4 Proof. Consistent estimators of αg and βg are obtained from the regression Yi on 1, Xi, if Wig = 1, (cid:16) (cid:17)(cid:48) is g ˘Xi (cid:17)(cid:21) E(cid:16) Wig ˘X(cid:48) i Wig ˘X(cid:48) iYi (cid:17)(cid:21)−1(cid:20) ˆαg, ˆβ(cid:48) (cid:20) E(cid:16) which produces intercept and slopes ˆαg and ˆβg. Letting ˘Xi = (1, Xi), the probability limit of (cid:17)(cid:21) (cid:17)(cid:21)  αg where random assignment is used so that Wig is independent of(cid:2)Xi, Yi(g)(cid:3). It follows that (cid:16) (cid:17)(cid:21)−1(cid:20) (cid:20) E(cid:16) E(cid:16) ˘X(cid:48) Wig ˘X(cid:48) (cid:20) (cid:17)(cid:21)−1(cid:20) E(cid:16) ˘X(cid:48) ρgE(cid:16) ˘X(cid:48) (cid:17)(cid:21) (cid:17)(cid:21)−1(cid:20) E(cid:16) ˘X(cid:48) E(cid:16) ˘X(cid:48) g = ρ−1 = ρ−1 (cid:20) g iYi(g) iYi(g) iYi(g) = ˘Xi i ˘Xi i (cid:17)(cid:48) (cid:17) = ˘Xi i is consistent for , and so a consistent estimator of µg is (cid:16) αg, β(cid:48) g ˆαg, ˆβ(cid:48) g βg ˆµg = ˆαg + ¯X ˆβg. Note that this estimator, which we refer to as full (or separate) regression adjustment (FRA), is the same as an imputation procedure. Given ˆαg and ˆβg, impute a value of Yi(g) for each i in the sample, whether or not i is assigned to group g: ˆYi(g) = ˆαg + Xi ˆβg, i = 1, 2, ..., N. Averaging these imputed values across all i produces ˆµg. In order to derive the asymptotic variance of ˆµg, it is helpful to obtain it as the intercept from the regression Let ¨Xi = Xi − ¯X and Yi on 1, Xi − ¯X, Wig = 1. ¨Ri = (1, ¨Xi). 144 Define Now write ˆγg = =  ˆµg  =  N(cid:88) N−1 N(cid:88) ˆβg i=1 Wig ¨R(cid:48) i ¨Ri Wig ¨R(cid:48) ¨Ri −1 N(cid:88) −1N−1 N(cid:88) i=1 i i=1 i=1  Wig ¨R(cid:48) iYi Wig ¨R(cid:48) iYi(g)  . Yi(g) = µg + ˙Xiβg + Ui(g) = µg + ¨Xiβg + ( ˙Xi − ¨Xi)βg + Ui(g) = µg + ¨Xiβg + ( ¯X − µX)βg + Ui(g) = ¨Riγg + ( ¯X − µX)βg + Ui(g) −1 N−1 N(cid:88) i=1  ( ¯X − µX)βg + N−1 Wig ¨R(cid:48) i N(cid:88) i=1  Wig ¨R(cid:48) iUi(g) Wig ¨R(cid:48) i ¨Ri N(cid:88) i=1 Wig ¨R(cid:48) i √ N ( ¯X − µX)βg + N−1/2 N(cid:88) i=1  Wig ¨R(cid:48) iUi(g) Plugging in for Yi(g) gives ˆγg = γg+ i=1 N(cid:88) N−1 N(cid:0)ˆγg − γg (cid:1) N−1 N(cid:88) Wig ¨R(cid:48) and so √ = i=1 Next, because ¯X ¨Ri i −1 N−1 N(cid:88) p→ µX, the law of large numbers and Slutsky’s Theorem imply N(cid:88) N−1 Wig ¨R(cid:48) i ¨Ri = N−1 Wig ˙R(cid:48) i ˙Ri + op(1) where i=1 i=1 ˙Ri = (1, ˙Xi). Further, by random assignment, N(cid:88) N−1 Wig ˙R(cid:48) i ˙Ri p→ E(cid:16) (cid:17) = ρgE(cid:16) ˙R(cid:48) i ˙Ri (cid:17) Wig ˙R(cid:48) i ˙Ri = ρgA, i=1 145  (cid:17) 0 √ where A ≡ The terms √ 1 0 E(cid:16) ˙X(cid:48) N ( ¯X − µX)βg and N−1/2(cid:80)N  √ N−1 N(cid:88) N(cid:0)ˆγg − γg (cid:1) = (1/ρg)A−1 Consider the first element of N−1(cid:80)N N(cid:88) N(cid:88) Wig ¨R(cid:48) N(cid:88) and so the first element is i=1 Wig ¨R(cid:48) i: i = N−1 Wig ¨R(cid:48) N−1 i=1 i=1 i=1 i i ˙Xi i=1 Wig ¨R(cid:48) iUi(g) are Op(1) by the CLT, and so N ( ¯X − µX)βg + N−1/2 Wig ¨R(cid:48) iUi(g) N(cid:88)  . i=1   1 ¨X(cid:48) i Wig i=1 N−1 N(cid:88) Wig = Ng/N = ˆρg p→ ρg. N(cid:88) i=1 Wig  Ui(g)  1 ¨X(cid:48) i Also, N−1/2 Wig ¨R(cid:48) iUi(g) = N−1/2 and so the first element is i=1 N(cid:88) i=1 N−1/2 WigUi(g). Because of the block diagonality of A, the first element of, (cid:1), N(cid:0)ˆµg − µg √ (cid:1), √ N(cid:0)ˆγg − γg N(cid:88) satisfies √ N(cid:0)ˆµg − µg (cid:1) = (1/ρg)ρg √ N ( ¯X − µX)βg + (1/ρg)N−1/2 = N(cid:0)ˆµg − µg √ √ N ( ¯X − µX)βg + (1/ρg)N−1/2 (cid:1) = N−1/2 N(cid:88) (cid:104) (Xi − µX) βg + WigUi(g)/ρg (cid:105) + op(1) WigUi(g) + op(1) N(cid:88) i=1 i=1 WigUi(g) + op(1). We can also write The above representation holds for all g. Then, stacking the RA estimates gives us theorem (cid:4) 4. i=1 146 Proof of Theorem 5 Proof. In order to find a useful first order representation of √ N ( ˇµ − µ), we first characterize the probability limit of ˇβ. Under random assignment, E(cid:16) (cid:17) = E (W)(cid:48) E(cid:16) ˙X (cid:17) W(cid:48) ˙X = 0, which means that the coefficients on W in the linear projections L(cid:0)Y |W(cid:1) and L(cid:16) still consistently estimates µ. Moreover, we can find the coefficients on ˙X in L(cid:16) by finding L(cid:16) Y |W, ˙X are the same and equal to µ. This essentially proves that adding the demeaned covariates Y |W, ˙X . Let β be the the linear projection of Y on ˙X. Then (cid:17) (cid:17) Y | ˙X (cid:17) Now use so that E(cid:16) ˙X(cid:48)Y (cid:17) = β = G(cid:88) G(cid:88) g=1 X Wg g=1 Y = (cid:17) (cid:17) (cid:17) = Ω−1 µg + ˙Xβg + U (g) E(cid:16) ˙X(cid:48)Y (cid:104) (cid:20) (cid:17)(cid:21)−1 E(cid:16) ˙X(cid:48) ˙X E(cid:16) ˙X(cid:48)Y (cid:105) G(cid:88) (cid:26) βg + E(cid:104) ˙X(cid:48)WgU (g) (cid:17) + E(cid:16) ˙X(cid:48)Wg ˙X E(cid:16) ˙X(cid:48)Wgµg  ,  G(cid:88) (cid:8)0 + ρgΩXβg + 0(cid:9) = ΩX = 0, and E(cid:104) ˙X(cid:48)U (g) (cid:105)  =  .  G(cid:88) (cid:17) X ΩX ρgβg ρgβg ρgβg (cid:17) g=1 g=1 (cid:105)(cid:27) = g=1 where we again use random assignment, E(cid:16) ˙X  G(cid:88) Therefore, the β in the linear projection L(cid:16) β = Ω−1 g=1 = 0. It follows that Y | ˙X is simply a weighted average of the coefficients from the separate linear projections using the potential outcomes. Now we can write Yi = Wiµ + ˙Xiβ + Ui 147 where the linear projection error Ui is Ui = = (cid:104) g=1 G(cid:88) G(cid:88) (cid:16) g=1 Wig µig + ˙Xiβg + Ui(g) G(cid:88) (Wig − ρg) ˙Xiβg + ˇµ(cid:48), ˇβ(cid:48)(cid:17)(cid:48) (cid:16) , and ˇθ = (cid:17) g=1 WigUi(g) (cid:105) − Wiµ − ˙Xi ρgβg   G(cid:88) N ( ˇµ − µ). Write θ =(cid:0)µ(cid:48), β(cid:48)(cid:1)(cid:48), g=1 (cid:16) We can now obtain the asymptotic representation for (cid:17) as the OLS estimators. The asymptotic Wi, ˙Xi √ N ( ˇµ − µ) is not the same as replacing ¨Xi with ˙Xi (even though for ˇβ it is). ˙Ri = variance of Wi, ¨Xi , ¨Ri = √ Write Now ˇθ = (cid:1) β + Ui Yi = Wiµ + ¨Xiβ + = ¨Riθ +(cid:0) ¯X − µX N−1 N(cid:88) N−1 ¨R(cid:48) N(cid:88) (cid:17) (cid:16) ˙Xi − ¨Xi (cid:1) β + Ui. −1N−1 N(cid:88) −1 N−1 β + Ui = Wiµ + ¨Xiβ +(cid:0) ¯X − µX  ¨R(cid:48) iYi N(cid:88) (cid:48)(cid:0) ¯X − µX (cid:1) β + N−1 ¨R(cid:48) ¨Ri ¨Ri ¨Ri i=1 i=1 i i = θ + i=1 i=1 i=1 N(cid:88)  ¨R(cid:48) iUi = N ¨R(cid:48) and so √ (cid:17) (cid:16) ˇθ − θ N−1 N(cid:88) N−1 N(cid:88) because N−1(cid:80)N ¨R(cid:48) i=1 i=1 = i i ¨Ri ¨Ri N(cid:88) ¨R(cid:48) iUi N(cid:88)  Wi i=1 ¨Xi i=1 (cid:48)  Ui β + N−1/2 β + N−1/2 i=1 −1 N−1 N(cid:88)  N−1(cid:80)N −1 N(cid:88) 0 N−1 ¨Ri i=1 W(cid:48) (cid:48)(cid:104)√ (cid:1)(cid:105) N(cid:0) ¯X − µX (cid:104)√ N(cid:0) ¯X − µX E(cid:0)W(cid:48) (cid:1) iWi p→ i 0 (cid:1)(cid:105)  0 ΩX ¨R(cid:48) i ¨Ri i=1 ¨X(cid:48) i = 0. Further, the terms in [·] are Op(1) and i=1 148 by random assignment and E √ N (cid:16) ˇθ − θ (cid:17)  (cid:104)E(cid:0)W(cid:48) = (cid:1)(cid:105)−1 iWi 0 0 Ω−1 X (cid:16) ˙Xi (cid:17)   N−1(cid:80)N 0 = 0. Therefore, We can now look at √ N ( ˇµ − µ), the first G elements of and so √ N ( ˇµ − µ) = (cid:20) E(cid:16) W(cid:48) iWi (cid:17)(cid:21)−1 Note that and so  N(cid:88) i=1 ¨R(cid:48) iUi β + N−1/2 (cid:17) . But  + op(1). N(cid:88) i=1 W(cid:48) iUi √ N i=1 W(cid:48) i  (cid:104)√ (cid:1)(cid:105) N(cid:0) ¯X − µX (cid:16) ˇθ − θ   N−1/2 ˙Xiβ + N−1/2 N(cid:88) W(cid:48) i ρ1 ρ2 ... p→ ρG i=1 N(cid:88) i=1 N−1 ρ1 ρ2 ... ρG      . W(cid:48) iWi = Wi1 0 0 Wi2 ... ... ··· 0 ··· ... 0 ... 0 0 WiG E(cid:16) (cid:17) = W(cid:48) iWi ρ1 0 0 ... 0 ρ2 ... ··· ··· ... ... 0 ... 0 0 ρG  149 Therefore, √ N ( ˇµ − µ) = jGN−1/2 (cid:20) E(cid:16) W(cid:48) iWi (cid:17)(cid:21)−1 N−1/2 ˙Xiβ + N(cid:88) i=1 N(cid:88) i=1 W(cid:48) iUi + op(1) ((k1)) where jG = (1, 1, ..., 1)(cid:48). Now write N(cid:88) √ N ( ˇµ − µ) = jGN−1/2 ˙Xiβ + −1N−1/2 N(cid:88) i=1  + op(1) W(cid:48) iUi W(cid:48) iUi = (Wih − ρg) ˙Xiβg + and so √ N ( ˇµ − µ) = N−1/2 N(cid:88) i=1   ˙Xiβ ˙Xiβ ... ˙Xiβ  Wi1Ui(1)/ρ1 Wi2Ui(2)/ρ2 ... WiGUi(G)/ρG  (E.2)  (k4) (cid:80)G (cid:80)G h=1 Wi1(Wih − ρh) ˙Xiβh/ρ1 h=1 Wi2(Wih − ρh) ˙Xiβh/ρ1 (cid:80)G h=1 WiG(Wih − ρh) ˙Xiβh/ρ1 ... 150  0 ρ1 0 ρG 0 ... 0 0 ... 0 ρ2 ... ··· ··· ... ...   G(cid:88) (cid:80)G (cid:80)G h=1(Wih − ρh) ˙Xiβh h=1(Wih − ρh) ˙Xiβh (cid:80)G h=1(Wih − ρh) ˙Xiβh  h=1 ... Wi1 Wi2 ... WiG Wi1 Wi2 WiG i=1 =    +  G(cid:88) WihUi(h) h=1 Wi1Ui(1) Wi2Ui(2) ... WiGUi(G)    +   + E(cid:0)LiQ(cid:48) (cid:1) = 0, and E(cid:0)KiQ(cid:48) i i (cid:1) = 0, it follows that (cid:105) (cid:104)√ (cid:105) (cid:104)√ N ( ˆµSM − µ) N ( ˆµF RA − µ) Avar = ΩL + ΩQ For each g, we can write the second term in bracket as follows. Then, combine the first and second parts and simplify using the expression for β. For example, Wi1(Wig − ρg) ˙Xiβg/ρ1 = ρ−1 1 (cid:104) G(cid:88) g=1 Wi1(Wi1 − ρ1) ˙Xiβ1 − Wi1ρ2 ˙Xiβ2 − ··· − Wi1ρG (cid:3) (cid:2)Wi1(1 − ρ1)β1 − Wi1ρ2β2 − ··· − Wi1ρGβG (cid:2)β1 − (ρ1β1 + ρ2β2 + ··· + ρGβG)(cid:3) ˙Xi = ρ−1 = ρ−1 = ρ−1 1 1 Wi1 ˙Xi 1 Wi1 ˙Xi (β1 − β) . (cid:105) ˙XiβG Using (k4) and adding ˙Xiβ and rearranging, we obtain the following theorem, (cid:4) Proof of Theorem 6 Proof. We now show that, asymptotically, ˆµF RA is no worse than ˆµSM . From (m1), (m2), where ΩL = E(cid:0)LiL(cid:48) i Avar (cid:1) and so on. Therefore, to show that Avar = ΩK + ΩQ (cid:104)√ (cid:105) is smaller N ( ˆµF RA − µ) (in the matrix sense), we must show is PSD, where  ˙Xiβ1 ˙Xiβ2 ... ˙XiβG Ki = ΩL − ΩK  and Li =  Wi1 ˙Xiβ1/ρ1 Wi2 ˙Xiβ2/ρ2 ... WiG ˙XiβG/ρG  The elements of Li are uncorrelated because WigWih = 0 for g (cid:54)= h. The variance of the gth element is (cid:17)2(cid:21) = E(cid:16) (cid:17) ρ−2 g E Wig (cid:20)(cid:16) ˙Xiβg (cid:17)2(cid:21) (cid:20)(cid:16) ˙Xiβg (cid:17)2(cid:21) = ρ−1 g E (cid:20)(cid:16) E Wig ˙Xiβg/ρg = ρ−1 g β(cid:48) gΩXβg. 151  ρ−1 1 0 ... 0 2 0 ρ−1 ... ··· ··· ... ... 0  ⊗ ΩX  B 0 ... 0 ρ−1 G Therefore, E(cid:16) (cid:17) = LiL(cid:48) i  = B(cid:48) where β(cid:48) 1ΩXβ1/ρ1 0 ... 0  ΩX/ρ1 0 0 ... 0 ΩX/ρ2 ... ··· 0 β(cid:48) 2ΩXβ2/ρ2 ... ··· ··· ... ... 0 β(cid:48) 0 ... 0 GΩXβG/ρG   B = B(cid:48)   ··· ... ... 0 ... 0 0 βG ··· ... ... 0 ... 0 0 ΩX/ρG  β1 0 0 β2 ... ... ··· 0 B = For the variance matrix of Ki, V(cid:16) ˙Xiβg (cid:17) = β(cid:48) C( ˙Xiβg, ˙Xiβh) = β(cid:48) gΩXβg gΩXβh Therefore, E(cid:16) KiK(cid:48) i (cid:17) = B(cid:48)  ΩX ΩX ··· ΩX ... ... ΩX ΩX ... ... ... ΩX ΩX ··· ΩX ΩX  B = B(cid:48)(cid:20)(cid:16) (cid:21) (cid:17) ⊗ ΩX B jGj(cid:48) G 152 G = (1, 1, ..., 1). Therefore, the comparison we need to make is 1 0 ρ−1 where j(cid:48)  That is, we need to show  0 0 0 ρ−1 2 0 0 ρ−1 G ρ−1 1 0 ... 0 2 0 ρ−1 ... ··· ··· ... ... 0 is PSD. The Kronecker product of two PSD matrices is also PSD, so it suffices to show (cid:17) ⊗ ΩX jGj(cid:48) G ⊗ ΩX G  G 0 ... (cid:16) (cid:17) jGj(cid:48) 0 ρ−1  ⊗ ΩX versus  −(cid:16)  −(cid:16)  a = G(cid:88) 0 ρ−1  G(cid:88) (cid:17)2 2  G(cid:88) (cid:16) a(cid:48)jG ··· ... ... ··· ... ... 0 ρ−1 0 ... 0 ... g=1 = ag . G 0 G 0 (cid:17) jGj(cid:48) G a2 g/ρg g=1 2 ag   ρ−1 1 0 ... 0 2 0 ρ−1 ... ··· 2 0 ρ−1 ... ··· a = ρ−1 1 0 ... a(cid:48) a(cid:48)(cid:16) 0 (cid:17) jGj(cid:48) G(cid:88) G is PSD when the ρg add to unity. Let a be any G × 1 vector. Then So we have to show g/ρg ≥ a2 g=1 g=1 153 Define vectors b = √ a1/ √ √ ρ2, ..., aG/ ρ1, a2/ ρG and c = apply the Cauchy-Schwarz inequality: (cid:16) 2  G(cid:88) ag g=1 because(cid:80)G g=1 ρg = 1. (cid:17)(cid:16) c(cid:48)c b(cid:48)c (cid:17)2 ≤(cid:16) (cid:16)  G(cid:88) a2 g/ρg g=1 b(cid:48)b  = = (cid:17)(cid:48) (cid:17) =  G(cid:88) g=1 a2 g/ρg (cid:17)(cid:48) √ ρG and √ ρ2, ..., ρ1, (cid:16)√  G(cid:88)  ρg g=1 (cid:4) Proof of Theorem 7 Proof. By random assignment and the linear projection property, E(FiK(cid:48) E(FiQ(cid:48) i) = 0. Hence, Fi, Ki, and Qi are pairwise uncorrelated. i) = E(KiQ(cid:48) i) = (cid:4) 154 APPENDIX F AUXILIARY RESULTS FOR CHAPTER 3 F.1 Stratified (or block) experiment with missing outcome Consider a stratified experiment where the population is partitioned into J strata, or blocks, based on the covariates x ∈ X ⊂ RJ, given by X1,X2, . . . ,XJ, such that these sets are mutually exclusive and exhaustive. Then draw a sample of size Nj from stratum j where j = 1, 2, . . . , J with N =(cid:80)J j=1 Nj. Let wijg be a binary indicator for treatment level g = 0, 1 for unit i in stratum j. Then, by construction, the probability of unit i getting treated in stratum j is a function of the covariates that have been used to define the strata. In other words, P(wijg = 1|yij(0), yij(1), xij) = P(wijg = 1|xij) ≡ pg(xij) for j = 1, 2, . . . , J; g = 0, 1 Hence, in a stratified experiment, the treatment assignment satisfies unconfoundedness by design, where this probability is constant for all units in a particular stratum j, but varies across the different strata. Let sij be a missing data indicator for unit i belonging to stratum j, such that 1; yij is observed 0; yij is missing sij = Then, one can characterize a stratified sample from stratum ‘j’ as {(yij, xij, wijg, sij); i = 1, . . . , Nj}. Now, suppose that the missing outcomes are ignorable, i.e. P(sij = 1|yij(0), yij(1), xij, wijg) = P(sij = 1|xij, wijg) ≡ r(xij, wijg) which implies that missingness is sufficiently well predicted by the covariates and the treat- ment indicator. Given this setup, one can use the doubly weighted estimator to consistently 155 estimate θ0 g as follows and ˆθ1 = argmin θ1∈Θ1 ˆθ0 = argmin θ0∈Θ0 J(cid:88) J(cid:88) j=1 Nj(cid:88) Nj(cid:88) i=1 j=1 i=1 sij · wij1 r(xij, wij1) · p1(xij) sij · wij0 r(xij, wij0) · p0(xij) · q(yij(1), xij, θ1) · q(yij(0), xij, θ0) where r(xij, wijg) and pg(xij) can be replaced by consistent estimators without changing the result. Note that even though the assignment probabilities are typically known in a stratified experiment, it can be asymptotically more efficient to estimate them using binary response MLE. F.2 Consistent variance estimation In order to construct asymptotic confidence intervals and obtain valid inference with the doubly weighted estimator, it is important to find a consistent estimator of its asymptotic variance. For smooth objective functions like OLS, NLS, MLE, this task is simple as one can replace the population Hessian and Jacobian functions by their sample counterparts. This involves substituting the sample average in place of the population expectations. However, for non-smooth objective functions, the task of obtaining a consistent variance estimator is not straightforward. The first order or second order derivatives of the objective function may not exist. In such situations, numerical derivatives of the objective functions can be used to approximate the true derivatives. Following Newey and McFadden (1994), let ei denote the ith unit vector and εN denote a small positive constant that depends on the sample size. For the doubly weighted estimator that solves the treatment problem, ˆθg, the asymptotic variance expression is given as H−1 g , where Hg can now be estimated using a second order numerical derivative of the objective function by ˆHg, where the (j, k)th element of g ΩgH−1 156 matrix ˆHg is given as, ˆHgjk (cid:34) (cid:34) = + (cid:35) 4ε2 N +QN ( ˆθg − ejεN − ekεN ) 4ε2 N QN ( ˆθg + ejεN + ekεN ) − QN ( ˆθg − ejεN + ekεN ) − QN ( ˆθg + ejεN − ekεN ) (cid:35) (cid:17) (cid:16) uigu(cid:48) ig (cid:80)N i=1 ˆuig ˆu(cid:48) For the middle term of the asymptotic variance expression which is Ωg = E can approximate it as ˆΩg = 1 N ˆH−1 , we ig, where uig exists with probability one. Hence, g will be consistent under the conditions of the following theorem. ˆΩg ˆH−1 g Theorem F.2.1. (Consistency of asymptotic variance) Suppose that εN → 0 and εN ∞, then under conditions of theorem 3.4.2, ˆHg p→ Hg and ˆΩg p→ Ωg. √ N → The proof of this theorem is given in appendix J and follows from Theorem 7.4 in Newey and McFadden (1994). Table ?? characterizes cases when the weighted and unweighted estimator will be con- g. Table I.4 talks about situations when the unweighted sistent for the true parameter, θ0 estimator is more efficient than the weighted estimator. F.3 Asymptotic variance for ATE √ Given N consistent and asymptotically normal estimators, ˆθ1 and ˆθ0, the estimated average treatment effect N(cid:88) i=1 1 N m1(xi, ˆθ1) − 1 N N(cid:88) i=1 m0(xi, ˆθ0) ˆτate = √ is easily shown to also be N-consistent and asymptotically normal (Wooldridge (2010) chapter 21). Regularity conditions for such an asymptotic result would require that the parametric model, mg(x, θg), is continuously differentiable on the parameter space Θg ⊂ g is in the interior of Θg. Then, by the continuous mapping theorem and slutsky’s RPg and θ0 157 theorem, where V = E(cid:2)ψ(xi)ψ(xi)(cid:48)(cid:3). Let’s denote E d→ N (0, V) (cid:104)∇θg mg(xi, θ0 g) (cid:105) ≡ G0 g, then √ N (ˆτate − τate) ψ(xi) = {m1(xi, θ0 1) − m0(xi, θ0 0) − τate} − G0 1 · H−1 1 ui1 + G0 0 · H−1 0 ui0 where Hg is the Hessian for the treatment group g, and uig is the residual from the regres- sion of the weighted score on the scores of two probability models. For the case when the conditional mean model is correctly specified, the variance expression simplifies to (cid:48) 0 · V0 · G0 1) − m0(xi, θ0 1 · V1 · G0 0)) − τate (m1(xi, θ0 V = E + G0 + G0 1 0 (cid:48) (F.1) (cid:104) (cid:105)2 Here V1 and V0 are the asymptotic variances of the doubly weighted estimator that solve the treatment and control group problems respectively. The above formula makes it clear that it better to use more efficient estimators of ˆθg. But we know from the results in section 3.5 that when the conditional mean model is correctly specified, using estimated weights is as efficient as using known weights. Another alternative in this case is to use unweighted g since under GCIME, unweighted estimators can be potentially more efficient estimators of θ0 than the doubly weighted estimators of θ0 g. For the case when the mean model is misspecified, the asymptotic variance of the ATE is given as follows (m1(xi, θ0 (cid:104) (cid:104){m1(xi, θ0 (cid:104){m1(xi, θ0 V = E − 2E + 2E (cid:105)2 0)) − τate 0) − τate}u(cid:48) 0) − τate}u(cid:48) i1 i0 + G0 (cid:105) (cid:105) (cid:48) 1 1 · V1 · G0 H−1 1 G0 1 H−1 0 G0 0 (cid:48) 1) − m0(xi, θ0 1) − m0(xi, θ0 1) − m0(xi, θ0 (cid:48) + G0 0 · V0 · G0 0 (cid:48) (F.2) though it is better to have more efficient estimators of θ0 In this case, the variance expression is a bit more complicated than the previous case. Even g in this case as well, it is not obvious whether that would help obtain a smaller variance for the ATE since we now have cross correlation terms in the variance expression. The proof of the asymptotic variances is provided in appendix ..... 158 F.4 Practical advice for obtaining double-weighted ATE estimates An easy way to obtain the doubly weighted estimates, ˆθg, for estimating ATE, is to combine the treatment and control group problems into a one-step GMM procedure. Es- sentially, this means that one would stack the moment conditions from the first and second steps, which can then be solved jointly via GMM. Since there are no over-identifying restric- g is equivalent to two-step tions in the double weighted framework, one-step estimation of θ0 estimation (Negi (2019)). For ease of notation, let wi1 = wi and wi0 = (1 − wi). Then, consider the following set of moment conditions: ¯m(θ0, θ1, γ, δ) = N(cid:88) i=1 1 N mi(θ0, θ1, γ, δ) = N−1  N N0 N N1 ·(cid:80)N ·(cid:80)N (cid:80)N (cid:80)N i=1 mi0(θ0, γ, δ) i=1 mi1(θ1, γ, δ) i=1 mi2(γ) i=1 mi3(δ)  where, mi0(θ0, γ, δ) = mi1(θ1, γ, δ) = si · (1 − wi) si · wi R(xi, wi, ˆδ) · (1 − G(xi, ˆγ)) R(xi, wi, ˆδ) · G(xi, ˆγ) mi2(γ) = ∇γG(xi, γ)(cid:48) · q(yi(0), xi, θ0)(cid:48) q(yi(1), xi, θ1)(cid:48) · ∇θ0 · ∇θ1 wi − G(xi, γ) G(xi, γ) · (1 − G(xi, γ)) si − R(xi, wi, δ) R(xi, wi, δ) · (1 − R(xi, wi, δ)) mi3(δ) = ∇δR(xi, wi, δ)(cid:48) · The example code below uses STATA’s gmm command to estimate two weighted linear re- gressions for estimating ATE. 159 Example code using STATA’s gmm local Rhat="exp(b31+b32*w+b33*x1+b34*x2)/(1+exp(b31+b32*w+b33*x1+b34*x2))" local Ghat="exp(b21+b22*x1+b23*x2)/(1+exp(b21+b22*x1+b23*x2))" gmm ((-2*s*(1-w)/(‘Rhat’*(1-‘Ghat’)))*(y-b00-b01*x1-b02*x2)*(n/nc)) /// ((-2*s*w/(‘Rhat’*‘Ghat’))*(y-b10-b11*x1-b12*x2)*(n/nt)) /// (w-exp(b21+b22*x1+b23*x2)/(1+exp(b21+b22*x1+b23*x2))) /// (s-exp(b31+b32*w+b33*x1+b34*x2)/(1+exp(b31+b32*w+b33*x1+b34*x2))), /// instruments(1 2 3: x1 x2) instruments(4: w x1 x2) winitial(identity) /// nocommonesample onestep from(b00 0.1 b01 0.1 b02 0.1 b10 0.1 b11 0.1 b12 /// 0.1 b21 0.1 b22 0.1 b23 0.1 b31 0.1 b32 0.1 b33 0.1 b34 0.1) Then using the GMM estimates, one can estimate the average treatment effect as gen y0hat = _b[b00: _cons]+_b[b01: _cons]*x1+_b[b02: _cons]*x2 gen y1hat = _b[b10: _cons]+_b[b11: _cons]*x1+_b[b12: _cons]*x2 egen ate = mean(y1hat-y0hat) Since I am estimating the two probability models as logits (as is the convention in applied work), the third and fourth moments simplify to mi2(γ) = x(cid:48) mi3(δ) = z(cid:48) i · (wi − Λ(xiγ)) i · (si − Λ(ziδ)) Even though this one-step estimation allows us to obtain variance estimates ˆV1 and ˆV0 for ˆθ1 and ˆθ0 respectively, obtaining analytically correct standard errors for estimated ATE requires additional work. A command that implements the correct standard errors is still in the works. Meanwhile, one can use bootstrapped standard errors, which provide asymptotically correct inference. 160 F.5 Asymptotic variance for QTEs Given that ˆθg are N-consistent and asymptotically normal for the CQF parameters, g (conditions for QR & Komunjer type QMLE estimators can be easily verified and can be θ0 found in basic textbooks), the estimated CQTEτ will also be N-consistent and asymptot- ically normal under the condition that quantg,τ (x, θg) is continuously differentiable on the parameter space Θg ⊂ RPg and θ0 g is an interior point in Θg (just like for the case of ATE). Then, again, by the continuous mapping theorem, we obtain √ ˆCQT Eτ (xi) − CQT Eτ (xi) (cid:17) d→ N (0, V(xi)) √ (cid:16) √ N where V(xi) = ∇θ1 + ∇θ1 quant1,τ (xi, θ0 quant0,τ (xi, θ0 1)V1∇θ1 0)V0∇θ0 1)(cid:48) quant1,τ (xi, θ0 0)(cid:48) quant0,τ (xi, θ0 In the case when we are able to consistently estimate the CQTE, the researcher may be interested in a quantity which I call the average quantile effect (AQE). This is defined to be the average difference in the CQTE function at a given quantile. Using the weak law of large numbers, one can also establish that √ N ( ˆAQEτ − AQEτ ) d→ N (0, V) where V =E E (cid:34)(cid:26)(cid:16) (cid:104)∇θ1 (cid:17) − AQEτ (cid:27)2(cid:35) (cid:104)∇θ1 (cid:104)∇θ0 (cid:105) · V0 · E + E quant1,τ (xi, θ0 1) (cid:105) · V1· (cid:105)(cid:48) quant1,τ (xi, θ0 1) − quant0,τ (xi, θ0 (cid:105)(cid:48) 0) (cid:104)∇θ1 + E quant1,τ (xi, θ0 1) quant0,τ (xi, θ0 0) quant0,τ (xi, θ0 0) and V1 and V0 are the asymptotic variances of the doubly weighted estimator that solves the QR or QMLE problem for the treatment and control groups respectively. The derivation of the two asymptotic variances is provided appendix H. For average quantile effect, the derivation proceeds in a similar manner to the case of ATE. Since the above results hinge on 161 correct quantile specification, one may use the usual robust asymptotic variance form for V1 and V0. However, one might be able to obtain a smaller finite sample variance from using estimated weights even though weighting would not have any bite in establishing consistency here. As discussed in the examples section, when the conditional quantile model is misspecified, g can still be interpreted as a weighted linear approximation parameter to the true τ-CQF θ0 of y(g). Since linear projections can be used as linear operators, the difference in the two linear projections of the two potential outcomes will give us a linear projection to the true CQTE. Formally, LP [CQTEτ ] = LP[quant1,τ (xi, θ1)] − LP[quant0,τ (xi, θ0)] Therefore, one can use θ0 g in the case of a misspecified CQF to define a linear projection to the true CQTE. 162 APPENDIX G APPLICATION APPENDIX FOR CHAPTER 3 G.1 National Supported Work Program The NSW was a transitional and subsidized work experience program that was mainly intended to target four sub-populations; ex-offenders, former drug addicts, women on AFDC welfare and high school dropouts.1 The program became operational in 1975 and continued until 1979 at fifteen locations in the United States. In ten of these sites, the program operated as a randomized experiment where individuals who qualified for the training program were randomly assigned to either the treatment or control group.2 At the time of enrollment in April 1975, individuals were given a retrospective baseline survey which was then followed by four follow-up interviews conducted at nine month intervals each. The survey data was collected using these baseline and follow-up interviews over a period of four years. The data included measurement on baseline covariates like age, years of education, number of children in 1975, high school dropout status, marital status, two race indicators for black and Hispanic sub-populations and other demographic and socio-economic information. The main outcome of interest was real earnings for the post-training year of 1979. G.2 Augmenting the Calónico and Smith (2017) sample to account for missing earnings in 1979 I obtain the data from Calónico and Smith (2017)’s supplementary data files in the Journal of Labor Economics where the authors recreate the experimental sample on AFDC 1The AFDC program is administered and funded by the federal and state governments and is meant to provide financial assistance to needy families. Source: US Census Bureau. Beyond the main eligibility criteria that was applied to all four target populations, the AFDC group was subjected to two additional criteria which were, a) no child below 6 years of age and b) on AFDC welfare for at least 30 of the last 36 months. 2Out of the 10 sites, 7 served AFDC women with random assignment at one or more of these sites in operation from Feb 1976-Aug 1977 (Calónico and Smith (2017)). 163 women using the raw public use data files maintained by the Inter-University Consortium for Political and Social Research (ICPSR). Then, I use the PSIDcross file provided by CS along with other supplementary data files to add back the individuals whom CS originally dropped from the analysis for not having valid earnings information between 1975-1979. For this, I apply the same filters applied by CS who use them to match their PSID samples to the ones used by LaLonde (1986). These filters involve keeping all female household heads continuously from 1975-1979 who were between 20 and 55 years of age in 1975 and were not retired in 1975.3 This constitutes the first non-experimental sample that CS use in their analysis, which they call the PSID-1 sample. The second PSID sample, which they label PSID-2 further restricts the PSID-1 sample to include only those women who received AFDC welfare in 1975.4 In order to compare my sample with the original sample used by CS, I first apply all the above mentioned filters and create a dummy variable which I call “cs”. Next, I remove the filter which requires the women to be continuous household heads and instead only impose that filter for 1975 and 1976. The reason this filter is imposed for both years 1975 and 1976 but not for any other years is because in the PSID datasets, the income information in a particular year corresponds to the previous calendar year. This ensures that merging the cross-file with the separate single-year files for 1975 and 1976 guarantee that only those women are included who do not have any missing earnings information for the pre-training year of 1974 and 1975. This is important since pre-training earnings are treated as any other baseline covariate in this paper, on which I do not allow any missing information. After merging cross year individual file with the single year family files, I then merge this PSID dataset with the NSW dataset using CS’s .do files and generate the various sample 3For the additional filters that CS impose, see the Calónico and Smith (2017) supplemen- tary material provided in JLE. 4Even though the two PSID comparison groups are not perfectly representative of women who would have proven eligible for NSW, there is no clear alternative since the PSID data lacks detailed covariate information that would be needed to impose the full eligibility criteria on the PSID sample. 164 dummies essentially in the same manner as they do. After this, I further restrict the sample to include only those women who have valid earnings information in 1975, which is the pre- training year for AFDC women. I also drop the cases where the measured age or education is less than zero. In order to make sure that any observations not used by CS only correspond to the ones that have missing post-program earnings, I also drop observations that do not satisfy the CS criteria but have observed earnings in 1979. G.3 Treatment and missing outcome probability specifications and sample trimming In this application, I estimate three sets of treatment assignment and missing outcomes probability models depending upon which comparison group is used for obtaining the esti- mates. For the experimental estimates, I use the experimental treatment and control groups to estimate the propensity score model. For the PSID-1 estimates, I consider the NSW ex- perimental observations to be the treatment group and use PSID-1 as the control group. For estimating the PSID-2 propensity score model, I switch to PSID-2 as being the comparison control group. For estimating the missing outcome probability models, I include the treat- ment indicator depending upon the comparison group as mentioned above. The probability models are estimated as logits and include the following covariates in their specification. For the treatment probability, I include the real earnings in 1974 and 1975 along with an indicator variable for whether the individual had any zero earnings in 1974 and 1975. Be- yond these, I also include Age, Age-squared, Education, High school dropout status, the race indicators of black and Hispanic along as well as the number of children in 1975. CS also add some interaction terms in their propensity score specification which I do not. I noticed that allowing for those terms in my specifications drove the final weights for many women in the sample too close to a 0 or 1. For the missing outcomes probability, I include the treatment indicator along with the same covariates. I kept the specifications to be the same for the three sets of probabilities I estimated. However, my regression specifications 165 include the same covariates as CS to allow for some comparison across the analyses. These comparisons should be made with some caution. Except the estimates that use the NSW control group, all other estimates are obtained using samples that are different than the CS samples. The final sample used to obtain estimates for the PSID-1 comparison group is trimmed in order to ensure common support for the weights in the treatment and comparison groups. For the PSID-1 group, this meant dropping observations with final weight either less than 0.03 or greater than 0.8. For the PSID-2 sample, this meant dropping observations with final weight that was either less than 0.1 or greater than 0.86. These final weights are the weights that are specified in the regression commands in STATA and are constructed as follows: weight = (w/Ghat+(1-w)/(1-Ghat))*(s/Rhat) The trimming threshold for PS-weighted estimates is kept the same as for computing the double weighted estimates since the overlap problem was relatively more severe when using the composite weights than when using propensity scores only. The graphs below plot the kernel density for the probabilities Rhat*Ghat for the treatment group and Rhat*(1-Ghat) for the control group. The common support problem due to which the samples were appro- priately trimmed can be seen in the graphs below. Additionally, figures G.2 and G.3 plot the estimated distributions for the propensity score and missing outcomes probability, where panel (a)-(c) display these for the three treatment and comparison group combinations. A couple of points emerge from the estimated graphs. For figure G.2, panel (a), we see that the treatment and control distributions appear very similar, confirming the strong role of randomization in producing groups that are balanced in terms of covariates. For panel (b), we see that the experimental observations have a relatively high probability of being treated whereas the control group have low probabilities. Note, however, that the common support condition holds quite strongly for the PSID-1 group. In panel (c), while the estimated distribution for the treated units still has a higher mean, the PSID-2 comparison group distribution is relatively similar than PSID-1 in panel (b). 166 Figure G.1: Kernel density plots for the composite probability a) Experimental treatment and control b) Experimental treatment and PSID-1 c) Experimental treatment and PSID-2 Notes: The weights here correspond to the product of the estimated assignment and missing outcomes prob- abilities. Following Calónico and Smith (2017), I exploit the efficiency gain from combining the experimental treatment and control groups for estimating the treatment and missing outcome probability models. For the PSID-1 group, this means using the full experimental group to be the treatment group and the PSID-1 as the control group. Similarly, to construct weights for the PSID-2 group, this means using the full experimental group along with the PSID-2 as the control group. These findings suggest that nonrandom assignment is predicted well by the covariates in the propensity score distributions. The same cannot be said for the estimated missing outcomes probabilities where panel (b) and (c) reveal a strong overlap problem. Moreover, we see that the treated units are less likely to be missing outcomes compared to the comparison groups. 167 Figure G.2: Kernel density plots for the estimated propensity score a) Experimental treatment and control b) Experimental treatment and PSID-1 c) Experimental treatment and PSID-2 Notes: Following Calónico and Smith (2017), I exploit the efficiency gains from combining the experimental treatment and control groups for estimating the propensity scores. For the PSID-1 group, this means using the full experimental group to be the treatment group and the PSID-1 as the control group. Similarly, to construct weights for the PSID-2 group, this means using the full experimental group along with the PSID-2 as the control group. 168 Figure G.3: Kernel density plots for the estimated missing outcomes probability a) Experimental treatment and control b) Experimental treatment and PSID-1 c) Experimental treatment and PSID-2 Notes: Following Calónico and Smith (2017), I exploit the efficiency gains from combining the experimental treatment and control groups for estimating the missing outcome probability. For the PSID-1 group, this means using the full experimental group to be the treatment group and the PSID-1 as the control group. Similarly, to construct weights for the PSID-2 group, this means using the full experimental group along with the PSID-2 as the control group. 169 APPENDIX H FIGURES FOR CHAPTER 3 Figure H.1: Relative estimated bias in UQTE estimates at different quantiles of the 1979 earnings distribution a) PSID-1 control group b) PSID-2 control group Notes: This graph plots the bias in the unweighted, PS-weighted and doubly-weighted UQTE estimates relative to the true experimental estimates across different quantiles of the 1979 earnings distribution. Panel (a) plots the relative bias estimates using the PSID-1 comparison group and Panel (b) plots the same using the PSID-2 comparison group. The treatment and missing outcome propensity score models have been estimated as flexible logits and the samples used for constructing these estimates have been trimmed to ensure common support across the two groups. The treatment propensity score has been estimated using the full experimental sample along with either PSID-1 or PSID-2 comparison group. The UQTE estimates for τ < 0.46 are omitted from the graph since these are zero. 170 Figure H.2: Empirical distribution of estimated ATE for N=5000 Case 1: When conditional mean model is correct a) Both probability models are correct b) Misspecified propensity score model c) Misspecified missingness model d) Both probability models are misspecified Notes: The empirical distribution is obtained from 1000 simulation draws of sample size 5000. However, the effective sample sizes are much smaller. Since the average propensity of treatment is a 0.41 and average propensity of being observed as 0.38, the average treated sample is N1 = 5000 × 0.41 × 0.38 = 779 and average control sample is N0 = 5000× (1− 0.41)× 0.38 = 1, 121. The true ATE = 0.096. The graphs display the empirical distribution of the estimated ATE with correct mean specification under three different cases of misspecification of the probability models. For the fourth case, see the main text. The graphs communicate the theoretical findings of this paper which state that under correct specification of the conditional model (conditional mean for these simulations), unweighted and weighted estimators will all be consistent for the true average treatment effect. Hence, correct specification of the probabilities does not have any added bite here in terms of achieving consistency. 171 Figure H.2 (cont’d) Case 2: When the conditional mean model is misspecified a) Both probability models are correct b) Misspecified propensity score model c) Misspecified missingness model d) Both probability models are misspecified Notes: The empirical distribution is obtained from 1000 simulation draws of sample size 5000. However, the effective sample sizes are much smaller. Since the average propensity of treatment is a 0.41 and average propensity of being observed as 0.38, the average treated sample is N1 = 5000 × 0.41 × 0.38 = 779 and average control sample is N0 = 5000× (1− 0.41)× 0.38 = 1, 121. The true ATE = 0.096. The graphs display the empirical distribution of the estimated ATE with misspecified mean model under two different cases of misspecification of the probability models. For the other two, see the main text. In each of these graphs we can the doubly weighted estimator is consistent for the true ATE whereas the unweighted and PS-weighted are away from the truth. 172 Figure H.3: Estimated CQTE with true CQTE as a function of x1, N=5000 D) Both probability models are misspecified a) τ = 0.25 b) τ = 0.50 c) τ = 0.75 Notes: This figure plots the estimated CQTE along with the true CQTE as a function of x1. The figure corresponds to the scenario when the conditional quantile functions for the treated and control groups are correctly specified but the two probability models are misspecified. Along with these two graphs, the figure also plots the function across the 1000 simulations (reps). The other three cases for when the propensity score or the missing data probability is allowed to be misspecified are not considered since under correct CQF specification, all these graphs look identical. For, N = 5000, the average treated sample is N1 = 5000 × 0.41 × 0.38 = 779 and average control sample is N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. 173 Figure H.4: Bias in estimated linear projection relative to true linear projection as a function of x1 using Angrist et al. (2006b) methodology, N=5000 A) Both probability models are correct a) τ = 0.25 b) τ = 0.50 c) τ = 0.75 Notes: Angrist et al. (2006b) show that under the case of misspecification of the true CQF, the check function can still estimate a weighted linear projection to the true CQF. Since this particular case corresponds to misspecification of the CQF (where I estimate it to be linear), the solutions to problem 3.36 will consistently estimate the LP’s to the two CQFs under the problems of non-random assignment and missing outcomes. Therefore, one can characterize an LP to the true CQTE using these two objects. This figure plots the bias in the doubly-weighted, PS-weighted and unweighted linear projection of the true CQTE relative to the true population LP of CQTE. For, N = 5000, the average treated sample is N1 = 5000 × 0.41 × 0.38 = 779 and average control sample is N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. For a description of how these functions were estimated, see the simulation appendix. 174 Figure H.4 (cont’d) D) Both probability models are misspecified a) τ = 0.25 b) τ = 0.50 c) τ = 0.75 Notes: Angrist et al. (2006b) show that under the case of misspecification of the true CQF, the check function can still estimate a weighted linear projection to the true CQF. Since this particular case corresponds to misspecification of all three components of the doubly weighted framework, the solutions to problem 3.36 will not consistently estimate the LP’s to the two CQFs under the problems of non-random assignment and missing outcomes. This figure plots the bias in the doubly-weighted, PS-weighted and unweighted linear projection of the true CQTE relative to the true population LP of CQTE. For, N = 5000, the average treated sample is N1 = 5000× 0.41× 0.38 = 779 and average control sample is N0 = 5000× (1− 0.41)× 0.38 = 1, 121. For a description of how these functions were estimated, see the simulation appendix. The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with non-random assignment and missing outcome problems. 175 Figure H.5: Empirical distribution of estimated UQTE for N=5000 A) Both probability models are correct a) τ = 0.25 b) τ = 0.50 c) τ = 0.75 Notes: The empirical distribution is obtained from 1000 simulation draws for N = 5000. Since the average propensity of treatment is 0.41 and average propensity of being observed is 0.38, this implies the average treated sample is N1 = 5000×0.41×0.38 = 779 and average control sample is N0 = 5000×(1−0.41)×0.38 = 1, 121. The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with non-random assignment and missing outcome problems. 176 Figure H.5 (cont’d) B) Correct missing outcomes probability but misspecified propensity score model a) τ = 0.25 b) τ = 0.50 c) τ = 0.75 Notes: The empirical distribution is obtained from 1000 simulation draws for N = 5000. Since the average propensity of treatment is 0.41 and average propensity of being observed is 0.38, this implies the average treated sample is N1 = 5000×0.41×0.38 = 779 and average control sample is N0 = 5000×(1−0.41)×0.38 = 1, 121. The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with non-random assignment and missing outcome problems. 177 Figure H.5 (cont’d) C) Misspecified missing outcomes probability but correct propensity score model a) τ = 0.25 b) τ = 0.50 c) τ = 0.75 Notes: The empirical distribution is obtained from 1000 simulation draws for N = 5000. Since the average propensity of treatment is 0.41 and average propensity of being observed is 0.38, this implies the average treated sample is N1 = 5000×0.41×0.38 = 779 and average control sample is N0 = 5000×(1−0.41)×0.38 = 1, 121. The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with non-random assignment and missing outcome problems. 178 Figure H.5 (cont’d) D) Both probability models are misspecified a) τ = 0.25 b) τ = 0.50 c) τ = 0.75 Notes: The empirical distribution is obtained from 1000 simulation draws for N = 5000. Since the average propensity of treatment is 0.41 and average propensity of being observed is 0.38, the average treated sample is N1 = 5000 × 0.41 × 0.38 = 779 and average control sample is N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with non-random assignment and missing outcome problems. 179 APPENDIX I TABLES FOR CHAPTER 3 Table I.1: An illustration of the observed sample ((cid:88)means observed, ? means missing) i x w s y ? (cid:88) 1 1 y(1) (cid:88) 0 2 y(0) (cid:88) 1 3 ? (cid:88) 1 4 ... ... ... N y(1) (cid:88) 0 ... 0 1 1 0 ... 1 180 Table I.2: Different scenarios under ignorability and unconfoundedness Situation D(y(g)|x) P(s = 1|y(g), x, wg) = P(s = 1|x, wg)? P(s = 1|x, wg) correct? P(wg = 1|y(g), x) = P(w = 1|x)? P(w = 1|x) correct? Unweigted for D(y(g)|x)? Weighted for D(y(g)|x)? Weighted for 2.1? 1 2 3 5 6 7 8 9 Correctly specified Correctly specified Correctly specified Misspecified Misspecified Misspecified Misspecified Misspecified No Yes Yes No Yes Yes Yes Yes Either Either Either Either Either Yes Yes No Either Yes No Yes Yes No Either Either Either Either Either Either No Yes Either Either No Yes No No No No No No No Yes No No No No No No No Yes No No No Yes No No a. Notice that if the missingness mechanism is not ignorable or for that matter the assignment mechanism is not unconfounded, then nothing can be consistently estimated whether or not other components of the framework are correctly specified. This can be seen in cases (1) and (3) in the table above. Situations (2) and (7) together forms what is called robust estimation that has been described in the sections above. Remember that under unconfoundedness and ignorability, D(y(g)|x) is the same as D(y(g)|x, wg, s). 181 Table I.3: Different scenarios under exogeneity of missingness and unconfoundedness Situation D(y(g)|x) P(s = 1|y(g), x, wg) = P(s = 1|x)? P(s = 1|x) correct? P(wg = 1|y(g), x) = P(wg = 1|x)? P(wg = 1|x) correct? Unweigted for D(y(g)|x)? Weighted for D(y(g)|x)? Weighted for 2.1? 1 2 3 4 5 6 7 8 9 Correctly specified Correctly specified Correctly specified Misspecified Misspecified Misspecified Misspecified Misspecified Misspecified Yes No Yes Yes Yes Yes Yes No Yes Either Either Either Yes No No Yes Either Yes Yes Either No Yes Yes Yes Yes Either No Either Either Either Yes No Yes No Either Either Yes No No No No No No No No Yes No No No No No No No No Yes No No Yes No No No No No a. Situations (1) and (4) combine to give you the double robustness result which says that either the conditional feature of interest needs to be correctly specified or the treatment and missing probabilities both need to be correctly specified. Again, just like the previous case, if the missingness mechanism is not exogenous or if the assignment mechanism is not unconfounded, then even correctly specifying either of these features will not consistently estimate the parameter of interest. This is illustrated in cases (2) and (3). If one looks at case (2) in the tabel above, under both these situations the unweighted estimator works to deliver a consistent estimator of θ0 g. In such a scenario where both the unweighted and weighted estimators are consistent, how can we choose amongst them? The following table enumerates situations where not weighting is better than weighting. 182 Table I.4: When is unweighted more efficient than weighted assuming ignorability and unconfoundedness and D(y(g)|x) correctly specified? Situation P(s = 1|x, wg) correct? P(s = 1|y(g), x, wg) = P(s = 1|x)? P(s = 1|x) correct? P(wg = 1|x) correct? GCIME holds? Unweighted more efficient Weighted with estimated probabilities more efficient? 1 2 3 Either Either Either No Yes Either Doesn’t apply Either Either Either Either Either Yes Yes No Yes Yes Can’t say No No Can’t say 183 I.0.1 Bias and root-mean squared error for ATE simulations Table I.5: When the conditional mean model is correctly specified A) Both probability models are correct Estimator Unweighted PS-weighted N=1000 BIAS RMSE -0.00082 0.02372 -0.00067 0.02370 D-weighted Estimated Known -0.00066 0.02376 -0.00065 0.02374 N=5000 Unweighted PS-weighted -0.00039 0.01074 -0.00037 0.01074 D-weighted Estimated Known -0.00034 0.01075 -0.00034 0.01075 Notes: The unweighted estimator does not weight the observed data. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the propensity score and the missingness probabilities to deal with the assignment and missing data problems. The two columns under the doubly weighted estimator report the Bias and RMSE of the estimators that use estimated and known probability weights respectively. The efficiency results in section 3.5 dictate no asymptotic efficiency gains in the case when we have the conditional model correctly specified. However, in finite samples, one could obtain smaller or larger variance estimates. For, N = 1000, the average treated sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000, N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and RMSE are reported across 1000 simulations. B) Correct missingness model but misspecified propensity score model N=1000 N=5000 Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted BIAS RMSE -0.00082 0.02372 -0.00069 0.02369 -0.00081 0.02376 -0.00039 0.01074 -0.00035 0.01074 -0.00040 0.01076 Notes: The unweighted estimator does not weight the observed data. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with non-random assignment and missing outcomes. For, N = 1000, the average treated sample is N1 = 1000 × 0.41×0.38 = 156 and average control sample is N0 = 1000×(1−0.41)×0.38 = 224. For, N = 5000, N1 = 5000×0.41×0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and RMSE are reported across 1000 simulations. 184 Table I.5 (cont’d) C) Misspecified missingness model but correct propensity score model N=1000 N=5000 Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted BIAS RMSE -0.00082 0.02372 -0.00067 0.02370 -0.00067 0.02372 -0.00039 0.01074 -0.00037 0.01074 -0.00035 0.01075 Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000, the average treated sample is N1 = 1000×0.41×0.38 = 156 and average control sample is N0 = 1000×(1−0.41)×0.38 = 224. For, N = 5000, N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000 simulations. D) Both probability models are misspecified N=1000 N=5000 Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted BIAS RMSE -0.00082 0.02372 -0.00069 0.02369 -0.00080 0.02373 -0.00039 0.01074 -0.00035 0.01074 -0.00039 0.01075 Notes: The unweighted estimator does not weight the observed data. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity scores to deal with non-random assignment and missing outcomes. For, N = 1000, the average treated sample size is N1 = 1000× 0.41× 0.38 = 156 and average control sample size is N0 = 1000× (1− 0.41)× 0.38 = 224. For, N = 5000, N1 = 5000× 0.41× 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000 Monte Carlo repetitions. 185 Table I.6: Misspecified conditional mean model A) Both probability models are correct N=1000 Estimator Unweighted PS-weighted BIAS RMSE 0.01087 0.03250 0.00008 0.03038 D-weighted estimated known 0.00003 -0.00002 0.02979 0.02986 N=5000 Unweighted PS-weighted 0.01052 0.01744 -0.00058 0.01396 D-weighted estimated known -0.00064 -0.00064 0.01376 0.01375 Notes: The unweighted estimator does not weight the observed data. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity scores to deal with non-random assignment and missing outcomes. The two columns under the doubly weighted estimator report the Bias and Rmse of the estimators that use estimated and known probability weights respectively. For, N = 1000, the average treated sample size is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample size is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000, N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000 Monte Carlo repetitions. B) Correct missingness model but misspecified propensity score model N=1000 N=5000 0.01087 0.03250 0.00699 0.03117 Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted BIAS RMSE Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000, the average treated sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000, N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000 simulations. 0.00651 0.01532 0.00102 0.02984 0.01052 0.01744 0.00049 0.01378 186 Table I.6 (cont’d) C) Misspecified missingness model but correct propensity score model N=1000 N=5000 Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted BIAS RMSE 0.01087 0.03250 0.00008 0.03038 0.00001 0.02970 0.01052 0.01744 -0.00058 0.01396 -0.00063 0.01371 Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000, the average treated sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000, N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000 simulations. D) Both probability models are misspecified N=1000 N=5000 0.01087 0.03250 0.00699 0.03117 0.00651 0.01532 Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted -0.00150 BIAS RMSE 0.01380 Notes: The unweighted estimator does not weight the observed data. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity scores to deal with non-random assignment and missing outcomes. For, N = 1000, the average treated sample size is N1 = 1000× 0.41× 0.38 = 156 and average control sample size is N0 = 1000× (1− 0.41)× 0.38 = 224. For, N = 5000, N1 = 5000× 0.41× 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000 Monte Carlo repetitions. -0.00093 0.02992 0.01052 0.01744 187 I.0.2 Bias and root-mean squared error for UQTE simulations Table I.7: A) Both probability models are correct For τ = 0.25 (25th quantile) Estimator Unweighted PS-weighted N=1000 N=5000 Unweighted PS-weighted D-weighted Estimated Known 0.0038 0.0549 0.0046 0.0532 D-weighted Estimated Known 0.0012 0.0255 0.0012 0.0247 -0.0424 0.0690 -0.0014 0.0554 BIAS RMSE Notes: The unweighted estimator does not weight the observed data. The ps-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the propensity score model and the missingness model to correct for non-random assignment and missing outcomes. For, N = 1000, the average treated sample is N1 = 1000×0.41×0.38 = 156 and average control sample is N0 = 1000×(1−0.41)×0.38 = 224. For, N = 5000, N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000 Monte Carlo repetitions. -0.0446 0.0512 -0.0022 0.0254 For τ = 0.50 (50th quantile) Estimator Unweighted PS-weighted N=1000 N=5000 Unweighted PS-weighted D-weighted Estimated Known 0.0007 0.1068 0.0000 0.1028 D-weighted Estimated Known 0.0044 0.0483 0.0043 0.0462 -0.1072 0.1543 -0.0206 0.1181 BIAS RMSE Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000, the average treated sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000, N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000 simulations. -0.0998 0.1114 -0.0157 0.0522 188 Table I.7 (cont’d) For τ = 0.75 (75th quantile) Estimator Unweighted PS-weighted N=1000 N=5000 Unweighted PS-weighted -0.2742 0.3803 -0.0899 0.3097 BIAS RMSE Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000, the average treated sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000, N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000 simulations. -0.0145 0.0983 -0.2687 0.2917 -0.0896 0.1550 D-weighted Estimated Known -0.0217 0.2523 -0.0210 0.2399 D-weighted Estimated Known -0.0147 0.1036 Table I.8: B) When missing data probability is misspecified and propensity score is correct For τ = 0.25 (25th quantile) N=1000 N=5000 -0.0014 0.0554 Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted BIAS RMSE Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000, the average treated sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000, N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000 simulations. -0.0150 0.0291 -0.0424 0.0690 -0.0116 0.0557 -0.0022 0.0254 -0.0446 0.0512 189 Table I.8 (cont’d) For τ = 0.50 (50th quantile) N=1000 N=5000 -0.0206 0.1181 Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted BIAS RMSE Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000, the average treated sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000, N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000 simulations. -0.0319 0.0571 -0.0157 0.0522 -0.0998 0.1114 -0.0355 0.1119 -0.1072 0.1543 For τ = 0.75 (75th quantile) N=1000 N=5000 -0.0899 0.3097 Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted BIAS RMSE Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000, the average treated sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000, N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000 simulations. -0.0896 0.1348 -0.2742 0.3803 -0.2687 0.2917 -0.0896 0.1550 -0.0989 0.2648 190 Table I.9: C) When missing data probability is correct and propensity score is misspecified For τ = 0.25 (25th quantile) N=1000 N=5000 -0.0014 0.0554 Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted BIAS RMSE Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000, the average treated sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000, N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000 simulations. -0.0239 0.0352 -0.0227 0.0598 -0.0022 0.0254 0.0265 0.0614 0.0243 0.0348 For τ = 0.50 (50th quantile) N=1000 N=5000 -0.0206 0.1181 Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted BIAS RMSE Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000, the average treated sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000, N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000 simulations. -0.0681 0.1349 -0.0637 0.0809 -0.0157 0.0522 0.0456 0.1168 0.0488 0.0673 191 Table I.9 (cont’d) For τ = 0.75 (75th quantile) N=1000 N=5000 -0.0899 0.3097 Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted BIAS RMSE Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000, the average treated sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000, N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000 simulations. -0.1978 0.2346 -0.0896 0.1550 -0.2005 0.3611 0.0691 0.2894 0.0709 0.1377 Table I.10: D) Both probability models are misspecified For τ = 0.25 (25th quantile) N=1000 N=5000 -0.0014 0.0554 Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted BIAS RMSE Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity score models to correct for non-random assignment and missing outcome. For, N = 1000, the average treated sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000, N1 = 5000 × 0.41 × 0.38 = 779 and N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000 Monte Carlo repetitions. -0.0227 0.0598 -0.0022 0.0254 -0.0239 0.0352 0.0095 0.0571 0.0074 0.0268 192 Table I.10 (cont’d) For τ = 0.50 (50th quantile) N=1000 N=5000 -0.0206 0.1181 Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted BIAS RMSE Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000, the average treated sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000, N1 = 5000 × 0.41 × 0.38 = 779 and average control sample size, N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000 simulations. -0.0681 0.1349 -0.0157 0.0522 -0.0637 0.0809 0.0070 0.1108 0.0136 0.0503 For τ = 0.75 (75th quantile) N=1000 N=5000 -0.0899 0.3097 Estimator Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted BIAS RMSE Notes: The unweighted estimator does not weight the observed data by anything. The PS-weighted estimator weights to correct only for non-random assignment and the doubly weighted estimator weights by both the treatment and missing outcomes propensity score models to deal with non-random assignment and missing outcome problems. For, N = 1000, the average treated sample is N1 = 1000 × 0.41 × 0.38 = 156 and average control sample is N0 = 1000 × (1 − 0.41) × 0.38 = 224. For, N = 5000, N1 = 5000 × 0.41 × 0.38 = 779 and average control sample size, N0 = 5000 × (1 − 0.41) × 0.38 = 1, 121. The Bias and Rmse are reported across 1000 simulations. -0.0131 0.1202 -0.0896 0.1550 -0.2005 0.3611 -0.0216 0.2806 -0.1978 0.2346 193 I.0.3 Calonico and Smith Application Table I.11: Proportion of missing earnings in the experimental sample Earnings in 1979 Treated Control Total 406 Missing Observed 1185 1591 Total 196 600 796 210 585 795 Table I.12: Proportion of missing data in the PSID samples Earnings in 1979 PSID-1 PSID-2 Missing Observed Total 81 648 729 22 182 204 194 Table I.13: Unweighted and weighted pre-training earnings comparisons using NSW and PSID comparison groups Comparison group Unadjusted Pre-training estimates NSW N=1,185 PSID-1 N=1,016 PSID-2 N=720 PSID-1 N=1,001 Unweighted PS-weighted D-weighted Unweighted -18 (123.45) -2,534 (283.95) -2,080 (411.23) -9 (51.07) -222 (213.57) -1,371 (331.41) 1 (48.76) -255 (205.59) -1,357 (317.41) -22 (124.70) -2,804 (281.49) -2,181 (427.24) Bias using NSW control Adjusted PS-weighted D-weighted -10 (51.34) -199 (212.55) -1,505 (359.98) -1 (48.97) -222 (205.45) -1,467 (342.16) -2,517 (279.38) 289 (256.93) 236 (247.18) -2,760 (283.09) 334 (257.50) 287 (248.20) -2,063 (416.53) -1,249 (323.36) -1,255 (310.59) -2,144 (435.74) -1,306 (354.12) -1,297 (337.68) PSID-2 N=705 Adjusted covariates Pre-training earnings (1975) Age Age2 Education High school droput Black Hispanic Marital status Notes: This table reports unadjusted and adjusted pre-training earnings differences where the first row reports the experimental estimates. The second and third row reports non-experimental estimates computed using the PSID-1 and PSID-2 control groups respectively. The second panel of the table reports bias estimates computed from combining the NSW control and PSID-1 and PSID-2 comparison groups respectively. Both the pre-training estimates and the bias estimates should be compared to zero. Bootstrapped standard errors (in parentheses) have been constructed using 10,000 replications. All values are in 1982 dollars. The samples used for estimating the training and bias estimates using PSID-1 and PSID-2 comparison groups have been trimmed to ensure common support in the distribution of weights for the NSW-treatment and comparison groups. For more detail, see appendix G. (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) 195 Table I.14: Estimation summary for ATE under different cases of misspecification Scenario CEF G(·) R(·) Model Estimation Model Estimation Model Estimation 1 2 3 4 5 6 7 8 C C C C M M M M Φ(xθg) Φ(xθg) Φ(xθg) Φ(xθg) xθg xθg xθg xθg C C M M C C M M Λ(xγ) Λ(xγ) Φ(x(1)γ(1)) Φ(x(1)γ(1)) Λ(xγ) Λ(xγ) Φ(x(1)γ(1)) Φ(x(1)γ(1)) C M C M C M C M Λ(zγ) Φ(z(1)γ(1)) Λ(zγ) Φ(z(1)γ(1)) Λ(zγ) Φ(z(1)γ(1)) Λ(zγ) Φ(z(1)γ(1)) Notes: C and M correspond to whether the estimated model is correctly specified or misspecified. x and z both include an intercept. x(1) and z(1) are the subsets of x and z left after omitting x1. Therefore, the probability models have been misspecified in both the functional form and linear index dimension. G(·) refers to the propensity score model and R(·) refers to the missing outcomes probability model. 196 Table I.15: Estimation summary for quantile effects under different cases of misspecification Scenario CQF G(·) R(·) Model Estimation Model Estimation Model Estimation 4 5 6 7 8 C M M M M exp(xθg(τ )) xθg(τ ) xθg(τ ) xθg(τ ) xθg(τ ) M C C M M Φ(x(1)γ(1)) Λ(xγ) Λ(xγ) Φ(x(1)γ(1)) Φ(x(1)γ(1)) M C M C M Φ(x(1)γ(1)) Λ(zγ) Φ(x(1)γ(1)) Λ(zγ) Φ(x(1)γ(1)) Notes: C and M denote whether the estimated model is correctly specified or misspecified. x and z both include an intercept. x(1) and z(1) are the subsets of x and z left after omitting x1. Therefore, the probability models have been misspecified in both the functional form and the linear index dimen- sion. G(·) refers to the propensity score model and R(·) refers to the missing outcomes probability model. 197 Table I.16: Covariate means and p-values from the test of equality of two means, by treatment status Treatment Control P(cid:0)|T| >|t|(cid:1) PSID-1 P(cid:0)|T| >|t|(cid:1) PSID-2 P(cid:0)|T| >|t|(cid:1) Covariates Age in years Years of education Proportion of high school dropouts Proportion Married Proportion Black Proportion Hispanic Number of children in 1975 Real earnings in 1975 33.37 (7.42) 10.30 (1.92) 0.70 (0.46) 0.02 (0.15) 0.84 (0.37) 0.12 (0.32) 2.17 (1.30) 799.88 (1931.92) 33.64 (7.19) 10.27 (2.00) 0.69 (0.46) 0.04 (0.20) 0.82 (0.39) 0.13 (0.33) 2.26 (1.32) 811.19 (2041.32) 0.46 0.72 0.73 0.03 0.29 0.59 0.21 0.91 36.73 (10.60) 11.32 (2.71) 0.45 (0.50) 0.02 (0.13) 0.66 (0.47) 0.02 (0.12) 1.70 (1.75) 7446.15 (7515.59) 0.00 0.00 0.00 0.05 0.00 0.00 0.00 0.00 34.41 (9.48) 10.55 (2.09) 0.59 (0.49) 0.01 (0.10) 0.87 (0.34) 0.02 (0.16) 2.91 (1.73) 2069.65 (3474.10) 0.11 0.07 0.00 0.08 0.13 0.00 0.00 0.00 Observations Notes: Along with the covariate means and standard deviation (in parentheses), the table also reports p-values from the test of equality for two means. Column 4 tests for differences between the NSW treatment and control groups, column 6 and 8 report the same using PSID-1 and PSID-2 comparison groups respectively. Real earnings in 1975 are expressed in terms of 1982 dollars. 204 729 796 795 198 Table I.17: Covariate means and p-values from the test of equality of two means for the observed and missing samples Covariates Age Years of education Proportion of high school dropouts Proportion married Proportion black Proportion hispanic Number of children in 1975 Real earnings in 1975 Missing 33.36 (7.30) 10.29 (1.93) 0.70 (0.46) 0.05 (0.21) 0.81 (0.39) 0.12 (0.33) 2.33 (1.29) 621.54 (1,523.00) Control Observed P(cid:0)|T| >|t|(cid:1) Missing Treatment Observed P(cid:0)|T| >|t|(cid:1) Missing PSID-1 Observed P(cid:0)|T| >|t|(cid:1) Missing PSID-2 Observed P(cid:0)|T| >|t|(cid:1) 33.74 (7.15) 10.26 (2.03) 0.68 (0.47) 0.04 (0.19) 0.82 (0.39) 0.13 (0.33) 2.23 (1.34) 879.28 (2,194.93) 0.51 0.85 0.57 0.61 0.81 0.87 0.34 0.12 32.15 (7.39) 10.29 (2.05) 0.69 (0.46) 0.03 (0.16) 0.83 (0.38) 0.13 (0.33) 2.14 (1.32) 610.77 (1,677.36) 33.77 (7.40) 10.31 (1.88) 0.70 (0.46) 0.02 (0.15) 0.84 (0.37) 0.12 (0.32) 2.19 (1.29) 861.65 (2,005.53) 0.01 0.89 0.77 0.75 0.87 0.64 0.69 0.11 34.00 (10.50) 11.44 (2.17) 0.43 (0.50) 0.00 (0.00) 0.74 (0.44) 0.01 (0.11) 1.54 (1.45) 6927.95 (7,330.74) 37.07 (10.57) 11.30 (2.77) 0.45 (0.50) 0.02 (0.14) 0.65 (0.48) 0.02 (0.12) 1.71 (1.78) 7510.92 (7,541.41) 0.01 0.60 0.73 0.00 0.10 0.82 0.33 0.50 33.32 (10.81) 11.05 (1.73) 0.55 (0.51) 0.00 (0.00) 0.91 (0.29) 0.05 (0.21) 2.41 (1.14) 896.56 (2,315.12) 34.54 (9.34) 10.49 (2.13) 0.59 (0.49) 0.01 (0.10) 0.86 (0.35) 0.02 (0.15) 2.97 (1.79) 2211.45 (3,567.50) 0.62 0.18 0.68 0.16 0.50 0.62 0.05 0.02 Observations Notes: Along with the covariate means and standard deviation (in parentheses), the table also reports p-values from the test of equality for two means between the observed and missing samples. Real earnings in 1975 are expressed in terms of 1982 dollars. 204 204 795 795 729 729 796 796 199 Table I.18: Unweighted and weighted earnings comparisons and estimated training effects using NSW and PSID comparison groups Comparison group Unadjusted Adjusted Adjusted Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted Unweighted PS-weighted D-weighted Post-training earnings estimates NSW N=1,185 PSID-1 N=1,016 PSID-2 N=720 PSID-1 N=1,001 821 (307.22) -799 (444.84) -31 (713.88) 848 (304.04) 827 (503.00) 824 (304.61) 803 (503.26) 845 (303.60) 298 (428.60) 852 (302.94) 909 (497.76) 828 (303.53) 907 (501.54) 569 (1041.81) 566 (1027.12) 492 (664.46) (953.80) Bias estimates using NSW control 1,040 (961.74) 996 864 (303.47) 335 (440.18) 698 (784.28) 850 (302.96) 905 (518.54) 826 (303.58) 904 (522.97) 1,082 (1264.18) 1,049 (1217.46) -1,620 (431.75) 169 (561.74) 156 (553.07) -493 (427.93) -40 (499.91) -21 (501.44) -568 (434.59) -38 (504.19) -21 (507.02) -109 (663.80) 207 (962.85) 200 (954.61) -853 (707.87) -17 -24 (1195.47) (1156.39) -212 (1025.87) -228 (1041.44) -378 (759.75) PSID-2 N=705 Adjusted covariates Pre-training earnings (1975) Age Age2 Education High school droput Black Hispanic Marital status Number of Children (1975) Notes: This table reports unadjusted and adjusted post-training earnings differences between the NSW treatment and three different comparison groups, namely, NSW control, PSID-1 and PSID-2. The first row reports experimental training estimates which combines the NSW treatment and control group whereas the second and third rows report non-experimental estimates computed from using the PSID-1 and PSID-2 groups respectively. Each of the non-experimental estimates should be compared to the experimental benchmark. The second panel of the table reports bias estimates computed from combining the NSW control with PSID-1 and PSID-2 comparison groups respectively. These represent a second measure of bias which should be compared to zero. Bootstrapped standard errors are given in parentheses and have been constructed using 10,000 replications. All values are in 1982 dollars. The samples used for estimating the training and bias estimates have been trimmed to ensure common support in the distribution of weights for the treatment and comparison groups. For more detail, see the application appendix. (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) 200 Table I.19: Unconditional quantile treatment effect (UQTE) using PSID-1 comparison group 0.1 0.2 0.3 0.6 0.7 0.4 0.5 0 (0) 0 (0) 0 (12.91) -1124.61 (552.97) -2227.26 (983.43) -860.55 (964.97) 428.01 (728.22) -190.60 (519.63) -1563.27 (952.85) 0 (0) 0 (0) 0 (0) 0 (207.14) 2076.58 (851.09) 3602.76 (1299.08) 3415.47 (988.24) 2019.44 (984.59) -385.45 (1059.43) 0 (0) 0 (0) 0 (0) 0 (11.17) 993.52 (695.93) 2004.40 (1112.82) 2129.93 (716.04) 1753.27 (372.37) 1134.21 (449.86) Quantile Experimental Unweighted PS-weighted D-weighted 0 (0) 0 (0) 0 (0) 0 (174.89) 1847.04 (829.42) 3535.85 (1284.64) 3340.84 (992.95) 2019.44 (999.47) -385.45 (1056.09) Notes: This table reports unweighted, PS-weighted and double-weighted UQTE estimates for three different comparison groups, namely, NSW control, PSID-1 and PSID-2. The estimates are reported at every 10th quantile of the 1979 earnings distribution. The experimental and PSID-1 estimates have been constructed using N=1,185 and N=1,016 observations respectively. Bootstrapped standard errors are given in parentheses and have been constructed using 1,000 replications. All values are in 1982 dollars. The samples used for constructing these estimates have been trimmed to ensure common support across the treatment and comparison groups. 0.8 0.9 201 Table I.20: Unconditional quantile treatment effect (UQTE) using PSID-2 comparison group 0.1 0.2 0.3 0.6 0.7 0.4 0.5 0 (0) 0 (0) 0 (111.74) -795.71 (672.87) -237.98 (1232.63) 193.77 (1426.40) 1857.64 (943.38) 1148.85 (1152.92) -237.08 (1888.06) 0 (0) 0 (0) 0 (0) 0 (13.25) 993.52 (693.73) 2004.40 (1114.65) 2129.93 (710.26) 1753.27 (371.73) 1134.21 (452.08) 0 (0) 0 (10.07) 0 (136.31) 0 (573.22) 378.98 (1312.93) 1480.47 (1647.31) 2616.22 (1217.80) 2010.87 (1541.14) 1089.10 (3321.56) Quantile Experimental Unweighted PS-weighted D-weighted 0 (0) 0 (10.07) 0 (129.77) 0 (546.78) 372.07 (1267.28) 1294.77 (1659.69) 2599.73 (1209.60) 1990.37 (1553.67) 1089.10 (3246.78) Notes: This table reports unweighted, PS-weighted and double-weighted UQTE estimates for three different comparison groups, namely, NSW control, PSID-1 and PSID-2. The estimates are reported at every 10th quantile of the 1979 earnings distribution. The experimental and PSID-2 estimates have been computed using N=1,185 and N=720 observations respectively. Bootstrapped standard errors are given in parentheses and have been constructed using 1,000 replications. All values are in 1982 dollars. The samples used for constructing these estimates have been trimmed to ensure common support across the treatment and comparison groups. 0.8 0.9 202 APPENDIX J PROOFS FOR CHAPTER 3 Proof of Lemma 3.2.5 Proof. By the law of iterated expectations (LIE) · wg pg(x) · q(cid:0)y(g), x, θg (cid:1)(cid:35) · q(cid:0)y(g), x, θg · q(cid:0)y(g), x, θg · q(cid:0)y(g), x, θg · q(cid:0)y(g), x, θg (cid:1)(cid:35) (cid:1)(cid:35) (cid:33) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)y(g), x, wg (cid:1) · P(cid:0)s = 1|y(g), x, wg (cid:1)(cid:35) (cid:1) · P(cid:0)s = 1|x, wg (cid:35) (cid:1) · r(x, wg) E r(x, wg) · wg pg(x) s r(x, wg) (cid:34) = E = E = E = E = E s (cid:32) E (cid:34) (cid:34) (cid:34) (cid:34) wg r(x, wg) · pg(x) r(x, wg) · pg(x) wg wg wg r(x, wg) · pg(x) · q(cid:0)y(g), x, θg pg(x) Using another application of LIE, rewrite the above expectation as pg(x) (cid:33) E (cid:32) wg (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)y(g), x · q(cid:0)y(g), x, θg · P(cid:0)wg = 1|y(g), x(cid:1)(cid:35) (cid:34) (cid:1) q(cid:0)y(g), x, θg (cid:34) · P(cid:0)wg = 1|x(cid:1)(cid:35) (cid:1) q(cid:0)y(g), x, θg (cid:34) (cid:1) q(cid:0)y(g), x, θg (cid:1)(cid:105) (cid:104) q(cid:0)y(g), x, θg · pg(x) pg(x) pg(x) pg(x) (cid:35) = E = E = E = E = E 203 where the third equality follows from ignorability and the third last equality follows from unconfoundedness. Hence, θ0 g solves the weighted population problem. (cid:4) Proof of Lemma 3.3.3 Proof. Proving consistency of ˆγ and ˆδ follows directly after verifying the conditions of Theorem 2.1 in Newey and McFadden (1994). Condition 2.1(i), which implies unique solution to the maximization problem, is satisfied using 3.3.3 (1), (4) and Lemma 2.2 of Newey and McFadden (1994). Condition 2.1(ii), which implies compactness of the parameter space, holds due to 3.3.3(i). Conditions 2.1(iii) and (iv) follow from Lemma 2.4 (cid:4) in Newey and McFadden (1994). Proof of Lemma 3.3.4 Proof. Again, the proof of asymptotic normality follows from verifying the conditions of Theorem 3.1 in Newey and McFadden (1994), which is the basic asymptotic normality proof for extremum estimators. I will then use the arguments as laid out in Newey and McFadden (1994) to prove asymptotic normality of √ N (ˆδ − δ0) follows in a similar manner. normality for By lemma 3.3.3, we have ˆγ and (ii). 3.1(iii) holds with Σ = −E equality, E(cid:2)∇γ ln f (w1|x, γ0)(cid:3) = 0 (condition 3.4(iii)), existence of (cid:104)∇γγ(cid:48) ln f (w1|x, γ0) (cid:105) (cid:104)∇γγ(cid:48) ln f (w1|x, γ0) E √ N (ˆγ − γ0). The asymptotic (cid:105) by the information matrix p→ γ. Theorem 3.1(i) and (ii) hold because of condition 3.4(i) (condition 3.4(iii)), and the Lindberg-Levy central limit theorem. (cid:104)∇γγ(cid:48) ln f (w1|x, γ0) (cid:105) E Condition 3.1(iv) follows from results of Lemma 2.4 in Newey and McFadden (1994) which require compactness of Γ, γ being an interior point in Γ, with a(z, θ) = ∇γγ(cid:48) ln f (w1|x, γ) using conditions (ii) and (v). Condition 3.1(v) follows from non-singularity of using condition 3.4(iv). Then, asymptotic normality follows from the conclusion of Theorem 3.1 in Newey and McFadden (1994). (cid:4) Proof of Theorem 3.4.1 204 Proof. I have already shown that (cid:35) · q(y(g), x, θg) = E(cid:2)q(y(g), x, θg)(cid:3) E s r(x, wg) · wg pg(x) (cid:34) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) for both g = 0, 1. Now, one needs to prove uniform convergence of the weighted sample objective function to its population expectation. Formally, I need to show N(cid:88) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1 Ng sup θg∈Θg i=1 Then consider, si · wig r(xi, wig) · pg(xi) · q(yi(g), xi, θg) − E · q(cid:0)y(g), x, θg s r(x, wg) · wg pg(x) s r(x, wg) · wg pg(x) (cid:34) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ |q(cid:0)y(g), x, θg r(x, wg) · pg(x) (cid:1)| ≤ b(y(g), x) η · κg (cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p→ 0 · q(y(g), x, θg) (J.1) (J.2) Inequality J.2 holds due to part (3) of Assumptions 3.2.2 and 3.2.3. Now, E(cid:2)b(y(g), x)(cid:3) < ∞ by condition (3) in this theorem. Therefore, uniform convergence is established by Lemma 2.4 of Newey and McFadden (1994). Hence, ˆθg p→ θ0 g Replacing the true probabilities, r(·), and, pg(·), by their consistent estimates does not change the above result. (cid:4) Proof of Theorem 3.4.2 Proof. Following Newey and McFadden (1994), with minor modifications, I will first show (cid:13)(cid:13)(cid:13) ˆθg − θ0 g (cid:13)(cid:13)(cid:13) = Op(1) or in other words, ˆθg is √ that N √ N−consistent. Q0(θg) = Q0(θ0 g) + ∇θg Q0(θ0 g) + (θg − θ0 g)(cid:48) + (θg − θ0 g)(cid:48)Hg(θg − θ0 g)(cid:48)Hg(θg − θ0 g)/2 + o (cid:18)(cid:13)(cid:13)(cid:13)θg − θ0 g (cid:13)(cid:13)(cid:13)2(cid:19) g)/2 + o = Q0(θ0 (cid:18)(cid:13)(cid:13)(cid:13)θg − θ0 g (cid:13)(cid:13)(cid:13)2(cid:19) (J.3) where the first equality follows from the second order Taylor series approximation. For the g, the first derivative will be second equality, since QN (θg) has a local minimum at θ0 g at θ0 205 zero. Since Hg is positive definite and non-signular, there exists a constant C ≥ 0 and a small enough neighborhood of θ0 g such that (θg − θ0 g)(cid:48)Hg(θg − θ0 g)/2 + o (cid:13)(cid:13)(cid:13)θg − θ0 g (cid:13)(cid:13)(cid:13)2 g (cid:13)(cid:13)(cid:13)(cid:19)2 ≥ C (cid:18)(cid:13)(cid:13)(cid:13)θg − θ0 (cid:13)(cid:13)(cid:13)θg − θ0 (cid:13)(cid:13)(cid:13)2 g g) + C Q0(θg) ≥ Q0(θ0 p→ θ0 g with probability approaching one we can rewrite eq (J.3) as Therefore, since θg Let us define, RN (θg) = QN (θg) − QN (θ0 g) − (Q0(θg) − Q0(θ0 g)) − ∇θg QN (θ0 g)(cid:48)(θg − θ0 g) then using Ossiander’s entropy conditions given in 4.2(6), 4.2(7) along with i.i.d sampling as given in assumption (2.4), one obtains stochastic equicontinuity using Theorem 4 and Theorem 5 (with p = 2) of Andrews (1994). Hence, for any sequence, βN → 0 In other words, the above implies that with probability approaching one, for all θg, N · RN (θg) √ N g √ sup (cid:13)(cid:13)(cid:13)≤βN (cid:13)(cid:13)(cid:13)θg − θ0 (cid:13)(cid:13)(cid:13) (1 + (cid:13)(cid:13)(cid:13)θg−θ0 (cid:13)(cid:13)(cid:13)(cid:18) N · RN (θg) ≤(cid:13)(cid:13)(cid:13)θg − θ0 g g √ = op(1) g (cid:13)(cid:13)(cid:13)θg − θ0 (cid:13)(cid:13)(cid:13)) (cid:13)(cid:13)(cid:13)θg − θ0 √ N g (cid:13)(cid:13)(cid:13)(cid:19) 1 + op(1) (J.4) Choose UN so that ˆθg ∈ UN with probability approaching one, so that eq (J.4) holds. Again since ˆθg is consistent for θ0 g ) − op(N−1) g ) + ∇θg QN (θ0 g ) + RN ( ˆθg) − op(N−1) g, we can write g )(cid:48)( ˆθg − θ0 (cid:13)(cid:13)(cid:13)2 0 ≥ QN ( ˆθg) − QN (θ0 = Q0( ˆθg) − Q0(θ0 ≥ C − op(N−1) (cid:13)(cid:13)(cid:13) ˆθg − θ0 ≥(cid:2)C + op(1)(cid:3)(cid:13)(cid:13)(cid:13) ˆθg − θ0 (cid:13)(cid:13)(cid:13)∇θg QN (θ0 (cid:13)(cid:13)(cid:13)2 + g g g )(cid:48)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˆθg − θ0 g (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ˆθg − θ0 (cid:13)(cid:13)(cid:13) ˆθg − θ0 (cid:13)(cid:13)(cid:13)(cid:16) (cid:13)(cid:13)(cid:13) − op(N−1) g g 1 + + Op(N−1/2) (cid:13)(cid:13)(cid:13) ˆθg − θ0 g (cid:13)(cid:13)(cid:13)(cid:17) √ N op(N−1/2) (J.5) 206 op(N−1/2) (cid:13)(cid:13)(cid:13)2 · op(1) g g g g g g g We obtain the above simplification because, + − op(N−1) (cid:13)(cid:13)(cid:13)∇θg QN (θ0 (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)2 C ·(cid:13)(cid:13)(cid:13) ˆθg − θ0 = C ·(cid:13)(cid:13)(cid:13) ˆθg − θ0 =(cid:2)C + op(1)(cid:3) ·(cid:13)(cid:13)(cid:13) ˆθg − θ0 =(cid:2)C + op(1)(cid:3) ·(cid:13)(cid:13)(cid:13) ˆθg − θ0 − op(N−1) (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)2 + g g g N √ 1 + Op(N−1/2) + op(N−1/2) g )(cid:48)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˆθg − θ0 (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ˆθg − θ0 (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ˆθg − θ0 (cid:13)(cid:13)(cid:13) ˆθg − θ0 + Op(N−1/2) (cid:16) + Op(N−1/2) ·(cid:13)(cid:13)(cid:13) ˆθg − θ0 (cid:13)(cid:13)(cid:13)2 ≤ −Op(N−1/2) (cid:13)(cid:13)(cid:13) ˆθg − θ0 (cid:13)(cid:13)(cid:13) ˆθg − θ0 (cid:13)(cid:13)(cid:13)2 ≤ −Op(N−1/2) (cid:13)(cid:13)(cid:13)(cid:17) (cid:13)(cid:13)(cid:13) ˆθg − θ0 (cid:13)(cid:13)(cid:13)(cid:16) (cid:13)(cid:13)(cid:13) ˆθg − θ0 (cid:13)(cid:13)(cid:13) · op(N−1/2) + (cid:13)(cid:13)(cid:13) − op(N−1) (cid:17) ·(cid:13)(cid:13)(cid:13) ˆθg − θ0 (cid:13)(cid:13)(cid:13) − op(N−1) (cid:13)(cid:13)(cid:13) ˆθg − θ0 (cid:13)(cid:13)(cid:13) + op(N−1) (cid:13)(cid:13)(cid:13), b = Op(N−1/2) and (cid:33) (cid:13)(cid:13)(cid:13) + op(N−1) (cid:2)C + op(1)(cid:3)(cid:13)(cid:13)(cid:13) ˆθg − θ0 (cid:13)(cid:13)(cid:13) ˆθg − θ0 (cid:33)2 We now use completing the square trick with x = c = −op(N−1) to obtain Since C + op(1) is bounded away from zero with probability approaching one, Then we can write the the inequality in (J.5) as g g g g g g g (cid:32) op(N−1) + − Op(N−1/2) · Op(N−1/2) ≤ 0 ) 4 (cid:32)(cid:13)(cid:13)(cid:13) ˆθg − θ0 g (cid:13)(cid:13)(cid:13) + Op(N−1/2) 2 ) By the rules of the asymptotic notation, Op(N−1/2) · Op(N−1/2) = Op(N−1). Therefore, we obtain, Taking a square root on both sides, (cid:13)(cid:13)(cid:13) + Op(N−1/2) (cid:12)(cid:12)(cid:12)(cid:12) ≤ Op(N−1/2) (J.6) (cid:17) op(N−1) + Op(N−1)) (cid:19)2 ≤(cid:16) (cid:19)2 ≤ Op(N−1) (cid:18)(cid:13)(cid:13)(cid:13) ˆθg − θ0 (cid:18)(cid:13)(cid:13)(cid:13) ˆθg − θ0 g g (cid:13)(cid:13)(cid:13) + Op(N−1/2) (cid:13)(cid:13)(cid:13) + Op(N−1/2) (cid:12)(cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13) ˆθg − θ0 g 207 Now, by triangle inequality (cid:13)(cid:13)(cid:13) ˆθg − θ0 g (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) ˆθg − θ0 g (cid:13)(cid:13)(cid:13) + Op(N−1/2) − Op(N−1/2) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13) ˆθg − θ0 g (cid:12)(cid:12)(cid:12)(cid:12) + (cid:13)(cid:13)(cid:13) + Op(N−1/2) (cid:12)(cid:12)(cid:12)−Op(N−1/2) (cid:12)(cid:12)(cid:12) ≤ Op(N−1/2) (By equation J.6 ) g − H−1 Hence we have established that ˆθg is ¨θg = θ0 ∇θg QN (θ0 √ N consistent. Now let, √ N−consistent almost by construction since g) is Op(N−1/2). Now consider, g ) = Q0( ˆθg) − Q0(θ0 g ∇θg QN (θ0 QN ( ˆθg) − QN (θ0 g )(cid:48) · ( ˆθg − θ0 g ) + RN ( ˆθg) + op(N−1) g), then ¨θg is g ) + ∇θg QN (θ0 Using J.3 gives me, QN ( ˆθg) − QN (θ0 g ) =( ˆθg − θ0 ∇θg QN (θ0 Therefore, using the fact that ∇θg QN (θ0 (cid:104) QN ( ˆθg) − QN (θ0 g) (cid:105) 2 (cid:18)(cid:13)(cid:13)(cid:13) ˆθg − θ0 g (cid:13)(cid:13)(cid:13)2(cid:19) g )/2 + o g ) + RN ( ˆθg) + op(N−1) + g )(cid:48)Hg( ˆθg − θ0 g )(cid:48) · ( ˆθg − θ0 g) = −Hg( ¨θg − θ0 g) we get =( ˆθg − θ0 =( ˆθg − θ0 g)(cid:48)Hg( ˆθg − θ0 g)(cid:48)Hg( ˆθg − θ0 g) + 2∇θg QN (θ0 g) − 2( ¨θg − θ0 g)(cid:48) · ( ˆθg − θ0 g)(cid:48)Hg( ˆθg − θ0 g) + op(N−1) g) + op(N−1) To show that the remaining terms are of order op(N−1), observe that the order of magnitude (cid:18)(cid:13)(cid:13)(cid:13) ˆθg − θ0 g (cid:13)(cid:13)(cid:13)2(cid:19) o + RN ( ˆθg) + op(N−1) =o(Op(N−1/2) · Op(N−1/2)) + Op(N−1/2) · op(N−1/2 + Op(N−1/2)) =op(N−1) In a similar manner, we can write, (cid:104) QN ( ¨θg) − QN (θ0 g) (cid:105) 2 g)(cid:48)Hg( ¨θg − θ0 g)(cid:48)Hg( ¨θg − θ0 =( ¨θg − θ0 =( ¨θg − θ0 = − ( ¨θg − θ0 g)(cid:48)Hg( ¨θg − θ0 g) + 2∇θg QN (θ0 g) − 2( ¨θg − θ0 g) + op(N−1) g)(cid:48) · ( ¨θg − θ0 g)(cid:48)Hg( ¨θg − θ0 g) + op(N−1) g) + op(N−1) 208 Then, (cid:104) (cid:105) − 2 QN ( ˆθg) − QN (θ0 g ) 2 g )(cid:48)Hg( ˆθg − θ0 g ) − 2( ¨θg − θ0 =( ˆθg − θ0 (cid:104) QN ( ¨θg) − QN (θ0 g ) g ) + ( ¨θg − θ0 g )(cid:48)Hg( ¨θg − θ0 g ) + op(N−1) where 2 (cid:104) QN ( ˆθg) − QN (θ0 g ) (cid:104) QN ( ¨θg) − QN (θ0 g ) (cid:105) ≤ op(N−1) (cid:105) g )(cid:48)Hg( ˆθg − θ0 (cid:105) − 2 (cid:13)(cid:13)(cid:13) ˆθg − ¨θg (cid:13)(cid:13)(cid:13)2 and op(N−1) ≥ ( ˆθg − θ0 g)(cid:48)Hg( ˆθg − θ0 g) − 2( ¨θg − θ0 g)(cid:48)Hg( ˆθg − θ0 g) + ( ¨θg − θ0 g)(cid:48)Hg( ¨θg − θ0 g) = ( ˆθg − ¨θg)(cid:48)Hg( ˆθg − ¨θg) ≥ C Hence, (cid:13)(cid:13)(cid:13)√ N ( ˆθg − θ0 N∇θg QN (θ0 g)) Therefore, the conclusion follows from the fact that g) − (−H−1 g √ (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) ˆθg − ¨θg (cid:13)(cid:13)(cid:13) p→ 0 √ N √ −H−1 g N∇θg QN (θ0 g) d→ N (0, H−1 g ΩgH−1 g ) Proof of Theorem 3.4.3 Proof. Consider, (cid:16) Σ1 − Ω1 (cid:16) lil(cid:48) lib(cid:48) = E = E (cid:17) − {E (cid:17) (cid:16) (cid:16) (cid:17) − E (cid:17)−1 lil(cid:48) bib(cid:48) (cid:16) lib(cid:48) E(bil(cid:48) E i i i i i (cid:17) E i) + E (cid:4) (cid:17) (cid:16) E did(cid:48) i (cid:17)−1 E(dil(cid:48) i)} (cid:16) (cid:17)−1 (cid:16) (cid:16) (cid:17) bib(cid:48) lid(cid:48) E i i (cid:16) lid(cid:48) E(dil(cid:48) i) i (cid:17)−1 E(bil(cid:48) i) − E did(cid:48) i since each component matrix in the above expression is positive semi-definite, therefore the sum of the two matrices is also positive semi-definite. The proof for the control group follows analogously. (cid:4) 209 Proof of Theorem 3.5.4 Proof. I have shown that (cid:20) (cid:21) s w1 · · q(y(1), x, θ1) E R(x, w1, δ∗) will identify the parameter of interest, θ0 G(x, γ∗) 1, under the strong identification condition given in 3.5.1. In order to prove consistency of ˆθ1 for θ0 1, we need to prove uniform convergence of the weighted sample objective function to its population expectation. Formally, we need (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1 N1 N(cid:88) i=1 to show (cid:20) sup θ1∈Θ1 − E si · wi1 R(xi, wi1, δ∗) · G(xi, γ∗) s · w1 · q(y(1), x, θ1) R(x, w1, δ∗) · G(x, γ∗) · q(yi(1), xi, θ1) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p→ 0 Replacing r(x, w1) and p1(x) in the proof of theorem 3.4.1 by R(x, w1, δ∗) and G(x, γ∗) gives us the desired result. Consistency of ˆθ0 for θ0 0 can be established analogously by replacing w1, G(x, γ∗) and R(x, w1, δ∗) above by w0, (1 − G(·, γ∗)) and R(x, w0, δ∗) respectively. Proof of Theorem 3.5.5 Proof. The proof of this theorem follows in the manner of Theorem 3.4.2 but where Hg now denotes the non-singular Hessian, with weights given by G(x, γ∗) and R(x, wg, δ∗). Also, Ωg now denotes the variance of the doubly weighted scores, li and ki for the treatment and control group problems respectively. Proof of corollary 3.5.6 Proof. This proof follows from the proof of the above theorem, 3.5.5, and the asymptotic variance of the estimator that uses known weights which is where Ω1 = E(cid:0)lil(cid:48) i √ Avar N (cid:1) and Ω0 = E(cid:0)kik(cid:48) i (cid:17) (cid:16) ˜θg − θ0 (cid:1). The result follows immediately. −1ΩgHg = Hg −1 g 210 (cid:4) (cid:4) (cid:4) Proof of theorem 3.5.7 Proof. Using two applications of LIE and invoking ignorability and unconfoundedness, I can rewrite (cid:20) (cid:20) E = E si · wi1 R(xi, wi1, δ∗) · G(xi, γ∗) R(xi, wi1, δ∗) · G(xi, γ∗) r(xi, wi1) · p1(xi) (cid:20) Using another application of LIE, I can rewrite the above as = E Then, · E r(xi, wi1) · p1(xi) R(xi, wi1, δ∗) · G(xi, γ∗) (cid:20) R(xi, wi1, δ∗) · G(xi, γ∗) r(xi, wi1) · p1(xi) r(xi, wi1) · p1(xi) R(xi, wi1, δ∗) · G(xi, γ∗) E (cid:16) · A(xi, θ0 1) · E (cid:21) (cid:20) H1 = ∇2 θ1 = E (cid:21) (cid:21) · q(yi(1), xi, θ0 1) · q(yi(1), xi, θ0 1) (cid:16) q(yi(1), xi, θ0 1)|xi (cid:17)(cid:21) (cid:17)(cid:21) q(yi(1), xi, θ0 1)|xi (cid:19)(cid:35) (cid:12)(cid:12)(cid:12)(cid:12)xi, wi1, si ∇θ1 q(yi(1), xi, θ0 1)(cid:48)∇θ1 q(yi(1), xi, θ0 1) · E ∇θ1 q(yi(1), xi, θ0 1)(cid:48)∇θ1 q(yi(1), xi, θ0 1) (cid:33) (cid:12)(cid:12)(cid:12)(cid:12)xi, wi1, si (cid:33) (cid:12)(cid:12)(cid:12)(cid:12)xi Similarly, I use LIE to express Ω1 as Ω1 (cid:34) (cid:18) = E E r(xi, wi1) · p1(xi) R2(xi, wi1, δ∗) · G2(xi, γ∗) · ∇θ1 = E = E   · q(yi(1), xi, θ0 (cid:32) 1) (cid:32) · E q(yi(1), xi, θ0 1)(cid:48)∇θ1 r(xi, wi1) · p1(xi) r(xi, wi) · p(xi) R2(xi, wi1, δ∗) · G2(xi, γ∗) (cid:34) R2(xi, wi, δ∗) · G2(xi, γ∗) r(xi, wi) · p(xi) R2(xi, wi, δ∗) · G2(xi, γ∗) = σ2 01 · E (cid:35) · A(xi, θ0 1) 211 For the unweighted estimator, the variance simplifies, and this happens precisely due to the GCIME. To see this, consider Hu 1. Then using LIE, I can rewrite (cid:20) (cid:16) 1 = ∇2 (cid:104) r(xi, wi1) · p1(xi) · E Hu θ1 r(xi, wi1) · p1(xi) · A(xi, θ0 1) = E E (cid:105) (cid:17)(cid:21) q(yi(1), xi, θ0 1)|xi and similarly we can rewrite Ωu 1 using LIE as (cid:20) (cid:20) Ωu 1 = E = E r(xi, wi) · p(xi) · E (cid:104) r(xi, wi) · p(xi) · E 01 · E (cid:16)∇θ1 (cid:16)∇θ1 Therefore, the asymptotic variance simplifies to simply (cid:18) √ Avar N 1 E N = σ2 1 − θ0 01 · = σ2 √ Avar r(xi, wi) · p(xi) · A(xi, θ0 1) (cid:18) (cid:104) (cid:16) ˆθ1 − θ0 (cid:32) (cid:19) ri · pi · Ai (cid:19) (cid:18) (cid:18) i · G2 R2 · (cid:16) D(cid:48) iBi (cid:16) ˆθu (cid:17) (cid:18) (cid:17)(cid:19)−1 − (cid:16) ˆθu E (ri · pi · Ai) − E (cid:18) ri · pi (cid:26) (cid:17) · E (cid:16) (cid:17) − E (cid:16) and Di = D(cid:48) iDi Ri · Gi (cid:16) 1/2 i (cid:17)−1 · E 1/2 i · A B(cid:48) iDi √ Avar 1 − θ0 1 (cid:17)(cid:19)−1 · p 1/2 i · E i p /Ri N r 1 · = 1 σ2 01 1/2 i Let Bi = r B(cid:48) iBi E = 1 σ2 01 For showing that the two variances are positive semi-definite consider the following q(yi(1), xi, θ0 q(yi(1), xi, θ0 q(yi(1), xi, θ0 q(yi(1), xi, θ0 1)(cid:48)∇θ1 1)(cid:48)∇θ1 (cid:105) (cid:17)(cid:21) (cid:17)(cid:21) 1)|xi, wi, si 1)|xi (cid:105)(cid:19)−1 r(xi, wi) · p(xi) · A(xi, θ0 1) (cid:18) ri · pi Ri · Gi · E (cid:19) · Ai (cid:33)−1 (cid:19) · Ai /Gi 1/2 i (cid:17)(cid:27) · A 1/2 i where the quantity inside the brackets is nothing but the variance of the residuals from the population regression of Bi on Di. Hence, the difference is positive semi-definite. The results for the control group can be proven analogously. (cid:4) Proof of theorem F.2.1 212 Proof. Consider a constant vector a. Then by the conclusion of theorem 3.4.2, we know (cid:13)(cid:13)(cid:13) ˆθg + aεN − θ0 g (cid:13)(cid:13)(cid:13) = Op(εN ). Consider, that |QN ( ˆθg + aεN ) − QN (θ0 g)| g) − Q0( ˆθg + aεN ) + Q0(θ0 Then we know that, RN (θg + aεN ) + ∇θg QN (θ0 QN (θg + aεN ) − QN (θ0 g)(cid:48)(θg + aεN − θ0 g) = g) − Q0(θg + aεN ) + Q0(θ0 g) Using eq(J.7) with eq(J.8) we obtain, |QN ( ˆθg + aεN ) − QN (θ0 = |RN ( ˆθg + aεN ) + ∇θg QN (θ0 g) − Q0( ˆθg + aεN ) + Q0(θ0 g)| g)(cid:48)( ˆθg + aεN − θ0 g)| Then using Triangle and Cauchy-Schwartz inequality, (J.7) (J.8) (J.9) |QN ( ˆθg + aεN ) − QN (θ0 ≤ |RN ( ˆθg + aεN )| + (cid:13)(cid:13)(cid:13) g) − Q0( ˆθg + aεN ) + Q0(θ0 g)| (cid:13)(cid:13)(cid:13)∇θg QN (θ0 (cid:13)(cid:13)(cid:13)(cid:18) g)(cid:48)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˆθg + aεN − θ0 (cid:13)(cid:13)(cid:13)(cid:19) (cid:13)(cid:13)(cid:13)θg + aεN − θ0 1 + g g √ op(1/ N ) Now, using stochastic equicontinuity condition, √ N RN (θg) ≤(cid:13)(cid:13)(cid:13)θg + aεN − θ0 g Then, |QN ( ˆθg + aεN ) − QN (θ0 ≤(cid:13)(cid:13)(cid:13)θg + aεN − θ0 (cid:13)(cid:13)(cid:13)(cid:18) g)| g) − Q0( ˆθg + aεN ) + Q0(θ0 √ N (cid:13)(cid:13)(cid:13)θg + aεN − θ0 (cid:13)(cid:13)(cid:13)(cid:19) 1 + g g √ N ) op(1/ + Op(N−1/2) · Op(εN ) (cid:104) √ = Op(εN ) = op(ε2 N ) 1 + N Op(εN ) √ op(1/ √ N ) N ) + Op(εN / (cid:105) 213 Hence, |QN ( ˆθg + aεN ) − QN (θ0 g) − (Q0( ˆθg + aεN ) − Q0(θ0 ε2 N Since, Q0(θg) is twice differentiable in θ0 g, − a(cid:48)Hga (cid:105) g))| (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 2 ε2 N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:104) Q0( ˆθg + aεN ) − Q0(θ0 g) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 (cid:34) (cid:12)(cid:12)(cid:12)(cid:12) 1 ( ˆθg + εN a − θ0 (cid:12)(cid:12)(cid:12)(cid:12) + ( ˆθg − θ0 g)(cid:48)Hga (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 ε2 N εN ε2 N 2 = ≤ g)(cid:48)Hg( ˆθg + εN a − θ0 g) ( ˆθg − θ0 g)(cid:48)Hg( ˆθg − θ0 g) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 (cid:104) Then using J.10, J.11 and triangle inequality, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 QN ( ˆθg + aεN ) − QN (θ0 (cid:104) g) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 QN ( ˆθg + aεN ) − QN (θ0 (cid:104) Q0( ˆθg + aεN ) − Q0(θ0 g) ε2 N ε2 N ≤ + ε2 N = op(1) (J.10) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − a(cid:48)Hga 2 (J.11) g + o (cid:13)(cid:13)(cid:13)2(cid:19)(cid:35) (cid:18)(cid:13)(cid:13)(cid:13) ˆθg + εN a − θ0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + op(1) = op(1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:105) − a(cid:48)Hga (cid:105) − a(cid:48)Hga (cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) g) − Q0( ˆθg + aεN ) + Q0(θ0 g) 2 2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤op(1) + op(1) = op(1) It follows that, ˆHgjk + = (cid:35) QN ( ˆθg + ejεN + ekεN ) − QN ( ˆθg − ejεN + ekεN ) − QN ( ˆθg + ejεN − ekεN ) (cid:34) (cid:34) QN ( ˆθg − ejεN − ekεN ) p→(cid:104) (cid:104) 2(ej + ek)(cid:48)Hgjk(ej + ek) − (ej − ek)(cid:48)Hgjk(ej − ek) − (ek − ej)(cid:48)Hgjk(ek − ej) e(cid:48) jHgjkej + e(cid:48) = 2 = e(cid:48) kHgjkek − e(cid:48) iHgjkei − e(cid:48) /8 + e(cid:48) kHgjkek jHgjkek 4ε2 N 4ε2 N (cid:105) jHgjkek = Hgjk (cid:35) (cid:105) /8 214 Pooled slopes (cid:4) Proof. Let us assume that m(x, θ) = h(α + xβ + ηw1) is the chosen mean function for µ(x). Then, in the presence of non-random sampling, we have the following first order conditions (cid:32) wi1 + ˆR · ˆG N(cid:88) i=1 si · (cid:32) wi1 ˆR · ˆG N(cid:88) i=1 si · + (cid:33) ·(cid:104) ·(cid:104) (cid:104) i wi0 N(cid:88) ˆR · (1 − ˆG) si · wi1 (cid:33) ˆR · ˆG · x(cid:48) i=1 wi0 ˆR · (1 − ˆG) yi − h(ˆα + xi yi − h(ˆα + xi yi − h(ˆα + xi ˆβ + ˆηwi1) ˆβ + ˆηwi1) ˆβ + ˆηwi1) (cid:105) (cid:105) (cid:105) = 0 = 0 = 0 where ˆR = R(x, w, ˆδ) and ˆG = G(x, ˆγ). Ignoring the set of conditions corresponding to the slope parameter β, the population counterparts to the above FOC are (cid:34) s · E (cid:18) w1 R · G + R · (1 − G) (cid:19) (cid:20)s · w1 R · G w0 E ·(cid:2)y − h(α∗ + xβ∗ + η∗w1)(cid:3)(cid:35) ·(cid:2)y − h(α∗ + xβ∗ + η∗w1)(cid:3)(cid:21) = 0 = 0 (J.12) (J.13) where α∗, β∗ and γ∗ are the probability limits of QMLE estimators ˆα, ˆβ and ˆγ. Rearranging J.12 and J.13 gives us (cid:34) s · E (cid:18) w1 w0 R · G + R · (1 − G) (cid:19) (cid:20)s · w1 R · G (cid:35) (cid:21) · y · y E w0 s · (cid:19) (cid:34) (cid:18) w1 (cid:20)s · w1 (cid:35) · h(α∗ + xβ∗ + η∗w1) (cid:21) · h(α∗ + xβ∗ + η∗w1) R · (1 − G) R · G + (J.14) (J.15) R · G = E = E Now, y = y(0) · w0 + y(1) · w1 which implies that we can replace y in the above two equations to obtain the LHs of J.14 equal to (cid:34) s · E (cid:18)w1 · y(1) R · G w0 · y(0) R · (1 − G) + (cid:19)(cid:35) 215 By using iterated expectations we can rewrite the above equation as · E(s · y(1)|x, w1) + w0 (1 − G) · R · E(s · y(0)|x, w0) Due to ignorability of sample selection, we can split the conditional expectation into parts. (cid:20) w1 G · R E (cid:20) w1 G · R E (cid:21) (cid:1) = E(cid:0)y(0)|x(cid:1). (cid:21) (cid:21) · E(s|x, w1) · E(y(1)|x, w1) + w0 (1 − G) · R · E(s|x, w0) · E(y(0)|x, w0) Note that, w1 · E(s|x, w1) = w1 · R. similarly, w0 · E(s|x, w0) = w0 · R and due to unconfoundedness we have, E(cid:0)y(1)|x, w1 Therefore, we can simplify the above expression into w0 · R E Another application of iterated expectation gives us (cid:1) = E(cid:0)y(1)|x(cid:1) and E(cid:0)y(0)|x, w0 (cid:21) · E(y(0)|x) (1 − G) · R (cid:21) · E(w1|x) + µ0(x) (1 − G) · E(w0|x) G · R · E(y(1)|x) + (cid:20)w1 · R (cid:20)µ1(x) = E(cid:2)µ1(x) + µ0(x)(cid:3) G E = E[y(1)] + E[y(0)] Manipulating the RHS of J.14 using iterated expectations gives us (cid:34) h(α∗ + xβ∗ + η∗w1) · (cid:34) (cid:20) h(α∗ + xβ∗ + η∗w1) · h(α∗ + xβ∗ + η∗w1) · w1 (cid:26)w1 (cid:26)w1 (cid:21) G G E = E = E · 1 R + (cid:27)(cid:35) · E(s|x, w1) + (cid:20) (1 − G) w0 + E G h(α∗ + xβ∗ + η∗w1) · w0 (1 − G) (cid:27)(cid:35) w0 (1 − G) · 1 R · E(s|x, w0) Therefore, combining the LHS and RHS give the result E[y(1)] + E[y(0)] = E (cid:21) + E (cid:20) h(α∗ + xβ∗ + η∗w1) · w0 (1 − G) (cid:21) (J.16) (cid:20) h(α∗ + xβ∗ + η∗w1) · w1 G 216 Now, consider the LHS of J.15. E (cid:20)s · w1 R · G (cid:21) · y (cid:20)s · w1 R · G (cid:21) · y(1) = E = E[y(1)] by LIE Similarly using LIE, the RHS of J.15 can be re-written as (cid:20)s · w1 R · G E · h(α∗ + xβ∗ + η∗w1) (cid:21) (cid:20) (cid:20) G h(α∗ + xβ∗ + η∗w1) · w1 h(α∗ + xβ∗ + η∗w1) · w1 (cid:21) G = E = E (cid:20) h(α∗ + xβ∗ + η∗w1) · w1 G (cid:20) h(α∗ + xβ∗ + η∗w1) · w0 (1 − G) (cid:21) (cid:21) · E(s|x, w1) · 1 R (cid:21) (J.17) (J.18) Therefore combining the LHS and RHS give us Then using J.17 along with J.16 implies that E[y(1)] = E E[y(0)] = E Consider (cid:104) Therefore, E (cid:104) h(α∗ + xβ∗ + η∗w1) · w1 G Similarly, we can also show that E E(cid:2)h(α∗ + xβ∗ + η∗w1) · w1|x(cid:3) = E(cid:2)h(α∗ + xβ∗ + η∗)(cid:3) · P (w1 = 1|x) = E(cid:2)h(α∗ + xβ∗ + η∗)(cid:3) (cid:105) (cid:105) h(α∗ + xβ∗ + η∗w1) · w0 (1−G) Hence, the pooled regression adjustment estimator can be written as τP RA = E(cid:2)h(α∗ + xβ∗ + η∗)(cid:3) − E(cid:2)h(α∗ + xβ∗)(cid:3) = E(cid:2)h(α∗ + xβ∗)(cid:3) so a consistent estimator of the QMLE pooled regression adjustment estimator can be obtained by replacing the population expectation by the sample average in the above expression and weighting by the appropriate probabilities to recover the balance of the random sample which gives us ˆτP RA = 1 N (cid:105) − 1 N (cid:105) ˆβ) h(ˆα + xi N(cid:88) (cid:104) i=1 (cid:4) N(cid:88) (cid:104) h(ˆα + xi ˆβ + ˆη) i=1 217 Separate slopes Proof. Let us assume that mg(x, θg) = h(αg + xβg) is the chosen mean function for µg(x). Then the population FOC’s are (cid:20)s · w1 R · G E s · w0 R · (1 − G) (cid:20) E ·(cid:2)y − h(α∗ ·(cid:2)y − h(α∗ 1 + xβ∗ 0 + xβ∗ 1)(cid:3)(cid:21) 0)(cid:3)(cid:21) = 0 = 0 (J.19) (J.20) where where α∗ J.19 and J.20 just like in the pooled case gives us the following equalities. g are the probability limits of QMLE estimators ˆαg, ˆβg. Rearranging g, β∗ (cid:20) s · w1 (cid:21) R · G · y (cid:21) (cid:20) (cid:20)s · w1 R · G (cid:20) E E s · w0 R · (1 − G) · y = E = E s · w0 R · (1 − G) · h(α∗ · h(α∗ 1 + xβ∗ 1) 0 + xβ∗ 0) (cid:21) (cid:21) (J.21) (J.22) Proceeding with the above two equations in the same way as in the pooled case gives us the results Therefore, τF RA = E(cid:2)h(α∗ 1 + xβ∗ 0 + xβ∗ E[y(1)] = E(cid:2)h(α∗ E[y(0)] = E(cid:2)h(α∗ 1)(cid:3) − E(cid:2)h(α∗ 0 + xβ∗ 1 + xβ∗ (cid:104) N(cid:88) 1)(cid:3) 0)(cid:3) 0)(cid:3) and so a consistent estimator of N(cid:88) (cid:104) (cid:105) (cid:105) − 1 N i=1 1 N i=1 ˆτF RA = h(ˆα1 + xi ˆβ1) h(ˆα0 + xi ˆβ0) the QMLE separate regression adjustment estimator can be obtained as Consistency of ˆτP RA for τP RA and ˆτF RA for τF RA follows from the results on double weighting and generalized linear model properties. Remember that the framework of this paper does not rely on the correct specification of some conditional mean of the distribution. I have allowed for both; when the mean function is correctly specified but everything else about the distribution is misspecified as well as when everything is allowed to be misspecified including the mean. In both cases, results from quasi maximum likelihood in the linear exponential family have been instrumental in guaranteeing consistency of pooled and separate slopes methods. (cid:4) 218 Asymptotic variance expression for ATE: Correct CEF N(cid:88) i=1 (cid:17) (cid:17)} = + E Proof. Assuming continuous differentiability of mg(xi, θg) on Θg, mean value expansion around θ0 g gives mg(xi, ˆθg) = 1 N mg(xi, θ0 g) + 1 N ∇θg mg(xi, ¨θg) · ( ˆθg − θ0 g) + Remainder where ¨θg lies between ˆθg and θ0 large numbers, we obtain g. Since ˆθg p→ θ0 g, so does ¨θg. Hence, using the weak law of mg(xi, ˆθg) = mg(xi, θ0 g ) + E (cid:104)∇θg m1(xi, θ0 g ) (cid:105) · √ N ( ˆθg − θ0 g ) + op(1) N(cid:88) i=1 1 N N(cid:88) i=1 1√ N N(cid:88) i=1 N(cid:88) (cid:16) 1√ N √ i=1 N · E (cid:17)} mg(xi, θ0 g) g) − E √ N ( ˆθg − θ0 g) + op(1) Adding and subtracting mg(xi, θ0 g ) 1√ N (cid:16) i=1 N(cid:88) {mg(xi, ˆθg) − E (cid:104)∇θg mg(xi, θ0 g) Let E mg(xi, θ0 g) (cid:105) ≡ G0 (cid:16) ˆθ1 − θ0 (cid:17) (cid:16) ˆθ0 − θ0 (cid:17) N N 1 0 √ √ = −H−1 1 = −H−1 0 (cid:16) 1√ N on both sides gives us i=1 g) N(cid:88) {mg(xi, θ0 (cid:104)∇θg mg(xi, θ0 (cid:105) ·  + op(1)  1√  + op(1)  1√ N(cid:88) N(cid:88) i=1 ki N N li i=1 we posit that the conditional feature of interest is correctly specified, we have g. Then, using the asymptotic results from section 3.5, where Therefore, √ N (ˆτate − τate) N(cid:88) (cid:32) {m1(xi, θ0 = 1√ N 1) − m0(xi, θ0 0) − τate} − G0 1 · H−1 1 li + G0 0 · H−1 0 ki (cid:33) + op(1) i=1 219 We may rewrite the above using the influence function representation as ψ(xi) + op(1) where E(cid:2)ψ(xi)(cid:3) = 0 √ N(cid:88) Then, provided that E(cid:2)ψ(xi)ψ(xi)(cid:48)(cid:3) exists, N (ˆτate − τate) = (cid:105) (cid:104)√ 1√ N (cid:104) i=1 N (ˆτate − τate) = E Avar (m1(xi, θ0 1) − m0(xi, θ0 0)) − τate (cid:105)2 + G0 + G0 1 · V1 · G0 0 · V0 · G0 0 1 (cid:48) (cid:48) Note that the covariance term involving li and ki is zero since the two denote scores for the treatment and control group problems. The covariance terms involving 1) − m0(xi, θ0 (m1(xi, θ0 also be zero. This is because θ0 (cid:104)∇θg q(yi(g), xi, θ0 0) − τate) and li and (m1(xi, θ0 (cid:105) = 0 (i.e. for g = 1, E(cid:0)li|xi g)(cid:48)|xi E 1) − m0(xi, θ0 0) − τate) and ki will (cid:1) = 0 and for g = 0, E(cid:0)ki|xi (cid:1) = 0). g solves the conditional problem, which implies that Then, using LIE, those covariance terms can be shown to be zero. Misspecified mean model In the case of a misspecified mean model, we still have 1√ N N(cid:88) N(cid:88) i=1 {mg(xi, ˆθg) − E mg(xi, θ0 g) {mg(xi, θ0 g) − E mg(xi, θ0 g) 1√ = N √ N ( ˆθg − θ0 i=1 g) + op(1) (cid:16) (cid:16) (cid:17)} (cid:17)} + E (cid:104)∇θg mg(xi, θ0 g) (cid:105)· 220 However now, using results from section 3.4 √ N (cid:16) ˆθ1 − θ0 (cid:26) li − E = − H−1 N(cid:88) (cid:17) (cid:16) 1 (cid:17) E (cid:16) lib(cid:48) i bib(cid:48) i (cid:17)−1 bi − E(lid(cid:48) i)E(did(cid:48) i)−1di (cid:27) 1 1√ N + op(1) = − H−1 1 1√ N = − H−1 0 1√ N (cid:16) ˆθ0 − θ0 0 (cid:17) √ N i=1 N(cid:88) N(cid:88) i=1 i=1 ui1 + op(1) (cid:26) ki − E (cid:16) kib(cid:48) i (cid:17) (cid:16) E bib(cid:48) i (cid:17)−1 bi − E(kid(cid:48) i)E(did(cid:48) i)−1di (cid:27) + op(1) = −H−1 0 1√ N N(cid:88) i=1 ui0 + op(1) {m1(xi, θ0 1) − m0(xi, θ0 0) − τate} − G0 1 · H−1 1 ui1 + G0 0 · H−1 0 ui0 Then, √ (cid:32) N (ˆτate − τate) N(cid:88) N(cid:88) i=1 i=1 = = 1√ N 1√ N ψ(xi) + op(1) Then, Avar (cid:105) (cid:104)√ N (ˆτate − τate) =E (cid:104) (cid:33) + op(1) 1 1 · V1 · G0 (cid:105) (cid:105) H−1 1 G0 1 H−1 0 G0 0 (cid:48) (cid:48) (cid:48) (cid:4) (cid:105)2 1) − m0(xi, θ0 0)) − τate + G0 + G0 − 2E (cid:48) (m1(xi, θ0 0 · V0 · G0 (cid:104){m1(xi, θ0 (cid:104){m1(xi, θ0 0 + 2E 1) − m0(xi, θ0 1) − m0(xi, θ0 0) − τate}u(cid:48) 0) − τate}u(cid:48) i1 i0 221 BIBLIOGRAPHY 222 BIBLIOGRAPHY Abadie, A., S. Athey, G. W. Imbens, and J. Wooldridge (2017a): “When Should You Adjust Standard Errors for Clustering?” Tech. rep., National Bureau of Economic Research. Abadie, A., S. Athey, G. W. Imbens, and J. M. Wooldridge (2017b): “Sampling- based vs. Design-based Uncertainty in Regression Analysis,” Working Paper. Andrews, D. W. (1994): “Empirical process methods in econometrics,” Handbook of econo- metrics, 4, 2247–2294. Angrist, J., E. Bettinger, and M. Kremer (2006a): “Long-term educational conse- quences of secondary school vouchers: Evidence from administrative records in Colombia,” American economic review, 96, 847–862. Angrist, J., V. Chernozhukov, and I. Fernández-Val (2006b): “Quantile Regression under Misspecification, with an Application to the U.S. Wage Structure,” Econometrica, 74, 539–563. Ansel, J., H. Hong, and J. Li (2018): “OLS and 2SLS in Randomized and Conditionally Randomized Experiments,” Jahrbücher für Nationalökonomie und Statistik, 238, 243–293. Ayer, M., H. D. Brunk, G. M. Ewing, W. T. Reid, and E. Silverman (1955): “An empirical distribution function for sampling with incomplete information,” The annals of mathematical statistics, 641–647. Ba, B. A., J. C. Ham, R. J. LaLonde, and X. Li (2017): “Estimating (easily interpreted) dynamic training effects from experimental data,” Journal of Labor Economics, 35, S149– S200. Bartlett, J. W. (2018): “Covariate adjustment and estimation of mean response in ran- domised trials,” Pharmaceutical statistics, 17, 648–666. Behaghel, L., B. Crépon, M. Gurgand, and T. Le Barbanchon (2015): “Please call again: Correcting nonresponse bias in treatment effect models,” Review of Economics and Statistics, 97, 1070–1080. Berk, R., E. Pitkin, L. Brown, A. Buja, E. George, and L. Zhao (2013): “Covari- ance adjustments for the analysis of randomized field experiments,” Evaluation review, 37, 170–196. Bloom, H. S. (1984): “Accounting for no-shows in experimental evaluation designs,” Eval- uation review, 8, 225–246. Bruhn, M. and D. McKenzie (2009): “In pursuit of balance: Randomization in practice in development field experiments,” American economic journal: applied economics, 1, 200– 232. 223 Busso, M. and S. Galiani (2019): “The causal effect of competition on prices and quality: Evidence from a field experiment,” American Economic Journal: Applied Economics, 11, 33–56. Calónico, S. and J. Smith (2017): “The women of the national supported work demon- stration,” Journal of Labor Economics, 35, S65–S97. Card, D., P. Ibarrarán, F. Regalia, D. Rosas-Shady, and Y. Soares (2011): “The labor market impacts of youth training in the Dominican Republic,” Journal of Labor Economics, 29, 267–300. Carson, R. T., M. B. Conaway, W. M. Hanemann, J. A. Krosnick, R. C. Mitchell, and S. Presser (2004): “Valuing oil spill prevention,” . Chen, X., C. A. Flores, and A. Flores-Lagunes (2018): “Going beyond LATE Bound- ing Average Treatment Effects of Job Corps Training,” Journal of Human Resources, 53, 1050–1099. Chetty, R., J. N. Friedman, and J. E. Rockoff (2014): “Measuring the impacts of teachers I: Evaluating bias in teacher value-added estimates,” American Economic Review, 104, 2593–2632. Cochran, W. G. (1957): “Analysis of covariance: 261–281. its nature and uses,” Biometrics, 13, de Luna, X. and P. Johansson (2014): “Testing for the unconfoundedness assumption using an instrumental assumption,” Journal of Causal Inference, 2, 187–199. Drange, N. and T. Havnes (2018): “Early child care and cognitive development: Evidence from an assignment lottery,” Journal of Labor Economics, 0. Firpo, S. (2007): “Efficient semiparametric estimation of quantile treatment effects,” Econo- metrica, 75, 259–276. Fisher, R. A. (1935): “The design of experiments. 1935,” Oliver and Boyd, Edinburgh. Freedman, D. A. (2008a): “On regression adjustments in experiments with several treat- ments,” The annals of applied statistics, 176–196. ——— (2008b): “On Regression adjustments to experimental data,” Advances in Applied Mathematics, 40, 180–193. Fricke, H., M. Frölich, M. Huber, and M. Lechner (2015): “Endogeneity and non- response bias in treatment evaluation: Nonparametric identification of causal effects by instruments,” . Frölich, M. and M. Huber (2014): “Treatment evaluation with multiple outcome periods under endogeneity and attrition,” Journal of the American Statistical Association, 109, 1697–1711. 224 Frumento, P., F. Mealli, B. Pacini, and D. B. Rubin (2012): “Evaluating the effect of training on wages in the presence of noncompliance, nonemployment, and missing outcome data,” Journal of the American Statistical Association, 107, 450–466. Gourieroux, C., A. Monfort, and A. Trognon (1984): “Pseudo maximum likelihood methods: Theory,” Econometrica: Journal of the Econometric Society, 681–700. Hahn, J. (1998): “On the role of the propensity score in efficient semiparametric estimation of average treatment effects,” Econometrica, 315–331. Hausman, J. A. and D. A. Wise (1979): “Attrition bias in experimental and panel data: the Gary income maintenance experiment,” Econometrica: Journal of the Econometric Society, 455–473. Heckman, J., H. Ichimura, J. Smith, and P. Todd (1998a): “Characterizing Selection Bias Using Experimental Data,” Econometrica, 66, 1017–1098. Heckman, J., J. Smith, and C. Taber (1998b): “Accounting for dropouts in evaluations of social programs,” Review of Economics and Statistics, 80, 1–14. Heckman, J. J. and V. J. Hotz (1989): “Choosing among alternative nonexperimental methods for estimating the impact of social programs: The case of manpower training,” Journal of the American statistical Association, 84, 862–874. Hirano, K. and G. W. Imbens (2001): “Estimation of causal effects using propensity score weighting: An application to data on right heart catheterization,” Health Services and Outcomes research methodology, 2, 259–278. Hirano, K., G. W. Imbens, and G. Ridder (2003): “Efficient estimation of average treatment effects using the estimated propensity score,” Econometrica, 71, 1161–1189. Hotz, V. J., G. W. Imbens, and J. A. Klerman (2006): “Evaluating the differential effects of alternative welfare-to-work training components: A reanalysis of the California GAIN program,” Journal of Labor Economics, 24, 521–566. Huber, M. (2012): “Identification of average treatment effects in social experiments under alternative forms of attrition,” Journal of Educational and Behavioral Statistics, 37, 443– 474. ——— (2014a): “Identifying causal mechanisms (primarily) based on inverse probability weighting,” Journal of Applied Econometrics, 29, 920–943. ——— (2014b): “Treatment evaluation in the presence of sample selection,” Econometric Reviews, 33, 869–905. Huber, M. and G. Mellace (2015): “Sharp bounds on causal effects under sample selec- tion,” Oxford bulletin of economics and statistics, 77, 129–151. Huber, M. and B. Melly (2015): “A test of the conditional independence assumption in sample selection models,” Journal of Applied Econometrics, 30, 1144–1168. 225 Imbens, G. W. and D. B. Rubin (2015): Causal inference in statistics, social, and biomed- ical sciences, Cambridge University Press. Imbens, G. W. and J. M. Wooldridge (2009): “Recent developments in the econometrics of program evaluation,” Journal of economic literature, 47, 5–86. Kane, T. J. and D. O. Staiger (2008): “Estimating teacher impacts on student achieve- ment: An experimental evaluation,” Tech. rep., National Bureau of Economic Research. Kim, T.-H. and H. White (2003): “Estimation, inference, and specification testing for possibly misspecified quantile regression,” in Maximum likelihood estimation of misspecified models: twenty years later, Emerald Group Publishing Limited, 107–132. Koenker, R. and G. Bassett (1978): “Regression Quantiles,” Econometrica, 46, 33–50. Komunjer, I. (2005): “Quasi-maximum likelihood estimation for conditional quantiles,” Journal of Econometrics, 128, 137 – 164. LaLonde, R. J. (1986): “Evaluating the econometric evaluations of training programs with experimental data,” The American economic review, 604–620. Lewbel, A. (2000): “Semiparametric qualitative response model estimation with unknown heteroscedasticity or instrumental variables,” Journal of Econometrics, 97, 145–177. Li, L., C. Shen, X. Li, and J. M. Robins (2013): “On weighting approaches for missing data,” Statistical methods in medical research, 22, 14–30. Lin, W. (2013): “Agnostic notes on regression adjustments to experimental data: Reexam- ining Freedman’s critique,” The Annals of Applied Statistics, 7, 295–318. Little, R. J. and D. B. Rubin (2002): “Statistical analysis with missing data: Wiley series in probability and statistics,” . Mattei, A., F. Mealli, and B. Pacini (2014): “Identification of Local Causal Effects with Missing Outcome Values and an Instrument for Non Response,” Communications in Statistics-Theory and Methods, 43, 815–825. McCullagh, P. and J. Nelder (1989): Generalized Linear Models, London, Chapman and Hall. Moffit, R., J. Fitzgerald, and P. Gottschalk (1999): “Sample attrition in panel data: The role of selection on observables,” Annales d’Economie et de Statistique, 129–152. Mullahy, J. (2015): “Multivariate fractional regression estimation of econometric share models,” Journal of econometric methods, 4, 71–100. Muralidharan, K. and V. Sundararaman (2015): “The aggregate effect of school choice: Evidence from a two-stage experiment in India,” The Quarterly Journal of Eco- nomics, 130, 1011–1066. 226 Negi, A. (2019): “GMM characterization of the Doubly weighted M-estimator,” Working paper. Negi, A. and J. M. Wooldridge (2019): “Regression adjustment in experiments with heterogeneous treatment effects,” Working paper. Newey, W. K. and D. McFadden (1994): testing,” Handbook of econometrics, 4, 2111–2245. “Large sample estimation and hypothesis Neyman, J. (1923): “On the Application of Probability Theory to Agricultural Experiments. Essay on Principles. Section 9.(Tlanslated and edited by DM Dabrowska and TP Speed, Statistical Science (1990), 5, 465-480),” Annals of Agricultural Sciences, 10, 1–51. Papke, L. E. and J. M. Wooldridge (1996): “Econometric methods for fractional re- sponse variables with an application to 401 (k) plan participation rates,” Journal of applied econometrics, 11, 619–632. Pollard, D. (1985): “New ways to prove central limit theorems,” Econometric Theory, 1, 295–313. Prokhorov, A. and P. Schmidt (2009): “GMM redundancy results for general missing data problems,” Journal of Econometrics, 151, 47–55. Robins, J. M. and A. Rotnitzky (1995): “Semiparametric efficiency in multivariate regression models with missing data,” Journal of the American Statistical Association, 90, 122–129. Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994): “Estimation of regression coeffi- cients when some regressors are not always observed,” Journal of the American statistical Association, 89, 846–866. Rosenbaum, P. R. (1987): “The role of a second control group in an observational study,” Statistical Science, 2, 292–306. ——— (2002): “Observational studies,” in Observational studies, Springer, 1–17. Rosenbaum, P. R. and D. B. Rubin (1983): “The central role of the propensity score in observational studies for causal effects,” Biometrika, 70, 41–55. Rosenblum, M. and M. J. Van Der Laan (2010): “Simple, efficient estimators of treat- ment effects in randomized trials using generalized linear models to leverage baseline vari- ables,” The international journal of biostatistics, 6. Shadish, W. R., M. H. Clark, and P. M. Steiner (2008): “Can nonrandomized ex- periments yield accurate answers? A randomized experiment comparing random and non- random assignments,” Journal of the American statistical association, 103, 1334–1344. Słoczyński, T. (2018): “A general weighted average representation of the ordinary and two-stage least squares estimands,” arXiv preprint arXiv:1810.01576. 227 Słoczyński, T. and J. M. Wooldridge (2018): “A general double robustness result for estimating average treatment effects,” Econometric Theory, 34, 112–133. Tsiatis, A. A., M. Davidian, M. Zhang, and X. Lu (2008): “Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: a principled yet flexible approach,” Statistics in medicine, 27, 4658–4677. Vossmeyer, A. (2016): “Sample Selection and Treatment Effect Estimation of Lender of Last Resort Policies,” Journal of Business & Economic Statistics, 34, 197–212. Watanabe, M. (2010): “Nonparametric estimation of mean willingness to pay from discrete response valuation data,” American Journal of Agricultural Economics, 92, 1114–1135. White, H. (1982): “Maximum likelihood estimation of misspecified models,” Econometrica: Journal of the Econometric Society, 1–25. Wooldridge, J. M. (2002): “Inverse probability weighted M-estimators for sample selec- tion, attrition, and stratification,” Portuguese Economic Journal, 1, 117–139. ——— (2007): “Inverse probability weighted estimation for general missing data problems,” Journal of Econometrics, 141, 1281–1301. ——— (2010): Econometric analysis of cross section and panel data, MIT press. Yang, L. and A. A. Tsiatis (2001): “Efficiency study of estimators for a treatment effect in a pretest–posttest trial,” The American Statistician, 55, 314–321. 228