TRADE-OFFS IN NON-LINEAR MODELS AND ESTIMATION STRATEGIES By Alyssa Helen Carlson A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Economics — Doctor of Philosophy 2019 ABSTRACT TRADE-OFFS IN NON-LINEAR MODELS AND ESTIMATION STRATEGIES By Alyssa Helen Carlson This dissertation examines the assumptions presumed throughout the literature to es- tablish valid estimation procedures for non-linear models. The following three chapters addresses issues of identification, consistent and efficient estimation, and incorporating het- eroskedasticity and serial correlation for binary response models in cross-sectional and panel data settings. Chapter 1: Parametric Identification of Multiplicative Exponential Heteroskedasticity Multiplicative exponential heteroskedasticity is commonly seen in latent variable models such as Probit or Logit where correctly modelling the heteroskedasticity is imperative for consistent parameter estimates. However, it appears the literature lacks a formal proof of point identification for the parametric model. This chapter presents several examples that show the conditions presumed throughout the literature are not sufficient for identification and as a contribution provides proofs of point identification in common specifications. Chapter 2: Relaxing Conditional Independence in an Endogenous Binary Response Model For binary response models, control function estimators are a popular approach to address endogeneity. But these estimators utilize a Control Function assumption that imposes Con- ditional Independence (CF-CI) to obtain identification. CF-CI places restrictions on the relationship between the latent error and the instruments that are unlikely to hold in an em- pirical context. In particular, the literature has noted that CF-CI imposes homoskedasticity with respect to the instruments. This chapter identifies the consequences of CF-CI, provides examples to motivate relaxing CF-CI, and proposes a new consistent estimator under weaker assumptions than CF-CI. The proposed method is illustrated in an application, estimating the effect of non-wife income on married women’s labor supply. Chapter 3: Behavior of Pooled and Joint Estimators in Probit Model with Random Coefficients and Serial Correlation This chapter compares a pooled maximum likelihood estimator (PMLE) to a joint (full) max- imum likelihood estimator (JMLE), the dominant estimation method for mixture models, for dealing with potential individual-specific heterogeneity and serial correlation in a binary response Probit Mixture model. The JMLE is more statistically efficient but computation- ally demanding and the implementation becomes more difficult if one tries to model the serial correlation over time. On the other hand, the PMLE is computationally simple and robust to arbitrary forms of serial correlation. Focusing on the Average Partial Effects, this chapter finds it imperative for the model to allow the individual-specific heterogeneity to be potentially correlated with the covariates (not a standard specification in Mixture models). Moreover, the JMLE can produce quite satisfactory estimates that seem robust to serial correlation even under misspecification of the likelihood function. Results are illustrated in an application, estimating the effects of different interventions on high risk men’s behavior, complementing the original study of Blattman, Jamison, and Sheridan (2017). ACKNOWLEDGMENTS First and foremost, I would like to thank the chair of my dissertation committee, Jeff Wooldridge, for all of his advice, encouragement, and helpful critiques. I would also like to thank Kyoo Il Kim, Joe Herriges and Nicole Mason for serving on my committee and providing valuable feedback and assistance. I also appreciate the comments of seminar par- ticipants at Michigan State University, the Econometrics Reading Group at MSU, Grand Valley State University, the 2018 MEA Conference, the 2018 and 2019 Annual Meeting of the Midwest Econometrics Group and the corresponding Women’s Mentoring Workshops, and the 2018 International Association of Applied Econometrics Conference. I am especially grateful for the financial support I received from the Graduate School and the Department of Economics at Michigan State University, including the Goodman Fellowship, Summer Research Fellowship, and Dissertation Completion Fellowship. I also appreciate the support and advice that Lori Jean Nichols, Steven Haider, and Mike Conlin all gave me as I navigated the graduate program and job market. I am also grateful to my friends and colleagues at Michigan State for making my graduate experience so memorable. I am truly thankful for my endlessly supportive parents Lance and Chim Carlson, whose love and encouragement helped me at every step of my life. Finally, I am especially grateful to my partner, Thom, for picking up his life and moving across the country to start a new adventure with me (multiple times), as well as all the countless ways he has supported my endeavors over the years. iv TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 1 Parametric Identification of Multiplicative Exponential Het- eroskedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 1.2 Identification when there is no bijective transformation . . . . . . . . . . . . 1.3 No identification when there is a bijective transformation . . . . . . . . . . . 1.4 Identification in a common specification . . . . . . . . . . . . . . . . . . . . . 1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 2.4.2 Chapter 2 Relaxing Conditional Independence in an Endogenous Binary Response Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Model Set Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 General Control Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation: General Control Function in the Demand for Premium Cable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Estimation and Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Average Structural Function . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Average Partial Effects . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.4 Simulation: ASF Estimates for the Effect of Income on Home- ownership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Empirical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Extension: Semi-Parametric Distribution Free Estimator . . . . . . . . . . . 2.7.1 Observational Equivalence and Identification . . . . . . . . . . . . . . 2.7.2 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 3 Behavior of Pooled and Joint Estimators in Probit Model with Random Coefficients and Serial Correlation . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Model Set Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Estimation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 3 3 7 10 14 15 17 17 23 29 38 38 43 46 47 49 56 58 61 67 68 76 80 86 89 89 97 98 3.3.1 Mixed Effects Probit . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.3.2 Pooled Heteroskedastic Probit . . . . . . . . . . . . . . . . . . . . . . 103 3.4 Average Partial Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.5.1 Computational Results . . . . . . . . . . . . . . . . . . . . . . . . . . 110 3.5.2 Parameter Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 3.5.3 Average Partial Effect Estimates . . . . . . . . . . . . . . . . . . . . 115 3.5.4 ASF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 3.6 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 3.7.1 AR(2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 3.7.2 No Random Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 3.7.3 Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 APPENDIX A Figures for Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . 138 APPENDIX B Proofs and Notation for Chapter 2 . . . . . . . . . . . . . . . . . . 141 APPENDIX C Simulation Details for Chapter 2 . . . . . . . . . . . . . . . . . . . 151 APPENDIX D Figures for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . 157 APPENDIX E Tables for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 181 APPENDIX F Figures for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . 200 APPENDIX G Tables for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . 224 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 vi LIST OF TABLES Table E.1: Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 Table E.2: Comparison of Logit Parameter Estimates . . . . . . . . . . . . . . 183 Table E.3: Comparison of Price Elasticity Estimates . . . . . . . . . . . . . . . 183 Table E.4: Comparison of Summary Statistics . . . . . . . . . . . . . . . . . . . 184 Table E.5: Comparison of Parameter Estimates . . . . . . . . . . . . . . . . . . 185 Table E.6: APE Results and Simulated Distribution (True APE = 0.6448) . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 Table E.7: Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Table E.8: Coefficient Estimates for Married Women’s LFP . . . . . . . . . . 188 Table E.9: Wald Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 Table E.10:APE Estimates for Non-Wife Income effect on Wife’s LFP . . . . 191 Table E.11:Logistic Distribution (h1 o = v2i) . . . . . . . . . . . . . . . . . . . . . . 192 Table E.12:Uniform Distribution (h1 o = v2i) . . . . . . . . . . . . . . . . . . . . . 192 Table E.13:Student T Distribution (h1 o = v2i) . . . . . . . . . . . . . . . . . . . . 193 Table E.14:Gaussian Mixture Distribution (h1 o = v2i) . . . . . . . . . . . . . . . 193 Table E.15:Logistic Distribution with Linear GCF (h2 o) . . . . . . . . . . . . . . 194 Table E.16:Uniform Distribution with Linear GCF (h2 o) . . . . . . . . . . . . . 194 Table E.17:Student T Distribution with Linear GCF (h2 o) . . . . . . . . . . . . 195 Table E.18:Gaussian Mixture Distribution with Linear GCF (h2 o) . . . . . . . 195 Table E.19:Logistic Distribution with Non-Parametric GCF (h3 o) . . . . . . . 196 Table E.20:Uniform Distribution with Non-Parametric GCF (h3 o) . . . . . . . 196 vii Table E.21:Student T Distribution with Non-Parametric GCF (h3 o) . . . . . . 197 Table E.22:Gaussian Mixture Distribution with Non-Parametric GCF (h3 o) . 197 Table E.23:Heteroskedastic Logistic (h1 o = v2i) . . . . . . . . . . . . . . . . . . . 198 Table E.24:Heteroskedastic Logistic with Linear GCF (h2 o) . . . . . . . . . . . 198 Table E.25:Heteroskedastic Logistic with Non-Parametric GCF (h3 o) . . . . . 199 Table G.1: Estimation Times for DGP 1 . . . . . . . . . . . . . . . . . . . . . . . 225 Table G.2: Estimation Times for DGP 2 . . . . . . . . . . . . . . . . . . . . . . . 226 Table G.3: Estimation Times for DGP 3 . . . . . . . . . . . . . . . . . . . . . . . 227 Table G.4: Bias and Std Deviation of De-scaled ME Probit Estimates for DGP 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 Table G.5: Bias and Std Deviation of De-scaled ME Probit Estimates for DGP 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 Table G.6: Bias and Std Deviation of De-scaled ME Probit Estimates for DGP 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 Table G.7: Bias and Std Deviation of Scaled Coefficient Estimates for DGP 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 Table G.8: Bias and Std Deviation of Scaled Coefficient Estimates for DGP 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 Table G.9: Bias and Std Deviation of Scaled Coefficient Estimates for DGP 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Table G.10:Root Mean Square Error of ˆβ2σ for Specification (2) . . . . . . . . 234 Table G.11:Bias and Std Deviation of Variance Component σ2 2 for Specifica- tion (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Table G.12:Bias and Std Deviation (×10) of APE Estimates for DGP 1 . . . 236 Table G.13:Bias and Std Deviation (×10) of APE Estimates for DGP 2 . . . 237 Table G.14:Bias and Std Deviation (×10) of APE Estimates for DGP 3 . . . 238 viii Table G.15:Comparison of APE and PEA . . . . . . . . . . . . . . . . . . . . . . 239 Table G.16:Select Baseline Summary Statistics . . . . . . . . . . . . . . . . . . . 240 Table G.17:Preliminary OLS Estimates . . . . . . . . . . . . . . . . . . . . . . . . 242 Table G.18:Scaled Probit Coefficient Estimates for Selling Drugs . . . . . . . 244 Table G.19:Scaled Probit Coefficient Estimates for being Arrested . . . . . . 245 Table G.20:Scaled Probit Coefficient Estimates for Illicit Activity . . . . . . . 246 Table G.21:ATE Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Table G.22:Bias and Std Deviation of Scaled Coefficient Estimates under AR(2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 Table G.23:Bias and Std Deviation (×10) of APE Estimates under AR(2) . . 249 Table G.24:Failure Count under no Random Coefficients . . . . . . . . . . . . . 249 Table G.25:Estimation Times under no Random Coefficients . . . . . . . . . . 250 Table G.26:Bias and Std Deviation (×10) of APE Estimates under no Random Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Table G.27:Variance Component σ2 1 Estimates under no Random Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 Table G.28:Variance Component σ2 2 Estimates under no Random Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 Table G.29:Rejection Rate of LR Test for Random Coefficients . . . . . . . . 253 Table G.30:Bias and Std Deviation of De-scaled ME Logit Estimate under a Conditional Logistic AR(1) Process . . . . . . . . . . . . . . . . . . 254 Table G.31:Bias and Std Deviation of De-scaled ME Logit Estimate under a Marginal Logistic AR(1) Process . . . . . . . . . . . . . . . . . . . . 255 Table G.32:Bias and Std Deviation of Scaled Coefficient Estimates under a Conditional Logistic AR(1) Process . . . . . . . . . . . . . . . . . . 256 Table G.33:Bias and Std Deviation of Scaled Coefficient Estimates under a Marginal Logistic AR(1) Process . . . . . . . . . . . . . . . . . . . . 257 ix Table G.34:Bias and Std Deviation (×10) of APE Estimates under a Condi- tional Logistic AR(1) Process . . . . . . . . . . . . . . . . . . . . . . 258 Table G.35:Bias and Std Deviation (×10) of APE Estimates under a Marginal Logistic AR(1) Process . . . . . . . . . . . . . . . . . . . . . . . . . . 259 x LIST OF FIGURES Figure A.1: Visual representation of bijective transformations . . . . . . . . . 139 Figure A.2: Parameter estimates from two observationally equivalent models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Figure D.1: Effect of Heteroskedasticity on Parameter Estimate . . . . . . . . 158 Figure D.2: ASF for Income equal to $85,000 . . . . . . . . . . . . . . . . . . . 159 Figure D.3: ASF Estimates for Misspecified Models . . . . . . . . . . . . . . . 160 Figure D.4: Consequence of CF-LI Assumption on ASF Estimates . . . . . . 161 Figure D.5: Comparison of ASF for Families with No Children . . . . . . . . 162 Figure D.6: Comparison of ASF for Families with Young Children Only . . . 163 Figure D.7: Comparison of ASF for Families with Old Children Only . . . . 164 Figure D.8: Comparison of ASF for Families with Both Young and Old Chil- dren . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Figure D.9: Logistic Distribution (h1 o = v2i) . . . . . . . . . . . . . . . . . . . . . 166 Figure D.10: Uniform Distribution (h1 o = v2i) . . . . . . . . . . . . . . . . . . . . . 167 Figure D.11: Student T Distribution (h1 o = v2i) . . . . . . . . . . . . . . . . . . . . 168 Figure D.12: Gaussian Mixture Distribution (h1 o = v2i) . . . . . . . . . . . . . . . 169 Figure D.13: Logistic Distribution with Linear GCF (h2 o) . . . . . . . . . . . . . 170 Figure D.14: Uniform Distribution with Linear GCF (h2 o) . . . . . . . . . . . . . 171 Figure D.15: Student T Distribution with Linear GCF (h2 o) . . . . . . . . . . . 172 Figure D.16: Gaussian Mixture Distribution with Linear GCF (h2 o) . . . . . . 173 Figure D.17: Logistic with Non-Parametric GCF (h3 o) . . . . . . . . . . . . . . . 174 Figure D.18: Uniform with Non-Parametric GCF (h3 o) . . . . . . . . . . . . . . . 175 xi Figure D.19: Student T with Non-Parametric GCF (h3 o) . . . . . . . . . . . . . 176 Figure D.20: Gaussian Mixture with Non-Parametric GCF (h3 o) . . . . . . . . . 177 Figure D.21: Heteroskedastic Logistic (h1 o = v2i) . . . . . . . . . . . . . . . . . . . 178 Figure D.22: Heteroskedastic Logistic with Linear GCF (h2 o) . . . . . . . . . . . 179 Figure D.23: Heteroskedastic Logistic with Non-Parametric GCF (h3 o) . . . . 180 Figure F.1: Distribution of ˆσ2 1 for T=5 under DGP1 . . . . . . . . . . . . . . . 200 Figure F.2: Distribution of ˆσ2 1 for T=10 under DGP1 . . . . . . . . . . . . . . 201 Figure F.3: Distribution of ˆσ2 1 for T=20 under DGP1 . . . . . . . . . . . . . . 202 Figure F.4: Distribution of ˆσ2 1 for T=5 under DGP2 . . . . . . . . . . . . . . . 203 Figure F.5: Distribution of ˆσ2 1 for T=10 under DGP2 . . . . . . . . . . . . . . 204 Figure F.6: Distribution of ˆσ2 1 for T=20 under DGP2 . . . . . . . . . . . . . . 205 Figure F.7: Distribution of ˆσ2 1 for T=5 under DGP3 . . . . . . . . . . . . . . . 206 Figure F.8: Distribution of ˆσ2 1 for T=10 under DGP3 . . . . . . . . . . . . . . 207 Figure F.9: Distribution of ˆσ2 1 for T=20 under DGP3 . . . . . . . . . . . . . . 208 Figure F.10: ASF Estimates for T=5 under DGP1 . . . . . . . . . . . . . . . . . 209 Figure F.11: ASF Estimates for T=10 under DGP1 . . . . . . . . . . . . . . . . 210 Figure F.12: ASF Estimates for T=20 under DGP1 . . . . . . . . . . . . . . . . 211 Figure F.13: ASF Estimates for T=5 under DGP2 . . . . . . . . . . . . . . . . . 212 Figure F.14: ASF Estimates for T=10 under DGP2 . . . . . . . . . . . . . . . . 213 Figure F.15: ASF Estimates for T=20 under DGP2 . . . . . . . . . . . . . . . . 214 Figure F.16: ASF Estimates for T=5 under DGP3 . . . . . . . . . . . . . . . . . 215 Figure F.17: ASF Estimates for T=10 under DGP3 . . . . . . . . . . . . . . . . 216 xii Figure F.18: ASF Estimates for T=20 under DGP3 . . . . . . . . . . . . . . . . 217 Figure F.19: ATE for Selling Drugs . . . . . . . . . . . . . . . . . . . . . . . . . . 218 Figure F.20: ATE for Being Arrested . . . . . . . . . . . . . . . . . . . . . . . . . 219 Figure F.21: ATE for Engaging in Illicit Activities . . . . . . . . . . . . . . . . . 220 Figure F.22: Distribution of ˆσ2 1 for T=5 under AR(2) . . . . . . . . . . . . . . . 221 Figure F.23: Distribution of ˆσ2 1 for T=10 under AR(2) . . . . . . . . . . . . . . 222 Figure F.24: Distribution of ˆσ2 1 for T=20 under AR(2) . . . . . . . . . . . . . . 223 xiii Introduction When the outcome has restricted support – non-negative, binary, discrete, etc. – non-linear models are used to better capture the underlying data generating process in which a linear model can only at best approximate. A common example is a binary response model where the threshold latent variable set-up is more reasonable than a linear approximation where the predicted outcomes could fall outside the 0 and 1 bound for probabilities. But unlike the linear regression, the non-linear models are not as well understood when the standard assumptions fail to hold. This dissertation addresses three settings in which the standard assumptions fail to hold: heteroskedasticity, endogeneity, and a panel setting with random coefficients and serial correlation. In the first chapter, I address the identification of a multiplicative exponential het- eroskedastic model. Although it is presented in a general setting, introducing heteroskedas- ticity in a binary response model is usually done through a multiplicative exponential het- eroskedasticity. Unlike the linear regression, heteroskedasticity – where the variance of the latent error depends on the covariates – does not just influence the calculation of the asymp- totic variance (and consequently the standard error estimates), but it changes the conditional mean function in the log-likelihood. This means that ignoring heteroskedasticity in a binary response model will result in inconsistent parameter estimates, not just inaccurate standard 1 error estimates (see Figure D.1). Moreover, introducing a heteroskedastic specification can capture more flexibility in the conditional mean function as see in Chapters 2 and 3. In chapter 2, I utilize an observational equivalence result from Khan (2013) that essentially implies flexibly specified heteroskedasticity allows for distributional misspecification in the latent error. In Chapter 3, a pooled probit estimator allows for random coefficients through a heteroskedastic specification. Both of these examples provides further motivation for the utility of a heteroskedastic binary response mode. But I find the assumptions in the literature are not sufficient to guarantee identification of a multiplicative exponential heteroskedastic model. I provide a proof of identification for a linear in parameters specification that will be utilized in the later chapters. The next two chapters propose and compare estimation procedures for specific binary response settings. In both cases, the alternative estimators from the literature considered are built upon a set of fairly restrictive assumptions. I examine the assumptions underlying the estimators and ask if they are realistic empirically. In both cases, I find simply sce- narios in which the underlying assumptions would be violated. In the second chapter, the conditional independence assumption for the control function estimators are violated when there is heteroskedasticity. In the third chapter, the joint maximum likelihood estimator is inconsistent under the presence of serial correlation. Given these limitations, an alternative estimation procedure is proposed. This dissertation aims to supply empirical economists with estimation tools they would need to address these complex issues that commonly arise in binary response estimation. 2 Chapter 1 Parametric Identification of Multiplicative Exponential Heteroskedasticity 1.1 Introduction Multiplicative exponential heteroskedasticity was first proposed by Harvey (1976) in the context of a linear conditional mean model. Estimation is undertaken in two stages, requiring first an argument that the conditional variance parameters are identified and then showing that a weighted least squares estimator identifies the conditional mean parameters. More recently, multiplicative exponential functions are used to model heteroskedasticity in the latent errors of binary response models. However, with the cases of heteroskedastic Logit and Probit, the parameters in the conditional variance function are estimated concurrently with the coefficients of interest requiring joint identification of the parameters. Standard textbooks such as Greene (2011) and Wooldridge (2010) state that these models are estimable under fairly standard conditions and are more flexible in the specification of the conditional mean function compared to standard Probit and Logit models leading to widespread use in empirical work. However the literature has yet to provide proofs of parametric identification. 3 To fill the gap in the literature, this chapter explores the issues of identifications in models with exponential heteroskedasticity. Although the results may be applied for any model with a multiplicative exponential component, I will use the example of a heteroskedastic binary response model throughout the chapter to give the identification proofs some context. Consider the standard latent binary response model set up: Y = 1{Xβo − U > 0} where U is heteroskedastic with conditional variance: V ar(U|Z) = exp(2Zδo) where Z may include functions X. Presuming that there is no endogeneity in the usual sense, E(U|X, Z) = 0, and the scaled latent error, U/ exp(Zδo), is independent of (X, Z), then the heteroskedastic binary response model has the following conditional probability distribution, (cid:18) (cid:18) Xβo (cid:19)(cid:19)y(cid:18) (cid:18) Xβo (cid:19)(cid:19)(1−y) exp(Zδo) exp(Zδo) f (y|X, Z, θo) = Φ 1 − Φ (1.1) where Φ is the known cumulative distribution function for exp(Zδo)U , is monotonic, and has support on the real line. If we assume a normal distribution then this is the individual likelihood for a heteroskedastic Probit model and if we assume a logistic distribution this is the individual likelihood for the heteroskedastic Logit model. The following restates the identification definition from Newey and McFadden (1994) for MLE.1 Definition 1.1.1 (Identification). Let f (y|X, Z, θo) be the conditional probability distribution If θ (cid:54)= θo in the of Y defined over the measures of X and Z with positive probability. parameter space Θ implies P (f (y|X, Z, θ) (cid:54)= f (y|X, Z, θo)) > 0 over the measures of X and 1This discussion can easily be extended to the cases of NLLS or GMM. 4 Z then θo is point identified. If one were to assume that E(X(cid:48)X) is non-singular and βo is non-zero, then identification requires For (β, δ) ∈ Θ, if X(βo − exp(Z(δ − δo))β) = 0 w.p. 1, then (β, δ) = (βo, δo) (1.2) where Θ is the joint parameter space. The above statement captures the fundamental iden- tification requirement for models with exponential heteroskedasticity. This chapter aims to clarify when this statement holds and under what necessary or sufficient conditions. The simplest case of identification is when X and Z are not a bijective transformation of each other in the sense that the variation in Z cannot be entirely explained by the X or visa versa. One of the main contributions of this chapter is to provide a formal proof of identification in this scenario under the standard conditions of the literature. A sufficient condition for X and Z to not be a bijective transformation of one another is to impose an exclusion restriction (in either X or Z). An exclusion restriction would require that one of the random variables in the vector X is not included in the vector Z or visa versa. By doing so, variation in one of the random vectors has been introduced that cannot be perfectly explained by the other random vector. But when one allows for an arbitrary relationship between X and Z, the standard conditions are no longer sufficient for identification. To provide some intuition, when X and Z are a bijective transformation, then show- ing identification is difficult due to the non-linear nature of the problem. Noted in Lewbel (forthcoming), non-linearity can allow for multiple solutions to the statement in (1.2). Specif- ically, if the relationship between X and Z allows Xβo to be equal to a scaling of Xβ by exp(Z(δ − δo)), then the model is not identified. This chapter will look at two ways the scal- ing of Xβ by exp(Z(δ − δo)) can be manipulated: through the joint support of (X, Z) and 5 through the functional form of the heteroskedasticity, exp(Zδo). Section 3 discusses several counter-examples in which the conditions presupposed in the literature are not sufficient for identification. The non-identification result is section three can be compared to the literature on iden- tification in a binary response model. Identification in this setting has been well-studied in several papers by Manski (1985, 1988). In the earlier paper, Manski looks at identification of a binary response model under a median restriction. This method allowed for arbitrary het- eroskedasticity in the latent error but at most, would identify the scaled parameters βo/||βo||. Simply put, to obtain identification in his framework, for every β in the parameter space such that P (sgn(Xβ) (cid:54)= sgn(Xβo)) > 0 then β/||β|| (cid:54)= βo/||βo||. Consequently, he provided non-identification results depending on the support of X. For instance, if Xβ had bounded support away from 0 (for all values of β in the parameter space) then βo is not identified. However in our setting, the identification definition is based on the entire likelihood, rather than just the sign of the linear index. So to be clear, non-identification results presented in this chapter are consequences of the highly non-linear specification of exponential multiplica- tive heteroskedasticity as opposed to limited information in median restriction framework in Manski (1985). Manski (1988) looks at identification of the scaled parameters βo in conjunction with the non-parametric identification of the conditional cumulative distribution of the latent error, FU|X (·). Manski is able to show identification in the case of a known cumulative distribution function and statistical independence between U and X. But since in our setting there is heteroskedasticity, statistical independence does not hold. Manski also provides a non-identification result in the case of conditional mean independence and an unknown cumulative distribution function. Although in our setting, the conditional distribution of 6 the latent error is parametrically specified, the non-identification result would suggest that there should exist some conditional distribution specification in which identification does not hold. Therefore it is unsurprising that even in this parametric setting, we can construct counter-examples in which identification is lost. However, the non-identification results should not discourage the utilization of models with multiplicative heteroskedasticity in empirical work. The counter-examples provided are trivial in nature and are meant to highlight the non-existence of a general identification theorem for these models. As a helpful contribution, this chapter ends with a corollary that provides identification with a bijective transformation relationship between the random vectors for possibly the most commonly used specification. 1.2 Identification when there is no bijective transfor- mation Continuing with the example of a heteroskedastic binary response model described in equa- tion (1.1.1), standard textbooks such as Greene (2011) and Wooldridge (2010) imply that the parameters are estimable2 under the following conditions, Condition 1. Z does not contain a constant. Condition 2. E(X(cid:48)X) is non-singular. Condition 3. E(Z(cid:48)Z) is non-singular. 2In this context, estimable is interpretively synonymous with point identified. However, neither text provides proofs of identification nor explicitly state that the models are point identified. Therefore the term “estimable” emphasizes the lack of rigorous treatment in the literature for identification in a parametric model. 7 Condition 1 implies the model is only identified up to scale. Alternatively, one could assume the normalization that one of the coefficients on X is equal to 1. Conditions 2 and 3 are needed in order to show Xβo = Xβ and Zδo = Zδ implies βo = β and δo = δ respectively. Additionally, although not commonly stated, identification requires Condition 4. βo is non-zero. Without this assumption, δo is not identified.3 This can easily be addressed by assuming a non-zero intercept as a location normalization. Under these assumptions, identification holds when X and Z are not bijective transformations of each other stated in the following theorem. Theorem 1.2.1. If Conditions 1-4 hold, and X and Z are not bijective transformations of each other, then the parameters βo and δo are point identified. Before providing the proof, I will formally characterized ‘bijective transformation’. Definition 1.2.1 (Bijective Transformation). X is a bijective transformation of Z (and equivalently Z is a bijective transformation of X) if there exists a bijective function f such that X = f (Z) Z = f−1(X) where f−1 denotes the inverse of f . 3Suppose βo = 0 then Xβo is zero with probability 1, so as long as β = 0, then any δ (cid:54)= δo satisfies X(βo − exp(Z(δ − δo))β) = 0 therefore δo is not identified. This has fairly minor consequences since in empirical work, researchers tend to be more interested in the coefficient parameters βo. 8 This definition can also be understood in terms of the support of X and Z. A bijective transformation would require that for every x in the support of X, there exists a unique z in the support of Z such that (x, z) occurs with positive probability in the joint support of (X, Z) and for any z(cid:48) (cid:54)= z in the support of Z, (x, z(cid:48)) occurs with probability 0 in the joint support. Conversely, for every z in the support of Z there exists a unique x in the support of X such that (x, z) occurs with positive probability in the joint support of (X, Z) and for any x(cid:48) (cid:54)= x in the support of X, (x(cid:48), z) occurs with probability 0 in the joint support. This implies that the variation in X can be perfectly described by variation in Z. Figure A.1 visually shows what is implied by a bijective transformation and examples in which a bijective transformation does not hold. Proof. Suppose there exists a (β, δ) ∈ Θ such that X(βo − exp(Z(δ − δo))β) = 0 (1.3) holds for almost all X and Z in their support. Since βo is non-zero and E(X(cid:48)X) is non- singular, Xβo (and similarly Xβ) is non-zero with positive probability. Consequently, Xβo/Xβ exists and is strictly positive with positive probability.4 Rearranging the equa- tion above for a realization (x, z), z(δo − δ) = ln(xβo/xβ) (1.4) where the realizations are in the following restricted support {(x, z) ∈ support(X, Z) : xβo and xβ are non-zero}. Since X and Z are not bijective transformations of each other, there must exist variation in either X or Z that cannot be explained by the other. When there is variation in Z not explained by X, there exists a realization in X in which there are 4For equation (1.3) to hold, sign(Xβo) = sign(Xβ) 9 more than one realizations of Z that occur with positive probability in the joint support. This would allow for different realizations on the left hand side of the above equation while the right hand side is fixed at one possible realization. Since E(Z(cid:48)Z) is non-singular, the above equation can only hold when δ = δo and consequently β = βo. Similar conclusions follow when there is variation in X not explained by Z.(cid:3) 1.3 No identification when there is a bijective trans- formation However confining to the case X and Z are not bijective transformations of one another is fairly restrictive. Return to the heteroskedastic binary response example where one is interested in modelling the mean of Y conditional on X. Let σ(X) denote the conditional standard deviation of the latent error where it is reasonable to assume a double index model such that, (cid:18) Xβo (cid:19) σ(X) (cid:18) Xβo (cid:19) exp(Zδo) = Φ E(Y |X) = Φ (1.5) where Z consists of bijective transformations of the elements in X.5 As mentioned before, to get around X and Z being bijective transformations, one could consider imposing an exclusion restriction. But this would require prior knowledge of which elements in X effect the conditional variance and which would not. Nevertheless, more generally, identification is not obtainable in the case of bijective transformation under the previously stated conditions. The following two counter-examples provide settings in which identification fails. 5Klein and Vella (2009) discuss identification in the semi-parametric case where they use a re-indexing approach following Ichimura and Lee (1991). 10 Suppose X = (1, Z) where Z is a binary variable. Then the first part of statement (1.2) can Counter-example: binary support be decomposed to, X(βo − exp(Z(δ − δo))β) = β1o − β1 β1o + β2o − exp(δ − δo)(β1 + β2) if Z = 0 if Z = 1 (1.6) The first part implies β1 = β1o. Plugging into the second part, equation (1.2) holds if β2 = exp(δo − δ)β2o − β1o(1 − exp(δo − δ)) which does not imply δ = δo or β2 = β2o. Even though Conditions 1-4 are satisfied, identification is lost because under the binary support of Z, the parameters β2o and δo are inherently linked. Obviously with binary support there is no possible way to separately identify a non-linear (the exponential component) effect from a linear effect. Therefore specifying an exponential multiplicative heteroskedastic model is naive and illogical in the binary support setting. In fact, it is not possible to discern any non-negative scale function as heteroskedasticity as opposed to the linear mean function, Xβo.6 Nevertheless, this concern needs to be addressed when determining conditions for identification. 6For any two non-negative scale functions go(Z) and g(Z) (cid:18) X βo − go(Z) g(Z) β = (cid:19) β1o − go(0) g(0) β1o + β2o − go(1) g(1) β1 if Z = 0 (β1 + β2) if Z = 1 − go(0) which implies β1 = g(0) which does not imply g(Z) = go(Z) or β = βo. Consequently any non-negative scale function cannot be identified as heteroskedasticity separately from a mean effect. β1o and the second part holds as long as β2 = β2o + β1o g(1) go(1) go(0) g(0) (cid:18) g(1) go(1) (cid:19) 11 Counter-example: exponential transformation Unlike the previous counter-example which manipulates the support of (X, Z), this counter- example takes advantage of the functional form of the heteroskedasticity. Suppose X = (1, exp(Z)) and Z is univariate and continuous, then the first part of statement (1.2) becomes, X(βo − exp(Z(δ − δo))β) = β1o + exp(Z)β2o − exp(Z(δ − δo))β1 − exp(Z(1 + δ − δo))β2 If δ− δo = −1 and β1 = β2o = 0, then any values β1o = β2 make the above equation equal to 0 for all values of X1. Alternatively if δ− δo = 1 and β1o = β2 = 0, then any values β1 = β2o also make the above equation equal to 0. This only holds for the exponential transformation because the heteroskedasticity is of exponential form. By imposing the same transformation in the mean term as in the heteroskedastic term, it becomes difficult to differentiate between the mean effect Xβo and the heteroskedastic effect exp(Zδo). Non-identification in simulation To illustrate the consequence of non-identification in estimation, the following simulation exercise uses the second counter-example to construct two observationally equivalent data generating processes for a heteroskedastic Probit model. Let Z ∼ N (0, 1) and X = exp(Z), then consider the following two data generating processes: Y1 = 1{0 + 0.5X + U1}, Y2 = 1{0.5 + U2}, where U1 ∼ N (0, exp(4Z)) where U2 ∼ N (0, exp(2Z)) (1.7) (1.8) According to the analysis given above, these two models are observationally equivalent. The simulation randomly draws a sample of (X, Z) and then computes two different outcomes Y1, 12 and Y2 for the same independent variable sample. Then using the hetprobit command in Stata, two estimations are performed, one using the outcomes Y1 from the first specification and the other uses the outcomes Y2 from the second specification. Figure A.2 show the empirical distributions of the parameter estimates for a sample size of 1,000. This plainly demonstrates that the estimator cannot distinguish between the different parameters values that construct the two outcomes. Because the distribution of the estimates for Specification 1 and Specification 2 look identical, one could think there is not divergence in the parameter estimates within a sample between the two data generating processes but looking at the difference of the two parameter estimates (Difference), there appears to be a trimodal distribution. The mass around 0 implies that the outcomes in the two data-generating processes are similar enough that the estimation procedure calculates the same parameter values when the data generating process is formed using two different parameter values. The mass around -0.5 in the first figure and the mass around 0.5 in the second figure show that in some of the samples, the estimator correctly matches the ‘true parameter value’ to the data generating process. However the remaining mode that occurs symmetrically across 0 shows that the estimator can also incorrectly match the parameter estimates to the alternate data generating process. But again, this example is trivial in nature in which an empirical researcher may sidestep by flexibly specifying the conditional mean as W = (1, 1/X) with a homogeneous latent error. This specification is observationally equivalent and is identified. The concern is how could one generally show identification in the case of bijective transformed variables that excludes these types of counter-examples. The two counter-examples show that Conditions 1-4 are not sufficient for identification. They manipulate the support of the random variables and the form of the heteroskedasticity 13 to lose identification. To better understand why, it is best to re-examine equation (1.4). The left hand side is linear in Z while the right hand side is a logarithmic function of a ratio of X. If there is a defined relationship between X and Z such that the logarithmic function in X is equivalent to a linear function of Z then identification does not hold. In the first example, the logarithmic function of X is necessarily linear because of the binary support. In the second example, the transformation undoes the logarithmic function resulting in a linear function (for specific values of the parameters). 1.4 Identification in a common specification The previous section provides justification as to why there is no general result on identifica- tion for the case of bijective transformations. The concern is that the most prevalent use of multiplicative exponential heteroskedastic models is when there is a bijective transformation between the random vectors. This would require showing identification prior to estimation for every variation of a specification. To provide some assistance in that front, the follow- ing shows identification in a general (although not completely general) and commonly used specification. Example: polynomial transformations Wanting to allow for a flexibly specified conditional variance function, one might consider polynomial functions as an approximation of the variance function7. The following corollary obtains identification in this commonly used specification. 7Khan (2013) shows that a heteroskedastic Probit model with a non-parametric conditional variance function is observationally equivalent to a model with median restriction and no distributional assumptions on the latent error. This would motivate flexible specification of the variance function as way to allow flexibility in the latent error distribution. 14 Corollary 1.4.1. If X = (1, X2) in which X2 is a univariate continuous random variable and Z = (X2, X2 2 , ..., Xp 2 ) then under Conditions 1, 3, and 4, the parameters βo and δo are identified. Proof. Suppose there exists a (β, δ) ∈ Θ such that X(βo − exp(Z(δ − δo))β) = 0 (1.9) holds for almost all X and Z in their support. By Condition 4, one can rearrange equation (1.9) to X2(δ1 − δ1o) + X2 2 (δ2 − δ2o) + ... + Xp 2 (δp − δpo) = ln Since X2 is continuous, taking the (p + 1)th derivative with respect to X2, (cid:34)(cid:18) 0 = (−1)p+1 β2o β1o + X2β2o (cid:19) (1.10) (cid:18) β1o + X2β2o (cid:19)p+1(cid:35) β1 + X2β2 (cid:19)p+1 − (cid:18) β2 β1 + X2β2 which implies (β1o + X2β2o)/(β1 + X2β2) = X2β2o/X2β2 = β2o/β2. Plugging into equation (1.10), the right hand side becomes a constant. By Conditions 1 and 3, the equality cannot hold for any non-zero (δ − δo), thus δo is identified. Finally, since Condition 2 is inherently implied by the given specification, the identification of δo implies β = βo. (cid:3) Note that one of the most common specifications X2 = Z is a special case of this result. This result could easily be extended to the cases where X2 is not univariate and contains discrete random variables. 1.5 Conclusion It has been widely accepted that a model with multiplicative exponential heteroskedasticity was estimable under Conditions 1 through 4 provided in Section 2. This chapter provides a 15 proof of identification when the variables are not bijective transformations of one another. But in a more general case, I supply two examples in which those conditions are satisfied but point identification is not obtainable. Consequently, the conditions previously stated in the literature are not sufficient in distinguishing a linear effect from an exponential effect in all cases. To overcome much of the concerns from the lack of a general identification proof, this chapter also provides a proof of identification in a commonly used specification. In the next chapter, the results here will be utilized to obtain identification for the pro- posed estimation of an endogenous binary response model. The proposed approach relaxes assumptions that were standard in the literature but I found to be too restrictive in most empirical settings. One of the motivations behind relaxing the assumptions was to allow for potential heteroskedasticity. Obtaining identification in this setting has two challenges (1) relaxing assumptions that were previously used for identification, and (2) identification with multiplicative exponential heteroskedasticity. This chapter provides the foundation for over- coming the second challenge. Therefore the identification strategy in Chapter 2 emphasizes the importance and utility of the results in Chapter 1. 16 Chapter 2 Relaxing Conditional Independence in an Endogenous Binary Response Model 2.1 Introduction In recent years, uncovering causal effects has become a cornerstone in economics research. The interest in causality as opposed to mere correlation allows for more plausible policy implications, counter-factual analysis and the disentanglement of causal mechanisms. En- dogeneity, correlation between the unobserved heterogeneity and covariates, is prevalent in economic settings and will bias parameter estimates which will ultimately affect the causal interpretations. With more realistic assumptions than those provided in the literature, this chapter proposes a new control function estimator in a binary response setting to address endogeneity. Binary responses, a 0 or 1 outcome, is a common setting in economics research. For instance, employment, graduating from college, and purchasing decisions, are all be binary outcomes. In order to accurately uncover the true underlying mechanism in a binary response model, many researchers turn to the latent variable set up (sometimes refer to as a hurdle 17 model) resulting in non-linear estimation. But, treating endogeneity in a non-separable and non-linear setting is not as straight forward as using a “plug-in” instrumental variables estimator in a simple linear regression. A series of papers (Smith and Blundell (1986), Rivers and Vuong (1988), Blundell and Powell (2004), and Rothe (2009)) have proposed using a control function method in con- structing an estimator that appropriately addresses endogeneity. To gain identification, these papers place strong assumptions on the relationships between the latent unobserved errors and the instruments. Essentially they impose an exclusion restriction such that the condi- tional distribution of the latent error cannot be a function of the instruments. These Control Function assumptions (CF-CI) are equivalent to assuming conditional independence between the latent error and the instrument and are unlikely to hold in an empirical setting. For instance, in models of labor participation, one may be interested in uncovering the effect of non-wage income on the probability of employment. But there are concerns of endogeneity because one of the main sources of non-wage income is the partner’s wages, and their labor force participation decisions are usually simultaneously determined within the household. CF-CI would require that shocks to (non-wage) household income affect labor participation decisions independent of any other included covariates, such as education, age, children in the household, or instruments such as husband’s education. Another example in the field of health economics is evaluating the effect of drug rehabili- tation treatment on subsequent substance abuse. There is endogeneity because the covariate of interest, number of visits the client makes during the episode of treatment, is most likely correlated with unobserved characteristics of the client that determine the likelihood of suc- cessful treatment. For example, those who are more likely to relapse initially (longer drug use or less community support) are less likely to visit the center during the episode of the 18 treatment. CF-CI would require that the unobserved characteristics of the client cannot have an interactive effects with other included covariates such as age, income, or marital status. In a more structural setting, suppose researchers are interested in understanding the welfare loss from government intervention into insurance markets using variation in prices to estimate the demand and marginal cost of insurance. But observed prices are endogenously determined since they are likely correlated with unobserved characteristics. Using exogenous variation in prices, possibly through variation in administrative costs or changes in the competitive environment over markets, endogeneity may be addressed. But, the CF-CI imposes functional form restrictions on the utility function that unobserved characteristics are additively separable from observed market, product or individual characteristics. This chapter proposes an alternative framework and control function estimator that re- laxes this strong assumption. This generalization has been explored in other settings such as the case of endogenous random coefficients for a linear model in Wooldridge (2005) and demand estimation where the unobserved product characteristics does not enter the utility function additively in Gandhi, Kim, and Petrin (2013). More generally, Kim and Petrin (2017) sets up a framework for the “general control function,” permitting the unobserved heterogeneity to be a function of the instruments, in the case of additively separable trian- gular equation models. This chapter extends the general control function approach of Kim and Petrin (2017) to the case of binary response models to propose a new estimator that is valid under the failure of CF-CI. One of the main contributions of this chapter is to clearly explain why CF-CI would not realistically hold in empirical settings and, given the likely failure of CF-CI, apply the general control function approach of Wooldridge (2005), Gandhi, Kim, and Petrin (2013), 19 Kim and Petrin (2017) to a binary response setting. A simulation illustrates that given the failure of CF-CI, the general control function approach, as opposed to alternative control function methods of the literature, is needed to accurately recover parameter estimates. Under the more general framework, CF-CI implies testable hypotheses in which standard variable addition or Wald tests can be used. In an empirical application on female labor supply, the CF-CI assumption is easily rejected. This chapter also adds to the larger literature on control function approaches to trian- gular simultaneous equations for both additively separable1 and non-separable models.2 In the literature, CF-CI has been taken as a required assumption in order to employ a con- trol function approach. In the discussion on identification, this chapter comments on how other control function methods in the literature obtain identification, explains why their approaches to identification can be restrictive, and proposes a simpler alternative. Conse- quently, this chapter provides an example where, under a reasonable setting, CF-CI with respect to the control variable need not hold to recover structural objects such as the Average Structural Function (ASF) or the Average Partial Effects (APE). By focusing on the ASF and APE, this chapter contributes to the discussion on interpre- tation of non-linear models under the presence of endogeneity. Blundell and Powell (2003, 2004) introduced the ASF as a way to interpret binary response models when there is en- dogeneity. They note that a conditional mean interpretation cannot capture the causal and 1Although a latent variable binary response model is non-separable due to the indicator function, separa- bility is imposed inside the indicator function. Consequently results from the additively separable literature may still apply. 2Literature on additively separable triangular equation models include Newey, Powell, and Vella (1999), Pinkse (2000), Su and Ullah (2008),Florens, Heckman, Meghir, and Vytlacil (2008), Ai and Chen (2003), Newey and Powell (2003), Newey (2013), Kim and Petrin (2017), and Hoderlein, Holzmann, and Meister (2017). Literature on non-separable triangular equation models include Imbens and Newey (2009), Kasy (2011), Blundell and Matzkin (2014), Chen, Chernozhukov, Lee, and Newey (2014), Kasy (2014), and Hoderlein, Holzmann, Kasy, and Meister (2016). 20 structural effect that a model incorporating endogeneity should produce. More recently, Lewbel, Dong, and Yang (2012) propose using an Average Index Function (AIF) as a gen- erally easier to identify alternative to the ASF. Lin and Wooldridge (2015) compare the two approaches and conclude the ASF is a more appropriate function for interpretation and is able capture the mechanisms of interest. This chapter further supports the conclusions of Lin and Wooldridge (2015), where it is shown that under the more general framework of this chapter, only the proposed estimation procedure recovers the correct ASF. This chapter also shows that the alternative estimator from Rothe (2009), the Semi-parametric Maximum Likelihood (SML) estimator, actually produces estimates for the AIF, which is shown in simulation to be distinctly and interpretively different from the ASF. The proposed estimator is presented in a parametric framework but in some empirical contexts, the distributional assumptions may be unrealistic. Therefore this chapter also pro- vides a semi-parametric extension that proposes a new distribution free estimator. Using the observational equivalence results of Khan (2013), the proposed sieve semi-parametric estimator is shown to be consistent under weaker assumptions than those found in the liter- ature. Consequently, this chapter contributes to the literature on semi and non-parametric estimation as a particular application of a semi-parametric two stage sieve estimator. Sieves (as opposed to kernel methods) are suggested in order to impose necessary shape restrictions on the general control function. Asymptotic results are derived using the works of Ai and Chen (2003), Chen, Linton, and Van Keilegom (2003), and Hahn, Liao, and Ridder (2018). A comprehensive simulation study shows that only the proposed estimator can produce accurate parameter and ASF estimates under the failure of CF-CI. The remainder of this chapter is organized as follows. Section 2 provides motivation for relaxing CF-CI, specifically to the setting of binary response models, and reviews previous 21 approaches and their potential shortcomings. Section 3 describes the set up of the model and introduces the general control function method of Kim and Petrin (2017) in the binary response setting. Empirical examples are provided to illustrate how CF-CI is unlikely to hold in many economic settings and how the proposed framework captures the potentially complex structure of endogeneity. Section 4 goes into more detail about the operation and interpretation of the general control function approach. Because CF-CI is used to show identification, the generalizations proposed in this chapter put into question whether identification still holds. The Conditional Mean Restriction from Kim and Petrin (2017) that places a shape restriction on the general control function is used to show identification. This section also provides a simulation to illustrate the failure of estimators that require CF-CI when only the weaker CMR assumption holds. Section 5 instructs on the implementation of the proposed estimator and derives the asymptotic properties such as consistency and asymptotic normality. Because the parameters of a binary choice model have no direct economic interpretation, this section also discusses the ASF and APE as structural objects of interest and how to recover them under the proposed framework. Section 6 illustrates the proposed estimator in an empirical application. Using 1991 CPS data, this chapter examines the effect of non-wage income on a married woman’s probability of labor force participation. CF-CI implies a testable hypothesis under the proposed framework and a Wald test finds strong statistical evidence that the assumptions of previous estimators are violated. Although the parametric assumptions are likely to hold in the empirical application provided, there are many economic settings where the distributional assumptions are restrictive and unconvincing. The final section extends the framework to a distribution free setting using a semi-parametric estimator. This section provides the asymptotic properties of the semi- parametric estimator as well as a comprehensive simulation study comparing the proposed 22 approach to other estimators in the literatures. 2.2 Background and Motivation Consider the latent variable triangular system where y1i is a binary response variable, zi = (z1i, z2i) is a 1 × (k1 + k2) vector of “non-endogenous” included and excluded instruments, y2i is a single continuous endogenous regressor, and xi is a 1 × k vector where each element is a function of (z1i, y2i) and includes a constant. 1 y∗ 0 y∗ y1i = 1i ≥ 0 1i < 0 y∗ 1i = xiβo + u1i y2i = m(zi)πo + v2i (2.1) The endogenous variable y2i can be decomposed into its conditional mean and the unob- served endogenous component, v2i. Alternatively one could consider a linear probability model, which has the advantages of being easy to estimate, easy to interpret, and dealing with endogeneity is relatively simple, or at the very least well studied. But linear probabil- ity models are restrictive and cannot be representative of the true underlying mechanisms. Their predicted probabilities lie outside the [0,1] bounds which places limitations on the in- terpretation of the estimates. Therefore this chapter will focus on the latent variable setting. In this framework, a series of papers, Smith and Blundell (1986), Rivers and Vuong (1988), Blundell and Powell (2004), and Rothe (2009), developed estimators to address endogeneity using a control function approach. The control function approach supposes that there is a particular function (or variable) that when included as an additional covariate, 23 is able to control for the endogeneity of the other regressors. For example, Rivers and Vuong (1988) shows that if one were to assume that u1i and v2i are bivariate normal and independent of the instruments zi, then one can derive the following conditional distribution, (2.2) (cid:18) u1i|v2i, zi ∼ N (cid:19) ρ v2i, (1 − ρ2)σ2 1 σ1 σ2 where σ1 and σ2 are the standard deviations of u1i and v2i respectively and ρ is the correlation coefficient. This conditional distribution provides the foundation for the control function approach in this context. The latent equation can be rewritten as y∗ 1i = xiβo + γov2i + ε1i (2.3) where ε1i = u1i − γov2i and γo = ρ . Notice that ε1i|v2i, zi ∼ N (0, (1 − ρ2)σ2 1) which σ1 σ2 means there is no endogeneity between the regressors and the new latent error ε1i (i.e.: E(ε1i|xi, v2i) = 0). Therefore the reduced form error v2i can be used as a control function; by including v2i as an additional covariate, one can control for the endogeneity in y2i and obtain consistent parameter estimates. In general, the control function approach constructs a function of the instruments and regressors that can act as a valid proxy for the source of the endogeneity. Of course, in practice, one does not observe v2i, so instead the residuals from a first stage estimation procedure that regresses the endogenous variable on the instruments can be used. In order to relax the distributional assumptions in Rivers and Vuong (1988), Blundell and Powell (2004) and Rothe (2009) propose alternative semi-parametric estimators. But to obtain non-parametric identification of the distribution of the latent error, they make a rather strong assumption on the relationship between the unobserved errors, u1i and v2i, and the instruments, zi. This Control Function assumption imposes Conditional Independence 24 (CF-CI). The CF-CI assumption requires the the instruments zi are independent of the latent error u1i after conditioning on the reduced form error v2i: u1i|v2i, zi ∼ u1i|v2i (2.4) Note that CF-CI is implicit in the set up of Rivers and Vuong (1988). Interpretively, this means that any source of endogeneity must be fully captured through the control variate v2i, or in terms of an exclusion restriction, the conditional CDF Fu1|v2,z(u1i|v2i, zi) is only a function of v2i (the instruments zi are excluded). This exclusion restriction must also hold for all moments which, as will be shown shortly, may be hard to justify. As a slight relaxation of CF-CI, Rothe (2009) also proposes a Linear Index (CF-LI) sufficiency assumption that, after conditioning on the first stage error and the linear index xiβo, the latent error is independent of the instruments: u1i|v2i, zi ∼ u1i|v2i, xiβo (2.5) Now the instruments can be a part of the conditional distribution but only through the linear index. The linear index restricts the relative direction and magnitudes of the regressors in the conditional distribution. So, although it allows for a more relaxed relationship between the instruments and the unobserved heterogeneity, it is hard to justify in a general setting. In either case, these assumptions used to obtain identification may be too stringent in many empirical contexts. To give a motivating parametric example, consider a slight variation of the Rivers and Vuong (1988) set up where u1i and v2i are still bivariate normal but are allowed to be heteroskedastic in the instruments; i.e., (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)zi ∼ N u1i v2i 0  0  ,    (2.6) [σ1(zi)]2 ρ(zi)σ1(zi)σ2(zi) ρ(zi)σ1(zi)σ2(zi) [σ2(zi)]2 25 Heteroskedasticity is commonly found in empirical data whether it is actually caused by variability in the latent error over the regressors, or by heterogeneity in the slopes as in a random coefficients setting.3 Even in the linear regression, heteroskedasticity has been accepted as endemic in empirical settings and heteroskedastic robust inference is always employed. Again, by the properties of the bivariate normal distribution, the following con- ditional distribution is derived. u1i|v2i, zi ∼ N (cid:18) ρ(zi) σ1(zi) σ2(zi) v2i, (1 − [ρ(zi)]2)[σ1(zi)]2 (cid:19) (2.7) This is a fairly small variation to the framework considered in Rivers and Vuong (1988) but ignoring heteroskedasticity can strongly bias parameter estimates. A simple Monte Carlo exercise, detailed in Figure D.1, illustrates the potential bias. Suppose equations (2.1) and (2.6) hold with a single excluded instrument and no included instruments such that ρ(zi) = 0.6, σ1(zi) = σ2(zi) = exp(0.25zi). Figure D.1 displays the empirical distribution of the estimators for β where the true value is equal to one. The Rivers and Vuong estimator that ignores heteroskedasticity is substantially biased with estimates of β centered around 1.2. This illustrates that ignoring heteroskedasticity in the context of binary response models produces inconsistent parameter estimates. Now let us compare the distribution in equation (2.7) to the CF-CI and CF-LI assump- tions for the semi-parametric estimators. CF-CI clearly does not hold since the exclusion restriction does not hold: both the conditional mean and conditional variance depend on the instruments. For the CF-LI assumption to hold, the heteroskedastic functions σ1(·), σ2(·), 3This is similar to the set up in Kasy (2011) where he provides a counter-example to the control function approach proposed by Imbens and Newey (2009). In Imbens and Newey (2009), they propose using the control variable Vi = Fy1|z(y1i, zi) which would satisfy CF-CI when the heterogeneity is only one-dimensional, as pointed out in Kasy (2011). In his example, a linear random coefficient model is used to show the failure of CF-CI using the Imbens and Newey control variable. Note that the random coefficient model can be rewritten as a linear model with heteroskedasticity as suggested here. 26 and ρ(·) could only be functions of the linear index xiβo. This is quite restrictive and would not generally hold. Therefore the semi-parametric estimators do not apply to this simple parametric setting. This causes some concern that the control function method may not be valid for this ex- ample. But the conditional distribution in equation (2.7) suggests that there should still be a control function approach to address endogeneity. If one were to include ρ(zi) ψ1(zi) ψ2(zi) v2i as an additional covariate, then estimating a heteroskedastic Probit with the following conditional mean will produce consistent parameter estimates.  xiβo + ρ(zi) (cid:112)(1 − [ρ(zi)]2)[ψ1(zi)]2 ψ1(zi) ψ2(zi) v2i  E(y1i|zi, v2i) = Φ (2.8) Building a control function method from a more general conditional distribution of the latent error is the motivation and starting point for the proposed approach. This example highlights that the assumptions used to obtain identification in the semi- parametric approaches can be fairly stringent and are not well understood in terms of their consequences. This chapter questions the necessity of the CF-CI assumption and considers its implications in estimation and interpretation. As an alternative to the semi-parametric estimators, I propose an estimator that builds upon the same control function technique but extends the model by relaxing the CF-CI assumption. In the previous heteroskedastic bivariate normal example, this means the instruments can be a part of the conditional variance and the conditional mean. But the CF-CI and CF-LI assumptions are not imposed superfluously, they are used to obtain identification of the conditional distribution of the latent error. In this chapter, I will first consider a parametric alternative to isolate the necessary conditions for identification. At the end of this chapter, I present a distribution free extension that proposes a semi-parametric 27 estimator that has strictly weaker identification requirements compared the estimators of Blundell and Powell (2004) and Rothe (2009). In a related strand of literature on non-parametric triangular simultaneous equation models with additively separable unobserved heterogeneity, Kim and Petrin (2017) question the restrictive control function assumptions in the Non-Parametric Control Function (NP- CF) literature (Newey, Powell, and Vella (1999), Pinkse (2000), Su and Ullah (2008), and Florens, Heckman, Meghir, and Vytlacil (2008)). Because the control function method requires the additional control function assumption for identification, the Non-Parametric Instrumental Variables (NP-IV) approach, as in Ai and Chen (2003), Newey and Powell (2003), Hall, Horowitz, et al. (2005), and Newey (2013), appears to be a superior approach that only requires the weaker Conditional Mean Restriction (CMR). Kim and Petrin (2017) show that a control function approach is still valid under the weaker CMR when a general control function is specified. This chapter extends their results to a binary response model under a latent variable framework and will use the CMR in showing identification. Alternative estimators that do not require the CF-CI assumption in estimating en- dogenous binary response models are the special regressor estimator proposed in Lewbel (2000) and Dong and Lewbel (2015) and the maximum score and smoothed maximum score estimator in Hong and Tamer (2003) and Krief (2014). The special regressor es- timator requires a regressor, independent of all unobserved heterogeneity, that has large support and without this “special regressor” their procedure is invalid. Alternatively, Hong and Tamer (2003) and Krief (2014) extend the maximum score and smoothed maximum score methods of Manski (1985) and Horowitz (1992) to estimate the structural parameters βo in the linear index. For identification they require conditional median independence: Med(u1i|v2i, zi) = Med(u1i|v2i). This would allow for general forms of heteroskedasticity 28 but, as seen in the heteroskedastic bivariate normal example, the conditional median inde- pendence assumption is still quite restrictive and would not necessarily hold. Moreover, the conditional median independence assumption does not identify the distribution of the latent error and therefore they cannot recover the ASF and APE. The proposed framework that allows for relaxation of CF-CI and CF-LI in a parametric setting is introduced in the next section. The proposed general control function estimator directly follows from the conditional distribution of the latent error provided in the frame- work. 2.3 Model Set Up Return to the set up described in equation (2.1). The distributional assumptions for u1i and v2i determine the consistent estimation procedure. Although most of the assumptions in the literature are based on a specification of the joint distribution of u1i and v2i (see Rivers and Vuong (1988) and Petrin and Train (2010)), one merely needs to specify the conditional distribution to use the control function approach. For example, if one were to assume u1i|v2i, zi ∼ N (0, 1), so there is no endogeneity and no heteroskedasticity, then a standard Probit maximum likelihood estimation (MLE) procedure yields consistent estimates. On the other hand, if u1i|v2i, zi ∼ N (0, exp(2ziδ)) such that heteroskedasticity is present, then the standard Probit MLE procedure would be inconsistent but a Het-Probit MLE procedure, included in many statistical packages, would be consistent. If u1i|v2i, zi ∼ N (ρv2i, 1), similar to the setting in equation (2.2), then two step CMLE developed by Smith and Blundell (1986) and Rivers and Vuong (1988) would be consistent and other methods that ignore the endogeneity would be inconsistent. More generally, if the CF-CI assumption holds such 29 that u1i|v2i, zi ∼ u1i|v2i with some unknown distribution, Blundell and Powell (2004) (for the remainder of the chapter referred to as BP) and Rothe (2009) provide semi-parametric methods that estimates the parameters consistently. As a first step in relaxing the CF-CI assumption, the following assumption proposes an alternative framework which assumes a more flexible conditional distribution of the latent error.4 Assumption 2.3.1. Consider the set up in equation (2.1), where {y1i, zi, y2i}n i=1, is iid. Assume the linear reduced form in the first stage is the true conditional mean E(y2i|zi) = m(zi)πo and the unobserved latent error has the following conditional distribution u1i|zi, v2i, y2i = u1i|zi, v2i ∼ N h(v2i, zi)γo, exp(2g(y2i, zi)δo) (cid:16) (cid:17) Where zi = (z1i, z2i) and m(zi), h(v2i, zi), and g(y2i, zi) are known vectors and h(v2i, zi) is differentiable in v2i. The first part of the assumption breaks up the endogenous variable into its conditional mean and what I will refer to as the control variate v2i. Note that by construction, the control variate is mean independent of the instruments. This assumption does not take a stand on the true data generating process of the endogenous variable.5 In the more general setting of non-separable triangular equation models, Imbens and Newey (2009) consider the 4The normality assumption could be easily generalized to just a known distribution with CDF G(·). This allows for a logit specification which is also explored in one of the simulations. 5For now this does presume a linear reduced form, but when discussing asymptotic properties of the estimator, if m(zi) is a sequence of basis function so that the first stage acts as a non-parametric sieve regression then this will not affect the asymptotic variance estimates. 30 case of a non-separable first stage y2i = d(zi, ηi) (2.9) where zi are the instruments, ηi is unobserved heterogeneity independent of the instruments, and d(·,·) is the unknown and true data generating process in the first stage. In this set- ting they suggest using the conditional CDF, e2i = Fy2|z(y2i, zi), as the control variable. They show that their proposed control variable satisfies CF-CI and therefore the control function method recovers the parameters of the model. Assumption 2.3.1 does not require full independence between the control variable and the instruments and therefore can use the population residual v2i = y2i − E(y2i|zi) as a control variable with the knowledge that it does not satisfy CF-CI. I will discuss the differences between these two approaches after explaining the second part of the assumption. The second part of Assumption 2.3.1 specifies the conditional distribution that allows for the violation of CF-CI. Both the conditional mean and the conditional variance are functions of the instruments, so the exclusion restriction implied by CF-CI is violated. Under this assumption, the conditional mean of y1i is: E(y1i|zi, y2i, v2i) = E(y1i|zi, v2i) = Φ (cid:18)xiβo + h(v2i, zi)γo (cid:19) exp(g(y2i, zi)δo) (2.10) Note that there is a one-to-one mapping between y2i and v2i given the instruments zi. This implies the mean is preserved regardless of which term is included in the conditioning argument. This result should be unsurprising as the conditional mean appears to be a het- eroskedastic Probit model that adjusts for endogeneity using the control function approach, both of which have been discussed extensively in the literature. But in this case, the control function (h(v2i, zi)γo) is a function of both the control variate v2i and the instruments zi. 31 This was first introduced in Wooldridge (2005) where he suggests using the following control function, h(v2i, zi)γo = γ1ov2i + v2iziγ2o (2.11) in a linear regression with random coefficients.6 Gandhi, Kim, and Petrin (2013) adopted a similar generalization for demand estimation and Kim and Petrin (2017) provides a general control function framework for the case of non-linear but additively separable triangular equation models. As in Kim and Petrin (2017), this generalization will be referred to as the “general control function,” as opposed to a more traditional control function that upholds the exclusion restriction (not a function of the instruments) as in Rivers and Vuong (1988) and Petrin and Train (2010). The proposed framework suggests a simple two step estimator. In the first step, the conditional mean of y2i is estimated to construct the control variate from the residuals (ˆv2i). In the second step, the residuals are plugged into the conditional mean in equation (2.10) in which parameter estimates are obtained via maximum likelihood estimation. This will be discussed with more detail in Section 5. How does the proposed approach differ from the setting considered in Imbens and Newey (2009)? In Imbens and Newey (2009), they attempt to flexibly model the true data generating process of the first stage as a possibly non-separable function of instruments and unobserved heterogeneity.7 They then construct a control variable, e2i = Fy2|z(y2i, zi), that they show satisfies the CF-CI assumption. But they require the instruments to completely independent 6He also discusses in this chapter the implementation of the control function approach in a binary response setting such as Probit. But in that example, he does not propose interaction with the instruments and instead only suggests including higher order moments of the reduced form error. So his analysis stops short of what is proposed in this chapter. 7The non-separable first stage needs to be monotonic in the unobserved heterogeneity. 32 of any unobserved heterogeneity and only a single source of unobserved heterogeneity in the first stage. In this chapter, I use a control variate v2i that is always obtainable and must satisfy conditional mean independence, by construction. Then to make up for the relaxation of CF-CI, I flexibly model the relationship between the structural heterogeneity u1i, the control variable v2i, and the instruments zi using a general control function. A major critique to the approach of Imbens and Newey (2009) is the caveat to their framework brought up in Kasy (2011) noting their method only allows for one source of heterogeneity (independent of the instruments) in the first stage. This would prohibit the simple example of random coefficients in the first stage y2i = η1i + η2izi (2.12) The approach in this chapter allows for this possibility since equation (2.12) can be rewritten in terms of a linear conditional mean with heteroskedasticity in the first stage error. One may object to the linear in parameters and known distribution specifications in Assumption 2.3.1. If these specifications are not true, this leads to the misspecification of equation (2.10) as the true conditional mean. The general control function and heteroskedas- tic function, h(v2i, zi)γo and g(y2i, zi)δ, respectively, are assumed to be linear in parameters, which facilitates the identification discussion because it allows for lower level conditions. Alternatively, one can consider any parametric specification, but then lower level conditions for identification would need to be derived to fit the specification. In the extension provided in Section 6, I allow both the general control function and the heteroskedastic function to be non-parametrically specified. The distribution assumption is particular pertinent in contrast to estimators from BP and Rothe, that have no distributional assumptions. Preserving the distributional assumption 33 keeps the difficult discussions on identification and interpreting the ASF and APE clear.8 But to appease any concerns, the extension provided in Section 7 proposes a semi-parametric estimator that is free of any distributional assumptions. Up till now I have only explained theoretically the consequences of CF-CI in terms of exclusion restrictions on the conditional distribution. But as a researcher with empirical data, how is one to determine why CF-CI may fail to hold? To further motivate the generalization provided in Assumption 2.3.1, the following are two examples taken from applications in the literature where their empirical settings may suggest a violation of CF-CI. Example 1: Demand for Premium Cable from Petrin and Train (2010) This example is a simplified version of the application given in Petrin and Train (2010) (hereafter PT), who propose a control function approach for estimating structural models of demand. In their application, they use a multinomial logit in modelling consumer’s choice of television reception. To fit a binary response setting, consider the choice of selecting premium cable conditional on already selecting cable as the television reception. Let Uim be the marginal utility of individual i choosing premium cable in market m over the utility from not selecting premium cable (so the utility from the outside option is normalized to 0). Violation of the CF-CI can be easily invoked by allowing for a utility that is not additively separable between unobserved utility (u1im) and the observed utility (Uim) as in Gandhi, Kim, and Petrin (2013). Suppose the observed utility is 5(cid:88) Uim = β1pm + β2gpmdgi + z11mβ3 + z12iβ4 + (1 + pmγ1 + z11mγ2 + z12iγ3)u1im (2.13) g=2 where the variables in z11m include the market and product characteristics and the variables 8An added benefit is the proposed estimator is much easier to implement compared to the semi-parametric approaches. For instance, estimates can be obtained using canned commands in Stata. Hopefully this will persuade empirical economists that implementing generalizations to previous estimators need not be computationally burdensome. 34 in z12i include individual characteristics. The variables dgi are dummies of an index of 5 different income levels, this allows price elasticity to be heterogeneous in income. The unobserved utility consists of two components u1im = ξm +εim where εim is iid logistic while ξm represents unobserved (to the researcher but not to the consumer or producer) attributes of the product. Consequently, ξm captures the component of the unobserved utility that is not independent from price. Note that this specification, like Gandhi, Kim, and Petrin (2013), allows for potential interactions between the observable covariates (price, market and product characteristics, and individual characteristics) and the unobserved attributes of the product. This specification, previously discussed in Gandhi, Kim, and Petrin (2013), can be mo- tivated using the example of unobserved advertisement. For instance, one would expect not only unobserved advertisement to affect utility (though ξm) but would also expect a inter- active effect with product characteristics. For example, suppose premium cable is marketed with advertisement that emphasizes the number of channels provided. Then advertisement should contribute to utility of consumption interactively with the number of premium cables actually provided. Even if a researcher were to impose an additively separable form on the utility, it is still unlikely that a simple control function from a reduced form pricing equation may capture the true endogenous structure. Suppose the utility from purchasing premium is 5(cid:88) Uim = β1pm + β2gpmdgi + z11mβ3 + z12iβ4 + u1im (2.14) g=2 where the unobserved utility is composed of two components: u1im = ξm + εim, εim is iid logistic while ξm is represents unobserved attributes of the product. The probability 35 consumer i chooses premium cable in market m is Pim = P (Uim > 0|pm, dgi, z11m, z12i, ξm) exp(β1pm +(cid:80)5 1 + exp(β1pm +(cid:80)5 = g=2 β2gpmdgi + z11mβ3 + z12iβ4 + ξm) g=2 β2gpmdgi + z11mβ3 + z12iβ4 + ξm) (2.15) and the expected demand from the perspective of the monopolist be E(Pim|pm, z11m, ξm). A monopolist will maximize expected profit with respect to price pm = arg max p (p − M C(z2m, ωm))E(Pim|p, z11m, ξm) From the first order conditions, the optimal price satisfies pm = pm |e(z11m, ξm)| + M C(z2m, ωm) (2.16) (2.17) where e(z11m, ξm) is the price elasticity of demand. It is evident that prices are not separable in ξm and the exogenous characteristics z11m, z2m. If one were to still use the control variable v2m = pm − E(pm|z11m, z2m) then the CF-CI assumption implies E(ξm|z11m, z2m, v2i) = E(ξm|v2m) and would generally not hold. Therefore the estimators based on the CF-CI assumption would not be valid in this setting. Kim and Petrin (2017) provide a similar example to motivate their general control function in non-linear but additively separable setting. Example 2: Home-ownership and Income from Rothe (2009) This example is the application in Rothe (2009) where he considers the effect of income on home-ownership in Germany for low-educated middle age married men. The controls, z1i, included age and an indicator for the presence of children under the age of 16. The instru- ments, z2i, are wife’s education level and an indicator for wife’s employment status which should only effect home-ownership through family income. Rothe relaxes CF-CI slightly by proposing the alternative CF-LI assumption. Recall, that the CF-LI assumption requires 36 the conditional distribution of the latent error to only be a function of the control variable v2i and the linear index xiβo = z1iβ1o + y2iβ2o. In this example, endogeneity can be ex- plained by omitted variables such as accessibility to loans via credit that are correlated with income. Moreover, as the accessibility to loans via credit lowers (i.e., low credit score), the effect of income and whether or not you have children becomes less important in the decision to purchase a home. So if credit score was observable, one would expect interactive effects between the linear index and credit score. Since v2i acts as a proxy for the omitted variable, the conditional mean of the latent error should include the interactive effect, E(u1i|v2i, xiβ) = v2iγ1 + v2i(xiβo)γ2 (2.18) Under this specification, the CF-LI assumption in Rothe is satisfied while CF-CI is vio- lated. However, this places fairly strong restrictions on the coefficients of interactive effects between the omitted variable–credit score– and the included regressors –age, presence of children, and income– such that they must be proportional to the linear index coefficients, βo. Alternatively, the proposed estimation procedure would recognize the interactive rela- tionship of these effects and could also allow the interactions to have effects not necessarily proportional to the index coefficients. These two examples provide some economic motivation for the relaxation of the CF-CI assumption. However the CF-CI assumption was used in the literature to gain identification. In equation (2.10), xi is a function of z1i and y2i which both comprise the control function h(v2i, zi). Without any restrictions on the control function, the two effects may not be separately identifiable. Wooldridge (2005) notes that the exclusion of z2i in the structural equation allows for identification of the general control function considered in equation (2.11). I will use the more general CMR from Kim and Petrin (2017) to show identification of the 37 general control function which also helps to illustrate which general control functions are or are not identified. 2.4 General Control Function The previous section set up the framework for using a general control function approach in a binary response model. In contrast to other control function methods, the general control function allows for the relaxation of the CF-CI assumption. This section is composed of two parts. The first part explains how identification can still be obtained under CMR and how the CMR relates to the other control function assumptions in the literature. The second part is a short simulation to illustrate how the general control function will aid in estimation when CF-CI does not hold but the true data structure satisfies CMR. In this simulation I emulate the application in Petrin and Train (2010) concerning the demand for cable as empirical context. 2.4.1 Identification Recently, there has been growing interest in the question of identification for the control function approach in non-parametric non-separable triangular simultaneous equation mod- els.9 However, the discussion usually starts with independence assumptions between the instruments and the unobservables. Then one searches for a control function that will sat- isfy strong identification assumptions such as CF-CI in BP or CF-CI and monotonicity in Imbens and Newey (2009). In the setting considered here, I allow for a more flexible re- lationship between the instruments and the unobserved heterogeneity and then allow for a 9Imbens and Newey (2009), Kasy (2011), Hahn and Ridder (2011), Blundell and Matzkin (2014), Chen, Chernozhukov, Lee, and Newey (2014), Torgovitsky (2015), and D’Haultfœuille and F´evrier (2015) 38 general control function, h(v2i, zi)γo, that can address endogeneity in a flexible manner. Of course, since I am only concerned with a binary response model, there are gains to knowing the structure of the non-separability in the outcome equation. The main concern for identification is separately identifying the mean effect xiβo and the control function h(v2i, zi)γo. Because both of these terms are perfectly determined by zi and v2i, without any additional assumptions on the construction of h(v2i, zi), perfect multicollinearity is possible such that the parameters βo and γo are not identified.10 When linearity of the control function is imposed, as in Assumption 2.3.1, identification requires E((xi, h(v2i, zi))(cid:48)(xi, h(v2i, zi))) is non-singular.11 However that does not place clear restric- tions on the composition of the control function. The following assumption provides lower level conditions in which identification is shown. Assumption 2.4.1. Let πo ∈ Π and βo, γo, δo ∈ Θ where Π and Θ denote the respective parameter spaces. ixi) is non-singular and the variance-covariance matrix of E(xi|zi) has full rank. (i) E(m(zi)(cid:48)m(zi)) is non-singular (ii) E(x(cid:48) (iii) E(h(v2i, zi)(cid:48)h(v2i, zi)) is non-singular. (iv) (CMR) E(h(v2i, zi)|zi) = 0 (v) g(y2i, zi) consists of polynomial functions of the elements in (xi, h(v2i, zi)), does not include a constant, and E(g(y2i, zi)(cid:48)g(y2i, zi)) is non-singular. o, γ(cid:48) o) is a non-zero vector. (vi) (β(cid:48) The first condition insures identification of the first stage parameters. The next three 10For example, if xi = (1, z1i, y2i) then a general control function of the form h(v2i, zi) = (z1i, z2i, v2i) creates perfect multicollinearity. Even when z1i is excluded from the general control function (so xi and h(v2i, zi) do not include the same terms) there is multicollinearity when y2i = π1 + z2iπ2 + v2i. 11Alternatively if one where to assume that the control function and the heteroskedastic function were non-linear functions then one can verify the rank conditions from Rothenberg (1971) for identification 39 conditions are used to show E((xi, h(v2i, zi))(cid:48)(xi, h(v2i, zi))) is non-singular. The CMR is the more realistic identification assumption used in Kim and Petrin (2017). The last two conditions help in showing identification in the highly non-linear heteroskedastic Probit model. The following theorem states the identification result. Theorem 2.4.1. In the set-up described by equation (2.1) and Assumption 2.3.1, if As- sumption 2.4.1 holds then the parameters πo and (βo, γo, δo) are identified. The proof of Theorem 2.4.1 is provided in the Appendix. The CMR approach to obtain identification using a control function is adopted from Kim and Petrin (2017) where they show non-parametric identification following control function approach in a triangular system with an additively separable error.12 The CMR can be interpreted as a way to distinguish between the endogeneity of y2i and the “non-endogeneity” of zi. By law of iterated expectations, E(u1i|zi) = E(E(u1i|zi, v2i)|zi) = E(h(v2i, zi)γo|zi) = 0 The middle equality holds by the specification provided in Assumption 2.3.1 and the last equality holds by the CMR. As a result, the CMR only implies zi is mean independent of u1i and does not require any stronger forms of independence. This is a fairly standard and weak exogeneity assumption on an instrument. In practice, if one is concerned that this restriction is violated then the included instrument, z1i, should be treated as an endogenous variable and the excluded instruments, z2i, should not be used as valid instruments. 12Hahn and Ridder (2011) show that a “Conditional Mean Restriction” is insufficient for identifying the ASF in a general non-parametric non-separable model. However I would like to be clear that the CMR they consider is E(y1i − Ψ(xi)|zi) = 0 where Ψ(xi) is the unknown ASF. This differs from the CMR consider here which is on the latent error. Although the binary response model is non-separable, since the latent error is additively separable from the mean component xiβo within the indicator function, identification follows analogously from Kim and Petrin (2017) 40 To provide some intuition for the implications, because v2i is mean independent of zi, the CMR requires each element of h(v2i, zi) to include functions of v2i and to be condi- tionally demeaned. For instance, v2 2i could not be an element of the control function, but 2i|zi) could be. In addition, no element can only be a function of zi alone – the 2i − E(v2 v2 instruments can only enter as an interaction with functions of v2i. Notice that in the ex- amples provided in the previous section the general control functions satisfy the CMR. This prevents any issues of linear dependence between elements of xi and h(v2i, zi). Wooldridge (2005) explains that identification holds given exclusion restriction on the instruments z2i in the structural equation that creates variation in the control variate unexplained by xi.13 Consequently, the extra variation in the control variate needs to be used to identify the parameters in the general control function. This can be demonstrated explicitly under the assumptions stated above. Let (a(cid:48), b(cid:48))(cid:48) be a non-random vector such that (cid:18) (cid:19)a  = xia + h(v2i, zi)b = 0 xi h(v2i, zi) b Taking the conditional expectation with respect to zi, E(xi|zi)a + E(hi|zi)b = E(xi|zi)a = 0 Because the variance-covariance matrix of E(xi|zi) is full rank, a is a zero vector and it follows that b is also a zero vector. Now how does the CMR compare to the CF-CI? In the heteroskedastic bivariate probit example it is easy to see how CF-CI is violated while the CMR continues to hold. However, the CMR is not strictly weaker in the technical sense where CF-CI implies CMR.14 However, 13An alternative identification strategy is used in Escanciano, Jacho-Ch´avez, and Lewbel (2016) that does not require an exclusion restriction on the instruments z2i. But in their setting, identification is dependent on non-linearity in the reduced form and they still impose CF-CI as a control function assumption. 14I would like to thank David Kaplan for pointing this out. 41 given the earlier discussion, the CMR is more in line with what our prior beliefs on what endogeneity is. Consider the following example in which CF-CI holds but CMR does not. Let E(u1i|zi, v2i) = v2i + v2 suppose there is heteroskedasticity in the first stage such that E(v2 v = V ar(v2i) such that CF-CI holds. But v where σ2 2i − σ2 v which implies 2i|zi) (cid:54)= σ2 v (cid:54)= 0 and the CMR does not hold. E(v2i + v2 2i − σ2 v|zi) = E(v2 2i|zi) − σ2 Interpretively what does this mean? It means that the specification of endogeneity is quadratic in the first stage residual while the quadratic term is deviations from the unconditional variance even when there is heteroskedasticity in the first stage. So for the CMR to fail, the endogeneity must depend on v2 2i − σ2 v instead of v2 2i − E(v2 2i|zi), deviations for the conditional variance. Therefore I would argue that although the CMR is not strictly weaker than CF-CI, it better reflects how we perceive endogeneity and is much more plausible to hold in empirical settings. By now, it seems that the relaxation of CF-CI, especially compared to the paramet- ric approach of Rivers and Vuong (1988), is fairly straightforward. Putting aside het- eroskedasticity for a moment, the difference is only including the control variate ˆv2i as an addition covariate, as suggested in Rivers and Vuong (1988), or including terms such as (ˆv2i, ˆv2 2i− ˆE(v2 2i|zi), ziˆv2i), as the general control function approach proposed in this chapter. One may wonder whether the relaxation of CF-CI to allow for a general control function is really necessary and whether it would have an impact empirically. The following simulation aims to show the importance of allowing for a general control function when it is called for. The results of the simulation suggest that there is a high cost to not specifying a general control function method when CF-CI fails, but there is very little cost in allowing for a more flexible specification of the general control function when it is not truly present. The detri- mental impacts of presuming CF-CI when it does not hold is seen not only in the parameter estimates but in economic objects of interest such as the estimated choice probabilities and 42 price elasticities. 2.4.2 Simulation: General Control Function in the Demand for Premium Cable The data generating process will emulate the setting described in Example 1 above, which is a simplification of the application given in PT. Recall that in this example I wish to estimate the demand for premium cable (conditional on already selecting cable as the tele- vision provider) but am concerned that price is endogenous and correlated with unobserved attributes. The latent utility function given in equation (2.13) is a function of product characteristics, such as the number of channels (z11m), and individual characteristics of the consumers (z12i), including income, single family household indicator, rent indicator, age, and age squared. Building on the example of advertisement and marketing (part of the unob- served product attributes), I interact ξm with product characteristics (number of channels) and individual characteristics (age). For simplicity, it is assumed that there is an exogenous cost shifter that acts as a valid instrument. As in PT, price will be interacted with 5 income level dummies to allow the price elasticity of premium cable to differ by income levels. A discussion on the construction of the variables as well as a table of summary statistics is provided in Appendix C. As mentioned previously, it is important to note that the data generating process specifies a general control function that satisfies the CMR but does not satisfy CF-CI. Table E.2 provides the parameter estimates for the different Logit specifications. As found in PT, without addressing any endogeneity (column (1)), there is actually a positive effect 43 of price for the higher income groups (in this simulation only the highest income group) and a negative effect for number of cable channels offered. Addressing endogeneity by including just the control variable (column (2)), as in PT, significantly strengthens the coefficient estimate on price and significantly alters to the coefficient estimate on number of channels to be positive. This is because price is strongly correlated with the number of channels offered and therefore addressing the endogeneity in price will also affect the estimated coefficient on number of channels. But allowing for the general control function in columns (3) and (4), the parameter estimates are much closer to their true value. For instance, the number of channels becomes much less impactful once the unobserved attributes, such as advertisement, of the premium channels is control for. In addition, the income effects are slightly higher than they are in column (2). The difference between columns (3) and (4) is that column (3) the general control function is correctly specified by including interactions between the control variate and the relevant instruments (number of channels and age) while in column (4) the general control function includes more terms that are not actually relevant such as interactions between the con- trol variate and household size and income. This is to explore the realistic situation that researchers would not typically have prior knowledge as to which terms to include in the general control function. The simulation results illustrate that there is very little loss to precision of the parameter estimates when one over specifies the general control function (column (4)). This shows there is very little cost to allowing the flexibility in estimation even when the true form may be more simplistic. But these parameter estimates provide little interpretative value. Usually of more inter- est are the choice probabilities which in this binary context corresponds to the ASF (the derivation of the ASF will be discussed in more detail in Section 4). Figure D.2 illustrates 44 how the ASF varies over price for an additional 5 channels of premium cable, assuming the individual is 35 years old in a family of 3 with income equal to $85,000. Estimates from a linear probability model, OLS and 2SLS (in orange), are also included as a comparison. OLS and Logit (dotted lines) which do not address endogeneity result in upward sloping ASF while the remaining estimators more realistically provide downward sloping ASF. The correctly-specified Logit (GCF) and over-specified Logit (Over) both follow the true ASF quite closely. Although Logit (CV) performs better compared to Logit or the linear specifi- cations, there is still some cost in not allowing for a flexible control function. The price elasticity of demand for premium cable is calculated as, (cid:18) ∂E(y1i|z1i, p) ∂p (cid:19) × p E(y1i|z1i, p) Elasticity = E (2.19) The linear probability models estimated by OLS or 2SLS produces a conditional mean, E(y1i|z1i, p), not strictly positive nor bounded below 1, this will result in imprecise and extreme elasticity estimates. Table E.3 presents the estimated price elasticities for the dif- ferent estimation procedures. OLS and 2SLS unsurprisingly provide poor estimates and Logit which does not address endogeneity greatly underestimates the price elasticity as inelastic. The Logit CV estimate is in a similar range to that produced in PT but the specifications that allow for more flexibility, Logit (GCF) and Logit (Over), are much closer to the true value. Again, as seen in the parameter estimates, there is very little cost in terms of effi- ciency, to including more terms in Logit (Over) when a simpler control function is the true specification. Now that the general control function is shown to be consequential and the complications concerning identification have been addressed, consistency of the estimation procedure is insured along with standard regulatory conditions. The next section discusses the estimation 45 procedure in more detail and derives consistent estimates of the asymptotic variance. Since the parameters are usually of little interest in a latent variable model, the next section also discusses the formulation, identification, and estimation of the ASF and APE using the proposed estimation procedure. 2.5 Estimation and Interpretation The estimation procedure proposed in this chapter for the parametric model is a standard two step estimator. In the first stage, the conditional mean function E(y2i|zi) = m(zi)πo is estimated using standard LS regression techniques.15 The control variable is constructed from the reduced form residuals, ˆv2i = y2i − m(zi)ˆπ, and used in the second step. In the second stage, one would maximize the following likelihood L(y1i, xi, zi; ˆπ, β, γ, δ) = y1i log Φ n(cid:88) i=1 (cid:20) (cid:18)xiβ + h(ˆv2i, zi)γ (cid:19)(cid:21) exp(g(y2i, zi)δ) (cid:20) + (1 − y1i) log 1 − Φ (cid:18)xiβ + h(ˆv2i, zi)γ exp(g(y2i, zi)δ) (cid:19)(cid:21) (2.20) with respect to β, γ and δ to obtain estimates of the parameters. In addition to relaxing assumptions in the literature, the proposed estimation procedure is quite simple to implement using commands from standard statistical packages.16 However, the estimated standard errors need to be adjusted to account for the variation from using the residual from the first stage as an approximation for the control variate. Asymptotic variance formulas that account for the multi-step approach are given in the next section, although a common alternative 15Alternatively, one may consider a non-parametric first stage regression to obtain estimates of a condi- tional mean function. Using sieve, asymptotic results would follow directly from Newey (1994) which differs from the asymptotic theory presented in this chapter. However, Ackerberg, Chen, and Hahn (2012) explain that the asymptotic variance estimator under the framework of the semi-parametric plug-in two step esti- mator is numerically equivalent to the asymptotic variance estimator in the parametric framework as long as the parametric specification is flexible enough. 16For example, the parameter estimates can be obtained using reg and hetprobit commands in Stata. 46 would be to bootstrap the standard errors. As for consistency and asymptotic normality, this is a simple application of MLE to a heteroskedastic Probit model with a generated regressor in which asymptotics are well- established. The next subsection provides the asymptotic properties and the asymptotic variance derivation under a two step M-estimation framework. 2.5.1 Asymptotic Properties The two-step estimator can be written in a GMM framework by stacking the moment con- ditions, E(cid:0)M (y1i, y2i, zi; πo, βo, γo, δo)(cid:1) = E (y2i − m(zi)πo)m(zi)(cid:48) Si(πo, βo, γo, δo)  = 0 (2.21) where Si(π, β, γ, δ) = ∂L(y1i, xi, zi; π, β, γ, δ)/∂θ denotes the score, Si(π, θ) = Φi(π, θ)(1 − Φi(π, θ)) exp(g(y2i − m(zi)π, zi)δ) (y1i − Φi(π, θ))φi(π, θ)  × x(cid:48) i h(y2i − m(zi)π, zi)(cid:48) −(xiβ + h(y2i − m(zi)π, zi))g(y2i − m(zi)π, zi)(cid:48)  Φi(·) and φi(·) are shorthand for the conditional CDF and PDF evaluated at the linear index xiβ. Note that estimation using the stacked moment conditions is equivalent to the two step approach previously described. Although using the GMM framework is useful for deriving the asymptotic variance of the estimator, it is suggested to use the two step approach in implementation to avoid issues of slow convergence. Let θ(cid:48) = (β(cid:48), γ(cid:48), δ(cid:48)) and let Π and Θ denote the parameter spaces of π and θ respectively. Consistency follows from Theorem 2.6 of Newey and McFadden (1994). 47 Theorem 2.5.1. In the set-up described by equation (2.1) where assumptions 2.3.1 and 2.4.1 hold, if πo ∈ Π and θo ∈ Θ, both of which are compact, then the GMM estimators that solve: (cid:34) n(cid:88) i=1 1 n (ˆπ, ˆθ) = arg min (π,θ)∈Π×Θ (cid:35)(cid:48)(cid:34) n(cid:88) i=1 1 n M (y1i, y2i, zi; π, θ) M (y1i, y2i, zi; π, θ) + op(1) (2.22) (cid:35) where M (y1i, y2i, zi; π, θ) are the stacked moment conditions in equation (2.21), are consis- tent, ˆπ − πo = op(1) and ˆθ − θo = op(1). Proof is provided in the appendix. Showing asymptotic normality follows from Theorem 6.1 in Newey and McFadden (1994). Theorem 2.5.2. In the set-up described by equation (2.1) where assumptions 2.3.1 and 2.4.1 hold, if πo ∈ int(Π) and θo ∈ int(Θ), both of which are compact, then for (ˆπ, ˆθ) that solves equation (2.22), √ n(ˆθ − θo) d−→ N (0, V ) where 2θ E(cid:0)Ξi(πo, θo)Ξi(πo, θo)(cid:48)(cid:1) G−1(cid:48) 2θ V = G−1 (2.23) where Ξi(πo, θo) = Si(πo, θo) + G2πG−1  = E((cid:53)(π,θ)M (y1i, y2i, zi; πo, θo)) is defined in detail in the appendix. 1π (y2i − m(zi)πo)m(zi)(cid:48) and G1π G1θ G2π G2θ The proof is provided in the appendix. Note that the asymptotic variance takes into account the variation introduced from the first stage. A consistent estimator for the asymp- totic variance would be the method of moments estimator that replaces all the unknown parameters with their consistent estimates and then use sample averages in place of expec- tations. Although this section provides consistency and n-asymptotic normality for the √ second stage parameter estimates, the parameters themselves bear very little interpretative value. The next two subsections discuss the derivation of the ASF and the APE and their im- portance for economic interpretation. These structural objects are magnitudes of effects that 48 empirical researchers can use to discuss the effectiveness of a particular policy or the average probability of a successful outcome for an individual with a particular set of characteristics. 2.5.2 Average Structural Function Researchers are often interested in using the data and model estimates to infer the average predicted probability of success at particular point of the observed data. When there is no endogeneity, this quantity can be easily described by the conditional mean, which in the case of binary response, is equivalent to the propensity score. As explained in BP, when endogeneity is present, the conditional mean is unable to capture the structural relationship between the endogenous variable and the outcome. In particular, most studies wish to uncover the effect of a structural intervention over the endogenous variable on the outcome, while the conditional mean can only capture a reduced form effect over changes in the instruments. For clarification, let us consider a simple linear structural equation. yi = xiβo + ui (2.24) Without endogeneity, E(ui|xi) = 0 and the interpretation of the average outcome for a given observation xo is simply the conditional mean: xoβo. The corresponding partial effect would be the slope parameter βo. But when endogeneity is introduced, E(ui|xi) (cid:54)= 0, the conditional mean is composed of two parts. E(yi|xi = xo) = xoβ + E(ui|xi = xo) (2.25) The first component is the structural direct effect of xi while the second component is the endogenous indirect effect of xi due to the presence of endogeneity. For instance, consider 49 the ubiquitous example of returns to education where education is endogenous due to unob- served ability. Then the structural direct effect is the average wage for particular education level (independent of ability) and the endogenous indirect effect is the contribution of aver- age ability for that given education level on wages. But BP argues that one should only be interested in the structural direct effect because if one were to consider a policy interven- tion on the level of education (ie: mandatory schooling) there would be no changes in the distribution of ability and therefore one would only want to capture the structural direct effect. To derive the ASF, BP instruct that one should integrate over the unconditional distri- bution of the unobserved heterogeneity in the structural equation. If the structural equation (2.24) includes an intercept then E(ui) = 0 and the ASF is xoβo, not equal to the conditional mean but still the same as the case of no endogeneity. Next is to extend the analysis to the binary response model. yi = 1{xiβo + ui > 0} (2.26) When there is independence between the latent error ui and the regressors xi, the conditional mean – equivalent to the propensity score – is E(yi|xi = xo) = F−u(xoβo) (2.27) which calculates the probability of success for an individual with characteristics xo. Now consider the case when there is no longer independence between the latent error and the regressors so the unconditional CDF is not equal to the conditional CDF; i.e., F−u(−u) (cid:54)= F−u|x(−u; x) where F−u|x(·;·) is the conditional CDF in which the first argument is the point of evaluation and the second argument is the conditioning argument. One can understand 50 the violation of independence either through the standard interpretation of endogeneity, E(ui|xi) (cid:54)= 0, or possible due to endogeneity at higher moments such as heteroskedasticity, V ar(ui|xi) (cid:54)= V ar(ui). Then the propensity score is E(yi|xi = xo) = F−u|x(xoβo; xo) (2.28) in which the first argument in F−u|x(xoβo; xo) is the point of evaluation which, corresponds to the structural direct effect, and the second argument is the conditioning argument, which corresponds to the endogenous indirect effect. As in the linear case, the conditional mean does not capture a structural interpretation. Therefore, to obtain the ASF, one can integrate over the unconditional distribution of the unobserved heterogeneity to obtain: F−u(xoβo). Now the ASF only captures the structural direct effect of xi and is not clouded by the influence of endogeneity. However, in calculating the ASF, the unconditional distribution of ui is usually unknown or at least not specified when estimating the structural parameters βo. Wooldridge (2005) studies the ASF in more depth and provides a more rigorous investi- gation of the derivation of the ASF. Using the same notation as above, the structural model of interest is E(yi|xi, ui) = µ1(xi, ui), where xi is observed covariates and ui is unobserved heterogeneity. Then the ASF is defined as ASF (xo) = Eu(µ1(xo, ui)) where the subscript of u is meant to emphasize that the expectation is taken with respect to the unconditional dis- tribution of ui. Using Lemma 2.1 from Wooldridge (2005), which is essentially an application of law of iterated expectations, the ASF can also be calculated from ASF (xo) = Ew(µ2(xo, wi)) (cid:90) U µ1(xo, u)fu|w(u; wi)η(du) µ2(xo, wi) = 51 where U is the support of ui and fu|w(·;·) is the conditional density of the unobserved heterogeneity ui given wi with respect to a σ-finite measure η(·). Essentially, one can use a conditioning argument wi to help identify the ASF. In many instances the conditioning argument wi will include components of the covariates xi, but it is important to note that the evaluation of the ASF requires the ability to distinguish between the point of evaluation xo and the conditioning argument wi.17 This will be important when I discuss the implications of the CF-LI assumption on the derivation and estimation of the ASF. To apply Lemma 2.1 from Wooldridge (2005), the following conditions must hold (i) (Ignorability) E(yi|xi, ui, wi) = E(yi|xi, ui) (ii) (Conditional Independence) D(ui|xi, wi) = E(ui|wi) Notice that conditional independence in this context is with respect to the conditioning ar- gument, wi which has yet to be specified in our context. When the conditioning argument is simply the control variate, v2i, then it is in fact the same conditional independence assump- tion of BP. Therefore the CF-CI assumption of BP is also used to obtain identification of the ASF. But, as seen when showing identification of the parameters, is CF-CI really necessary? Consider the conditioning argument as both the control variate v2i and the instruments zi. This easily satisfies the ignorability assumption E(y1i|xi, u1i, v2i, zi) = E(y1i|xi, u1i) 17Wooldridge (2005) considers the example of the heteroskedastic Probit model where in equation (2.26), it is assumed ui is normally distributed with V ar(ui|xi) = exp(2xiδ). Then the covariates xi are used as the conditioning argument (ie: wi = xi) such that (cid:18)(cid:90) (cid:19) (cid:18) xoβ (cid:18) (cid:60) 1{xoβo + u > 0}fu|x(u; xi)du (cid:19)(cid:19) ASF (xo) = Exi = Exi Φ exp(xiδ) where the expectation is taken with respect to the xi in the heteroskedastic function (part of the conditioning argument) and not with respect to the structural direct effect of xo. Therefore, even when the conditioning argument is the same as the covariates in the structural equation, it is necessary to be able to distinguish between the two when composing the ASF. 52 given ignorability of the excluded instruments z2i and automatically satisfies this version of conditional independence D(u1i|xi, v2i, zi) = D(u1i|v2i, zi) since xi is composed of v2i and zi. Then under Assumption 2.3.1, µ2(xo, (v2i, zi)) = (cid:90) (cid:60) 1{xoβo + u > 0}fu|v,z(u; v2i, zi)du (cid:18)xoβo + h(v2i, zi)γo (cid:19) = Eu(1{xoβo + u > 0}|v2i, zi) = Φ exp(g(y2i, zi)δo) and the ASF is ASF (xo) = Ev2,z(µ2(xo, (v2i, zi))) = Ev2,z (cid:18) Φ (cid:18)xoβo + h(v2i, zi)γo (cid:19)(cid:19) exp(g(y2i, zi)δo) (2.29) where the expectation is taken with respect to the unconditional distribution of v2i and zi. A consistent method of moments estimator would replace the unknown parameter values with their consistent estimates, (ˆπ, ˆβ, ˆγ, ˆδ), and in place of the expectation, take sample averages. Therefore identification of the ASF is still possible without the CF-CI assumption of BP. Next is to examine the derivation of the ASF in the BP framework where CF-CI is assumed. Let GCF−CI (·; v2i) be the CDF of −u1i|v2i, zi that will be estimated non- parametrically and recall that CF-CI implies that zi is excluded from the conditional distri- bution function. Then the ASF is easily calculated as, ASFCF−CI (xo) = Ev2(G−u1|v2 (xoβo; v2i)) (2.30) where the expectation is taken with respect to v2i. Comparing equations (2.29) and (2.30), highlights the impact of CF-CI on interpretation. Since there is endogeneity, the effect of xo on the predicted probability of success can be broken down between the structural direct effect and an endogenous indirect effect. The allure of the CF-CI assumption is it 53 immediately distinguishes between the two effects in the conditional distribution function (xoβo; v2i) where the first argument captures the structural direct effect and the second Gu1|v2 argument should entirely control for endogenous indirect effect. But when CF-CI fails, and this structure of the conditional CDF is still presumed, the lines between the structural direct effect and an endogenous indirect effect become blurred. Consequently, in estimation, the ASF calculated when incorrectly imposing CF-CI will not be able to correctly average out the endogenous indirect effect. In a more flexible framework, Rothe assumes the CF-LI which slightly relaxes the CF-CI by allowing the conditional distribution to be a function of the instruments through the linear index xiβo. Recall, that CF-LI means D(u1i|v2i, zi) = D(u1i|v2i, xiβo). Using results from Manski (1988), identification of βo and GCF−LI (xiβo, v2i) = Fu1|v2,xβo(xiβo; v2i, xiβo) which is the conditional CDF of u1i evaluated at xiβo can be obtained. As mentioned before, the CF-LI assumption is still a fairly strong restriction on the conditional distribution of ui|v2i, zi. Compared to the specification in Assumption 2.3.1, this would require the control function and the heteroskedastic function to be constructed with the linear index and not as more flexible functions of the instruments. But for now, consider the most optimistic case where the CF-LI assumption holds, then how does one calculate and estimate the ASF? Again applying the framework provided in Wooldridge (2005), where now the condition- ing argument includes the control variate and the linear index, wi = (v2i, xiβo). (cid:90) (cid:60) 1{xoβo + u > 0}fu|v,xβ(u; v2i, xiβo)du µ2(xo, wi) = = Eu(1{xoβo + u > 0}|v2i, xiβo) = Fu1|v2,xβo(xoβo; v2i, xiβo) notice that the linear index appears twice as arguments: first at the point of evaluation for the 54 conditional CDF xoβo (the structural direct effect) and as part of the conditioning argument xiβo (the endogenous indirect effect). Applying Lemma 2.1 from Wooldridge (2005), the ASF when the true data generating process satisfies the CF-LI assumption is ASFCF−LI (xo) = Ev2,xβo(F−u1|v2,xβo(xoβo; v2i, xiβo)) (2.31) where the expectation is taken with respect to the joint distribution of the conditioning arguments (v2i, xiβ). The immediate issue is that the ASF cannot be written in terms of the identified function GCF−LI (xiβo, v2i) that is estimated using the proposed SML estimator in Rothe. The identified function is the conditional CDF evaluated at and conditioned on the same linear index. Therefore one cannot distinguish between the direct structural effect and indirect endogenous effect of the linear index. This reiterates the importance of being able to separately identify the conditioning argument from the point of evaluation for estimation. Rothe suggests using Ev2(GCF−LI (xoβo, v2i)) as the ASF but this only averages out the part of the endogenous indirect component due to v2i, and does not average out any the effect due to the linear index. Therefore, the ASF proposed by Rothe is equal to the true ASF only when CF-CI assumption of BP holds. So although it may be tempting to consider the CF-LI assumption as a compromise to allow for flexibility in terms of the relationship between the unobserved heterogeneity and the instruments, the true ASF is not identified under the CF-LI assumption. In fact, the ASF proposed by Rothe is estimating the AIF of Lewbel, Dong, and Yang (2012) who suggest using it as an alternative to the ASF since it is generally much easier to identify. They define the AIF as AIF (xo) = E(1{xiβo + ui1 > 0}|xiβo = xoβo) (2.32) = Ev2(GCF−LI (xoβo, v2i)) 55 Given the choices of Propensity Score, ASF, and AIF, as possible ways to interpret the estimates of the model, how should one proceed? Lin and Wooldridge (2015) address this issue by comparing these functions, proposed in the context of binary response, in linear regression case (as in equation (2.24)) to see if they uncover the direct structural component, xoβo. Earlier in this section, the propensity score is shown to not be reflective of the mechanisms researchers are interested in when endogeneity is present. Lin and Wooldridge (2015) also show this for the AIF, explaining that “the AIF suffers from essentially the same shortcomings as the propensity score because it is affected by correlation between the unobservables and the observed [endogenous explanatory variables].” The next section discusses the derivation of the APEs. Again, the APEs should isolate the structural impact of varying a particular covariate and therefore should be derived from the ASF. Consequently presuming CF-CI or CF-LI affects the derivations and interpretations of the APEs. 2.5.3 Average Partial Effects Similar to the interpretation of βo in a linear regression (as in equation (2.24)), the APE should capture the causal (structural direct) effect of a regressors on the outcome variable. In the binary response framework, the parameters are scale invariant and therefore provide very little for interpretation and are generally not comparable across different specifications. Alternatively, the APE provide a comparable statistic that can be used for interpretation. Let xo j and βjo denote the jth elements of xo and βo respectively. Then the partial effect of the jth element of xi is defined as the partial derivative of the ASF with respect to xo j averaged in the population. Under the setting consider in Assumption 2.3.1, the partial 56 effect is, (cid:46) ∂ASF (xo) ∂xo j = ∂Ev2,z (cid:18) (cid:18) (cid:18) Φ (cid:19)(cid:19)(cid:46) (cid:18)xoβo + h(v2i, zi)γo (cid:18)xoβo + h(v2i, zi)γo (cid:19)(cid:46) (cid:19) (cid:18)xoβo + h(v2i, zi)γo exp(g(y2i, zi)δo) exp(g(y2i, zi)δo) ∂xo j (cid:19) ∂xo j βjo = Ev2,z ∂Φ = Ev2,z φ exp(g(y2i, zi)δo) exp(g(y2i, zi)δo) (cid:19) To obtain the APE, one plugs in xi for xo and averages over the joint distribution of (xi, v2i, zi), (cid:18) AP E = E φ (cid:18)xiβo + h(v2i, zi)γo (cid:19) (cid:19) βjo exp(g(y2i, zi)δo) exp(g(y2i, zi)δo) (2.33) Use sample averages in place of expectations and consistent estimates of the parameters to estimate. Notice that the derivative is only taken with respect to the structural direct component, the argument in the ASF, but after the derivative is taken, one will average over the joint distribution of xi, v2i, and zi in both the structural direct effect and the endogenous indirect effect together. How does using the AIF instead of the ASF affect the APE derivation under the CF-LI assumption? First, the correct APE under the CF-LI assumption (cid:46) (cid:16) AP ECF−LI = ∂ASF (xi) ∂xji = E f−u|v2,xβo(xiβo; v2i, xiβo)βjo (cid:17) (2.34) where f−u|v2,xβo(·; v2i, xiβo) is the conditional PDF. Since I am averaging over the point of evaluation and the conditioning argument, one may be hastily optimistic in thinking this is identified from the conditional CDF, GCF−LI (xiβo, v2i) = Fu1|v2,xβo(xiβo; v2i, xiβo). However, the correct PDF cannot be derived the from this function since ∂GCF−LI (xiβo, v2i)/∂[xiβo] (cid:54)= f−u|v2,xβo(xiβo; v2i, xiβo) 57 Consequently, the APE in Rothe are also incorrectly calculated from the AIF, AP EAIF = E(cid:0)(∂GCF−LI (xiβo, v2i)/∂[xiβo])βjo (cid:1) (2.35) This discussion has provided a theoretical argument for the differences between the AIF and ASF and why one should prefer the ASF. But one may wonder whether all of this matters in practice. Once you start averaging over components, minute differences in calculations may be diminished in their impact. Perhaps the AIF used by Rothe may do a “good enough” job in approximating the true ASF. This is investigated in a simulation study in the next section where the CF-LI assumption holds true in the underlying data generating process and the ASF from the proposed estimator and AIF from the SML estimator proposed by Rothe are calculated and compared.18 The simulation results suggest that when the CF-LI assumption holds, both the proposed method and the SML estimator from Rothe perform quite well in terms of parameter estimates. However, there is a stark difference in ASF estimates, consistent with the previous analysis. In the simulation, the poor ASF estimates using the SML procedure can be entirely attributed to the fact that under CF-LI, the SML can only recover the AIF which can be starkly different from the ASF. 2.5.4 Simulation: ASF Estimates for the Effect of Income on Home-ownership This simulation models the home-ownership and income application in Rothe (2009) as a contextual setting. Rothe uses a sample of ‘981 married men aged 30 to 50 that are working full time and have completed at most the lowest secondary school track of the German 18Rothe (2009) also considers the case that CF-LI assumption holds but CF-CI does not hold in the simulation study. The second design introduces heteroskedasticity as a function of the linear index xiβo, in the unobserved latent error, u1i. However, only results on coefficient estimates, and not the ASF or APE estimates, are reported. 58 education system’ from a 2004 wave of the German Socio-economic Panel. The outcome y1i is an indicator that takes on the value 1 if an individual owns their home and 0 if they are renting. The included instruments z1i are individual’s age (z11i) and an indicator of the presence of children younger than 16 (z12i). The endogenous variable of interest, y2i is household income and there are two excluded instruments: indicators for the wife’s education level (intermediate z21i and advanced z22i) and an indicator for her employment status (z23i). A more detailed discussion of the data generating process and table of summary statistics are presented in the Appendix. In this simulation the CF-LI assumption holds in the underlying data generating process such that the distribution of u1i conditional on v2i and the exogenous regressors (zi) is, (cid:16) (cid:17) y2iγ1o + v2i(xiβo)γ1o, exp(2 × (xiβo)δo) u1i|zi, v2i ∼ N Motivated by the explanation in example 2, by allowing for an interactive effect in the conditional mean function, I am allowing for interactions between the omitted variable, credit score, and the linear index. Heteroskedasticity is also introduced which allows for variability in the variance of unobservables conditional on observables. Since u1i|v2i, zi ∼ u1i|v2i, xiβ, the CF-CI assumption is violated but the CF-LI assumption holds. This simulation will examine in more detail the CF-LI assumption as a relaxation of the CF-CI assumption, investigating whether the discussion on the ASF, AIF, and APE holds true in practice. Given the analysis of the previous section, the Rothe SML estimator should be able to estimate the parameters βo well but unable to correctly calculate the ASF and APE because it cannot distinguish between the two effects (structural direct and endogenous indirect) of the linear index, xiβo. Implementation of the SML and the proposed estimator are explained in more detail in the appendix. 59 Table E.5 reports the coefficient estimates for the simulated data as well as the estimates in the Rothe application as a comparison.19 All the second stage coefficient estimates are normalized such that the coefficient on Children in the Household is one. The simulated data is not an exact replica, but the estimates are in the same range and the change in the coefficient estimates as one starts to control for the endogeneity all move in the same direction. As expected, the estimates for Het-Probit (GCF) – the proposed estimator – and SML, columns (8) and (9), are quite similar and close to the true values in column (10). Figures D.3 and D.4 show the ASF estimates for a 40 year old with children under the age of 16 in the household as it varies over the endogenous regressor, log(total income). In Figure D.3, the OLS and Probit estimators perform poorly since they do not address endogeneity at all. The 2SLS and Probit (CV) estimates are much closer to the true ASF but predict a slightly flatter ASF. Recall from the earlier discussion, even if the SML is producing consistent parameter estimates, one would incorrectly estimate the ASF and consequently the APE. This is because the CF-LI assumption does not correctly average out the distribution of the unobserved heterogeneity. Figure D.4 reports the true ASF, the AIF (not a structural object of interest) and the estimated ASFs for the proposed Het-Probit (GCF) and the semi-parametric SML. The true ASF correctly averages over the distribution of the unobserved heterogeneity while the wrong ASF only averages out the v2i components of the unobserved heterogeneity. As expected, the proposed estimator does a good job estimating the true ASF while the SML estimator does a good job estimating the AIF. In this simulation I find that there can be a fairly stark difference between the ASF and AIF which means the differences in the estimators are consequential. For instance, the AIF would predict the average probability for 19SML and Het Probit are estimated using a Nelder-Mead Simplex Method in Matlab. 60 home-ownership for an individual with a log total income of 7.65 to be 0.595 while the ASF would predict an average probability of 0.461, a substantial difference. This further reiterates the discussion in Section 4; i.e., even under the CF-LI assumption, the SML estimator is not capturing the true ASF. The simulated distribution of the APE are reported in Table E.6. The true APE is 0.6448 so the proposed Het-Probit CF estimator has the closest mean whereas the mean of the SML estimates is the third closest following the mean of the 2SLS estimates. But the difference between these estimators in interpretation is minimal: a 10% increase in total income results in either a 0.0626 increase in probability of home ownership according to the Het-Probit (GCF) or a 0.0699 increase according to Rothe’s SML. The estimators that suffer the most are the ones that do not address the issue of endogeneity at all, OLS and Probit, and are distinctly biased downwards. Therefore I find in this simulation study that under the CF-LI assumption parameter estimates are similar across the two estimators. But when looking at the ASF, the estimates diverge significantly. I show that this can be entirely accounted by the fact that the SML estimator is actually estimating the AIF which is not equal to the ASF (and should not be interpreted in the same way). 2.6 Empirical Example To showcase the estimator in an empirical example, I examine married women’s labor force participation using 1991 CPS data.20 All tables and figures referenced in this section can 20Data is part of the supplementary material provided with the textbook “Econometric Anal- Data can be downloaded at ysis of Cross Section and Panel Data” by Jeffrey Wooldridge. https://mitpress.mit.edu/books/econometric-analysis-cross-section-and-panel-data 61 be found in Appendix D. Table E.7 provides some summary statistics for the data set. The dependent variable is Employed (=1 when the individual is in the labor force) where approximately 58% of married women in the sample participate in the labor force. The last two column divide the sample over the binary outcome and reports the summary statistics for the other observable characteristics. The structural outcome equation is, Employedi = 1{β1 + nwif inciβ2 + educiβ3 + experiβ4 + kidslt6β5 + kidsge6β6 + nwif inci × kidslt6β7 + nwif inci × kidsge6β8 + u1i > 0} (2.36) where the economic interest is in estimating the effect of non-wife income on the probability of being in the labor force. Since there is a trade-off between work and leisure, by relaxing the budget constraint such that an individual has other sources of income, one would expect the individual to be less likely to work. From the summary statistics, those not working tend to have higher non-wife income. But this can not be interpreted as a causal effect since there is concern that other sources of income would be endogenously determined with the wife’s labor force participation. In particular, husband’s employment, which partly determines the non-wife income, would probably be decided simultaneously with wife’s employment. Utilizing husbands education level as an instrument, the causal effect of non-wife income on wife’s labor force participation can be parse out. Since education and the probability of working are generally correlated, the instrument is easily argued to be relevant. In fact, the F- statistic of significance for the first stage is quite large as seen in Table E.8.21 Excludability of the instrument follows from the argument that husband’s education level should not directly effect the wife’s choice of labor force participation except through the channels of how it 21The standard benchmark of 10 from Stock, Wright, and Yogo (2002) only applies to the relative bias in 2SLS with homoskedasticity. It is an open area of research to determine benchmarks for non-standard cases. 62 effects the non-wife income. The other controls considered in this example are the wife’s education level, experience, and dummy variables for whether or not they have kids younger than 6 and kids 6 and older. Table E.8 reports the reduced form coefficient estimates, the second stage parameter estimates for several different specifications of a Probit model, and the SML estimator pro- posed by Rothe. For the second stage estimates the coefficient on Education is normalized 1, since the model is only identified to scale. This allows for comparisons across the dif- ferent estimators. The second column specifies a standard Probit model which assume no endogeneity and homoskedasticity in the latent error. This is slightly relaxed in column (3) where heteroskedasticity is allowed. The specifications in columns (4)-(8) all address endogeneity in one form or another. The fourth column corresponds to the setting of Rivers and Vuong (1988) where they address endogeneity by only including the control variable as an additional covariate, maintaining the CF-CI assumption. The next three columns are variations on the proposed estimator all of which relax the CF-CI assumption by either allowing for heteroskedasticity in the latent error and/or allowing for a general control func- tion. The final column presents results using the SML estimator (from Rothe) which impose no distributional assumptions but require either CF-CI or CF-LI assumptions. I find that addressing endogeneity with only a control variable reduces the effect of non- wife income (in columns (4) and (5)). But then allowing for a general control function, where the control variate interacts with the children dummies, raises the effect of non-wife income and also switches the signs for the interactions with the children (although not statistically significant). When a general control function is used without allowing for heteroskedasticity the bootstrapped standard errors increase substantially. This is due to very small (almost 0) coefficient estimates for education which blows up the scaled parameter estimates. When 63 heteroskedasticity is allowed, then the coefficient estimate on education becomes statistical different from 0 which results in lower standard errors of the scaled parameter estimates. Finally, the SML estimates are found to be quite similar to the proposed estimator results in column (6). This would suggest that the CF-LI assumption may in fact hold in this setting. When looking at the control function parameters I see particularly large effects when interacting the reduced form error ˆv2i with the children dummies (in specification (7)). Intuitively this makes sense since one would imagine the endogenous decision making process of who in the household should work (either husband, wife or both) depends a lot on the presence of children in the household. For instance, if there are very young children in the household then the trade-off is not just between work and leisure but must also consider the cost of childcare if both parents enter the workforce. Therefore it would make sense that there is a negative interactive effect such that when one partner is working, the other is less likely to in order to provide childcare. Since this chapter proposes a more flexible specification, Table E.9 provides Wald test results on different specifications. The first 4 columns test the null hypothesis that non-wife income is in fact exogenous. One of the benefits of the control variable approach is the vari- able addition test it supplies. One can test the null hypothesis of no endogeneity by testing whether all the coefficients in the control function are 0. Under all combination of modelling assumptions (such as homoskedasticity/heteroskedasticity and control variable/general con- trol function) I find strong evidence of endogeneity. However this is conditional on the instrument being exogenous and, since the model is just identified, there is no way to test for exogeneity of the instrument. The remainder of the table tests the different components of the CF-CI assumption in alternative specifications. The middle two columns test the null hypothesis that the control 64 variable is sufficient in capturing the full impact of the endogenous part of y2i. In other words, testing the significance of the coefficients on the additional terms in the general control function. The null is rejected at the 10% level under homoskedasticity and rejected at the 5% level under heteroskedasticity. This gives statistical evidence of the violation of CF-CI, through the general control function, in this empirical applications. Finally the last three columns test the null hypothesis of homoskedasticity (i.e., all of the coefficients in heteroskedastic function are 0). There is strong statistical evidence of the violation of CF-CI through the presence of heteroskedasticity, easily rejecting homoskedasticity at the 5% level. Given these results, the preferred specification should be the Het-Probit (CF), I reject the possibility of homoskedasticity and reject the inclusion of only the control variable in favor of a general control function. To understand the consequences of the different specification on the interpretation of the results, Table E.10 provides estimates of the APEs and their bootstrapped standard errors with respect to the endogenous variable, non-wife income. The most significant change in the estimates is when one starts to address the issue of endogeneity. In the linear models, the 2SLS APE estimates shrink to about 3/4 of the OLS estimates but are still statistically significant. In the non-linear models, a similar reduction in APE estimates is observed when controlling for endogeneity (about 3/5). But in the models that relax CF-CI completely (Het Probit (GCF)), the APE is no longer statistically significant and even switches its sign. Putting the APE estimates into interpretive setting, if the non-wife income increases by $10,000 – a fairly substantial increase–, according to the preferred specification, the likelihood of the wife working decreases by around 1.16 percentage points, a fairly negligible effect. As suggested in the discussion on coefficient estimates, this small effect is most likely driven by a heterogeneity over the presence of children in the household. Therefore examining the ASF 65 for different combination of ages of children present in the household can be informative. Since the APEs average over the distribution of all the covariates, these differing effects, tend to be washed out in the single statistic. Figures D.5-D.8 show the ASF with respect to non-wife income for a married woman with high school education, 20 years of experience, and different combination of ages of children present in the household. The Probit models have a fairly linear ASF which explains why 2SLS gives fairly similar estimates of the APE. The first thing to note is that the ASF using Probit estimates is much more negatively sloped in all the figures. This is because without addressing the issue of endogeneity, (i.e., if the husband works then the wife is less likely to work and visa versa, but these decisions are made simultaneously) I would expect to see this sort of substitution effect. Once endogeneity is controlled for in Probit (CV), the slope lessens. The much more striking revelation is with the proposed Het-Probit (GCF) and SML estimators, I see a positive slope when there are both children in the household and fairly flat ASF when there are only older children. When there are no children or just very young children in the household then the ASF is much more negatively sloped. This goes to show that relaxing CF-CI and allowing for a much more flexible specification in the conditional distribution of the latent error makes a interpretative impact. In this setting, the normality assumption seems likely to hold, especially since the semi- parametric estimator (SML) produces fairly linear ASF estimates. But in other empirical applications, the normality assumption may be too restrictive and unconvincing. Conse- quently, neither the SML estimator or the proposed Het-Probit (GCF) estimator are strictly weaker in their assumptions. The SML estimator imposes CF-CI (or CF-LI) by not allowing for general heteroskedasticity and a flexible general control function. But on the other hand, 66 the proposed approach imposes distributional assumptions. This divergence between the proposed parametric estimator and the semi-parametric estimators offered in the literature leads to the following distribution free extension. 2.7 Extension: Semi-Parametric Distribution Free Es- timator In some empirical settings, imposing normality on the latent error may not be a reasonable assumption. Therefore this section offers an alternative semi-parametric estimator that does not depend on distributional assumptions. The main result of this section is that by allowing the heteroskedastic function and general control function to be non-parametrically specified, the semi-parametric variation of the proposed Het-Probit (GCF) is actually a distribution free estimator. This section will go into detail as to why this is true, how to obtain non-parametric identification and what the asymptotic properties of the semi- parametric estimator are. But for an applied researcher, the results of this section imply that as long as the heteroskedastic function and general control function are flexibly specified (i.e., sieve basis functions) the normality assumption that appears to be used in Assumption 2.3.1 is non-binding. How is this distribution free estimator possible? A recent paper, Khan (2013), has noted an observational equivalence result concerning binary response models: a heteroskedastic Probit model with a non-parametric heteroskedastic function is observationally equivalent to a “distribution free” model with only a conditional median restriction. The utility of this result, is one may use simple estimation procedures (such as a semi-parametric Het-Probit MLE) while not making any strong distributional assumptions to obtain structural param- 67 eter estimates and possibly even identify and estimate choice probabilities and marginal effects. This section extends the result to the case of endogeneity in a flexible manner that allows for the relaxation of CF-CI. By introducing a general control function into the conditional median restriction, the observational equivalence holds under endogeneity and a simple estimation procedure is obtainable. Consequently, this section proposes a semi- parametric estimator based on assumptions that are more realistic than any other control function methods in the literature for endogenous binary response models. Section 7.1 reviews the observational equivalence result in Khan (2013) and extends it to the case of endogeneity. Since this framework considers non-parametric functions for both the general control function and the heteroskedastic function, identification will be shown under this more general scenario. Section 7.2 derives the asymptotic properties of the semi- parametric estimator. Using the results in Song (2016), proofs of consistency and the rate of convergence only need to be slightly altered to allow for the semi-parametric general control function. Finally, this extension ends with a comprehensive simulation study. Over a variety of conditional distributions (some satisfying CF-CI and some not), the performance of the proposed semi-parametric Het-Probit (GCF) estimator is compared to the parametric Rivers and Vuong (1988) estimator and the SML of Rothe (2009). The simulation results suggest the proposed approach can handle a variety of alternative distributions while still allow for the violation of CF-CI. 2.7.1 Observational Equivalence and Identification Consider the following binary response setting without endogeneity, yi = 1{xiβo + ui ≥ 0} (2.37) 68 where xi is a vector of covariates, and ui is the unobserved heterogeneity. The following two assumptions restates the setting of the two observationally equivalent models in Khan (2013).22 Assumption 2.7.1 (Conditional Median Restriction). In the set up described by equation (2.37) (i) xi ∈ (cid:60)k is assumed to have density with respect to a Lebesgue measure, which is positive on the set X ⊆ (cid:60)k. (ii) Let po(t, xi) denote P (−ui < t|xi), and assume (a) po(·,·) is continuous on (cid:60) × X . (b) p(cid:48) o(t, xi) = ∂po(t, xi)/∂t exists and is continuous and positive on all (cid:60) for all xi ∈ X . (c) po(0, xi) = 1/2. (d) limt→−∞ po(t, xi) = 0, limt→∞ po(t, xi) = 1. Assumption 2.7.2 (Heteroskedastic Probit). In the set up described by equation (2.37) (i) xi ∈ (cid:60)k is assumed to have density with respect to a Lebesgue measure, which is positive on the set X ⊆ (cid:60)k. (ii) ui = σo(xi)ei where σo(·) is continuous and positive on X a.s, and ei is independent of xi with any known (e.g. logistic, normal) distribution with median 0 and has a density function which is positive and continuous on the real line. Theorem 2.1 of Khan (2013) states that under the above assumptions, the two models are observationally equivalent. The equivalence between the two models is in terms of the 22Assumption 2.1 correspond to CM1 and CM2 and Assumption 2.7.2 corresponds to HP1 and HP2 in Khan (2013) 69 choice probabilities: P (y1i|xi). In other words, both models will generate the same choice probability functions and therefore cannot be distinguished from one another on that basis. This means that a researcher is able to use estimators developed under Assumptions 2.7.2 such as a semi-parametric heteroskedastic Probit while only imposing the weaker distribu- tional assumptions under Assumption 2.7.1. This allows for easy estimation using canned commands in popular statistical programs such as Stata, Matlab or R, but still preserving the “distribution free” interpretation. Previously, endogeneity was understood as a non-zero conditional mean, but to fit the framework in Khan (2013), endogeneity will be determined by a non-zero conditional me- dian: Med(−ui|xi) (cid:54)= 0. This would violate assumptions Assumptions 2.7.1(ii) part (c) and 2.7.2(ii). Therefore the provided observational equivalence result from Khan (2013) is no longer applicable. Now consider the set up in equation (2.1) and suppose I define the non-zero conditional median as, ho(v2i, zi) ≡ Med(−u1i|zi, v2i) = Med(−u1i|z1i, y2i) (2.38) where the second equality holds because y2i is merely a function of zi and v2i. This function should look familiar as it would be the non-parametric version of the general control function introduced in Assumption 2.3.1. This function captures the part of the unobserved latent error that is correlated with the endogenous variable. Again, I allow for the violation of CF-CI since the conditional median is a function of all the condition arguments including the instruments. Thus far, I have made no assumptions on what the function ho(·,·) should be and therefore the control function is completely general. Now I can incorporate the general control function into the model assumptions to allow for 70 arbitrary endogeneity. The following assumptions include slight adjustments to Assumptions 2.7.1 and 2.7.2 to incorporate endogeneity. Assumption 2.7.3 (General Conditional Median Restriction). In the set up described by equation (2.1) (i) (v2i, zi) ∈ (cid:60)1+k1+k2 is assumed to have density with respect to a Lebesgue measure, which is positive on the set (V × Z) ⊆ (cid:60)1+k1+k2. (ii) Let po(t, v2i, zi) denote P (−u1i < t|v2i, zi), and assume (a) po(·,·,·) is continuous on (cid:60) × (V × Z). (b) p(cid:48) o(t, v2i, zi) = ∂po(t, v2i, zi)/∂t exists and is continuous and positive on all (cid:60) for all (v2i, zi) ∈ (V × Z). (c) po(ho(v2i, zi), v2i, zi) = 1/2 where ho(v2i, zi) is continuous on all (v2i, zi) ∈ (V × Z). (d) limt→−∞ po(t, v2i, zi) = 0, limt→∞ po(t, v2i, zi) = 1. Assumption 2.7.4 (Endogenous Heteroskedastic Probit). In the set up described by equa- tion (2.1) (i) (zi, v2i) ∈ (cid:60)1+k1+k2 is assumed to have density with respect to a Lebesgue measure, which is positive on the set (V × Z) ⊆ (cid:60)1+k1+k2. (ii) u1i = σo(v2i, zi)e1i +ho(v2i, zi) where σo(v2i, zi) is continuous and positive on (V×Z), ho(v2i, zi) that is continuous on all (v2i, zi) ∈ (V×Z), and ei is independent of (v2i, zi) with any known (e.g. logistic, normal) distribution with median 0 and has a density function which is positive and continuous on the real line. Modifying the observational equivalence result to this setting is almost trivial. Instead of focusing the model on a zero median restriction, it is acknowledged that the median is 71 non-zero but a general conditional median function, ho(v2i, zi), is specified. Theorem 2.7.1 states this result with the proof provided in the appendix. Theorem 2.7.1 (Observational Equivalence). Under Assumptions 2.7.3 and 2.7.4, the two models are observationally equivalent. Extending the results in Khan (2013) to the case of endogenous regressors is not novel. In a working paper, Song (2016) uses a more traditional control function method to ad- dress endogeneity. He imposes an exclusion restriction on the conditional median func- tion, same as the conditional median independence assumption proposed in Krief (2014): M ed(u1i|v2i, zi) = f (v2i). Although these assumptions are weaker than CF-CI, the exclu- sion restrictions that are imposed are constrictive and unnecessary. Utilizing Theorem 2.7.1, the following assumption is an alternative to Assumption 2.3.1 that allows for a non-parametric general control function and a non-parametric heteroskedas- tic function. Assumption 2.7.5. Consider the set up in equation (2.1), where {y1i, zi, y2i}n i=1, is iid. In the first stage, the true conditional mean is E(y2i|zi) = mo(zi) and the unobserved latent error has the following conditional distribution u1i|zi, v2i, y2i = u1i|zi, v2i ∼ N (cid:16) (cid:17) ho(v2i, zi), exp(2 × go(y2i, zi)) Where zi = (z1i, z2i) and mo(zi), ho(v2i, zi), and go(y2i, zi) are unknown function. Since the normal distribution is used in this framework and is symmetric, the conditional median is equal to the conditional mean. Therefore the remaining discussion will return 72 to the conditional mean interpretation of endogeneity. With this slight variation from the parametric framework, Assumption 2.7.5 suggests a distribution free estimator comparable to the other semi-parametric estimators of the literature. To reiterate, in implementation the normal distribution is used according to Assumption 2.7.5, but in interpretation, there are no distributional strings attached because of the observational equivalence result. However, this generalization further complicates identification. First, the model is only identified to scale, which can be solved with a normalization that assumes the last coefficient in a linear index xiβo is equal to 1: βko = 1. Similar to the previous literature, identi- fication of the non-parametric heteroskedastic function is obtained by assuming that the last regressor, xki, conditional on all other random variables in the numerator has a density function with respect to the Lebesgue measure that is positive on (cid:60) and all other terms in the numerator have bounded support.23 Second, as in the parametric model, introducing a general control function without any restrictions will not be identified relative to the linear index, xiβ, because they rely on the same sources of variation. Using an analogous CMR, a shape restriction on the general control function insures that there is variation unexplained by the linear index. 23This is essentially assumption RC2(i) in Khan (2013). As he notes in Remark 3.2, the bounded support condition can be relaxed to the finite fourth moments. To illustrate how this condition is used in identification, consider the non-endogenous case where one would like to show identification βo and σo from the choice probability Φ . First, suppose not, suppose there exists a β (cid:54)= βo and σ (cid:54)= σo such that (cid:18) (cid:19) xiβo exp(σo(xi)) xiβ exp(σ(xi)) = xiβo exp(σo(xi)) for all xi ∈ X . But because xki, conditional on x−ki has a density function with respect to the Lebesgue measure that is positive on (cid:60) and x−ki is bounded, for any realization of x∗−k in the support, there exists a x∗ k also in the support such that which is a contradiction. x∗−kβ−k + x∗ exp(σ(x∗)) k > 0 and x∗−kβ−ko + x∗ exp(σo(x∗)) k < 0 73 Assumption 2.7.6. Let mo(·) ∈ M, ho(·) ∈ H, and go(·) ∈ G denote the function spaces and βko ∈ B denote the parameter space. ixi) is non-singular, E(xi|zi) is full rank. (i) E(x(cid:48) (ii) (CMR) E(ho(v2i, zi)|zi) = 0 (iii) the last component of xi, xki, is an included instrument whose coefficient is normalized to 1 such that, xiβo = x−kiβ−ko + xki and xki conditional on (x−ki, ho(v2i, zi)) has a density function with respect to the Lebesgue measure that is positive on (cid:60) and (x−ki, ho(v2i, zi)) has bounded support. The first two parts are taken from assumption 2.4.1. The last part imposes the scale normalization and is crucial in identifying the heteroskedastic function. There is no consensus in the literature on how to choose which regressor should have the scaled coefficient. Song (2016) uses the endogenous regressor since it will be continuously distributed and more likely to satisfy the support requirements. But then no inference can be made on the structural parameter whose value must be assumed to be non-zero. Therefore I suggest scaling on an instrument whose relevancy is not in question and has sufficient support. A quick remark on parts (ii) and (iii): in some simple scenarios, the CMR in assumption (ii) is sufficient for the second part of assumption (iii), as long as xki conditional on x−ki has a density function with respect to the Lebesgue measure that is positive on (cid:60). For instance, consider the linear case where the general control function is of the form (cid:88) p∈P ho(v2i, zi) = bp(zi)(vp 2i − E(vp 2i|zi)) (2.39) where bp : Z → (cid:60) and the set P consists of unique elements from the real line. Supposing 74 P does not include 0, then the CMR is satisfied. Also, for ease of understanding, consider the common case that x−ki = y2i so there is only one included instrument that is acting at the normalized covariate xki = z1i. Then, conditional on any realization h = ho(v2i, zi) and y2 = y2i, one cannot precisely determine the corresponding z1i. Therefore z1i conditional on (y2i, ho(v2i, zi)) has a density function with respect to the Lebesgue measure that is positive on (cid:60). The following Theorem states the general identification result. Theorem 2.7.2. In the set-up described by equation (2.1) and Assumption 2.7.5, if As- sumption 2.7.6 holds then (mo(·), βo, ho(·), go(·)) are identified. Proof is given in the appendix. Alternatively, if one is concerned that Assumption 2.7.6(iii) is unlikely to hold then the researcher may always turn to exclusion restrictions as a sufficient condition. If xki is assumed to be excluded in the control function, then, given the proper bounded and unbounded supports, part (iii) of Assumption 2.7.6 is easily satisfied. With identification, the proposed estimation procedure is quite simple and similar to the parametric version in Section 4. In the first stage, the conditional mean function E(y2i|zi) = mo(zi) is estimated using standard non-parametric regression techniques such as sieves or kernels. The control variable is constructed from the residuals ˆv2i = y2i− ˆm(zi) and plugged into the second step. In the second stage, one would use sieves to estimate the non-parametric components. In this case sieves are preferred over other non-parametric methods since it will be easier to impose the CMR on the general control function. Let {bl(v2i, zi), l = 1, 2, ..., Lhn} and {cl(v2i, zi), l = 1, 2, ..., Lgn} be sequences of basis function of (v2i, zi) such 75 that bl(v2i, zi) satisfy the CMR for all l = 1, 2, ..., Lhn. So the sieve spaces are defined as Hn = {h : (V × Z) → (cid:60), h(v2i, zi) = bl(v2i, zi)γl Lhn(cid:88) l=1 (2.40) (2.41) (2.42) ∈ (cid:60)} : E(bl(v2i, zi)|zi) = 0 and γ1, ..., γLhn Lgn(cid:88) (cid:18)xiβ + h(ˆv2i, zi) Gn = {g : (V × Z) → (cid:60), g(y2i, zi) = (cid:20) n(cid:88) l=1 L(y1i, xi, zi; ˆm, β, γ, δ) = Finally, one would maximize the following likelihood y1i log Φ i=1 exp(g(y2i, zi)) + (1 − y1i) log 1 − Φ cl(y2i, zi)δl : δ1, ..., δLgn ∈ (cid:60)} (cid:19)(cid:21) (cid:18)xiβ + h(ˆv2i, zi) exp(g(y2i, zi)) (cid:19)(cid:21) (cid:20) with respect to β, h(·) ∈ Hn, and g(·) ∈ Gn. Same as in the parametric version, to im- plement, this is as simple as running the hetprobit command in Stata for the second stage. However, inference should reflect both the two step estimation process and the non- parametric specification. The next section provides the asymptotic results for the proposed distribution free estimator. 2.7.2 Asymptotic Properties Song (2016) derives consistency and convergence rates of the semi-parametric Het Probit (CF) estimator when the control function satisfies CF-CI and imposes the exclusion restric- tion (i.e.: ho(v2i, zi) = ho(v2i)). Therefore the asymptotic results only need to be slightly augmented to allow for the general control function. Explanations of the notation is left to the appendix. The following assumption collects the remaining low level regulatory condi- tions needed for consistency of the second stage parameters. Assumption 2.7.7. 76 (i) Let (β−ko, ho(·), go(·)) ∈ (B × H × G) = Θ denote the joint parameter space. For any (β−k, h(·), g(·)) ∈ Θ, (a) β−k ∈ B ⊂ (cid:60)k−1 where B is compact, (cid:16) xiβ+h(v2i,zi) exp(g(y2i,zi)) (cid:17) ∈ Λs (b) Φ c((V × Z), w1) for s > 0 and w1 ≥ 0, (c) h(v, z) is continuously differentiable with respect to its first component such that (ii) (cid:82) (1 +||(v, z)||2)w2fv,z(v, z)d(v, z) < ∞ where fv,z(·) denotes the joint density function z∈Z sup sup v∈V ∂v2 ∂h(v2, z) < C < ∞ and w2 > w1 > 0. (iii) For bLhn(v2i, zi) = (b1(v2i, zi), ..., bLhn cLgn(y2i, zi) = (c1(y2i, zi), ..., cLgn(y2i, zi)) (v2i, zi)) E(bLhn(v2i, zi)(cid:48)bLhn(v2i, zi)) and E(cLgn(y2i, zi)(cid:48)cLgn(y2i, zi)) are non-singular for all n. Part (i) collects conditions on the parameter and functional space. The second component constraints the predicted probability function to be in a weighted Holder ball with radius c, smoothness s and weight function (1 + || · ||2)w1/2 as defined in equation (B.12) in the appendix. Since the parameter space is a weighted Holder ball, there exists a projection mapping from a standard sieve spaces constructed from power series, Fourier series, splines, or wavelets to the parameter space as n → ∞. The last component allows for a Taylor expansion around the control variate v2i since it is estimated in the first stage. Part (ii) replaces any compactness conditions on (v2i, zi) and the part (iii) insures point identification of the sieve coefficients. 77 The following theorem provides consistency of the proposed estimator. Proof is omitted since the arguments are identical to those provided in Song (2016). Theorem 2.7.3. In the set-up described by equation (2.1) where Assumptions 2.7.5, 2.7.6, and 2.7.7 hold, if the first stage estimator ˆv2i satisfies |ˆv2i − v2i| = Op(τv) sup (xi,zi)∈X×Z where τv = op(1), then the estimators that maximize the log likelihood in equation (2.20) are consistent, (cid:32) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Φ || ˆβ−k − β−ko|| = op(1) (cid:33) (cid:18)xiβo + ho(v2i, zi) − Φ exp(go(y2i, zi)) ˆβ + ˆh(ˆv2i, zi) xi exp(ˆg(y2i, zi)) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)∞,w1 = op(1) Note that consistency of the first stage non-parametric estimator can be obtained using Proposition 3.6 of Chen (2007) under fairly standard and relaxed conditions.24 This theorem provides consistency of both the parametric component of the second stage estimator as well as the predicted probability function. This will be used in providing consistency of APE estimates in the following corollary. Corollary 2.7.1. Under the conditions of Theorem 2.7.3, the APE estimator with respect to component j of xi is consistent: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)n−1 n(cid:88) (cid:18) i=1 (cid:33) (cid:32) (cid:18)xiβo + ho(v2i, zi) (cid:19) ˆβ + ˆh(ˆv2i, zi) xi exp(ˆg(y2i, zi)) φ − E φ ˆβj exp(ˆg(y2i, zi)) βjo (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = op(1) exp(go(y2i, zi)) exp(go(y2i, zi)) 24The conditions for consistency of the first stage estimator include Z ∈ (cid:60)k1+k2 is a Cartesian product of compact intervals, E(y2i|zi) is bounded, and mo(zi) = E(y2i|zi) ∈ Λs where s > (k1 + k2)/2 78 The next assumption collects the remaining low level conditions needed to derive the convergence rate of the parametric component of the second stage estimator. Assumption 2.7.8. c((V × Z), w1) for some s > 0 and w1 ≥ 0. (i) Any h(·) ∈ H, h(v2i, zi) ∈ Λs c((V × Z), w1) for some s > 0 and w1 ≥ 0. (ii) Any g(·) ∈ G, g(y2i, zi) ∈ Λs (iii) The smoothness exponent of the Holder space satisfies 2s ≥ 1 + k1 + k2. (iv) In Assumption 2.7.7(ii), w2 > w1 + s This assumption places stronger smoothness assumptions as well as further restricts the tail behavior of the covariates. The convergence rates are derived for the one-step estimator where the control variate v2i is assumed to be known and not estimated in a first stage. Therefore the estimator is defined as, ˜θ ≡ ( ˜β−k, ˜h, ˜g) = arg max (β−k,h,g)∈B×Hn×Gn (cid:20) y1i log Φ n(cid:88) i=1 (cid:18)xiβ + h(v2i, zi) exp(g(y2i, zi)) (cid:20) + (1 − y1i) log 1 − Φ (cid:19)(cid:21) (cid:18)xiβ + h(v2i, zi) exp(g(y2i, zi)) (cid:19)(cid:21) (2.43) The following theorem states the rate results, again the proof is omitted since the arguments are almost identical to those given in Song (2016).25 Theorem 2.7.4. In the set-up described by equation (2.1) under Assumptions 2.7.5, 2.7.6, 2.7.7, and 2.7.8, then the estimator described in equation (2.43) satisfies || ˜β−k − β−ko|| = Op max(Lhn, Lgn) n −s/(1+k1+k2) hn −s/(1+k1+k2) gn + L + L (cid:32)(cid:114) (cid:33) 25The only difference is that now the control function is a function of both v2i and the instruments zi. Derivation of the convergence rates still follow from Theorem 3.2 of Chen (2007), but now the approximation rate of the control function converges at a slightly different rate: ||ho − πnho|| = L −s/(1+k1+k2) . hn 79 The proposed estimator converges much slower than the parametric rate. This will effect the performance of the estimator as seen in the simulation study given in the following section. 2.7.3 Simulation This simulation study will be a broad examination of the proposed Semi-Parametric Het- eroskedastic Probit with a General Control Function (SP Het Probit (GCF)) in a variety of settings. There is one included and one excluded instrument drawn from the following joint distribution, z1i z2i 0  , 1 3 0 0 1   The common data generating process is  ∼ N  1 y∗ 0 y∗ 1i ≥ 0 1i < 0 y1i = y∗ 1i = y2iβo + z1i + u1i (2.44) (2.45) where βo = 1 and πo =(cid:0)−1/ y2i = π1o + π2oz1i + π3oz2i + v2i; √ √ 2,−1/ √ 6, 1/ 2(cid:1)(cid:48) . The control variate v2i is drawn from a N (0, 1). This means that there is a strong first stage with an R2 of approximately 0.50. The unobserved heterogeneity u1i will be decomposed into the general control function and a mean zero random variable e1i that determines the conditional distribution of the latent error, u1i = ho(v2i, zi) + √ 2e1i (2.46) 80 such that e1i and ho(v2i, zi) are standardized to have variance equal to one. This means V ar(u1i) ≈ 3 and V ar(y2iβo + z1i) ≈ 2.45. The simulation experiment considers three different control functions that satisfy the CMR: h1 o(v2i, zi) = v2i h2 o(v2i, zi) = h3 o(v2i, zi) = √ z1i/3 + z2i/ (cid:17) 3 v2i (cid:16) (cid:18) (cid:19) √ v2i/ 2.5 1 + z1i 1 + (z1i/3 − 2z2i/3)2 The coefficients in the linear control function, h2 o(v2i, zi), are chosen so a projection of h2 o(v2i, zi) on just the control variate (v2i) only explains about 35% of the variation in h2 o(v2i, zi). This means relaxing CF-CI should have meaningful consequences. The functional form and of the non-linear control function, h3 o(v2i, zi), is chosen so a projection of h3 o(v2i, zi) on (v2i, z1iv2i, z2iv2i) explains 90% of the variation and therefore a linear approximation is very reasonable. Moreover the decomposition of variance explained is split 50-50 between just the control variate (v2i) and the terms that are interacted with instruments (z1iv2i and z2iv2i). Again, this means relaxing CF-CI should have meaningful consequences. This simulation also considers four different conditional distributions for the latent error: 12/2, √ 12/2) 1i ∼ Logistic(0, 3/π2) e1 √ 1i ∼ U nif orm(− e2 √ 1i ∼ T (0, 3)/ e3 1i ∼ 0.5N e4 1i ∼ Logistic e5 0, (z2 3 (cid:16) (cid:16)−0.8, (1 − 0.82) (cid:17) 2i/4)/π2(cid:17) (cid:16) 1i/2 + 3y2 + 0.5N (cid:17) 0.8, (1 − 0.82) Notice that a combination of the first control function with any of the first three conditional distributions does not violate CF-CI. Only when the conditional distribution of the latent er- 81 ror is a function of the instruments, either through the control function or heteroskedasticity (as in e5 1i), is CF-CI violated. The simulation results will be presented in three segments. In the first segment, the data generating process satisfies CF-CI. This only allows for the first control function h1 o(v2i, zi) and the first four conditional distributions (e1 1i, e2 1i, e3 1i, e4 1i). The SML estimator is expected to perform as well, if not better, than the proposed method on all accounts (parameter estimates, ASF estimates, and APE estimates). The second segment considers the remain- ing two general control functions that do not satisfy CF-CI. Notice that h2 o(v2i, zi) is linear in parameters and therefore can be estimated parametrically but h3 o(v2i, zi) is a non-linear function and whose functional form will be treated as unknown and will be estimated non- parametrically. For the final segment heteroskedasticity is introduced so the non-parametric heteroskedastic function in the proposed estimator must capture both the misspecified dis- tribution and the heteroskedasticity in the latent error. In addition to the proposed semi-parametric estimator, the simulation experiment will employ the two step control function Probit estimator of Rivers and Vuong (1988) (Probit (CF)) and the SML estimator of Rothe (2009) as a comparison. The SML estimator is implemented with a Gaussian kernel of order 1. Although asymptotically the SML estimator requires higher order kernels, Rothe finds that lower order kernels perform better in small samples. As suggested in Rothe (2009), bandwidths for the SML estimator were treated as additional parameters to be optimized over. All three estimators use the same first stage estimates for v2i: the residual from regressing y2i on zi. Two issues with the proposed method arose during implementation. First the proposed estimator is fairly sensitive to different starting values. But using 15 randomized starting values helps to avoid local maxima. Second, since the estimator incorporates two non- 82 parametric functions that need to be approximated via sieves, the number of parameters increases quite quickly. To reduce the number of parameters, a reasonable restriction on the general control function such that ho(v2i, zi) = ho(zi)v2i is used. For sample size n = 250, both polynomial series approximating ho(zi) and go(y2i, zi) only include first order terms. For sample size n = 500 the polynomial series approximating ho(zi) only includes first order terms while the polynomial series approximating go(y2i, zi) includes both first and second order terms. For sample size n = 1, 000, both polynomial series are up to order 2. Alternatively, one can consider a penalization method to restrict the number of non-zero covariates. This extension is left to future research. All tables and figures referenced in this section can be found in the Appendix. They report the bias, standard deviation (Std. Dev.), root mean squared error (RMSE), and the 25th, 50th, and 75th sample quantiles of the parameter estimates for 1,000 repetitions of the simulation. The following summarizes the results for the three cases: CF-CI holds, CF-CI is violated by a general control function, and CF-CI is violated by heteroskedasticity. CF-CI holds Tables E.11-E.14 report the simulation results for the estimates of βo when CF-CI holds. The results show all three estimators perform fairly well in terms of bias even though the Probit (CF) estimator imposes a misspecified distribution (assumes normality when it does not hold). The only exceptions are the proposed estimator under a Uniform and Gaussian Mixture distribution for the latent error. This bias is stronger when n = 500 and n = 1, 000 which is when there are a larger number of higher order terms for the non-parametric functions. This would suggest that there may be gains to adding a penalized approach to control for potentially large number of irrelevant terms. Also the proposed estimator is much less efficient than the alternate methods. The 83 efficiency gain for using the SML estimator can be substantial. In the cases of the Uniform Distribution and the Gaussian Mixture, the standard deviation can be reduced by 1/2 when using the SML estimator instead of the proposed approach. Nevertheless, I see all the estimators perform well in terms of ASF estimates in Figure D.9-D.12. Violation of CF-CI with General Control Function Now, the data generating process includes general control functions that violates CF-CI by including the instruments. Recall that there is two possible general control functions: one that is linear in parameters (h2 o), so that it is consequently parametrically specified, and another that is non-linear and must be estimated non-parametrically. Tables E.15-E.18 report the βo estimate results when the control function is linear. The simulation results suggest SML estimator has a fairly strong negative bias compared to the other two estimators where in some cases the 75th sample quantile falls below the true parameter value of 1. Surprisingly, the parametric Probit (CF) estimator does not have as strong of a bias as the SML estimator, even though it is also implicit imposing CF-CI. But the proposed estimation approach still incurs fairly large standard deviation due to the numerous parameters needed in estimation. Therefore both of the other approaches, Probit (CF) and SML, fair better in terms of RMSE. This places some doubt onto the proposed approach and whether any realist gains on previous approaches is even possible. But examining the ASF estimates clarifies the matter. The figures show that the proposed SP Het Probit (GCF) substantially outperforms the other estimators by better estimating true ASF across all the different distributions. Tables E.19-E.22 and Figures D.17-D.20 report the results when the control function is unknown and estimated non-parametrically. The conclusions stay the same although there tends to be a larger bias (compared to the cases of linear general control function) for the proposed 84 estimator. But this is because the form of general control function is unknown and can only be approximated. Comparing these results to the previous segment, distributional misspecification appears to have a lighter effect on the parameter estimates than violations of CF-CI. The Probit (CF) estimator performs quite well under distributional misspecification while the Probit (CF) and SML estimators display stronger bias and poor ASF estimates when CF-CI does not hold. Moreover, the proposed estimator always has a larger RMSE compared to the other approaches which suggests that there is consequential trade-off in terms of efficiency of the SML estimator and smaller bias of the proposed approach. Violation of CF-CI with Heteroskedasticity The final segment only looks at the Logistic distribution for the latent error but introduce heteroskedasticity as a further violation of CF-CI. Simulation results are reported in Tables E.23-E.25. When the sample size is 250, only the first order polynomials are included in approximating the heteroskedasticity. Consequently, the proposed estimator performs poorly and on par (in terms of bias) with the SML estimator. But when I allow for higher order terms when the sample size increases, the bias of the proposed estimator diminishes significantly compared to the alternative methods. Examining the ASF estimates in Figures D.21-D.23, only the SP Het Probit (GCF) estimator follows the true ASF closely which the other two estimators suffer even under the simplest control function h1 o(v2i, zi) = v2i. Overall the proposed estimator correctly adapts to the scenario in which CF-CI is vio- lated while the alternative estimation methods are restricted by imposing the assumption. An additional benefit to the proposed method is much simpler to implement and can be done using canned commands in common statistical packages. So for an applied researcher the proposed method is more general than alternative estimators and is easier to implement. 85 However this simulation also brings to light some weaknesses of the SP Het Probit (GCF) estimator. First, this estimator is quite inefficient due to the large number of parameters it needs to estimate. Therefore the proposed procedure could benefit by a dimension re- duction. Second, the proposed estimator is quite sensitive to starting values and without prior knowledge of what the true parameter value should be, this may pose some challenge to implementation. Randomizing around scaled parameter estimates from the Probit (CF) estimator for starting values is a promising possibility, as this simulation study shows. The parametric estimator tends to perform fairly well even under violation of CF-CI and with a misspecified distribution. 2.8 Conclusion This chapter presents a new control function approach to endogeneity in a binary response model that does not impose CF-CI. Applying a similar framework as Kim and Petrin (2017), this chapter uses a general control function method that allows the instruments to be a part of the conditional distribution of the unobserved heterogeneity. The proposed estimator is consistent and asymptotically normal. In simulations, it is shown that the general control function method is necessary to obtain accurate parameter estimates under the weaker CMR setting. Moreover, structural objects of interest such as the ASF and APE can be recovered in the general framework presented in the chapter. Without CF-CI, other estimators of the literature are unable to correctly estimate the ASF and APE resulting in inaccurate economic interpretations. In the empirical application, a Wald test uncovers strong statistical evidence for the violation of CF-CI, although there are fairly minimal difference in the economic interpretations produced by the different estimators. 86 The proposed estimator is introduced in a parametric framework which may be unre- alistic in some economic settings. Therefore a semi-parametric extension is provided that places no distributional assumptions on the unobserved heterogeneity. Simulations show that when CF-CI is violated and the distribution of the latent error is misspecified, the proposed semi-parametric estimator consistently estimates the parameters and the ASF. But, the sim- ulations also uncover some drawbacks to the proposed semi-parametric approach. Due to the fairly large dimension of the parameter space, the proposed approach is quite inefficient relative to other estimators in the literature. An interesting avenue for further research is to develop a more efficient semi-parametric estimator that still allows for the relaxation of CF-CI. The motivation for this chapter was to propose an estimation procedure built upon a model and assumptions that are much more reflective of what we would expect in empirical data. By creating a model that is much more flexible and realistic as well as an estimation procedure that is easy to implement, the proposed approach will be a useful addition to an economists tool-kit of estimators. The next chapter approaches a different setting but with a similar purpose. In Chapter 3, a joint work with Jeffrey Wooldridge and Ying Zhu, we consider a panel binary response (large N, small T) in which the standard joint maximum likelihood approaches have simple and restrictive specifications for the individual hetero- geneity and do not allow for serial correlation. Empirical data calls for more flexibility so we propose an approach that can capture individual persistence through several mechanisms. First, we introduce individual heterogeneity in the levels and the slopes that are allowed to be potentially correlated with the covariates. Second, we allow for serial correlation in the latent error. The resulting estimator is a pooled correlated random effects heteroskedastic Probit in which identification will again rely on the results provided in the first chapter. 87 Both of the proposed approaches in these two chapters will find their utility in empirical work as they push the frontiers of the literature on how to incorporate flexibility driven by the demands of data. 88 Chapter 3 Behavior of Pooled and Joint Estimators in Probit Model with Random Coefficients and Serial Correlation1 3.1 Introduction Multilevel data analysis is among the long standing statistical tools that leverages hetero- geneity in the data. One of the most frequent occurrences in application is panel data where the first level is time and the second level is individuals. Given the broad framework pro- vided by a multi-level setting, there is an absence of times series analysis in the multilevel literature that appears in panel data settings. In particular, when observations are recorded over time we expect the data is display a strong amount of persistence. This persistence can arise with individual-specific heterogeneity or with serially correlated errors. Economic theory can provide motivation as to why would expect to see persistence in the data. In modelling demand, purchasing behavior can be traced back to a utility max- 1This is joint work with Jeffrey Wooldridge and Ying Zhu 89 imization problem where if one allows for heterogeneous agents – in preferences or income effects – the estimating equation should allow for individual heterogeneity. In Wooldridge (2010), the individual-specific heterogeneity in a program evaluation framework is motivated by “the usual omitted ability story.” The individual-specific heterogeneity controls for any individual characteristic – such as ability or motivation – that may be correlated with pro- gram participation. In the application of these examples, the individual heterogeneity is an unobserved random variable. Even after allowing for individual-specific heterogeneity, one would expect a strong pres- ence of serial correlation in the errors. In the field of labor economics, outcomes such as employment, wages, and health outcomes are strongly persistent and exhibit clear signs of auto-correlation. Bertrand, Duflo, and Mullainathan (2004) survey the empirical literature on evaluating treatment effect that apply the Difference in Difference technique and found that out of 69 studies, only 5 explicitly address serial correlation. They also show that the consequences of not correcting for serial correlation can be severe for inference. By evalu- ating placebo interventions, ignoring serial correlation can result in concluding a “effect” at the 5 percent level for up to 45 percent of the placebo interventions. This chapter intends to further explore the effects of persistence – individual-specific heterogeneity and serial correlation – in popular estimation procedures in a binary response setting. We are interested in any robustness properties these estimation procedures may provide either theoretically or in simulations. The most common formulation of a model for a panel binary response, yit ∈ {0, 1}, is derived from the latent variable set up that allows for level individual heterogeneity. yit = 1{ai + xitβ + εit > 0} (3.1) 90 where xit is a vector of observed random variables – the covariates– and ai is an unobserved random variable – the individual heterogeneity. To begin, we will assume ai, xit, and εit are all independent from one another. If we make the following “random effects” assumption, ai|xi1, ..., xiT ∼ N (α, σ2 1) (3.2) then a Joint Maximum Likelihood Estimation (JMLE) procedure that integrates the ran- dom effect ai is consistent. If we assume the idiosyncratic error εit is normally distributed this results in the random effect Probit estimator. Alternatively, if we assume a logistic distribution then random effects Logit estimator is used. Estimation can be computational more difficult given a logistic distribution since it does not mix well with any other distribu- tion. Consequently, this chapter will more heavily examine the Probit case, but most of the analysis can be extended to the Logit case as well. However the conditional independence assumption that is implicit in the random effects assumption in equation 3.2 can be quite stringent. In the linear panel data model literature in econometrics, there are two popular modelling approaches in the literature to relax the conditional independence of ai and the x’s. One is the Fixed Effect (FE) approach and the other is the Mundlak device. The FE approach runs a pooled OLS regression of the form (yit − ¯yi) on (xit − ¯xi). This allows for arbitrary correlation between the individual heterogeneity ai and the x’s. Alternatively, one can model the correlated random effects using a Mundlack device. The Mundlack device proposes allowing ai to be correlated with the x’s through time constant functions of the data. A common implementation is to use the time averages ¯xi. Wooldridge (2018), Proposition 2.1, shows that running a pooled OLS regression of the form yit on xit, ¯xi yields the same estimates for as the FE approach. However, the FE approach does not allow us to estimate functions that involve the conditional mean of 91 the heterogeneity, which might be of interest in certain applications (as we will see soon). Extending the discussion to a non-linear setting, using the FE approach results in an incidental parameters problem and consequently serious biases in the coefficient estimates. Greene (2004) and Fern´andez-Val (2009) provide a more complete discussion of the FE approach for Probit. To avoid these issues, we propose applying the Mundlack device to the setting of equation (3.1) by assuming ai = α + ¯xiξa + u1i where u1i|xi1, ..., xiT ∼ N (0, σ2 1). (3.3) We could use a JMLE procedure that integrates out the random effect u1i or, as in the linear case, we could consider a simpler Pooled Maximum Likelihood (PMLE) approach. The PMLE approach makes no assumptions on the joint distribution over the time observations but pools the likelihood over i and t. So far, we have only introduced individual heterogeneity into the level but there is little reason as to why the individual heterogeneity should be restricted to a level effect. While introducing random slopes is much more common in the linear regression literature (see Hall, Horowitz, et al. (2005), Swamy (1970), and Swamy and Tavlas (1995)), there has been fewer papers that attempt to account for the unobserved heterogeneity in slope parameters in a nonlinear model like Probit.2 One of the reasons has to do with the fact that joint estimation methods are so far the dominant approach. A JMLE approach, which requires obtaining the joint distribution of (yi1, ..., yiT ) conditional on (xi1, ..., xiT ), can be computationally difficult, and we may not even have enough assumptions to obtain the joint distribution. In any case, the JMLE will generally require more assumptions to consistently estimate the parameters. The benefit from the additional assumptions and computational burden is 2Hausman and Wise (1978) and Akin, Guilkey, and Sickles (1979) introduce random coefficients in the multinomial and ordered Probit models. 92 greater asymptotic efficiency. Extending the specification in equation (3.1) we will assume, yit = 1{ai + xitbi + εit > 0} (3.4) where now both ai and bi are unobserved random variables capturing the level and slope individual heterogeneity. Before we dive into the more complicated joint methods, perhaps it would be wise for us to take a step back and ask the following question: What features of the model should we focus on? In evaluating policy interventions, the ultimate interest usually concerns the treatment effect. While the average treatment effect coincides with the slope coefficient in a linear model with only additive heterogeneity, in a nonlinear model like the one above, the average treatment effect is much more complex in its derivation. The concepts of the Average Structural Function (ASF) are simultaneously proposed by Blundell and Powell (2004) and Wooldridge (2005), in which the average treatment effect should be derived from the ASF. Using the notation in Wooldridge (2005), the conditional mean of y is defined as, E(y|x, q) = µ1(x, q) where x are observed covariates – xit in the setup above – and q is unobserved heterogeneity – ai and bi in the setup above. Assuming the standard distributional assumptions in the Probit case (ε ∼ N (0, 1) and independent of all other random variables), applied to our model of interest: µ1(xit, (ai, bi)) = Φ(ai + xitbi). (3.5) Then the ASF averages the above equation over the distribution of unobserved heterogeneity. The treatment effect is the difference of the ASF over the treated and not treated, but this can vary over the observed covariates. Unlike the linear model, the complexity of the non-linear 93 model allows the treatment effect to be heterogeneous over the distribution of the covariates. Then averaging over the distribution of the observed covariates renders the average treatment effect. It is useful to begin with a framework that unifies the discussion of treatment effects in models with unobserved heterogeneity. Average treatment effect is usually reserved for cases of binary treatment and average partial effect for the continuous analogue. In the continuous case, in lieu of taking difference, the partial derivative of the ASF produces the partial effect. We will refer to the Average Partial Effect (APE) synonymously for the binary and continuous case. If our focus is on the APEs, then adopting the Mundlak device to model the unobserved heterogeneity in slope parameters would seem a sensible approach. Combined with a pooled estimation method, the Mundlak approach treats the data as if it is one long cross section and computation is typically straightforward. Motivated by Mundlack, we return to setting of equation (3.4) and assume ai = α + ¯xiξa + u1i where u1i|xi1, ..., xiT ∼ N (0, σ2 1) bi = α + ¯xiξb + u2i where u2i|xi1, ..., xiT ∼ N (0, σ2 1). (3.6) (3.7) As previously mentioned will focus on two different estimation routes: JMLE and PMLE. The Joint MLE procedure derives joint distribution of (yi1, ..., yiT ) conditional on (xi1, ..., xiT ) by integrating out the random effects (u1i, u2i). This integral is not solved in closed form and in estimation is approximated using numerical methods. This can cause more computational issues including failures due to non-convergence and long estimation times. Because it is a full MLE method, the JMLE produces the efficient estimates of the parameters α, β, ξa, ξb, σ2 1, and σ2 2. However it does assume εit is iid over i and t and (u1i, u2i) are bivariate normal 94 independent of εit and (xi1, ..., xiT ). These assumptions could be relaxed theoretically, but trying to implement the more flexible models turn out computationally costly so that the current state of statistical software makes these assumptions in implementation. In the alternative pooled framework, it is computationally easy for us to relax the as- sumption that εit is independent over i and t. Since we are considering a panel setting, it is natural to expect serial correlation in the latent error. Although it is not as efficient as a joint procedure, it is robust to serial correlation. The drawback to this approach is that (cid:113) 1 + σ2 1 (cid:113) it cannot separately identify the coefficients (α, β, ξa, ξb) from the scaling factor 1/ but it is consistent in estimating the scaled parameters, θσ = θ/ any of the coefficients. 1 + σ2 1 where θ represents This leads to an interesting trade-off between the two estimation procedures. The JMLE can separately identify and estimates the variance of the random effects and should be more efficient, but is not robust to serial correlation and may be computationally more demanding. On the other hand, the PMLE is robust to serial correlation but is less efficient and can only estimate the scaled coefficients. However, if we focus on the APEs, then the lack of identification would not pose any issue in the interpretations of the results. In this case, it is possible that precise estimates of individual coefficients may have a much smaller impact on the estimates of the APEs. We conduct extensive simulation experiments for the Probit model comparing the JMLE and PMLE. We look at both the continuous – with and without strong dependence – and binary treatment cases. The pooled approach performs as we expect: less efficient but consistent over different levels of serial correlation in the latent error. We do find some surprising trends in the coefficient estimates using the JMLE procedure. We find that even under no serial correlation, the coefficients estimates have a serious negative bias. The 95 driving factor appears to be poor estimation of the variance components, σ1 and σ2, which tend to also have significant negative bias. But these biases seem to cancel each other out when examining the estimates of the scaled coefficients, even under the presence of serial correlation. Consequently, the APE calculated from the JMLE estimates appear to have robustness properties with respect to serial correlation. The remainder of this chapter is organized as follows. Section 2 presents the model set up and assumptions. Section 3 goes into more detail in deriving the two estimation procedures. In particular, we discuss how the JMLE procedure fits into the Generalized Linear Mixed Effects Model literature and how the pooled approach results in a heteroskedastic Probit estimator. Section 4 derives the average structural function and the corresponding APE. We provide a more detailed discussion on how the heterogeneity is incorporated into the APEs. Section 5 present the specifications for the simulation study and discusses the results. Section 6 uses the two estimators in an application to investigate if our simulation results hold with empirical findings. Section 7 presents a short discussion on extending this analysis to the Logit case. There is no easy implementation of a pooled approach since no distribution mixes well with the logistic distribution and the JMLE approach does not provide consistent estimates of coefficients which leads to the question of what is a more robust statistic: the APEs or the log-odds. Finally we conclude with a summary of our results and their implications. 96 3.2 Model Set Up We are considering a binary response model in a panel setting with small T and large N allowing for correlated random effects in the intercept as well as the coefficient of interest, yit = 1{ai + x1itb1i + x2itβ2 + εit > 0} ai = α + g1( ¯xi)ξa + u1i bi = β1 + g2( ¯xi)ξb + u2i. (3.8) Motivated by the Mundlack device, we will assume that the elements of g1(·) and g2(·) are known functions of the time averages ¯xi. This could of course be generalized to know functions of all the time observations. The random intercept and coefficients are allowed to be correlated with all time observations of the x’s, (xi1, ..., xiT ), through the linear func- tions g1( ¯xi)ξa and g2( ¯xi)ξb. We will assume the following independence and distributional assumptions hold: u1i (cid:12)(cid:12)(cid:12)xi1, ..., xiT ∼ N  0  ,  σ2   . u2i 1 σ12 0 σ12 σ2 2 (3.9) This means that the random effects ai and bi have a known distribution and are independent of the x’s conditional on the time averages, ¯xi. Generally, u1i and u2i are allowed to drawn from a multivariate normal distribution with possible correlation, however we found that, in the simulation, allowing for a general variance covariance structure was quite straining to compute. In the remainder of this chapter we will assume σ12 = 0 but the analysis can be easily extended to allow for the correlated case. Finally, the idiosyncratic error εit is assumed to be independent of all other random variables in the model and independent over i. In order to allow persistence in the outcome that is unrelated to the covariates, εit is 97 serially correlated over t following an AR(1) process, εit = ρεit−1 + eεit (3.10) An AR(1) process does a fair job at modeling the persistence in outcomes that we see in empirical data. However, this could be extended even further by allowing for an AR(p) process or even a ARMA(p,q). 3.3 Estimation Methods Given the set-up above, we will consider two different estimation procedures: JMLE and PMLE. Section 15.8 of Wooldridge (2010) reviews these two estimation methods (as well as others) and their accompanying assumptions and implications with only additive individual heterogeneity. We extend this by introducing slope individual heterogeneity and focus on the implications of serial correlation. 3.3.1 Mixed Effects Probit We will first look at a JMLE, referred to as the Mixed Effects (ME) Probit, which can be derived through two similar but different framework: one can be viewed as an extension to the Random Effects Probit (described in Wooldridge (2010) chapter 15.8) to allow for a random coefficient and the other as a Generalized linear Mixed Model (GLMM) with a Bernoulli distribution and a Probit link function. Under the first framework, we may consider the set up described in equation (3.8), so the marginal density of yit conditional on the contemporaneous regressors and random 98 coefficients is, f (yit|xit, ai, bi) = Φ (ai + x1itb1i + x2itβ2)yit (1 − Φ (ai + x1itb1i + x2itβ2))(1−yit) . (3.11) If one were to assume independence across t, the joint density (over time) will be a product of the marginal densities. By allowing for correlated random effects and integrating out the random effects (u1i, u2i) we obtain, f (yi1, ..., yiT|xi1, ..., xiT ) = (cid:90) ∞ (cid:90) ∞ −∞ −∞ T(cid:89) t=1 Φ(cid:0)α + x1itβ1 + x2itβ2 + g1(¯xi)ξa (cid:1)yit ×(cid:16) 1 − Φ(cid:0)α + x1itβ1 (cid:1)(cid:17)(1−yit) + x1itg2(¯xi)ξb + u1 + x1itu2 (cid:18)u1 (cid:19) (cid:18) u2 (cid:19) + x2itβ2 + g1(¯xi)ξa + x1itg2(¯xi)ξb + u1 + x1itu2 × 1 σ1σ2 du1du2 σ2 σ1 φ φ (3.12) where σ1 and σ2 are the standard deviations of u1i and u2i respectively. The integral is not solved in closed form but can estimated using numerical methods.3 Taking the log of equation (3.12) gives the conditional log likelihood for each i. Maximizing the sum, over i, of the log likelihood with respect to the parameters α, β1, β2, σ1, and σ2 produces the JMLE estimator. As for the second framework, let us consider the definition of GLMM in chapter 4 of McCulloch and Neuhaus (2001) (with notations altered slightly), Yi(= (yi1, ..., yiT ))|u ∼ independent fYi|u(yi|u) h(E(Yi|u)) = XiB + Ziu u ∼ fU (u) (3.13) where X and Z are considered fixed design matrices and u is the only random effect such 3The simulation uses a mean-variance adaptive Gauss–Hermite quadrature, but other procedures such as a Laplacian approximation could be used. 99 that XB is the fixed component and Zu is the random component. Note that in our set up, h(·) is the inverse of the Standard Normal CDF which is why this estimator will be referred to as the Mixed Effects (ME) Probit estimator. Then defining the components in equation (3.13) to match the set up described by equation (3.8) yields, (cid:19) 1T , x1i, x2i, 1T × g1(¯xi), 1T × g2(¯xi) (cid:19) 2 γ(cid:48) α β1 β(cid:48) a γ(cid:48) (cid:19) b (cid:18) (cid:18) (cid:18) u1i u2i  1T , x1i Xi = B(cid:48) = Zi = u = where 1j is a j×1 vector ones and x1i and x2i are the stacked time observations for individual i. Then the standard GLMM estimator computes the log likelihood under the assumption of independence across t and then integrating out the random effects u = (u1i, u2i)(cid:48). There are several concerns that should be addressed with this estimator. First, in prac- tice, the second level equations (defining ai and bi) are often not given a flexible specification that allows for correlation with the regressors x1i and x2i. It is perhaps the assumption of “fixed design matrices” in the GLMM literature that leads to a general lack of concern for correlation between the random effect u and the design matrices X and Z. As our simula- tion study will show, not allowing for correlated random effects will result in heavily biased parameter estimates. Second, the JMLE can be quite computationally demanding. The discussion for the results of the simulation study will provide more detail, but to summarize, the ME Probit estimator is more likely to fail to converge and if it does converge, takes much longer than the alternate estimator. The failure of converges is more frequent when the true data generating 100 process does not have any random effects and therefore the parameters are at the boundary of the identified set (σ2 1 = σ2 2 = 0). The slower speeds are because the ME Probit estimator must numerically approximate several integrals. Third, note that this estimator depend on independence across t. Since we are introducing serial correlation through an AR(1) process, we would expect the estimator to be inconsistent. In particular, let us re-examine the joint distribution of (yi1, ..., yiT ) conditional on the x’s where in an AR(1) process with correlation coefficient ρ, Σε will have the form,  ∼ N (0, Σε) εi1 ... εiT assuming correlation over t. Suppose,  ρ2 ... Σε = 1 ρ  . . . ρT−1 . . . ρT−2 . . . ... . . . ρ 1 ρ 1 ρ ... ρ2 ρ 1 . . . ρT−1 ρT−2 . . . ρ . (3.14) By the properties of conditional distributions, f (yi1, ..., yiT|x1i, x2i, ai, bi) =f (yiT|yi1, ..., yiT−1, x1i, x2i, ai, bi) × f (yiT−1|yi1, ..., yiT−2, x1i, x2i, ai, bi) × ··· × f (yi2|yi1, x1i, x2i, ai, bi) × f (yi1|x1i, x2i, ai, bi), (3.15) 101 solving for the conditional means of yit for t ≥ 2 yields E(yit|yi1, ..., yit−1, x1i, x2i, ai, bi) (cid:32) =E Φ = (cid:112) 1 − ρ2 Φ ai + x1itb1i + x2itβ2 − ρuit−1 (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)yi1, ..., yit−1, x1i, x2i, ai, bi (cid:32) (cid:33) =E(E(yit|uit−1, yi1, ..., yit−1, x1i, x2i, ai, bi)|yi1, ..., yit−1, x1i, x2i, ai, bi) (cid:35)yit−1 (cid:34)(cid:90) ∞ (cid:32) (cid:112) (cid:35)(1−yit−1) (cid:33) (cid:34)(cid:90) ai+x1it−1b1i+x2it−1β2 (cid:112) ≡E(yit|yit−1, x1it, x1it−1, x2it, x2it−1, ai, bi). ai + x1itb1i + x2itβ2 − ρu (cid:32) ai + x1itb1i + x2itβ2 − ρu ai+x1it−1b1i+x2it−1β2 × −∞ 1 − ρ2 Φ 1 − ρ2 φ(u) du (cid:33) φ(u) du In words, the mean of yit conditional on all past observations and the random effects ai and bi is only a function of the data from the last time period and the random effects. However, due to the nonlinearity of the Probit model, the conditional mean relies on the past data in a complicated manner. It is the difference between E(yit|yit−1, x1it, x1it−1, x2it, x2it−1, ai, bi) and E(yit|x1it, x2it, ai, bi) that would suggest the ME Probit estimator is inconsistent under an AR(1) process. Of course one could consider estimating using a joint likelihood based on a AR(1) model rather than assume independence. However this would require correctly specify the depen- dence structure as AR(1), where in empirical data, a simple AR(1) process may not be able to truly capture the complex time dependencies. Moreover, Keane (1994) discusses the diffi- culty of doing so directly and instead proposes a simulated variation of a Method of Moments estimator. In this chapter we do not consider a simulated version of the JMLE method since this is done rarely in practice. 102 3.3.2 Pooled Heteroskedastic Probit The PMLE method is an alternative to a JMLE method and requires fewer assumptions. Un- like the ME Probit, the Pooled Heteroskedastic Probit does not depend on correct specifica- tion of the joint likelihood and therefore is consistent under the presence of serial correlation. Note that the pooled method will produce a conditional mean similar to a heteroskedastic probit. The heteroskedasticity in the pooled method is due to the heterogeneous slope co- efficient, rather than the traditional interpretation of heteroskedasticity in the latent error. In deriving the Pooled Heteroskedastic Probit estimator, we apply the assumptions stated in section 2, (u1i + x1itu2i + εit)|x1i, x2i ∼ N (0, 1 + σ2  α + x1itβ1 + x2itβ2 + g1(¯xi)ξa + x1itg2(¯xi)ξb 1 + x2 2) and 1itσ2 (cid:113)  . E(yit|x1i, x2i) = Φ (3.16) 1 + σ2 1 + x2 1itσ2 2 Plugging the conditional mean into the Bernoulli density, taking logs, and pooling over i and t, yields the following log likelihood N(cid:88) T(cid:88) i=1 t=1 yit ln Φ α + x1itβ1 + x2itβ2 + g1(¯xi)ξa + x1itg2(¯xi)ξb 1 − Φ (cid:113)  α + x1itβ1 + x2itβ2 + g1(¯xi)ξa + x1itg2(¯xi)ξb  1 + x2 1 + σ2 1itσ2 2 (cid:113) × (1 − yit) ln  (3.17) 1 + σ2 1 + x2 1itσ2 2 A standard practice in statistical packages is to assume an exponential function variance function which insures a strictly positive variance in estimation. We can then approximate equation (3.16) with, E(yit|x1i, x2i) = Φ (cid:32) α + x1itβ1 + x2itβ2 + g1(¯xi)ξa + x1itg2(¯xi)ξb (cid:33) exp(1/2 ln(1 + σ2 1 + x2 1itσ2 2)) 103   α + x1itβ1 + x2itβ2 + g1(¯xi)ξa + x1itg2(¯xi)ξb (cid:19) (cid:113) (cid:18) ασ + x1itβ1σ + x2itβ2σ + g1(¯xi)ξaσ + x1itg2(¯xi)ξbσ 1 exp(1/2 ln σ2 2 1+σ2 1 1 + σ2 (cid:18) x2 1it 1 + exp(v(x1it)) = Φ = Φ (cid:19) (3.18) (cid:113) √ 1 + σ1. where θσ = θ/( 1 + σ2 1) is the scaled coefficient for θ = α, β1, β2, ξa, ξb. As a result, using a Pooled heteroskedastic Probit approach does not allow us to separately identify (α, β1, β2, ξa, ξb) and If we focus on the APEs (to be defined formally in the subsequent section), we only need the estimates of the scaled coefficient. The function v(x1it) does not include a constant as a necessary requirement for identification in heteroskedastic Probit. Any constant in v(x1it) would be incorporated into the scaling factor. We can approximate v(x1it) using a polynomial expansion4: (cid:32) ∞(cid:88) ln v(x1it) = 1 2 ≈ 1 2 σ2 2 1 + 1 + σ2 1 (−1)n+1 (cid:32) n n=1 (cid:33) x2 1it σ2 2 1 + σ2 1 x2 1it (cid:33)n (3.19) where we have used a Taylor expansion in the second line. Compared to the ME Probit estimator, the Pooled Heteroskedastic Probit is compu- tationally simple, identified when there are no random effects, and consistent under serial correlation. The drawbacks are having to approximate the function v(x1it) while using a preprogrammed command, a loss of efficiency comparative to the JMLE, and not being able to separately identify the scaled parameters. 4One could maximize the log-likelihood in equation (3.17) directly without approximating the het- eroskedastic function v(x1it). However, to allow for easy implementation by using preprogrammed commands such as hetprobit in STATA, v(x1it) can be well approximated by x1itδ1 + x 1itδ3 + x4 1itδ4. it2 δ2 + x3 104 3.4 Average Partial Effects As discussed in our introduction, a more meaningful statistic in our model of interest are the Average Partial Effects (APE).This section discusses the identification and formulation of the ASF and APEs using the results of Wooldridge (2005). Identification is shown using Lemma 2.2 and then the ASF is calculated using the results of Lemma 2.1. We will then discuss the derivation and interpretation of the APEs and contrast it to the Partial Effects at the Average that is commonly computed in lieu of the APEs. The set up explained in Section 2 can be seen as an extension to the Probit example given in Wooldridge (2005) allowing for a random coefficient. A consequence of the Mundlack device, the observable random variables (w in his notation) that help identify the unobserved heterogeneity (q in his notation) are the time averages of the covariates, ¯xi. We start from the Average Structural Function (ASF) defined in Blundell and Powell (2004). The ASF defines the structural relationship between the expected outcome and the covariates, averaging out all the unobserved heterogeneity. To obtain identification of the ASF using Lemma 2.2, the following ignorability assumptions must be satisfied. Applied to the notation of our model, the first is an excludeability assumption that requires, E(yit|xit, ai, bi, ¯xi) = E(yit|xit, ai, bi), and the second is a selection on observables assumption that requires, D(ai, bi| ¯xi, xit) = D(ai, bi| ¯xi), (3.20) (3.21) where D(·) denotes the distribution. Equation (3.9) satisfies the ignorability assumptions and therefore following Lemma 2.2, the ASF will be identified from µ2(xit, ¯xi) = E(yit|xit, ¯xi) 105 where, µ2(xit, ¯xi) =E(yit|xit, ¯xi) =E(1{ai + x1itb1i + x2itβ2 + εit} > 0|xit, ¯xi) =E(1{α + x1oβ1 + x2oβ2 + g1(¯xi)ξa + x1og2(¯xi)ξb + u1i + x1ou2i + εit > 0}|xit, ¯xi) α + x1itβ1 + x2itβ2 + g1(¯xi)ξa + x1itg2(¯xi)ξb (cid:113)  . (3.22) =Φ 1 + σ2 1 + x2 1itσ2 2 The ASF, E(ai,bi)(µ1(xo, (ai, bi))), can be calculated as, E(ai,bi)(µ1(xo, (ai, bi))) = E ¯xi(µ2(xo, ¯xi)) Φ α + x1oβ1 + x2oβ2 + g1(¯xi)ξa + x1og2(¯xi)ξb (cid:113)  = E ¯xi 1 + σ2 1 + x2 1oσ2 2 using Lemma 2.1. Next we take the partial derivative of the ASF with respect to xo 1 (the variable of interest). Under typical regularity conditions that allow the derivative to pass through the integration, the partial effect with respect to x1o evaluated at the values x1o and x2o takes on the form: P E(x1o, x2o) = ∂E ¯xi(µ2(xo, ¯xi)) ∂x1o Φ  α + x1oβ1 + x2oβ2 + g1(¯xi)ξa + x1og2(¯xi)ξb (cid:44)  ×  α + x1oβ1 + x2oβ2 + g1(¯xi)ξa + x1og2(¯xi)ξb (cid:34) (cid:32) (cid:113) (cid:35)(cid:33) (cid:113) (cid:113) 1 + x2 1 + σ2 1 + x2 1oσ2 2 1 + σ2 1oσ2 2 (α + x1oβ1 + x2oβ2 + g1(¯xi)ξa + x1og2(¯xi)ξb) φ =∂E =E − (xo 1σ2 2) (1 + σ2 1 + x2 1oσ2 2)3/2 ∂x1o β1 + g2(¯xi)ξb 1 + σ2 1 + x2 1oσ2 2 (3.23) Then we define the AP E = E (P E(x1it, x2it)), where the inner expectation is with respect to 106 ¯xi and then an outer expectation with respect to x1it and x2it. Since we will have estimates for σ2 1 and σ2 2 using the ME Probit estimator, the APE can be estimated by plugging in the parameter estimates and then replacing the expectations with the sample averages. Alternatively, following the Pooled Heteroskedastic Probit estimator, in which we are using an exponential function for the heteroskedasticity function and only obtain estimates of the scaled parameters, the partial effect can also be written as, P E(x1o, x2o) = ∂x1o =∂E ¯xi Φ =E ¯xi (cid:18) (cid:32) φ ∂E ¯xi(µ2(xo, ¯xi)) (cid:18) ασ + x1oβ1σ + x2oβ2σ + g1(¯xi)ξaσ + x1og2(¯xi)ξbσ (cid:19) (cid:18) ασ + x1oβ1σ + x2oβ2σ + g1(¯xi)ξaσ + x1og2(¯xi)ξbσ (cid:17)(cid:33) (cid:16) × exp(−v(x1o)) × (ασ + x1oβ1σ + x2oβ2σ + g1(¯xi)ξaσ + x1og2(¯xi)ξbσ) β1σ + g2(¯xi)ξbσ − (∂v(x1o)/∂x1o) exp(v(x1o)) exp(v(x1o)) (cid:19)(cid:19)(cid:46) ∂x1o (3.24) . To estimate the above quantity, we replace the scaled coefficients and the heteroskedastic function v(x1o) with the Pooled Heteroskedastic Probit estimates. In this chapter, we advocate the APE calculated from the ASF as the statistic that most appropriately captures the effect of interest. However, the literature also places value on what we will refer to as the Partial Effect at the Average (PEA). In the linear case, the APE and PEA are equivalent whereas in the nonlinear case they can be quite different. The source of their difference follows from the basic principle that expectations cannot pass through nonlinear functions. In our model, the PEA is simply the partial derivative of E ¯xi(µ1(xo, (E(ai|¯xi), E(bi|¯xi))) where the unobserved heterogeneity are evaluated at their conditional means: P EA(x1o, x2o) = ∂E ¯xi(µ1(xo, (E(ai|¯xi), E(bi|¯xi))) ∂x1o 107 (cid:16) (cid:17) × (β1 + g2(¯xi)ξb) . =E ¯xi φ (α + x1oβ1 + x2oβ2 + g1(¯xi)ξa + x1og2(¯xi)ξb) (3.25) Note that the PEA only incorporates the part of the heterogeneity that is correlated with the observables. In fact, there is no distinction between the PEA in our model that allows for heterogeneity in the level and slope and the APE in a model that assumes constant effects but time averages enter the structural function as additional covariates. We argue that using the APEs calculated from equations (3.23) and (3.24) truly capture the heterogeneous effect while the PEA mutes the genuine impact of heterogeneity. 3.5 Simulation In this section, we investigate the behaviour of the two estimators with simulated data. In particular we are interested in the trade off in the robustness properties of the Pooled Heteroskedastic Probit estimator under serial correlation and the ability of the ME Probit estimator to separately identify the variance components. We consider several variations on the same model described in equation (1). 1. The covariates are iid over i and t and drawn from the following multivariate normal (3.26) (3.27) distribution x1it  ∼ N  1  ,  1   . 0.3 x2it 1 0.3 1 and the random coefficients are generated as, ai = −0.25 − 0.5¯x1i − 0.25¯x2 bi = 1.25 − 0.5¯x1i − 0.25¯x2 1i − 0.1¯x1i ¯x2i + u1i 1i − 0.1¯x1i ¯x2i + u2i 108 where the random effects u1i, u2i are generated from the following multivariate normal distribution, u1i  ∼ N 0, 0.5   . 0 u2i 0 0.25 (3.28) 2. The variable of interest is iid over i but correlated over t through an AR(1) process x1i1 = axi + e1i1 x1it = 0.5axi + 0.5x1it−1 + e1it, t = 2, 3, ..., T (3.29) where axi ∼ iid N (1, 0.2) is the persistent individual effect, e1i1 ∼ iid N (0, 0.2) and e1i1 ∼ iid N (0, 1 − 0.52 − 0.2(0.52)) are additional noise terms. Although x1it is not independent over t, it is identically distributed. To induce correlation between the regressors, we let x2i = 0.7 + 0.3x1it + e2it, e2it ∼ iid N (0, 1 − 0.32). The corre- lated random coefficients are generated in the same was as the first DGP described in equations (3.27) - (3.28). 3. As in DGP 1, the covariates are generated iid over i and t and distributed as in equation (3.26). In this DGP, we are interested if using a simple specification for the correlated random coefficients when the random coefficients are generated in a more flexible manner would still result in the correlated random effects specifications performing better than just the random effects specifications that do not allow for any correlation between the random coefficients and the covariates. The random coefficients will be generated with the following equations, ai = −0.25 − 0.25¯x3 bi = 1.25 − 0.25¯x3 1i − 0.15¯x4 1i − 0.15¯x4 2i + u1i 2i + u2i (3.30) but in estimation, we will only include the polynomial functions of the time averages 109 up to order 2. Finally, we vary the cases over the number of time periods observed (T = 5, 10, 20) and the level of serial correlation in the unobserved heterogeneity εit (ρ = 0, 0.4, 0.8 in equation (3.10)). We expect the ME Probit estimator to be inconsistent under serial correlation while the Pooled Heteroskedastic Probit estimator to be consistent under serial correlation or independence. In addition, we would expect the ME Probit estimator to be more efficient than the Pooled estimator under serial independence. We will estimate two specifications for both the ME Probit and Pooled Heteroskedastic Probit estimators. The first specification incorrectly assumes that the random effects ai and bi are uncorrelated with the x’s while the second specification assumes that ai and bi are random effects that are correlated with the x’s through their time averages. 3.5.1 Computational Results In addition to providing results on estimation consistency, it seems prudent to report results on the computational ease of implementation. All estimation was performed in STATA5 using the commands meprobit and hetprobit. Since the JMLE requires numerically inte- grating out the random effects, we expect the ME Probit estimator to take longer. Tables G.1–G.3 present the average length of time of the two estimators with several notable features. Although the estimation times may seem short (5 seconds at most), this is reflective of the simple specification of only two covariates and fairly small sample sizes of 300 individuals. As the specifications become more complex and the sample size increases, the time to compute will lengthen. Therefore we will focus on the relative speed between the 5STATA is among the most popular software used by researchers in social science. Other software such as Matlab and R also have built-in commands that perform similar functions as those in STATA. 110 two estimators. As expected, the ME Probit estimator always takes much longer than the Pooled Heteroskedastic Probit estimator almost 6 times as long. In addition, the distribution of the ME Probit times are much more variable, with standard deviations as large as 9.03 (in DGP 1). So it appears that there may be some outliers skewing the distribution to the right. This confirms our initial expectations that the ME Probit estimator will suffer computa- tionally compared to the pooled method. 3.5.2 Parameter Estimates The estimates from the ME Probit estimator and Pooled Heteroskedastic Probit estimator are not comparable “as is” because the Pooled Heteroskedastic Probit estimator is only able to estimate the scaled coefficients. Therefore we will first look at the “de-scaled” ME Probit coefficient estimates and then compare the two estimation procedures with the “scaled” coefficient estimates. Recall that we will need to calculate the scaled ME Probit estimates √ 1 + ˆσ1). by dividing the coefficient estimates by the estimate of the scale value ( Tables G.4 – G.6 present the bias and standard deviation (given in parenthesis) of the de- scaled ME Probit estimates. The first thing to note is that specification (1) performs quite poorly at any given level of serial correlation. Since specification (1) incorrectly assumes that the random effects ai and bi are uncorrelated with the covariates, this emphasizes the importance of allowing for correlation between the random coefficients and the covariates. In this particular setting, not doing so would result in an interpretation that the regressors x1it has no strong predictive power for the outcome y1it. Turning to the correct specification that allows for correlated random coefficients, spec- ification (2) in DGP 1 and 2, under no serial correlation (ρ = 0), the ME Probit appears 111 to perform well with very little bias. But in DGP 3 where specification (2) acts as an approximation to a more flexible correlated random coefficients, there appears to be signif- icant amount of bias. But the bias is at a lesser degree than not attempting to control for correlated random coefficients at all. As the level of serial correlation increases, we see an increasingly positive bias on the de-scaled coefficient estimates. This confirms our earlier discussions where we would expect the JMLE procedure to be sensitive to correlation over the time dimension. We do see a quite loss to efficiency when using specification (2) relative to specification (1). This is because a correlated random coefficient approach requires including several more terms, polynomials of the time averages, that may be strongly collinear with one another. We would expect that as the distribution of the time averages becomes more precise (ie: number of time observations increase), the terms in the correlated random coefficient become more collinear. So even though the standard deviations of specification (1) decreases as the number of time observations increases, we see the standard deviations for specification (2) increase. Finally, as the level of serial correlation increases, there is also an increase in the standard deviation. Therefore using the JMLE procedure, introducing serial correlation results in increasing bias and increasing variance in the de-scaled parameter estimates. But to compare the two estimation procedures, we will need to look at the scaled param- eter estimates. Both the scaled Pooled Heteroskedastic Probit and ME Probit estimates are presented in Tables G.7 – G.9. Again, and with both estimators, there is strong bias with specification (1) that does not allow the random coefficients to be correlated with the x’s. Moreover, as we expected, the bias of the Pooled Heteroskedastic Probit estimator is unaffected by the level of serial correlation. But more surprising is that the bias from the presence of serial correlation of the scaled ME Probit estimates is diminished. In fact the 112 bias of the scaled ME Probit estimates appears to be approaching the level of bias observed for the Pooled Heteroskedastic Probit estimates. We also see that the ME Probit scaled coefficient estimates are slightly more efficient even under serial correlation. Of course one should expect a JMLE procedure to be more efficient than a PMLE procedure when the distribution is correctly specified. However, this need not be true when the joint likelihood is miss-specified and the pooled likelihood is still correctly specified. The efficiency of the ME Probit estimator in this simulation should not be mistaken as a general result, however, since it appears to be a fairly stable result across all three DGP, it may be worth investigating a theoretical result. Therefore there seems to be a bias efficiency trade-off in terms of the scaled parameters estimates. If one were to compare on the basis of Root Mean Squared Error, as in Table G.10, the ME Probit estimator appears superior to the Pooled Heteroskedastic Probit estimator under all of the different sampling scenarios. Why are the ME Probit estimates of the scaled coefficients performing so well when the de-scaled coefficients are quite poor? The answer lies in the estimation of the scaling factor in which the ME Probit is able to identify and estimate σ2 1. Figures F.1-F.9 present the empirical distribution of ˆσ2 1. Recall the true variance is 0.5, so under no serial correlation, the ME Probit estimator does a fair job of estimating the variance component. But as serial correlation increased, the distribution of the variance estimates move towards the right suggesting an upward bias. Since we also see an upward bias in the de-scaled coefficient estimates, the biases cancel each other when calculating the scaled coefficients. Since ME Probit assumes no serial correlation, it is interpreting the persistence in the latent error as part of the individual heterogeneity. Returning to the latent variable set up 113 in equation (3.8), the unobserved error that the estimator is trying to parse is, (unobserved error)it = u1i + x1itu2i + εit (3.31) The serial correlation in εit appears a lot like the persistence induced by the additive het- erogeneity u1i. Consequently, the estimate of the variance component would be biased up and less precise compared to the case of no serial correlation. An advantage of the JMLE procedure is that it is able to identify both of the variance components σ2 1 and σ2 2. However given this analysis, one should be wary of the validity of the σ2 1 estimates if there is concern for serial correlation. As for σ2 2, the results reported in Table G.11 mirror the other coefficient estimates. The de-scaled estimates display increasing bias over the level serial correlation which cancels with the scaling such that the scaled estimates appear unbiased. Incorrectly assuming that there is no serial correlation would lead a researcher to incorrectly conclude that the distribution of the random coefficient is much larger than it truly is. But nevertheless, the bias seems to work in our favor if one is more interested in APE, which we typically are. As mentioned previously, the scaled coefficients are really what is used to determine APE estimates. The results from the simulation thus far would suggest that both estimation procedures will perform reasonable well given that they both estimate the scaled coefficients with fairly small bias (when they allow for correlated random effects). The question remains if any efficiency gains will be observed when using a miss-specified JMLE over a PMLE. Moreover, we saw that not allowing for correlated random effects (specification (1)) will bias the parameter estimates. Some may hope that the simple specification will still be able to capture an average effect. But from theory, we know that this is unlikely in a non- linear model such as probit because the average does not pass through non-linear functions. 114 Finally, we saw that poor parameter estimates for DGP 3, where the correlated random coefficient structure was only approximated. If we are only interested in APE, can only an approximation for the random coefficient be sufficient in capturing the correlation structure with the covariates. 3.5.3 Average Partial Effect Estimates Estimates of the Average Partial Effects with respect to x1it are presented in Tables G.12 – G.14. In line with the results on coefficient estimates, not allowing the random effects to be correlated with the xs (specification (1)) results in significant bias across all three DGPs. When we allow the random effects to be correlated with the xs (specification (2)), the bias in the APE estimates shrink considerably, for both estimation procedure. In DGP 1 and 2, the ME Probit estimates appear to have smaller bias are more efficient than the Pooled Heteroskedastic Probit estimates. Therefore the bias in the de-scaled co- efficient and the bias in the variance component of the scaling factor neutralize each other, resulting in very little bias in the ME Probit estimates for APE over any level of serial corre- lation. In DGP 3, the Pooled Heteroskedastic Probit estimator tends to have slightly smaller bias but is less efficient than the ME Probit estimator. This suggests that there might be some bias efficiency trade-off, but after a quick examination of the RMSE, the ME probit estimator is preferred. But what does this mean for a researcher working with empirical data? At first glance, one should trust the Pooled Heteroskedastic Probit estimates over the ME Probit estimates since the specified likelihood is robust to arbitrary correlation over the time dimension. But the simulation results suggest that under a simple correlation structure, such as AR(1), there may be robustness in the scaled parameter estimates and APE using a JMLE procedure 115 where the joint likelihood is misspecified. However, it should also be emphasized that similar scaled coefficient and APE estimates between the ME Probit and Pooled Heteroskedastic Probit procedures should not trick a researcher into thinking that there is statistical evidence that the assumptions underlying the JMLE necessarily hold. Consequently, any interpretations based on variance estimates, such as the amount of variation explained by the “random” and “fixed” components should be taken with a hearty amount of skepticism. Finally, to emphasize the importance of correctly understanding the unobserved hetero- geneity and how to incorporate it into the descriptive statistics, we provide calculations of the true PEA as a comparison to the true APE over a single sample. Many researchers, turn to the PEA as a simpler to calculate approximation to the APE. As explained earlier, the PEA plugs in the averages for the unobserved heterogeneity rather than integrating it out. Although quicker to compute, this does not truly reflect the data structure we believe is present. Table G.15 present the results over all 3 DGPs and increasing time observations. Note that the true APE and PEA should not vary by any serial correlation in the latent error. The PEA systematically over-estimates the effect in comparison to the APE. This can be easily explained by comparing the Partial Effect equations (3.23) and (3.25). First, the PEA does not incorporate the scaling factor 1/ 1 + σ2 (cid:113) 1 + x2 1oσ2 2. Because of the chain rule, the APE and PEA can be broken down into two similar terms. We refer to the first as the “Probit scaling” which consists of the Standard Normal CDF (φ(·)) evaluated at some point in relation to the covariates x1o and x2o. This term insures that the partial effect diminishes as the covariates x1o and x2o get large in absolute value. It is a consequence of the Probit functional form that bounds the average structural function 116 between 0 and 1. The second term multiplied by the Probit scaling is the “latent” effect. This is the effect of the random variable of interest to the latent index. In a standard Probit with random coefficients, this is just the coefficient on the random variable of interest. By not incorporating the scaling factor, the PEA diminishes the Probit scaling and enlarges the latent effect. This means that the PEA will be shifted up (since β1 is positive) and flatter over the support of the x’s. Second, the latent effect in the PEA does not include the impact of the part of the heterogeneous effect that is uncorrelated with the x’s. This effect varies over the value of x1o and can either enlarge or diminish the latent effect. The main take away is that any patterns or significant biases caused by serial correlation in the latent error appear to be significantly muted when it comes to computing the APEs. A major contributing factor is the ability of the ME Probit estimator to somewhat preserve the consistency of the scaled coefficient estimates under specifications in which we would otherwise deem the procedure inconsistent. This calls for a theoretical investigation of the possible limitations of the consequences when miss-specifying the joint likelihood under serial correlation. 3.5.4 ASF Figures F.10 - F.15 provide ASF estimates for DGP 1 and 2 over the relevant values of x1i and fixing x2i at its mean (1). There is very little difference between the two estimation procedures, over the different level of serial correlation or over the number of time observa- tions. This again reiterates that because the ME Probit estimator is able to well estimate the scaled coefficient estimates, statistics such as the APE and ASF that only depend on the scaled coefficient estimates tend to be well estimated. This simulation study has uncovered a surprising number of results which we will sum- 117 marize here. First, regardless of the estimation procedure, not specifying correlated random effects (i.e., allowing the random coefficients to be correlated with the covariates) significantly effects the results and interpretations of the results. Second, the Pooled Heteroskedastic Pro- bit estimator has performed quite well in terms of the scaled parameter estimates, the APE estimates, and the ASF estimates. It produces estimates with fairly low bias and is much quicker, running in 15.3% of the time ME Probit takes to run. However, one of the main drawbacks to the pooled approach is that it is unable to identify the variance components and therefore some information is lost using this approach. Alternatively, the ME Probit estimator is able to identify and estimate the variance components but relies on specifying the whole joint distribution which is generally assumed to be independent over time. We saw that there are biases in the de-scaled coefficient and variance component estimates on serial correlation. Therefore, even if the ME Probit can identify these parameters, interpretation should be taken lightly when one is concerned for the presence of serial correlation. But surprisingly these biases appear to counterbalance when calculating the scaled coefficient. This leads to good estimates of the APE and ASF under miss-specification of the joint likelihood. Finally, there appears to be efficiency gains using the JMLE approach regardless of whether or not the joint likelihood is misspecified. This is somewhat surprising since there are no theoretical results that would suggest efficiency under misspecification. It should be reiterated that this apparent robustness results could be an artifact of the particular data generating processes considered. But we tried to provide a range of interesting DGP to investigate this results. Although beyond the scope of this chapter, there may be some theoretical result in terms of deriving the bias of the scaled coefficient for the JMLE under relatively simple dependency structures. 118 The next section examines whether the results found here are accordant with real data. Given these results, we would expect to see the Pooled Heteroskedastic Probit and ME probit approaches to provide similar results in their scaled coefficient estimates and APE estimates even when we find evidence of serial correlation. 3.6 Application Our application utilizes data from Blattman, Jamison, and Sheridan (2017) (for the remain- der of the chapter, referred to as BJS) where they study the effect of Cognitive Behavioral Therapy (CBT) on criminal and violent behavior of men in Liberia. After identifying and approaching potential high risk men, the research team obtained 999 men who agreed to enter the sample. Then treatment was assigned randomly within blocks as described in their accompanying appendix. The three possible treatments were: CBT, cash, and both CBT and cash. CBT works to make the patient aware of their automatic negative or self-destructive thoughts so they may be better able to actively change their behavior. Supplying cash should reduce criminal behavior for budget constrained individuals. BJS provides a more thorough discussion on what mechanisms these interventions may change behavioral and economic outcomes. The data was collected as a series of 5 surveys. The initial survey provided baseline covariates on the men from the study and was taken prior to treatment. Table G.16 provide a summary of a section of these variables. The remaining four endline surveys were taken after 2 weeks, 5 weeks, 12 months and 13 months.6 One of the major differences between our 6Because the surveys are taken unevenly over time, it would difficult to conclude that the dependency in the latent error follows a simple AR(1) process as we used in simulations. We believe that this would make any similarities found between the two procedures even more convincing that a robustness property may hold. 119 analysis and the initial work done in BJS, is they average the first two surveys and the last two surveys to construct short-run outcomes and long-run outcomes and calculate the effects separately while we treat it as a panel structure. By doing so they are able to investigate heterogeneity in treatment effect over time while we are more interested in heterogeneity that is correlated with the controls. Although many different types of outcomes were recorded and analyzed by BJS, we will look at only some of the antisocial behavior outcomes in more detail. They define the anti- social behavior as, “disruptive or harmful acts toward others, such as crime or aggression.” We will look at the binary outcomes of selling drugs, being arrest, and engaging in illicit activity.7 Over all the observations, each outcome occurred on average around 10-13 percent of the population. The last four variables (antisocial behavior index, perseverance index, reward responsive- ness and impulsiveness index) are combinations of survey responses to capture the individuals inclination towards a particular characteristics. All are standardize to 0 mean and variance 1. These will be the dimensions in which we will investigate a heterogeneous treatment effect using a correlated random effects approach. BJS does investigate heterogeneous treatment in their appendix (Table E.7) but uses the endline survey responses to construct the out- come antisocial behavior index and only looks at heterogeneity correlated with the baseline antisocial behavior index and a baseline measure of self-control/patience. To motivate a heterogeneous treatment in the nonlinear ME Probit and Pooled Het- eroskedastic Probit models, Table G.17 provides the OLS estimates of a linear probability CRE model. In this setting, a simple linear analysis should provide fairly good estimates of the treatments effect because of the random assignment of the treatment. 7In the survey the respondents are asked if each of these outcomes occurred within the last two weeks. 120 Sells Drugs – All of the interventions decrease the probability of selling drugs in which the sum of the CBT and cash effects is comparable to the effect of both as a treatment. However the cash intervention is not very large and is also not statistically different from 0. The other two treatments are statistically significant. We also see strong evidence of hetero- geneity in treatment over the antisocial behavior index and some evidence of heterogeneity over the perseverance index (not statistically significant). The direction of the heterogeneity suggests that those who initially demonstrate antisocial behavior (i.e., one standard devia- tion away from the average level of antisocial behavior), tend to have a stronger treatment effect (i.e., treatment effect of both CBT and cash changes from -0.0724 to -0.1432, almost doubling). This is consistent since one would expect CBT to have decreasing returns over the level of antisocial behavior (i.e., those who already display low levels of antisocial behavior do not gain much from CBT whereas those with high levels of antisocial behavior can gain much more). The treatments that include cash have more of an effect if the individual dis- played poor perseverance (lower on the perseverance index). A possible explanation is that those with poor perseverance are more likely to have binding budget constraints compared to those with better perseverance. Arrested – None of the interventions show a statistically significant effect on the prob- ability of arrest. On top of that, the cash intervention led to a slightly positive-but not statistically significant- effect, opposite direction of what one would expect. Even so, there is statistically significant evidence of a heterogeneous effect for the treatment of both CBT and cash over the antisocial behavior index. It is important to note that being arrested is not just a measure of behavior but also a measure of the governments ability to enforce the law. The next outcome looks to isolate the effect on the behavior. Illicit – All of the treatments are estimated to have the expected negative effect in which 121 the treatments including CBT have significant effects. Interestingly, the marginal effect of providing cash in addition to therapy is minimal since the treatment effects are estimated to be about the same. The treatment effects are heterogeneous in antisocial behavior for the interventions that include CBT and slightly heterogeneous in perseverance for only both CBT and cash. In BJS, the characteristic of perseverance is sometimes referred to as grit, and measured from the responses of seven questions on “the ability to press on in the face of difficulty” from the GRIT scale (Duckworth and Quinn (2009)). The positive direction of the heterogeneity in perseverance can be interpreted as: for an individual who has more perseverance than the average level by 1 standard deviation has a treatment effect from both cash and CBT of -0.0227 compared to -.0622 (almost a reduction of 2/3). This would suggest that perseverance may be a detriment in trying to change individuals’ behaviors and actions. It appears that overall, both CBT and cash produce stronger effects which may indicate that cash is necessary to loosen the budget constraint such that an individual may change their behavior influenced by the CBT. Moreover, most the heterogeneity seems to be captured by the antisocial behavior or the perseverance indexes. Finally, this panel structure suggests the possibility of serial correlation. Following the suggestion in Wooldridge (2010), we test for serial correlation in the linear model by regressing first differences of yit on its lag and testing if the coefficient is equal to 0.5 (as implied by the case of no serial correlation). We are only able to reject the null hypothesis of no serial correlation for the outcome of physical fights (p-value = 0.7689). Motivated that there is evidence of a treatment effect and possible heterogeneity in the treatment effect in a simpler linear model, Table G.18-G.20 provides the parameter estimates in the different Probit specifications. First note that the estimates are scaled parameter 122 estimates and therefore comparable between the different estimation methods. As per usual, the reported standard errors for the PMLE are robust to arbitrary serial correlation. But we also report the JMLE standard errors that are robust to arbitrary serial correlation. This is usually not done, since we assume serial independence for consistent estimation. However, given the results of the earlier simulation, we observed fairly accurate scaled coefficient and APE effects estimates from the JMLE even under serial correlation. Therefore we treat the estimator as if it were a quasi-MLE, knowing the likelihood is misspecified, and adjusting the inference accordingly. For all three outcomes, the coefficient estimates on the treatments are quite different between the JMLE and PMLE procedures. For instance the cash treatment is estimated to have a negative coefficient for the outcomes of selling drugs and being arrested when estimated using the JMLE procedure but then estimated to have a positive coefficient in the PMLE approach. In the end, this may still have very little impact on the treatment effects estimates since they are also strongly determined by the heteroskedastic coefficients in the Pooled Heteroskedastic Probit model. An explanation for the stark differences could be because the pooled estimator is quite inefficient with some standard error estimates approaching 4× higher than the JMLE stan- dard errors. Consequently, the JMLE coefficients are more frequently statistically significant from 0 whereas the PMLE coefficients are almost never statistically significant. On the other hand, the efficiency of the JMLE also comes at a computational cost. The Pooled Heteroskedastic Probit estimator was always able to compute within a couple seconds while the ME Probit estimator was taking as long as 4 hours to compute. This makes bootstrapping for standard error, the common procedure when obtaining standard errors for ATE and ASF, impractical. 123 Given the evidence in the simulation studies of Section 5, one should not readily trust the variance component estimates in the ME Probit model. However, it is interesting to note that under the outcome of selling drugs and engaging illicit activity, the estimator would suggest that the cash or both the CBT and cash treatment do not have a random effect at all. As for the CRE specifications, estimates of the coefficients for interaction terms appear more similar in direction, magnitude, and efficiency across the two estimators. For the out- come of selling drugs, the most important dimensions of heterogeneity appear to be antisocial behavior and perseverance, especially for the treatments that include give cash. As for being arrested, there is little heterogeneity in the treatment of CBT, but the treatment of cash is heterogeneous in perseverance and reward. Unlike the results of the linear specification, these results suggest that those with more perseverance are less likely to be arrested after given cash. The reward index compiles responses from eight survey questions to measure “whether [an individual is] motivated by immediate, typically emotional rewards.” The re- sults indicate the an individual more motivated by rewards is less likely to respond well to cash treatments in reducing the probability of being arrested. This may be because, without any changes in their behavior prior to the treatment, they were then rewarded with cash which provides positive reinforcement of their bad behavior. Similar to what we observe in the linear specification, treatment is heterogeneous with respect to antisocial behavior and perseverance for the outcome of engaging in illicit activity. In particular, the treatment of CBT is fairly heterogeneous in antisocial behavior and the treatment of both CBT and cash is heterogeneous in perseverance in the same directions of the linear estimates. The implications for the ATE can be seen in Table G.21. We find, using the OLS estimates 124 and ME Probit estimates allowing for correlation between the unobserved heterogeneity and the covariates tends to lower the ATE with very little cost to efficiency. It is more of a mixed bag when we look at the Pooled Heteroskedastic Probit estimator. Again, this may be due to the inefficiency of the estimator. Consequently, we tend to see stronger similarities between the ME Probit and OLS estimates. This reiterates the robustness of APEs using ME Probit seen in the simulation study. In all outcomes, the strongest treatment among the three is both CBT and cash. For selling drugs, we find a statistically significant effect of both CBT and cash as well as therapy only. Interpreting the ME Probit estimates, both CBT and cash will reduce, on average, the probability of selling drugs in the future by 7.6 percentage points while the treatment of therapy only reduces the probability by 6.4 percentage point. The Pooled Heteroskedastic Probit Estimates differs slightly estimating a 4.9% decrease in probability for both CBT and cash. However, there appears to be a significant jump in the standard error so the difference between the two estimates are not likely to be statistically significant. We find no statistically significant treatment effects for the outcome of arrested at any conventional levels of significance. For illicit activity we find fairly similar estimates between the ME Probit and Pooled Heteroskedastic Probit ATE. An interesting result is that for some specifications and outcomes, we find the Pooled Heteroskedastic Probit estimator to be more efficient. This was not seen in our simula- tion results, were the ME Probit estimator appeared always more efficient (even under a misspecified log likelihood). Since we saw ample statistical significance in the correlated random effects, Figures F.19 - F.21 show the surface of the treatment effects over relevant values of two characteris- tics. As discussed previously, the most influential characteristics are antisocial behavior and 125 perseverance, except in the case of being arrested in which reward has more impact than perseverance. When looking at the outcome for selling drugs, the first thing to note is that at combination of relevant characteristic values, there is a treatment that induces an effect in the desired direction. Moreover this figure tells us that those with low levels of perseverance require the treatment of both therapy and cash whereas those with higher perseverance are better served with just receiving therapy. Finally, both estimation procedures (JMLE and PMLE) produce similar figures with inconsequential differences in interpretation. Moving to the treatment effects for being arrested, we find that there are areas in which no treatment is able to produce an effect in the desired direction. For those who are relatively better behaved initially and are very responsive to rewards, none of the treatments produce desirable effects. On the other end of the spectrum, a therapy and cash treatment produce a strong effect (i.e., those with antisocial behavior = 2 and reward = -2, the treatment of both therapy and cash reduces the probability of being arrested by approximately 25 to 30 percentage point). Although the broad conclusions are the same, there are small differences between the two estimators. Unlike the JMLE, the PMLE requires much lower values of reward and antisocial behavior to find cash to be the best treatment. Moreover the findings of the JMLE show a slightly larger area in which none of the treatments produce desirable effects compared to the PMLE. The conclusions for the outcome of engaging in illicit activities are similar to those found in studying the outcome of selling drugs. Those with lower levels of perseverance require both therapy and cash while those with higher perseverance suffice with just therapy. The conclusions of either only therapy or only cash as the optimal treatments may be unexpected as it suggests that is actually a marginal detriment in providing cash when also providing therapy (or vice versa). A possible explanation for this conclusion is the limitations 126 of the model specification. We have assumed that the heterogeneous treatment is linear in the individuals characteristics. Therefore the crossing from both therapy and cash to just therapy or just cash as the optimal treatment may be an unsubstantiated consequence of the marginally strong effect of therapy and cash over just therapy or over just cash on the other end of the characteristic spectrum. A possible solution would be allow for a much more flexible specification of the corre- lated random effect. Instead of only specifying linear terms in random effect, we could also include higher order terms to capture any nonlinear relationship. However this will increase the dimensions of the parameter space fairly quickly, providing grounds for utilizing high di- mensional approaches. For instance, including second order terms for the four characteristics (10 terms) for the intercept and each of the treatments will result in increasing the number of parameters by 40. One could extend the work of Wooldridge and Zhu (Forthcoming) who use a debiased estimator of a L1-penalized pooled probit with correlated random effects (only in the intercept). Unfortunately, to our knowledge, there are not any published commands in common statistical packages such as Stata or MATLAB that allow for either a penalized ME Probit (or any ME Generalized Linear Model) estimator or a penalized Heteroskedastic Probit pooled estimator. 3.7 Discussion The results from the simulations and application leave some open ended questions that we wish to look at in more depth. First, we are concerned that the robustness of the JMLE when independence over time does not hold may be because we have introduce serial correlation in a fairly simplistic manner. We will examine DGP 1 under AR(2) process. This introduces 127 a much more complex model of serial correlation rather than merely a perception of more or less persistence. Second, we are concerned that many researchers are attracted to the JMLE because it is able to identify the variance parameters. As we showed in simulation, the estimates are strongly biased under the presence of serial correlation. But in our simulations we have always assumed the presence of random effects. Alternative in this simulation, we consider what will happen when the coefficients are in fact non-random. We also find that the presence of serial correlation can mislead one to believe there are random effects when there are none. This further illustrates the caution that should be taken when interpreting the variance components from the ME Probit estimator. Finally, there is a growing interest in utilizing a Logit model as an alternative to a Probit model. Therefore we repeat DGP1 but specify that the latent error is logistically distributed. The analogue of the ME Probit estimator is the ME Logit estimator in which we employ the command melogit in STATA. As in the Probit case, the random components are still assumed to be normally distributed and integrated out numerically. However this means there is no good analogue of the pooled approach with correct distribution assumptions since the logistic distribution does not mix well with the normal distribution. Consequently, this section will not focus on the comparison of the JMLE to the PMLE but rather whether the JMLE is itself consistent under serial correlation in terms of the parameter, variance components, and partial effect estimates. 128 3.7.1 AR(2) Consider the following AR(2) process in the latent error εit = 0.6εit−1 − 0.3εit−2 + eit (3.32) where eit ∼ N (0, 1 − 0.62 − 0.32). This means that each error is positively correlated with the first lagged error and then negatively correlated (conditional on the first lag) with the second lagged error. With a simple AR(1) process, serial correlation in the latent error appears similar to individual heterogeneity, and as we found in section 5, does not bias the scaled coefficient or APE estimates. But with a more complex AR(2) process, the correlation over time cannot be as easily mistaken as individual heterogeneity. Table G.22 present the scaled coefficient estimates. Again, we find no strong bias in the ME Probit estimates even though the joint likelihood is misspecified. We do see an increase in bias as the number of time observation increases, which might suggest that the JMLE starts to waver in its capability of addressing a more complex correlation structure as more observations are present. But this holds true for the Pooled Heteroskedastic Probit estimator as well. Turning to the APE estimates in Table G.23, both the ME Probit and the Pooled Het- eroskedastic Probit estimates have low bias. We find that there are still fairly substantial efficiency gains by utilizing the JMLE even when the joint likelihood is misspecified. So even when introducing a more complex structure to the serial correlation, we find that the bias in estimating the variance component σ2 1 fully captures the consequences of the serial correlation in the latent error. Figures F.22 - F.24 show the empirical distribution of the ME Probit estimates for σ2 1. As one would expect, there is an upward bias since 129 overall, the AR(2) process in equation (3.32) would induce positive correlation among the time observation. Overall, this would help to further illustrate a possible robustness to serial correlation in the scaled parameter and ASF/APE estimates under JMLE. This should be theoretically investigated in further studies. 3.7.2 No Random Effects Since the JMLE seems to be able to address serial correlation through the variance com- ponent σ2 1. It would be interesting to observe what would occur when no random effects are actually present σ2 1 = σ2 2 = 0. Tables G.24 and G.25 present the computational results. Since σ2 1 = σ2 2 = 0 is at the boundary of the valid parameter space, we would expect the ME Probit estimator to struggle. Table G.24 presents the number of failed convergence of the estimator prior to obtaining 1,000 successes. When there is no serial correlation, there are upwards of 700 failures for the ME Probit estimator, but as serial correlation is introduced, the failures reduce dramatically. This is because the introduction of serial correlation al- lows for estimates of variance components away from the boundary. Table G.25 reports the estimation times. Now we see a much strong contrast between the Pooled Heteroskedastic Probit estimator and ME Probit estimator where the ME Probit can take up to 22 times longer. Instead of looking at the estimates of α and β (which follow the trends of all the previous simulations) we will simply note that the APE, in Table G.26, are well estimated regardless of the estimation procedure used or whether a correlated random effects specified. Since there are no random coefficients, there cannot be correlation between the fixed parameters and the covariates. Consequently, specifying correlated random coefficients does not necessarily 130 hurt the estimators in terms of bias but it does result in a less efficient estimator as it calls for the inclusion of many irrelevant covariates. We will focus on the estimates of variance components and whether or not standard LR tests are valid in detecting the presence of random coefficients under serial correlation. Tables G.27-G.28 present the average and standard deviations of the predicted variance components from ME Probit under specifications (1) and (2). When there is no serial correlation, both the estimates of σ2 1 and σ2 2 are quite close to zero, which is what we would hope for when there is no random coefficients actually present. As the level of serial correlation increases, the variance component σ2 2 remains low while the estimates for σ2 1 are increasingly biased up. This means that serial correlation can be miss-interpreted as individual heterogeneity. This leads us to caution any researcher that would like to make inference on the variance component estimates and use them interpretatively. One would hope that the LR test should be able to reject the model of random coefficients in favour of a more simple non-random coefficient Probit model. Table G.29 reports for the rejection rates at the 5% significance level. We find that under no serial correlation the test performs as expected but as the serial correlation increases the rejection rates also increase. In fact, with a correlation coefficient of 0.8, we reject 100% of the simulated samples. So although the earlier simulation results are able to suggest that the ME Probit estimator is a favourable estimator given that it is more efficient than the pooled approach and there appears to not be much bias under the misspecification of the joint likelihood for scaled coefficients and APE estimates. But when there is in fact no random coefficient, we find the ME Probit estimator struggle computationally compared the Pooled Heteroskedastic Probit estimator. The ME Probit estimator takes much longer to compute and failing to converge at all in many instances. Moreover this simulation re-emphasizes the caution that should be 131 taken when interpreting the variance components. 3.7.3 Logit This simulation utilizes a logistic distribution in the latent error compared to the normal distribution used in a Probit. Although it seems to be favoured particularly in applied work, the logistic distribution does not easily incorporate a mixed effects framework. The logistic distribution does not mix well with itself nor the normal distribution. This raises two issues: 1. Assuming that the random coefficients are normally distributed, as is usually done in the Mixed Effects literature, then there is no equivalent pooled approach. Specifically, the unobserved components: u1i + x1iu2i + εit (3.33) are the sum of two normals and a logistic random variable whose distribution is gener- ally unknown. Consequently we cannot evaluate the conditional distribution to obtain the contemporaneous conditional mean of yit as we did in equation (3.16) when all the unobserved components are assumed to be normally distributed. 2. It is unclear how to implement AR(1) process to the logistic errors since the logistic distribution does not mix well with other logistically distributed random variables. We find that how the AR(1) process is implemented will vary greatly in how we can approach the estimation problem. We consider two approaches, with the following autoregressive process of order 1 εit = ρεit−1 + eεit (3.34) we can either aim to insure εit|εit−1 is logistically distributed or that the marginal 132 distributions of all the time observation are identically logistically distributed. These two challenges have to be considered when constructing our simulation study. With respect to the first point, in our simulation, we of course use the ME Logit estimator, but we also consider the Pooled Heteroskedastic Probit estimator where imposing normality in the latent error is used as an approximation to, what is usually an unknown latent distribution. In addressing the second point, we run the simulations on two different implementations of the serial correlation. In the first case, which we will refer to a conditional logistic AR(1) we will generate the process from the following εi1 ∼ logistic(0, eεit ∼ logistic(0, √ (cid:113) 3/π) 3(1 − ρ2)/π), for t = 2, ..., T (3.35) This means εit|εit−1 is logistically distributed but since the logistic distribution does not mix with itself, the marginal distributions are not identically distributed over t (although they will all have the same standardized first two moments). The second case, which we will refer to as a marginal logistic AR(1), will be generating from the following distributions √ εi1 ∼ logistic(0, eεit ∼ log (cid:18) sin(ρU π) 3/π) sin(ρ(1 − U )π) (cid:19) where U ∼ U nif orm(0, 1), for t = 2, ..., T (3.36) as proposed in Sim (1993). Now the marginal distributions will be identically logistically distributed. We feel that this would give more credence to a pooled, although misspecified in distribution, approach. Under no serial correlation (ρ = 0) both processes are identical and the ME Logit likelihood is correctly specified. Tables G.30 and G.31 reports the de-scaled coefficient estimates. Similar to the Probit case, as the level of serial correlation increases, there is an increasingly positive bias for both AR(1) specifications. Tables G.32 and G.33 report the scaled coefficient estimates for 133 the conditional Logit AR(1) process and the marginal Logit AR(1) process. The ME Logit parameter estimates are scaled by 1/ (cid:113) π2/3 + ˆσ2 1 which should match (at least in terms of scaling to) the Pooled Heteroskedastic Probit scaled coefficient estimates. As we saw in the Probit case, the bias of the ME Logit estimator is countered by bias in the variance component estimate, ˆσ2 1, which results in unbiased scaled coefficient estimates. In fact, we see the ME Logit estimator is far superior to the Pooled Heteroskedastic Probit estimator in terms of bias and efficiency. Finally, the APE estimates are presented in Tables G.34 and G.35. As we expected, since the marginal logistic AR(1) process produces identically distributed errors, the Pooled Heteroskedastic estimator performs better compared to the conditional logistic AR(1) data generating process. But in either case the ME Logit estimator has lower bias and is much more efficient than the pooled approach even though we know that the joint likelihood is misspecified. The results from this simulation suggest that the robustness of the JMLE under serial correlation is not limited to the normal distribution. 3.8 Conclusion This study has been an comprehensive investigation in the behaviors of PMLE and JMLE for panel random coefficient binary response models under serial correlation. After introducing the two estimators and the context in which they are usually implemented, we explored their potential in a diverse simulation study as well as sought to confirm our results with an application. Consistent with our initial intuition, there are several points that need to be considered when implementing these estimators. 134 First and foremost, specifying correlated random effects matters enormously whether you are consider a PMLE or JMLE approach. We saw this regardless of the data generating process (DGP1, DGP2, DGP3), the level of serial correlation (ρ = 0, 0.4, 0.8), the type of serial correlation (AR(1) vs AR(2)), or the distribution of the latent error (Probit vs Logit). Our biggest concern is that those who implement the Mixed Effects approaches may be swayed by the language in thinking that they are able to easily model the heterogeneity with such a flexible framework without considering potential correlation with the covariates. Another expected results is that the pooled approach is much quicker to implement which may have more importance when considering much larger datasets with many more covariates. As we saw in our application the difference ranged between seconds for the PMLE approach and hours for the JMLE approach. More intriguing, we find quite a number of surprising results that should change some of the perceptions of JMLE and PMLE. JMLE estimates of the scaled coefficient, ASF, and APE appear to be robust to fairly simple specifications of serial correlation even though the presence of serial correlation implies that the joint likelihood is miss-specified. This is because the bias in the de-scale param- eter estimates are countered by a bias in the variance component estimate. Consequently, interpretation of the de-scaled parameter estimates and variance components are ill-advised when one is concerned there may be serial correlation in the latent error. In simulation, we repeatedly found the JMLE to be the efficient estimator even under miss-specification of the likelihood. There is no theoretical grounding as to why a Pooled estimator need not be more efficient than a miss-specified JMLE. In fact, we do see the PMLE become more efficient than the JMLE in the Application but we were unable to reproduce these results in simulations. Therefore the questions of efficiency and under what 135 settings the PMLE becomes more efficient than the JMLE remains open and requires further investigation. In our discussion of the case of no random effects, it was surprising to see the large number of failures to converge for the ME Probit under no serial correlation and then the dramatic drop as the level of serial correlation increases. This adds to the computational advantage of the pooled approach. Although it does not exactly model the heterogeneity, it is much more likely to be able to converge when there is no random effects and at a much faster rate. Should we really care about differentiating between serial correlation and random ef- fects? One could argue that they are both ways for an econometrician to model persistence in the data and there is no particular reason to prefer one over the other. As we see in the simulation, this idea appears to be consistent with the robustness of the JMLE under serial correlation. But it would warn against making strict interpretation of the variance components. In the end they are capturing the variability of the persistence over time but this is not necessarily equivalent to the true variance of the random effect. 136 APPENDICES 137 APPENDIX A Figures for Chapter 1 138 Figure A.1: Visual representation of bijective transformations (a) (b) (c) (d) Each oval represent the support of either X or Z, the objects inside represent possible realizations in the support, and the lines connecting the realizations represent pairs of realizations that occur in the joint support with positive probability. From left to right: (a) shows a bijective transformation from X to Z and equivalently a bijective transformation from Z to X. There is no variation in one of the random variables that cannot be perfectly described by the variation in the other. (b), (c), and (d) show examples where there is not a bijective transformation. In (b) there is extra variation in X that cannot be explained by Z and in (c) there is extra variation in Z that cannot be explained by X. The case of an exclusion restriction is presented in (d). Imagine there is an element in Z that is excluded in X that can take on 3 values. Then for every point in the support of X there are 3 possible realizations in Z that will occur with positive probability. 139 Figure A.2: Parameter estimates from two observationally equivalent models From top to bottom: intercept estimates where the true values are 0 (Specification 1) or 0.5 (Specification 2), coefficient estimates where the true values are 0.5 (specification 1) or 0 (Specification 2), and the het- eroskedastic coefficient estimates where the true values are 2 (Specification 1) or 1 (Specification 2). 2,000 simulations of sample size 1,000 using the Stata command hetprobit. 140 APPENDIX B Proofs and Notation for Chapter 2 141 Identification The following lemma is an extension of corollary 1.4.1 given in chapter 1 to allow for multi- variate X and Z. Lemma B.1. Let X and Z be vectors of random variables with continuous support. Suppose the following conditions hold (i) Z does not contain a constant. (ii) E(X(cid:48)X) is non-singular. (iii) E(Z(cid:48)Z) is non-singular. (iv) βo is non-zero (v) Each element of Z is a polynomial function of an element in X, such that Zj = X pk j k where K is the dimension of X and pk j ∈ {1, 2, 3, ...} is the order of the polynomial on the kth term in X that composes the jth term in Z. Then for all parameters (β, δ) ∈ Θ (the parameter space) if X(βo − exp(Z(δ − δo))β) = 0 with probability 1, then (β, δ) = (βo, δo). Proof. Suppose there is a (β, δ) ∈ Θ, such that X(βo − exp(Z(δ − δo))β) = 0 with probability 1. Then I can rearrange given condition (iv), (cid:18)Xβo (cid:19) Xβ Z(δ − δo) = ln 142 (B.1) (B.2) Let A denote the set of k such that the set {pl maximum polynomial order, ˜pk = max j : l = k} is nonempty, then there exists a {pl j : l = k} for each k ∈ A. Then for each k ∈ A, take the partial derivative with respect to Xk, ˜pk + 1 times, j (cid:34)(cid:18) βko Xβo (cid:19)˜pk+1 − (cid:18) βk Xβ (cid:19)˜pk+1(cid:35) 0 = (−1)˜pk+1 which implies βko Xβo = Xβ . There are two cases: either for all k such that {pl βk an empty set, βk = βko = 0 or there exists at least one ˆk such that βˆk (B.3) j : l = k} is not (cid:54)= 0 and βˆko (cid:54)= 0. In the first case, this reduces to the scenario that X and Z are not functionally related in which Theorem 1.2.1 of chapter 1 can be applied to obtain identification. In the second case, equation (B.3) implies βko βk = Xβo Xβ and plugging into equation (B.2), (cid:18) βko (cid:19) βk Z(δ − δo) = ln (B.4) the right hand side is a constant. By conditions (i) and (iii) equation (B.4) can only hold if δo − δ = 0 and by condition (ii) this implies βo = β. Proof of Theorem 2.4.1: Using Lemma B.1: Parts (i)-(iii) of Assumption 2.4.1 insure that E((xi, h(v2i, zi))(cid:48)(xi, h(v2i, zi))) (B.5) is non-singular (shown in the paper). Part (iv) restricts how the heteroskedastic function may be specified to avoid issues non-identification due to the non-linear setting and correspond to conditions (i), (iii), and (v) of Lemma B.1. Part (v) insures that there is identification of the heteroskedastic components and corresponds to condition (iv) of Lemma B.1. Applying Lemma B.1, identification follows. 143 Asymptotic for Parametric Estimator Proof of Theorem 2.5.1: Using Theorem 2.6 of Newey and McFadden (1994), since there is no weighting matrix, I merely need to show the following: (i) E(M (y1i, y2i, zi; π, θ)) = 0 only if π = πo and θ = θo (ii) πo ∈ Π and θo ∈ Θ, both of which are compact (iii) E(M (y1, y2, z; π, θ)) is continuous at each π ∈ Π and each θ ∈ Θ (iv) E(sup(π,θ)∈Π×Θ ||M (y1i, y2i, zi; π, θ)||) < ∞ Identification, part (i), holds under Assumption 2.4.1. Part (ii) is assumed. Part (iii) is evident given the linear LS and Probit specifications and part (iv) is satisfied given the finite second moment conditions given in Assumption 2.4.1, see below for more details. ||M (y1i, y2i, zi; π, θ)|| ≤||(y2i − m(zi)π)m(zi)|| + ||Si(π, β, γ, δ)|| ≤||(y2i − m(zi)π)m(zi)|| + (cid:35) × (||xi|| + ||hi(π)|| + ||(xiβ + hi(π)γ)||||gi||) (y1i − Φi(π, θ))φi(π, θ) Φi(π, θ)(1 − Φi(π, θ)) exp(giδ) max(λi(π, β, γ, δ), λi(π,−β,−γ, δ)) exp(giδ) ≤||(y2i − m(zi)π)m(zi)|| + (cid:35) × (||xi|| + ||hi(π)|| + ||(xiβ + hi(π)γ)||||gi||) (cid:12)(cid:12)(cid:12)xiβ + hi(π)γ (cid:35) × (||xi|| + ||hi(π)|| + ||(xiβ + hi(π)γ)||||gi||) ≤||(y2i − m(zi)π)m(zi)|| + exp(giδ) exp(giδ) (cid:18) 1 C 1 + (cid:34)(cid:12)(cid:12)(cid:12) (cid:34) (cid:34) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:19) where λi(π, θ) is the inverse mills ratio and notationally, hi(π) = h(y2i − m(zi)π, zi) and 144 gi = g(y2i, zi). Therefore, E(sup(π,θ)∈Π×Θ ||M (y1i, y2i, zi; π, θ)||) is finite as long as the sec- ond moments of m(zi), xi, hi(π), and gi are bounded (which is presumed under Assumption 2.4.1). Proof of Theorem 2.5.2: Using Theorem 6.1 of Newey and McFadden (1994), I merely need to show the following: (i) πo ∈ int(Π) and θo ∈ int(Θ), both of which are compact (ii) M (y1i, y2i, zi; π, θ) is continuously differentiable in a neighborhood of (πo, θo) with probability approaching one. (iii) E(M (y1i, y2i, zi; πo, θo)) = 0 (iv) E(||M (y1i, y2i, zi; πo, θo)||2) is finite; (v) E(sup(π,θ)∈Π×Θ || (cid:53)(π,θ) M (y1i, y2i, zi; π, θ)||) < ∞ (vi) G(cid:48)G is non-singular where G = E((cid:53)(π,θ)M (y1i, y2i, zi; πo, θo)). Part (i) is assumed and part (ii) is evident given the linear LS and Probit specifications. Part (iii) holds by Assumption 2.3.1 (correct conditional mean specification in the first stage and Fischer consistency in the second stage). Part (iv) can be verified (cid:19) ||M (y1i, y2i, zi; πo, θo)||2 = ||(y2i − m(zi)πo)m(zi)||2 + ||Si(πo, βo, γo, δo)||2 =||(y2i − m(zi)πo)m(zi)||2 + (y1i − Φi(πo, θo))2φi(πo, θo)2 Φi(πo, θo)2(1 − Φi(πo, θo))2 exp(2giδo) (cid:18) × (||xi||2 + ||hi(πo)||2 + ||(xiβo + hi(πo)γo)||2||gi||2) applying law of iterated expectations, (cid:16)||M (y1i, y2i, zi; πo, θo)||2(cid:12)(cid:12)(cid:12)zi, y2i E (cid:17) 145 =||(y2i − m(zi)πo)m(zi)||2 + × (||xi||2 + ||hi(πo)||2 + ||(xiβo + hi(πo)γo)||2||gi||2) =||(y2i − m(zi)πo)m(zi)||2 + (cid:33) (cid:32) E(cid:0)(y1i − Φi(πo, θo))2|zi, y2i (cid:1)φi(πo, θo)2 Φi(πo, θo)2(1 − Φi(πo, θo))2 exp(2giδo) (cid:19) (cid:18) Φi(πo, θo)(1 − Φi(πo, θo))φi(πo, θo)2 Φi(πo, θo)2(1 − Φi(πo, θo))2 exp(2giδo) (cid:19) (cid:18) λi(πo, βo, γo, δo)λi(πo,−βo,−γo, δo) × (||xi||2 + ||hi(πo)||2 + ||(xiβo + hi(πo)γo)||2||gi||2) =||(y2i − m(zi)πo)m(zi)||2 + exp(2giδo) × (||xi||2 + ||hi(πo)||2 + ||(xiβo + hi(πo)γo)||2||gi||2 Since λi(πo, βo, γo, δo)λi(πo,−βo,−γo, δo) is bounded (and bounded away from 0), taking an expectation of the above equation, E(||M (y1i, y2i, zi; πo, θo)||2) is finite as long as the sec- ond moments of m(zi), xi, hi(π), and gi are bounded (which is presumed under Assumption 2.4.1). Part (v) follows from boundedness of the first derivative of the inverse mills ratio and finite second moments of m(zi), xi, hi(π), and gi. In showing (vi), let G = (Gπ, Gθ) where G1π  = G1θ  = G2π G2θ E(m(zi)(cid:48)m(zi))    −E(cid:0)∆i(πo, θo)ωi(πo, θo)(cid:48)ωi(πo, θo)(cid:1) E(Γi(πo, θo)ωi(πo, θo)(cid:48)m(zi)) 0  Gπ = Gθ = and Γi(πo, θo) = (∂hi(πo)/∂v2)γo∆i(πo, θo) ∆i(πo, θo) = ωi(πo, θo) = φi(πo, θo)2 (cid:18) (cid:19) Φi(πo, θo)(1 − Φi(πo, θo)) exp(2giδo) xi, hi(πo, θo), −(xiβ + hi(πo, θo)γo)gi(πo, θo) 146 Then G(cid:48)G = G(cid:48) πGπ G(cid:48) θGπ G(cid:48) G(cid:48) πGθ θGθ  πGπ =E(m(zi)(cid:48)m(zi))E(m(zi)(cid:48)m(zi)) G(cid:48) G(cid:48) + E(Γi(πo, θo)m(zi)(cid:48)ωi(πo, θo))E(Γi(πo, θo)ωi(πo, θo)(cid:48)m(zi)) πGθ = − E(Γi(πo, θo)m(zi)(cid:48)ωi(πo, θo))E(cid:0)∆i(πo, θo)ωi(πo, θo)(cid:48)ωi(πo, θo)(cid:1) GθGθ =E(cid:0)∆i(πo, θo)ωi(πo, θo)(cid:48)ωi(πo, θo)(cid:1) E(cid:0)∆i(πo, θo)ωi(πo, θo)(cid:48)ωi(πo, θo)(cid:1) which is easily non-singular by Assumption 2.4.1. Identification and Asymptotic for Semi-Parametric Es- timator Proof of Theorem 2.7.1: Write u1i = ho(zi, v2i) + i where Med(i|zi, v2i) = 0. Plugging into equation (2.1) and redefining: ˜xi = (xi, ho(zi, v2i)) and ˜β(cid:48) o, 1), one can apply Theorem 2.1 of Khan o = (β(cid:48) (2013) to yi = 1{˜xi ˜βo +  ≥ 0} (B.6) and obtain the observational equivalence result. Proof of Theorem 2.7.2: Identification of mo(·) in the first stage is immediate from part (i) of Assumption 2.7.6. For identification of the second stage parameters and functions, suppose there are β−k ∈ B, 147 h(·) ∈ H, and g(·) ∈ G such that (β−k, h(·), g(·)) (cid:54)= (β−ko, ho(·), go(·)) and x−kiβ−k + xki + h(v2i, zi) x−kiβ−ko + xki + ho(v2i, zi) exp(go(y2i, zi)) exp(g(y2i, zi)) = (B.7) with probability 1. By Assumption 2.7.6 (iii), xki conditional on x−ki and h(v2i, zi) had density with respect to the Lebesgue measure that is positive on (cid:60) for any h(·) ∈ H. So for any realization x−ki, v2i, zi, there exists a xki such that, x−kiβ−ko + xki + ho(v2i, zi) > 0 and x−kiβ−k + xki + h(v2i, zi) < 0 (B.8) and since the scaling by exp(g(y2i, zi)) is always positive for any g(·) ∈ G, this is a contra- diction. I maintain separate identification of βo and ho(v2i, zi) by parts (ii) and the CMR (part (iv)) of Assumption 2.7.6. Before providing the remaining proofs from Section 2.7.2, I will briefly outline some notation. Let a = (a1, ..., ak) be a 1 × k vector of non-negative integers, then the |a|-th derivative with respect to a function f : (cid:60)k → (cid:60) is defined as, where |a| =(cid:80)k (cid:53)af (x) = ∂|a| 1 ··· ∂x a1 ak k ∂x (B.9) i=1 ai. For any s > 0, let [s] denote the largest integer smaller than s. Define the s-th Holder norm, || · ||Λs, as (cid:88) |a|≤[s] ||f||Λs = | (cid:53)a f (x)| + sup x∈X (cid:88) |a|=[s] | (cid:53)a f (x)| − | (cid:53)a f (¯x)| ||x − ¯x||s−[s] sup x(cid:54)=¯x where || · || denotes the euclidean norm. Define a Holder space with smoothness s as Λs(X ) = {f ∈ Cs−[s](X ) : ||f||Λs < ∞} (B.10) (B.11) where Cr(X ) is the set of continuous function on X that have continuous first r-th derivatives. Define a weighted Holder ball with radius c, smoothness s, and weight function (1+||·||2)−w/2 148 with w > 0, c(X , w) = {f ∈ Λs(X ) : ||f (·)(1 + || · ||2)−w/2||Λs ≤ c < ∞} Λs (B.12) (cid:18)(cid:90) X f (x)2dFx (cid:19) (cid:12)(cid:12)f (x)(1 + ||x||2)−w/2(cid:12)(cid:12) Finally, define the following two norms ||f (x)||2 = ||f (x)||∞,w = sup x∈X Proof of Corollary 2.7.1: By the triangle inequality and definition of || · ||∞,w1, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)n−1 −E n(cid:88) (cid:32) i=1 φ (cid:32) (cid:18)xiβo + ho(v2i, zi) ˆβ + ˆh(ˆv2i, zi) xi exp(ˆg(y2i, zi)) φ (cid:33) (cid:19) (B.13) (B.14) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆβj βjo + φ i=1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) φ i=1 exp(ˆg(y2i, zi)) exp(go(y2i, zi)) − n−1 exp(go(y2i, zi)) ˆβ + ˆh(ˆv2i, zi) xi exp(ˆg(y2i, zi)) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:32) (cid:33) (cid:19) (cid:18)xiβo + ho(v2i, zi) (cid:19) (cid:18)xiβo + ho(v2i, zi) (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)n−1 n(cid:88) n(cid:88) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)n−1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n(cid:88) (cid:18)xiβo + ho(v2i, zi) (cid:18) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)φ (cid:33) (cid:32) (cid:18)xiβo + ho(v2i, zi) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)n−1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n(cid:88) (cid:18)xiβo + ho(v2i, zi) ˆβ + ˆh(ˆv2i, zi) xi exp(ˆg(y2i, zi)) exp(go(y2i, zi)) exp(go(y2i, zi)) exp(go(y2i, zi)) exp(go(y2i, zi)) βjo (cid:19) (cid:19) − E φ φ i=1 φ i=1 − φ + ˆβj exp(ˆg(y2i, zi)) βjo exp(go(y2i, zi)) exp(go(y2i, zi)) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) βjo (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)∞,w1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) βjo exp(go(y2i, zi)) ˆβj exp(ˆg(y2i, zi)) exp(go(y2i, zi)) βjo exp(go(y2i, zi)) exp(go(y2i, zi)) ≤ ≤ 149 (cid:18) φ − E (cid:18)xiβo + ho(v2i, zi) (cid:19) βjo exp(go(y2i, zi)) exp(go(y2i, zi)) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) and the first term is op(1) by the results of Theorem 2.7.3 and the second term is op(1) using (cid:18)xiβo + ho(v2i, zi) (cid:19) (cid:19) βjo (B.15) the Weak Law of Large Numbers, noting (cid:18) V ar φ is bounded. exp(go(y2i, zi)) exp(go(y2i, zi)) 150 APPENDIX C Simulation Details for Chapter 2 151 General Control Function in the Demand for Premium Cable First I constructed the distribution of markets and operators in which there is one operator in each market (but the operators serve multiple markets). The market ID was assigned using a truncated (from 0 to 172) exponential distribution with mean 40 such that the higher the market id (rounded up to the integer), the smaller the market size.1 I only allow for two operators, to mimic the competition between Time Warner and AT&T (assigned with equal probability to each market). The product characteristics: number of premium channels offered (z11m) and cost shifter (z2m), are the same within a market. The number of premium channels was drawn from a truncated (0 to 10) N ormal(4.5, 6.25). The cost shifter was a function of the quality (number of channels) and the operator (efficient/inefficient operator): costm = 10 + 5om + 2numchm + 1m where om ∈ {0, 1} is the operator in the market and 1m ∼ N ormal(0, 1) is a market level cost shock. The endogenous variable price is constructed as a function of the number of channels, cost, and unobserved quality (v2m). pm = 7.5 + 2numchm + costm + v2m (C.1) where v2m is drawn from a U nif orm(−8, 8). Then to construct the consumer characteristics (z12i), I draw e1, e2 from independent standard normal distributions where age and income 1The market identifiers could be constructed any way, this was just so there was a good range in market size. 152 are constructed as, √ 0.85e2)(cid:101) 1 − 1) + agei = (cid:100)20 + 40Φ(0.4e1 − 0.1(e2 d1i = 1{Φ(e1) < .196} d2i = 1{Φ(e1) ≥ 0.196 and Φ(e1) < 0.44} d3i = 1{Φ(e1) ≥ 0.44 and Φ(e1) < 0.685} d4i = 1{Φ(e1) ≥ 0.685 and Φ(e1) < 0.86} d5i = 1{Φ(e1) ≥ 0.86} 5(cid:88) incomei = dgiincg where e2i is drawn from a g=1 inc1 ∼ U nif orm(10, 25) inc2 ∼ U nif orm(25, 50) inc3 ∼ U nif orm(50, 75) inc4 ∼ U nif orm(75, 100) inc5 ∼ 20Exp(1) + 100 Consequently age and income are positively correlated. The last consumer characteristic, household size, is constructed as the following function of age and income, hhsi = (cid:100)exp(−0.75 + 0.0015incomei + 0.03agei + 2i(cid:101) where 2i is drawn from a truncated (-1 to 1) N ormal(0, 0.45). In violation of CF-CI, the conditional distribution of u1i is, u1i|z11m, z12i, z2m, v2m ∼ N (0.32v2m + 0.15v2m × numchm − 0.02v2m × agei, 1) (C.2) 153 So there is no heteroskedasticity in the latent error but the unobserved product attributes (advertisement) has an interactive effect with number of channels (the addition of more channels matters more if this was advertised) and age (younger consumers may be more susceptible to advertisement). Finally, the binary dependent variable, y1i is calculated from, y1i = 1{ − 9.8 − 0.14pm + 0.017pmd2i + 0.03pmd3i + .035pmd4i + 0.045pmd5i + 0.01numchm + 0.005incomei + 0.03hhsi (C.3) + 0.005agei + 0.006age2 1 + u1i > 0} The summary statistics are presented in Table E.1. Similar to the real data in PT, conditional on choosing cable, about 1/3 of the sample selects premium cable. ASF Estimates for the Effect of Income on Home- ownership In the construction of the exogenous variables, first z11i (age), z21i and z22i(education of wife) are determined (z11i independent of z21i and z22i, z21i and z22i are mutually exclusive) and then z12i (children in household) and z23i (wife working) are functions of the other exogenous variables to induce correlation. Since the sample consisted on 981 married men aged 30 to 50, z11i is drawn from a truncated N ormal(41.8, 60) with a lower bound of 30 and an upper bound of 50. Let e be drawn from a U nif orm(0, 1), then the education of the wife was determined by, z21i = 1 0 1 0 if e2 > 0.897 if else if 0.482 < e1 ≤ .897 if else z22i = 154 Since it would seem reasonable for having young children in household be negatively corre- lated with age, and higher education,z12i is calculated as the following, z12i = 1{261.2 − 5z11i − 20z21i − 50z22i + ε12i > 0} (C.4) where ε12i is drawn from a N ormal(0, 30). Since it would seem reasonable that the proba- bility of a wife working is negatively correlated with having young children in the household and age but positively correlated with higher education, z23i is calculated as the following, z23i = 1{17.6 − .3z11i − 5z12i + 3z21i + 10z22i + ε23i > 0} (C.5) where ε23i is drawn from a N (0, 5). The conditional mean of y2i is, y2i =7.2 + 0.0117z11i + 0.0911z12i + 0.0642z21i + 0.1291z22i + 0.0911z23i + v2i; where v2i is drawn from a N (0, 0.088). The linear index is, xiβo = 3.8y2i + 0.09z11i + z12i (C.6) so the conditional distribution of u1i is only a function of the linear index and v2i u1i|v2i, z11i, z12i, z21i, z22iz23i ∼ N − 2v2i − 2v2ixiβo, exp 2(0.01(xiβo) (cid:32) (cid:16) (cid:17)(cid:33) and the binary dependent variable is calculated from, y1i = 1{−34 + xiβo + u1i > 0} (C.7) Table E.4 present the summary statistics of the simulated data as well as the summary statistics from Rothe (2009) as a comparison. The SML estimator proposed in Rothe (2009) maximizes the following log likelihood, (cid:16) (cid:17) 1 − ˆG(xiβ, ˆv2i) + (1 − y1i) log (C.8) ˆβSM L = arg max β y1i log n(cid:88) (cid:16) ˆG(xiβ, ˆv2i) (cid:17) i=1 155 where ˆG(xiβ, ˆv2i) = Φ((cid:80)n j=1 Kh([xjβ, ˆv2j]− [xiβ, ˆv2i])y1j/(cid:80)n j=1 Kh([xjβ, ˆv2j]− [xiβ, ˆv2i])) and Kh(·) is a bivariate kernel based on bandwidth h and scaled by that bandwidth. In order to eliminate the asymptotic bias, the SML estimator requires the use of higher order kernels. However, Rothe finds, utilizing lower order kernels tend to perform better with finite samples. Therefore I use a first order Gaussian kernel. The normal CDF transformation insures the estimates fall between 0 and 1 and imposes the correct distribution as a transformation. I find that this helps the parameter estimates. In determining optimal bandwidths, Rothe suggests maximizing the above likelihood with respect to both the parameters β and the bandwidths h. I find that this can result in a number of extreme outliers that corrupt the analysis. Therefore I equate the bandwidths to the optimal value (given the distribution is truly normal) as a function of the parameters h = 1.06(cid:112)V ar([xiβ, ˆv2i])n−1/5. This is then plugged into the likelihood and maximize with respect to the parameters β. The proposed Het Probit (GCF) maximizes the following likelihood n(cid:88) ( ˆβ, ˆγ, ˆδ)HetP robit(GCF ) = arg max β,γ,δ y1i log (G(xi, v2i; β, γ, δ)) i=1 (C.9) + (1 − y1i) log (1 − G(xi, v2i; β, γ, δ)) (cid:16) xiβ+ˆv2iγ1+ˆv2ixiγ2 (cid:17) . I found that the estimates are sensitive where G(xi, v2i; β, γ, δ) = Φ to the starting values therefore I used [1, 0.5, 0.75, 1.5, 2] × (βo, γo, δo) as the starting values exp(xiδ) and choose the estimates with the largest log likelihood. 156 APPENDIX D Figures for Chapter 2 157 Figure D.1: Effect of Heteroskedasticity on Parameter Estimate Shows empirical distribution for estimates of β in the model y1i = 1{y2iβ + u1i > 0} and y2i = zi +v2i. The unobserved heterogeneity are generated from a heteroskedastic bivariate normal as in equation (2.6) where ρ(zi) = 0.6, σ1(zi) = σ2(zi) = exp(0.25zi). The Homoskedastic estimator assumes the data generating process in equation (2.2) while the heteroskedastic estimator correctly scales by the true conditional variance. Calculated from 1,000 simulations of sample size 1,000. 158 General Control Function in the Demand for Premium Cable Figure D.2: ASF for Income equal to $85,000 ASF is evaluated over different prices for an additional 5 channels of premium cable offered to a consumer who is 35 years old in a family of 3 with income equal to $85,000. 159 ASF Estimates for the Effect of Income on Home- ownership Figure D.3: ASF Estimates for Misspecified Models ASF evaluated over different levels of log(total income) for a 40 year old with children under the age of 16 in the household. 160 Figure D.4: Consequence of CF-LI Assumption on ASF Estimates ASF evaluated over different levels of log(total income) for a 40 year old with children under the age of 16 in the household. 161 Empirical Example Figure D.5: Comparison of ASF for Families with No Children 1991 CPS data on Married Women Labor force participation. ASF evaluated over difference levels of Non-Wife Income for family with no children in the household. 162 Figure D.6: Comparison of ASF for Families with Young Children Only 1991 CPS data on Married Women Labor force participation. ASF evaluated over difference levels of Non-Wife Income for family with only young (under 6) children in the household. 163 Figure D.7: Comparison of ASF for Families with Old Children Only 1991 CPS data on Married Women Labor force participation. ASF evaluated over difference levels of Non-Wife Income for family with only old (over 6) children in the household. 164 Figure D.8: Comparison of ASF for Families with Both Young and Old Children 1991 CPS data on Married Women Labor force participation. ASF evaluated over difference levels of Non-Wife Income for family with both old (over 6) and young (under 6) children in the household. 165 Extension: Semi-Parametric Distribution Free Estima- tor Figure D.9: Logistic Distribution (h1 o = v2i) 1,000 simulations of Sample size 1,000. 166 Figure D.10: Uniform Distribution (h1 o = v2i) 1,000 simulations of Sample size 1,000. 167 Figure D.11: Student T Distribution (h1 o = v2i) 1,000 simulations of Sample size 1,000. 168 Figure D.12: Gaussian Mixture Distribution (h1 o = v2i) 1,000 simulations of Sample size 1,000. 169 Figure D.13: Logistic Distribution with Linear GCF (h2 o) 1,000 simulations of Sample size 1,000. 170 Figure D.14: Uniform Distribution with Linear GCF (h2 o) 1,000 simulations of Sample size 1,000. 171 Figure D.15: Student T Distribution with Linear GCF (h2 o) 1,000 simulations of Sample size 1,000. 172 Figure D.16: Gaussian Mixture Distribution with Linear GCF (h2 o) 1,000 simulations of Sample size 1,000. 173 Figure D.17: Logistic with Non-Parametric GCF (h3 o) 1,000 simulations of Sample size 1,000. 174 Figure D.18: Uniform with Non-Parametric GCF (h3 o) 1,000 simulations of Sample size 1,000. 175 Figure D.19: Student T with Non-Parametric GCF (h3 o) 1,000 simulations of Sample size 1,000. 176 Figure D.20: Gaussian Mixture with Non-Parametric GCF (h3 o) 1,000 simulations of Sample size 1,000. 177 Figure D.21: Heteroskedastic Logistic (h1 o = v2i) 1,000 simulations of Sample size 1,000. 178 Figure D.22: Heteroskedastic Logistic with Linear GCF (h2 o) 1,000 simulations of Sample size 1,000. 179 Figure D.23: Heteroskedastic Logistic with Non-Parametric GCF (h3 o) 1,000 simulations of Sample size 1,000. 180 APPENDIX E Tables for Chapter 2 181 General Control Function in the Demand for Premium Cable Table E.1: Summary Statistics Variables Mean Std. dev. Premium Cable Age Income (in thousands) Family Size Price Number of Channels Cost (y1im) (z12i) (z12i) (z12i) (y2m) (z11m) (z2m) 0.329 40.598 60.019 2.596 40.691 5.139 22.912 0.470 11.513 34.782 1.387 11.594 2.420 6.086 1,000 simulations of sample size 7,677. 182 Table E.2: Comparison of Logit Parameter Estimates Variables Price Price × Income Group 2 Price × Income Group 3 Price × Income Group 4 Price × Income Group 5 Number of Channels Income Household Size Age Age2 Logit (1) -0.032 (0.011) 0.015 (0.005) 0.026 (0.006) 0.031 (0.008) 0.040 (0.010) -0.231 (0.046) 0.002 (0.004) 0.024 (0.038) 0.077 (0.114) 0.004 (0.001) Logit (CV) (2) -0.109 (0.018) 0.015 (0.005) 0.026 (0.006) 0.031 (0.008) 0.040 (0.010) 0.094 (0.076) 0.002 (0.004) 0.024 (0.038) 0.073 (0.115) 0.004 (0.001) Logit (GCF) (3) -0.141 (0.021) 0.017 (0.006) 0.030 (0.007) 0.035 (0.008) 0.045 (0.011) 0.011 (0.091) 0.005 (0.004) 0.031 (0.044) 0.025 (0.147) 0.006 (0.002) Logit (Over) (4) TRUE -0.141 (0.021) 0.017 (0.006) 0.030 (0.007) 0.035 (0.009) 0.045 (0.011) 0.011 (0.091) 0.005 (0.004) 0.031 (0.044) 0.024 (0.147) 0.006 (0.002) -0.14 0.017 0.03 0.035 0.045 0.01 0.005 0.03 0.005 0.006 1,000 simulations of sample size 7,677 and standard deviations are given in parenthesis. Logit (CV) only includes the control variable v2i to address the issue of endogeneity, Logit (GCF) uses the correct specification that allows for a general control function, and Logit (Over) over specifies the control function by including terms that are not in the true specification. Table E.3: Comparison of Price Elasticity Estimates Estimator Mean Std. dev. OLS CF Logit Logit (CV) Logit (GCF) Logit (Over) TRUE 0.485 0.082 -0.386 -2.536 -3.348 -3.350 -3.320 12.009 55.989 0.267 0.489 0.571 0.571 1,000 simulations of sample size 7,677. 183 ASF Estimates for the Effect of Income on Home- ownership Table E.4: Comparison of Summary Statistics Std. dev. Mean Simulated Data Std. dev. Rothe 0.490 0.324 5.374 0.359 0.493 0.304 0.459 0.608 7.857 40.633 0.851 0.422 0.111 0.689 0.488 0.316 5.330 0.356 0.494 0.314 0.463 Variable Homeowner log(total income) Age Children in HH Education of Wife Intermediate Degree High Degree Wife Working Mean 0.599 7.853 40.613 0.848 0.415 0.103 0.699 (y1) (y2) (z11) (z12) (z21) (z22) (z23) 1,000 simulations of sample size 981. 184 Table E.5: Comparison of Parameter Estimates Rothe Simulated Data SML Het-Probit (8) 3.9605 (0.0237) 0.0925 (0.0007) 1 (GCF) (9) 3.9202 (0.0511) 0.0907 (0.0008) 1 Probit (6) 1.3789 (0.0122) 0.1084 (0.0008) 1 Probit (CV) (7) 4.1969 (0.0725) 0.0971 (0.0010) 1 -2.6510 (0.0600) Variables log(Income) (y2) Age (z11) Child (z12) CV (ˆv2) Ed. of Wife Intm. (z21) High (z22) Wife Emp (z23) R2 RF (1) 0.0117 (0.0117) 0.0911 (0.0194) 0.0642 (0.0185) 0.1291 (0.0298) 0.0911 (0.0194) 0.1072 Probit (2) 2.1343 (0.5571) 0.2076 (0.0257) 1 SML (4) 3.8533 (1.3338) 0.0982 (0.0889) 1 Probit (CV) (3) 4.7923 (1.5135) 0.0863 (0.0209) 1 -3.0348 (1.3048) RF (5) 0.0117 (0.0001) 0.0911 (0.0009) 0.0646 (0.0008) 0.1286 (0.0012) 0.0914 (0.0008) 0.1252 1,000 simulations of sample size 981. Standard errors (for Rothe) and standard deviations (for Simulated Data) are given in parenthesis. RF is the reduced form first stage estimates, Probit does not address endogeneity at all, Probit (CV) is the Rivers and Vuong (1988) estimator that is a Probit model that only includes the control variable ˆv2i as an additional covariate to address endogeneity SML is the estimator proposed in Rothe (2009). Het-Probit (GCF) is the proposed heteroskedastic Probit with a flexible control function. Since coefficients are only identified to scale, I normalize the coefficients in columns (2)-(4) and (6)-(9) so the coefficient on Children in HH is 1. This allows for comparisons across the different specifications. True values of coefficients on log(Income) and Age are 3.80 and 0.09 respectively. 185 Table E.6: APE Results and Simulated Distribution (True APE = 0.6448) Specification Mean SD 10% 25% 50% 75% 90% Het-Probit (GCF) SML (Sieve) 0.6260 0.6996 0.0034 0.0025 0.4839 0.5976 0.5603 0.6475 0.6350 0.6972 0.7017 0.7512 0.7528 0.8035 Probit Probit (CV) 0.2851 0.5802 0.0015 0.0037 0.2247 0.4274 0.2546 0.5098 0.2858 0.5922 0.3170 0.6643 0.3434 0.7184 Lin. Prob. (OLS) Lin. Prob. (2SLS) 0.3117 0.6960 0.0016 0.0062 0.2462 0.4553 0.2795 0.5620 0.3122 0.6886 0.3451 0.8213 0.3739 0.9475 1,000 simulations of sample size 981. 186 Empirical Example Table E.7: Summary Statistics Variables Mean Std Dev. Mean Mean (If Employed=0 ) (If Employed=1) Employed Non-Wife Inc ($1000) Education Experience Has kids (age<6) Has kids (age≥6) Husband’s Education (y1) (y2) (z11) (z12) (z13) (z14) (z2) 0.583 30.269 12.984 20.444 0.279 0.308 13.148 0.493 27.212 2.615 10.445 0.449 0.462 2.977 Observations 5,634 1991 CPS data on Married Women Labor force participation. 34.771 12.395 22.080 0.324 0.259 12.811 2,348 27.053 13.405 19.274 0.247 0.342 13.388 3,286 187 Table E.8: Coefficient Estimates for Married Women’s LFP Probit Het Probit Probit Het Probit (2) (3) -0.071 (0.011) -0.058 (0.016) -0.005 (0.014) 1 -0.168 (0.028) -2.782 (0.795) 1.102 (0.665) -0.071 (0.010) -0.034 (0.015) -0.010 (0.011) 1 -0.134 (0.036) -2.870 (0.813) 1.337 (0.614) (CV) (4) -0.024 (0.029) -0.069 (0.021) -0.008 (0.017) 1 -0.224 (0.052) -3.432 (1.015) 1.036 (0.770) (CV) (5) -0.011 (0.026) -0.045 (0.019) -0.017 (0.015) 1 -0.208 (0.047) -3.886 (0.996) 1.322 (0.774) Probit (GCF) (6) -0.073 (1.873) 0.010 (3.128) 0.068 (3.825) 1 -0.221 (2.946) -5.693 (363.525) -1.274 (484.991) Het Probit (GCF) (7) -0.061 (0.025) 0.030 (0.037) 0.065 (0.035) 1 -0.223 (0.062) -6.263 (1.816) -1.196 (1.142) SML (Sieve) (8) -0.072 (0.004) 0.031 (0.003) 0.059 (0.007) 1 -0.223 (0.011) -6.031 (0.259) -1.306 (0.117) Variables Non-wife Income RF (1) Non-wife Income × Has Kids (Age<6) × Has Kids (Age≥6) Non-wife Income Education Experience Has Kids (Age<6) Has Kids (Age≥6) Husband’s Education Husband’s Education × Has Kids (Age<6) Husband’s Education × Has Kids (Age<6) 0.002 0.003 (2.42×10−4) (2.68E×10−4) 1.09×10−4 (2.40×10−5) 1.10×10−4 (2.3×10−5) (4.54×10−4) 0.002 0.014 (0.004) 0.012 (0.002) 1991 CPS data on Married Women Labor force participation. Standard errors given in parenthesis and calculated using 100 bootstraps. F-test in Reduced Form is a joint test of significant for the terms that include the instrument husband’s education. 188 Table E.8 (cont’d) RF Probit Het Probit Variables (1) (2) (3) Probit Het Probit Probit Het Probit (CV) (4) (CV) (5) (6) (GCF) (GCF) (7) SML (Sieve) (8) Control Function ˆv2i ˆv2i× Has Kids (Age<6) ˆv2i× Has Kids (Age≥6) Heteroskedasticity Non-wife Income Education Experience F-Stat 45.304 -0.062 (0.035) -0.080 (0.031) -0.006 (4.006) -0.093 (5.647) -0.088 (7.727) -0.004 (0.001) 1.25×10−4 (1.92×10−4) 0.008 (0.006) -0.004 (0.001) 0.003 (0.005) 0.009 (0.006) -0.025 (0.032) -0.088 (0.056) -0.091 (0.045) -0.004 (0.001) 0.000 (0.001) 0.005 (0.006) 1991 CPS data on Married Women Labor force participation. Standard errors given in parenthesis and calculated using 100 bootstraps. F-test in Reduced Form is a joint test of significant for the terms that include the instrument husband’s education. 189 Table E.9: Wald Test Results Null Hypothesis Alternative Hypothesis Non-Wife Income is Exogenous Non-Wife Income is Endogenous Control Variable Homoskedasticity General Control Heteroskedasticity Function Wald Statistic p-value 4.582 0.032 12.219 0.007 4.749 0.029 12.987 0.005 4.949 0.084 7.109 0.029 15.213 0.002 15.851 0.001 10.658 0.014 Additional Assumptions: Homoskedasticity Endogeneity (CV) Endogeneity (GCF) x x x x x x x x x 1991 CPS data on Married Women Labor force participation. Wald Statistics calculated using bootstrapped standard errors. 190 Table E.10: APE Estimates for Non-Wife Income effect on Wife’s LFP Estimators OLS 2SLS Probit Het Probit Probit (CV) Het Probit (CV) Probit (GCF) Het Probit (GCF) APE -0.00333 -0.00253 -0.00266 -0.00331 -0.00155 -0.00116 -0.00180 0.00027 SE 0.00023 0.00095 0.00032 0.00020 0.00076 0.00088 0.00093 0.00103 1991 CPS data on Married Women Labor force participation. Standard errors given in parenthesis and calculated using 100 bootstraps. 191 Extension: Semi-Parametric Distribution Free Estima- tor Table E.11: Logistic Distribution (h1 o = v2i) N Estimators Bias Std. Dev. RMSE 25% 50% 75% 250 500 1, 000 Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) -0.0131 -0.0352 -0.0183 -0.0073 -0.0219 -0.0124 0.0007 -0.0075 0.0004 0.1717 0.1778 0.2210 0.1192 0.1236 0.1563 0.0821 0.0840 0.1246 0.1721 0.1811 0.2216 0.1194 0.1255 0.1567 0.0820 0.0843 0.1245 0.8875 0.8595 0.8603 0.9172 0.9024 0.8975 0.9499 0.9404 0.9285 0.9988 0.9702 0.9873 0.9992 0.9817 0.9995 1.0022 0.9962 1.0052 1.0947 1.0829 1.1191 1.0758 1.0661 1.0864 1.0593 1.0521 1.0831 1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th percentiles of the empirical distribution are reported in the last three columns. Table E.12: Uniform Distribution (h1 o = v2i) N Estimators Bias Std. Dev. RMSE 25% 50% 75% 250 500 1, 000 Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) -0.0210 -0.0402 -0.0354 -0.0056 -0.0124 -0.0939 0.0008 -0.0021 -0.0996 0.1808 0.1898 0.2598 0.1223 0.1225 0.2908 0.0856 0.0848 0.2097 0.1819 0.1939 0.2621 0.1224 0.1230 0.3055 0.0855 0.0848 0.2320 0.8676 0.8400 0.8240 0.9180 0.9089 0.7547 0.9484 0.9444 0.7794 0.9862 0.9663 0.9773 0.9991 0.9907 0.9294 1.0047 1.0007 0.9149 1.1053 1.0985 1.1336 1.0734 1.0730 1.0814 1.0613 1.0559 1.0371 1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th percentiles of the empirical distribution are reported in the last three columns. 192 Table E.13: Student T Distribution (h1 o = v2i) N Estimators Bias Std. Dev. RMSE 25% 50% 75% 250 500 1, 000 Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) -0.0123 -0.0190 -0.0136 -0.0083 -0.0122 0.0042 -0.0031 -0.0045 0.0100 0.1611 0.1571 0.1872 0.1087 0.1077 0.1254 0.0761 0.0735 0.1023 0.1614 0.1582 0.1876 0.1090 0.1083 0.1254 0.0761 0.0736 0.1027 0.8928 0.8853 0.8841 0.9202 0.9157 0.9161 0.9466 0.9475 0.9473 1.0019 0.9850 0.9917 0.9888 0.9875 1.0096 0.9960 0.9975 1.0168 1.0989 1.0902 1.1018 1.0709 1.0648 1.0915 1.0500 1.0454 1.0764 1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th percentiles of the empirical distribution are reported in the last three columns. Table E.14: Gaussian Mixture Distribution (h1 o = v2i) N Estimators Bias Std. Dev. RMSE 25% 50% 75% 250 500 1, 000 Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) -0.0135 -0.0373 -0.0203 -0.0130 -0.0248 -0.0608 -0.0021 -0.0085 -0.0440 0.1840 0.1972 0.2463 0.1201 0.1230 0.2383 0.0877 0.0887 0.1658 0.1845 0.2006 0.2470 0.1207 0.1254 0.2458 0.0877 0.0891 0.1715 0.8783 0.8345 0.8425 0.9066 0.8923 0.8090 0.9380 0.9321 0.8562 1.0023 0.9765 0.9995 0.9910 0.9794 0.9548 1.0011 0.9948 0.9715 1.1095 1.1025 1.1400 1.0759 1.0653 1.0908 1.0558 1.0488 1.0676 1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th percentiles of the empirical distribution are reported in the last three columns. 193 Table E.15: Logistic Distribution with Linear GCF (h2 o) N Estimators Bias Std. Dev. RMSE 25% 50% 75% 250 500 1, 000 Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) -0.0119 -0.0710 -0.0127 -0.0064 -0.0602 -0.0059 -0.0013 -0.0517 0.0033 0.1533 0.1539 0.1985 0.1135 0.1044 0.1502 0.0765 0.0696 0.1120 0.1537 0.1694 0.1988 0.1136 0.1205 0.1503 0.0765 0.0867 0.1120 0.8913 0.8296 0.8643 0.9271 0.8737 0.9154 0.9509 0.9014 0.9293 0.9948 0.9213 0.9936 1.0038 0.9397 0.9995 0.9996 0.9509 1.0067 1.0912 1.0314 1.1169 1.0647 1.0111 1.0917 1.0500 0.9972 1.0820 1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th percentiles of the empirical distribution are reported in the last three columns. Table E.16: Uniform Distribution with Linear GCF (h2 o) N Estimators Bias Std. Dev. RMSE 25% 50% 75% 250 500 1, 000 Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) -0.0310 -0.0868 -0.0145 -0.0166 -0.0644 -0.0168 -0.0141 -0.0604 -0.0203 0.1794 0.1722 0.2111 0.1194 0.1114 0.2088 0.0836 0.0762 0.1715 0.1820 0.1927 0.2115 0.1205 0.1286 0.2094 0.0848 0.0972 0.1726 0.8609 0.8028 0.8513 0.9068 0.8672 0.8467 0.9311 0.8886 0.8724 0.9757 0.9128 0.9923 0.9889 0.9370 0.9911 0.9869 0.9413 0.9954 1.0863 1.0223 1.1239 1.0649 1.0094 1.1106 1.0437 0.9911 1.0915 1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th percentiles of the empirical distribution are reported in the last three columns. 194 Table E.17: Student T Distribution with Linear GCF (h2 o) N Estimators Bias Std. Dev. RMSE 25% 50% 75% 250 500 1, 000 Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) -0.0033 -0.0563 -0.0060 -0.0034 -0.0529 0.0002 -0.0002 -0.0479 0.0086 0.1421 0.1309 0.1674 0.0992 0.0889 0.1208 0.0695 0.0636 0.0967 0.1421 0.1425 0.1674 0.0992 0.1034 0.1208 0.0695 0.0796 0.0970 0.9128 0.8626 0.8890 0.9317 0.8892 0.9257 0.9519 0.9083 0.9473 1.0040 0.9486 1.0017 0.9959 0.9460 1.0039 0.9985 0.9505 1.0118 1.0904 1.0311 1.1013 1.0652 1.0049 1.0789 1.0479 0.9957 1.0684 1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th percentiles of the empirical distribution are reported in the last three columns. Table E.18: Gaussian Mixture Distribution with Linear GCF (h2 o) N Estimators Bias Std. Dev. RMSE 25% 50% 75% 250 500 1, 000 Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) -0.0195 -0.0778 -0.0377 -0.0189 -0.0721 -0.0177 -0.0094 -0.0578 -0.0014 0.1773 0.1768 1.0300 0.1163 0.1076 0.1930 0.0822 0.0760 0.1396 0.1783 0.1931 1.0302 0.1178 0.1294 0.1937 0.0827 0.0954 0.1395 0.8781 0.8175 0.8759 0.9042 0.8592 0.8674 0.9351 0.8902 0.9048 0.9915 0.9289 1.0019 0.9863 0.9273 0.9967 0.9904 0.9393 1.0022 1.1023 1.0282 1.1342 1.0605 0.9984 1.1047 1.0469 0.9943 1.0942 1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th percentiles of the empirical distribution are reported in the last three columns. 195 Table E.19: Logistic Distribution with Non-Parametric GCF (h3 o) N Estimators Bias Std. Dev. RMSE 25% 50% 75% 250 500 1, 000 Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) 0.0349 -0.0525 0.0117 0.0354 -0.0451 0.0453 0.0426 -0.0344 0.0010 0.1565 0.1739 0.2313 0.1106 0.1245 0.1537 0.0774 0.0843 0.1231 0.1603 0.1816 0.2315 0.1161 0.1324 0.1601 0.0883 0.0910 0.1230 0.9421 0.8363 0.9123 0.9657 0.8776 0.9581 0.9929 0.9101 0.9219 1.0387 0.9467 1.0244 1.0395 0.9594 1.0551 1.0465 0.9649 1.0074 1.1367 1.0589 1.1405 1.1084 1.0376 1.1472 1.0950 1.0245 1.0872 1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th percentiles of the empirical distribution are reported in the last three columns. Table E.20: Uniform Distribution with Non-Parametric GCF (h3 o) N Estimators Bias Std. Dev. RMSE 25% 50% 75% 250 500 1, 000 Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) 0.0270 -0.0710 -0.0159 0.0381 -0.0488 -0.0063 0.0424 -0.0403 -0.0606 0.1695 0.1966 0.2418 0.1110 0.1249 0.2364 0.0789 0.0870 0.1781 0.1716 0.2089 0.2422 0.1173 0.1341 0.2363 0.0895 0.0958 0.1881 0.9199 0.8151 0.8469 0.9612 0.8735 0.8518 0.9878 0.9030 0.8281 1.0313 0.9353 1.0006 1.0390 0.9541 1.0080 1.0442 0.9643 0.9525 1.1433 1.0599 1.1395 1.1127 1.0344 1.1458 1.0971 1.0211 1.0616 1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th percentiles of the empirical distribution are reported in the last three columns. 196 Table E.21: Student T Distribution with Non-Parametric GCF (h3 o) N Estimators Bias Std. Dev. RMSE 25% 50% 75% 250 500 1, 000 Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) 0.0363 -0.0338 0.0241 0.0373 -0.0291 0.0577 0.0387 -0.0264 0.0078 0.1431 0.1559 0.1870 0.0990 0.1045 0.1205 0.0700 0.0740 0.1036 0.1475 0.1595 0.1885 0.1058 0.1085 0.1336 0.0800 0.0786 0.1038 0.9511 0.8733 0.9318 0.9747 0.9026 0.9807 0.9883 0.9244 0.9395 1.0409 0.9677 1.0314 1.0376 0.9692 1.0638 1.0408 0.9737 1.0132 1.1333 1.0657 1.1473 1.1035 1.0401 1.1440 1.0856 1.0243 1.0760 1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th percentiles of the empirical distribution are reported in the last three columns. Table E.22: Gaussian Mixture Distribution with Non-Parametric GCF (h3 o) N Estimators Bias Std. Dev. RMSE 25% 50% 75% 250 500 1, 000 Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) 0.0303 -0.0668 0.0056 0.0317 -0.0581 0.0106 0.0389 -0.0443 -0.0296 0.1708 0.1945 0.2327 0.1130 0.1247 0.2048 0.0828 0.0922 0.1561 0.1733 0.2056 0.2327 0.1173 0.1375 0.2050 0.0915 0.1023 0.1588 0.9220 0.8071 0.8857 0.9565 0.8525 0.9050 0.9825 0.8920 0.8595 1.0385 0.9462 1.0238 1.0307 0.9403 1.0290 1.0408 0.9565 0.9771 1.1441 1.0685 1.1484 1.1115 1.0303 1.1443 1.0979 1.0225 1.0797 1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th percentiles of the empirical distribution are reported in the last three columns. 197 Table E.23: Heteroskedastic Logistic (h1 o = v2i) N Estimators Bias Std. Dev. RMSE 25% 50% 75% 250 500 1, 000 Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) -0.0687 -0.1041 0.1213 -0.0593 -0.0926 -0.0263 -0.0516 -0.0816 -0.0043 0.1798 0.1935 0.2858 0.1319 0.1460 0.1532 0.0854 0.0953 0.1160 0.1925 0.2197 0.3104 0.1446 0.1728 0.1554 0.0998 0.1254 0.1160 0.8222 0.7730 0.9453 0.8557 0.8136 0.8836 0.8951 0.8529 0.9203 0.9295 0.8895 1.1030 0.9456 0.9040 0.9712 0.9497 0.9231 0.9940 1.0513 1.0225 1.2574 1.0256 1.0116 1.0713 1.0034 0.9824 1.0705 1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th percentiles of the empirical distribution are reported in the last three columns. Table E.24: Heteroskedastic Logistic with Linear GCF (h2 o) N Estimators Bias Std. Dev. RMSE 25% 50% 75% 250 500 1, 000 Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) -0.0171 -0.0843 0.1245 -0.0148 -0.0609 -0.0019 -0.0105 -0.0557 0.0026 0.1649 0.1601 0.2358 0.1204 0.1163 0.1350 0.0816 0.0780 0.1025 0.1657 0.1809 0.2665 0.1213 0.1313 0.1349 0.0822 0.0958 0.1025 0.8795 0.8114 0.9792 0.9132 0.8621 0.9110 0.9380 0.8905 0.9343 0.9869 0.9116 1.1112 0.9870 0.9365 0.9996 0.9910 0.9449 1.0019 1.0861 1.0197 1.2494 1.0650 1.0193 1.0843 1.0399 0.9964 1.0628 1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th percentiles of the empirical distribution are reported in the last three columns. 198 Table E.25: Heteroskedastic Logistic with Non-Parametric GCF (h3 o) N Estimators Bias Std. Dev. RMSE 25% 50% 75% 250 500 1, 000 Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) Probit (CF) SML SP Het Probit (GCF) 0.0044 -0.0846 0.1307 0.0051 -0.0721 0.0208 0.0110 -0.0646 0.0042 0.1618 0.1823 0.2867 0.1182 0.1360 0.1264 0.0795 0.0906 0.1106 0.1618 0.2009 0.3149 0.1182 0.1539 0.1281 0.0803 0.1112 0.1106 0.8972 0.7910 0.9974 0.9308 0.8405 0.9343 0.9568 0.8753 0.9242 1.0054 0.8995 1.1174 1.0111 0.9251 1.0210 1.0131 0.9398 1.0001 1.1102 1.0314 1.2534 1.0815 1.0204 1.1102 1.0662 0.9999 1.0810 1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th percentiles of the empirical distribution are reported in the last three columns. 199 APPENDIX F Figures for Chapter 3 Simulation Figure F.1: Distribution of ˆσ2 1 for T=5 under DGP1 1,000 simulations of N=300. Increasing serial correlation as color lightens. 200 Figure F.2: Distribution of ˆσ2 1 for T=10 under DGP1 1,000 simulations of N=300. Increasing serial correlation as color lightens. 201 Figure F.3: Distribution of ˆσ2 1 for T=20 under DGP1 1,000 simulations of N=300. Increasing serial correlation as color lightens. 202 Figure F.4: Distribution of ˆσ2 1 for T=5 under DGP2 1,000 simulations of N=300. Increasing serial correlation as color lightens. 203 Figure F.5: Distribution of ˆσ2 1 for T=10 under DGP2 1,000 simulations of N=300. Increasing serial correlation as color lightens. 204 Figure F.6: Distribution of ˆσ2 1 for T=20 under DGP2 1,000 simulations of N=300. Increasing serial correlation as color lightens. 205 Figure F.7: Distribution of ˆσ2 1 for T=5 under DGP3 1,000 simulations of N=300. Increasing serial correlation as color lightens. 206 Figure F.8: Distribution of ˆσ2 1 for T=10 under DGP3 1,000 simulations of N=300. Increasing serial correlation as color lightens. 207 Figure F.9: Distribution of ˆσ2 1 for T=20 under DGP3 1,000 simulations of N=300. Increasing serial correlation as color lightens. 208 Figure F.10: ASF Estimates for T=5 under DGP1 1,000 simulations of N=300. Increasing serial correlation as color darkens. Simulated 95% Confidence Intervals given with dotted lines. 209 Figure F.11: ASF Estimates for T=10 under DGP1 1,000 simulations of N=300. Increasing serial correlation as color darkens. Simulated 95% Confidence Intervals given with dotted lines. 210 Figure F.12: ASF Estimates for T=20 under DGP1 1,000 simulations of N=300. Increasing serial correlation as color darkens. Simulated 95% Confidence Intervals given with dotted lines. 211 Figure F.13: ASF Estimates for T=5 under DGP2 1,000 simulations of N=300. Increasing serial correlation as color darkens. Simulated 95% Confidence Intervals given with dotted lines. 212 Figure F.14: ASF Estimates for T=10 under DGP2 1,000 simulations of N=300. Increasing serial correlation as color darkens. Simulated 95% Confidence Intervals given with dotted lines. 213 Figure F.15: ASF Estimates for T=20 under DGP2 1,000 simulations of N=300. Increasing serial correlation as color darkens. Simulated 95% Confidence Intervals given with dotted lines. 214 Figure F.16: ASF Estimates for T=5 under DGP3 1,000 simulations of N=300. Increasing serial correlation as color darkens. Simulated 95% Confidence Intervals given with dotted lines. 215 Figure F.17: ASF Estimates for T=10 under DGP3 1,000 simulations of N=300. Increasing serial correlation as color darkens. Simulated 95% Confidence Intervals given with dotted lines. 216 Figure F.18: ASF Estimates for T=20 under DGP3 1,000 simulations of N=300. Increasing serial correlation as color darkens. Simulated 95% Confidence Intervals given with dotted lines. 217 Application Figure F.19: ATE for Selling Drugs Vertical axis is the negative ATE such that the higher the ATE the better treatment it is (increasingly reduces the probability of the antisocial behavior outcomes). The transparent gray plane is flat at an ATE equal to 0, therefore any treatment effect above the plan is a desired outcome. The ATEs are calculated from a matrix of the characteristics of interest valued between [0,1] (recall that the characteristics are standardized to mean 0 and standard deviation of 1). 218 Figure F.20: ATE for Being Arrested Vertical axis is the negative ATE such that the higher the ATE the better treatment it is (increasingly reduces the probability of the antisocial behavior outcomes). The transparent gray plane is flat at an ATE equal to 0, therefore any treatment effect above the plan is a desired outcome. The ATEs are calculated from a matrix of the characteristics of interest valued between [0,1] (recall that the characteristics are standardized to mean 0 and standard deviation of 1). 219 Figure F.21: ATE for Engaging in Illicit Activities Vertical axis is the negative ATE such that the higher the ATE the better treatment it is (increasingly reduces the probability of the antisocial behavior outcomes). The transparent gray plane is flat at an ATE equal to 0, therefore any treatment effect above the plan is a desired outcome. The ATEs are calculated from a matrix of the characteristics of interest valued between [0,1] (recall that the characteristics are standardized to mean 0 and standard deviation of 1). 220 Discussion Figure F.22: Distribution of ˆσ2 1 for T=5 under AR(2) 1,000 simulations of N=300. 221 Figure F.23: Distribution of ˆσ2 1 for T=10 under AR(2) 1,000 simulations of N=300. 222 Figure F.24: Distribution of ˆσ2 1 for T=20 under AR(2) 1,000 simulations of N=300. 223 APPENDIX G Tables for Chapter 3 224 Simulation Table G.1: Estimation Times for DGP 1 ρ = 0 ρ = 0.4 ρ = 0.8 PHP MEP PHP MEP PHP MEP (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) 0.696 (0.11) 0.613 (0.08) 0.676 (0.06) 0.656 (0.10) 0.688 (0.08) 0.873 (0.08) 1.581 (0.14) 2.429 (0.18) 4.277 (0.36) 1.845 (0.25) 2.755 (0.23) 4.837 (0.39) 0.708 (0.12) 0.619 (0.07) 0.681 (0.06) 0.668 (0.13) 0.691 (0.07) 0.879 (0.07) 1.617 (0.15) 2.412 (0.17) 4.251 (0.34) 1.852 (0.14) 2.776 (0.21) 4.755 (0.34) 0.678 (0.11) 0.624 (0.08) 0.687 (0.07) 0.651 (0.11) 0.699 (0.08) 0.882 (0.08) 1.950 (0.28) 2.621 (0.21) 4.471 (0.42) 2.407 (9.03) 2.982 (0.23) 4.692 (0.40) T 5 10 20 Average estimation time in seconds and standard deviations given in parenthesis. Specification (1) incorrectly assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. 225 Table G.2: Estimation Times for DGP 2 ρ = 0 ρ = 0.4 ρ = 0.8 PHP MEP PHP MEP PHP MEP (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) 0.643 (0.13) 0.614 (0.09) 0.671 (0.09) 0.609 (0.09) 0.616 (0.07) 0.779 (0.08) 1.613 (0.17) 2.199 (0.23) 3.935 (0.36) 1.881 (0.86) 2.647 (0.25) 4.542 (0.41) 0.655 (0.15) 0.633 (0.10) 0.676 (0.09) 0.617 (0.09) 0.633 (0.07) 0.789 (0.08) 1.606 (0.19) 2.213 (0.23) 3.908 (0.34) 1.833 (0.58) 2.624 (0.24) 4.403 (0.36) 0.634 (0.13) 0.643 (0.11) 0.682 (0.09) 0.610 (0.12) 0.638 (0.08) 0.792 (0.08) 3.121 (3.38) 2.579 (0.31) 3.891 (0.34) 2.189 (0.72) 2.844 (0.67) 4.492 (0.43) T 5 10 20 Average estimation time in seconds and standard deviations given in parenthesis. Specification (1) incorrectly assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. 226 Table G.3: Estimation Times for DGP 3 ρ = 0 ρ = 0.4 ρ = 0.8 PHP MEP PHP MEP PHP MEP (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) 0.169 (0.03) 0.385 (0.04) 0.411 (0.03) 0.208 (0.20) 0.416 (0.04) 0.523 (0.03) 1.025 (0.25) 1.692 (0.13) 2.970 (0.25) 1.053 (0.25) 1.831 (0.13) 3.108 (0.21) 0.306 (0.05) 0.390 (0.05) 0.416 (0.03) 0.302 (0.06) 0.430 (0.05) 0.527 (0.03) 1.142 (0.11) 1.680 (0.13) 2.928 (0.22) 1.104 (0.20) 1.860 (0.10) 3.043 (0.19) 0.445 (0.08) 0.397 (0.05) 0.428 (0.03) 0.405 (0.08) 0.430 (0.04) 0.539 (0.03) 1.610 (0.81) 1.882 (0.10) 2.954 (0.21) 1.371 (0.33) 1.988 (0.10) 3.174 (0.20) T 5 10 20 Average estimation time in seconds and standard deviations given in parenthesis. Specification (1) incorrectly assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. 227 Table G.4: Bias and Std Deviation of De-scaled ME Probit Estimates for DGP 1 ρ = 0 ρ = 0.4 ρ = 0.8 T (1) (2) (1) (2) (1) (2) α 5 β1 β2 α 10 β1 β2 α 20 β1 β2 -0.8495 (0.114) -0.9902 (0.080) 0.0176 (0.078) -0.8789 (0.079) -0.8990 (0.056) 0.0062 (0.047) -0.8688 (0.059) -0.8687 (0.042) 0.0020 (0.030) -0.0003 (0.423) 0.0698 (0.436) 0.0073 (0.075) -0.0280 (0.715) 0.0439 (0.577) 0.0007 (0.046) 0.0067 (1.120) 0.0361 (0.891) 0.0004 (0.030) -1.0358 (0.135) -0.9320 (0.092) 0.1652 (0.094) -0.9679 (0.089) -0.8702 (0.058) 0.0795 (0.051) -0.9107 (0.065) -0.8556 (0.044) 0.0360 (0.033) -0.0485 (0.509) 0.2210 (0.454) 0.1439 (0.088) -0.0419 (0.747) 0.1347 (0.623) 0.0734 (0.051) -0.0910 (1.243) 0.1208 (0.926) 0.0345 (0.032) -1.8101 (0.261) -0.6902 (0.140) 0.7966 (0.189) -1.3745 (0.131) -0.7305 (0.076) 0.4223 (0.077) -1.1243 (0.084) -0.7811 (0.050) 0.2212 (0.042) -0.2168 (0.863) 0.9404 (0.666) 0.7320 (0.173) -0.1355 (1.094) 0.6394 (0.813) 0.4120 (0.076) -0.1522 (1.705) 0.3639 (1.102) 0.2194 (0.042) R=1,000, N=300. Standard Deviations are given in parenthesis. The true coeffi- cient values are α = −0.25, β1 = 1.25, and β2 = 1. Specification (1) incorrectly assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. 228 Table G.5: Bias and Std Deviation of De-scaled ME Probit Estimates for DGP 2 ρ = 0 ρ = 0.4 ρ = 0.8 T (1) (2) (1) (2) (1) (2) α 5 β1 β2 α 10 β1 β2 α 20 β1 β2 -0.7149 (0.119) -1.2177 (0.100) 0.0103 (0.082) -0.8690 (0.086) -0.9803 (0.071) 0.0159 (0.049) -0.8924 (0.064) -0.9041 (0.047) 0.0062 (0.032) -0.0048 (0.285) 0.0516 (0.311) 0.0107 (0.079) -0.0263 (0.426) 0.0166 (0.375) 0.0058 (0.048) -0.0110 (0.700) 0.0354 (0.562) 0.0020 (0.032) -0.9152 (0.145) -1.1591 (0.113) 0.1668 (0.098) -0.9867 (0.101) -0.9423 (0.074) 0.0976 (0.054) -0.9481 (0.070) -0.8816 (0.052) 0.0492 (0.033) -0.0385 (0.336) 0.2198 (0.365) 0.1518 (0.093) -0.0122 (0.471) 0.1309 (0.421) 0.0845 (0.053) -0.0213 (0.762) 0.0945 (0.591) 0.0446 (0.033) -1.6801 (0.258) -0.9264 (0.173) 0.8092 (0.182) -1.4536 (0.149) -0.7891 (0.093) 0.4769 (0.086) -1.1919 (0.093) -0.7962 (0.060) 0.2514 (0.044) -0.2266 (0.573) 0.9672 (0.538) 0.7461 (0.165) -0.1076 (0.733) 0.6271 (0.562) 0.4544 (0.083) -0.0525 (1.027) 0.3712 (0.706) 0.2464 (0.043) R=1,000, N=300. Standard Deviations are given in parenthesis. The true coeffi- cient values are α = −0.25, β1 = 1.25, and β2 = 1. Specification (1) incorrectly assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. 229 Table G.6: Bias and Std Deviation of De-scaled ME Probit Estimates for DGP 3 ρ = 0 ρ = 0.4 ρ = 0.8 T (1) (2) (1) (2) (1) (2) α 5 β1 β2 α 10 β1 β2 α 20 β1 β2 -0.6468 (0.116) -0.6229 (0.099) -0.0332 (0.084) -0.5603 (0.077) -0.5215 (0.067) -0.0094 (0.051) -0.4860 (0.061) -0.4739 (0.050) -0.0004 (0.033) -0.2268 (0.451) -0.0972 (0.474) 0.0095 (0.080) -0.4977 (0.710) -0.4936 (0.640) 0.0013 (0.050) -0.5450 (1.182) -0.6094 (0.958) 0.0033 (0.033) -0.8121 (0.145) -0.5174 (0.122) 0.1248 (0.103) -0.6190 (0.084) -0.4707 (0.074) 0.0626 (0.054) -0.5140 (0.064) -0.4437 (0.052) 0.0333 (0.035) -0.3008 (0.523) 0.0199 (0.522) 0.1484 (0.093) -0.5812 (0.792) -0.4200 (0.680) 0.0705 (0.052) -0.6948 (1.260) -0.5018 (0.970) 0.0367 (0.035) -1.4530 (0.256) -0.0816 (0.204) 0.7717 (0.216) -0.9248 (0.127) -0.2153 (0.098) 0.4171 (0.085) -0.6467 (0.084) -0.3001 (0.061) 0.2138 (0.048) -0.6283 (0.894) 0.6856 (0.752) 0.7304 (0.193) -0.9415 (1.142) -0.1168 (0.846) 0.4133 (0.082) -0.7962 (1.669) -0.4328 (1.226) 0.2152 (0.047) R=1,000, N=300. Standard Deviations are given in parenthesis. The true coeffi- cient values are α = −0.25, β1 = 1.25, and β2 = 1. Specification (1) incorrectly assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. 230 Table G.7: Bias and Std Deviation of Scaled Coefficient Estimates for DGP 1 ρ = 0 ρ = 0.4 ρ = 0.8 PHP MEP PHP MEP PHP MEP T (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) ασ 5 β1σ β2σ ασ 10 β1σ β2σ ασ 20 β1σ β2σ -0.5288 (0.090) -0.9152 (0.086) -0.0751 (0.085) -0.6001 (0.070) -0.8329 (0.055) -0.0388 (0.056) -0.6442 (0.054) -0.7695 (0.043) -0.0174 (0.040) 0.0055 (0.357) 0.0388 (0.392) 0.0069 (0.087) -0.0254 (0.602) 0.0218 (0.517) -0.0013 (0.055) -0.0124 (0.944) 0.0462 (0.805) 0.0010 (0.040) -0.6084 (0.073) -0.8288 (0.057) -0.0636 (0.054) -0.6609 (0.058) -0.7517 (0.042) -0.0453 (0.038) -0.6812 (0.046) -0.7189 (0.033) -0.0235 (0.027) -0.0009 (0.347) 0.0606 (0.354) 0.0091 (0.058) -0.0245 (0.588) 0.0422 (0.476) 0.0049 (0.039) 0.0050 (0.917) 0.0350 (0.732) 0.0046 (0.027) -0.5318 (0.092) -0.9164 (0.088) -0.0726 (0.082) -0.6043 (0.074) -0.8318 (0.056) -0.0351 (0.059) -0.6458 (0.059) -0.7706 (0.041) -0.0171 (0.040) -0.0060 (0.382) 0.0233 (0.375) 0.0065 (0.087) -0.0156 (0.585) 0.0188 (0.530) 0.0027 (0.058) -0.0692 (1.012) 0.0734 (0.842) 0.0005 (0.040) -0.6255 (0.074) -0.8156 (0.058) -0.0641 (0.053) -0.6678 (0.061) -0.7486 (0.041) -0.0434 (0.039) -0.6828 (0.050) -0.7192 (0.034) -0.0247 (0.028) -0.0116 (0.367) 0.0435 (0.323) 0.0119 (0.056) -0.0208 (0.573) 0.0424 (0.479) 0.0074 (0.040) -0.0661 (0.986) 0.0661 (0.734) 0.0035 (0.027) -0.5368 (0.096) -0.9140 (0.088) -0.0704 (0.084) -0.6040 (0.081) -0.8310 (0.056) -0.0344 (0.060) -0.6500 (0.064) -0.7697 (0.042) -0.0140 (0.043) -0.0223 (0.438) 0.0155 (0.401) 0.0069 (0.085) -0.0114 (0.675) 0.0620 (0.554) 0.0042 (0.060) -0.0684 (1.193) 0.0563 (0.884) 0.0055 (0.042) -0.6439 (0.082) -0.7904 (0.054) -0.0764 (0.057) -0.6730 (0.067) -0.7401 (0.041) -0.0483 (0.042) -0.6893 (0.055) -0.7157 (0.033) -0.0225 (0.030) -0.0194 (0.411) 0.0237 (0.309) 0.0090 (0.058) -0.0201 (0.634) 0.0792 (0.470) 0.0062 (0.043) -0.0672 (1.151) 0.0691 (0.743) 0.0075 (0.030) R=1,000, N=300. Standard Deviations are given in parenthesis. The true coefficient values are α = −0.25, β1 = 1.25, and β2 = 1. Specification (1) incorrectly assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. 231 Table G.8: Bias and Std Deviation of Scaled Coefficient Estimates for DGP 2 ρ = 0 ρ = 0.4 ρ = 0.8 PHP MEP PHP MEP PHP MEP T (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) ασ 5 β1σ β2σ ασ 10 β1σ β2σ ασ 20 β1σ β2σ -0.2231 (0.235) -1.3919 (0.241) -0.1331 (0.140) -0.5102 (0.065) -0.9603 (0.094) -0.0786 (0.060) -0.5803 (0.055) -0.8798 (0.048) -0.0422 (0.042) 0.0042 (0.240) 0.0087 (0.288) -0.0015 (0.088) -0.0232 (0.362) 0.0025 (0.345) 0.0039 (0.056) -0.0156 (0.585) 0.0228 (0.515) 0.0010 (0.040) -0.5189 (0.074) -0.9973 (0.075) -0.0577 (0.060) -0.6169 (0.057) -0.8227 (0.051) -0.0705 (0.040) -0.6620 (0.047) -0.7584 (0.036) -0.0534 (0.028) -0.0034 (0.232) 0.0403 (0.253) 0.0075 (0.064) -0.0222 (0.349) 0.0174 (0.306) 0.0082 (0.040) -0.0094 (0.574) 0.0326 (0.459) 0.0049 (0.028) -0.2203 (0.232) -1.3901 (0.244) -0.1326 (0.143) -0.5080 (0.069) -0.9625 (0.091) -0.0845 (0.060) -0.5802 (0.055) -0.8787 (0.052) -0.0440 (0.041) 0.0007 (0.250) 0.0044 (0.303) -0.0014 (0.088) 0.0036 (0.373) 0.0160 (0.356) 0.0024 (0.056) -0.0190 (0.626) 0.0358 (0.529) 0.0002 (0.040) -0.5477 (0.075) -0.9627 (0.073) -0.0620 (0.060) -0.6304 (0.063) -0.8131 (0.049) -0.0753 (0.040) -0.6670 (0.050) -0.7526 (0.039) -0.0535 (0.028) -0.0040 (0.243) 0.0374 (0.263) 0.0124 (0.063) 0.0043 (0.359) 0.0337 (0.322) 0.0113 (0.041) -0.0100 (0.600) 0.0400 (0.467) 0.0073 (0.028) -0.2491 (0.241) -1.3667 (0.247) -0.1207 (0.140) -0.5110 (0.080) -0.9539 (0.090) -0.0796 (0.065) -0.5803 (0.060) -0.8777 (0.052) -0.0417 (0.042) -0.0224 (0.286) 0.0024 (0.304) 0.0059 (0.090) 0.0011 (0.442) 0.0319 (0.391) 0.0047 (0.061) 0.0037 (0.714) 0.0294 (0.550) 0.0039 (0.041) -0.5905 (0.084) -0.8890 (0.067) -0.0702 (0.063) -0.6458 (0.068) -0.7907 (0.046) -0.0791 (0.045) -0.6731 (0.055) -0.7445 (0.037) -0.0549 (0.031) -0.0223 (0.272) 0.0357 (0.248) 0.0156 (0.062) -0.0008 (0.419) 0.0522 (0.321) 0.0149 (0.046) 0.0044 (0.681) 0.0536 (0.468) 0.0093 (0.031) R=1,000, N=300. Standard Deviations are given in parenthesis. The true coefficient values are α = −0.25, β1 = 1.25, and β2 = 1. Specification (1) incorrectly assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. 232 Table G.9: Bias and Std Deviation of Scaled Coefficient Estimates for DGP 3 ρ = 0 ρ = 0.4 ρ = 0.8 PHP MEP PHP MEP PHP MEP T (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) ασ 5 β1σ β2σ ασ 10 β1σ β2σ ασ 20 β1σ β2σ -0.2446 (0.090) -0.5314 (0.115) -0.2057 (0.079) -0.2998 (0.068) -0.4438 (0.075) -0.1064 (0.054) -0.3181 (0.053) -0.3982 (0.058) -0.0486 (0.038) -0.1737 (0.378) -0.0883 (0.420) -0.0024 (0.090) -0.3742 (0.592) -0.4378 (0.549) -0.0033 (0.056) -0.4102 (1.006) -0.5671 (0.879) 0.0019 (0.038) -0.4033 (0.071) -0.5951 (0.068) -0.1609 (0.054) -0.3981 (0.056) -0.4789 (0.052) -0.0802 (0.038) -0.3739 (0.048) -0.4110 (0.041) -0.0314 (0.028) -0.1846 (0.367) -0.0812 (0.384) 0.0068 (0.063) -0.4071 (0.581) -0.4021 (0.522) 0.0034 (0.040) -0.4487 (0.969) -0.4956 (0.785) 0.0067 (0.029) -0.2453 (0.090) -0.5305 (0.115) -0.2075 (0.076) -0.2962 (0.065) -0.4441 (0.078) -0.1043 (0.055) -0.3192 (0.054) -0.3959 (0.057) -0.0499 (0.039) -0.1748 (0.395) -0.1363 (0.406) -0.0058 (0.085) -0.3994 (0.627) -0.4221 (0.575) 0.0003 (0.055) -0.5041 (1.022) -0.5069 (0.858) 0.0010 (0.039) -0.4163 (0.073) -0.5920 (0.069) -0.1585 (0.053) -0.3975 (0.056) -0.4806 (0.054) -0.0805 (0.038) -0.3754 (0.048) -0.4088 (0.041) -0.0325 (0.029) -0.1907 (0.374) -0.1101 (0.369) 0.0081 (0.060) -0.4338 (0.608) -0.3842 (0.522) 0.0054 (0.040) -0.5452 (1.000) -0.4277 (0.769) 0.0056 (0.029) -0.2485 (0.096) -0.5291 (0.117) -0.2082 (0.076) -0.3017 (0.075) -0.4407 (0.077) -0.1018 (0.057) -0.3152 (0.059) -0.3979 (0.055) -0.0518 (0.041) -0.2025 (0.454) -0.1006 (0.434) -0.0032 (0.084) -0.4392 (0.690) -0.4056 (0.568) 0.0052 (0.060) -0.4641 (1.141) -0.5270 (0.964) 0.0004 (0.043) -0.4216 (0.079) -0.5910 (0.067) -0.1659 (0.053) -0.4079 (0.065) -0.4814 (0.052) -0.0781 (0.042) -0.3739 (0.054) -0.4081 (0.041) -0.0339 (0.032) -0.2142 (0.426) -0.1054 (0.347) 0.0023 (0.059) -0.4909 (0.664) -0.3591 (0.493) 0.0099 (0.045) -0.5001 (1.126) -0.4683 (0.829) 0.0047 (0.033) R=1,000, N=300. Standard Deviations are given in parenthesis. The true coefficient values are α = −0.25, β1 = 1.25, and β2 = 1. Specification (1) incorrectly assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. 233 Table G.10: Root Mean Square Error of ˆβ2σ for Specification (2) ρ = 0 ρ = 0.4 ρ = 0.8 PHP MEP PHP MEP PHP MEP 0.1552 0.2677 0.6496 0.0832 0.1188 0.2662 0.1845 0.4935 1.0950 0.1286 0.2284 0.5370 0.0654 0.0937 0.2118 0.1540 0.4338 0.8625 0.1415 0.2808 0.7146 0.0920 0.1273 0.2811 0.1833 0.5091 0.9926 0.1062 0.2311 0.5432 0.0707 0.1049 0.2198 0.1483 0.4200 0.7738 0.1611 0.3109 0.7839 0.0926 0.1541 0.3029 0.1985 0.4876 1.2071 0.0962 0.2275 0.5572 0.0626 0.1056 0.2219 0.1317 0.3717 0.9059 T 5 10 20 5 10 20 5 10 20 DGP 1 DGP 2 DGP 3 R=1,000, N=300 234 Table G.11: Bias and Std Deviation of Variance Component σ2 2 for Specification (2) DGP 1 ρ = 0.4 ρ = 0.8 0.0633 (0.117) 0.0302 (0.069) 0.0100 (0.040) 0.4829 (0.305) 0.2278 (0.098) 0.1113 (0.060) ρ = 0 -0.0165 (0.105) -0.0064 (0.054) -0.0067 (0.040) DGP 2 ρ = 0.4 ρ = 0.8 0.1090 (0.149) 0.0632 (0.079) 0.0365 (0.048) 0.5318 (0.372) 0.3325 (0.140) 0.1913 (0.068) ρ = 0 0.0084 (0.281) -0.0506 (0.237) -0.1004 (0.228) DGP 3 ρ = 0.4 ρ = 0.8 0.0458 (0.354) -0.0079 (0.314) -0.0695 (0.258) 0.3359 (0.804) 0.2285 (0.536) 0.0139 (0.386) T 5 10 20 ρ = 0 0.0050 (0.095) -0.0023 (0.053) -0.0035 (0.042) R=1,000, N=300 235 Table G.12: Bias and Std Deviation (×10) of APE Estimates for DGP 1 ρ = 0 ρ = 0.4 ρ = 0.8 PHP MEP PHP MEP PHP MEP (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) -0.6680 (0.123) -0.4028 (0.102) -0.2163 (0.083) 0.0027 (0.128) 0.0011 (0.098) -0.0012 (0.078) -0.3953 (0.123) -0.1537 (0.096) -0.0600 (0.075) 0.0025 (0.125) 0.0009 (0.096) -0.0004 (0.075) -0.6715 (0.133) -0.3999 (0.102) -0.2145 (0.083) -0.0037 (0.131) -0.0027 (0.096) 0.0002 (0.078) -0.3593 (0.126) -0.1408 (0.092) -0.0561 (0.077) -0.0043 (0.126) -0.0010 (0.092) 0.0004 (0.077) -0.6714 (0.132) -0.3983 (0.101) -0.2128 (0.084) -0.0082 (0.126) -0.0021 (0.093) 0.0011 (0.076) -0.2852 (0.115) -0.1127 (0.089) -0.0454 (0.073) -0.0055 (0.118) -0.0007 (0.090) 0.0028 (0.073) T 5 10 20 R=1,000, N=300. Standard deviations are given in parenthesis. Both bias and standard deviations are multiplied by 10. True APE value is 0.0732. Specification (1) incorrectly assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. 236 Table G.13: Bias and Std Deviation (×10) of APE Estimates for DGP 2 ρ = 0 ρ = 0.4 ρ = 0.8 PHP MEP PHP MEP PHP MEP (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) -1.7912 (0.433) -0.7884 (0.132) -0.5333 (0.084) 0.0030 (0.148) -0.0013 (0.110) -0.0022 (0.082) -0.8927 (0.144) -0.3996 (0.111) -0.1712 (0.080) 0.0027 (0.143) -0.0012 (0.107) -0.0014 (0.080) -1.7813 (0.438) -0.7904 (0.136) -0.5307 (0.091) -0.0017 (0.149) -0.0019 (0.108) 0.0013 (0.088) -0.8139 (0.139) -0.3735 (0.105) -0.1624 (0.085) 0.0001 (0.143) 0.0003 (0.106) 0.0027 (0.086) -1.7429 (0.442) -0.7830 (0.138) -0.5338 (0.091) -0.0014 (0.143) -0.0016 (0.103) -0.0011 (0.083) -0.6664 (0.136) -0.3161 (0.101) -0.1436 (0.080) -0.0003 (0.134) -0.0006 (0.099) 0.0025 (0.079) T 5 10 20 R=1,000, N=300. Standard deviations are given in parenthesis. Both bias and standard deviations are multiplied by 10. True APE value is 0.0731. Specification (1) incorrectly assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. 237 Table G.14: Bias and Std Deviation (×10) of APE Estimates for DGP 3 ρ = 0 ρ = 0.4 ρ = 0.8 PHP MEP PHP MEP PHP MEP (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) -0.6549 (0.136) -0.3192 (0.110) -0.1452 (0.079) -0.0037 (0.122) -0.0010 (0.089) -0.0039 (0.071) -0.3342 (0.119) -0.1419 (0.089) -0.0610 (0.069) -0.0092 (0.119) -0.0040 (0.087) -0.0032 (0.068) -0.6575 (0.141) -0.3273 (0.109) -0.1425 (0.078) -0.0089 (0.127) -0.0053 (0.091) -0.0013 (0.070) -0.3137 (0.120) -0.1393 (0.088) -0.0559 (0.067) -0.0157 (0.121) -0.0082 (0.088) -0.0003 (0.0680) -0.6564 (0.143) -0.3266 (0.110) -0.1414 (0.078) -0.0037 (0.125) -0.0021 (0.088) -0.0010 (0.068) -0.2646 (0.111) -0.1181 (0.085) -0.0491 (0.065) -0.0139 (0.115) -0.0046 (0.084) -0.0003 (0.065) T 5 10 20 R=1,000, N=300. Standard deviations are given in parenthesis. Both bias and standard deviations are multiplied by 10. True APE value is .1201. Specification (1) incorrectly assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. 238 Table G.15: Comparison of APE and PEA T 5 10 20 DGP 1 DGP 2 DGP 3 APE PEA APE PEA APE PEA 0.0595 (0.050) 0.0640 (0.047) 0.0692 (0.045) 0.0719 (0.056) 0.0797 (0.050) 0.0879 (0.047) 0.0487 (0.055) 0.0531 (0.052) 0.0565 (0.052) 0.0572 (0.065) 0.0635 (0.061) 0.0682 (0.059) 0.1005 (0.076) 0.0986 (0.083) 0.0913 (0.086) 0.1147 (0.090) 0.1150 (0.099) 0.1071 (0.106) APE and PEA values using simulated data of sample size 1,000 and calculated using the true parameter values. Standard deviations of the distribution of the Partial effects given in parenthesis. 239 Application Table G.16: Select Baseline Summary Statistics Baseline Covariate Therapy Cash Both Control Total Age Married or partnered Number of children <15 in household Years of Schooling Has any disabilities Ex-Combatant Currently sleeping on street Saving Stock (US$) Hrs/week, illicit activites Hrs/week, agriculture Hrs/week, low-skill wage labor Hrs/week, low-skill business Hrs/week, illicit high skill work 25.16 (4.82) 0.152 (0.36) 2.031 (3.10) 7.576 (3.38) 0.063 (0.24) 0.375 (0.48) 0.228 (0.42) 33.92 (70.16) 15.56 (32.54) 0.565 (4.61) 18.44 (30.63) 14.19 (28.68) 1.813 (8.15) 25.69 (5.01) 0.133 (0.40) 2.058 (3.09) 7.832 (3.29) 0.062 (0.24) 0.394 (0.49) 0.252 (0.44) 28.16 (63.35) 12.11 (23.74) 0.128 (0.71) 17.88 (26.91) 8.657 (20.92) 1.795 (8.08) 25.45 (5.09) 0.198 (0.40) 2.528 (3.50) 7.766 (3.27) 0.061 (0.24) 0.315 (0.47) 0.244 (0.43) 41.11 (79.37) 14.20 (26.08) 0.197 (1.35) 19.09 (28.44) 9.515 (20.80) 0.947 (5.10) 25.37 (4.65) 0.143 (0.35) 1.865 (3.00) 7.647 (3.06) 0.095 (0.29) 0.389 (0.49) 0.258 (0.44) 27.37 (47.58) 13.68 (27.12) 0.604 (5.71) 18.97 (27.22) 10.26 (22.41) 1.054 (6.32) 25.41 (4.88) 0.155 (0.36) 2.100 (3.17) 7.702 (3.25) 0.071 (0.26) 0.370 (0.48) 0.246 (0.43) 32.21 (65.39) 13.87 (27.60) 0.385 (3.87) 18.59 (28.29) 10.67 (23.55) 1.406 (7.07) 899 individuals, BJS provides tests of balance for select baseline covariates in Table 1 240 Table G.16 (cont’d) Baseline Covariate Therapy Cash Both Control Total Sells Drugs Uses marijuana daily Indicator for usually Takes hard drugs Uses hard drugs daily Committed theft, past 2 wks Antisocial behavior index Perseverance Index Reward Responsiveness Impulsiveness Index 0.223 (0.42) 0.464 (0.50) 0.272 (0.45) 0.134 (0.34) 0.576 (0.49) -0.066 (0.962) -0.026 (1.02) -0.066 (1.05) -0.085 (1.05) 0.177 (0.38) 0.465 (0.50) 0.279 (0.45) 0.177 (0.38) 0.540 (0.50) 0.078 (1.090) -0.036 (1.12) 0.111 (1.08) 0.019 (1.09) 0.198 (0.40) 0.416 (0.49) 0.269 (0.44) 0.157 (0.36) 0.533 (0.50) -0.036 (1.048) -0.009 (1.09) 0.080 (1.07) -0.004 (1.07) 0.194 (0.40) 0.484 (0.50) 0.246 (0.43) 0.115 (0.32) 0.579 (0.49) 0.035 (1.084) 0.008 (1.05) 0.020 (1.03) 0.109 (1.04) 0.198 (0.40) 0.459 (0.50) 0.266 (0.44) 0.145 (0.35) 0.558 (0.50) 0.005 (1.050) -0.015 (1.07) 0.035 (1.06) 0.013 (1.07) 899 individuals, BJS provides tests of balance for select baseline covariates in Table 1 241 Table G.17: Preliminary OLS Estimates Coeff Sells Drugs SE Coeff SE Coeff SE Coeff SE Arrested -0.0558 -0.0087 -0.0724 (0.023) (0.024) (0.022) -0.0559 -0.0068 -0.0642 (0.023) (0.024) (0.023) -0.0010 0.0044 -0.0195 (0.018) (0.019) (0.019) -0.0003 0.0125 -0.0153 (0.019) (0.019) (0.019) -0.0522 0.0052 -0.0032 -0.0179 -0.0280 0.0377 0.0068 0.0292 -0.0708 0.0380 -0.0080 0.0116 (0.024) (0.021) (0.025) (0.021) (0.026) (0.021) (0.021) (0.022) (0.023) (0.019) (0.019) (0.019) -0.0541 0.0007 -0.0065 -0.0198 -0.0201 0.0387 0.0032 0.0155 -0.0729 0.0260 -0.0145 0.0007 (0.024) (0.021) (0.025) (0.021) (0.027) (0.021) (0.022) (0.023) (0.024) (0.020) (0.020) (0.020) -0.0099 -0.0181 0.0049 -0.0052 -0.0021 -0.0238 0.0305 -0.0012 -0.0480 0.0166 0.0244 0.0038 (0.021) (0.017) (0.016) (0.018) (0.023) (0.017) (0.017) (0.020) (0.023) (0.017) (0.017) (0.018) -0.0064 -0.0199 0.0007 -0.0061 0.0097 -0.0173 0.0251 -0.0121 -0.0561 0.0167 0.0187 0.0027 (0.021) (0.017) (0.016) (0.019) (0.022) (0.017) (0.018) (0.020) (0.023) (0.018) (0.017) (0.019) Treatment Therapy Cash Both Interact with Therapy Bad Behavior Perseverance Reward Impulsiveness Interact with Cash Bad Behavior Perserverence Reward Impulsiveness Interact with Both Bad Behavior Perserverence Reward Impulsiveness Includes Block FE Number of Individuals Number of Observations No 890 3,312 Yes 890 3,312 No 890 3,312 Yes 890 3,312 Standard errors for both estimators are robust and clustered at the individual level. Treatment is randomly assigned within Blocks. 242 Table G.17 (cont’d) Treatment Therapy Cash Both Interact with Therapy Bad Behavior Perseverance Reward Impulsiveness Interact with Cash Bad Behavior Perseverance Reward Impulsiveness Interact with Both Bad Behavior Perseverance Reward Impulsiveness Coeff Illicit Activity Coeff SE SE -0.0575 -0.0204 -0.0622 (0.021) (0.021) (0.021) -0.0565 -0.0163 -0.0570 (0.022) (0.021) (0.022) -0.0642 0.0094 -0.0010 0.0039 -0.0271 0.0253 0.0043 0.0175 -0.0567 0.0495 -0.0092 0.0028 (0.024) (0.019) (0.023) (0.020) (0.023) (0.018) (0.018) (0.019) (0.024) (0.020) (0.018) (0.019) -0.0678 0.0043 -0.0043 0.0064 -0.0270 0.0253 -0.0028 0.0091 -0.0671 0.0339 -0.0129 -0.0031 (0.024) (0.019) (0.023) (0.019) (0.024) (0.018) (0.019) (0.019) (0.025) (0.020) (0.019) (0.019) Includes Block FE Number of Individuals Number of Observations No 890 3,328 Yes 890 3,328 Standard errors for both estimators are robust and clustered at the individual level. Treatment is randomly assigned within Blocks. 243 Table G.18: Scaled Probit Coefficient Estimates for Selling Drugs ME Probit Pooled Probit RE CRE RE CRE Coeff SE Coeff SE Coeff SE Coeff SE Treatment Therapy Cash Both Therapy × Antisocial Perseverance Reward Impulsiveness Cash × Antisocial Perseverance Reward Impulsiveness Both × Antisocial Perseverance Reward Impulsiveness Var. Components Therapy Cash Both Intercept Het. Coefficients Therapy Cash Both -0.760 -0.086 -0.392 (0.24) (0.12) (0.12) -0.740 -0.080 -0.337 (0.26) (0.13) (0.13) -0.679 0.303 -0.513 (0.55) (0.21) (0.55) -0.233 0.494 0.104 -0.155 -0.140 0.127 -0.176 -0.146 0.213 0.106 0.114 -0.307 0.237 -0.055 0.081 1.550 0.000 0.000 0.786 (0.16) (0.16) (0.18) (0.16) (0.12) (0.12) (0.12) (0.12) (0.13) (0.13) (0.12) (0.12) (0.89) (0.00) (0.00) (0.18) 1.621 0.000 0.000 0.838 (0.92) (0.00) (0.00) (0.19) -0.198 -0.041 0.046 -0.081 -0.223 0.191 0.014 0.060 -0.279 0.229 -0.073 0.074 (0.49) (0.20) (0.45) (0.13) (0.13) (0.15) (0.13) (0.10) (0.10) (0.09) (0.08) (0.11) (0.11) (0.10) (0.09) 0.256 -0.546 0.078 (0.33) (0.33) (0.37) -0.046 -0.819 -0.431 (0.37) (0.37) (0.48) Time (in Seconds) 11369.869 11973.093 1.029 2.208 3,312 total observations for 890 individuals, in which dummies for the number of time observations for each individual are included to address the unbalanced panel. Standard errors for both estimators are 1 + σ2 a) robust and clustered at the individual level. ME Probit coefficient estimates are scaled by (1/ and standard errors are calculated using the delta method. (cid:113) 244 Table G.19: Scaled Probit Coefficient Estimates for being Arrested ME Probit Pooled Probit RE CRE RE CRE Coeff SE Coeff SE Coeff SE Coeff SE Treatment Therapy Cash Both Therapy × Antisocial Perseverance Reward Impulsiveness Cash × Antisocial Perseverance Reward Impulsiveness Both × Antisocial Perseverance Reward Impulsiveness Var. Components Therapy Cash Both Intercept Het. Coefficients Therapy Cash Both -0.020 -0.031 -0.305 (0.13) (0.15) (0.17) -0.019 -0.098 -0.222 (0.14) (0.16) (0.17) -0.013 0.397 -1.185 (0.32) (0.21) (0.87) -0.060 0.335 -0.730 -0.055 -0.099 0.027 -0.039 -0.046 -0.182 0.226 -0.021 -0.308 0.072 0.187 0.040 0.068 0.190 0.334 0.149 (0.10) (0.10) (0.10) (0.10) (0.11) (0.11) (0.11) (0.10) (0.13) (0.11) (0.11) (0.11) (0.21) (0.25) (0.27) (0.15) 0.062 0.177 0.522 0.156 (0.20) (0.26) (0.31) (0.14) -0.028 -0.097 0.032 -0.045 -0.088 -0.109 0.180 -0.008 -0.286 0.072 0.229 0.052 (0.47) (0.28) (0.87) (0.12) (0.12) (0.10) (0.10) (0.10) (0.10) (0.10) (0.09) (0.16) (0.14) (0.16) (0.13) 0.006 -0.455 0.646 (0.28) (0.30) (0.37) 0.047 -0.403 0.431 (0.39) (0.36) (0.46) Time (in Seconds) 818.475 827.293 4.746 5.988 3,302 total observations for 880 individuals, in which dummies for the number of time observations for each individual are included to address the unbalanced panel. Standard errors for both estimators are 1 + σ2 a) robust and clustered at the individual level. ME Probit coefficient estimates are scaled by (1/ and standard errors are calculated using the delta method. (cid:113) 245 Table G.20: Scaled Probit Coefficient Estimates for Illicit Activity ME Probit Pooled Probit RE CRE RE CRE Coeff SE Coeff SE Coeff SE Coeff SE Treatment Therapy Cash Both Therapy × Antisocial Perseverance Reward Impulsiveness Cash × Antisocial Perseverance Reward Impulsiveness Both × Antisocial Perseverance Reward Impulsiveness Var. Components Therapy Cash Both Intercept Het. Coefficients Therapy Cash Both -0.972 -0.101 -0.704 (0.31) (0.12) (0.27) -0.877 -0.095 -0.538 (0.30) (0.12) (0.25) -2.003 -0.498 -1.320 (1.65) (0.46) (1.72) -1.242 -0.356 -0.267 -0.309 -0.025 0.137 0.002 -0.118 0.105 0.121 0.007 -0.157 0.382 -0.087 -0.017 1.761 0.000 0.568 0.461 (0.17) (0.17) (0.18) (0.16) (0.11) (0.11) (0.11) (0.11) (0.14) (0.15) (0.13) (0.13) (0.89) (0.00) (0.55) (0.17) 1.988 0.000 1.018 0.422 (0.98) (0.00) (0.66) (0.15) -0.288 -0.147 0.153 -0.028 -0.051 0.084 0.117 0.035 -0.167 0.342 -0.081 -0.013 (1.36) (0.55) (0.90) (0.19) (0.23) (0.27) (0.20) (0.17) (0.13) (0.14) (0.12) (0.15) (0.17) (0.12) (0.13) 0.830 0.327 0.557 (0.54) (0.30) (0.74) 0.542 0.205 -0.075 (0.58) (0.37) (0.71) Time (in Seconds) 4642.346 6735.654 1.063 1.872 3,320 total observations for 882 individuals, in which dummies for the number of time observations for each individual are included to address the unbalanced panel. SStandard errors for both estimators are 1 + σ2 a) robust and clustered at the individual level. ME Probit coefficient estimates are scaled by (1/ and standard errors are calculated using the delta method. (cid:113) 246 Table G.21: ATE Estimates Therapy Only Coeff SE Cash Only SE Coeff Both Cash and Therapy SE Coeff -0.0569 -0.0573 -0.0621 -0.0648 -0.0638 -0.0642 -0.0006 -0.0005 0.0011 0.0021 -0.0012 0.0002 -0.0563 -0.0587 -0.0536 -0.0597 -0.0238 -0.0473 (0.023) (0.023) (0.033) (0.033) (0.019) (0.031) (0.018) (0.018) (0.024) (0.025) (0.018) (0.019) (0.021) (0.022) (0.034) (0.034) (0.034) (0.026) -0.0074 -0.0095 -0.0162 -0.0207 -0.0202 0.0065 0.0082 0.0059 0.0077 0.0072 -0.0035 -0.0001 -0.0180 -0.0211 -0.0174 -0.0219 0.0049 -0.0112 (0.024) (0.024) (0.023) (0.023) (0.020) (0.059) (0.019) (0.019) (0.025) (0.024) (0.019) (0.020) (0.021) (0.021) (0.020) (0.021) (0.027) (0.021) -0.0704 -0.0747 -0.0653 -0.0764 -0.0686 -0.0495 -0.0187 -0.0201 -0.0140 -0.0163 -0.0363 -0.0298 -0.0600 -0.0649 -0.0549 -0.0639 -0.0246 -0.0533 (0.023) (0.023) (0.021) (0.021) (0.019) (0.055) (0.019) (0.019) (0.027) (0.026) (0.020) (0.020) (0.021) (0.022) (0.029) (0.029) (0.034) (0.018) OLS ME PHP OLS ME PHP OLS ME PHP (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) Sells Drugs Arrested Illicit Activity Specification (1) assumes a RE structure such that the random treatment effects are not hetero- geneous in terms of the individual characteristics. Specification (2) implements a flexible CRE specification that allows the treatment effects to be heterogeneous in individual characteristics. 247 Discussion Table G.22: Bias and Std Deviation of Scaled Coefficient Estimates under AR(2) PHP MEP T (1) (2) (1) (2) ασ 5 β1σ β2σ ασ 10 β1σ β2σ ασ 20 β1σ β2σ -0.5520 (0.091) -0.9095 (0.087) -0.0542 (0.083) -0.6284 (0.074) -0.8229 (0.055) -0.0115 (0.059) -0.6678 (0.060) -0.7616 (0.043) 0.0057 (0.043) -0.0159 (0.388) 0.0398 (0.397) 0.0284 (0.088) -0.0580 (0.592) 0.0510 (0.531) 0.0297 (0.058) -0.0247 (1.031) 0.1005 (0.821) 0.0255 (0.043) -0.6439 (0.076) -0.8112 (0.057) -0.0441 (0.056) -0.6898 (0.061) -0.7401 (0.041) -0.0221 (0.039) -0.7055 (0.053) -0.7095 (0.034) -0.0016 (0.030) -0.0193 (0.375) 0.0659 (0.343) 0.0358 (0.059) -0.0647 (0.573) 0.0695 (0.484) 0.0321 (0.040) -0.0063 (1.002) 0.0918 (0.718) 0.0285 (0.030) R=1,000, N=300. Standard Deviations are given in parenthesis. The true coefficient values are α = −0.25, β1 = 1.25, and β2 = 1. Specification (1) incorrectly assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. 248 Table G.23: Bias and Std Deviation (×10) of APE Estimates under AR(2) T 5 10 20 PHP MEP (1) (2) (1) (2) -0.6657 (0.1284) -0.3966 (0.0987) -0.2096 (0.0820) 0.0024 (0.1274) 0.0044 (0.0936) 0.0059 (0.0768) -0.3645 (0.1238) -0.1409 (0.0898) -0.0510 (0.0743) 0.0042 (0.1228) 0.0058 (0.0914) 0.0072 (0.0750) R=1,000, N=300. Standard deviations are given in parenthesis. Both bias and standard deviations are multiplied by 10. True APE value is 0.0735. Specification (1) incorrectly assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. Table G.24: Failure Count under no Random Coefficients ρ = 0 ρ = 0.4 ρ = 0.8 PHP MEP PHP MEP PHP T (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) MEP (2) (1) 5 10 20 0 0 0 6 0 0 525 559 652 448 567 721 0 0 0 4 0 0 6 0 4 8 4 4 0 0 0 7 0 0 0 0 0 1 0 0 Specification (1) assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. 249 Table G.25: Estimation Times under no Random Coefficients ρ = 0 ρ = 0.4 ρ = 0.8 PHP MEP PHP MEP PHP MEP (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) 0.301 (0.06) 0.538 (0.10) 0.605 (0.07) 0.464 (0.13) 0.695 (0.14) 0.845 (0.15) 6.826 (3.90) 7.991 (4.69) 13.688 (8.29) 10.056 (5.06) 11.219 (6.23) 18.954 (10.54) 0.389 (0.08) 0.533 (0.08) 0.618 (0.08) 0.525 (0.15) 0.686 (0.14) 0.863 (0.16) 3.681 (2.41) 5.078 (3.39) 8.532 (6.13) 5.442 (3.15) 7.038 (4.22) 11.741 (7.63) 0.478 (0.08) 0.532 (0.08) 0.629 (0.08) 0.613 (0.26) 0.692 (0.13) 0.878 (0.16) 4.269 (1.78) 7.455 (3.18) 9.070 (6.56) 5.592 (2.00) 9.362 (3.78) 10.434 (8.16) T 5 10 20 Average estimation time in seconds and standard deviations given in parenthesis. Specification (1) assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. 250 Table G.26: Bias and Std Deviation (×10) of APE Estimates under no Random Coefficients ρ = 0 ρ = 0.4 ρ = 0.8 PHP MEP PHP MEP PHP MEP (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) -0.0023 (0.084) 0.0016 (0.058) -0.0035 (0.040) -0.0005 (0.092) 0.0006 (0.060) -0.0036 (0.040) -0.0001 (0.081) 0.0031 (0.057) -0.0027 (0.040) 0.0016 (0.088) 0.0029 (0.059) -0.0025 (0.041) 0.0005 (0.093) -0.0005 (0.059) 0.0018 (0.044) 0.0052 (0.099) -0.0005 (0.063) 0.0022 (0.045) 0.0013 (0.089) 0.0005 (0.059) 0.0022 (0.044) 0.0057 (0.095) 0.0013 (0.061) 0.0028 (0.044) 0.0045 (0.094) 0.0017 (0.0672) -0.0008 (0.0438) 0.0044 (0.098) 0.0025 (0.068) -0.0005 (0.044) 0.0018 (0.087) 0.0008 (0.065) -0.0008 (0.043) 0.0024 (0.090) 0.0018 (0.066) -0.0004 (0.044) T 5 10 20 R=1,000, N=300. Standard deviations are given in parenthesis. Both bias and standard deviations are multiplied by 10. True APE value is 0.1511. Specification (1) assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. 251 Table G.27: Variance Component σ2 1 Estimates under no Random Coefficients T 5 10 20 ρ = 0 ρ = 0.4 ρ = 0.8 (1) (2) (1) (2) (1) (2) 0.0556 (0.064) 0.0232 (0.026) 0.0117 (0.012) 0.0463 (0.059) 0.0186 (0.024) 0.0092 (0.011) 0.2944 (0.154) 0.1456 (0.060) 0.0688 (0.026) 0.2771 (0.153) 0.1366 (0.060) 0.0642 (0.026) 2.1492 (0.916) 0.9861 (0.205) 0.4779 (0.077) 2.1260 (0.944) 0.9650 (0.203) 0.4667 (0.076) Specification (1) assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. Table G.28: Variance Component σ2 2 Estimates under no Random Coefficients T 5 10 20 ρ = 0 ρ = 0.4 ρ = 0.8 (1) (2) (1) (2) (1) (2) 0.0164 (0.041) 0.0106 (0.024) 0.0065 (0.013) 0.0107 (0.033) 0.0076 (0.021) 0.0047 (0.011) 0.0383 (0.063) 0.0245 (0.037) 0.0134 (0.019) 0.0268 (0.055) 0.0178 (0.033) 0.0102 (0.017) 0.0808 (0.188) 0.0337 (0.072) 0.0324 (0.039) 0.0524 (0.160) 0.0260 (0.063) 0.0282 (0.035) Specification (1) assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. 252 Table G.29: Rejection Rate of LR Test for Random Coefficients ρ = 0 ρ = 0.4 (1) (2) (1) (2) ρ = 0.8 (1) (2) 0.054 0.047 0.055 0.033 0.026 0.027 0.723 0.858 0.881 0.638 0.799 0.841 1 1 1 1 1 1 T 5 10 20 True value of σ2 1 is 0.5. Specification (1) assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. 253 Table G.30: Bias and Std Deviation of De-scaled ME Logit Estimate under a Conditional Logistic AR(1) Process ρ = 0 ρ = 0.4 ρ = 0.8 T (1) (2) (1) (2) (1) (2) α 5 β1 β2 α 10 β1 β2 α 20 β1 β2 -0.7919 (0.129) -1.0646 (0.100) -0.0034 (0.090) -0.8592 (0.089) -0.9314 (0.066) 0.0034 (0.061) -0.8660 (0.067) -0.8784 (0.049) 0.0020 (0.040) -0.0614 (0.537) 0.0599 (0.523) 0.0030 (0.091) -0.0628 (0.822) 0.0007 (0.688) 0.0030 (0.061) 0.0009 (1.397) 0.0217 (1.045) 0.0019 (0.039) -1.0504 (0.164) -0.9749 (0.113) 0.1738 (0.102) -0.9699 (0.108) -0.8845 (0.070) 0.0867 (0.065) -0.9119 (0.077) -0.8596 (0.051) 0.0441 (0.042) -0.0641 (0.701) 0.2782 (0.586) 0.1705 (0.099) 0.0248 (1.013) 0.1203 (0.689) 0.0854 (0.065) -0.0367 (1.486) 0.0820 (1.094) 0.0437 (0.042) -1.9744 (0.310) -0.6782 (0.159) 0.8856 (0.192) -1.4747 (0.173) -0.7167 (0.087) 0.5012 (0.093) -1.1741 (0.113) -0.7661 (0.061) 0.2643 (0.054) -0.2474 (1.202) 1.1248 (0.895) 0.8505 (0.187) -0.1786 (1.599) 0.6357 (0.980) 0.4960 (0.093) -0.1120 (2.326) 0.3829 (1.324) 0.2633 (0.054) R=1,000, N=300. Standard Deviations are given in parenthesis. The true coefficient values are α = −0.25, β1 = 1.25, and β2 = 1. Specification (1) incorrectly assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. Conditional logistic AR(1) is generated according to equation (3.35) 254 Table G.31: Bias and Std Deviation of De-scaled ME Logit Estimate under a Marginal Logistic AR(1) Process ρ = 0 ρ = 0.4 ρ = 0.8 T (1) (2) (1) (2) (1) (2) α 5 β1 β2 α 10 β1 β2 α 20 β1 β2 -0.7962 (0.129) -1.0646 (0.098) 0.0057 (0.086) -0.8561 (0.095) -0.9296 (0.063) 0.0017 (0.059) -0.8632 (0.070) -0.8780 (0.050) 0.0004 (0.039) -0.0541 (0.538) 0.0399 (0.508) 0.0119 (0.087) -0.0393 (0.818) 0.0449 (0.659) 0.0010 (0.059) 0.0045 (1.369) 0.0646 (1.054) 0.0003 (0.040) -1.0212 (0.167) -0.9827 (0.104) 0.1510 (0.106) -0.9424 (0.106) -0.8944 (0.069) 0.0633 (0.061) -0.8933 (0.079) -0.8695 (0.052) 0.0234 (0.039) -0.0769 (0.690) 0.2104 (0.578) 0.1480 (0.106) -0.1014 (0.995) 0.1266 (0.750) 0.0618 (0.061) -0.0981 (1.510) 0.0737 (1.093) 0.0232 (0.040) -1.8377 (0.276) -0.7226 (0.153) 0.7812 (0.177) -1.3900 (0.168) -0.7464 (0.088) 0.4205 (0.086) -1.0903 (0.109) -0.7932 (0.057) 0.1897 (0.049) -0.2540 (1.259) 0.9752 (0.818) 0.7508 (0.171) -0.1505 (1.688) 0.5626 (0.961) 0.4153 (0.085) -0.0510 (2.344) 0.3125 (1.285) 0.1891 (0.049) R=1,000, N=300. Standard Deviations are given in parenthesis. The true coefficient values are α = −0.25, β1 = 1.25, and β2 = 1. Specification (1) incorrectly assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. Marginal logistic AR(1) is generated according to equation (3.36) 255 Table G.32: Bias and Std Deviation of Scaled Coefficient Estimates under a Conditional Logistic AR(1) Process ρ = 0 ρ = 0.4 ρ = 0.8 PHP MEL PHP MEL PHP MEL T (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) ασ 5 β1σ β2σ ασ 10 β1σ β2σ ασ 20 β1σ β2σ -0.3808 (0.078) -0.5884 (0.081) 0.0035 (0.074) -0.4309 (0.058) -0.5172 (0.045) 0.0270 (0.050) -0.4536 (0.043) -0.4716 (0.033) 0.0331 (0.032) -0.0389 (0.305) 0.0664 (0.323) 0.0333 (0.074) -0.0447 (0.460) 0.0429 (0.408) 0.0386 (0.051) -0.0053 (0.777) 0.0398 (0.603) 0.0371 (0.032) -0.3876 (0.060) -0.5505 (0.049) -0.0197 (0.044) -0.4267 (0.043) -0.4826 (0.033) -0.0114 (0.030) -0.4367 (0.034) -0.4539 (0.025) -0.0062 (0.020) -0.0321 (0.277) 0.0321 (0.270) 0.0026 (0.047) -0.0321 (0.423) 0.0018 (0.353) 0.0029 (0.031) 0.0003 (0.720) 0.0125 (0.538) 0.0021 (0.020) -0.3824 (0.080) -0.5873 (0.081) 0.0024 (0.074) -0.4268 (0.064) -0.5162 (0.046) 0.0264 (0.050) -0.4502 (0.045) -0.4721 (0.034) 0.0333 (0.031) -0.0172 (0.340) 0.0750 (0.317) 0.0374 (0.072) 0.0117 (0.531) 0.0557 (0.389) 0.0384 (0.050) -0.0185 (0.803) 0.0525 (0.620) 0.0373 (0.032) -0.4311 (0.064) -0.5239 (0.048) -0.0081 (0.042) -0.4410 (0.049) -0.4715 (0.032) -0.0064 (0.029) -0.4409 (0.037) -0.4508 (0.025) -0.0020 (0.020) -0.0135 (0.318) 0.0512 (0.265) 0.0176 (0.043) 0.0198 (0.489) 0.0190 (0.332) 0.0101 (0.030) -0.0146 (0.741) 0.0221 (0.546) 0.0066 (0.020) -0.3783 (0.086) -0.5858 (0.074) 0.0002 (0.077) -0.4238 (0.071) -0.5210 (0.044) 0.0264 (0.049) -0.4511 (0.055) -0.4737 (0.036) 0.0328 (0.034) -0.0183 (0.383) 0.0583 (0.339) 0.0382 (0.073) -0.0292 (0.629) 0.0459 (0.458) 0.0404 (0.049) -0.0260 (1.034) 0.0681 (0.654) 0.0375 (0.033) -0.4683 (0.072) -0.4887 (0.042) -0.0073 (0.044) -0.4696 (0.059) -0.4572 (0.030) 0.0069 (0.031) -0.4619 (0.047) -0.4415 (0.025) 0.0104 (0.023) -0.0176 (0.352) 0.0528 (0.257) 0.0282 (0.046) -0.0269 (0.579) 0.0414 (0.356) 0.0283 (0.032) -0.0254 (0.985) 0.0484 (0.559) 0.0209 (0.023) R=1,000, N=300. Standard Deviations are given in parenthesis. The true coefficient values are α = −0.25, β1 = 1.25, and β2 = 1. Specification (1) incorrectly assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. Conditional logistic AR(1) is generated according to equation (3.35) 256 Table G.33: Bias and Std Deviation of Scaled Coefficient Estimates under a Marginal Logistic AR(1) Process ρ = 0 ρ = 0.4 ρ = 0.8 PHP MEL PHP MEL PHP MEL T (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) ασ 5 β1σ β2σ ασ 10 β1σ β2σ ασ 20 β1σ -0.3855 (0.078) -0.5856 (0.079) 0.0083 (0.072) -0.4298 (0.060) -0.5160 (0.045) 0.0271 (0.048) -0.4515 (0.044) -0.4715 (0.033) -0.0365 (0.304) 0.0654 (0.307) 0.0428 (0.073) -0.0279 (0.457) 0.0626 (0.397) 0.0369 (0.048) -0.0001 (0.760) 0.0620 (0.612) -0.3901 (0.059) -0.5503 (0.048) -0.0149 (0.041) -0.4249 (0.046) -0.4818 (0.031) -0.0125 (0.029) -0.4351 (0.034) -0.4538 (0.026) -0.0280 (0.277) 0.0226 (0.262) 0.0077 (0.044) -0.0199 (0.421) 0.0243 (0.339) 0.0016 (0.029) 0.0024 (0.705) 0.0338 (0.542) -0.3738 (0.084) -0.5854 (0.078) -0.0064 (0.075) -0.4183 (0.059) -0.5190 (0.047) 0.0187 (0.046) -0.4428 (0.046) -0.4763 (0.034) -0.0261 (0.338) 0.0458 (0.310) 0.0292 (0.073) -0.0521 (0.521) 0.0652 (0.426) 0.0294 (0.046) -0.0584 (0.818) 0.0514 (0.617) -0.4223 (0.067) -0.5263 (0.045) -0.0147 (0.043) -0.4311 (0.047) -0.4752 (0.032) -0.0147 (0.027) -0.4336 (0.038) -0.4550 (0.025) -0.0211 (0.314) 0.0244 (0.263) 0.0102 (0.046) -0.0414 (0.483) 0.0245 (0.363) 0.0005 (0.028) -0.0458 (0.754) 0.0197 (0.545) -0.3639 (0.087) -0.5894 (0.079) -0.0123 (0.074) -0.4062 (0.069) -0.5265 (0.045) 0.0066 (0.049) -0.4230 (0.054) -0.4820 (0.033) -0.0273 (0.406) 0.0331 (0.325) 0.0219 (0.069) -0.0410 (0.655) 0.0626 (0.438) 0.0193 (0.047) -0.0117 (1.054) 0.0491 (0.640) -0.4549 (0.069) -0.4948 (0.042) -0.0157 (0.043) -0.4524 (0.058) -0.4638 (0.031) -0.0105 (0.029) -0.4350 (0.046) -0.4501 (0.024) -0.0262 (0.380) 0.0333 (0.246) 0.0175 (0.043) -0.0209 (0.623) 0.0274 (0.353) 0.0088 (0.030) -0.0006 (1.004) 0.0270 (0.550) β2σ 0.0323 (0.032) 0.0364 (0.032) -0.0040 (0.022) R=1,000, N=300. Standard Deviations are given in parenthesis. The true coefficient values are α = −0.25, β1 = 1.25, and β2 = 1. Specification (1) incorrectly assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. Marginal logistic AR(1) is generated according to equation (3.36) -0.0073 (0.019) -0.0136 (0.021) 0.0072 (0.033) 0.0009 (0.020) 0.0242 (0.031) 0.0284 (0.030) -0.0106 (0.019) -0.0022 (0.019) 0.0117 (0.033) 257 Table G.34: Bias and Std Deviation (×10) of APE Estimates under a Conditional Logistic AR(1) Process ρ = 0 ρ = 0.4 ρ = 0.8 PHP MEP PHP MEP PHP MEP (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) -0.4665 (0.155) -0.2824 (0.104) -0.1498 (0.080) 0.1036 (0.148) 0.0487 (0.101) 0.0272 (0.075) -0.2912 (0.144) -0.1052 (0.101) -0.0322 (0.078) 0.0033 (0.142) 0.0003 (0.099) 0.0017 (0.077) -0.4681 (0.156) -0.2836 (0.104) -0.1508 (0.080) 0.0985 (0.138) 0.0498 (0.098) 0.0241 (0.076) -0.2173 (0.139) -0.0755 (0.098) -0.0258 (0.077) -0.0030 (0.132) 0.0017 (0.097) 0.0000 (0.077) -0.4643 (0.139) -0.2919 (0.100) -0.1520 (0.084) 0.1040 (0.125) 0.0438 (0.089) 0.0230 (0.074) -0.1036 (0.117) -0.0353 (0.088) -0.0096 (0.076) 0.0051 (0.116) -0.0029 (0.088) 0.0000 (0.075) T 5 10 20 R=1,000, N=300. Standard deviations are given in parenthesis. Both bias and standard deviations are multiplied by 10. True APE value is 0.0585. Specification (1) incorrectly assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. Conditional logistic AR(1) is generated according to equation (3.35) 258 Table G.35: Bias and Std Deviation (×10) of APE Estimates under a Marginal Logistic AR(1) Process ρ = 0 ρ = 0.4 ρ = 0.8 PHP MEP PHP MEP PHP MEP (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) -0.4652 (0.151) -0.2827 (0.098) -0.1498 (0.081) 0.0975 (0.140) 0.0498 (0.099) 0.0264 (0.077) -0.2982 (0.144) -0.1051 (0.099) -0.0314 (0.077) -0.0028 (0.136) -0.0002 (0.098) 0.0023 (0.077) -0.4634 (0.148) -0.2878 (0.105) -0.1551 (0.082) 0.0947 (0.135) 0.0404 (0.100) 0.0196 (0.076) -0.2191 (0.132) -0.0839 (0.098) -0.0326 (0.078) -0.0053 (0.130) -0.0052 (0.097) -0.0055 (0.077) -0.4708 (0.150) -0.2913 (0.105) -0.1629 (0.078) 0.0896 (0.126) 0.0365 (0.094) 0.0096 (0.073) -0.1200 (0.118) -0.0442 (0.090) -0.0213 (0.072) -0.0057 (0.117) -0.0109 (0.090) -0.0109 (0.071) T 5 10 20 R=1,000, N=300. Standard deviations are given in parenthesis. Both bias and standard deviations are multiplied by 10. True APE value is 0.0582. Specification (1) incorrectly assumes that the random effects ai and bi are uncorrelated with the x’s while specification (2) assumes that ai and bi are correlated with the x’s through their time averages. Marginal logistic AR(1) is generated according to equation (3.36) 259 BIBLIOGRAPHY 260 BIBLIOGRAPHY Ackerberg, D., X. Chen, and J. Hahn (2012): ‘A practical asymptotic variance estimator for two-step semiparametric estimators,’ Review of Economics and Statistics, 94(2), 481–498. Ai, C., and X. Chen (2003): ‘Efficient estimation of models with conditional moment restric- tions containing unknown functions,’ Econometrica, 71(6), 1795–1843. Akin, J. S., D. K. Guilkey, and R. Sickles (1979): ‘A random coefficient probit model with an application to a study of migration,’ Journal of Econometrics, 11(2), 233 – 246. Bertrand, M., E. Duflo, and S. Mullainathan (2004): ‘How Much Should We Trust Differences-In-Differences Estimates?*,’ The Quarterly Journal of Economics, 119(1), 249– 275. Blattman, C., J. C. Jamison, and M. Sheridan (2017): ‘Reducing Crime and Violence: Ex- perimental Evidence from Cognitive Behavioral Therapy in Liberia,’ American Economic Review, 107(4), 1165–1206. Blundell, R., and R. L. Matzkin (2014): ‘Control functions in nonseparable simultaneous equations models,’ Quantitative Economics, 5(2), 271–295. Blundell, R., and J. L. Powell (2003): ‘Endogeneity in nonparametric and semiparametric regression models,’ Econometric society monographs, 36, 312–357. Blundell, R. W., and J. L. Powell (2004): ‘Endogeneity in Semiparametric Binary Response Models,’ The Review of Economic Studies, 71(3), 655–679. Chen, X. (2007): ‘Large sample sieve estimation of semi-nonparametric models,’ Handbook of econometrics, 6, 5549–5632. Chen, X., V. Chernozhukov, S. Lee, and W. K. Newey (2014): ‘Local identification of non- parametric and semiparametric models,’ Econometrica, 82(2), 785–809. Chen, X., O. Linton, and I. Van Keilegom (2003): ‘Estimation of semiparametric models when the criterion function is not smooth,’ Econometrica, 71(5), 1591–1608. D’Haultfœuille, X., and P. F´evrier (2015): ‘Identification of nonseparable triangular models with discrete instruments,’ Econometrica, 83(3), 1199–1210. Dong, Y., and A. Lewbel (2015): ‘A simple estimator for binary choice models with endoge- nous regressors,’ Econometric Reviews, 34(1-2), 82–105. Duckworth, A. L., and P. D. Quinn (2009): ‘Development and validation of the Short Grit Scale (GRIT–S),’ Journal of personality assessment, 91(2), 166–174. Escanciano, J. C., D. Jacho-Ch´avez, and A. Lewbel (2016): ‘Identification and estimation of semiparametric two-step models,’ Quantitative Economics, 7(2), 561–589. 261 Fern´andez-Val, I. (2009): ‘Fixed effects estimation of structural parameters and marginal effects in panel probit models,’ Journal of Econometrics, 150(1), 71–85. Florens, J.-P., J. J. Heckman, C. Meghir, and E. Vytlacil (2008): ‘Identification of treat- ment effects using control functions in models with continuous, endogenous treatment and heterogeneous effects,’ Econometrica, 76(5), 1191–1206. Gandhi, A., K. I. Kim, and A. Petrin (2013): ‘Identification and Estimation in Discrete Choice Demand Models when Endogenous Variables Interact with the Error,’ . Greene, W. (2004): ‘The behaviour of the maximum likelihood estimator of limited depen- dent variable models in the presence of fixed effects,’ The Econometrics Journal, 7(1), 98–119. Greene, W. (2011): Econometric Analysis. Pearson Education. Hahn, J., Z. Liao, and G. Ridder (2018): ‘Nonparametric two-step sieve M estimation and inference,’ Econometric Theory, pp. 1–44. Hahn, J., and G. Ridder (2011): ‘Conditional moment restrictions and triangular simultane- ous equations,’ Review of Economics and Statistics, 93(2), 683–689. Hall, P., J. L. Horowitz, et al. (2005): ‘Nonparametric methods for inference in the presence of instrumental variables,’ The Annals of Statistics, 33(6), 2904–2929. Harvey, A. C. (1976): ‘Estimating regression models with multiplicative heteroscedasticity,’ Econometrica, Vol. 44(3), 461–465. Hausman, J. A., and D. A. Wise (1978): ‘A Conditional Probit Model for Qualitative Choice: Discrete Decisions Recognizing Interdependence and Heterogeneous Preferences,’ Econo- metrica, 46(2), 403–426. Hoderlein, S., H. Holzmann, M. Kasy, and A. Meister (2016): ‘Erratum Instrumental Vari- ables with Unrestricted Heterogeneity and Continuous Treatment,’ The Review of Eco- nomic Studies, forthcoming. Hoderlein, S., H. Holzmann, and A. Meister (2017): ‘The triangular model with random coefficients,’ Journal of Econometrics, 201(1), 144–169. Hong, H., and E. Tamer (2003): ‘Endogenous binary choice model with median restrictions,’ Economics Letters, 80(2), 219–225. Horowitz, J. L. (1992): ‘A smoothed maximum score estimator for the binary response model,’ Econometrica: journal of the Econometric Society, pp. 505–531. Ichimura, H., and L.-F. Lee (1991): ‘Semiparametric least squares estimation of multiple index models: single equation estimation,’ in Nonparametric and semiparametric meth- ods in econometrics and statistics: Proceedings of the Fifth International Symposium in Economic Theory and Econometrics. Cambridge, pp. 3–49. Cambridge University Press. 262 Imbens, G. W., and W. K. Newey (2009): ‘Identification and estimation of triangular simul- taneous equations models without additivity,’ Econometrica, 77(5), 1481–1512. Kasy, M. (2011): ‘Identification in triangular systems using control functions,’ Econometric Theory, 27(3), 663–671. (2014): ‘Instrumental variables with unrestricted heterogeneity and continuous treat- ment,’ The Review of Economic Studies, 81(4), 1614–1636. Keane, M. P. (1994): ‘A Computationally Practical Simulation Estimator for Panel Data,’ Econometrica, 62(1), 95–116. Khan, S. (2013): ‘Distribution free estimation of heteroskedastic binary response models using Probit/Logit criterion functions,’ Journal of Econometrics, Vol. 172(1), 168 – 182. Kim, K. i., and A. Petrin (2017): ‘A New Control Function Approach for Non-Parametric Regressions with Endogenous Variables,’ . Klein, R., and F. Vella (2009): ‘A semiparametric model for binary response and continuous outcomes under index heteroscedasticity,’ Journal of Applied Econometrics, Vol. 24(5), 735–762. Krief, J. M. (2014): ‘An integrated kernel-weighted smoothed maximum score estimator for the partially linear binary response model,’ Econometric Theory, 30(3), 647–675. Lewbel, A. (2000): ‘Semiparametric qualitative response model estimation with unknown heteroscedasticity or instrumental variables,’ Journal of Econometrics, 97(1), 145–177. (forthcoming): ‘The identification zoo–meanings of identification in econometrics,’ Journal of Economic Literature. Lewbel, A., Y. Dong, and T. T. Yang (2012): ‘Comparing features of convenient estima- tors for binary choice models with endogenous regressors,’ Canadian Journal of Eco- nomics/Revue canadienne d’´economique, 45(3), 809–829. Lin, W., and J. M. Wooldridge (2015): ‘On different approaches to obtaining partial effects in binary response models with endogenous regressors,’ Economics Letters, 134, 58 – 61. Manski, C. F. (1985): ‘Semiparametric analysis of discrete response: asymptotic properties of the maximum score estimator,’ Journal of Econometrics, 27(3), 313–333. Manski, C. F. (1988): ‘Identification of Binary Response Models,’ Journal of the American Statistical Association, 83(403), 729–738. McCulloch, C. E., and J. M. Neuhaus (2001): Generalized linear mixed models. Wiley Online Library. Newey, W. K. (1994): ‘The asymptotic variance of semiparametric estimators,’ Economet- rica: Journal of the Econometric Society, pp. 1349–1382. 263 (2013): ‘Nonparametric instrumental variables estimation,’ American Economic Review, 103(3), 550–56. Newey, W. K., and D. McFadden (1994): ‘Chapter 36: Large sample estimation and hypoth- esis testing,’ Handbook of Econometrics, Vol. 4, 2111 – 2245. Newey, W. K., and J. L. Powell (2003): ‘Instrumental variable estimation of nonparametric models,’ Econometrica, 71(5), 1565–1578. Newey, W. K., J. L. Powell, and F. Vella (1999): ‘Nonparametric estimation of triangular simultaneous equations models,’ Econometrica, 67(3), 565–603. Petrin, A., and K. Train (2010): ‘A Control Function Approach to Endogeneity in Consumer Choice Models,’ Journal of Marketing Research, 47(1), 3–13. Pinkse, J. (2000): ‘Nonparametric Two-Step Regression Estimation When Regressors and Error Are Dependent,’ The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 28(2), 289–300. Rivers, D., and Q. H. Vuong (1988): ‘Limited information estimators and exogeneity tests for simultaneous probit models,’ Journal of Econometrics, 39(3), 347 – 366. Rothe, C. (2009): ‘Semiparametric estimation of binary response models with endogenous regressors,’ Journal of Econometrics, 153(1), 51–64. Rothenberg, T. J. (1971): ‘Identification in parametric models,’ Econometrica: Journal of the Econometric Society, pp. 577–591. Sim, C. H. (1993): ‘First-Order Autoregressive Logistic Processes,’ Journal of Applied Prob- ability, 30(2), 467–470. Smith, R. J., and R. W. Blundell (1986): ‘An exogeneity test for a simultaneous equation To- bit model with an application to labor supply,’ Econometrica: Journal of the Econometric Society, pp. 679–685. Song, W. (2016): ‘A Semiparametric Estimator for Binary Response Models with Endoge- nous Regressors,’ . Stock, J. H., J. H. Wright, and M. Yogo (2002): ‘A survey of weak instruments and weak identification in generalized method of moments,’ Journal of Business & Economic Statis- tics, 20(4), 518–529. Su, L., and A. Ullah (2008): ‘Local polynomial estimation of nonparametric simultaneous equations models,’ Journal of Econometrics, 144(1), 193–218. Swamy, P., and G. S. Tavlas (1995): ‘Random Coefficient Models: Theory and Applications.,’ Journal of Economic Surveys, 9(2), 165. Swamy, P. A. V. B. (1970): ‘Efficient Inference in a Random Coefficient Regression Model,’ Econometrica, 38(2), 311–323. 264 Torgovitsky, A. (2015): ‘Identification of nonseparable models using instruments with small support,’ Econometrica, 83(3), 1185–1197. Wooldridge, J. (2010): Econometric Analysis of Cross Section and Panel Data, Econometric Analysis of Cross Section and Panel Data. MIT Press. Wooldridge, J. M. (2005): ‘Unobserved heterogeneity and estimation of average partial ef- fects,’ Identification and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg, pp. 27–55. (2018): Econometrics. ‘Correlated Random Effects Models with Unbalanced Panels,’ Journal of Wooldridge, J. M., and Y. Zhu (Forthcoming): ‘Inference in Approximately Sparse Correlated Random Effects Probit Models,’ Journal of Business and Economic Statistics. 265