ESSAYS ON DISCRETE MULTIVALUED TREATMENTS WITH ENDOGENEITY AND HETEROGENEOUS COUNTERFACTUAL ERRORS By Ibrahim Kekec A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Economics – Doctor of Philosophy 2021 ABSTRACT ESSAYS ON DISCRETE MULTIVALUED TREATMENTS WITH ENDOGENEITY AND HETEROGENEOUS COUNTERFACTUAL ERRORS By Ibrahim Kekec This dissertation is composed of three chapters, and each one of them studies discrete multivalued treatments with endogeneity and heterogeneous counterfactual errors. The first chapter extends the investigations of average treatment effects (ATEs) in extensively-studied binary treatments to those in discrete multivalued treatments with both endogeneity and heterogeneous counterfactual errors and explores the behavior of control function (CF) and instrumental variables (IV) methods in this framework. Specifically, I offer identification strategies for the ATEs, suggest a consistent estimator for the ATEs, show the asymptotic properties of CF parameter estimates, and derive a score test in order to draw inferences about the ATEs and other parameters of interest. Moreover, using a Monte Carlo simulation analysis, I compare CF method with widely used IV method in terms of asymptotic efficiency, asymptotic unbiasedness, and consistency. Simulation results suggest that CF method can be asymptotically up to 12% more efficient than IV method, and asymptotic bias in parameter estimates of IV method can be as high as 43%. However, when misspecification is introduced, simulation results favor IV method. For the empirical illustration, I apply ordinary least squares (OLS), CF, IV, and nonparametric bound analysis to the estimation of how limited English proficiency (LEP) influences wages of Hispanic workers in the USA. The data come from the 1% Public Use Microdata Series of the 1990 US Census. Utilizing age at arrival as an instrumental variable, both OLS and CF methods indicate that LEP on average imposes a statistically significant wage penalty (up to 79% in some CF estimates) on Hispanic community in the USA. IV method mostly produces insignificant results, and nonparametric bound analysis provides uninformative lower bounds. The second chapter incorporates a structure of correlated random coefficients (CRCs) into the framework introduced in the first chapter. However, in this new setting with CRCs, conventional IV method is suspected to be inconsistent for ATEs. In this chapter, I propose a consistent CF estimation procedure for the ATEs and show the asymptotic properties of CF parameter estimates. In addition, my Monte Carlo simulation analysis suggests that, in the absence of misspecification, CF method is asymptotically unbiased and consistent (but not necessarily more efficient). Whereas, IV method is generally asymptotically biased and inconsistent. In the presence of misspecification, the simulation results show that both CF and IV methods have biased estimates (more on CF estimates). With regard to efficiency, the simulation findings show that none of the methods outperforms the other one clearly. In the third chapter, I take the treatment model from the first chapter to a specific linear high dimensional sparse setting where the high dimensional variables are irrelevant in treatment choice given the instruments and appear only in the outcome equation. Using a detailed simulation analysis, I examine the finite sample properties, model selection features, and prediction capabilities of several machine learning (ML) methods and of the CF method from the first chapter. To estimate the parameters of interest, I use four different ML methods: LASSO; post partial-out LASSO of Belloni et al. (2012); post double selection LASSO of Belloni, Chernozhukov, and Hansen (2014a); and double/debiased ML LASSO of Chernozhukov et al. (2018). The most important simulation result is that, in the presence of enough extra predictive variables that are ignorable in treatment selection and are from a set of high dimensional predictors of outcome, more complicated LASSO-based methods result in efficiency gains in ATE estimates over the simpler CF method although both LASSO- based methods and the CF method perform more or less the same as far as finite sample bias is concerned. As far as model selection goes, the simulations show that the double/debiased ML LASSO both selects the most number of potential variables and correctly selects the most number of variables with true nonzero impact on outcome in estimation. As to prediction, the simulation results suggest that LASSO has the best prediction features. Copyright by IBRAHIM KEKEC 2021 To my family, for always supporting me. v ACKNOWLEDGMENTS I would like to thank to my advisor, Jeffrey M. Wooldridge, for his guidance during my doctoral journey and all the suggestions he offered on the previous versions of my dissertation. I also thank my committee members, Kyoo il Kim, Peter Schmidt, and Chih-Li Sung for their comments. My special thanks go to Steven Haider, Todd Elder, and Soren Anderson for their useful advices; Jay Feight, Lori Jean Nichols, and Margaret Lynch for their administrative help; and Dean Olson and the High Performance Computing Center at Michigan State University Institute for Cyber-Enabled Research for technical support. I also would like to convey my sincere appreciation to Margie Tieslau for sowing the love of econometrics in me and my deep gratitude to Joan G. Staniswalis for encouraging me when I really needed it. My dear friends Yi Li and Fei Jia have my deep thanks for lending me their helping hands during the ups and downs of being a doctoral student. Lastly and most importantly, I extend my most profound gratitude to my family for always supporting me unconditionally no matter what. I am indebted to my sister for urging me to start this journey and always being there for me. I am forever thankful to my parents for their never-ending love that has shaped me to be the person that I am now. To their sacrifices, I dedicate this milestone in my life. I could not have done it without your love and help. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix CHAPTER 1 IDENTIFICATION, ESTIMATION, AND INFERENCE FOR MUL- TIVALUED ENDOGENOUS TREATMENT EFFECT MODELS: A CONTROL FUNCTION APPROACH . . . . . . . . . . . . . . . 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.1 The Model with ηg,j = ηj : A Special Case . . . . . . . . . . . . . . . 9 1.3 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4.1 IV Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4.2 CF Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.4.3 The Model with ηg,j = ηj : Estimation . . . . . . . . . . . . . . . . . . 18 1.5 Asymptotic Normality Results . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.5.1 Method of Moments Framework . . . . . . . . . . . . . . . . . . . . . 25 1.5.2 The Model with ηg,j = ηj : Asymptotics . . . . . . . . . . . . . . . . . 26 1.6 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.6.1 The Model with ηg,j = ηj : Hypothesis Testing . . . . . . . . . . . . . 29 1.7 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 1.7.1 Data Generating Process . . . . . . . . . . . . . . . . . . . . . . . . . 30 1.7.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 1.7.2.1 Asymptotic Efficiency Outcomes . . . . . . . . . . . . . . . 36 1.7.2.2 Asymptotic Unbiasedness and Consistency Outcomes . . . . 38 1.8 Empirical Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 1.8.1 Background on the Economics of Language Skills . . . . . . . . . . . 40 1.8.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 1.8.3 Regression Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 1.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 CHAPTER 2 ESTIMATION AND INFERENCE FOR MULTIVALUED ENDOGE- NOUS TREATMENT EFFECT MODELS WITH CORRELATED RANDOM COEFFICIENTS . . . . . . . . . . . . . . . . . . . . . . 59 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 2.3.1 IV Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 2.3.2 CF Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 2.4 Asymptotic Normality Results . . . . . . . . . . . . . . . . . . . . . . . . . . 73 2.4.1 Method of Moments Framework . . . . . . . . . . . . . . . . . . . . . 79 2.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 2.5.1 Data Generating Process . . . . . . . . . . . . . . . . . . . . . . . . . 81 vii 2.5.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 2.5.2.1 Asymptotic Efficiency Outcomes . . . . . . . . . . . . . . . 85 2.5.2.2 Asymptotic Unbiasedness and Consistency Outcomes . . . . 87 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 CHAPTER 3 ESTIMATION FOR MULTIVALUED ENDOGENOUS TREATMENT EFFECT MODELS USING HIGH DIMENSIONAL METHODS: A SIMULATION STUDY . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 3.3.1 LASSO Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.3.2 Post Partial-out LASSO Estimation . . . . . . . . . . . . . . . . . . . 104 3.3.3 Post Double Selection LASSO Estimation . . . . . . . . . . . . . . . 106 3.3.4 Double/Debiased ML LASSO Estimation . . . . . . . . . . . . . . . . 107 3.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 3.4.1 Data Generating Process . . . . . . . . . . . . . . . . . . . . . . . . . 111 3.4.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 3.4.2.1 Bias and Efficiency Outcomes . . . . . . . . . . . . . . . . . 116 3.4.2.2 Prediction and Model Selection Outcomes . . . . . . . . . . 119 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 APPENDIX A APPENDIX FOR CHAPTER 1 . . . . . . . . . . . . . . . . 125 APPENDIX B APPENDIX FOR CHAPTER 2 . . . . . . . . . . . . . . . . 183 APPENDIX C APPENDIX FOR CHAPTER 3 . . . . . . . . . . . . . . . . 205 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 viii LIST OF TABLES Table A.1: Model without Correlated Random Coefficients but with Asymmetric Instrument, N=1000, and I=10000 . . . . . . . . . . . . . . . . . . . . . . 152 Table A.2: Model without Correlated Random Coefficients but with Asymmetric Instrument, N=2000, and I=10000 . . . . . . . . . . . . . . . . . . . . . . 153 Table A.3: Model without Correlated Random Coefficients but with Asymmetric Instrument, N=5000, and I=10000 . . . . . . . . . . . . . . . . . . . . . . 154 Table A.4: Model without Correlated Random Coefficients but with Asymmetric Instrument, N=10000, and I=10000 . . . . . . . . . . . . . . . . . . . . . 155 Table A.5: Model without Correlated Random Coefficients but with Symmetric In- strument, N=1000, and I=10000 . . . . . . . . . . . . . . . . . . . . . . . 156 Table A.6: Model without Correlated Random Coefficients but with Symmetric In- strument, N=2000, and I=10000 . . . . . . . . . . . . . . . . . . . . . . . 157 Table A.7: Model without Correlated Random Coefficients but with Symmetric In- strument, N=5000, and I=10000 . . . . . . . . . . . . . . . . . . . . . . . 158 Table A.8: Model without Correlated Random Coefficients but with Symmetric In- strument, N=10000, and I=10000 . . . . . . . . . . . . . . . . . . . . . . 159 Table A.9: Model without Correlated Random Coefficients but with Asymmetric Instrument, ηg,j = ηj , N=1000, and I=10000 . . . . . . . . . . . . . . . . 160 Table A.10: Model without Correlated Random Coefficients but with Asymmetric Instrument, ηg,j = ηj , N=2000, and I=10000 . . . . . . . . . . . . . . . . 161 Table A.11: Model without Correlated Random Coefficients but with Asymmetric Instrument, ηg,j = ηj , N=5000, and I=10000 . . . . . . . . . . . . . . . . 162 Table A.12: Model without Correlated Random Coefficients but with Asymmetric Instrument, ηg,j = ηj , N=10000, and I=10000 . . . . . . . . . . . . . . . . 163 Table A.13: Model without Correlated Random Coefficients but with Symmetric In- strument, ηg,j = ηj , N=1000, and I=10000 . . . . . . . . . . . . . . . . . . 164 Table A.14: Model without Correlated Random Coefficients but with Symmetric In- strument, ηg,j = ηj , N=2000, and I=10000 . . . . . . . . . . . . . . . . . . 165 ix Table A.15: Model without Correlated Random Coefficients but with Symmetric In- strument, ηg,j = ηj , N=5000, and I=10000 . . . . . . . . . . . . . . . . . . 166 Table A.16: Model without Correlated Random Coefficients but with Symmetric In- strument, ηg,j = ηj , N=10000, and I=10000 . . . . . . . . . . . . . . . . . 167 Table A.17: Model without Correlated Random Coefficients but with Misspecification and Asymmetric Instrument, N=1000, and I=10000 . . . . . . . . . . . . 168 Table A.18: Model without Correlated Random Coefficients but with Misspecification and Asymmetric Instrument, N=2000, and I=10000 . . . . . . . . . . . . 169 Table A.19: Model without Correlated Random Coefficients but with Misspecification and Asymmetric Instrument, N=5000, and I=10000 . . . . . . . . . . . . 170 Table A.20: Model without Correlated Random Coefficients but with Misspecification and Asymmetric Instrument, N=10000, and I=10000 . . . . . . . . . . . . 171 Table A.21: Variables Description and Summary Statistics, N=38779 . . . . . . . . . . 172 Table A.22: English Proficiency, Earnings, and Other Characteristics . . . . . . . . . . 173 Table A.23: Multinomial Logit Regressions of English Proficiency, N=38779 . . . . . . 173 Table A.24: Multinomial Logit Regressions of English Proficiency (Continuing), N=38779174 Table A.25: Average Treatment Effects (ATEs) of English Proficiency on Log Hourly Wages, N=38779 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Table A.26: Average Treatment Effects (ATEs) of English Proficiency on Log Hourly Wages (Continuing), N=38779 . . . . . . . . . . . . . . . . . . . . . . . . 176 Table A.27: Average Treatment Effects (ATEs) of English Proficiency on Log Hourly Wages with Bounds, N=38779 . . . . . . . . . . . . . . . . . . . . . . . . 177 Table A.28: Average Treatment Effects (ATEs) of English Proficiency on Log Hourly Wages with Bounds, Only Male, N=25568 . . . . . . . . . . . . . . . . . . 178 Table A.29: Average Treatment Effects (ATEs) of English Proficiency on Log Hourly Wages with Bounds, Only Female, N=13211 . . . . . . . . . . . . . . . . 179 Table A.30: Average Treatment Effects (ATEs) of English Proficiency on Log Hourly Wages with Bounds, Only Operators, N=9622 . . . . . . . . . . . . . . . 180 x Table A.31: Average Treatment Effects (ATEs) of English Proficiency on Log Hourly Wages with Bounds, Only Repair, N=6209 . . . . . . . . . . . . . . . . . 181 Table A.32: Average Treatment Effects (ATEs) of English Proficiency on Log Hourly Wages with Bounds, Only Service, N=5417 . . . . . . . . . . . . . . . . . 182 Table B.1: Model with Correlated Random Coefficients and Asymmetric Instru- ment, N=1000, and I=10000 . . . . . . . . . . . . . . . . . . . . . . . . . 189 Table B.2: Model with Correlated Random Coefficients and Asymmetric Instru- ment, N=2000, and I=10000 . . . . . . . . . . . . . . . . . . . . . . . . . 190 Table B.3: Model with Correlated Random Coefficients and Asymmetric Instru- ment, N=5000, and I=10000 . . . . . . . . . . . . . . . . . . . . . . . . . 191 Table B.4: Model with Correlated Random Coefficients and Asymmetric Instru- ment, N=10000, and I=10000 . . . . . . . . . . . . . . . . . . . . . . . . . 192 Table B.5: Model with Correlated Random Coefficients and Symmetric Instrument, N=1000, and I=10000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Table B.6: Model with Correlated Random Coefficients and Symmetric Instrument, N=2000, and I=10000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Table B.7: Model with Correlated Random Coefficients and Symmetric Instrument, N=5000, and I=10000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Table B.8: Model with Correlated Random Coefficients and Symmetric Instrument, N=10000, and I=10000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 Table B.9: Model with Correlated Random Coefficients, Misspecification, Asymmet- ric Instrument, N=1000, and I=10000 . . . . . . . . . . . . . . . . . . . . 197 Table B.10: Model with Correlated Random Coefficients, Misspecification, Asymmet- ric Instrument, N=2000, and I=10000 . . . . . . . . . . . . . . . . . . . . 198 Table B.11: Model with Correlated Random Coefficients, Misspecification, Asymmet- ric Instrument, N=5000, and I=10000 . . . . . . . . . . . . . . . . . . . . 199 Table B.12: Model with Correlated Random Coefficients, Misspecification, Asymmet- ric Instrument, N=10000, and I=10000 . . . . . . . . . . . . . . . . . . . 200 Table B.13: Model with Correlated Random Coefficients, Misspecification, Symmet- ric Instrument, N=1000, and I=10000 . . . . . . . . . . . . . . . . . . . . 201 xi Table B.14: Model with Correlated Random Coefficients, Misspecification, Symmet- ric Instrument, N=2000, and I=10000 . . . . . . . . . . . . . . . . . . . . 202 Table B.15: Model with Correlated Random Coefficients, Misspecification, Symmet- ric Instrument, N=5000, and I=10000 . . . . . . . . . . . . . . . . . . . . 203 Table B.16: Model with Correlated Random Coefficients, Misspecification, Symmet- ric Instrument, N=10000, and I=10000 . . . . . . . . . . . . . . . . . . . 204 Table C.1: High Dimensional Sparse Model with Heterogeneous Counterfactual Er- rors lh1 = 4, p = 2100, p0 = 27, f = 15, N=1000, Correlated h, and I=1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Table C.2: High Dimensional Sparse Model with Heterogeneous Counterfactual Er- rors lh1 = 4, p = 2100, p0 = 27, f = 15, N=1250, Correlated h, and I=1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Table C.3: High Dimensional Sparse Model with Heterogeneous Counterfactual Er- rors lh1 = 4, p = 2100, p0 = 27, f = 15, N=1500, Correlated h, and I=1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Table C.4: High Dimensional Sparse Model with Heterogeneous Counterfactual Er- rors lh1 = 4, p = 2100, p0 = 27, f = 15, N=2000, Correlated h, and I=1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Table C.5: High Dimensional Sparse Model with Heterogeneous Counterfactual Er- rors lh1 = 1, p = 2100, p0 = 18, f = 15, N=1000, Correlated h, and I=1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Table C.6: High Dimensional Sparse Model with Heterogeneous Counterfactual Er- rors lh1 = 1, p = 2100, p0 = 18, f = 15, N=1250, Correlated h, and I=1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 Table C.7: High Dimensional Sparse Model with Heterogeneous Counterfactual Er- rors lh1 = 1, p = 2100, p0 = 18, f = 15, N=1500, Correlated h, and I=1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 Table C.8: High Dimensional Sparse Model with Heterogeneous Counterfactual Er- rors lh1 = 1, p = 2100, p0 = 18, f = 15, N=2000, Correlated h, and I=1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Table C.9: High Dimensional Sparse Model with Heterogeneous Counterfactual Er- rors lh1 = 4, p = 2100, p0 = 27, f = 15, N=1000, Uncorrelated h, and I=1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 xii Table C.10: High Dimensional Sparse Model with Heterogeneous Counterfactual Er- rors lh1 = 4, p = 2100, p0 = 27, f = 15, N=1250, Uncorrelated h, and I=1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 Table C.11: High Dimensional Sparse Model with Heterogeneous Counterfactual Er- rors lh1 = 4, p = 2100, p0 = 27, f = 15, N=1500, Uncorrelated h, and I=1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 Table C.12: High Dimensional Sparse Model with Heterogeneous Counterfactual Er- rors lh1 = 4, p = 2100, p0 = 27, f = 15, N=2000, Uncorrelated h, and I=1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Table C.13: High Dimensional Sparse Model with Heterogeneous Counterfactual Er- rors lh1 = 1, p = 2100, p0 = 18, f = 15, N=1000, Uncorrelated h, and I=1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Table C.14: High Dimensional Sparse Model with Heterogeneous Counterfactual Er- rors lh1 = 1, p = 2100, p0 = 18, f = 15, N=1250, Uncorrelated h, and I=1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 Table C.15: High Dimensional Sparse Model with Heterogeneous Counterfactual Er- rors lh1 = 1, p = 2100, p0 = 18, f = 15, N=1500, Uncorrelated h, and I=1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 Table C.16: High Dimensional Sparse Model with Heterogeneous Counterfactual Er- rors lh1 = 1, p = 2100, p0 = 18, f = 15, N=2000, Uncorrelated h, and I=1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 xiii CHAPTER 1 IDENTIFICATION, ESTIMATION, AND INFERENCE FOR MULTIVALUED ENDOGENOUS TREATMENT EFFECT MODELS: A CONTROL FUNCTION APPROACH 1.1 Introduction In the economics literature, a great deal of interest lies in the estimation of average treat- ment effects (ATEs) since it gives economists a method to evaluate the effects of government programs and policies, such as school voucher programs, the effects of personal choices, such as higher education attendance, and the effects of many other institutional or personal de- cisions. To quantify the interest, in 2020 so far even in the midst of Covid-19 pandemic, 9 articles related to ATEs were published in the journal Econometrica alone. The economics literature in ATEs is fairly large. The golden standard for estimating ATEs is through the use of randomized control trials or natural experiments for many economists; however, these estimation methods are rare mostly due to financial restraints or ethical problems (e.g., trying to assign some students into the control group against their will while investigating the impact of a tutoring program on students’ academic achieve- ments) associated with an experiment. For these (and other similar) reasons, economists often use observational data and employ estimation methods that suit observational data. Using observational data, there are three major methods in the ATE estimation arsenal of an economist: methods employing ignorability, regression discontinuity designs, and instru- mental variables (IV) and control function (CF) methods. Rosenbaum and Rubin (1983) first used the ignorability assumption in the context of ATEs, and the assumption states the independence of treatment variable and counterfactual outcomes given observed control variables. With the help of this very assumption, the classical methods to estimate ATEs are regression adjustment, propensity score, and matching. For those interested in further in- formation on these methods, both Wooldridge (2010) and Cameron and Trivedi (2005) have 1 their decent coverage. In certain cases where discontinuity of policy assignment is the direct result of some ad hoc institutional regulation/rule, outcome differences between those who are treated and those who are not can be indeed ascribed to treatment statuses. Regression discontinuity designs pave its way for estimating ATEs at the discontinuity point in these very cases. For more information on this method, see Hahn, Todd, and van der Klaauw (2001); van der Klaauw (2002); and Imbens and Lemieux (2008). When the treatment status is truly endogenous in ATE models, the usual ignorability assumption fails. In such a case, the conventional methods to estimate ATEs such as regres- sion adjustment or propensity score weighting proves ineffective. Among different estimation methods, IV method is one of the most, if not the most, commonly used estimation methods when treatment choice is endogenous. Angrist, Imbens, and Rubin (1996); Heckman (1997); and Wooldridge (1997, 2003) are all useful for reference purposes and discuss the estimation of ATEs using IV in detail. CF method where estimating equation gets augmented by the addition of functions of all relevant observed covariates (inclusive of the endogenous treat- ment variable) is an alternative in such endogenous setting, as well. Historically, CF method dates back as early as 1970s. Heckman (1979) and Lee (1978) can be considered two exam- ples of the early works where the idea behind the CF method was used; nevertheless, they did not use the very words “control function.” while referring to their estimation methods. Specifically, Heckman (1979) used the idea in order to take care of sample selection bias. The switching regression model of Lee (1978) where union membership changes the wage equa- tions for workers employed the idea to examine the relationship between the labor unions and wage rates. For review purposes, Vella and Verbeek (1999) indeed give a good discussion of CF method in a setting where binary treatment variable is endogenous and analyzes the relationship between CF and IV methods. In addition, Wooldridge (2015) provides a very comprehensive view of CF methods in both linear and nonlinear models with endogenous explanatory variables and argues that CF methods are, in a complementary fashion, alter- natives to traditional IV methods. Furthermore, ATE models are indeed part of program 2 evaluation literature, so for an overview of standard methods and advances in the theoretical aspects of program evaluation, see Imbens and Wooldridge (2009) and Abadie and Cattaneo (2018). When the treatment status takes on more than just two values, i.e. binary treatment effect, economists step into the realm of multivalued treatment effects where the number of treatments can be more than just two but finite. Seminal works in casual inference with multivalued treatments were developed by Rubin (1978) and Robins (1989a, 1989b). Build- ing on top of these works, Angrist and Imbens (1995) improved the casual inference with multivalued treatments by both going beyond binary treatments and utilizing the concept of counterfactuality in their IV analysis of the effect of schooling on earnings. Upon these pioneering works, academic interest in ATEs seems to go on without any reduction in in- tensity. Recent survey articles contain but not limited to Heckman and Vytlacil (2007a, ch. 70; 2007b, ch.71), Imbens and Wooldridge (2009), and Linden et al. (2016) specifically for multivalued treatments under ignorability. In the ATE literature, the latent choice model for treatment statuses in treatment effects is indeed a discrete choice model on which studies have reached maturity. Besides survey articles (see, for example, Amemiya, 1981), there are several textbooks and book chapters devoted to discrete choice analysis (a.k.a. qual- itative response models). To name a few, see Daganzo (1979) specifically in multinomial probit models, Ben-Akiva and Lerman (1985), Maddala (1986), McFadden (1984, ch. 24), Amemiya (1985, ch. 9), Maddala and Flores-Lagunes (2001, ch. 17), Cameron and Trivedi (2005, chs. 14 and 15), and Wooldridge (2010, chs. 15 and 16). One of the first studies that utilize discrete choice models with multiple choices belongs to the work of Dubin and McFadden (1984). In their paper, they studied the demand for electricity by residences. To estimate the demand for electricity model, price and income elasticities, appliance dummy variables that follow a discrete choice model were included in their model. Since the theory of economics suggests that these dummy variables are endoge- nous, Dubin and McFadden estimated the demand for electricity by IV and CF methods. 3 In their final analysis, they obtained similar elasticities in magnitude from both estimation methods. Another paper in the subject comes from Bourguignon, Fournier, and Gurgand (2007). In their survey paper, they presented the set of available methods when the multinomial logit model (MNL) underlies the discrete choice model. The three approaches in their paper were those studied by Lee (1983), Dubin and McFadden (1984), and Dahl (2002). After having showed the advantages and disadvantages of these approaches, they then concluded that Dubin and McFadden’s model and Dahl’s model were more efficient compared to other specifications. In addition, Bourguignon, Fournier, and Gurgand (2007) improved the set of methods at researchers’ disposal by allowing correlations between different choices. In a binary treatment effect model with an endogenous treatment variable, if the coun- terfactual errors are heterogeneous in the sense that they depend on treatment status, then conventional IV estimation in general leads to inconsistent estimates including ATEs. For example, Angrist (1991, p. 15) delineates the problems of estimating ATEs by IV method while, at the same time, outlining functional form restrictions for the method to provide con- sistent estimates of ATEs. He shows that IV method can lead to asymptotically biased and inconsistent estimates of ATEs when the error term in the outcome equation interacts with endogenous binary treatment. Similarly, Heckman and Robb (1985b, p. 196) portray when IV method is appropriate to use for consistent estimation of ATEs in binary treatments and state that standard IV method does not estimate ATEs consistently in a framework where the error term of the estimating equation again contains components interacting with en- dogenous binary treatment but CF method does. Wooldridge (1997, p. 131; 2003, p.191) also mentions about this issue clearly. Since heterogeneous counterfactual errors result in composite error terms including interactions with endogenous treatment, one can deduce that consistent estimation of ATEs in multivalued treatments by IV method hinges on whether the counterfactual errors are homogeneous. In discrete ATE literature, most of the attention has been devoted to binary treatment 4 models with endogeneity, which leaves behind an untapped area of research in discrete mul- tivalued treatments with endogeneity. This chapter extends the investigations of ATEs in binary treatments to those in discrete multivalued treatments with both endogeneity and heterogeneous counterfactual errors and explores the behavior of both CF and instrumental variables (IV) methods comparatively in this framework, which has not been examined to the best of my knowledge and constitutes my main contribution to the literature. Specifically, in this chapter, I offer identification strategies for the ATEs, suggest a consistent CF esti- mator for the ATEs, show the asymptotic properties of CF parameter estimates, and derive a score test in order to draw inferences about the ATEs and other parameters of interest in a discrete multivalued treatment model with endogeneity and heterogeneous counterfactual errors. I follow the latent choice model setup laid out by Dubin and McFadden (1984) for the endogenous treatment variable. Under this setup, the endogenous treatment variable follows a multinomial logit reduced form. This key observation enables me to calculate ex- pectations of heterogeneous counterfactual errors conditional on all the exogenous variables and the endogenous treatment variable, which plays a critical role in deriving the estimating equation for CF method. I argue that CF method can be more efficient than IV method when the counterfactual errors are homogeneous and that IV parameter estimates can suffer from considerable biases when the counterfactual errors are heterogeneous. However, when misspecification is introduced, my findings slightly favor IV method. The rest of this chapter is organized as follows. In section 1.2, I introduce the model. In section 1.3, I discuss identification strategies for the model. In section 1.4, I derive the estimating equations for both CF and IV methods and propose procedures to estimate the parameters of interest and ATEs for both methods. In section 1.5, I show the asymptotic properties of CF estimates, propose a consistent estimator for the asymptotic variance matrix of CF estimates, and show how a GMM framework can be set up for the main problem. In section 1.6, I suggest a score test to draw inferences about ATEs and parameters of interest. In section 1.7, I share some simulation results. In section 1.8, I apply ordinary least squares 5 (OLS), CF, IV, and nonparametric bound analysis while estimating the impact of English proficiency on wages of Hispanic workers in the USA. In section 1.9, I conclude. And, in appendix A, I share the derivations, simulation tables, and empirical analysis tables that are hidden from the main body of this chapter. 1.2 The Model Consider the following model yg = αg + xβg + ug wg∗ = zγg + ag , (1.1) where yg is the g th counterfactual outcome variable, αg is the scalar coefficient in the counter- factual outcome equation for yg , x ≡ (x1 , x2 , . . . , xl ) is the 1 × l vector of exogenous variables in yg , βg is the l × 1 vector of slope coefficients in yg , ug is the counterfactual error in yg , wg∗ is the latent treatment variable that determines the choice of treatment status among G + 1 alternative treatment statuses, z ≡ (z1 , z2 , . . . , zk ) is the 1 × k vector of instruments that includes a constant term in the choice equation for wg∗ , γg is the k × 1 vector of parameters in wg∗ , and ag is the scalar error term that is independently and identically Gumbel distributed (i.i.d.) with location parameter µ = 0 and scale parameter β = 1 in wg∗ for g = 0, 1, . . . , G. Let w ∈ {0, 1, . . . , G} be the observed discrete multivalued endogenous treatment variable whose values are determined by wg∗ for g = 0, 1, . . . , G. One common interpretation of wg∗ is to think of it as the utility or satisfaction obtained from treatment status g. Let the treatment statuses of w be exhaustive and mutually exclusive. Define binary treatment status indicators, dg = 1[w = g] for g = 0, 1, . . . , G. So the binary treatment status indicator dg is equal to one if the treatment status is equal to g and zero otherwise. This coupled with the mutual exclusivity of treatment statuses implies that G g=0 dg = 1. Define the 1 × (G + 1) P 6 vector of treatment statuses d ≡ (d0 , d1 , . . . , dG ). Let y be the observed outcome. Then, I can write y = d0 y 0 + d1 y 1 + · · · + dG y G , (1.2) where yg is the g th counterfactual outcome for g = 0, 1, . . . G. After having described the discrete multivalued endogenous treatment model above, I now will make a series of assumptions that complete the model and that are used in estimation. First, I assume that the rational economic agents choose the status of treatment from which they receive the most satisfaction out of all possible treatment statuses. That is, • Assumption 1.1 (A.1.1): One chooses treatment status g, i.e., w = g if and only if wg∗ ≥ wj∗ ∀j 6= g for g, j = 0, 1, . . . , G. Second, I assume that identification of the model in (1.1) and (1.2) is contributed by ex- clusion of some (at least one) variables in the set of instruments z from the set of exogenous variables in x. This exclusion restriction is encouraged for the estimation and identification to be more convincing and reliable even though nonlinearity in estimation suffices for iden- tification, especially when the exogenous variables in z vary enough in the sample. In the literature, it is common that the set of exogenous variables in x is a proper subset of the set of instruments z. That is, z includes all the variables in x and has at least one additional variable that is not in x. The idea is that all of the characteristics influential in the outcome are also expected to be critical in determining the treatment choice. For a concrete example on this point, see Vella and Verbeek (1999, p. 473). • Assumption 1.2 (A.1.2): Identification of the model described by (1.1) and (1.2) is strengthened by exclusion of at least one variable in z from the set of variables in x. 7 As shown by McFadden (1973), under the model in (1.1) and (1.2) the assumptions made so far allow the treatment variable w to follow a multinomial logit model with choice probabilities given as follows: XG P (w = g|x, z) = P (w = g|z) = exp(zγg )/ exp(zγr ), (1.3) r=0 for g = 0, 1, . . . , G. The next assumption is essential to the CF estimation, which I describe in section 1.4, since this assumption coupled with the multinomial logit specification of the treatment variable w will enable me to form control function terms that account for the endogeneity in w. • Assumption 1.3 (A.1.3): E(ug |x, z, a) = E(ug |a) = PG PG j=0 ηg,j aj + [− j=0 ηg,j E(aj )], where ug is the counterfactual error in yg , x is the 1 × l vector of exogenous variables in yg , z is the 1 × k vector of instruments that includes a constant term in the choice equation for wg∗ , a ≡ (a0 , a1 , . . . , aG ) is the 1×(G+1) vector of i.i.d. Gumbel distributed errors aj with location parameter µ = 0 and scale parameter β = 1 in wj∗ , ηg,j is the scalar multiple of correlation coefficient between ug and aj , and E(aj ) = 0.5772 is Euler’s constant for j, g = 0, 1, . . . , G. Bourguignon, Fournier, and Gurgand (2007) refers to A.1.3 as Dubin and McFadden’s linearity assumption since the conditional expectation of counterfactual error ug given all Gumbel distributed errors a is linear in a for g = 0, 1, . . . , G. A.1.3 also implies that, condi- tional on a, x and z are redundant for the conditional expectation of ug . In other words, ug is mean independent of x and z conditional on a. Under all assumptions from A.1.1 through A.1.3, the model in (1.1) and (1.2) can be consistently estimated by CF method. In section 1.4, I will propose a consistent estimator for the ATEs in this discrete multivalued endogenous treatment model. 8 1.2.1 The Model with ηg,j = ηj : A Special Case This special case follows the initial model, so the basic setup and variables in (1.1) and (1.2) are the same. However, in this special case, I modify the initial model by assuming that ηg,j = ηj , g = 0, 1, . . . , G, which makes the main difference in this model. The assumptions I made in this new model are pretty much the same as the ones in the initial model except A.1.3 and are as follows: • Assumption 1.10 (A.1.10 ): Same as A.1.1. • Assumption 1.20 (A.1.20 ): Same as A.1.2. • Assumption 1.30 (A.1.30 ): E(ug |x, z, a) = E(ug |a) = PG PG j=0 ηj aj + [− j=0 ηj E(aj )], where ug is the counterfactual error in yg , x is the 1 × l vector of exogenous variables in yg , z is the 1 × k vector of instruments that includes a constant term in the choice equation for wg∗ , a is the 1 × (G + 1) vector of i.i.d. Gumbel distributed errors aj with location parameter µ = 0 and scale parameter β = 1 in wj∗ , ηj is the fixed scalar multiple of correlation coefficient between ug and aj , and E(aj ) = 0.5772 is Euler’s constant for j, g = 0, 1, . . . , G. The last assumption, A.1.30 , now incorporates the special condition that ηg,j = ηj which means that the correlation coefficient between ug and aj does not change as the treatment status g changes, but is fixed at ηj . Because of this special condition, ug = u for g = 0, 1, . . . , G, which practically means that counterfactual errors ug ’s are homogeneous across treatment statuses. 9 1.3 Identification In plain words, identification in economics is the concept of uniquely determining some population parameters in an econometric model from what can be observed in the name of data. Unlike natural sciences, economists cannot control the conditions under which social phenomena occur or change, and the data are produced by an unknown process. Hence, an economist needs to formulate a model with its assumptions and restrictions in order to simplify the complexity of the real world and understand the population of interest. If a model is identified, then only a single set of parameter values must agree with the data under the true model, and other parameter values must point to a different data. The origins of econometric identification go as far back in time as late 1920s. Stock and Trebbi (2003) argue that Wright (1928) was the pioneer in econometric identification because he solved an identification problem in econometrics first. Apart from these studies, Working (1925, 1927), Haavelmo (1943), Koopmans and Reiersøl (1950), Hurwicz (1950), Koopmans, Rubin, and Leipnik (1950), Wald (1950), and Fisher (1966) give the earliest examples of identification research in econometrics. For a complete review of the early research in econometric identification and its formalization, see Duo (1993, ch. 4). Since the early days in econometric identification, its literature has grown considerably and become complicated but can be roughly divided into two: point identification and set identification (a.k.a. partial identification). Conventionally, when a parameter is identified, an economist means that it is point identified, which is mainly because the research in point identification has started earlier than in set identification. From a terminological point of view, Koopmans and Reiersøl (1950), Hurwicz (1950), Fisher (1966) and Rothenberg (1971) give the first formal definitions of point identification. Other recent definitions are available in Hsiao (1983), Bekker and Wansbeek (2001), and Matzkin (2007). In line with Lewbel (2018), let say that θ is the population parameter to be identified, φ is everything known about the population that data can offer, and M is a model with a set of assumptions and restrictions on the set of 10 possible values φ can take on. Then, singling out θ from φ given the model M establishes the point identification of θ, which is an introductory and informal way of thinking about point identification. In a similar but more informal fashion, Duo (1993, p. 95) describes the process of point identification as “... essentializing the conditions under which a certain set of values of structural parameters could be uniquely determined from the data among all the permissible sets embodied in mathematically complete theoretical model ... ”. As for set identification, in an intuitively simple way, Lewbel (2018, p. 65) describes it as “... the analysis of situations where φ provides some information about parameter θ, but not enough information to actually point identify θ ... ”. From an informal standpoint, the true parameter θo is set identified if some other possible parameter values have the same chance of creating the data in hand as does the true parameter value (terminologically, these other possible parameter values are called observationally equivalent to θo and form the identified set together with θo ). And, if the identified set is only composed of θo , then point identification of θo is the same as its set identification. Frisch (1934), Reiersøl (1941), and Marschak and Andrews (1944) are among the first works on set identification. Manski (1990, 1995, 2003) also analyze the subject in great detail over the years. More recent definitions are available in Matzkin (2007) and Chesher and Rosen (2017), and set identification has been reviewed in Tamer (2010). Some researchers favor set identification over point identification because, according to them, economic theory does not supply econometric models with enough information for the point identification of model parameters, leading to sophisticated tricks to obtain point identification. However, one downside of set identified models is that estimation and inference in set identified parameters get rather more complicated than in point identified parameters. In two-stage models, the main identification strategy is generally based on either some exclusion restrictions or functional form assumptions. Applied economists also prefer using both exclusion restrictions and functional form assumptions at the very same time in order to aid and strengthen identification in their models. In two-stage models, an exclusion 11 restriction means the inclusion of an explanatory variable in the first stage equation that is excluded from the second stage equation. There can also be exogenous variables that do not appear in the first stage equation but are included only in the second stage equation, which is expected to produce even more variation in the model for its parameters to be identified. The economists who are in favor of exclusion restrictions argue that identification based on functional form assumptions (a.k.a. identification by functional form) is fragile, especially in empirical settings where the data do not have enough variation in key variables. Since identification by functional form relies heavily on model assumptions which might not be true in reality, it can greatly suffer from distributional misspecifications and exclusion restrictions can be required for identification. For example, Keane (1992), Reilly (1996), Montmarquette, Viennot-Briot, and Dagenais (2007), and Shen (2013) all use some exclusion restrictions as their main identification strategy. For Olsen (1980) and Little (1985), the use of nonlinearities in identification is even unappealing because identification by functional form is not only weak but also causes high standard errors and unreliable estimates. On the other hand, the economists who are in favor of functional form assumptions argue that identification based on exclusion restrictions is hard to achieve, especially in empirical settings where finding a true instrument that only affects the selection equation and that does not appear in the equation of interest is not very realistic. For instance, Heckman (1978), Heckman and Robb (1985a), Mendelsohn (1985), Schaffner (2002), and Lewbel (2012) all employ the nonlinearities in exogenous variables for aiding identification. In the Monte Carlo simulations of Leung and Yu (1996), they find evidence supporting the claim that two-step models are reliable in the absence of exclusion restrictions given enough variation in one of the exogenous variables in data. Wilde (2000) provide evidence in support of that parameter identification does not require exclusion restrictions in a system of equations, given each equation has at least one explanatory variable with enough variation. Given the assumption of multivariate normal distribution, his multiple equation probit model is identified without exclusion restrictions. Similarly, Escanciano et al. (2016) also have some 12 results that positively contribute to that identification by functional form is neither fragile nor unreliable in a large class of two-stage models. In certain models, some economists even show that identification of two-stage models is possible without exclusion restrictions, see, for instance, Dong (2010)’s binary choice model. It is also common that some economists use both exclusion restrictions and functional form assumptions for aiding identification in their models, see Mocan and Tekin (2003) and Appelt (2015) for further comments. In my main model given by (1.1), (1.2), A.1.1, A.1.2, and A.1.3; the identification argu- ment is based on both exclusion restriction(s) as in A.1.2 and nonlinearity that describes the relationship between the set of instruments z and the treatment variable w which follows a multinomial logit model in z. Due to arguments stressing that it is hard to achieve identifi- cation without imposing an exclusion restriction, I rely not only on the nonlinearity but also on the exclusion restriction(s) in my model so as to come up with a stronger and improved identification argument that appeases a wide range of economists. 1.4 Estimation In multivalued treatment cases, ATEs depend on the choice of a base treatment group out of all possible treatment groups. Upon the determination of the base treatment group, ATEs can be defined as the expectation of the gain from the treatment received with respect to the base treatment group. Note that there are G number of ATEs in my analysis since there exist G + 1 treatments. Let g = 0 be the base group in my analysis. Denote AT Eg,0 as the expected gain from treatment g with respect to the base treatment. In my model, under A.1.1 through A.1.3, and the law of iterated expectations, ATEs will take the following form: AT Eg,0 = E(yg − y0 ) = E (αg + xβg + ug − (α0 + xβ0 + u0 )) = (αg − α0 ) + (E(x)) (βg − β0 ), (1.4) 13 where the third equality uses E(ug ) = 0 for g = 1, 2, . . . , G. Then, a consistent estimator of AT Eg,0 is AT \ Eg,0 = (α̂g − α̂0 ) + x̄(β̂g − β̂0 ), (1.5) where α̂g , α̂0 , β̂g , β̂0 , and x̄= N −1 are respectively consistent estimates for αg , α0 , PN i=1 xi βg , β0 , and E(x). Note that when E(x) = 0, AT Eg,0 simplifies to AT Eg,0 = (αg − α0 ). (1.6) Then, a consistent estimate of AT Eg,0 in (1.6) is AT \ Eg,0 = (α̂g − α̂0 ), (1.7) which is simply the difference between the estimates α̂g and α̂0 for g = 1, 2, . . . , G. 1.4.1 IV Estimation Consider the observed outcome: y = d0 y0 + d1 y1 + . . . + dG yG XG XG = dj α j + dj xβj + u0 , (1.8) j=0 j=0 where u0 = d0 u0 +d1 u1 +· · ·+dG uG . Applying IV method on (1.8) requires instruments for the binary treatment indicators dj since corr(dj , u0 ) is expected not to be zero for j = 0, 1, . . . , G. One way to obtain instruments for dj is to model the treatment variable w as a discrete multinomial logit model and to then use the predicted probabilities from this model as instruments for dj for j = 0, 1, . . . , G. Hence, one can prescribe the following three-stage procedure to estimate ATEs: 14 Procedure 1.1 1. Estimate the predicted probabilities, Λ̂ji = exp(zi γ̂j )/ G r=0 exp(zi γ̂r ), from a MNL of P wi on zi for j = 0, 1, . . . , G and i = 1, 2, . . . , N.   2. Estimate the parameters in (1.8) by IV method using instruments Λ̂ji , Λ̂ji xi for (dji , dji xi ), j = 0, 1, . . . , G and i = 1, 2, . . . , N. 3. Plug parameter estimates from step 2 and sample average of x into (1.5), and estimate ATEs. An important remark to make here is that IV estimator described in Procedure 1.1 is not a conventional IV estimator. It is in fact optimal (asymptotic variance minimizing) IV estimator with optimal instruments (predicted probabilities from the MNL of w on z) under homoskedasticity. Whether the error term u0 in (1.8) is homoskedastic is another question; however, if it is, optimal IV estimator in Procedure 1.1 is asymptotically efficient over a parametric family of conditional probabilities for w. For this reason, I prefer IV estimator described in Procedure 1.1 over other IV estimators using different variations of instruments that are functions of z. For an example of optimal IV estimator in an endogenous dummy variable model, see Newey (1993, p. 430). For further discussion on optimal IV estimators, see Newey and McFadden (1994, Section 5.4). Procedures similar to Procedure 1.1 in models with binary endogenous treatment are popular among the empirical economists relatively because of the straightforward imple- mentation of IV method. For example, Robinson (1989) used a two-stage procedure where he obtained a set of predicted probabilities from a probit specification for the endogenous variable , i.e. union choice, at the first stage and used, at the second stage, that as instrument for the union choice in his model to estimate the union wage differentials. Similarly, Puhani and Weber (2007) calculated a set of predicted probabilities again from a probit specifica- tion for the endogenous variable , i.e. age of school entry, at the first stage and used, at the 15 second stage, that as instrument for the age of school entry in their model to explore how the age of school entry influences educational outcomes. Examples in multivalued cases are also numerous, see, for example, Ettner (1995 and 1996); Norton and Staiger (1994); Sloan, Picone, Taylor Jr., and Chou (2001). As pointed out by Chesher and Rosen (2013) and Lewbel, Dong, and Yang (2012), al- though this IV approach is simple to apply and is popular in empirical work with binary endogenous treatment, it is naive and results mostly in inconsistent IV estimates (and in- consistent ATE estimates in my analysis thereof) since instruments used in Procedure 1.1 are expected to be correlated with u0 . This IV method with instruments (all are nonlinear functions of z) would most likely yield inconsistent estimates because dj is in u0 and is a function of z for j = 0, 1, . . . , G. Note that (1.8) must be reformulated so that one can estimate it using the canned software packages such as STATA. This reformulation is needed because I include all of the binary treatment indicators dg in (1.2), and the canned packages also include an intercept in the first stage of the IV estimation although some allow for the exclusion of a constant term in the second stage regression, e.g., STATA. Hence, in practice, the IV estimation of (1.8) suffers from perfect multicollinearity and is not possible. To fix this practical problem described above, lets drop one of the binary treatment indicator variables, say dG but it could be another one, from (1.8). And then add a constant term and the variables x into (1.8). Then, (1.8) can be equivalently written as G−1 X G−1 X y=( dj α̃j + α̃G ) + ( e0 , dj xβ̃j + xβ̃G ) + u (1.9) j=0 j=0 where αg = α̃g + α̃G , βg = β̃g + β̃G , αG = α̃G , and βG = β̃G for g = 0, 1, . . . , G − 1. Under this reformulation, AT Eg,0 for g = 1, 2, . . . , G − 1 is AT Eg,0 = (α̃g − α̃0 ) + (E(x)) (β̃g − β̃0 ) (1.10) 16 and for g = G AT EG,0 = (−α̃0 ) + (E(x)) (−β̃0 ). (1.11) Therefore, consistent estimates of AT Eg,0 for g = 1, 2, . . . , G − 1 and AT EG,0 under this reformulation are AT \ Eg,0 = (α̃ ˆ 0 ) + x̄(β̃ˆg − β̃ˆ0 ) ˆ g − α̃ (1.12) and ˆ AT \ EG,0 = (−α̃ ˆ 0 ) + x̄(−β̃ 0 ), (1.13) where α̃ ˆ 0 , β̃ˆg , and β̃ˆ0 are respectively the consistent estimates of α̃g , α̃0 , β̃g , and β̃0 from ˆ g , α̃ Procedure 1.1 applied on (1.9), and x is consistent estimate of E(x) just as in (1.5). 1.4.2 CF Estimation CF estimation is more involved compared to the IV estimation in subsection 1.4.1. This is because I first need to derive the estimating equation of CF method. This estimating equation is based on the expectation of the observed outcome y conditional on the observed variables (d, x, z), E(y|d, x, z). To prevent equation clutter, I collected all the derivations in appendix A. Thus, for derivations, refer to appendix A. Having said that, (A.7) in appendix A gives me the estimating equation of CF method because I can always write XG XG y = dj αj + dk xβk + j=0 k=0 X G X X + [−ηg,g dg log(Λg )] + dg ηg,0 M0 + dg ηg,1 M1 + · · · + g=0 g6=0 g6=1 X + dg ηg,G MG + , (1.14) g6=G 17 where E(|d, x, z) = 0, Λj = exp(zγj )/ G r=0 exp(zγr ), and Mj = Λj log(Λj )/(1 − Λj ) for P j = 0, 1, . . . , G. So I can prescribe the following three-stage estimation procedure to estimate ATEs: Procedure 1.2 1. Same as in Procedure 1.1. 2. Run the regression of yi on di0 , di1 ,. . . , diG , di0 xi , di1 xi ,. . . , diG xi , −d0i log(Λ̂0i ), −d1i log(Λ̂1i ), . . . , −dGi log(Λ̂Gi ), d1i M̂0i , d2i M̂0i , . . . , dGi M̂0i , d0i M̂1i , d2i M̂1i , d3i M̂1i , . . . , dGi M̂1i , . . . , d0i M̂Gi , d1i M̂Gi , . . . , and dG−1i M̂Gi . 3. Same as in Procedure 1.1, where Λ̂gi = exp(zi γ̂g )/ G r=0 exp(zi γ̂r ) and M̂gi = Λ̂gi log(Λ̂gi )/(1 − Λ̂gi ) for g = 0, 1, . . . , G P and i = 1, 2, . . . , N. Unlike the estimates from IV method, CF method’s estimates are theoretically robust to heterogeneity in ug at least from a consistency standpoint. Under A.1.1 through A.1.3, CF method yields consistent estimates because the addition of the control function terms renders dg exogenous in (1.14) for g = 0, 1, . . . , G. 1.4.3 The Model with ηg,j = ηj : Estimation ATEs and their estimates are calculated just the same way as in the initial model. The main estimation approaches are again IV and CF methods. Consider the observed outcome: y = d0 y0 + d1 y1 + . . . + dG yG X G XG = dj α j + dj xβj + u0 , (1.15) j=0 j=0 18 where u0 = d0 u0 + d1 u1 + · · · + dG uG . However, by A.1.30 , ug = u for g = 0, 1, . . . , G (homogeneous counterfactual errors), and thereof u0 = u. (1.8) and (1.15) are practically the same, so IV estimation (Procedure 1.1) can be used in this special model. However, I expect that IV approach under the condition ηg,j = ηj would yield consistent estimates because the error term u0 in (1.15) does not depend on the binary treatment status indicators dg but only on ug = u. For practical purposes, the reformulation of (1.15) is just as the reformulation of (1.8). As before, the CF estimating equation is based on the expectation of the observed out- come y conditional on the observed variables (d, x, z), E(y|d, x, z). To prevent equation clutter, I again collected all the derivations in appendix A. Thus, for derivations, refer to appendix A. In this special model with ηg,j = ηj , the estimating equation of CF method gets simplified. As seen in (A.10) in appendix A, I can write the estimating equation of CF method as follows: X G XG XG y= dg αj + dg xβg + ηg rg + ε, (1.16) g=0 g=0 g=0 where rg = [(1 − dg )Mg − dg log(Λg )] , Mg = Λg log(Λg )/(1−Λg ), Λg = exp(zγg )/ G P r=0 exp(zγr ) , and E(ε|d, x, z) = 0 for g = 0, 1, . . . , G. Then, I can prescribe the following three-stage estimation procedure to estimate ATEs: Procedure 1.20 1. Estimate the predicted probabilities Λ̂gi from a MNL of wi on zi and then obtain r̂gi . 2. Run the regression of yi on d0i , d1i ,. . . , dGi , d0i xi , d1i xi ,. . . , dGi xi , r̂0i , r̂1i , . . . , r̂Gi . 3. Same as in Procedure 1.1, 19 where r̂gi = [(1 − dgi )M̂gi −dgi log(Λ̂gi )], M̂gi = Λ̂gi log(Λ̂gi )/(1 − Λ̂gi ), and Λ̂gi = exp(zi γ̂g )/ r=0 exp(zi γ̂r ) for g = 0, 1, . . . , G and i = 1, 2, . . . , N. Under A.1.1 , A.1.2 , and A.1.3 , CF PG 0 0 0 method yields consistent estimates because the addition of the control function terms, rg , renders dg exogenous in (1.16) for g = 0, 1, . . . , G. 1.5 Asymptotic Normality Results CF method is indeed a two-step M-estimator that solves the problem XN min (yi − m(Xi , v(di , zi , γ̂), θ))2 /2, (1.17) θ∈Θ i=1 √ where γ̂ = (γ̂00 , γ̂10 , . . . , γ̂G0 )0 is the (G + 1)k × 1 vector of N − consistent and asymptotically normal first stage conditional MLE (CMLE) estimates from the MNL of wi on zi for i = 1, 2, . . . , N. However, the first stage estimates does not have to be consistent as long as they p converge in plim, i.e., γ̂ → − γ ∗ where γ ∗ ∈ Γ ⊂ R(G+1)k . CMLE solves the problem XN max li (γ), (1.18) γ∈Γ i=1 where γ = (γ00 , γ10 , . . . , γG0 )0 is the (G+1)k×1 vector of parameters, and li (γ) ≡ log(f (wi |zi ; γ)), i.e., the conditional log likelihood for observation i, is given below G G ! X X log(f (wi |zi , γ)) = 1[wi = j]log exp(zi γj )/ exp(zi γr ) . (1.19) j=0 r=0 √ To establish that first stage MLE estimates are N − consistent and asymptotically normal, I will rely on Proposition 7.6 in Hayashi (2000), Theorem 13.1 in Wooldridge (2010), Proposition 7.9 in Hayashi (2000), and Theorem 13.2 in Wooldridge (2010). Theorem 1.1 (Th.1.1) below is just a combination of both the Proposition 7.6 and the Theorem 13.1 and establishes the consistency of CMLE without a compact parameter space. 20 • Theorem 1.1 (Th.1.1): Let {(wi , zi ) : i = 1, 2, . . .} be a random sample with zi ∈ Z ⊂ Rk , wi ∈ W ⊂ R. Let Γ ⊂ R(G+1)k be the parameter set, and denote the parametric model for the conditional density, p(· |z), as {f (· |z; γ) : z ∈ Z , γ ∈ Γ }. Let l : W × Z × Γ → R be a real-valued function. Assume that (a) f (· |z; γ) is a true density function with respect to the measure µ(dw) for all z and γ, so that ´ W f (w|z)µ(dw) = 1, ∀z ∈ Z holds; (b) for some γo ∈ Γ, po (· |z) = f (· |z; γo ), ∀z ∈ Z , and the true parameter vector γo is the unique solution to maxE[li (γ)]; (c) γo is an γ∈Γ element of the interior of a convex parameter space Γ ; (d) for each γ ∈ Γ, l(· , γ) is a Borel measurable function on W × Z ; (e) for each (w, z) ∈ W × Z , l(w, z, ·) is concave in γ; and (f) |l(w, z, γ)| ≤ b(w, z), ∀γ ∈ Γ, where b(·, ·) is a nonnegative function on W × Z such that E[b(w, z)] < ∞. Then there exist a solution to problem in (1.18), p the CMLE γ̂, and γ̂ → − γo . In appendix A, I will verify the conditions stated in Th.1.1. For a generic consistency proof of extremum estimators without compactness, see Theorem 2.7 in Newey and Mc- Fadden (1994, p. 2133). Next, I will state Theorem 1.2 (Th.1.2) below, which is simply a combination of both the Proposition 7.9 and the Theorem 13.2 and establishes the asymp- totic normality of CMLE. • Theorem 1.2 (Th.1.2): Let the definitions and conditions of Th.1.1 hold, and define o ≡ V ar[∇γ li (γo )]. Furthermore, assume that (a) γo is an element of the interior 0 BF of a parameter space Γ ;—i.e., γo ∈ int(Γ ); (b) for each (w, z) ∈ W × Z , l(w, z, ·) is twice continuously differentiable on int(Γ ); (c) E[sFi (γo )] = 0 and −E[Hi (γo )] = F i (γo )], where si (γ) ≡ ∇γ li (γ) and Hi (γ) ≡ ∇γ [∇γ li (γ)]; (d) the elements of 0 0 V ar[sF F F ∇γ [∇0γ l(w, z, γ)] are bounded in absolute value by a function b(w, z), ∀γ ∈ Γ, where b(·, ·) is a nonnegative function on W × Z such that E[b(w, z)] < ∞; and (e) AF o ≡ 21 −E(∇γ [∇0γ li (γo )]) is positive definite. Then √ d N (γ̂ − γo ) → − N ormal(0, (AF −1 F o ) Bo (Ao ) ). F −1 (1.20) Explicitly, the score of the log likelihood for observation i is as follows:   ∂li ∂li ∂li sF i (γ) ≡ ∇0γ li (γ) = (γ), (γ), . . . , (γ) 0 , (1.21) ∂γ0 ∂γ1 ∂γG which is a (G + 1)k × 1 vector of partial derivatives of li (γ) with respect to parameters in γ. The Hessian, HF i (γ) ≡ ∇γ [∇γ li (γ)], for observation i is the (G + 1)k × (G + 1)k 0 matrix of second partial derivatives of li (γ) with respect to parameters in γ. Thus, using the definitions in Th. 1.2, AF o ≡ −E[Hi (γo )], and Bo ≡ V ar[si (γo )]. In appendix A, I show F F F that E[sFi (γo )] = 0 and Ao = Bo , which are used to reduce the variance expression in (1.20) F F to the one in (1.22) below: √ d N (γ̂ − γo ) → − N ormal(0, (AF −1 o ) ). (1.22) See appendix A for the verification of the conditions stated in Th.1.2 and see Theorem 3.1 in Newey and McFadden (1994, p. 2143) for a generic proof of asymptotic normality of extremum estimators. Now I can move to the second-stage of CF method, which is √ basically OLS with generated regressors. To establish that second stage estimates are N − consistent and asymptotically normal, I use Theorem 1.3 (Th.1.3) and Theorem 1.4 (Th.1.4) respectively. Th.1.3 below is based off Theorem 4.3 in Wooldridge (1994, p. 2653) and establishes the consistency of CF method with a compact parameter space. • Theorem 1.3 (Th.1.3): Let w = (y, X, v) be a random vector with w ∈ W ⊂ RM+1 and M = (l + G + 2)(G + 1). Let Θ ⊂ RM and Γ ⊂ R(G+1)k be the parameter sets. 22 Let q(w, θ; γ) : W × Θ × Γ → R be a real-valued function. Let γ̂ be an estimator from p a preliminary estimation. Assume that (a) γ̂ → − γ ∗ for some γ ∗ ∈ Γ ; (b) for a given γ ∗ ∈ Γ, the true parameter vector θo is the unique solution to minE[qi (θ; γ ∗ )]; (c) the θ∈Θ parameter space Θ × Γ is compact; (d) for each (θ, γ) ∈ Θ × Γ, q(· , θ; γ) is a Borel measurable function on W; (e) for each w ∈ W, q(w, ·; ·) is continuous function on Θ × Γ ; and (f) E[|q(wi , θ; γ)|] < ∞ ∀(θ, γ) ∈ Θ × Γ. Then there exists a solution to problem p in (1.17), the CF estimator θ̂, and θ̂ → − θo . Compared to Th.1.1, Th.1.3 replaces the convexity assumption on parameter space with the compactness assumption and the concavity of the objective function with its continuity. In appendix A, I will verify the conditions stated in Th.1.3. In addition, see Wooldridge (1994, p. 2730) for a generic consistency proof of two-step M-estimators with compactness. Before I move into the asymptotic normality result, I will introduce some notation. From (1.17), we can see that q(w, θ; γ) for observation i in Th.1.3 is as follows: qi (θ; γ) ≡ q(wi , θ; γ) ≡ (yi − m(Xi , v(di , zi , γ), θ))2 /2, (1.23) where mi (vi (γ), θ) ≡ m(Xi , v(di , zi , γ), θ) ≡ Xi δ + vi λ, θ = (δ 0 , λ0 )0 is the M × 1 vector of parameters, Xi is the 1 × (l + 1)(G + 1) vector of regressors in (1.17), and vi is the 1 × (G + 1)(G + 1) vector of generated regressors in (1.17). More explicitly, Xi = ( d0i , · · · , dGi , d0i xi , · · · , dGi xi ) vi = ( −d0i log(Λ0i ), · · · , −dGi log(ΛGi ), d1i M0i , d2i M0i , · · · , dGi M0i , d0i M1i , d2i M1i , d3i M1i , · · · , dGi M1i , · · · , d0i MGi , d1i MGi , · · · , , dG−1i MGi ), (1.24) where Λgi = exp(zi γg )/ G r=0 exp(zi γr ) and Mgi = Λgi log(Λgi )/(1 − Λgi ) for g = 0, 1, . . . , G. P As one can expect, expressions such as Λ̂ji and M̂gi are just consistent estimates of Λgi and 23 Mgi with γ̂g replacing γg in Λgi and Mgi . Now, I will state Theorem 1.4 (Th.1.4) that is based off Theorem 4.4 in Wooldridge (1994, p. 2655) and establishes the asymptotic normality of CF method with a compact parameter space. • Theorem 1.4 (Th.1.4): Let the definitions and conditions of Th.1.3 hold. Further- √ more, assume that (a) θo ∈ int(Θ) and γ ∗ ∈ int(Γ ); (b) N (γ̂ − γ ∗ ) is bounded in √ probability —i.e., N (γ̂ − γ ∗ ) = Op (1); (c) for each (w, γ) ∈ W × Γ, q(w, ·; γ) is a twice continuously differentiable on int(Θ); (d) for each θ ∈ Θ, s(·, θ; ·) ≡ ∇0θ q(·, θ; ·) is con- tinuously differentiable on int(Γ ); (e) for each (θ, γ) ∈ Θ ×Γ, H(·, θ; γ) ≡ ∇θ s(·, θ; γ) is a Borel measurable function on W; (f) for each w ∈ W, H(w, ·; ·) is continuous on Θ × Γ ; (g) E[k H(wi , θ; γ) k] < ∞ ∀(θ, γ) ∈ Θ×Γ. (h) Ao ≡ E[H(wi , θo ; γ ∗ )] is positive definite; (i) for each (θ, γ) ∈ Θ × Γ, ∇γ s(·, θ; γ) is a Borel measurable function on W; (j) for each w ∈ W,∇γ s(w, ·; ·) is continuous on Θ × Γ ; (k) E[k ∇γ s(wi , θ; γ) k] < ∞ ∀(θ, γ) ∈ Θ × Γ ; (l) E[si (θo ; γ ∗ )] = 0, E[(AF ∗ ) si (γ )] = 0, and E[(A∗ ) si (γ )si (θo ; γ )] = 0. Then, −1 F ∗ F −1 F ∗ 0 ∗ √ d N (θ̂ − θo ) → − N ormal(0, (Ao )−1 Do (Ao )−1 ), (1.25) where Do = Bo +Fo To +T0o F0o +Fo R∗ F0o , si (θo ; γ ∗ ) ≡ ∇0θ q(wi , θo ; γ ∗ ), Ao ≡ E[∇θ si (θo ; γ ∗ )] ≡ E[Hi (θo ; γ ∗ )], Bo ≡ E[si (θo ; γ ∗ )s0i (θo ; γ ∗ )], Fo ≡ E[∇γ si (θo ; γ ∗ )], To ≡ E[ri (γ ∗ )s0i (θo ; γ ∗ )], R∗ ≡ E[ri (γ ∗ )r0i (γ ∗ )], ri (γ ∗ ) = (AF ∗ ) si (γ ), and A∗ ≡ −E(∇γ [∇γ li (γ )]). For the deriva- −1 F ∗ F 0 ∗ √ tion of asymptotic variance of N (θ̂ − θo ), refer to the subchapter 12.4 in Wooldridge (2010) or subsections 4.3 and 4.4 in Wooldridge (1994). See appendix A for the verification of the conditions stated in Th.1.4 and see Wooldridge (1994, p. 2730) for a generic asymptotic normality proof of two-step M-estimators with compactness. In appendix A, I also derive the closed forms of the population matrices Ao , Bo , Fo , and R∗ and show E[ri (γ ∗ )] = 0, E[si (θo ; γ ∗ )] = 0, and To ≡ E[ri (γ ∗ )s0i (θo ; γ ∗ )] = 0. Since To = 0, Do in the asymptotic √ variance of N (θ̂ − θo ) in (1.25) simplifies to Bo + Fo R∗ F0o . 24 Let’s construct the following estimators for Ao , Bo , Fo , and R∗ : XN  = N −1 Hi (θ̂; γ̂), (1.26) i=1 XN B̂ = N −1 si (θ̂; γ̂)s0i (θ̂; γ̂), (1.27) i=1 XN F̂ = N −1 ∇γ si (θ̂; γ̂), and (1.28) i=1 XN R̂ = N −1 ri (γ̂)r0i (γ̂). (1.29) i=1 Define D̂ ≡ B̂ + F̂R̂F̂0 . Then, using the analogy principle and Lemma 1 in appendix √ √ A, a consistent estimator for Avar N (θ̂ − θo ) is Avar ˆ N (θ̂ − θo ) = (Â)−1 D̂(Â)−1 . The asymptotic standard errors of CF estimates can be obtained from the matrix Avar( ˆ θ̂) = (Â)−1 D̂(Â)−1 /N. 1.5.1 Method of Moments Framework In his paper, Newey (1984) showed that two-step estimators could be interpreted as members of generalized method of moments (GMM) estimators for the purpose of obtain- ing asymptotic variance matrix. This perspective not only provides consistent parameter estimates but also yields consistent standard errors of parameter estimates. By jointly es- timating all parameters in just one step, consistent standard errors are obtained without deriving the asymptotic variance matrix of a two-step estimator. For this reason, GMM estimation gives an alternative way to get consistent standard errors of the parameters in CF regression. GMM estimation reduces down to method of moments estimation when the number of moments is exactly equal to the number of parameters to be estimated, so I technically utilize method of moments (MoM) in my analysis. 25 In the first stage of CF method, γ̂ is the CMLE estimator solving X N sFi (γ̂) = 0, (1.30) i=1   where r=0 exp(zi γr ) . In the second stage, θ̂ is a 0 PG PG sF i (γ) = ∇γ j=0 1[wi = j]log exp(zi γj )/ OLS estimator solving X N si (θ̂; γ̂) = 0, (1.31) i=1 where si (θ̂; γ̂) = ∇0θ [yi − m(Xi , v(di , zi , γ̂), θ̂)]2 /2. Newey (1984) proposes stacking the sum- mands in these first order conditions into the unified function   F  s (γ)  g(θ, γ) =   (1.32) s(θ; γ) and then applying MoM using the moment conditions E[g(θ, γ)] = 0 to obtain consistent estimates for θ and γ, and valid asymptotic variance matrix of θ̂ and γ̂. 1.5.2 The Model with ηg,j = ηj : Asymptotics The asymptotic properties of CF estimates can also be obtained using theorems used for the initial model. For consistency and asymptotic normality of CMLE in the first stage, I will still use Th.1.1 and Th.1.2. Th.1.3 and Th.1.4 will be almost the same as before except now M = (l + 2)(G + 1) + 1. vi will be the 1 × (G + 1) vector of generated regressors. More explicitly, vi = (r0i , r1i , . . . , rGi ). Using Th.1.3 and Th.1.4 with these changes, the asymptotic standard errors of CF estimates in this special case can still be obtained from the matrix Avar( ˆ θ̂) = (Â)−1 D̂(Â)−1 /N, where  and D̂ are defined as in the initial model. A MoM estimation in this case can be applied as described in subsection 1.5.1. 26 1.6 Hypothesis Testing Economists often are interested in whether a particular research question is true or not, i.e., does the limited English proficiency have an effect on the earnings of immigrant worker population in the U.S.? To answer questions of this sort, economists set up a hypothesis test with a null hypothesis H0 , a statement often against the idea one would like to accept, an alternative hypothesis H1 , a statement one would like to show evidence for, and a test statistic whose distribution can be calculated under H0 . In this section, I will devise a hypothesis testing framework with hypotheses expressed as a set of restrictions on model parameters and will construct a test statistic based on the score test (a.k.a. Lagrange multiplier test) of Rao (1948). In my framework, I could also utilize the Wald statistic. However, I choose the score statistic because the Wald statistic generally suffer from lack of invariance to how the non- linear restrictions are constructed and, thereof, yield different hypothesis test results. In addition, note that, due to the heterogenous error terms in my estimating equations, the generalized information matrix equality (GIME) fails (i.e., the expected value of the outer product of score function is not equal to a constant multiple of the expected value of the Hessian function.) And, since GIME fails, the quasi-likelihood ratio statistic does not work in my framework. For more on the drawbacks of the Wald statistic and the quasi-likelihood ratio statistic, refer to section 7.4 of Hayashi (2000) and section 12.6 of Wooldridge (2010). Following the notation and conditions used in subsection 4.6 of Wooldridge (1994), con- sider the following null hypothesis H0 : c(θo ) = 0, (1.33) where c(θ) is a Q × 1 vector function of the M × 1 vector θ, and some constraints may be linear while the others are nonlinear. The constrained CF estimator θ̃ solves the problem XN min (yi − m(Xi , v(di , zi , γ̂), θ))2 /2 s.t. c(θ) = 0, (1.34) θ∈Θ i=1 27 where γ̂ is as in (1.17). In the assumption below, I will state some conditions on c(θ) and on a mapping defined by the restrictions in H0 . Even though it is not necessary to know the explicit form of this mapping, it will help us establish the first order representation of θ̃. • Assumption 1.4 (A.1.4): Assume that (a) Q ≤ M ; (b) c(·) is continuously differen- tiable on int(Θ); (c) θo ∈ int(Θ) under H0 ; (d) C(θ) ≡ ∇θ c(θ) is the Q × M gradient of c(θ) with rank Q, and C(θo ) is bounded in probability; (e) there exists a twice con- tinuously differentiable mapping d : RM−Q → RM with θo = d(αo ) under H0 , where αo is a (M − Q) × 1 vector in the interior of its compact parameter space A ⊂ RM−Q under H0 ; and (f) D(α) ≡ ∇α d(α) is the M × (M − Q) gradient of d(α) with rank M − Q at α = αo , and D(αo ) is bounded in probability. White (1994, p. 138) gives a famous example of where expressing restrictions in H0 as θ = d(α) can be used in econometrics. In simultaneous systems of equations with overi- dentifying restrictions, θ represents the parameters of the reduced form, and α corresponds to the structural parameters, and d determines the relation between them. Hence, we can intuitively think of the mapping d in a similar way. Furthermore, note that the estimator of αo , α̃, solves the problem XN min (yi − m(Xi , v(di , zi , γ̂), d(α)))2 /2, (1.35) α∈A i=1 and the constrained CF estimator θ̃ is equal to θ̃ ≡ d(α̃). Now, I will state Theorem 1.5 (Th.1.5) that offers the score test and the score statistic, the Lagrange multiplier (LM) statistic. Th.1.5 is similar to the LM statistic in subsection 4.6 of Wooldridge (1994, p. 2668); however, into this theorem below, I also incorporate the adjustment to take into consideration the estimation of the nuisance parameter γ ∗ in (1.18). 28 • Theorem 1.5 (Th.1.5): Let the definitions and conditions of Th.1.2, Th.1.4 and A.1.4 P 0 hold. Then under H0 : c(θo ) = 0, (a) N s i=1 i (θ̃; γ̂) A−1 0 −1 −1 0 −1 o Co [Co Ao Do Ao Co ] Co Ao −1 P  P 0 d N s i=1 i (θ̃; γ̂) /N → − χ 2 Q and (b) the LM statistic LM N ≡ N s i=1 i (θ̃; γ̂) Ã−1 C̃0 [C̃ P  Ã−1 D̃Ã−1 C̃0 ]−1 C̃Ã−1 N i=1 i s ( θ̃; γ̂) /N is asymptotically χ2Q , where si (θ̃; γ̂) is as in Th.1.4 but evaluated at (θ̃, γ̂), Ao and Do are as in Th.1.4, Co ≡ C(θo ) with C(θ) as in A.1.4, à = N −1 N i=1 Hi (θ̃; γ̂) with Hi as in Th.1.4, C̃ = C(θ̃), D̃ = B̃ + P i=1 ∇γ si (θ̃; γ̂), and R̂ = N PN PN F̃R̂F̃0 , B̃ = N −1 N 0 −1 −1 0 P i=1 si (θ̃; γ̂)si (θ̃; γ̂), F̃ = N i=1 ri (γ̂)ri (γ̂) as in Th.1.4. See appendix A for the proof of Th.1.5. 1.6.1 The Model with ηg,j = ηj : Hypothesis Testing After CF estimation when ηg,j = ηj , hypothesis testing can be conducted in the same way as the initial model: Th.1.5 still holds. As noted before, nothing changes in terms of both Th.1.1 and Th.1.2. As for Th.1.3 and Th.1.4, now M = (l + 2)(G + 1) + 1, and vi will be the 1 × (G + 1) vector of generated regressors. More explicitly, vi = (r0i , r1i , . . . , rGi ), where rgi is as in subsection 1.4.3. Using Th.1.3, Th.1.4, and Th.1.5 with these changes, LMN is still asymptotically χ2Q . 1.7 Simulations In this section, I present some simulation results that place CF and IV methods in section 1.4 side by side and note differences and similarities in terms of their asymptotic performances, specifically asymptotic efficiency, asymptotic unbiasedness, and consistency. The setups for the main model in section 1.2 and for the special case model in subsection 1.2.1 are similar to each other; however, each setup varies a little as I change the distribution of instrument in the latent variable equation and assume ηg,j = ηj for g, j = 0, 1, 2. For 29 the sake of computational simplicity, I adopt a scheme in which there is only one covariate in the counterfactual outcome equation, i.e., x = x and only one instrument in the latent variable equation, i.e., z = z. In addition, the treatment variable w takes on only three values, and each treatment group comprises at least about 30 percent for each simulation setting. Lastly, I introduce misspecification into the main model by ignoring an instrument in the latent variable equation and examine its consequences. 1.7.1 Data Generating Process In my simulation analysis, I used five different data generating processes (DGPs); one for the main model in section 1.2 with asymmetric instrument, one for the main model with symmetric instrument, one for the special case model in subsection 1.2.1 with asymmetric instrument, one for the special case model with symmetric instrument, and one for the main model with asymmetric instrument and misspecification. The setup for the DGP of the main model with asymmetric instrument is as follows: w ∈ {0, 1, 2} , dg = 1[w = g], g ∈ {0, 1, 2} , ag ∼ Gumbel(0, 1), g ∈ {0, 1, 2} , γ0 = 1, γ1 = 5, and, γ2 = 9, l0 = 1, l1 = 5, and, l2 = 3, z = z ∼ χ2 (2) − 2, wg∗ = lg + γg z + ag , g ∈ {0, 1, 2} , 30 w = g if f wg∗ ≥ wj∗ , ∀j 6= g and g, j ∈ {0, 1, 2} , eg ∼ N (0, 4), g ∈ {0, 1, 2} , η0,0 = 0.05, η0,1 = 0.10, and, η0,2 = 0.15, η1,0 = 3.05, η1,1 = 3.10, and, η1,2 = 3.15, η2,0 = 6.05, η2,1 = 6.10, and, η2,2 = 6.15, P2 P2 ug = j=0 ηg,j aj + [− j=0 ηg,j E(aj )] + eg , g ∈ {0, 1, 2} , x = x ∼ N (0, 1), α0 = 1, α1 = 2, and, α2 = 3, β0 = 6, β1 = 7, and, β2 = 8, yg = αg + xβg + ug , g ∈ {0, 1, 2} , and y = d0 y0 + d1 y1 + d2 y2 . For the main model with symmetric instrument, the DGP setup is very similar to the one above. However, I make the following modifications: l0 = 1, l1 = 5.2, and, l2 = 2, z = z ∼ N (0, 4), η1,0 = 0.55, η1,1 = 0.60, and η1,2 = 0.65. The setup for the DGP of the special case model with asymmetric instrument is as follows: 31 w ∈ {0, 1, 2} , dg = 1[w = g], g ∈ {0, 1, 2} , ag ∼ Gumbel(0, 1), g ∈ {0, 1, 2} , γ0 = 1, γ1 = 5, and, γ2 = 9, l0 = 1, l1 = 5, and, l2 = 3, z = z ∼ χ2 (2) − 2, wg∗ = lg + γg z + ag , g ∈ {0, 1, 2} , w = g if f wg∗ ≥ wj∗ , ∀j 6= g and g, j ∈ {0, 1, 2} , e ∼ N (0, 4), η0 = 0.05, η1 = 3.05, and, η2 = 6.05, + [− 2j=0 ηj E(aj )] + e, g ∈ {0, 1, 2} , P2 P ug = u = j=0 ηj aj x = x ∼ N (0, 1), α0 = 1, α1 = 2, and, α2 = 3, β0 = 6, β1 = 7, and, β2 = 8, yg = αg + xβg + u, g ∈ {0, 1, 2} , and y = d0 y0 + d1 y1 + d2 y2 . For the special case model with with symmetric instrument, the DGP setup is very similar to the one with asymmetric instrument above. However, I make the following modifications: 32 l0 = 1, l1 = 5.2, and, l2 = 2, z = z ∼ N (0, 4), η0 = 0.05, η1 = 0.55, and η2 = 6.05. And lastly, for the main model with asymmetric instrument and misspecification in the latent variable equation, the DGP setup is very similar to the one without misspecification. However, I introduce an additional instrument in the latent variable equation and ignore it from the MNL regression of treatment variable on instruments at the first stage. Hence, I make the following modifications: z = (z1 , z2 )0 , z1 ∼ χ2 (2) − 2, z2 ∼ χ2 (2) − 2, wg∗ = lg + γg z1 + ϑg z2 + ag , g ∈ {0, 1, 2} , ϑ0 = γ1 , ϑ1 = γ2 , and ϑ2 = γ0 , where z1 and z2 are scalar instruments in the choice equation for wg∗ , and ϑg is a scalar parameter associated with z2 in wg∗ for g ∈ {0, 1, 2} . Note that missing, say, z2 in wg∗ is practically like putting it into the i.i.d. Gumbel distributed error term ag , which creates the new error term a0g = ϑg z2 +ag in the latent treatment equation. However, this new error term is not i.i.d. and is very likely not Gumbel distributed anymore, which violates the model condition that the error term in the latent treatment equation is i.i.d. Gumbel distributed. Once this model condition is infringed, w does not follow a MNL distribution and the CF terms used in (1.14) or (1.16) are not correct, which leads to that E(|d, x, z) 6= 0 in (1.14) and E(ε|d, x, z) 6= 0 in (1.16). In short, it is this model condition violation as a result of 33 misspecification that creates huge biases on CF estimates as demonstrated in simulation results later. This is interesting in its own right because, for several two-stage estimators, similar misspecifications (i.e., excluding relevant instruments) of the first stage equation (e.g., choice/selection equation) would not cause much of a bias on the second stage (e.g., outcome stage) parameter estimates. In this regard, the most prominent estimator is two- stage least squares with excluded instruments, see McKenzie and McAleer (1994, p. 446) for example. In addition, in a unified fashion, Pagan (1986) explores when two-stage estimators are consistent and when they are not. In proposition 3.2 of this paper, he shows that two- stage estimators based on a misspecified first stage equation result in consistent second stage parameter estimates when the excluded (and included) instruments in the first stage are not correlated with the included variables in the second stage. Even though this is the case in my simulation study, my simulation results in this section below still suggest that CF estimates have big biases under misspecification, which makes my results even more interesting. Regarding the DGP setups, note that γg and lg both play a role in establishing the percentage of each treatment group in simulations for g = 0, 1, 2. γg ’s being distant from each other enough are also critical to obtain strong first stage estimates. With γg ’s be- ing very close each other, one can easily run into identification problems in the first stage estimation. Having ηg,j ’s being far away from each other, especially ηg,j ’s from different treatment statuses, is also another critical point to create endogeneity in the main model for g, j = 0, 1, 2. If ηg,j ’s from different treatment statuses get closer and closer to each other, we essentially get closer and closer to make the assumption that ηg,j = ηj , and endogeneity issue in the main model gets attenuated or maybe nearly resolved as pointed out in subsection 1.4.3. The usage of instrument z whose distribution is either asymmetric or symmetric in the latent variable equation is also important because, in endogenous binary treatment case, Wooldridge (2008, p. 106; 2010, p. 947) argues that IV method can be consistent for ATEs when the instrument is symmetrically distributed around zero. Hence, I include schemes in simulations that take into consideration this possibility. 34 1.7.2 Simulation Results I present my simulation results in two parts: first, asymptotic efficiency outcomes and second, asymptotic unbiasedness and consistency outcomes. The simulation results reported in Tables A.1 through A.20 focus on comparing CF method with IV method in terms of asymptotic efficiency, asymptotic unbiasedness and consistency. In Tables A.1 through A.20, I report the Monte Carlo (M.C.) estimates for αg and AT Eh,0 , bias in the M.C. estimate for ATEs, analytical standard errors for αg and AT Eh,0 in CF method and uncorrected standard errors for αg and AT Eh,0 in IV method (except in Tables A.17 through A.20), bootstrapped standard errors (BS. SEs) and Monte Carlo standard deviations (M.C. SDs) for αg and AT Eh,0 , and BS. SEs and M.C. SDs for standard errors of αg and AT Eh,0 for g = 0, 1, 2 and h = 1, 2. The analytical standard errors for αg and AT Eh,0 in CF method are calculated as in section 1.5. The uncorrected standard errors for αg and AT Eh,0 in IV method are directly obtained by following Procedure 1.1. And, when ηg,j = ηj for j, g = 0, 1, 2, I do not even need to correct the standard errors for αg and AT Eh,0 in IV method, see appendix A for more detailed explanation. In simulations, I use different sample sizes n = 1000, n = 2000, n = 5000 and n = 10000 for each DGP setup with the number of M.C. and BS. iterations always equal to 10000. As for the notation, in Tables A.1-A.20, α̂g is the parameter estimate for αg , ateˆh,0 is the estimate for AT Eh,0 , and bias(ateˆh,0 ) is the bias in the estimate for AT Eh,0 for g = 0, 1, 2 and h = 1, 2. Furthermore, se(α̂g ) is the standard error of parameter estimate for αg and se(ateˆh,0 ) is the standard error of the estimate for AT Eh,0 for g = 0, 1, 2 and h = 1, 2. Since these tables would require a considerable amount of space in the main body of the chapter, I place all simulation tables of this chapter into appendix A. At this point, it is also important to remember the true values for αg and AT Eh,0 for g = 0, 1, 2 and h = 1, 2 since I often refer them throughout this section. The true values are respectively as follows: 35 α0 = 1, α1 = 2, and, α2 = 3, AT E1,0 = 1, and AT E2,0 = 2. 1.7.2.1 Asymptotic Efficiency Outcomes First, using simulation results from Tables A.1 through A.4, I can check how close the standard error estimator proposed in section 1.5 are to simulated results. In Table A.1, analytical standard errors of CF parameter estimates are fairly close to BS. SEs and especially to M.C. SDs of CF parameter estimates of both αg and AT Eh,0 for g = 0, 1, 2 and h = 1, 2, except the analytical standard errors of CF parameter estimates for α1 and AT E1,0 in Tables A.1 through A.4 and those for α2 and AT E2,0 in Tables A.5 through A.8. As sample size increases, the analytical standard errors get even closer to the BS. SEs and M.C. SDs, and all shrink in magnitude considerably. For example, switching from sample size of 1000 to 10000, the analytical standard error of α̂0 decreases by 73% from .2749 to .0745, the BS. SEs by 63% from .2106 to .0790, and M.C. SDs by 70% from .2543 to .0755. A very similar pattern can be observed in Tables A.5 through A.16 as I change the distribution of instrument z and/or assume ηg,j = ηj for g, j = 0, 1, 2. Hence, the analytical standard errors proposed in section 1.5 seem to be working well and to well approximate the standard deviations of the most CF parameter estimates. As a side note, the uncorrected standard errors of the IV parameter estimates in Tables A.1 through A.8 when ηg,j 6= ηj are also not far away from BS. SEs and M.C. SDs of IV estimates of both αg and AT Eh,0 . With increased sample size, the uncorrected standard errors shrink in magnitude and get closer to the BS. SEs and M.C. SDs of the IV estimates. Second, from an efficiency standpoint, let’s first take into account the models with no misspecification. At this point, it is wiser to consider Tables A.9 through A.16 since IV method is inconsistent when ηg,j 6= ηj as indicated in Tables A.1 through A.8 with large biases in αg and AT Eh,0 for g = 0, 1, 2 and h = 1, 2. However, in Tables A.9 through A.16, 36 neither CF method nor IV method has biases in parameter estimates. In Table A.9, the simulation results show that the analytical standard errors, BS. SEs, and M.C. SDs of the CF estimates are lower than the uncorrected standard errors, BS. SEs, and M.C. SDs of the IV estimates, respectively. Similarly, BS. SEs and M.C. SDs of the standard errors of CF estimates are also lower than those of the IV estimates. For instance, in Table A.9, the BS. SE of the CF parameter estimate α̂0 is 4% lower than that of the IV estimate, and the M.C. SD of the CF parameter estimate 5% lower. Furthermore, again in Table A.9, the BS. SE of the standard error of CF parameter estimate α̂0 is 12% lower than that of the IV estimate, and the M.C. SD of the standard error of CF parameter estimate 11% lower. This very alike pattern is persistent in Tables A.10 through A.16 as I switch the distribution of instrument z from asymmetric to symmetric and/or increase the sample size. As a result, when there is no misspecification, the simulation results demonstrate that the CF method performs better compared to the IV method from the perspective of efficiency: The results suggest that CF method estimates the parameters of interest, ATEs, and their standard errors more precisely than does IV method. Now let’s look at the models with misspecification in Tables A.17 through A.20. When sample size is 1000, the simulation results in regard to efficiency are very suggestive: IV method has sharper estimates than does the CF method for all parameters of interest, ATEs, and (almost all) their standard errors. For example, in Table A.17, the BS. SE of the IV parameter estimate α̂1 is around 48% lower than that of the CF estimate, and the M.C. SD of the IV parameter estimate about 66% lower. Moreover, again in Table A.17, the BS. SE of the standard error of IV parameter estimate a[ te10 is about 29% lower than that of the CF estimate, and the M.C. SD of the standard error of IV parameter estimate around 55% lower. As the sample size goes up in Tables A.18 through A.20, the same pattern is still observed in favor of IV method. And the BS. SEs and M.C. SDs of all estimates and of the standard errors of all estimates get smaller. For instance, the BS. SE (M.C. SD) of the IV parameter estimate α̂2 goes down from 1.4534 (1.1045) in Table A.17 to .3257 (.3336) in 37 Table A.20. As a consequence, when there is misspecification, the simulation results show that the IV method outperforms the CF method in terms of efficiency: The results suggest that IV method estimates the parameters of interest, ATEs, and their standard errors more precisely than does CF method, which is just the opposite of the findings when there is no misspecification. 1.7.2.2 Asymptotic Unbiasedness and Consistency Outcomes In a M.C. simulation, the average of parameter estimates over a specific number of iter- ations (conventionally 10000) is the M.C. simulation estimate of the expected value of those estimates. Therefore, if I repeat this M.C. simulation with a fixed number of iterations but increasing sample size and the M.C. estimates get closer and closer to true parameter values as the sample size increases, this would be suggestive of the asymptotic unbiasedness of the estimator in question. In addition, if the standard errors of the parameter estimates get smaller and smaller on top of their being closer and closer to true parameter values, this would be indicative of the consistency of the estimator in question. First, in Tables A.1 through A.4, the simulation results show that M.C. simulation esti- mates from CF method of both αg and AT Eh,0 are very close to the true values, whereas the ones from IV method are not that close at all for g = 0, 1, 2 and h = 1, 2. For example, in Ta- ble A.1, M.C. simulation estimates from IV method for α0 , α1 , and α2 are respectively 1.3394 (about 34% higher than the true value), 1.9078 (around 5% lower than the true value), and 2.9018 (around 3% lower than the true value) and are all off the true values, causing severe biases in ATE estimates (about 43% lower in estimated AT E1,0 and 22% lower in estimated AT E2,0 .) On the other hand, M.C. simulation estimates from CF method of both αg and AT Eh,0 in Table A.1 are not off the true values at all with almost no biases. As the sample size increases from Table A.1 to Table A.4, M.C. simulation estimates from IV method do not improve on the biases; however, their BS. SEs and M.C. SDs get closer to zero just as 38 those from CF method. A very similar pattern can also be seen in Tables A.5 through A.8 as I switch the distribution of instrument z from asymmetric to symmetric and/or increase the sample size. As a result, the simulation results indicate that CF method is asymptotically unbiased and consistent while IV method is asymptotically biased and inconsistent, which is supportive of the conjecture I made in subsection 1.4.1. Second, in Tables A.9 through A.12 when ηg,j = ηj for g, j = 0, 1, 2, the simulation results demonstrate that M.C. simulation estimates of αg and AT Eh,0 from both CF method and IV method are very close to the true values for g = 0, 1, 2 and h = 1, 2. For example, in Table A.9, M.C. simulation estimates from both CF method and IV method for α0 , α1 , and α2 are all accurate up to two decimal places with almost no biases in ATE estimates. As the sample size increases from Table A.9 to Table A.12, M.C. simulation estimates from both CF method and IV method continue keeping their accuracy with almost no bias, and their BS. SEs and M.C. SDs shrink in size and get closer and closer to zero. This very pattern is also seen in Tables A.13 through A.16 as I switch the distribution of instrument z from asymmetric to symmetric and/or increase the sample size. As a conclusion, the simulation results demonstrate that, when ηg,j = ηj , both CF method and IV method are asymptotically unbiased and consistent, which corroborates another conjecture I made in subsection 1.4.3. Under misspecification, the simulation results in Tables A.17 through A.20 indicate that M.C. simulation estimates of αg and AT Eh,0 from both CF method and IV method are off the true values for g = 0, 1, 2 and h = 1, 2. For instance, in Table A.17, M.C. simulation estimates from CF method for α0 , α1 , and α2 are respectively .9310 (about 7% lower than the true value), 1.5886 (around 21% lower than the true value), and 4.4840 (around 50% higher than the true value) and are all off the true values, causing severe biases in ATE estimates (about 36% lower in estimated AT E1,0 and 78% higher in estimated AT E2,0 .) As indicated earlier, M.C. simulation estimates from IV method are also not close to the true values. However, biases in its estimates are lower than those in the estimates of CF method except in the estimates for α0 and AT E1,0 . For example, in Table A.17, M.C. simulation estimates 39 from IV method for α1 , α2 , and AT E2,0 are respectively 1.6833 (about 16% lower than the true value), 3.0697 (around 2% higher than the true value), and 1.7553 (around 12% lower than the true value) but these biases are all smaller than their counterparts from CF method. As the sample size increases from Table A.17 to Table A.20, M.C. simulation estimates from both IV and CF methods do not improve on the biases; however, their BS. SEs and M.C. SDs get smaller. As a result, the simulation results indicate that, under misspecification, CF method is asymptotically more biased compared to IV method, and both methods are inconsistent. 1.8 Empirical Application In this section, I illustrate the role of limited English proficiency (LEP) in determining wages of Hispanic workers in the USA. To this end, I revisit the 1% Public Use Microdata Series (PUMS) of the 1990 U.S. Census and utilize a subsample that is constructed from data used by Gonzalez (2005). My aim is just to apply OLS, CF, IV, and nonparametric bound analysis to the estimation of how LEP influences wages of Hispanic workers in the USA and to make a comparison of their performances, not to offer a detailed or decisive evaluation of the factors that explain wages of Hispanic workers in the USA. 1.8.1 Background on the Economics of Language Skills As a result of the growth of international immigrant flows into the host destination coun- tries after the Second World War, researchers started investigating the immigrant behavior (e.g., how integrated the immigrants were with the members of their host society and what factors were crucial to improve the immigrants’ integration period). As part of immigrant adjustment time, among other things, economists regarded language abilities as a part of human capital - for example, see Carliner (1981) and Grenier (1984) - and were interested 40 in questions such as: Would investing in language acquisition result in higher wages for im- migrants? If it would, how proficient should immigrants be in the destination language to really benefit from higher wages? As in many other host destination countries where English is the official language of communication, Hispanic workers’ deficiencies in English can lead to negative consequences, particularly lower earnings, in the U.S. labor market. Previous studies showed that Hispanic workers in the USA earn less than Non-Hispanic Whites due to low levels of schooling (see Gwartney and Long, 1978), discrimination (see Reimers, 1983), general assimilation problems for immigrants who were born outside the USA (see Borjas, 1985), and LEP (see McManus et al., 1983). In general, the human earnings analyses are rooted in the human capital earnings func- tion, and the earnings regressions out of this function include the natural logarithm of earn- ings as its dependent variable and some other covariates such as education, experience (and its square), marital status, ethnicity, duration in the destination, and destination language proficiency. Early studies from 1980s used OLS to explore the impact of LEP on earnings in the USA. McManus et al. (1983) found that LEP curtails the common wage increases associated with schooling and experience and provided evidence for that Hispanic male work- ers in skill occupations where wages are generally the highest earn less than their domestic counterparts due to LEP. Reimers (1983) found that Puerto Ricans who do not speak and understand English well have wage penalties of 20%. Grenier (1984) found that the effect of LEP is a 14.6% decrease in wages, which explains that language attributes greatly ex- plain the mean wage differential between Hispanics and non-Hispanics. Kossoudji (1988) concluded that Hispanics suffer from the economic cost to LEP more than do Asians at every skill level, with decreased earnings reaching up to 66% for sales workers and within- occupational wage gap of 18.3% even in service industry. Tainer (1988) found that each unit improvement in English speaking abilities of Hispanics leads to a 17.4% rise in their annual earnings. Chiswick and Miller (2002) indicated that the foreigners who are born in non- English speaking countries but fluent in English earn 14.4% more than those who are not 41 fluent in English. Lewis (2011) found that there is a wage penalty of 17% for immigrants who cannot speak English very well. Other studies around the globe also report similar findings for other countries: Immigrants benefit from the destination language proficiency in term of higher wages. Chiswick and Miller (1985) found that, in Australia, immigrants with poor English skills have earnings 4.7% lower than those with good English skills. Carliner (1981) showed that, in Montreal, Canada, immigrants who can speak neither English nor French earn 36.3% lower than those can speak English monolingually. Dustmann (1994) pointed out that, in Western Germany, male immigrants who speak German well or very well have earnings 6.9% higher than earnings of those who speak German badly or not at all, creating a considerable improvement in terms of the earnings position of male immigrants proficient in German. Chiswick and Repetto (2000) reported that, in Israel, compared to immigrant men who could not speak Hebrew at all, those who can speak Hebrew only increase their earnings by 20.8%. Lastly, Leslie and Lindley (2001) found that, in the UK, male immigrants who are fluent in English have a wage increase of 16.9% over those who have poor English skills. One weakness of the these early studies is that they do not offer a solution to ability bias. It is probable that workers with higher innate ability are more likely to earn higher wages, to invest in language capital, and to speak English (or another foreign language) more proficiently than are workers with lower innate ability, which indicates that there may be some correlation between English skills and unobserved innate ability. Since unobserved innate ability is expected to affect both language skills and earnings, the OLS coefficient estimates of language skills might be biased. The concern over this weakness of OLS naturally made many researchers utilize alternative estimation methods such as IV while analyzing the effect of language proficiency on earnings. Chiswick and Miller (1995) found that, in the USA, the effect of language fluency on immigrant earnings goes up from 16.9% to 57.1% when IV is used in place of OLS. Chiswick (1998) showed that, in Israel, the immigrants who speak Hebrew on a daily basis as their only or primary language earn 35.08% (11%) higher than 42 those who cannot when IV (OLS) method is adopted. Shields and Price (2002) noted that, among male immigrants in the UK, fluency in English raises the mean occupational wage by 16.5% (8.9%) when IV (OLS) method is used for estimation. Dustmann and van Soest (2002) provided evidence for that, in Germany, the male immigrants with good or very good German speaking abilities have a wage premium of 14% (5%) over those with intermediate or poor German speaking abilities when IV (OLS) method is utilized. Budra and Swedberg (2012) pointed out that, in Spain, the influence of Spanish fluency on the earnings of male immigrants is about 27% based on a benchmark IV method, whereas OLS estimate is only 4.8%. Finally, Di Paolo and Raymond (2012) studied the immigrants in Catalonia, Spain, and their IV (OLS) estimates indicated that the monthly earnings of individuals who speak and write Catalans are about 18% (7.5%) higher than those who cannot. Looking at the results of these studies above, one can argue that representative premiums in immigrant earnings as a result of being fluent in host destination language are mostly be- tween 5% (20%) and 20% (50%) based off OLS (IV). Hence, there is a considerable difference between OLS and IV estimates for the earnings premium to host destination language profi- ciency: the OLS estimates are usually smaller than the IV estimates, and can be considered underestimates of the true values. The observation that IV estimates are usually larger than OLS estimates is actually fairly common in the literature, and one possible explanation to this phenomenon is the dominance of downward measurement error bias (e.g., misclassifi- cation errors coming from self-reported language proficiency data) over upward unobserved heterogeneity bias (e.g., innate ability common to both host destination language abilities and immigrant earnings). For more on this, see Dustmann and van Soest (2002), Dustmann and Fabbri (2003), Bleakley and Chin (2004), and Yao and van Ours (2015). Although the majority of studies probing the impact of language abilities on immigrant earnings uses OLS and IV approaches, some researchers employed different methods to shed more light on the topic and to provide a new perspective with the help of recent developments in economet- rics. For instance, Dustmann and van Soest (2001) estimated the earnings equation (e.g., a 43 random effects panel data model) of male immigrants in Germany by maximum likelihood, together with the equation for speaking proficiency (e.g., a random effects ordered probit model) and found that speaking fluency among male immigrants in Germany can result in a wage increase of 7.3 percentage points. Berman, Lang and Siniver (2003) used the first-difference estimator in a fixed effects panel data framework while examining the effect of language acquisition on earnings and found that the earnings of male immigrants from the Soviet Union to Israel in 1994 who speak Hebrew very well are 23% higher than the earnings of those who cannot speak Hebrew at all. Dustmann and Fabbri (2003) combined a matching estimator based on the propensity score of being proficient in English with an IV estimator, and their results showed that proficiency in English leads to 35.6% higher incomes. In the spirit of Manski and Pepper (2000), Gonzalez (2005) utilized nonparametric bounds to analyze the effects of LEP on wages for Hispanic workers in the USA, and con- cluded that, on average, the wage premium from developing English proficiency from “not at all” to “very well” is 39% under both monotone treatment response and monotone treatment selection. Chiswick, Lee, and Miller (2005) used a first-difference inertia model to analyze the earnings growth of adult male immigrants in Australia and found that male immigrants with proficiency in English have 23.9% higher income growth compared to those who are not proficient in English. Lastly, Aldashev, Gernandt, and Thomsen (2009) employed aug- mented OLS to examine the results of better language skills on earnings for foreigners in West Germany by jointly modeling self-selection in participation and employment decisions (e.g., a double hurdle model) and self-selection in economic sector and occupation decisions (e.g., a bivariate probit model). They pointed out that, among the high skilled foreigners, the wage premium from speaking mainly German to their mother tongue is 26.9%. Overall, estimation methods alternative to OLS and IV also yield wage premiums for immigrants with proficiency in host destination language, going up to 39%. In conclusion, it is well accepted in economics literature that improvement in host destination language abilities is associated with increases in earnings among adult male immigrants. 44 1.8.2 Data The data source in my empirical analysis is the 1% Public Use Microdata Series (PUMS) of the 1990 U.S. Census, and I utilize a subsample that is constructed from data used by Gonzalez (2005). There are 13 variables and 82250 observations in her original dataset. However, I kept only observations for individuals who reported a Hispanic first ancestry, were between 16 and 64 years old in 1989, had at least the minimum hourly wage in 1990, and worked at least 48 weeks and at least 40 hours per week in 1989. Since year of entry into the USA is in intervals (e.g., between 1950 and 1959) for Hispanic immigrants, and age of arrival into the USA is equal to age in 1989 minus the upper bound of year of entry into the USA; some observations have negative values for age of arrival into the USA. After dropping these observations from the dataset of Gonzalez (2005), the sample for my analysis has only 38779 observations. As in many applications using Mincer earnings equation, my outcome variable is the natural log of hourly wage where hourly wage is equal to wages or salary income in 1989 divided by the product of weeks worked last year and usual hours worked per week last year. In 1990, the minimum hourly wage was $1.335 in natural log. I drop the observations for individuals who earned less than the minimum hourly wage in 1990 and worked less than 48 weeks and less than 40 hours per week in 1989 because these individuals are not regular workers. There could be unobserved or observed factors that are specific to these irregular workers, have an impact on their earnings, and I cannot control. This would confound my estimates; however, dropping the observations for irregular workers eliminates that possibility. The treatment data in Gonzalez (2005) are based off the answers to the survey question on “ability to speak English”. Hence, it is a self-reported ordinal variable, and this variable might suffer from measurement error, and thereof, misclassification of English proficiency. Taking this deficiency of the treatment variable, I collapsed the original five- category treatment variable (i.e., speaks only English at home; speaks English very well, well; does not speak English well, at all) to a new three-category treatment variable: (1-not 45 well) does not speak English well and at all; (2-well) speaks English very well and well; and (3-very well) speaks only English at home. The purpose behind this simple recategorization of the English proficiency is to reduce possible measurement error problem in the treatment variable, see Espenshade and Fu (1997), Dustmann and van Soest (2001), and Bleakley and Chin (2004) for more on the benefits of combining language categories. Original employment status variable in Gonzalez (2005) had six categories: civilian employed, at work; civilian employed, with a job but not at work; unemployed; armed forces, at work; armed forces, with a job but not at work; and not in labor force. In my analysis, I regrouped them into three categories: employed, unemployed and not in labor force. As indicated in previous subsection, the endogeneity problem in wage equation (i.e., the possibility that workers with higher innate ability earn more and speak English better than do workers with lower innate ability) led several economists to use IV method for investigating the impact of English proficiency on immigrant earnings. One commonly used instrument for English proficiency in the literature is immigrants’ age at arrival, see, for example, Bleakley and Chin (2004), Bleakley and Chin (2010), Miranda and Zhu (2013), and Yao and van Ours (2015). This choice of instrument is driven by scientific studies on language acquisition which provide evidence for that young people’s (e.g., children’s) capacity to learn and use languages is generally higher than older people’s (e.g., adults’). In conformity with this idea, Akresh and Akresh (2011) showed that, for children of Hispanic immigrants, each additional year spent in the USA leads to higher scores on the passage comprehension, applied problems, and letter-word identification tests. On the other hand, age at arrival might not be a perfect instrument for English proficiency because it is argued that immigrants who come to a host destination country at a younger age might find it less costly to economically assimilate and acculturate to the host country. For example, Gonzalez (2003) noted that Mexican and Latin American immigrants who arrive at younger ages in the USA earn higher wages as a result of completing more years of schooling. Moreover, the immigrant assimilation model of Eckstein and Weiss (2004) supports the idea that age at arrival might have an effect on 46 earnings through channels other than language: The model claims that immigrants learn more about the host country labor market as they spend more time in the host country, become better at implementing their human capital, and earn more. See Schaafsma and Sweetman (2001) and Borjas (1985, 1995) for more on the relationship between immigrant earnings and age at immigration, and economic assimilation of immigrants. Despite some concerns over the quality of age at arrival as an instrument for language fluency among immigrants, it is a fairly established and commonly used instrument in the literature, and I believe the wide variety of control variables in my empirical analysis would help to reduce these concerns to the minimum from a statistical point of view. Specifically, in my analysis, I create a variable based off age at arrival in the USA that takes four possible values: 0 (US born immigrants), 1 (arrived as a child-0 to 11 years old), 2 (arrived as a teenager-12 to 17 years old), and 3 (arrived as an adult-18 or older). This new variable, ordered by creation, is excluded from the second stage earnings equation and used only in the first stage language proficiency equation. By running a multinomial logistic regression in this very first stage, I obtain the predicted probabilities of being in one of the three English speaking categories (i.e., very well, well, and not well), and these predicted probabilities are used as instruments for English proficiency in the second stage earnings equation. There are also several control variables available I use in my regressions. Previous studies show that the effect of LEP may change as individuals’ characteristics (e.g., their profession, education, etc...) vary. One such control variable is education that takes on values ranging from 0 to 20 in my dataset, and Chiswick and Miller (2003) indicated that male immigrants in Canada with more years of schooling gain relatively more in earnings from being proficient in English compared to those with less years of schooling. Hence, it is possible that the effect of LEP on earnings of Hispanic immigrants in the USA might differ, depending on their education level. I also create dummy variables for region of birth based off ancestry codes from 1% PUMS, 1990 US Census. In total, there are seven such dummies for US born, Spanish, Mexican, Central American, South American, Puerto Rican, and Cuban 47 Hispanics. I categorize all occupations into seven groups (i.e., managerial, technical, service, repair, operators, agriculture and military) and create a dummy variable for each except agriculture. Occupation variables may yield interesting results because Berman, Lang and Siniver (2003) found that Hebrew fluency has higher effect on wages in the skilled occupations than in the unskilled occupations. Potential experience in years is another control variable, and McManus et al. (1983) suggested that English deficiency has wage penalties on earnings of Hispanic men in the USA, where the penalties increase with potential experience. Gender enters in my regressions as a dummy variable for female immigrants. Lastly, I divide workers into fours groups based on worker class codes from 1% PUMS, 1990 US Census (i.e., employee of a private for profit company, employee of a private for nonprofit organization, government employee and others-mostly self-employed) and create a dummy variable for each except others-mostly self-employed. For a compact version of variable descriptions used in my analysis and summary statistics, see Table A.21 in appendix A. Before moving to regression results, I look at some of the characteristics of Hispanics in the sample and present, from a descriptive point of view, some interesting observations I gather from Table A.22 available in appendix A. The biggest treatment group is those Hispanic immigrants who speak English well, which comprises just about 61% of the whole sample. The other two treatment groups (i.e., those who do not speak English well and those who speak English very well) are about 20% of the sample each. The average of log hourly wages is 2.18 and is more or less the same across all English proficiency levels with a standard deviation of 0.5. In a descriptive sense, wages are increasing in English proficiency: Hispanics who have a better command of English in speaking on average earn higher wages. Hispanics who speak English better have more schooling: Average years of education increase from 7.8 among Hispanics with deficiency in speaking English to 12.79 among Hispanics with high proficiency in speaking English. Whereas, experience goes down in English proficiency: Hispanics who mastered in speaking English tend to have less potential experience, with 23.35 years for those who do not speak English well to 15.92 years for those who speak 48 English very well. In line with my expectations, the percentage of female Hispanics goes up from 27.73% among those who do not speak English well to 37.79% among those who speak English very well. Hispanics who speak English better are more prone to work in high skilled occupations: the percentage of Hispanics working in managerial positions (as operators) increases (decreases) from 3.61% (41.34%) for treatment 1 to 23.69% (16.86%) for treatment 3. The percentage of Mexicans goes down from 62.23% among those who do not speak English well to 55.57% among those who speak English very well. Hispanics who have a better mastery in speaking English tend to suffer from unemployment less: the unemployment rate of Hispanics among those who do not speak English well is 1.2 percentage points higher than that among those who speak English very well. Finally, age is slightly decreasing in English proficiency: Younger Hispanics are slightly more inclined to speak English better. 1.8.3 Regression Results First, I present the first stage regression findings for English speaking proficiency among Hispanics. Second, I move to the estimates from the second stage regression for earnings among Hispanics. Lastly, I share the results that come from nonparametric bound analysis. However, before discussing these results, I have to admit that there are some limitations in my analysis. One of them is that there may be a selection bias on my estimates because of non-randomly selected sample, only considering those who are earning some income. An- other limitation is that, just as I expect language proficiency causes to produce increases in earnings, earning more (i.e., being able to save some money for language courses) could reversely cause improvement in language skills, as well, which opens the door to the simul- taneity problem. Yet another limitation is that self-reported language measures might suffer from measurement errors that come from either individual respondents (i.e., exaggerating their language skills for personal reasons) or interviewers (i.e., misclassifying respondents’ 49 language ability), which can create a degree of subjectivity in my English proficiency vari- able. And lastly as pointed out previously, the validity of immigrants’ age at arrival as an instrument for English fluency is argued, especially in immigrant assimilation models. Overall, the readers need to keep in mind these limitations of my analysis as I present my results. Tables A.23 and A.24 available in appendix A report the estimated parameters on His- panic workers’ arrival age in the USA from a multinomial logit regression of English pro- ficiency for several different models/specifications and their likelihood ratio chi square test statistics of goodness of fit. In all the regressions, the outcome is a discrete variable of En- glish proficiency that has three categories: Not well, well, and very well with not well being the base outcome. The arrival age, critical predictor and instrumental variable included in these first stage regressions but excluded from the second stage regressions, takes four values: 0 for US born, 1 for arrived as a child (0 to 11 years old), 2 for arrived as a teenager (12-17 years old) , and 3 for arrived as an adult (18 or older). As to the models/specifications, they all use the same multinomial logit model but with different sets of predictors. Models 1 and 2 have only arrival age included. Model 3 controls for both arrival age and education. Model 4 contains arrival age, education, and gender. Models 3a (4a) has exactly the same variables as Model 3 (4) except that arrival age is not included in Model 3a (4a). Model 5 controls for arrival age, education, gender, and occupation. Model 6 includes arrival age, education, gender, occupation, and ancestry. Model 7 contains arrival age, education, gender, occupa- tion, ancestry, and employment status. Model 8 controls for arrival age, education, gender, occupation, ancestry, employment status, and worker class, which is the full specification regression. Lastly, Models 5a, 6a, 7a, and 8a have exactly the same variables as Models 5, 6, 7, and 8 respectively except that arrival age is not included in Models 5a, 6a, 7a, and 8a. We can think of Model 8 as a language proficiency equation and of its predictors as the determinants of language proficiency. In all models, I use the whole sample with 38779 observations in it. 50 Even though I have shared only the estimated parameters on arrival age, I can make them available upon request. However, to mention a few of the other parameter estimates (almost all of them statistically significant) from the full specification regression (Model 8), one more year of schooling causes the odds ratio for speaking English very well to not well to increase by about 34%, holding all other variables constant. In the USA, those Hispanic immigrants working in high skill occupations such as managerial positions and in government sector have a significantly higher probability to speak English better. The gender impact is negative against my expectations but statistically insignificant for females: being a female is associated with a little over 5% decrease in the odds ratio for speaking English very well to not well to, ceteris paribus. However, this might be simply because of that female Hispanic immigrants have lower incentives to learn English. In the USA, those Hispanic immigrants from Mexico and Puerto Rico are more likely to speak English more fluently. This supports the idea that immigrants can acquire speaking fluency by exposure to the host destination language through their relatives and friends already working and living in the USA. In IV method, it is necessary that instruments be significantly correlated with the en- dogenous variable, English proficiency in my models. To check the quality of instruments, there are a few empirical ways: significance of instruments in the model for the endogenous variable, instruments’ contribution to the explanatory power of endogenous variable model, and the Sargan overidentification test, see, for instance, Bound, Jaeger, and Baker (1995) for more on instrument checks. I specifically use the first two ways to assess the quality of the instrument, age at arrival, for the fluency in English. In all the models in Tables A.23 and A.24, the estimates on age at arrival are statistically significant at the 1% level and fairly stable for all English speaking categories, which is a positive indicator for the instrument, age at arrival. Furthermore, when I look at the χ2 statistics, it is obvious that, in all models, excluding age at arrival from the English proficiency equation results in a sig- nificant reduction in the explanatory power of multinomial logistic regressions. Comparing Model 8 to 8a in Table A.24, I see that adding age at arrival into the English proficiency 51 equation increases the explanatory power of the English proficiency model by about 9.4%. The improvements resulting from the inclusion of age at arrival in the explanatory power of the English proficiency model are much higher in other regressions in Tables A.23 and A.24. Hence, the instrument age at arrival passes the quality check in terms of its correlation with English proficiency. Now let’s consider the ATE results in Tables A.25 and A.26 available in appendix A that report the estimated ATEs of English proficiency on log hourly wages among Hispanic im- migrants in the USA and their standard errors in parentheses, so the dependent variable in all the second stage regressions is log hourly wages. I use three different estimation methods (i.e., CF, IV, and OLS) and several different specifications in order to compare the perfor- mance of these estimation methods to each other. CF is the control function estimation with control function terms in Procedure 1.2. IV is the instrumental variables estimation in Procedure 1.1. Since English proficiency has three levels (i.e., not well, well, and very well), I create binary English proficiency indicators for each level and label them ep1 , ep2 , and ep3 respectively. The ep2 − ep1 and ep3 − ep1 in Tables A.25 and A.26 denote the differences between the corresponding estimates on binary English proficiency indicators and are simply the estimated ATEs due to the usage of demeaned control variables in the models. In IV regressions, I employ the predicted probabilities from the first stage regressions (e.g., those probabilities from Model 1 in Table A.23) as instruments for the binary endogenous English proficiency indicators (e.g., instruments in the IV regression under Model 1 in Table A.25). As for the control variables in the models, Model 1 has no exogenous variables controlled for. Model 2 introduces only potential experience that is totally excluded from the first stage regressions. Model 3 controls for both potential experience and education. Model 4 includes potential experience, education, and gender. Model 5 controls for potential experience, ed- ucation, gender, and occupation. Model 6 contains potential experience, education, gender, occupation, and ancestry. Model 7 includes potential experience, education, gender, occu- pation, ancestry, and employment status. Lastly, Model 8 controls for potential experience, 52 education, gender, occupation, ancestry, employment status, and worker class. The standard errors in CF regressions come from the analytical formula in Th.1.4, and the standard errors in IV regressions are bootstrapped. Model 8 in Table A.26 is the full specification model and, thereof, can be thought as a earnings equation with its predictors as the determinants of earnings. As in the first stage, in all models, I use the whole sample with 38779 observations in it. To avoid clutter in Tables A.25 and A.26 due to a large number of control variables used in models, I have shared only the estimated ATEs in these tables. However, I can make the full results available upon request. By discussing the results of regressions in Tables A.25 and A.26 that relate English proficiency to earnings of immigrant Hispanics in the USA, my goal is to be able to say something about whether deficiency in speaking English negatively influences earnings for immigrant Hispanic workers in the USA, that is, whether the average Hispanic immigrant who improves in its English speaking skills ends up with a higher wage than it would have earned had he not improved its English speaking skills. After controlling for experience and education variables in Model 3, CF estimates for ATEs start getting smaller in size and significant. Especially after adding occupation, ancestry, and employment status variables into the models one by one, CF estimates for ATEs shrink in magnitude greatly and CF estimates for ep3 −ep1 (the leap from speaking English not well to very well) become all statistically significant, which indicates the significant contribution of these control variables into the earnings models. OLS estimates are also very stable and statistically significant in all models, and shrink in magnitude as I control for more and more determinants of earnings. For example, CF estimates from Model 8 in Table A.26 reveal that the wage increase from speaking English not well to well is around 30% and 79% from speaking English not well to very well. The same wage premiums from OLS are about 12% and 22% respectively. On the other hand, IV estimates do not perform well in these earnings models of Tables A.25 and A.26. IV captures a positive effect of one treatment (the leap from speaking English not well to well) on earnings but a negative effect of the 53 other treatment (the leap from speaking English not well to very well) on earnings, which is a conflicting result. The estimated ATEs from IV are also all insignificant in Models 6 through 8 which I am in favor of for the sake of a more complete earnings model because Models 6 through 8 include extra control variables frequently used in the literature. In Tables A.27 through A.32 available in appendix A, I also add the results from nonpara- metric bound analysis in the sense of Manski and Pepper (2000). Assuming both monotone treatment response (MTR) and monotone treatment selection (MTS), this nonparametric bound analysis provides identification regions for the ATEs. When drawing conclusions about the returns to English proficiency on log hourly wages among Hispanic immigrants in the USA, the MTR assumption means that wages among Hispanic immigrants in the USA increase as a function of English proficiency levels, ceteris paribus. On the other hand, the MTS assumption states that these Hispanic immigrants with higher levels of English proficiency have weakly higher mean wage functions than do those with lower levels of En- glish proficiency. As shown in Manski and Pepper (2000) and Gonzalez (2005), combining these MTR and MTS assumptions produce tighter identification regions with smaller upper bounds on the returns to schooling and English proficiency. Hence, I follow their approach and construct the upper and lower bounds of the estimated ATEs of English proficiency on log hourly wages among Hispanic immigrants in the USA by using the combined MTR and MTS assumptions. The MTR+MTS bounds in Tables A.27 through A.32 are calculated based off the inequalities (21) in Manski and Pepper (2000). In these tables, I essentially report the estimated ATEs of English proficiency on log hourly wages among Hispanic im- migrants in the USA, their standard errors in parentheses (for CF, IV, and OLS estimates only), the estimated nonparametric bounds for the ATEs, and their 95% confidence intervals in brackets. As in Tables A.25 and A.26, the dependent variable is log hourly wages. I use four different estimation methods (i.e., CF, IV, OLS, and nonparametric bounds) and the full specification model (i.e., Model 8 in Table A.26) in order to compare the performance of these estimation methods to each other. CF and IV estimations are as described in Pro- 54 cedure 1.2 and Procedure 1.1, respectively. As in Table A.26, the ep2 − ep1 and ep3 − ep1 in Tables A.27 and A.32 are simply the estimated ATEs due to the usage of demeaned control variables in the models. Again, in IV regressions, I employ the predicted probabilities from the first stage regressions (e.g., those probabilities from Model 8 in Table A.24) as instru- ments for the binary endogenous English proficiency indicators (e.g., instruments in the IV regression under Model 8 in Table A.26). As for the control variables in CF, IV, and OLS, I control for potential experience, education, gender, occupation, ancestry, employment status, and worker class. The standard errors in CF regressions come from the analytical formula in Th.1.4, and the standard errors in IV regressions are bootstrapped. The 95% confidence intervals are also bootstrapped. In Table A.27, I use the whole sample with 38779 observa- tions in it. In Tables A.28, A.29, A.30, A.31, and A.32, I pay attention to the subsamples of males, females, operators, repair workers, and service employees with 25568, 13211, 9622, 6209, and 5417 observations in them, respectively. For the sake of tidiness in Tables A.27 through A.32, I have chosen to share only the estimated ATEs in these tables. However, I can make the full results available upon request. In Table A.27, CF estimate for ep2 − ep1 (.30 but statistically insignificant) is within the estimated nonparametric bounds but CF estimate for ep3 − ep1 (.79 and statistically signifi- cant) is well over the bounds. OLS estimates are all statistically significant and well within the estimated nonparametric bounds. However, the lower bound estimates are all zero and, thereof, not informative in the sense that they do not narrow the expected lower bounds, which are positive with respect to human capital theory. Specifically, the estimated bounds for ep2 − ep1 are 0 and .32, and those for ep3 − ep1 0 and .43. These bounds in Table A.27, nevertheless, suggest that the wage premiums from speaking English better is positive. IV estimates do not perform well: They are all statistically insignificant in Table A.27. In Tables A.28 through A.32, I investigate how the penalties imposed by LEP (or the gains resulting from improvements in English proficiency) may vary across gender and occupation. In Tables A.28 and A.29, the bounds on ATEs are similar for both men and women are 55 relatively similar with a slightly higher upper bound for ep3 − ep1 among men. However, this is not the case for CF and OLS estimates: They reveal that the wage premiums due to improvements in English proficiency are considerably lower among women. For example, CF estimates in Tables A.28 and A.29 show that the wage increase from speaking English not well to well (not well to very well) among men is over 50% (84%) more than that among women. The same applies to OLS findings, as well. Even though the upper bounds on ep3 − ep1 are smaller than the point estimates of CF, the empirical results suggest that CF estimates (and OLS estimates) detect the wage inequality between men and women. In Tables A.30 through A.32, I looked at different occupations (i.e., operators, repair workers, and service employees) and explored if heterogeneity in their use of language may lead to differences in wage premiums due to speaking English better among Hispanic immi- grants in the USA. All the results from CF, OLS, and the nonparametric bound analysis imply that the wage gains due to improvements in English proficiency varies greatly across occupations. For instance, in managerial and repair occupations, the highest wage increases from speaking English not well to very well (137% and 114% by CF estimation, respectively) are observed. Whereas, the lowest wage premium values from speaking English not well to very well are attained in service occupation (34% by nonparametric upper bounds). The IV estimates are all statistically insignificant. Overall, only CF, OLS, and nonparametric bound analysis produce statistically signif- icant results in line with the literature. CF estimates for ATEs are well above the OLS estimates and generally outside the nonparametric bounds, which indicates that OLS esti- mates might be biased downwards and that CF method overestimates, especially when the size of treatment groups are imbalanced. However, as noted before, earlier studies provide substantial evidence for this overestimation: Estimation methods that explore the effect of English proficiency on earnings and control for endogeneity and measurement error produce estimates greater in magnitude than does OLS. In addition, nonparametric bound analysis has its own disadvantage: Its lower bound estimates for ATEs are always zero. Therefore, 56 CF method I propose offers an attractive alternative estimation method to ATE literature and outperforms IV in this empirical application. 1.9 Conclusion In this chapter, I introduce an econometric model with a discrete multivalued endogenous treatment variable and show how to consistently estimate ATEs by a three step estimation procedure of CF method in such a model. In addition, I show the asymptotic distribution of √ the CF estimates follows a normal distribution, and the CF estimates are N − consistent. I propose a consistent estimator for the asymptotic variance matrix of CF estimates, which takes into consideration the nonlinear first stage estimation. Using GMM, I also indicate how one can consistently estimate ATEs and obtain valid standard errors for the parame- ters of interest. I offer a hypothesis testing framework with hypotheses expressed as a set of restrictions on model parameters and construct a Lagrange multiplier statistic which is asymptotically χ2 . I also demonstrate how CF method can be applied to one special case: the model with fixed correlation between counterfactual error terms and latent model errors. As expected, the asymptotic distribution of the CF estimates still follows a normal distribution, and the √ CF estimates are still N − consistent in this special case. A consistent estimator for the asymptotic variance matrix of CF estimates in this case still follows the conventional sandwich form, and the GMM solution follows the same structure proposed for the more general case. In my simulation analysis, I compare CF method with IV method. The simulation re- sults suggest that, under no misspecification, CF method is asymptotically unbiased and consistent and can be more efficient than IV method. Whereas, IV method is generally asymptotically biased and inconsistent. Therefore, the simulation results also indicate that, without misspecification, CF method consistently estimates ATEs while IV method often 57 cannot. On the other hand, when the correlation between counterfactual error terms and latent model errors is constant, i.e. when counterfactual errors are homogeneous, the sim- ulation results indicate that, under no misspecification, both CF method and IV method are asymptotically unbiased and consistent with CF method being slightly more efficient. After introducing some misspecification, the simulation results show that IV method out- performs CF method in terms of efficiency and that both CF and IV methods have biased estimates. However, biases in IV estimates are generally lower than those in the estimates of CF method. In my empirical application, I illustrate the role of limited English proficiency (LEP) in determining wages of Hispanic workers in the USA. Utilizing age at arrival as an instru- mental variable, both OLS, CF, and nonparametric bound analysis indicate that LEP on average imposes a statistically significant wage penalty on immigrant Hispanic workers in the USA. In line with the existing literature, CF estimates are greater in magnitude than the OLS estimates, and nonparametric bound analysis provides uninformative lower bounds. IV estimates mostly produce insignificant results or results that are against expectations. In future, it can be worth further researching CF method and its large sample properties in a discrete multivalued endogenous treatment model with a nonlinear second stage (outcome) equation. We can theoretically examine how the nonlinearity embedded in the first stage (choice) equation can improve identification in a discrete multivalued endogenous treatment model with weak instruments. Furthermore, under the current model, we can systematically compare CF method to IV method in terms of their asymptotic efficiency when counterfactual errors are homogeneous and can explore how to estimate the local average treatment effects and quantile effects. Lastly, we can theoretically study the large sample properties of CF and IV method in a high dimensional (i.e., settings where sample size is less than the number of parameters to be estimated) discrete multivalued endogenous treatment model. 58 CHAPTER 2 ESTIMATION AND INFERENCE FOR MULTIVALUED ENDOGENOUS TREATMENT EFFECT MODELS WITH CORRELATED RANDOM COEFFICIENTS 2.1 Introduction When the parameter of interest in an economic model changes in a population, economists turn to random coefficient (RC) models. For example, the effect of tutoring on exam perfor- mance may be quite different across students: some students with decent exam preparation under their belt can greatly benefit from tutoring, and for some students who are totally lost in class tutoring can really be a waste of time and energy. The regression models capturing this idea of RCs date back as early as 1950s. In the econometrics textbook of Klein (1953, p. 216), he points out the lack of complexity in linear regression equations (especially those with a limited number of covariates) using cross sectional data to decipher the differences among people in their responses to outcomes. And then he suggests the usage of RC models to take these differences truly into account. However, RC models came to the mainstream economics with the work of Zellner (1969) on aggregation problem in models with random coefficients, see Swamy and Tavlas (2001) for a broad summary of random coefficient models. Swamy (1970) offers a consistent and an asymptotically efficient estimator for the mean of RCs in a panel data setting and applies its theoretical findings to the analysis of annual gross investment of firms. Using again panel data, Swamy and Mehta (1977) estimate the demand model for liquid asset in which the effect of time deposits, demand deposits and savings, and loan association share varies by both time and state in the USA (in addition to showing the asymptotic properties of estimators for the mean of RCs and their variance-covariance matrix.) In nonlinear settings, Bjorklund and Moffitt (1987) generate a RC self-selection model for the effect of some activities such as education, training, and unions on wages and apply their 59 model to the government manpower training program in Sweden to show the heterogeneity in wage gains to the program. Akin, Guilkey, and Sickles (1979) extend the ordered response probit model to the RC probit model where the coefficients are allowed to be random in the latent variable equation and examine family moving decisions using the Panel Study of Income Dynamics survey data. As explained by Heckman and Robb (1985b, p. 173), RC models also have links to switching regression models with which union/nonunion wage gaps are estimated as surveyed in Lewis (1983) or college/high school wage differentials are explored (among other things) as in Willis and Rosen (1979). RC models have been intensely used in the human capital researches, giving informative insight into the understanding of, for example, the relationship between economic earnings and schooling and how this relationship varies across individuals. Becker and Chiswick (1966), Chiswick and Mincer (1972), and Chiswick (1974, ch. 3), for instance, are influential researches on earnings function relating personal earnings to schooling and other employment variables. Traditionally, it is generally assumed in these researches that economic return changes across people but is independent (or uncorrelated) of the level of schooling. Further examples over this come from Becker (1967) and Mincer (1974, chs. 2 and 3), and they use models that are in line with this assumption. Correlated random coefficient (CRC) models come into play at occasions where re- searchers would like to allow for at least some correlation between a RC of interest and the variable of interest through some unobservables. As Wooldridge (2015, p. 430) men- tions, CRC models can also allow for both heterogeneous treatment effects and self-selection into treatment. Heckman and Vytlacil (1998) bring in the usage of CRC among economists and specifically mention that CRC models compared to RC models can be more plausible with empirical observations and economic theory by relaxing the assumption of no correla- tion between the variable of interest and its rate of return, see Rosen (1977, p. 14 and 17) for a summary of the relationship among schooling, ability, and earnings and the problems associated with extracting the marginal effect of schooling on earnings. They develop the 60 classic RC model as in Becker and Chiswick (1966) and turn that into a CRC model in which some unmeasured ability/motivation factors (and observed characteristics) can influence the return to schooling and can also be correlated with the level of schooling, creating the cor- relation between a RC of interest and the variable of interest. Heckman and Vytlacil (1998) use a two-step estimator of the average return to schooling in their wage equation whereas Card (2001) (he summarizes different methods, inclusive of RC models, of causal modeling of the return to education) and Meghir and Palme (2001) prefer using instrumental variables (IV) method. Contrary to the popularity of estimating ATE by IV method in continuous treatment ef- fects (see, for example, Moffitt (1999) and Wooldridge (2003) for arguments on the grounds of robustness), Wooldridge (2008) explains a case where IV estimation can produce inconsis- tent estimates in the framework of CRC model with an endogenous binary treatment. This is due to that, in a CRC model, the binary treatment variable and unobserved heterogeneity factors in random coefficients are allowed to be correlated, and the binary treatment vari- able interacts with the unobserved heterogeneity factors. In a CRC model with multiple endogenous treatments, Wooldridge (2003, p. 191) points out that consistency condition of conventional IV does not hold when treatment variables are discrete. In the presence of heterogeneity and endogenous treatment, Heckman and Li (2004) also mention that tra- ditional IV method fails to provide true average treatment effect (ATE) of schooling on earnings and employ a semi-parametric method for estimating treatment effects. Hence, in a CRC model with discrete endogenous multivalued treatments, conventional IV method is generally expected to be inconsistent for ATEs because of the existence of CRCs (if there are also heterogeneous counterfactual errors in the structural equation as in Chapter 1, then inconsistency of IV can get even worse). Control function (CF) method naturally comes as an alternative estimation method to IV. For instance, after pinpointing the drawbacks of ordinary least squares (OLS) and IV methods, Gebel and Pfeiffer (2007) estimate average returns to education in the West German labor market using CF method in a CRC setting. 61 In continuous treatment case, Amann and Klein (2012) use CF method to estimate the ATE of tenure on hourly wages in Germany within a CRC framework, allowing heterogeneous returns to tenure across individuals and feedback between these returns and tenure decision. Unfortunately, in discrete treatment cases, CF method has received little to no attention under the framework of CRC models. In this chapter, I extend my work from Chapter 1 to CRC framework. I focus on esti- mating ATEs in a discrete multivalued endogenous treatment model with CRCs and het- erogeneous counterfactual errors and investigate the behavior of both CF and IV methods comparatively in this setting. This has not been studied to the best of my knowledge and is my main contribution to the literature. Specifically, in this chapter, I suggest a consistent CF estimator for the ATEs and show the asymptotic properties of CF parameter estimates in a discrete multivalued endogenous treatment model with CRCs and heterogeneous coun- terfactual errors. Using a simulation analysis, I also claim that, without misspecification, IV method is generally asymptotically biased and inconsistent to a great degree whereas CF method is not. However, when misspecification is introduced, my simulation findings suggest that IV method perform better than CF method when it comes to unbiasedness. As for efficiency (with or without misspecification), the findings from simulations show that neither IV method nor CF method is necessarily more efficient than the other. The rest of this chapter is organized as follows. In section 2.2, I introduce the model. In section 2.3, I derive the estimating equations for both CF and IV methods and propose procedures to estimate the parameters of interest and ATEs for both methods. In section 2.4, I show the asymptotic properties of CF estimates, propose a consistent estimator for the asymptotic variance matrix of CF estimates, and show how a GMM framework can be set up for the main problem. In section 2.5, I share some simulation results. In section 2.6, I conclude. And, in appendix B, I share the derivations and simulation tables that are hidden from the main body of this chapter. 62 2.2 The Model Consider the following model with CRCs yg = mg + xbg + ug wg∗ = zγg + ag , (2.1) where yg is the g th counterfactual outcome variable, mg is the scalar random coefficient in the counterfactual outcome equation for yg , x ≡ (x1 , x2 , . . . , xl ) is the 1 × l vector of exogenous variables in yg , bg is the l×1 vector of slope random coefficients in yg , ug is the counterfactual error in yg , wg∗ is the latent treatment variable that determines the choice of treatment status among G + 1 alternative treatment statuses, z ≡ (z1 , z2 , . . . , zk ) is the 1 × k vector of instruments that includes a constant term in the choice equation for wg∗ , γg is the k ×1 vector of parameters in wg∗ , and ag is the scalar error term that is independently and identically Gumbel distributed (i.i.d.) with location parameter µ = 0 and scale parameter β = 1 in wg∗ for g = 0, 1, . . . , G. Note that (2.1) is almost the same as the counterfactual outcome equation together with the choice equation from Chapter 1; however, here I incorporate CRCs into the model. As in Chapter 1, let w ∈ {0, 1, . . . , G} be the observed discrete multivalued endogenous treatment variable whose values are determined by wg∗ for g = 0, 1, . . . , G. One common interpretation of wg∗ is to think of it as the utility or satisfaction obtained from treatment status g. Let the treatment statuses of w be exhaustive and mutually exclusive. Define binary treatment status indicators, dg = 1[w = g] for g = 0, 1, . . . , G. So the binary treatment status indicator dg is equal to one if the treatment status is equal to g and zero otherwise. This coupled with the mutual exclusivity of treatment statuses implies that G g=0 dg = 1. Define P the 1 × (G + 1) vector of treatment statuses d ≡ (d0 , d1 , . . . , dG ). 63 Let y be the observed outcome. Then, I can write y = d0 y 0 + d1 y 1 + · · · + dG y G , (2.2) where yg is the g th counterfactual outcome for g = 0, 1, . . . G. After having described the discrete multivalued endogenous treatment model with CRCs above, I now will make a series of assumptions that complete the model, and are used in estimation. Notice that some of these assumptions will be the same as the ones used in Chapter 1. First, I assume that the rational economic agents choose the status of treatment from which they receive the most satisfaction out of all possible treatment statuses. That is, • Assumption 2.1 (A.2.1): One chooses treatment status g, i.e., w = g if and only if wg∗ ≥ wj∗ ∀j 6= g for g, j = 0, 1, . . . , G. Second, I assume that identification of the model in (2.1) and (2.2) is contributed by ex- clusion of some (at least one) variables in the set of instruments z from the set of exogenous variables in x. This exclusion restriction is encouraged for the estimation and identification to be more convincing and reliable even though nonlinearity in estimation suffices for iden- tification, especially when the exogenous variables in z vary enough in the sample. The set of exogenous variables in x can all be included in the set of instruments z. • Assumption 2.2 (A.2.2): Identification of the model described by (2.1) and (2.2) is strengthened by exclusion of at least one variable in z from the set of variables in x. As in Chapter 1, the identification argument is based on both exclusion restriction(s) and the above nonlinearity that describes the conditional probability of treatment status g 64 as a function of the set of instruments z. In addition, as shown by McFadden (1973), under the model in (2.1) and (2.2) the assumptions made so far allow the treatment variable w to follow a multinomial logit model with choice probabilities given as follows: XG P (w = g|x, z) = P (w = g|z) = exp(zγg )/ exp(zγr ), (2.3) r=0 for g = 0, 1, . . . , G. The next assumption is essential to the CF estimation, which I describe in section 2.3, since this assumption coupled with the multinomial logit specification of the treatment variable w will play a role in creating CF terms that account for the endogeneity in w. • Assumption 2.3 (A.2.3): E(ug |x, z, a) = E(ug |a) = PG PG j=0 ηg,j aj + [− j=0 ηg,j E(aj )], where ug is the counterfactual error in yg , x is the 1 × l vector of exogenous variables in yg , z is the 1 × k vector of instruments that includes a constant term in the choice equation for wg∗ , a ≡ (a0 , a1 , . . . , aG ) is the 1×(G+1) vector of i.i.d. Gumbel distributed errors aj with location parameter µ = 0 and scale parameter β = 1 in wj∗ , ηg,j is the scalar multiple of correlation coefficient between ug and aj , and E(aj ) = 0.5772 is Euler’s constant for j, g = 0, 1, . . . , G. Bourguignon, Fournier, and Gurgand (2007) refers to A.2.3 as Dubin and McFadden’s linearity assumption since the conditional expectation of counterfactual error ug given all Gumbel distributed errors a is linear in a for g = 0, 1, . . . , G. A.2.3 also implies that, condi- tional on a, x and z are redundant for the conditional expectation of ug . In other words, ug is mean independent of x and z conditional on a. The next assumption puts exogeneity and linearity restrictions on the structure of RCs in (2.1). 65 • Assumption 2.4 (A.2.4): mg = ψog + xψg and bg = κog + Γg x0 + vg are the random coefficients in yg , where ψog is the scalar parameter in the random scalar coefficient mg , x is the 1 × l vector of exogenous variables in yg , ψg is the l × 1 vector of slope parameters in mg , κog is the l × 1 vector of parameters in the random vector of slope coefficients bg , Γg is the l × l matrix of slope parameters in bg , and vg is the l × 1 vector of error terms with E(vg |x, z) = 0 in bg for g = 0, 1, . . . , G. A.2.4 implies that E(mg |x, z) = E(mg |x) and E(bg |x, z) = E(bg |x) for g = 0, 1, . . . , G. That is, both the scalar random coefficient mg in yg and the random vector of slope coeffi- cients bg in yg are mean independent of z conditional on x. In addition, for convenience in estimation, the RCs mg and bg are assumed to be linear in x. The conditional expectation E(vg |x, z) = 0 might look a little restrictive; however, it still allows for possible correlation between the treatment variable w and the random vector of slope coefficients bg , especially in higher moments. The following assumption indeed establishes the existence of that ar- bitrary correlation through error terms in wg∗ but restricts its form, enabling the model to incorporate CRCs. • Assumption 2.5 (A.2.5): E(vg |x, z, a) = E(vg |a) = Pa0 , where vg is the l ×1 vector of error terms with E(vg |x, z) = 0 in bg , a is the 1 × (G + 1) vector of i.i.d. Gumbel distributed errors ag with location parameter µ = 0 and scale parameter β = 1 in wg∗ , x is the 1 × l vector of exogenous variables in yg , z is the 1 × k vector of instruments that includes a constant term in the choice equation for wg∗ , and P is the l × (G + 1) matrix of constant parameters for g = 0, 1, . . . , G. By A.2.5 I impose that, conditional on a, x and z are redundant for the conditional expectation of vg . In other words, vg is mean independent of x and z conditional on a. 66 Again, for convenience in estimation, the conditional expectation of vg is assumed to be linear in a. In essence, A.2.5 is similar to A.2.3: just as A.2.3 formulates the endogeneity in w, A.2.5 expresses the correlation between w and bg . In this regard, the constant parameters in P can be interpreted as scalar multiple of correlation coefficients between vg and aj for j, g = 0, 1, . . . , G. Just as A.2.3, A.2.5 is also central to the CF method in deriving the CF estimating equation in section 2.3. Under all assumptions from A.2.1 through A.2.5, the model in (2.1) and (2.2) can be consistently estimated by CF method. In section 2.3, I will propose a consistent estimator for the ATEs in this discrete multivalued endogenous treatment model with CRCs. 2.3 Estimation The main theme of interest is again the estimation of ATEs in the discrete multivalued endogenous treatment model with CRCs and heterogeneous counterfactual errors that is described by (2.1) and (2.2) under the assumptions from A.2.1 through A.2.5. Following the definitions from Chapter 1, denote AT Eg,0 as the expected gain from treatment g with respect to the base treatment g = 0 for g = 1, . . . , G. In my model, under A.2.1 through A.2.4, and the law of iterated expectations, I can write AT Eg,0 = E(yg − y0 ) = E (mg + xbg + ug − (m0 + xb0 + u0 )) = (ψog − ψo0 ) + E(x) [(ψg + κog ) − (ψ0 + κo0 )] + E(x ⊗ x)vec(Γg − Γ0 ),(2.4) where vec(· ) is the column vectorization operator and vec(ABC) = (C 0 ⊗ A)vec(B) for conformable matrices A, B, and C. Note that the last equality above uses E(vg |x, z) = 0 for g = 1, . . . , G. 67 Then, using the analogy principle of Manski (1988, ch. 1), a consistent estimator of AT Eg,0 is AT \ Eg,0 = (ψ̂og − ψ̂o0 ) + x[(ψ\ g + κog ) − (ψ0 + κo0 )] + (x ⊗ x)vec(Γ̂g − Γ̂0 ), \ (2.5) where ψ̂og , ψ̂o0 , (ψ\ and x ⊗ x= N −1 −1 PN PN n=1 xi ⊗ xi \ g + κog ), (ψ0 + κo0 ), Γ̂g , Γ̂0 , x̄= N n=1 xi , are respectively consistent estimates for ψog , ψo0 , (ψg + κog ), (ψ0 + κo0 ), Γg , Γ0 , E(x), and E(x ⊗ x) for g = 1, . . . , G. Here, it is important to reemphasize that by (ψ\ g + κog ), I mean a consistent estimate for the sum (ψg + κog ) not the sum of consistent estimates for ψg and κog . Note that when there is only one exogenous variable (i.e., x = x) in yg with E(x) = 0 and E(x2 ) = 1, AT Eg,0 simplifies to AT Eg,0 = (ψog − ψo0 ) + (Γg − Γ0 ). (2.6) Then, a consistent estimate of AT Eg,0 in (2.6) is AT \ Eg,0 = (ψ̂og − ψ̂o0 ) + (Γ̂g − Γ̂0 ), (2.7) where ψ̂og , ψ̂o0 , Γ̂g , and Γ̂0 are defined as in (2.5) for g = 1, 2, . . . , G. It is rather important to state this simplification here because I use this version of AT Eg,0 in (2.6), instead of the one in (2.5), in my simulation analysis later in this chapter. 68 2.3.1 IV Estimation Consider the observed outcome y = d0 y0 + d1 y1 + . . . + dG yG XG X G X G = ψoj dj + dj x(ψj + κoj ) + dj (x ⊗ x)vecΓg + ε, (2.8) j=0 j=0 j=0 where ε = Applying IV method on (2.8) requires instruments for the PG PG j=0 dj xvj + j=0 dj uj . binary treatment indicators dj and x since both corr(dj , ε) and corr(xk , ε) are expected not to be zero, where xk ∈ x and x ≡ (x1 , x2 , . . . , xl ) for k = 1, 2, . . . , l and j = 0, 1, . . . , G. One might think of z as instruments for x. And as to dj , one can model the treatment variable w as a discrete multinomial logit model and then use the predicted probabilities from this model as instruments for dj , j = 0, 1, . . . , G. Hence, one can prescribe the following three- stage procedure to estimate ATEs: Procedure 2.1 1. Estimate the predicted probabilities, Λ̂ji = exp(zi γ̂j )/ G r=0 exp(zi γ̂r ), from a MNL of P wi on zi for j = 0, 1, . . . , G and i = 1, 2, . . . , N. 2. Estimate the parameters in (2.8) by IV method using instruments (Λ̂ji , Λ̂ji zi , Λ̂ji (zi ⊗ zi )) for (dji , dji xi , dji (xi ⊗ xi )), j = 0, 1, . . . , G and i = 1, 2, . . . , N. 3. Plug parameter estimates from step 2 and sample averages of x and x ⊗ x into (2.5), and estimate ATEs. Procedures similar to Procedure 2.1 are not uncommon in economical applications, see, for example, Puhani and Weber (2007); and Sloan, Picone, Taylor Jr., and Chou (2001). Therefore, some economists can use conventional IV method in a discrete multivalued en- dogenous treatment model with CRCs and heterogeneous counterfactual errors. However, 69 conventional IV estimation of (2.8) is likely not to produce consistent parameter and ATE estimates as pointed out by Wooldridge (2003, p. 191 and 2008, p. 98) and Card (2001, p. 1819). Conventional IV method generally fails because it is contaminated by the two differ- ent sets of interaction terms in the error term ε of (2.8): dj xvj and dj uj for j = 0, 1, . . . , G. The instruments used by conventional IV method (all are nonlinear functions of z) as in Procedure 2.1 are expected to be correlated with ε through these interaction terms that are correlated with z through dj for j = 0, 1, . . . , G. Also note that IV estimator described in Pro- cedure 2.1 would be optimal IV estimator if ε were homoskedastic, following the discussion in Chapter 1. Just as in Chapter 1, to estimate (2.8) by IV method in canned software packages, one need to reformulate it. To this end, lets drop one of the binary treatment indicator variables, say dG , from (2.8). And then add a constant term, x, and x ⊗ x into (2.8). Then, (2.8) can be equivalently written as G−1 ! G−1 ! X X y = ψ̃oj dj + ψ̃oG + dj x(ψ^ ^ j + κoj ) + x(ψG + κoG ) + j=0 j=0 G−1 ! X + dj (x ⊗ x)vecΓ e j + (x ⊗ x)vecΓ eG + εe, (2.9) j=0 where ψoj = ψ̃oj +ψ̃oG , ψj +κoj = (ψ^ ^ j + κoj )+(ψG + κoG ), vecΓj = vecΓj +vecΓG , ψoG = ψ̃oG , e e ψG +κoG = (ψG^ + κoG ), and vecΓG = vecΓ e G for j = 0, 1, . . . , G−1. Under this reformulation, the ATEs are as follows: h i AT Eg,0 = (ψ̃og − ψ̃o0 ) + E(x) (ψg + κog ) − (ψ0 + κo0 ) + E(x ⊗ x)vec(Γ ^ ^ eg − Γe 0 ), (2.10) for g = 1, 2, . . . , G − 1 and AT EG,0 = (−ψ̃o0 ) + E(x)(−ψ^ 0 − κo0 ) + E(x ⊗ x)vec(−Γ0 ) e (2.11) for g = G. 70 Therefore, consistent estimates of AT Eg,0 for g = 1, 2, . . . , G − 1 and AT EG,0 under this reformulation are as follows:   \ \ AT\ Eg,0 = (ψ og − ψ o0 ) + x (ψ^ g + κog ) − (ψ0 + κo0 ) + (x ⊗ x)vec(Γg − Γ0 ), ^ (2.12) b̃ b̃ b e b e and b̃ ) + x(−ψ\ (2.13) 0 − κo0 ) + (x ⊗ x)vec(−Γ0 ), AT\ EG,0 = (−ψ o0 ^ b e b̃ , (ψ\ \ where ψ g + κog ), (ψ0 + κo0 ), Γg , and Γ0 are respectively the consistent estimates b̃ , ψ og o0 ^ ^ b e b e of ψ̃og , ψ̃o0 , (ψ^g + κog ), (ψ0 + κo0 ), Γg , and Γ0 from Procedure 2.1 applied on (2.9), and x ^ e e and x ⊗ x are consistent estimates of E(x) and E(x ⊗ x) as defined in (2.5). 2.3.2 CF Estimation To get rid of the endogeneity complications conventional IV faces as mentioned in sub- section 2.3.1, CF estimation needs to find a closed form expression for E(ε|d, x, z) and then to add that expression as a control variable (a.k.a. control function terms) back into (2.8). Hence, compared to IV method, CF method is almost always more complex. To prevent equation-clutter, I left the derivation of finding a closed form expression for E(ε|d, x, z) (and of the estimating equation of CF method) to appendix B. Thus, for derivations, refer to appendix B. 71 Having said that, (B.10) in appendix B gives me the estimating equation of CF method because I can always write X G XG X G y = dj ψoj + dj x(ψj + κoj ) + dj (x ⊗ x)vecΓj + j=0 j=0 j=0 G ! X X X + − ηj,j dj log(Λj ) + dj ηj,0 M0 + dj ηj,1 M1 + j=0 j6=0 j6=1 l,G G X X X +··· + dj ηj,G MG + pk,h [ dj xk E(ah |dj = 1, x, z)] + ξ, (2.14) j6=G k=1,h=0 j=0 where E(ξ|d, PG  x, z) = 0, Λj = exp(zγj )/ r=0 exp(zγr ), Mj = Λj log(Λj )/(1 − Λj ), E(ah |dj = −log(Λh ) + E(ah ) , h = j        1, x, z) = , and E(ah ) = 0.5772 for h, j = 0, 1, . . . , G.   Λ log(Λj )    j + E(ah ) , h 6= j   (1 − Λj ) So I can prescribe the following three-stage procedure to estimate ATEs: Procedure 2.2 1. Same as in Procedure 2.1. 2. Run the regression of yi on d0i , d1i ,. . . , dGi , d0i xi , d1i xi .. . . , dGi xi , d0i (xi ⊗ xi ), d1i (xi ⊗ xi ), . . . , dGi (xi ⊗ xi ), −d0i log(Λ̂0i ), −d1i log(Λ̂1i ), . . . , −dGi log(Λ̂Gi ), d1i M̂0i , d2i M̂0i , . . . , dGi M̂0i , d0i M̂1i , d2i M̂1i , d3i M̂1i , . . . , dGi M̂1i , . . . , d0i M̂Gi , d1i M̂Gi , . . . , X G X G dG−2i M̂Gi , dG−1i M̂Gi , dji x1i Ê(a0 |dj = 1, x, z)i , dji x1i Ê(a1 |dj = 1, x, z)i , . . . , j=0 j=0 XG XG XG dji x1i Ê(aG |dj = 1, x, z)i , dji x2i Ê(a0 |dj = 1, x, z)i , dji x2i Ê(a1 |dj = 1, x, z)i , j=0 j=0 j=0 X G X G ..., dji x2i Ê(aG |dj = 1, x, z)i , . . . , dji xli Ê(a0 |dj = 1, x, z)i , j=0 j=0 72 XG XG dji xli Ê(a1 |dj = 1, x, z)i , . . . , and dji xli Ê(aG |dj = 1, x, z)i . j=0 j=0 3. Same as in Procedure 2.1,  −log(Λ̂gi ) + E(ag ) ,g = j        where Λ̂gi = exp(z i γ̂g )/ G P r=0 exp(z i γ̂r ), Ê(ag |dj = 1, x, z)i = ,    Λ̂ji log(Λ̂ji )     + E(ag ) , g 6= j (1 − Λ̂ji ) M̂gi = Λ̂gi log(Λ̂gi )/(1 − Λ̂gi ), and E(ag ) = 0.5772 for g, j = 0, 1, . . . , G and i = 1, 2, . . . , N. Notice that, unlike IV method, CF method is robust to two different sources of unobserved heterogeneity in the model described by (2.1), (2.2) and the assumptions (from A.2.1 through A.2.5): one coming from the counterfactual errors ug and the other from vg . Even though the treatment variable w is endogenous and correlated with the random vector of slope coefficients bg for g = 0, 1, . . . , G, under A.2.1 through A.2.5, CF method yields consistent estimates. Owing to this advantage of CF method over IV method by using the very same instruments z in CRC (and many other) models, some economists might consider CF method a generalized form of IV method, see, for example, Card (2001, p. 1819) on this. 2.4 Asymptotic Normality Results The asymptotic theory behind CF method is not much different from the one developed in Chapter 1 because the estimating equation of CF method in (2.14) is still a two step M- estimator with some additional generated regressors. As a result, this two step M-estimator again solves the problem XN min (yi − m(Xi , v(di , xi , zi , γ̂), θ))2 /2, (2.15) θ∈Θ i=1 73 √ where γ̂ = (γ̂00 , γ̂10 , . . . , γ̂G0 )0 is the (G + 1)k × 1 vector of N − consistent and asymptotically normal first stage conditional MLE (CMLE) estimates from the MNL of wi on zi for i = 1, 2, . . . , N. Technically, the first stage estimates does not have to be consistent as long as p they converge in plim, i.e., γ̂ → − γ ∗ where γ ∗ ∈ Γ ⊂ R(G+1)k . However, they are consistent in this setting. CMLE solves the problem XN max li (γ), (2.16) γ∈Γ i=1 where γ = (γ00 , γ10 , . . . , γG0 )0 is the (G+1)k×1 vector of parameters, and li (γ) ≡ log(f (wi |zi ; γ)), (i.e., the conditional log likelihood for observation i) is given below G G ! X X log(f (wi |zi , γ)) = 1[wi = j]log exp(zi γj )/ exp(zi γr ) . (2.17) j=0 r=0 This CMLE is exactly the same as the one used in Chapter 1. Therefore, to establish that √ these first stage MLE estimates are N −consistent, I will rely on Theorem 1.1 (Th.1.1) from Chapter 1 which is restated below for readers’ convenience and establishes the consistency of CMLE without compactness. • Theorem 2.1 (Th.2.1): Let {(wi , zi ) : i = 1, 2, . . .} be a random sample with zi ∈ Z ⊂ Rk , wi ∈ W ⊂ R. Let Γ ⊂ R(G+1)k be the parameter set, and denote the parametric model for the conditional density, p(· |z), as {f (· |z; γ) : z ∈ Z , γ ∈ Γ }. Let l : W × Z × Γ → R be a real-valued function. Assume that (a) f (· |z; γ) is a true density function with respect to the measure µ(dw) for all z and γ, so that ´ W f (w|z)µ(dw) = 1, ∀z ∈ Z holds; (b) for some γo ∈ Γ, po (· |z) = f (· |z; γo ), ∀z ∈ Z , and the true parameter vector γo is the unique solution to maxE[li (γ)]; (c) γo is an γ∈Γ element of the interior of a convex parameter space Γ ; (d) for each γ ∈ Γ, l(· , γ) is a Borel measurable function on W × Z ; (e) for each (w, z) ∈ W × Z , l(w, z, ·) is concave in γ; and (f) |l(w, z, γ)| ≤ b(w, z), ∀γ ∈ Γ, where b(·, ·) is a nonnegative function on 74 W × Z such that E[b(w, z)] < ∞. Then there exist a solution to problem in (2.16), p the CMLE γ̂, and γ̂ → − γo . For the verification of the conditions stated in Th.2.1, see appendix A. I also refer readers to see Theorem 2.7 in Newey and McFadden (1994, p. 2133) for a generic consistency proof of extremum estimators without compactness. To establish that these first stage MLE estimates are asymptotically normal, I will use Theorem 1.2 (Th.1.2) from Chapter 1 which is restated below for readers’ convenience. • Theorem 2.2 (Th.2.2): Let the definitions and conditions of Th.2.1 hold, and define o ≡ V ar[∇γ li (γo )]. Furthermore, assume that (a) γo is an element of the interior 0 BF of a parameter space Γ ;—i.e., γo ∈ int(Γ ); (b) for each (w, z) ∈ W × Z , l(w, z, ·) is twice continuously differentiable on int(Γ ); (c) E[sF i (γo )] = 0 and −E[Hi (γo )] = F i (γo )], where si (γ) ≡ ∇γ li (γ) and Hi (γ) ≡ ∇γ [∇γ li (γ)]; (d) the elements of 0 0 V ar[sF F F ∇γ [∇0γ l(w, z, γ)] are bounded in absolute value by a function b(w, z), ∀γ ∈ Γ, where b(·, ·) is a nonnegative function on W × Z such that E[b(w, z)] < ∞; and (e) AF o ≡ −E(∇γ [∇0γ li (γo )]) is positive definite. Then √ d N (γ̂ − γo ) → − N ormal(0, (AF −1 F o ) Bo (Ao ) ). F −1 (2.18) Explicitly, the score of the log likelihood for observation i is as follows:   ∂li ∂li ∂li sF i (γ) ≡ ∇0γ li (γ) = (γ), (γ), . . . , (γ) 0 , (2.19) ∂γ0 ∂γ1 ∂γG which is a (G + 1)k × 1 vector of partial derivatives of li (γ) with respect to parameters in γ. The Hessian, HF i (γ) ≡ ∇γ [∇γ li (γ)], for observation i is the (G + 1)k × (G + 1)k 0 matrix of second partial derivatives of li (γ) with respect to parameters in γ. Thus, using the 75 definitions in Th.2.2, AF o ≡ −E[Hi (γo )], and Bo ≡ V ar[si (γo )]. In addition, E[si (γ)] = 0 F F F F and AF o = Bo , which are used to simplify the variance expression in (2.18), and see appendix F A for how to show these equalities. Using that E[sF i (γ)] = 0 and Ao = Bo , I can rewrite F F (2.18) as below: √ d − N ormal(0, (AF N (γ̂ − γo ) → −1 o ) ). (2.20) For the verification of the conditions in Th.2.2, see appendix A. Theorem 3.1 in Newey and McFadden (1994, p. 2143) is also helpful for a generic proof of asymptotic normality of extremum estimators. The second stage of CF method is technically OLS with generated regressors as in Chapter 1. For this reason, to establish that second stage estimates are √ N − consistent, I will use a modified version of Theorem 1.3 (Th.1.3) from Chapter 1 which establishes the consistency of CF method with a compact parameter space. • Theorem 2.3 (Th.2.3): Let w = (y, X, v) be a random vector with w ∈ W ⊂ RM+1 and M = (l(l + 1)/2 + 2l + G + 2)(G + 1). Let Θ ⊂ RM and Γ ⊂ R(G+1)k be the parameter sets. Let q(w, θ, γ) : W × Θ × Γ → R be a real-valued function. Let γ̂ be an estimator p from a preliminary estimation. Assume that (a) γ̂ → − γ ∗ for some γ ∗ ∈ Γ ; (b) for a given γ ∗ ∈ Γ, the true parameter vector θo is the unique solution to minE[qi (θ; γ ∗ )]; θ∈Θ (c) the parameter space Θ × Γ is compact; (d) for each (θ, γ) ∈ Θ × Γ, q(· , θ, γ) is a Borel measurable function on W; (e) for each w ∈ W, q(w, ·, ·) is a continuous function on Θ × Γ ; and (f)E[|q(wi , θ; γ)|] < ∞ ∀(θ, γ) ∈ Θ × Γ. Then there exists a solution to p problem in (2.15), the CF estimator θ̂, and θ̂ →− θo . For the verification of the conditions stated in Th.2.3, one can follow the steps taken in appendix A. In addition, readers can benefit from Wooldridge (1994, p. 2730) for a generic consistency proof of two-step M-estimators with compactness. Before I move into 76 the asymptotic normality result, I have to introduce some notation. From problem (2.15), we can see that q(w, θ, γ) in Th.2.3 is as follows: qi (θ, γ) ≡ q(wi , θ, γ) ≡ (yi − m(Xi , v(di , xi , zi , γ), θ))2 /2, (2.21) where mi (vi (γ), θ) ≡ m(Xi , v(di , xi , zi , γ), θ) ≡ Xi δ + vi λ is a real-valued scalar function, θ = (δ 0 , λ0 )0 is the M × 1 vector of parameters, Xi is the 1 × (l(l + 1)/2 + l + 1)(G + 1) vector of regressors in (2.15), and vi is the 1 × (l + G + 1)(G + 1) vector of generated regressors in (2.15). More explicitly, Xi = ( d0i , · · · , dGi , d0i xi , · · · , dGi xi , d0i (xi ⊗ xi ) · · · , dGi (xi ⊗ xi ) ) vi = ( −d0i log(Λ0i ), · · · , −dGi log(ΛGi ), d1i M0i , d2i M0i , · · · , dGi M0i , d0i M1i , d2i M1i , d3i M1i , · · · , dGi M1i , · · · , d0i MGi , d1i MGi , · · · , XG X G , dG−1i MGi , dji x1i E(a0 |dj = 1, x, z)i , · · · , dji x1i E(aG |dj = 1, x, z)i , j=0 j=0 X G XG dji x2i E(a0 |dj = 1, x, z)i , · · · , dji x2i E(aG |dj = 1, x, z)i , · · · , j=0 j=0 X G X G dji xli E(a0 |dj = 1, x, z)i , · · · , dji xli E(aG |dj = 1, x, z)i ), (2.22) j=0 j=0  −log(Λgi ) + E(ag )     ,g = j    where Λgi = exp(zi γg )/ G P r=0 exp(zi γr ), E(ag |dj = 1, x, z)i = ,   Λ log(Λji )    ji + E(ag ) , g 6= j   (1 − Λji ) Mgi = Λgi log(Λgi )/(1 − Λgi ), and E(ag ) = 0.5772 for g, j = 0, 1, . . . , G and i = 1, 2, . . . , N. As one can expect, expressions such as Λ̂ji , M̂gi , and Ê(ag |dj = 1, x, z)i are just consistent estimates of Λgi , Mgi , and E(ag |dj = 1, x, z)i with γ̂g replacing γg in Λgi , Mgi , and E(ag |dj = 1, x, z)i respectively. Now, I will state the theorem that is a modified version (in the sense 77 that dimensions of X and v are different) of Theorem 1.4 (Th.1.4) from Chapter 1 and establishes the asymptotic normality of CF method with a compact parameter space. • Theorem 2.4 (Th.2.4): Let the definitions and conditions of Th.2.3 hold. Further- √ more, assume that (a) θo ∈ int(Θ) and γ ∗ ∈ int(Γ ); (b) N (γ̂ − γ ∗ ) is bounded in √ probability —i.e., N (γ̂ − γ ∗ ) = Op (1); (c) for each (w, γ) ∈ W × Γ, q(w, ·; γ) is a twice continuously differentiable on int(Θ); (d) for each θ ∈ Θ, s(·, θ; ·) ≡ ∇0θ q(·, θ; ·) is con- tinuously differentiable on int(Γ ); (e) for each (θ, γ) ∈ Θ ×Γ, H(·, θ; γ) ≡ ∇θ s(·, θ; γ) is a Borel measurable function on W; (f) for each w ∈ W, H(w, ·; ·) is continuous on Θ × Γ ; (g) E[k H(wi , θ; γ) k] < ∞ ∀(θ, γ) ∈ Θ×Γ. (h) Ao ≡ E[H(wi , θo ; γ ∗ )] is positive definite; (i) for each (θ, γ) ∈ Θ × Γ, ∇γ s(·, θ; γ) is a Borel measurable function on W; (j) for each w ∈ W,∇γ s(w, ·; ·) is continuous on Θ × Γ ; (k) E[k ∇γ s(wi , θ; γ) k] < ∞ ∀(θ, γ) ∈ Θ × Γ ; (l) E[si (θo ; γ ∗ )] = 0, E[(AF ∗ ) si (γ )] = 0, and E[(A∗ ) si (γ )si (θo ; γ )] = 0. Then, −1 F ∗ F −1 F ∗ 0 ∗ √ d N (θ̂ − θo ) → − N ormal(0, (Ao )−1 Do (Ao )−1 ), (2.23) where Do = Bo +Fo To +T0o F0o +Fo R∗ F0o , si (θo ; γ ∗ ) ≡ ∇0θ q(wi , θo ; γ ∗ ), Ao ≡ E[∇θ si (θo ; γ ∗ )] ≡ E[Hi (θo ; γ ∗ )], Bo ≡ E[si (θo ; γ ∗ )s0i (θo ; γ ∗ )], Fo ≡ E[∇γ si (w, θo ; γ ∗ )], To ≡ E[ri (γ ∗ )s0i (θo ; γ ∗ )], ∗ ) si (γ ), and A∗ ≡ −E(∇γ [∇γ li (γ )]). For the deriva- −1 F ∗ 0 ∗ R∗ ≡ E[ri (γ ∗ )r0i (γ ∗ )], ri (γ ∗ ) = (AF F √ tion of asymptotic variance of N (θ̂ − θo ), refer to the subchapter 12.4 in Wooldridge (2010) or subsections 4.3 and 4.4 in Wooldridge (1994). For the verification of the conditions stated in Th.2.4, one can again follow the steps taken in appendix A. Readers can also benefit from Wooldridge (1994, p. 2730) for a generic asymptotic normality proof of two-step M- estimators with compactness. In addition, refer to appendix A for the derivation of the closed forms of the population matrices Ao , Bo , Fo , and R∗ and for seeing that E[ri (γ ∗ )] = 0, E[si (θo ; γ ∗ )] = 0, and To ≡ E[ri (γ ∗ )s0i (θo ; γ ∗ )] = 0. Since To = 0, Do in the asymptotic √ variance of N (θ̂ − θo ) in (2.23) actually simplifies to Bo + Fo R∗ F0o . 78 Let’s construct the following estimators for Ao , Bo , Fo , and R∗ as in Chapter 1 as follows: X N  = N −1 − Hi (θ̂; γ̂), (2.24) i=1 X N B̂ = N −1 si (θ̂; γ̂)s0i (θ̂; γ̂), (2.25) i=1 X N F̂ = N −1 ∇γ si (θ̂; γ̂), and (2.26) i=1 X N R̂ = N −1 ri (γ̂)r0i (γ̂). (2.27) i=1 Define D̂ ≡ B̂ + F̂R̂F̂0 . Then, using the analogy principle and Lemma 1 in Chapter √ √ 1, a consistent estimator for Avar N (θ̂ − θo ) is Avar ˆ N (θ̂ − θo ) = (Â)−1 D̂(Â)−1 . The asymptotic standard errors of CF estimates can be obtained from the matrix Avar( ˆ θ̂) = (Â)−1 D̂(Â)−1 /N as usual or be bootstrapped. 2.4.1 Method of Moments Framework Following the results from Newey (1984), Newey and McFadden (1994, p. 2132 and 2148), or Heckman, Tobias, and Vytlacil (2003), two-step estimators can be regarded as members of generalized method of moments (GMM) estimators, and the asymptotic theory for these estimators can be derived by stacking moment conditions. GMM estimators take away the burden of deriving the asymptotic variance matrix of a two-step estimator and thus provide an alternative way for inference in CF regression as well. Since the number of moment conditions is the same as the number of parameters to be estimated in my analysis, I technically use method of moments (MoM). As in Chapter 1, in the first stage of CF method, γ̂ is the CMLE estimator solving XN sF i (γ̂) = 0, (2.28) i=1 79   where sF . In the second stage, θ̂ is a 0 PG PG i (γ) = ∇γ j=0 1[w i = j]log exp(z i γj )/ r=0 exp(z i γr ) OLS estimator solving XN si (θ̂; γ̂) = 0, (2.29) i=1 where si (θ̂; γ̂) = ∇0θ [yi − m(Xi , v(di , xi , zi , γ̂), θ̂)]2 /2. Newey (1984) proposes stacking the summands in these first order conditions into the unified function   F  s (γ)  g(θ, γ) =   (2.30) s(θ; γ) and then applying MoM using the moment conditions E[g(θ, γ)] = 0 to obtain consistent estimates for θ and γ, and valid asymptotic variance matrix of θ̂ and γ̂. Using the GMM results in the appendix of Heckman, Tobias, and Vytlacil (2003), one can also derive the asymptotic distribution theory for the ATE estimators. 2.5 Simulations In this section, I share some simulation results that compare and contrast the estimation methods (i.e., CF and IV methods in section 2.3) and note what is different and similar in terms of their asymptotic performances, specifically asymptotic efficiency, asymptotic unbiasedness, and consistency. I will change the simulation setup for the model in section 2.2 as I change the distribution of instrument in the latent variable equation or introduce misspecification into the model by ignoring an instrument in the latent variable equation. For the sake of computational simplicity, I adopt a scheme in which there is only one covariate in the counterfactual outcome equation (i.e., x = x) and only one instrument in the latent variable equation (i.e., z = z). When examining the consequences of misspecification, there are two instruments determining the latent treatment variable though. And lastly, the 80 treatment variable w takes on only three values, and each treatment group comprises at least about 30 percent for each simulation setting. 2.5.1 Data Generating Process In my simulation analysis, I used four different data generating processes (DGPs): one for the model in section 2.3 with asymmetric instrument, one for the model with symmetric instrument, one for the model with asymmetric instrument and misspecification, and one for the model with symmetric instrument and misspecification. The setup for the DGP of the model in section 2.3 with asymmetric instrument is as follows: w ∈ {0, 1, 2} , dg = 1[w = g], g ∈ {0, 1, 2} , ag ∼ Gumbel(0, 1), g ∈ {0, 1, 2} , γ0 = 1, γ1 = 5, and, γ2 = 9, l0 = 1, l1 = 5, and, l2 = 3, z = z ∼ χ2 (2) − 2, wg∗ = lg + γg z + ag , g ∈ {0, 1, 2} , w = g if f wg∗ ≥ wj∗ , ∀j 6= g and g, j ∈ {0, 1, 2} , eg ∼ N (0, 1), g ∈ {0, 1, 2} , η0,0 = 0.05, η0,1 = 0.10, and η0,2 = 0.15, 81 η1,0 = 4.05, η1,1 = 4.10, and η1,2 = 4.15, η2,0 = 8.05, η2,1 = 8.10, and η2,2 = 8.15, P2 P2 ug = j=0 ηg,j aj + [− j=0 ηg,j E(aj )] + eg , g ∈ {0, 1, 2} , x = x ∼ N (0, 1), ψo0 = 1, ψo1 = 2, and ψo2 = 3, ψ0 = 4, ψ1 = 5, and ψ2 = 6, mg = ψog + xψg , g ∈ {0, 1, 2} , κo0 = 1, κo1 = 4, and κo2 = 7, Γ0 = 4, Γ1 = 5, and Γ2 = 6, p0 = 1, p1 = 2, and p2 = 3, evg ∼ N (0, 1), g ∈ {0, 1, 2} , P2 vg = j=0 pj aj + evg , g ∈ {0, 1, 2} , bg = κog + Γg x + vg , g ∈ {0, 1, 2} , yg = mg + xbg + ug , g ∈ {0, 1, 2} , and y = d0 y0 + d1 y1 + d2 y2 . 82 For the model with symmetric instrument, the DGP setup is almost exactly the same as the one above. However, I make some modifications on both the location parameters and the distribution of instrument appearing in the latent variable equation as follows: l0 = 1, l1 = 5.2, and, l2 = 2, z = z ∼ N (0, 2). For the model with asymmetric instrument and misspecification in the latent variable equation, the DGP setup is very similar to the one without misspecification. However, I introduce an additional instrument in the latent variable equation and ignore it from the MNL regression of treatment variable on instruments at the first stage, thus creating misspecification. In line with this, I make the following modifications to the DGP: z = (z1 , z2 )0 , z1 ∼ χ2 (2) − 2, z2 ∼ χ2 (2) − 2, wg∗ = lg + γg z1 + ϑg z2 + ag , g ∈ {0, 1, 2} , ϑ0 = γ1 , ϑ1 = γ2 , and ϑ2 = γ0 , where z1 and z2 are scalar instruments in the choice equation for wg∗ (i.e., the latent variable equation), and ϑg is a scalar parameter associated with z2 in wg∗ for g = 0, 1, 2. Lastly, for the model with symmetric instrument and misspecification in the latent vari- able equation, the DGP setup is again very similar to the one without misspecification. I make the following modifications to its DGP: 83 z = (z1 , z2 )0 , z1 ∼ N (0, 2), z2 ∼ N (0, 2), l0 = 4.5, l1 = 4, and, l2 = 2, wg∗ = lg + γg z1 + ϑg z2 + ag , g ∈ {0, 1, 2} , ϑ0 = γ1 , ϑ1 = γ2 , and ϑ2 = γ0 , where z1 , z2 , and ϑg are defined just as in the model with asymmetric instrument and misspecification in the latent variable equation. As in Chapter 1, both γg and lg play a role in determining the percentage of each treatment group in simulations for g = 0, 1, 2. Having γg ’s being apart from each other enough is also critical to obtain strong first stage estimates and to ward off identification problems in the first stage estimation. To increase the effect of endogeneity in the model, having ηg,j ’s seperated from each other across treatment statuses is also another critical point for g, j = 0, 1, 2. Following Wooldridge (2008, p. 106; 2010, p. 947), I also vary the distribution of instrument z in order to see if its distribution can influence IV estimates for ATEs in terms of consistency. 2.5.2 Simulation Results I present my simulation results in two parts: first, asymptotic efficiency outcomes and second, asymptotic unbiasedness and consistency outcomes. The simulation results reported in Tables B.1 through B.16 aim for comparing CF method with IV method in terms of asymptotic efficiency, asymptotic unbiasedness and consistency. The first eight tables belong 84 to models without misspecification, whereas the last eight tables include results coming out of models with misspecification. In Tables B.1 through B.16, I report the Monte Carlo (M.C.) estimates for ψog , Γg , and AT Eh,0 ; bias in the M.C. estimate for ATEs; bootstrapped standard errors (BS. SEs) and Monte Carlo standard deviations (M.C. SDs) for ψog , Γg , and AT Eh,0 ; and BS. SEs and M.C. SDs for standard errors of ψog , Γg , and AT Eh,0 for g = 0, 1, 2 and h = 1, 2. In simulations, I use different sample sizes (i.e., n = 1000, n = 2000, n = 5000 and n = 10000) for each DGP setup with the number of M.C. and BS. iterations always equal to 10000. I also used some trimming to remove outliers from my simulation analysis. As for the notation, in Tables B.1-B.16, ψ̂og is the parameter estimate for ψog , Γ̂g is the parameter estimate for Γg , ateˆh,0 is the estimate for AT Eh,0 , and bias(ateˆh,0 ) is the bias in the estimate for AT Eh,0 for g = 0, 1, 2 and h = 1, 2. Furthermore, se(ψ̂og ) is the standard error of parameter estimate for ψog , se(Γ̂g ) is the standard error of parameter estimate for Γg , and se(ateˆh,0 ) is the standard error of the estimate for AT Eh,0 for g = 0, 1, 2 and h = 1, 2. Since these tables would require a considerable amount of space in the main body of the chapter, I place all simulation tables of this chapter into appendix B. At this point, it is also important to remember the true values for ψog , Γg , and AT Eh,0 for g = 0, 1, 2 and h = 1, 2 since I often refer them throughout this section. As pointed out in section 2.3, since x ∼ N (0, 1), the true values are respectively as follows: ψo0 = 1, ψo1 = 2, and, ψo2 = 3, Γ0 = 4, Γ1 = 5, and, Γ2 = 6, AT E1,0 = 2, and AT E2,0 = 4. 2.5.2.1 Asymptotic Efficiency Outcomes From an efficiency standpoint, let’s first consider the models with no misspecification. In 85 Table B.1, the simulation results show that BS. SEs and M.C. SDs of the CF estimates for ψog (Γg ) are almost always higher (lower) than BS. SEs and M.C. SDs of the counterpart IV estimates, respectively. Similarly, BS. SEs and M.C. SDs of the standard errors of CF estimates for ψog (Γg ) are also almost always higher (lower) than than those of the IV estimates. Furthermore, BS. SEs and M.C. SDs of the CF estimates for AT Eg,0 are always higher than BS. SEs and M.C. SDs of the counterpart IV estimates. For instance, in Table B.1, the BS. SE of the CF parameter estimate for ψo1 (Γ1 ) is about 22% (22%) higher (lower) than that of the IV estimate, and the M.C. SD of the CF parameter estimate 59% (32%) higher (lower). Again in Table B.1, the BS. SE of the standard error of CF parameter estimate for ψo1 (Γ1 ) is about 57% (43%) higher (lower) than that of the IV estimate, and the M.C. SD of the standard error of CF parameter estimate 135% (57%) higher (lower). Furthermore, the BS. SE of the CF parameter estimate for AT E1,0 is about 33% higher than that of the IV estimate, and the M.C. SD of the CF parameter estimate 67% higher. A very similar pattern is observed in Tables B.2 through B.8 as I switch the distribution of instrument z from asymmetric to symmetric and/or increase the sample size but with higher precision in estimates. As a result, when there is no misspecification, the simulation results demonstrate that neither the CF method nor the IV method performs better compared to the other method from the perspective of efficiency: The results suggest that CF method estimates Γg and their standard errors more precisely than does IV method; however, IV method estimates ψog , their standard errors, and ATEs more precisely than does CF method for g = 0, 1, 2. Now let’s take a look at the models with misspecification in Tables B.9 through B.16. In Table B.9, the simulation results still provide evidence for that that BS. SEs and M.C. SDs of the CF estimates for ψog (Γg ) and AT Eg,0 are almost always higher (lower) than BS. SEs and M.C. SDs of the counterpart IV estimates, respectively. When it comes to BS. SEs and M.C. SDs of the standard errors of estimates for Γg , the simulation results seem to favor CF method: CF method has sharper estimates than does IV method. And from the dot plots 86 of estimated parameters, it seems like the abundance of outliers in IV estimates play a role in this observation, especially when the sample size is relatively small (i.e., N = 1000, 2000 ) and even after some trimming. As for BS. SEs and M.C. SDs of the standard errors of estimates for ψog , the simulation results are mixed. To give a few instances, in Table B.9, the BS. SE of the CF parameter estimate for ψo1 (AT E1,0 ) is around 88% (51%) higher than that of the IV estimate, and the M.C. SD of the CF parameter estimate 130% (126%) higher. On the other hand, the BS. SE (the BS. SE of the standard error) of CF parameter estimate for Γ2 is 22% (77%) lower than that of the IV estimate, and the M.C. SD (the M.C. SD of the standard error) of CF parameter estimate 67% (97%) lower. A very similar pattern is observed in Tables B.10 through B.16 as I switch the distribution of instrument z from asymmetric to symmetric and/or increase the sample size but with higher precision in estimates. For example, the M.C. SD of the CF parameter estimate for ψo0 (AT E2,0 ) is .3526 (1.9578) in Table B.13 which is around 72% (57%) lower than 1.2723 (4.5059), the same CF estimate in Table B.9. As a result, when there is misspecification, the simulation results show that the efficiency results resemble to those when there is no misspecification and no method has a definite efficiency advantage over the other. The results indicate that CF method estimates Γg and their standard errors more precisely than does IV method; whereas, IV method estimates ψog and ATEs more precisely than does CF method for g = 0, 1, 2. 2.5.2.2 Asymptotic Unbiasedness and Consistency Outcomes Using the asymptotic unbiasedness and consistency ideas from Chapter 1, let’s first take a look at the results with no misspecification (e.g., those in Tables B.1 through B.8). In the absence of misspecification, the simulation results show that M.C. simulation estimates from CF method for both ψog and AT Eh,0 are very close to the true values (even when sample is relatively small), whereas the ones from IV method are not that close at all for g = 0, 1, 2 and h = 1, 2. For example, in Table B.1, M.C. simulation estimates from IV method for ψo0 , ψo1 , 87 and ψo2 are respectively 1.4465 (about 45% higher than the true value), 1.8806 (around 6% lower than the true value), and 2.8735 (around 4% lower than the true value) and are all off the true values, which causes severe biases in ATE estimates (about 28% lower in estimated AT E1,0 and 14% lower in estimated AT E2,0 ) even though M.C. simulation estimates from IV (and CF) method for Γ0 , Γ1 , and Γ2 are very close to the true values. On the other hand, M.C. simulation estimates from CF method of both ψog and AT Eh,0 in Table B.1 are not off the true values at all with almost no biases. As the sample size increases from 1000 in Table B.1 to 10000 in Table B.4, M.C. simulation estimates from IV method do not improve on the biases; however, their BS. SEs and M.C. SDs get closer to zero just as those from CF method. A very similar pattern can also be seen in Tables B.5 through B.8 as I switch the distribution of instrument z from asymmetric to symmetric and/or increase the sample size. As a result, the simulation results indicate that, in the absence of misspecification, CF method is asymptotically unbiased and consistent while IV method is asymptotically biased and inconsistent (except for Γg , g = 0, 1, 2), which is supportive of the conjecture I made in subsection 2.3.1. Here, I also need to mark that having unbiased and consistent estimates from IV method for Γg is in contrast to my expectations in subsection 2.3.1 since they are the coefficient esti- mates associated with the interaction terms dg (x2 ) for g = 0, 1, 2. I guess one reason behind this result might be that the amount of exogenous variation coming from x2 overpowers the endogeneity embedded in dg (x2 ), at least in my simulation analysis. Under the presence of misspecification, the simulation results in Tables B.9 through B.16 indicate that M.C. simulation estimates for ψog and AT Eh,0 from both CF method and IV method are off the true values for g = 0, 1, 2 and h = 1, 2. For instance, in Table B.9, M.C. simulation estimates from CF method for ψo0 , ψo1 , and ψo2 are respectively .9461 (about 5% lower than the true value), 1.5961 (around 20% lower than the true value), and 4.9293 (around 64% higher than the true value) and are all off the true values, which leads to drastic biases in ATE estimates (about 19% lower in estimated AT E1,0 and 50% higher in 88 estimated AT E2,0 ) even though M.C. simulation estimates from both IV and CF methods for Γ0 , Γ1 , and Γ2 are again very close to the true values as in the case of no misspecification. As noted before, M.C. simulation estimates from IV method are also not close to the true values. However, there is a difference: Biases in the estimates from IV method are lower than those in the estimates of CF method except in the estimates for ψo0 and AT E1,0 . For example, in Table B.9, M.C. simulation estimates from IV method for ψo2 and AT E2,0 are respectively 3.0860 (around 3% higher than the true value) and 3.6704 (around 8% lower than the true value) but these biases are all smaller than their counterparts from CF method, 64% and 50% respectively. As the sample size increases from Table B.10 to Table B.12, M.C. simulation estimates from both IV and CF methods do not improve on the biases; however, their BS. SEs and M.C. SDs get smaller. A very similar pattern can also be seen in Tables B.13 through B.16 as I switch the distribution of instrument z from asymmetric to symmetric and/or increase the sample size. Actually, the simulation results in Tables B.13 through B.16 suggest that IV method performs even better in terms of unbiasedness with more biases only in the estimates for ψo0 . As a result, the simulation results indicate that, under misspecification, CF method is asymptotically more biased compared to IV method, and both methods are inconsistent. 2.6 Conclusion In this chapter, I introduce an econometric model with a discrete multivalued endogenous treatment variable and CRCs and show how to consistently estimate ATEs by a three step estimation procedure of CF method in such a model where the endogeneity problem is further exacerbated compared to the one in the model without CRCs as in Chapter 1. Moreover, I state that, based off the theorems developed in Chapter 1 and restated in this chapter, the asymptotic distribution of the CF estimates follows a normal distribution, and the CF √ estimates are N − consistent. I propose a consistent estimator for the asymptotic variance 89 matrix of CF estimates, which takes into consideration the nonlinear first stage estimation, following the analogy principle as in Chapter 1. For those who do not like to go through multistep estimation, I also express how one can consistently estimate ATEs and obtain valid standard errors for the parameters of interest by using GMM. In my simulation analysis, I compare CF method and IV method under various specifi- cations with and without misspecification. In the absence of misspecification, the simulation results suggest that CF method is asymptotically unbiased and consistent (but not neces- sarily more efficient because, for some parameters, IV method provides sharper estimates than CF method). Whereas, IV method is generally asymptotically biased and inconsistent, which is more pronounced when the instrument is asymmetrically distributed. Therefore, the simulation results point that, without misspecification, CF method can consistently es- timate ATEs, while IV method cannot. In the presence of misspecification, the simulation results show that both CF and IV methods have biased estimates. However, biases in IV es- timates are generally lower than those in the estimates of CF method, which is more obvious when the instrument is symmetrically distributed. With regard to efficiency, the findings from simulations are mixed in the sense that none of the methods outperforms the other one clearly. In addition, especially in the presence of misspecification, the simulation results point that IV method can less precisely estimate the standard errors of standard errors when sample size is relatively small. All of the research ideas mentioned in the conclusion of Chapter 1 can be explored in a discrete multivalued endogenous treatment model with CRCs, as well. In addition to these research ideas though, it can be worth the time and effort to investigate the ways in which we can extend the model in this chapter to the framework of panel data models. We can also develop tests that measure the existence of CRCs and the degree of endogeneity attached to them as in Heckman, Schmierer, and Urzua (2010). Furthermore, one can also examine the possibility of devising a consistent IV method, i.e., a correction function approach as in Wooldridge (2008), and its large sample properties for the model presented in this chapter. 90 CHAPTER 3 ESTIMATION FOR MULTIVALUED ENDOGENOUS TREATMENT EFFECT MODELS USING HIGH DIMENSIONAL METHODS: A SIMULATION STUDY 3.1 Introduction After the expansion of internet usage in early 2000s, digitization accelerated and pene- trated all facets of society and science, including the field of economics. As described in Athey and Luca (2019), many technology companies (e.g., Google, Apple, Facebook, Amazon, and Microsoft) have been increasingly employing economists in order to address problems such as individualized marketing and promotions, optimal pricing, auction platform design, and intervention effects. The power of digitization combined with its absorption by technol- ogy companies (and now rapidly by other traditional companies as well) also leads to new opportunities for collaborations with academics (e.g., Golub Capital Social Impact Lab at Stanford Graduate School of Business). As a result, in recent years, economists have been making use of big data, which causes the rising popularity of machine learning (ML) tech- niques in economics and the attempts to improve upon existing econometric approaches by incorporating ML ideas. The survey article of Donaldson and Storeygard (2016) summarizes several examples of how ML methods are applied in development economics. Mullainathan and Spiess (2017) provide a brief overview of the business-oriented prediction and classification problems (e.g., house valuation, industry classification, and hiring decision) in which ML methods are used. There has been a decent usage of ML methods in policy determination and assignment in economics too, see Athey (2018) for a review. Several econometric theory results for ML methods with regard to estimation and inference have been established in research areas such as treatment effects, structural models of consumer choice, panel data models, and model selection. For instance, in randomized experiments, Chernozhukov et al. (2020) advance 91 generic estimation and inference results for heterogeneous treatment effects that are valid under the usage of various ML techniques such as boosted trees, ensemble methods, neural networks, random forests, and regularized methods. Athey, Tibshirani, and Wager (2019) develop generalized random forests that allow the estimation of any population quantities identified in moment conditions as in generalized method of moments (GMM) method. Far- rell, Liang, and Misra (2021) derive new rates of convergence for deep neural networks and use these rates to develop semiparametric inference on parameters of interests. Iskhakov, Rust, and Schjerning (2020) look at how ML can further contribute to structural economet- rics. The application of ML methods in panel data models has received some attention from economists, see, for example, Belloni et al. (2016); Kock (2016); Chernozhukov, Wuthrich, and Zhu (2019); Semenova et al. (2021); and Athey et al. (2021). For more on the list of economics research areas in which economists can benefit from the tools originated in the ML literature, see Athey and Imbens (2019). In recent years, there have been developments in high dimensional regularized models applied to economics. As one of the earliest works in high dimensional regularized mod- els, Belloni and Chernozhukov (2011a) present concepts related to high dimensional sparse econometric models and their estimation using `1-regularized and post `1-regularized ML methods, particularly least absolute shrinkage and selection operator (LASSO), in linear and nonparametric settings. Technically, the ideas used in this paper go all the way back to Belloni and Chernozhukov (2013)’s 2009 version available in arXiv.org. In these papers, high dimensionality means that the number of parameters to be estimated in an economet- ric model is more than sample size, and sparsity means that the number of covariates with nonzero coefficients is in reality less than sample size and unknown. In a pioneering follow- up work, Belloni, Chernozhukov, and Hansen (2011a) share inferential (and estimation) results for linear instrumental variables (IV) model with many instruments and partially linear models in high dimensional sparse setting. In another influential paper, Belloni and Chernozhukov (2011b) develop regularized quantile regression in high dimensional sparse 92 models, show the consistency of their estimator for this model, and evaluate its performance by simulation. Later, Belloni, Chernozhukov, and Kato (2019) also establish new inference methods for the parameters from `1-penalized quantile regression in high dimensional sparse settings. For an early overview of estimation techniques and inference in high dimensional models with a focus on model selection and LASSO methods, see Belloni, Chernozhukov, and Hansen (2014b) and Chernozhukov, Hansen, and Spindler (2015a). In the last decade, there are also some notable work in the casual and debiased estimation of linear high dimensional models. For example, Belloni et al. (2012) employ LASSO to select instruments, show the asymptotic properties of the linear IV estimator that is based on post LASSO method dampening shrinkage bias associated with LASSO, and use partialled-out variables for inference when instruments are weak. Belloni, Chernozhukov, and Wei (2016) extend the results in this paper to generalized linear high dimensional regression models with regular sparsity assumption. Belloni, Chernozhukov, and Hansen (2014a) introduce a new estimation method called post double selection and provide uniformly valid post selection inference for treatment effects in sparse high dimensional models. This method is robust to imperfect selection of the controls and allows for non-Gaussian and heteroscedastic disturbances too. Jointly using Neyman-orthogonal scores and crossfitting in Chernozhukov et al. (2018), the authors propose residual-on-residual regression method to remove biases associated with regularized ML methods off causal parameters of interest and construct valid confidence intervals for these parameters. For more on debiased estimation and valid post estimation inference results and applications, see Belloni et al. (2017); Chernozhukov et al. (2017); Chernozhukov, Newey, and Singh (2021); and Athey, Imbens, and Wager (2018). In the last couple of years, there is also some interest in high dimensional moment-based regularized models and high dimensional models with measurement error. For more on these, see, for example, Belloni et al. (2018a); Belloni et al. (2018b); Caner and Kock (2019); Bach et al. (2020); and Belloni, Chernozhukov, and Kaul (2017); Chernozhukov, Wuthrich, and Zhu (2018); Belloni, Kaul, and Rosenbaum (2019); and Chernozhukov et al. (2020). 93 In this chapter, I extend my work from Chapter 1 to a high dimensional sparse model in a particular setting. Imagine that there exists an extra set of high dimensional variables. And a low dimensional (and unknown) subset of these variables has an impact on the out- come (so some of these variables are relevant in the outcome equation); however, all of these high dimensional variables are totally ignorable (or redundant) to the decision to undertake the treatment given some instruments in the selection equation, which is not unheard of in experimental intervention studies. Long known as in Cochran (1957), the addition of distinctly relevant variables into a regression often results in more precise estimation and better prediction. For this reason, the inclusion of high dimensional variables in estimation can potentially improve the efficiency and predictive power of the model in Chapter 1 with- out jeopardizing its consistency results of interest when the high dimensional variables are orthogonal to (or uncorrelated with) the variables of interest already included. However, the high dimensionality of these extra variables requires the usage of estimation methods that are capable of doing variable selection (especially needed if the sample size is also small) among the extra set of high dimensional variables really influencing the outcome of interest and that produce reliable inference results. Through a simulation analysis, this chapter aims to guide economists in if (and which) LASSO-based inference and variable selection methods would perform better than the control function (CF) estimator from Chapter 1 for discrete multivalued endogenous treatments in a linear scalar outcome high dimensional sparse model with heterogeneous counterfactual errors just described in previous sentences. I also allow for non-Gaussian disturbances in my particular setting. To address endogeneity, I again use the CF approach from Chapter 1: First, a multinomial logit model for the treatment decision in a setting that is not high dimensional is estimated by maximum likelihood to construct CF variables. Second, these CF variables are added into outcome equation in high dimensional setting to be estimated by different ML methods. For the parameter estimation in the outcome equation with high dimensional variables, I specifically use four different ML methods: LASSO; post partial-out LASSO of Belloni et 94 al. (2012); post double selection LASSO of Belloni, Chernozhukov, and Hansen (2014a); and double/debiased ML LASSO of Chernozhukov et al. (2018). In my simulations, I also include the CF method from Chapter 1 as a benchmark with the aim of comparing the finite sample performances of these four LASSO-based ML methods to the CF’s. In the comparison, I employ measures such as bias of ATE estimates, standard deviation of ATE estimates, mean absolute prediction error, root mean square error (RMSE), mean number of correctly selected covariates, and mean size of selected set of covariates. To the best of my knowledge, this chapter is the first simulation-based comparative analysis of the LASSO- based methods above in a discrete multivalued endogenous treatment model of linear high dimensional sparse setting with heterogeneous counterfactual errors. The main simulation finding in this setting is that, on top of being on par with the CF method in finite sample bias ground, the LASSO-based methods can surpass the efficiency performance of the CF method in ATE estimation if there exist enough extra predictive variables that are ignorable in treatment selection among a set of high dimensional predictors of outcome. The rest of this chapter is organized as follows. In section 3.2, I introduce the model. In section 3.3, I summarize the ML methods and the procedure to estimate the parameters of interest. In section 3.4, I share some simulation results. In section 3.5, I conclude. And, in appendix C, I share simulation tables that are hidden from the main body of this chapter. 3.2 The Model Consider the model from Chapter 1 augmented by the presence of high dimensional variables yg = αg + xβg + hδg + ug wg∗ = zγg + ag , (3.1) where yg is the g th counterfactual outcome variable, αg is the scalar coefficient in the counter- 95 factual outcome equation for yg , x ≡ (x1 , x2 , . . . , xl ) is the 1 × l vector of exogenous variables in yg , βg is the l × 1 vector of slope coefficients associated with x in yg , h ≡ (h1 , h2 , . . . , hlh ) is the 1 × lh vector of high dimensional exogenous variables in yg , δg is the lh × 1 vector of slope coefficients associated with h in yg , ug is the counterfactual error in yg , wg∗ is the latent treatment variable that determines the choice of treatment status among G + 1 alternative treatment statuses, z ≡ (z1 , z2 , . . . , zk ) is the 1 × k vector of instruments that includes a constant term in the choice equation for wg∗ , γg is the k × 1 vector of parameters in wg∗ , and ag is the scalar error term that is independently and identically Gumbel distributed (i.i.d.) with location parameter µ = 0 and scale parameter β = 1 in wg∗ for g = 0, 1, . . . , G. Here, high dimensionality means that lh is bigger than the sample size N (i.e., lh > N ) available in estimation. As in Chapter 1, let w ∈ {0, 1, . . . , G} be the observed discrete multivalued endogenous treatment variable whose values are determined by wg∗ that can be regarded as the utility or satisfaction obtained from treatment status g for g = 0, 1, . . . , G. Let the treatment statuses of w be exhaustive and mutually exclusive. Define binary treatment status indicators, dg = 1[w = g] for g = 0, 1, . . . , G. So the binary treatment status indicator dg is equal to one if the treatment status is equal to g and zero otherwise. This coupled with the mutual exclusivity of treatment statuses implies that G g=0 dg = 1. Define the 1 × (G + 1) vector of treatment P statuses d ≡ (d0 , d1 , . . . , dG ). Let y be the observed outcome. Then, I can write y = d0 y0 + d1 y1 + · · · + dG yG , (3.2) where yg is the g th counterfactual outcome for g = 0, 1, . . . G. After having described the discrete multivalued endogenous treatment model with het- erogeneous counterfactual errors and high dimensional variables, I now will list a series of assumptions some of which have also been made in Chapter 1 to further develop the model. First, I assume that the rational economic agents choose the status of treatment from which they receive the most satisfaction out of all possible treatment statuses. That is, 96 • Assumption 3.1 (A.3.1): One chooses treatment status g, i.e., w = g if and only if wg∗ ≥ wj∗ ∀j 6= g for g, j = 0, 1, . . . , G. Second, I assume that identification of the model in (3.1) and (3.2) is aided by the exclusion of some (at least one) variables in the set of instruments z from the set of exogenous variables in x. The set of exogenous variables in x can all be included in the set of instruments z. • Assumption 3.2 (A.3.2): Identification of the model described by (3.1) and (3.2) is strengthened by exclusion of at least one variable in z from the set of variables in x. In economics, it is often the case that z includes all the variables in x. However, this is not an absolute necessity in my model as long as the exclusion restriction above is satisfied. Third, I make an assumption about the irrelevancy of h given the set of instruments z in the g th choice equation for g = 0, 1, . . . G. • Assumption 3.3 (A.3.3): D(wg∗ |z, h) = D(wg∗ |z) where D(· |· ) means conditional distribution. That is, conditional on z, wg∗ (and thereof w) is independent of h. Conceptually, having variables, such as the ones in h, that appear only in the outcome equation is standard in a randomized controlled trial (RCT) due to treatment selection (or assignment)’s being totally random. For this reason, in RCTs, the propensity score for treatment does not depend on any variables; whereas, the outcome of interest can depend on some variables. Apart from RCTs, here is a framework where A.3.3 is reasonable. Suppose an economist have an experimental intervention with exogenous instruments. In line with 97 Vella and Verbeek (1999, p. 474) and Heckman (1990, p. 314), the experimental intervention with exogenous instruments means that, using exogenous instruments, unobservables that play a role in the treatment decision and that are correlated with the outcome of interest have been taken into consideration. And further assume that this economist can create so reliable and strong instruments z via surveys that these instruments include all necessary observed information to make treatment decisions. Well then in this case, it is plausible that the treatment variable w out of this intervention is related only to the instruments z and is conditionally independent of other outcome-related (and maybe high dimensional) observables h once z is controlled for. It is also important to make the distinction between the set of variables in x and the set of high dimensional variables h in the counterfactual outcome equation yg . x can be thought as micro-level structural characteristics, and h as macro-level characteristics. To make the difference between x and h clearer, consider the following example from development eco- nomics similar to that in Danquah et al. (2021). Imagine that one studies how influential household gender wage gap is on women’s empowerment, using survey data with limited number of observations (but with a dense/high dimensional set of variables created based on survey answers and outside data sources). In such a study, the dependent variable can be the share of household assets owned by women. The treatment variable can be gender wage gap at household level grouped in low, medium, and high levels of relative difference between average male and female adult household members’ earnings. x can be exogenous covariates such as household and family characteristics that affect the share of household assets owned by women. On the other hand, h can be high dimensional regressors such as occupation, sector, access to certain infrastructure and services, location, and social norms that impacts the share of household assets owned by women. 98 As a result, the model in (3.1) and (3.2) together with the assumptions made so far still allows the treatment variable w to follow a multinomial logit model with choice probabilities given as follows: XG P (w = g|x, h, z) = P (w = g|z) = exp(zγg )/ exp(zγr ), (3.3) r=0 for g = 0, 1, . . . , G. See McFadden (1973) for a similar result without the high dimensional variables h. The next assumption is the Dubin and McFadden’s linearity assumption com- bined with the redundancy of x, h, and z for the expectation of ug conditional on a. • Assumption 3.4 (A.3.4): E(ug |x, h, z, a) = E(ug |a) = PG PG j=0 ηg,j aj +[− j=0 ηg,j E(aj )], where ug is the counterfactual error in yg , x is the 1 × l vector of exogenous variables in yg , h is the 1 × lh vector of high dimensional exogenous variables in yg , z is the 1 × k vector of instruments that includes a constant term in the choice equation for wg∗ , a ≡ (a0 , a1 , . . . , aG ) is the 1 × (G + 1) vector of i.i.d. Gumbel distributed errors aj with location parameter µ = 0 and scale parameter β = 1 in wj∗ , ηg,j is the scalar multiple of correlation coefficient between ug and aj , and E(aj ) = 0.5772 is Euler’s constant for j, g = 0, 1, . . . , G. Using all the assumptions from A.3.1 through A.3.4, I create CF terms as derived in Chapter 1 to deal with endogeneity in w. In section 3.3, I will introduce the ML methods I employ to estimate the parameters of interest in (3.1). 3.3 Estimation To eliminate the complications of endogeneity in w, I again rely on the idea of CF ap- proach presented in Chapter 1. Referring back to appendix A, it is pretty straightforward 99 to derive the estimating equation for the parameters of interest in (3.1): Just add h as an- other conditioning variable into the equations in that section. Then, the baseline estimating equation for the regressions augmented by the CF terms in this section is as below: X G XG XG y = dj αj + dk xβk + dm hδm + j=0 k=0 m=0 X G X X + [−ηg,g dg log(Λg )] + dg ηg,0 M0 + dg ηg,1 M1 + · · · + g=0 g6=0 g6=1 X + dg ηg,G MG + e, (3.4) g6=G where E(e|d, x, h, z) = 0, Λj = exp(zγj )/ G r=0 exp(zγr ), and Mj = Λj log(Λj )/(1 − Λj ) for P j = 0, 1, . . . , G. Note that because of A.3.3, the high dimensional variables h are in fact not needed for consistently estimating the partial effects αj (and thereof ATEs if E(x) = E(h) = 0) for j = 0, 1, . . . , G. Therefore, the CF estimator from Chapter 1 applied to (3.4) without the terms G m=0 dm hδm still purges endogeneity of the model and is still valid for producing P consistent partial effect estimates for αj . Before moving forward with the ML methods and their procedures, let me make the following assumption regarding the sparsity in (3.4). • Assumption 3.5 (A.3.5): A random sample {wi , zi , yi , xi , hi } of N observations (i.e., i = 1, 2, . . . , N ) is available for the treatment choice and the outcome equations. The number of parameters in both γg and βg is way less than the sample size N (i.e., k << N and l << N ). Suppose hδg = (h1 δg1 + h2 δg2 ) where h1 is the 1 × lh1 vector of exogenous variables in h associated with nonzero slope coefficients δg1 , and h2 is the 1 × lh2 vector of high dimensional (i.e., lh2 > N ) exogenous variables in h associated with zero slope coefficients δg2 . It is not known which variables in h belong to h1 or h2 . However, it is true that the number of nonzero slope coefficients in δg is way less than the sample size N (i.e., lh1 << N , the sparsity condition) such that the number 100 of coefficients to be estimated in (3.4), say p0 , is in reality way less than the sample size (i.e., p0 = (l + lh1 + G + 2)(G + 1) << N )). A.3.5 implies that, given the assumptions made earlier, the multinomial logit model of w on z can be consistently estimated using the standard asymptotic theory. On the other hand, since the variables associated with nonzero coefficients in h is not known, there are potentially p = (l + lh + G + 2)(G + 1) variables in (3.4). And this number p, already greater than N as assumed by A.3.5, can grow even further as the number of treatment status increases. If one wants to take advantage of the variables in h while estimating, then the high dimensionality in (3.4) creates a big estimation problem that cannot be handled by low dimensional methods such as the CF estimator from Chapter 1 and that causes the researchers to utilize other estimation methods designed for high dimensional and sparse settings. Due to the sparsity condition made in A.3.5 and the linearity of (3.4) in parameters, LASSO estimation method and its unbiased versions can be used to estimate the parameters of interest in (3.4), e.g., the partial effects αj for j = 0, 1, . . . , G. As suggested just above, one can benefit from the existence of the variables in h using LASSO-based methods. Owing to the relevancy of the high dimensional variables in h with the outcome variable y and the independence of treatment decision from h, estimating (3.4) by using LASSO-based methods can improve the efficiency and predictive power of the model over that by the CF estimator from Chapter 1 without dangering consistency. For example, in low dimensional settings with randomly assigned treatment, Imbens and Rubin (2015), Lin (2013), and Negi and Wooldridge (2021) all point out efficiency gains out of adding extra variables into regression that are sufficiently predictive of the outcome. Considering all these, with strong instruments z and h1 highly predictive of the outcome, it can worthwhile from efficiency and prediction perspectives (and hopefully biaswise as well) to estimate (3.4) by (at least one of) the LASSO-based methods which are more complicated and more time-consuming than the CF estimator from Chapter 1. Now, I can describe the 101 four LASSO-based methods I will use to estimate the parameters of interest in (3.4) under all assumptions from A.3.1 through A.3.5. 3.3.1 LASSO Estimation Developed by Tibshirani (1996), LASSO is a regularized regression method similar to ridge regression that economists are familiar with and use when they want to decrease the severity of multicollinearity among covariates at the expense of shrinking their coefficient estimates using a penalty on coefficient sizes. Thus, LASSO and its variants are sometimes also called shrinkage or penalized regression methods. In my model given by (3.1) and (3.2) together with all assumptions from A.3.1 through A.3.5, LASSO estimator solves the following problem p ( N ) X X min (yi − m(Xi , v(di , zi , γ̂), θ))2 /2 + λ |θj | , (3.5) θ∈Θ i=1 j=1 where m(Xi , v(di , zi , γ̂), θ) ≡ Xi ϑ+vi (γ̂)ϕ is a real-valued scalar function, γ̂ = (γ̂00 , γ̂10 , . . . , γ̂G0 )0 √ is the (G + 1)k × 1 vector of N − consistent and asymptotically normal first stage condi- tional maximum likelihood estimates from the multinomial logit (MNL) regression of wi on zi for i = 1, 2, . . . , N, θ = (θ1 , θ2 , . . . , θp ) = (ϑ0 , ϕ0 )0 is the p × 1 vector of parameters, | · | is absolute value operator, pj=1 |θj | is `1 norm (a.k.a. `1 LASSO penalty), λ is the Lagrange P multiplier (a.k.a. the tuning parameter that determines the strength of the penalty), Xi is the 1 × (lh + l + 1)(G + 1) vector of regressors, and vi (γ̂) is the 1 × (G + 1)(G + 1) vector of generated regressors (actually all CF terms). 102 More clearly, Xi = ( d0i , d1i , · · · , dGi , d0i xi , d1i xi , · · · , dGi xi , d0i hi , d1i hi , · · · , dGi hi ) vi (γ̂) = ( −d0i log(Λ̂0i ), · · · , −dGi log(Λ̂Gi ), d1i M̂0i , d2i M̂0i , · · · , dGi M̂0i , d0i M̂1i , d2i M̂1i , d3i M̂1i , · · · , dGi M̂1i , · · · , d0i M̂Gi , d1i M̂Gi , · · · , , dG−1i M̂Gi ), (3.6) where Λ̂gi = exp(z i γ̂g )/ G r=0 exp(z i γ̂r ) and M̂gi = Λ̂gi log(Λ̂gi )/(1 − Λ̂gi ) for g = 0, 1, . . . , G P and i = 1, 2, . . . , N. Notice that LASSO estimator minimizes (3.5) only with respect to θ not with respect to θ and λ together. Indeed, the minimization is done for given values of λ,so the tuning parameter needs to be chosen before the minimization problem. One of the most frequent approaches to choose λ is cross validation which typically runs LASSO (on the training set- some portion of sample) using a candidate λ from a grid of λ values and selects the value of λ to be used in the minimization problem by minimizing, say, the validation set (the remaining portion of sample) prediction error. Also note that, as λ gets larger, `1 LASSO penalty gets heavier (meaning more bias in coefficients) and the selected model gets sparser (i.e., more coefficients are set exactly equal to zero). To learn more on shrinkage methods and cross validation, see subchapters 3.4 and 7.10 in Hastie, Tibshirani, and Friedman (2009) respectively. Another method to choose λ comes from the penalty level formula (12) in Belloni, Chernozhukov, and Wang (2011, p.795). The formula is given below: √ λ = c N Φ−1 (1 − α/2p), (3.7) where c=1.1, N is the sample size, Φ−1 is the inverse standard normal cumulative distribution function, α=.05, and p is the number of potential variables in regression which is equal to lh in my case. This method is especially designed for LASSO-based inference methods. It does 103 not estimate the model several times for other values of λ from a grid of λ values, which makes it much faster than cross validation. Compared to cross validation, LASSO-based inference methods using (3.7) tend to better select the variables that have true nonzero impact on the outcome. For these reason, I also prefer using (3.7) over cross validation in this chapter. Now, I can prescribe the following LASSO procedure to estimate the partial effects αj in high dimensional and sparse setting for j = 0, 1, . . . , G : Procedure 3.1 1. Estimate the predicted probabilities, Λ̂ji = exp(zi γ̂j )/ Gr=0 exp(zi γ̂r ), from a MNL of P wi on zi for j = 0, 1, . . . , G and i = 1, 2, . . . , N. 2. Run the LASSO regression of yi on Xi and vi (γ̂) with only dji hi ’s to be selected. 3. Obtain parameter estimates of dj ’s from step 2, where Xi and vi (γ̂) are defined as in (3.6) for j = 0, 1, . . . , G and i = 1, 2, . . . , N. Due to the existence of generated regressors in the model, the standard errors and confidence intervals associated with parameter estimates from LASSO unfortunately are not valid. For this reason, I rely on Monte Carlo simulation to draw inference, which is also true for all the other LASSO-based estimations below. 3.3.2 Post Partial-out LASSO Estimation The main disadvantage of LASSO is that its parameter estimates are biased towards zero after regularization by `1 penalty. Therefore, for inference purposes, researchers need to handle the regularization bias. To this end, there have been several LASSO-based methods 104 proposed by economists, and one of them is called post partial-out LASSO which is based on Belloni et al. (2012). “Post” part simply means the application of linear regression to the model selected by LASSO and alleviates biases in parameter estimates of LASSO due to regularization. “Partial-out” part deals with endogenous covariates and the inclusion of irrelevant variables (or the exclusion of relevant variables). Belloni et al. (2012) imply that the score equations from the linear regression of an partialled-out outcome on partialled-out covariates are immune to endogeneity and selection mistakes. Here, the partialled-out outcome is the residual after running a linear regression of the outcome on the LASSO-selected covariates, and similarly the partialled-out covariate is the residual after running a linear regression of that covariate on the other LASSO-selected covariates. Because of this partialling-out procedure, the post partial-out LASSO can be called the post residual-on-residual LASSO with references to the conventional partialling- out estimators Adapting closely from Algorithm 1 given in Chernozhukov, Hansen, and Spindler (2015b), the post partial-out LASSO procedure I can prescribe to estimate the partial effects in high dimensional and sparse setting is as follows: Procedure 3.2 1. Same as in Procedure 3.1. 2. Run a LASSO regression of dji on d0i xi , d1i xi .. . . , dGi xi , d0i hi , d1i hi , . . . , dGi hi , and vi (γ̂), where only dji h0i s are to be selected, and dji x0i s and vi (γ̂) are forced to be included in the selected controls. Let’s denote the selected controls by sdji for j = 0, 1, . . . , G. 3. Run a regression of dji on sdji . Let dˆji be the residuals from this regression for j = 0, 1, . . . , G. 4. Let dˆi = (dˆ0i , dˆ1i , . . . , dˆGi ) be the collection of all the residuals from step 3. 105 5. Run a LASSO regression of yi on d0i xi , d1i xi , . . . , dGi xi , d0i hi , d1i hi , . . . , dGi hi , and vi (γ̂), where only dji h0i s are to be selected, and dji x0i s and vi (γ̂) are forced to be included in the selected controls. Let’s denote the selected controls by syi . 6. Run a regression of yi on syi . Let ŷi be the residuals from this regression. 7. Run a regression of ŷi on dˆi . 8. Obtain parameter estimates of dˆ from step 7, where vi (γ̂) is defined as in (3.6) for i = 1, 2, . . . , N. 3.3.3 Post Double Selection LASSO Estimation Another method to lessen the regularization bias on parameters estimated by LASSO is called post double selection LASSO given by Belloni, Chernozhukov, and Hansen (2014a). Technically, the post double selection LASSO is a simplified version of the post partial-out LASSO without partialling out outcome variable and other covariates. However, it is still robust to imperfect selection of the covariates on top of reducing bias on parameter estimates. “Post” part means the same as in the post partial-out LASSO. “Double selection” part, on the other hand, means that covariates are selected for predicting both the variables of interest and the outcome. This double selection of covariates to be included in the final estimating equation is done with the purpose of strengthening the validity of inference results by adding all the important and relevant variables that are correlated with the variables of interest and the outcome. In line with the estimation steps laid out in Belloni, Chernozhukov, and Hansen (2014a, p. 610), the post double selection LASSO procedure I can prescribe to estimate the partial effects in high dimensional and sparse setting is as follows: 106 Procedure 3.3 1. Same as in Procedure 3.1. 2. Same as in Procedure 3.2. 3. Run a LASSO regression of yi on d0i xi , d1i xi , . . . , dGi xi , d0i hi , d1i hi , . . . , dGi hi , and vi (γ̂), where only dji h0i s are to be selected, and dji x0i s and vi (γ̂) are forced to be included in the selected controls. Let’s denote the selected controls by syi . nS o 4. Let si ≡ j=0 ji ∪ si be the union of all the selected controls from steps 2 and 3. G d y s 5. Run a regression of yi on d0i , d1i , . . . , dGi , and si . 6. Obtain parameter estimates of dj ’s from step 5, where vi (γ̂) is defined as in (3.6) for j = 0, 1, . . . , G and i = 1, 2, . . . , N. 3.3.4 Double/Debiased ML LASSO Estimation Lastly, I briefly go over double/debiased ML LASSO defined in great detail with its the- oretical properties in Chernozhukov et al. (2018). The double/debiased ML LASSO is the most complicated of all the LASSO estimation methods described in this section and can be seen as the crossfit version of the post partial-out LASSO. To attenuate the effect of regularization bias and overfitting bias on the parameters of interest, the double/debiased ML LASSO uses two important techniques: Neyman-orthogonal moments/scores and cross- fitting. For this reason, this new estimation method gets the name “double/debiased ML.” Regularization bias is handled by orthogonalization obtained by both partialling out and crossfitting, and crossfitting play a major role in overcoming bias induced by overfitting in the double/debiased ML LASSO. 107 Chernozhukov et al. (2018) defines two different double/debiased ML approaches. In remark 3.1 of that article, the authors’ second approach DML2 is recommended over DML1 in many problems. For this reason, following definition 3.2 of double/debiased ML in Cher- nozhukov et al. (2018, p. C23), the double/debiased ML LASSO procedure I can prescribe to estimate the partial effects in high dimensional and sparse setting is as follows: Procedure 3.4 1. Same as in Procedure 3.1. 2. Divide the sample of N into K equal-sized partitions randomly. 3. Let Ik be the set of observations in partition k and IkC ≡ {1, 2, . . . , N } \ Ik for k = 1, 2, . . . , K. 4. Run a LASSO regression of dji on d0i xi , d1i xi , . . . , dGi xi , d0i hi , d1i hi , . . . , dGi hi , and vi (γ̂), where only dji h0i s are to be selected, and dji x0i s and vi (γ̂) are forced to be included in the selected controls. Let’s denote the selected controls by sd,k ji for j = 0, 1, . . . , G, i ∈ IkC , and k = 1, 2, . . . , K. 5. Run a regression of dji on sd,k ji , and let the parameter estimates from this regression be ςˆjd,k for j = 0, 1, . . . , G, i ∈ IkC , and k = 1, 2, . . . , K.   6. Construct the residuals dˆji = dji − sd,k d,k ji ςˆj for j = 0, 1, . . . , G, i ∈ Ik , and k = 1, 2, . . . , K. 7. Let dˆi = (dˆ0i , dˆ1i , . . . , dˆGi ) be the collection of all the residuals from step 6 for i ∈ Ik and k = 1, 2, . . . , K. 8. Run a LASSO regression of yi on d0i xi , d1i xi , . . . , dGi xi , d0i hi , d1i hi , . . . , dGi hi , and vi (γ̂), where only dji h0i s are to be selected, and dji x0i s and vi (γ̂) are forced to be included in the selected controls. Let’s denote the selected controls by sy,k i for i ∈ IkC and k = 1, 2, . . . , K. 108 9. Run a regression of yi on sy,k i , and let the parameter estimates from this regression be ςˆy,k for i ∈ IkC and k = 1, 2, . . . , K.   10. Construct the residuals ŷi = yi − sy,k i ς ˆ y,k for i ∈ Ik and k = 1, 2, . . . , K. 11. Run a regression of ŷi on dˆi for i = 1, 2, . . . , N. 12. Obtain parameter estimates of dˆ from step 11, where vi (γ̂) is defined as in (3.6). In a similar fashion portrayed in section 1.4 of Chapter 1, ATEs in this chapter take the following form: AT Eg,0 = E(yg − y0 ) = E (αg + xβg + hδg + ug − (α0 + xβ0 + hδ0 + u0 )) = (αg − α0 ) + (E(x)) (βg − β0 ) + (E(h)) (δg − δ0 ), (3.8) where the third equality uses E(ug ) = 0 for g = 1, 2, . . . , G. Then, a consistent estimator of AT Eg,0 is AT\ Eg,0 = (α̂g − α̂0 ) + x̄(β̂g − β̂0 ) + h̄(δ̂g − δ̂0 ), (3.9) where α̂g , α̂0 , β̂g , β̂0 , δ̂g , δ̂0 , x̄= N −1 and h̄= N −1 are respectively consistent PN PN n=1 xi , n=1 hi estimates for αg , α0 , βg , β0 , δg , δ0 , E(x), and E(h). One might think that, once the partial effects in high dimensional and sparse setting are estimated using either one of Procedures 3.1 − 3.4, ATEs can easily be obtained by plugging the parameter estimates α̂g , α̂0 , β̂g , β̂0 , δ̂g , and δ̂0 into (3.9). However, this is not true due to the fact that Procedures 3.1 − 3.4 all provide debiased (and consistent) estimates only for αg for g = 0, 1, . . . , G. The estimates for both βg and δg also need to debiased before using them in (3.9) in order to overcome regularization biases on them owing to estimating high 109 dimensional parameters by ML methods and to remove overfitting bias on them resulting from using all observations while estimating high dimensional parameters and, afterwards, parameters of interest. Orthogonalization (either done by partialling out the effect of high dimensional variables from the variables of interest to obtain the orthogonalized variables of interest or by doubly selecting high dimensional control variables that are useful for predicting the variables of interest and the outcome variable or by both partialling out and double selection) is used to get rid of regularization bias, and sample splitting to overcome overfitting bias. Since x is low dimensional, the construction of the Neyman orthogonal scores associated with x and sample splitting can be done, see, e.g., Example 2.1. in Chernozhukov et al. (2018). However, when it comes to estimating δg debiasedly, there is a caveat: it is high dimensional. And to the best of my knowledge, there has been no research that provides results regarding consistent estimation of high dimensional parameters δg . Reasonably, when consistent estimation of δg is not possible, one cannot make inference about these high dimensional parameters either. On the other hand, inference results on low dimensional parameters exist, see, e.g., Belloni, Chernozhukov, and Hansen (2014a, Theorem 2) and Chernozhukov et al. (2018, Theorem 3.1 and Corollary 3.1). After all this discussion, there are two cases in which ATEs can still be consistently estimated. First, when E(h) = 0 (or when both E(x) = 0 and E(h) = 0, which is used in my simulations), AT Eg,0 simplifies to AT Eg,0 = (αg − α0 ) + (E(x)) (βg − β0 ) (or AT Eg,0 = (αg − α0 ) when both E(x) = 0 and E(h) = 0) with one of its consistent estimates given by AT \ Eg,0 = (α̂g − α̂0 ) + x̄(β̂g − β̂0 ) (or AT \ Eg,0 = (α̂g − α̂0 ) when both E(x) = 0 and E(h) = 0) for g = 1, 2, . . . , G. Second, when δg is constant across all treatment statuses (e.g., δg = δ for g = 0, 1, . . . , G), then again AT Eg,0 = (αg − α0 ) + (E(x)) (βg − β0 ), and AT \ Eg,0 defined above can be used for consistent ATE estimates. Note that consistent estimation of ATEs is possible in these two cases because ATEs do not depend on high dimensional parameters δg . 110 3.4 Simulations In this section, I report Monte Carlo simulation results that aim to compare and con- trast mainly the finite sample performances of the estimation methods from section 3.3 (i.e., LASSO, post partial-out LASSO, post double selection LASSO, double/debiased ML LASSO) and the CF method from Chapter 1. With respect to the measures of performance comparison, I make use of bias of ATE estimates, standard deviation of ATE estimates, mean absolute prediction error (MAPE), root mean square error (RMSE), mean number of correctly selected variables (CSVs), and mean size of selected set of variables (SVs). My simulation results come from the model introduced in section 3.2. However, I will change the data generating process slightly for this model as I change the correlation structure among the high dimensional variables in h and the sparsity condition (e.g., the number of variables associated with nonzero parameters in h). For the sake of computational simplicity, I adopt a scheme in which there is only one instrument in the latent variable equation (i.e., z = z), there is only one exogenous low dimensional (and distinct from z) variable in the counterfactual outcome equation (i.e., x = x), and the treatment variable w takes on only three values as always. In all simulations, the treatment is evenly spread among different treatment statuses (i.e., each treatment status has at least about 30 percent of the sample size). 3.4.1 Data Generating Process In my simulation analysis, I used four different data generating processes (DGPs): one for the model in section 3.3 with correlated variables in h = (h1 , h2 , . . . , h695 ) and moderate sparsity (i.e., lh1 = 4), one for the model with correlated variables in h and high sparsity (i.e., lh1 = 1),one for the model with uncorrelated variables in h and moderate sparsity, and one for the model with uncorrelated variables in h and high sparsity. The setup for the DGP 111 of the model in section 3.3 with correlated variables in h and moderate sparsity (i.e., h1 has 4 variables: h1 , h2 , h3 , and h4 ) is as follows: w ∈ {0, 1, 2} , dg = 1[w = g], g ∈ {0, 1, 2} , ag ∼ Gumbel(0, 1), g ∈ {0, 1, 2} , γ0 = 1, γ1 = 5, and, γ2 = 9, l0 = 1, l1 = 4.5, and, l2 = 1.5, z = z ∼ N (0, 2), wg∗ = lg + γg z + ag , g ∈ {0, 1, 2} , w = g if f wg∗ ≥ wj∗ , ∀j 6= g and g, j ∈ {0, 1, 2} , eg ∼ N (0, 1), g ∈ {0, 1, 2} , η0,0 = 0.05, η0,1 = 0.10, and η0,2 = 0.15, η1,0 = 4.05, η1,1 = 4.10, and η1,2 = 4.15, η2,0 = 8.05, η2,1 = 8.10, and η2,2 = 8.15, P2 P2 ug = j=0 ηg,j aj + [− j=0 ηg,j E(aj )] + eg , g ∈ {0, 1, 2} , x = x ∼ N (0, 1), h ∼ N (0, Σ) with elements Σr,c = (0.5)|r−c| , r, c ∈ {1, 2, . . . , 695} , 112 α0 = 1, α1 = 2, and, α2 = 3, β0 = 6, β1 = 7, and, β2 = 8, δ0,1 = 1, δ1,1 = 2, and, δ2,1 = 3, δ0,2 = 2, δ1,2 = 3, and, δ2,2 = 4, δ0,3 = 3, δ1,3 = 4, and, δ2,3 = 5, δ0,4 = 4, δ1,4 = 5, and, δ2,4 = 6, yg = αg + xβg + h1 δg,1 + h2 δg,2 + h3 δg,3 + h4 δg,4 + ug , g ∈ {0, 1, 2} , and y = d0 y0 + d1 y1 + d2 y2 . For the model with correlated variables in h and high sparsity, I reduce the number of variables associated with nonzero parameters in h from four to one (i.e., h1 has nonzero coefficient only). This reduction in the dimension of h1 causes some of the parameters above to be set equal to zero and some of the variables above in h1 to disappear from yg . Specifically, δg,l will be zero for l = 2, 3, 4 and g = 0, 1, 2; and h2 , h3 , and h4 will be removed from yg . In other words, the DGP of the model with correlated variables in h and high sparsity will be almost exactly the same as the above DGP with the exception of the following modifications: δ0,2 = 0, δ1,2 = 0, and, δ2,2 = 0, δ0,3 = 0, δ1,3 = 0, and, δ2,3 = 0, δ0,4 = 0, δ1,4 = 0, and, δ2,4 = 0, yg = αg + xβg + h1 δg,1 + ug , g ∈ {0, 1, 2} . 113 For the model with uncorrelated variables in h and moderate sparsity, the DGP setup is almost exactly the same as the one with correlated variables in h and moderate sparsity. The only difference is that the variance-covariance matrix Σ now becomes a 695×695 identity matrix, which is also the only difference between the DGPs of the model with uncorrelated variables in h and high sparsity and the model with correlated variables in h and high sparsity. 3.4.2 Simulation Results I report my simulation results in Tables C.1 through C.16 in appendix C. My objective out of these simulation examples is to compare the finite performances of several estima- tion methods used in linear high dimensional sparse settings and of the CF method from Chapter 1 (as a benchmark) in the presence of discrete multivalued endogenous treatment and heterogeneous counterfactual errors. I have the expectation that, from an efficiency perspective (hopefully biaswise and predictionwise too), it is worth running at least one of the LASSO-based methods with the high dimensional variables in h which, due to the high dimensional setting, are more involved and more time-consuming than my simpler estimator from Chapter 1. And this expectation has roots in that some of the variables in h are pre- dictive of the outcome and that h is irrelevant to the process of treatment choice. I present my simulation results in two parts: first, bias and efficiency outcomes and second, prediction and model selection outcomes. In Tables C.1 through C.16, I report Monte Carlo estimates for AT Eh,0 , bias in the Monte Carlo estimate for ATEs, Monte Carlo standard deviations (SDs) for AT Eh,0 , the mean number of SVs in estimation, the mean number of CSVs in estimation, MAPE, and RMSE for each of the estimation methods described earlier −specifically, LASSO, post partial-out LASSO (PO), post double selection LASSO (DS), double/debiased ML LASSO (XPO), and the CF method from Chapter 1) for h = 1, 2. In simulations, I use different small sample 114 sizes n = 1000, n = 1250, n = 1500 and n = 2000 for each DGP setup. For each Monte Carlo experiment, the number of iterations is always equal to 1000 due to the time-costliness of estimating high dimensional models. I also use some trimming to remove outliers from my simulation analysis. Since the dimension lh of the variables in h that potentially (but the researcher does not know for sure which one of these potential variables really have a nonzero effect on the outcome) can have an effect on the outcome is always 695, p (the number of variables included in the second stage estimating equations except the CF method from Chapter 1 that always has 15 variables in estimation) is always 2100. On the other hand, p0 (the number of variables that would be included in the second stage estimating equations of the LASSO-based inference methods of section 3.3 if the researcher knew exactly which variables have a nonzero effect on the outcome) changes depending on lh1 , the number of variables in h that have nonzero effect on the outcome. When lh1 = 4 (1) in the model, p0 = 27 (18). Since I rely on a CF approach to deal with the endogeneity in the model across all methods, I also know there are 15 variables (all the binary treatment indicators, x interacted with these binary treatment indicators, and the CF terms: these are exactly all the variables that the CF method from Chapter 1 uses) that must be included in the second stage estimating equations. For this reason, the number of variables that are forced to be included in each second stage estimation (denoted by f ) is 15, which practically leaves the methods to correctly select only 12 (3) variables when lh1 = 4 (1) although there always exist 2085 variables to really select. Therefore, when calculating the total number of correctly selected variables reported in Tables C.1-C.16, I consider only these 12 (3) variables when lh1 = 4 (1). Whereas, when calculating the total number of selected variables in estimation, I consider all the potential variables including f, totally 2100 of them. As for the notation, in Tables C.1-C.16, ate [ h,0 is the estimate for AT Eh,0 , and bias(ate [ h,0 ) is the bias in the estimate for AT Eh,0 for h = 1, 2. Moreover, # of SV and # of CSV stand for the number of selected variables (inclusive of f although they are not really selected) in 115 the second stage and the number of correctly selected variables (not inclusive of f ) in the second stage, respectively. As for MAPE and RMSE, they mean mean absolute prediction error in the second stage and root mean square error in the second stage, respectively. As always, since these tables would require a considerable amount of space in the main body of the chapter, I place all simulation tables of this chapter into appendix C. At this point, it is also important to remember the true values for AT Eh,0 for h = 1, 2 since I often refer them throughout this section. They are respectively as follows: AT E1,0 = 1 and AT E2,0 = 2. Besides this, the variables to be correctly selected when lh1 = 1 is as follows: oracle1 = (dg h1 ) for g = 0, 1, 2. When lh1 = 4, this list grows into oracle4 = (dg hm ) for g = 0, 1, 2 and m = 1, 2, 3, 4. 3.4.2.1 Bias and Efficiency Outcomes First, let’s consider the results in the two benchmark cases: the one with correlated vari- ables in h and lh1 being equal to 4 and the sparser one with exactly the same configurations but lh1 being equal to 1. Starting with the first benchmark case (correlated variables in h and lh1 being equal to 4) in Table C.1, the Monte Carlo simulation results show that there are only small estimation biases for AT Eh,0 in all estimation methods for h = 1, 2, and no estimation method has a specific advantage over the other methods in terms of bias. This observation is also in line with my expectation due to h totally ignorable in the process of treatment choice and to CF terms taking care of endogeneity in all methods. For example, in Table C.1, the simulation estimates from XPO method for AT E1,0 and AT E2,0 are respec- tively 1.0574 (only about 5.7% higher than the true value) and 1.9784 (only around 1.1% lower than the true value). Similarly, the same simulation estimates from the benchmark CF method are respectively .9958 (a mere .4% lower than the true value) and 1.9666 (only about 1.7% lower than the true value). Continuing to explore the first benchmark case in Table C.1, the simulation results indicate that PO is the most efficient method with the 116 lowest Monte Carlo standard deviations (SDs), and the CF and XPO are the least efficient methods. AT E2,0 estimates from LASSO-based inference methods (except XPO) are sta- tistically significant, whereas AT E1,0 estimates across all estimation methods are not. For instance, in Table C.1, the SD of the DS parameter estimate for AT E1,0 (AT E2,0 ), which is 1.0467 (1.9936), is .8215 (1.0332). Again in line with my expectation, the simulation findings point that the CF method produces less precise ATE estimates than do several LASSO-based inference methods that select from h. To give an example, the SD of the PO (LASSO) pa- rameter estimate for AT E1,0 is about 34% (29%) lower than that of the same CF parameter estimate. As the sample size increases from 1000 in Table C.1 to 2000 in Table C.4, SDs go down across all methods; however, there is no particular improvement on parameter biases which are still small. Even AT E1,0 estimates (except those coming from the CF method) become statistically significant when N = 2000. The other patterns observed in Table C.1 are also seen in Tables C.2 through C.4, and all these patterns in Tables C.1 through C.4 outlined in this paragraph constitute the patterns of the first benchmark case. Second, as I increase the sparsity by moving from lh1 being equal to 4 to 1 in Tables C.5 through C.8, I step into the second benchmark behaviors of the methods. In terms of parameter biases, the simulation findings do not change: All the estimation methods still have finite sample biases (but small) on ATE estimates. As for efficiency, SDs of ATE estimates from the CF method go down compared to the first benchmark results, and all estimation methods (except XPO) start producing results more or less at the same statistical significance levels for each ATE estimate. In a way, SDs of ATE estimates converge to each other. I expect this convergence in SDs, and it is mainly due to running regression with less variables in h that really are predictive of the outcome, which implies one should not expect much efficiency gain out of utilizing LASSO-based inference methods over the CF method with only one extra predictive variable in h. To give an example, the SD of CF parameter estimate for AT E1,0 in Table C.1 goes down from 1.2182 to .8194 in Table C.5 which is not much different from the SD of LASSO (PO) parameter estimate for AT E1,0 , 117 .8211 (.8169). On top of these observations, the patterns from the first benchmark case are also traced when lh1 = 1, too: PO is still the most efficient method (slightly though due to convergence in SDs), the CF and XPO are still the least efficient methods (only with trifling margins this time), all AT E2,0 estimates (except those in XPO) are statistically significant, AT E1,0 estimates get statistically significant only when N = 2000, and SDs decrease across all methods when N gets bigger with no change on ATE biases which are still small. All these observations in Tables C.5 through C.8 outlined in this paragraph form the patterns of the second benchmark case. When I alter the correlation structure of the variables in h from high correlation to no correlation, the simulation results of course reflect this change as seen in Tables C.9 through C.16. First, let’s start summing up the findings in Tables C.9 through C.12 when the variables in h are uncorrelated and lh1 = 4. As before, the simulations point that all the estimation methods have small biases on ATE estimates when N is small. As to efficiency, the simulation results show that, compared to the first benchmark case, there is very trivial increase in the SDs of ATE estimates from LASSO-based methods and a decrease in the SDs of ATE estimates from the CF method. These changes in SDs are against the fact that the degree of multicollinearity among the uncorrelated covariates in a regression is less than that among the uncorrelated covariates in a regression, and less multicollinearity among the regression covariates means more precision in parameter estimates. Besides these, the patterns of the first benchmark case still apply in this case with uncorrelated variables in h and lh1 being equal to 4. This is especially true when it comes to the efficiency gains of LASSO-based methods over the CF method. For instance, the SD of DS parameter estimate ate [ 1,0 (ate [ 2,0 ) is about 23% (10%) lower than that of the CF estimate. When Tables C.13 through C.16 are considered with uncorrelated variables in h and lh1 being equal to 1, the simulation findings again reveal the existence of small ATE biases resulting from small sample sizes. With regard to efficiency, compared to the case with uncorrelated variables in h and lh1 being equal to 4, there is a drop in SDs across all estimation methods, which 118 was not this apparent in the comparison of the models with uncorrelated variables in h but varying lh1 . My explanation for this is that the number of variables included in regressions of Tables C.13 through C.16 are less than that of Tables C.9 through C.12, which can be seen from decreased number of SVs and results in less number of parameters to be estimated and smaller SDs of ATE estimates. For example, the SD of LASSO (XPO) parameter estimate ate [ 1,0 reduces by about 12% (6%) from .9350 (.9276) in Table C.9 to .8221 (.8753) in Table C.13. Compared to the second benchmark case, the simulations in Tables C.13 through C.16 provide more or less similar results: the patterns observed in the second benchmark case are seen in Tables C.13 through C.16, as well. Specially, SDs of ATE estimates are close to each other, so efficiency gains out of LASSO-based models are rather limited. 3.4.2.2 Prediction and Model Selection Outcomes As in bias and efficiency outcomes, the two benchmark cases are still the same: the one with correlated variables in h and lh1 being equal to 4 (the first benchmark) and the sparser one with exactly the same configurations but lh1 being equal to 1 (the second benchmark). Let’s start interpreting the simulation results from the first benchmark. In Table C.1, the simulation outcomes suggest that XPO followed by DS and PO together has the most number of both SVs and CSVs, and that LASSO selects the least number of variables. Note that since I force all the variables to be included in the CF method, there is actually nothing to select there. For this reason, the number of SVs and CSVs are always left blank for the CF method. With regard to prediction, LASSO has the lowest prediction errors, and after LASSO comes the CF method the second. XPO designed for inference is the worst predictor of outcome variable according to the simulations. For example, in Table C.1, the number of SVs from XPO (LASSO) is around 26.4 (25.9), and the number of CSVs from XPO (DS and PO) is about 11.3 (10.9). In addition, still in Table C.1, the MAPE and RMSE of LASSO (the CF method) are about 7.89 (11.53) and 12.41 (15.74) but the same figures from 119 XPO are more or less 13.38 and 17.84. One of the most striking features of the simulation results is that DS and PO methods are almost the same: exactly equal numbers of SVs and CSVs, and almost equal MAPE and RMSE values. As the second stage sample size increases from 1000 in Table C.1 to 2000 in Table C.4, the readers continue seeing similar prediction and model selection patterns just with more SVs and CSVs. The LASSO-based methods predict better with larger sample sizes (lower MAPE and RMSE values). On the contrary, the prediction figures of the CF method slightly worsen (higher MAPE and RMSE values). All these patterns in Tables C.1 through C.4 constitute the patterns of the first benchmark case in terms of prediction and model selection. Based off simulation outcomes in Tables C.5 through C.8, it can be claimed that increased sparsity (lh1 being equal to 1 rather than 4) results in smaller numbers of SVs and CSVs and that, compared to the first benchmark case, prediction improves in all methods. For instance, the total number of SVs and CSVs (the MAPE and RMSE values) from PO in Table C.4 are about 26.78 and 11.77 (13.29 and 17.74); in contrast, those from PO method in Table C.8 are only around 16.88 and 1.88 (10.39 and 14.36). The other prediction and model selection patterns from the first benchmark case are also seen in Tables C.5 through C.8, which collectively forms the second benchmark behaviors of the estimators in reference to prediction and model selection. When the variables in h are uncorrelated as in Tables C.9 through C.12 rather than correlated, the most conspicuous observations from the simulations are that, compared to the first benchmark case, less variables are selected (and correctly selected) by all methods and that prediction gets better across all models except LASSO. To give an example, the total number of SVs and CSVs (the MAPE and RMSE values) from DS in Table C.3 are about 26.49 and 11.49 (13.29 and 17.74); in contrast, those from DS in Table C.11 are only around 24.39 and 9.39 (11.92 and 16.11). My explanation for the lesser number of variables selected is that the methods tend not to choose the variables which could have been selected just because of the presence of strong correlation among variables but in reality have zero effect 120 on the outcome of interest. As far as better predictions are concerned in Tables C.9 through C.12 compared to the first benchmark case, I think this is a natural consequence of estimating sparser models. The other prediction and model selection patterns resemble those from the first benchmark case. In Tables C.13 through C.16 with lh1 set equal to 1 in addition to having uncorrelated variables in h, compared to the previous case with uncorrelated variables in h and lh1 equal to 4, the simulation outcomes show that there are even lesser numbers of SVs and CSVs across all methods and that prediction becomes better across all methods. However, compared to the second benchmark case, the simulations indicate that both model selection and prediction figures are extremely close to each other, which indirectly implies that how sparse the model is might be more influential than the correlation structure of the variables in h as far as model selection and prediction are concerned. And the other prediction and model selection patterns are very much like the ones from the first benchmark case. 3.5 Conclusion In this chapter, I build on my work from Chapter 1 and take the econometric model with a discrete multivalued endogenous treatment variable and heterogeneous counterfactual errors to a linear high dimensional sparse setting where the number of parameters to be estimated is way more than the sample size available for use but the number of variables with nonzero effect on outcome is less than the sample size. Using the CF approach adopted from Chapter 1 to handle the problem of endogeneity in the model, I summarize four LASSO-based methods coupled with detailed procedures to estimate partial effects and ATEs. Three of the LASSO- based methods are XPO of Chernozhukov et al. (2018); DS of Belloni, Chernozhukov, and Hansen (2014a); and PO of Belloni et al. (2012), which are all developed for providing better inference results. The other LASSO-based method is simply LASSO itself, which is designed for predicting better. To estimate the ATEs, I also use the CF method from Chapter 121 1; however this method mainly functions as a benchmark in my Monte Carlo simulation analysis. Using this detailed simulation analysis (the first simulation-based comparative analysis of the LASSO methods mentioned above in my setting), I compare and contrast the finite sample properties, model selection features, and prediction capabilities of the LASSO-based models for discrete multivalued endogenous treatments in a linear scalar outcome high di- mensional sparse model with heterogeneous counterfactual errors. In my simulation analysis, I specifically make use of measures such as bias of estimates, standard deviation of estimates, MAPE, RMSE, mean number of CSVs, and mean number of SVs. Overall, the simulation evidence suggests that none of the methods suffer from huge biases in small samples and that all methods can reliably estimate ATEs in the presence of high dimensional potential variables and under the threat of endogeneity when the finite sample size is 2000. Increased sample sizes, how sparse the model truly is, and less correlation among potential high dimensional variables all seem to have an impact on efficiency but not on finite sample bias. The most important simulation result is that, in the presence of enough extra predictive variables that are ignorable in treatment selection and are from a set of high dimensional predictors of outcome, more complicated LASSO-based methods result in efficiency gains in ATE estimates (more obvious in models with moderate sparsity) over the simpler CF method although both LASSO-based methods and the CF method perform more or less the same as far as finite sample bias is concerned. Among LASSO-based methods, the simulations indicate that PO is often the most efficient method to use. As far as model selection goes, the simulations show that XPO followed by both DS and PO selects both the most number of potential variables to be used in estimation and correctly selects the most number of variables with true nonzero impact on outcome in estimation. As to prediction, the simulation results suggest that LASSO followed by CF has the best prediction features with the lowest MAPE and RMSE numbers among all the methods compared and that XPO has the least favorable prediction capabilities. Increase 122 in sample size results in more variables to be included in estimation and to be correctly selected in all methods. On the contrary, increase in sparsity and correlatedness among potential variables cause just the opposite in all methods. Bigger sample sizes and high sparsity also cause better prediction, especially in LASSO-based methods. Moreover, the strength of correlation among potential variables is not as influential as sample size and model sparsity in model selection and prediction. And lastly, the simulations reveal the convergence of DS and PO in terms of their model selection and prediction results. Concerning further research ideas that can flower out of this chapter, all of the research ideas mentioned in the conclusion of Chapter 1 can of course be explored in the current high dimensional sparse setting, too. Apart from these research ideas though, it would be also very interesting and exciting to theoretically examine the asymptotic properties of XPO, DS, and PO methods in high dimensional sparse models that use generated regressors and the impact of these generated regressors on the asymptotic variance-covariance matrix of these estimators. 123 APPENDICES 124 APPENDIX A APPENDIX FOR CHAPTER 1 A.1 Chapter 1: Derivations in CF Method To find E(y|d, x, z), note that I can write E(y|d, x, z) = E(d0 y0 + d1 y1 + · · · + dG yG |d, x, z) = d0 E(y0 |d, x, z) + d1 E(y1 |d, x, z) + · · · + dG E(yG |d, x, z) = d0 E(α0 + xβ0 + u0 |d, x, z) + d0 E(α1 + xβ1 + u1 |d, x, z) + · · · + +dG E(αG + xβG + uG |d, x, z) = d0 α0 + d0 xβ0 + d0 E(u0 |d, x, z) + d1 α1 + d1 xβ1 + d1 E(u1 |d, x, z) + + . . . + dG αG + dG xβG + dG E(uG |d, x, z) XG XG XG = dj αj + dk xβk + dg E(ug |d, x, z). (A.1) j=0 k=0 g=0 Next, I need to derive E(ug |d, x, z). Under A.1.3 and the law of iterated expectations, E(ug |d, x, z) = E(E(ug |d, x, z, a)|d, x, z) XG = E(E(ug |x, z, a)|d, x, z) = E[ ηg,j (aj − E(aj ))|d, x, z] j=0 XG = ηg,j E[(aj − E(aj ))|d, x, z)]. (A.2) j=0 In equations above, I use that d is completely determined by z and a together. Refer to section 1.2 for seeing this where the main model is described. Hence, the expectation conditional on d, x, z, a reduces down to the one conditional on x, z, a only. 125 Then, using A.1.1, mutual exclusivity of binary treatment indicators, and the fact right below for g = 0, 1, . . . , G : E(ug |d, x, z) = d0 E(ug |d0 = 1, x, z) + d1 E(ug |d1 = 1, x, z) + . . . + +dG E(ug |dG = 1, x, z), (A.3) I can write E(y|d, x, z) as follows: XG XG X G XG E(y|d, x, z) = dj αj + dk xβk + dg ηg,j E[(aj − E(aj ))|dg = 1, x, z]. (A.4) j=0 k=0 g=0 j=0 To complete the derivation of the conditional expectation E(y|d, x, z), all I need is to find a closed form expression for G j=0 ηg,j E[(aj − E(aj ))|dg = 1, x, z]. To do so, I P PG g=0 d g will utilize the work of Dubin and McFadden (1984). In their paper, they used the following result:  −log(Λj )     ,j = g    E(aj − E(aj )|dg = 1, x, z) = , (A.5)   Λ log(Λg )    g , j 6= g   (1 − Λg ) where Λg = exp(zγg )/ G r=0 exp(zγr ), i.e., the MNL response probability for g, j = 0, 1, . . . , G. P 126 Using the above result, I have G G G G ! X X X X dg ηg,j E[(aj − E(aj ))|dg = 1, x, z] = dg ηg,j E(aj − E(aj )|dg = 1, x, z) g=0 j=0 g=0 j=0 G ! X X Λh log(Λh ) = dg −ηg,g log(Λg ) + ηg,h g=0 h6=g (1 − Λh ) G ! X X = dg −ηg,g log(Λg ) + ηg,h Mh g=0 h6=g G G ! X X X = −ηg,g dg log(Λg ) + dg ηg,h Mh g=0 g=0 h6=g G ! X = −ηg,g dg log(Λg ) + g=0 +d0 (η0,1 M1 + η0,2 M2 + · · · + η0,G MG ) + +d1 (η1,0 M0 + η1,2 M2 + η1,3 M3 + · · · + η1,G MG ) .. . +dG (ηG,0 M0 + ηG,1 M1 + · · · + ηG,G−1 MG−1 ) G ! X X = − ηg,g dg log(Λg ) + dg ηg,0 M0 + g=0 g6=0 X X + dg ηg,1 M1 + · · · + dg ηg,G MG , (A.6) g6=1 g6=G where Mg = Λg log(Λg )/(1 − Λg ) for g = 0, 1, . . . , G. Finally, by combining (A.4) and (A.6), I can write the expectation of the observed out- come y conditional on the observed variables (d, x, z) as follows: X G XG E(y|d, x, z) = dj α j + dk xβk + j=0 k=0 X G X X + [−ηg,g dg log(Λg )] + dg ηg,0 M0 + dg ηg,1 M1 + · · · + g=0 g6=0 g6=1 X + dg ηg,G MG , (A.7) g6=G 127 where Λg and Mg are as in (A.5) and (A.6) for g = 0, 1, . . . , G. A.2 Chapter 1: Derivations in CF Method-A Special Case Upon the additional assumption that ηg,j = ηj and the results from the initial model, I can write the conditional expectation E(y|d, x, z) as follows: X G XG XG XG E(y|d, x, z) = dl α l + dk xβk + dg ηg,j E[(aj − E(aj ))|dg = 1, x, z] l=0 k=0 g=0 j=0 G G G G ! X X X X = dl α l + dk xβk + dg ηj E (aj − E(aj )|dg = 1, x, z) l=0 k=0 g=0 j=0 G G G G ! X X X X Λh log(Λh ) = dl α l + dk xβk + dg −ηg log(Λg ) + ηh l=0 k=0 g=0 h6=g (1 − Λh ) G G G G ! X X X X = dl αl + dk xβk + dg −ηg log(Λg ) + ηh Mh , (A.8) l=0 k=0 g=0 h6=g where Λg and Mg are as in (A.5) and (A.6) for g = 0, 1, . . . , G. 128 Note that I can further simplify the last sum of terms in (A.8) as follows: XG XG XG X G XG dg [−ηg log(Λg ) + ηh Mh ] = [ −ηg dg log(Λg )] + [dg ηh Mh ] g=0 h6=g g=0 g=0 h6=g XG = [ −ηg dg log(Λg )] + d0 (η1 M1 + η2 M2 + · · · + ηG MG ) g=0 +d1 (η0 M0 + η2 M2 + η3 M3 + · · · + ηG MG ) .. . +dG (η0 M0 + η1 M1 + · · · + ηG−1 MG−1 ) XG X X = [ − ηg dg log(Λg )] + dg η0 M0 + dg η1 M1 + g=0 g6=0 g6=1 X +··· + dg ηG MG g6=G XG = [ − ηg dg log(Λg )] + ((1 − d0 )η0 M0 ) + ((1 − d1 )η1 M1 ) g=0 + · · · + ((1 − dG )ηG MG ) XG X G = [ − ηg dg log(Λg )] + (1 − dg )ηg Mg g=0 g=0 X G = [(1 − dg )ηg Mg − ηg dg log(Λg )] g=0 X G = ηg [(1 − dg )Mg − dg log(Λg )] (A.9) g=0 Thus, by combining (A.8) and (A.9), I can write the expectation of the observed outcome y conditional on the observed variables (d, x, z) as follows: X G XG XG E(y|d, x, z) = dl α l + dk xβk + ηg [(1 − dg )Mg − dg log(Λg )] , (A.10) l=0 k=0 g=0 where Λg and Mg are as in (A.5) and (A.6) for g = 0, 1, . . . , G. 129 A.3 Chapter 1: Derivations in Asymptotic Normality Results of CF Estimates Note that   ∂li ∂γ0 (γ)         F   si (γ) ≡   ∂li ∂γ1 (γ)   ..   .       ∂li ∂γG (γ)   exp(zi γj )exp(zi γ0 )z0i exp(zi γ0 )z0i i −exp(2zi γ0 )z0i X P  [ 1[w i = j] −( i )2 Λji P ] + 1[w i = 0] ( i )2 Λ0i P   j6=0          exp(zi γj )exp(zi γ1 )z0i exp(zi γ1 )z0i i −exp(2zi γ1 )z0i  X P  =  [ 1[wi = j]  P 2 −( i ) Λji ] + 1[w i = 1] P 2 ( i ) Λ1i    j6=1  ..   .       )exp(zi γG )z0i exp(zi γG )z0i i −exp(2zi γG )z0i  X P  [ 1[wi = j] exp(zi γjP  −( 2 ) Λj ] + 1[w i = G] ( P 2 ) ΛG  i i i i j6=G  X   [ 1[wi = j](−Λ0i )z0i ] + 1[wi = 0](z0i − Λ0i z0i )   j6=0           X  0 0 0 =  [ 1[wi = j](−Λ1i )zi ] + 1[wi = 1](zi − Λ1i zi ) ,    j6=1  ..   .        X   [ 1[wi = j](−ΛG )z0 ] + 1[wi = G](z0 − ΛG z0 )  i i i i i j6=G G ! X where li (γ) = r=0 exp(zi γr ), and Λji = PG exp(zi γr ) , i = G P P j=0 1[wi = j]log exp(zi γj )/ r=0 exp(zi γj ) / G for and i = 1, 2, . . . , N. From here, I can further P r=0 exp(z γ i r ) j = 0, 1, . . . , G 130 simplify sF i (γ) as follows:  XG   [ 1[wi = j](−Λ0i )z0i ] + 1[wi = 0]z0i  j=0            XG  [ 1[wi = j](−Λ1i )z0i ] + 1[wi = 1]z0i    sF i (γ) =     j=1  ..   .        X G  [ 1[wi = j](−ΛGi )z0i ] + 1[wi = G]z0i   j=G  X G   (−Λ0i z0i 1[wi = j]) + 1[wi = 0]z0i  j=0            X G  =  (−Λ1i zi 1[wi = j]) + 1[wi = 1]z0i 0       j=0  ..   .        X G  (−ΛGi z0i 1[wi = j]) + 1[wi = G]z0i   j=0   1[wi = 0] − Λ0i         O 0 (A.11)  =   1[wi = 1] − Λ1i   zi , ..   .       1[wi = G] − ΛGi where Λji = exp(zi γj )/ G r=0 exp(zi γr ) for j = 0, 1, . . . , G. P 131 Then, using the law of iterated expectations and that wi follows a multinomial logit reduced form under γ = γo , I have F E[sF i (γo )] = E(E[si (γo )|zi ])   E(1[wi = 0]|zi ) − Λ0i        O z0i )   = E(  E(1[wi = 1]|zi ) − Λ1i   ..   .       E(1[wi = G]|zi ) − ΛGi   Λ − Λ0i  0i        O 0 (A.12)  = E(  Λ1i − Λ1i   zi ) = 0. ..   .       ΛG i − ΛG i Having showed that E[sF i (γo )] = 0, I will now show Ao = Bo . Since E[si (γo )] = 0, F F F 0 o ≡ V ar[si (γo )] = E[si (γo )si (γo )]. Using (A.11) and the definition of binary treatment F F BF F indicator dg for g = 0, 1, . . . , G, I have the symmetric (G + 1)k × (G + 1)k matrix   (d0i − Λ0i )2 z0i zi (d0i − Λ0i )(d1i − Λ1i )z0i zi ··· (d0i − Λ0i )(dGi − ΛGi )z0i zi    ..   (d1i − Λ1i )(d0i − Λ0i )z0i zi (d1i − Λ1i )2 z0i zi ··· .   F F0  si (γ)si (γ) =  , (A.13)  .. .. .. ..  .    . . .    (dGi − ΛGi )(d0i − Λ0i )z0i zi ··· ··· (dGi − ΛGi )2 z0i zi where dgi = 1[wi = g] for g = 0, 1, . . . , G. 132 Now consider HF i (γ) ≡ ∇γ si (γ), F −e(zi γ0 ) z0i zi +e(2zi γ0 ) z0i zi e(zi γ0 ) e(zi γ1 ) z0i zi e(zi γ0 ) e(zi γG ) z0i zi  P  i ( i )2 P ( i )2 P ··· ( i )2 P ..   e(zi γ1 ) e(zi γ0 ) z0i zi −e(zi γ1 ) i z0i zi +e(2zi γ1 ) z0i zi  P   ( i )2 P ( i )2 P ··· .  ∇γ sF i (γ) =     .. .. .. ..    . . . .    e(zi γG ) e(zi γ0 ) z0i zi −e(zi γG ) z0i zi +e(2zi γG ) z0i zi P ( i )2 P ··· ··· i ( i )2 P   −Λ0i + (Λ0i )2 Λ0i Λ1i ··· Λ0i ΛGi    2 ..   Λ1i Λ0i −Λ1i + (Λ1i ) ··· . O = z0i zi , (A.14)     .. .. .. ..    . . . .    ΛGi Λ0i ··· ··· −ΛGi + (ΛGi )2 where and Λji = exp(zi γj )/ G r=0 exp(zi γr ) for j = 0, 1, . . . , G and P PG P i = r=0 exp(zi γr ) i = 1, 2, . . . , N. As you can see, ∇γ sF i (γ) is a symmetric (G + 1)k × (G + 1)k matrix. Note that, using the law of iterated expectations and mutual exclusivity of binary treatment indicators, under γ = γo I can write E[(dgi − Λgi )2 z0i zi |zi ] = E[(dgi − 2dgi Λgi + (Λgi )2 )z0i zi |zi ] = [Λgi − 2(Λgi )2 + (Λgi )2 ]z0i zi = [Λgi − (Λgi )2 ]z0i zi (A.15) for g = 0, 1, . . . , G and E[(dgi − Λgi )(dhi − Λhi )z0i zi |zi ] = E[(−dhi Λgi − dgi Λhi + Λgi Λhi )z0i zi |zi ] = [(−Λhi Λgi − Λgi Λhi + Λgi Λhi )z0i zi ] = [−Λhi Λgi z0i zi ], (A.16) where dgi = 1[wi = g] and Λji = exp(zi γj )/ G r=0 exp(zi γr ) for ∀h 6= g and i = 1, 2, . . . , N. Us- P ing the results in (A.15) and (A.16), (A.13) and (A.14) together imply that −E[HF i (γo )|zi ] = 133 0 E[sFi (γo )si (γo )|zi ], which is the conditional information matrix equality (CIME). By tak- F ing the expectation of CIME with respect to zi and using the law of iterated expecta- tions, I obtain the unconditional information matrix equality (UIME), i.e., −E[HF i (γo )] = 0 i (γo )si (γo )]. Hence, Ao = Bo . F F F E[sF Now I will show the influence function representation of CMLE estimates, γ̂. Assuming all the assumptions in Th.1.2 and using a mean value expansion of the first order condition i=1 si (γ̂) = 0, I can write PN F N N N ! X X X sF i (γ̂) = sFi (γo ) + HF i (γ̈) (γ̂ − γo ), (A.17) i=1 i=1 i=1 where γ̈ is the (G + 1) × 1 vector of parameter values between γ̂ and γo in Γ ⊂ R(G+1)k . (A.17) and N i=1 si (γ̂) = 0 together imply that F P N N ! X X √ 0 = N −1/2 sF i (γo ) + N −1 HF i (γ̈) N (γ̂ − γo ). (A.18) i=1 i=1 Now lets state a variant of Lemma 12.1 in Wooldridge (2010), which follows from Newey and McFadden (1994). p • Lemma 1.1 (Lm.1.1): Suppose that γ̂ → − γo , and assume that ri (γ) satisfies the same assumptions on li (γ) in Th.1.1. Then N p X N −1 ri (γ̂) → − E[ri (γo )]. (A.19) i=1 Assuming AF o ≡ −E[Hi (γo )] exists and is nonsingular, N i=1 Hi (γ̈) is nonsingular F −1 PN F  −1 p with probability approaching 1 and N −1 N F − (−AF o ) . Since E[si (γo )] = 0, −1 F P H i=1 i (γ̈) → and sF i (γo ) is iid random vectors, I can rewrite (A.18) as " N # √ X N (γ̂ − γo ) = (−AF o) −1 −N −1/2 sF i (γo ) + op (1), (A.20) i=1 134 where op (1) (little oh p one) means that if yn , a sequence of random variables, is op (1) then p yn →− 0. From (A.20), I obtain the influence function representation of CMLE estimates as follows: N √ X N (γ̂ − γo ) = N −1/2 ri (γo ) + op (1), (A.21) i=1 where ri (γo ) ≡ (AF −1 F o ) si (γo ). To show E[ri (γ ∗ )] = 0, we need to remember that E[sF i (γ )] = 0 from the first stage ∗ CMLE estimation. Then, E[ri (γ ∗ )] = E[(AF −1 F ∗ F −1 F ∗ ∗ ) si (γ )] = (A∗ ) E[si (γ )] = 0. (A.22) Now, to show E[si (θo ; γ ∗ )] = 0, we assume the model for the expectation of y conditional on X, v is correctly specified. Then, E[si (θo ; γ ∗ )] = E (E[si (θo ; γ ∗ )|X, v]) = E (E[−∇0θ mi (θo ; γ ∗ )(yi − mi (θo ; γ ∗ )|d, x, z]) = E (−∇0θ mi (θo ; γ ∗ )(E[yi |d, x, z] − mi (θo ; γ ∗ )) = 0, (A.23) where mi (θo ; γ ∗ ) = m(Xi , v(di , xi , zi , γ ∗ ), θo ) as in (1.17) and E[yi |d, x, z] = mi (θo ; γ ∗ ), i.e., the conditional mean of yi is correctly specified. Now, I will derive the closed forms of the expressions appearing in the asymptotic variance matrix of CF estimates. I start with the score function, which is si (θo ; γ ∗ ) = −∇0θ mi (θo ; γ ∗ )(yi − mi (θo ; γ ∗ )   0  Xi  = −  (yi − Xi δo − vi λo ). (A.24) vi0 135 Next, I continue with the expected values of the Hessian, of the outer product of the score, and of the outer product of the influence function as below: Ao = E[∇θ (−∇0θ mi (θo ; γ ∗ )) (yi − mi (θo ; γ ∗ )) + (−∇0θ mi (θo ; γ ∗ )) ∇θ (yi − mi (θo ; γ ∗ ))] = E {E[∇θ (−∇0θ mi (θo ; γ ∗ )) (yi − mi (θo ; γ ∗ ))|d, x, z]} + +E {E[(−∇0θ mi (θo ; γ ∗ )) ∇θ (yi − mi (θo ; γ ∗ )) |d, x, z]} = E {∇θ (−∇0θ mi (θo ; γ ∗ )) (E[yi |d, x, z] − mi (θo ; γ ∗ ))} + +E {E[(∇0θ mi (θo ; γ ∗ )) ∇θ mi (θo ; γ ∗ )|d, x, z]} = E {E[(∇0θ mi (θo ; γ ∗ )) ∇θ mi (θo ; γ ∗ )|d, x, z]} = E[∇0θ mi (θo ; γ ∗ )∇θ mi (θo ; γ ∗ )]   0    Xi  = E[  Xi vi ], (A.25) vi0   X0i   Bo = E[  (yi − Xi δo − vi λo ) 2 ], (A.26)   Xi vi vi0 0 R∗ = E[(AF ∗) −1 F ∗ F si (γ )si (γ ∗ )(AF ∗) −1 ] 0 ∗ F ∗ −1 = {E[sFi (γ )si (γ )]}    −1 (d0i − Λ∗0i )z0i                     ∗ 0 ∗ = E[ (d1i − Λ1i )zi     ((d0i − Λ0i )zi (d1i − Λ∗1i )zi ··· (dGi ∗ − ΛGi )zi ) ]  ,(A.27)   ..   .             (dGi − Λ∗Gi )z0i where Λ∗gi = exp(zi γg∗ )/ G ∗ P r=0 exp(zi γr ). 136 Next, I continue with the expected value of the expression that represents the effect of the sampling error in γ on the second stage and with the expected value of the product of the score from the second stage estimation and the influence function as follows: Fo = E[−∇γ (∇0θ mi (θo ; γ ∗ )) (yi − mi (θo ; γ ∗ )) + ∇0θ mi (θo ; γ ∗ )∇γ mi (θo ; γ ∗ )] = E[−∇γ (∇0θ mi (θo ; γ ∗ )) (yi − mi (θo ; γ ∗ )) + ∇0θ mi (θo ; γ ∗ )∇v0 mi (θo ; γ ∗ )∇γ vi0 (γ ∗ )] = E {E[−∇γ (∇0θ mi (θo ; γ ∗ )) (yi − mi (θo ; γ ∗ ))|d, x, z]} + +E {E[∇0θ mi (θo ; γ ∗ )∇v0 mi (θo ; γ ∗ )∇γ vi0 (γ ∗ )|d, x, z]} = E {−∇γ (∇0θ mi (θo ; γ ∗ )) (E[yi |d, x, z] − mi (θo ; γ ∗ ))} + +E {E[∇0θ mi (θo ; γ ∗ )∇v0 mi (θo ; γ ∗ )∇γ vi0 (γ ∗ )|d, x, z]} = E {E[∇0θ mi (θo ; γ ∗ )∇v0 mi (θo ; γ ∗ )∇γ vi0 (γ ∗ )|d, x, z]} = E[∇0θ mi (θo ; γ ∗ )∇v0 mi (θo ; γ ∗ )∇γ vi0 (γ ∗ )]   0  Xi  0 = E[ 0 ∗  λ ∇γ vi (γ )] and (A.28) vi0 To = E[ri (γ ∗ )s0i (θo ; γ ∗ )] −1 F ∗ ∗ ∗ = E[(AF ∗ ) si (γ ){−∇θ mi (θo ; γ )(yi − mi (θo ; γ )}] −1 F ∗ ∗ ∗ = E E[(AF  ∗ ) si (γ ){−∇θ mi (θo ; γ )(yi − mi (θo ; γ )}|d, x, z] −1 F ∗ ∗ ∗ = E (AF  ∗ ) si (γ ){−∇θ mi (θo ; γ )(E[yi |d, x, z] − mi (θo ; γ )} = 0. (A.29) A.4 Chapter 1: Verification of the Conditions in Theorems 1.1-1.4 Conditions of Theorem 1.1. Condition (a) holds because under the model in (1.1) and (1.2), A.1.1 and A.1.2 allow the discrete treatment variable w to follow a multinomial logit model with choice probabilities P (w = g|z; γ) = exp(zγg )/ G r=0 exp(zγr ) for g = 0, 1, . . . , G. P 137 ´ Hence, = g|z; γ) = 1, ∀z ∈ Z . For condition (b), since I PG W f (w|z)µ(dw) = g=0 P (w implicitly assume the parametric model f (· |z; γ) for the conditional density p(· |z) is cor- rectly specified, then for some γo ∈ Γ, po (· |z) = f (· |z; γo ), ∀z ∈ Z . Furthermore, assum- ´ ´ ing W po (w|z)µ(dw) ≥ W f (w|z)µ(dw),then the Kullback-Leibler information criterion in Wooldridge (2010, p. 523) implies that E[li (γo )|zi ] ≥ E[li (γ)|zi ], ∀γ ∈ Γ. Consequently, γo is a solution to maxE[li (γ)]. And if γo is identified, then it is the unique solution to γ∈Γ the maximization problem. As for condition (c), without any restrictions, Γ can be an open ball with center 0 ∈ R(G+1)k and a very large radius. Since open balls are convex, Γ is an open and convex parameter space. Then, γo ∈ Γ is in the interior of Γ since it   is an open ball. For condition (d), l(· , γ) = G P PG j=0 1[w = j]log exp(zγ j )/ r=0 exp(zγ r ) , which is simply the sum of the products of an indicator function and a logarithmic func- tion. Therefore, if for each γ ∈ Γ the logarithmic function is Borel measurable on Z , then l(· , γ) is a Borel measurable function on W × Z . Since, for each γ ∈ Γ, exp(zγj ) is an exponential function of zγj and zγj is linear in z, exp(zγj ) is continuous at z. As z is an arbitrary element of Z , exp(zγj ) is continuous on Z for j = 0, 1, . . . , G. Then, by Theorem 15.6 of Bartle (1964), the multinomial logistic function exp(zγj )/ G P r=0 exp(zγr ) is continuous on Z . As logarithmic functions are continuous, by Theorem 15.8 of Bar-   tle (1964), log exp(zγj )/ G is continuous on Z . Hence, by Theorem 13.2 of P r=0 exp(zγ r )   Billingsley (1995), for each γ ∈ Γ log exp(zγj )/ G is Borel measurable on P r=0 exp(zγ r ) Z and thereof l(· , γ) is a Borel measurable function on W × Z . As for condition (e), it suffices to show that, for each (w, z) ∈ W × Z , the Hessian matrix HF (γ) ≡ ∇γ sF (γ) is negative semidefinite or its negative is positive semidefinite, see Theorem 21.5 and Chapter 21 in Simon and Blume (1994) for more on concavity. Note that −∇γ sF (γ) = −Λ z0 z N is the Kronecker product of −Λ and z0 z, where Λ is a (G + 1) × (G + 1) symmetric ma- trix with −Λi + Λ2i ’s on its diagonal and Λm Λn ’s off the diagonal for i, m, n = 0, 1, . . . , G and m 6= n. −Λ is positive semidefinite because , by Equation 30 of Searle (1982), the quadratic form x (−Λ) x0 = j>i xi xj Λi Λj for all nonzero row PG 2 2 PG−1 PG i=0 (Λi − Λi )xi − 2 i=0 138 vector x0 ∈ RG+1 . After some algebraic manipulation and that G 0 P i=0 Λi = 1, x (−Λ) x = P  j>i Λi Λj (xi + xj ) − 2xi xj Λi Λj . But PG G 2 PG−1 PG PG−1 PG 2 2 Λ i=0 i j6=i j xi − 2 Λ i=0 j>i xi xj Λi Λj = i=0 the summand is the square of ( Λi Λj xi − Λi Λj xj ). Therefore, x (−Λ) x0 ≥ 0, and −Λ p p is positive semidefinite as claimed. z0 z is also positive semidefinite because x(z0 z)x0 = (zx0 )0 zx0 which is equal to the square of zx0 . By Theorem 23.17 of Simon and Blume (1994), all the eigenvalues of both −Λ and z0 z are nonnegative. Then, by Proposition 2.45 of Dhrymes (2013), all the eigenvalues of −∇γ sF (γ) are nonnegative, too. Hence, by Theorem 23.17 of Simon and Blume (1994) again, −∇γ sF (γ) is positive semidefinite, and thereof, for each (w, z) ∈ W × Z , l(w, z, ·) is concave in γ. As to condition (f), consider   the first, second, and mixed partial derivatives of f (v) ≡ log exp(vj )/ G P r=0 exp(v r ) : P ∂ 2 f (v) ∂f (v) = 1 − exp(vj )/ , ∂f∂v(v) P P P ∂vj r = −exp(v r )/ , ∂v 2 = (exp(v j )/ )(exp(v j )/ −1), j ∂ 2 f (v) ∂ 2 f (v) ∂ 2 f (v) = (exp(vr )/ )(exp(vr )/ −1), and ∂v P P P P ∂vr2 r ∂v j = ∂v j ∂v r = (exp(v j )/ )(exp(v r )/ ) where , j 6= r and j = 0, 1, . . . , G. P PG 0 G+1 = r=0 exp(vr ), v = (v0 , v1 , . . . , vG ) ∈ R Note that all these derivatives are less than one in absolute value. By the Taylor’s The- orem for functions from RG+1 to R (see Theorem 20.16 of Bartle (1964) for more on this) f (v) can be expanded about 0 ∈ RG+1 such that f (v) = log(1/(G + 1)) + G P ∂f (0) j=0 ∂vj vj + P ∂ 2 f (v̄) 2 PG−1 PG ∂ 2 f (v̄) (1/2!) G j=0 ∂vj2 vj + j=0 r>j ∂vj ∂vr vj vr where v̄ is a point on the line segment between v and zero. Then, by the triangle inequality, |f (v)| ≤ |log(1/(G + 1))| + G ∂f (0) P j=0 | ∂vj ||vj | + ∂ 2 f (v̄) PG−1 PG ∂ 2 f (v̄) (1/2!) G r>j | ∂vj ∂vr ||vj vr |. Since all the derivatives are less than one in 2 P j=0 | ∂vj2 ||vj | + j=0 absolute value, |f (v)| ≤ |log(1/(G+1))|+ G P PG 2 PG−1 PG j=0 |vj |+(1/2!) j=0 |vj |+ j=0 r>j |vj vr |, ∀v ∈   RG+1 . Now consider |l(w, z, γ)| ≤ G | by the tri- P PG j=0 |1[w = j]||log exp(zγ j )/ r=0 exp(zγ r )   angle inequality. Since |1[w = j]| ≤ 1, |l(w, z, γ)| ≤ G P PG j=0 |log exp(zγ j )/ r=0 exp(zγ r ) |. Using the upper bound for |f (v)| right above, the Cauchy–Schwarz inequality, and the tri- angle inequality, |l(w, z, γ)| ≤ (G + 1)[|log(1/(G + 1))| + G P 0 PG j=0 k z kk γj k +(1/2!) j=0 k z0 k2 k γj k2 + G−1 h=1 |zi zh ||γji ||γrh |}] where k · k is the Euclidean norm. P PG Pk Pk j=0 r>j { i=1 Since Γ is an open ball with a fixed (but potentially very large) radius, there exists an m ≡ (m00 , m01 , . . . , m0G )0 ∈ Γ c ⊂ R(G+1)k where Γ c is the complement of Γ, mj ∈ Rk , and 139 k mj k< ∞, for j = 0, 1, . . . , G such that for each k γj k≤k mj k . Furthermore, assuming E|zi zh | < ∞ for i, h = 1, 2, . . . , k implies that E k z0 k2 < ∞ and E k z0 k< ∞ by Jensen’s in- equality. Then, setting b(w, z) ≡ (G+1)[|log(1/(G+1))|+ G P 0 PG j=0 k z kk mj k +(1/2!) j=0 k z0 k2 k mj k2 + G−1 h=1 |zi zh ||mji ||mrh |}], |l(w, z, γ)| ≤ b(w, z), ∀γ ∈ Γ, where P PG Pk Pk j=0 r>j { i=1 b(·, ·) is a nonnegative function on W × Z with E[b(w, z)] < ∞. Conditions of Theorem 1.2. Condition (a) holds because of condition (c) of Th.1.1. For condition (b), since, for each (w, z) ∈ W × Z , exp(zγj ) is an exponential function of zγj and zγj is linear in γj , exp(zγj ) is differentiable at γ ∈ int(Γ ). As γ is arbitrary, exp(zγj ) is differentiable on int(Γ ) for j = 0, 1, . . . , G. Then, by Theorem 20.8 of Bartle (1964), the multinomial logistic function exp(zγj )/ G r=0 exp(zγr ) is differentiable on int(Γ ). P As logarithmic functions are differentiable, by Theorem 20.9 of Bartle (1964), log(exp(zγj )/ r=0 exp(zγr )) is differentiable on int(Γ ). But l(w, z, ·) is simply the sum of the products PG of an indicator function and this logarithmic function; therefore, for each (w, z) ∈ W × Z , l(w, z, ·) is differentiable on int(Γ ). The transpose of the first derivative of l(w, z, ·) is sF (γ) with elements (1[w = j] − Λj )zh where Λj = exp(zγj )/ G r=0 exp(zγr ) for j = 0, 1, . . . , G and P h = 1, 2, . . . , k. But Λj is differentiable and 1[w = j] is an indicator function; hence for each (w, z) ∈ W × Z , each element of sF (γ) is differentiable on int(Γ ). For this reason, for each (w, z) ∈ W × Z , l(w, z, ·) is twice differentiable on int(Γ ). The second derivative of l(w, z, ·) is HF (γ) with elements (−Λj + (Λj )2 )zi zh and Λm Λn zi zh for j, m, n = 0, 1, . . . , G, m 6= n, and i, h = 1, 2, . . . , k. Following the method used in checking condition (d) of Th.1.1, it is obvious that for each (w, z) ∈ W × Z , Λj and (Λj )2 are continuous on int(Γ ). Consequently, each element of HF (γ), and thereof HF (γ), is continuous by Theorem 15.6 of Bartle (1964). For this reason, for each (w, z) ∈ W × Z , l(w, z, ·) is twice continuously differentiable on int(Γ ). Condition (c) holds as shown in the previous subsection. As for Condition (d), ∇γ [∇0γ l(w, z, γ)] = Λ z0 z, where Λ is a (G+1)×(G+1) symmetric matrix with −Λi +Λ2i ’s N on its diagonal and Λm Λn ’s off the diagonal for i, m, n = 0, 1, . . . , G and m 6= n. By Exercise 6(a) of Laub (2005, p. 149), k Λ z0 z k=k Λ kk z0 z k, where k · k is the square root N 140 of the sum of the squares of matrix elements. Since each element of Λ is less than one in absolute value, k Λ z0 z k< (G+1) k z0 z k . Furthermore, assuming E(zi zh )2 < ∞ for i, h = N 1, 2, . . . , k implies that E k z0 z k< ∞ by Jensen’s inequality. Then, setting b(w, z) ≡ (G+1) k z0 z k, the elements of ∇γ [∇0γ l(w, z, γ)] are bounded in absolute value by b(w, z), ∀γ ∈ Γ, where b(·, ·) is a nonnegative function on W × Z such that E[b(w, z)] < ∞. As to condition 0 (e), note that by condition (c) AF o = E(si (γo )si (γo )). Then, for all nonzero row vector F F F0 x0 ∈ R(G+1)k , the quadratic form xAF 0 F 0 PG Pk o x = E(xsi (γo )si (γo )x ) = E[ j=0 h=1 {(1[w = j]− Λj )zh xh,j }2 ] = G P Pk 2 2 2 PG Pk 2 2 2 j=0 h=1 E[(1[w = j] − Λj ) zh xh,j ] = j=0 h=1 E[(1[w = j] − Λj ) zh ]xh,j . Since there is at least one xh,j 6= 0, xAF o x > 0 assuming that E[(1[w = j] − Λj ) zh ] > 0 for 0 2 2 at least one j and h corresponding to that xh,j . Hence, AF o is positive definite. p Conditions of Theorem 1.3. Condition (a) clearly holds due to Th.1.1 where γ̂ → − γo and γo ∈ int(Γ ). As for condition (b), note that q(wi , θ; γ ∗ ) ≡ (yi − m(Xi , v(di , zi , γ ∗ ), θ))2 /2 = [yi − (Xi δ + vi λ)]2 /2 = [yi − (hi θ)]2 /2 where hi = (Xi , vi ) and θ = (δ 0 , λ0 )0 . Assuming conditional expectation of yi is correctly specified, i.e., E(yi |d, x, z) = (hi θo ), the conditional mean identification principle of Hayashi (2000, p. 462-3) suggests that minE[qi (θ; γ ∗ )] occurs θ∈Θ uniquely at θo if hi θ 6= hi θo for all θ 6= θo . However, this condition is satisfied if and only if h0i hi is nonsingular. To see why, let h0i hi be nonsingular and assume that ∃θ such that θ 6= θo but hi θ = hi θo . Then, h0i hi θ = h0i hi θo and (h0i hi )−1 h0i hi θ = (h0i hi )−1 h0i hi θo , which imply θ = θo . But this is a contradiction, so hi θ 6= hi θo for all θ 6= θo if h0i hi is nonsingular. As a result, for any given γ ∗ ∈ Γ, the true parameter vector θo is the unique solution to minE[qi (θ; γ ∗ )]. For condition (c), let Θ be a closed ball with center 0 ∈ RM θ∈Θ and a very large radius rθ > 0, i.e., Θ ≡ B(0, rθ ) = {η ∈ RM :k η k≤ rθ } ⊂ RM , and Γ be another closed ball with center 0 ∈ R(G+1)k and a very large radius rγ > 0, i.e., Γ ≡ B(0, rγ ) = {ζ ∈ R(G+1)k :k ζ k≤ rγ } ⊂ R(G+1)k . Since closed balls include all their boundary points, and the distance between any two points in a closed ball cannot exceed twice its radius, both Θ and Γ are compact parameter spaces by Theorem 9.3 of Bartle (1964). Note that Θ × Γ is the Cartesian product of two compact spaces Θ and Γ. Then, by 141 Theorem 4.2.17 of Dixmier (1984), the parameter space Θ × Γ is compact. As to condition (d), q(w, θ; γ) = [y − (Xδ + vλ)]2 /2, which is simply the square of a linear function in w. For each γ ∈ Γ, X is Borel measurable because each element of X is either a simple binary random variable or the product of a binary random variable and x, and thereof measurable, see the third paragraph of Billingsley (1995, p. 182) and Theorem 3.33 of Davidson (1994) for more on this. In addition, for each γ ∈ Γ, v is Borel measurable because each element of v is either the product of a binary random variable and log(Λg ) or the product of a binary random variable and Mg where Λg = exp(zγg )/ G r=0 exp(zγr ) and Mg = Λg log(Λg )/(1 − Λg ) P for g = 0, 1, . . . , G. By using the arguments in condition (d) of Th.1.1, it is clear that log(Λg ) and Mg are both measurable, so is v. Since, for each (θ, γ) ∈ Θ × Γ, y − (Xδ + vλ) is a linear function of w, y−(Xδ+vλ) is continuous at w. As w is an arbitrary element of W, y−(Xδ+vλ) is continuous on W. As polynomial functions are continuous, by Theorems 15.6 and 15.8 of Bartle (1964), [y − (Xδ + vλ)]2 /2 is continuous on W. Hence, by Theorem 13.2 of Billingsley (1995), for each (θ, γ) ∈ Θ×Γ, q(· , θ, γ) is a Borel measurable function on W. For condition (e), define t ≡ Xδ +vλ, so q(w, ·; ·) = [y −t]2 /2. Note that, for each w ∈ W, q(w, ·; ·) is a quadratic function of t which is linear in θ. Since quadratic functions are continuous and t is linear in θ, q(w, ·; ·) is continuous at θ. As θ is an arbitrary element of Θ, for each w ∈ W, q(w, ·; ·) is continuous on Θ × Γ. As for condition (f), note that q(wi , θ; γ) = [yi − (Xi δ + vi λ)]2 /2 = [yi − (hi θ)]2 /2, and yi = hi θo + εi where E(εi |d, x, z) = 0. Then, [yi − (hi θ)]2 /2 = [εi + hi (θo − θ)]2 /2 = [ε2i + εi hi (θo − θ) + {hi (θo − θ)}2 ]/2 = [ε2i + εi hi (θo − θ) + (θo − θ)0 h0i hi (θo − θ)]/2. Since |q(wi , θ; γ)| = q(wi , θ; γ) and E(εi |d, x, z) = 0, E[|q(wi , θ; γ)|] = E[{ε2i + εi hi (θo − θ) + (θo − θ)0 h0i hi (θo − θ)}/2] = {E(ε2i ) + (θo − θ)0 E(h0i hi )(θo − θ)}/2]. Assuming E(h0i hi ) is positive definite and E(ε2i ) is finite, E[|q(wi , θ; γ)|] < ∞ ∀(θ, γ) ∈ Θ × Γ. Conditions of Theorem 1.4. Condition (a) holds if θo is not boundary point of Θ and γ ∗ of Γ. However, by Th.1.3, the parameter space Θ × Γ is compact, so both there is a possibility that θo and γ ∗ are boundary points. Assuming Θ (Γ ) is a closed ball with center 0 ∈ RM (0 ∈ R(G+1)k ) and a very large radius rθ > 0 (rγ > 0), this possibility can be very low 142 though. Without evidence, I accept as true that θo ∈ int(Θ) and γ ∗ ∈ int(Γ ). As for condi- √ d tion (b), N (γ̂ − γo ) → − N ormal(0, (AF o ) Bo (Ao ) ) in Th.1.2. Then, by setting γ = γo −1 F F −1 ∗ √ and by Lemma 4.5 of White (2001), N (γ̂ − γ ∗ ) = Op (1). For condition (c), since for each (w, γ) ∈ W × Γ [y − (Xδ + vλ)]2 /2 is a quadratic function of y − (Xδ + vλ) and y − (Xδ + vλ) is linear in θ = (δ 0 , λ0 )0 , [y − (Xδ + vλ)]2 /2 is differentiable at θ ∈ int(Θ). As θ is arbitrary, q(w, ·; γ) = [y − (Xδ + vλ)]2 /2 is differentiable on int(Θ). The transpose of the first derivative  0 of q(w, ·; γ) is s(w, ·; γ) ≡ ∇θ q(w, ·; γ) = − X v 0 (y − Xδ − vλ). But (y − Xδ − vλ) is dif-  0 ferentiable and − X v is just constant in θ. Hence for each (w, γ) ∈ W×Γ, each element of s(w, ·; γ) is differentiable on int(Θ), and thereof, q(w, ·; γ) is twice differentiable on int(Θ).  0   The second derivative of q(w, ·; γ) is H(w, ·; γ) ≡ ∇θ s(w, ·; γ) = Xi vi X i vi , which is constant in θ. Consequently, it is obvious that for each (w, γ) ∈ W × Γ, H(w, ·; γ) is continuous on int(Θ), and thereof, q(w, ·; γ) is twice continuously differentiable on int(Θ). As  0 to condition (d), note that for each θ ∈ Θ, s(·, θ; ·) ≡ ∇θ q(·, θ; ·) = − X v 0 (y−Xδ−vλ). Only the variables in v depend on γ, so as long as v is differentiable on int(Γ ), s(·, θ; ·) is differentiable on int(Γ ) by Theorem 20.8 of Bartle (1964). v has two types of variables: −dg log(Λg ) and dh Mg , where dh is a binary random variable Λg = exp(zγg )/ G P r=0 exp(zγr ), and Mg = Λg log(Λg )/(1 − Λg ) for h, g = 0, 1, . . . , G and h 6= g. From condition (b) of Th.1.2, we know that both Λg = exp(zγg )/ G r=0 exp(zγr ) and log (Λg ) are differentiable on int(Γ )− P note that z is taken as given. Then, by Theorem 20.8 of Bartle (1964), Mg is also differ- entiable on int(Γ ). Since each element of v is either −dg log(Λg ) or dh Mg , for each θ ∈ Θ, v and thereof s(·, θ; ·), is differentiable  on  int(Γ ). The derivative  (gradient) of s(·, θ; ·) with  0  0   X  0 respect to γ is ∇γ s(·, θ; ·) = −   (y − Xδ − vλ) +   λ (∇γ v0 ), where 0 is the 0 0 ∇γ v v (l+1)(G+1)×(G+1)k zero vector and ∇γ v0 is the (G+1)2 ×(G+1)k gradient vector. ∇γ v0 is composed of four types of variables: −dg (1−Λg )z, dg Λh z, {dg [log(Λh )+1−Λh ]Λh z}/(1−Λh ), and {dg [−Mh − Λh ]Λi z}/(1 − Λh ), for h, g, i = 0, 1, . . . , G and h 6= g, i. Using the arguments from condition (d) of Th.1.1, we know that both Λg and log (Λg ) are continuous on int(Γ ). 143 By Theorems 15.6 and 15.8 of Bartle (1964) and the continuity of logarithmic functions, Mg is also continuous on int(Γ ). Using the same theorems and d0g s being a binary ran- dom variable, each element of both ∇γ v0 and v is continuous. Therefore, it is clear that ∇γ s(·, θ; ·), is continuous on int(Γ ). As a result, for each θ ∈ Θ, s(·, θ; ·) ≡ ∇0θ q(·, θ; ·) is continuously differentiable on int(Γ ). As for condition (e), note that for each (θ, γ) ∈ Θ × Γ,  0   H(·, θ; γ) = X v X v . From condition (d) of Th.1.3, we know that both X and v are Borel measurable functions on W. Since each element of H(·, θ; γ) is one of the cross product of X and v, each element of H(·, θ; γ) is measurable by Theorem 3.33 of Davidson (1994). Hence, for each (θ, γ) ∈ Θ × Γ, H(·, θ; γ) is a Borel measurable func- tion on W. For condition (f), from condition (d) of Th.1.4, it is clear that v is continu- ous on Γ. In addition, X is constant in θ and γ, so Xis continuous. But each element of H(w, ·; ·) is one of the cross product of X and v, so each element is continuous by Theorem 15.6 of Bartle (1964). For this reason, for each w ∈ W, H(w, ·; ·) is continu- P(l+1)(G+1) P(l+1)(G+1) ous on Θ × Γ. As to condition (g), k H(wi , θ; γ) k2 = j=1 h=1 (Xji Xhi )2 + P(G+1)2 P(G+1)2 P(l+1)(G+1) P(G+1)2 m=1 n=1 (vmi vni )2 + 2 j=1 n=1 (Xji vni )2 , where k · k is the Euclidean norm, Xji is the j th element of Xi , and vmi is the mth element of vi . Hence, E[k H(wi , θ; γ) k2 P(l+1)(G+1) P(l+1)(G+1) 2 P(G+1)2 P(G+1)2 E(Xji Xhi )2 + (G+1) E(vmi vni )2 + 2 (l+1)(G+1) P P ] = j=1 h=1 m=1 n=1 j=1 n=1 E(Xji vni )2 . But Xi contains only binary random variables and their products with x0ti s. Therefore, assuming E(xti xri )2 < ∞ for t, r = 1, 2, . . . , l, E(Xji Xhi )2 < ∞ for j, h = 1, 2, . . . , (l + 1)(G + 1). In addition, note that vi contains only the product of binary random variables with log(Λhi ) and Mhi for h = 0, 1, . . . , G and i = 1, 2, . . . , N. But, from condition (f) of Th.1.1, for each g ∈ {0, 1, . . . , G}, |log(Λgi )|2 ≤ [|log(1/(G + 1))| + G 0 P j=0 k zi kk γj k +(1/2!) G h=1 |zti zhi ||γjt ||γrh |}] . Therefore, assuming P 0 2 2 PG−1 PG Pk Pk 2 j=0 k zi k k γj k + j=0 r>j { t=1 qP E(zt4i zh4i ) < ∞, E(zf4i zt2i zh2i ) < ∞, E[zt2i zh2i ( kf =1 zf2i )|zli zmi |] < ∞, E[zt2i zh2i zf2i |zli zmi |] < ∞, qP and E[zsi |zti zhi | ( kf =1 zf2i )|zli zmi |] < ∞ for f, h, l, m, s, t = 1, 2, . . . , k, E(log(Λgi ))4 < ∞. 2 Furthermore, Mhi = [Λhi log(Λhi )]/(1 − Λhi ),and the first derivative of Mhi with respect to Λhi , ∂Mhi /∂Λhi = (1 − {Λhi − log(Λhi )})/(1 − Λhi )2 < 0 since 0 < Λhi < 1. Note that 144 limΛhi →0 Mhi = 0 and limΛhi →1 Mhi = −1 by L’Hôpital’s rule, so |Mhi | < 1, which implies E(Mhi )4 < ∞ and E(Mhi Mgi )2 < ∞ ∀h, g ∈ {0, 1, . . . , G}. Since binary random variables are mutually exclusive, it follows that E(vmi vni )2 < ∞ for m, n = 1, 2, . . . , (G + 1)2 . In addition, |xsi log(Λgi )|2 ≤ x2si [|log(1/(G + 1))| + G P 0 PG 0 2 2 j=0 k zi kk γj k +(1/2!) j=0 k zi k k γj k + G−1 h=1 |zti zhi ||γjt ||γrh |}] for s = 1, 2, . . . , l and g = 0, 1, . . . , G. Therefore, P PG Pk Pk 2 j=0 r>j { t=1 qP assuming E[xsi zti zhi ]2 < ∞, E[x2si ( kf =1 zf2i )|zti zhi |] < ∞, and E[x2si zf2i |zti zhi |] < ∞ for f, h, t = 1, 2, . . . , k, E(Xji vni )2 < ∞ for j = 1, 2, . . . , (l + 1)(G + 1), n = 1, 2, . . . , (G + 1)2 , and i = 1, 2, . . . , N . Since E(Xji Xhi )2 < ∞, E(vmi vni )2 < ∞, and E(Xji vni )2 < ∞, E[k H(wi , θ; γ) k] < ∞ ∀(θ, γ) ∈ Θ × Γ by Jensen’s inequality. As for condition (h), Ao ≡ E[H(wi , θo ; γ ∗ )] = E[h0i hi ], where hi = (Xi , vi ). Then, for all nonzero row vector u0 ∈ RM , the quadratic form uAo u0 = E(uh0i hi u0 ) = E[ M P 2 PM 2 j=1 {hji uj } ] = j=1 E[{hji uj } ] = j=1 E[hji ]uj . Since there is at least one uj 6= 0, uAo u > 0 assuming that E[hji ] > 0 for PM 2 2 0 2 that  j variable th  in hi . Hence, Ao is   positive definite. For condition (i), ∇γ s(w, θ; γ) = 0  0   X  0 −  (y − Xδ − vλ) +   λ (∇γ v0 ) and ∇γ v0 is composed of four types of ∇γ v 0 v0 variables: −dg (1 − Λg )z, dg Λh z, {dg [log(Λh ) + 1 − Λh ]Λh z}/(1 − Λh ), and {dg [−Mh − Λh ]Λi z}/(1 − Λh ), for h, g, i = 0, 1, . . . , G and g, i 6= h as in condition (d). Using the arguments from condition (d) of Th.1.1, we know that, for each (θ, γ) ∈ Θ × Γ, both Λg , log (Λg ) , and Mg are all continuous on W. Using Theorems 15.6 and 15.8 of Bartle (1964) and d0g s being a binary random variable, each element of both ∇γ v0 and v is continu- ous on W. Therefore, it is clear that, for each (θ, γ) ∈ Θ × Γ, ∇γ s(·, θ; γ), is continuous on W. Then, by Theorem 13.2 of Billingsley (1995), for each (θ, γ) ∈ Θ × Γ, ∇γ s(·, θ; γ), is a Borel measurable function on W. Condition (j) holds because of the very same ar- guments utilized in condition (i): We just need to replace “for each (θ, γ) ∈ Θ × Γ ” by “for each w ∈ W” and “on W” by “on Θ × Γ.” As for condition (k), consider the Eu-  0  0 clidean norm of both X v and (∇γ v ). k X v k2 = G 0 P 2 PG Pl 2 2 g=0 dg + g=0 t=1 dg xt + PG 2 2 PG P 2 2 Pl 2 PG 2 PG 2 g=0 dg log (Λg )+ h=0 g6=h dg Mh < (G+1)+(G+1) t=1 xt + g=0 log (Λg )+G h=0 Mh < 145 (G+1)2 +(G+1) lt=1 x2t + G g=0 log (Λg ), where the first inequality is due to binary variables 2 P P d0g s, and the second inequality is due to that |Mh | < 1. Before bounding k (∇γ v0 ) k2 , define f (Λh ) ≡ [log(Λh )+1−Λh ]Λh /(1−Λh ), so ∂ 2 f (Λh )/∂Λ2h = [1/Λh −Λh +2log(Λh )]/(1−Λh )3 > 0 where ∂[1/Λh −Λh +2log(Λh )]/∂Λh = −(Λh −1)2 /Λ2h < 0 and limΛh →1 [1/Λh −Λh +2log(Λh )] = 0. Moreover, note that by L’Hôpital’s rule limΛh →1 f (Λh ) = 0 and limΛh →0 f (Λh ) = 0. Since f (Λh ) is convex on (0, 1) and f (1/2) ∼ = −.1932, there must exist a real number Λho ∈ (0, 1) at which f (Λh ) achieves its finite minimum c ∈ R, which implies |f (Λh )| < c. Similarly, define g(Λh ) ≡ (−Mh − Λh )/(1 − Λh ) = −Λh [log(Λh ) + 1 − Λh ]/(1 − Λh )2 . Hence, ∂g(Λh )/∂Λh = [Λh log(Λh ) + log(Λh ) − 2Λh + 2]/(Λh − 1)3 . To show ∂g(Λh )/∂Λh > 0, consider ∂[Λh log(Λh ) + log(Λh ) − 2Λh + 2]/∂Λh = log(Λh ) + 1/Λh − 1 and ∂ 2 [Λh log(Λh ) + log(Λh ) − 2Λh + 2]/∂ 2 Λh = 1/Λh (1 − 1/Λh ) < 0, which implies that ∂[Λh log(Λh ) + log(Λh ) − 2Λh + 2]/∂Λh is strictly decreasing. But limΛh →1 ∂[Λh log(Λh ) + log(Λh ) − 2Λh + 2]/∂Λh = 0, so ∂[Λh log(Λh ) + log(Λh ) − 2Λh + 2]/∂Λh > 0 and thereof [Λh log(Λh ) + log(Λh ) − 2Λh + 2] is strictly increasing. But limΛh →1 [Λh log(Λh ) + log(Λh ) − 2Λh + 2] = 0, so [Λh log(Λh ) + log(Λh ) − 2Λh + 2] < 0. This last strict inequality suggests that ∂g(Λh )/∂Λh > 0, and g(Λh ) is strictly increasing. But, by L’Hôpital’s rule, limΛh →1 g(Λh ) = 1/2 and limΛh →0 g(Λh ) = 0. As a result, |g(Λh )| < 1/2. Now consider k (∇γ v0 ) k2 = PG Pk 2 2 2 g=0 t=1 dg (1 − Λg ) zt + PG P Pk 2 2 2 PG P Pk 2 2 2 PG h=0 g6=h t=1 dg Λh zt + h=0 g6=h t=1 dg {[log(Λh ) + 1 − Λh ]Λh /(1 − Λh )} zt + h=0 P P Pk 2 2 2 2 2 P k 2 2 P k 2 g6=h i6=h t=1 dg [(−Mh − Λh )/(1 − Λh )] Λi zt < (G + 1) t=1 zt + c (G + 1)G t=1 zt +1/4(G + 1)G2 kt=1 zt2 = (G + 1)[(G + 1) + c2 G + G2 /4] kt=1 zt2 , where the inequality P P is due to that dg is  binary,|Λg | < 1, |f (Λh )| < c, and  |g(Λh )| < 1/2. Now consider 0  0   X  k ∇γ s(wi , θ; γ) k≤k   k |(y − Xδ − vλ)|+ k   kk λ0 kk (∇γ v0 ) k by the trian- 0 0 ∇γ v v gle inequality and Definition 7.45 and Example 7.46 of Laub  (2005). By using the bounds 0  0   for k X v k2 and k (∇γ v0 ) k2 , k ∇γ s(wi , θ; γ) k