ESSAYS ON AVERAGE TREATMENT EFFECTS By Myounggin Keay A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Economics 2012 ABSTRACT ESSAYS ON AVERAGE TREATMENT EFFECTS By Myounggin Keay This dissertation consists of three essays on estimating average treatment effects (ATE) under counterfactual framework. In Chapter 1, I compare the performances of single-step and two-step estimators for estimating the ATE in a linear model when treatment assignment depends on unobservables. Recent advances in computing technology have enabled the extensive use of single-step estimators, such as Limited Information Maximum Likelihood (LIML), instead of 2SLS. In this study I make clear that there are two kinds of singlestep estimators for estimating ATE. LIML-type estimator is the one which uses the control function method, on which the two-step method is also based, whereas FIML-type estimator directly uses the joint distribution of underlying errors or endogenous variables. I find that the relative asymptotic efficiency between two-step Heckit and single-step LIML cannot be determined in general. However, the relative efficiency of single-step LIML with respect to two-step Heckit is decreasing as the sample size increases, implying that if the asymptotic variances are same, then single-step LIML is less efficient in finite samples. On the other hand the FIML estimator tends to have very small finite sample variances, but it is less robust to misspecification. Newey-type series estimators are also considered for correcting the misspecification of error distributions, but it turned out that cost is greater than the benefit. Under weak many instrument cases, the advantage of LIML in terms of median bias was not as strong as in the linear models. Chapter 2 explores the ATE estimator proposed by Terza (2009)’s Nonlinear Full En- dogenous Treatment (NFES) model, where count dependent and binary treatment variables are present. When the true conditional mean function takes the form of exponential function, the Heckit-type linear method, while it can be a good approximation, is inconsistent for the true ATE since it is derived under the assumption of linear conditional mean. The asymptotic distribution of nonlinear estimators have additional terms in asymptotic variance of which magnitudes depend on population coefficient. Due to their presence, the asymptotic variances of nonlinear estimators can be either larger or smaller than the linear counterparts depending on the values of coefficients. It turns out that they tend to have small variances when the variance of conditional ATE are small. And Monte Carlo experiments show that they are fairly robust to various distributional misspecifications. In summary, nonlinear ATE estimators are robust and consistent with small variance when the treatment effects are not substantially different across individuals. An application to Botswana fertility is given where the treatment is seven years of education with the dependent variable fertility. Chapter 3 presents a method for estimating ATE for the case that the dependent variable is count variable and the coefficients of covariates are random variables which are correlated with the binary treatment variable. The identifying assumptions are given and the estimating equation based on them is derived. Simulations show that, in large samples, it has usually smaller biases and larger variances than the linear methods have. An application on Botswana fertility is given with same variables as in Chapter 2. Copyright by MYOUNGGIN KEAY 2012 To Jiyoung and Un-hyeng v ACKNOWLEDGMENTS I would like to express my deepest gratitude to my advisor Jeff Wooldridge. Without his support and guidance, this dissertation would never have been able to come to existence. I also extend my appreciation to Todd Elder, Peter Schmidt, Tim Vogelsang and Hira Koul. Their comments and insights have helped me a great deal whenever I was stuck on obstacles. I also thank Ot´vio Bartalotti, Guojun Chen, Cheol Keun Cho, Do Won Kwak, Jin Young a Lee, Shengwu Shang, Valentin Verdier, Yali Wang and all other my collegues at Michigan State University. vi TABLE OF CONTENTS List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Chapter 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Alternative Estimators of Average Treatment Effect under Misspecification and Weak Instrumental Variables . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Two-step Heckit and Single-Step Quasi-LIML . . . . . . . . . . . . . 1.3.2 Series Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Quasi-FIML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Asymptotic Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Comparison between LIV and Heckit1 . . . . . . . . . . . . . . . . . 1.6.2 Comparison between Heckit2 and QLIML . . . . . . . . . . . . . . . 1.6.3 Comparison between Heckit2 and Series estimator. . . . . . . . . . . 1.6.4 QFIML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.5 Comparison between LIV and Heckit1 under weak IV . . . . . . . . . 1.6.6 Comparison between LIV and L-LIML and between Heckit2 and QLIML under weak IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 30 Chapter 2 2.1 2.2 2.3 2.4 Estimating Average Treatment Effect by Nonlinear Endogenous Switching Regression with an Application in Botswana Fertility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Nonlinear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Full Information Maximum Likelihood . . . . . . . . . . . . . . . . . 2.3.2 Quai-Maximum Likelihood Estimator . . . . . . . . . . . . . . . . . . 2.3.3 Nonlinear Least Squares Estimator . . . . . . . . . . . . . . . . . . . 2.3.4 Weighted Nonlinear Least Squares Estimator . . . . . . . . . . . . . . Asymptotic Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 4 9 9 13 15 18 20 21 22 25 27 28 29 vii 32 32 34 35 40 42 42 43 45 46 48 2.5 2.6 2.7 Monte Carlo Simulation . . . . . 2.5.1 Data Generating Processes 2.5.2 Main Results . . . . . . . 2.5.3 Some Other Results . . . . Empirical Application . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 53 57 62 64 69 Chapter 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 A Two-Regime CRC Model with Nonnegative Dependent Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Previous Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Various Models of CRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Continuous Endogenous Variable . . . . . . . . . . . . . . . . . . . . 3.3.2 Binary Endogenous Variable . . . . . . . . . . . . . . . . . . . . . . . Nonlinear Two-Regime CRC Model . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Identification of ATE . . . . . . . . . . . . . . . . . . . . . . . . . . . Specification Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Tests for Endogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Model Selection Test . . . . . . . . . . . . . . . . . . . . . . . . . . . Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Data Generating Processes . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . An application: the effect of elementary school education on fertility in Botswana Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APPENDICES . . . . . . . . . . . . . . . 70 70 71 75 75 78 80 80 86 88 88 89 90 90 92 94 98 . . . . . . . . . . . . . . . . . . 100 Appendix A Simulation Results for Series Estimator . . . . . . . . . . . . . . 100 Appendix B Simulation Results for All Other Estimators . . . . . . . . . . . 114 Appendix C Other Tables in Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . 124 Appendix D Figures in Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Appendix E Proofs in Chapter 2 . . . . . . . . . . . . . . . . E.1 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . E.2 Derivation of Estimating Equation for NFES Model . . . E.3 Derivation of Conditional Variance for WNLS Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 129 132 133 Appendix F Tables in Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Appendix G Figures in Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 158 viii Appendix H Tables in Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 References . . . . . . . . . . . . . . . ix . . . . . . . . . . . . . . . . . . . 170 LIST OF TABLES Table 1.1 Relationship among Estimators . . . . . . . . . . . . . . . . . . . . . 13 Table 1.2 Distributional assumptions . . . . . . . . . . . . . . . . . . . . . . . 17 Table A.1 Simulation results for series estimators with ρ = 0.0, ξ: normal . . . 102 Table A.2 Results for ρ = 0.0, ξ: t(5) . . . . . . . . . . . . . . . . . . . . . . . 103 Table A.3 Results for ρ = 0.0, ξ: χ2 (5) . . . . . . . . . . . . . . . . . . . . . . 104 Table A.4 Results for ρ = 0.4, ξ: normal . . . . . . . . . . . . . . . . . . . . . 105 Table A.5 Results for ρ = 0.4, ξ: t(5) . . . . . . . . . . . . . . . . . . . . . . . 106 Table A.6 Results for ρ = 0.4, ξ: χ2 (5) . . . . . . . . . . . . . . . . . . . . . . 107 Table A.7 Results for ρ = 0.5, ξ: normal . . . . . . . . . . . . . . . . . . . . . 108 Table A.8 Results for ρ = 0.5, ξ: t(5) . . . . . . . . . . . . . . . . . . . . . . . 109 Table A.9 Results for ρ = 0.5, ξ: χ2 (5) . . . . . . . . . . . . . . . . . . . . . . 110 Table A.10 Results for ρ = 0.6, ξ: normal . . . . . . . . . . . . . . . . . . . . . 111 Table A.11 Results for ρ = 0.6, ξ: t(5) . . . . . . . . . . . . . . . . . . . . . . . 112 Table A.12 Results for ρ = 0.6, ξ: χ2 (5) . . . . . . . . . . . . . . . . . . . . . . 113 Table B.1 Simulation Results: Strong IV and ρ = 0.0 . . . . . . . . . . . . . . 116 Table B.2 Results for Strong IV and ρ = 0.4 . . . . . . . . . . . . . . . . . . . 117 Table B.3 Results for Strong IV and ρ = 0.5 . . . . . . . . . . . . . . . . . . . 118 Table B.4 Results for Strong IV and ρ = 0.6 . . . . . . . . . . . . . . . . . . . 119 x Table B.5 Results for Weak IV and ρ = 0.0 Table B.6 Results for Weak IV and ρ = 0.4 . . . . . . . . . . . . . . . . . . . . 121 Table B.7 Results for Weak IV and ρ = 0.5 . . . . . . . . . . . . . . . . . . . . 122 Table B.8 Results for Weak IV and ρ = 0.6 Table C.1 Relative Efficiencies (Ratio of MSEs) . . . . . . . . . . . . . . . . . 125 Table C.2 Ratio of estimations with AvarQLIM L < AvarT wo−step . . . . . . . 126 Table F.1 Simulation Results for rho = 0.4 . . . . . . . . . . . . . . . . . . . . 137 Table F.2 Simulation Results for rho = 0.5 . . . . . . . . . . . . . . . . . . . . 139 Table F.3 Simulation Results for rho = 0.6 . . . . . . . . . . . . . . . . . . . . 141 Table F.4 Simulation Results for FIML estimator . . . . . . . . . . . . . . . . 143 Table F.5 Simulation Results for DGP 1 . . . . . . . . . . . . . . . . . . . . . 144 Table F.6 Simulation Results for DGP 2 . . . . . . . . . . . . . . . . . . . . . 146 Table F.7 Simulation Results for DGP 3 . . . . . . . . . . . . . . . . . . . . . 148 Table F.8 Simulation Results for DGP 4 . . . . . . . . . . . . . . . . . . . . . 150 Table F.9 Variables Description . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Table F.10 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Table F.11 Regression Results: selection equation . . . . . . . . . . . . . . . . . 152 Table F.12 Regression Results: dependent variable children . . . . . . . . . . . 153 Table F.13 Regression Results: dependent variable ced . . . . . . . . . . . . . . 155 Table F.14 Average Treatment Effects . . . . . . . . . . . . . . . . . . . . . . . 157 xi . . . . . . . . . . . . . . . . . . . 120 . . . . . . . . . . . . . . . . . . . 123 Table H.1 Simulation Results for ρ = 0.3 . . . . . . . . . . . . . . . . . . . . . 163 Table H.2 Simulation Results for ρ = 0.5 . . . . . . . . . . . . . . . . . . . . . 164 Table H.3 Regression Results for Linear and NFES models . . . . . . . . . . . 165 Table H.4 Regression Results for LTCRC and NTCRC models . . . . . . . . . 167 xii LIST OF FIGURES Figure 3.1 Analogy Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure D.1 Sample Cumulative Density Functions under Strong IV with ρ = 0.6 and obs=1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Figure D.2 Sample Cumulative Density Functions under Weak IV with ρ = 0.6 and obs=1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Figure G.1 Selected Monte Carlo Simulation Results for DGP 0. . . . . . . . . . 158 Figure G.2 Selected Monte Carlo Simulation Results for DGP 1. . . . . . . . . . 159 Figure G.3 Selected Monte Carlo Simulation Results for DGP 2. . . . . . . . . . 160 Figure G.4 Selected Monte Carlo Simulation Results for DGP 3. . . . . . . . . . 161 xiii 75 Chapter 1 Alternative Estimators of Average Treatment Effect under Misspecification and Weak Instrumental Variables 1.1 Introduction Although single-step estimators such as LIML has long been recognized as a good alternative to the traditional 2SLS (Anderson et al., 1982), the computational burden has prevented them from being widely used in applied researches. Rather, people have preferred two-step methods or, for some nonlinear models, have sought to find appropriate two-step estimation procedure in order to circumvent the computational difficulty (Greene, 1998; Maddala, 1986). Nevertheless recent new findings about LIML along with the advances in computing technology have reinvigorated researchers’ interest on those single-step estimators. Since the controversial Angrist and Kruger (1991), we are now more familiar to the properties or advantages of LIML estimators especially when there are many weak instrumental variables (Bekker, 1994; Staiger-Stock, 1997; Flores-Lagunes, 2007). Extending our attention to the 1 broad class of single-step estimators to which LIML belongs, we see that there are some more advantages of single-step estimators other than in weak many IV’s: First, it reduces the sampling error by skipping the first stage. Second, as in bivariate probit models, there are some cases where there is no proper consistent two-step method (Wooldridge, 2010); single-step estimator usually exists under some appropriate distributional assumptions. One of the leading examples of single-step estimator is LIML in linear simultaneous models. In LIML, after reduced form transformation, both the structural equation and linear projection of endogenous explanatory variable are jointly estimated by maximum likelihood method . Although 2SLS and LIML give different estimates, they can be interpreted as two different opeartional version of same control function (CF) method; LIML is identical to the single-step CF method, while 2SLS is to the two-step CF in linear models. If CF method is available for a particular nonlinear model, then each single-step and two-step CF estimating procedure can be understood as a generalized LIML and 2SLS respectively in such nonlinear models (Wooldridge, 2007). In line with this idea, LIML can be viewed as a joint ML estimation method based on the CF approach; such approach provides greater generality since LIML can then be applied to the nonlinear models as well as the usual linear simultaneous equations models. In linear simultaneous models LIML and FIML are mechnically the same; the only difference is that FIML is a term used when the whole equations in the system are under consideration, whereas LIML is used for a single equation in the system (Hayashi, 2000). In the context of endogenous switching regression, FIML indicates an estimation method through direct modeling of endogenous variables without resorting to control functions (Maddala, 1986). If the selection equation is viewed as a structural equation, such method can be called FIML because the absence of selection variable in the other structural equations makes 2 the reduced form transformation unnecessary. Therefore there are essentially two kinds of single-step estimators at hand, which are the ones with and without control function in the estimating equation. In this article, the former will be called LIML and the later FIML. Even though LIML was fully robust in linear models, one of the shortcomings of the single-step estimators in nonlinear models is that they usually require strong distributional assumptions for constructing the likelihood functions. They can be inefficient or even inconsistent unless the distributions are correctly specified. In order to address this problem, one can also consider some distribution-free methods such as nonparametric or semiparametric approaches. The objective of this article is to propose those various estimators and then to investigate their performances for the estimation of average treatment effect (ATE) under counterfactual settings. One of the most common situations requiring two-step estimation might be the case where an explanatory variable is endogenous. In such a situation, a nonlinearity can easily be brought by assuming a binary endogenous explanatory variable. Therefore the discussion starts from estimating ATE in endogenous switching regression models with linear structural equations.1 As it will be clearer later, the partial effects of binary endogenous variables is a special case of ATE under some parameter restrictions. Although Angrist (2001) argues that the linear model is sufficiently fine for the partial effect estimation, this article attempts to justify the use of nonlinear models, which will make the whole argument more meaningful. Section 2 describes the model under consideration. Here the relation between the partial effects of binary endogenous variable and the ATE under counterfactual setting will be clarified. Section 3 will provide a detailed description of those four estimators i.e. two1 The case of nonlinear structural model is the topic of the second chapter of this disser- tation. 3 step Heckit, single-step (Q)LIML, single-step (Q)FIML and Series estimators. Section 4 derives the asymptotic distributions of two-step and single-step estimators with the same CF estimating equation. As it turned out, the asymptotic variance of one estimator is not always greater or smaller than that of the other. Threfore the asymptotic analysis does not provide any guideline as to which estimator is more efficient. Section 5 briefly describes the data generating process used for the simulation and Section 6 discusses the results. The focus of the analysis will be on the comparison between single-step and two-step estimators mainly in terms of consistency and efficiency. 1.2 Model As it was briefly discussed in introduction, the main objective of this paper is to compare the estimators for the partial effects of binary variables or ATEs depending on the model specifications. Suppose that we are given a model with a binary endogenous variable. Although the partial effect can easily be estimated given a set of proper instrumental variables, the main focus here is how to use the fact that the underlying endogenous variable is binary. Although it is perfectly legitimate to use usual IV procedure without paying too much attention to the binary nature of the variable, I attempt to make use of such information in order to make a better estimation. Therefore the discussion starts from the generic problem of estimating partial effect of an endogenous explanatory variable. As it will be clarified below, this issue can obtain generality by extending it to the topic of ATE. Below it will be shown that they use common estimating equation although there are slight differences between the ATE and the partial effects of binary endogenous explanatory variable. To state more properly it will be shown that estimating the partial effects of binary variables is a special case of estimating 4 the ATE. Consider a model as below. y = µ + τ w + βx + u (1.1) w = zδ + ξ, In the above equations, w is the binary endogenous explanatory variable where τ is the partial effect, x is exogenous covariates, and z is a set of all exogenous variables containing x which are uncorrelated with u. In order for z to be proper IV, it must be δ ̸= 0. There is absolutely no problem in estimating the τ by the usual IV procedure. If the endogeneity of w is interpreted as an existence of unobserved variables c included in the error, then the partial effect τ can be written as E(y|x, c, w = 1) − E(y|x, c, w = 0). Now we want to estimate better by somehow using the additional information, i.e. the binary nature of w. One natural possibility is to use any binary choice model for the reduced form equation instead of linear probability model. A clear account of ATE is warranted at this point before a further discussion. Suppose a counterfactual setting with a binary choice variable; there exist two regimes for each unit of observation. Let the response variable only in regime one is observed under w = 1 while that in regime zero is observed under w = 0. The exogenous variables x are always observed. The value of w is affected by the values of response variables for each regime, which creates correlation between response variables in each regime and w. Thus the binary variable w gets an endogeneity, and the model under such environment is sometimes called endogenous switching regression. A classic example of such model is found in job training program analysis. In such model the binary variable w denotes whether an individual takes the program or not, and the response variables y1 and y0 denote the wages with and without participation in the program (for recent survey 5 on program evaluation, see Imbens and Wooldridge (2009)). It is not hard to imagine those two regimes for each particular individual although those two response variables are not observed in the real world at the same time; in line with such an idea the model is called counterfactual (Rubin, 1974). For a particular individual the difference between y1 and y0 is called the treatment effect. The expectation of the treatment effects over the whole population is called the average treatment effect, i.e. E(y1 − y0 ). Thus it is obvious that this ATE is not generally same as the partial effect of the binary treatment variable E(y|x, c, w = 1) − E(y|x, c, w = 0). An estimable equation for estimating the ATE is derived below. Since there are two regimes, y0 = µ0 + xβ0 + u0 (1.2) y1 = µ1 + xβ1 + u1 . In the above equations, the relation between x and y are assumed to be linear. For now let us use another assumption that β0 = β1 and u0 = u1 , which implies that the treatment has an intercept shifting effect only. In other words, the treatment shifts the response variable by exactly same magnitude for all the individuals in population. In addition to that let E(x) = 0. This is without loss of generality because the intercept can be adjusted so that E(x) = 0 be true. By those manipulations, it is easy to see that the ATE can be expressed as E(y1 − y0 ) = µ1 − µ0 In equations (1.2) y1 is denoted as the response variable when w = 1 while y0 is when w = 0. In fact the observable response variable is y0 when w = 0 and so on. Let us denote the 6 observed response variable as y, then y = y0 + (y1 − y0 )w Therefore using the above equation, the equations (1.2) can be incorporated into a single equation. y = µ0 + (µ1 − µ0 )w + βx + u (1.3) Thus the partial effect of w becomes the ATE under the restriction above. Notice that the equation (1.3) is same as the equation (1.1). Remembering that the restrictions β0 = β1 and u0 = u1 were put in equation (1.2) to obtain equation (1.3), we can easily see that the equation (1.1) is just a special case of those models with counterfactual settings. Although we started from a special case of binary endogenous explanatory variable in the beginning, it will be discussed under the broad settings of ATEs for the rest of this article. Returning to the equation (1.2), let us consider how to identify the ATE in a model without any restrictions by using the fact that the treatment is binary. To do that there has to be at least more than one IV for w. Rewriting the equation (1.2) with the reduced form equation, y0 = µ0 + xβ0 + u0 y1 = µ1 + xβ1 + u1 (1.4) w = 1[zδ + ξ > 0] It can be easily seen that once we drop one of the regimes in the above equation, then it simply becomes the well-known Heckman correction method (Heckman, 1976). Therefore we can use the Heckman correction model twice for each regime in order to consistently estimate the coefficients in each regime. Of course all the assumptions for Heckman correction model 7 are also required in this model: z is mean independent of ug where g = 0, 1, E(ug |ξ) is linear in ξ, and ,although not necessary for the identification itself, w follows probit model so that the inverse Mill’s ratio can be used. Therefore the estimating equation can be written as y = µ0 + (µ1 − µ0 )w + xβ0 + wx(β1 − β0 ) + u0 + (u1 − u0 )w = µ0 + τ w + xβ0 + wx(β1 − β0 ) + ρ1 wλ(zδ) + ρ0 (1 − w)λ(−zδ) + e0 + (e1 − e0 )w, (1.5) where λ(·) is inverse Mill’s ratio, and τ is denoted as the ATE. Estimating procedure is basically same as Heckman correction model; in the first stage estimate the selection equation by probit and put the estimated parameters δ for δ and run the second stage regression just like usual ordinary least squares. If one believes that β0 = β1 and u0 = u1 , then the estimating equation simply becomes ( ) y = µ0 + τ w + xβ0 + ρ wλ(zδ) − (1 − w)λ(−zδ) + e0 . (1.6) Running two-stage regression of the above equation (1.6) is an alternative estimation method to the usual IV regression with linear probability model for selection. Nevertheless, following discussions will be based on the equation (1.5) with greater generality rather than (1.6) . 8 1.3 1.3.1 Estimation Two-step Heckit and Single-Step Quasi-LIML So far we have seen the relationship between partial effect of a dummy endogenous variable and ATE in the previous section. In what follows the equation (1.5) will be used as a basic estimating equation for most part of the discussion. In equation (1.5) the errors eg are the difference between the structural errors ug and the correction terms for each regime. We needed three important assumptions to be able to write the model as in equation (1.5). The first two are the conditional mean independence of z and ug and the linear conditional expectation assumption between the errors in each equation i.e. E(ug |z, ξ) = E(ug |ξ) = ρξ. A sufficient condition for linearity is the trivariate normal distribution between ug and ξ; under such assumption the coefficient ρg will then be the covariance between ug and ξ. However, the linear conditional expectation will be enough for writing the equation (1.5). Here the distribution of ug can be flexible; it doesn’t have to follow any particular distribution as far as it is continuous. The next assumption is that ξ follows normal distribution. This is more important than the former ones since it warrants the use of inverse Mill’s ratio in the structural equation as well as the use of probit for the first stage estimation. Without it, the above representation is invalid particularly for the inverse Mill’s ratio, for which case we can possibly consider semiparametric estimation methods instead of using the inverse Mill’s ratio for correction terms. And again the first stage estimation can also be run by semiparametric binary choice model, which will be clarified later. The following three propositions summarize the above discussion. 9 Proposition 1.1. Given equation (1.4), it can be shown that 1 E(y1 |z, w = 1) = µ1 + xβ1 + Pr(w = 1) ∫ ∫ ∞ −zγ u1 p(u1 , ξ|z)dξdu Proposition 1.2. Given equation (1.4), if E(u1 |z, ξ) = E(u1 |ξ) = ρξ, then E(y1 |z, w = 1) = µ1 + xβ1 + ρ ∫ ∞ −zγ ξp(ξ|ξ > −zγ)dξ Corollary 1.1. If ξ ∼ N (0, 1) in addition to the assumption in proposition 1.2, then E(y1 |z, w = 1) = µ1 + xβ1 + ρλ(zγ), where λ(·) is the inverse Mill’s ratio. Moreover, if u1 and ξ follow bivariate normal, then ρ = cov(u1 , ξ)/var(ξ). One can use the two-step Heckit approach discussed in the previous section in order to estimate ATE using equation (1.5). Easy to implement as it is, it could be inefficient after accounting for the first stage sampling error. If we are willing to make some further assumptions, then there actually exists a single-step ML estimation method using the likelihood function for eg , which can be constructed as below. Let us call the composite error as e = e0 + (e1 − e0 )w and assume that it follows standard normal distribution and that the selection mechanism is probit. Then a joint density function of y and w can be written as 10 below. p(y, w|z) = p(y|w, z) · p(w|z) = √ 1 2π · var(e|w, z) ) ( {y − µ0 − τ · w − xβ0 − wx(β1 − β0 ) − ρ1 wλ(zδ) + ρ2 (1 − w)λ(−zδ)}2 · exp − 2 · var(e|w, z) ·[Φ(zδ)]w · [1 − Φ(zδ)]1−w , (1.7) where inverse Mill’s ratio is compactly denoted as λ(·). The conditional variance of each eg is given by Johnson and Kotz (1972) as 2 2 var(e0 |w = 0, z) = σ0 − σ0ξ λ(−zδ){−zδ + λ(−zδ)} 2 2 var(e1 |w = 1, z) = σ1 + σ1ξ λ(zδ){−zδ + λ(zδ)}, 2 where σg ≡ var(ug ) and σgξ ≡ cov(ug , ξ). All the parameters are estimable by a ML estimation with equation (1.7). One of the advantages of this approach is that there is no more sampling error to account for in the first stage estimation, which greatly simplifies the computation of the standard errors. Let us now consider the normality assumption of e which is one of two assumptions used in equation (1.7). In fact our knowledge about e is in fact very limited, although the normality assumption on e was used in order to construct a likelihood function. The composite error e was obtained by subtracting the Heckman correction terms from u. Although we can maintain that u follows normal by the usual central limit theorem argument, it is very hard to maintain that it still follows normal even after it is purged of the elements causing the endogeneity. Other than that, one can surely suspect the possibility of heteroscedasticity because the correction terms are the functions 11 of explanatory variables and again e also is. However, even for the case where the likelihood is not correctly specified, we can expect to have at least a consistent estimator as long as the conditional expectation of y is correctly specified (Gourieroux et al., 1984). Without caring about the conditional variances, let us just use a standard normal distribution for p(y|w, z). Taking natural log, equation (1.7) can be written as ℓ(y, w|z) = − {y − µ0 − τ · w − xβ0 − wx(β1 − β0 ) − ρ1 wλ(zδ) + ρ2 (1 − w)λ(−zδ)}2 2 +w · ln[Φ(zδ)] + (1 − w) · ln[1 − Φ(zδ)] (1.8) Now suppose that the assumptions in Proposition 1.2. and Corollary 1.1. are all satisfied. Then it can be seen that the true parameter values maximize the expectation of log-likelihood function in equation (1.8). If we label the true parameter value by “o” subscript, then δo will surely maximize the expectation of probit log-likelihood function. Also under the assumption that E(y|w, z) = µ0o −τo ·w−xβ0o −wx(β1o −β0o )−ρ1o wλ(zδo )+ρ2o (1−w)λ(−zδo ), the true parameters (µ0o , τo , βgo , ρgo , δo ) will maximize the expectation of the first term in (1.8). The latter statement is true because the normal density belongs to the linear exponential family. Since (µ0o , τo , βgo , ρgo , δo ) maximize the first term in (1.8) and δo does the same for the second and third terms, the whole set of parameters maximize Eℓ(y, w|z), which guarantees the consistency. The above argument makes it clear that consistency depends on the fact that the selection error is normally distributed, since it enables us to write the correction terms as the inverse Mill’s ratio, which is a part of conditional expectation function of y. In other words, although e does not have to follow standard normal distribution, the selection 12 Table 1.1: Relationship among Estimators single-step CF two-step CF linear LIML 2SLS nonlinear (Q)LIML Heckit error has to be normally distributed in order to ensure consistency. A general form of correction terms without the normality but with linearity assumption on ξ is shown in the proposition 1.2. Also if the conditional expectation E(ug |ξ) is a nonlinear function of ξ containing terms of higher degrees, then some additional correction terms might be needed. As it will be discussed in the following section, one can consider using nonparametric or semiparametric methods in such cases where it is hard to determine the validity of those assumptions. The estimation method using the above log-likelihood function (1.8) will be called quasiLIML or QLIML. A qualifier quasi- is used here due to the ignorance of the correct distribution of e. It is LIML since the likelihood function is constructed by the joint density function of the response and endogenous explanatory variables as in the linear simultaneous equation case. The only difference between the QLIML used here and the one in linear model is that the former uses joint density of normal and Bernouilli while the later uses multivariate normal. 1.3.2 Series Estimator According to proposition 1.2 and corollary 1.1, the correction term will take the form of inverse Mill’s ratio when E(ug |ξ) is linear in ξ and ξ is normally distributed. If ξ is not, the correction term will not take the form of inverse Mill’s ratio any more. If E(ug |ξ) is a 13 nonlinear function of ξ, then the correction term will have some more additional terms as well as the leading inverse Mill’s ratio. Although it is still possible to derive the appropriate formulae for correction terms for each distributional assumption, we usually don’t have any knowledge about the population whenever the failure of the assumptions is suspected; in such cases one can instead use some distribution-free methods. Rather than the fully parametric approach, a semiparametric method can also be applied for the correction terms. Here I will follow the methods proposed by Powell, Newey and Walker (1990), Newey (1994) and Newey (2009). The discussions in those papers are mainly about the semiparametric estimation of sample selection models where the β’s in equation (1.2) are identified. They are semiparametric because the linear index zγ is always used as an argument of control function. The estimation will be carried out by two steps. In the first stage, the selection equation is estimated by the nonparametric binary choice models as in Powell, Stock, and Stoker (1989), Ichimura (1993) and Klein and Spady (1993). I use Klein-Spady estimator in this paper since it is the most efficient one among those. Given a first stage estimate, a series estimator can be constructed in the second stage as below. y = µ0 +τ ·w+xβ0 +w·x(β1 −β0 )+w P (∑ ) ρp ψ(zδ)p +(1−w) p=1 P (∑ ) ηp ψ(−zδ)p +ϵ0 +w·(ϵ1 −ϵ0 ) p=1 (1.9) The errors appeared here are not the same as eg in Heckit or in QLIML. They are essentially the sums of the series terms from P +1 to infinity. The function ψ is monotone transformation which makes the estimate less sensitive to outliers (Newey, 1994). For ψ, Newey (2009) proposes three monotone functions: identity, inverse Mill’s ratio and CDF of standard normal distribution. 14 In Heckman correction method the correction term in observed subpopulation is the inverse Mill’s ratio λ(zγ) under the usual assumptions, whereas that in the other unobserved subpopulation is −λ(−zγ). Then the unconditional expectation of correction term over the whole population is equal to zero; otherwise the unconditional expectation of error in structural equation won’t be zero causing the intercept estimator inconsistent. This condition is not automatically satisfied as in the original Heckman correction model when we apply series estimator with powers of degree more than two and with a function ψ(·) other than the inverse Mill’s ratio. Therefore an adjustment that makes the expectation of the correction terms over the whole population be zero is required for a consistent estimation of the intercept and ATE; it can be done by subtracting the sample means of the correction terms. 1.3.3 Quasi-FIML Before considering this estimator, it should be emphasized that this is not in the class of control function approaches as those previous three estimators. The basic motivation of using control function approach is that we suspect unobserved variables hidden in error which are correlated with one of the regressors in the structural equation. Since it is unobserved, by definition, the distribution that it follows is unknown. Indeed we haven’t put any distributional assumptions for ug or eg in the two-step Heckman model. Even though we used normal assumption for eg in QLIML, we do not claim that it is truly normal by labeling it as QLIML. Therefore putting distributional assumption directly to the structural error ug is not in line with our spirit of viewing this problem. However, if the normality assumption happens to be true, then a FIML estimator that exploits the joint density function becomes a very attractive alternative estimator with correctly specified likelihood. If it is not correctly specified, then the FIML with wrong distributional assumption on the errors becomes 15 a quasi-FIML or QFIML that can still be used for estimating the ATE. Below is a description of FIML: suppose the joint distribution of ug and ξ is multivariate normal as below.            2  0   σ0 u0          u1  ∼ N  0  ,          ξ 0  σ10 σ1ξ    2 σ σ1 0ξ    1 The variance of the selection error is set to be one since the coefficients in probit model can only be estimated up to a scale factor. Under this assumption, we want to derive the joint distribution of y and w conditional on the exogenous variables z. We can easily transform u into y since it is linear. The transformation of ξ into w can be done by the equation below. (∫ ∞ f (u, w|z) = −zδ )1−w )w ( ∫ −zδ h(u0 , ξ) dξ g(u1 , ξ) dξ · −∞ (1.10) Those functions g and h are marginal density functions for ug and ξ conditional on z. For example g can be obtained by integrating out irrelevant u0 from the joint distribution for all three variables. Here we can see that the trivariate normal assumption is sufficient condition for our purpose; the above likelihood function shows that the pairwise bivariate distribution assumption for each ug and ξ is all that is necessary since we are hardly interested in estimating σ01 . The integrals on the right hand side of the above equation were needed in order to get the marginal distributions for ug for the event where w = 1 and w = 0 respectively. It can be shown that the above density function can be written as [ ( f (u, w|z) = Φ 2 zδ + (σ1a /σ1 )u1 √ 1 − (σ1a /σ1 )2 ) ( ) ] [{ } ( u0 ) ]1−w u ( 2 )u ) ϕ σ1 w ϕ σ zδ + (σ1a /σ1 0 1 0 · · 1−Φ √ · 2 σ1 σ0 1 − (σ1a /σ1 ) (1.11) 16 Table 1.2: Distributional assumptions u ξ e QFIML normal normal NA QLIML none normal none Heckit none normal none Series none none NA where Φ(·) and ϕ(·) denote the CDF and PDF of standard normal distribution respectively. By using (1.2) the above joint density can easily be transformed to f (y, w|z). The estimator using the likelihood function given above is called FIML (Maddala, 1986). In linear simultaneous equations models FIML basically uses the joint distribution of all endogenous variables conditional on exogenous ones; the only difference in this nonlinear setting is that the joint distribution of endogenous variables are constructed without reduced form transformation. Since there is no guarantee that the errors follow trivariate normal, it should be properly called “quasi”-FIML with a qualifier. The ATE is estimated simply as the difference of estimated intercepts in equation (1.2). If the true distribution is trivariate normal, then this estimator becomes FIML and will be consistent and efficient. However, if that is not the case, then it is neither consistent nor efficient. Therefore FIML is less robust than Heckit or QLIML under various violations of distributional assumptions. In sum there are the four available estimators for ATE using instrumental variables in counterfactual setting. Those estimators were constructed under different distributional assumptions which are summarized below. NA indicates the underlying error term is not in the model. Even though not in the table, it has to be noted that there is linear conditional expectation condition between u and ξ for the Heckit and QLIML. 17 1.4 Asymptotic Variances In this section the asymptotic distributions of Heckit and QLIML, which are applications of two-step and single-step CF method respectively, are given. The asymptotic distribution of QFIML is straightforward and that of the series estimator in the following discussion is derived by using essentially same method as in the two-step Heckit. In what follows a common structural log-likelihood function for both single-step and two-step methods will be used. The standard normal distribution will be used as the quasi-likelihood of structural equation for QLIML estimation. Heckit uses OLS at second step; computing the least square is equivalent to using the standard normal distribution for e. Therefore one can simply use a common structural log-likelihood function to derive the asymptotic distribution of the twostep Heckit as well as single-step QLIML. To make the discussion as simple as possible, the parameters θ1 and θ2 are assumed to be scalars. Let us first consider the asymptotic distribution of QLIML. Let q 1 (y|w, z; θ1 , θ2 ) be the log-likelihood or objective function to be considered for the structural equation and q 2 (w|z; θ2 ) for the reduced form equation. In this particular endogenous switching regression setup, θ1 = (τ, β ′ )′ and θ2 = δ, but again the discussion below assumes that they are scalar for the sake of simplicity. Since the two sets of parameters are estimated together, the loglikelihood function for QLIML is expressed in additive form as q 1 (y|w, z; θ1 , θ2 )+q 2 (w|z; θ2 ). 18 i i Let qj ≡ ∂q i /∂θj and qjk ≡ ∂ 2 q i /∂θj ∂θk . Then   1 2  E(q1 ) 1 1 E(q1 q2 ) B0 =    1 1 1 2 E(q1 q2 ) E(q2 )2 + E(q2 )2   1 ) 1 ) E(q12   E(q11 A0 =   1 1 2 E(q21 ) E(q22 ) + E(q22 ) and √ n(θn − θ0 ) →d (0, A−1 B0 A−1 ). The asymptotic distribution of θ1 is the first diagonal 0 0 element of the sandwich form as below. √ Avar n(θ1n − θ10 ) ( ) 1 1 1 1 1 1 2 E(q1 )2 K 2 − E(q12 ) 2E(q1 q2 )K − E(q12 )[E(q2 )2 + E(q2 )2 ] , = )2 ( 1 )K − [E(q 1 )]2 E(q11 12 1 2 where K = E(q22 ) + E(q22 ) and θ1n denotes QLIML estimator of θ1 . A two-step approach maximizes q 1 (y|w, z; θ1 , θ2 ) given the first stage estimates θ2 . Then it can be shown by using the result in Wooldridge (2010, Chapter 12) that √ Avar n(θ1n − θ10 ) = ( 1 2 2 2 2) 1 1 )2 + [E(q12 )] [E(q2 ) ] E(q1 1 2 [E(q11 )]2 [E(q22 )]2 where θ1n is the second stage estimate of θ10 . The only case where those two asymptotic 1 variances are same is when E(q12 ) = 0, where one can estimate equation by equation with- out any interaction between θ1 and θ2 possibly through some control functions. Since our 1 model generally contains the interaction terms, the presence of nonzero E(q12 ) should be understood as a main cause differentiating the asymptotic variances of single-step and two- 19 √ √ step methods. The magnitudes of Avar n(θ1n − θ10 ) and Avar n(θ1n − θ10 ) cannot be 1 determined in general. However, holding other terms fixed, higher value of E(q12 ) makes the asymptotic variance of single-step estimator decrease since its denominator and numerator 1 are a polynomial of degree four and two respectively. Thus under a higher value of E(q12 ), the denominator would dominate the numerator yielding smaller asymptotic variance. On the other hand the asymptotic variance of two-step estimator would increase smoothly for 1 1 higher value of E(q12 ). However, if E(q12 ) is close to zero, then the fraction can possibly 1 be subject to an abrupt change even by a small change in E(q12 ) making comparison more difficult. 1.5 Simulation Design Monte Carlo simulations were carried out in order to compare the finite sample performances of the four estimators under consideration. The data generating process is x ∼ uniform[−10, 10] y1 = 3 + 0.1x + u0 y0 = 2 + 0.15x + u1 w = 1[1.4 − 0.05x − 0.3z + ξ ≥ 0] z ∼ Binomial(1, 1/2), where ug and ξ can follow normal, centered t(5) and centered χ2 (5) with variances all normalized as unity.2 Their correlations are set 0.0, 0.4, 0.5 and 0.6. The ATE, the difference 2 First and second moments alone do not uniquely determine the joint distribution of nonnormal random variables. The stata code for this simulation will be provided upon request. 20 of the intercepts for each regime, is equal to one. To see the effect of weak instrumental variable, a set of 10 randomly created binary variables was used. Because the nonlinear selection function reduces the information in the domain through the indicator function, the effect of weak IV is not very perceivable unless a large number of weak IVs is put in the equation. To measure the quality of IVs as a whole, the concentration parameter is used, i.e. E(δ ′ Q′ Qδ)/L for y ∗ = 1.4 − .05x − .3z + ξ where the i-th row of Q is (1, xi , zi ) and L is the number of instrumental variables. Then for the observation number n = 100, the concentration parameter for y ∗ is 10.575 when a relevant IV is used, and 1.0575 when ten irrelevant IVs were added. Vella and Verbeek (1999) already compared the performances of IV and CF under various error distributions, but what they essentially compared was the two-step CF methods with and without using the binary nature of treatment variable in the present context. And they just investigated what they called restricted CF which is the single-regime model such as equation (1.6). The contribution of this paper is that (1) both two-regime and one-regime models are considered, (2) both single-step and two-step estimation methods are investigated, and (3) the misspecifications of both selection and structural errors are allowed for. 1.6 Simulation Results As it was mentioned in the previous sections, we focus on those four estimators, i.e. Heckit, QLIML, Series and QFIML, the estimable equations of which are equations (1.5), (1.8), (1.9) ad (1.11). In addition to those two-regime estimations, we also discuss one-regime procedures i.e. IV estimator and Heckit with single regime, which are (1.1) and (1.6). In order to distinguish Heckits in those two settings, it will be referred to as Heckit1 and Heckit2 21 for regime one and two respectively. Heckit1 is essentially same as Heckit2 except for the fact that the restriction that the partial effects of the covariates across two regimes are equal was used for Heckit1. Also the IV estimator using equation (1.1) will be called linear IV estimation or LIV lest one get confused with instrumental variable. Thus the difference between LIV and Heckit1 is that the selection equation for LIV is the linear probability model while for Heckit1 is probit. Of course we can think of the optimal IV estimation using probit, but it will not be considered here. LIV is two-step estimation method; when the same model is estimated by single-step LIML, then it will be written as LLIML in order to distinguish that from QLIML above. Thus the focus is on the comparison between linear and nonlinear control function methods i.e. LIV and Heckit1 for one-regime estimation. For two-regime estimation, the focus is on the comparisons among Heckit1, QLIML, Series, and QFIML. We discuss strong IV results from section 6.1 through 6.4. Sections 6.5 and 6.6 are devoted for weak IV. 1.6.1 Comparison between LIV and Heckit1 To begin with, let us investigate the behavior of LIV and Heckit1 with strong IV under various misspecifications. Although Vella and Verbeek (1999) dealt with this issue, the discussion here is more extensive. The results are summarized in Table B.1 through Table B.4. For each simulation session, the seven summary statistics are provided: Monte Carlo mean, standard deviation, root mean squared error (RMSE), and median. The simulation was run for different correlation values such as 0.0, 0.4, 0.5 and 0.6, which makes it easy to find out patterns of behavior of estimators if any. Since there are three possible error specifications for both structural and selection errors, we have nine combinations to consider. However, it turned out that the cases of heavier tails generated by t(5) distribution is very similar to the 22 normal cases and their results are omitted in the tables3 . The horizontal rows are for three possible selection error specifications and the vertical columns are for structural error. Let us first compare LIV and Heckit1 in terms of bias. One can see that those one-regime models are not particularly bad despite the true data generating process has two regimes. The two estimators also converge well to the true parameter value, which is 1, under sufficiently large samples. Heckit1 is valid under three assumptions which were already made clear in Proposition 1.2 and Corollary 1.1. Although the exogenous variable z is independent in the simulation design, the other two assumptions are designed to violated except for in normal-normal case, where Heckit1 is truly legitimate. The conditional expectation will not be linear unless both of the errors follow same distribution. Particularly if one of them is skewed while the other one is symmetric, then the conditional expectation can be nonlinear. Given the independent z, the other two assumptions fail unless the error combination is normal-normal. If those assumptions are not satisfied, then the control functions in the form of inverse Mill’s ratio will be misspecified ultimately causing an inconsistency. The results in Tables B.1-B.4 clearly show this: regardless of correlation, when the selection errors are skewed, the finite sample biases of Heckit1 are greater than those of LIV. Such large biases are found only when the selection errors are skewed. Moreover, they are persistent even when the sample size is sufficiently large. Also it shows that the results under misspecified structural error are not as sensitive as those under misspecified selection errors. Since large finite sample biases are common among the cases with skewed selection errors, it appears that it is not so much caused by nonlinearity of E(u|ξ) as caused by nonnormality of ξ. Nevertheless the results are also affected by nonlinearity of E(u|ξ). 4 One can also see the 3 The results for t(5) will be provided upon request. 4 Even though I did not include the results in this article, I also created an error generating process where each of them is normal but the conditional expectation is not linear in order 23 behaviors of estimates under various correlation values; given symmetric selection errors, as the correlation becomes larger, the relative magnitudes of Heckit1 biases compared to those of LIV decrease. However, under skewed selection errors, the Heckit1 biases are larger than those of LIV regardless of correlations. To summarize, LIV has smaller bias under skewed selection errors while Heckit1 does under symmetric ones. LIV is robust under misspecification. For efficiency, Tables B.1-B.4 show that Heckit1 has smaller Monte Carlo standard deviation than the LIV does, which is true irrespective of the error specifications. Also in terms of RMSE, except for some cases where the selection errors are skewed, the RMSE of Heckit1 is smaller than that of LIV overall. Therefore we have good reason to use nonlinear models such as Heckit1 unless a nonsymmetric selection error is suspected. IV estimators do not necessarily have finite moments, which is reflected by the considerable magnitudes of standard errors under small sample sizes. Therefore, as suggested by Angrist et al. (1999), it would be very useful to see the median or MAD as well as those moment estimators such as mean and standard deviation. In terms of mean, Heckit1 has a smaller bias than LIV except for the cases under skewed selection errors. In terms of median, on the contrary, the tendency is just the opposite: one can see that in most cases the median biases of LIV are smaller than those of Heckit1, and, particularly, they are for all the cases under skewed selection errors. Nevertheless one can see Heckit1 is more efficient than LIV in terms of MAD, which is in agreement with the results for standard deviation. to see the pure effect of nonlinearity. The results show that nonlinearity of E(u|ξ) causes larger finite sample bias for Heckit1. 24 1.6.2 Comparison between Heckit2 and QLIML Let us now compare the performances of Heckit2 and QLIML. Those two estimators use same estimating equation; the difference is that Heckit2 is estimated by two-step procedure while QLIML is by single-step. It is already known that in linear models with just identification 2SLS and LIML give numerically same estimates(Anderson et al., 1982). If IV or 2SLS is understood as a special case of control function method, then the comparison between Heckit2 and QLIML is essentially a comparison between 2SLS and LIML with nonlinear selection function. However, as it was already discussed in previous sections, unlike linear models, when the selection equation is nonlinear, it is impossible to determine in general whether the asymptotic variance of one estimator is same or greater than the other. Therefore it is not very meaningful to discuss the efficiency simply by comparing the Monte Carlo standard deviations of two estimators under some sample sizes; it would be very likely to see the similar tendency also in finite samples if the asymptotic variance of one estimator is substantially larger or smaller than the other. Rather it would be more interesting to find out how the relative efficiencies, i.e. the ratio of MSE’s, behave under different sample sizes. For consistency of those two estimators, it is essential that the conditional expectation of y on x and w be correctly specified(Gourieroux et al., 1984). It is basically whether the conditional expectation on x, w can be correctly specified by the inverse Mill’s ratio once it is taken for granted that the conditional expectation on x is linear. It is again based on the assumption that the selection equation is probit. If true model is not probit, then both Heckit2 and LIML might be inconsistent. Although it can be well expected that the two estimators might be inconsistent under skewed selection errors, it turned out that they converge well to the true parameter value even under such misspecifications. Therefore 25 misspecification of selection error does not seem to be very critical. The simulation results for Heckit2 and QLIML are also in Tables B.1-B.4 and Figure D.1 displays the cumulative density functions only for the case of ρ = 0.6 and n = 1000. According to the results, there are some cases where the Monte Carlo means are not converging to the true value monotonically. It can be explained by two ways: First, theoretically, in the definitions of any kinds of convergence including the convergence in probability, the behaviors of early terms of sequence are not very important; also the monotonicity is never considered in weak law of large numbers or in uniform weak law of large numbers. The second reason is practically more relevant. Since IV estimators do not necessarily have finite moments, the IV estimators can produce many estimates of large magnitude on which the numerical values of mean or standard deviation can be very sensitive. Thus it should not be a major concern even though the sequence of mean is not monotonically converging to the true value. Let us now discuss the efficiencies of those two estimators. It can be seen that the Monte Carlo standard deviation of QLIML is very large compared to Heckit2, which is more so under small sample sizes. In addition to that, although QLIML is very inefficient under small sample sizes, it nearly catches up with Heckit2 very quickly. Table C.1 shows the relative efficiencies of some selected pair of estimators including QLIML with respect to Heckit2. The relative efficiencies were computed as the MSE of each relevent estimators divided by the MSE of Heckit1 or Heckit2. Thus if the value is greater than one, then it implies that the MSE of Heckit1 or Heckit2 is relatively smaller. It can be seen from Table C.1 that the relative efficiencies are decreasing as the sample sizes increase. In other words, the single-step QLIML is less efficient particularly under small sample sizes. On the contrary, the estimated asymptotic variances for actual estimation give mislead26 ing information: a lot of cases the estimated asymptotic variances of single-step QLIML are smaller than those of two-step Heckit2. Table C.2 tabulates the ratio of estimations with higher estimated asymptotic variances for two-step Heckit2 out of total 1000 repetitions.5 According to Table C.2, roughly 50-60% of total estimations produce misleading information about the true sampling distribution; overall the single-step QLIML underestimates its asymptotic variance In so far as the median bias of Heckit2 and quasi-LIML, it is hard to find any evidence in favor of either estimator; the magnitudes of their median biases are more or less the same. Like the standard deviation, MAD also shows that QLIML is more dispersed than Heckit2. 1.6.3 Comparison between Heckit2 and Series estimator. A series estimator is expected to solve the following three major problems: allowing for nonlinear E(u|ξ), consistent estimation of index under nonprobit settings, and thus correction of the misspecified control function. There are various forms of series estimator, but here the three leading ones suggested by Newey (2009) were used. According to the results presented in Tables A.1-A.12, one can see that the ones with inverse Mill’s ratio have the smallest MSE. As for the degree of power series of correction terms, the estimator that includes only the terms of degree one is the best in terms of MSE; it is because larger number of control function terms interacted by w creates severer multicollinearity. Therefore the only difference between best Series estimator and Heckit2 is that the former uses Klein-Spady semiparametric estimator instead of probit in the first stage. They are similar in that they use inverse Mill’s ratio of degree 5 This simulation was designed to generate same pseudo-random numbers across different estimators enabling a direct comparison among them. 27 one. According to Proposition 1.2, under the linearity of E(u|ξ), the control or correction terms can be expressed as some function of index with single term. Corollary 1.1 implies that normality of ξ lets us to write the correction terms as inverse Mill’s ratios. Without the linearity of E(u|ξ), the correction terms should be expressed not as a single term, but as a power series of some functions of index. However, Tables A show that such methods create multicollinearity due to the interaction with w, implying that the series estimators are not good alternatives for the cases where E(u|ξ) is nonlinear. Unfortunately, among the three motivations of using series estimators, the only one that the Newey type estimators can address is the second one, that is the consistent estimation of index. The Series estimator is expected to do better under the skewed selection error, but there is no particular evidence that the Series estimator gives smaller mean or median biases than Heckit2 does; in summary, the KS semiparametric estimator combined with inverse Mill’s ratio does not do better than the usual Heckit2 does in terms of bias. There is a tendency that the MSE of series estimator is greater than that of Heckit2. However, under small sample sizes, there are some cases where the MSE of series estimator is smaller than the Heckit2. Such phenomena are more often found as the correlation becomes higher. 1.6.4 QFIML Heckit2, LIML and Series estimators are easy to compare with each other since they all use or based on same estimating equation. On the contrary QFIML is different in that the structural and selection errors are directly modeled without resorting to control function method. QFIML uses trivariate normal assumption; therefore if the specification is true then its asymptotic variance becomes simpler by information equality and QFIML becomes 28 FIML. On the other hand QFIML fails to be a consistent estimator if the error distribution is misspecified. Tables B.1-B.4 show this point: when the structural error is nonnormal, then the sequence of Monte Carlo mean is passing through the true parameter value and converging to somewhere else. Under normal structural error, QFIML converges to the true value relatively well. In other words although a misspecified selection error does not have strong effect, a misspecified structural error does cause inconsistency. One interesting thing is that the inconsistency is not so serious under zero correlation. For efficiency, QFIML has the smallest variance among those four estimators. Such fact is also reflected and even exaggerated in the asymptotic variance approximations; the estimated asymptotic variances are so small that the 95% coverage rates are abnormally smaller than other estimators. The coverage rate converges to zero as the sample size grows bigger. This is also true even when the errors are correctly specified. Therefore QFIML estimator is not a desirable choice for test unless bootstrap standard error is used. To summarize QFIML fails to be consistent under misspecified structural errors, and has extremely small coverage rates. 1.6.5 Comparison between LIV and Heckit1 under weak IV All the results for weak IV are presented in Tables B.5-B.8. The cumulative distribution functions for the case of ρ = 0.6 with n = 1000 are shown in Figure D.2. From the table, it can be seen that the behaviors of LIV and Heckit1 are not very different from those under strong IV except for in terms of bias. Under both strong and weak IV, if the selection error is skewed, then the linear model has smaller bias. However, under weak IV, even if the selection error is symmetric, then the nonlinear model does not have any noticeable advantage over linear ones as it used to have under strong IV. Thus the linear model is more robust under 29 weak IV case. Now let us discuss the results on RMSE. First, under symmetric selection error, the nonlinear model still has advantage over the linear model in terms of RMSE. It is due to the fact that the nonlinear model is more efficient than the linear one in terms of standard deviation. Second, on the other hand, under skewed selection error, the efficiency of nonlinear model is not strong enough to make the nonlinear model more advantageous than the linear one in terms of RMSE. 1.6.6 Comparison between LIV and L-LIML and between Heckit2 and QLIML under weak IV As it was already discussed, we have LIV and L-LIML for linear models and as their nonlinear counterpart Heckit2 and QLIML. Unless the correlation is zero, under all circumstances the median biases of L-LIML are smaller than those of LIV regardless of sample size and error specification. On the other hand, the relation of the median biases of Heckit2 and QLIML are not as simple as in linear models. Particularly, when ρ = 0.4, the median biases of QLIML are greater than those of Heckit2 under all circumstances and it is still true in many cases when ρ = 0.5. Nevertheless it shows a tendency of smaller QLIML median bias than Heckit2 as the correlation becomes larger. 1.7 Conclusion In finite samples, especially when the sample size is small, QLIML is relatively less efficient than the two-step Heckit2 suggesting that there is no efficiency gain by running single-step procedure which mimics two-step. QFIML is more efficient than Heckit2; however, it is not 30 robust to distributional misspecification. In terms of efficiency, Heckit2 is a middle ground between those two single-step estimators. Nevertheless Heckit2 is more preferable since it is not only easy to compute but also robust to arbitrary misspecifications. Newey-type Series estimators were expected to do relatively better under misspecification, but turned out not to be good alternatives. Under weak many instruments, QLIML shows better performance in terms of median bias only when the degree of endogeneity is strong. Under weak endogeneity, there is no clear evidence of such an advantage. 31 Chapter 2 Estimating Average Treatment Effect by Nonlinear Endogenous Switching Regression with an Application in Botswana Fertility 2.1 Introduction In order to estimate the treatment effects of binary variable on count dependent variable, Terza (1998, 2009) proposed nonlinear models that take into account the limited dependent variables. As alternatives to those fully nonlinear models, a traditional linear regression model with probit treatment equation can also be used. Although it seems to be more sensible to apply the nonlinear models given count outcome variable, the previous literature have not clearly stated the advantages as well as disadvantages of using nonlinear outcome models rather than simply applying linear methods to estimate the treatment effects. While the linear models implemented by Heckman’s (1978) method is already well understood, large part of the statistical properties of Terza’s nonlinear approaches are still unknown. The goal of this study is to explore the properties of nonlinear approaches to estimating the 32 treatment effects and to give a guidance that might be useful to the empirical analyses. Terza (1998) considers a model where the binary treatment variable shifts the intercept inside the exponential conditional mean function and provides estimating equations that can be implemented by using the observable variables. Also in later works, Terza (2008, 2009) extends the earlier model by incorporating the counterfactual framework where the treatment status puts the individual in a different regime. Following the terminology used in Terza (2009), the former model will be called throughout this paper “Nonlinear Endogenous Treatment Model” (NET), and the latter “Nonlinear Full Endogenous Switching Model” (NFES). As it will be shown in subsequent sections, NFES model is an extended version of NET in the sense that an appropriate restriction on coefficients along with a fairly weak assumption readily makes NFES and NET equivalent. While NFES is relatively new, NET has acquired wide popularity among empirical economists. For the last decade it has been applied to see the effect of founder CEO as incumbent on the active acquisition activity (Fahlenbrach, 2009), the effect of credit constraint on floating net aquaculture adoption in Indonesia (Miyata and Sawada, 2007), the effect of firm’s voluntary pollution reduction program on pollution (Innes and Sam, 2008; Sam, 2010), the effect of duplicate coverage on the demand for health care in Germany (Vargas and Elhewaihi, 2007), the effect of illicit drug use on emergency room utilization (McGeary and French, 2000), the effect of physician advice on alcohol consumption (Kenkel and Terza, 2001), the effect of insurance on demand for health care (Ko¸, 2005), the effect of higher education on smocking (Miranda and Bratti, c 2006), the effect of socio-economic factors on completed fertility (Miranda, 2003), the effect of Mexican families’ migration in US on woman’s domestic power (Parrado, Flippen and McQuiston, 2005; Parrado and Flippen, 2005), the effect of health maintenance organization plans on the health care expenditure in private sector (Shin and Moon, 2007) and the fertility 33 differences between married and cohabiting couples (Zhang and Song, 2007) to name a few. Since most studies enumerated above use the NET model to measure the effect of binary variables, the validity of their conclusions may be put into question unless the single regime restrictions are correct. One important exception is Ko¸ (2005) where he estimates two c different structural equations for each value of treatment variable. However, he mainly focuses on the equation in each regime and not paying full attention to comparing the values of dependent variables that might lead to ATE analysis. Although the first papers proposing the ATE estimator based on the NFES model is Terza (2008, 2009), it only proposes the possibility of such methodology in unifying framework with other nonlinear models without fully discussing the properties of ATE estimator compared to traditional approaches. This study will show that the ATE estimators based on NFES model can have higher efficiency and smaller finite sample biases only under certain circumstances. The rest of the paper is organized as follows. Section 2 introduces various switching regression models such as NFES, NET, LFES and LET and discuss how the ATE can be identified for each model. Section 3 characterizes the asymptotic biases when the methods being used does not reflect the true population. Section 4 describes the various estimation methods for NFES model. Section 5 is devoted to Monte Carlo simulation and discusses the results and implications. In Section 6, the proposed approach is applied to a real data set to estimate ATE and Section 7 presents the concluding remarks. 2.2 Model In what follows the term nonlinear is exclusively reserved to describe the nature of dependent variable of structural equation. In this count dependent variable setting, nonlinear models 34 will use the linear index transformed by exponential function as their conditional expectation function. On the other hand the linear models will be constructed as if the dependent variable were continuous. 2.2.1 Nonlinear Models The “Nonlinear Endogenous Treatment Model” (NET) first proposed by Terza (1998) is as follow. E[y|x, w, ϵ] = exp(α + xβ + γw + ϵ) w = 1[zδ + v > 0], where x is 1 × K vector of covariates, w is binary treatment variable and ϵ is unobserved heterogeneity. The vector of covariates x and the vector of exogenous variables z are all assumed to be independent with the structural and selection errors. Usually x is the subset of z. The value of treatment variable, i.e. either one or zero, is determined by a binary choice model such as probit. The treatment equation tells that the value of w is determined by the exogenous variables z and the selection error v. When their sum is greater than zero, w is equal to one, and zero otherwise. If w is determined purely randomly as in randomized experiment, then it will be independent with the unobserved heterogeneity ϵ and the regression will become very simple and straightforward. However, when w is correlated with the unobserved heterogeneity, then a usual estimation that does not control for the correlated error might suffer from an endogeneity problem for the estimation of γ. For example, when the number of children a woman has at the time of observation is set as a dependent variable y, it will be determined by her age and marriage status and so on that 35 constitute the covariates x. The dependent variable will also be affected by the education status w that is either zero or one depending on whether she has education at all. Since the education status is determined by an individual’s utility maximization, the factor that affects w might also affect y creating an endogeneity. Terza (1998) suggests an estimating equation in the form of conditional mean function with a correction term that is conditioned only on the observables. The above model, however, is restrictive in that it supposes a constant semi-elasticity of dependent variable with respect to the treatment across all the individuals in population. This is related to the fact that the coefficient on covariates and the unobserved heterogeneity are invariant under different treatment status. The model that extends the above one is proposed by Ko¸ (2005) and Zhang and Song (2007) as below. c E[yg |x, w, ϵ1 , ϵ0 ] = E[yg |x, ϵg ] = exp(αg + xβg + ϵg ), g = 0, 1 (2.1) w = 1[zδ + v > 0], where different coefficients on covariates and unobserved heterogeneity depending on the treatment status are allowed for. In other words, the treatment status puts an individual in a different regime; if w = 1, then she is in regime 1 with the outcome y1 and similarly for the other regime. Presumably each individual has her y0 and y1 for each treatment status but one of them is not observed. The way to recover the unobserved counterfactual will be discussed later on for estimation, but for the time being let us focus on the population model itself. If those two outcome variables are known, then yi1 − yi0 would be an individual treatment effect. Since it might be different from person to person, we might want to know the averaged individual treatment effect E(yi1 − yi0 ) that is the so-called Average Treatment 36 Effect (ATE). Incidentally the individual semi-elasticity can be computed by (yi1 − yi0 )/yi0 that might not be constant across individuals either. This is the extended Terza model that will be called throughout this paper “Nonlinear Full Endogenous Switching Model” (NFES). The quantity of interest will then be the ATE that captures the causal effect of treatment. Returning to (2.1), the first equality in the upper equation tells that the conditional expectations of dependent variables for each regime depend neither on switching variable w nor on unobservables for other regime. The exclusion of w is particularly important; once the covariates and the unobservables ϵg are controlled for, the knowledge about realized regime does not provide any additional information on the conditional expectation of dependent variables. In other words, the equality assumes the ignorability (Rubin, 1978) or unconfoundedness (Imbens, 2005) of w conditional on covariates and unobservables. Although the treatment equation in (2.1) is expressed by a binary choice model, it is also possible to use the linear probability model that is essentially a linear projection of w on z. However, in the present model, the fact that the endogenous variable is binary is not neglected so that an appropriate binary choice model is used. The implication is that it can be viewed as a structural equation1 . The treatment equation that describes the regime switching mechanism can be modeled by any binary choice model, but here let us assume that it is governed by probit model for the sake of simplicity. The robustness of this assumption will also be discussed later. Now let the errors in outcome and treatment 1 In this respect the one step estimator that simultaneously maximizes the outcome as well as treatment equation is called Full Information Maximum Likelihood estimator. 37 equation be denoted by ϵ and v and follow trivariate normal distribution as below.      2  0   σ0  ϵ0             ϵ1  ∼ N  0  ,            v 0  ρ0 σ0   2 ρ σ  σ1 1 1    1 This assumption becomes sufficient condition for each error to follow normal distribution. If there is no correlation between ϵ and v, then the regime switching becomes entirely random. Unless the covariances are equal to zero, the regime choice will be determined by each individual’s own idiosyncrasies that create correlation between w and ϵ. Heckman correction can be used to solve this endogeneity problem in linear model where the dependent variable is continuous; the difference between Heckman corrected linear model and current one is that the latter allows for noncontinuous outcome distribution with exponential CEF while Heckit presupposes a continuous structural error of which the conditional expectation is expressed as a linear function of v. Nevertheless the basic situation is more or less the same. Under the above assumption the ATE can be identified as below (Terza, 2009). ( ) AT E = E[y1 − y0 ] = E E[y1 |x] − E[y0 |x] [ ] 2 2 = E exp(α1 + σ1 /2 + xβ1 ) − exp(α0 + σ0 /2 + xβ0 ) (2.2) Thus an estimate can be computed by using the sample analogue method, i.e., AT E = N −1 N ∑[ ] 2 2 exp(α1 + σ1 /2 + xβ1 ) − exp(α0 + σ0 /2 + xβ0 ) . i=1 2 As it will be shown in equation (2.4), the composite intercepts αg + σg /2 are identified 38 2 2 without separately identifying αg and σg . The term αg + σg /2 are the estimates of the composite intercepts. Incidentally, the Average Treatment Effects on the Treated (See p. 906, Wooldridge, 2010) is computed by AT T = ( N ∑ i=1 wi )−1 N ∑ [ ] 2 2 wi exp(α1 + σ1 /2 + xβ1 ) − exp(α0 + σ0 /2 + xβ0 ) . i=1 The NFES model discussed so far nests NET model shown in the very beginning of this section. By putting restrictions β0 = β1 and ϵ0 = ϵ1 the two outcome equations in NFES can be combined to be written as ( ) E[y|x, w, ϵ] = exp α0 + (α1 − α0 )w + xβ + ϵ , where y = y0 + w(y1 − y0 ). The NET model, although having been claimed as a switching regression in Terza (1998), does not clearly incorporate the two distinct regimes; the regime changes according to the value of the binary variable, but switching is expressed only by shifting the intercept term inside the exponential function. In linear model, it is similar to the case where the coefficients of covariates for two regimes are identical except for the intercept. Thus it is recommended to run the NFES model first; it is preferable unless test rejects the hypothesis of β1 = β0 . In ET model, the parameter of interest is usually the coefficient on w, i.e. α1 of which interpretation is the semi-elasticity of y with respect to the treatment variable. This is distinct from ATE that we are in many cases interested; ATE must be computed as in equation (2.2). 39 2.2.2 Linear Models Angrist (2001, 2010) and Angrist and Pischke (2009) have pointed out that in many cases a linear model may be sufficiently good for estimating the marginal effect of a model with binary dependent variable. Angrist and Pischke (2009) also maintain the validity of such approach even for the general limited dependent variable models on the grounds that the linear coefficient can provide the linear projection coefficients that might be very close to the actual causal effect. In line with that approach, the above endogenous switching model can be expressed in linear form as below despite the nonlinear nature of count dependent variables. yg = µg + xβg + ug , g = 0, 1 (2.3) w = 1[zδ + v > 0] Let the explanatory variables be demeaned, then the ATE is E[y1 − y0 ] = µ1 − µ0 . We call this model “Linear Full Endogenous Switching Model” (LFES) as a linear counterpart of NFES. As NFES model nests the NET, LFES does it for “Linear Endogenous Treatment Model” (LET) under the restriction that β1 = β0 and u1 = u0 , whereby the coefficient on w becomes the ATE that is constant across all individuals. The treatment equation is modeled as probit as usual. When the true model is such that the outcome variable is nonnegative, the outcome equations, i.e. the equations of which dependent variable is yg , in LFES model cannot be viewed as the error form of conditional expectation. Rather it is the linear projection of y on covariates and therefore E(yg ) = µg since all the covariates are already demeaned. Then the ATE is the difference between the two intercepts for each regime. The problem is that 40 there is no known identifying strategy of those intercepts when the true model is exponential. For example, one can try using the Heckman correction method (Heckman, 1978) for the outcome equation. However, since the minimum condition is that E[u|x, v] = E[u|v] = ρv (Olsen, 1980) and the model does not satisfy the first equality under the exponential conditional mean, the LFES estimator with Heckman correction does not identify the true ATE. Although u and x are orthogonal by linear projection, they are not mean independent under the exponential conditional mean assumption. Another alternative linear approach is the 2SLS that does not explicitly model the two distinct regimes. As it was already shown by Angrist et al. (1996), the coefficient on w in the regression without any other covariates will identify the Local Average Treatment Effect (LATE). With covariates, the coefficient on w identifies the weighted averages of LATE for each covariate cell (Hirano et al., 2000; Mealli et al., 2004). Although LATE will generally be different from ATE, Angrist and Pischke (2009) support its usefulness on the grounds that it does not require any distributional assumptions and its estimates end up being very similar to ATE estimated by nonlinear modeling. In later sections, we will see the validity of this claim through simulations. Instead of applying the 2SLS method, one can make use of the fact that the endogenous treatment variable is binary. Thus the probit model can be used for the first stage. Let us call this the “Linear Endogenous Treatment (LET)” model and it is obvious that it is a special case of LFES model with the restriction that β1 = β0 and u1 = u0 in equation (2.3). 41 2.3 Estimation Various estimation methods for NFES models are presented in this section. Based on the estimating equations in Terza (1998), the estimation methods for NFES are discussed below. 2.3.1 Full Information Maximum Likelihood When a binary choice model is used for regime switching, the treatment equation can be regarded as another structural equation. Specifically we assume that the treatment variable follows probit model and the outcome variables do Poisson distribution. Let the one-step ML estimation method under these assumptions be called Full Information Maximum Likelihood (FIML). One of the advantages of FIML is that it is the most efficient estimator achieving the Cram´r-Rao bound with correctly specified likelihood function. Also unlike other estimators e that will be discussed below, it estimates all the parameters separately in a single step. (In other estimators the structural intercept and covariance between ϵ and v are not separately identified or, even if they are, it requires further steps to do that.) Despite those advantages it is still computationally burdensome; it might not numerically converge to any meaningful solution depending on specific data set, and, if ever, it usually takes a lot of time to make the bootstrapping very awkward. In the case when it needs numerical integration by using Gauss-Hermite quadrature, the error from approximation can be substantial depending on the population. The likelihood of FIML estimator is as below. The assumption used is again that ϵ and v follow trivariate normal. Proposition 2.1. The joint density function of y and w conditional on the exogenous 42 variables is [ f (y, w|z) = ]w ∫ √ √ 1 2 √ f (y1 |z, w = 1, 2σ1 ζ1 )Φ∗ ( 2σ1 ζ1 ) exp(−ζ1 )dζ1 π R [ ]1−w ∫ ( ) √ √ 1 ∗ ( 2σ ζ ) exp(−ζ 2 )dζ · √ , f (y0 |z, w = 0, 2σ0 ζ0 ) 1 − Φ 0 0 0 0 π R where [ 2 ] √ zγ ∗ ( 2σ ζ ) = Φ∗ (ϵ) = F √ + (σga /σg )ϵ Φ g g 1 − (σga /σg )2 Proof The basic proof was given in Terza (1998) and its extention to the NFES model is given in Appendix A. The function f (·) denotes the used conditional distribution; it is usually either Poisson or NegBinII. The above equation is a function of ζ that is being integrated out; the integration will be computed by Gauss-Hermite quadrature method in actual estimation. 2.3.2 Quai-Maximum Likelihood Estimator Unless the distributional assumption used in FIML are correct, the FIML estimator might not be consistent; this is a cost of FIML in exchange for efficiency. By the way there is another method called Quasi-Maximum Likelihood Estimator(QMLE) that trades the efficiency with robustness by using weaker condition that only the conditional expectation function (CEF) is correctly specified. As long as the used likelihood is in the class of linear exponential family, and the CEF is correctly specified, the estimator is consistent even if the whole likelihood function is not correctly specified (Gourieroux, Monfort and Trognon, 1984). Given the model in equation (2.1), a natural way to estimate might be running QMLE 43 or Nonlinear Least Squares (NLS) by using the E(yg |x, ϵg ). However, it does not give an estimable equation due to the ignorance of ϵg ; the unobserved variable needs to be removed by integrating out from the conditioning set of that CEF. By using the fact that ϵ and v are correlated, one can construct E(y|z, v). ( ) 1 2 2 ) + xβ + ρ σ v E(yg |z, v) = exp αg + σg (1 − ρg g g g 2 Conditional on z, v determines the value of w. Since z, w makes a sparser σ-field than z, v does, by law of iterated expectation, ( ) 1 2 2 ) + xβ E[exp(ρ σ v)|z, w] E(yg |z, w) = exp αg + σg (1 − ρg g g g 2 Thus E(y|z, w) can be expressed by using only the observable variables z, w. Then the estimating equation is obtained as [ ] ) Φ(zδ + ρ σ ) 2 σ1 1 1 E(y|z, w) = w · exp α1 + + xβ1 2 Φ(zδ) [ ] ( ) Φ(−(zδ + ρ σ )) 2 σ0 0 0 +(1 − w) · exp α0 + + xβ0 , 2 Φ(−zδ) ( (2.4) where the composite intercepts and βg are identified. As was already seen in equation (2.2) these parameters are sufficient for identifying ATE. A detailed derivation of the above estimating equation can be found in Appendix B. One can run a QML estimation using the above CEF. A distributional assumption on y is needed as in FIML; the difference is that FIML models yg to follow certain distribution with E(yg |z, ϵg ) as CEF, whereas QMLE does it with E(y|z, w). The integration does not appear in Poisson likelihood based on E(y|z, w) 44 because the unobservable was already got rid of and the correction term does that role instead. Both FIML and QMLE relies on correctly specified conditional mean for consistent estimation of parameters. However, the conditional mean in QMLE, i.e., E(y|z, w), is expressed by all observable variables that makes the QMLE likelihood simpler than FIML. One can run a QMLE by using a conditional distribution with the mean E(y|z, w). Specifically two step method can be employed where the first stage probit estimates are substituted in the correction terms. It does not, however, have to be carried out sequentially by two steps; they can be estimated by a single step procedure where all the necessary parameters for ATE are separately identified (Wooldridge, 2011). However, as Hellstr¨m and Nordstr¨m o o (2008) and Chapter 1 have shown, the single step ML method for estimating ATE in linear endogenous switching model is relatively less efficient in finite sample; it will be examined in the sequel whether that is still the case in this nonlinear model with count dependent variable. 2.3.3 Nonlinear Least Squares Estimator The above QML method is run by using a likelihood in linear exponential family based on the condition that the conditional mean function is correctly specified. By the way given the correctly specified conditional mean function it is also possible to use Nonlinear Least Squares (NLS) method. This NLS can also be viewed as a method of moment estimator. Let us write the equation in additive form with the correctly specified conditional expectation as y = E[y|z, w] + e, 45 where E(e|z, w) = 0 by definition. The true parameter values are such that it minimizes E[y − E(y|z, w)]2 and the estimation can be performed by sample analogue. Since the conditional mean contains the correction terms, it should be estimated through the first stage probit. Under some regularity conditions this approach gives a consistent estimator. See Wooldridge (2010, Chapter 12) for detailed discussion. On the other hand this NLS estimator can also be interpreted as method of moment estimator. From the minimization problem the true parameters satisfy the first order condition E(dE(y|z, w)/dθ × e) = 0 which can be a moment condition with the instrument vector dE(y|z, w)/dθ. By the fact that E(e|z, w) = 0 along with law of iterated expectation, it can be easily shown that any function of instruments can also be a good instrument. Therefore we have infinitely many possible instruments that can serve to increase the asymptotic efficiency of parameter estimators. 2.3.4 Weighted Nonlinear Least Squares Estimator We have seen above that NLS with correctly specified conditional mean gives consistent estimator under some regularity conditions. Also it became clear that the NLS estimator can also be viewed as a method of moment estimator, from which one can construct infinitely many valid instruments and moment conditions. In order to avoid poor finite sample performances due to overidentifying restrictions (Altonji and Segal, 1996; Ziliak, 1997), one can resort to the optimal IV approach (See Wooldridge (2010, Chapter 8)). This can be done in this case by dividing the instrument by conditional variance; the estimator using E(dE(y|z, w)/dθ × var[y|z, w]−1 × e) = 0 as moment condition is more efficient than NLS in the previous section. One can also draw almost same conclusion by using the weighted nonlinear least squares 46 (WNLS) approach, in which both sides of the error form equation is divided by the square root of an arbitrary function v(z, w, γ) with nuisance parameter γ. Then the first order condition becomes E(dE(y|z, w)/dθ × v(z, w, γ)−1 × e) = 0. This WNLS estimator is consistent under correctly specified conditional mean and identification condition; see Wooldridge (2010, p. 411). Note that the difference between moment condition for optimal IV approach and first order condition in WNLS is whether the original instrument is divided by var[y|z, w] or by an arbitrary function v(z, w, γ). Under the condition σ 2 v(z, w, γ) = var[y|z, w] for some constant σ 2 , the generalized information matrix equality (GIME) holds and the WNLS estimator becomes efficient; if σ 2 = 1 then this essentially restates the efficiency result from optimal IV. What if σ 2 v(z, w, γ) ̸= var[y|z, w]? Then the inference has to be made robust due to the failure of GIME. Having said that, the consistency result still holds under the assumption of identification condition and correctly specified conditional mean. Under the assumption that y|z, w, ϵ follows Poisson distribution, Terza (1998) found correctly specified conditional variance var[y|z, w]. If should also be noted that if the Poisson assumption fails, then the conditional variance will be misspecified and robust inference is called for. Estimation will be carried out by three steps. The correction term is estimated in the first step and the structural parameters are estimated in the second step from which the conditional variance is estimated. The last third step again estimates the structural parameters by using the conditional variance estimated in the earlier step. Terza (1998) has proposed two approaches to estimating the conditional variance. Among those, the regression based method will be used in this paper since it is computationally easier to implement. The derivation for NFES model under the assumption that y|z, w, ϵ follows Poisson is given in 47 the Appendix C that turns out to be ( ) ( ) 2 2 var[y|z, w] = wδ1 δ1 (exp(σ1 )L1,2 −L2 )+L1 +(1−w)δ0 δ0 (exp(σ0 )L0,2 −L2 )+L0 , (2.5) 1 0 2 where δg = exp(αg + σg /2 + xβg ), L1,2 = Φ(zδ + 2ρ1 σ1 )/Φ(zδ), L1 = Φ(zδ + ρ1 σ1 )/Φ(zδ), L0,2 = Φ(−zδ−2ρ0 σ0 )/Φ(−zδ), and L0 = Φ(−zδ−ρ0 σ0 )/Φ(−zδ). Regression based method 2 estimates the σg that will be used to compute the conditional variance for WNLS. If the Poisson assumption is true, then the GIME is applicable. Otherwise we need robust inference. In any case, the parameter estimators are consistent under the regularity conditions given above. 2.4 Asymptotic Distributions The asymptotic distribution of FIML estimator is straightforward. Given the likelihood function in proposition 2.1, the score and hessian will be constructed as usual. If the multivariate normal assumption is correct and so is the likelihood function, then the asymptotic variance will be simplified. The disadvantage of FIML is that the parameters are not consistent any more when the likelihood function is misspecified. Now consider the WNLS estimator. The objective function is (y−E[y|z, w])2 /2·var[y|z, w], where E[y|z, w] and var[y|z, w] are from equation (2.4) and (2.5). Ignoring the first stage error, the asymptotic distribution can be written under the condition var(y|z, w) = v(z, w, γ) as (See Wooldridge, 1997) √ ( ) ) [E h(z, w, y, θ) ]−1 , ( n(θ − θ0 ) −→ N 0, d 48 where [ ] ∇θ m(z, w, θ)∇θ m(z, w, θ)′ E[h(z, w, y, θ)] = E , v(z, w, γ) (2.6) and m(z, w, θ) = E[y|z, w]. Now consider the asymptotic distribution of PQMLE. The likelihood function is constructed using the Poisson distribution with the conditional mean in equation (2.4). Then the asymptotic distribution is √ ( n(θ − θ0 ) −→ N 0, d ) E[h(y|z, w, θ0 )]−1 E[s(y|z, w, θ0 )s(y|z, w, θ0 )′ ]E[h(y|z, w, θ0 )]−1 , where [ ∇θ m(z, w, θ)(yi − m(z, w, θ)) qvar(yi ) ] (yi − m(z, w, θ))∇θ m(z, w, θ)′ × qvar(yi ) [ ′] ∇θ m(z, w, θ)∇θ m(z, w, θ) E[h(y|z, w, θ0 )] = −E qvar(yi ) E[s(y|z, w, θ0 )s(y|z, w, θ0 )′ ] = E The denominator qvar is the variance implied by the used distribution function in QML. For WNLS the denominator of the expected Hessian was the conditional variance of y, whereas qvar, that of expected Hessian and score for PQML, is the variance implied from the distribution used for quasi-likelihood, i.e. the conditional mean for Poisson QMLE. The asymptotic variance of PQMLE can be simplified under the condition V ar[y|z, w] = σ 2 · qvar, (2.7) This condition says that the true conditional variance is proportional to the variance implied 49 in the quasi-likelihood. Generalized Conditional Information Matrix Equality (GCIME) holds under this condition that gives ( √ n(θ − θ0 ) −→ N 0, d ( ) ) −σ 2 [E h(y|z, w, θ) ]−1 . By plugging (2.7) in (2.6), it is obvious that the two asymptotic distribution for WNLS and PQML are equivalent. Having said that, without the condition (2.7), PQML might be less efficient than the WNLS. Of course this conclusion is true in as much as the first stage estimation error is ignored. Now consider our model with the estimating equation as in (2.7). Although the dependent variable yg conditional on x and ϵg follows the Poisson distribution with the mean E(yg |x, ϵg ) = exp(αg + xβg + ϵg ), it does not necessarily mean that yg conditional 2 on x and w follows Poisson distribution with the mean E(yg |z, w) = exp(αg + σg /2 + xβg )Φ(f (zδ))/Φ(zδ). To see this point, mean and variance conditional on z, w are E[y|z, w] = wδ1 L1 + (1 − w)δ0 L0 ( ) ( ) 2 2 var[y|z, w] = wδ1 δ1 (exp(σ1 )L1,2 − L2 ) + L1 + (1 − w)δ0 δ0 (exp(σ0 )L0,2 − L2 ) + L0 1 0 It is obvious that they are neither same nor proportional by a constant. Therefore the condition for GCIME is not satisfied and the PQML is asymptotically less efficient than the WNLS. Nevertheless, it should also be noted that the first stage estimation error is ignored in the asymptotic distribution above, that the small sample behavior can be different. The asymptotic distribution of the estimators for structural parameters that accounts for first stage error is straightforward with additional terms on the score functions. Then this 50 adjustment make it impossible to use GCIME and creates a sandwich form variance matrices both for WNLS and PQMLE. The discussion so far has been about the structural coefficients inside the exponential function. When our quantity of interest is ATE, which is a nonlinear function of structural parameters, the asymptotic approximation of the variance matrix can be obtained by delta method. Recall that the ATE is estimated as in equation (2.2). Terza (2009) derived the following result. Proposition 2.2. Along with the regularity conditions for Uniform Weak Law of Large Numbers and asymptotic normality, suppose that the nonlinear functions g1 (x, θ) and g0 (x, θ) are continuous and differentiable at θ0 and that their first derivative with respect to θ satisfies all the conditions in Uniform Weak Law of Large Numbers for objective function. Let the expectation of their derivatives with respect to θ be denoted by G1 and G0 . Then the asymptotic distribution of ATE estimator in equation (2.2) is √ N (AT E − AT E) →d N (0, V ), where V = E[T ]2 + (G1 − G0 )A−1 B0 A−1 (G1 − G0 )′ 0 0 and g1 (x, θ0 ) ≡ exp(x, β1 ) , g0 (x, θ0 ) ≡ exp(x, β0 ) ( ) T ≡ g1 (x, θ0 ) − g0 (x, θ0 ) − E[g1 (x, θ0 )] − E[g0 (x, θ0 )] Proof See Terza (2009). 51 In the above, T is the demeaned “ATE conditional on x”, which is a population property that is not related to particular estimator being used. Incidentally if the structural parameters are estimated by two-step method, the terms B0 and si (θ0 ) can be easily adjusted by using the result from two-step M-estimator (See Wooldridge, 2010, Chapter 12). Under this setting if the two exponential terms, i.e. g1 (x, θ0 ) and g1 (x, θ1 ), are substantially different along the values of covariates x, then the variance of conditional ATE will also be large. Therefore one can have a large variance of ATE estimator to the extent that the values of β1 and β0 are different. A lesson from this argument is that ATE estimator by NFES might become less accurate when there are substantial inequality of the treatment effect from person to person. On the other hand, there also exist the set of parameters with which the variance of ATE estimator by NFES can be arbitrarily small. The extreme case is when the conditional mean for each regime are identical, where the variance of T as well as the covariance term becomes zero. The above discussion shows that the consistent NFES method can have larger or smaller variance depending on the coefficient values. As the conditional mean of two regimes are similar, then one can expect that ATE estimator by NFES might be very efficient, but when they are very much different, then it might have larger variance. 2.5 Monte Carlo Simulation As it was already discussed in earlier sections, the NFES and LFES models are thought to be consistent for ATE. Therefore the purpose of Monte Carlo simulation is to compare the linear and nonlinear models for estimating ATE. Incidentally other alternative estimators 52 including LET, estimated by 2SLS and Heckit, and NET model that has acquired wide popularity needs to be examined in terms of consistency. Specifically the main objectives of simulation study are: (1) to present the possibility that NFES estimators can be more efficient than LFES under small variance of T , (2) to find out how severe the bias in one-regime model is compared to that in two-regime model, and (3) to check how robust the NFES model is to various distributional misspecifications. In addition to that, it might also be interesting (4) to clarify the advantage of one-step estimations over two-step ones in nonlinear models, (5) to compare Poisson QMLE and WNLS in terms of efficiency, and (6) to find out the advantage of FIML estimator. 2.5.1 Data Generating Processes For each simulation session, the number of replication is 1000, and the sample sizes are 1000, 3000 and 5000. There will be five Data Generating Processes (DGP). For all the DGPs the following setup will commonly be used. x ∼ uniform[−5, 5] z ∼ Binomial(1, 1/2) w = 1[.15 − 0.05x − 0.3z + v ≥ 0] E(y1 |x, ϵ1 ) = exp(β10 + β11 x + ϵ1 ) E(y0 |x, ϵ0 ) = exp(0.1 + β01 x + ϵ0 ), where the errors follow trivariate normal distribution. The treatment equation is designed so that the numbers of observation for each regime are approximately same. The population 53 ATE is set equal to one. For the errors, the covariance between ϵ1 and ϵ0 was set equal to 0.5 and the variances of all the errors are set equal to one. What is important is the covariances between ϵg and v, for which 0.4, 0.5 and 0.6 were used. Below are the descriptions of data generating processes used for each session. DGP 0 : β10 = 0.564, β11 = 0.01, β01 = 0.1, yg ∼ Poisson(E[yg |x, ϵg ]) ϵg , v ∼ trivariate normal This is the ideal case for the NFES estimators. Since the data generating process is nonlinear, the value of population ATE cannot be computed in closed form; rather the claimed true R value is computed numerically. Using Stata⃝ , I tried to find the value of the intercept in regime one such that the difference between one and the numerically computed ∑ (y1 −y0 )/N is less than 0.001 with a million observation from the above data generating process. The value of intercept in regime one, i.e. 0.564 was found in this way. The dependent variables y1 and y0 follow Poisson distribution for given conditional mean. The other DGP’s for misspecified distribution will be slight modifications of this basic [DGP 0]. There are three distributional assumptions in the above data generating process: 1) the dependent variable follows Poisson, 2)the unobservables in conditional mean function follow the normal distribution, and 3)the selection error follows standard normal, i.e. the treatment is probit. It might not be very surprising even if the NFES estimator performs well under [DGP 0] on which the model is based. To be fully reliable and useful the nonlinear models have to show their validity under misspecified distributions also. This is particularly important because it has been a major source of critics in favor of simpler linear models that the nonlinear models rely on strong distributional assumption (Angrist and Pischke, 2010). There have been some papers addressing this issue: Romeu and Vera-Hern´ndez (2005) a 54 use flexible functional form for the conditional probability of counts by using polynomial Poisson expansions. Masuhara (2007) models the joint distribution of errors by Hermite polynomials. Choi and Min (2009) use Johnson’s SU -normal distribution that is shown to outperform the normal model in terms of consistency. And Deb and Trivedi (2004) use latent factor structure and simulated likelihood methods to deal with nonnormality. Although those methods provide very nice alternatives to the models based on normal assumption, it will be shown below that the nonlinear model with normality still in many cases performs better than the linear IV methods. To check the robustness of NFES, I designed some other data generating processes such that the above distributional assumptions are violated one by one. DGP 1 : β10 = 0.564, β11 = 0.01, β01 = 0.1, yg ∼ uniform[0, 2E[yg |x, ϵg ]] ϵg , v ∼ trivariate normal In [DGP 1], discrete uniform distribution was employed in place of the Poisson. It is set such that the left and right end points of the support are zero and two times the conditional mean. DGP 2 : β10 = 0.564, yg ∼ Poisson(E[yg |x, ϵg ]) √ v ∼ (χ2 (5) − 5)/ 10 β11 = 0.01, ϵg ∼ bivariate normal, β01 = 0.1, √ In [DGP 2], for the standard normal selection error, a skewed distribution (χ2 (5) − 5)/ 10 was used2 . The mean and standard deviation are the same as the standard normal. This 2 A correlated trivariate distribution in which two of the errors follow standard normal and the other follows centered χ2 distribution can be properly modeled by copula density functions. Nevertheless the purpose of present simulation is not finding a multivariate distribution with parameters that uniquely determines a specific multivariate distribution. In DGP used above, each random numbers were generated first by creating many standard normal random numbers as “basis” variables. The dependence structure is then created 55 probit assumption was important since it was this very assumption that made it possible to write the correction terms as the fractions of two normal cumulative density function. The conditional mean E[y|x, z] is misspecified without this. DGP 3 : β10 = 0.516, β11 = 0.01, β01 = 0.1, √ ϵg ∼ (χ2 (20) − 20)/ 40, v ∼ normal yg ∼ Poisson(E[yg |x, ϵg ]) √ In [DGP 3], the case where the structural errors follow (χ2 (20)−20)/ 40 instead of standard normal is considered. When the structural errors are not normally distributed as in [DGP 0], it directly changes the distribution of dependent variables and thus the population ATE. To ensure that the ATE is equal to one, the constant term in regime 1 is changed to 0.516. Normality of ϵg is used in the process of getting rid of ϵg from the conditioning set. Thus a violation of this assumption causes the composite intercept misspecified. Only under this assumption, we could obtain a nice result that the composite intercepts in estimating equation E[yg |x, w] and that in E[yg |x] are identical. Therefore a violation of normality will make ATE estimator inconsistent. The degrees of freedom in chi-square were chosen so that the comparison between various estimators easy and meaningful to the extent that they are shown to be consistent converging to the true value. It was found that the results for structural errors were more sensitive to distributional misspecification than it is for selection error. Also when the degree of freedom is as high as that used in selection, a lot of cases the PQML estimator fails to converge to give nice estimates. One can see from the results below that this choice of degrees of freedom is indeed good for comparison purpose without too many extreme outcomes. by using some common “basis” such that the linear combination gives intended correlation structure. The errors in [DGP2] and [DGP3] are generated in this way. Stata codes will be provided upon request. 56 In [DGP 0], the variance of individual treatment effect, i.e. T ≡ g1 (x, θ0 ) − g0 (x, θ0 ) − ( ) E[g1 (x, θ0 )] − E[g0 (x, θ0 )] was approximately 18.9. The asymptotic distribution in earlier section says that the variance of ATE estimator is an increasing function of E[T 2 ]. [DGP 4] is designed to show this. DGP 4 : β10 = 0.865, β11 = 0.001, β01 = 0.35, yg ∼ Poisson(E[yg |x, ϵg ]) ϵg , v ∼ trivariate normal In the above process, E[T 2 ] is approximately 52.5. In order to create higher variance of individual treatment, the vertical distance between two conditional mean function at a fixed x has been adjusted without changing the true ATE value. 2.5.2 Main Results The results for [DGP 0] is tabulated for all the discussed estimators except for FIML in Tables E.1-E.3 for each correlation. The FIML results are separately given in Table E.4. For other DGPs, only 2SLS, LFES (Heckit), NFES (2PQMLE) and NFES (WNLS), which are the focus of our discussion, are tabulated in Tables E.5-E.7. First, compare LFES with NFES under various estimation methods with small T . Tables E.1-E.3 present the [DGP 0] results with E[T 2 ] = 18.9 that are relevent for this purpose. For any correlation, it can be seen that NFES are extremely inefficient and their RMSEs are very large compared to LFES under the sample size 1000. This is predicted by the asymptotic distribution; the variance of T in nonlinear estimator is still prevalent in small sample. However, such constant term rapidly disappears as the sample size increases. Also the results in terms of median and mean absolute deviation (MAD) is not too bad; the medians of NFES are closer to one than those of linear models. In addition to that, although the Monte Carlo 57 standard deviations of NFES were very large, the degrees of dispersion measured by MAD are smaller than those of linear models, from which one can conclude that the large amount of RMSE was caused by some outlying estimates with extreme values. Therefore the measures that are less affected by the extreme values give more favorable figures for NFES. At any rate, such extreme estimates disappear as the sample sizes grow larger. With n = 5000, the mean approaches to the true parameter value sufficiently close, and their Monte Carlo standard deviations are smaller than those of linear models let alone median and MAD. In sum, ruling out seemingly absurd estimates, when the variance of T is small, NFES model does better than the linear models in terms of both consistency and efficiency particularly as the sample size grows larger. On the contrary, the results under E[T 2 ] = 52.5 in Table E.8 looks very different. Even under large sample size, NFES estimators are less efficient than LFES. As seen in Proposition 2.2, the asymptotic variance of NFES estimators can be arbitrarily large as the variance of T diverges to infinity. However, the results in Tables E.1-E.3 imply that there can be the cases where NFES becomes more efficient than linear estimators when the individual treatment effect is not substantially different from person to person. Incidentally it has been claimed that LATE and ATE obtained by linear IV and nonlinear methods respectively might not be substantially different (Angrist and Evans, 1998; Angrist, 2001; Angrist and Pischke, 2009). Nevertheless the simulation results indicate that their finite sample distributions and the estimates can be very different with each other. This issue will also be revisited in next section. Second, we have seen in the above that the NFES estimator is a very useful alternative to LFES when the variance of T is small. However, it must be shown further that the performances of NFES estimators are not very sensitive to the assumptions on which it is based. 58 If the said nice properties of NFES are valid only under the ideal assumptions, its usefulness will be much limited. The results of additional simulations under wrong distributions mentioned in previous subsection is tabulated in Tables E.5-E.7. In those tables, only PQML, WNLS, and linear estimators that are still thought to be consistent were considered. To begin with, see the case where the discrete uniform distribution was used in the place of Poisson, i.e., under given E[yg |x, ϵg ], the support is [0, 2E[yg |x, ϵg ]]. Since the distributional misspecification on dependent variable does not affect the consistency results of PQML and WNLS, the point of interest should be the efficiency of NFES compared to linear models. The results in Table E.5 show that the performances of linear and nonlinear models are not very different from the ones in correct distribution. Regardless of ρ, NFES model is better than the linear models in terms of efficiency as well as consistency. As in the correct distribution case, despite some extreme values of estimates in small samples, the performance of NFES measured by median and MAD is still better than the linear models. Ruling out the seemingly unacceptable estimates, it can be said that NFES model is still more efficient than linear models even in small samples. Therefore NFES is robust about the distributional assumption on dependent variable. Now consider the case where the true treatment equation is not probit. To that end, a skewed random variable using χ2 (5) was used. This is important because the correction terms that we used for PQML and WNLS were derived under the probit assumption. When violated, then the conditional mean function becomes misspecified and they lose the consistency property. However, the results in Table E.6 show that their behaviors are not very different from the cases under correct distribution. Although the performances of NFES in terms of mean and variance are worse than the linear estimators in small samples, they are better in terms of median and MAD. In larger samples they are a lot better by all criteria. 59 Although a departure from probit misspecifies the conditional mean, the impact is almost negligible. Lastly, consider the cases where the distributions for unobservables are skewed. The advantage of normal distribution assumption was that it leads to same intercept both in estimating equation (2.7) and in equation (2.2) and thus the estimates from switching regression could directly be used for ATE estimation. Therefore under misspecification, the switching regression estimates do not give correct information for ATE, which leads to inconsistency. The results for misspecified unobservables are given in Table E.7, where the Monte Carlo mean of NFES model is farther away than the other cases as expected. However, it should be noted that it is not only the nonlinear model but also the linear models that are under strain with misspecified errors. Monte Carlo means for both linear and nonlinear models are also farther away from the true value and their deviation is also noticeable. In other words, skewed error moves the means away from the true value and makes the estimates less accurate by increasing the variance. However, the NFES model tends to have smaller RMSE in larger samples mainly due to its advantage in efficiency. The efficiency gain is particularly notable when it is measured by MAD; in larger samples the MAD value is merely about half as much as the ones of linear models. The WNLS is even more efficient than the PQMLE. Third, from the results in Tables E.1-E.3 and E.5-E.6, the performances of 2SLS and NFES estimators are more or less the same. As the correlation gets higher, their finite sample biases grow bigger. However, 2SLS is diverging faster than the NFES, which implies that the NFES estimators are not only robust to distributional misspecification on y and v, but also suffering less to the higher degree of endogeneity than 2SLS is. The only case of concern is when the structural error ϵ is misspecified, where both 2SLS and NFES are diverging even faster than in the previous cases. However, unlike the above cases, NFES 60 are affected more than 2SLS under higher correlation. Nevertheless due to the advantage in efficiency the RMSE of NFES in larger sample is still smaller until the correlation is up to 0.5. Even with 0.6, the RMSE of WNLS is still smaller. In sum, NFES performs better than the 2SLS unless the structural errors are both misspecified and highly correlated with the selection error at the same time. To better understand the simulation results, the sampling distributions for some selected estimators and DGP are graphically shown in figures. Figure G.1 displays the results under ideal cases [DGP 0] where the normality conditions hold. Among many estimators with different sample sizes, only 2SLS, NET(WNLS) and NFES(WNLS) with 5000 sample sizes for different correlation values are listed. The blue line juxtaposed with the histogram is nonparametric PDF estimate generated by Epanechnikov kernel with an optimal half-width. It can be seen from the figure that the NET estimator3 behaves poorly in terms of both consistency and efficiency as anticipated in earlier sections. Even under large observation of 5000, sampling mean is very different from those of other estimators. Under multivariate normality, NFES is clearly consistent and efficient in large samples. For misspecified cases [DGP 1] and [DGP 2], it is clearly noticeable, in Figure G.2 and Figure G.3, that the NFES estimator is consistent and more efficient than 2SLS at large sample sizes. Figure G.4 depicts the sampling distribution with skewed structural error [DGP 3]. Although it creates bigger finite sample biases for those two estimators, the efficiency loss of 2SLS is more pronounced than in other misspecified cases. An important lesson from these figures is that the violation of distributional assumptions on which nonlinear models rely can also cause disadvantage for 2SLS that is not explicitly seen in traditional asymptotic analysis. 3 The estimators based on NET model will simply be called NET estimators. Similarly for NFES estimators. 61 2.5.3 Some Other Results In addition to the main results, here are some interesting, but off the main topic results. First, examine how large the biases of NET estimators are when the true model is of NFES. It can be seen in Tables E.1-E.3 that neither by mean nor by median do the NET estimators converge to the true value. It seems that the asymptotic biases of 1PQML and WNLS estimators are considerable; it is clear that a single-regime estimation is worse than simple linear models when the true model is of two-regime. Moreover, the seemingly large difference of Monte Carlo means of 2PQMLE and NLS reflects that those two estimators are not even estimating same agreed upon parameters. Gourieroux, Monfort and Trognon (1984) show that the true parameters are consistently estimated when the conditional mean is correctly specified. Also under the same condition, the NLS estimator also does the same thing since it is the sample analogue of appropriately defined moment conditions. By those argument, correctly specified conditional mean guarantees the convergence of those two estimators to a same quantity. In other words, a substantial difference even with large sample size is indicative of conditional mean misspecification. Therefore any parameter restriction in Mestimator should be used with caution. Second, it can be seen in Tables E.1-E.3 for nonlinear estimators, the one-step QuasiLIML, here NFES(1PQML) and NET(1PQML), is less efficient than the two-step methods, here NFES(2PQML) and NET(2PQML), in small samples. Such result was already reported in Chapter 1 for linear models where the dependent variable in structural equation was continuous; present simulation also shows that the finding is still valid when the dependent variable is nonnegative count variable. The results support the claim that two-step estimation procedure gives more efficient estimator than one-step. 62 Third, compare the 2PQML and WNLS for NFES model in Tables E.1-E.3. As it was already mentioned above, the asymptotic distribution of QMLE and WNLS are very similar. Under the condition var(u|x, w) = σ 2 qvar, their asymptotic distributions are the same. Simulation results show that the performances of those two estimators are indeed very similar. Nevertheless it is also clearly noticeable that WNLS is slightly more efficient than the 2PQML, which may be due to the fact that the GCIME does not hold for the 2PQML estimator. Forth, according to the results for FIML that were tabulated in Table E.4, the FIML estimator is considerably more efficient than the others. Nevertheless it does not seem to converge well to the true parameter value, which may have resulted from the quadrature approximation error. To make this point clearer, another set of simulations using two different numbers of abscissas, i.e. 8 and 16 were run; more abscissas clearly help the mean approach to the true value. Incidentally the variances are not affected by number of abscissas. One can have more accurate FIML estimator by increasing number of abscissas, but it would take considerable amount of time in actual applications. One other feature that makes the FIML less attractive is that the PQML and WNLS quickly catches up FIML in terms of efficiency as the sample size grows bigger. Although FIML is the most efficient among all other estimators, the RMSE for n = 5000 is larger than PQML and WNLS due to the inaccuracy. On the other hand under small sample sizes the linear estimators do as nicely as FIML. Also using FIML becomes even less attractive when the standard error has to be found by bootstrapping; it may take too much time to complete a single session of bootstrap. There are other disadvantages too; since FIML heavily relies on the distributional assumption, any violation of them will cause the estimator inconsistent. In sum, FIML does not have any clear advantage over PQML and WNLS. 63 2.6 Empirical Application Primary education may increase the human capital and lifetime wage and thereby increase the opportunity cost of having a child (Becker and Barro, 1988; Barro and Becker, 1989), and it may help reduce the child’s mortality rate and hence let mothers have fewer children to reach a desired level of family size (Lam and Duryea, 1999; Schultz, 1994a,b). Other than that an enhanced literacy can help them use contraceptive method more effectively (Rosenzweig and Schultz, 1985, 1989). Based on those theoretical background, we are interested on how much the primary education reduces the number of children in Botswana. The sign of the effect is certainly presumed to be negative. Moreover, those who got primary education may have better health information for their children, which may possibly reduce the child mortality. Then we can also expect that the difference between ceb and children might be smaller for those who got the primary education. In this case the treatment effect would be greater when ceb was used as the dependent variable. The data used in this empirical analysis is from Wooldridge (2010, Chapter 21). The variables description and descriptive statistics are given in Tables E.9 and E.10. There was a huge increase of enrollment rate in Botswana during 1970s. The female enrollment rate in early 1970s were roughly 60% and kept increasing for the whole decade until it reached nearly 100% in 1980(UNESCO, 2011). Due to that increase in enrollment, in 1989, the year this data set was collected, more than half the total female population had at least seven years of primary education. Thus this data set captures the ideal time point where there were even amount of control and treatment groups. The dependent variables under analysis are children (number of living children), ceb (number of total children born) and mort (number of dead children) and the covariates are 64 age, agesq, evermarr (ever married), urban (living in urban area), electric (has electricity), tv (has a TV) and radio (has a radio). The variable of interest, i.e. the treatment variable is educ7 (finished primary education) and the instrument variable is f rsthalf (born in first half of year). The correlation between educ7 and f rsthalf is -.106. We are interested in the effect of women’s primary education on the number of children that she ever has(ceb) and that of living children(children). Although we are trying all the linear and nonlinear methods for estimating the ATE of education on fertility, the nonlinear estimators are expected to perform better in two reason: First, the outcome variable is typical count variable with small natural numbers and thus modeling the conditional mean as exponential function is well justified. Second, the ATE conditional on covariates might not be substantially different. In other words, we would not assume neither substantial difference of causal effects across different age groups nor any particular time trend. Table E.12 presents regression results for various models and estimation methods with children as the dependent variable. In what follows the regime with primary education will be called regime one with a subscript 1 and the regime without it will be regime zero with a subscript 0. In Table E.12 first four columns present the estimation results for linear models. The ATE estimates of LFES(Heckit) is −1.552 but not statistically significant. Although LET(Heckit) and 2SLS differ only in the first stage regression, the estimates of LET(Heckit) is almost twice as large as the 2SLS estimate. The LFES(Heckit), LET(Heckit) and 2SLS give AT E with a lot larger magnitude OLS does, which might be an evidence of endogeneity. It is, however, very hard to get any meaningful conclusion just by seeing the linear regression results: the only consistent estimator LFES(Heckit) fails to give significant result, and other estimators of which estimates are significant do not seem to agree with one another. The next three columns present the results of NET estimators. We already know that 65 the NET model does not identity the true ATE unless the single regime restriction is true. Indeed the AT EN ET estimates are substantially smaller than the ones from other estimators. It was also pointed out in Section 3 that each estimator does not even agree with each other under wrong restriction, which is well demonstrated here; the magnitude of PQMLE and NLS estimates are very different and they seem to head to different places. The results show that the AT EN ET estimate by PQMLE is close to zero and not significant. Although only NLS gives an estimate weakly significant at 10% level, the magnitude is relatively smaller than those of linear models; it estimates that the primary education reduces the number of children by no more than 0.68. Also for semi-elasticity, the PQML estimate does not give any evidence of effectiveness of primary education. Again only the NLS estimate is weakly significant reporting roughly 30% decrease of living children. These results seem to mimic the behavior shown in simulation of last section and it can be an evidence that the NET model is inappropriate. The last three (double) columns in Table E.12 list the results of NFES estimators. The NFES estimates report that the primary education reduces 0.8(PQML) or 1.2(NLS) children. It is worth mentioning that standard error of NFES estimates are a lot smaller than those of other estimators, due to which all the three NFES estimates are significant at 1% level. What is particularly interesting is the fact that the NFES estimates support the validity of 2SLS estimate by providing similar values. LFES(Heckit) being consistent under probit selection assumption, it can have higher finite sample bias than the 2SLS when the assumption is violated as shown in Table E.5. Now can we say with greater certainty that ATE estimates cannot be substantially different from the LATE by 2SLS as was claimed by Angrist and Evans (1998) and Angrist and Pischke (2009) for bivariate probit case? As it was already seen in the simulation results, being similar under weak endogeneity, they start to diverge as 66 the degree of endogeneity becomes higher. Therefore the fact that 2SLS and NFES estimates are similar would be indicative of relatively weak correlation between selection error v and structural error ϵ, rather than the validity of the above claim. The estimated regime one (with primary education) averages ∑ children1 /N for three estimators are 1.264(NLS), 1.499(2PQML), 1.482(1PQML) and those of regime zero (without primary education) ∑ children0 /N are 2.488(NLS), 2.340(2PQML), 2.312(1PQML); from those values one can compute the semi-elasticities, i.e. -0.49(NLS), -0.36(2PQML), and 0.36(1PQML). All those estimates are greater in absolute value than the ones from NET model. From these, it becomes more obvious that the NET estimators give us information that looks very much different from what was provided by other estimators. Lastly we can directly test the restriction put on the NET model. One may use the Wald test of H0 : β1 = β0 . The p-values for 2PQMLE and NLS are 0.000 and that of 1PQMLE is 0.001 implying that there actually exist two regimes4 . All the above results unequivocally show that the NET model is not an appropriate model to be used to describe this data set. We can also test the endogeneity by checking the covariance between v and ϵg . Ignoring NET model, all the two regime estimators show that the regime one covariance is significantly positive, whereas the one at regime zero, slightly negative, is not statistically different from zero. Overall the use of two regime endogenous switching model is well justified. The above discussion was about treatment effect on the number of living children that reveals the difference in the desired number of children for each education group. Another interesting aspect can be the treatment effect on child mortality and those educated mothers 4 Since 2QMLE and NLS use two-step procedure, the asymptotic variance approximation has to account for the first stage error. One of the advantages of single-step 1PQMLE is that such first stage error is not present and the inference is straightforward. Although there is slight difference in the p-values, such trivial difference is not thought to be of any practical importance. 67 are expected to have fewer dead children(Lam and Duryea, 1999; Schultz, 1994a,b). Thus the treatment effect might be negative. Since a direct estimation yields extreme outlying estimates for NFES model5 , an alternative indirect way of estimation procedure is used. To that end, another regression with the dependent variable ceb is run and presented in Table E.13. Then the expected child death at regime g for each individual is computed as mortig = cebig − childrenig . The treatment effect for each observation is then computed as morti1 − morti0 and their average through the whole population becomes the ATE on mortality presented in Table E.14. For example, the estimated regime one averages, i.e. ∑ ∑ cebi1 /N are 1.402(NLS), 1.697(2PQML) and 1.683(1PQML), whereas those of regime zero cebi0 /N are 2.890(NLS), 2.680(2PQML) and 2.657(1PQML). The differences in estimated regime zero averages, i.e. ∑ cebi0 /N − ∑ childreni0 /N are 0.401(NLS), 0.340(2PQML) and 0.345(1PQML) implying that the mothers without primary education lose on average 0.4 children. For those with primary education the quantity is ∑ cebi1 /N − ∑ childreni1 /N of which numerical values are 0.138(NLS), 0.198(2PQML), and 0.201(1PQML). The implication is that the child mortality rate is reduced roughly by 0.14 to 0.26 per mother by primary education. Although the ATE’s on mortality for nonlinear estimators are not significant, they are still more efficient than the linear ones with an exception of LET(Heckit); the bootstrap standard deviations of nonlinear estimators are smaller than the linear ones. 5 The 2SLS estimator gives insignificant -.002(.208). Unlike the general results, the NFES estimators do not give a reasonable estimate; two-step QML estimate is 25.821 with bootstrap standard deviation 784296.5 by 50 replications. It turns out that predicted values for dependent variable in regime 1, i.e. morti1 are very volatile; about 10% of the observations have more than 10 predicted child’s deaths and about 1% have more than 100. Strangely such phenomenon does not occur in regime zero. 68 2.7 Conclusion The main contribution of this study is to clarify the asymptotic distribution of the ATE estimator based on NFES model. Unlike other structural parameters, the ATE estimates are computed by a nonlinear function of the parameter estimates. The estimation error therefore comes both from the error in parameter estimation and also from the computation of ATE by the parameter estimates. The asymptotic distribution reveals that each factor can be written additive separably with a covariance term. The theory predicts that the efficiency of nonlinear ATE estimator is not taken for granted as in many other nonlinear cases. However, when the nonlinear estimators are actually more efficient, simulations show not only that it tends to have smaller finite sample bias, but also that its performance under misspecified model is not too bad compared with the linear IV or LFES models. The application shows an example in which this nonlinear methodology can be successfully used. A nonlinear method is expected to be perform better if the variance of conditional ATE are not substantial as in the Botswana fertility example. 69 Chapter 3 A Two-Regime CRC Model with Nonnegative Dependent Variable 3.1 Introduction The Correlated Random Coefficient (CRC) Models that were first introduced by Heckman and Vytlacil (1998) provided an alternative way to model the individual heterogeneity. The main focus having been on clarifying the conditions for identifying the parameters of interest under the presence of CRC since its inception (Wooldridge 1997, 2003, 2005; Card 2001), more recent development is focusing on the extension to the cases where the support of dependent variables are of limited nature. Wooldridge (2007) has suggested a method for estimating Average Partial Effects (APE) where the count dependent variable as well as random coefficients correlated with variable of interest are present. Cases of interest in this article are very similar to it except for the fact that a binary variable with counterfactual causal model is considered as a starting point. Specifically the goal of this paper is to provide an estimation method for Average Treatment Effects (ATE) when the dependent variable is count variable and all the covariates have CRC’s that are correlated with the binary variable. This paper is organized as follows: In Section 2, previous discussion on CRC model will be reviewed. Section 3 discusses various CRC models proposed so far in the context of ATE. 70 Section 4 is the core of this article where the ATE estimation method for Nonlinear Tworegime CRC model will be provided. Section 5 discusses two specification tests; the test for endogeneity is designed for detecting the correlatedness of the random coefficients and the model selection test is for choosing the model between the ones with and without random coefficients. The method in this article is dependent on some distributional assumptions and their sensitivity on the estimation performance will be examined by Monte Carlo simulations in Section 6. Section 7 is an empirical application of this method and the performance of other competing estimator will be compared. Lastly, Section 8 is concluding remarks. 3.2 Previous Literature Random coefficients typically arise from a model where an unobserved variable interacts with one or more observed variables. It then calls for different approaches due to the random nature of the coefficients on the observed variables. The randomness of the coefficients makes the conventional partial effect be random, and the quantities of interest are usually the means of the random coefficients. In other words, we would like to estimate not the partial effect but the average partial effect. If a regressor is mean independent from the unobserved variables, then the APE of the regressor can be consistently estimated by using OLS (See Amemiya, 1985). Let the unobserved heterogeneity denoted q and the 1 × K vector of exogenous variables x . Then for the case where E[q|x] ̸= 0, even if E[q|x] can be appropriately modeled as a function of x, the APE of a particular variable in x cannot be consistently estimated as long as the function contains the x with degree one. In such cases, q needs to be expressed by some proxy variables that does not contain x to consistently estimate the APE of a variable in x. 71 Now let us consider a CRC model where E(q|x) ̸= 0 and proxy variables are not available as follows. In CRC, no correlation means mean independence, not the usual orthogonality. y = α0 + (α1 + q1 )x + u Assume for now that the regressor of interest is exogenous, i.e. orthogonal to the structural error. Then the model will have the regressor itself inside the composite error, i.e. q1 x + u. Unlike from the mean independent case, this composite error does not vanish by using conditional expectation operation and renders the regressor effectively endogenous. Heckman and Vytlacil (1998) and Wooldridge (2003) have shown that under some assumptions, the instrumental variables method can be used to consistently estimate α1 . In sum, when there are uncorrelated random coefficients, OLS gives a consistent estimator under the mean independence of q conditional on regressor. For CRC models, under the violation of mean independence, even a regressor that is independent of the structural error u in the population effectively becomes endogenous due to the random part of the coefficient. In what follows, our assumption is that the regressor is independent with the structural error for the sake of simplicity. Too restrictive the independent regressor as above might seem to be, the solutions for the CRC usually work for the nonindependent cases also (See Wooldridge, 2010). It must be noted that given a particular regressor x, the coefficients on other regressors can be both random and correlated with x. The first case to consider is the model where the CRC with respect to a specific regressor is only on that regressor, i.e. the coefficients on all other covariates are assumed to be constants. The problem with this case is that a usual IV conditions are inadequate; even if the mean independence between IV and structural error 72 and between IV and CRC are established, the orthogonality between that IV and composite error is not guaranteed. In order for the IV to be effective, it has to satisfy an additional condition that E[q1 x|z] is constant (Wooldridge, 2003).1 Although this condition is not very strong, its validity may be put in question when x is not continuous(Card, 2001; Wooldridge, 2005). For the cases where the conditions for IV are not met, there is still another method that uses control function approaches with stronger assumptions. Heckman (1976) provides a solution for binary regressor, whereas Garen (1984) does it for continuous one. See the equation below. y1 = η1 + z1 δ1 + a1 y2 + y2 z1 γ1 + u1 , (3.1) where z1 is 1 × L1 vector of exogenous variables and δ1 and γ1 are L1 × 1 coefficients vectors. The common convention is that Greeks are constant and Romans random. Regressors may or may not have random coefficients which may or may not be correlated with the structural error. The variables z1 that have constant coefficients without any correlation with u1 will be called exogenous variable throughout this article. And any variable that is correlated with structural error or at least has correlated random coefficient will be called endogenous variable; for instance in the above equation the endogenous variable y2 not only has correlation with u1 , but also has CRC. One can also enrich the model by introducing some interaction terms between exogenous and endogenous variables such as y2 z1 in equation (2.1). Let us put some restrictions here; suppose that y2 is binary endogenous variable and that the coefficient of the interaction 1 Although this condition is for the IV estimation in CRC models, it is still needed for a random coefficient that is uncorrelated with the regressor. Even if E[q1 x] = 0, it is not necessarily that E[q1 x|z] is constant. Again no correlation in CRC is not about the orthogonality but mean independence. 73 term γ1 is constant. Then this model simply becomes Linear Full Endogenous Switching Regression or LFES model (Keay, 2011) with two regimes between which an individual can switch. In this model all the coefficients in each regime are nonrandom. For this LFES it is also possible to introduce CRC to the exogenous variables for each equation (Wooldridge, 2007). Or can one equivalently construct such two-regime CRC model, i.e. a two regime regression model where the covariates in each equation have random coefficients correlated with y2 , by directly introducing CRC not only to the endogenous variable but also to the exogenous variables as well as the interaction. Here is an equivalence result: two-regime CRC model can be constructed by putting CRC to the all regressors in the equation including the interaction terms. The discussion so far was about the models with continuous structural error and hence continuous dependent variable. In this paper I will provide a model where the dependent variable takes nonnegative or more specifically natural numbers. Such cases arise very often in real life such as when the dependent variable is commuting frequency (Terza, 1998) or number of child (Keay, 2011). In addition to the nonlinearity in the first stage binary choice, another nonlinearity to the dependent variable of the structural equation will be taken care of. Although Terza (2009) has already considered such nonlinear model with two regimes without CRC, its extension with CRC on all exogenous variables has not been proposed so far. In the following, the former model will be called Nonlinear Full Endogenous Switching Regression or NFES model and the later be called Nonlinear Two-regime CRC or NTCRC model. The linear version of Two-regime CRC model was already proposed by Wooldridge (2007) that will be called Linear Two-regime CRC or LTCRC model. In what follows Poisson distribution will be used to model the nonnegative response in the structural model. Incidentally an ensuing natural question is whether all the results in the 74 Linear with Interaction k CRC on y kkkkkkk k kkk ukkkk LFES SS CRC on all SSSS SSSS SSSS S) CRC on z  Nonlinear with Interaction jj CRC on y jjjjjj j jjjj tjjjj NFES TT CRC on all TTTT TTTT TTTT TT* CRC on z  LTCRC NTCRC Figure 3.1: Analogy Structure linear model is valid to the same extent. Of course the model itself does not undergo much changes although some quantities of interest will not be identified, which will be discussed in the following sections. The problem of NFES approach is that the intercept term inside the exponential function in the structural equation is not identified by itself alone; instead Terza (2009) estimates the average treatment effect for nonlinear model and such approach will be employed also for the two-regime CRC model in this paper. As it was already mentioned above, the CRC on binary endogenous variable makes two-regime switching regression model. The objective of this paper will be to set out a NTCRC model by bringing once again the CRC to the two-regime nonlinear switching model, where the average treatment effect will be identified. Incidentally this is equivalent to the model where all the variables including the interaction have CRC. The above discussion is summarized in the analogy scheme in Figure 3.1. 3.3 3.3.1 Various Models of CRC Continuous Endogenous Variable Let the variable with which random coefficient is correlated be denoted w. By the unobserved heterogeneity argument the variable w is usually regarded as endogenous, which is also the 75 case here. The CRC can be either only on w or on all other variables. In population regression model, the variable w and all other covariates can be interacted, which may create many possible combinations of situations: CRC only on w with no interaction, CRC only on w with interaction, CRC on all variables with no interaction and finally CRC on all variables with interaction. Consider a regression model as follows. y = xβ + τ w + γq + qxδ + u, where x is a 1×K vector of exogeneous variables and β and δ are K ×1 vectors. The variables q and w are assumed to be scalar. The former is unobserved variable that is correlated with the latter, but interacts with x. One of the characteristics of this model is that the variable w will be correlated with the random coefficients of x, i.e. β + qδ and, at the same time, will be an endogenous variable in traditional sense because it is correlated with the composite error γq + u. If an interaction term between x and w is included in this model and also if that interaction term is again interacted with the unobserved variable q, then, given that w is binary, the resulting model will be the two-regime CRC model that will be discussed below. y = x(β + δ0 q) + w(τ + δ1 q) + wx(κ + δ2 q) + γq + u A random coefficient model with an endogenous variable with which the coefficients on an exogenous variable is correlated is created in such manner. This model turns out to be a very good description of education-fertility behavior which will be discussed in later sections. In addition to that, this model is very general in that many other previous models can be treated as its special cases. In case w is endogenous binary variable, a restriction δ0 = δ2 = 0 makes the model as an ordinary two-regime switching model and also an additional restriction 76 δ1 = κ = 0 will make it as Endogenous Treatment Model as in Terza (1998, 2009). There are also many other possibilities that are nonetheless special cases of the model given above. Let us focus on a continuous w here deferring discussion of binary case in subsequent sections. With count dependent variable, suppose the conditional expectation can be modeled by an exponential function as link function. This approach explicitly considers the dependent variable of the structural model as count variable that does not take negative value. As the simplest case consider a model of CRC only on w without interaction as below. E(y1 |z, w, a1 , r1 ) = exp(xδ1 + a1 w + r1 ) w = zδ2 + v2 , where x and z are 1 × L1 and 1 × L vectors of exogenous variables with x ⊂ z that are independent with all the random variables generated in model. The coefficient vectors δ1 and δ2 are L1 × 1 and L × 1 respectively. Also assume that a1 = α1 + d1 d1 = ψv2 + ed v2 ⊥ ed ⊥ r1 = θv2 + er v2 ⊥ er ⊥ In the first equation, α1 = E[a1 ] and d1 = a1 − α1 . The second and third equations are basically the linear projections of each random variable on v2 . However, we need a stronger condition that the orthogonal decompositions are not only uncorrelated but also independent in order to facilitate the identification. The regressor w suffers from endogeneity unless ψ = θ = 0. Along with those assumptions, we put multivariate normal assumption on 77 the random variables (d1 , r1 , v2 ). Under these assumptions, Wooldridge (2007) has derived a conditional mean function as below. ( ) 2 2 σr + 2σdr w + σd w2 E(y1 |z, v2 ) = exp xδ1 + α1 w + ψv2 w + θv2 + 2 Although the above equation is estimable once v2 is estimated in the first stage regression, the semi-elasticity of w, i.e. α1 is not identified due to the fraction term, and this is still the case even if w is exogenous, where ψ = 0, θ = 0. In order to identify α1 , it is required that the random coefficient on w is uncorrelated not only with w but also with r1 . Although the semi-elasticity is not identified, one can still identify the APE of w over z, w and v2 . From the above equation, the partial effect of w is ( ) 2 2 σr + 2σdr w + σd w2 ∂E(y1 |z, w, v2 ) 2 = exp xδ1 +α1 w+ψv2 w+θv2 + ·(α1 +σdr +ψv2 +σd w). ∂w 2 By taking the average over the whole population, APE of w is identified. Wooldridge (2007) also extends to discuss the case where the coefficients of other covariates are also correlated with w. The results are basically the same; APE is identified although the semi-elasticity is not. 3.3.2 Binary Endogenous Variable Recall Figure 3.1. It should be noted that we are already familiar with the models with CRC only on y, which are equivalent to the two-regime endogenous switching regression (See Maddala, 1986). For the linear and nonlinear cases, one can use the correction methods proposed by Heckman (1978) and Terza (1998) respectively. As an extension in linear model, 78 the one with CRC not only on endogenous but also on all other variables, i.e. Linear Tworegime CRC model (hereafter LTCRC), was explored by Wooldridge (2007). Consider a regression model as follows. y1 = xd1 + a1 y2 + xy2 g1 + u, where x is 1 × K vector of exogenous covariates and d1 and g1 are K × 1 vectors of CRC’s. Also we have a vector of all exogenous variables z such that x ⊂ z. As in Chapter 2, it is assumed that x is demeaned without loss of generality. Wooldridge (2007) shows that under the assumptions E(a1 |z, v2 ) = α1 + φ1 v2 E(d1 |z, v2 ) = δ1 + ψ1 v2 E(g1 |z, v2 ) = ξ1 + ω1 v2 , the estimating equation is E(y1 |z, y2 ) = xδ1 +α1 y2 +y2 xξ1 +ρ1 h2 (y2 , zδ2 )+φ1 h2 (y2 , zδ2 )y2 +h2 (y2 , zδ2 )xψ1 +h2 (y2 , zδ2 )y2 xω1 , where h2 (y2 , zδ2 ) = y2 λ(zδ2 ) − (1 − y2 )λ(−zδ2 ), where λ(·) is inverse Mill’s ratio. From this model and estimating equation, one can identify the ATE of y2 on y1 , i.e. α1 . In the next section, the main contribution of this article, the Nonlinear Two-regime CRC 79 (NTCRC) model and its estimation method will be discussed. 3.4 3.4.1 Nonlinear Two-Regime CRC Model Model In what follows we allow for a binary case of the variable of interest w, which calls for an analysis of ATE under counterfactual framework. Under this setting, the first stage binary selection equation is assumed to follow probit model, which makes the estimation more efficient. For the continuous w, the average of the semi-elasticity of w, i.e. α1 was not identified, while the APE was. Analogously, although the average semi-elasticity of w is not identified, ATE will be. Consider a model in terms of conditional expectation functions given below. E(y0 |z, w, a0 , b0 , a1 , b1 ) = E(y0 |x, a0 , b0 ) = exp(a0 + xb0 ) E(y1 |z, w, a0 , b0 , a1 , b1 ) = E(y1 |x, a1 , b1 ) = exp(a1 + xb1 ) (3.2) w = 1[zγ + v > 0], where ag are scalar errors, x is 1×K vector of covariates, and bg are K ×1 random coefficient vectors. The z is the vector of all the available exogenous variables such that x ⊂ z. For those random coefficients, our quantities of interest will be the means of random variables. In what follows, we will use notation var(x) ≡ σ 2 (x) and cov(x, y) ≡ σ(x, y). Also the regime subscript will be suppressed whenever obvious. Here are the assumptions for further discussion. 80 Assumptions 1. Unconfoundedness: As was already stated in equation (3.2), w is redundant conditional on a and b that summarize all the information about the determined regime, i.e. E(y1 |z, w, a0 , b0 , a1 , b1 ) = E(y1 |x, a1 , b1 ), and same for Regime 0. Although the main focus is on w, this equation also assumes the same thing for the unobserved heterogeneity for the other regime. 2. Multivariate Normality: For each regime in equation (3.2), we have a = α+e b = β + d, where α and β are the means of a and b. The vector (e0 , d0 , e1 , d1 , v) of which dimension is (2K + 3) × 1 follows multivariate normal, i.e. (e0 , d0 , e1 , d1 , v)′ ∼ N (0, V), where V is the appropriate variance-covariance matrix. Let the linear projections of e and d on v as e = σ(e, v)v + ϵ d = σ(d, v)v + δ. Note that b, d, δ and σ(d, v) are K × 1 vectors. Specifically, σ(d, v) = 81 [σ(d1 , v), · · · , σ(dK , v)]′ , where bj = βj +dj is the j-th element in b. Since v and ϵ, δ are orthogonal, they are independent under multivariate normality. Another implication of the above assumption is that the selection equation follows probit model. 3. Conditional Independence: v and ϵ + xδ are independent conditional on z, w, i.e., v ⊥ ϵ + xδ | z, w. ⊥ Under these assumptions, one can obtain the following lemma. Lemma 3.1. Under the assumptions 2 and 3, E[exp(e1 + xd1 )|z, w = 1] [( = exp σ 2 (e1 ) + K ∑ K ∑ σ 2 (d1j )x2 + 2 j j=1 σ(e1 , d1j )xj + j=1 × K ∑∑ )/ ] σ(d1j , d1r )xj xr 2 j=1 r̸=j ) ( ∑ Φ zγ + σ(e1 , v) + K σ(d1j , v)xj j=1 Φ(zγ) for Regime 1, and E[exp(e0 + xd0 )|z, w = 0] [( = exp σ 2 (e0 ) + K ∑ K ∑ σ 2 (d0j )x2 + 2 j j=1 j=1 × σ(e0 , d0j )xj + K ∑∑ )/ ] σ(d0j , d0r )xj xr 2 j=1 r̸=j ( ) ∑K Φ − zγ − σ(e0 , v) − j=1 σ(d0j , v)xj 82 Φ(−zγ) for Regime 0. Proof Consider the Regime 1 only. For notational simplicity the regime subscripts are suppressed. Derivation for Regime 0 is almost identical. Note that [ ( ) ] E[exp(e + xd)|z, w = 1] = E exp σ(e, v)v + ϵ + xσ(d, v)v + xδ z, w = 1 ] [ ( ) = E exp (σ(e, v) + xσ(d, v))v exp(ϵ + xδ) z, w = 1 [ ( ) ] = E exp (σ(e, v) + xσ(d, v))v z, w = 1 × E[exp(ϵ + xδ)|z, w = 1] [ ( ) ] = E exp (σ(e, v) + xσ(d, v))v z, w = 1 × E[exp(ϵ + xδ)|z] ) ( ) Φ zγ + σ(e, v) + xσ(d, v) (σ(e, v) + xσ(d, v))2 = exp Φ(zγ) 2 ( × E[exp(ϵ + xδ)|z] For the second term on RHS, ( K )2 ( )2 ∑ σ(e, v) + xσ(d, v) = σ(e, v) + σ(dj , v)xj j=1 = σ 2 (e, v) + 2 K ∑ σ(e, v) · σ(dj , v)xj + j=1 + K ∑∑ σ(dj , v) · σ(dr , v)xj xr . j=1 r̸=j 83 K ∑ j=1 σ 2 (dj , v)x2 j Now consider the third term on RHS. From the normality of ϵ + xδ that is guaranteed by the multivariate normal assumption, ( / ) E[exp(ϵ + xδ)|z] = exp var[ϵ + xδ | x] 2 . Now that var[ϵ + xδ | x] = σ 2 (ϵ) + σ 2 (xδ) + 2σ(ϵ, xδ) = σ 2 (ϵ) + K ∑ σ 2 (δj )x2 + j j=1 K ∑∑ σ(δj , δr )xj xr + 2 j=1 r̸=j K ∑ σ(ϵ, δj )xj . j=1 Collecting those terms we have the stated result. Remark Terza (1998) directly used the multivariate normal property in order to solve the ( ) 1 2 2 ) was derived under conditional expectation, i.e. E[exp(ϵ)|z, v] = exp ρσv + σ (1 − ρ 2 the multivariate normal assumption between ϵ and v. Although such approach is also applicable here, it does not give a convenient expression for estimating the average treatment effect. In the above derivation I used the linear projections of e and d on v that give un estimating equation of which the coefficients can be directly used to find the ATE. As will be discussed below, the structural parameters such as β is not independently identified, but the coefficients on x and x2 from the estimating equation derived by the linear projection coincides with those coefficients of the estimating equation for ATE. 84 Now let us derive an estimating equation of the model in (3.2). Note that by assumptions 1 and 2, E[y|z, w, a0 , b0 , a1 , b1 ] = (1 − w)E[y0 |z, w, a0 , b0 , a1 , b1 ] + wE[y1 |z, w, a0 , b0 , a1 , b1 ] = (1 − w)E[y0 |x, a0 , b0 ] + wE[y1 |x, a1 , b1 ] ( ) = (1 − w) exp α0 + e0 + x(β0 + d0 ) ( ) + w exp α1 + e1 + x(β1 + d1 ) E[y|z, w] = (1 − w) exp(α0 + xβ0 )E[exp(e0 + xd0 )|z, w = 0] + w exp(α1 + xβ1 )E[exp(e1 + xd1 )|z, w = 1] By Lemma 3.1. [( K K ∑ ∑ 2 (e ) + 2 (d )x2 + 2 σ 1j j [β1 + σ(e1 , d1j )]xj = w · exp 2α1 + σ 1 j=1 + K ∑∑ j=1 r̸=j +(1 − w) exp [( j=1 ) ( ] Φ zγ + σ(e , v) + ∑K σ(d , v)x )/ 1 1j j j=1 σ(d1j , d1r )xj xr 2 × Φ(zγ) 2α0 + σ 2 (e0 ) + K ∑ j=1 + K ∑∑ j=1 r̸=j σ 2 (d0j )x2 + 2 j K ∑ [β0 + σ(e0 , d0j )]xj j=1 ( ) ∑ )/ ] Φ − zγ − σ(e0 , v) − K σ(d0j , v)xj j=1 σ(d0j , d0r )xj xr 2 × Φ(−zγ) (3.3) The above equation identifies only σ 2 (dj ), σ(e, v) and σ(dj , v) separately, and the average of semi-elasticity, i.e. β is not identified due to the nuisance parameters σ(e, dj )’s. It can be estimated by two step by using the probit model for the selection equation; given the estimated index zγ, the above equation can be estimated either by NLS or Quasi-ML 85 under Poisson distribution. Or can they still be estimated simultaneously by a single-step method. See Chapter 2 for detailed discussion of estimation procedure and their asymptotic distribution. Since the structural parameters α and β are not separately identified, our interest should lie in average treatment effects. 3.4.2 Identification of ATE Note that the ATE is E[y1 − y0 ]. One of the easiest way to identify this is by using the law of iterated expectation, i.e. E[y1 − y0 ] = EE[y1 − y0 |x] as long as the expectation conditional on x can be derived. To that end, the following lemma is derived. Lemma 3.2. Under the assumptions given above, the following result holds. [( E[exp(e+xd)|x] = exp σ 2 (e)+ K ∑ σ 2 (dj )x2 +2 j j=1 K ∑ σ(e, dj )xj + j=1 K ∑∑ )/ ] σ(dj , dr )xj xr 2 j=1 r̸=j Proof Note that E(e + xd|x) = E(e|x) + E(xd|x) = 0 var(e + xd|x) = var(e|x) + var(xd|x) + 2cov(e, xd|x) = σ 2 (e) + K ∑ σ 2 (dj )x2 + j j=1 K ∑∑ j=1 r̸=j σ(dj , dr )xj xr + 2 K ∑ σ(e, dj )xj . j=1 For any fixed value of x, e + xd is normally distributed since it is a combination of multivariate normal random variables as was mentioned in assumption 2. Since the mean and variance of normal random variables are already obtained, the mean of its log-normal variable is trivially found. 86 The ATE conditional on x is E[y1 − y0 |x] = exp(α1 + xβ1 )E[exp(e1 + xd1 )|x] − exp(α0 + xβ0 )E[exp(e0 + xd0 )|x]. Thus from the above lemma, E[y1 −y0 |x] [( = exp 2α1 +σ 2 (e1 )+ K ∑ σ 2 (d 2 1j )xj +2 j=1 −exp [( 2α0 +σ 2 (e0 )+ K ∑ j=1 K ∑ [β1 +σ(e1 , d1j )]xj + K ∑∑ j=1 σ 2 (d0j )x2 +2 j j=1 r̸=j K ∑ K ∑∑ )/ ] σ(d1j , d1r )xj xr 2 [β0 +σ(e0 , d0j )]xj + j=1 )/ ] σ(d0j , d0r )xj xr 2 . j=1 r̸=j (3.4) Note the similarity of this equation with the estimating equation in (3); except for the correction functions, the expressions inside the exponential function are identical which makes the identification of ATE possible. The estimator will be the sample average of equation (3.4) over the values of x. The fact that each parameter is not identified does not make any problems for ATE identification. The asymptotic distribution of this ATE estimator is essentially same as was presented in Terza (2008). 87 3.5 3.5.1 Specification Test Tests for Endogeneity In the previous section, we have derived the estimating equation (3.3) and ATE conditional on covariates (3.4). In these equations no restrictions were imposed and an estimation of the model without restriction, where the number of identifiable composite parameters is no less than (K + 4)(K + 1)/2, might cause difficulty in numerical optimization. One can apply the Lagrange Multiplier (LM) test in order to test for the presence of correlated random coefficients. Another test called Variable Addition Test (VAT) is also available which is asymptotically equivalent to LM but easier to apply. This test constructs a conditional mean by adding appropriately defined variables to create the likelihood function of which the score under restriction is same as the one used in LM test (See Wooldridge, 2011). The actual test is performed by Wald test on the significance of coefficients of the added variables. Revisit the estimating equation (3.3). Inside the exponential function we have each covariate, the square of each covariate and their cross products. Let the vector of these functions of covariates denoted x and rewrite the estimating equation for Regime 1 as follows. ( ) Φ zγ + θ2 + xθ3 E[y|z, w = 1] = exp(xθ1 ) × , Φ(zγ) (3.5) where θ2 = σ(e1 , v) and θ3 = σ(d1 , v). As was already mentioned, θ2 is scalar, θ3 is K × 1 vector and θ1 is (K + 2)(K + 1)/2 × 1 vector. The estimating equation for Regime 0 is the same as above except for the minus sign inside the Φ(·) in correction function. In order to perform VAT, we need to construct a conditional mean function of which score is identical to the equation (5) under restriction. Let the restriction or equivalently 88 the null hypothesis H0 : θ2 = θ3 = 0. In other words, there is no correlation between selection error v and any other errors in the structural equations. An LM test can be applied that does not require an estimation of complicated model without restriction. The VAT is a device that facilitates this LM test by using an auxiliary regression for Regime 1 ) ( exp xθ1 + θ2 λ(zγ) + θ3 λ(zγ)x , (3.6) where λ(·) is inverse Mill’s ratio. The auxiliary regression for Regime 0 will have −zγ for zγ. It can be easily verified that the likelihood and score functions derived from (3.6) under restrictions are same as the one from (3.5). Therefore the score tests on the significance of coefficients from the above auxiliary regression is equivalent to the LM tests on (3.5). Thus the Wald tests for (3.6) will give a simple and asymptotically equivalent way to test the null without estimating the model without restriction. In actual test, λ(zγ) has to be estimated through the first stage regression. 3.5.2 Model Selection Test What we have considered above is whether the already given random coefficients are correlated with the selection error or not. Even when the null hypothesis is not rejected, it does not warrants our coming back to the Nonlinear Full Endogenous Switching (NFES) Regression provided by Terza (1998). In current model, presence of CRC is assumed both in a and b in equation (3.2). As a natural consequence one also assumes the multivariate normality 89 between those errors and v. On the contrary, the coefficient b is assumed to be constant in NFES and the multivariate normal joint error distribution is assumed just between a and v. It should be emphasized that the VAT tests for the endogeneity under the assumption that b is random, and thus a failure to reject null hypothesis does not imply an acceptance of NFES. A practical consequence of this observation is that one has to include the square and cross products of the covariates, which were not included in NFES, even when there turns out to be no correlation between those random coefficients. Then how can one perform a test of which the alternative is NFES model? The NFES model assumes that the coefficient b in equation (3.2) is constant and thus d = 0 as well as σ 2 (dj ) = σ( · , dj ) = 0. From equation (3.3), the null hypothesis will be the zero restriction on the coefficients of square and cross product terms inside the exponential functions and the coefficients on xj inside the correction functions. 3.6 Monte Carlo Simulation The purpose of the simulation is two-fold: First, it will compare the performances of this nonlinear ATE estimator with the linear method proposed by Wooldridge (2007). This will show the advantage of using the nonlinear model for estimating ATE. Second, it will see the impact of violations of the assumption that were used in order to derive the estimating equation in the previous section. 3.6.1 Data Generating Processes The derivation of the estimating equations depends on the three assumptions listed in the above section. In this section, we examine the performances of the ATE estimator under 90 violations of the first and second assumptions only. In order to check the robustness of those assumption, a series of simulations are run by using the following data generating processes (DGPs). x ∼ uniform[15, 50] z ∼ Binomial(1, 1/2) w = 1[1.4 − 0.05x − 0.3z + v ≥ 0] ( ) E(y1 |x, z, e1 , d1 ) = exp 0.15 + e0 + (0.02 + d0 )x ( ) E(y0 |x, z, e0 , d0 ) = exp 0.1 + e1 + (0.0059 + d1 )x , where y1 and y0 follow Poisson distribution with the mean specified above. This DGP was designed to make the population ATE approximately equal to unity. There are three sessions of simulations; DGP 1 deals with the ideal case where the errors follow multivariate normal distribution and v ⊥ ϵ + xδ. The joint distribution of errors for ⊥ DGP 1 is   ∗  e0               d∗ 0 e∗ 1 d∗ 1 v                ∼ N                0   1 ρ 0 0 ρ     0   1 0 0 ρ       0 , 1 ρ ρ       0   1 ρ     0 1          .       This variance-covariance matrix is in fact correlation matrix since the variances are all equal to unity. For the simulation, e0 = e∗ /100, e1 = e∗ /100, d0 = d∗ /100, and d1 = d∗ /100 were 0 1 0 1 used. Also for the correlation values ρ, 0.3 and 0.5 were used. This is the case where the 91 above estimator might perform the best, since the actual DGP conforms the assumptions on which the estimating equation is based. DGP 2 generates a process where the linear projection errors ϵ and δ are uncorrelated but dependent with v. For a given v, ϵ is designed to be either the same value of v or −v by a half probability for each. Then the scattergram will look like a X letter, where the covariance of the two variables becomes zero, but very strongly dependent with each other. Another important assumption for Lemma 3.1 is that v follows normal distribution, which brought the normal cumulative distribution function Φ(·) in the equation. In order to check the robustness of the estimator on that assumption, DGP 3 uses χ2 (5) distribution to create skewed errors. The variance-covariance matrix looks just the same as in the above equation. However, we here drop the multivariate normal assumption; all the marginal distributions of the errors follow some adjusted χ2 distribution with the predetermined correlation2 . 3.6.2 Simulation Results Tables F.1 and F.2 list the simulation results for ρ = 0.3 and 0.5 respectively. For each table, each column for I, II and III lists the results for the DGP I, II and III both for linear and nonlinear estimator. Let us first consider the performances of nonlinear estimator. In either Table F.1 or F.2, the column I for DGP 1 shows that the ATE estimator is performing well as expected. The results show that there are nontrivial amount of outlying values under the sample size 1000. It is usually the case that the numerical optimization does not work 2 There might be infinitely many such joint distribution and the multivariate χ2 distribution, which is not used here, is just one of them. The reason why multivariate χ2 distribution is not used is that 1) it is hard to generate the DGP 3 by Stata, and that 2) what matters is just the failure of the normality for each marginal distribution and its impact on the estimation. In this DGP 3, I generated χ2 distributions by first generating many basis standard normal distributions. The correlations were created by using some common basis standard normal distribution. The Stata code will be provided upon request. 92 well in smaller sample size and gives many dubious estimates. In the Table F.1 and F.2, the outlying values that are greater or smaller than the maximum and minimum values for the sample size of 3000 were discarded from the generated data. Roughly about 8 to 10% of the total data is removed in this way. However, whole data are kept for median and MAD calculation which are not sensitive to the extreme values. There seems to be no discernible difference between ρ = 0.3 and 0.5 The column for DGP 2 shows that for those two tables the results do not show any clear sign of the adverse impact due to the violation of the independence assumption. Therefore we can conclude that the independence assumption is more or less practically negligible. The last column is for DGP 3, where the multivariate normal assumption is violated. In this case, it is clearly visible that the estimator suffers from larger variance and bias even under 5000 observations. It happens because we used the fact that E[exp(u)] = exp(σ 2 /2) under normality in Lemma 3.1. If the normality fails, then the whole estimating equation will be misspecified causing inconsistency. A comparison between linear and nonlinear estimators shows that the nonlinear estimator has larger variance particularly when the sample size is small. However, it is also clear that the linear estimators do not show any sign of consistency. The nonlinear estimator keeps approaching the true value, i.e. one, from sample size of 3000 to 5000, the linear estimator does not show any convergence as the sample sizes grow. Another interesting aspect is that the skewed distribution of errors not only affects the nonlinear estimator, but also the linear one too. This is particularly true in Table F.2; in column III, the linear estimator is actually moving away from the true value one. In sum, although having larger variance, the nonlinear estimator is better than the linear one in terms of consistency. It is preferable to use the nonlinear ATE estimator when the sample size is sufficiently large. 93 3.7 An application: the effect of elementary school education on fertility in Botswana In this section, the ATE estimator derived above will be applied to the Botswana fertility data set from Wooldridge (2010, Chapter 21). Detailed descriptions of the data set are already given in Chapter 2 and thus are omitted here. Before the regression results are presented, it warrants some justification at this point why the data can be analyzed by the CRC model framework. Consider a regression equation as below. children = β0 + b1 age + β2 agesq + u For the time being let us focus on the motivation of CRC modeling and ignore other covariates that might also be present in the structural equation. The interpretation of β1 is the marginal birth per year that is determined by the opportunity cost of childbearing (Becker and Barro, 1988; Barro and Becker, 1989). Since education increases human capital, higher education would lead to lower b1 . For now just assume β2 is constant. Then b1 is a negative function of years of education. Thus the above equation becomes children = β0 + b1 (educ)age + β2 agesq + u. It is assumed in the above model that agesq does not have the CRC. In other words, all the heterogeneity of age are assumed to be absorbed into age only. This assumption helps us dispense with age4 that might possibly cause much trouble in exponential regression. Now suppose educ is continuous. Then for each value of educ, there are uncountably many 94 regression equations with constant coefficient. Let us divide the support of educ by using threshold 7, i.e. the duration of primary education in Botswana to have a model below. children0 = β00 + b01 (educ)age + β02 agesq + u0 children1 = β10 + b11 (educ)age + β12 agesq + u1 The upper equation is defined on {ω|educ(ω) < 7} whereas the lower one on the complement set. Next suppose educ is not available but only educ7 is. The information reduction is done by the following rule. educ7 = 1 if educ > 7 educ7 = 0 if educ < 7, where educ = zδ + v. Then the above model can be written as children0 = β00 + b01 (v)age + β02 agesq + u0 children1 = β10 + b11 (v)age + β12 agesq + u1 educ7 = 1[zδ + v > 0] 95 Considering the nonnegativity of children, E[children0 |age, u0 ] = exp(β00 + b01 (v)age + β02 agesq + u0 ) E[children1 |age, u1 ] = exp(β10 + b11 (v)age + β12 agesq + u1 ) educ7 = 1[zδ + v > 0] It is expected that there are negative correlations between bg1 and v; in other words σdg v s in equation (3.3) might be negative. Higher education can reduce fertility either by increasing opportunity cost or by providing people with more information on, say, contraception. Although omitted for simplicity so far, other variables such as evermarr, urban, and electric, which might have CRC’s too, will also be included in the following regression. Therefore all the variables but agesq have the CRC’s and their square and cross product will be included in the estimating equations. Before running regression using NTCRC model, it might be a good idea to test for endogeneity by VAT. The test statistic for both the regime that follows χ2 (10) is 46.76 with p-value .0000. The test statistics for Regime 1 and 0 that follow χ2 (5) are 26.13 with p-value .0001 and 20.64 with p-value .0009. Thus the null hypothesis of no endogeneity is rejected and we will run a regression with full-blown model. The regression results for LFES and NFES models are summarized in Tables F.3 and F.4. First and second columns in Table F.3 show the results by OLS and 2SLS. Next columns are the ones by LFES estimated by Heckman’s correction method. NFES model is estimated by using both Poisson Quasi-ML and NLS methods. The results for NTCRC model are provided separately in Table F.4. Although both NLS and Poisson QMLE is available also for NTCRC, the estimate for NLS does not give a reasonable value; the estimated ATE for 96 NLS is −5.28 and thus only the QMLE results are listed. All the standard errors in Tables F.3 and F.4 are bootstrap standard errors. The bootstrap ran 500 replications. Among those the bootstrap standard error for ATE in NTCRC model was particularly large. Since the maximum number of children in the data set was 13, it might be highly unlikely that the absolute value of ATE is greater than 13. By that reasoning, all the data with estimated ATE over 13 were discarded to get the bootstrap standard error presented in the table. In Table F.4, each basic covariate is numerically labeled such as age (1), evermarr (2) and so forth. Using that numbers cov(1)=cov(δ1 , v), cov(2)=cov(δ2 , v), and so on. Let us now see the estimation results for NTCRC model. As it was already expected in theory, there are indeed, although insignificant, negative correlations between the selection error and the coefficients of age for both the regimes. The estimated ATE is −1.020 which is very close to the estimates from NFES model in Table F.3. Let us now see the LTCRC estimates in Table F.4. The ATE estimate is substantially large and highly significant, which is indicative of inconsistency of the estimator. Remember that both LTCRC and NTCRC estimating equations are based on the conditional expectation and thus these two estimators cannot be consistent at the same time. As long as we are convinced that the effect of education on fertility is negative, the lesson is that the inconsistency of LTCRC can be substantial, which supports the usefulness of NTCRC model provided in this article. Finally one can perform the model selection test discussed in Section 5.2. The null hypothesis that this model is in fact NFES is that the coefficients of all the bilateral products as well as the covariances between δ and v are equal to zero. The Wald statistics of this null hypothesis for Regime 1, Regime 0 and both the regimes are 106.74 (df=10), 339.63 (df=10) and 602.32 (df=20) with p-values all equal to .0000, which justifies the model specification as NTCRC. 97 3.8 Conclusion We have so far considered the endogenous switching regression where there is a random coefficient correlated with the switching variable under the count dependent variable. The count dependent variable requires a nonlinear modeling using the exponential conditional mean function. Although it is impossible to identify the means of each CRC, we have seen that the ATE, which might be more interesting, can be identified. Simulations show that this ATE estimator by nonlinear two-regime CRC model performs well with large sample size. 98 APPENDICES 99 Appendix A Simulation Results for Series Estimator 100 Note to Tables A.1-12: Each table presents the series estimator results for specified correlation and selection error. Simulations were run for ρ = .0, .4, .5 and .6. For each correlation, three types of selection error were considered. For a table, each column shows the monotone functions, i.e. identity, inverse Mill’s ratio, and normal CDF used for binary choice index. Subcolumns p = 1, 2 and 3 specify the degree of polynomial of monotone function. Each row shows the distribution used for u for each sample size. 101 Table A.1: Simulation results for series estimators with ρ = 0.0, ξ: normal monotone functions degrees of polynomial obs=100 mean s. d. RMSE obs=500 mean u: normal s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean u: t(5) s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean 2 (5) u: χ s. d. RMSE obs=1000 mean s. d. RMSE p=1 0.905 2.497 2.499 1.009 0.177 0.177 1.002 0.070 0.070 0.890 2.048 2.050 1.000 0.188 0.188 1.000 0.074 0.074 1.047 1.665 1.665 1.002 0.108 0.108 0.997 0.071 0.071 identity p=2 p=3 0.672 0.184 5.768 18.913 5.777 18.931 1.010 1.018 0.180 0.292 0.180 0.292 1.001 1.002 0.070 0.072 0.070 0.072 0.668 -0.235 4.759 28.807 4.771 28.833 1.001 1.009 0.186 0.309 0.186 0.310 1.000 1.000 0.075 0.077 0.075 0.077 1.066 66.547 2.042 1592.114 2.043 1593.462 1.000 0.918 0.141 2.064 0.141 2.066 0.997 0.998 0.072 0.073 0.072 0.073 p=1 1.238 13.086 13.088 1.104 1.760 1.763 1.030 0.732 0.733 1.673 22.520 22.530 1.121 2.398 2.401 1.013 0.739 0.739 0.664 10.238 10.244 0.940 1.753 1.754 1.008 0.997 0.997 102 IMR p=2 -4.003 160.863 160.940 1.470 9.016 9.028 0.870 2.567 2.570 -2.652 139.004 139.052 1.694 15.246 15.262 0.891 2.419 2.421 -0.076 30.095 30.114 1.242 4.146 4.153 1.062 2.668 2.669 p=3 -4.536 2144.130 2144.137 2.085 40.975 40.989 0.758 18.714 18.715 -66.012 1954.183 1955.331 1.946 38.925 38.936 0.751 17.300 17.302 -16.218 415.140 415.497 -2.349 48.550 48.666 0.387 16.232 16.243 p=1 0.002 15.011 15.044 0.956 2.212 2.213 1.006 1.101 1.101 -0.380 26.305 26.342 0.941 3.741 3.742 0.988 0.951 0.951 1.412 12.484 12.490 1.040 2.220 2.220 0.973 1.403 1.403 normal CDF p=2 p=3 -9.141 -2.676 207.018 2245.681 207.266 2245.684 0.997 0.117 8.300 46.724 8.300 46.732 0.861 0.650 2.712 24.053 2.716 24.056 -8.688 47.569 161.743 1967.634 162.033 1968.185 0.932 0.402 11.778 44.746 11.778 44.750 0.821 0.650 2.682 22.353 2.688 22.356 1.304 23.753 26.002 518.243 26.004 518.742 -4.611 1.482 142.931 67.768 143.041 67.770 1.011 1.909 2.848 21.341 2.848 21.361 Table A.2: Results for ρ = 0.0, ξ: t(5) monotone functions degrees of polynomial obs=100 mean s. d. RMSE obs=500 mean u: normal s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean u: t(5) s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean u: χ2 (5) s. d. RMSE obs=1000 mean s. d. RMSE p=1 1.462 11.395 11.404 1.000 0.095 0.095 1.001 0.064 0.064 1.251 3.150 3.160 0.959 2.136 2.136 0.995 0.069 0.069 1.727 12.393 12.415 1.002 0.121 0.121 0.996 0.069 0.069 identity p=2 1.470 11.423 11.433 1.001 0.095 0.095 1.000 0.065 0.065 1.118 2.484 2.487 0.962 2.109 2.109 0.995 0.070 0.070 1.624 10.564 10.582 1.001 0.125 0.125 0.996 0.069 0.069 p=3 1.215 13.376 13.378 1.000 0.102 0.102 1.001 0.067 0.067 2.949 39.619 39.667 0.997 1.384 1.384 0.995 0.072 0.073 1.502 10.555 10.567 1.000 0.128 0.128 0.997 0.072 0.072 p=1 0.537 5.228 5.248 1.021 0.996 0.996 0.966 0.563 0.564 0.711 4.924 4.933 0.991 1.261 1.261 0.999 0.530 0.530 0.530 10.995 11.005 1.010 0.990 0.990 1.011 0.543 0.543 103 IMR p=2 p=3 8.813 18.942 122.882 580.978 123.130 581.255 1.183 3.112 2.973 31.698 2.979 31.769 0.872 0.903 1.473 9.677 1.478 9.678 0.669 -0.795 33.028 1066.294 33.030 1066.296 1.054 2.382 3.918 28.998 3.918 29.031 1.050 0.995 1.426 9.878 1.427 9.878 -6.811 -15.234 173.140 262.276 173.316 262.778 0.891 2.720 3.382 27.108 3.384 27.162 0.957 0.937 1.475 9.421 1.476 9.421 p=1 -1.648 74.963 75.010 0.969 1.305 1.306 1.043 0.586 0.588 0.975 8.052 8.052 1.036 1.589 1.589 0.986 0.566 0.566 2.146 34.055 34.075 1.014 1.669 1.669 0.983 0.562 0.562 normal CDF p=2 p=3 1.447 -4.180 83.649 524.571 83.650 524.597 1.150 -0.667 3.011 36.946 3.015 36.984 0.942 0.613 1.870 13.924 1.871 13.930 -0.022 -24.906 33.815 2086.952 33.830 2087.113 1.192 2.860 3.125 106.597 3.131 106.614 1.041 0.938 1.843 14.212 1.843 14.213 -2.217 19.497 154.943 392.817 154.977 393.252 0.863 -1.823 3.076 33.335 3.079 33.455 0.908 0.815 1.926 13.655 1.928 13.656 Table A.3: Results for ρ = 0.0, ξ: χ2 (5) monotone functions degrees of polynomial obs=100 mean s. d. RMSE obs=500 mean u: normal s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean u: t(5) s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean u: χ2 (5) s. d. RMSE obs=1000 mean s. d. RMSE p=1 -0.190 23.632 23.662 1.027 2.031 2.031 0.789 5.320 5.325 0.753 6.102 6.107 1.027 1.071 1.071 0.990 0.155 0.155 1.502 14.174 14.183 0.846 2.177 2.182 0.994 0.207 0.207 identity p=2 -0.139 25.422 25.447 1.032 2.024 2.024 0.808 4.891 4.894 0.535 7.119 7.134 1.022 1.021 1.022 -1.301 56.188 56.235 1.480 14.044 14.052 0.860 2.259 2.264 0.998 0.199 0.199 p=3 4.089 114.077 114.119 1.025 2.064 2.064 0.790 5.353 5.357 0.048 18.836 18.860 1.028 1.048 1.048 -1.298 56.188 56.235 1.448 16.133 16.139 1.818 23.846 23.860 1.000 0.211 0.211 p=1 0.819 6.847 6.849 1.048 1.285 1.286 0.963 0.629 0.630 1.108 6.519 6.520 0.984 1.848 1.848 0.978 0.721 0.721 1.201 4.734 4.738 0.880 1.861 1.865 0.979 0.915 0.915 104 IMR p=2 p=3 2.004 5.072 42.215 831.961 42.227 831.971 1.289 2.570 5.439 35.175 5.446 35.210 0.900 0.704 2.695 18.370 2.696 18.373 -9.782 -7.016 175.169 1084.230 175.501 1084.260 1.196 -1.794 5.267 102.992 5.271 103.030 1.148 1.030 2.230 18.927 2.235 18.927 0.357 -26.260 53.792 469.023 53.796 469.815 2.901 -0.413 49.073 72.928 49.110 72.941 1.057 0.765 3.777 17.255 3.777 17.257 normal CDF p=1 p=2 p=3 1.272 3.022 -14.460 7.037 56.951 903.143 7.042 56.987 903.275 0.956 1.175 -0.249 1.964 4.368 45.661 1.965 4.371 45.678 1.044 1.072 0.845 0.655 2.589 23.248 0.656 2.590 23.249 1.195 -9.550 34.038 8.013 186.925 976.187 8.016 187.222 976.746 1.075 1.094 4.281 1.914 4.474 120.464 1.915 4.475 120.508 0.970 1.126 0.977 0.757 2.648 25.814 0.758 2.651 25.814 0.628 -1.474 14.801 8.254 51.984 753.007 8.262 52.043 753.133 1.084 3.467 7.182 2.138 53.878 107.651 2.140 53.934 107.828 0.965 1.162 1.735 0.880 3.305 23.844 0.880 3.309 23.855 Table A.4: Results for ρ = 0.4, ξ: normal monotone functions degrees of polynomial obs=100 mean s. d. RMSE obs=500 mean u: normal s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean u: t(5) s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean u: χ2 (5) s. d. RMSE obs=1000 mean s. d. RMSE p=1 1.585 1.435 1.550 1.656 0.211 0.689 1.653 0.063 0.656 1.602 1.279 1.414 1.654 0.186 0.680 1.654 0.067 0.657 1.630 0.669 0.919 1.554 1.465 1.566 1.617 0.069 0.621 identity p=2 1.550 1.571 1.665 1.656 0.205 0.688 1.653 0.064 0.657 1.577 1.441 1.553 1.655 0.178 0.679 1.655 0.068 0.658 1.647 0.709 0.960 1.555 1.469 1.570 1.618 0.069 0.622 p=3 1.850 10.140 10.175 1.657 0.204 0.688 1.653 0.067 0.657 1.800 7.871 7.911 1.655 0.181 0.680 1.655 0.072 0.659 1.643 0.846 1.063 1.552 1.528 1.625 1.618 0.071 0.622 p=1 1.307 5.927 5.935 0.957 1.796 1.796 0.945 1.440 1.441 1.335 5.523 5.534 0.990 1.695 1.695 0.935 1.120 1.122 0.557 13.044 13.051 0.954 1.930 1.931 0.888 0.943 0.950 105 IMR p=2 -1.812 63.999 64.061 0.705 6.261 6.268 0.949 5.583 5.583 -1.399 58.693 58.742 0.830 5.922 5.925 0.943 5.478 5.478 2.528 32.373 32.409 0.681 11.416 11.421 1.013 2.861 2.861 p=3 -244.475 4349.198 4356.120 -0.264 32.845 32.870 1.195 17.378 17.379 -218.822 3561.923 3568.700 0.255 35.032 35.040 1.534 18.083 18.091 54.997 1091.634 1092.969 -0.374 56.784 56.800 -0.488 20.391 20.445 p=1 1.860 8.261 8.306 2.376 2.813 3.131 2.296 3.409 3.647 1.455 9.688 9.699 2.342 2.826 3.129 2.301 3.116 3.377 1.883 17.769 17.791 2.471 5.579 5.769 2.290 1.436 1.931 normal CDF p=2 p=3 -0.862 98.108 66.486 3190.363 66.512 3191.841 2.823 4.548 4.549 40.649 4.900 40.803 2.512 2.830 4.274 22.629 4.534 22.703 -1.487 78.668 62.218 3133.514 62.268 3134.476 2.798 4.030 4.555 42.870 4.897 42.977 2.504 2.436 3.948 23.089 4.225 23.134 2.775 -51.650 32.364 1110.114 32.412 1111.362 2.601 7.852 6.516 152.745 6.710 152.899 2.712 5.325 2.837 25.032 3.313 25.402 Table A.5: Results for ρ = 0.4, ξ: t(5) monotone functions degrees of polynomial obs=100 mean s. d. RMSE obs=500 mean u: normal s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean u: t(5) s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean 2 (5) u: χ s. d. RMSE obs=1000 mean s. d. RMSE p=1 1.409 2.607 2.639 1.598 0.086 0.604 1.604 0.054 0.607 1.751 15.885 15.903 1.650 0.891 1.102 1.602 0.067 0.605 1.774 1.966 2.113 1.696 0.115 0.705 1.690 0.061 0.693 identity p=2 2.225 18.243 18.284 1.600 0.087 0.607 1.606 0.055 0.609 1.782 16.054 16.073 1.652 0.873 1.090 1.604 0.068 0.608 1.799 2.097 2.244 1.698 0.119 0.708 1.691 0.062 0.693 p=3 2.288 19.057 19.100 1.600 0.092 0.607 1.605 0.057 0.608 1.720 15.538 15.555 1.653 0.856 1.076 1.604 0.071 0.608 2.166 7.903 7.988 1.695 0.119 0.705 1.690 0.063 0.693 p=1 0.560 5.776 5.793 0.875 1.128 1.135 0.965 0.504 0.505 1.101 5.884 5.884 0.875 1.206 1.213 0.945 0.565 0.568 0.825 4.384 4.387 0.893 1.105 1.111 0.942 0.520 0.523 106 IMR p=2 0.223 18.087 18.104 0.830 2.996 3.000 1.015 1.392 1.392 7.428 80.305 80.562 1.210 5.961 5.965 1.136 1.612 1.617 2.110 50.278 50.291 0.766 3.736 3.744 0.807 1.468 1.481 p=3 6.892 711.156 711.180 0.825 14.825 14.826 1.373 8.595 8.603 -10.197 691.627 691.718 1.918 26.496 26.512 1.697 10.620 10.642 -40.554 767.518 768.642 0.666 15.999 16.003 1.360 8.368 8.376 p=1 2.466 8.557 8.682 2.348 1.330 1.894 2.244 0.523 1.349 1.515 9.981 9.995 2.238 3.160 3.394 2.249 0.583 1.379 2.742 6.341 6.576 2.476 2.016 2.499 2.446 0.529 1.540 normal CDF p=2 p=3 3.017 -12.984 16.688 1016.387 16.809 1016.483 2.796 3.758 2.956 20.996 3.459 21.176 2.647 2.821 1.852 12.566 2.478 12.697 5.061 11.073 61.300 747.279 61.435 747.347 2.803 3.455 4.053 30.221 4.436 30.321 2.835 3.044 2.011 15.660 2.722 15.793 5.056 72.784 58.716 1266.234 58.856 1268.267 2.818 3.713 3.012 21.298 3.518 21.470 2.653 2.657 1.848 12.220 2.479 12.332 Table A.6: Results for ρ = 0.4, ξ: χ2 (5) monotone functions degrees of polynomial obs=100 mean s. d. RMSE obs=500 mean u: normal s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean u: t(5) s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean u: χ2 (5) s. d. RMSE obs=1000 mean s. d. RMSE p=1 1.726 3.445 3.521 0.807 19.300 19.301 1.602 0.251 0.652 0.036 34.923 34.937 1.227 9.042 9.045 0.855 18.134 18.135 1.454 3.380 3.411 1.233 9.881 9.884 1.615 0.129 0.628 identity p=2 1.225 12.631 12.633 0.825 19.152 19.153 1.606 0.248 0.654 0.220 33.714 33.723 1.252 8.805 8.808 0.872 17.918 17.918 1.521 3.689 3.726 1.248 9.540 9.544 1.621 0.137 0.636 p=3 4.124 48.106 48.208 0.697 22.329 22.331 1.606 0.251 0.656 0.271 52.038 52.043 1.051 10.061 10.061 0.830 18.865 18.866 1.707 7.703 7.736 1.243 9.832 9.835 1.627 0.150 0.645 p=1 1.235 5.758 5.763 0.860 1.501 1.507 0.904 0.856 0.861 1.478 5.905 5.924 0.837 1.600 1.608 0.874 0.866 0.875 1.160 6.246 6.248 0.783 2.428 2.437 0.902 0.757 0.764 107 IMR p=2 0.229 26.793 26.804 1.041 7.222 7.222 0.390 6.148 6.178 0.467 65.448 65.451 0.913 5.490 5.491 0.648 5.677 5.688 1.187 21.007 21.008 0.877 9.586 9.587 0.860 2.054 2.059 p=3 30.498 1313.152 1313.483 -1.822 61.044 61.109 0.960 17.520 17.520 7.669 690.188 690.220 -0.252 38.116 38.137 -0.647 39.131 39.166 -6.888 1121.906 1121.934 -1.680 50.114 50.185 -0.128 16.384 16.423 p=1 2.139 5.254 5.376 2.357 1.903 2.338 2.366 0.875 1.622 2.376 9.930 10.025 2.428 1.710 2.228 2.351 0.972 1.664 2.169 6.961 7.059 2.413 2.987 3.304 2.341 0.681 1.504 normal CDF p=2 p=3 0.775 24.472 30.090 1332.741 30.091 1332.948 2.733 2.109 4.446 69.006 4.772 69.015 2.528 2.314 3.334 24.728 3.667 24.763 4.986 -12.510 59.551 814.498 59.684 814.610 3.008 5.698 5.476 45.458 5.832 45.700 2.774 2.880 2.808 33.121 3.322 33.174 2.538 -40.546 19.943 1167.326 20.002 1168.065 2.818 10.402 5.788 135.818 6.067 136.143 2.694 5.406 2.356 22.334 2.902 22.765 Table A.7: Results for ρ = 0.5, ξ: normal monotone functions degrees of polynomial obs=100 mean s. d. RMSE obs=500 mean u: normal s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean u: t(5) s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean u: χ2 (5) s. d. RMSE obs=1000 mean s. d. RMSE identity p=1 p=2 p=3 1.751 1.715 1.997 1.401 1.517 9.293 1.590 1.677 9.347 1.821 1.822 1.823 0.222 0.216 0.214 0.851 0.850 0.850 1.816 1.817 1.817 0.060 0.060 0.064 0.818 0.819 0.819 1.771 1.746 1.898 1.184 1.338 5.350 1.413 1.532 5.425 1.821 1.822 1.822 0.196 0.186 0.189 0.844 0.842 0.843 1.817 1.818 1.818 0.065 0.066 0.070 0.820 0.821 0.821 3.642 -152.907 -152.886 54.878 3751.113 3751.115 54.941 3754.269 3754.270 1.772 1.772 1.772 0.104 0.105 0.113 0.779 0.779 0.781 1.765 1.767 1.771 0.066 0.071 0.108 0.768 0.770 0.779 108 p=1 1.308 5.795 5.803 0.940 1.869 1.870 0.925 1.420 1.422 1.353 5.525 5.537 0.974 1.750 1.751 0.918 1.106 1.109 0.779 5.864 5.868 0.906 2.089 2.091 0.933 0.981 0.983 IMR p=2 p=3 -1.748 -223.241 66.855 3905.942 66.911 3912.374 0.681 -0.167 6.120 32.301 6.128 32.322 0.938 1.166 5.778 17.438 5.778 17.439 -1.332 -211.368 60.928 3223.614 60.973 3230.602 0.812 0.415 5.686 33.790 5.689 33.795 0.939 1.506 5.655 18.523 5.655 18.530 3.737 41.465 64.437 888.309 64.495 889.230 0.239 -61.222 7.406 1422.404 7.445 1423.764 1.817 -72.234 16.911 1492.984 16.931 1494.779 p=1 2.225 8.544 8.631 2.715 2.960 3.421 2.640 3.503 3.867 1.807 9.346 9.381 2.673 3.034 3.464 2.645 3.158 3.561 2.791 7.768 7.972 2.842 3.171 3.668 2.636 1.243 2.055 normal CDF p=2 p=3 -0.054 71.763 69.193 2887.926 69.201 2888.793 3.226 4.965 4.529 40.405 5.046 40.599 2.944 3.502 4.298 22.576 4.717 22.714 -0.759 61.653 64.255 2838.222 64.279 2838.870 3.180 4.326 4.551 41.661 5.046 41.794 2.941 3.133 3.929 23.448 4.382 23.544 7.904 -19.124 78.898 1295.865 79.199 1296.021 2.999 29.286 5.429 527.206 5.785 527.964 3.944 -73.574 15.817 1969.699 16.088 1971.110 Table A.8: Results for ρ = 0.5, ξ: t(5) monotone functions degrees of polynomial obs=100 mean s. d. RMSE obs=500 mean u: normal s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean u: t(5) s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean u: χ2 (5) s. d. RMSE obs=1000 mean s. d. RMSE p=1 1.306 7.878 7.883 1.753 0.082 0.757 1.759 0.050 0.761 2.265 11.937 12.004 1.755 0.100 0.761 1.755 0.065 0.758 2.110 6.737 6.828 1.859 0.159 0.874 1.862 0.058 0.864 identity p=2 2.090 11.556 11.607 1.756 0.083 0.761 1.762 0.051 0.764 2.313 12.426 12.495 1.759 0.104 0.766 1.758 0.067 0.761 2.119 6.601 6.695 1.861 0.167 0.877 1.863 0.058 0.865 p=3 2.152 12.338 12.392 1.757 0.088 0.762 1.761 0.052 0.763 2.238 10.230 10.305 1.764 0.145 0.777 1.759 0.069 0.762 5.009 70.993 71.106 1.859 0.172 0.876 1.863 0.060 0.865 p=1 0.556 5.503 5.521 0.851 1.132 1.142 0.949 0.511 0.513 1.085 5.171 5.172 0.943 1.011 1.013 0.913 0.639 0.645 0.961 4.437 4.437 0.871 1.420 1.426 0.944 0.558 0.561 109 IMR p=2 0.455 21.766 21.773 0.789 3.022 3.029 1.016 1.368 1.368 0.694 26.726 26.728 1.104 2.692 2.694 1.067 2.300 2.300 2.997 40.355 40.405 0.419 5.204 5.236 0.771 1.432 1.450 p=3 -3.359 879.925 879.936 0.739 14.768 14.770 1.397 8.331 8.341 8.374 407.675 407.741 12.207 268.626 268.860 14.594 342.987 343.256 -32.207 887.756 888.377 -0.556 21.634 21.690 0.537 7.954 7.967 normal CDF p=1 p=2 p=3 2.801 3.397 -0.139 8.941 19.344 1156.982 9.121 19.492 1156.983 2.697 3.249 4.499 1.336 2.888 20.764 2.160 3.661 21.057 2.569 3.082 3.437 0.529 1.809 12.175 1.656 2.758 12.417 2.778 3.693 1.471 7.736 20.710 410.407 7.937 20.885 410.407 2.575 3.202 -7.942 1.018 2.993 292.329 1.876 3.716 292.466 2.587 3.175 -8.899 0.738 2.671 337.136 1.751 3.445 337.281 2.458 60.007 20.749 5.877 1359.200 476.175 6.055 1360.480 476.584 2.917 3.209 5.544 1.956 3.350 29.005 2.739 4.013 29.359 2.800 3.063 4.313 0.603 1.703 11.422 1.898 2.675 11.892 Table A.9: Results for ρ = 0.5, ξ: χ2 (5) monotone functions degrees of polynomial obs=100 mean s. d. RMSE obs=500 mean u: normal s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean u: t(5) s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean u: χ2 (5) s. d. RMSE obs=1000 mean s. d. RMSE p=1 1.820 2.751 2.871 1.555 4.775 4.807 1.758 0.258 0.800 0.254 33.037 33.046 1.335 10.238 10.243 0.997 18.615 18.615 2.585 19.918 19.981 1.758 0.244 0.797 1.755 0.296 0.811 identity p=2 1.379 11.212 11.218 1.577 4.544 4.580 1.763 0.255 0.804 0.447 31.418 31.423 1.364 9.983 9.990 1.015 18.424 18.424 2.533 19.806 19.865 1.760 0.295 0.816 1.764 0.300 0.820 p=3 3.770 41.059 41.153 1.439 7.980 7.992 1.763 0.258 0.806 1.164 47.442 47.442 1.227 10.567 10.569 0.966 19.523 19.524 4.956 45.317 45.490 1.748 0.476 0.886 1.763 0.309 0.823 p=1 1.261 5.494 5.500 0.831 1.605 1.614 0.877 0.849 0.858 1.498 5.905 5.926 0.790 1.641 1.655 0.845 0.862 0.876 1.197 6.037 6.040 0.711 1.832 1.855 0.845 1.157 1.168 110 IMR p=2 0.199 26.939 26.951 0.947 7.436 7.436 0.320 6.540 6.575 0.300 60.848 60.852 0.855 5.729 5.731 0.541 5.734 5.752 3.327 52.412 52.463 -0.487 21.893 21.943 0.883 4.033 4.035 p=3 20.936 1211.318 1211.482 -1.624 54.468 54.532 0.833 17.694 17.695 0.825 648.799 648.799 -0.637 35.654 35.692 -0.616 36.839 36.874 5.477 1014.042 1014.052 -0.813 27.023 27.084 0.263 17.755 17.771 p=1 2.421 5.171 5.363 2.706 2.089 2.697 2.713 0.884 1.928 2.665 9.067 9.218 2.802 1.796 2.544 2.723 0.918 1.952 2.480 6.409 6.578 2.871 2.506 3.128 2.656 1.302 2.107 normal CDF p=2 p=3 1.174 34.170 29.922 1349.311 29.923 1349.719 3.133 3.035 4.489 60.785 4.970 60.819 2.962 3.072 3.347 24.876 3.879 24.962 5.009 -8.262 55.814 760.331 55.958 760.387 3.505 6.911 5.696 43.801 6.223 44.198 3.206 3.572 2.803 32.939 3.567 33.039 4.877 4.803 46.194 1058.208 46.356 1058.215 2.357 4.825 21.745 36.775 21.787 36.973 3.040 4.997 3.414 24.746 3.977 25.066 Table A.10: Results for ρ = 0.6, ξ: normal monotone functions degrees of polynomial obs=100 mean s. d. RMSE obs=500 mean u: normal s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean u: t(5) s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean u: χ2 (5) s. d. RMSE obs=1000 mean s. d. RMSE p=1 1.915 1.342 1.624 1.982 0.238 1.011 1.974 0.055 0.975 1.943 1.052 1.412 1.985 0.225 1.010 1.976 0.063 0.978 1.955 0.764 1.223 1.917 0.217 0.942 1.917 0.066 0.919 identity p=2 1.879 1.436 1.684 1.983 0.232 1.010 1.975 0.056 0.977 1.918 1.185 1.499 1.986 0.214 1.009 1.977 0.064 0.980 1.922 1.123 1.453 1.923 0.250 0.956 1.918 0.066 0.920 p=3 2.143 8.169 8.249 1.983 0.231 1.010 1.975 0.059 0.976 2.008 3.545 3.686 1.987 0.215 1.010 1.977 0.067 0.979 2.132 4.645 4.781 1.923 0.253 0.957 1.918 0.069 0.920 p=1 1.300 5.694 5.702 0.922 1.948 1.949 0.906 1.366 1.369 1.363 5.512 5.524 0.958 1.815 1.816 0.902 1.058 1.062 1.121 5.532 5.534 0.840 2.353 2.358 0.861 0.819 0.831 111 IMR p=2 -1.608 69.991 70.040 0.673 6.047 6.055 0.927 5.835 5.836 -1.144 64.164 64.200 0.819 5.594 5.597 0.933 5.529 5.529 -4.785 152.666 152.775 0.659 9.886 9.892 0.827 3.603 3.607 p=3 -193.518 3321.385 3327.076 0.004 31.575 31.591 1.122 17.307 17.308 -192.521 2704.632 2711.547 0.686 31.594 31.596 1.423 18.686 18.691 4.184 782.850 782.856 -8.181 348.054 348.175 -0.399 32.080 32.111 p=1 2.605 9.142 9.282 3.038 3.186 3.782 2.973 3.507 4.025 2.197 9.076 9.154 2.987 3.362 3.905 2.984 3.014 3.608 3.445 14.623 14.826 3.224 3.657 4.280 3.029 1.385 2.457 normal CDF p=2 p=3 0.895 38.808 72.123 2551.367 72.123 2551.647 3.603 5.272 4.544 39.956 5.237 40.184 3.367 4.184 4.235 22.248 4.851 22.475 0.241 35.246 66.997 2508.761 67.002 2508.995 3.526 4.452 4.624 39.343 5.269 39.494 3.378 3.910 3.744 23.460 4.435 23.639 0.366 17.115 152.234 1219.060 152.236 1219.167 3.797 23.476 8.948 401.758 9.375 402.387 3.584 9.358 3.346 87.408 4.228 87.807 Table A.11: Results for ρ = 0.6, ξ: t(5) monotone functions degrees of polynomial obs=100 mean s. d. RMSE obs=500 mean u: normal s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean u: t(5) s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean u: χ2 (5) s. d. RMSE obs=1000 mean s. d. RMSE p=1 1.446 7.859 7.871 1.912 0.077 0.915 1.918 0.045 0.919 1.143 19.862 19.863 1.913 0.121 0.921 1.900 0.065 0.902 1.994 0.596 1.159 2.033 0.092 1.037 2.035 0.056 1.036 identity p=2 2.157 9.976 10.043 1.916 0.078 0.920 1.921 0.045 0.922 1.088 21.975 21.976 1.916 0.121 0.924 1.903 0.065 0.905 2.570 13.838 13.926 2.035 0.095 1.039 2.053 0.410 1.130 p=3 2.219 10.750 10.819 1.917 0.083 0.921 1.920 0.047 0.921 1.189 19.205 19.206 1.916 0.126 0.925 1.903 0.068 0.905 -14.227 396.369 396.661 2.036 0.098 1.040 2.052 0.410 1.129 p=1 0.562 5.293 5.311 0.828 1.160 1.172 0.931 0.524 0.529 0.728 4.783 4.791 0.772 1.499 1.516 0.903 0.739 0.746 0.941 4.994 4.994 0.898 1.498 1.501 0.925 0.563 0.568 112 IMR p=2 p=3 0.498 -11.457 23.983 1027.405 23.988 1027.481 0.735 0.709 3.163 14.463 3.174 14.466 1.013 1.409 1.344 8.027 1.344 8.038 2.240 3.905 18.476 358.296 18.518 358.307 0.933 1.523 4.518 25.017 4.518 25.022 0.862 0.986 5.396 20.919 5.398 20.919 1.218 -17.108 16.735 495.433 16.736 495.763 0.822 0.082 3.789 15.135 3.793 15.163 0.960 0.085 4.975 25.166 4.975 25.183 p=1 3.171 9.272 9.523 3.057 1.432 2.507 2.904 0.541 1.980 2.845 6.649 6.901 3.125 2.401 3.207 2.893 0.767 2.042 3.017 6.849 7.140 3.162 2.345 3.190 3.161 0.584 2.238 normal CDF p=2 p=3 3.872 11.510 20.461 1310.067 20.662 1310.109 3.702 5.147 2.830 20.154 3.912 20.576 3.528 4.083 1.754 11.680 3.077 12.080 4.253 51.089 16.360 1135.995 16.680 1137.099 3.963 4.964 4.036 31.489 5.007 31.737 3.412 4.494 5.591 21.474 6.089 21.756 20.454 20.701 414.109 535.525 414.566 535.888 3.603 5.496 3.285 21.537 4.191 22.001 3.725 6.038 5.150 44.941 5.827 45.223 Table A.12: Results for ρ = 0.6, ξ: χ2 (5) monotone functions degrees of polynomial obs=100 mean s. d. RMSE obs=500 mean u: normal s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean u: t(5) s. d. RMSE obs=1000 mean s. d. RMSE obs=100 mean s. d. RMSE obs=500 mean u: χ2 (5) s. d. RMSE obs=1000 mean s. d. RMSE p=1 1.978 3.035 3.189 2.358 10.869 10.954 1.923 0.248 0.956 0.571 27.921 27.925 1.463 11.088 11.098 1.182 18.243 18.244 -7.626 237.699 237.856 1.856 0.886 1.232 1.895 0.163 0.910 identity p=2 1.641 8.456 8.480 2.384 11.195 11.280 1.930 0.244 0.962 0.746 26.659 26.660 1.494 10.818 10.830 1.200 18.064 18.065 -2.993 251.249 251.281 1.885 0.909 1.269 1.923 0.191 0.942 p=3 3.899 40.539 40.642 2.252 7.897 7.996 1.930 0.247 0.962 2.299 39.932 39.953 1.460 10.984 10.993 1.145 19.343 19.344 -5.618 264.834 264.917 1.881 0.893 1.254 1.922 0.211 0.946 p=1 1.297 4.568 4.578 0.803 1.721 1.732 0.849 0.818 0.832 1.476 5.926 5.946 0.745 1.689 1.708 0.819 0.861 0.880 1.078 10.765 10.765 0.700 2.492 2.510 0.789 0.928 0.951 113 IMR p=2 p=3 0.203 6.196 25.212 1097.418 25.224 1097.430 0.832 -1.308 7.599 46.393 7.601 46.450 0.265 0.721 6.557 17.481 6.598 17.483 0.254 -5.936 55.876 643.289 55.881 643.326 0.801 -1.061 6.117 32.627 6.120 32.692 0.442 -0.535 5.515 32.703 5.543 32.739 0.476 -17.813 109.312 826.419 109.313 826.633 0.331 0.582 8.895 31.604 8.920 31.606 0.782 1.053 2.266 15.294 2.277 15.294 p=1 2.702 4.901 5.188 3.067 2.277 3.076 3.080 0.911 2.270 2.959 8.291 8.519 3.182 1.899 2.893 3.106 0.908 2.293 2.942 13.383 13.523 3.173 2.064 2.997 3.073 0.850 2.241 normal CDF p=2 p=3 1.417 45.284 27.891 1357.101 27.894 1357.823 3.521 4.062 4.513 48.607 5.170 48.704 3.415 3.858 3.255 24.531 4.052 24.697 5.113 -3.693 50.636 729.581 50.803 729.596 4.001 8.114 6.071 41.818 6.772 42.419 3.633 4.312 2.765 32.058 3.817 32.229 6.034 18.191 120.664 608.739 120.769 608.982 4.010 6.456 6.382 48.811 7.057 49.115 3.790 5.538 2.507 25.215 3.751 25.620 Appendix B Simulation Results for All Other Estimators 114 Note to Tables B.1-18: Each table presents the all the results but series estimator for specified correlation and IV. Simulations were run for ρ = .0, .4, .5 and .6. For a table, each column specifies the estimator categorized under different types of distributions for u. Each row shows the distribution used for selection error ξ for each sample size. M, S, R, Md denote mean, Monte Carlo standard deviation, RMSE and median respectively. 115 Table B.1: Simulation Results: Strong IV and ρ = 0.0 obs 100 M S R Md 500 M ξ: S normal R Md 1000 M S R Md 100 M S R Md 500 M ξ: S 2 (5) χ R Md 1000 M S R Md IV Heckit1 0.803 0.976 18.556 7.140 18.557 7.140 1.022 1.006 1.040 1.046 1.327 1.528 1.328 1.529 1.004 1.010 1.012 1.015 0.681 0.662 0.681 0.662 1.024 1.045 0.732 1.955 14.543 4.979 14.546 5.069 1.005 1.396 1.036 1.603 1.398 1.394 1.399 1.519 1.078 1.399 1.052 1.411 0.631 0.665 0.633 0.781 1.048 1.350 u: normal Heckit2 QLIML QFIML 0.809 0.135 0.975 7.539 18.502 1.431 7.542 18.522 1.431 0.935 0.921 0.926 1.041 1.019 1.030 1.264 3.349 0.601 1.265 3.349 0.602 1.002 1.022 1.017 1.011 0.974 0.995 0.649 1.367 0.497 0.650 1.367 0.497 1.032 0.995 1.006 0.909 0.032 1.027 6.455 23.840 1.449 6.455 23.860 1.450 1.003 0.960 1.007 0.985 0.909 0.991 1.302 3.765 0.573 1.303 3.766 0.573 1.008 0.957 1.011 1.003 0.975 1.021 0.618 1.788 0.499 0.618 1.788 0.499 1.033 1.019 1.032 116 IV Heckit1 4.440 1.126 108.561 6.028 108.616 6.030 0.988 1.004 1.102 1.034 2.413 1.092 2.415 1.092 1.064 1.023 1.002 1.015 0.752 0.669 0.752 0.669 1.011 1.012 -2.554 2.144 99.600 5.654 99.663 5.769 1.346 1.422 0.893 1.555 1.455 1.386 1.459 1.493 1.104 1.363 1.017 1.390 0.708 0.686 0.708 0.789 1.116 1.310 u: χ2 (5) Heckit2 QLIML QFIML 1.124 0.060 1.006 6.146 26.560 1.546 6.148 26.577 1.546 0.972 0.938 0.985 1.026 1.024 1.032 1.213 3.539 0.877 1.214 3.539 0.877 1.035 0.991 1.014 1.010 0.993 1.017 0.664 1.198 0.911 0.664 1.198 0.911 1.020 1.002 0.995 1.092 1.623 0.986 5.975 24.003 1.537 5.975 24.011 1.537 1.056 1.096 0.835 0.966 0.936 0.732 1.085 2.806 0.981 1.086 2.807 1.017 0.936 1.012 0.759 0.965 0.892 0.689 0.624 1.756 0.925 0.625 1.759 0.976 0.932 0.985 0.631 Table B.2: Results for Strong IV and ρ = 0.4 obs 100 500 ξ: normal 1000 100 500 ξ: χ2 (5) 1000 M S R Md M S R Md M S R Md M S R Md M S R Md M S R Md IV Heckit1 1.259 1.361 24.316 9.359 24.318 9.366 1.203 1.218 0.816 0.895 1.742 1.583 1.751 1.586 1.051 1.060 0.951 0.967 0.690 0.663 0.691 0.663 1.032 1.018 5.185 2.227 71.770 4.895 71.892 5.046 1.229 1.432 0.687 1.502 6.310 1.580 6.318 1.658 1.102 1.393 1.046 1.367 0.639 0.616 0.640 0.717 1.090 1.333 u: normal Heckit2 QLIML QFIML 1.195 -1.082 1.294 8.948 25.431 1.307 8.950 25.516 1.340 1.194 1.136 1.280 0.881 0.626 1.245 1.353 3.221 0.548 1.358 3.243 0.600 1.080 1.031 1.237 0.960 0.900 1.059 0.648 0.928 0.619 0.649 0.933 0.621 1.028 0.980 1.088 1.156 -1.879 1.325 5.504 24.664 1.381 5.506 24.832 1.418 1.113 0.969 1.255 0.854 0.740 1.249 1.430 2.620 0.529 1.437 2.633 0.585 1.022 1.008 1.195 0.971 0.902 1.139 0.646 1.067 0.539 0.647 1.071 0.557 0.996 0.944 1.114 117 u: χ2 (5) IV Heckit1 Heckit2 QLIML QFIML 0.320 1.171 0.847 -1.217 1.205 13.649 4.706 6.192 23.647 1.444 13.666 4.709 6.194 23.751 1.458 1.202 1.250 1.211 1.192 1.208 0.745 0.936 0.881 0.671 0.925 3.649 1.729 1.310 2.687 0.822 3.658 1.730 1.315 2.707 0.825 1.000 1.026 1.007 1.031 1.048 0.930 0.943 0.918 0.816 0.602 0.815 0.678 0.695 1.557 0.827 0.818 0.680 0.700 1.568 0.918 0.975 0.998 0.978 1.011 0.211 1.384 2.491 1.365 -0.404 1.175 43.334 8.577 9.176 25.317 1.555 43.336 8.705 9.183 25.356 1.565 1.239 1.566 1.196 1.046 1.105 0.918 1.525 0.899 0.667 0.805 2.085 1.318 1.405 3.649 0.866 2.086 1.419 1.408 3.664 0.887 1.129 1.409 1.062 0.996 0.757 1.021 1.323 0.937 0.761 0.482 0.631 0.601 0.671 1.858 0.842 0.631 0.683 0.674 1.874 0.988 1.058 1.317 0.997 0.960 0.075 Table B.3: Results for Strong IV and ρ = 0.5 obs 100 500 ξ: normal 1000 100 500 ξ: χ2 (5) 1000 M S R Md M S R Md M S R Md M S R Md M S R Md M S R Md IV Heckit1 0.979 1.357 24.720 9.161 24.720 9.168 1.251 1.233 0.775 0.869 1.864 1.589 1.878 1.595 1.052 1.075 0.935 0.952 0.715 0.670 0.718 0.671 1.030 1.025 4.799 2.193 59.319 4.692 59.440 4.841 1.265 1.483 0.648 1.464 7.227 1.549 7.236 1.617 1.093 1.390 1.031 1.342 0.644 0.591 0.645 0.683 1.098 1.324 u: normal Heckit2 QLIML QFIML 1.158 -1.361 1.374 8.891 28.617 1.273 8.892 28.714 1.327 1.254 1.128 1.365 0.853 0.621 1.280 1.413 2.870 0.512 1.421 2.895 0.583 1.087 1.024 1.243 0.945 0.877 1.154 0.663 0.900 0.393 0.665 0.909 0.422 1.028 0.975 1.117 1.128 -0.612 1.395 5.354 33.681 1.303 5.355 33.720 1.362 1.165 1.153 1.306 0.819 0.909 1.290 1.501 3.030 0.490 1.512 3.031 0.570 1.025 1.009 1.232 0.950 0.850 1.190 0.651 1.291 0.388 0.653 1.300 0.432 0.992 0.966 1.121 118 u: χ2 (5) IV Heckit1 Heckit2 QLIML QFIML 0.695 1.498 0.608 -0.633 1.361 37.964 9.773 15.743 23.151 1.361 37.965 9.786 15.748 23.208 1.408 1.214 1.269 1.162 1.312 1.274 0.681 0.901 0.829 0.567 0.998 3.592 2.078 2.201 3.691 0.664 3.606 2.081 2.207 3.717 0.664 1.016 1.062 1.012 0.999 0.652 0.864 0.921 0.877 0.808 0.707 1.030 0.704 0.778 1.624 0.469 1.039 0.708 0.787 1.635 0.553 1.011 1.019 0.993 1.004 0.518 0.647 1.756 0.866 -1.120 1.169 23.916 6.539 8.393 24.404 1.459 23.918 6.583 8.394 24.496 1.468 1.215 1.569 1.227 1.263 1.017 0.853 1.459 0.875 0.826 0.769 3.762 1.598 1.431 4.111 0.799 3.765 1.663 1.436 4.115 0.832 1.053 1.334 0.949 0.982 0.396 0.968 1.317 0.898 0.835 0.509 0.759 0.695 0.747 1.346 0.628 0.760 0.764 0.754 1.356 0.797 1.044 1.310 0.982 0.962 0.221 Table B.4: Results for Strong IV and ρ = 0.6 obs 100 500 ξ: normal 1000 100 500 ξ: χ2 (5) 1000 M S R Md M S R Md M S R Md M S R Md M S R Md M S R Md IV Heckit1 0.646 1.340 25.625 8.788 25.628 8.794 1.325 1.329 0.736 0.844 1.988 1.600 2.005 1.608 1.068 1.073 0.920 0.938 0.749 0.686 0.754 0.689 1.027 1.021 4.452 2.216 48.623 4.660 48.746 4.816 1.324 1.550 0.628 1.435 7.991 1.548 8.000 1.608 1.085 1.357 1.014 1.319 0.652 0.566 0.652 0.649 1.088 1.326 u: normal Heckit2 QLIML QFIML 1.101 -0.581 1.407 8.717 20.334 1.251 8.717 20.395 1.316 1.292 1.166 1.368 0.827 0.595 1.295 1.475 2.543 0.476 1.485 2.575 0.560 1.065 1.028 1.267 0.931 0.897 1.137 0.688 1.090 0.360 0.691 1.095 0.385 1.016 0.988 1.084 1.145 -0.533 1.484 4.814 26.714 1.288 4.817 26.758 1.375 1.199 1.295 1.383 0.791 0.899 1.323 1.562 2.946 0.463 1.576 2.948 0.565 1.018 1.036 1.255 0.930 0.827 1.189 0.659 1.155 0.389 0.663 1.168 0.433 0.987 0.937 1.100 119 IV Heckit1 1.536 1.481 31.611 10.474 31.615 10.485 1.235 1.251 0.819 0.918 2.794 1.623 2.800 1.625 1.017 1.068 0.894 0.941 0.903 0.740 0.909 0.742 1.021 1.032 -2.554 2.671 99.600 9.893 99.663 10.033 1.346 1.790 0.893 1.501 1.455 0.999 1.459 1.118 1.104 1.489 1.017 1.435 0.708 0.606 0.708 0.746 1.116 1.431 u: χ2 (5) Heckit2 QLIML QFIML 0.884 0.024 1.361 6.833 20.848 1.361 6.834 20.871 1.983 1.217 1.244 1.274 0.799 0.573 0.998 1.607 4.106 0.664 1.620 4.128 0.440 1.017 1.044 0.652 0.841 0.844 0.555 0.902 1.573 0.767 0.916 1.580 0.786 0.972 0.991 0.506 0.619 0.710 1.243 17.531 27.404 1.626 17.535 27.406 1.644 1.232 1.231 1.030 0.740 0.653 0.769 1.624 4.086 0.753 1.644 4.101 0.620 0.945 0.942 0.476 0.887 0.747 0.504 0.744 1.709 0.575 0.752 1.728 0.576 0.978 0.894 0.335 Table B.5: Results for Weak IV and ρ = 0.0 obs 100 M S R Md 500 M ξ: S normal R Md 1000 M S R Md 100 M S R Md 500 M ξ: S 2 (5) χ R Md 1000 M S R Md IV Heckit1 0.980 10.114 0.638 172.074 0.638 172.315 0.976 0.981 0.995 0.857 0.545 6.247 0.545 6.249 0.990 0.979 0.976 0.923 0.453 1.501 0.453 1.503 0.967 0.939 1.035 -10.261 0.613 265.733 0.614 265.971 1.031 1.047 1.013 0.934 0.547 8.995 0.547 8.995 0.996 0.954 1.046 1.113 0.452 5.328 0.454 5.329 1.037 1.065 u: normal Heckit2 QLIML QFIML 0.983 0.691 0.963 0.636 32.420 1.096 0.637 32.421 1.096 1.002 1.021 0.964 0.994 1.253 0.992 0.537 6.483 0.735 0.537 6.488 0.735 0.993 0.974 0.956 0.973 0.906 0.963 0.450 2.831 0.569 0.451 2.832 0.571 0.973 0.972 0.968 1.069 0.005 0.995 0.614 24.815 1.094 0.618 24.835 1.094 1.068 0.946 0.976 1.150 1.232 0.987 0.548 5.896 0.707 0.568 5.900 0.707 1.128 0.958 0.992 1.231 1.156 1.012 0.460 3.455 0.553 0.515 3.459 0.553 1.196 1.109 1.004 120 IV Heckit1 1.018 -1.436 0.640 93.693 0.640 93.724 0.999 1.048 1.017 2.092 0.524 77.712 0.524 77.720 1.005 1.016 1.014 0.762 0.434 6.072 0.434 6.077 1.001 0.995 0.978 -3.914 0.630 223.518 0.631 223.572 0.991 0.842 1.033 0.716 0.537 13.047 0.538 13.050 1.009 1.006 1.031 1.052 0.457 5.238 0.458 5.238 1.029 1.032 u: χ2 (5) Heckit2 1.014 0.630 0.630 0.984 1.018 0.518 0.518 1.004 1.010 0.429 0.429 1.005 1.009 0.623 0.624 1.012 1.164 0.535 0.559 1.143 1.222 0.453 0.505 1.202 QLIML QFIML 2.490 0.998 24.057 1.007 24.103 1.007 0.853 0.961 0.988 0.981 5.271 0.582 5.271 0.583 1.190 0.993 1.053 1.057 2.668 0.803 2.669 0.805 1.038 1.008 0.948 0.910 30.667 0.983 30.667 0.987 1.211 0.874 1.076 0.938 6.180 0.664 6.180 0.667 1.101 1.034 0.791 0.728 3.327 0.825 3.333 0.869 0.874 0.748 Table B.6: Results for Weak IV and ρ = 0.4 obs 100 M S R Md 500 M ξ: S normal R Md 1000 M S R Md 100 M S R Md 500 M ξ: S 2 (5) χ R Md 1000 M S R Md IV Heckit1 1.586 2.836 0.633 35.243 0.863 35.291 1.578 1.438 1.400 0.037 0.526 24.103 0.661 24.123 1.407 1.138 1.268 1.130 0.433 6.115 0.509 6.117 1.282 1.014 1.587 2.641 0.594 34.063 0.835 34.103 1.588 1.471 1.432 1.638 0.530 23.301 0.684 23.310 1.417 1.135 1.310 0.994 0.428 8.085 0.529 8.085 1.310 1.092 u: normal Heckit2 QLIML QFIML 1.588 -3.068 1.586 0.606 20.161 1.022 0.844 20.568 1.178 1.587 0.224 1.575 1.398 -0.585 1.330 0.521 5.314 0.677 0.655 5.546 0.753 1.404 0.250 1.307 1.264 -0.057 1.206 0.428 2.832 0.527 0.503 3.023 0.566 1.274 0.566 1.167 1.617 -3.805 1.549 0.579 23.632 1.033 0.846 24.116 1.170 1.615 0.115 1.491 1.540 -0.501 1.398 0.532 5.135 0.675 0.758 5.350 0.783 1.524 0.315 1.321 1.460 0.279 1.230 0.426 2.615 0.501 0.627 2.713 0.551 1.444 0.578 1.152 121 IV Heckit1 1.536 -2.592 0.597 240.707 0.802 240.734 1.508 1.283 1.383 0.670 0.515 9.581 0.642 9.586 1.401 1.116 1.277 1.006 0.440 3.613 0.520 3.613 1.287 1.019 1.586 1.557 0.633 51.806 0.863 51.809 1.578 1.350 1.399 -0.065 0.528 25.444 0.662 25.467 1.433 1.117 1.295 0.700 0.435 9.323 0.525 9.328 1.320 1.063 u: χ2 (5) Heckit2 1.539 0.592 0.800 1.523 1.387 0.513 0.643 1.407 1.276 0.436 0.516 1.280 1.610 0.638 0.883 1.612 1.508 0.529 0.733 1.519 1.455 0.435 0.629 1.453 QLIML QFIML -2.606 1.433 25.049 0.941 25.307 1.035 -0.266 1.391 -1.625 1.192 6.592 0.778 7.095 0.801 -0.122 1.394 -0.315 0.899 3.314 0.767 3.565 0.774 0.286 1.103 -1.096 1.523 40.577 1.066 40.631 1.187 0.331 1.510 -0.406 0.957 7.720 0.871 7.847 0.872 -0.074 1.102 -0.084 0.635 3.917 0.793 4.065 0.873 0.400 0.483 Table B.7: Results for Weak IV and ρ = 0.5 obs 100 M S R Md 500 M ξ: S normal R Md 1000 M S R Md 100 M S R Md 500 M ξ: S 2 (5) χ R Md 1000 M S R Md IV Heckit1 1.729 3.130 0.592 62.997 0.939 63.033 1.720 1.555 1.497 0.508 0.514 10.178 0.715 10.190 1.498 1.152 1.335 0.838 0.428 3.494 0.544 3.498 1.353 1.009 1.726 3.822 0.573 169.952 0.925 169.976 1.731 1.506 1.501 4.126 0.506 77.381 0.712 77.444 1.494 1.124 1.366 1.141 0.410 4.313 0.550 4.315 1.369 1.063 u: normal Heckit2 QLIML QFIML 1.734 -3.201 1.721 0.588 21.170 0.998 0.941 21.583 1.231 1.741 0.232 1.623 1.495 -0.280 1.388 0.510 5.184 0.649 0.710 5.339 0.756 1.494 0.463 1.333 1.332 0.350 1.229 0.423 2.425 0.492 0.537 2.510 0.543 1.350 0.683 1.165 1.754 -2.987 1.692 0.566 27.520 1.022 0.942 27.807 1.234 1.753 0.771 1.623 1.608 0.403 1.417 0.493 5.296 0.652 0.782 5.329 0.774 1.598 0.572 1.332 1.509 0.702 1.244 0.402 2.793 0.472 0.648 2.809 0.532 1.499 0.707 1.172 122 IV Heckit1 1.686 -2.573 0.600 182.319 0.911 182.354 1.704 1.564 1.488 0.541 0.520 7.570 0.713 7.584 1.505 1.132 1.349 0.900 0.427 5.431 0.552 5.432 1.354 1.038 1.747 1.056 0.593 17.854 0.954 17.854 1.740 1.615 1.492 0.421 0.545 8.400 0.734 8.420 1.515 1.099 1.354 1.039 0.417 2.638 0.547 2.639 1.364 1.075 u: χ2 (5) Heckit2 1.685 0.589 0.904 1.713 1.496 0.518 0.717 1.515 1.358 0.423 0.554 1.361 1.771 0.592 0.972 1.747 1.599 0.546 0.811 1.615 1.500 0.415 0.650 1.501 QLIML QFIML -0.896 1.595 28.523 0.930 28.586 1.104 0.460 1.571 -0.625 1.209 8.002 0.771 8.166 0.799 0.151 1.350 0.042 0.894 3.894 0.708 4.010 0.716 0.425 0.463 1.409 1.639 37.053 1.009 37.055 1.194 0.973 1.594 0.790 0.972 8.326 0.828 8.328 0.829 0.501 0.860 0.905 0.714 4.161 0.726 4.162 0.780 0.567 0.336 Table B.8: Results for Weak IV and ρ = 0.6 obs 100 M S R Md 500 M ξ: S normal R Md 1000 M S R Md 100 M S R Md 500 M ξ: S 2 (5) χ R Md 1000 M S R Md IV Heckit1 1.870 2.304 0.571 13.195 1.041 13.259 1.879 1.702 1.590 1.115 0.498 6.704 0.773 6.705 1.602 1.155 1.401 0.892 0.421 2.709 0.581 2.711 1.423 1.008 1.890 0.524 0.557 120.541 1.050 120.542 1.894 1.771 1.593 1.180 0.501 12.399 0.777 12.401 1.612 1.144 1.431 0.902 0.404 1.596 0.591 1.599 1.441 1.066 u: normal Heckit2 QLIML QFIML 1.837 -0.322 1.848 0.612 16.511 0.969 1.037 16.563 1.288 1.826 0.885 1.716 1.589 0.222 1.421 0.495 5.268 0.625 0.769 5.325 0.754 1.590 0.724 1.388 1.397 0.529 1.231 0.416 2.149 0.458 0.575 2.200 0.513 1.420 0.776 1.136 1.910 1.180 1.804 0.545 15.472 0.942 1.060 15.473 1.238 1.902 1.495 1.724 1.696 1.464 1.472 0.492 4.972 0.636 0.852 4.994 0.792 1.703 1.079 1.357 1.570 0.895 1.244 0.389 2.999 0.422 0.690 3.000 0.487 1.566 0.797 1.152 123 IV Heckit1 1.836 1.918 0.617 44.461 1.039 44.470 1.837 1.665 1.618 0.601 0.513 21.976 0.803 21.979 1.621 1.231 1.441 1.013 0.431 4.355 0.617 4.355 1.448 1.057 1.922 2.613 0.617 86.358 1.110 86.373 1.936 1.728 1.610 0.280 0.491 14.177 0.783 14.196 1.612 1.118 1.437 0.645 0.426 3.557 0.610 3.574 1.472 1.031 u: χ2 (5) Heckit2 1.837 0.612 1.037 1.826 1.628 0.511 0.810 1.633 1.452 0.429 0.623 1.461 1.963 0.618 1.145 1.993 1.805 0.574 0.989 1.813 1.645 0.422 0.771 1.652 QLIML QFIML 1.730 1.695 27.031 0.948 27.041 1.176 0.965 1.635 1.247 1.251 7.572 0.750 7.576 0.791 0.939 1.279 0.990 0.979 3.972 0.668 3.972 0.668 0.718 0.587 1.934 1.755 37.442 1.112 37.454 1.344 2.310 1.700 3.106 1.000 8.526 0.871 8.783 0.871 1.150 0.624 1.035 0.674 5.245 0.683 5.246 0.756 0.564 0.371 Appendix C Other Tables in Chapter 1 124 Table C.1: Relative Efficiencies (Ratio of MSEs) ρ 0 ξ normal χ2 (5) 0.4 normal χ2 (5) 0.5 normal χ2 (5) 0.6 normal χ2 (5) Heckit1 vs LIV obs. u: normal u: χ2 (5) 100 6.03 18.69 500 7.01 8.51 1000 4.43 3.26 100 13.66 16.15 500 8.36 6.69 1000 8.36 7.92 100 8.13 14.70 500 5.70 4.24 1000 2.07 5.02 100 20.34 7.62 500 3.36 6.77 1000 2.74 7.72 100 10.43 2.17 500 4.15 2.83 1000 1.87 4.31 100 39.64 8.52 500 4.02 8.20 1000 3.96 3.23 100 5.47 9.33 500 3.01 6.49 1000 2.51 2.98 100 30.86 2.44 500 3.50 6.22 1000 3.11 5.28 Heckit2 vs QLIML Heckit2 vs Series Heckit2 vs QFIML 2 (5) u: normal u: χ2 (5) u: normal u: χ2 (5) u: normal u: χ 3.01 2.78 0.04 0.06 6.75 324.48 1.94 2.09 0.23 0.52 0.75 4.89 1.27 2.26 0.59 1.89 1.06 1.26 1.13 0.63 0.05 0.07 8.23 298.49 0.97 2.95 0.19 0.88 0.85 0.95 1.04 2.14 0.65 2.44 0.66 0.81 0.44 4.44 0.02 0.06 6.74 8.42 1.75 2.16 0.20 0.49 1.22 4.47 4.93 1.84 0.92 1.72 1.09 1.44 1.10 0.46 0.07 0.03 202.95 24.78 1.10 3.00 0.17 0.40 14.52 2.16 1.77 1.28 0.74 2.15 0.80 0.86 0.43 0.14 0.02 0.01 7.27 15.05 1.73 0.90 0.17 0.09 1.39 3.00 4.57 1.56 0.40 0.49 1.14 2.15 1.05 0.52 0.06 0.03 150.73 13.20 1.14 1.67 0.14 0.34 20.02 5.13 1.72 2.40 0.44 1.12 0.89 0.99 0.43 0.66 0.02 0.08 8.49 9.09 1.72 2.12 0.14 0.07 1.56 2.97 3.92 0.82 0.31 0.74 1.20 1.50 0.90 0.38 0.08 0.01 102.44 98.67 1.21 2.33 0.13 0.14 24.76 1.70 1.57 1.60 0.43 0.59 1.01 0.90 125 Table C.2: Ratio of estimations with AvarQLIM L < AvarT wo−step ρ ξ normal 0.0 ξ: t(5) χ2 (5) normal 0.4 ξ: t(5) χ2 (5) normal 0.5 ξ: t(5) χ2 (5) normal 0.6 ξ: t(5) χ2 (5) obs. u: normal 100 0.54 500 0.54 1000 0.52 100 0.53 500 0.57 1000 0.56 100 0.55 500 0.58 1000 0.55 100 0.48 500 0.53 1000 0.53 100 0.53 500 0.52 1000 0.53 100 0.54 500 0.56 1000 0.55 100 0.48 500 0.53 1000 0.54 100 0.52 500 0.57 1000 0.56 100 0.53 500 0.58 1000 0.56 100 0.48 500 0.54 1000 0.56 100 0.52 500 0.56 1000 0.57 100 0.53 500 0.59 1000 0.60 126 u: t(5) 0.56 0.52 0.52 0.52 0.55 0.55 0.56 0.58 0.56 0.48 0.53 0.54 0.53 0.54 0.54 0.52 0.58 0.58 0.48 0.55 0.55 0.52 0.54 0.56 0.56 0.58 0.58 0.48 0.55 0.56 0.54 0.54 0.57 0.55 0.59 0.60 u: χ2 (5) 0.53 0.53 0.54 0.53 0.56 0.53 0.56 0.57 0.54 0.51 0.56 0.54 0.51 0.51 0.58 0.55 0.59 0.55 0.51 0.54 0.56 0.51 0.54 0.54 0.51 0.56 0.61 0.54 0.54 0.57 0.53 0.56 0.58 0.55 0.57 0.59 Appendix D . . Figures in Chapter 1 . . . . . . Figure D.1: Sample Cumulative Density Functions under Strong IV with ρ = 0.6 and obs=1000. Note: The upper row is for normal ξ and lower row is for chi-squared ξ. The left column is for normal u and right column is for chi-squared u. The dotted, short-dashed and long-dashed line represent FIML, LIML and Heckit estimators respectively. 127 . . . . . . . . Figure D.2: Sample Cumulative Density Functions under Weak IV with ρ = 0.6 and obs=1000. Note: Graphs are placed by same way as in Figure D.1. 128 Appendix E Proofs in Chapter 2 E.1 Proof of Proposition 1 Although the basic proof is presented in Terza (1998), the following generalizes it to the two regime setting. Lemma 1 ∫ ∞ [ 2 zγ + (σ1a /σ1 )ϵ g(v|ϵ)dv = F √ 1 − (σ1a /σ1 )2 −zγ ] Proof First imagine the joint density function of v and ϵ. Then the expression is the probability of v ∈ [−zγ, ∞), i.e. d = 1 conditional on ϵ. A linear projection can be written as ( ( σ )2 ) σ1a with e ∼ N 0, 1 − 1a v = 2 ϵ+e σ1 σ1 Then 2 w = 1[zγ + (σ1a /σ1 )ϵ + e > 0] 2 = 1[e > −zγ − (σ1a /σ1 )ϵ] [ 2 ] −zγ − (σ1a /σ1 )ϵ e = 1 √ > √ 1 − (σ1a /σ1 )2 1 − (σ1a /σ1 )2 Therefore 2 ] zγ + (σ1a /σ1 )ϵ P(w = 1|ϵ) = ≡ Φ∗ (ϵ) g(v|ϵ)dv = F √ 2 1 − (σ1a /σ1 ) −zγ [ ∫ ∞ as was to be shown. Proposition The joint density function of y and d conditional on the exogenous variables 129 is [ f (y, w|z) = ]w ∫ √ √ 1 2 √ f (y1 |z, w = 1, 2σ1 ζ1 )Φ∗ ( 2σ1 ζ1 ) exp(−ζ1 )dζ1 π R [ ]1−w ∫ ( ) √ √ 1 ∗ ( 2σ ζ ) exp(−ζ 2 )dζ · √ f (y0 |z, w = 0, 2σ0 ζ0 ) 1 − Φ 0 0 0 0 π R Proof Note that ∫ ∞ f (ϵ1 , w = 1) = f (ϵ1 , w = 0) = f (ϵ0 , w = 1) = f (ϵ1 , v)dv −zγ ∫ −zγ f (ϵ1 , v)dv ∫−∞ ∞ f (ϵ0 , v)dv −zγ ∫ −zγ f (ϵ0 , w = 0) = f (ϵ0 , v)dv −∞ Let f (·|z) ≡ g(·). ) ( ∫ −zγ ∫ ∫ g(ϵ0 , v) dv dϵ0 = g(y0 |w = 0, ϵ0 ) · g(ϵ0 , w = 0) dϵ0 g(y0 |w = 0, ϵ0 ) −∞ R R ∫ = g(y0 , w = 0, ϵ0 ) dϵ0 R = g(y0 , w = 0) ) (∫ ∞ ∫ g(ϵ1 , v) dv dϵ1 = g(y1 |w = 1, ϵ1 ) · g(ϵ1 , w = 1) dϵ1 g(y1 |w = 1, ϵ1 ) −zγ R R ∫ = g(y1 , w = 1, ϵ1 ) dϵ1 ∫ R = g(y1 , w = 1) Thus we have g(y, w) ≡ g(y1 , w = 1)w · g(y0 , w = 0)1−w (∫ )w (∫ ∞ ) g(y1 |w = 1, ϵ1 ) = g(ϵ1 , v) dv dϵ1 R (∫ × R −zγ g(y0 |w = 0, ϵ0 ) 130 ( ∫ −zγ −∞ )1−w ) g(ϵ0 , v) dv dϵ0 Recovering the original notation [∫ f (y, w|z) = R ]w (∫ ∞ ) f (y1 |z, w = 1, ϵ1 ) f (ϵ1 , v|z) dv dϵ1 −zγ [∫ · f (y0 |z, w = 0, ϵ0 ) ( ∫ −zγ ]1−w ) f (ϵ0 , v|z) dv dϵ0 [∫ ]w (∫ ∞ ) = f (y1 |z, w = 1, ϵ1 ) f (ϵ1 |z)f (v|ϵ1 , z) dv dϵ1 R R −zγ [∫ · −∞ ) ]1−w f (ϵ0 |z)f (v|ϵ0 , z)dv dϵ0 ]w [∫ (∫ ∞ ) f (y1 |z, w = 1, ϵ1 )f (ϵ1 |z) f (v|ϵ1 , z) dv dϵ1 = R = R −∞ −zγ [∫ · [∫ R f (y0 |z, w = 0, ϵ0 ) ( ∫ −zγ R ]1−w ( ∫ −zγ ) f (y0 |z, w = 0, ϵ0 )f (ϵ0 |z) f (v|ϵ0 , z)dv dϵ0 −∞ ]w f (y1 |z, w = 1, ϵ1 )f (ϵ1 |z)Φ∗ (ϵ1 ) dϵ1 [∫ · Let R ]1−w ( ) f (y0 |z, w = 0, ϵ0 )f (ϵ0 |z) 1 − Φ∗ (ϵ0 ) dϵ0 ϵ ζ≡√ , 2σ ϵ ∼ N (0, σ 2 ) (E.1) Then [ f (y, w|z) = ∫ 1 √ π R [ √ √ 2 f (y1 |z, w = 1, 2σ1 ζ1 )Φ∗ ( 2σ1 ζ1 ) exp(−ζ1 )dζ1 ]w ]1−w ∫ ( ) √ √ 1 2 × √ f (y0 |z, w = 0, 2σ0 ζ0 ) 1 − Φ∗ ( 2σ0 ζ0 ) exp(−ζ0 )dζ0 π R as was to be shown. 131 E.2 Derivation of Estimating Equation for NFES Model In what follows the regime subscripts were omitted for simplicity. For each regime E(y|z, v) = exp(α + xβ)E[exp(ϵ)|z, v] ( ) 1 2 2) = exp(α + xβ) exp ρσv + σ (1 − ρ 2 ) ( 1 2 2 ) + xβ + ρσv = exp α + σ (1 − ρ 2 The second equation derived by the result in Appendix A in Terza (1998). Note that for regime 1, ∫ ∞ exp(ρσv)p(v|v > −zδ)dv E[exp(ρσv)|v > −zδ] = −zδ ∫ ∞ p(v) = exp(ρσv) dv P(v > −zδ) −zδ ∫ ∞ 1 = exp(ρσv)p(v)dv Φ(zδ) −zδ ∫ ∞ ( v2 ) 1 1 √ exp − exp(ρσv) = dv Φ(zδ) −zδ 2 2π ∫ ∞ ( 2ρσv − v 2 ) 1 1 √ exp dv = Φ(zδ) −zδ 2π 2 ∫ ∞ ( (ρσ)2 ) ( 2ρσv − v 2 ) ( (ρσ)2 ) 1 1 √ exp = exp − exp dv Φ(zδ) −zδ 2π 2 2 2 ( (ρσ)2 ) ∫ ∞ 1 ( (v − ρσ)2 ) 1 √ exp − = exp dv Φ(zδ) 2 2 −zδ 2π ( (ρσ)2 ) ∫ ∞ ( v2 ) 1 1 √ exp − dv = exp Φ(zδ) 2 2 −(zδ+ρσ) 2π = ( (ρσ)2 ) 1 Φ(zδ + ρσ) exp Φ(zδ) 2 By the same reasoning it can be shown that for regime 0, ( (ρσ)2 ) 1 E[exp(ρσv)|v < −zδ] = exp Φ(−(zδ + ρσ)) Φ(−zδ) 2 Since G(z, v) ⊂ G(z, w), by law of iterated expectation, ( ) 1 2 2 ) + xβ E[exp(ρ σ v)|z, w] E(y1 |z, w) = exp α1 + σ1 (1 − ρ1 1 1 1 2 ) ( 1 2 E(y1 |z, w = 1) = exp α1 + σ1 (1 − ρ2 ) + xβ1 E[exp(ρ1 σ1 v)|v > −zδ] 1 2 ) Φ(zδ + ρ σ ) ( σ2 1 1 = exp α1 + 1 + xβ1 2 Φ(zδ) 132 Similarly ( ) 1 2 2 ) + xβ E[exp(ρ σ v)|v < −zδ] E(y0 |z, w = 0) = exp α0 + σ0 (1 − ρ0 0 0 0 2 ( ) Φ(−(zδ + ρ σ )) σ2 0 0 = exp α0 + 0 + xβ0 2 Φ(−zδ) Therefore the estimating equation is, [ ] ( ) Φ(zδ + ρ σ ) 2 σ1 1 1 E(y|z, w) = w · exp α1 + + xβ1 2 Φ(zδ) [ ] ( ) Φ(−(zδ + ρ σ )) 2 σ0 0 0 + (1 − w) · exp α0 + , + xβ0 2 Φ(−zδ) where y = wy1 + (1 − w)y0 . E.3 Derivation of Conditional Variance for WNLS Estimator It was shown in Appendix A. in Terza (1998) that ( ) 1 2 E[exp(cg )|v] = exp ρg σg v + σg (1 − ρ2 ) g 2 ( ) ( ) 2 2 var[exp(cg )|v] = exp 2ρg σg v + 2σg (1 − ρ2 ) − exp 2ρg σg v + σg (1 − ρ2 ) g g Let all the expectations below are conditional on z. Then [ ] E[exp(cg )2 |w] = E E[exp(cg )2 |v] w [ ] [ ] = E var[exp(cg )|v] w + E E[exp(cg )|v]2 w Thus we need to find out the expressions for the two terms on the RHS. [ ( ) ( ) ] 2 (1 − ρ2 ) − exp 2ρ σ v + σ 2 (1 − ρ2 ) w E[var(exp(cg )|v)|w] = E exp 2ρg σg v + 2σg g g g g g [ 2 ] [ ( ) ] = exp 2σg (1 − ρ2 ) · E exp 2ρg σg v w g ] [ ( ) ] [ 2 − exp σg (1 − ρ2 ) · E exp 2ρg σg v w g ) ( [ 2 2 )] − exp [σ 2 (1 − ρ2 )] = exp 2σg (1 − ρg g g [ ( ) ] × E exp 2ρg σg v w (E.2) 133 Also [ E[E(exp(cg )|v)2 |w] = ( ] ) 2 E exp 2ρg σg v + σg (1 − ρ2 ) g w ( ) [ ] 2 = exp σg (1 − ρ2 ) · E exp(2ρg σg v)|w g Therefore By the way [ 2 ] [ ] E[exp(cg )2 |w] = exp 2σg (1 − ρ2 ) · E exp(2ρg σg v)|w g [ ] E[exp(cg )|w] = E E(exp(cg )|v) w ) ] [ ( 1 2 = E exp ρg σg v + σg (1 − ρ2 ) w g 2 (1 ) [ ] 2 = exp σg (1 − ρ2 ) · E exp(ρg σg v)|w g 2 Therefore var[exp(cg )|w] = E[exp(cg )2 |w] − E[exp(cg )|w]2 [ 2 ] [ ] = exp 2σg (1 − ρ2 ) · E exp(2ρg σg v)|w g ( ) 2 (1 − ρ2 ) · E [ exp(ρ σ v)|w ]2 − exp σg g g g From the results in Appendix A, { } Φ(zδ + 2ρ1 σ1 ) ( Φ(zδ + ρ1 σ1 ) )2 2 2 var[exp(ϵ1 )|w = 1] = exp(σ1 ) exp(σ1 ) − Φ(zδ) Φ(zδ) } { Φ(−zδ − 2ρ0 σ0 ) ( Φ(−zδ − ρ0 σ0 ) )2 2 2 − var[exp(ϵ0 )|w = 0] = exp(σ0 ) exp(σ0 ) Φ(−zδ) Φ(−zδ) Also [ ] [ ] var E[y|z, w, ϵ1 , ϵ0 ] z, w = var w exp(α1 + xβ1 + ϵ1 ) + (1 − w) exp(α0 + xβ0 + ϵ0 ) z, w = w exp(α1 + xβ1 )2 var[exp(ϵ1 )|z, w] +(1 − w) exp(α0 + xβ0 )2 var[exp(ϵ0 )|z, w] +2w(1 − w) exp(α1 + xβ1 ) exp(α0 + xβ0 ) × cov[exp(ϵ1 ), exp(ϵ0 )|z, w] 2 var[exp(ϵ )|z, w = 1] = w exp(α1 + xβ1 ) 1 + (1 − w) exp(α0 + xβ0 )2 var[exp(ϵ0 )|z, w = 0] { } Φ(zδ + 2ρ1 σ1 ) ( Φ(zδ + ρ1 σ1 ) )2 2 2 = w exp(α1 + σ1 /2 + xβ1 )2 exp(σ1 ) − Φ(zδ) Φ(zδ) } { Φ(−zδ − 2ρ0 σ0 ) ( Φ(−zδ − ρ0 σ0 ) )2 2 2 +(1 − w) exp(α0 + σ0 /2 + xβ0 )2 exp(σ0 ) − Φ(−zδ) Φ(−zδ) 134 2 Let δg = exp(αg + σg /2 + xβg ), L1,2 = Φ(zδ + 2ρ1 σ1 ) Φ(zδ + ρ1 σ1 ) , L1 = , L0,2 = Φ(zδ) Φ(zδ) Φ(−zδ − 2ρ0 σ0 ) Φ(−zδ − ρ0 σ0 ) , and L0 = . Then the last equation can be simply writΦ(−zδ) Φ(−zδ) ten as ) ) ( [ ] ( 2 2 exp(σ 2 )L 2 + (1 − w)δ 2 exp(σ 2 )L var E[y|z, w, ϵ1 , ϵ0 ] z, w = wδ1 0 0,2 − L0 1 1,2 − L1 0 And var[y|z, w, ϵ1 , ϵ0 ] = var[wy1 + (1 − w)y0 |z, w, ϵ1 , ϵ0 ] = wvar[y1 |z, w, ϵ1 , ϵ0 ] + (1 − w)wvar[y0 |z, w, ϵ1 , ϵ0 ] + 2w(1 − w)cov[y1 , y0 |z, w, ϵ1 , ϵ0 ] = wE[y1 |z, w, ϵ1 ] + (1 − w)E[y0 |z, w, ϵ0 ] Taking E[·|z, w] at both sides, [ ] E var[y|z, w, ϵ1 , ϵ0 ] z, w = wE[y1 |z, w = 1] + (1 − w)E[y0 |z, w = 0] = wδ1 L1 + (1 − w)δ0 L0 Finally, [ ] [ ] var[y|z, w] = var E[y|z, w, ϵ1 , ϵ0 ] z, w + E var[y|z, w, ϵ1 , ϵ0 ] z, w ( ) ( ) 2 2 = wδ1 δ1 (exp(σ1 )L1,2 − L2 ) + L1 + (1 − w)δ0 δ0 (exp(σ0 )L0,2 − L2 ) + L0 1 0 135 Appendix F Tables in Chapter 2 136 Table F.1: Simulation Results for rho = 0.4 Linear Models NET (1 Regime Nonlinear) 2SLS LET(Hkt) LFES(Hkt) 1PQML 2PQML NLS WNLS n=1000 mean 0.726 0.742 0.747 -18813768 -1037.586 -6220.699 7.289E+13 mc. st. dev 2.463 2.448 2.404 378300000 30871.585 186544.680 2.055E+15 RMSE 2.478 2.462 2.417 378800000 30889.051 186648.410 2.056E+15 median 0.825 0.845 0.855 -91.134 -0.296 0.747 1.177 MAD 1.453 1.446 1.444 94.588 3.387 2.257 1.757 n=3000 mean 0.804 0.820 0.818 -2389.450 -4.619 -2.516 -0.347 mc. st. dev 1.255 1.254 1.256 19778.908 16.242 17.923 18.548 RMSE 1.270 1.267 1.269 19922.838 17.187 18.264 18.597 median 0.814 0.804 0.809 -187.915 -0.239 0.616 0.270 MAD 0.809 0.798 0.806 188.648 1.881 1.407 1.208 n=5000 mean 0.892 0.904 0.904 -904.755 -1.712 -0.857 -0.340 mc. st. dev 0.980 0.982 0.982 5179.628 6.637 10.024 3.833 RMSE 0.986 0.987 0.986 5258.226 7.170 10.194 4.061 median 0.907 0.913 0.906 -220.859 0.033 0.495 0.065 MAD 0.647 0.630 0.628 192.470 1.388 1.034 0.981 Note: The true ATE=1, The number of Monte Carlo repetition is 1000. LET(Hkt): LET model estimated by twostep Heckit, 1PQML: one step Poisson Quasi-Maximum Likelihood estimator, RMSE=root mean squared error, MAD=mean absolute deviation 137 Table F.1 (cont’d) n=1000 n=3000 n=5000 mean mc. st. dev RMSE median MAD mean mc. st. dev RMSE median MAD mean mc. st. dev RMSE median MAD NFES (2 Regime Nonlinear) 1PQML 2 PQML NLS WNLS -0.528 -0.084 -3.555 -0.100 17.753 14.587 105.187 18.089 17.818 14.628 105.286 18.122 0.909 0.984 1.096 0.971 1.132 1.160 1.198 1.138 0.856 0.885 0.879 0.926 1.487 1.322 1.444 1.298 1.494 1.327 1.449 1.300 0.934 0.937 1.004 0.958 0.679 0.674 0.706 0.649 0.983 0.980 0.973 0.996 0.933 0.935 0.987 0.912 0.933 0.935 0.988 0.912 1.027 1.022 1.001 1.044 0.539 0.540 0.594 0.528 138 Table F.2: Simulation Results for rho = 0.5 n=1000 n=3000 n=5000 mean mc. st. dev RMSE median MAD mean mc. st. dev RMSE median MAD mean mc. st. dev RMSE median MAD Linear Models 2SLS LET(Hkt) LFES(Hkt) 0.620 0.634 0.635 2.512 2.517 2.522 2.541 2.544 2.549 0.784 0.820 0.789 1.467 1.457 1.461 0.748 0.763 0.761 1.254 1.255 1.256 1.279 1.277 1.279 0.735 0.770 0.772 0.809 0.803 0.803 0.835 0.847 0.847 0.986 0.988 0.988 0.999 1.000 1.000 0.881 0.888 0.888 0.638 0.640 0.631 139 NET (1 Regime Nonlinear) 1PQML 2PQML NLS WNLS -2.097E+11 -107.190 -70.525 1.E+07 6.902E+12 1678.496 1008.897 3.E+08 6.905E+12 1681.979 1011.429 3.E+08 -90.775 -0.391 0.779 0.878 93.716 3.289 2.149 1.821 -4252.613 -7.376 -3.185 -1.604 56506.153 37.336 18.650 17.555 56666.026 38.264 19.113 17.747 -151.285 -0.702 0.292 0.028 133.619 2.212 1.452 1.330 -622.799 -2.529 -1.083 0.221 2763.448 8.182 10.301 26.549 2832.979 8.910 10.510 26.561 -191.688 -0.260 0.517 -0.121 131.812 1.558 1.084 1.112 Table F.2 (cont’d) n=1000 n=3000 n=5000 mean mc. st. dev RMSE median MAD mean mc. st. dev RMSE median MAD mean mc. st. dev RMSE median MAD NFES (2 Regime Nonlinear) 1PQML 2 PQML NLS WNLS -4.675 -4.115 -19.700 -1.291 101.905 79.327 383.617 28.446 102.063 79.491 384.175 28.538 0.942 1.002 1.141 1.010 1.071 1.128 1.189 1.121 0.811 0.820 0.777 0.852 1.364 1.337 1.494 1.320 1.377 1.349 1.510 1.328 0.932 0.936 0.950 0.954 0.651 0.646 0.721 0.625 0.935 0.940 0.937 0.954 0.952 0.946 1.004 0.920 0.954 0.948 1.006 0.921 1.016 1.017 1.017 1.029 0.550 0.546 0.559 0.536 140 Table F.3: Simulation Results for rho = 0.6 n=1000 n=3000 n=5000 mean mc. st. dev RMSE median MAD mean mc. st. dev RMSE median MAD mean mc. st. dev RMSE median MAD Linear Models 2SLS LET(Hkt) LFES(Hkt) 0.589 0.610 0.615 2.444 2.424 2.417 2.478 2.455 2.447 0.743 0.725 0.761 1.500 1.500 1.483 0.690 0.703 0.703 1.269 1.269 1.267 1.306 1.304 1.301 0.724 0.723 0.727 0.830 0.843 0.846 0.777 0.787 0.788 1.012 1.015 1.015 1.036 1.037 1.037 0.830 0.835 0.855 0.655 0.658 0.656 141 NET (1 1PQML -3178.000 10050.000 10050.000 -73.208 76.071 -1579.442 17617.457 17688.204 -134.971 112.208 -411.791 1144.600 1216.761 -158.018 104.547 Regime Nonlinear) 2PQML NLS WNLS -83.600 -430.3381 2933.343 828.368 7701.3678 80317.211 832.677 7713.4375 80370.723 -0.760 0.50252408 0.818 3.636 2.3827269 1.897 -9.535 -5.6418819 326.938 43.824 29.77737 9798.644 45.073 30.509119 9804.063 -1.165 0.33116622 -0.149 2.613 1.4962865 1.456 -3.600 -1.1125125 -1.055 8.520 7.3739959 5.262 9.683 7.6706274 5.649 -0.705 0.53132847 -0.315 1.930 1.0969538 1.192 Table F.3 (cont’d) n=1000 n=3000 n=5000 mean mc. st. dev RMSE median MAD mean mc. st. dev RMSE median MAD mean mc. st. dev RMSE median MAD NFES (2 Regime Nonlinear) 1PQML 2 PQML NLS WNLS -2.296 -362.847 -4313.352 -143.539 39.603 11411.335 126080.480 4209.908 39.740 11417.134 126154.270 4212.388 0.983 0.998 1.170 1.188 1.113 1.165 1.176 1.150 0.717 0.725 0.670 0.782 1.650 1.595 1.842 1.517 1.674 1.618 1.871 1.533 0.945 0.943 0.989 0.974 0.685 0.688 0.706 0.664 0.901 0.907 0.893 0.923 0.976 0.973 1.054 0.954 0.981 0.978 1.059 0.957 0.975 0.986 0.993 1.007 0.556 0.547 0.596 0.539 142 Table F.4: Simulation Results for FIML estimator n=1000 n=3000 n=5000 correlation # of abscissas mean mc. st. dev RMSE median MAD mean mc. st. dev RMSE median MAD mean mc. st. dev RMSE median MAD 0.4 8 2.143 1.342 1.763 2.014 0.737 1.888 1.048 1.374 1.880 0.605 1.837 0.916 1.241 1.822 0.507 16 1.336 1.525 1.561 1.430 0.725 1.348 0.975 1.035 1.394 0.488 1.460 0.967 1.071 1.471 0.486 143 0.5 8 2.365 1.325 1.902 2.246 0.726 2.234 1.121 1.667 2.141 0.547 2.215 0.903 1.514 2.195 0.552 16 1.395 1.524 1.574 1.523 0.747 1.557 1.032 1.173 1.586 0.513 1.650 0.853 1.073 1.641 0.403 0.6 8 2.511 1.630 2.222 2.376 0.749 2.302 1.074 1.687 2.271 0.550 2.254 0.920 1.556 2.197 0.511 16 1.422 1.606 1.661 1.593 0.843 1.670 0.904 1.125 1.633 0.462 1.656 0.906 1.118 1.617 0.544 Table F.5: Simulation Results for DGP 1 ρ = 0.4 n=1000 mc. n=3000 mc. n=5000 mc. ρ = 0.5 n=1000 mc. n=3000 mc. n=5000 mc. mean st. dev RMSE median MAD mean st. dev RMSE median MAD mean st. dev RMSE median MAD mean st. dev RMSE median MAD mean st. dev RMSE median MAD mean st. dev RMSE median MAD Linear Model 2SLS LFES(Hkt) 0.622 0.644 3.195 3.200 3.218 3.220 0.858 0.864 1.581 1.539 0.892 0.904 1.439 1.436 1.444 1.439 0.995 0.979 0.953 0.955 0.891 0.904 1.104 1.101 1.110 1.105 0.894 0.916 0.728 0.714 0.509 0.540 3.336 3.309 3.371 3.340 0.759 0.826 1.614 1.590 0.827 0.839 1.463 1.464 1.473 1.472 0.921 0.931 0.933 0.930 0.821 0.834 1.131 1.128 1.145 1.141 0.853 0.860 0.737 0.734 144 NFES 2PQML WNLS 149.774 584.489 11454.700 18081.155 11455.667 18090.568 0.979 1.027 1.280 1.246 0.946 0.983 1.604 1.521 1.605 1.521 1.024 1.065 0.778 0.774 0.994 1.027 1.014 0.980 1.014 0.980 1.023 1.028 0.585 0.564 -6.616E+05 -1.324E+15 2.069E+07 4.103E+16 2.070E+07 4.106E+16 1.030 1.032 1.248 1.182 0.922 0.971 1.497 1.427 1.499 1.428 1.045 1.067 0.727 0.712 0.960 0.997 1.026 0.979 1.027 0.979 1.016 1.037 0.582 0.559 See next page. Table F.5 (cont’d) ρ = 0.6 n=1000 n=3000 n=5000 mean mc. st. dev RMSE median MAD mean mc. st. dev RMSE median MAD mean mc. st. dev RMSE median MAD Linear Model 2SLS LFES(Hkt) 0.402 0.428 3.387 3.364 3.440 3.412 0.849 0.864 1.620 1.614 0.771 0.784 1.503 1.500 1.520 1.516 0.852 0.874 0.958 0.959 0.768 0.779 1.149 1.147 1.172 1.168 0.800 0.811 0.766 0.768 145 NFES 2PQML WNLS -9.205 -12.746 173.719 301.270 174.019 301.584 1.101 1.012 1.244 1.213 0.796 0.959 1.888 1.465 1.899 1.466 0.983 1.121 0.722 0.689 0.950 1.013 1.009 0.945 1.011 0.945 0.992 1.070 0.541 0.550 Table F.6: Simulation Results for DGP 2 ρ = 0.4 n=1000 mc. n=3000 mc. n=5000 mc. ρ = 0.5 n=1000 mc. n=3000 mc. n=5000 mc. mean st. dev RMSE median MAD mean st. dev RMSE median MAD mean st. dev RMSE median MAD mean st. dev RMSE median MAD mean st. dev RMSE median MAD mean st. dev RMSE median MAD Linear Model 2SLS LFES(Hkt) 0.791 0.503 2.329 2.788 2.338 2.832 1.062 0.823 1.278 1.517 0.918 0.619 1.260 1.520 1.263 1.567 0.930 0.653 0.816 0.999 0.857 0.553 0.956 1.148 0.966 1.232 0.898 0.580 0.639 0.778 0.523 0.100 2.581 3.067 2.625 3.197 0.824 0.449 1.447 1.786 0.905 0.512 1.279 1.559 1.283 1.634 1.013 0.621 0.821 0.988 0.810 0.411 0.942 1.148 0.961 1.291 0.847 0.466 0.610 0.731 146 NFES 2PQML WNLS -9.268 -3.302 321.882 135.961 322.046 136.029 1.085 1.121 0.965 0.926 1.085 1.130 1.212 1.157 1.215 1.164 1.016 1.041 0.606 0.597 0.998 1.015 0.827 0.829 0.827 0.829 0.974 0.987 0.484 0.483 0.422 0.578 6.241 13.511 6.268 13.517 0.975 1.203 0.944 0.913 1.145 1.177 1.129 1.093 1.139 1.107 1.122 1.148 0.603 0.570 1.038 1.048 0.744 0.726 0.745 0.727 1.029 1.035 0.446 0.427 See next page. Table F.6 (cont’d) ρ = 0.6 n=1000 n=3000 n=5000 mean mc. st. dev RMSE median MAD mean mc. st. dev RMSE median MAD mean mc. st. dev RMSE median MAD Linear Model 2SLS LFES(Hkt) 0.540 0.035 2.549 3.185 2.590 3.328 0.884 0.423 1.321 1.646 0.822 0.345 1.275 1.569 1.287 1.700 0.904 0.475 0.841 1.024 0.739 0.253 0.953 1.173 0.988 1.391 0.750 0.269 0.629 0.789 147 NFES 2PQML WNLS -40.132 -2.731 1267.348 77.542 1268.015 77.632 1.069 1.064 0.849 0.871 1.137 1.163 1.120 1.068 1.128 1.080 1.153 1.173 0.541 0.549 1.069 1.075 0.719 0.712 0.722 0.716 1.080 1.077 0.426 0.424 Table F.7: Simulation Results for DGP 3 ρ = 0.4 n=1000 mc. n=3000 mc. n=5000 mc. ρ = 0.5 n=1000 mc. n=3000 mc. n=5000 mc. mean st. dev RMSE median MAD mean st. dev RMSE median MAD mean st. dev RMSE median MAD mean st. dev RMSE median MAD mean st. dev RMSE median MAD mean st. dev RMSE median MAD Linear Model 2SLS LFES(Hkt) 0.364 0.361 6.723 6.285 6.753 6.317 0.676 0.646 2.681 2.697 0.750 0.749 3.039 3.059 3.049 3.070 0.816 0.839 1.599 1.622 0.713 0.713 2.198 2.219 2.217 2.237 0.761 0.769 1.239 1.260 0.537 0.492 6.434 6.505 6.451 6.525 0.745 0.735 2.611 2.651 0.419 0.405 3.132 3.145 3.185 3.201 0.605 0.589 1.719 1.704 0.702 0.673 2.253 2.304 2.273 2.327 0.787 0.731 1.297 1.338 148 NFES 2PQML WNLS -1097.543 34.431 34992.925 874.882 35010.165 875.520 1.241 1.235 1.836 1.826 1.861 1.370 14.839 3.718 14.864 3.736 1.158 1.222 0.989 0.953 1.331 1.367 1.959 1.693 1.987 1.733 1.105 1.201 0.769 0.772 6.409E+05 2.076E+05 1.969E+07 5.937E+06 1.970E+07 5.940E+06 1.425 1.409 1.571 1.356 1.420 1.497 2.629 2.370 2.663 2.422 1.273 1.335 0.888 0.852 1.566 1.591 2.160 1.563 2.233 1.671 1.328 1.419 0.746 0.789 See next page. Table F.7 (cont’d) ρ = 0.6 n=1000 n=3000 n=5000 mean mc. st. dev RMSE median MAD mean mc. st. dev RMSE median MAD mean mc. st. dev RMSE median MAD Linear Model 2SLS LFES(Hkt) 0.278 0.231 6.323 6.510 6.365 6.556 0.744 0.715 2.634 2.705 0.575 0.541 3.067 3.122 3.096 3.156 0.768 0.729 1.586 1.612 0.638 0.599 2.278 2.337 2.307 2.371 0.673 0.610 1.371 1.411 149 NFES 2PQML WNLS 169.761 4308.951 4842.930 131368.520 4845.870 131439.140 1.466 1.605 1.402 1.328 2.485 1.755 24.906 4.314 24.950 4.379 1.506 1.526 0.830 0.805 1.703 1.630 2.836 2.129 2.921 2.221 1.473 1.416 0.646 0.612 Table F.8: Simulation Results for DGP 4 ρ = 0.4 n=1000 mc. n=3000 mc. n=5000 mc. ρ = 0.5 n=1000 mc. n=3000 mc. n=5000 mc. mean st. dev RMSE median MAD mean st. dev RMSE median MAD mean st. dev RMSE median MAD mean st. dev RMSE median MAD mean st. dev RMSE median MAD mean st. dev RMSE median MAD Linear Model 2SLS LFES(Hkt) 0.646 0.888 3.689 3.854 3.706 3.855 0.819 0.988 2.099 2.069 0.725 0.910 1.881 1.834 1.901 1.836 0.721 0.877 1.217 1.187 0.732 0.914 1.389 1.354 1.415 1.356 0.762 0.935 0.917 0.841 0.505 0.724 3.577 3.512 3.611 3.523 0.751 0.925 2.017 1.979 0.663 0.833 1.847 1.806 1.878 1.814 0.653 0.821 1.232 1.230 0.655 0.831 1.394 1.366 1.436 1.376 0.685 0.874 0.895 0.834 150 NFES 2PQML WNLS -18.308 -156.597 205.225 1979.583 206.131 1985.846 1.213 1.026 1.764 2.137 0.864 0.535 1.998 3.068 2.003 3.103 0.947 0.900 1.058 1.232 0.935 0.730 1.439 1.937 1.440 1.955 1.064 1.078 0.770 1.004 -8.922 -13.194 93.284 107.502 93.810 108.435 1.164 1.360 1.468 1.609 0.911 0.641 1.713 2.531 1.715 2.557 0.919 0.957 0.903 1.059 0.850 0.671 1.393 1.792 1.401 1.822 0.909 0.896 0.835 1.113 See next page. Table F.8 (cont’d) ρ = 0.6 n=1000 n=3000 n=5000 mean mc. st. dev RMSE median MAD mean mc. st. dev RMSE median MAD mean mc. st. dev RMSE median MAD Linear Model 2SLS LFES(Hkt) 0.321 0.584 4.308 3.733 4.361 3.756 0.692 0.849 2.039 1.954 0.567 0.729 1.822 1.794 1.873 1.814 0.611 0.758 1.215 1.164 0.595 0.759 1.380 1.353 1.438 1.374 0.624 0.798 0.885 0.852 Table F.9: Variables Description children ceb mort educ7 age agesq evermarr urban electric tv radio frsthalf number of living children children ever born number of dead children = 1 if educ ≥ 7 age in years age2 = 1 if ever married = 1 if live in urban area = 1 if has electricity = 1 if has tv = 1 if has radio = 1 if mnthborn ≤ 6 151 NFES 2PQML WNLS -7.682 -12.635 64.476 117.247 65.058 118.038 1.483 1.342 1.463 1.403 0.862 0.920 1.816 1.853 1.821 1.854 0.971 1.114 0.912 0.869 0.858 0.850 1.480 1.547 1.487 1.554 1.087 1.223 0.822 0.690 Table F.10: Descriptive Statistics Variable children ceb mort educ7 age agesq evermarr urban electric tv radio frsthalf Mean 2.267 2.441 .173 .555 27.405 826.460 .476 .516 .140 .092 .701 .540 Std. Dev. 2.222 2.406 .511 .496 8.685 526.923 .499 .499 .347 .290 .457 .498 Min 0 0 0 0 15 225 0 0 0 0 0 0 Max 13 13 7 1 49 2401 1 1 1 1 1 1 Table F.11: Regression Results: selection equation Variable age 1 Regime 1PQML -0.013 (0.019) -0.001** (0.000) -0.306*** (0.049) 0.256*** (0.044) 0.413*** (0.079) 0.831*** (0.111) 0.491*** (0.046) -0.211*** (0.044) 0.269*** (0.032) 2 Regime 1PQML -0.007 (0.018) -0.001*** (0.000) -0.301*** (0.050) 0.258*** (0.044) 0.418*** (0.073) 0.811*** (0.089) 0.490 *** (0.048) -0.231*** (0.042) 0.278*** (0.031) probit -0.014 (0.018) agesq -0.001** (0.000) evermarr -0.306*** (0.049) urban 0.257*** (0.043) electric 0.412*** (0.079) tv 0.828*** (0.111) radio 0.492*** (0.045) frsthalf -0.215*** (0.042) constant 0.271*** (0.032) L-likelihood -2371.668 Note: The second and third columns report the results from single step estimation. All the figures in the parenthesis are bootstrap standard errors. *: significant at 10%, **: 5%, ***: 1% 152 Table F.12: Regression Results: dependent variable children Variable OLS Linear Models 2SLS LET(Hkt) LFES(Hkt) ATE educ7 age agesq evermarr urban electric tv radio constant -0.398 *** (0.046) -1.185 * (0.691) 0.272 *** 0.262 *** (0.019) (0.021) -0.002 *** -0.002 *** (0.000) (0.000) 0.694 *** 0.610 *** (0.054) (0.096) -0.246 *** -0.178 ** (0.047) (0.078) -0.337 *** -0.233 ** (0.074) (0.114) -0.330 *** -0.155 (0.085) (0.182) 0.027 0.153 (0.053) (0.126) -3.540 *** -2.880 *** (0.035) (0.385) cov(epsilon, v) L-likelihood R2 sigma -2.232 *** (0.432) -1.552 (0.979) 0.249 *** (0.021) -0.002 *** (0.000) 0.499 *** (0.080) -0.088 (0.066) -0.094 (0.098) 0.078 (0.124) 0.322 *** (0.099) 3.508 *** (0.246) 1.108 *** (0.257) R1 R0 0.251 *** 0.384 *** (0.030) (0.049) -0.003 -0.003 (0.001) (0.001) 0.194 ** 0.930 *** (0.098) (0.198) 0.206 ** -0.478 *** (0.088) (0.172) 0.197 -0.512 (0.126) (0.347) 0.467 *** -0.563 (0.142) (0.698) 0.620 *** -0.052 (0.129) (0.294) 1.878 ** (0.943) 2.550 *** -0.524 (0.454) (0.905) NET (1 Regime) 1PQML 2PQML -0.103 -0.120 (0.363) (0.324) -0.046 -0.053 (0.158) (0.142) 0.340 *** (0.009) -0.004 *** (0.000) 0.326 *** (0.030) -0.101 *** (0.024) -0.162 *** (0.044) -0.203 *** (0.061) -0.015 (0.032) -5.514 *** (0.085) -0.060 (0.094) -1088.243 0.586 1.431 0.563 1.471 0.589 1.427 0.595 1.417 153 NLS -0.681 * (0.394) -0.303 * (0.172) 0.340 *** 0.265 *** (0.009) (0.012) -0.004 *** -0.003 *** (0.000) (0.000) 0.325 *** 0.291 *** (0.029) (0.033) -0.101 *** -0.085 *** (0.023) (0.027) -0.161 *** -0.135 *** (0.042) (0.051) -0.201 *** -0.108 * (0.058) (0.070) -0.014 0.035 (0.030) (0.037) -5.507 *** -4.059 *** (0.077) (0.240) -0.056 0.099 (0.085) (0.104) 1283.413 8597.940 Table F.12 (cont’d) Variable ATE 1PQML -0.830 *** (0.318) NFES (2 Regime) 2PQML -0.841 *** (0.322) NLS -1.224 *** (0.375) R1 R0 0.412 *** 0.294 *** (0.019) (0.015) -0.006 *** -0.003 *** (0.000) (0.000) 0.268 *** 0.312 *** (0.045) (0.042) 0.006 -0.129 *** (0.041) (0.034) -0.070 -0.155 * (0.057) (0.079) -0.060 -0.121 (0.079) (0.155) 0.067 0.000 (0.061) (0.046) -6.843 *** -4.728 *** (0.101) (0.153) 0.390 *** -0.081 (0.178) (0.169) R1 R0 0.410 *** 0.293 *** (0.018) (0.015) -0.005 *** -0.003 *** (0.000) (0.000) 0.270 *** 0.309 *** (0.044) (0.042) 0.003 -0.127 *** (0.040) (0.033) -0.074 -0.150 * (0.057) (0.077) -0.063 -0.111 (0.078) (0.153) 0.063 0.004 (0.060) (0.045) -6.813 *** -4.697 *** (0.098) (0.146) 0.373 ** -0.062 (0.169) (0.159) R1 R0 0.333 *** 0.239 *** (0.022) (0.017) -0.004 *** -0.003 *** (0.000) (0.000) 0.219 *** 0.293 *** (0.050) (0.045) 0.031 -0.111 *** (0.045) (0.036) -0.028 -0.142 * (0.065) (0.085) 0.075 -0.062 (0.089) (0.174) 0.169 ** 0.021 (0.069) (0.049) -5.671 *** -3.677 *** (0.092) (0.169) 0.656 *** -0.009 (0.202) (0.178) educ7 age agesq evermarr urban electric tv radio constant cov(epsilon, v) L-likelihood -1067.756 1303.732 8521.373 2 R sigma Note: All the covariates are demeaned. All the figures in the parenthesis are bootstrap standard errors. 154 Table F.13: Regression Results: dependent variable ced Variable OLS Linear Models 2SLS LET(Hkt) LFES(Hkt) ATE educ7 age agesq evermarr urban electric tv radio constant -0.462 *** (0.048) -1.187 (0.725) 0.269 *** 0.260 *** (0.021) (0.023) -0.002 *** -0.002 *** (0.000) (0.000) 0.734 *** 0.657 *** (0.058) (0.101) -0.248 *** -0.186 ** (0.049) (0.082) -0.389 *** -0.293 ** (0.078) (0.119) -0.389 *** -0.228 (0.089) (0.190) 0.001 0.118 (0.057) (0.134) 2.698 *** 3.101 *** (0.037) (0.405) cov(epsilon, v) L-likelihood R2 sigma -2.496 *** (0.483) -2.118 * (1.084) 0.243 *** (0.023) -0.002 *** (0.000) 0.517 *** (0.087) -0.073 (0.072) -0.119 (0.105) 0.063 (0.136) 0.329 *** (0.109) 3.828 *** (0.276) 1.229 *** (0.287) R1 R0 0.251 *** 0.388 *** (0.033) (0.053) -0.003 *** -0.003 *** (0.001) (0.001) 0.174 0.936 *** (0.107) (0.221) 0.233 ** -0.419 ** (0.097) (0.190) 0.229 * -0.507 (0.138) (0.374) 0.537 *** -0.426 (0.153) (0.769) 0.733 *** -0.048 (0.142) (0.325) 2.308 ** (1.046) 2.944 *** -0.296 (0.493) (1.003) NET (1 Regime) 1PQML 2PQML NLS -0.179 -0.189 -0.897 ** (0.373) (0.343) (0.419) -0.074 -0.078 -0.372 ** (0.152) (0.141) (0.172) 0.339 *** (0.009) -0.004 *** (0.000) 0.321 *** (0.029) -0.093 *** (0.023) -0.172 *** (0.043) -0.215 *** (0.079) -0.023 (0.114) 0.518 *** (0.083) -0.052 (0.091) -84.416 0.605 1.513 0.588 1.545 0.608 1.509 0.615 1.495 155 0.339 *** 0.264 *** (0.009) (0.012) -0.004 *** -0.003 *** (0.000) (0.000) 0.321 *** 0.283 *** (0.029) (0.034) -0.093 *** -0.071 *** (0.023) (0.027) -0.171 *** -0.140 *** (0.042) (0.050) -0.214 *** -0.106 (0.057) (0.070) -0.023 0.033 (0.030) (0.037) 0.520 *** 0.783 *** (0.077) (0.104) -0.049 0.131 (0.084) (0.103) 2287.245 -9545.770 Table F.13 (cont’d) Variable ATE 1PQML -0.974 * (0.591) NFES (2 Regime) 2PQML -0.983 ** (0.400) NLS -1.488 *** (0.511) R1 R0 0.405 *** 0.293 *** (0.018) (0.015) -0.005 *** -0.003 *** (0.000) (0.000) 0.287 *** 0.291 *** (0.047) (0.044) -0.016 -0.101 *** (0.041) (0.035) -0.103 * -0.139 * (0.060) (0.080) -0.106 -0.069 (0.083) (0.158) 0.053 0.001 (0.063) (0.051) 0.164 0.645 *** (0.111) (0.179) 0.300 0.011 (0.183) (0.179) R1 R0 0.404 *** 0.292 *** (0.018) (0.014) -0.005 *** -0.003 *** (0.000) (0.000) 0.289 *** 0.289 *** (0.045) (0.043) -0.018 -0.099 *** (0.040) (0.033) -0.106 * -0.136 * (0.057) (0.076) -0.109 -0.061 (0.081) (0.151) 0.051 0.004 (0.061) (0.046) 0.171 0.656 *** (0.105) (0.154) 0.289 * 0.023 (0.171) (0.159) R1 R0 0.318 *** 0.239 *** (0.021) (0.018) -0.004 *** -0.003 *** (0.000) (0.000) 0.240 *** 0.273 *** (0.052) (0.047) 0.006 -0.081 ** (0.045) (0.037) -0.056 -0.128 (0.067) (0.085) 0.033 0.007 (0.096) (0.174) 0.165 ** 0.025 (0.073) (0.053) 0.098 0.809 *** (0.102) (0.190) 0.579 *** 0.091 (0.212) (0.185) -66.489 2305.122 -9480.499 educ7 age agesq evermarr urban electric tv radio constant cov(epsilon, v) L-likelihood R2 sigma 156 Table F.14: Average Treatment Effects children 2SLS E(y1 ) E(y0 ) AT E ceb mort 1.741 1.914 0.173 2.926 3.101 0.175 -1.185 * -1.187 -0.002 (0.691) (0.725) (0.220) LET(Heckit) E(y1 ) 1.276 1.332 0.056 E(y0 ) 3.508 3.828 0.32 AT E -2.232 *** -2.496 *** -0.264 ** (0.432) (0.483) (0.123) LFES(Heckit) E(y1 ) 0.326 0.19 -0.136 E(y0 ) 1.878 2.308 0.43 AT E -1.552 -2.118 * -0.566 * (0.979) (1.084) (0.291) 1PQML E(y1 ) 1.482 1.683 0.201 E(y0 ) 2.312 2.657 0.345 AT E -0.83 ** -0.974 ** -0.144 (0.337) (0.438) (0.152) 2PQML E(y1 ) 1.499 1.697 0.198 E(y0 ) 2.34 2.68 0.34 AT E -0.841 *** -0.983 *** -0.142 (0.310) (0.373) (0.123) NLS E(y1 ) 1.264 1.402 0.138 E(y0 ) 2.488 2.89 0.402 AT E -1.224 *** -1.488 *** -0.264 (0.357) (0.454) (0.166) Note: All the figures in the parenthesis are bootstrap standard errors. The three nonlinear estimators on the bottom are all from NFES model. 157 Appendix G Figures in Chapter 2 Figure G.1: Selected Monte Carlo Simulation Results for DGP 0. Note: Each column represents the sampling distribution of 2SLS, NET (WNLS) and NFES (WNLS) estimator from left to right. Each row represents that of ρ = .4, .5 and .6 from top to bottom. The sample sizes are 5000 for all. Among these five grids, the middle one represents the population ATE, i.e. 1. 158 Figure G.2: Selected Monte Carlo Simulation Results for DGP 1. Note: Each column represents the sampling distribution of 2SLS and NFES (WNLS) estimator from left to right. Each row represents that of ρ = .4, .5 and .6 from top to bottom. The sample sizes are 5000 for all. Among these five grids, the middle one represents the population ATE, i.e. 1. 159 Figure G.3: Selected Monte Carlo Simulation Results for DGP 2. 160 Figure G.4: Selected Monte Carlo Simulation Results for DGP 3. 161 Appendix H Tables in Chapter 3 162 Table H.1: Simulation Results for ρ = 0.3 DGP I II III linear nonlinear linear nonlinear linear nonlinear n=1000 mean 0.5640 2.192* 1.2664 1.584* 1.2505 5.366* mc. st. dev 1.4758 3.703* 1.7732 2.326* 2.4280 15.545* RMSE 1.5389 3.890* 1.7931 2.399* 2.4409 16.146* median 0.5930 1.031 1.1954 0.961 1.0098 1.041 MAD 0.9266 1.023 1.0931 0.939 1.2917 1.322 3.5610 7.286 4.3419 5.947 5.0139 15.939 IR n=3000 mean 0.5781 1.361 1.3657 1.233 1.2210 2.430 mc. st. dev 0.7665 1.613 0.9766 1.423 1.1970 5.627 RMSE 0.8750 1.653 1.0429 1.442 1.2172 5.806 0.5971 1.064 1.3409 0.913 1.1089 1.074 median MAD 0.5106 0.547 0.6304 0.488 0.7878 0.809 1.9470 2.398 2.4461 2.403 3.1001 5.169 IR n=5000 mean 0.5356 1.158 1.3692 1.035 1.2210 1.606 mc. st. dev 0.6042 0.807 0.7265 0.766 0.9759 2.919 RMSE 0.7620 0.822 0.8150 0.767 1.0006 2.981 median 0.5220 1.032 1.3442 0.890 1.1648 1.046 MAD 0.4223 0.394 0.5115 0.367 0.6017 0.589 IR 1.5592 1.750 1.8633 1.705 2.3029 3.032 Note: The Monte Carlo standard deviation is substantially large for the sample size of 1000. The * indicates that the outlying values that are greater or smaller than the maximum and minimum values of for the sample size of 3000 were discarded from the generated data. 163 Table H.2: Simulation Results for ρ = 0.5 DGP n=1000 mean mc. st. dev RMSE median MAD IR n=3000 mean mc. st. dev RMSE median MAD IR n=5000 mean mc. st. dev RMSE median MAD IR linear 0.7653 1.4605 1.4792 0.7267 0.9558 3.6600 0.7548 0.7953 0.8323 0.7494 0.5424 2.0694 0.7155 0.6110 0.6740 0.7179 0.4144 1.5459 I nonlinear 1.358* 1.843* 1.877* 0.984 0.927 5.074 1.147 1.044 1.055 0.975 0.478 2.085 1.139 0.749 0.762 1.031 0.413 1.594 164 linear 1.3439 1.7058 1.7401 1.3196 1.0801 4.3459 1.3890 0.9522 1.0286 1.3690 0.6060 2.4681 1.3440 0.7306 0.8076 1.3386 0.5071 1.9085 II nonlinear 1.634* 2.590* 2.667* 0.931 0.990 6.482 1.182 1.421 1.432 0.872 0.505 2.331 1.024 0.788 0.789 0.865 0.388 1.657 linear 1.6385 2.3073 2.3940 1.4693 1.3670 5.5827 1.5936 1.3847 1.5066 1.4946 0.8624 3.2039 1.7004 1.0403 1.2541 1.6007 0.6847 2.5199 III nonlinear 3.457* 9.524* 9.836* 1.045 1.200 11.164 1.937 4.738 4.830 1.028 0.764 3.935 1.418 1.757 1.806 1.006 0.574 2.697 Table H.3: Regression Results for Linear and NFES models Variables ATE OLS educ7 -0.419*** (0.049) 2SLS LFES -1.152* (0.596) NFES (PQML) -0.956*** (0.278) NFES (NLS) -1.103*** (0.315) -1.168** (0.528) R0 0.431*** (0.043) R1 0.419*** (0.020) R0 0.300*** (0.016) R1 0.339*** (0.022) R0 0.250*** (0.018) age 0.273*** (0.017) 0.263*** (0.019) R1 0.271*** (0.020) agesq -0.002*** (0.000) -0.002*** (0.000) -0.004*** (0.000) -0.004*** (0.001) -0.006*** (0.000) -0.004*** (0.000) -0.005*** (0.000) -0.003*** (0.000) evermarr 0.685*** (0.052) 0.612*** (0.080) 0.135* (0.078) 1.088*** (0.138) 0.212*** (0.047) 0.331*** (0.043) 0.174*** (0.055) 0.321*** (0.046) urban -0.259*** (0.046) -0.172** (0.084) 0.187*** (0.064) -0.596*** (0.120) 0.063 (0.051) -0.152*** (0.040) 0.103* (0.059) -0.141*** (0.044) electric -0.468*** (0.066) -0.287* (0.161) 0.117* (0.093) -0.781*** (0.224) 0.010 (0.079) -0.242** (0.113) 0.105 -0.244* (0.090) (0.127) See next page. 165 Table H.3 (cont’d) Variables OLS constant -3.514*** (0.245) cov(ϵ, v) 2SLS -2.815*** (0.620) LFES 0.085 1.253*** (0.268) (0.478) NFES (PQML) -0.100 0.397** (0.097) (0.154) NFES (NLS) -5.668*** -4.077*** (0.327) (0.469) 2.202*** (0.304) 0.714 (1.398) 0.963*** (0.295) -1.484** (0.605) -0.204 (0.192) 1299.311 -0.213 (0.218) 8534.8903 L-likelihood R2 0.585 0.564 0.593 σ 1.433 1.469 1.418 Note: All the standard errors presented above are bootstrap standard errors. *, ** and *** indicates the significance at 10%, 5% and 1% levels respectively. 166 Table H.4: Regression Results for LTCRC and NTCRC models Variable ATE educ7 age (1) agesq evermarr (2) urban (3) electric (4) LTCRC 13.376*** (2.347) R1 R0 0.104*** 0.343*** (0.030) (0.040) -0.003** (0.001) 0.009*** (0.002) -0.047 1.377*** (0.164) (0.299) 0.132 -0.178 (0.155) (0.287) 0.396** (0.175) 0.686 (0.919) age*evermarr age*urban age*electric evermarr*urban evermarr*electric urban*electric constant 0.999*** 1.440*** (0.345) (0.502) 167 NTCRC -1.020 (1.633) R1 0.405*** (0.040) 0.005*** (0.001) 2.180*** (0.844) -0.265 (0.382) R0 0.106 (0.088) -0.001 (0.001) 0.690** (0.321) 0.826*** (0.319) -0.675** -0.480 (0.335) (2.471) -0.097** -0.009 (0.041) (0.008) 0.013 0.012 (0.025) (0.008) 0.009 -0.010 (0.010) (0.054) 0.060 0.132** (0.427) (0.066) 0.479* 0.202 (0.272) (0.178) 0.295 0.088 (0.287) (0.248) -1.666 7.010*** (1.842) (0.749) See next page. Table H.4 (cont’d) Variable LTCRC NTCRC R1 R0 R1 R0 cov(ϵ, v) 1.208*** -0.677 2.438** 1.353 (0.460) (0.553) (1.068) (2.435) cov(1) 0.022 0.135*** -0.108 -0.041 (0.033) (0.043) (0.074) (0.121) cov(2) 0.675** 0.621** 0.290 6.954** (0.287) (0.313) (0.414) (3.374) cov(3) -0.085 0.431 -0.312* -0.225 (0.263) (0.279) (0.182) (2.127) -0.869 0.953* -0.094 -0.255 cov(4) (0.583) (0.563) (0.680) (1.240) L-likelihood 1341.602 2 R 0.6036 σ 1.4026 Note: The bootstrap error for NTCRC ATE estimator is calculated by excluding all the replication with ATE values over 13. cov(1) is the covariance between v and the coefficient of variable (1), i.e. age. Variable number for each basic covariate is indicated in the variable list. cov(2) and other numbers are defined similarly. 168 REFERENCES 169 REFERENCES Altonji, J.G., Segal, L.M. (1996), “Small Sample Bias in GMM Estimation of Covariance Structures,” Journal of Business and Economic Statistics 14, 353-366. Amemiya, T. (1985), Advanced Econometrics. Cambridge: Harvard University Press Anderson. T. W., Kunitomo, N., Sawa, T. (1982), “Evaluation of the Distribution Function of the Limited Information Maximum Likelihood Estimator,” Econometrica 50, 1009-1027. Angrist, J.D. (2001), “Estimations of Limited Dependent Variable Models with Dummy Endogenous Regressors: Simple Strategies for Empirical Practice,” Journal of Business and Economic Statistics 19: 216. Angrist, J.D. (2010), “The Credibility Revolution in Empirical Economics: How Better Research Design is Taking the Con out of Econometrics,” Journal of Economic Perspectives 24. 2: 3-30. Angrist, J.D., Evans, W.N. (1998), “Children and Their Parents’ Labor Supply: Evidence from Exogenous Variation in Family Size,” American Economic Review 88, 450-477. Angrist, J.D., Imbens, G., Krueger, A.B. (1999), “Jackknife instrumental variables estimation,” Journal of Applied Econometrics 14, 57-67. Angrist, J.D., Krueger, A.B. (1991), “Does Compulsory School At- tendance Affect Schooling and Earnings?,” Quarterly Journal ofEconom- ics 106, 979-1014. Angrist J.D., Pischke, J.S. (2009), Mostly Harmless Econometrics: An Empiricists Companion. Princeton. Princeton University Press. Barro R, Becker, G. (1989), “Fertility Choice in a Model of Economic Growth,” Econometrica. 57.2: 481-501. Becker G, Barro, R. (1988), “A Reformulation of the Economic Theory of Fertility,” Quarterly Journal of Economics 103. 1: 1-25. Bekker, P.A. (1994), “Alternative Approximations to the Distributions of Instrumental Variable Estimators,” Econometrica 62, 657-681. Bratti M, Miranda, A. (2010), “Non-pecuniary returns to higher education: the effect on smoking intensity in the UK,” Health Economics. 170 Card, D. (2001), “Estimating the Return to Schooling: Progress on Some Persistent Econoemtric Problems,” Econometrica 69, 1127-1160 Choi P, Min, I. (2009), “Estimating endogenous switching regression model with a flexible parametric distribution function: application to Korean housing demand,” Applied Economics 41(23): 3045-3055. Fahlenbrach, R. (2009), “Founder-CEOs, Investment Decisions, and Stock Market Performance,” Journal of Financial and Quantitative Analysis 44(2): 439-466. Flores-Lagunes, A. (2007), “Finite Sample Evidence of IV Estimators under Weak Instruments,” Journal of Applied Econometrics 22, 677-694. Garen, J. (1984), “The Returns to Schooling: A Selectivity Bias Approach with a Continuous Choice Variable,” Econometrica 52, 1199-1218 Gourieroux, C, Monfort, A., Trognon, C. (1984), “Pseudo-Maximum Likelihood Methods: Theory,” Econometrica 52: 681-700. Greene, W.H. (1998), “Gender Economics Courses in Liberal Arts Colleges: ther Results,” The Journal of Economic Education 29, 291-300. Fur- Hayashi, F. (2000), Econometrics. Princeton, NJ: Princeton University Press. Heckman, J.J. (1976), “The Common Structure of Statistical Models of Truncation, Sample Selection, and Limited Dependent Variables and a Simple Estimator for Such Models,” Annals of Economic and Social Measurement 5: 475-492. Heckman, J.J., Vytlacil, E. (1998), “Instrumental Variables Methods for the Correlated Random Coefficient Model,” Journal of Human Resources 33, 974-987 Hellstr¨m, J., Nordstr¨m, J. (2008), “A count data model with endogenous houseo o hold specific censoring: the number of nights to stay,” Empirical Economics 35: 179-192. Hirano K, Imbens, G., Rubin, D., Zhou, X.H. (2000), “Assessing the Effect of an Influenza Vaccine in an Encouragement Design,” Biostatistics 1(1): 69-88. Holland, P. (1986), “Statistics of Causal Inference,” Journal of the American Statistical Association 81: 945-960. Ichimura, H. (1993), “Semiparametric least squares (SLS) and weighted SLS estimation of single-index models,” Journal of Econometrics 58, 71-120. Imbens, G. (2005), “Semiparametric Estimation of Average Treatment Effects un171 der Exogeneity: A Review,” Review of Economics and Statistics. Imbens, G, Wooldridge, J. (2009), “Recent Developments in the Econometrics of Program Evaluation,” Journal of Economic Literature 47(1): 5-86. Innes, R, Sam , A. (2008), “Voluntary Pollution Reductions and the Enforcement of Environmental Law: An Empirical Study of the 33/50 Program,” Journal of Law and Economics 51(2): 271-296. Johnson, N.L., Kotz, S. (1972), Distributions in Statistics. Continuous Multivariate Distributions. New York: Wiley. Kenkel, D, Terza, J. (2001), “The Effect of Physician Advice on Alcohol Consumption: Count Regression with an Endogenous Treatment Effect,” Journal of Applied Econometrics 16(2): 165-184. Klein, R. W., Spady, R.H. (1993), “An Efficient Semiparametric Estimator for Discrete Choice Models,” Econometrica 61, 387?421. Ko¸, C. (2005), “Health-Specific Moral Hazard Effects,” Southern Economic Jourc ¸ nal 72(1): 98-118. Lam, D., Duryea, S. (1999), “Effects of schooling on fertility, labor supply and investments in children, with evidence from Brazil,” Journal of Human Resources 34 (1): 160-192. Maddala, G.S. (1986), “Limited-Dependent and Qualitative Variables in Econometrics”, Cambridge: Cambridge University Press. Masuhara, H. (2008), “Semi-nonparametric count data estimation with an endogenous binary variable,” Economics Bulletin 3(42): 1-13. McGeary, K., French, M. (2000), “Illicit Drug Use and Emergency Room Utilization,” Health Services Research 35(1): 153-169. Mealli, F., Imbens, G., Ferro, S., Biggeri, A. (2004), “Analyzing a Randomized Trial on Breast Self-Examination with Noncompliance and Missing Outcomes,” Biostatistics 5(2): 207-222. Miranda, A. (2003), “Socio-economic characteristics, completed fertility, and the transition from low to high order parities in Mexico,” University of Warwick. Miyata, S., Sawada, Y. (2007), “Learning, Risk, and Credit in Households’ New Technology Investments: The Case of Aquaculture in Rural Indonesia,” 172 Newey, W.K. (1994), “The Asymptotic Variance of Semiparametric Estimators,” Econometrica 62, 1349-1382. Newey, W.K. (2009), “Two-step series estimation of sample selection models,” Econometrics Journal, Supplement 1, 12, S217-S229. Newey, W.K., Powell, J.L., Walker, J.R. (1990), “Semiparametric Estimation of Selection Models: Some Empirical Results,” American Economic Review 80, 324-328. Olsen, R.J. (1980), “A Least-Squares Correction for Selectivity Bias,” Econometrica 48: 1815-1820. Parrado, E., Flippen, C. (2005), “Migration and Gender among Mexican Women,” American Sociological Review 70(4): 606-632. Parrado, E., Flippen, C., McQuiston, C. (2005), “Migration and Relationship Power among Mexican Women,” Demography 42(2): 347-372. Partha, D., Trivedi, P. (2004), “Specification and Simulated Likelihood Estimation of a Non-normal Treatment-Outcome Model with Selection: Application to Health Care Utilization,” Department of Economics, Indiana University. Powell, J.L., Stock, J.H., Stoker, T.M. (1989), “Semiparametric Estimation of Index Coefficients,” Econometrica 57, 1403-1430. Romeu, A., Vera-Hern´ndez, M. (2005), “Counts with an endogenous binary rea gressor: a series expansion approach,” Econometrics Journal 8: 1-22. Rosenzweig, M., Schultz, T. (1985), “The demand and supply of births and its life-cycle consequences,” American Economic Review 75(5): 992-1015. Rosenzweig, M., Schultz, T. (1989), “Schooling, information, and nonmarket productivity: contraceptive use and its effectiveness,” International Economic Review 30(2): 457-477. Rubin, D. (1974), “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies,” Journal of Education Psychology 66: 688-701. Rubin, D. (1978), “Bayesian Inference for Causal Effects,” Annals of Statistics 6: 34-58. Sam, A. (2010), “Impact of government-sponsored pollution prevention practices on environmental compliance and enforcement: evidence from a sample of US manufacturing facilities,” Journal of Regulatory Economics 37: 266-286. 173 Schultz, T. (1994a), “Human capital, family planning, and their effects on population growth,” American Economic Review 84 (2): 255-260. Schultz, T. (1994b), “Studying the impact of household economic and community variables on child mortality,” Population and Development Review 10(0): 215-235. Shin, J., Moon, S. (2007), “Do HMO plans reduce health care expenditure in the private sector?” Economic Inquiry 45(1): 82-99. Staiger, D., Stock, J. (1997), “Instrumental Variables Regression with Weak Instruments,” Econometrica 65, 557-586. Terza, J. (1998), “Estimating Count Models with Endogenous Switching: Sample Selection and Endogenous Treatment Effects,” Journal of Econometrics 84: 129-154. Terza, J. (1999), “Estimating Endogenous Treatment Effects in Retrospective Data Analysis,” Value in Health 2(6): 429-434. Terza, J. (2008), “Parametric Regression and Health Policy Analysis: Estimation and Inference in the Presence of Endogeneity,” Department of Economics, University of Florida. Terza, J. (2009), “Parametric nonlinear regression with endogenous switching,” Econometric Reviews 28(6):555-580. Terza, J., Bradford, D., Dismuke, C. (2008), “The Use of Linear Instrumental Variables Methods in Health Services Research and Health Economics: A Cautionary Note,” Health Services Research 43(3). UNESCO. (2011), http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId =3674. (Retrieved on November 23, 2011) Vargas, M., Elhewaihi, M. (2007), “What is the impact of duplicate coverage on the demand for health care in Germany?” Vella, F., Verbeek, M. (1999), “Estimating and Interpreting Models with Endogenous Treatment Effects,” Journal of Business and Economic Statistics 17, 473-478. Wooldridge, J. (1997), “Quasi-Likelihood Methods for Count Data,” In Handbook of Applied Econometrics 2, Pesaran, M.H., Schmidt, P. (eds). Oxford: Blackwell, 352-406. Wooldridge, J. (1997), “On Two Stage Least Squares Estimation of the Average Treatment Effect in a Random Coefficient Model,” Economics Letters 56, 129-133 Wooldridge, J. (2003), “Further Results on Instrumental Variables Estimation of 174 Average Treatment Effect in the Correlated Random Coefficient Model,” Economics Letters 79, 185-191 Wooldridge, J. (2005), “Fixed Effects and Related Estimators for Correlated Random Coefficient and Treatment Effect Panel Data Models,” Review of Economics and Statistics 87, 385-390 Wooldridge, J. (2007), “Control Function and Related Methods,” What’s New in Econometrics?, National Bureau of Economic Research Wooldridge J. (2010), Econometric Analysis of Cross Section and Panel Data. MIT: Cambridge, MA. Wooldridge J. (2011), “Quasi-Maximum Likelihood Estimation and Testing for Nonlinear Models with Endogenous Explanatory Variables,” Department of Economics, Michigan State University. Zhang, J., Song, X. (2007), “Fertility Differences between Married and Cohabiting Couples: A Switching Regression Analysis,” discussion paper series, Institute for the Study of Labor. Ziliak, J.P. (1997), “Efficient Estimation with Panel Data When Instruments Are Predetermined: An Empirical Comparison of Moment-Condition Estimators,” Journal of Business and Economic Statistics 15, 419-431. 175