THREE ESSAYS IN APPLIED ECONOMETRICS By Monthien Satimanon A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Economics - Doctor of Philosophy 2013 ABSTRACT THREE ESSAYS IN APPLIED ECONOMETRICS By Monthien Satimanon The three essays are self-contained and are the combination of applied and empirical econometrics. They are “Comparisons of Approaches in Measuring Willingness to Pay for Environmental Services”, “Comparisons of Approaches in Measuring Causes of Wage Inequality”, and “Estimation of Binary Response Model with Endogeneity and Heteroskedasticity.” The first essay proposes a comparison of both parametric and semiparametric estimation of willingness to pay (WTP) for environmental services. In order to solve for problem of inconsistency of estimation since heteroskedasticity, several conventional and new methods are used in the analysis. The methods are Probit (Probit), Heteroskedasticy Probit (HP), Turnbull (T), Watnabe (2010), Ahn (2000), and sieve semiparametric estimator (S). The comparison includes the estimated parameters as well as the estimated standard errors since the WTP is derived from these parameters. Monte Carlo simulations have been used to compare finite sample properties of each estimating methods. The empirical application comes from a study of the demand for payment for environmental services, water quality preservation, in eastern Costa Rica. By Monte Carlo Simulation, we found out that neglecting heteroskedasticity could lead to over estimation of WTP by almost 100 percent. In the empirical study, we found out that WTP for water conservation program in villages in eastern Costa Rica is about 2400 Colones that is about two times of the current monthly water cost program. This estimate is consistent with previous studies in the water conservation program. The second essay proposes a comparison of both parametric and semiparametric estimation of causes of income equality. In quantile regression setting, this paper analyzes the determinants of wage inequality with endogenous categorical regressors. The framework of Lee (2007) has been extended to cover the case where control function comes from generalized residuals of ordered probit models. We found out that the proposed method yields not only consistent but also efficient estimated parameters when there is a present of both endogeneity and heteroskedasticity while the conventional estimators led to overestimated parameters. In addition, we use all the proposed to estimate the return to education on wage using the data from Current Population Survery (CPS). We found out that returns to education are not monotonically increase throughout the wage distribution. The third essay analyzes the binary response model that encounters the problems of endogeneity and heteroskedasticity that lead to inconsistent estimated parameters. Our model allows both heteroskedasticity in structural and reduced formed equation. To handle both problems, we employ control function estimation with sieve estimator. The first step generates control function with flexible form of multiplicative exponential heteroskedasticity. The second step employs the use of maximum likelihood method to find the average partial effects with the adjusted standard errors based on Ackerberg et.al. (2012). Monte Carlo simulations are conducted to study the performance of this new estimator comparing to standard binary choice models. Then, the estimator is applied to estimation of female labor supply. For my parents, Montri Satimanon and Ream Chairuengsri, and my wife Thasanee Satimanon iv ACKNOWLEDGMENTS I would like to express my deepest appreciation to my advisor, Professor Jeffrey Wooldridge for his valuable advice, guidance, suggestion, and support throughout every stage of completing this dissertation. My work would not have been possible without his comment and support. I am grateful for the assistance and lively discussion with Professor Timothy (Tim) Volgelsang. I wish to thank Professor Todd Elder and Professor Robert Myers for their encouragement and his fruitful discussions. I would like to express my deepest gratitude for Thammasat University for financial support throughout my academic years. I am also indebted to Center for Statistical Training and Consulting, Michigan State University that have not only offered life-changing experience but also support for my dissertation. I would like to thank the Economics department staff for their help throughout my study here, especially Ms. Jennifer Carducci and Ms. Lori Jean Nichols. Finally yet importantly, my deepest blessing and gratitude goes to my family. I would never be who I am today without their encouragement and unconditional love from my parents. In addition, I would like to thank my parent-in-law for their support. Last but not least, I would like to thank my wife, Thasanee (my little Yui), who shared with me every moment of this dissertation throughout the thick and thin. I would not be this better self without her encouragement, love and support. Thank you very much, N’Yui. All ambiguities, deficiencies, and mistakes in this dissertation are solely my own responsibility. v TABLE OF CONTENTS LIST OF TABLES viii CHAPTER 1: COMPARISION OF APPROACHES TO ESTIMATING DEMAND FOR PAYMENT FOR ENVIRONMENTAL SERVICES 1 1.1. Introduction 1 1.2. Binary Response Model and Estimation Methods 3 1.2.1 Probit 4 1.2.2 Heteroskedasticity Probit (Hetprob) 5 1.2.3 Sieve estimator, Probit model with distribution-free heteroskedasticity (Sieve) 5 1.2.4 Using the survival function to estimate the willingness to pay 8 1.2.4.1 Turnbull Estimator 8 1.2.4.2 An (2000) 10 1.2.4.3 Watanabe (2010) 11 1.3 Monte Carlo Simulation for Single Bound Dichotomous Choice Model 14 1.4 Data and Estimating Results 21 1.4.1 Data 21 1.4.2 Estimation Results 24 1.5 Conclusion and further study 28 REFERENCES 30 CHAPTER 2: COMPARISION OF APPROACHES TO ESTIMATING DEMAND FOR PAYMENT FOR ENVIRONMENTAL SERVICES 33 2.1 Introduction 33 2.2 Great Lake income distribution and determinants of inequality 37 2.3 Models and Estimation Methods 42 2.3.1 Quantile Regresssion 43 2.3.2 Endogenous Quantile Treatment Effect 44 2.3.2.1 Abadie Imbens Angrist (2002) AAI 45 2.3.2.2 Unconditional quantile treatment effect Firpo (2007) and Frolich and Melly (2008) 48 2.3.3 Sieve Estimator and Control Function 51 2.4 Monte Carlo Simulation 54 2.4.1 Data generating process 55 2.4.2 Bias, standard deviation of Estimator, and Power 56 2.4.2.1 One endogenous variable model 57 2.4.2.2 Two endogenous variables model 60 2.4.2.3 Sensitivity analysis for the control function estimator to heteroskedasticity 66 2.4.3 Test size result 71 vi 2.5 Data and Estimating Results 2.6 Concluding Remarks and Further Study REFERENCES 75 84 86 CHAPTER 3: BINARY RESPONSE MODEL WITH CONTINUOUS ENDOGENOUS VARIABLE AND HETEROSKEDASTICITY 90 3.1 Introduction 90 3.2 Model and Estimation Method 94 3.3 Monte Carlo Simulation 98 3.3.1 Data generating process (DGP) 98 3.3.1.1 The data generating process contains only exogenous variables 99 3.3.1.2 The data generating process contains only endogeneity 100 3.3.1.3 Data generating process with endogeneity and heteroskedasticity in reduced form 102 3.3.1.4 Data generating process with endogeneity and heteroskedasticity in both equations 103 3.4 Empirical Study: Application to female labor supply 104 3.5 Conclusion 107 APPENDICES 124 APPENDIX A: Asymptotic Variance for the two-step estimators 125 APPENDIX B: Asymptotic Variance for the APEs 136 REFERENCES 141 vii LIST OF TABLES Table 1.1 Basic willingness to pay model 17 Table 1.2 Heteroskedastic Probit willingness to pay model 18 Table 1.3 Heteroskedastic Probit with square exponenetial variance willingness to pay model 22 Table 1.4 Communities in the study 23 Table 1.5 Data description and descriptive statistics (N=1141) 26 Table 1.6 Estimated coefficients 27 Table 2.1 Monte Carlo Simulation for Mean without endogeneity 58 Table 2.2 Monte Carlo Simulation for Mean with 0.1 degree of correlation 58 Table 2.3 Monte Carlo Simulation for Mean with 0.5 degree of correlation 59 Table 2.4 Monte Carlo Simulation for Mean with 0.9 degree of correlation 60 Table 2.5 Monte Carlo Simulation for Mean with 0 degree of correlation 63 Table 2.6 Monte Carlo Simulation for Mean with 0.1 degree of correlation 64 Table 2.7 Monte Carlo Simulation for Mean with 0.5 degree of correlation 65 Table 2.8 Monte Carlo Simulation for Mean with 0.9 degree of correlation 66 Table 2.9 Monte Carlo Simulation for the case of Heteroskedasticity and Endogeneity with 0.1 degree of correlation 68 Table 2.10 Monte Carlo Simulation for the case of Heteroskedasticity and Endogeneity with 0.5 degree of correlation 69 Table 2.11 Monte Carlo Simulation for the case of Heteroskedasticity and Endogeneity with 0.9 degree of correlation 70 Table 2.12 Test size based on model with two ordered endogenous variables 72 Table 2.13 Test size based one binary endogenous variable 73 viii Table 2.14 Test size with 1000 observations and variance of (9, 1) 74 Table 2.15 Percentiles, cut-off level of nominal household income ($) 76 Table 2.16 Summary statistics for the year 2001-2009 76 Table 2.17 Estimate result assuming education attainments are conditionally exogenous (n = 349,825) 80 Table 2.18 Estimate result assuming education attainments are endogenous (n = 349,825) 81 Table 3.1 The correct model is Probit with 1000 observations 109 Table 3.2 The correct model is Probit with 3000 observations 110 Table 3.3 The correct model is IVprobit with 1000 observations and  21  0.5,  22  0.5 111 Table 3.4 The correct model is IVprobit with 3000 observations and 21  0.5, 22  0.5 112 Table 3.5 The correct model is IVprobit with 1000 observations and  21  0.1,  22  0.1 113 Table 3.6 The correct model is IVprobit with 3000 observations and  21  0.1,  22  0.1 114 Table 3.7 The correct model is IVprobhet with 1000 observations and 21  0.5, 22  0.5 115 Table 3.8 The correct model is IVprobhet with 3000 observations and 21  0.5, 22  0.5 116 Table 3.9 The correct model is IVprobhet with 1000 observations and  21  0.1,  22  0.1 117 Table 3.10 The correct model is IVprobhet with 3000 observations and  21  0.1,  22  0.1 118 Table 3.11 The correct model is HetIVprobhet with 1000 observations and 21  0.5, 22  0.5 119 The correct model is HetIVprobhet with 3000 observations and 21  0.5, 22  0.5 120 Table 3.12 ix Table 3.13 Data description 121 Table 3.14 Estimation of Reduced form equation 121 Table 3.15 Empirical Result of APEs with analytical standard errors 122 Table 3.16 Empirical Result of APEs with analytical standard errors and Bootstrap standard errors 123 x CHAPTER 1: COMPARISION OF APPROACHES TO ESTIMATING DEMAND FOR PAYMENT FOR ENVIRONMENTAL SERVICES 1.1. Introduction This paper proposes a comparison of both parametric and semiparametric estimation of willingness to pay (WTP) for environmental services. Payment for environmental services (PES) is an approach that uses economic incentives either provided by public or private sector to protect natural resources. PES programs range from classical soil and water conservation to the new areas such as drinking and farming water supply and carbon sequestration. Hence, PES programs have been of recent interest globally and have led to an increasing number of empirical studies. Two important questions for PES studies are what determine the willingness to pay (WTP) or demand for PES? and what determines participation in PES programs by payment recipients?. Both of these questions have been answered by estimating the dichotomous choice (binary choice) models by using standard Probit or Logit estimation. The standard procedure of this contingent valuation can be found in the work of Haneman (1984) and Haab and McConnell (2003). In this binary response valuation models, WTP usually refers to conditional mean that is derived from estimated parameters under given underlying distributional assumption. The problem with this set up is that the welfare evaluations will crucially depend on these specific distributions. Unlike the linear probability model, the consistencies of estimated parameters depend on the underlying distribution as well as the conditional variance of the estimated model. In this context, semiparametric estimation provides an interesting alternative since it allows flexible functional form for conditional variance. 1 Semiparametric methods have been used in estimation of binary choice model for a long period of time, as summarized in Li and Racine (2008). In most theoretical studies, the semiparametric models have been compared with parametric binary choice model by simulation. Horowitz (1992) found that semiparametric models will be more robust when the estimated model contains heteroskedasticity. Klein and Spady (1993) and Li (1996) also found strong support for the semiparametric model. In empirical application of semiparametric methods to welfare measurement in binary choice model, Chen and Randall (1997), Creel and Loomis(1997), An (2000), Cooper (2002), and Huang et.al (2009) found out that the semipametric results are robust and can be used as a complementary procedure along with the parametric estimation. In addition, it can be used to check whether the parametric model encounters any inconsistency problems because of underlying distribution, unobserved heterogeneity, and heteroskedasticity. The comparing methods of estimation for the binary choice models are Probit (Probit), Heteroskedasticy Probit (Hetprob), Turnbull, Watnabe (2010), An (2000), and sieve semiparametric estimator (Sieve). The sieve method assumes exponential heteroskedasticity of the normal distribution with flexible functional form. Hence, comparison includes the estimated parameters as well as the estimated variances since the WTP derives from these parameters. The data used for the comparison of welfare measures comes from a study of the demand for payment for environmental services (PES) in eastern Costa Rica. The data set come from the extended surveys of Ortega-Pacheco et.al. (2009). The respondents are asked to vote “Yes” or “No” in the response to additional payment for the people who live in the upstream and mountainous area to preserve the quality as well as quality of 2 water sources that will be used in the lower area. The bid value has been provided in standard referendum contingent valuation. The goods here are clearly defined since the people who live along the downstream self-financed their existing water supply and already pay the water fees monthly. With the new estimation methods and extended data from previous study, the results show that the choice of methods and models can influence the mean willingness to pay. 1.2 Binary Response Model and Estimation Methods The estimated model in this study is specified as exponential willingness to pay function as in Haab and McConnell (2003). The exponential willingness to pay with linear combination of attributes and additional stochastic term is WTP  ezi i i (1.1) where i is a stochastic error with mean zero and unknown variance  2 .The probability that individual ‘i’ will answer “yes” for the corresponding offered bid ti will be determined whether the random willingness to pay is greater than the offered bid. P( yi 1 zi )  P(WTP  ti ) i (1.2) = P(exp(zi  i )  ti ) = P(i  ln(ti )  zi ) By assuming that the unknown standard errors is  , equation can be standardized to ln(ti ) zi   P( yi  1 zi )  P i    = P(i   ln(ti )   * zi )     3 (1.3) The conventional process in estimating the parameters of the model in (1.3) is to specifying the distribution of the error terms. In most of the studies,  i are independently and identically distributed (IID) with mean zero and variance 1. Then, either the underlying distribution of normal and/or logistic will be assumed as in the case of Probit and Logit estimation. For comparison in this study, only the basic Probit will be used. 1.2.1 Probit In Probit model, the probability of “yes” will be model in term of latent variable that yi  1 if yi *   * zi   ln(ti )  i  0 and probability of “no” will be defined as yi  0 if yi *   * zi   ln(ti )  i  0 . Or, binary response model is in the form of index function yi  1 * zi   ln(ti )  i  0.Also, for the errors term, it is assumed to be i ~ N (0,  2 ) and i ~ N (0,1) . Hence the distribution is assumed to be as followed. P( yi 1 zi )  ( * zi   ln(ti )) (1.4) where (x ) is the cumulative standard normal distribution. Then, the parameters can be estimated up to a scale as well as the marginal effects. In order to estimate this model, the maximum likelihood estimation will be used. Defining a new 1 (m  1) parametric vector B   *,   where ( m  1) is the dimension of covariates including constant terms, and define the data vector X i  zi , ti  the likelihood function will be n L( B yi , X i )   i {( X i ' B)yi 1  ( X i ' B)1 yi } (1.5) Then, the familiar log likelihood function is n ln L( B yi , X i )  i { yi ln ( X i ' B)  (1  yi ) ln[1  ( X i ' B)]} 4 (1.6) 1.2.2 Heteroskedasticity Probit (Hetprob) The simple estimation will be modified with the unobserved heterogeneity by incorporating the heteroskedasticity into standard Probit model. The variance  2 will be varying as a function of attributes. The variance will be a multiplicative function of zi as followed i  exp(zi ) (1.7) Substituting this variance into equation (1.3) yields multiplicative heteroskedastic probit model.  i ln(ti ) zi  P( yi  1 zi )  P  exp(z )  exp(z )  exp(z )   i i i   = P(i *   * ln(ti )   * *zi ) (1.8) Defining a new 1 (m  1) parametric vector B*  { * *,  *} where m  1 is the dimension of covariates including constant terms. Then, the log likelihood function will become n ln L( B yi , X i )  i yi ln ( X i ' B*)  (1  yi ) ln[1  ( X i ' B*)] (1.9) The result of estimation from equation (1.6) and (1.9) will be useful in positing whether our estimated model contain heteroskedasticity or not. Furthermore, the other assumptions that can be relaxed is functional specification of i . 1.2.3 Sieve estimator, Probit model with distribution-free heteroskedasticity (Sieve) 5 Sieve estimation refers to one class of semiparametric estimation that solves the problem of infinite dimensional parameter. The sieve method employs the optimization routine that tries to optimize the criterion function over finite approximated parameter spaces (sieves). The sieve method, in the simplest form, might be similar to how we choose the bandwidth and numbers in plotting the histogram. As pointed out by Chen (2007), the method of sieves is very flexible in estimating complicated semiparametric models with (or without) endogeneity and latent heterogeneity. It can easily incorporate prior information and constraints, and it can simultaneously estimate the parametric and nonparametric parts, typically with optimal convergence rates for both parts. Khan (2005) proposed a estimation method that is a further expansion of Horowitz (1992) method. The important assumption is the conditional median restriction to ensure the identification of estimated parameters  . med (i X i )  0 (1.10) and symmetric distribution of the error terms the local nonlinear least squares estimator for yi  1 * zi   ln(ti )  i  0 is defined as 2  X i ' B *  1 n  ˆ B  arg min   yi   h    B1 n i 1 n     (1.11) where hn is a sequence of positive numbers such that hn  0 as n   . This estimator will yield the estimated B with one of the estimated element to be normalized to 1 as usual for semiparametric estimation. Blevins and Khan (2011) provides the procedure to estimation equation (1.11), they suggested the use of probit criterion function for the sieve nonlinear least squares. The criterion function is 6 1 n i ( B*, l )     yi   X i ' B *  exp(l ( zi ))2 n i where l ( zi ) is finite dimensional scaling parameter and (1.12) is a finite vector of parameters. Then, they introduce a finite-dimensional approximation of l ( zi ) using a linear-in- parameters sieve estimator as in Chen (2007). They define the estimator as followed. Let b0 j( z ) denotes a sequence of known basis function and b n ( zi )  b0 j ( zi ),..., b0 ( zi ) i n for some integer  n . Then, the function expl ( zi ) will be estimated by the following sieve estimator exp b n ( zi )' n      where n is a vector of constants. Let  n  (, n )  An where An is the sieve space. The estimator can be defined by 1 n ˆ  n  arg min   yi   X i ' B * g n ( zi )2 An n i 1 (1.13) The choice of g n ( zi ) is arbitrary and can be any possible series such as power and polynomial series, and spline. In this study, we estimate the g n ( zi ) by exponential function that contains the power series of ( zi ) as a domain. Chen (2007) proved that the estimated parameters from sieve estimation will be asymptotically normal and consistent when the estimated number of sieve parameters grows with respect to number of observations. However, in this paper we are interested in the estimation of willingness to pay so we have to apply further step in estimation. From estimation of equation (1.12), we can get the estimation of g n ( zi ) , and then we will plug this one in the probit estimation of equation (1.3). The main reason that we proceed in two step estimation is that we can apply the results from Ackerberg et.al. (2012) in order to estimate the 7 asymptotic variance by using parametric approximation since it requires less computation power to get the variance of the estimate of B * and willingness to pay. Moreover, the first step estimation of equation (1.12) can be conduct by using Sieve-M estimator that comprises of both non-linear least squares as in Blevins and Khan (2011) and the use of standard maximum likelihood with flexible function Chen (2007). They should yield the same asymptotic result; however, their finite sample properties will be examined through Monte Carlo study among the nested models. Then, the usual delta method or Krinsky and Robb will easily compute the willingness to pay and its variance. 1.2.4 Using the survival function to estimate the willingness to pay Consider a PES project as in our study, villagers are asked to pay a given bid for either there will be conservation or not, hence the willingness to pay, WTP , may be i viewed as a survival time in a survival analysis. By giving a particular bid across individual, we can get a one to one response between the binary choice models as in 1.2.1-1.2.3 to the survival model. In this context we can define the general failure-time distribution as (ti zi )  P(WTP  ti zi ) , given fixed ti , it is the simple model of survival i probability S (ti zi ) 1  F (ti zi ) . In this study, three estimators will be used to illustrated and compared with the given binary choice models. They are Turnbull estimator, An (2000) estimator, and Watanabe (2010) estimator. 1.2.4.1 Turnbull estimator Turnbull estimator is a distribution-free estimator that relies on asymptotic properties. When there are large number of observations and the offered bid increases, 8 the proportion of “no” responses will be higher and the survival function is decreasing in bid. That is, the survival function supposed to mimic the real survival function when observations are large and it assumed to decrease monotonically. The steps in which Turnbull estimation are carried are as followed: - For each bid indexed t j , j  1, ... , m calculate the proportions, F j , of “no” responses. For example, if the first bid is 10 and has been given to 100 villagers and 10 villagers answered “no”. This proportion, F1 , will be 0.1. If the second bid is 20 and has been given to 100 villagers and 20 villagers answered “no”. This proportion, F2 , will be 0.2. - Calculate this proportion for all of the bids level j . - If F j 1  F j , there is no need for adjustment. - If F j 1  F j , there is need to pool bid j and j  1 together and then calculate the proportions of “no” with the observations from both of the bid categories. For example, if the second bid is 20 and has been given to 100 villagers and 20 villagers answered “no”. This proportion, F2 , will be 0.2. In addition, if third bid is 30 and the has been given to 100 villagers and 15 villagers answered “no”. This proportion, F3 , will be 0.15. Then, we have to merge second and third bid to one category and calculate the new F2 that is equal to (20 + 15) /(100 +100) = 0.175 - Continue this process of merging to allow for monotonicity in survival function. 9 - Set Fm1  1 , or there is no one who will pay after the last bid, survival function = 0. - Calculating probability distribution for each bid level as follows: f j  F j 1  F j . - The expected lower bound of willingness to pay will be  j 0 t j * f j m In this study the expected lower bound and upper bound as in Haab and McConnnell (1997) will be calculated to find the mid-point which will be compared with other estimators. 1.2.4.2 An (2000) He introduced a semiparametric model that has the same property as Cox Proportional hazard model with discrete time and unobserved heterogeneity as in Jenkins (1995). Willingness to pay is represented by the link function as follows WTP  f (zi ,i )   i i (1.14) where f can be generalized by generic link function  that is assumed to be continuous and differentiable with (0)  0 and limWTP m (WTP )   .i is assumed to follow i i a Type-I extreme value distribution. This lead to the conditional survivor function as same as Cox proportional hazard model of the form S (WTP zi )  P(WTP  ti zi )  exp((WTP )ezi ) i i i (1.15) Then, by pursuing the double bound dichotomous contingent valuation model and unobserved heterogeneity with unit-mean Gamma density, he formulated the same log 10 likelihood function as in Jenkins (1995) that is called discrete choice proportional hazard model with shared frailty. ln L( B yi, zi )   n i log({1  exp( zi ' B*)ln exp( l ) /  }  {1  exp( zi ' B*)l 1exp( l ) /  } n i j ) (1.16) However, in this study with single bound dichotomous choice model, it is not possible to assume shared frailty since there is only one bid or failed time per one individual. In addition, there is no data grouping mechanism. Hence, the likelihood function in equation (1.16) is not suitable for this study. Stripping off both shared frailty and data grouping mechanism, the log-likelihood in (1.16) will become the same as standard Cox proportional hazard model in survival analysis with the mean willingness to pay equal to area under the curve of survival function. m E(WTP zi )   (exp(( i )ezi ))d i 0 (1.17) where  i is defined as in equation (1.14). Following approximation as in An (2000)’s equation (1.17), it will be m E(WTP zi )   exp(( i )ezi )[t j  t j 1] i j 1 (1.18) 1.2.4.3 Watanabe (2010) The paper presented the nonparametric model to find mean willingness to pay with modeling it as the survival function as in the case of Turnbull and An(2000). The survival function is defined as 11 S (ti )  P(WTP  ti ) i (1.19) And the support of willingness to pay need to be in the range of offering bid [0, tm ] . Given the observed “yes” and “no” answer in standard single bound dichotomous choice model, the probability of answering “yes” will be assumed to follow Bernoulli random variable with the conditional probability of P( yi 1ti )  P(WTP  ti ) and expected value i of E( yi 1ti )  S (ti ) . Then he assume the distribution of the the bid ti follows - A bid ti follows a continuous distribution f (ti ) over the range of [0, tm ] , and f (ti )  0 in the range. - The range of support of bid and willingness to pay are equal to [0, tm ] . Under these two assumptions the expected willingness to pay of single bound dichotomous choice willingness to pay condition on observe attributes will be equal to t E(WTP zi )   m P(WTP  ti zi )dt i i 0 t   m E ( yi ti , zi )dt 0  y   E  i ti , zi   f (t )   i  (1.20) In order to estimate equation (1.20), the following assumption is required, ( yi , ti , zi ) is independent and identically distributed. Then, the consisted estimator of the ˆ ˆ minimum mean-squared errors linear approximation of equation (1.20) is zi  where  12 are estimated coefficients from running linear regression yi on intercepts and f ( ti ) attributes zi . These three estimators share similar characteristics in order to get better precision of estimating mean willingness to pay. First, the support of willingness to pay has to lies within the support of bid distribution. Second, number of distinct bids (ideally bid should continuously distributed as in case of survival time) matters in achieving the less interpolation errors of willingness to pay. Though, An and Watanabe claims that their estimator are non/semiparametric; however, they need to assume certain form of distribution either on unobserved heterogeneity or bid distribution. Lastly, for nonparametric estimators, the need for independent and identical assumption is required for estimation in Turnbull and Watanabe. To conclude this section, there are certain insights that might be gained from comparing these six methods of estimation. The probit and heteroskedastic probit models are computationally simple and should be more efficient if the underlying distributions are correctly specified. On the other hands, the four nonparametric and semiparametric models in this paper are not nested with each corresponding probit and heteroskedastic probit, but heuristic comparison can be made as in Beluzzo (2004). Results of Probit, Hetprob, and Sieve can be compared to see whether the underlying normal distribution is a valid assumption or not. In addition, results from Probit, Hetprob and Sieve can be compared to see whether the there is a problem of heteroskedasticity in the data generating process or not. 13 1.3 Monte Carlo Simulation for Single Bound Dichotomous Choice Model Monte Carlo simulations were carried out to compare the finite sample properties of six estimators with respect to varying conditional variances. The sample size is 500 and the number of simulation trials is 1000 at each simulation. First we consider the case where data generating process followed Watanabe (2010). That is true willingness to pay (WTP) follow the linear exponential function as WTP  exp(1  2 z1  i ) , where i z1 is considered to be individual attribute, i is random errors term and 1 and  2 are parameters. The data generating process are as followed: Case 1: 1  3 ,  2  0.5 , z1 ~ N (0,0.64) , and i ~ N (0,0.64) , the bids are set up following process as in Watanabe (2010). There are 20 equal bid levels ranging from 0 to 400 assigned randomly to 500 observations with 25 observations facing the same bid level. In case 1), the sieve estimator and heteroskedasticity probit model use the same variance adjustment term that is exponential of z1 . First, when the standard probit specification is true for estimating the WTP, the result are as expected, the probit estimator performs best not only in terms of bias but also variance of estimation. The three estimators under survival function analysis yield similar results. They under predict the mean WTP by almost 50 percent. However, their variances are quite low compared to the dichotomous choice models. The lower variance can be attributed to assuming the smoothness of survival function that make the extreme value less affect the mean WTP. In addition, for each underlying survival function, they are multiplied with corresponding bid points that further lead to lower variances of estimated mean willingness to pay. For the case of heteroskedasticity probit and Sieve 14 estimator, the misspecification in conditional variance lead to bias in prediction of mean WTP; nevertheless, the bias is not as severe as the case of survival function. Comparing Sieve and Hetprob estimators, both of them yield quite similar result in terms of mean WTP, variances, and RMSEs. Out of 1000 replications, the likelihood ratio test between heteroskedasticity probit and probit shows that the overspecified model is not that better fit than the latter one at 95 percent confidence. That is, only in 58 trials that the probit model is rejected with respect to the heteroskedasticity probit model. To conclude, the standard methods yield the best prediction for willingness to pay when there is no underlying conditional variance. Case 2: True WTP follows WTP  exp(1  2 z1  3z2  i ) , i  2  0.5 , 3  0.5 , z1 ~ N (0,0.64) , z2 ~ N (0,0.64) , 1  3 , and i ~ N (0, exp(0.1z1  0.1z2 )) . There are 20 equal bid levels ranging from 0 to 400 assigned randomly to 500 observations with 25 observations facing the same bid level. In this setup, heteroskedastic probit estimator (Hetprob) should yield the best approximation of the mean WTP. From Table 2, it overestimates the WTP by 5 percent with lowest variance among dichotomous choice models. On the other hands, there are two sieve estimators here, Hetrprob1 and Sieve1. They use the variance formula as in Blevins and Khan (2011) that over specified the conditional variance. Hetprob1 solutions come from maximum likelihood estimation at the first step while Sieve solutions apply nonlinear least squares first. Both of them provide the overestimated mean WTP and higher variance than three survival function models. It might be the case that the extreme values from the data generating process drive the result of mean WTP estimation. 15 Sieve1 yields similar result to probit estimator while Hetprob1, though having same explanatory variables for the conditional variance as Sieve1, yields similar results to Hetprob. It is important to note that Sieve1 estimator provide highest variance than Hetrprob1 given the highest maximum WTP estimated is 221 that is 100 percent higher than all estimates from other methods. We suspect that nonlinear least squares estimation of over specified model might be more biased and less efficient than maximum likelihood. Comparing across the nested model of Probit, Hetprob and Hetprob1 by using likelihood ratio test, there are only 125 trials that reject Probit against correctly specified Hetprob. In addition, there are only 213 trials that reject correctly specified Hetprob against over-specified Hetprob1. Hence, adding three more explanatory terms to conditional variance to Hetprob1 compared to Hetprob does not change the likelihood ratio between the two methods that much. This might help in explaining why there is more increase in bias and variance of estimation from using nonlinear least squares Sieve compared to using maximum likelihood Sieve. It is surprising that in this case Turnbull estimator yields the best result with lowest bias and RMSE among the all estimators. Moreover, An and Watanabe estimators still underestimate the mean willingness to pay by 25-30 percent with low variances. 16 Table 1.1 Basic willingness to pay model True Model mean willingness to pay = 47.746 Turnbull Watanabe An Probit Hetprob Sieve Mean willingness to pay 28.920 19.990 23.970 48.008 37.292 36.048 Bias 18.556 27.486 23.506 -0.532 10.184 11.428 Variance 10.910 11.029 8.209 60.203 40.590 40.918 Root mean squared errors(RMSE) 18.848 27.686 23.680 7.777 12.013 13.096 Maximum WTP 41.400 32.000 44.085 77.511 164.072 57.258 Minimum WTP 19.800 10.400 19.812 18.569 12.557 12.870 Frequency of rejection 58 Note: Monte Carlo Bias and Root mean square errors are measure with respect to true model willingness to pay. 17 Table 1.2 Heteroskedastic Probit willingness to pay model True Model mean willingness to pay = 37.773 Turnbull Watanabe An Probit Hetprob Hetprob1 Sieve1 Mean willingness to pay 38.395 30.178 27.682 52.035 39.223 42.193 53.327 Bias 0.622 -7.595 -10.091 14.262 1.450 4.420 15.554 Variance 23.394 18.050 11.958 98.292 57.737 82.785 403.976 Root mean squared errors(RMSE) 4.877 8.702 10.667 17.369 7.736 10.115 25.415 Maximum WTP 55.927 46.400 48.711 89.358 66.803 80.388 221.391 Minimum WTP 19.800 18.400 21.107 22.335 12.557 14.122 12.896 Frequency of rejection 125 213 Note: 1) Monte Carlo Bias and Root mean square errors are measure with respect to true model willingness to pay. 2) Hetprob represents heteroskedasticity model with correctly specified errors Hetprob1 represents heteroskedasticity model with overspecified standard errors as in Blevins and Khan (2011) with maximum likelihood estimation. 3) Sieve represents Blevins and Khan (2011) non-linear least squares estimation. 18 Case 3: True WTP follows WTP  exp(1  2 z1  3z2  i ) , 1  3 , i  2  0.5 , 3  0.5 , z1 ~ N (0,0.64) , z2 ~ N (0,0.64) , and 2 2 i ~ N (0, exp(0.1z1  0.2 z1z2  0.1z2 )) . There are 20 equal bid levels ranging from 0 to 400 assigned randomly to 500 observations with 25 observations facing the same bid level. In Table 1.3, only the Sieve1 and Hetprob1 estimator assumes the correct form of conditional variance, where the normal Sieve2 and Hetprob2 estimator assumes the conditional variances as in Blevins and Khan (2011). As expected the misspecification of conditional various affect the estimated mean WTP for all of the dichotomous choice models. The Probit and Hetprob1 models those are under specified the conditional variance lead to higher level of WTP than expected by 20 percent or more. Although, the Sieve1 and Hetprob1 estimator should yield lower bias; however, their RMSEs are still higher than the comparable Turnbull model but lower than An and Watanabe. Comparing across nested models, only Probit without conditional variance leads to considerable biased estimator that overshoots willingness to pay by 18 percent. At least by inserting conditional variance terms either correctly specified (Sieve1 and Hetprob1), over specified (Hetprob2 and Sieve2) helps in reducing the bias of estimated mean WTP. Using nonlinear least squares is precise with low variance when only the conditional variances are correctly specified. Considering the survival function models, all of them underestimate mean WTP with lower variances and Turnbull estimators yields the least biased result. 19 Comparing across the nested model of Probit, Hetprob, Hetprob1, and Hetprob2 by using likelihood ratio test, there are only 124 trials that reject Probit against under specified Hetprob. There are only 177 trials that reject under-specified Hetprob against correctly specified Hetprob1 and there are 135 that reject Hetprob1 in favors of Hetprob2. This comparison points out that given unknown form of conditional variances, it is better to flexibly model them rather than ignore them. Adding conditional variance terms will help not only reduced bias but also make estimated results more efficient. Nevertheless, over fitting the conditional variances tend to increase the variances compared to under fitting model. To conclude, comparing across six estimators yield interesting results, first, the survival function models give less variance than all of the dichotomous choice models that might come as a result of smoothing and interpolation. Out of three survival functions, Turnbull estimator midpoint might be the best among three estimators. However, in these setups, survival function models always yield inconsistent results that mean WTP will not converge to the true one even in large sample. We suspect that the result from Table 1.2) where Turnbull estimator outperforms all other estimators might comes out of data generating process. On the contrary, the dichotomous choice models perform better in term of less bias when conditional variances are prevalent either misspecified or not. For sieve estimator with maximum likelihood that assumes flexible conditional variance function, it performs best comparing to the rest of the models. However, since it shares the same feature as the other two dichotomous choice models, it always yields higher variance than survival function models. Hence, sieve estimator provides the middle ground in which we can compromise between assuming underlying 20 distribution with flexible conditional variance and give less bias willingness to pay when form of conditional variances is unknown. 1.4 Data and Estimating Results 1.4.1 Data The data in this study came from eastern Costa Rica. The research sites contain not only the two communities as in Ortega-Pacheco et.al. (2009) but also four communities within the region that were recently surveyed as presented in Table 4. The communities’ local water supply is too polluted for drinking water usage due to heavy use of chemical substances in nearby pineapple and banana plantation. Their drinking water supply comes from aqueducts that pipe in water from the forested upper reaches of their watershed. The communities have local water boards that oversee the construction and maintenance of these water systems and levy monthly fees for water. However, changes in land use in the upper reaches of the watersheds threaten the quality of the communities’ drinking water. To protect their water, the communities are considering PES programs to keep land uses from changing in the upper watershed. The surveys assess local resident’s willingness to pay to finance these PES programs. The payment vehicle is a monthly surcharge on their water bill. There are 1179 completed interviews from the surveys. 21 Table 1.3 Heteroskedastic Probit with square exponenetial variance willingness to pay model True Model mean willingness to pay = 44.986 Turnbull Watanabe An Probit Hetprob Hetprob1 Hetprob2 Sieve1 Sieve2 Mean willingness to pay 37.552 29.390 26.541 52.253 39.138 41.885 42.1929 43.341 51.120 Bias -7.434 -15.596 -18.445 7.267 -5.848 -3.101 -2.793 -1.645 6.134 Variance 20.129 17.015 11.853 92.106 53.508 69.425 82.785 111.529 342.966 Root mean squared errors(RMSE) 8.683 16.133 18.764 12.038 9.365 8.891 9.518 10.688 19.509 Maximum WTP 53.787 44.800 54.703 90.071 66.606 87.964 80.388 131.567 149.185 Minimum WTP 24.051 18.400 20.391 15.155 14.495 17.343 14.122 20.212 12.896 Frequency of rejection 124 177 135 Note: 1) Monte Carlo Bias and Root mean square errors are measure with respect to true model willingness to pay. 2) Hetprob represents heteroskedasticcity model with under specified conditional variance as in case 2 with only and . Hetprob1 represents heteroskedasticity model with correctly specified conditional variance. Hetprob2 represents heteroskedasticity model with overspecified standard errors as in Blevins and Khan (2011) with maximum likelihood estimation. 3) Sieve1 represents non-linear least squares estimation with correctly specified conditional variance. Sieve 2 represents Blevins and Khan (2011) non-linear least squares estimation with overspecified conditional variance 22 Table 1.4 Communities in the study Community Herediana CairoFrancia Interviews 164 397 Florida Alegría Milano Iberia 248 131 136 103 The dependent variable is the binary choice variable of voting “Yes” or “No” for the program for a particular fee (cost) in addition to the current water bill. The independent variables are the fee (cost) of the program, female dummy, age, high school dummy, household income, and other household characteristics. In the Table 5, summary statistics of variables used in estimation are presented along with their description. In total, there are 1141 observations to be used after using respondent with reported income. The respondents are asked to Vote “Yes” or no for the proposed increase in the monthly water fee. From, the observations about 66 percent of people voted “Yes”. This variable will be the dependent variable in the estimated model. The bid value for each respondent will range from 400 to 2400 Colones, this represents the additional water fee that each respondent has to pay for the PES program. This additional fee is a direct payment to people who manage land upstream. The recipient of the fee payment will in return conserve the resources in the surrounding catchments. This will ensure the preservation of both water quality and quantity. The other dependent variables are female which indicates the sex of respondents, age of respondents, number of household members under age 18, and education level of respondent. Average monthly income is 142,364 Colones that is slightly higher than national average household income of 140,000 Colones and slightly lower than household incomes in urbanized and metropolitan areas of Costa Rica’s Central Valley (Ortega-Pachego et al, 2009). 23 1.4.2 Estimation Results The methods presented in Section 2 were estimated using Vote “yes” as dependent variable. The set of other covariates are monthly cost, income, female, age, number of member less than 18, and education. Table 6 gives coefficient estimate obtained from Probit, Heteroskedastic Probit, sieve estimator as well as mean willingness to pay. For the An’s method and Turnbull, only the mean willingness to pay will be presented since the parameters estimated from An’s method is clearly not comparable to the previous four methods and Turnbull estimator does not use covariates in calculating willingness to pay. Overall, the key variables in the model are significant and yield expected signs. The additional monthly cost has a negative impact on the probability of voting “Yes” in all four estimation methods. If the sex of respondent is female, then it will have led to lower probability of voting yes to the PES. Furthermore, the age of respondents and number of household member under age 18 both have negative effects on the probability of voting for the program. For the education variable, if the respondent has high school degree, it might lead to higher chance of voting. The monthly income also has a positive effect on probability of voting "yes". Regarding the estimated willingness to pay, as expected, the mean willingness to pay from survival function models, Turnbull, Watanabe, and An yield lower estimates compares to the Probit, Heteroskedastic Probit, and Sieve as presented in Table 1.6. The estimated results from dichotomous choice models exhibit similar results to the previous studies where WTP for water conservation is approximately two times the current cost of monthly water usage. The estimated mean willingness to pay from the dichotomous 24 choice models is roughly 28 percent higher than survival models. On the other hands, the differences between Probit, Heteroskedastic Probit and Sieve estimators are only about 5 percent. This is similar to the case 2 in Monte Carlo simulation where estimation from the dichotomous choice models yields willingness to pay that is higher than survival functions model. That is, at least including flexible conditional variances or parametric conditional variances will increase the precision of estimating willingness to pay. In addition, including unobserved heterogeneity by conditional variances might be complement to modeling it as in An (2000) where there is a need for double bound dichotomous question that might suffer from framing issue. However, including the conditional variance come at a cost of increasing the variance of estimation since extreme value can significantly affect Sieve estimator as observed in simulation study. Our findings have some implications for the use of discrete choice and survival functions in contingent valuation study. First, modeling unobserved heterogeneity is important either by modeling conditional variance in the single bound questions. Also, if the questionnaire can incorporate double bound question there is possibility of using survival function model with censored discrete time as in An (2000). In addition, future developing of the survival model that includes cluster and heterogeneity in form of spatial location might be of interest. Secondly, each proposed estimator contains information that helps in better approximation of willingness to pay and is rather complement than substitute. Given that underlying willingness distribution is unknown along with actual differences between bid design and distribution, the survival function model at least can provide the lower bound of mean willingness to pay while dichotomous choice can capture the effect of extreme value. 25 Table 1.5 Data description and descriptive statistics (N=1141) Description Response to " Would you vote for or against the program if you would have to pay [cost] Colones more on your water bill (yes or for = 1, no or against = 0) Measurement Mean Std. Dev. Min Max 1 Or 0 0.659 0.474 0 1 Monthly cost of program (on top of current water bill) from the vote question. Defined in the preamble to the vote question and varied across respondents Colones 1243.087 758.098 400 2400 A dummy for sex of the respondent (female = 1 male = 0) 1 Or 0 0.725 0.447 0 1 43.176 15.113 18 93 1.515 1.418 0 9 0.120 0.326 0 1 Age of respondents Number of household member under age 18 A dummy for schooling (high school or more = 1 otherwise = 0) Monthly household income 1 Or 0 Colones 142364.4 141374.8 7000 2000000 26 Table 1.6 Estimated coefficients Dependent variable =1 if respondent vote "Yes", 0 if the vote is "No". Monthly Cost Female Age Household members Education Income Intercept Probit HP -0.0004*** (-8.29) -0.2420*** (-2.53) -0.0150*** (-4.85) -0.0608** (-1.95) 0.0955 (0.65) 2.56E-06*** (4.95) 1.5679*** (7.02) -0.0007*** (-3.10) -0.2250 (-1.41) -0.0203*** (-3.19) -0.0363 (-0.33) 0.7335 (0.94) 6.60E-06*** (2.10) 1.9731*** (3.58) Sieve -0.0003*** (-2.36) -0.1298 (-0.56) -0.2490*** (-2.63) -0.0860*** (-2.53) 4.31E-06*** (3.37) 1.3856** (1.82) Watanabe An 1630.99 1407.28 -0.3552*** (-8.42) -198.4140*** (-2.82) -13.3100*** (-5.73) -45.5040** (-1.81) 121.4693 (-1.22) 0.0011*** (4.53) 2643.5690*** (17.58) Mean Willingness to Pay 2340.69 2492.7 2426.02 1592.29 Note: 1)*** is significant at 99%, ** is significant at 95%, and * is significant a 90% confidence interval 2) The number in parenthesis is t-statistic 3) Sieve estimator assumed parameter of education to be normalized to 1. 4) For Watanabe estimator, the value of dependent variable is 2400 times dependent variable. 27 Turnbull Lastly, for the use of Watanabe‘s method in calculating willingness to pay, the method relies on the crucial that the bid construction is uniformly distributed. This is assumption is certainly violated in this empirical application and other empirical papers where the bid designed is setting up as a discrete cut-off point, or possesses the discrete distribution structure. Therefore, our use of Watanabe’s estimator might be only a special case. It leads to underestimation of WTP since we offered the bid at a lower end more than the upper end of bid design. 1.5. Conclusion and further study This study has presented a comparison of approaches of estimation for the willingness to pay in the contingent valuation set up. The standard exponential willingness to pay model has been estimated by Probit, heteroskedastitc probit, and semiparametric estimators. In addition, at the time of study, this paper is the first one that applied sieve estimator to estimate willingness to pay. The estimation results come from the contingent valuation study of payment for environmental services in Costa Rica. The referendum of the study is asking respondent to vote “Yes” or “No” to an additional monthly water fee to pay for conservation of water resources by the group of people who live upstream. Regarding Monte Carlo simulation, the difference between survival function models and dichotomous choice models show the significant effect of conditional variance on WTP. It is possible that the low WTP form survival function models might come as a result of assuming monotonicity and the estimator heavily filtered out the effect of extreme value. We might need to further explore this issue and the use of better 28 semiparametric estimators that can be more flexible. Also, the use of quantile regression might be of interest. The empirical results from the dichotomous choice models yield similar results as the previous study in term of WTP; however, the estimation from the survival function models give significantly lower estimation of WTP when there is relaxation of underlying distribution assumption. On the other hand, if the model allows only a flexible functional form of conditional variance(Sieve), the estimated WTP is slightly higher by 5 percent compared to the standard Probit model. Nevertheless, there is still more work to be done within this area of research. Given, single bound dichotomous choice model, the current sieve estimation model still has no canned package that empirical researchers can easily use. For double bound dichotomous choice model, controlling the unobserved heterogeneity by spatial correlation of known and unknown form might be interesting. 29 REFERENCES 30 REFERENCES Ackerberg, Daniel, Xiaohong Chen, and Jinyong Hahn. “A Practical Asymptotic Variance Estimator for Two-step Semiparametric Estimators.” Review of Economics and Statistics 94, no. 2 (2012): 481–498. An, Mark Yuying. “A Semiparametric Distribution for Willingness to Pay and Statistical Inference with Dichotomous Choice Contingent Valuation Data.” American Journal of Agricultural Economics 82, no. 3 (2000): 487–500. Belluzzo Jr, Walter. “Semiparametric Approaches to Welfare Evaluations in Binary Response Models.” Journal of Business & Economic Statistics 22, no. 3 (2004): 322–330. Blevins, Jason R., and Shakeeb Khan. Distribution-Free Estimation of Heteroskedastic Binary Response Models in Stata. mimeo, 2011. Boyle, Kevin J. “Contingent Valuation in Practice.” In A Primer on Nonmarket Valuation, 111–169. Springer, 2003. Chen, Heng Z., and Alan Randall. “Semi-nonparametric Estimation of Binary Response Models with an Application to Natural Resource Valuation.” Journal of Econometrics 76, no. 1 (1997): 323–340. Chen, Xiaohong. “Large Sample Sieve Estimation of Semi-nonparametric Models.” Handbook of Econometrics 6 (2007): 5549–5632. Cooper, Joseph C. “Flexible Functional Form Estimation of Willingness to Pay Using Dichotomous Choice Data.” Journal of Environmental Economics and Management 43, no. 2 (2002): 267–279. Costa, María A. Máñez, and Manfred Zeller. “Calculating Incentives for Watershed Protection. A Case Study in Guatemala.” In Valuation and Conservation of Biodiversity, 297–314. Springer, 2005. Creel, Michael, and John Loomis. “Semi-nonparametric Distribution-free Dichotomous Choice Contingent Valuation.” Journal of Environmental Economics and Management 32, no. 3 (1997): 341–358. Haab, T. C „and KE McConnell. 2003. Valuing Environmental and Natural Resources: The Econometrics of Nonmarket Valuation. Edward Elgar Publishing, Massachusetts . Haab, Timothy C., and Kenneth E. McConnell. “Referendum Models and Negative Willingness to Pay: Alternative Solutions.” Journal of Environmental Economics and Management 32, no. 2 (1997): 251–270. 31 Hanemann, W. Michael. “Welfare Evaluations in Contingent Valuation Experiments with Discrete Responses.” American Journal of Agricultural Economics 66, no. 3 (1984): 332–341. Horowitz, Joel L. “A Smoothed Maximum Score Estimator for the Binary Response Model.” Econometrica: Journal of the Econometric Society (1992): 505–531. Huang, Ju-Chin, Douglas W. Nychka, and V. Kerry Smith. “Semi-parametric Discrete Choice Measures of Willingness to Pay.” Economics Letters 101, no. 1 (2008): 91–94. Jenkins, Stephen P. “Easy Estimation Methods for Discrete‐time Duration Models.” Oxford Bulletin of Economics and Statistics 57, no. 1 (1995): 129–136. Khan, Shakeeb. “Distribution Free Estimation of Heteroskedastic Binary Response Models Using Probit/Logit Criterion Functions.” Journal of Econometrics (2012). Klein, Roger W., and Richard H. Spady. “An Efficient Semiparametric Estimator for Binary Response Models.” Econometrica: Journal of the Econometric Society (1993): 387–421. Li, Chuan-Zhong. “Semiparametric Estimation of the Binary Choice Model for Contingent Valuation.” Land Economics (1996): 462–473. Li, Qi, and Jeffrey Scott Racine. Nonparametric Econometrics: Theory and Practice. Princeton University Press, 2007. Ortega-Pacheco, Daniel V., Frank Lupi, and MichAel D. Kaplowitz. “Payment for Environmental Services: Estimating Demand Within a Tropical Watershed.” Journal of Natural Resources Policy Research 1, no. 2 (2009): 189–202. 32 CHAPTER 2: COMPARISON OF APPROACHES TO MEASURING THE CAUSES OF INCOME INEQUALITY 2.1 Introduction This paper proposes a comparison of both parametric and semiparametric estimation of causes of income equality. In the United States of America, income inequality had followed the Kuznets’ hypothesis of an inverse-U shape over the developmental process since the Great depression until the early 1950s. That is, the inequality rising with industrialization and then declining, as more and more workers join the high-productivity sectors of the economy (Kuznets 1955). There was a remarkable decrease in relative gap between high-income Americans and low-income American. From about 1950 until the early 1970s, this narrowing gap stayed constant (Ballard and Menchik 2008). However, since the late 1970s, the income distribution has followed a Ushaped pattern. Piketty and Saez (2003) argue that it is just a remake of the previous Kuznets’ curve relationship between income inequality and income. A new industrial revolution or wave of development had taken place in service and digital industries, thereby leading to increasing inequality. Inequality will decline again at some point in time as more and more workers benefit from innovations and market mechanism in which it will shift the worker from industrial sector to service sector. That is, income can be more equalized when labor can leap the benefit from education, new technology, and innovation. However, since the early 1980s, there is no sign of reducing inequality. In United States, the share of top 10 percentile income bracket has risen from 32.87 percent in 1980 to 45.60 percent in 2008 (Saez 2008). 33 Despite abundant literature on the income distribution at the national and international level, there has been relatively little attention to the causes of income inequality in the regional as well as state level. In addition, most of the inequality literature in the United States and developing countries has focused on average treatment effect of education and fringe benefit provided by government as determinants of income inequality. However, most of the analysis of the causes of income inequality has employed the conditional mean estimation in either cross-section or panel data setup that ignores the possibility of various effects of education or government policies on income distribution. It has been well recognized that the resulting estimates of effects of education on the conditional mean of income are not necessary indicative of size and nature of the return to education on the upper and lower tail of income distribution Abadie, Angrist and Imbens (2002). In addition, the partial effects of government policies such as Medicaid, Medicare, and food stamp on income fall under the same context. Quantile regression offers a complementary mode of analysis and gives a more complete picture of covariate effects by estimating the conditional quantile functions. Furthermore, in the recent development literature, it has been pointed out that there exists the endogeneity issue regarding the causality of income and education attainment. Hence, the estimating results of treatment effect might be inconsistent. Taking advantages of the newly developed quantile regression with control function, this study compares the result from conventional quantile regression to the results of this new estimation method. 34 Semiparametric methods have been used in estimation of quantile regression for quite some time, as summarized in Koenker (2005). The main different between the two methods is either to assume parametric or nonparametric conditional quantile function. In most studies, the semiparametric models have been compared with parametric quantile regression model by simulation. Koenker (2005) point out that semiparametric model will be more robust when the parametric specifications fail and data analysis must require flexible conditional quantile function. Frolich and Melly (2008) have categorized the estimation of quantile treatment effect into four different cases. There are conditional and unconditional treatment effects and whether the selection is “on observables” or “on unobservables”. Selection on observables is referred to the case of exogenous treatment choice and selection on unobservable is referred to the case of endogenous treatment choice. If the quantities of interest are conditional quantile treatment effects with exogenous regressors, the parametric method as in Koenker (2005) can be used. However, if the conditional treatment is binary and endogenous, the method suggested by AAI may be used. This method contains the semiparametric element in the estimation of instrumental variables in a reduced form equation. AAI found out that the semipametric results are robust and can be used as a complementary procedure along with the parametric estimation. Firpo (2007) developed semiparametric estimation for the quantile treatment effect that is unconditional. This method consists of two steps estimation that consists of nonparametric estimation of the propensity score and computation of the difference between the solutions of two separate minimization problems. Frolich and Melly (2008) (FM) developed the binary instrumental variable method for unconditional 35 quantile treatment effects that reaches the semiparametric efficiency lower bound. Lee (2007) considers conditional endogenous treatment effect with the use of control function rather than IV estimation. This method is easier to compute than the IV methods described above and can be extend to cover more flexible estimation since it is a special case of sieve estimation. In this paper, we try to apply and extend the control function method as in Lee (2007) to cover the case of discrete endogenous variables that are ordinal. We approximate the controlled function by using concept of generalized residuals with flexible functional forms. Then, viewing these residuals as an approximation of control function, we use these residuals in the estimation of quantile treatment effects. Our findings reveal a way to improve the robustness of estimation results and provide a case study for more complete picture of the covariate effects, when the endogenous regressors are ordered outcome. In addition, this method can be used to check whether the parametric model encounters any inconsistency problems because of endogeneity. The methods that will be in this paper for Monte Carlo simulation of quantile regression are Koenker (K), Abadie, Angrist, and Imbens (AAI), Firpo (F), Frolich and Melly (FM) and control function with sieve semiparametric estimator (S). The comparison includes the estimated treatment effects as well as the simulated standard errors. However, the first four methods are not applicable to the case where there are ordered binary regressors. Therefore, in this study, only the control function with sieve estimator can correct for such endogenity problem. In section two, I provide the background of Great Lakes state regarding income distribution within the Great Lake Region and USA from 2000-2009 given there are two 36 recessions within this period: Dot com meltdown of 2001 and financial crisis of 2008. This can help in choosing the independent variables to use in comparison of both parametric and semiparametric models. In section three, I present detail of each methodology. Monte Carlo Simulation will be discussed in section four while section five presents data empirical results, and section six provides concluding remarks. 2.2 Great Lake income distribution and determinants of inequality The Great Lake states comprises of Michigan, Illinois, Indiana, Ohio and Wisconsin that is based on Bureau of Economic Analysis (BEA) regions in 2009. These states share certain economic characteristics as well as had been most severely hit by financial crisis of 2008. In 2009, real gross domestic product of the whole region fall by 3.4 percent. At the bottom of the region is Michigan with 5.2 percent reduction followed by 3.6 percent in Indiana, 3.4 percent in Illinois, 2.7 percent in Ohio and 2.1 percent in Wisconsin. Moreover, the real per capita GDP of the Great Lakes are the second lowest in the country at the value of 38,856 dollars. Among these states, Michigan has the lowest real per capita GDP of 34,157 dollars. Despite the facts that financial meltdown and housing price bubble lead to the national wide reduction in real GDP in 2009. Great Lakes states suffered from the decline of manufacturing goods sector since 2005. On average, this industry has been accounted for about 20 percent of this regional GDP. In 2009, the durable-goods manufacturing (i.e. automobile), contributed to more than 2 percentage points to the decline in real GDP in Michigan and Indiana, and more than 1 percentage point in Ohio and Wisconsin. That is, these states are facing contraction of their main industry. 37 On the income distribution side, by using the Current Population Survey data (CPS), this region share similar story particularly regarding the change in income of the top 10 percentile and 50 percentile(median). In Michigan, for the household at 50 percentile, real income grew by only 3.4 percent over the period of 1976-2006. While the top 10 per centile real income grew by 31.6 percent over the same time. In Ohio, the situation is quite similar; the top 10-percentile income grew by 37.2 percent while the median group income grew by 18.3 percent. In Illinois, the top 10-percentile income grew by 36.5 percent and the median income grew only 10.1 percent. Certainly, the worsening income distribution across the region makes leaping the benefit of innovation to become more crucial if they want to reduce such inequality. In summary, over the past 30 years, the income growth rates of these states have been lower than the national average as well as exhibit the pattern of income distribution that is worse than the national level. Given that and combined with population of these five states that is approximately 50 million, the causes of inequality in this region is well worth studied since there are numerous literature that points out to the adverse effects income inequality. Conventionally, the main explanation for household income inequality has been driven by an increase in gap of labor-market earnings or wage. The neoclassical explanation is that there has been a sharp increase in the demand for highly skilled labor due to globalization, innovation, and changing in demand based on Engle curve. Following agricultural product and food, the income elasticities of demand for manufacturing product, both durable and non-durable, have been declined. These led to changes in corporate-governance procedures. The wage gaps between those with more 38 education and those with less education and experience have increased greatly given the shift in consumer demand and need to minimize the cost of operation. Other explanations include the decline in the relative strength of labor unions either in public and private setup, the decrease in the real value of the minimum wage, and the increase in immigration of low-skilled workers. These explanations are well understood and certainly affect people more at the bottom of income distribution. For discussion of these trends, see Levy and Murnane (1992), Bound and Johnson (1992), Saez and Piketty (2003), Autor et.al. (2008), and Bakija and Heim (2009). Empirical results of these studies come from finding the average relationship between indexes of income inequality to the interested regressors. However, the staggering fact is that in 2007 the incomes share of the richest first percentile reached a staggering 18.3%. The last time America was such an unequal place was in 1929, when the equivalent figure was 18.4% (Economist 2011). Also, the income (excluding capital gains) of the richest one percentile is approximately 3 times of the richest 10 percent while including the capital gain the results is 5 times (Saez 2008). Applying the neoclassical growth theory that the main hypotheses for the different in income will tell a story that the group of top 1 percent is three times more skilled, educated, and productive than the top 10 percent might seems questionable. One way to explain this phenomenon might be looking at the Endogenous Growth Theory (Acemoglu 2008). In the age of innovation where growth have been highly associated with investment in human capital and endogenous creation of new products and technology, the real returns to labor with lower skilled than the frontier will be reduced. While only labor at the highest possible frontier or with diversified skill and 39 capital holders will leap more benefit out of the growth. In order to capture this phenomena, quantile regression might give a better picture than standard average treatment effects model. Why should we worry about income inequality? There are two important economic concerns. First line of thought opting from the possibility of social fairness and conflicts. There are the effects of income inequality on the mortality in US. The papers by Kaplan et.al (1996) and Kawachi et.al. (1997) found the positive correlation between income inequality and mortality. Moreover, there are several studies pointed out that region with high income inequality are more prone to adverse effects of natural disaster than the other. Kahn (2005) found that area with higher income inequality measured by Gini coefficient suffer more deaths and damage in the wake of natural disasters. Anbarci et.al. (2005) discussed how the number of fatalities from earthquake positively response to income inequality. Shaughnessy et.al. (2010) provided the evidences of effects of Hurricane Katrina on income inequality and vice versa. In the neoclassical economic idea, the quote of “That (inequality) it is not a big concern if the rich are getting richer so long as the poor are doing well too.”(Economist 2011) is still relevant. However, in recent, the incorporation of political economy and endogenous growth model, Acemoglu (2008), Rajan (2010), and Ritchie (2010) pointed out the adverse effect of income inequality on the prospect of economic growth via political policy and innovation process of the economy. Rajan (2010) reckoned that technological progress increased the relative demand for skilled workers. This led to a widening gap in wages between them and the lower40 skilled workforce. He argued that this growing gap lays the ground for the housing credit boom that precipitated the financial crisis. The US government put on the two state enterprises, Fannie Mae and Freddie Mac, to lend more to poorer people as instruments of public policy. Subprime mortgages rose from less than 4% in 2000 to a peak of around 15% in 2008. This credit boom led to an enormous housing bubble and the worst financial crisis since great depression. According to this, he argued that well-intentioned political responses to the rise in inequality might lead to devastating side effects. On the innovation and technological development part, Ritchie (2010) argues that country with high level of natural resources, distributional alliances of political party and ruling elites, education systems that has political priorities rather than economic and technology priorities, and high income inequality will lead to low levels of technical intellectual capital. That is, it might be suitable to explain lower level of higher education attainment by in the U.S. For example, percentage of bachelor's degrees awarded in mathematics and science of USA in 2006 is lowest among the OECD average, even lower than Mexico (http://nces.ed.gov). Also, if we look at the latest U.S. Census Bureau (2010), out of 226,793 observations of people with the age over 18, only 17.7 percent got there bachelor degree, and only 9.3 percent attained the degree higher than bachelor. Following the argument in Acemoglu (2008) and Murray (2008), when income is not normally distributed and more skewed to the right (evidence of high income inequality), it is harder for household with average income to attain college not to mention higher education. In addition, if the students inherited skill is normally distributed, given such income structure, cost, and benefit of college and higher degree, the rate of attainment for higher education will be lowered. In turn, this will lead to lower prospect of growth since, 41 at the current stage of economic development, innovation and technological adoption relies heavily on human capital. Given that, inequality might be detrimental to the overall economy; this study will use new estimation method to test whether the root cause of wage inequality is still education attainment. In addition, from simulation study, we want to be certain that the estimated return to education is no biased. Finally, for empirical study, we will analyze the return to education for each level of them simultaneously in order to see which level of education can contribute the most to wage. 2.3 Models and Estimation Methods The estimated model in this study is specified as system of equations as followed: yi  Q( Di , xi , ui ) (2.1) Di  G( zi , xi , vi ) (2.2) Where Q(.) is quantile function; yi is continuous outcome; Di are individual ordered treatment effects in form of dummy variables that represent each ordered outcome; Di ; xi , zi are exogenous covariates; ui and vi are possibly related unobservable; and, G(.) is known or unknown function. For example, if there is only one endogenous binary treatment Di  D1i . If there are two ordered endogenous treatment effects Di  ( D1i , D2i ) , D1i  (0,1) as well as D2i  (0,1) and Di  (0,1,2) in equation (2.2). 42 When treatment is exogenous and conditional upon given covariates, then the use of standard quantile regression is appropriate. That is, equation (2.2) is not relevant, the standard Koenker method as in section 2.3.1 can be used. 2.3.1 Quantile Regression Koenker’s Quantile regression (K) exhibits a more complete picture of relationship between yi and xi at the different points in the conditional distribution of yi . ˆ ˆ The q th quartile estimator of  q and  q minimizes over  q and  q on the objective function ˆ ˆ (  q , lq )  arg min ,  q ( yi  Dilq  xi ' q) (2.3) Where q lies between 0 and 1 and q  u * {q  1(u  0)} as in Koenker (2005). Then, ˆ ˆ  lq and  q represent the choices of quantile that will estimate for the different value of  and  . For example, if q  0.9 , much more weight will be put at the income at 90 percentile and when q  0.5 , the estimated result is the same as least absolute deviation estimators. Let X  ( Di , xi ' ) and  q  (lq , q' )' , the asymptotic distribution of estimator defined in (2.3) is given by   ˆ n ( q   q )  N (0, J q 1 J q 1) q 43 (2.4) where J q  E[ f y x ( X '  q ) * XX ' ] and q  q(1  q)E[ XX ' ] . The term  q is estimated by q(1  q)n1[ XX ' ]. J q has been estimated by kernel method as in Koenker (2005) ˆ  Yi  X '  q  1 ˆ JT   k  h XX '   nhn n   (2.5) In addition, he pointed out the advantages of QR as followed. First, it is not sensitive to outlier and will be more robust than ordinary least squares (OLS) when the dependent variable does not contain influential observations.. That is, in the case of studying income distribution which is not normally distributed and skewed to right, it is certainly better to use the method that is in sensitive to change in extreme high or low wage income. Secondly and most important for our study, QR provides the estimators for the impact of covariate on the full distribution of income at any particular percentile of distribution, not just the conditional mean. 2.3.2 Endogenous quantile treatment effect Currently, several methods can deal with only one binary endogenous treatment quantile effects and use to compare to the case of interest when we aggregate the ordered endogenous effects to only binary effects. 44 2.3.2.1 Abadie, Angrist, and Imbens (2002) (AAI) If the treatment is endogenous or self-selected as in case of education attainment or job training program, the traditional quantile regression will be biased and the use of instrumental variable (IV) might be used as suggested by AAI with the following framework and assumptions. Adopting the set up as in AAI and Frolich and Melly (2010). yi is a scalar outcome variable as in (2.1). However, Di is a binary treatment indicator, and zi is binary instrument. In the case of return to education, yi is wage, Di indicates college attainment, and zi might be father, mother, or spouse education. Then, the value of y1 corresponds to the outcome of the education Di  1 and y0 corresponds to the outcome of the education when Di  0 . In addition, we can write D1 as an individual education attainment when zi  1 while D0 indicates education attainment would be if zi  0 . The underlying assumptions are as followed. For almost all values of xi : (i) Independence: ( y1, y0 , D1, D0 ) is jointly independent of zi given xi . (ii) Nontrivial assignment: P( zi  1 xi )  (0,1) . (iii) First-stage: E[ D1 xi ]  E[ D0 xi ] . (iv) Monotonicity: P[ D1  1  D0  0 xi ]  1 . (v) Model for potential outcomes Qq ( yi xi , D, D1  D0 )   q D  xi ' q (2.6) 45 where Qq refers to the q-th conditional quantile of yi . Given assumptions (i)-(v), AAI show that the conditional quantile treatment effect for the compilers (i.e. observations with Di  1  Di  0 can be estimated by weighted quantile regression: ˆ IV ˆ IV (  q ,  q )  arg min , WiAAI * q ( yi  Di ' q  xi ' q) (2.7) Di (1  zi ) (1  Di ) zi Wi AAI  1   1  P( zi  1 xi ) P( zi  1 xi ) This is a two-step estimator in which the P( zi  1 xi ) is need to be estimated first, in this paper the local logit estimator has been used as in Frolic and Melly (2007). Moreover, in order to avoid the problem of non-convex optimization problem, AAI suggest the use of positive weights Wi AAI   E[W AAI yi , Di , xi ] (2.8) Equation (2.8) will be estimated by linear regression and if some of these estimated weights are negative in the finite samples, they will be set to zero. Then, the asymptotic distribution of the AAI estimator is given by   ˆ IV n ( q   q )  N (0, I q 1q I q 1) , (2.9) where I q  E[ f y x , D  D ( xi q ) * xi ' xi D1  D0 ] * P( D1  D0 ) i i 1 0 with  WiAAI mq ( xi , yi )  H ( xi )( zi  P( zi  1 xi )) and 46 and q  E( ' ) mq ( xi , yi )  (q  1( yi  xi '  q  0)) X and  Di (1  zi ) (1  Di ) zi    x ]. H ( xi )  E[mq ( xi , yi )  i  (1  P( z  1 x ))2 P( z  1 x )2   i i i i   I q will be estimated by kernel estimation that uses Epanechnikov kernel as suggested by Abadie et.al. (2002) 1 ˆ Iq  nhn  y  x '  IV i ˆq ˆ AAI  * k  i Wi   hn     xi xi '   (2.10) ˆ ˆ Where Wi AAI  are estimates of the projected weights. H ( xi ) is estimated by regressing  Di (1  zi ) (1  Di ) zi   on x . Lastly, IV  0)) x   ˆ ( q  1( yi  xi '  q  i i ˆ ˆ  (1  P( z  1 x ))2 ( P( z  1 x ))2   i i i i   ˆ ˆ ˆ ˆ IV   WiAAI (q  1( yi  xi '  q  0)) X  H ( xl )( Z  P( Z  1 xi )) (2.11) ˆ ˆˆ q  E( ' ) (2.12) As stated before, AAI method can only be applied to endogenous treatment effects where the Di is only binary. That is, the model can estimate the case where observation of interest attains certain level of education, high school, or not. So, it is not possible to use the Di , that are ordered outcome as endogenous regressors. 47 2.3.2.2 Unconditional quantile treatment effect Firpo (2007) and Frolich and Melly (2008) The unconditional treatment effect for quantile can be defined as y y  q  Qq1  Qq 0 (2.13) The distinct feature between the conditional and unconditional treatment effects is that the unconditional effect, by definition, will not change with respect to the different set of covariates xi . Also, unconditional effects can be estimated consistently at the n rate without any parametric restrictions. That is, these estimators will be entirely nonparametric, and the assumption (i) will not be needed. Also, in estimating this nonparametric model, it is needed to assume that the support of the covariates xi is the same independently of the treatment. For almost all values of xi , 0  P( Di  1 xi )  1 (2.14) However, the unconditional method still needs the inclusion of covariates xi for various reasons. First, xi are needed to make the identification plausible. Secondly, including xi will improve efficiency. Following Frolich and Melly (2008), it is better to explain the endogenous treatment with a binary instrumental variable zi first. Given assumption(iv), the estimator for  q is as followed: ˆ ˆ ( IV , IV )  arg min , WiFM * q ( yi    Di ) q 48 (2.15) zi  P( zi  1 xi ) WiFM  * ( 2 Di  1) P( zi  1 xi )(1  P( zi  1 xi )) WiFM need to be estimated first as same as in the case of Wi AAI (2.16) . Also, the optimization in (2.15) will face the same non-convex problem as in equation (2.7). Therefore, the alternative weight has to be used. That is, WiFM   E[W FM yi , Di , xi ] (2.17) Firpo (2007) and Frolich and Melly (2008) use assumption (ii) and (2.14) together to identify unconditional treatment effect. The estimator of Firpo (2008) is a special case of (2.15), when Di is used to be its own instrument or zi  Di . The weighting estimator and weight for the estimate of  q are as followed: ˆ ˆ ( ,  q )  arg min , WiF * q ( yi    Di ) (2.18) Di (1  Di ) . WiF   P( Di  1 xi ) 1  P( Di  1 xi ) (2.19) Then, the process to estimate the weight function will be employed as same as in the case of WiFM  and Wi AAI  . Firpo (2007) and Frolich and Melly (2008) provides the asymptotic distribution of the estimated treatment effects as followed. From equation (2.18), Firpo (2007) states that ˆ  q distributes as 49 ˆ n (  q   q )  N (0, v) (2.20) y1 y1  F  Y Di 1, xi (Qq )(1  FY Di 1, x (Qq ))  v E  2 (Q y1 )  P( Di  1 xi ) f y1 q     1 y0 y0  F  Y Di 0, x (Qq )(1  FY Di 0, x (Qq ))   E  2 (Q y 0 )  1  P( Di  1 xi ) f y0 q     1  E[{1( X )  0 ( xi )}2 ] , q  FY D  d , x (QYd ) i d . QY 1 and QY 0 have been estimated by    ˆ ˆq where d ( X )  q q Yd ) fYd (Qq Yd ˆ and  . The densities fYd (Qq ) are estimated by kernel estimators with Epanecnikov kernel function and Silverman bandwidth choice. 1 ˆ ˆ Yd fYd (Qq )  nhn  y  QYd  q  ˆ F  k i  Di d Wi  h    n   (2.21) yd FY D  d , X (Qq ) is estimated by local logit estimator. In addition, the density will i be estimated by kernel regression as in the case of the exogenous treatment. These two methods also suffer the same setback as in the case of AAI, where binary instrument will be required for only on binary endogenous variable. 50 In addition, there is other IV method suggested by Chernokuzhov and Hansen (2005), (2006), and (2008) that does not require binary instrument; however, in this study, we do not employ their method for comparison since it is more computationally intensive than the other proposed methods. In addition, it can deal with the case where there are more than one endogenous variable but we need to fully understanding and define the relationship between those endogenous variables. If not, it will entail bias problem similar to the case of instrumental variable regression. 2.3.3 Sieve estimator and control function Sieve estimation refers to one class of semiparametric estimation that solves the problem of infinite dimensional parameter. The sieve method employs the optimization routine that tries to optimize the criterion function over finite approximated parameter spaces (sieves). The sieve method, in the simplest form, might be similar to how we choose the bandwidth and numbers in plotting the histogram. As pointed out by Chen (2007), the method of sieves is very flexible in estimating complicated semiparametric models with (or without) endogeneity and latent heterogeneity. It can easily incorporate prior information and constraints, and it can simultaneously estimate the parametric and nonparametric parts, typically with optimal convergence rates for both parts. The main reason that this paper employs the sieve estimator is that it can simplify semiparametric inference for the quantile treatment effects. So far, the four methods of estimation employs at least certain degree of semiparametric estimation for their respective variances with relatively complicated formulation and computationally 51 intensive. In addition, the estimated variance varies with not only choice of kernel function but also bandwidth selection. Following the results in Ackerberg et.al. (2012), it has established the numerical equivalence between two estimators of asymptotic variance for two-step semiparametric estimators when the first-step nonparametric estimation is implemented. That is, in the first stage, the sieve M-estimator (Sieve maximum likelihood, Sieve minimum distance, series estimator) will be applied to the model of interest, and then in the second stage, estimation can be set up as if the problem is completely parametric for the purpose of inference on treatment effects. Secondly, in this method, the endogeneity will be treated as omitted variable case as in equation (2.1) and (2.2) since wage will certainly depend on unobserved personal characteristics that linked to education attainment. In addition, we can allow more than one endogenous variables that can be either categorical or ordered that is drawback for previous methods of estimation. In addition, even the instrumental variables that allow more than one endogenous variables as in Chernokuzhov and Hansen (2005, 2008) cannot cover the case of endogenous categorical and ordered regressors. The paper corrects for endogeneity by adopting the control function approach, as in case of Lee (2007) but there is a different in first stage and second stage estimation. We newly propose that the first step is to construction of estimated generalized ˆ residuals gri a ordered-probit of Di on ( xi , zi ) . This step is approximation of control function that help in solving the endogenous problem. The idea of generalized residual for the group or ordered outcome can be found in Chesher and Irish (1987) and Gourieroux et.al. (1987). The generalized residuals is as equation (2.22) 52 L 1  (cl 1  zi  )   (cl  zi  ) gri   1[ Di  l  1] (cl  zi  )  (cl 1  zi  ) l 1 (2.22) cl is the estimated cut points or threshold parameters, for example, if there are three values of Di , Di  0,1,2 there will be Dli  ( D1i , D2i ) , l  1,2 , then there will be two cut points, c1 and c2 given that cl 10   and cl 13   . After getting the ˆ estimated gri from equation (2.22), it will be plug in to the (2.3) as an additional independent variables. The second step of estimation is partially linear quantile regression of Yi on Z i , Di and gri . However, it is difficult to assume the stochastic relationship between gri and  i (the errors terms from the main equation), so we use the ˆ sieve estimator for putting the linear combination of power series of gri as regressors in the equation of ˆ ˆ (  q , lq )  arg min , ˆ ˆ  t( Dli , zi , gri )q ( yi  Dlilq  zi ' q   n f ( gri )) (2.23) ˆ where t is the trimming function that is set to equal to 1 and gri and its power function even interacting the residuals with other regressors. In addition, we can use other form of control functions such as fourier, spline or Taylor series approximation in the second stage of quantile regression. Moreover, practitioner can assume other form of distribution in the first stage given that there is the correct form for generalized residuals. Then, the variance of the treatment effects can be estimated by either bootstrap or finding the solution of inverse quantile functions analytically. In this paper, we choose the bootstrap method for inference since the inverse function will mainly depend on the 53 control function that we choose. Moreover, this leads complicated analytical formula and computationally intensive procedure. The reason that bootstrap estimator for variance is valid that we proceed in two step estimation as in Ackerberg et.al. (2012). Also, estimating the asymptotic variance by using parametric approximation and bootstrap requires less restrict assumptions in order to get establish consistency and asymptotic normality as in the case of Lee (2007). To conclude this section, there are certain insights that might be gained from comparing these five methods of estimation. The conditional treatment effect models are computationally simple and should be unbiased if there is no underlying endogeneity. On the other hands, the four semiparametric models in this paper have each own advantages and heuristic comparison can be made to see differences in treatment effects across income distribution. In addition, results from unconditional treatment effects, it might be helpful for policy makers and applied economists since they capture the effects in the entire population rather than a large number of effects for different covariate combinations. 2.4. Monte Carlo Simulation This section presents the Monte Carlo simulation result for comparing finite sample properties of sieve control function estimator with the methods described in section 2.3 given that we have not developed its asymptotic property. There are three focuses in the simulation: the changes in root mean squared errors of the parameters with varying degree of endogeneity, robustness to heteroskedasticity for quantile endogenous treatment effects, and size of the test for statistical inference given varying of 54 endogeneity. In addition, we report the mean of estimated coefficients as well as bias and standard deviation. 2.4.1 Data generating process The data generating process (DGP) is given by yi   ( )1 D1i   ( )2 D2i  1x1i  2 x2i  ui , i  1,2,..., n Di *  1x1i   2 x2i   3z1i   4 z2i  vi , i  1,2,..., n Di  (0,1,2) For the sake of simplicity, we assume that the exogenous variables have homogenous effects throughout the entire distribution. However, for endogenous covariates we can follow Melly (2007) and Lee (2007) that assume homogenous marginal effects in their Monte Carlo Simulations or follow Chernokuzhov et.al. (2005) that allows the marginal effects to be monotonically increases with respect to quantile. The DGP for each variable is given by X1i ~  2 ( 2) , X 2i ~ N (0,1) , z1i ~  2 (1) , z2i ~ N (0,1)  1 (ui , vi ) ~ N 0,        1   55 For one endogenous variable Di *  0.5 * X1i  0.5 * X 2i  0.5 * z1i  1* z2i  vi , i  1,2,..., n , yi  1  2 *  ( )D1i  1* X1i  2 * X 2i  ui , i  1,2,..., n Di  (0,1) For two endogenous variables Di *  0.5 * X1i  0.5 * X 2i  0.5 * z1i  1* z2i  vi , i  1,2,..., n , yi  1  1*  ( )D1i  1.5 *  ( )D2i  1* X1i  2 * X 2i  ui , i  1,2,..., n The  ( ) is the  th percentile value of normal cumulative density function for example at .5 quantile the value of  (0.5) is 0.5. The endogeneity problem comes from the correlation between errors term from (ui , vi ) . We assume that there are only three categories of ordered endogenous variables. Simulation results are based on 1000 replications with number of observations of 10,000 since we are interest in all part of income distribution especially to capture the treatment effects at the tails of distribution. 2.4.2 Bias, standard deviation of Estimator, and Power In this section, we present results from Monte Carlo simulation with varying degree of endogeneity in two main cases. The first case, we compare the model where 56 there is only one binary endogenous variable. The second, we compare the model where there are two ordered endogenous variables. In the first case, we compare the estimate by the method proposed in third section. The second case control function method is compared with standard quantile regression, two-stage least squares with strong instruments, and regression with predicted probability from Probit as instrumental variables. 2.4.2.1 One endogenous variable model In this section, we compare the finite sample properties of sieve estimator with the well-established estimator of the quantile treatment effects that can only take care of one endogenous variable. When there is no endogeneity, as presented in Table 2.1, standard quantile regression yields the lowest bias and variance for the coefficient estimated. However, Firpo yields least precise estimate as expected by unconditional quantile treatment effects. For, the unconditional quantile regression, the bias is higher along with variance. Moreover, mean rejection for FM is closest to the 0.05 level that we set up. By introducing mild degree of endogeneity in table 2.2, the methods that does not take in to account, endogeneity tends to slightly overestimate the quantile treatment effect at the median. The Sieve estimator still yield less bias with higher variance compared to the standard quantile regression. The FM and AAI methods do slightly overestimate the parameter but come out with higher variance than the rest. 57 Table 2.1 Monte Carlo Simulation for Mean without endogeneity True  ( ) is 1 Correlation between errors is 0 Firpo 0.913 Frolich and Melly 1.004 Abadie Imbens and Angrist 0.995 Mean of  ( ) Parameters Bias Qreg 0.999 Control 0.997 Control Median 0.999 -0.001 -0.003 -0.001 -0.087 0.004 -0.005 SD 0.030 0.080 0.031 0.058 0.085 0.077 Root Mean Squared Errors 0.030 0.080 0.031 0.105 0.085 0.077 Mean Rejection 0.038 0.051 0.034 0.310 0.05 0.054  = 0.5 Table 2.2 Monte Carlo Simulation for Mean with 0.1 degree of correlation True  ( ) is 1 Correlation between errors is 0.1  = 0.5 Mean of  ( ) Parameters Bias Qreg Control 1.113 0.998 Control Median 1.114 Firpo 1.030 Frolich and Melly 1.021 Abadie Imbens and Angrist 1.005 0.113 -0.002 0.114 0.030 0.021 0.005 SD 0.030 0.080 0.030 0.059 0.085 0.077 Root Mean Squared Errors 0.118 0.080 0.118 0.067 0.088 0.773 Mean Rejection 0.038 0.053 0.974 0.073 0.061 0.049 58 Table 2.3 Monte Carlo Simulation for Mean with 0.5 degree of correlation True  ( ) is 1 Correlation between errors is 0.5 Frolich and Melly 1.080 Abadie Imbens and Angrist 1.067 0.499 0.080 0.067 0.029 0.058 0.086 0.069 0.079 0.569 0.502 0.118 0.097 0.046 1.000 1.000 0.149 0.163 Qreg 1.568 Control 0.992 Control Median 1.568 0.568 -0.008 0.568 SD 0.029 0.078 Root Mean Squared Errors 0.569 Mean Rejection 1.000  = 0.5 Mean of  ( ) Parameters Bias Firpo 1.499 The results change dramatically when we introduce higher degree of endogeneity as in Table 2.3 and 2.4. Koneker method and Firpo methods provide higher bias of over than 50 percent to almost 100 percent when correlation among errors is 0.9. The Sieve estimator still yield less bias with low variance when compare to those two, at correlation of 0.9, sieve estimator produces moderately underestimated coefficient. On the other hands, FM and AAI slightly overestimate the coefficient with higher variance. In addition, the mean rejections of almost all of the methods except sieve, FM, and AAI jump to 1 implying that those methods might be unreliable. Comparing these methods with current number of observations and number of replication, in order to gain good approximate of root mean square errors, has been conducted only at median since it is time consuming for FM and AAI method to converge at the both tail of distribution. In addition, for these two methods, the estimation of the whole quantile process, as pointed out by Melly (2006), is nearly impossible and 59 the results confirm bias does go away even with lower number of replication. That is, from our experience, control function along with standard quantile regression can be computed faster than the previous two methods especially at the tail of distribution. Lastly, putting the control function with the generalized residuals from conditional median yields infinitesimal different result from the standard quantile regression. Only, when the correlation is 0.9, there is a clear difference. However, from Table 2.2 to Table 2.4, they shows that using such generalized residuals from conditional median cannot solve the endogeneity problem. Table 2.4 Monte Carlo Simulation for Mean with 0.9 degree of correlation True  ( ) is 1 Correlation between errors is 0.9 Firpo 1.972 Frolich and Melly 1.135 Abadie Imbens and Angrist 1.228 Qreg 2.008 Control 0.948 Control Median 2.004 1.008 -0.052 1.004 0.972 0.135 0.228 SD 0.025 0.071 0.042 0.054 0.087 0.052 Root Mean Squared Errors 1.008 0.088 1.004 0.974 0.158 0.234 Mean Rejection 1.000 0.106 1.000 1.000 0.367 0.996  = 0.5 Mean of  ( ) Parameters Bias 2.4.2.2 Two endogenous variables model There are no direct comparable methods that can estimate the endogenous quantile treatment effects with ordered endogenous variables. Assuming the errors to is 60 normally distributed, we can use Monte Carlo simulation to compare the property of control function estimator to the standard two-stage least squares method that can take care of two endogenous regressors properly. We compare sieve estimator with quantile regression, two-staged least squares with strong instruments and two-staged least square with predicted probabilities as instrument. In addition, as same as in section 2.4.2.1, we present the result with varying degree of endogeneity. When there is no endogeneity, the quantile regression, sieve, and 2SLS with probabilities yield the best estimate of coefficients as expected, however, putting the power series or probabilities to control for endogeneity yields higher variance of estimation. However, the drawback of standard instrumental variable (Z) persists both in the case where there is more than one endogenous variable. For simulation and estimation, the relationship between these two variables has to be established. Assuming that both of these variables are not related is plausible but not valid in almost all of the case. Hence, even with no endogeneity, the results from third column in Table 5 yield bias and high variance since we do not have information on the relationship such that we can estimate the model with full information maximum likelihood or GMM. When we introduce mild degree of endogeneity, the Koenker estimation yield overestimated results for both coefficients with low level of mean rejection. The sieve and 2SLS with probabilities produce less biased results with similar level of mean rejection implying more precise and robust results. Moreover, introducing the corrected terms, series of generalized residuals or predicted probabilities always increase the Monte Carlo standard deviations. At 0.5 level of correlation, neglecting endogeneity, leads to 61 almost 100 percent increase in median treatment effects as well as still accepting that the estimated effects is statistically different from zero. On the contrary, the sieve and 2SLS with probabilities are really closed and reliable. For standard IV results, the bias is not getting better as well as standard deviation. We might conclude that dealing with more than one endogenous variable is quite problematic in terms of not only finding the strong instrumental variables but also modeling relationship between them. In table 2.8, we present the result where the correlation is 0.9, sieve estimation yield more biased results compared to two-stage least squares with predicted probabilities though the Monte Carlo variance is lower. Moreover, for the other two methods, the results are certainly biased. However, one caution is in the case of two-stage least squares with instrumental variables, this method yield the highest bias among the entire estimators with high variances. That is, we certainly will fail to reject that the estimated coefficient will be different from zero and led to the worse estimators of all the methods. To conclude, sieve estimator, though might not be the best estimator to control for endogeneity, at least can reduce the bias as well as give a valid estimate for quantile treatment effect when there is one or more endogenous variables. That is, without calculating analytical asymptotic theory for this sieve estimator, the proposed procedures constitute a probable approximation for estimating quantiles treatment effects and for making reliable inference for this two-step estimation. 62 Table 2.5 Monte Carlo Simulation for Mean with 0 degree of correlation True 1( ) is 0.5 Correlation between errors is 0  = 0.5 Qreg Control 2SLS with Z 2SLS with Prob Mean of 1( ) 0.499 0.499 -0.772 0.501 Parameters Bias -0.001 -0.001 -1.272 0.001 SD 0.034 0.049 36.110 0.095 Root Mean Squared Errors 0.034 0.049 36.132 0.095 Mean rejection 0.044 0.049 0.017 0.051 True  = 0.5  2 ( ) is 0.75 Correlation between errors is 0 Mean of  2 ( ) Parameters Bias Qreg 0.749 Control 0.750 2SLS with Z 1.203 2SLS with Prob 0.748 -0.001 0.000 0.453 -0.002 SD 0.037 0.071 13.424 0.052 Root Mean Squared Errors 0.037 0.071 13.432 0.052 Mean rejection 0.040 0.055 0.017 0.047 63 Table 2.6 Monte Carlo Simulation for Mean with 0.1 degree of correlation True 1( ) is 0.5 Correlation between errors is 0.1  = 0.5 Qreg Control 2SLS with Z 2SLS with Prob Mean of 1( ) 0.582 0.499 -1.267 0.501 Parameters Bias 0.082 -0.001 -1.767 0.001 SD 0.034 0.049 36.110 0.095 Root Mean Squared Errors 0.089 0.049 41.239 0.095 Mean rejection 0.671 0.048 0.010 0.052 SD 0.362 0.072 15.018 0.052 Root Mean Squared Errors 0.392 0.072 15.031 0.052 Mean rejection 0.984 0.042 0.009 0.048 True  2 ( ) is 0.75 Correlation between errors is 0.1  = 0.5 Qreg Control 2SLS with Z 2SLS with Prob Mean of  2 ( ) 0.900 0.750 1.374 0.748 Parameters Bias 0.150 0.000 0.624 -0.002 64 Table 2.7 Monte Carlo Simulation for Mean with 0.5 degree of correlation True 1( ) is 0.5 Correlation between errors is 0.5  = 0.5 Qreg Control 2SLS with Z 2SLS with Prob Mean of 1( ) 0.918 0.494 2.996 0.502 Parameters Bias 0.418 -0.006 2.496 0.002 SD 0.324 0.046 53.678 0.095 Root Mean Squared Errors 0.529 0.047 41.239 0.095 Mean rejection 1.000 0.052 0.013 0.052 SD 0.338 0.072 22.547 0.052 Root Mean Squared Errors 0.825 0.072 22.571 0.052 Mean rejection 1.000 0.053 0.012 0.046 True  2 ( ) is 0.75 Correlation between errors is 0.5  = 0.5 Qreg Control 2SLS with Z 2SLS with Prob Mean of  2 ( ) 1.503 0.741 -0.281 0.749 Parameters Bias 0.753 -0.009 -1.031 -0.001 65 Table 2.8 Monte Carlo Simulation for Mean with 0.9 degree of correlation True 1( ) is 0.5 Correlation between errors is 0.9  = 0.5 Qreg Control 2SLS with Z 2SLS with Prob Mean of 1( ) 1.244 0.469 -0.569 0.502 Parameters Bias 0.744 -0.032 -1.069 0.002 SD 0.027 0.038 76.284 0.095 Root Mean Squared Errors 0.744 0.049 41.239 0.095 Mean rejection 1.000 0.133 0.009 0.046 SD 0.029 0.062 29.145 0.052 Root Mean Squared Errors 1.345 0.086 29.147 0.052 Mean rejection 1.000 0.174 0.009 0.044 True  2 ( ) is 0.75 Correlation between errors is 0.9  = 0.5 Qreg Control 2SLS with Z 2SLS with Prob Mean of  2 ( ) 2.095 0.690 1.046 0.749 Parameters Bias 1.345 -0.060 0.296 -0.001 2.4.2.3 Sensitivity analysis for the control function estimator to heteroskedasticity Unlike normal regression, the estimated parameters from quantile regression can be effected by heteroskedasticity. Suppose the true data generating process depicts linear quantile regression where there are only exogenous variables ( X1, X 2 ) and there is only 66 one exogenous variable ( X1) that enter in to both conditional quantile equation as well as multiplicative heteroskedasticity, then we expect that estimated coefficient of X1 will vary with the value of quantile. Moreover, even if we add one additional binary endogenous variable ( D1i ) to estimation, we suspect that heteroskedasticity will further lead to bias estimation of both coefficients of ( X1, D1i ) . However, control function might provide flexible way to estimate the model by interacting generalized residuals with X1 as more control variables. The data generating process come from part 2.4.1, we focus on the case where there are one binary endogenous variable as the following D1i *  0.5 * X1i  0.5 * X 2i  0.5 * z1i  1* z2i  vi , i  1,2,..., n y1i  1  1( )D1i   2 ( ) * X1i  2 * X 2i  wi , i  1,2,..., n where wi  (0.1  0.5 X1i ) * ui The only difference from section 2.4.1 is the new DGP contains heteroskedasticity problem in exogenous independent variable and might exacerbate the biases of both endogenous treatment effect and coefficient of X1i . The simulation contains the same 1,000 replications with 10,000 observations. Then, we compute Monte Carlo mean, standard deviation, and root mean squares error as before. The result in Table 2.9 shows that even with mild endogeneity, correlation at 0.1, control function without interaction term can let to a biased result of about 10 percent for the estimated parameters of binary endogenous variable. However, for the estimation of 67 coefficient of X1i , the bias is not that severe. Only at the 90 percentile where the bias is large for the case of control function. Table 2.9 Monte Carlo Simulation for the case of Heteroskedasticity and Endogeneity with 0.1 degree of correlation True 1( ) is 0.2 Correlation between errors is 0.1 Control with Frolich and Abadie Imbens and Control Interaction Melly Angrist  = 0.1 Mean of 1( ) Parameters Bias 0.252 0.212 0.200 0.444 0.229 0.052 0.012 0.000 0.244 0.029 SD 0.018 0.061 0.035 0.099 0.031 Root Mean Squared Errors 0.055 0.062 0.035 0.264 0.042 Mean rejection .811 0.051 0.039 0.691 0.152  = 0.1 Qreg Mean of  2 ( ) Parameters Bias Qreg True  2 ( ) is 0.36 Correlation between errors is 0.1 Control with Frolich and Abadie Imbens and Control Interaction Melly Angrist 0.366 0.362 0.360 0.386 0.006 0.002 0.000 0.026 SD 0.018 0.019 0.033 0.041 Root Mean Squared Errors 0.019 0.019 0.033 0.049 Mean rejection 0.039 0.020 0.048 0.093 68 Table 2.10 Monte Carlo Simulation for the case of Heteroskedasticity and Endogeneity with 0.5 degree of correlation True 1( ) is 1 Correlation between errors is 0.5 Qreg Control Control with Interaction Frolich and Melly Abadie Imbens and Angrist 1.259 1.029 0.995 1.069 1.145 0.259 0.029 -0.005 0.069 0.145 SD 0.017 0.044 0.027 0.096 0.029 Root Mean Squared Errors 0.260 0.052 0.028 0.118 0.148 Mean rejection 1.000 0.094 0.032  = 0.5 Qreg 1.037 0.999 1.001 1.115 0.037 -0.001 0.001 0.115 SD 0.013 0.013 0.024 0.036 Root Mean Squared Errors 0.039 0.013 0.024 0.121 Mean rejection 0.719 0.019 0.045 0.891  = 0.5 Mean of 1( ) Parameters Bias Mean of  2 ( ) Parameters Bias 0.121 0.999 ` True  2 ( ) is 1 Correlation between errors is 0.5 Control with Frolich and Abadie Imbens and Control Interaction Melly Angrist 69 Table 2.11 Monte Carlo Simulation for the case of Heteroskedasticity and Endogeneity with 0.9 degree of correlation  = 0.9 Mean of 1( ) Parameters Bias Qreg True 1( ) is 1.8 Correlation between errors is 0.9 Control with Frolich and Abadie Imbens Control Interaction Melly and Angrist 2.057 1.759 1.798 1.304 1.981 0.257 -0.041 -0.002 -0.497 0.181 SD 0.019 0.049 0.032 0.216 0.028 Root Mean Squared Errors 0.257 0.064 0.032 0.541 0.183 Mean rejection 1.000 0.121 0.038 0.668 1.000  = 0.9 Qreg Mean of  2 ( ) Parameters Bias True  2 ( ) is 1.64 Correlation between errors is 0.9 Control with Frolich and Abadie Imbens Control Interaction Melly and Angrist 1.656 1.559 1.496 1.711 0.016 -0.081 -0.144 0.071 SD 0.016 0.018 0.028 0.030 Root Mean Squared Errors 0.023 0.083 0.147 0.077 Mean rejection 0.122 0.996 0.996 0.697 In sum, from simulation results, control function can correct the bias problem from the heteroskedasticity and endogeneity to a certain degree. When, the 70 heterskedasticity is multiplicative, putting interaction term between generalized residuals and independent variables might help. In addition, from sieve estimator perspective, increasing the number of interaction or creating better sieve approximation function with more terms is necessary when the number of observations is large. 2.4.3 Test size result We present the test size results from three methods to correct for endogeneity problems. The first one is control function estimation of median regression by using generalized residuals as control. The second one is two-stage least squares with 2 instrumental variables and the last one is two-stage least squares with predicted probabilities from ordered probit as instrumental variables. The test size has been calculated based on 1000 replications for both the data generating process with two endogenous variables and one endogenous variable. Generally, in order to adopt control functions of ordered outcome method for quantile regression, we worry about its finitesample size property comparing to standard mean regression whether it should report lower p-value than the true size, so we reject the null hypothesis more often than we should when there is no endogeneity or low degree of endogeneity. For our example with given DGP, we want to simulate the size of a 0.05 level test of endogeneity based on Hausman-type statistics. The null hypothesis for the first method is H 0 :  0 against H 0 :  0 where  is the coefficient of the control function. Under the null, we have use the naïve estimator for standard errors of  . For the other two instrumental variables method we perform the test based on standard Hausman test. 71 Then, we obtain their respective p  value and compare it to true size. The results are in the following table. When   0 , or there is no endogeneity, the actual test size is 0.065 for control function, it is reasonably close to the nominal size of 0.05. However, it is a little higher than the cases where we use two-stage least squares. That is, we fail the null hypothesis a little more often than the other two methods. However, out of 1,000 replications, the difference is only 15, it make us become less concern about the validity of employing the control function. Moreover, when there is slight degree of endogeneity at   0.1, the size of the test is larger, we almost certainly reject the null hypothesis of no endogeneity in case of control function and two-stage least squares with predicted probabilities. However, with the case of standard instrumental variables, it does not reject the hypothesis of no endogeneity as often as other two methods; only about 600 out of 1,000 that it rejects. When we increase the correlation (  ) to 0.5 and 0.9, all of the tests perform well; they always reject the case of no endogeneity. Table 2.12 Test size based on model with two ordered endogenous variables Test Size Control with generalized residuals Ivregress with Instrumental Variables Ivregress with predicted probabilites Two ordered endogenous variables  = 0.1  = 0.5  = 0.9 0.065 0.951 1 1 =0 0.047 0.597 1 1 0.058 0.978 1 1 72 Table 2.13 Test size based one binary endogenous variable Test Size Control with generalized residuals Control with median residuals Ivregress with Instumental Variables Ivregress with predicted probabilites Probit binary endogenous variable  = 0.1  = 0.5  = 0.9 0.061 0.848 1 1 =0 0.510 0.511 0.481 0.379 0.054 0.573 1 1 0.06 0.948 1 1 In table 2.13, we compare size of test as in table 2.12, but rather than having two ordered-endogenous variables, we lump the two-ordered outcome to be only one binary outcome and re-estimate with the same three methods. We find out that the test performs quite well at zero degree of endogeneity and rejects the null hypothesis more often at even slight degree of endogeneity. Only at   0.1, the control function rejects null hypothesis less often comparing to the case where we have two-ordered outcome. Then, when  increases to 0.5, all of the tests completely reject the hypothesis of no endogeneity. Moreover, rather than using conditional expectation to calculate generalized residuals from the probit as in the first column, we calculate the predicted residuals from conditional median and plug it in as independent variable in the second stage quantile regression. However, the conditional median residuals is not correct way to detect and 73 correct endogeneity since it does not reject the null hypothesis of endogeneity for almost all level of  . Thus, we should use conditional expected residuals rather than median. Table 2.14 Test size with 1000 observations and variance of (9, 1) Test Size Two ordered endogenous variables =0  = 0.1  = 0.5  = 0.9 Control with generalized residuals Ivregress with Z 0.054 0.213 1 1 0.047 0.091 0.958 1 Ivregress with predicted probabilites Test Size 0.059 0.227 1 1 Control with generalized residuals Control with median residuals Ivregress with Instumental Variables Ivregress with predicted probabilites 0.061 0.151 0.998 1 0.508 0.494 0.481 0.494 0.058 0.199 0.919 1 0.059 0.199 1 1 Probit binary endogenous variable =0  = 0.1  = 0.5  = 0.9 Next, we consider the case where the variance of (ui , vi ) change to (9, 1). Also, we will reduce number of observation to 1,000. Increasing the variance from 1 to 9 will certainly reduce the imply R-squared of the data generating process. For example, when 74 the variance is 1, without considering the variation in endogenous terms, the implied Rsquared equals to (4+4)/(4+4+1) = 0.89 or 89 percent. When we change it to 9, the implied R-squared is (4+4)/(4+4+9) = 0.47 or 47 percent that is closed to normal cross sectional data R-squared. In table 3, we represent the test sized based on two and one endogenous variable with 1000 observations and variance of (9, 1). To conclude, from the test size simulations, adopting control functions for the model of quantile regression at median with ordered endogenous variables might be a valid way to control for endogeneity and perform statistical inference in order to check whether we really face endogeneity problem. 2.5. Data and Estimating Results In this paper, we apply the methods described in section 3 and 4 to estimating the causes of income inequality in USA and Great Lake States. The data is Currently Population Survey (CPS) from the period of 2001 to 2009. During this period, two shocks potentially affect household income in the top quantile. They are the dot-com crisis of the 2001 and the Financial Crisis of 2008. The measurement of log wage of the person with highest education will be used as dependent variable while household characteristics, education, union coverage, and age are used as dependent variables in finding quantile treatment effect. In the case that there are two same level of highest education in the household, we will use person with highest wage income. 75 Table 2.15 Percentiles, cut-off level of nominal household income ($) 10th 25th 50th 75th 2000 10344 20720 40551 70646 2001 10572 21521 42024 73000 2002 10632 21500 42125 74900 2003 10580 21384 42381 75000 2004 10500 21620 43160 76803 2005 10890 22108 44097 78000 2006 11250 23010 46001 81000 2007 12000 24600 48020 85028 2008 12143 25000 50000 88294 2009 12157 25000 50000 89133 Source: Current Population Survey (CPS) 90th 108487 112040 114504 114626 118662 121012 126838 133726 136435 138774 As shown in Table 2.15, the income different between the top 10 percent and the bottom 10 percent is approximately 10 times. This trend has persisted over the past 10 years. Only in 2001 and 2008, is the period where the income of the top 10 percent stagnated because of the recession. On the other hands, for the household at the median of income distribution, the difference is about four times compared to the bottom 10 percent. Without considering the people at the top 1 percent as in Saez (2008), there is a certain evidence of income inequality. Table 2.16 Summary statistics for the year 2001-2009 2001 55482 0.201 0.296 0.116 0.015 Average Household income Percentage of Household with high school Percentage of Household with College Percentage of Household with higher than college Percentage of Household in Manufacturing Sector Percentage of Household in Management and Financial Sector 0.115 Number of observations (Household) 49633 Source: Current Population Survey (CPS) 76 2005 60432 0.192 0.321 0.132 0.014 2009 68409 0.192 0.332 0.149 0.013 0.079 76447 0.083 76185 Table 2.16 contains statistics of some key variables that will be used in estimation. The average income shows a steady growth despite two recessions during the period of sample. On the education attainment, household with the high school education refers to the case where the most educated person in the household achieve high school degree where the household with college refers to most educated person in the household holds bachelor degree. It is clear that for the past ten years, the household with high school degree from the survey stays at about 20 percent of total population. There is a growth of household with college degree from 29 percent to 33 percent and household with higher than college degree from 11 to 14 percent. The linear quantile effects model of income determination with the interested explanatory as discussed in section 2 will be as follows: ln(wage)  0  1highschool   2college  3Mcollege   4 Age  5uncov  6 white  7 Manu  8MaFi  9Greatlake  ui (2.25) where ln(wage) is the log wage of most educated person in the household highschool = 1 if household member of highest education got high school degree = 0 otherwise. college = 1 if household member of highest education got bachelor degree = 0 otherwise Mcollege = 1 household member of highest education got high degree than bachelor = 0 otherwise 77 Age is age of person in the household with highest education. uncov = 1 if household member of highest education is under union coverage = 0 otherwise white = 1 if household member of highest education is white = 0 otherwise Manu = 1 if household member of highest education worked in Manufacturing sector last year = 0 otherwise MaFi = 1 if household member of highest education worked in Management and Financial sector last year = 0 other wise. Greatlake = 1 if the household lives in the Great Lake States First, we assume that education attainment is exogenous at first in order to provide a quick picture of what are the determinants of income or return to education. The linear model will be estimated by qreg command in STATA® as well as regression with robust standard errors. We pool the data from CPS survey from year 2001 to 2009 to be one data set. We choose the household with positive wage income that has more than 2 members. Then, we keep only the case where the second most educated member is either wife, husband, partner, or parents. The results are in the following Table 17. At a first glance, not taking in to account the possibility of endogeneity in education attainment, the effects of education attainment on the wage income are significant across all the quantile as well as from standard regression. The results show that higher level of education always lead to higher wage income no matter what point of 78 income distribution. In addition, all of the coefficients are significant at 1 percent. These result are more in line with classical economic theory, the relative differences in wage can be explain by the education. Moreover, at quantile 90 and 99, the return to education seems to be highest especially for the people with degree higher than college. One way of looking at this high value of Mcollege is that it might be possible to look through one of the most popular graduate level program, Master in Business Administration (MBA). According to http://www.businessweek.com/interactive_reports/roi_rankings.html, the average salary of the newly graduate top 20 business school is about 100,000 dollars for the graduate of 2008 class. 79 Table 2.17 Estimate result assuming education attainments are conditionally exogenous (n = 349,825) Estimation OLS Quantile10 Quantile50 Quantile90 Quantile99 highschool 0.2303*** (0.0038) 0.4957*** college (0.0032) 0.8623*** Mcollege (0.0045) 0.1145*** Age (0.0010) 2 Age -0.0013*** (0.00001) 0.0042** uncov (0.0017) 0.1197*** white (0.0036) -0.01669*** Manu (0.0140) 0.3918*** MaFi (0.0037) Greatlake -0.0323*** (0.0038) 0.2578*** (0.0082) 0.5500*** (0.0071) 0.8990*** (0.0092) 0.1757*** (0.0015) -0.0021*** (0.00001) 0.01351*** (0.0041) 0.1484*** (0.0078) -0.3526*** (0.024) 0.4131*** (0.0083) 0.0823*** (0.0086) 0.2341*** (0.0033) 0.4619*** (0.0028) 0.7610*** (0.0037) 0.0976*** (0.0006) -0.0017*** (0.00006) -0.007*** (0.0016) 0.1081*** (0.0031) -0.0910*** (.0010) 0.3336*** (0.0033) 0.0336*** (0.0034) 0.1923*** (0.0058) 0.4685*** (0.0050) 0.9687*** (0.0066) 0.0725*** (0.0011) -0.0007*** (0.00002) -0.0190*** (0.0029) 0.1253*** (0.0055) 0.0083*** (0.0167) 0.4040*** (.0058) -0.014** (0.0061) 0.2485*** (0.0224) 0.9190*** (0.0189) 1.247*** (0.0251) 0.0612*** (0.0041) -0.0006*** (0.00004) -0.0234** (0.0112) 0.1110*** (0.0211) 0.1540** (0.0646) 0.4010*** (0.0224) -0.0448* (0.0233) Notes: (i) The standard errors for the coefficients are in parenthesis (ii) *, **, and *** denote significance at the 10,5, and 1% level respectively For the case of age, it forms a quadratic relationship with the wage that is wage is growing with age at a decreasing rate for all of the quantile distribution. On the union coverage, it might be positively relates to wage at the lower end of income spectrum but not for the quantile 50, 90 and 99. 80 Table 2.18 Estimate result assuming education attainments are endogenous (n = 349,825) Estimation 2SLS Quantile10 Quantile50 Quantile90 Quantile99 highschool 0.0142 (0.0117) 0.1528*** college (0.0163) 0.0747** Mcollege (0.0268) 0.1298*** Age (0.0011) 2 Age -0.0014*** (0.00001) 0.0069** uncov (0.0017) 0.1148*** white (0.0036) -0.1134*** Manu (0.0140) 0.5006*** MaFi (0.0058) Greatlake 0.0286*** (0.0038) -0.0794*** (0.0235) 0.0133 (0.3275) -0.2109*** (0.0490) 0.1968*** (0.0020) -0.0023*** (0.00002) 0.0176*** (0.0035) 0.1435*** (0.0081) -0.2613*** (0.0278) 0.5707*** (0.0102) 0.0783*** (0.0078) 0.0499*** (0.0100) 0.1628*** (0.0150) 0.0906*** (0.0256) 0.1100*** (0.0001) -0.0012*** (0.000001) -0.0056*** (0.0012) 0.1047*** (0.0038) -0.0426*** (.0011) 0.4244*** (0.0051) 0.0311*** (0.0030) 0.0853*** (0.0056) 0.2534*** (0.0175) 0.2553*** (0.0200) 0.0807*** (0.0004) -0.0008*** (0.000005) -0.0169*** (0.0012) 0.1223*** (0.0015) 0.0463*** (.0011) 0.4772*** (0.0086) -0.0141** (0.0078) 0.5910*** (0.0670) 1.1502*** (0.1204) 0.3381** (0.2024) 0.0576*** (0.0042) -0.0005*** (0.00004) -0.0136*** (0.0112) 0.0624*** (0.0211) 0.2022*** (0.0868) 0.5287*** (0.0567) -0.0353*** (0.0107) Notes: (i) The standard errors for the coefficients are in parenthesis (ii) Standard errors come from bootstrap with 200 replications. (iii) *, **, and *** denote significance at the 10, 5, and 1% level respectively Being white yield the same positive and significant effect on wage throughout regression and quantile regression. As expected for the decline in industrial sector, working in manufacturing gives less wage at a lower part of income while at a higher income distribution, the coefficients is still positive. In addition working in financial 81 service sector always yield higher wage. Lastly, working in great lake state is on average associate with lower income; however, it is not true for quantile 10 and 50. Secondly, when we treat the education attainments as endogenous ordered variables and we should, the estimated results are starkly different. The first column from Table 2.18 describes the case where we correct for endogeneity by first calculating the generalized residuals and then use the standard regression. By looking at the average effects of education attainment, we found out that return to education is not as high as in Table 2.17. Getting high school yields positive return to wage but not significant at 10 quantile. Moreover, earning college degree leads to higher return to wage comparing to earning more than high school. This result is the clear when we focus on the 10 quantile and 50 quantile; the return to education of bachelor degree is higher than both high school and even graduate degree. At 90 quantile, it is the place where the wage is increasing, as the level of education attainment is higher, but only at this quantile is where the return to education of graduate becomes higher than college. Then, at 99 quantile the return to education of graduate degree is highest but is still lower than the bachelor degree. That is, the return to education is not monotonically increases with the level of education attainment. By correcting for endogeneity, it reveals that returns of education on wages do not always increase throughout the income distribution. Specifically, attaining high school though positively related to higher wage than not getting high school degree, the coefficients are quite small throughout the wage distribution compared to other education degrees. Next attaining college degree leads to highest impact on wage. We might be able to say that bachelor degree is a must-have for 82 higher wage. The low coefficients of high school education even at 90 and 99 quantile point out that high school degree is not sufficient anymore. In addition, for the higher than college degree (graduate degree), the impact on wage is still positive but incrementally less than the college degree. The lower return of graduate degree might be a good indicator of why only 10 percent of observations from the survey pursued graduate degree. One explanation for such lower return to graduate degree is Neoclassical Growth theory; the reduction for the return of graduate degree might be related to skill-set demanded in the current world economy. Globalization makes outsourcing of the graduate skill-set possible. That is, not only the US corporation can conduct foreign direct investment abroad to lower the cost of low-skill labor(high school) but also firm can lower the cost of high-skill labor, too. Wan (2008) pointed out that the cost of hiring US engineer to design computer chip is approximately three times higher than cost of hiring Chinese engineer and two times higher than hiring Korean engineer of the same caliber. In addition, Endogenous Growth theory predicts the economy where only people at the high end of human capital spectrum and capital holder will leap more benefit from economy. That is, at the 99 quantile, education attainment alone might not be enough in explaining the different in wage. People with graduate degree earn higher wage than people without high school education. However, it might not be true if we compare to the bachelor degree. 83 2.6 Concluding Remarks and Further Study Unlike previous studies on income distribution, this empirical study examines the relationship between determinants of income throughout entire distribution by using quantile regression. Although, various methods of quantile treatment effects that are robust to endogeneity have been examined, but none of them has taken into account the ordered endogenous regressors. We propose the use of control function with sieve estimator to correct for the endogeneity problem. The Monte Carlo simulation shows that our proposed method is usable and works well against other method. Correcting for endogeneity reveals interesting results, the different levels of education attainment significantly increases wage. However, ignoring endogeneity leads to overestimate of the return to education. In addition, wages are not monontonically increasing with education throughout wage distribution. At the both low and high wage spectrum, it proves out that bachelor degree is not enough for the current state of the economy. Moreover, focusing on graduate degree attainment might not be the key in improving wage, their coefficients are increasing throughout the income distribution but it can be lower than bachelor degree attainment even at the 99 quantile. Moreover, only at the 90 quantile is where the returns to graduate degree is slightly higher than bachelor degree. Regarding policy implications, on the surface, it might be suitable to say that government should promote higher education attainment; however, there are issues that are more concerned. At first, higher education is not cheap and given the current state of income inequality and economy, only the people at the higher end of the spectrum will be able to leap the benefit. In addition, some might argue the definition of education whether 84 the university should provide knowledge that can be practically used and related to the economy or being holistic. Also, there is a differences opinion regarding the education system, in the East Asian countries, it is a belief that judgementalism and incentive system are an important elements in leading the student to study science and technology as well as pursuing higher education than bachelor degree Wan (2008). Future research into this issue might be beneficial by either conducting Monte Carlo simulation or empirical study of endogenous effect of education on income (not only wage) inequality of people at the higher end of income spectrum (higher than 90 percent). In addition, the choices of control function, sieve estimator, as well as instrumental variable do really affect the estimated results. Further sensitivity analysis and asymptotic theory are of interest in order to ensure the robustness of estimator. 85 REFERENCES 86 REFERENCES Abadie, Alberto, Joshua Angrist, and Guido Imbens. “Instrumental Variables Estimates of the Effect of Subsidized Training on the Quantiles of Trainee Earnings.” Econometrica 70, no. 1 (2002): 91–117. Acemoglu, Daron. Introduction to Modern Economic Growth. Princeton University Press, 2008. Anbarci, Nejat, Monica Escaleras, and Charles A. Register. “Earthquake Fatalities: The Interaction of Nature and Political Economy.” Journal of Public Economics 89, no. 9 (2005): 1907–1933. Autor, David H., Lawrence F. Katz, and Melissa S. Kearney. “Trends in US Wage Inequality: Revising the Revisionists.” The Review of Economics and Statistics 90, no. 2 (2008): 300–323. Bakija, Jon, Adam Cole, and Bradley T. Heim. “Jobs and Income Growth of Top Earners and the Causes of Changing Income Inequality: Evidence from US Tax Return Data.” Williamstown: Williams College (2010) Mimeo. Ballard, Charles L., and Paul L. Menchik. “Informing the Debate” (2008) Mimeo. Bound, John, and George Johnson. “Changes in the Structure of Wages in the 1980’s: An Evaluation of Alternative Explanations.” The American Economic Review (1992): 371– 392. Chen, Xiaohong. “Large Sample Sieve Estimation of Semi-nonparametric Models.” Handbook of Econometrics 6 (2007): 5549–5632. Chernozhukov, Victor, and Christian Hansen. “An IV Model of Quantile Treatment Effects.” Econometrica 73, no. 1 (2005): 245–261. ———. “Instrumental Quantile Regression Inference for Structural and Treatment Effect Models.” Journal of Econometrics 132, no. 2 (2006): 491–525. ———. “Instrumental Variable Quantile Regression: A Robust Inference Approach.” Journal of Econometrics 142, no. 1 (2008): 379–398. Firpo, Sergio. “Efficient Semiparametric Estimation of Quantile Treatment Effects.” Econometrica 75, no. 1 (2007): 259–276. Frölich, Markus. “Nonparametric IV Estimation of Local Average Treatment Effects with Covariates.” Journal of Econometrics 139, no. 1 (2007): 35–75. 87 ———. “Propensity Score Matching Without Conditional Independence Assumption—with an Application to the Gender Wage Gap in the United Kingdom.” The Econometrics Journal 10, no. 2 (2007): 359–407. Frölich, Markus, and Blaise Melly. “Estimation of Quantile Treatment Effects with Stata.” Stata Journal 10, no. 3 (2010): 423. ———. “Unconditional Quantile Treatment Effects Under Endogeneity.” Journal of Business & Economic Statistics no. just-accepted (2013). Kahn, Matthew E. “The Death Toll from Natural Disasters: The Role of Income, Geography, and Institutions.” Review of Economics and Statistics 87, no. 2 (2005): 271–284. Kahn, Robert S., Paul H. Wise, Bruce P. Kennedy, and Ichiro Kawachi. “State Income Inequality, Household Income, and Maternal Mental and Physical Health: Cross Sectional National Survey.” BMJ: British Medical Journal 321, no. 7272 (2000): 1311. Kawachi, Ichiro, Bruce P. Kennedy, Kimberly Lochner, and Deborah Prothrow-Stith. “Social Capital, Income Inequality, and Mortality.” American Journal of Public Health 87, no. 9 (1997): 1491–1498. Koenker, Roger. Quantile Regression. Cambridge university press, 2005. Kuznets, Simon. “Economic Growth and Income Inequality.” The American Economic Review 45, no. 1 (1955): 1–28. Levy, Frank, and Richard J. Murnane. “US Earnings Levels and Earnings Inequality: A Review of Recent Trends and Proposed Explanations.” Journal of Economic Literature 30, no. 3 (1992): 1333–1381. Murray, Charles. Real Education: Four Simple Truths for Bringing America’s Schools Back to Reality. Random House Digital, Inc., 2009. Piketty, Thomas, and Emmanuel Saez. “Income Inequality in the United States, 1913–1998.” The Quarterly Journal of Economics 118, no. 1 (2003): 1–41. Rajan, Raghuram G. Fault Lines: How Hidden Fractures Still Threaten the World Economy (New in Paper). Princeton University Press, 2011. Ritchie, Bryan K. Systemic Vulnerability and Sustainable Economic Growth: Skills and Upgrading in Southeast Asia. Edward Elgar Publishing, 2010. Saez, Emmanuel. “Striking It Richer: The Evolution of Top Incomes in the United States (update with 2007 Estimates)” (2009). 88 Shaughnessy, Timothy M., Mary L. White, and Michael D. Brendler. “The Income Distribution Effect of Natural Disasters: An Analysis of Hurricane Katrina.” Journal of Regional Analysis and Policy 40, no. 1 (2010). 89 CHAPTER 3: BINARY RESPONSE MODEL WITH CONTINUOUS ENDOGENOUS VARIABLE AND HETEROSKEDASTICITY 3.1 Introduction This paper describes a simple estimator for the binary choice model with continuous endogenous regressors and heteroskedasticity in both the structural and reduced form equations. As pointed out in Wooldridge (2010), either endogeneity or heteroskedasticity can lead to the problem of inconsistency in the estimation of both binary choice model coefficient as well as average partial effects (APEs). Many approaches have been proposed to handle the endogeneity in the binary responses model including full information maximum likelihood, limited information maximum likelihood, special regressor, instrumental variables, and control functions. Applying instrumental variables to a Linear Probability Model (LPM) is the easiest approach to implement and the estimated coefficients are equal to (approximate) APEs. There are two main drawbacks for using LPM. First, by construction, it ignores the bounded nature of the binary response. That is, it does not guarantee the fitted values – which are supposed to be probabilities – lie with the interval of zero and one. Secondly, its functional form may not be general enough to well approximate all response probabilities, thereby leading to poor estimates of APEs, especially for subgroups in the population. Full information maximum likelihood is the most common estimation method to deal with either endogeneity or heteroskedasticity in the binary response model. The popular econometrics package Stata® has commands to deal with each problem separately (called 90 “ivprobit” and “hetprob,” respectively.) But we may want to allow both features in the same time, and for flexibility reason, we might want to allow heteroskedasticity in both the structural and reduced form equations. One can use full maximum likelihood, but that is computationally intenstive. Instead, a two-step control function approach is attractive. Recently, several studies include flexible form of hetreoskedastitcity into binary response model without taking into account endogeneity. Khan (2013) proves the equivalence results of the maximum score estimator of Manski (1985) and Horowitz (1992). The proposed estimator used sieve nonlinear least squares for estimating the distribution free form of heteroskedasiticity with the Probit criterion function. Chesher (2009) and Holderlin (2009) used the index restriction along the line of quantile regression to estimate the binary model with heteroskedasticity in the form of unobserved heterogeneity in the structural equation. Their methods are complicated and required semiparametric estimation of conditional distribution function. Moreover, in Chesher (2009), he focused on the identification of ratios of derivatives of the index while Holderlin (2009) focused on median interpretation of the result. None of these papers allows consistent estimation of the average partial effect, which is a focus in this paper. There are group of papers that can deal with both endogeneity and heteroskedasticity problems. They are using “special regressor” proposed by Lewbel (2000) and applied to binary choice model in Dong et.al. (2012). The special regressor must satisfy three assumptions. It is conditionally independent of the error, additive with the model error in the binary response model, and continuous with the large supports. In practice, these assumptions essentially require that the special regressor is independent of other observable explanatory variables, as well as the error. Plus, the special regressor approach can deal with only the case of heteroskedasticity in the special regressor only. In addition, their marginal effects come from Average Index Function 91 as in Lewbel et.al. (2012). This function is easy to compute but is not the same an average partial effect. In addition, it appears sensitive to outliers as well choice of bandwidth and kernel to approximate the index function. The last method is control function. The general approach is standard and can be found in textbooks such as Wooldridge (2010) and Greene (2008). The method was first used by Heckman (1976) to correct for endogeneity in a linear model with a binary endogenous explanatory variable. For the simple binary response model and continuous endogenous explanatory variables, it was first proposed by Rivers and Vuong (1988) and Blundell and Smith (1989). For the semiparametric estimation, the example is Blundell and Powell (2004) and Klein and Vella (2008). The previous literature on control functions has assumed that, when applied to standard model such as probit and Tobit, or semiparametric index functions, that heteroskedasticity is either absent or of a very restrictive form. In this paper, we adopt the use of control function approach to address endogeneity and hetersoskedasticity problem since it can incorporate both heteroskedasticity in reduced form equation and structural equation. The advantage of the control function is that for the first stage estimation, we can specify flexible functional form of heteroskedasticity; for example, exponential heteroskedasticity. Then, we can apply the asymptotic property of sieve estimator from Ackerberg et.al. (2012). It leads to valid statistical inference for second-stage parameters. For the second step of estimator, we assumed also the heteroskedasticity in the probit model as in Wooldridge (2005). Hence, we can easily estimate APEs and find their standard errors by mean value expansion theorem or bootstrapping. 92 This paper does not cover analytical and asymptotic theory of the control function with sieve estimator as in Chen (2007) and Ackerberg et.al. (2012). From the perspective of regularity conditions, the problem here is standard, so there should be no surprises when it comes to the asymptotic theory. Instead, we conduct Monte Carlo simulations to investigate how the proposed estimation works when the true data generating process contains only endogeneity, endogeneity with heteroskedasticity in reduced form equation, and endogeneity with heteroskedasticity in both equations. The performance is measured by Monte Carlo bias and root mean square errors of the estimated APES, as well as mean of standard errors from the simulation. The simulation results show the considerable benefit of modeling heteroskedasticity in either the first stage estimation or the second stage estimation. Without incorporating heteroskedascity, the linear model of endogenous regressor yields biased and insignificant results. In addition, the standard two-step estimation leads to upward-biased APEs. For, the empirical study, we apply the proposed method to the model of female labor supply decision making, that is binary probit model. The only endogenous variable will be non-wife income as in study of Mroz (1987) The rest of the paper is organized as followed. Section 2 describes the model and APEs. Section 3 contains the results from Monte Carlo Simulation that compares control function with the LPM, probit, hetprob, and ivprobit from Stata®. Section 4 demonstrates the application to female labor supply. And, Section 6 is conclusion. 93 3.2 Model and Estimation Method In this paper, I consider the binary response model where one of the explanatory variables is correlated with the error term in the latent variable model. * y1i  1{ y1i  0} (3.1) * where the latent variable y1i is assumed to be in the linear in parameter model of the form * y1i  z1iβ1  y2i  1i (3.2) y2i  ziβ 2  ε2i (3.3) z11i  z1i  zi , z 2i  z1i  z1i (3.4) From structural equation (3.1) and (3.2), the binary response model contains continuous endogenous variable y2i and exogenous variables z1i . Then, equation (3.3) is the reduced form equation of y2i . The subset of exogenous variable in (3.4) is used in the identification of model with heteroskedasticity as well as endogeneity in structural equation. Furthermore, I assume that the error terms in (3.2) and (3.3) contain multiplicative heteroskedasticity of the form 2 2 1i | z11i ~ N (0, h1 ( z11i1)) and  2i | zi ~ N (0, h2 (zi 2 )) To deal with heteroskedasticity in the reduced form equation, I divide equation (3.2) and (3.3) by their respective standard deviation h1( z11i1) and h2 ( zi 2 ) . We can write equation (3.2) and (3.3) as follows: ~*  ~   ~   y1i z1i 1 y2i ~ i 1 (3.5) 94 ~  ~   y2i z1i 2 ~2i (3.6) where the tilde refers to transformed data. In order to take care of endogeneity problem, we must make a strong assumptions that ~ ~ ~ governs the relationship between the standardized errors 1i and  2i .For identification, 1i and ~ ~ ~  2i is conditional independent of z i . Next, we assume that (1i ,  2i ) has zero mean, bivariate normal distribution with the following variance covariance structure, ~   0  1 1   1i    ~  ~ N   ,    0  1  2i    1  (3.7) ~ ~ We can write the linear projection of 1i on  2i ~ ~ 1i  1 2i  e1 (3.8) ~ ~ ~ where 1 is correlation between 1i and  2i . Given that  2i is normally distributed, then in this case y2i conditional on z i has features of a normal random variable. This model cannot solve * the endogeneity problem when y1i appears in the reduced form equation (3.3). As pointed out by Wooldridge (2010). The model is applicable when y2i poses the endogeneity problem that ~ comes from omitted variables or measurement error. The normalization of variance of  2i is necessary in estimating the parameters in (3.5) up to scale and obtaining average partial effects (APEs). Then, I employ the control function approach similar to Rivers and Vuong (1988), it includes extra regressions in the structural equation (3.5) such that the remaining variation in ~ ~ will not be correlated with  and  . Under joint normality assumption in (3.7), we can y2i 1i 1i rewrite (3.8) as 95 ~ ~ h1(z11i1)1i   (z11i ) 2i  h1(z11i1)  e1i (3.9) where  (z11i )  1h1(z11i1) . Also, because or assumption in (3.8), e1 is normally distributed with 2 2 E(e1i )  0 and variance h1 ( z11i1)(1  1 ) . Therefore, we can now get the structural equation with control function as ~ ~*  ~   ~   (z )  h (z  )  e y1i z1i 1 y2i 11i 2i 1 11i 1 1i (3.10) 2 2 h1(z11i1)  e1i | zi , ~2i ,  2i ~ N (0, h1 ( z11i1)(1  1 )) y ~ (3.11) Using equation (3.9) to rewrite the structural equation in term of normal distribution, we get 2 2 ~ P( y1i  1 | zi , ~2i ,  2i )  [( ~1i 1  ~2i   ( z11i ) 2i / h1 ( z11i1)(1  1 ) ] y ~ z y (3.12) If there is no endogeneity of y2i or ~2i then estimation of (3.12) will be the same as standard y ~ heteroskedasticity probit model. If we do observe  2i , we can estimate each respective parameter ~ with the maximum likelihood estimation of (3.12). However,  2i is not observed, we can replace ~ ~ ˆ it with the consistent estimator of  2i ,  2i . Hence, we suggest the following procedure. ~ ˆ Step 1. Obtain the standardized residual  2i from the maximum likelihood estimation of y2i on z i . Also, we do not have to specifically identify the specific form of h2 ( zi 2 ) . We propose the use exponential function that can allow for the interaction and power series that we can flexibly increase their terms with respect to the number of observations. 96 Step 2. Apply maximum likelihood estimation of equation (3.12). However, we have to specify 2 the functional form of h1 ( z11i1) . I suggest the use of exponential standard deviation as same as in hetprob command in Stata®. This two-step estimator is easy to implement; however, we need to adjust the secondstage standard errors by taking into account the first-stage estimation. This procedure is common as in the chapter 12 of Wooldridge (2010). We proceed in two-step estimation as in Ackerberg et.al. (2012) such that the corrected standard errors of ( 1,  ) are asymptotically equivalent to that of Murphy and Topel (1985). I show the adjustment procedure in the appendix A. In order to obtain the APEs, we adopt the average structural function (ASF) defined by Blundell and Powell (2004), and is similar to Wooldridge (2005). From (3.12), ASF is defined as    z11  y2   ( z11i ) 2i  ASF ( z1, y2 )  E 2i , z11i [  ] 2 ( z  )(1   2 )   h1 11i 1 1    (3.13) By assumption that  2i is normally distributed, a consistent estimator of the average structural function is 1 N   ˆ ˆ ˆ ˆ N  z11  y2   ( z11i ) 2i  i 1  2  ˆ ˆ2  h1 ( z11i1)(1  1 )    (3.14) For the continuous covariate in z i and y2i , their APEs are partial differentiate of (3.13) with respect to each parameter and averaging out with respect to both  2i and z11i as in (3.15). 97 The APEs for binary exogenous variable is the differences of ASF in (3.14) when the exogenous variable is set to one and zero. 1 N      z   y   ( z )   ˆ ˆ 2 ˆ 11i ˆ2i iN 1   1 12   /    ˆ )(1   2 )   ˆ   h1 ( z11i1 1     (3.15) Then, we calculate the APEs standard error based on their corrected two-step estimation. The standard errors of the APEs will be derived the procedure as in question 12.17 in Wooldridge (2010). The detail is in the appendix B. 3.3 Monte Carlo Simulation The purpose of this section is to compare the proposed estimators’ APEs, (HetIVhetprob), with standard estimators: regression(OLS), two-stage least squares(2SLS), Probit, instrumental variables probit(IVprob), heteroskedasticity probit(Hetprob), and instrumental variable probit with heteroskedasticity(IVhetprob). These estimators are evaluated under correct model specification under different scenarios. There are four main scenarios. Firstly, all of the dependent variables are exogenous. Secondly, there is only one endogenous variable in the probit model. Thirdly, there will be one endogenous variable in the probit model and heteroskedasticity in the first stage estimation. Lastly, there is only one endogenous variable in the probit model and heteroskedasticity in both first and second stage estimation. 98 3.3.1 Data generating process (DGP) 3.3.1.1 The data generating process contains only exogenous variables The number of observation and the number of iteration are set to be 1000 and 1000, respectively. The data generating process (DGP) for the first scenario is given by * y1i  10  1 y2i  11x1i  12 x2i  13 x3i  1i , i  1,2,..., N (3.16) y2i   20   21z1i   22 z2i   2i , i  1,2,..., N (3.17) The parameters are setting as follows: 10  2, 1  3, 11  0.5, 12  0.5, 13  2, 20  1, 21  1, 22  1 The DGP for each variable is given by x1i ~ 3  u(0,1), x2i ~ N (0,1), x3i ~  2 (1), z1i ~ N (0,1), z2i ~ N (0,1) The error terms are randomly drawn from normal distribution with mean 1 and covariance 0.   0  1 1   1i   . Hence, probit and regression should be the best estimator. The true     ~ N   0 ,       2i    1 1  APEs are obtained from plugging in the correct parameters to the APEs formula, then averaging the estimates out across them sample and simulate 1000 times with respective number of observations, 1,000 and 3,000. That is, it is equal to 99 1 s  j 1 (2  3 y2ij  0.5x1ij  0.5x2ij  2 x3ij ) *  s (3.18) where  represents each true respective parameters. The results are in table 3.1 and 3.2. The tables report the means of the APEs over 1000 replications, the standard deviations (SD) of Monte Carlo simulation, and the mean of the adjusted standard errors (SE). When there is no heteroskedasticity and endogeneity, all of the estimation methods work really well. All of them have very low bias and variance that leads to low root mean squares errors (RMSE). In addition, the relative values of SD and SE are close for all of the estimators, this indicates that the adjusted standard errors method, by mean value expansion theorem, work well when the data generating process is not complicated. Moreover, by over fitting the model, the standard deviations of the nonlinear models are slightly higher than the linear one. Therefore, the simulations under the first condition suggest that nonlinear and linear model are comparable and can be use alongside with each other when there are no endogeneity and heteroskedasticity problem. 3.3.1.2 The data generating process contains only endogeneity To examine whether the proposed estimators work under endogeneity, the data generating process incorporate moderate endogeneity through the covariance matrix of errors   0  1 1   1i   and 1  0.5 . Then, the linear projection of 1i on  2i is  ~ N   , terms     0     1 1 2i    1i  1 2i  e1 . The instrument’s predictive power is 0.5 that is strong instrument by setting 20  1, 21  0.5, 22  0.5 . The rest of the parameters and data generating process is the same 100 as in section 3.3.1.1. The true value of APEs for each variable is generated from the mean of the APEs from Monte Carlo simulations as same as in 3.3.1.1. For example, the true APE of y2 is approximated from simulations by first computing the derivative of (3.14) with respect to y2 , and then taking the average across the distribution of 1i : APE  2  1 N  2  3  y2i  0.5  x1i  0.5  x2i  2  x3i  0.5   2i  ei       N 1  0.25  i 1  (3.19) These approximated true values of the APEs with respect to both endogenous and exogenous variables are the benchmark for comparing the estimated APEs across the models. Table 3.3 and 3.4 report simulation results when there are 1000 and 3000 observations. As expected with strong instrument, the nonlinear two-step models that takes into account endogeneity provide less biased results than the linear model (2SLS) and the other models that do not take into account endogeneity. However, 2SLS yields less bias than either normal probit or heteroskedastic probit. On the volatility of each estimator, the linear and one-step model provides less standard deviation than the two-step model. Overall, the root mean squares errors of the two-step models with endogeneity are still less than the one-step model. In Table 3.5 and 3.6, they represent the results when the instruments are weak. With the two-stage least squares and other one-step estimators, they are biased upward while the two-step estimators with endogenetiy are biased downward. Comparatively, all of them yield similar root mean square errors. Interestingly, by using IVhetprob and HetIVhetprob to estimate the only endogenous probit model, the Monte Carlo standard deviations are not much different from the correctly specified Ivprobit results. 101 3.3.1.3 Data generating process with endogeneity and heteroskedasticity in reduced form In this scenario, all other parameters are the same as in 3.3.1.2, except, the heteroskedasticity in the first stage estimation will be in the form of multiplicative heteroskedasticity with exponential functional form. It changes the first stage estimation error term to  2i | zi ~ N (0, exp((2  1 z 2i )) and the covariance matrix of the errors term change to   0  1 1   1i  ~  where  2i is  2i / exp(1  0.5  z 2i ) . Hence, in the first stage estimation,  ~  ~ N   ,    0  1  2i    1  if we suspect that the error terms contain heteroskedaticity, we will estimate it first and then standardized such errors and use it in the second stage estimation. While holding other parameters to be the same as before, table 3.7 and 3.8 present very interesting result. With strong instruments of 0.5 predictive powers, the last two estimators IVhetprob and HetIVhetprob that take into account heteroskedasticity outperform the other four estimators not only in term of less bias but also comparable standard errors. In addition, the linear model performs poorly; it heavily under estimate APEs of the endogenous variable while over estimates the rest of the APEs for exogenous variables. Interestingly, regarding the bias, it is even worse than the nonlinear models that do not take into account endogeneity and heteroskedasticity problem. With weak instruments of 0.1 predictive powers, the IVhetprob, and HetIVhetprob lead to more positive biased of the APEs for the endogenous variables. However, the bias is still less than the other nonlinear models. On the contrary, the two-stage least squares suffered the most from weak instrument. It not only yields biased APEs for endogenous variables, but also the sign is in correct with high standard deviations and standard errors. This certainly will lead to 102 rejection of the significant result when applying the linear model to the nonlinear setup. More importantly, even the observations increases from 1000 to 3000, the bias does not go away. The increase in number of observations slightly reduces only standard deviation and standard errors. 3.3.1.4 Data generating process with endogeneity and heteroskedasticity in both equations For the last condition, the first stage estimation contains heteroskedasticity as in 3.3.1.3, then, the errors term in the second stage estimation is the same as in equation 3.10. Then, h1(z11i1) defines as multiplicative heteroskedasticity with exponential of the form exp(1 x2i ) while the other parameter for data generating process are the same as in 3.3.1.3. The derivations of APEs as well as adjusted standard errors are in the appendix. From Table 3.11, the linear estimator still yields biased result for all estimated APEs. For the APEs of endogenous explanatory variable, the estimators those do not take in to account the heteroskedasticity in the structural equation yields downward biased result while the standard heteroskedastic probit leads to upward bias. Even though, the proposed HetIVhetprob leads to upward bias result for the APEs of endogenous variables, it provides some advantages over other estimation methods that do not model heteroskedasticity, the APEs for x2 of linear model, probit, and Ivprobit not only are biased but also negative rather than being positive. One last important note is that the APE of endogenous explanatory variable from probit estimation has lower bias than the HetIVhetprob; however, its APEs of other exogenous variables are more bias than HetIVhetprob. 103 Lastly, the mean rejection rate of the APEs from the proposed method is certainly lower than both linear instrumental variables model and the heteroskedastic model as seen in table 13.12. Using linear model leads to mean rejection rate of 1 while using the heteroskedastic probit model alone lead to mean rejection rate of 0.967. That is, out of 1,000 simulations, we are almost certainly rejecting the estimated APEs from both of the models. The mean rejection rate of HetIVhetprob is about 0.14 or 14 percent given its complicated variance structure and higher simulated variance than the other methods. 3.4. Empirical Study: Application to female labor supply In this section, we apply the above procedure in estimation of female labor force participation that is binary choice model. The female decision to join labor force may be affected by endogeneity as suggested by Mroz (1987). In addition, it might contain the heteroskedasticity problem, too. This makes the response probability and the APEs depend on the form of variance and need to be incorporated into estimation of female decision model. In the static decision making model with continuous endogenous variable, the estimation focus on non-wife income as a sole endogenous variable. The non-wife income might be related to certain unobserved individual female characteristics as well as her husband characteristics. For heteroskedasticity, in order to be comparable to the Monte Carlo simulation, the heteroskedasticity in reduce form equation has exponential form with husband experience and education exp( 21  husexper   21  huseduc) . In the second stage estimation, only female experience and its square are in 104 the exponential heteroskedastic term exp(δ11  exper  δ12  exper 2 ) . Later on, the flexible form of the heteroskedasticity can be introduced in both stage of estimation. Then, the analytical standard errors from the delta method will be hard to derive and become too cumbersome as pointed out in Wooldridge (2005). The standard errors of APEs will come from bootstrapping. The data set for the application comes from Wooldridge (2012) book; it contains information for female labor participation from Current Population Survey (CPS) in 1991. There are 5634 observations. Table 3.13 provides the summary of the 5634 observations. In the reduced form equation, the husband education and experience are instrumental variable for nonwife income, where both the husband experience and experience enters into the exponential heteroskedasticity. Table 3.14 provides the result from reduced form equation estimation. The high value of t-statistics suggests that both husband education and experience is strong instrument. In addition, the estimation of the conditional variance terms are also significant. The specification of structural equation is inlf *  10  11  educi  12  kidlt 6i  13kidge6i  14  experi i  15  experi2    nwifeinc i  1i (3.20) The reduced form equation is nwifeinc i   21  huseduc   21  husexper   2i Then, the structural equation is inlf i  1[10  11  educi  12  kidlt 6i  13kidge6i  14  experi 105 (3.21)  15  experi2    nwifeinc i  1i  0] (3.22) This specification is similar to example in Wooldridge (2010) and (2012) textbook; however, I drop the age out since it is highly correlated with experience; the correlation coefficient is 0.9682. At first, in order calculate the analytical standard errors, only the wife experience and experience square enter into the exponential heteroskedasticity term. Table 3.15 and 3.16 report the estimated APEs. The results for estimated APEs are interesting and provide a comparable result to Monte Carlo simulation. From table 3.15, ignoring endogeneity will lead to overestimation of APEs for non-wife income; however, when the standard instrumental variable regression is used, the partial effects will be much smaller. In contrast, if researcher believe that instrumental variable probit needed to be used in order to control for both engogenity and nonlinearity; the partial effects of -0.00328 will be really close to regression of 0.00331. This might lead to the conclusion that estimating partial effects from linear model is usable and comparable to what seems to be right underlying model of Ivprobit. However, the result from heteroskedastic probit model lead to the starkingly different with APEs of -0.00696, it is almost two times higher than the result from regression and Ivprobit and about four times higher than two stage least squares (2SLS). Then, from the significant of the heteroskedastic term in both first stage and second stage estimation, table 3.16 provides the APEs of non-wife income. They shows that considering the heteroskedasticity in the first stage is not enough, the APEs from IVhetrprob is similar to twostage least squares but it is not significant. Then, by simply modeling the heteroskedasticity in the second stage, the APEs turn out to be -0.00459 for the basic Hetivhetprob model that has 106 wives’ experience and square of experience in the conditional variance terms. This result is similar to simulation when the HetIVhetprob is correctly specified in table 3.11 and 3.12. That is, the conditional variance both in first stage and in second stage influences the estimated APEs not only in terms of their values but also in terms of standard errors of endogenous variable. Also, when we allow flexible functional form of predicted errors and predicted errors square to be in the conditional variance function, the results are in column 3 and 4 in table 3.16, we can see that the APEs for non-wife income turns out to be -0.0054 and -0.0049. These APES are higher than the other estimated APEs from models that do not take into account heteroskedasticity. On the other exogenous variables, the estimated APEs from heteroskedasticity (HetIVHetprob) model show a different story, especially the APEs for education and having kid in the age greater than 6. The APEs of education are more negatively affected the probability of labor supply in such model than the rest. In addition, APEs of having kid in the age greater than six become positive and significant while the other model show only positive but not significant result. These results are consistent with the Monte Carlo simulation study. 3.5 Conclusion This paper proposes a two-step estimation method for the binary model with heteroskedasticity and continuous endogenous variable. The method employed is the used of control function that allow the flexible functional form of the conditional variance in the first stage estimation that still yield a valid second stage variance estimation and adjustment. The Monte Carlo simulations study is conducted to test whether the proposed estimation is better or not under different data generating process that start from the simple binary model. The binary 107 model with endogeneity is correct, and the binary model with heteroskededaticity and endogeneity is correct. The results elucidate that not taking into account all of the either endogeneity or heteroksedasticity in both steps of estimation might lead to biased result. Most importantly, using the linear probability model to estimate APEs is not always the best or considered just fine approximation in various situations. The application to female labor supply from CPS data yields similar conclusion to Monte Carlo simulation. Estimating the APEs by using the linear model yield lower APEs than the model that taken into account heteroskedasticity. In addition, the estimation of the model with heteroskedasticity yields APEs that is statistically significant compared to the models that do not put the heteroskedasticity in the estimation. 108 Table 3.1 The correct model is Probit True Value APE N=1000 OLS Probit Hetprob Ivprobit Mean y2 (0.09764) SD SE RMSE 0.0977 0.0034 0.0031 0.0034 0.0976 0.0091 0.0092 0.0091 0.0989 0.0094 0.0097 0.0095 0.0966 0.0092 0.0105 0.0093 0.0966 0.0092 0.0096 0.0093 0.0978 0.0095 0.0100 0.0095 Mean x1 (0.01627) SD SE RMSE 0.0163 0.0093 0.0093 0.0093 0.0162 0.0044 0.0045 0.0044 0.0164 0.0046 0.0046 0.0046 0.0160 0.0044 0.0044 0.0044 0.0161 0.0044 0.0052 0.0045 0.0162 0.0046 0.0048 0.0046 Mean x2 (0.01627) SD SE RMSE 0.0161 0.0093 0.0094 0.0093 0.0162 0.0045 0.0045 0.0045 0.0164 0.0048 0.0047 0.0048 0.0160 0.0045 0.0046 0.0045 0.0160 0.0045 0.0050 0.0045 0.0162 0.0048 0.0048 0.0048 Mean x3 (0.0651) SD SE RMSE 0.0456 0.0070 0.0067 0.0207 0.0651 0.0075 0.0076 0.0075 0.0659 0.0077 0.0079 0.0077 0.0643 0.0075 0.0076 0.0076 0.0643 0.0076 0.0079 0.0076 0.0651 0.0078 0.0083 0.0078 109 IVprobhet HetIVprobhet Table 3.2 The correct model is Probit True Value OLS Probit Hetprob Ivprobit IVprobhet HetIVprobhet Mean y2 (0.09762) SD SE RMSE 0.0977 0.0020 0.0018 0.0020 0.0976 0.0050 0.0051 0.0050 0.0978 0.0050 0.0051 0.0050 0.0972 0.0050 0.0047 0.0050 0.0972 0.0050 0.0052 0.0050 0.0975 0.0051 0.0052 0.0051 Mean x1 (0.01627) SD SE RMSE 0.0165 0.0053 0.0054 0.0053 0.0164 0.0024 0.0026 0.0024 0.0164 0.0024 0.0026 0.0024 0.0163 0.0024 0.0025 0.0024 0.0163 0.0024 0.0026 0.0024 0.0164 0.0024 0.0026 0.0024 Mean x2 (0.01627) SD SE RMSE 0.0164 0.0053 0.0054 0.0053 0.0163 0.0025 0.0026 0.0025 0.0163 0.0026 0.0026 0.0026 0.0162 0.0025 0.0025 0.0025 0.0162 0.0025 0.0028 0.0025 0.0162 0.0026 0.0028 0.0026 Mean x3 (0.0651) SD SE RMSE 0.0458 0.0040 0.0038 0.0197 0.0652 0.0042 0.0042 0.0042 0.0654 0.0042 0.0043 0.0042 0.0650 0.0042 0.0041 0.0042 0.0650 0.0042 0.0043 0.0042 0.0651 0.0042 0.0044 0.0042 APE N=3000 110 Table 3.3 The correct model is Ivprobit with 21  0.5, 22  0.5 True Value APE N=1000 Mean y2 (0.09158) SD SE RMSE 2SLS Probit Hetprob Ivprobit IVprobhe t HetIVprobhet 0.1062 0.0070 0.0059 0.0162 0.1114 0.0097 0.0098 0.0221 0.1126 0.0103 0.0104 0.0234 0.0928 0.0124 0.0109 0.0125 0.0929 0.0124 0.0113 0.0125 0.0935 0.0129 0.0126 0.0130 Mean x1 (0.01526) SD SE RMSE 0.0180 0.0086 0.0084 0.0090 0.0178 0.0051 0.0048 0.0057 0.0180 0.0053 0.0049 0.0059 0.0156 0.0045 0.0043 0.0045 0.0156 0.0045 0.0045 0.0045 0.0158 0.0046 0.0063 0.0047 Mean x2 (0.01526) SD SE RMSE 0.0174 0.0083 0.0084 0.0086 0.0174 0.0050 0.0048 0.0051 0.0177 0.0054 0.0051 0.0055 0.0153 0.0044 0.0045 0.0045 0.0153 0.0044 0.0038 0.0045 0.0154 0.0047 0.0080 0.0048 Mean x3 (0.06106) SD SE RMSE 0.0359 0.0055 0.0060 0.0257 0.0706 0.0090 0.0091 0.0131 0.0713 0.0094 0.0094 0.0139 0.0620 0.0096 0.0097 0.0096 0.0620 0.0096 0.0107 0.0096 0.0624 0.0099 0.0115 0.0100 111 Table 3.4 The correct model is Ivprobit with  21  0.5,  22  0.5 True Value APE N=3000 Mean y2 (0.0919) SD SE RMSE 2SLS Probit Hetprob Ivprobit IVprobhet HetIVprobhet 0.1065 0.0041 0.0034 0.0151 0.1121 0.0055 0.0055 0.0210 0.1124 0.0055 0.0056 0.0212 0.0941 0.0071 0.0061 0.0074 0.0941 0.0071 0.0071 0.0074 0.0943 0.0071 0.0071 0.0075 Mean x1 (0.01531) SD SE RMSE 0.0176 0.0050 0.0049 0.0055 0.0177 0.0027 0.0028 0.0036 0.0177 0.0027 0.0028 0.0036 0.0157 0.0024 0.0025 0.0024 0.0157 0.0024 0.0037 0.0024 0.0157 0.0024 0.0036 0.0024 Mean x2 (0.01531) SD SE RMSE 0.0178 0.0049 0.0049 0.0055 0.0178 0.0028 0.0028 0.0037 0.0178 0.0029 0.0029 0.0039 0.0157 0.0025 0.0025 0.0026 0.0157 0.0025 0.0040 0.0026 0.0158 0.0026 0.0041 0.0026 Mean (0.06126) SD SE RMSE 0.0360 0.0031 0.0034 0.0254 0.0709 0.0052 0.0051 0.0110 0.0711 0.0053 0.0052 0.0111 0.0628 0.0054 0.0054 0.0057 0.0628 0.0054 0.0062 0.0057 0.0629 0.0055 0.0063 0.0057 112 Table 3.5 The correct model is Ivprobit with  21  0.1,  22  0.1 True Value APE N=1000 2SLS Probit Hetprob Ivprobit IVprobhet HetIVprobhet Mean y2 (0.0705) SD SE RMSE 0.0797 0.0479 0.0458 0.0488 0.0918 0.0107 0.0111 0.0238 0.1124 0.0055 0.0056 0.0422 0.0607 0.0293 0.0259 0.0309 0.0612 0.0297 0.0258 0.0311 0.0610 0.0302 0.0271 0.0317 Mean x1 (0.01176) SD SE RMSE 0.0130 0.0063 0.0065 0.0064 0.0132 0.0043 0.0041 0.0045 0.0177 0.0027 0.0028 0.0065 0.0095 0.0044 0.0052 0.0049 0.0096 0.0044 0.0052 0.0049 0.0096 0.0045 0.0053 0.0050 Mean x2 (0.01176) SD SE RMSE 0.0132 0.0067 0.0065 0.0069 0.0132 0.0042 0.0041 0.0044 0.0178 0.0029 0.0029 0.0067 0.0096 0.0044 0.0052 0.0049 0.0096 0.0045 0.0051 0.0049 0.0095 0.0049 0.0055 0.0054 Mean x3 (0.04702) SD SE RMSE 0.0178 0.0034 0.0046 0.0294 0.0536 0.0109 0.0106 0.0127 0.0711 0.0053 0.0052 0.0246 0.0390 0.0149 0.0185 0.0169 0.0392 0.0150 0.0183 0.0169 0.0391 0.0157 0.0190 0.0176 113 Table 3.6 The correct model is Ivprobit with  21  0.1,  22  0.1 True Value APE N=3000 2SLS Probit Hetprob Ivprobit IVprobhet HetIVprobhet Mean y2 (0.0713) SD SE RMSE 0.0798 0.0260 0.0258 0.0274 0.0924 0.0063 0.0062 0.0220 0.0927 0.0066 0.0065 0.0223 0.0673 0.0228 0.0112 0.0231 0.0676 0.0227 0.0113 0.0230 0.0675 0.0230 0.0115 0.0233 Mean x1 (0.0119) SD SE RMSE 0.0133 0.0037 0.0037 0.0040 0.0133 0.0025 0.0024 0.0029 0.0133 0.0025 0.0024 0.0029 0.0110 0.0030 0.0031 0.0031 0.0110 0.0030 0.0031 0.0031 0.0110 0.0031 0.0032 0.0032 Mean x2 (0.0119) SD SE RMSE 0.0134 0.0038 0.0037 0.0041 0.0133 0.0024 0.0024 0.0027 0.0133 0.0025 0.0025 0.0029 0.0109 0.0029 0.0031 0.0030 0.0109 0.0029 0.0031 0.0030 0.0109 0.0030 0.0033 0.0031 Mean x3 (0.04756) SD SE RMSE 0.0179 0.0019 0.0026 0.0297 0.0533 0.0059 0.0059 0.0082 0.0534 0.0061 0.0060 0.0085 0.0438 0.0098 0.0108 0.0105 0.0439 0.0098 0.0108 0.0104 0.0439 0.0100 0.0109 0.0106 114 Table 3.7 The correct model is Ivprobhet with 21  0.5, 22  0.5 True Value APE N=1000 2SLS Probit Hetprob Ivprobit IVprobhet HetIVprobhet Mean y2 (0.0893) SD SE RMSE 0.0188 0.0249 0.0177 0.0748 0.1109 0.0097 0.0100 0.0237 0.1122 0.0101 0.0106 0.0250 0.1065 0.0129 0.0117 0.0215 0.0896 0.0203 0.0225 0.0203 0.0900 0.0210 0.0220 0.0211 Mean x1 (0.01488) SD SE RMSE 0.0172 0.0125 0.0124 0.0127 0.0175 0.0049 0.0047 0.0055 0.0177 0.0050 0.0047 0.0057 0.0172 0.0048 0.0046 0.0053 0.0150 0.0049 0.0060 0.0049 0.0150 0.0050 0.0060 0.0050 Mean x2 (0.01488) SD SE RMSE 0.0175 0.0124 0.0124 0.0127 0.0172 0.0046 0.0047 0.0051 0.0174 0.0048 0.0048 0.0054 0.0168 0.0046 0.0046 0.0050 0.0147 0.0046 0.0062 0.0046 0.0147 0.0048 0.0060 0.0048 Mean x3 (0.05953) SD SE RMSE 0.0451 0.0067 0.0088 0.0159 0.0698 0.0080 0.0080 0.0131 0.0706 0.0083 0.0083 0.0139 0.0686 0.0082 0.0085 0.0122 0.0592 0.0120 0.0186 0.0120 0.0595 0.0124 0.0178 0.0124 115 Table 3.8 The correct model is Ivprobhet with 21  0.5, 22  0.5 True Value APE N=3000 2SLS Probit Hetprob Ivprobit IVprobhet HetIVprobhet Mean y2 (0.0896) SD SE RMSE 0.0164 0.0141 0.0101 0.0745 0.1108 0.0056 0.0056 0.0220 0.1111 0.0056 0.0056 0.0222 0.1075 0.0070 0.0062 0.0192 0.0911 0.0123 0.0115 0.0124 0.0913 0.0124 0.0117 0.0126 Mean x1 (0.01492) SD SE RMSE 0.0176 0.0073 0.0071 0.0078 0.0173 0.0027 0.0027 0.0036 0.0173 0.0028 0.0027 0.0037 0.0172 0.0027 0.0027 0.0035 0.0152 0.0028 0.0037 0.0028 0.0152 0.0028 0.0037 0.0028 Mean x2 (0.01492) SD SE RMSE 0.0168 0.0071 0.0071 0.0074 0.0171 0.0027 0.0027 0.0035 0.0172 0.0027 0.0027 0.0035 0.0170 0.0027 0.0027 0.0034 0.0150 0.0028 0.0035 0.0028 0.0150 0.0029 0.0036 0.0029 Mean x3 (0.05971) SD SE RMSE 0.0453 0.0039 0.0050 0.0149 0.0698 0.0046 0.0045 0.0111 0.0699 0.0047 0.0045 0.0112 0.0693 0.0047 0.0046 0.0107 0.0607 0.0072 0.0102 0.0072 0.0607 0.0072 0.0103 0.0073 116 Table 3.9 The correct model is Ivprobhet with  21  0.1,  22  0.1 True Value APE N=1000 2SLS Probit Hetprob Ivprobit IVprobhet HetIVprobhet Mean y2 (0.08054) SD SE RMSE -0.0841 0.6349 0.6699 0.6559 0.0982 0.0092 0.0095 0.0199 0.0993 0.0094 0.0100 0.0210 0.0784 0.0286 0.0495 0.0286 0.0798 0.0175 0.0159 0.0176 0.0801 0.0181 0.0168 0.0181 Mean x1 (0.01342) SD SE RMSE 0.0131 0.0687 0.0815 0.0687 0.0155 0.0045 0.0044 0.0049 0.0156 0.0046 0.0045 0.0051 0.0117 0.0051 0.0064 0.0054 0.0132 0.0044 0.0048 0.0044 0.0132 0.0045 0.0049 0.0045 Mean x2 (0.01342) SD SE RMSE 0.0161 0.0589 0.0724 0.0589 0.0155 0.0044 0.0044 0.0048 0.0157 0.0047 0.0046 0.0052 0.0117 0.0050 0.0065 0.0053 0.0132 0.0044 0.0048 0.0044 0.0133 0.0046 0.0050 0.0046 Mean x3 (0.05369) SD SE RMSE 0.0415 0.0383 0.0505 0.0402 0.0624 0.0076 0.0077 0.0116 0.0632 0.0077 0.0080 0.0122 0.0474 0.0159 0.0214 0.0171 0.0526 0.0111 0.0125 0.0111 0.0528 0.0113 0.0131 0.0114 117 Table 3.10 The correct model is Ivprobhet with  21  0.1,  22  0.1 True Value APE N=3000 2SLS Probit Hetprob Ivprobit IVprobhet HetIVprobhet Mean y2 (0.0809) SD SE RMSE -0.2341 0.5626 0.4575 0.6448 0.0986 0.0051 0.0053 0.0184 0.0988 0.0051 0.0053 0.0186 0.0869 0.0227 0.0374 0.0235 0.0822 0.0108 0.0083 0.0109 0.0823 0.0109 0.0084 0.0110 Mean x1 (0.01348) SD SE RMSE 0.0170 0.0412 0.0404 0.0413 0.0156 0.0026 0.0025 0.0034 0.0157 0.0026 0.0026 0.0034 0.0123 0.0038 0.0042 0.0039 0.0137 0.0026 0.0028 0.0026 0.0137 0.0026 0.0028 0.0027 Mean x2 (0.01348) SD SE RMSE 0.0130 0.0472 0.0512 0.0472 0.0156 0.0027 0.0025 0.0034 0.0157 0.0028 0.0026 0.0035 0.0123 0.0039 0.0042 0.0040 0.0137 0.0027 0.0028 0.0027 0.0138 0.0028 0.0028 0.0028 Mean x3 (0.05393) SD SE RMSE 0.0416 0.0317 0.0329 0.0340 0.0631 0.0043 0.0043 0.0101 0.0632 0.0043 0.0043 0.0102 0.0498 0.0132 0.0144 0.0138 0.0549 0.0067 0.0070 0.0067 0.0550 0.0067 0.0070 0.0068 118 Table 3.11 The correct model is HetIVprobhet with 21  0.5, 22  0.5 True Value APE N=1000 2SLS Probit Hetprob Ivprobit IVprobhet HetIVprobhet Mean y2 (0.12227) SD SE RMSE 0.0220 0.0241 0.0177 0.1035 0.1103 0.0065 0.0064 0.0140 0.1661 0.0210 0.0208 0.0482 0.1035 0.0128 0.0108 0.0230 0.0896 0.0203 0.0225 0.0388 0.1389 0.0324 0.0645 0.0362 Mean x1 (0.02045) SD SE RMSE 0.0168 0.0120 0.0124 0.0126 0.0167 0.0067 0.0066 0.0077 0.0263 0.0070 0.0067 0.0091 0.0165 0.0066 0.0065 0.0077 0.0150 0.0049 0.0060 0.0073 0.0230 0.0071 0.0135 0.0075 Mean x2 (0.02045) SD SE RMSE -0.0058 0.0126 0.0124 0.0291 -0.0101 0.0086 0.0062 0.0317 0.0133 0.0080 0.0071 0.0107 -0.0099 0.0085 0.0062 0.0315 0.0147 0.0046 0.0062 0.0074 0.0114 0.0074 0.0111 0.0117 Mean x3 (0.08186) SD SE RMSE 0.0452 0.0069 0.0088 0.0373 0.0659 0.0090 0.0078 0.0183 0.1053 0.0158 0.0151 0.0283 0.0653 0.0090 0.0079 0.0188 0.0592 0.0120 0.0186 0.0256 0.0913 0.0198 0.0465 0.0220 119 Table 3.12 The correct model is HetIVprobhet with 21  0.5, 22  0.5 True Value APE N=3000 2SLS Probit Hetprob Ivprobit IVprobhet HetIVprobhet 0.0201 0.0136 0.0101 0.1038 0.1105 0.0038 0.0036 0.0130 0.1604 0.0106 0.0105 0.0389 0.1045 0.0069 0.0061 0.0198 0.0881 0.0108 0.0068 0.0365 0.1386 0.0181 0.0255 0.0239 1 0.917 0.947 0.814 .881 0.140 0.0174 0.0072 0.0071 0.0079 0.0168 0.0039 0.0038 0.0054 0.0256 0.0037 0.0035 0.0063 0.0167 0.0039 0.0038 0.0054 0.0152 0.0036 0.0085 0.0065 0.0232 0.0038 0.0060 0.0047 Mean x2 (0.0205) SD SE RMSE -0.0067 0.0075 0.0071 0.0282 -0.0105 0.0051 0.0036 0.0314 0.0140 0.0043 0.0035 0.0078 0.0104 0.0051 0.0036 0.0313 -0.0095 0.0047 0.0132 0.0304 0.0125 0.0043 0.0044 0.0091 Mean x3 (0.082) SD SE RMSE 0.0456 0.0039 0.0050 0.0367 0.0660 0.0052 0.0044 0.0168 0.1025 0.0081 0.0077 0.0220 0.0658 0.0052 0.0045 0.0170 0.0591 0.0062 0.0117 0.0237 0.0922 0.0109 0.0203 0.0149 Mean y2 (0.123) SD SE RMSE Mean Rejection Rate Mean x1 (0.0205) SD SE RMSE 120 Table 3.13 Data description Definition Mean Standard deviations 1 if wife is in labor force non-wife income, $1000s wife's year of schooling 1 if have child age <6 years 1 if have child age >=6 years experience(age - education - 6) 0.583 30.269 12.984 0.279 0.308 20.444 0.493 27.212 2.165 0.449 0.462 10.445 huseduc exper squared husband's year of schooling 527.042 13.147 468.289 2.977 husexper husband age - huseduc-6 23.305 11.761 Variable inlf nwifeinc educ kidlt 6 kidge6 exper exper 2 Table 3.14 Estimation of Reduced form Equation nwifeinc Mean Equation husexper huseduc constant Variance Equation huseduc husexper constant Coefficient SE z P>z 0.360 3.454 -23.500 0.032 0.121 1.952 11.340 28.520 -12.040 0 0 0 0.101 0.016 1.464 0.003 0.001 0.057 30.270 16.400 25.620 0 0 0 121 Table 3.15 Empirical Result of APEs with analytical standard errors. OLS inlf 2SLS inlf Probit inlf VARIABLES nwifeinc -0.00331*** -0.00195** [0.00023] [0.00095] educ 0.0353*** 0.0318*** [0.00257] [0.00356] exper 0.00245 0.00077 0.00261 0.00289 expersq -0.00021*** -0.00018*** [0.00006] [-0.00006] kidlt6 -0.171*** -0.175*** [0.01794] [0.01860] kidge6 0.01660 0.01330 [0.0166] [0.01690] 122 Ivprobit inlf -0.00328*** -0.00696*** -0.00316*** [0.00023] [0.00115] [0.00084] 0.03607*** 0.07641*** 0.03579*** [0.00261] [0.01165] [0.00324] -0.00611*** -0.00875*** -0.00615*** 0.00077 0.00170 0.00255 -0.17416*** -0.14538*** -0.17413*** [0.01827] [0.03763] [0.01827] 0.01467 0.01310 0.01473 [0.01852] [0.04340] [0.00550] Observations 5,634 5,634 5,634 *, **, ***: significant at 10%, 5% and 1% level respectively. Adjusted analytical standard errors are in the [ ]. Hetprob inlf 5,634 5,634 Table 3.16 Empirical Result of APEs with analytical standard errors and Bootstrap standard errors IVprobhet inlf HetIVprobhet1 inlf HetIVprobhet2 HetIVprobhet3 inlf inlf VARIABLES nwifeinc -0.00205 -0.00459* -0.00534*** [0.00210] [0.00285] (0.001614) (0.001509) educ 0.03372*** 0.07002*** 0.07183*** [0.00442] [0.01434] (0.012299) (0.012102) exper -0.00653 -0.00924*** -0.00927*** [0.00528] [0.00287] (0.001761) (0.001819) kidlt6 -0.17357*** -0.19280** -0.19262*** [0.03848] [0.05996] (0.026883) (0.001819) kidge6 0.01546 0.06580** 0.06123** [0.01123] [0.02353] (0.038130) (0.037438) Observations 5,634 5,634 5,634 *, **, ***: significant at 10%, 5% and 1% level respectively. -0.00490*** ( 0.001913) 0.08265*** (0.013432) -0.01048*** (0.002094) -0.17163*** (0.025359) 0.07777** (0.034927) 5,634 Adjusted analytical standard errors are in the [ ]. Bootstrap standard errors are in the ( ). 123 APPENDICES 124 APPENDICES APPENDIX A: Asymptotic Variance for the two-step estimators In order to obtain the standard errors for estimated APEs, first we need to compute the adjusted standard errors of the estimated parameters. Our estimator is a common two-step Mestimator, where the first stage estimation is maximum likelihood estimation. Then, the estimated standardized residuals are plugged in to the maximum likelihood estimation(MLE) in the second stage. The adjustment of asymptotic variance is well known and explained in textbooks as Greene (2008) and Wooldridge (2010) with minor differences in either using outer product of the score to approximate the information matrix or using the information matrix directly. In this paper, it follows the step as in Murphy and Topel (1985) for the adjustment. That is, it uses the outer product approximation to the information matrix. The standardized residuals come from the first stage MLE that solves ˆ ˆ (2 ,  2 ) n ˆ    arg max   i (  2 ,  2 ) . i 1  2 , 2 (A.1) ~ ˆ In the second step, the estimated standardized residuals  2i are plugged in the other MLE estimation. Hence, the first stage problem is n ˆ ˆ ˆ ˆ ˆ ˆ ( 1, , ,1)    arg max  li ( 1, ,  , 1; ) , 1, ,1 i 1 (A.2) and this objective function depends on the estimated parameters from the first stage. Let ( 0 ,  0 ) be the true value of the estimated parameters. Then, the following score, outer product of the score and Hessian matrix terms are  i ( )   i ( ) , and   li ( ,  )  li ( ,  ) .  (A.3) 125 Outer products of the score are O 1 n    p lim  l ( 0 ,  0 )li ( 0 ,  0 )' , n i 1 i n  (A.4) O 1 n    p lim  l ( ,  ) ( )' , i 1 i 0 0 i 0 n  n (A.5) O 1 n    p lim   i ( 0 )li ( 0 ,  0 )' , i 1 n  n (A.6) O 1 n    p lim   i ( 0 ) i ( 0 )' . i 1 n  n (A.7) Hessian matrices are H 2 1 n  li ( 0 ,  0 ) ,   p lim  n n i 1   ' (A.8) H 2 1 n  li ( 0 ,  0 ) ,   p lim  n n i 1   ' (A.9) H  2 1 n  li ( 0 ,  0 ) ,   p lim  n n i 1   ' (A.10) H  2 1 n   i ( 0 ) .   p lim  n n i 1   ' (A.11) Then, let assume that the sample scores are normally distributed asymptotically with the following mean zero and variance covariance matrix:     O 1 n  li ( 0 ,  0 )  d   N 0,      n   O  i 1  i ( 0 )    O   . O   126 (A.12) From the first step maximum likelihood estimation of normal distribution with exponential heteroskedasticity, 0 the mean value expansion theorem around 0 and 1 n  1 n  ˆ  i ( )     i ( 0 )  H n (ˆ   0 )  o p (1) . n n i 1 i 1 That is the ˆ n (   0 ) same as the asymptotic result under ˆ gives (A.13) standard regularity n 1 1  H  i 1 i ( 0 )  o p (1) . n condition (A.14) Then, applying the same mean value expansion to the second-step estimation of its first-order ˆ conditions around  0 and  gives the required result for variance adjustment: 0  1 in1li (ˆ , ˆ ) , n 1 n i 1li (0 ,  0 )  H n ˆ ˆ n (   0 )  H n (   0 )  o p (1) . (A.15) Then, n n  ( ,  )  H * H 1 1 ˆ  )  1 H n (  li 0 0   n  i ( 0 )  o p (1) 0 n i 1 i 1 (A.16) and 1 1 1 ˆ n (   0 )  H ( I , H * H  ) n   n  li ( 0 ,  0 )   o p (1) . i 1    ( 0 )    i  (A.17) Hence, d ˆ n (   0 )   N 0,V  , (A.18) where 127 V 1 1  O  H ( I , H H  )  O  O  I  1H O   H     1 H ,    1 1 1 1 1 1  H O  H  H O H  H  H H O  O H H  H . (A.19)         If the expected value outer product of the score is equal to the information matrix then, V 1 1 1 1 1 1  H  H  H H H  H H O  O H H  H .     (A.20) This is equivalent to equation (3.34) in Murphy and Topel (1985) and comparable to equation 12.34 in Wooldridge (2010). ˆ In order to estimate the adjusted standard variances of each parameters V , each elements of the above equation need to be estimated. 1 1) H is the asymptotic variance of naïve estimator of the second-step estimation without taken into account the first stage estimation and can be estimated by n  2l ( ,  ) 1 n l ( ,  )  l ( ,  ) ' ˆ ˆ ˆ ˆ ˆ ˆ 1   1 i ˆ H    '  n  i *  i  .   n   i 1 i 1 (A.21) 2) H and H  are the derivative of the score from second-step estimation with respect to first stage parameter and its transpose, respectively. They, can be estimated by ˆ ˆ ˆ ˆ ˆ ˆ ' 1 n  2li ( ,  ) 1 n li ( ,  )  li ( ,  )  ˆ  . H      *    n   ' n    i 1 i 1 (A.22) 1 ˆ 3) H  is the asymptotic variance of first stage estimation parameters n (   0 ) . 1 ˆ H    n n  2 ( ) i ˆ    ' i 1 (A.23) 4) O and O are the multiplicative terms of the score in the first stage and second stage and its transpose. They can be estimated by 128 n  ( )  l ( ,  )  ˆ ˆ ˆ ˆ 1 O  i *  i   .   n   i 1 (A.24) This estimated adjusted variance might not be robust when the underlying distribution is not correctly specified. However, Hardin (2002) showed that the robust version of Murphy Topel estimation. This paper doses not adopt such formula since it will not be equivalent to result of Ackerberg et.al. (2012). To obtain score and Hessian from first-step estimation, the log likelihood in matrix form is as follows   N N 1 2 ln(2 )  ln(h2 ( z 2 ))  ( y  z 2 )' ( y2  z 2 ) . 2 ( z ) 2 2 2 2h2 2 (A.25) Hence, the score function can be represented by gradient vector   ( )     ( )    2   0 ,  0   ( )     2   (A.26)  ( ) 1 z' y  z' z 2  .  2 ( z )  2 h2 2 (A.27) For  ( ) , assume that the heteroskedasticity term is multiplicative with exponential function  2 exp(2z 2 ) ,  ( )  2   Z ' ( 2 * z' ) ( y2  z 2 )' ( y2  z 2 ) . exp(2z 2 ) From the gradient matrix, the Hessian matrix is 129 (A.28) H ( )   2 ( )    ( )   2  2 '    2     ( )   2  2 '   2 ( )    2  2 '  ,  2 ( )   2  2 '    2 ( )  2  2 '  1 - z'z , exp(2z 2 ) (A.30)  2 ( )  2  2 '  ( 2 * z' )z' y  z ' z 2  , exp(2 z 2 ) (A.31)  2 ( )  2  2 '  ( 2 * z' )z' y  z' z 2 ' , exp(2z 2 ) (A.32)  2 ( )  2  2 '  ( 4 * z' z) ( y2  z 2 )' ( y2  z 2 ) . exp(2z 2 ) (A.33) (A.29) The information matrix is the negative of the expectation of the Hessian I ( )  E[ H ( )z] . ˆ Then, the estimation of Avar(   0 ) ˆ 1 H    2 ( ) i ˆ n  1   '   2 2  2 n ˆ i 1    i ( )   2  2 '  ˆ  2 i ( )    2  2 '   ˆ  2 i ( )   2  2 '   In the second step, the estimation of 1 . (A.34) θ  ( 1' , 1' ,  )   /( 1   2 )' comes from the ~ ~ ˆ ˆ heteroskedastic probit of y1i on ( z1i , y2i ,  2i ) , where  2i is the standardized error from the 2 first stage estimation and h1 ( z11i1) is the heteroskedastic variance. Assuming that 2 h1 ( z11i1)  exp(2z11i1) . The log-likelihood function for each individual observation i is 130   z1i 1   y2i  ~  ˆ ˆ )  y1i log  li ( 1,  ,  , 1;    2i   exp(z  ) 1   2  11i 1       z1i 1   y2i  ~  . ˆ  (1  y1i ) log1     2i    exp(z  ) 1   2   11i 1    (A.35) The score function and gradient vector is  ˆ li ( ,  ) ' ˆ li ( ,  ) / 1   ˆ ˆ  li ( ,  )  li ( ,  ) /     0. ˆ  li ( ,  ) /      ˆ li ( ,  ) / 1  (A.36)   z1i 1   y2i   ˆ Define A     2i  , Wi  (z1i , y2i ) , and   ( 1' , )' then  exp(z  ) 1   2  11i 1   ˆ li ( ,  ) / 1  ˆ li ( ,  ) /   ˆ li ( ,  ) /    ( A)z1i '  y2i  Φ( A) Φ( A)(1  Φ( A)) exp(z11i1) 1   2  ( A) y2i '  y2i  Φ( A) Φ( A)(1  Φ( A)) exp(z11i1) 1   2 ˆ  ( A) 2i ' y2i  Φ( A) ˆ li ( ,  ) / 1   Φ( A)(1  Φ( A)) , (A.37) , (A.38) , (A.39)  ( A)z11i ' w i y2i  Φ( A) Φ( A)(1  Φ( A)) exp(z11i1) 1   2 . (A.40) 1 The inverse of Hessian H for the heteroskedastic probit is a mess as pointed out in Wooldridge (2010). Hence, we adopt the result from his example 13.1,  EHi (0 ) zi  is a 4x4 131 partitioned matrix, where zi contains both endogenous, exogenous, and instrument variables, that have simple form. ˆ Using product rules, taking derivatives of li ( ,  ) / 1 with respect to 1 yields ˆ  2li ( ,  ) 1 1'   ( A)z1i ' Φ( A)(1  Φ( A)) exp(z11i1) 1   2 * [ y2i  Φ( A)] 1  z1i ' ( A)    Φ( A)(1  Φ( A)) exp(z  ) 1   2 11i 1  [ y2i  Φ( A)] *  1     . (A.41) Evaluating the above expression at  0 and take the expectation, the second term will be equal to zero, then first diagonal element, [1,1], of the  E H i ( 0 ) zi   { ( A)}2 z1i ' z1i   Φ( A)(1  Φ( A)) exp(z11i1 ) 1   2    2 . (A.42) The [1,2] element is { ( A)}2 z1i ' y2i   Φ( A)(1  Φ( A)) exp(z11i1 ) 1   2    2 . (A.43) The [1,3] element is ˆ { ( A)}2 z1i '  2i Φ( A)(1  Φ( A)) exp(z11i1) 1   2 . (A.44) 132 The [1,4] element is  { ( A)}2 z1i ' w iz11i   Φ( A)(1  Φ( A)) exp(z11i1) 1   2    2 . (A.45) The [2,1] element is { ( A)}2 y2i' z1i   Φ( A)(1  Φ( A)) exp(z11i1) 1   2    2 . (A.46) . (A.47) The [2,2] element is { ( A)}2 y2i' y2i   Φ( A)(1  Φ( A)) exp(z11i1) 1   2    2 The [2,3] element is ˆ { ( A)}2 y2i' 2i Φ( A)(1  Φ( A)) exp(z11i1) 1   2 . (A.48) The [2,4] element is  { ( A)}2 y2i'wi z11i   Φ( A)(1  Φ( A)) exp(z11i1 ) 1   2    2 . (A.49) The [3,1] element is ˆ { ( A)}2  2i' z1i Φ( A)(1  Φ( A)) exp(z11i1) 1   2 . (A.50) The [3,2] element is 133 ˆ { ( A)}2  2i' y2i Φ( A)(1  Φ( A)) exp(z11i1) 1   2 . (A.51) The [3,3] element is ˆ ˆ { ( A)}2  2i'  2i . Φ( A)(1  Φ( A)) (A.52) The [3,4] element is  ˆ { ( A)}2  2i'wi z11i Φ( A)(1  Φ( A)) exp(z11i1) 1   2 . (A.53) The [4,1] element is  { ( A)}2 ( w i z11i )' z1i   Φ( A)(1  Φ( A)) exp(z11i1) 1   2    2 . (A.54) . (A.55) The [4,2] element is  { ( A)}2 ( w i z11i )' y2i   Φ( A)(1  Φ( A)) exp(z11i1) 1   2    2 The [4,3] element is  ˆ { ( A)}2 ( w i z11i )'  2i Φ( A)(1  Φ( A)) exp(z11i1 ) 1   2 . (A.56) 134 The [4,4] element is  { ( A)}2 ( w i z11i )' w i z11i   Φ( A)(1  Φ( A)) exp(z11i1) 1   2    2 . (A.57) Finally, the last element need for the two-step adjustment is ˆ li ( ,  )  ˆ l ( ,  ) /  2   i , ˆ li ( ,  ) /  2  ˆ li ( ,  )  2   ( A)zi '[ y2i  Φ( A)] , Φ( A)(1  Φ( A)) exp(zi 2 ) (A.59) ˆ li ( ,  )  2   ( A)( y2 - zi  2 )zi '[ y2i  Φ( A)] . Φ( A)(1  Φ( A)) exp(zi 2 ) (A.60) Plugging in the estimated coefficient of (A.58) ˆ ˆ ˆ  ,  , and ˆ1 yields the estimated variance V . 135 Appnedix B: Asymptotic Variance for the APEs B.1 Standard errors for the continuous explanatory variables. ˆ ˆ ˆ ˆ First, we need to obtain asymptotic variance of N (η - η) , where η  g( z1, y2 ,(  , )) is partial derivative of ˆ ASF  f ( z1, y2 ) with respect to z1 , y2 and η  g( z1, y2 ,(  , )) is partial derivative of ASF  f ( z1, y2 ) with respect to z1 , y2 . For example the APEs for continuous y2 is 1 ˆ ηy   2 N A  . 2 i 1 ( A) * y n (A.61) Then, we can insert the interesting value of y2 or further average out y2i . Moreover, this set up leads to the dilemma as stated in Wooldridge (2010) chapter 15, if either y2i , subset of z1i , or ~ ˆ  2i enters into the heteroskedastic function, the sign and magnitude of APEs might not be the same as estimated parameters. For η , its value depends on where the value of z1i , y2i ,  2i evaluates, so in order to get single number for each variable, we average them over z11i and  2i . 1 ˆ ηy  2 N n 1 A  . 2 i 1  N i 1 ( A) * y  n (A.62) Using problem 12.17 and 15.15 in Wooldridge (2010), by applying mean value expansion theorem, n ˆ     1 n ˆ , ))  1  g( z1, y2 ,(  ˆ  g( z1, y2 ,(  , )) G N *  ˆ   0   o p (1) ,   N N 0  i 1 i 1 (A.63) where G  E[ g( z1, y2 ,(  , ))  g( z1, y2 ,(  , ))] . For example, assumed that y2i does not enter into the variance, the Jacobian of g( z1, y2 ,(  , )) of y2 or APE of y2 with respect to second-step parameter θ is 136  A    ( A) * y2  ,  g( z1 , y2 ,(  , ))    (A.64)  A    ( A) *  y2  ,  β g( z1 , y2 ,(  , ))   1 β1 (A.65)  A    ( A) * y2   ,  g( z1 , y2 ,(  , ))   (A.66)  A    ( A) *  y2   .  g( z1 , y2 ,(  , ))  1 1 (A.67) The Jacobian of APE of y2 with respect to first-step parameter γ is  A    ( A) *  y2   ,  β g( z1 , y2 ,(  , ))  2 β2 (A.68)  A    ( A) *  y2   .  g( z1 , y2 ,(  , ))  2  2 For (A.69) ˆ    0   N *      , it is just a stack of the both equation ˆ 0  1 1 1 ˆ n (   0 )  H ( I , H * H  ) n 1 1 ˆ n (   0 )  H n   n  li ( 0 ,  0 )   o p(1) and i 1    ( 0 )    i  i 1 i ( 0 )  o p(1) . n Hence, we can write, 137 ˆ    0  1 n  N *       n i 1ei ( 0 ,  0 )  o p(1) .  ˆ 0 (A.70) Thus, ˆ ˆ ˆ C  Avar N (η - η)  Var{ g( z1 , y2 ,(  , )) - η - Gei ( 0 ,  0 )} (A.71) A consistent estimator for the asymptotic variance of APEs is N ˆ 1 ˆˆ ˆˆ ˆ ˆ C  {[ gi ( z1, y2 ,( ˆ ,ˆ )) - η - Gei ( 0 ,  0 )}{ gi ( z1, y2 ,( ˆ ,ˆ )) - η - Gei ( 0 ,  0 )]}' N i 1 (A.72) By construction, the conditional expected values of the score function of both first and second ˆ ˆ stage estimation equals to zero, then, g( z1, y2 ,(  , )) will not be correlated with Gei ( 0 ,  0 ) , which means ˆ     0  ˆ ˆ ˆ  ˆ C  Var ( gi )  G  Avar N *      G'   ˆ 0    (A.73) Hence, asymptotic variance of APEs depends on variance of itself plus positive definite matrix of the adjustment term. For standard error for any particular APE, it will be the square root of corresponding diagonal element of the above equation, divided by N. B.2 Standard errors for the binary exogenous explanatory variables. Assumed, that there is one element of z1 is binary exogenous variables z1 with coefficient  . Hence, the APE of z1 can be written as h  APE (z1)  Ez , [( A)  ( A  z1i )] 11i 2i For the estimate of APE (z1) ,it is 138 (A.74) 1 N ˆ  [( A)  ( A  z1i ]) N i 1 (A.75) Then, applying mean value expansion theorem as we did before in (A.63) ˆ     1 N 1 n ˆ ˆ h( z1 , y2 ,(  , ))    h( z1, y2 ,(  , )) H N *  ˆ   0   o p (1)   N N 0  i 1 i 1 (A.76) where H  E[ h( z1, y2 ,(  , ))  h( z1, y2 ,(  , ))] ˆ ˆ If z1 does not enter into the variance, the Jacobian of h( z1, y2 ,(  , ) or APE of z1 with respect to second-step parameter  is  h( z1 , y2 ,(  , ))  ˆ [( A)  ( A  z1i )] ,   β h( z1 , y2 ,(  , ))  1 (A.77) ˆ ( A)  ( A  z1i ) , β1 (A.78)  h( z1 , y2 ,(  , ))  ˆ ( A)  ( A  z1i ) ,  (A.79)  h( z1 , y2 ,(  , ))  1 ˆ ( A)  ( A  z1i ) . 1 (A.80) The Jacobian of APE of z1 with respect to first-step parameter γ is  β h( z1 , y2 ,(  , ))  2 ˆ ( A)  ( A  z1i ) , β2 (A.81)  h( z1 , y2 ,(  , ))  2 ˆ ( A)  ( A  z1i ) .  2 (A.82) Following the same development as previous section, the asymptotic variance of APEs is ˆ ˆ ˆ D  Avar N ( - )  Var{h( z1, y2 ,(  , )) - - Hei ( 0 ,  0 )} A consistent estimator for the asymptotic variance of APEs is 139 (A.83) N ˆ 1 ˆ ˆ ˆˆ ˆˆ C  {[hi ( z1, y2 ,( ˆ ,ˆ )) - - Hei ( 0 ,  0 )}{hi ( z1, y2 ,( ˆ ,ˆ )) - - Hei ( 0 ,  0 )]}' N i 1 (A.84) Therefore, we can write the asymptotic variance as ˆ     0  ˆ ˆ ˆ  ˆ D  Var ( hi )  H  Avar N *       H' ˆ 0      (A.85) The standard errors of each respective APE will be the square root of corresponding diagonal element of the above equation, divided by N . If continuous exogenous variable enters into conditional variance, APEs will be more complicated and similar to equation (15.76) in Wooldridge (2010). 140 REFERENCES 141 REFERENCES Ackerberg, Daniel, Xiaohong Chen, and Jinyong Hahn. “A Practical Asymptotic Variance Estimator for Two-step Semiparametric Estimators.” Review of Economics and Statistics 94, no. 2 (2012): 481–498. Angrist, Joshua D., and William N. Evans. Children and Their Parents’ Labor Supply: Evidence from Exogenous Variation in Family Size. National Bureau of Economic Research, 1996. Averett, Susan L., and Julie L. Hotchkiss. “Female Labor Supply with a Discontinuous, Nonconvex Budget Constraint: Incorporation of a Part-time/full-time Differential.” Review of Economics and Statistics 79, no. 3 (1997): 461–470. Wage Blundell, Richard, and James L. Powell. “Endogeneity in Nonparametric and Semiparametric Regression Models.” ECONOMETRIC SOCIETY MONOGRAPHS 36 (2003): 312–357. Blundell, Richard W., and James L. Powell. “Endogeneity in Semiparametric Binary Response Models.” The Review of Economic Studies 71, no. 3 (2004): 655–679. Blundell, Richard W., and Richard J. Smith. “Estimation in a Class of Simultaneous Equation Limited Dependent Variable Models.” The Review of Economic Studies 56, no. 1 (1989): 37–57. Card, David. “Estimating the Return to Schooling: Progress on Some Persistent Econometric Problems.” Econometrica 69, no. 5 (2001): 1127–1160. Chen, Xiaohong. “Large Sample Sieve Estimation of Semi-nonparametric Models.” Handbook of Econometrics 6 (2007): 5549–5632. Chesher, Andrew. “Instrumental Variable Models for Discrete Outcomes.” Econometrica 78, no. 2 (2010): 575–601. Cruces, Guillermo, and Sebastián Galiani. “Fertility and Female Labor Supply in Latin America: New Causal Evidence.” Labour Economics 14, no. 3 (2007): 565–573. Dong, Yingying, and Arthur Lewbel. “A Simple Estimator for Binary Choice Models with Endogenous Regressors.” Econometrics Reviews, Forthcoming (2012). Florens, Jean-Pierre, James J. Heckman, Costas Meghir, and Edward Vytlacil. “Identification of Treatment Effects Using Control Functions in Models with Continuous, Endogenous Treatment and Heterogeneous Effects.” Econometrica 76, no. 5 (2008): 1191–1206. Gourieroux, Christian, Alain Monfort, and Alain Trognon. “Pseudo Maximum Likelihood Methods: Theory.” Econometrica: Journal of the Econometric Society (1984): 681–700. 142 Greene, William H. “Econometric Analysis, 5th.” Ed.. Upper Saddle River, NJ (2003). Heckman, James J. “The Common Structure of Statistical Models of Truncation, Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models.” In Annals of Economic and Social Measurement, Volume 5, Number 4, 475–492. NBER, 1976. Heckman, James J., and Edward Vytlacil. “Structural Equations, Treatment Effects, and Econometric Policy Evaluation1.” Econometrica 73, no. 3 (2005): 669–738. Heckman, James, and Edward Vytlacil. “Instrumental Variables Methods for the Correlated Random Coefficient Model: Estimating the Average Rate of Return to Schooling When the Return Is Correlated with Schooling.” Journal of Human Resources (1998): 974–987. ———. “Instrumental Variables Methods for the Correlated Random Coefficient Model: Estimating the Average Rate of Return to Schooling When the Return Is Correlated with Schooling.” Journal of Human Resources (1998): 974–987. Hoderlein, Stefan. Endogenous Semiparametric Binary Choice Models with Heteroscedasticity. cemmap working paper, 2009. Honoré, Bo E., and Arthur Lewbel. “Semiparametric Binary Choice Panel Data Models Without Strictly Exogeneous Regressors.” Econometrica 70, no. 5 (2002): 2053–2063. Horowitz, Joel L. “A Smoothed Maximum Score Estimator for the Binary Response Model.” Econometrica: Journal of the Econometric Society (1992): 505–531. Imbens, Guido, and Jeffrey Wooldridge. “Control Function and Related Methods.” What’s New in Econometrics (2007). Khan, Shakeeb. “Distribution Free Estimation of Heteroskedastic Binary Response Models Using Probit/Logit Criterion Functions.” Journal of Econometrics 172, no. 1 (2013): 168–182. Klein, Roger, Chan Shen, and Francis Vella. “Triangular Semiparametric Models Featuring Two Dependent Endogenous Binary Outcomes.” Unpublished Manuscript (2010). Lewbel, A., and Y. Dong. “T. Yang (2012),‘ Comparing Features of Convenient Estimators for Binary Choice Models With Endogenous Regressors,’ Forthcoming.” Canadian Journal of Economics. Lewbel, Arthur. “Semiparametric Latent Variable Model Estimation with Endogenous or Mismeasured Regressors.” Econometrica (1998): 105–121. ———. “Semiparametric Qualitative Response Model Estimation with Unknown 143 Heteroscedasticity or Instrumental Variables.” Journal of Econometrics 97, no. 1 (2000): 145–177. Manski, Charles F. “Maximum Score Estimation of the Stochastic Utility Model of Choice.” Journal of Econometrics 3, no. 3 (1975): 205–228. Mroz, Thomas A. “The Sensitivity of an Empirical Model of Married Women’s Hours of Work to Economic and Statistical Assumptions.” Econometrica (1987): 765–799. Murtazashvili, Irina, and Jeffrey M. Wooldridge. “Fixed Effects Instrumental Variables Estimation in Correlated Random Coefficient Panel Data Models.” Journal of Econometrics 142, no. 1 (2008): 539–552. Newey, Whitney K., and James L. Powell. “Instrumental Variable Estimation of Nonparametric Models.” Econometrica 71, no. 5 (2003): 1565–1578. Newey, Whitney K., James L. Powell, and Francis Vella. “Nonp arametric Estimation of Triangular Simultaneous Equations Models.” Econometrica 67, no. 3 (1999): 565–603. Papke, Leslie E., and Jeffrey M. Wooldridge. “Panel Data Methods for Fractional Response Variables with an Application to Test Pass Rates.” Journal of Econometrics 145, no. 1 (2008): 121–133. Petrin, Amil, and Kenneth Train. “A Control Function Approach to Endogeneity in Consumer Choice Models.” Journal of Marketing Research 47, no. 1 (2010): 3–13. Rivers, Douglas, and Quang H. Vuong. “Limited Information Estimators and Exogeneity Tests for Simultaneous Probit Models.” Journal of Econometrics 39, no. 3 (1988): 347–366. Staiger, Douglas, and James H. Stock. “Instrumental Variables Regression with Weak Instruments.” Econometrica 65, no. 3 (1997): 557–586. Vella, Francis, and Marno Verbeek. “Two-step Estimation of Panel Data Models with Censored Endogenous Variables and Selection Bias.” Journal of Econometrics 90, no. 2 (1999): 239–263. Wooldridge, Jeffrey M. Econometric Analysis Cross Section Panel. 2nd MIT press, 2010. Wooldridge, Jeffrey M. Introductory Econometrics: a Modern Approach. Cengage Learning, 2012. Wooldridge, Jeffrey M. “Unobserved Heterogeneity and Estimation of Average Partial Effects.” Identification and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg (2005): 27–55. 144