NEW ESTIMATION METHODS FOR PANEL DATA MODELS By Valentin Verdier A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Economics - Doctor of Philosophy 2014 ABSTRACT NEW ESTIMATION METHODS FOR PANEL DATA MODELS By Valentin Verdier This dissertation is composed of three chapters that develop new estimation methods for several models of panel data. The first and third chapters are mainly concerned with understanding and aproximating the structure of optimal instruments for estimating dynamic panel data models with cross-sectional dependence in the case of the first chapter, and non-linear panel data models with strictly exogeneous covariates in the case of the third chapter. The second chapter is concerned with additional restrictions that can be used to estimate non-linear dynamic panel data models. The first chapter considers the estimation of dynamic panel data models when data are suspected to exhibit cross-sectional dependence. A new estimator is defined that uses cross-sectional dependence for efficiency while being robust to the misspecification of the form of the cross-sectional dependence. I show that using cross-sectional dependence for estimation is important to obtain an estimator that is more accurate than existing estimators. This new estimator also uses nuisance parameters parsimoniously so that it exhibits good small sample properties even when the number of available moment conditions is large. As an empirical application, I estimate the effect of attending private school on student achievement using a value added model. The second chapter considers the instrumental variable estimation of non-linear models of panel data with multiplicative unobserved effects where instrumental variables are predetermined as opposed to strictly exogenous. Existing estimators for these models suffer from a weak instrumental variable problem, which can cause them to be too inaccurate to be reliable. In this chapter I present additional sets of restrictions that can be used for more precise estimation. Monte Carlo simulations show that using these additional moment conditions improves the precision of the estimators significantly and hence should facilitate the use of these models. In the third chapter I study the efficiency of the Poisson Fixed Effects estimator. The Poisson fixed effects estimator is a conditional maximum likelihood estimator and as such is consistent under specific distributional assumptions. It has also been shown to be consistent under significantly weaker restrictions on the conditional mean function only. I show that the Poisson Fixed Effects estimator is asymptotically efficient in the class of estimators that are consistent under restrictions on the conditional mean function, as long as the assumptions of equal conditional mean and variance and zero conditional serial correlation are satisfied. I then define another estimator that is optimal under more general conditions. I use Monte Carlo simulations to investigate the small-sample performance of this new estimator compared to the Poisson fixed effects estimator. ACKNOWLEDGEMENTS I particularly thank Jeffrey Wooldridge who served as the chair of my dissertation committee. His teaching throughout my studies as Michigan State University shaped this dissertation and my current research. I also thank my other committee members, Peter Schmidt, Timothy Vogelsang and Robert Myers, whose help at various stages of my research had a large positive impact on the quality of my work. I also thank graduate students in the department of economics and the department of agricultural economics for helpful conversations. Finally I thank Margaret Lynch and Lori Jean Nichols, who work in the administrative staff of the department of economics, whose continuous support for five years helped a lot in completing this degree. iv TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii CHAPTER 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 ESTIMATION OF DYNAMIC PANEL DATA MODELS WITH CROSS-SECTIONAL DEPENDENCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Dynamic Panel Data Models with Cross-Sectional Dependence . . . . . . . . . . . . . . . . 3 1.2.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 Consistent Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Efficient Estimation under Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 Special Case of Independent Disturbances and T = 2 . . . . . . . . . . . . . . . . . 6 1.3.2 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.3 Comparison to Existing Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Models with Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Monte Carlo Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Application: Estimation of Persistence in Student Achievement . . . . . . . . . . . . . . . . 36 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 CHAPTER 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 ESTIMATION OF UNOBSERVED EFFECTS PANEL DATA MODELS UNDER SEQUENTIAL EXOGENEITY . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimation without Additional Assumptions . . . . . . . . . . . . . . . . . . . . . . . . Additional Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Estimation with Stationary Instruments . . . . . . . . . . . . . . . . . . . . . . 2.4.1.1 Example of the Linear Feedback Model . . . . . . . . . . . . . . . . 2.4.1.2 Time Demeaned Instruments . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Serially Uncorrelated Transitory Shocks . . . . . . . . . . . . . . . . . . . . . . Monte Carlo Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average Partial Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 3 EFFICIENCY OF THE POISSON FIXED EFFECTS ESTIMATOR 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 The Model and Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Asymptotically Efficient Estimation . . . . . . . . . . . . . . . . . . . . 3.2.2 Conditions for Efficiency of the Poisson FE estimator . . . . . . . . . . . 3.2.3 An Alternative Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Monte Carlo Simulations Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 45 47 49 51 51 51 53 54 56 62 65 . . . . . . . . . . . . . . 66 66 66 67 68 69 70 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 APPENDIX A ESTIMATION OF DYNAMIC PANEL DATA MODELS WITH CROSS-SECTIONAL DEPENDENCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 APPENDIX B ESTIMATION OF UNOBSERVED EFFECTS PANEL DATA MODELS UNDER SEQUENTIAL EXOGENEITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 v APPENDIX C EFFICIENCY OF THE POISSON FIXED EFFECTS ESTIMATOR . . . . . . . . 84 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 vi LIST OF TABLES Table 1.1 Number of replications where all estimators converged (out of 1,000) . . . . . . . . . . . 23 Table 1.2 Bias and RMSE, ρ = .8, equi-correlation within clusters . . . . . . . . . . . . . . . . . . 24 Table 1.3 Bias and RMSE, ρ = .8, no correlation within clusters . . . . . . . . . . . . . . . . . . . 25 Table 1.4 Bias and RMSE, ρ = .8, heteroscedasticity and correlation within clusters . . . . . . . . . 26 Table 1.5 Inference, ρ = .8, equi-correlation within clusters . . . . . . . . . . . . . . . . . . . . . . 27 Table 1.6 Inference, ρ = .8, no correlation within clusters . . . . . . . . . . . . . . . . . . . . . . . 28 Table 1.7 Inference, ρ = .8, heteroscedasticity and correlation within clusters . . . . . . . . . . . . 29 Table 1.8 Bias and RMSE, ρ = .5, equi-correlation within clusters . . . . . . . . . . . . . . . . . . 30 Table 1.9 Bias and RMSE, ρ = .5, no correlation within clusters . . . . . . . . . . . . . . . . . . . 31 Table 1.10 Bias and RMSE, ρ = .5, heteroscedasticity and correlation within clusters . . . . . . . . . 32 Table 1.11 Inference, ρ = .5, equi-correlation within clusters . . . . . . . . . . . . . . . . . . . . . . 33 Table 1.12 Inference, ρ = .5, no correlation within clusters . . . . . . . . . . . . . . . . . . . . . . . 34 Table 1.13 Inference, ρ = .5, heteroscedasticity and correlation within clusters . . . . . . . . . . . . 35 Table 1.14 Averages and standard deviations of scores per subject and per grade . . . . . . . . . . . . 42 Table 1.15 Effects of Attending Private Schools on Student Achievement . . . . . . . . . . . . . . . 43 Table 2.1 Bias and RMSE for estimating γ, T = 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Table 2.2 Bias and RMSE for estimating γ, T = 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Table 2.3 Ratio of standard errors over standard deviations of estimators of γ, T = 4 . . . . . . . . . 61 Table 2.4 Ratio of standard errors over standard deviations of estimators of γ, T = 8 . . . . . . . . . 62 Table 2.5 Coverage of 95% confidence intervals for γ, T = 4 . . . . . . . . . . . . . . . . . . . . . 63 Table 2.6 Coverage of 95% confidence intervals for γ, T = 8 . . . . . . . . . . . . . . . . . . . . . 64 Table 3.1 N = 100: Bias, standard deviation and root mean squared error . . . . . . . . . . . . . . . 72 Table 3.2 N = 500: Bias, standard deviation and root mean squared error . . . . . . . . . . . . . . . 73 Table 3.3 N = 1000: Bias, standard deviation and root mean squared error . . . . . . . . . . . . . . 74 vii CHAPTER 1 ESTIMATION OF DYNAMIC PANEL DATA MODELS WITH CROSS-SECTIONAL DEPENDENCE 1.1 Introduction In some econometric studies of panel data, researchers might want to account for the presence of feedback between the dependent variable and explanatory variables, i.e. for current values of the dependent variable to affect future values of the explanatory variables or even for both dependent and independent variables to be jointly determined. The simplest example of such models is the dynamic panel data model where lagged values of the dependent variable are used as covariates. In such cases, explanatory variables can not be treated as strictly exogenous. In virtually all panel data applications, researchers also want to control for unobserved heterogeneity that affects the dependent variable but might also be correlated with the covariates. The presence of both non strictly exogenous covariates and unobserved heterogeneity in panel data models causes many estimation methods to be invalid (see for instance Wooldridge (2010)). In the context of cross-sectionally independent data, a valid estimator for dynamic panel data models that relies on first differencing and instrumental variables has been defined in early work by Anderson and Hsiao (1981) Anderson and Hsiao (1981). Additionally, an asymptotically efficient estimator is found in Arellano and Bond (1991)1 . In the rest of the paper, we refer to this estimator as the AB estimator. These estimators often suffer from having a large variance because the instrumental variables that they use are weak.2 In addition, inference for the AB estimator is often unsatisfactory when the number of time periods in the data set is relatively large because of problems due to using many moment conditions, as studied in Alvarez and Arellano (2003) or Windmeijer (2005) for the case of cross-sectional independence. In this paper, we consider the estimation of panel data models with covariates that are not strictly exogenous when data also exhibit cross-sectional dependence. We will define a new estimator that is more 1 The Arellano and Bond estimator is asymptotically efficient in the class of estimators using linear functions of the instruments. 2 To address this problem, papers such as Ahn and Schmidt (1995), Arellano and Bover (1995), and Blundell and Bond (1998) considered using for estimation additional assumptions such as homoscedasticity, uncorrelation of the transitory shocks, or restrictions on initial conditions. Another approach to obtain efficiency gains by using additional assumptions can be found in the literature on First Difference Quasi-Maximum Likelihood estimation, as in Hsiao et al. (2002) for instance which relies on assumptions of homoscedasticity and serial uncorrelation. We do not consider these estimators here since we am interested in estimators that are consistent under the only assumption of mean independence of the transitory shock, without any other assumption holding. 1 efficient than the AB estimator and for which inference is significantly better in small samples. The main reason why our estimator is more efficient than previous estimators that were defined for data with crosssectional independence is that it makes use of cross-sectional dependence to obtain stronger instruments. In order to obtain an estimator with not only good properties in terms of point estimation, but also good properties for inference, we use an auxiliary model for optimal instruments. Optimal instruments are instruments that, once interacted with corresponding moment functions, provide an optimal set of exactly identifying moment conditions so that the resulting estimator achieves the asymptotic efficiency bound for estimating unknown parameters from the assumption of mean independence of the transitory shocks. Optimal instruments for estimating dynamic panel data models without cross-sectional dependence are found in Chamberlain (1992a) and they can be generalized to the case of cross-sectional dependence. In this paper, we propose auxiliary assumptions sufficient to model optimal instruments for panel data models with covariates that are not strictly exogenous and cross-sectional dependence. The advantage of such an approach is that it provides a systematic way of weighting many moment conditions while making use of few nuisance parameters. As a result, our estimator exhibits good small sample properties and inference while being robust to the misspecification of our model of optimal instruments. Arellano (2003) and Alvarez and Arellano (2004) have previously considered modeling optimal instruments for dynamic panel data models in the special case of cross-sectional independence. We show that cross-sectional dependence can be particularly useful to obtain more accurate estimators. Previous work on dynamic panel data models that has considered cross-sectional dependence has not made use of this dependence to obtain stronger instruments. Mutl (2006), for instance, studied a GMM estimator based on the same moment conditions as in Anderson and Hsiao (1981) or Arellano and Bond (1991) and only uses an optimal weighting matrix based on a specific model of spatial dependence. Elhorst (2005) and Su and Yang (2013) generalized maximum likelihood estimators as in Hsiao et al. (2002) to the case of cross-sectional dependence but these estimators are not robust to heteroscedasticity, serial correlation of the transitory shocks or misspecification of the cross-sectional dependence. In Section 1.2, we present the simplest example of the models we consider, the dynamic panel data model without covariates for data with cross-sectional dependence. In Section 1.3, we define our estimator and compare it to existing estimators. In Section 1.4, we generalize our estimator to general models with non strictly exogenous covariates. In Section 1.5, we present Monte-Carlo evidence that the efficiency gains from using cross-sectional dependence for estimation can be significant and that the estimator we propose 2 has superior small sample properties compared to existing estimators. In Section 1.6, we apply our estimator to the estimation of the effect of attending private school on student achievement using a value-added model and taking into account the possibility that student achievements are correlated within schools. 1.2 Dynamic Panel Data Models with Cross-Sectional Dependence 1.2.1 The Model Throughout the paper we will consider large n, fixed T asymptotics.3 Consider first the model for any observation i from a sample of n observations and any time period t from a fixed number T of time periods: yit = ρ0 yit−1 + ci + uit t = 1, ..., T E(uit |Yt−1 ) = 0 t = 1, ..., T (1.2.1) (1.2.2) where Yt = [Y1t , ...,Ynt ] and Yit = [yi0 , ..., yit ] are random vectors that stack values of yit across time and observations and ci are time constant unobserved effects, also called unobserved heterogeneity. We also assume that ρ0 = 1 so that ρ0 is identified from differenced equations as seen in the next subsection. In the case where there is no cross-sectional dependence, (1.2.1) and (1.2.2) correspond to the linear dynamic model for panel data as presented in Arellano and Bond (1991) for instance. When there is crosssectional dependence, (1.2.1) and (1.2.2) impose the restriction that cross-sectional dependence does not cause Yt−1 to be endogenous. For instance if contemporaneous spatial lags were omitted variables in (1.2.1), then (1.2.2) would be violated. Some papers such as Cizek et al. (2011), Elhorst (2005), Su and Yang (2013) and Baltagi et al. (2014) have considered models with both dynamic effects and contemporaneous spatial lag effects. Since estimators for such models rely on correct specification of the form of cross-sectional dependence, we do not consider them here and concentrate on models where cross-sectional dependence of some unknown form is present in the residuals.4 Lagged values of the dependent variable of neighboring observations could also be 3 Using a parsimonious number of nuisance parameters seems to grant the estimator we propose good properties with relatively large numbers of time periods but a formal derivation of results under large N, large T asymptotics is left for future research. 4 It is also important to note that, with cross-sectional dependence, it is not likely for E(u |Y it it−1 ) = 0 to hold without (1.2.2) holding. If (1.2.2) is not satisfied, it is likely that both estimators for cross-sectionally independent data such as the Arellano and Bond estimator and the alternative estimator proposed in this chapter will be inconsistent. For instance suppose for simplicity that n = 2 and E(u1t |Yt−1 ) = α + β1 y1t−1 + β2 y2t−1 = 0 so that β1 = 0 or β2 = 0. Then E(u1t |Y1t−1 ) = α + β1 y1t−1 + β2 E(y2t−1 |Y1t−1 ) and it likely that E(y2t−1 |Y1t−1 ) be a function of y10 , ..., y1t−2 in addition to y1t−1 so that, in general, α + β1 y1t−1 = −β2 E(y2t−1 |Y1t−1 ) and E(u1t |y1t−1 ) = 0. 3 included in the model as covariates to control for dynamic cross-sectional effects. We will discuss models with covariates in Section 1.4. The objective of the next section is to characterize estimators for ρ0 that are consistent when (1.2.1) and (1.2.2) hold under general conditions on the form of cross-sectional dependence in ci and uit . 1.2.2 Consistent Estimation The presence of unobserved heterogeneity rules out estimating ρ0 by a regression. Because (1.2.1) and (1.2.2) form a dynamic model, fixed effects estimation is also ruled out because explanatory variables are not strictly exogenous. To estimate ρ0 , we will consider a first difference transformation. All of the derivations in this paper can be generalized to other transformations, such as the forward filtering transformation presented in Arellano and Bover (1995) for instance, which can be useful in the case of unbalanced panels. Define: mit (ρ) = ∆yit − ρ∆yit−1 ∀t = 2, ..., T (1.2.3) where ∆ is the first difference operator. Therefore, mit (ρ0 ) = uit − uit−1 and (1.2.1) and (1.2.2) imply: E(mit (ρ0 )|Yt−2 ) = 0 ∀t = 2, ..., T (1.2.4) Define mi (ρ) = [mit (ρ)]t=2,...,T to be the column vector with mit+1 (ρ) as its t th element. Sometimes we will also shorten notation by writing mi = mi (ρ0 ), mit = mit (ρ0 ) and ∆Y−1,i = [∆yit−1 ]t=2,...,T . Define: Zi = [Zi2 , ..., ZiT ] (1.2.5) to be a matrix containing instruments for each time period so that Zit is a function of Yt−2 and therefore T E(Z m (ρ )) = 0.5 E(Zit mit (ρ0 )) = 0 and E(Zi mi (ρ0 )) = ∑t=1 it it 0 Define Ξ to be some weighting matrix. Define ρˆ an estimator for ρ0 as: n n ρˆ = argminρ ( ∑ Zi mi (ρ)) Ξ i=1 ∑ Zi mi (ρ) (1.2.6) i=1 Consider first the case where cross-sectional dependence is captured by a large group of clusters with fixed numbers of observations so that observations within a cluster might be related but observations across 5 Note that we need to assume ρ0 = 1 for E(Zi mi (ρ)) = 0 to hold for ρ = ρ0 only since if ρ0 = 1 then E(Zi mi (ρ)) = 0 ∀ ρ. 4 clusters are independent. Standard results on asymptotic properties of GMM estimators with clustering, found in White (2001) for instance, imply that ρˆ will be consistent for ρ0 and asymptotically normal as the number of clusters grows unboundedly under standard regularity conditions. For more general forms of cross-sectional dependence, Conley (1999), Jenish and Prucha (2009), Jenish and Prucha (2012) consider different sets of regularity conditions that guarantee that ρˆ is consistent and asymptotically normal as long as E(Zi mi (ρ0 )) = 0. In this paper, we will assume that either set of regularity conditions holds so that the probability limits p D = plim( n1 ∑ni=1 Zi ∆Y−1,i ) and ϒ = plim n1 ∑ni=1 ∑nj=1 Zi mi m j Z j exist and are finite, D ΞD = 0, ρˆ → ρ0 and, as n → ∞:6 √ d n(ρˆ − ρ0 ) → N(0,V ) V = (D ΞD)−1 D ΞϒΞD(D ΞD)−1 (1.2.7) (1.2.8) In the next sections we consider efficient feasible GMM estimation where the matrix of instruments Zi and an estimator of the weighting matrix Ξ are chosen so that the resulting estimator of ρ0 is efficient under some auxiliary assumptions. It is important to note that all of the estimators we propose will be asymptotically equivalent to estimators of the type defined by (1.2.6) so that they will be consistent as long as (1.2.1) and (1.2.2) hold, independently of whether the auxiliary models we specify are true or not. 1.3 Efficient Estimation under Clustering In this section, we consider an auxiliary model for deriving optimal instruments that assumes that every observation belongs to one of a large number of clusters. Observations are treated as correlated within clusters but independent across clusters. While clustering only represents a specific form of cross-sectional dependence, it might be a good approximation for more general forms of dependence in many applications. In addition, the method outlined in this section for the special case of clustering can easily be extended to other forms of cross-sectional dependence. Therefore we restrict our attention in this paper to auxiliary models that make use of the clustering assumption. For simplicity we will consider in this section the case where each observation belongs to the same cluster across all time periods but the results in this section can be generalized to clusters changing over 6 Note that in the case of clustering we consider {ng }g=1,...,G to be a set of fixed values, where ng denotes the number of √ √ observations in cluster g and G the number of clusters. Then n-asymptotic normality or G-asymptotic normality are equivalent since n/min{ng } ≤ G ≤ n/max{ng }. 5 time as shown in Section 1.6. Previous work that estimated dynamic models of panel data with clustered sampling generally used estimators developed for i.i.d. data such as the ones found in Anderson and Hsiao (1981), Arellano and Bond (1991), or Ahn and Schmidt (1995), and adjusted inference by using clustered standard errors. Such an analysis can be found for instance in de Brauw and Giles (2008) where farming households are treated as clustered by village or Andrabi et al. (2011) where students are clustered by school7 . Topalova and Khandelwal (2010) and Balasubramanian and Sivadasan (2010) consider the case where firms are clustered by industry. In this section, we show that there is much to gain in terms of efficiency by using a different estimator that takes into account correlation within cluster but is robust to misspecification of the form of this correlation. We will consider the case where the data is composed of a large number of clusters indexed by g = 1, ..., G, each with a fixed number of observations denoted ng so that asymptotics will be performed for G → ∞. In the first subsection, we present the special case of two time periods since in this case the problem reduces to estimating ρ0 from only one differenced equation using instrumental variables. 1.3.1 Special Case of Independent Disturbances and T = 2 For this simple special case, we derive an efficient estimator for the case where {uit }i=1,...,n,t=1,2 are independent both cross-sectionally and across time, where T = 2 and where we have conditional homoscedasticity so that: Var(uit |Yt−1 ) = σu2 ∀t = 1, 2 (1.3.1) When T = 2, there is only one differenced equation that can be used for estimation: ∆yi2 = ρ0 ∆yi1 + ∆ui2 (1.3.2) for which the available instruments are Y0 . Under the assumption of independence of disturbances and homoscedasticity, ∆ui2 is also cross-sectionally independent and homoscedastic, so the optimal instrument for the differenced equation is the best prediction of ∆yi1 based on all the available instruments, i.e. E(∆yi1 |Y0 ). To find E(∆yi1 |Y0 ), note that under (1.2.1) and (1.2.2), yi1 = ρ0 yi0 + ci + ui1 so that E(∆yi1 |Y0 ) = (ρ0 − 1)yi0 + E(ci |Y0 ). Therefore the quality of the prediction of ∆yi1 based on the instruments will depend on 7 We will show in Section 1.6 however that the clustering used in Andrabi et al. (2011) is not appropriate to obtain robust standard errors due to observations moving across clusters during the period of observation. We will show robust standard errors that take this factor into consideration. 6 the quality of the prediction of ci based on the instruments. In many applications, it is very likely that agents that belong to the same cluster will have levels of unobserved heterogeneity that are related. For instance, farmers that live in the same village might farm plots with with similar soil quality or develop similar farming practices over time. Firms that operate in the same industry might also face similar constraints such as for instance regulation or access to skilled labor force. Similarly, households that live in the same district might have been selected based on common characteristics such as wealth, income, family status or values. Therefore, in many applications, we can expect that using information from other observations in the same cluster in addition to one’s own previous outcomes can provide a better predictor for one’s level of unobserved heterogeneity. For this simple case, we could derive an optimal predictor for ci by using the assumption that for any observation i belonging to cluster g we have: ci = cg + ei (1.3.3) where {cg }g=1,...,G forms a sequence of i.i.d. random variables, {ei }i=1,...,n is an i.i.d. sequence of zero-mean random variables with ei being mean independent of {y j0 } j=i conditional on yi0 . Then for any observation i in cluster g we have E(ci |Y0 ) = E(cg |Y0 ) + E(ei |yi0 ). To obtain a parsimonious model for the optimal instruments, we can postulate that conditional expectations are linear and that each observation within a cluster contributes in the same way to predict cg . Then for any observation in cluster g, E(ci |Y0 ) = α0 + β0 n1g ∑ j∈g y j0 + γ0 yi0 where ng denotes the number of observations in cluster g. Therefore the optimal instrument for (1.3.2) for an observation in cluster g is zi = (ρ0 − 1)yi0 + α0 + β0 n1g ∑ j∈g y j0 + γ0 yi0 . A feasible version of this optimal instrument can be obtained from a consistent ¨ since consistent estimators for α0 , β0 , γ0 can be obtained from a preliminary estimator of ρ0 , denote it ρ, ¨ it−1 on an intercept, n1 ∑ j∈g y j0 and yi0 . Using the information contained in pooled regression of yit − ρy g past outcomes for other observations in the cluster will presumably yield a much better predictor of ci and hence a much better instrument, which can lead to sizable gains in efficiency. Even though we derived this efficient estimator by using very strong auxiliary assumption, it is consistent as long as (1.2.1) and (1.2.2) hold and one can use inference that is robust to all of our auxiliary assumptions being violated as shown in the next sub-section. 7 1.3.2 General Case In this sub-section, we consider efficient estimation with T being any fixed integer equal or greater than two and disturbances being potentially correlated within clusters. Here we will generalize the idea developed in the previous sub-section of using other observations from a cluster to predict one’s level of unobserved heterogeneity. We will start with the same auxiliary assumption of clustering as in the previous subsection: Auxiliary Assumption 1: Clusters of observations are independent and identically distributed. With Auxiliary Assumption 1, we can derive the optimal estimator for ρ0 by generalizing the work on optimal instruments for cross-sectionally independent data in Chamberlain (1992a) to the case of clustersampling. In this section we will index observations by cluster so that for any i, gi denotes the cluster to which observation i belongs and jg denotes the jth observation of cluster g so that for any observation i in g, there is j such that jg = i and {{x jg } j=1,...,ng }g=1,...,G = {xi }i=1,...,n for any sequence of variables g {xi }i=1,...,n . Consider stacking all observations by cluster and define mt (ρ) = [m1g ,t (ρ), ..., mngg ,t (ρ)] , g g g g g mg (ρ) = [m2 (ρ), ..., mT (ρ)] , mt = mt (ρ0 ) and mg = mg (ρ0 ). Similarly, define ut = [u1g ,t , ..., ungg ,t ] , g g g g g g g g g ug = [u1 , ..., uT ] , cg = [c1g , ..., cngg ] yt = [y1g ,t , ..., yngg ,t ] , Yt = [y0 , ..., yt ] , and ∆Y−1 = [∆y1 , ..., ∆yT −1 ] . Appendix A.1.1 shows that the optimal estimator for ρ0 is defined by: G g ∑ Zopt mg (ρˆopt ) = 0 (1.3.4) g=1 g g g g s=2,...,T where Zopt = L g (Φg )−1/2 where Φg = [Cov(mt , ms |Ymax{t,s}−2 )]t=2,...,T , (Φg )−1/2 is the upper dig g g g g agonal matrix such that (Φg )−1/2 (Φg )−1/2 = (Φg )−1 , L g = [Lt ]t=2,...,T and Lt = E((Φt )−1/2 ∆Y−1 |Yt−2 ) g where (Φt )−1/2 is the (t − 1)th ng × ng (T − 1) matrix composing (Φg )−1/2 . One could estimate these optimal instruments non-parametrically by using series of instruments that include lagged values of the dependent variable for an observation but also lagged values of the dependent variable for neighboring observations. A similar estimator has been studied for the case of cross-sectionally independent data in Donald et al. (2009) for static models and Hahn (1997) for dynamic models. However such an approach would not be practical here since there are too many possible terms to consider as instruments. Also, it would involve using many nuisance parameters which can cause poor small sample properties for the estimator, as is discussed later. Instead, we propose two auxiliary auxiliary assumptions 8 that will allow us to model optimal instruments and drastically reduce the number of nuisance parameters needed. The resulting estimator will be consistent as long as (1.2.1) and (1.2.2) hold and efficient when these auxiliary assumptions are satisfied. Because the estimator we propose makes use of few nuisance parameters, it will have good small sample properties even when the auxiliary assumptions do not hold, as evidenced in Section 1.5. The second auxiliary assumption we will use is the assumption of conditional homoscedasticity as well as conditional serial uncorrelation and conditional equi-correlation within clusters: Auxiliary Assumption 2a: For any i, j ∈ g, t, s = 1, ..., T , t ≥ s: g Cov(uit , u js |cg ,Yt−1 ) = σu2 i f i = j, t = s = τu σu2 i f i = j, t = s = 0 otherwise g Under Auxiliary Assumption 2a, Appendix A.1.2 shows that the optimal instrument for mg , Zopt , is g now a linear function of {E(∆yt−1 |Yt−s )}t=2,...,T, s=2,...,t . This corresponds to the intuition developed in the previous section where we found that, for the special case T = 2, optimal instruments were simply g E(∆y1 |Y0 ). From (1.2.1) and (1.2.2): g g E(∆yt−1 |Yt−s ) = (ρ0 − 1)ρ0s−1 yt−s + g = (ρ0 − 1)ρ0s−1 yt−s + s−2 ∑ ρ0r E(cg |Yt−s ) (1.3.5) 1 − ρ0s−1 (1.3.6) r=0 1 − ρ0 E(cg |Yt−s ) Under Auxiliary Assumption 1: g g E(∆yt−1 |Yt−s ) = (ρ0 − 1)ρ0s−1 yt−s + 1 − ρ0s−1 1 − ρ0 g E(cg |Yt−s ) (1.3.7) Therefore in order to obtain a model for optimal instruments, one needs to make additional assumptions so that there exists a parametric model for the mean of unobserved heterogeneity conditional on lagged values of the dependent variable. In order to keep the number of nuisance parameters low, it is useful to use the assumption that unobserved heterogeneity follows the simple cluster correlation structure: Corr(ci , c j ) = τc i f i = j, i, j ∈ g = 0 otherwise 9 (1.3.8) (1.3.9) g Also we use the assumption that disturbances {ut }t=1,...,T are independent from unobserved heterogeneity, that both have a joint normal distribution and that the initial values of the dependent variable are in the stationary state associated with (1.2.1), i.e.: g y0 = cg g + u˜0 1 − ρ0 (1.3.10) g g where u˜0 is independent of cg and {ut }t=1,...,T , follows normal distribution with zero mean, variance g equal to σu2 /(1 − ρ02 ) and has a within cluster correlation of τu .8 Let the variance-covariance matrix of ut g for t = 1, ..., T be denoted by Σu :   1     τu 1    g Σu = σu2     ... ...     τu ... τu 1 (1.3.11) g Let the variance-covariance matrix of cg be denoted by Σc :  1   τc 1  g Σc = σc2   ... ...  τc ... τc          1 (1.3.12) The last auxiliary assumption of our model for optimal instruments is: Auxiliary Assumption 3a: Suppose that for any cluster g = 1, ..., G:   g g Σc c   µc ιng     1   g g y  ∼ N( 1 µc ιn  ,  Σc 1−ρ  0  1−ρ0 g  0       g u 0 0     g g 1 Σc + 1 2 Σu (1−ρ0 )2 1−ρ0 0 8 The g IT ⊗ Σu    )   (1.3.13) auxiliary assumption of stationary initial conditions can easily be generalized, at the expense of introducing three additional nuisance parameters, by assuming: g g y0 = α + β cg + u˜0 g u˜ |cg ∼ N(0, Σ˜ 0 ) 0 Var(u˜i0 ) = σ˜ 0 Corr(u˜i0 , u˜ j0 ) = τu i f i = j but gi = g j 10 g g where Σc and Σu have been defined previously and ιng is a column vector of ones of dimension ng × 1. g g g g Note that E(cg |Yt ) = E(cg |y0 , cg + u1 , ..., cg + ut ). Define V g as:   g Σc   g  1 V g =  1−ρ Σc 0   0 g g 1 Σc + 1 2 Σu 2 (1−ρ0 ) 1−ρ0 0 g IT ⊗ Σu       (1.3.14) Under Auxiliary Assumption 3a:  cg      g  y   0     cg + ug  ∼ N(µ g , AgV g Ag )  1      ...    g g c + uT   (1.3.15)   g c        g g g g g 1    where µ = A   1−ρ0 µc ιng  and A is the deterministic matrix of ones and zeros so that A y0  =     ug 0   cg      yg   0     cg + ug .  1      ...    g g c + uT Therefore, using the properties of the multivariate normal distribution, E(cg |Yt ) can be obtained as a µc ιng g g g linear function of y0 , cg + u1 , ..., cg + ut with coefficients given by the elements of V g . The exact form of E(cg |Yt ) under Auxiliary Assumptions 1, 2a, 3a is given in Appendix A.1.3. Only five nuisance parameters compose V g and can be consistently estimated if a consistent preliminary ¨ Let rit (ρ) = yit − ρyit−1 . Consistent estimators for the nuisance estimator of ρ0 is available, denote it ρ. 11 parameters in V are: σˆ u2 = 1 1 1 T n ¨ 2 ∑ ∑ mit (ρ) 2 T − 1 n t=2 i=1 τˆu = n 1 1 1 1 1 T n ¨ jt (ρ) ¨ 1[i = j, gi = g j ]mit (ρ)m ∑ ∑ ∑ σˆ u2 2 T − 1 n t=2 i=1 ngi − 1 j=1 σˆ c2 = 1 1 T T n ¨ is (ρ) ¨ − µˆ c2 ∑ ∑ ∑ 1[t = s]rit (ρ)r T (T − 1) n t=1 s=1 i=1 µˆ c = 11 T n ¨ ∑ ∑ rit (ρ) T n t=1 i=1 τˆc = n 1 T T n 1 1 1 ¨ js (ρ) ¨ 1[t = s, gi = g j ; i = j]rit (ρ)r ∑ ∑ ∑ ∑ σˆ c2 T (T − 1) n t=1 s=1 i=1 ngi − 1 j=1 ˆ g be the consistent estimator for the variance-covariance matrix Φg = Var(mg (ρ0 )) composed of Let Φ ˆ g−1/2 be the upper-diagonal matrix such that σˆ u and τˆu from the formula derived in Appendix A.1.2. Let Φ ˆ g−1/2 Φ ˆ g−1/2 = Φ ˆ g−1 . Denote Φ ˆ g−1/2 the t th ng × ng (T − 1) matrix composing Φ ˆ g−1/2 . Let µˆ gc be Φ t t a consistent estimator of E(cg |Yt ) from the formula given in the Appendix A.1.3. A consistent estimator for the optimal instrument for mg (ρ) under (1.2.1) and (1.2.2) and Assumptions 1, 2a, 3a is:   g gc (ρ¨ − 1)y0 + µˆ 0   0       g−1/2 g−1/2   ]Φ ˆ ˆ ... ...   , ..., ΦT −1      g gc g 1−ρ¨ T −1 gc (ρ¨ − 1)yT −2 + µˆ T −2 ρ¨ T −2 (ρ¨ − 1)y0 + 1−ρ¨ µˆ 0 (1.3.16)   g ˆ g−1/2  Zˆ opt = [Φ  1 and the estimator obtained from using this instrument matrix is defined by: G g ∑ Zˆopt mg (ρˆ )=0 (1.3.17) g=1 So that: ρˆ = g ˆg ∑G g=1 Zopt ∆y (1.3.18) g ˆg ∑G g=1 Zopt ∆y−1 g = ρ0 + g g g g g ˆ ∑G g=1 Zopt ∆u g (1.3.19) g ˆ ∑G g=1 Zopt ∆y−1 g g g where ∆yg = [∆y2 , ..., ∆yT ] , ∆y−1 = [∆y1 , ..., ∆yT −1 ] and ∆ug = [∆u2 , ..., ∆uT ] . 12 g ¨ σˆ u2 , σˆ c2 , τˆu , τˆc , µˆ c are replaced Let Z¨ opt to be the random vector defined as in (1.3.16) but where ρ, ¨ plim(σˆ u2 ), plim(σˆ c2 ), plim(τˆu ), plim(τˆc ), plim(µˆ c ). When (1.2.1), (1.2.2) and Auxiliary Asby plim(ρ), sumption 1 hold, ρˆ is asymptotically normal: √ d G(ρˆ − ρ0 ) → N(0,Vρ ) (1.3.20) g g g Vρ = E(Z¨ opt ∆y−1 )−2Var(Z¨ opt ∆ug ) (1.3.21) Standard errors for ρˆ that are consistent as long as (1.2.1), (1.2.2) and Auxiliary Assumption 1 hold are given by: G g g s.e. = (( ∑ Zˆ opt ∆y−1 )−2 g=1 G g ∑ (Zˆopt mg (ρˆ ))2 )1/2 (1.3.22) g=1 The estimator defined by (1.3.17) is consistent and asymptotically normal even when the Auxiliary Assumption 1 of cluster sampling is not satisfied, as long as some regularity conditions hold on the strength of cross-sectional dependence. As in section 1.2.2, cross-sectional dependence has to be weak enough so that asymptotic theorems can be applied: g 1 G ¨g p Zopt ∆ug → 0 ∑ G g=1 (1.3.23) 1 G ¨g g p Zopt ∆y−1 → a ∑ G g=1 (1.3.24) 1 G g d √ ∑ Z¨ opt ∆ug → N(0, v) G g=1 (1.3.25) g g 1 G Z¨ ∆y ) = 0 and v = plim( 1 ( G Z¨ ∆ug )2 ). In this case: where a = plim( G ∑g=1 opt −1 G ∑g=1 opt √ d G(ρˆ − ρ0 ) → N(0, a−2 v) (1.3.26) 1 G Zˆ g ∆yg and non-parametric estimators for plim( 1 ( G Z g ∆ug )2 ) a can simply be estimated by G ∑g=1 opt −1 G ∑g=1 opt as well as statistical tests with general forms of spatial dependence are available and have been discussed in Conley (1999), Bester et al. (2011b), Kim and Sun (2011) and Bester et al. (2011a). In situations where available preliminary estimators might have poor small sample properties, one can g also use an iterated version of the feasible optimal estimator. Denote Zˆ opt (ρ) to be the value of the estimated ¨ evaluated at ρ. The iterated optimal optimal instruments for a preliminary estimator (previously denoted ρ) estimator is defined by: G g ∑ Zˆopt (ρˆiter )mg (ρˆiter ) = 0 g=1 13 (1.3.27) This estimator has the same √ n-asymptotic properties as the two step estimator defined by (1.3.17) but its small sample properties will not depend on the small sample properties of a preliminary estimator. 1.3.3 Comparison to Existing Estimators The estimator defined by (1.3.17) can be rewritten as ρˆ ∗ that satisfies the equation: G ˆ g mg (ρˆ ∗ ) = 0 ∑ wg∗ (η)Z (1.3.28) g=1 where ηˆ = [σˆ u2 , τˆu , σˆ c2 , µˆ c , τˆc ], Z g is the matrix containing all valid instruments for mg :   g Ing ⊗Y0     g   0 I ⊗Y g   1 Zg =     ...  ...    g 0 ... 0 Ing ⊗YT −2 (1.3.29) g ˆ g = Zˆ opt and wg∗ () is the row vector function such that wg∗ (η)Z . The Arellano and Bond estimator can also be written as exactly identified from: G g ∑ wˆ AB Z g mg (ρˆAB ) = 0 (1.3.30) g=1 where: g wˆ AB = n n ˜ i (ρ) ˜ Zi )−1 Sg ∑ (∆Y−1,i Zi )( ∑ Zi mi (ρ)m (1.3.31) i=1 i=1 g where ρ˜ is a preliminary consistent estimator and S is the matrix of zeros and ones such that Sg Z g mg (ρ) = ∑i∈g Zi mi (ρ) where:   Y  i0     0 Y i1   Zi =     ...  ...    0 ... 0 YiT −2 (1.3.32) In the presence of cross-sectional dependence, it is likely that our estimator will perform better than the Arellano and Bond estimator even when some of the Auxiliary Assumptions 1, 2a, 3a are violated because our estimator gives non-zero weights to moment conditions obtained from using instruments from neighboring observations. As discussed in previous sections, these instruments may have significant predictive 14 power for the covariates in the differenced equations so that these additional moment conditions might be useful to improve the accuracy of the estimator. In addition, our estimator relies on the estimation of only five nuisance parameters to compute weights for all n2g ×T ×(T −1)/2 moment conditions available per cluster, whereas the Arellano and Bond estimator relies on the estimation of T × (T − 1)/2 weights. When T is relatively large, estimating that many nuisance parameters causes the Arellano and Bond estimator to suffer from poor small sample properties in terms of bias, precision and inference, which was studied in the context of cross-sectional independence in Alvarez and Arellano (2003) and Windmeijer (2005). Because our estimator makes use of few nuisance parameters, it will have good properties in finite samples even when T is relatively large. A formal derivation of the asymptotic properties of our estimator when both n and T grow unboundedly is left for future research. As a result of both using non-zero weights for useful moment conditions and using nuisance parameters parsimoniously, the results from Monte Carlo simulations presented in Section 1.5 show that our estimator has significantly better small sample properties than the Arellano and Bond estimator in terms of efficiency and quality of inference, particularly in cases with cross-sectional dependence but also without cross-sectional dependence. So-called system GMM estimators presented in Ahn and Schmidt (1995), Arellano and Bover (1995), and Blundell and Bond (1998) are similar to the Arellano and Bond estimator but use additional moment conditions based on additional assumptions of homoscedasticity, no serial correlation, or stationary initial conditions. Since our estimator is only based on the mean independence of transitory shocks conditional on past outcomes, it is more robust than the estimators presented in Ahn and Schmidt (1995), Arellano and Bover (1995) or Blundell and Bond (1998). 1.4 Models with Covariates Similar auxiliary assumptions as in the previous section can be listed to model optimal instruments for models with covariates. In this section we consider a model that allows for some of the covariates to be strictly exogenous (wit ) and some of the covariates to be sequentially exogenous or contemporaneously 15 endogenous (xit ): yit = xit β0 + wit γ0 + ci + uit t = 1, ..., T (1.4.1) E(uit |Zt ,W ) = 0 (1.4.2) ( j) where W = [W1 , ...,Wn ] and Wi = wi1 , ..., wiT and Zt = [zi1 , ..., zit ] and for every random variable xit ( j) ( j) ( j) in ( j) xit , either xit or xit−1 is in zit .9 xit is said to be sequentially exogenous if it is in zit . If xit−1 only is in j zit , xit is said to be contemporaneously endogenous. Such a model specification is flexible enough to allow for complex interactions between unobserved factors and covariates of interest, an example will be given in Section 1.6. The estimation method presented in this section can be generalized to the case where neither xit nor xit−1 are part of zit but where some other instruments are available, which is also treated in the example given in Section 1.6. As a notational matter, we generalize the notation from the previous section by denoting by xg the vector [x1g , ..., xngg ] for any sequence of variables {xi }i=1,...,n . A consistent estimator of β0 , γ0 is obtained from the differenced equation: ∆yit = ∆xit β0 + ∆wit γ0 + ∆uit t = 1, ..., T E(∆uit |Zt−1 ,W ) = 0 (1.4.3) (1.4.4) To model optimal instruments for estimating β0 and γ0 from (1.4.3) and (1.4.4), we will make use of the same auxiliary assumption of clustering, i.e. we maintain the use of Auxiliary Assumption 1a. We also generalize Auxiliary Assumption 2a so that homoscedasticity and serial correlation are specified conditional on the relevant instruments: Auxiliary Assumption 2b: For any i, j ∈ g, t, s = 1, ..., T , t ≥ s: g Cov(uit , u js |cg , Zt ,W g ) = σu2 i f i = j, t = s = τu σu2 i f i = j, t = s = 0 if t > s As in the previous section, this assumption guarantees that the optimal instruments will be known ling ear functions of {E(∆xit |Zs ,W g )}t,s=1,...,T, s≤t−1 and Wi (up to the unknown nuisance parameters σu2 9 Note that a special case of this model is the dynamic model we considered in the previous section where xit = yit−1 , xit = zit , and wit γ0 = 0. In most applications, even if xit includes other covariates than lagged values of the dependent variable, it is expected that yit−1 will be included in xit in order to identify the effect of xit on yit separately from the dynamic effects in yit and xit . 16 and τu ). Therefore, we need to generalize Auxiliary Assumption 3a to obtain a parsimonious model for g g {E(∆xit |Zs ,W g )}t,s=1,...,T, s≤t−1 . To do so, we can model zt as a VAR process conditional on W g : Auxiliary Assumption 3b: Suppose that for any observation i = 1, ..., n: zit = Γzit−1 + wit η + di + vit and:      g g g  µd (W )   Σd d    g g    z  |W ∼ N(µ (W g ) , Σ z  0  0   dz0      g v 0 0 g (1.4.5)       Σz0 0 (1.4.6) Σv g where vg = [v1 , ..., vT ] . g In particular applications, one will impose auxiliary restrictions on µd (.), µz0 (.), Σd , Σdz , Σz0 , Σv so 0 that they can be estimated with few enough nuisance parameters. Auxiliary Assumption 3b implies: g E(zit |Zt−s ,W g ) = Γs zit−s + s−1 g ∑ Γr (wit−r η + E(di |Zt−s ,W g )) (1.4.7) r=0 and: g g g g E(di |Zt ,W g ) = E(di |z0 ,W g , d g + v1 , ..., d g + vt ) (1.4.8) which can be derived from Auxiliary Assumption 3b as was done in the previous section. For any co( j) ( j) variate xit , either xit ( j) ( j) is in zit or xit−1 is in zit , therefore Auxiliary Assumption 3b yields a model for g g E(xit |Zs ,W g ) ∀s ≤ t and hence E(∆xit |Zs ,W g ) ∀s ≤ t − 1 as a function of the nuisance parameters in Auxiliary Assumption 3b. Therefore, under (1.4.1), (1.4.2) and Auxiliary Assumptions 1a, 2b and 3b, one can find a parametric model for the optimal instruments for estimating β0 and γ0 . A feasible version of these instruments can be obtained from a preliminary estimator of (β0 , η0 ) as in the previous section. One can also use an iterated version of this feasible estimator in order to obtain an estimator with better performances in small samples. 17 1.5 Monte Carlo Simulations In this section, we will study the small sample properties of the estimator we propose using Monte Carlo simulations. Consider the simple data generating process for a model with cluster correlation and without covariates: ng ∼ Poisson(α) + 1 cg ∼ Fc g y0 |cg ∼ Normal(µ0 (cg ), Σ0 (cg )) g g g g g g yt |cg , yt−1 , ..., y0 ∼ Normal(cg + ρyt−1 , Σu (cg , yt−1 , ..., y0 )) We compare the properties of three estimators of ρ: The estimator defined in Arellano and Bond (1991) which we call the AB estimator, the estimator defined by (1.3.27), which we denote by Estimator 1 and the estimator defined by (1.3.27) but with estimated within-cluster correlations replaced by zero which we denote by Estimator 2.10 As a benchmark for comparison, we also show the results from using an unfeasible optimal estimator (UO) which is optimal in the class of estimators that use linear functions of the instruments. This estimator weights optimally all available moment conditions that use linear instruments using the true unobserved optimal weights so that it is defined by: G ∑ wg Z g mg (ρˆUO ) = 0 (1.5.1) g=1 wg = ∆g (W g )−1 (1.5.2) ∂ mg ) ∂ρ (1.5.3) ∆g = E(Z g W g = E(Z g mg mg Z g ) (1.5.4) When Auxiliary Assumptions 1, 2a, 3a hold, the UO estimator is the same as the estimator defined by (1.3.4) and will be efficient in the class of estimators using any function of the instruments. When these assumptions hold, Estimator 1 and the unfeasible optimal estimator will also be asymptotically equivalent so that, for small samples, the difference in their performances is due to the extra noise in Estimator 1 10 In most of the scenarios we simulate, transitory shocks will be homoscedastic, serially uncorrelated and the dependent variable will be stationary so that additional moment conditions presented in Arellano and Bover 1995, Ahn and Schmidt 1995 or Blundell and Bond 1998 hold. We do not present estimators that use these moment conditions however since we are interested in studying the properties of estimators that are robust to these moment conditions being false. 18 due to estimating the nuisance parameters needed. When Auxiliary Assumptions 2a or 3a are violated, the unfeasible optimal estimator is asymptotically more efficient than Estimator 1. Estimator 1 is asymptotically more efficient than the AB estimator or than Estimator 2 when there exists cross-sectional dependence and Auxiliary Assumptions 1, 2a, 3a hold. When Auxiliary Assumptions 1, 2a, 3a hold and there is no cross-sectional dependence, the AB Estimator, Estimator 1 and Estimator 2 have the same asymptotic variance. When Auxiliary Assumptions 2a or 3a are violated and there is no crosssectional dependence, the AB estimator has a smaller asymptotic variance than Estimator 1 and 2 but, in finite samples, Estimator 1 or Estimator 2 might still have better properties than the AB estimator because they make use of less nuisance parameters. When Auxiliary Assumptions 2a or 3a are violated and there is cross-sectional dependence, which of the AB estimator or Estimator 1 has smallest asymptotic variance depends on the data generating process but we expect Estimator 1 to perform better since, by making use of instruments from other observations in the cluster, it should use a weighted sum of moment conditions that is closer to optimal that the sum used for the AB estimator. For inference for the AB estimator, we will consider GMM robust standard errors with clustered standard errors with and without the finite sample correction proposed by Windmeijer (2005). For inference for Estimators 1 and 2, we use the standard errors defined in (1.3.22) that only require (1.2.1), (1.2.2) and Auxiliary Assumption 1 to hold in order to be consistent. We will study the small sample properties of the estimators in three different scenarios: within cluster equi-correlation, cross-sectional independence and general within cluster correlation with unobserved heterogeneity that does not have a Normal distribution. Scenario 1 and 2 will correspond to Auxiliary Assumptions 1, 2a, 3a holding. In Scenario 1 there is cross-sectional dependence and in Scenario 2 there is no cross-sectional dependence. Scenario 3 corresponds to only Auxiliary Assumption 1 holding. 19 More precisely, Scenario 1 uses the following parameterization:   1     0.5 1    Fc = Normal(0,  )   ...  ...    0.5 ... 0.5 1   1     0.5 1    g g g Σu (c , yt−1 , ..., y0 ) =     ...  ...    0.5 ... 0.5 1 µ0 (cg ) = Σ0 (cg ) = cg 1 − ρ0 1 1 − ρ02 g g Σu (cg , yt−1 , ..., y0 ) Scenario 2 uses:  1      0 1    Fc = Normal(0,  )   ...  ...   0 ... 0 1   1     0 1    g g Σu (cg , yt−1 , ..., y0 ) =     ...  ...   0 ... 0 1 µ0 (cg ) = Σ0 (cg ) = cg 1 − ρ0 1 g g Σu (cg , yt−1 , ..., y0 ) 2 1 − ρ0 20 And Scenario 3 uses:   1     0.5 1    Fc = LogNormal(0,  )   ...  ...    0.5 ... 0.5 1   2 ui t−1   1     0.5u 2 u u   i t−1 i t−1 i2t−1 g g 1 2  Σu (cg , yt−1 , ..., y0 ) =      ... ...     2 0.5ui t−1 uin t−1 ... 0.5ui u u ing t−1 g 1 ng −1t−1 ing t−1 g c µ0 (cg ) = 1 − ρ0   1       1 0.5 1  g Σ0 (c ) =   2   1 − ρ0  ... ...    0.5 ... 0.5 1 All Monte Carlo results were obtained using 1,000 replications. Because Estimators 1 and 2 are iterated versions of our estimator, we present results from simulations conditional on Estimators 1 and 2 converging. Table 1.1 shows the number of observations where all estimators converged, which represents all or almost all draws except when T = 5, G = 100 and ρ = 0.8. In this case Estimator 1 or Estimator 2 did not converge in 15%-22% of the replications depending on the scenario. In particular applications, convergence of the iterated Estimators 1 and 2 will depend on the particular numerical algorithm chosen and properties of the data. For instance in the application presented in Section 1.6, convergence was achieved in just a few iterations even though T = 3. Table 1.2, 1.3 and 1.4 show the results for the four estimators considered in terms of bias, standard deviation and root mean squared error for a value of ρ of 0.8. Table 1.2 shows results for the case where there is equi-correlation within clusters (Scenario 1), Table 1.3 the case where there is no cross-sectional correlation (Scenario 2) and Table 1.4 the case where there is heteroscedasticity and cross-sectional correlation (Scenario 3). The first conclusion from these three tables is that Estimator 1 and 2 exhibit virtually no bias compared to the AB estimator. Estimator 1 also has significantly smaller standard deviations when there is cross-sectional correlation (Scenarios 1 and 3). Both of these features of our estimator result in significantly 21 smaller values for mean squared error. The smaller standard deviations of our estimator are due to the use of instruments from other observations in the cluster that are relevant in the presence of cross-sectional dependence. The low bias is attributable to our estimators using very few nuisance parameters compared to the AB estimator. The improvement of Estimator 1 over the AB estimator is particularly striking when T is large and G is small, which is when the AB estimator uses the most nuisance parameters compared to the sample size. When there is no within cluster correlation (Scenario 2), Estimators 1 and 2 have standard deviations only slightly lower than the AB estimator so that the decrease in rmse of Estimators 1 and 2 compared to the AB estimator is mostly due to the elimination of the bias. In Scenario 3 where the unfeasible optimal estimator is asymptotically more efficient than Estimator 1, Estimator 1 performs very closely to the unfeasible optimal estimator, which shows that the approximation of the optimal weighted sum of moment conditions used by Estimator 1 is good in this case. Table 1.5, Table 1.6 and Table 1.7 show results in terms of bias in standard errors (captured by the ratio of the mean of the standard errors over the standard deviations of the estimators), coverage of the 95% confidence interval and average length of 95% confidence intervals. All three tables show that standard errors for the AB estimator without the Windmeijer correction are seriously downward biased, particularly when T is large, resulting is very low coverage of 95% confidence intervals (as low as 48%). The Windmeijer correction yields unbiased standard errors for the AB estimator but the resulting confidence intervals still have low coverage because of the bias in the AB estimator of ρ. The standard errors for Estimators 1 and 2 are unbiased and the resulting confidence intervals have the correct coverage of 95%. Because our estimators have smaller standard deviations that the AB estimator, the average length of their 95% confidence intervals is also smaller than that of the AB estimator so that our estimators have confidence intervals that are both tighter and have the correct coverage. Tables 1.8-1.13 show the same results for ρ0 = 0.5. Estimators 1 and 2 show similar improvements over the AB estimator but slightly less markedly since, with this lower level of persistence, the instruments used by the AB estimator are not as weak as when ρ0 = 0.8 so that there is less to gain compared to the unfeasible optimal estimator. 22 Table 1.1: Number of replications where all estimators converged (out of 1,000) ρ = 0.8 ρ = 0.5 Scenario 1 Scenario 2 Scenario 3 Scenario 1 Scenario 2 Scenario 3 G=100 802 854 781 1000 999 1000 G=200 906 935 867 1000 1000 1000 G=400 977 976 942 1000 1000 1000 G=100 998 992 989 1000 999 1000 G=200 1000 999 1000 1000 1000 1000 G=400 1000 1000 1000 1000 1000 1000 G=100 1000 999 1000 1000 999 1000 G=200 1000 1000 1000 1000 1000 1000 G=400 995 1000 1000 1000 1000 1000 T=5 T=10 T=15 23 Table 1.2: Bias and RMSE, ρ = .8, equi-correlation within clusters Unfeasible Optimal Estimator Arellano and Bond Estimator Estimator 1 Estimator 2 −0.031 −0.158 −0.037 −0.045 sd 0.121 0.176 0.137 0.191 rmse 0.125 0.236 0.142 0.196 −0.018 −0.087 −0.018 −0.025 sd 0.089 0.127 0.093 0.134 rmse 0.091 0.154 0.095 0.136 −0.001 −0.033 −0.002 0.000 sd 0.064 0.092 0.065 0.097 rmse 0.064 0.098 0.065 0.097 bias 0.001 −0.060 0.001 0.001 sd 0.047 0.067 0.048 0.068 rmse 0.047 0.090 0.048 0.068 −0.001 −0.033 −0.001 −0.003 sd 0.034 0.048 0.034 0.047 rmse 0.034 0.058 0.034 0.047 bias 0.001 −0.016 0.001 0.000 sd 0.024 0.035 0.024 0.034 rmse 0.024 0.038 0.024 0.034 −0.001 −0.041 −0.000 −0.003 sd 0.028 0.038 0.028 0.037 rmse 0.028 0.056 0.028 0.037 bias 0.000 −0.022 0.000 −0.001 sd 0.018 0.027 0.018 0.026 rmse 0.018 0.034 0.018 0.026 bias 0.000 −0.010 0.000 0.000 sd 0.013 0.019 0.013 0.018 rmse 0.013 0.021 0.013 0.018 T=5 G=100 G=200 G=400 bias bias bias T=10 G=100 G=200 G=400 bias T=15 G=100 G=200 G=400 bias 24 Table 1.3: Bias and RMSE, ρ = .8, no correlation within clusters Unfeasible Optimal Estimator Arellano and Bond Estimator Estimator 1 Estimator 2 −0.028 −0.097 −0.030 −0.034 sd 0.121 0.119 0.135 0.127 rmse 0.124 0.153 0.138 0.131 −0.013 −0.046 −0.013 −0.014 sd 0.093 0.091 0.097 0.094 rmse 0.093 0.102 0.098 0.095 −0.001 −0.020 −0.001 −0.002 sd 0.063 0.063 0.066 0.065 rmse 0.063 0.066 0.066 0.065 bias 0.000 −0.033 0.000 −0.000 sd 0.046 0.050 0.047 0.047 rmse 0.046 0.060 0.047 0.047 −0.001 −0.018 −0.001 −0.002 sd 0.033 0.035 0.033 0.033 rmse 0.033 0.039 0.033 0.033 bias 0.001 −0.008 0.001 0.001 sd 0.024 0.025 0.024 0.024 rmse 0.024 0.026 0.024 0.024 −0.001 −0.023 −0.001 −0.001 sd 0.028 0.031 0.028 0.028 rmse 0.028 0.038 0.028 0.028 bias 0.000 −0.011 0.000 0.000 sd 0.018 0.020 0.018 0.018 rmse 0.018 0.023 0.018 0.018 bias 0.000 −0.005 0.000 0.000 sd 0.013 0.013 0.013 0.013 rmse 0.013 0.014 0.013 0.013 T=5 G=100 G=200 G=400 bias bias bias T=10 G=100 G=200 G=400 bias T=15 G=100 G=200 G=400 bias 25 Table 1.4: Bias and RMSE, ρ = .8, heteroscedasticity and correlation within clusters Unfeasible Optimal Estimator Arellano and Bond Estimator Estimator 1 Estimator 2 −0.027 −0.226 −0.065 −0.049 sd 0.183 0.218 0.271 0.367 rmse 0.185 0.314 0.279 0.370 −0.025 −0.140 −0.027 −0.033 sd 0.125 0.166 0.135 0.203 rmse 0.128 0.217 0.137 0.206 −0.007 −0.069 −0.008 −0.010 sd 0.084 0.121 0.085 0.131 rmse 0.085 0.139 0.086 0.131 bias 0.001 −0.075 −0.000 0.000 sd 0.055 0.075 0.056 0.074 rmse 0.055 0.106 0.056 0.074 bias 0.000 −0.042 0.000 −0.000 sd 0.039 0.055 0.039 0.054 rmse 0.039 0.069 0.039 0.054 bias 0.001 −0.019 0.001 0.001 sd 0.027 0.039 0.027 0.038 rmse 0.027 0.043 0.027 0.038 −0.001 −0.046 −0.000 −0.002 sd 0.031 0.041 0.030 0.040 rmse 0.031 0.062 0.030 0.040 bias 0.001 −0.024 0.001 0.000 sd 0.020 0.029 0.020 0.028 rmse 0.020 0.037 0.020 0.028 bias 0.001 −0.012 0.000 0.000 sd 0.015 0.021 0.014 0.020 rmse 0.015 0.024 0.014 0.020 T=5 G=100 G=200 G=400 bias bias bias T=10 G=100 G=200 G=400 T=15 G=100 G=200 G=400 bias 26 Table 1.5: Inference, ρ = .8, equi-correlation within clusters Unfeasible Optimal Estimator Arellano and Bond Estimator AB w/ Windmeijer correction Estimator 1 Estimator 2 T=5 G=100 G=200 G=400 ratio 1.121 0.788 1.051 1.016 1.002 coverage 0.969 0.763 0.895 0.964 0.959 length 0.539 0.550 0.749 0.552 0.760 ratio 1.063 0.834 1.045 1.024 0.978 coverage 0.956 0.796 0.917 0.951 0.956 length 0.375 0.417 0.525 0.376 0.518 ratio 1.073 0.863 1.029 1.042 0.977 coverage 0.976 0.888 0.939 0.969 0.954 length 0.268 0.313 0.376 0.268 0.372 ratio 0.987 0.661 1.004 0.962 0.943 coverage 0.948 0.653 0.873 0.952 0.949 length 0.184 0.176 0.270 0.184 0.254 ratio 0.983 0.734 1.003 0.975 0.961 coverage 0.949 0.760 0.905 0.949 0.944 length 0.130 0.139 0.194 0.130 0.179 ratio 0.974 0.766 0.975 0.968 0.943 coverage 0.953 0.835 0.922 0.952 0.939 length 0.092 0.104 0.134 0.091 0.127 ratio 0.941 0.561 1.015 0.939 0.983 coverage 0.941 0.480 0.815 0.943 0.948 length 0.105 0.084 0.140 0.105 0.144 ratio 1.025 0.688 1.053 1.022 1.006 coverage 0.961 0.696 0.916 0.958 0.951 length 0.074 0.072 0.110 0.074 0.102 ratio 1.012 0.763 1.024 1.013 1.006 coverage 0.952 0.823 0.925 0.957 0.944 length 0.052 0.057 0.078 0.052 0.072 T=10 G=100 G=200 G=400 T=15 G=100 G=200 G=400 27 Table 1.6: Inference, ρ = .8, no correlation within clusters Unfeasible Optimal Estimator Arellano and Bond Estimator AB w/ Windmeijer correction Estimator 1 Estimator 2 T=5 G=100 G=200 G=400 ratio 1.140 1.042 1.142 1.013 1.060 coverage 0.978 0.883 0.923 0.959 0.960 length 0.548 0.491 0.558 0.542 0.533 ratio 1.038 1.000 1.053 0.984 1.004 coverage 0.961 0.909 0.932 0.954 0.948 length 0.379 0.359 0.383 0.377 0.372 ratio 1.082 1.052 1.073 1.040 1.041 coverage 0.975 0.947 0.951 0.968 0.965 length 0.269 0.262 0.271 0.268 0.267 ratio 1.006 0.820 1.007 0.974 0.975 coverage 0.958 0.836 0.910 0.951 0.952 length 0.185 0.163 0.204 0.182 0.182 ratio 0.990 0.890 0.971 0.981 0.979 coverage 0.950 0.884 0.918 0.951 0.949 length 0.130 0.122 0.142 0.129 0.129 ratio 0.976 0.920 0.975 0.964 0.965 coverage 0.952 0.914 0.932 0.950 0.950 length 0.092 0.089 0.097 0.091 0.091 ratio 0.949 0.686 0.959 0.935 0.937 coverage 0.948 0.730 0.866 0.943 0.944 length 0.106 0.084 0.105 0.105 0.105 ratio 1.027 0.846 1.021 1.022 1.020 coverage 0.962 0.860 0.925 0.957 0.958 length 0.074 0.066 0.082 0.074 0.074 ratio 1.011 0.935 1.033 1.008 1.009 coverage 0.952 0.902 0.945 0.955 0.956 length 0.052 0.050 0.058 0.052 0.052 T=10 G=100 G=200 G=400 T=15 G=100 G=200 G=400 28 Table 1.7: Inference, ρ = .8, heteroscedasticity and correlation within clusters Unfeasible Optimal Estimator Arellano and Bond Estimator AB w/ Windmeijer correction Estimator 1 Estimator 2 T=5 G=100 G=200 G=400 ratio 1.027 0.773 1.088 0.834 0.812 coverage 0.963 0.666 0.872 0.949 0.940 length 0.747 0.669 0.943 0.897 1.181 ratio 1.028 0.801 1.033 0.943 0.854 coverage 0.945 0.736 0.884 0.955 0.934 length 0.507 0.523 0.680 0.501 0.684 ratio 1.078 0.845 1.027 1.036 0.919 coverage 0.969 0.831 0.901 0.960 0.947 length 0.358 0.402 0.491 0.348 0.474 ratio 0.973 0.652 1.009 0.937 0.954 coverage 0.948 0.608 0.857 0.942 0.954 length 0.214 0.193 0.301 0.208 0.280 ratio 0.968 0.711 0.982 0.951 0.936 coverage 0.934 0.729 0.899 0.932 0.929 length 0.150 0.155 0.218 0.146 0.199 ratio 0.980 0.772 0.988 0.968 0.942 coverage 0.953 0.829 0.926 0.947 0.943 length 0.106 0.117 0.152 0.103 0.141 ratio 0.955 0.537 1.002 0.939 0.965 coverage 0.940 0.478 0.800 0.940 0.951 length 0.117 0.088 0.149 0.113 0.152 ratio 1.016 0.681 1.028 1.006 0.996 coverage 0.964 0.700 0.897 0.952 0.947 length 0.082 0.077 0.118 0.079 0.109 ratio 0.977 0.746 1.004 0.988 0.978 coverage 0.945 0.810 0.914 0.943 0.945 length 0.058 0.061 0.085 0.056 0.077 T=10 G=100 G=200 G=400 T=15 G=100 G=200 G=400 29 Table 1.8: Bias and RMSE, ρ = .5, equi-correlation within clusters Unfeasible Optimal Estimator Arellano and Bond Estimator −0.001 −0.022 0.000 0.002 sd 0.063 0.087 0.063 0.085 rmse 0.063 0.090 0.063 0.085 −0.002 −0.015 −0.002 −0.004 sd 0.045 0.064 0.045 0.064 rmse 0.045 0.066 0.045 0.064 −0.000 −0.006 0.000 0.000 sd 0.031 0.044 0.031 0.044 rmse 0.031 0.045 0.031 0.044 bias 0.001 −0.015 0.001 0.002 sd 0.029 0.041 0.029 0.040 rmse 0.029 0.044 0.029 0.040 −0.001 −0.008 −0.000 −0.001 sd 0.021 0.029 0.021 0.028 rmse 0.021 0.030 0.021 0.028 bias 0.000 −0.004 0.000 −0.000 sd 0.014 0.021 0.014 0.020 rmse 0.014 0.021 0.014 0.020 bias 0.000 −0.014 0.000 −0.001 sd 0.020 0.028 0.020 0.027 rmse 0.020 0.031 0.020 0.027 bias 0.000 −0.007 0.000 −0.000 sd 0.013 0.019 0.013 0.018 rmse 0.013 0.020 0.013 0.018 bias 0.000 −0.003 0.000 0.000 sd 0.010 0.014 0.010 0.013 rmse 0.010 0.014 0.010 0.013 Estimator 1 Estimator 2 T=5 G=100 G=200 G=400 bias bias bias T=10 G=100 G=200 G=400 bias T=15 G=100 G=200 G=400 30 Table 1.9: Bias and RMSE, ρ = .5, no correlation within clusters Unfeasible Optimal Estimator Arellano and Bond Estimator Estimator 1 Estimator 2 −0.001 −0.013 −0.000 −0.001 sd 0.063 0.063 0.062 0.063 rmse 0.063 0.065 0.062 0.063 −0.002 −0.008 −0.002 −0.002 sd 0.045 0.045 0.045 0.045 rmse 0.045 0.046 0.045 0.045 −0.000 −0.003 0.000 −0.000 sd 0.031 0.031 0.031 0.031 rmse 0.031 0.031 0.031 0.031 bias 0.001 −0.008 0.001 0.001 sd 0.029 0.031 0.029 0.029 rmse 0.029 0.032 0.029 0.029 −0.001 −0.005 −0.000 −0.001 sd 0.021 0.022 0.021 0.021 rmse 0.021 0.022 0.021 0.021 bias 0.000 −0.002 0.000 0.000 sd 0.014 0.015 0.014 0.014 rmse 0.014 0.015 0.014 0.014 bias 0.000 −0.007 0.000 0.000 sd 0.020 0.022 0.020 0.020 rmse 0.020 0.023 0.020 0.020 bias 0.000 −0.003 0.000 0.000 sd 0.013 0.014 0.013 0.013 rmse 0.013 0.015 0.013 0.013 bias 0.000 −0.002 0.000 0.000 sd 0.010 0.010 0.010 0.010 rmse 0.010 0.010 0.010 0.010 T=5 G=100 G=200 G=400 bias bias bias T=10 G=100 G=200 G=400 bias T=15 G=100 G=200 G=400 31 Table 1.10: Bias and RMSE, ρ = .5, heteroscedasticity and correlation within clusters Unfeasible Optimal Estimator Arellano and Bond Estimator Estimator 1 Estimator 2 T=5 G=100 G=200 G=400 bias 0.001 −0.035 0.001 0.003 sd 0.079 0.109 0.076 0.103 rmse 0.079 0.114 0.076 0.103 −0.003 −0.022 −0.002 −0.004 sd 0.058 0.080 0.055 0.078 rmse 0.058 0.083 0.055 0.078 −0.001 −0.012 −0.001 −0.002 sd 0.039 0.057 0.038 0.054 rmse 0.039 0.058 0.038 0.054 bias 0.001 −0.017 0.001 0.002 sd 0.033 0.045 0.032 0.043 rmse 0.033 0.048 0.032 0.043 bias 0.000 −0.009 0.000 0.000 sd 0.023 0.032 0.023 0.031 rmse 0.023 0.034 0.023 0.031 bias 0.001 −0.004 0.001 0.000 sd 0.016 0.023 0.015 0.022 rmse 0.016 0.023 0.016 0.022 bias 0.000 −0.015 0.000 −0.001 sd 0.022 0.030 0.021 0.028 rmse 0.022 0.033 0.021 0.028 bias 0.001 −0.006 0.001 0.001 sd 0.014 0.021 0.014 0.020 rmse 0.014 0.022 0.014 0.020 bias 0.000 −0.004 0.000 0.000 sd 0.011 0.015 0.010 0.014 rmse 0.011 0.015 0.010 0.014 bias bias T=10 G=100 G=200 G=400 T=15 G=100 G=200 G=400 32 Table 1.11: Inference, ρ = .5, equi-correlation within clusters Unfeasible Optimal Estimator Arellano and Bond Estimator AB w/ Windmeijer correction Estimator 1 Estimator 2 T=5 G=100 G=200 G=400 ratio 1.009 0.818 1.034 1.004 1.014 coverage 0.959 0.875 0.948 0.957 0.951 length 0.251 0.282 0.362 0.250 0.344 ratio 0.985 0.805 0.968 0.978 0.956 coverage 0.945 0.875 0.943 0.944 0.942 length 0.175 0.204 0.249 0.175 0.242 ratio 1.030 0.851 1.000 1.024 0.993 coverage 0.961 0.907 0.950 0.960 0.950 length 0.124 0.148 0.175 0.124 0.172 ratio 0.969 0.682 0.986 0.969 0.974 coverage 0.931 0.795 0.933 0.932 0.939 length 0.112 0.111 0.165 0.112 0.153 ratio 0.973 0.746 0.977 0.975 0.987 coverage 0.944 0.841 0.935 0.942 0.948 length 0.079 0.086 0.117 0.079 0.110 ratio 0.997 0.790 0.979 0.997 0.969 coverage 0.952 0.873 0.946 0.950 0.946 length 0.056 0.064 0.081 0.056 0.077 ratio 0.964 0.561 1.002 0.961 0.983 coverage 0.945 0.666 0.925 0.943 0.950 length 0.076 0.062 0.104 0.076 0.104 ratio 1.016 0.696 1.052 1.020 1.018 coverage 0.952 0.795 0.951 0.955 0.950 length 0.053 0.053 0.079 0.053 0.074 ratio 0.993 0.765 1.008 0.994 0.996 coverage 0.949 0.862 0.939 0.951 0.942 length 0.038 0.041 0.056 0.038 0.052 T=10 G=100 G=200 G=400 T=15 G=100 G=200 G=400 33 Table 1.12: Inference, ρ = .5, no correlation within clusters Unfeasible Optimal Estimator Arellano and Bond Estimator AB w/ Windmeijer correction Estimator 1 Estimator 2 T=5 G=100 G=200 G=400 ratio 1.015 0.960 1.022 1.007 1.004 coverage 0.960 0.928 0.947 0.952 0.953 length 0.252 0.242 0.264 0.249 0.249 ratio 0.990 0.960 0.990 0.978 0.980 coverage 0.944 0.936 0.938 0.943 0.943 length 0.176 0.172 0.180 0.175 0.175 ratio 1.032 1.010 1.027 1.026 1.026 coverage 0.960 0.954 0.955 0.962 0.963 length 0.125 0.123 0.126 0.124 0.124 ratio 0.976 0.828 0.985 0.968 0.968 coverage 0.938 0.880 0.942 0.928 0.932 length 0.113 0.102 0.125 0.111 0.111 ratio 0.973 0.885 0.948 0.977 0.975 coverage 0.945 0.917 0.927 0.943 0.942 length 0.079 0.075 0.087 0.079 0.079 ratio 0.999 0.950 0.991 0.994 0.996 coverage 0.952 0.938 0.952 0.951 0.951 length 0.056 0.054 0.059 0.056 0.056 ratio 0.970 0.701 0.970 0.961 0.962 coverage 0.952 0.811 0.931 0.941 0.942 length 0.076 0.061 0.077 0.076 0.076 ratio 1.018 0.852 1.017 1.020 1.018 coverage 0.956 0.906 0.948 0.955 0.954 length 0.054 0.048 0.059 0.053 0.053 ratio 0.996 0.930 1.012 0.995 0.995 coverage 0.950 0.926 0.955 0.949 0.950 length 0.038 0.036 0.041 0.038 0.038 T=10 G=100 G=200 G=400 T=15 G=100 G=200 G=400 34 Table 1.13: Inference, ρ = .5, heteroscedasticity and correlation within clusters Unfeasible Optimal Estimator Arellano and Bond Estimator AB w/ Windmeijer correction Estimator 1 Estimator 2 T=5 G=100 G=200 G=400 ratio 1.019 0.798 1.025 1.016 0.992 coverage 0.955 0.857 0.935 0.959 0.951 length 0.320 0.345 0.448 0.306 0.407 ratio 0.983 0.809 0.977 0.994 0.943 coverage 0.948 0.864 0.936 0.953 0.944 length 0.223 0.255 0.313 0.215 0.290 ratio 1.018 0.828 0.985 1.020 0.982 coverage 0.961 0.883 0.928 0.962 0.946 length 0.157 0.185 0.220 0.152 0.207 ratio 0.956 0.680 0.983 0.957 0.968 coverage 0.938 0.795 0.929 0.930 0.949 length 0.126 0.120 0.179 0.122 0.164 ratio 0.962 0.739 0.971 0.961 0.971 coverage 0.937 0.840 0.931 0.935 0.942 length 0.088 0.094 0.128 0.086 0.117 ratio 0.997 0.793 0.984 0.995 0.969 coverage 0.952 0.877 0.945 0.953 0.942 length 0.062 0.070 0.089 0.061 0.083 ratio 0.957 0.542 0.988 0.956 0.974 coverage 0.945 0.650 0.915 0.939 0.944 length 0.082 0.064 0.109 0.080 0.108 ratio 1.022 0.678 0.999 1.026 0.988 coverage 0.961 0.807 0.935 0.960 0.940 length 0.058 0.055 0.083 0.056 0.077 ratio 0.971 0.758 1.002 0.980 0.983 coverage 0.946 0.858 0.939 0.950 0.942 length 0.041 0.043 0.059 0.040 0.055 T=10 G=100 G=200 G=400 T=15 G=100 G=200 G=400 35 1.6 Application: Estimation of Persistence in Student Achievement In this section, we are interested in estimating the effect of attending private schools on student achievement in the province of Punjab, Pakistan. In a non-experimental framework, estimating the causal effects of some factors on student achievement requires accounting for factors that affected student achievements in previous time periods since these factors might affect students’ learning ability in the future but also be correlated across time. A model for studying the effect of some factor x on student achievement y can be written as in Andrabi et al. (2011) in its summary of the work in Todd and Wolpin (2003): t−1 yit = ∑ t−1 α j xit− j + j=0 ∑ θ j µit− j (1.6.1) j=0 where µt are unobserved shocks to student achievement. If one assumes that both {α j } j=1,...,T and {θ j } j=1,...,T form geometric series such that α j = ρα j−1 and θ j = ρθ j−1 , we can write: yit = αxit + ρyit−1 + µit (1.6.2) where θ0 was normalized to one. In order to account for the possibility that students have unobserved characteristics that affect their ability to learn and are related to other educational inputs in xit , we can decompose µit between time constant unobserved factors (also called unobserved heterogeneity) and transitory shocks: µit = ci + uit (1.6.3) where ci is arbitrarily related to xi = [xi1 , ..., xiT ] and xit is either strictly exogenous: E(uit |Xi ,Yit−1 ) = 0 (1.6.4) E(uit |Xit ,Yit−1 ) = 0 (1.6.5) E(uit |Xit−1 ,Yit−1 ) = 0 (1.6.6) or sequentially exogenous: or contemporaneously endogenous: where Xi = [xi1 , ..., xiT ] and Xit−1 = [xi1 , ..., xit−1 ]. In this section we will use the data analyzed in Andrabi et al. (2011) to estimate the effect of attending private schools on student achievement in three districts of Punjab in Pakistan so that the input of interest 36 is attendance of private school. The other covariates included are wealth and variables indicating whether each parent lives with the student. We treat all of these inputs as contemporaneously endogenous since it is likely that they follow a dynamic process with unobserved transitory shocks that are correlated with shocks to student achievement. For instance, an unobserved and unexpected increase in income might result in a student enrolling in private school but also benefiting from better study conditions at home so that Cov(uit , privateschoolit ) = 0 while it is still possible that Cov(uit , privateschoolit−1 ) = 0. It is likely that transitory shocks are correlated within schools since there are school or class-level unobserved shocks, such as changes in infrastructure, staff or teachers, that will affect all students within a school or class. The data-set we use collected between 0 and 25 students per school in each year with most schools being represented by less than 10 students, which is too small to estimate time-varying school fixed effects accurately. Instead, we prefer treating uit as cross-sectionally correlated within schools. It is likely that unobserved heterogeneity will also be correlated across students within schools since students might attend specific schools based on unobserved characteristics, such as residential location, socio-economic characteristics or past achievements, that relate to their performance. As described in the rest of the paper, using this cross-sectional correlation for estimation can result in significant efficiency gains. j For any subject j among English, Urdu and Mathematics (denoted E, U, M), denote by yit the grade obtained by student i in year t and subject j and denote by xit the variable indicating whether student i attended a private school in year t. Let wit be the vector containing other predetermined explanatory j variables for student i at time period t. Denote by uit transitory shocks in achievement in subject j and j denote measurement error in that achievement by εit . Also denote by git the school attended by student i in year t. We will assume clustering so that, in a given year, transitory shocks are independent across schools. We consider a model with measurement error and contemporaneously endogenous covariates. As in Andrabi et al. (2011), we assume that measurement errors are independent across subjects. We can write such a model as: j j j j j j j j j j j yit = dt + α0 xit + ρ0 yit−1 + wit β0 + ci + uit + εit − ρ0 εit−1 j E(uit |Xt−1 ,Yt−1 ,Wt−1 , gt−1 ) = 0 j −j E(εit |Xt ,Yt (1.6.8) ,Wt−1 , gt−1 ) = 0 −j where Yt = {Ytk }k=E,U,M , Yt (1.6.7) (1.6.9) j = {Ytk }k= j,k=E,U,M and dt are time specific intercepts. The first dif- ference with the model used in Andrabi et al. (2011) is that we use predetermined instruments instead 37 of sequentially exogenous instruments. This is a more suitable assumption since, as explained previously, the covariates used in this model are likely jointly determined. The second difference is that we include as potential instruments the lagged values of the covariates for all observations instead of only −j j Xit−1 ,Yit−1 ,Wit−1 . As pointed out in Section 1.2, since uit is correlated cross-sectionally, it is unlikely that j j E(uit |Xt−1 ,Yit−1 ,Wit−1 ) = 0 holds without E(uit |Xt−1 ,Yt−1 ,Wt−1 ) = 0 holding. It could be interesting to introduce peer effects in the model but we do not consider it here for simplicity and comparability with results in Andrabi et al. (2011). In this application, clusters (school membership) are not time constant and, as pointed out previously, not strictly or sequentially exogenous. Therefore it is possible that: E(uit |git , Xt−1 ,Yt−1 ,Wt−1 ) = 0 (1.6.10) E(uit |git−1 , Xt−1 ,Yt−1 ,Wt−1 ) = 0 (1.6.11) even though: Hence we can use as instruments lagged values of achievements of students from schools where an observation was previously enrolled but not from schools where it is currently enrolled. There are three time periods t = 0, 1, 2 available in the data-set used so the only transformed equation for each subject that can be used for estimation is: j j j j j j j j j ∆yi2 = δ0 + α0 ∆xi2 + ρ0 ∆yi1 + ∆wi2 β0 + ∆ui2 + ∆εi2 − ρ∆εi1 j −j j −j j −j (1.6.12) E(∆ui2 |X0 ,Y0 ,W0 , g0 ) = 0 (1.6.13) E(∆εi2 |X0 ,Y0 ,W0 , g0 ) = 0 (1.6.14) E(∆εi2 |X0 ,Y0 ,W0 , g0 ) = 0 j j (1.6.15) j where δ0 = d2 − d1 . j j j j j j Let φ j = [δ j , α j , ρ j , β j ] , mi (φ j ) = ∆yi2 − (δ j + α j ∆xi2 + ρ j ∆yi1 + ∆wi2 β j ) and mi = mi (φ0 ). The Arellano and Bond estimator for this model is defined by: n n n i=1 i=1 i=1 j jAB j jAB j j jAB −1 jAB j ) ( ∑ Zi mi (φ j )) φˆAB = argmin j ( ∑ Zi mi (φ j )) ( ∑ Zi mi (φ˜ j )mi (φ˜ j ) Zi φ j jAB −j where φ˜ j is a preliminary estimator of φ0 and Zi = [1, yi0 , xi0 , wi0 ] . 38 (1.6.16) This estimator is inefficient because it ignores cross-sectional dependence11 . Using the previous results in this paper, we can specify auxiliary assumptions so that an estimator can be derived which will be consistent as long as the identifying assumptions defined above hold and efficient if the auxiliary assumptions also hold. The first auxiliary assumption we can use is of conditional homoscedasticity and cluster equi-correlation. For j = U, E, M: j j −j Cov(uit , uls |Y0 , X0 ,W0 , g0 ) = σu2j i f i = l, t = s (1.6.17) = τ j σu2j i f i = j, git = gls , t = s (1.6.18) = 0 otherwise (1.6.19) and: j j −j j j −j Cov(uit , εls |Y0 , X0 ,W0 , g0 ) = 0 ∀ i, l, j, k,t, s (1.6.20) Cov(εit , εls |Y0 , X0 ,W0 , g0 ) = σε2 j i f i = l, j = k, t = s = 0 otherwise (1.6.21) (1.6.22) Under this assumption: j j −j Cov(mi , ml |Y0 , X0 ,W, g0 ) = 2σu2j + 2σε2 j (1 + ρ + ρ 2 ) i f i = j = τ j σu2j (1[gi1 = gl1 ] + 1[gi2 = gl2 ]) i f i = j (1.6.23) (1.6.24) j Under the previous auxiliary assumption, the optimal instruments for mi (φ j ) will be linear functions of j −j −j −j E(∆yi1 |Y0 , X0 ,W0 , g0 ), E(∆xi2 |Y0 , X0 ,W0 , g0 ) and E(∆wi2 |Y0 , X0 ,W0 , g0 ). Since we have T = 2 so that there is only one transformed equation available for estimation, we can use the simple second auxiliary 11 Without measurement error, it would also be possible to use correlation of transitory shocks across outcomes to obtain an efficient joint estimator of {φ j } j=U,E,M . However because of measurement error, the sets of instruments across subjects are non-overlapping, so that optimal instruments cannot be derived. Since there is no restriction in the parameters across equations, weighting of optimally weighted moment conditions or minimum distance methods cannot be used either. 39 assumption: j −j j j E(∆yi1 |Y0 , X0 ,W0 , g0 ) = a0 + 1 j ∑ a01k yki0 + ∑ a02k #g k= j k= j j j + a04 wi0 + a05 j ∑ i0 l∈g i0 ykl0 + a03 xi0 1 ∑ wl0 #gi0 l∈g (1.6.25) i0 −j j j 1 j ∑ bz01k yki0 + ∑ bz02k #g E(∆zi2 |Y0 , X0 ,W0 , g0 ) = bz0 + k= j i0 l∈g k= j j j + bz04 wi0 + bz05 ∑ j ykl0 + bz03 xi0 i0 1 ∑ wl0 #gi0 l∈g (1.6.26) i0 where z ∈ {x, w}and consistently estimate these unknown parameters by OLS regression. j j j ∆y Define Eˆi , Eˆi∆x and Eˆi∆w to be the estimated conditional expectations defined by (1.6.25) and (1.6.26). Define:  ∆y j ˆ  Ei    j ∆x j  Di =   Eˆi    j Eˆi∆w  (1.6.27) j j j j j j l=1,...,n ˆ and D j = [D1 , ..., Dn ]. Define Σˆ j (φ j ) = [Cov(m i (φ ), ml (φ ))]i=1,...,n and j j m j (φ j ) = [m1 (φ j ), ..., mn (φ j )] (1.6.28) j The efficient estimator for φ0 under the auxiliary assumptions is φˆ j opt defined by: ˆ φˆ j )−1 m(φˆ j ) = 0 D j Σ( opt opt (1.6.29)  j ∆y  1 j   ∂ mi (φ ) j  = ( Let Mi =  ) . Both the Arellano and Bond estimator and our optimal estimator can  ∆xi2  ∂φ   ∆wi2 be written as:  n j j ∑ Zi mi (φˆ j ) = 0 (1.6.30) i=1 where for the Arellano and Bond estimator: j n j jAB Zi = ( ∑ Ml Zl l=1 jAB ˆ jAB−1 Z )Θ i (1.6.31) with ˆ jAB = Θ n jAB j ˜ j j ˜ j jAB mi (φ )mi (φ ) Zi ∑ Zi i=1 40 (1.6.32) j j For our optimal estimator, Zi is the ith column of D j Σˆ j (φˆopt )−1 . j Under the assumption that transitory shocks are independent across schools, φˆ j is consistent for φ0 and asymptotically normal. The asymptotic variance-covariance matrix of both estimators is12 : AVar(φˆ ) = A BA (1.6.33) A = plim( 1 n j j −1 ∑ Zi Mi ) n i=1 (1.6.34) B = plim( 1 n j j n j j ∑ Zi mi ∑ mi Zi ) n i=1 i=1 (1.6.35) 1 n j j n j j ∑ Zi mi ∑ 1[{git = gls }t,s=1,2 ]ml Zl ) n i=1 l=1 (1.6.36) = plim( which can be estimated consistently since there is a small number of observations in each school. The students’ achievement in each subject was measured by the results obtained by students on a test administed by the authors of Andrabi et al. (2011) and graded using the Item Response Theory so that scores can be compared across students and years and the standard deviation of scores in the first year (third grade) is one. Table 1.14 shows the average and standard deviations of scores by subject and grade. Table 1.15 reports the estimated degree of persistence and the estimated effect of attending private schools on performance for the three subjects considered. We also show the associated standard errors and 95% confidence intervals. Similarly as in Andrabi et al. (2011), we find that there is significant persistence in scores except for Mathematics. We estimate effects of attending private school that are smaller than in Andrabi et al. (2011), which can be attributed to Andrabi et al. (2011) treating the covariates as sequentially exogenous instead of contemporaneously endogenous while it is likely that unobserved factors simultaneously affect performance and school attendance, as explained previously. The optimal estimator we presented in this section yields smaller standard errors compared to the Arellano and Bond estimator both for estimating persistence in student achievements and for estimating the effect of attending private school, with particularly significantly smaller standard errors for the latter. 12 Note that clustering standard errors by the first school attended, which is used in Andrabi et al. (2011), is not justified since transitory shocks should be correlated within a school that a child is currently attending and not necessarily only across students who attended the same school in the first time period. 41 Table 1.14: Averages and standard deviations of scores per subject and per grade English Math Urdu Average s.d. Average s.d. Average s.d. Grade 3 0 1 0 1 0 1 Grade 4 0.18 1.04 0.18 1.11 0.24 1.10 Grade 5 0.68 0.89 0.81 1.04 0.82 0.94 42 Table 1.15: Effects of Attending Private Schools on Student Achievement Optimal Estimator Persistence Andrabi et al. 2010 English Urdu Math English Urdu Math English Urdu Math 0.31 0.30 0.04 0.34 0.53 0.23 0.19 0.35 0.12 (0.14) (0.12) (0.11) (0.17) (0.18) (0.14) (0.10) (0.11) (0.12) [0.04,0.58] Private School Arellano and Bond Estimator [0.06,0.54] [-0.18,0.26] [0.01,0.67] [0.18,0.88] [-0.04,0.50] [-0.01,0.39] [0.13,0.57] [-0.12,0.36] 0.44 0.89 0.30 0.40 0.81 0.43 1.15 0.90 0.46 (0.38) (0.41) (0.31) (0.55) (0.59) (0.54) (0.39) (0.48) (0.50) [-0.30,1.18] [0.09,1.69] [-0.31,0.91] [-0.68,1.48] [-0.35,1.97] [-0.63,1.49] [0.39,1.91] [-0.04,1.84] [-0.52,1.44] Numbers in parenthesis are standard errors and intervals are 95% confidence intervals. Standard errors and confidence intervals in Andrabi et al. 2010 do not take into account changes in school attendance across time. Covariates are treated as sequentially exogenous instead of contemporaneously endogenous in Andrabi et al. 2010. 43 1.7 Conclusion We have presented an estimation method that used cross-sectional dependence to improve the accuracy with which dynamic models of panel data are estimated while making use of few nuisance parameters and being robust to the misspecification of the form of the cross-sectional dependence. This method can be generalized to models with covariates that are strictly exogenous, sequentially exogenous or contemporaneoulsy endogenous. Monte Carlo simulations and an application to the estimation of a value-added model show that, when there is cross-sectional dependence, this method dominates existing estimators in terms of accuracy and quality of inference. Extensions of this work that are the subject of ongoing research consider the generalization of the results in this paper to non-linear panel data models, the use of other forms of cross-sectional dependence than clustering in the auxiliary restrictions, and the asymptotic properties of our estimator with large numbers of time period and of observations within clusters. 44 CHAPTER 2 ESTIMATION OF UNOBSERVED EFFECTS PANEL DATA MODELS UNDER SEQUENTIAL EXOGENEITY 2.1 Introduction Time constant unobserved effects are now routinely introduced in models of panel data to address endogeneity issues that are due to time constant unobserved variables. A first group of estimators for such models uses iterated conditioning by specifying an auxiliary model for unobserved effects conditional on the covariates. Such models are commonly called Correlated Random Effects (CRE) models. A second group of estimators implements instrumental variable estimation methods on transformed data as long as some specific functional form assumptions can be made. In the case of linear models with strictly exogenous covariates, CRE estimators have first been proposed in Mundlak (1978) and Chamberlain (1982). When covariates are strictly exogenous, Wooldridge (2010) contains many examples of the generalization of this approach to non-linear models of panel data. For linear models with strictly exogenous covariates, the instrumental variable estimators are the well-known Fixed Effects estimator and the First Difference estimator. These estimators have also been generalized to non-linear models that are linear in random coefficients in Chamberlain (1992b). The estimators mentioned above cannot be used in applications where there are sequentially exogenous covariates or, more generally, instruments. Sequentially exogenous instruments arise when transitory shocks to the dependent variable are independent of past and current values of the instruments but affect future values of the instruments. This scenario is particularly plausible in a dynamic optimization framework. The simplest example is when lagged values of the dependent variables are used as covariates and instruments. Dynamic models are frequently used to analyze panel data. A review of linear dynamic models for panel data can be found in Bond (2002). Sometimes instruments other than lagged values of the dependent variable can be sequentially exogenous. This is the case, for instance, in Clerides et al. (1998) which investigates the causal effect of exporting on firm efficiency but recognizes that shocks to firm efficiency will affect current and future exports as well. Another well know example is Blundell et al. (1995) which investigates the causal effect of Research and Development spending on innovation knowing that current success in 45 deposing patents will affect future Research and Development spending. CRE models encounter strong limitations when instruments are sequentially exogenous. Dynamic CRE models can be used for special cases where lagged dependent variables are included in the list of covariates but all other covariates are strictly exogenous. Such models and the corresponding estimation methods are discussed in Wooldridge (2005). One application can be found in Browning et al. (2010). In more general cases, however, one would need a very large set of auxiliary assumptions in order to use a CRE model to analyze panel data with sequential exogeneity. Instrumental variable estimation can be used for estimation of panel data models with sequential exogeneity as long as there exists a transformation of the data so that the method of moments can be applied, which makes it a much more flexible method. For models with additive unobserved effects, such transformations of the data are presented in Arellano and Bond (1991). Chamberlain (1992a) and Wooldridge (1997) discuss transformations of the data for models with multiplicative unobserved effects. Once the transformed equations are obtained, these papers advocate for a two-step GMM estimation of the unknown parameters using the transformed equations at each time period and the corresponding available sets of instruments. These estimators are efficient given the set of unconditional moment conditions that are used, but they are still known for suffering from a weak instrumental variable problem that can hinder their use in practice. In this paper, we consider using additional assumptions to derive useful additional moment conditions and hence obtain a more precise estimator. The additional moment conditions that we present in this paper are generalized versions of the additional moment conditions for the linear dynamic model with additive unobserved effects presented in Arellano and Bover (1995), Ahn and Schmidt (1995) and Blundell and Bond (1998). Windmeijer (2000) considered some of the additional moment conditions we present here, namely uncorrelation of the transitory shocks, for a special case of the group of models we define. However, the chosen set of assumption in Windmeijer (2000) is actually too weak to support the moment conditions that are used for estimation. Hence it seems useful to present these moment conditions here as part of a unifying framework. In Section 2.2 we will present the model and assumptions we use. In Section 2.3 we will discuss the estimator that is currently used. In Section 2.4 we will present additional sets of restrictions that can be used for estimation when instruments are stationary or when transitory shocks are serially uncorrelated. In Section 2.5 we will show using Monte Carlo simulations that the propositions to address the weak instrumental variable problem of these models result in significant improvements in accuracy and hence effectively mit- 46 igate the weak instrumental variable problem. In Section 2.6 we will show how to estimate and perform inference on measures of interest of the effect of covariates on the mean of the dependent variable. 2.2 Model and Assumptions The models we consider are such that for each observation i of a random sample of large size n and each time period t of a fixed number of time periods T we can specify: E(yit |xti , uti , zit ) = h0 (xit , β0 ) + h1 (xit , β0 )uit E(uit |zit ) = E(uit+1 |zit ) ∀t ≤ T − 1 (2.2.1) (2.2.2) for known functions h0 , h1 . In this model xit are observed covariates and uit captures the effect of unobserved covariates. xti = {xis }s=1,...,t contains all values of the covariates up to the current time period and similarly we denote uti = {uis }s=1,...,t . zit are observed instruments that do not belong to the mean equation for yit once we condition on the observed and unobserved covariates xit , uit . We consider cases where zi1 ⊆ zi2 ⊆ ... ⊆ ziT so that we have sequential exogeneity, also called predetermined instruments. (2.2.1) was specified in terms of a conditional expectation instead of simply in terms of yit in order to allow for dependent variables with discrete supports as we will see later in this section. (2.2.2) requires that at each time period the effects of unobserved covariates have the same mean conditional on the instruments as the effects of unobserved covariates at future time periods. Hence it requires that the source of endogeneity of the instruments be time constant. For simplicity we will consider the case where yit , h0 (., .), h1 (., .) and ci are scalars but all the results can be generalized to systems of equations if needed. Dynamic linear models with additive heterogeneity are a special case of the group of models described by (2.2.1) and (2.2.2) with xit = yit−1 , zit = [y0 , ..., yit−1 ], h0 (x, β ) = β x, h1 (., .) = 1: yit = β0 yit−1 + ci + νit (2.2.3) E(ci + νit |yit−1 , ..., y0 ) = E(ci |yit−1 , ..., y0 ) (2.2.4) Here we wrote uit = ci + νit . Traditionally unobserved effects have been explicitely decomposed between a time constant part, sometimes called unobserved heterogeneity, and a transitory part. In this paper we keep a more general notation as in (2.2.1) and (2.2.2) for more flexibility. 47 Other special cases of the models we consider have been used to model count dependent variables, such as the linear feedback model presented in Blundell et al. (2002): E(yit |yit−1 , ..., yi0 , xit , ..., xi1 , ci ) = γ0 yit−1 + exp(xit θ0 )ci ∀t = 1, ..., T (2.2.5) In this case we simply have uit = ci . Our specification also includes models for count dependent variables where covariates cannot be used as instruments, i.e. xit ∈ / zit , but where enough instruments are available to identify the parameters of the model. An example where instruments available are lagged covariates is presented in Windmeijer (2000)1 : yit = exp(xit β0 )ci νit E(ci νit |xit−1 , ..., xi0 ) = E(ci |xit−1 , ..., xi0 ) (2.2.6) (2.2.7) Multiplicative unobserved effects models have first been used to analyze count data and a description of the state of the literature on dynamic models of count data with unobserved heterogeneity is given in Windmeijer (2008). However the class of models we consider in this paper is very appropriate for the analysis of any data that require the specification of a non-linear response function like binary, fractional, ordered, non-negative, corner solution response data and so on. For a binary dependent variable for instance, a dynamic probit model with sequential exogeneity in the explanatory variables could be specified: E(yit |yit−1 , ..., yi0 , ci , xit , ..., xi1 ) = ci Φ(γ0 yit−1 + xit θ0 ) (2.2.8) Here ci /2 is the conditional probability of being in state yit = 1 when yit−1 = 0, xit = 0 and also captures a time constant unobserved propensity to be in state yit = 1. It is also important to note that the generality of our chosen specification also allows us to use models where some of the explanatory variables are endogenous but where instruments are available: E(yit |yit−1 , ..., yi0 , xit , ..., xi1 , uit , ..., ui1 , zit ) = Φ(γ0 yit−1 + xit θ0 )uit E(uit |zit ) = E(uit+1 |zit ) (2.2.9) (2.2.10) where uit is a random variable between zero and one and captures the effect on the mean of yit of unobserved explanatory variables which are not independent from xit but have the same mean conditional on instruments as effects on the mean in future time periods. 1 We present slightly different assumptions here than in Windmeijer (2000) since Windmeijer (2000) goes to great lengths to avoid making assumptions on conditional means and only consider assumptions of uncorrelation but does so at the expense of making two mistakes. One of which being that assuming that xit−1 is uncorrelated with νit does not imply E(xit−1 ci (νit+1 −νit )) = 0 as is claimed in Windmeijer (2000). The other one will be mentioned in Section 2.4.2. 48 2.3 Estimation without Additional Assumptions Following the argument made in Chamberlain (1992a), the model described by (2.2.1), (2.2.2) is statistically indistinguishable from: y − h0 (xiT , β0 ) E( iT |ziT ) = E(uiT |ziT ) h1 (xiT , β0 ) y − h0 (xit , β0 ) E(∆ it |zit−1 ) = 0 ∀t = 2, ..., T h1 (xit , β0 ) (2.3.1) (2.3.2) where ∆ denotes the difference operator. Since E(uiT |ziT ) is unknown and unrestricted, (2.3.1) does not participate in estimating β0 . Therefore we can restrict our attention to estimating β0 from (2.3.2). For notation we will write: y − h0 (xit , β0 ) ∀t = 2, ..., T ρt (wi , β ) ≡ ∆ it h1 (xit , β ) (2.3.3) where wi ≡ {yit , xit }t=1,...,T . So the conditional moment restrictions available for estimation are: E(ρt (wi , β0 )|zit−1 ) = 0 ∀t = 2, ..., T (2.3.4) Chamberlain (1992a) has shown that an optimal estimator would be βˆopt that solves: n ˆ ∑ D˜ it Σ˜ −1 it ρ˜t (wi , zi , βopt ) = 0 (2.3.5) i=1 Such an estimator would achieve the asymptotic information bound for estimating β0 from these condiT D ˜ Σ˜ −1 D˜ it ) where D˜ it ≡ E( ∂ ρ˜t (wi , zi , β0 )|zit−1 ), Σ˜ it ≡ tional moment restrictions which is J = E(∑t=2 it it ∂β Var(ρ˜t (wi , zi , β0 )|zit−1 ), ρ˜t () is defined by: ρ˜ T (wi , zi , β ) = ρT (wi , β ) (2.3.6) ρ˜t (wi , zi , β ) = ρt (wi , β ) − Γit,t+1 ρ˜t+1 (wi , zi , β ) − ... − Γit,T ρ˜ T (wi , zi , β ) ∀t = 2, ..., T − 1 (2.3.7) where zi = {zit−1 }t=2,...,T , Γit,s ≡ Cov(ρit , ρ˜ is |zis−1 )Var(ρ˜ is |zis−1 )−1 ∀ s > t where ρit = ρit (β0 ), ρit (β ) = ρt (wi , β ) and ρ˜ it = ρ˜ it (β0 ), ρ˜ it (β ) = ρ˜t (wi , zi , β ). The intuition behind this result is that the asymptotic information bound from all the sequential conditional moment restrictions is the sum of the information bounds for each conditional moment restriction once these restrictions have been orthogonalized. Unfortunately the optimal estimator from equation (2.3.5) is usually not feasible without additional assumptions since D˜ it , Σ˜ it and ρ˜ it are not observed, i.e. they are not known functions of the data and of β0 . One could think about approximating such moment conditions arbitrarily well, as suggested in Chamberlain 49 (1992a) or partially studied in Hahn (1997), but this introduces several new problems and therefore is left for future research. Under the conditional moment restrictions given in (2.3.4), any function of zit−1 can be used as instruments for ρit (β ) to estimate β0 . Windmeijer (2008), for instance, recommends the use of all available lags of the instruments in levels, in our notation this is just zit−1 . So the estimator that is commonly used to estimate β0 from the model given in (2.2.1) and (2.2.2) is: βˆ = argmin ∑(Zi ρi (β )) (Zi ρi (β˜ )ρi (β˜ ) Zi )−1 ∑ Zi ρi (β ) i β (2.3.8) i √ where β˜ is a preliminary n-consistent estimator of β0 , ρi (β ) = [ρit (β )]t=2,...,T ,   z 0 ... 0  i1    0 z ...    i2 Zi =     ... 0   ...   0 ... 0 ziT −1 The asymptotic variance of of Avar = (E(Zi (2.3.9) √ ˆ n(β − β0 ) is: ∂ ρi (β0 ) ∂ ρ (β ) ) E(Zi ρi (β0 )ρi (β0 ) Zi )−1 E(Zi i 0 ))−1 ∂β ∂β (2.3.10) It is shown in Appendix B.1 that this asymptotic variance is equal to: T T −1 ˜ Avar = (E( ∑ D˜ it Σ˜ −1 it Dit ) − E( ∑ eit eit )) (2.3.11) t=2 t=2 −1/2 where et is the error term from the linear projection of D˜ it Σ˜ it 1/2 on z˜it−1 = {zis Γs+1,t Σ˜ it }s=1,...,t−1 , so that βˆ can be seen as the estimator resulting from a linear approximation of the optimal moment conditions. This estimator is often quite imprecise. For dynamic linear models, the weak instrumental variable problem affecting the estimator described in this section has been documented in Arellano and Bover (1995), Ahn and Schmidt (1995) and Blundell and Bond (1998). The additional moment conditions that are proposed in these papers to address this issue are the ones we generalize to our more general set up in the next section. Blundell et al. (2002) have documented the weak instrumental variable problem for estimating the Linear Feedback Model for count data, which we will use in the next section as an example. The additional assumptions used by Blundell et al. (2002) in order to alleviate the weak instrumental variable problem are quite unrealistic compared to the additional sets of assumptions we present in the next section. In addition, Monte 50 Carlo simulations show that using the additional assumptions presented in this paper achieves efficiency gains that are similar to those obtained in Blundell et al. (2002) when both sets of additional assumptions hold. 2.4 Additional Assumptions 2.4.1 Estimation with Stationary Instruments For the models described in (2.2.1), it is possible in some applications that part of the instruments, denote it zstat it , has a time constant covariance with the unobserved effects and a time constant mean so that in addition to (2.2.2) we can assume: E(zstat it ) = µzstat Cov(zstat it , uit ) = γ (2.4.1) (2.4.2) stat (2.4.1) and (2.4.2) imply that E(zstat it uit ) is time constant as well since E(zit uit ) = γ + µzstat E(uit ) and E(uit ) is time constant by the law of iterated expectations and E(ui1 |zi1 ) = ... = E(uiT |zi1 ) which stat stat stat is implied (2.2.2). This in turn implies that E(zstat it uit ) = E(zit−1 uit ) since E(zit−1 uit−1 ) = E(zit−1 uit ) stat additional from (2.2.2). Let K stat be the dimension of zstat it . We can use for estimation the (T − 1) × K moment conditions2 : stat yit − h0 (xit , β0 ) E((zstat it − zit−1 ) h (x , β ) ) = 0 ∀t = 2, ..., T 1 it 0 2.4.1.1 (2.4.3) Example of the Linear Feedback Model An example of a model where such additional moment conditions can be used but have not been exploited in previous studies is the linear feedback model (LFM) presented in Blundell et al. (2002). For |γ0 | < 1: E(yit |yit−1 , ..., yi0 , xit , ..., xi1 , ci ) = γ0 yit−1 + ci µ(xit , θ0 ) (2.4.4) 2 Note stat that the moment conditions E((zstat it−s − zit−s−1 )uit ) = 0 for s ≥ 1 do not constitute useful additional moment conditions stat stat since they are implied by the moment conditions E((zstat it−s − zit−s−1 )uit−s ) = 0 and E(zit−s (uit−τ − uit−τ−1 )) = 0 ∀ τ = 0, ..., s − 1 since: stat stat stat stat (zstat it−1 − zit−2 )uit = zit−1 uit − zit−2 ∆uit − zit−2 uit−1 stat stat stat = zstat it−1 uit − zit−2 ∆uit + ∆zit−1 uit−1 − zit−1 uit−1 stat stat = zstat it−1 ∆uit − zit−2 ∆uit + ∆zit−1 uit−1 stat stat stat stat stat stat stat and by iteration: (zstat it−s − zit−s−1 )uit = zit−s ∆uit − zit−s−1 ∆uit + ∆zit−s uit−1 is a function of (zit−s − zit−s−1 )uit−s and {zit−s (uit−τ − uit−τ−1 ). ∀ τ = 0, ..., s − 1} 51 For estimation we can use the sequence of conditional moment conditions corresponding to the conditional moment conditions (2.3.2) considered in the previous section: y − γ0 yit−1 yit−1 − γ0 yit−2 − |y , ..., yi0 , xit−1 , ..., xi1 ) = 0 ∀t = 2, ..., T E( it µ(xit , θ0 ) µ(xit−1 , θ0 ) it−2 (2.4.5) so for this specific model we have uit = ci . Blundell et al. (2002) also assumes that xit is strictly stationary conditional on ci .3 This implies that E(µ(xit , θ0 )|ci ) = g1 (ci ) for some arbitrary function g1 (.). Consider the difference equation given by: yit = γ0 yit−1 + µ(xit , θ0 )ci εit . (2.4.6) y −γ y it−1 . The associated stationary process is defined by where εit = cit µ(x0 ,θ i it 0 ) ∞ sit = ∑ γ0s ci µ(xit−s , θ0 )εit−s (2.4.7) s=1 y −γ yit−1 c g1 (ci ) Then E(sit |ci ) = i1−γ since E(εit |ci , xit , xit−1 , ...) = E( cit µ(x0 ,θ ) = 1. So if we simply assume ) 0 i it 0 c g1 (ci ) so that that the deviation of yi0 from si0 has mean zero conditional on ci , we have E(yi0 |ci ) = i1−γ 0 ci g1 (ci ) E(yit |ci ) = 1−γ ∀t = 1, ..., T . This assumption is the generalization of the restriction on initial condi0 tions made in Blundell and Bond (1998) for dynamic linear models with additive unobserved effects. It results in the additional over-identifying moment conditions: y − γ0 yit−1 ) = E((yit−1 − yit−2 )ci ) ∀t = 2, ..., T E((yit−1 − yit−2 ) it µ(xit , θ0 ) = 0 ∀t = 2, ..., T (2.4.8) (2.4.9) Since for this specific model these conditions would not be plausible without the stationarity of xit , we can also add the moment conditions: y − γ0 yit−1 E((xit − xit−1 ) it ) = 0 t = 2, ..., T − 1 µ(xit , θ0 ) (2.4.10) Monte Carlo simulations show that these extra moment conditions improve the efficiency of estimators significantly, even tough they rely on assumptions that are more realistic than the assumptions imposed in Blundell et al. (2002). 3 They do so in a different attempt to mitigate the weak IV problem of FE estimators for the LFM. Blundell et al. (2002) proposes a so called pre-sample mean estimator which attempts to control for unobserved heterogeneity by using the average of observations on the dependent variable for many periods before the rest of the sample started as a proxy for time constant unobserved heterogeneity. However this estimator suffers from two severe drawbacks which make it unusable in practice: it supposes one has many observations on the dependent variable before the start of the rest of the sample but most importantly the assumptions under which the pre-sample average is a good proxy for unobserved heterogeneity are highly unrealistic, in particular it supposes that the covariates xit have a mean that is proportional to ci and restricts µ() to be the linear index exponential function. 52 2.4.1.2 Time Demeaned Instruments In some applications it might not be plausible to assume that some of the instruments are mean stationary. However similar additional moment conditions as (2.4.3) can be obtained after time demeaning of the stat stat instruments if E((zstat it − E(zit ))uit ) = Cov(zit , uit ) is time constant. In this section we consider the conditions necessary for this to be true when zstat it is not itself mean stationary. stat stat stat From (2.2.2) alone, Cov(zstat it , uit ) = Cov(zit , uiT ) since (2.2.2) implies E(zit uit ) = E(zit uiT ) and E(uit ) = E(uiT ). Cov(zstat it , uiT ) will be time constant if ∀t, s ≤ T − 1: stat Cov(zstat it , uiT ) −Cov(zis , uiT ) = 0 (2.4.11) stat Cov(zstat it − zis , uiT ) = 0 (2.4.12) stat Hence Cov(zstat it , uiT ) will be time constant if the change in zit over time is uncorrelated with the unob- served effects at the last time period. To provide more intuition regarding what such an assumption mean, we can consider the unobserved heterogeneity decomposition of unobserved effects and write uiT = ci νiT with ci and νiT such that E(uiT |zit ) = E(ci |zit ). Therefore: stat stat stat Cov(zstat it − zis , uiT ) = Cov(zit − zis , ci ) (2.4.13) stat So for Cov(zstat it , uiT ) to be time constant, we need the change in zit over time to be uncorrelated with the time constant part of the unobserved effects. This will be satisfied for instance if zstat it is composed of a deterministic time component ft , a time constant component that is arbitrarily correlated with ci , denote it di , and a time varying component that is uncorrelated with ci , denote it εit : zstat it = ft + di + εit (2.4.14) stat stat stat Cov(zstat it − zis , uiT ) = Cov(zit − zis , ci ) (2.4.15) Indeed in this case, = Cov(εit − εis , ci ) (2.4.16) =0 (2.4.17) As long as Cov(zstat it , uiT ) is time constant, we can use for estimation the following additional moment conditions: stat yit − h0 (xit , β0 ) E((˜zstat it − z˜it−1 ) h (x , β ) ) = 0 ∀t = 2, ..., T 1 it 0 53 (2.4.18) stat stat stat where z˜stat it = zit − E(zit ). For estimation, E(zit ) can be simply replaced by the sample average of zstat it . The asymptotic variance of the estimator of β0 will not be affected by this preliminary estimation, following the results in Newey and McFadden (1994) for instance. A simple informal test of whether the change in zstat it over time is uncorrelated with the time constant part of the unobserved effects could be to regress the change in zstat it over time on time period dummies and as many time constant explanatory variables as available and test the joint significance of the time constant covariates in the regression. 2.4.2 Serially Uncorrelated Transitory Shocks In some applications it might be unlikely that instruments or functions of the instruments have a time constant covariance with unobserved effects. For instance consider the case of the linear feedback model where only time period dummy variables are used as covariates so that xit = Dt . Then E(yit |yit−1 , ..., yi0 , ci ) = γ0 yit−1 + µt ci where µt is a deterministic constant that depends on t. Even if we assume that yi0 does not ∞ s s deviate from the stationary process si0 = ∑∞ s=1 γ0 ci µ−s εi−s , E(yi1 − yi0 |ci ) = ∑s=1 γ0 ci (µ−s+1 − µ−s ) so that in general yit − yit−1 will be correlated with ci and therefore yit−1 − yit−2 can not be used as an instrument for the equation in level even if it is time demeaned. However in such cases other additional restrictions might be available that would come from restrictions on the variance covariance matrix of ui = [ui1 , ..., uiT ] . It is sometimes plausible to assume that the only source of serial correlation in the unobserved effects is time constant unobserved effects so that Cov(uit , uis ) = Cov(uiq , uir ) ∀ s < t, q < r4 . In general such restrictions imply T × (T − 1)/2 − 1 additional overidentifying moment restrictions which can be written as: y − h0 (xit , β0 ) yis − h0 (xis , β0 ) ) = τ0 ∀t, s = 1, ..., T, s < t E( it h1 (xit , β0 ) h1 (xis , β0 ) (2.4.19) where τ0 is an additional parameter added to β0 defined by τ0 = Cov(uit , uis ) + E(uit )E(uis ) ∀t = s which doesn’t depend on t or s since E(uit ) is constant by (2.2.2). This is however not true in the case of dynamic models since then some of these moment conditions are already implied by (2.2.1) and (2.2.2). For dynamic models, uit = (yit − h0 (xit , β0 ))/h1 (xit , β0 ) and yit−1 , xit ∈ zit so uit−1 ∈ zit . Hence (2.2.1) and 4 One could also consider weaker additional restrictions of the type E(u u it it−s ) = τ(s) so that serial correlation in the unobserved effects only depends on the number of lags s separating these unobserved effects and not on the chosen time period t. We do not consider this possibility in this paper for simplicity, it would be straightforward to modify the derivations of this section to consider this case. 54 (2.2.2) imply E(uit uis ) = E(uit uir ) ∀t < s, r so that Cov(uit uis ) = Cov(uit uir ) ∀t < s, r. Therefore assuming Cov(uit , uis ) = Cov(uiq , uir ) ∀t < s, q < r in the case of dynamic models will only imply the additional T − 2 over-identifying restrictions: y − h0 (xiT , β0 ) yiT −s − h0 (xiT −s , β0 ) ) = τ0 ∀ s = 1, ..., T − 1 E( iT h1 (xiT , β0 ) h1 (xiT −s , β0 ) (2.4.20) These moment restrictions are the generalization of the additional moment conditions derived for linear dynamic models in Ahn and Schmidt (1995). In the case of dynamic models, an assumption such as Cov(uit , uis ) = Cov(uiq , uir ) ∀t = s, q = r can be very plausible since it is possible to argue that modeling dynamics will also account for all serial correlation in unobserved effects other than serial correlation due to the time constant part of unobserved effects. For instance, the linear feedback model presented in (2.4.4) implies such additional moment restrictions even though they have not been exploited for estimation in previous studies. Indeed from 2.4.45 : y − γ0 yiT −1 yiT −s − γ0 yiT −s−1 ) = E(c2i ) s = 1, ..., T − 1 E( iT µ(xiT , θ0 ) µ(xiT −s , θ0 ) (2.4.21) Windmeijer (2000) has also derived similar moment conditions for the model presented in (2.2.6) and (2.2.7) but under a set of assumptions that was too weak. Windmeijer (2000) only assumes that c2i is uncorrelated with εit and that εit is uncorrelated with εis for t = s which doesn’t imply E(c2i εit εis ) = E(c2i )E(εit )E(εis ) = E(c2i ) hence it seems that a specification of models in terms of conditional expectations and unobserved effects as in (2.2.1) and (2.2.2) is more straightforward than the specification of the model in terms of uncorrelation found in Windmeijer (2000). 5I was made aware after finishing a draft of this paper that, in unpublished work, Kitazawa (2007) also considers similar moment conditions for the LFM. Note however that the LFM is only one of the special cases considered by the group of models we defined. 55 2.5 Monte Carlo Evidence To study the small sample performance of the estimators we present in this paper, we consider estimating the Linear Feedback model presented in Blundell et al. (2002): yit ∼ Poisson(γyit−1 + exp(β xit + ηi )) ∀t = 1, ..., T (2.5.1) xit = ρxit−1 + τηi + εit (2.5.2) τ η + ξi 1−ρ i exp(β xi0 + ηi ) yi0 ∼ Poisson( ) 1−γ xi0 = (2.5.3) (2.5.4) ηi ∼ N(0, ση2 ) (2.5.5) εit ∼ N(0, σε2 ) (2.5.6) ξi ∼ N(0, σε2 ) 1 − ρ2 (2.5.7) The only difference with the data generating process of Blundell et al. (2002) is that we do not obtain yi0 as the last draw from fifty draws starting at yi−49 ∼ Poisson(exp(β xi−49 + ηi )) but instead impose E(yi0 |ci ) = ci E(exp(β xi0 ))/(1 − γ). Since we will restrict our attention to γ < 1, both data generating processes will be very similar even though not exactly equivalent. With this model, we will consider using for estimation the sequence of moment conditions: y − γ0 yit−1 yit−1 − γ0 yit−2 − )) = 0 t = 2, ..., T E(zit ( it h1 (xit , β0 ) h1 (xit−1 , β0 ) (2.5.8) where zit = (yit−2 , ..., yi0 , xit−1 , ..., xi1 ) or zit = (yit−2 , xit−1 ). The additional conditions that arise from the restriction imposed on the initial conditions are: y − γ0 yit−1 E((yit−1 − yit−2 ) it ) = 0 t = 2, ..., T h1 (xit , β0 ) y − γ0 yit−1 E((xit − xit−1 ) it ) = 0 t = 2, ..., T h1 (xit , β0 ) (2.5.9) (2.5.10) The additional conditions that arise from serial uncorrelation of the transitory shocks are: y − γ0 yiT −1 yit − γ0 yit−1 yit−1 − γ0 yit−2 E( iT ( − )) = 0 t = 2, ..., T − 1 h1 (xiT , β0 ) h1 (xit , β0 ) h1 (xit−1 , β0 ) (2.5.11) We will consider four groups of estimators: using no additional moment conditions, using the additional moment conditions from the restrictions on the initial conditions, using the additional moment conditions 56 from serially uncorrelated transitory shocks, and using both sets of additional moment conditions. Within each group we will consider the GMM estimator that uses all available lags of the instruments for the conditional moment conditions and the GMM estimator that uses only one lag of the instruments for the conditional moment conditions. For each estimator we will also consider the two-step GMM estimator with the identity matrix as initial weighting matrix and the iterated GMM estimator, which is a multiple step GMM estimator that takes as many steps as are needed for the estimates to converge6 . Therefore we will be considering a total of sixteen estimators. Table 2.1 and Table 2.2 report the bias and root Mean Squared Error (MSE) of the estimators of γ 7 . Table 2.3 and Table 2.4 report the ratio of the mean of standard errors of the estimators of γ over the standard deviations of these estimators. Therefore these tables capture the bias in the estimators of the variance of the estimators of γ. Table 2.5 and Table 2.6 report the coverage rate of the 95% confidence intervals created from the estimators of γ and their associated standard errors. All results are from 1,000 replications. The first conclusion from Table 2.1 and Table 2.2 is that using the additional moment conditions presented in the previous section results in large efficiency gains with very sizable decreases in both bias and standard deviations. This gain is especially noticeable when either set of additional moment conditions is used compared to not using any set of additional moment conditions. The addition of a second set of additional moment conditions causes a more modest gain in efficiency. Bias is almost always smaller when using only one lag of the instruments instead of all available lags. When all available lags of the instruments are used, iterated GMM seems to perform better than two step GMM. Table 2.3 and Table 2.4 show a severe downward bias in standard errors for small n and large T when all available lags of the instruments are used. This problem is alleviated by using iterated GMM, particularly when T is large. However, even using iterated GMM can result in standard deviations being significantly underestimated. This bias in standard errors is due to the use of many over-identifying moment conditions. The same problem of downward biased standard errors has been studied for the special case of linear models in Windmeijer (2005) and for models of count data in Windmeijer (2008). However these two papers concentrate on the bias originating from using a preliminary estimator to compute the optimal weighting matrix, 6 We do not present the results of iterated GMM estimation for n = 100 because for this small sample size and with the convergence criterion we used for the other sample sizes, the iterated GMM algorithm failed to converge in less than 400 iterations in 25% of the bootstrap draws when T = 4 and 50% of the bootstrap draws when T = 8. Conditional on having the iterated GMM algorithm converging for n = 100, using iterated GMM instead of two step GMM seemed to provide some efficiency gain and significantly better inference when many moment conditions are used in a similar way as for larger sample sizes. 7 We only show results for the estimation of γ here but results for estimation of β exhibit similar patterns. 57 whereas we see that using iterated GMM instead of two step GMM helps but does not solve completely the problem of downward biased standard errors. Asymptotic analysis under many moment conditions performed in separate work in progress seems to indicate that most of the bias comes from the correlation between the gradient of the moment functions and the moment functions themselves. This result has been presented in a more general setting in Newey and Windmeijer (2009). Bootstrapped standard errors might also be a solution. Table 2.5 and Table 2.6 show the effect of both downward biased standard errors and bias in the estimator of γ on inference. For small n or large T the coverage of confidence intervals is significantly lower than the confidence level of 95%, particularly when all available lags of the instruments are used. This problem is alleviated by using iterated GMM but not completely solved. Corrected standard errors should participate in constructing better confidence intervals as could bias correction, particularly in the case where no additional moment condition is available. Similarly as for correction of the standard errors, bias correction could be based on higher order asymptotic analysis. The first conclusion of this section is that using additional restrictions on the stationarity of the instruments or serial uncorrelation of transitory shocks can make a big difference in terms of the precision of the point estimates. It does not solve however the problem of inference which was already present with previous estimators and is due to the poor properties of GMM standard errors in cases where many over-identifying conditions are used. Using iterated GMM can improve the quality of inference compared to two step GMM especially when T is relatively large without solving the problem completely. The results presented in this section also suggest that using only one lag of the instruments can result in much better inference especially when T is relatively large. Previous studies of instrumental variable estimation of models similar to the ones we consider in this paper, such as Arellano and Bond (1991) or Windmeijer (2008), recommended the use of all lags of the instruments in (2.5.8). However the Monte Carlo evidence we presented indicates that using only one lag of the instruments causes only a modest loss in accuracy, especially when additional moment conditions are available, but results in significantly lower bias and significantly better inference compared to using all available lags of the instruments. 58 Table 2.1: Bias and RMSE for estimating γ, T = 4 N = 100 Bias N = 1000 N = 500 RMSE Bias RMSE Bias N = 2000 RMSE Bias RMSE Two step GMM no additional conditions initial conditions serial uncorrelation both sets of conditions All Lags −0.214 0.314 −0.070 0.117 −0.039 0.076 −0.022 0.053 One Lag −0.186 0.320 −0.059 0.137 −0.031 0.088 −0.019 0.065 All Lags −0.036 0.167 −0.008 0.068 0.001 0.046 −0.001 0.033 One Lag −0.006 0.142 −0.001 0.067 0.003 0.047 −0.000 0.034 All Lags −0.108 0.196 −0.034 0.072 −0.015 0.046 −0.008 0.033 One Lag −0.070 0.168 −0.022 0.073 −0.007 0.049 −0.005 0.036 All Lags −0.029 0.166 −0.007 0.070 −0.001 0.048 −0.002 0.033 One Lag −0.006 0.135 −0.003 0.064 0.002 0.043 −0.001 0.031 All Lags −0.063 0.105 −0.036 0.074 −0.021 0.051 One Lag −0.060 0.132 −0.029 0.088 −0.018 0.063 All Lags −0.001 0.062 0.004 0.046 −0.001 0.032 One Lag 0.003 0.065 0.004 0.047 −0.000 0.034 All Lags −0.023 0.058 −0.010 0.043 −0.007 0.031 One Lag −0.017 0.061 −0.006 0.047 −0.004 0.034 All Lags −0.010 0.086 −0.001 0.042 −0.002 0.029 One Lag −0.003 0.061 −0.000 0.043 −0.002 0.030 Iterated GMM no additional conditions initial conditions serial uncorrelation both sets of conditions The values of parameters used for the simulations are: γ = 0.5, β = 0.5,ρ = 0.5, τ = 0.1, ση2 = 0.5, σε2 = 0.5 59 Table 2.2: Bias and RMSE for estimating γ, T = 8 N = 100 Bias N = 1000 N = 500 RMSE Bias RMSE Bias N = 2000 RMSE Bias RMSE Two step GMM no additional conditions initial conditions serial uncorrelation both sets of conditions All Lags −0.244 0.319 −0.070 0.092 −0.037 0.050 −0.018 0.028 One Lag −0.126 0.183 −0.033 0.061 −0.019 0.041 −0.010 0.028 All Lags −0.122 0.204 −0.030 0.061 −0.014 0.032 −0.005 0.017 One Lag −0.013 0.105 −0.002 0.041 −0.001 0.029 −0.000 0.020 All Lags −0.191 0.267 −0.052 0.075 −0.025 0.039 −0.011 0.020 One Lag −0.068 0.128 −0.017 0.041 −0.010 0.027 −0.005 0.018 All Lags −0.107 0.192 −0.026 0.060 −0.012 0.032 −0.005 0.017 One Lag −0.002 0.101 −0.002 0.038 −0.002 0.026 −0.001 0.018 All Lags −0.044 0.058 −0.026 0.038 −0.015 0.024 One Lag −0.030 0.057 −0.017 0.039 −0.009 0.027 All Lags −0.009 0.035 −0.005 0.023 −0.002 0.015 One Lag −0.001 0.041 −0.001 0.029 −0.000 0.020 All Lags −0.019 0.034 −0.011 0.023 −0.007 0.015 One Lag −0.011 0.036 −0.008 0.026 −0.004 0.018 All Lags −0.010 0.039 −0.006 0.022 −0.003 0.014 One Lag −0.004 0.037 −0.003 0.025 −0.002 0.018 Iterated GMM no additional conditions initial conditions serial uncorrelation both sets of conditions The values of parameters used for the simulations are: γ = 0.5, β = 0.5,ρ = 0.5, τ = 0.1, ση2 = 0.5, σε2 = 0.5 60 Table 2.3: Ratio of standard errors over standard deviations of estimators of γ, T = 4 N = 100 N = 500 N = 1000 N = 2000 All Lags 0.612 0.929 1.033 1.029 One Lag 0.772 0.883 1.025 0.967 All Lags 0.428 0.763 0.870 0.921 One Lag 0.556 0.826 0.908 0.945 All Lags 0.562 0.900 0.990 0.975 One Lag 0.718 0.921 0.980 0.958 All Lags 0.368 0.724 0.844 0.887 One Lag 0.520 0.794 0.914 0.939 All Lags 0.971 0.999 1.039 One Lag 0.902 0.990 0.986 All Lags 0.758 0.827 0.922 One Lag 0.809 0.899 0.939 All Lags 0.976 0.971 0.997 One Lag 1.022 0.983 0.976 All Lags 0.496 0.829 0.907 One Lag 0.774 0.877 0.943 Two step GMM no additional conditions initial conditions serial uncorrelation both sets of conditions Iterated GMM no additional conditions initial conditions serial uncorrelation both sets of conditions The values of parameters used for the simulations are: γ = 0.5, β = 0.5,ρ = 0.5, τ = 0.1, ση2 = 0.5, σε2 = 0.5 61 Table 2.4: Ratio of standard errors over standard deviations of estimators of γ, T = 8 N = 100 N = 500 N = 1000 N = 2000 All Lags 0.146 0.540 0.734 0.914 One Lag 0.599 0.898 0.948 0.948 All Lags 0.081 0.403 0.615 0.837 One Lag 0.353 0.736 0.834 0.899 All Lags 0.099 0.415 0.591 0.832 One Lag 0.435 0.823 0.907 0.955 All Lags 0.055 0.354 0.555 0.803 One Lag 0.290 0.696 0.822 0.899 All Lags 0.749 0.854 0.942 One Lag 0.915 0.954 0.963 All Lags 0.561 0.727 0.849 One Lag 0.713 0.814 0.891 All Lags 0.695 0.803 0.923 One Lag 0.855 0.924 0.976 All Lags 0.440 0.683 0.848 One Lag 0.684 0.823 0.897 Two step GMM no additional conditions initial conditions serial uncorrelation both sets of conditions Iterated GMM no additional conditions initial conditions serial uncorrelation both sets of conditions The values of parameters used for the simulations are: γ = 0.5, β = 0.5,ρ = 0.5, τ = 0.1, ση2 = 0.5, σε2 = 0.5 2.6 Average Partial Effects With multiplicative heterogeneity models, Average Partial Effects (APE) are very simple to compute. Average Partial Effects are defined by: ∂y APE fw = E fw ( ) ∂x (2.6.1) where fw is some distribution over the domain of w which represents all the information observed for one observation and ∂∂ xy denotes the change in y caused by a small change in x.8 Eliminating the subscripts, 8 Here we use the notation for partial derivatives but in case of discrete changes in x of ∆ we could use the counterfactuals x notation and use ∆y (x) = y|(x + ∆x ) − y|x interchangeably. 62 Table 2.5: Coverage of 95% confidence intervals for γ, T = 4 N = 100 N = 500 N = 1000 N = 2000 All Lags 0.628 0.835 0.895 0.924 One Lag 0.788 0.875 0.930 0.929 All Lags 0.614 0.855 0.903 0.923 One Lag 0.717 0.887 0.927 0.934 All Lags 0.628 0.847 0.910 0.921 One Lag 0.774 0.894 0.930 0.940 All Lags 0.538 0.832 0.906 0.918 One Lag 0.699 0.868 0.919 0.936 All Lags 0.849 0.893 0.931 One Lag 0.879 0.914 0.930 All Lags 0.842 0.891 0.921 One Lag 0.874 0.918 0.926 All Lags 0.884 0.918 0.933 One Lag 0.911 0.934 0.951 All Lags 0.808 0.891 0.929 One Lag 0.863 0.902 0.933 Two step GMM no additional conditions initial conditions serial uncorrelation both sets of conditions Iterated GMM no additional conditions initial conditions serial uncorrelation both sets of conditions The values of parameters used for the simulations are: γ = 0.5, β = 0.5,ρ = 0.5, τ = 0.1, ση2 = 0.5, σε2 = 0.5 (2.2.1) can be written as: y = h0 (x, β0 ) + h1 (x, β0 )u (2.6.2) Therefore: ∂ h0 (x, β0 ) ∂ h1 (x, β0 ) + u) ∂x ∂x ∂ h (x, β0 ) ∂ h1 (x, β0 ) y − h0 (x, β0 ) = E fw ( 0 + ) ∂x ∂x h1 (x, β0 ) APE fw = E fw ( (2.6.3) (2.6.4) For notational simplicity in this section we will consider h0 (., .) = 0 but this will not affect any of the results. 63 Table 2.6: Coverage of 95% confidence intervals for γ, T = 8 N = 100 N = 500 N = 1000 N = 2000 All Lags 0.098 0.434 0.649 0.821 One Lag 0.590 0.862 0.898 0.913 All Lags 0.091 0.534 0.757 0.875 One Lag 0.522 0.846 0.902 0.923 All Lags 0.086 0.410 0.637 0.820 One Lag 0.573 0.853 0.895 0.926 All Lags 0.066 0.501 0.724 0.863 One Lag 0.457 0.832 0.902 0.924 All Lags 0.620 0.753 0.862 One Lag 0.875 0.904 0.920 All Lags 0.729 0.834 0.895 One Lag 0.836 0.888 0.920 All Lags 0.731 0.822 0.897 One Lag 0.876 0.911 0.933 All Lags 0.697 0.820 0.892 One Lag 0.823 0.892 0.923 Two step GMM no additional conditions initial conditions serial uncorrelation both sets of conditions Iterated GMM no additional conditions initial conditions serial uncorrelation both sets of conditions The values of parameters used for the simulations are: γ = 0.5, β = 0.5,ρ = 0.5, τ = 0.1, ση2 = 0.5, σε2 = 0.5 Many applications are interested in the average effect across an observed subset of the population, denote it A. This corresponds to using fw = f (w|A) so that APEA = E( ∂∂ xy |A) = E( ∂ h1 (x,β0 ) y |A). ∂x h1 (x,β0 ) For instance we could be interested in the average effect of x on y across the entire population in some given time period t APEt = E( ∂ h1 (xit ,β0 ) yit ). Or in the case of a binary explanatory variable x1 , ∂x h1 (xit ,β0 ) with x = (x1 , x_1 ), we could be interested in Average Treatment Effect on the Treated at a given time period: AT ETt = E(y(1, xit_1 ) − y(0, xit_1 )|xit1 = 1) = E((h1 ((1, xit_1 ), β0 ) − h1 ((0, xit_1 ), β0 )) (2.6.5) yit |x1 = 1) h1 (xit , β0 ) it (2.6.6) Estimation and inference are straightforward in this case once a consistent estimator βˆ for β0 is defined. Since E(1(i ∈ A)( ∂ h1 (xit ,β0 ) yit − APEAt )) = 0 where 1(.) is the indicative function, we can just ∂x h1 (xit ,β0 ) 64 add this moment condition to the moment conditions used to estimate β0 and obtain an additional estimator ˆ At and covariance between APE ˆ At and βˆ of APEAt as well an estimator for the asymptotic variance of APE where βˆ denotes the estimator of β0 we will be using. Since we are adding one moment condition for one new parameter, the estimator βˆ will not be affected by estimation of Average Partial Effects. In addition the GMM estimator of APE will be given by: n ˆ yit ˆ At = 1 ∑ 1(i ∈ A) ∂ h1 (xit , β ) APE nA i=1 ∂x h1 (xit , βˆ ) (2.6.7) Where nA = ∑ni=1 1(i ∈ A). If we think that APE should be equal across time periods, we can impose this restriction in the GMM estimation by adding the moment restrictions {E(1(i ∈ A)( ∂ h1 (xit ,β0 ) yit − APEA )) = 0}t=1,...,T ∂x h1 (xit ,β0 ) which might affect estimation of β0 or we can estimate average partial effects for each time period and combine them using Minimum Distance Estimation which will not affect estimation of β0 . ˆ where ηˆ is a vector of estimators of In other situations, if fw can be consistently estimated by fw (η) nuisance parameters η0 , then: ˆ f = Ef APE w y ∂ h1 (x, βˆ ) ) ˆ w(η) ∂x h1 (x, βˆ ) ( (2.6.8) ˆ are jointly asymptotically normal and a consistent estimator of their is consistent for APE fw . If (βˆ , η) asymptotic variance-covariance matrix is available then inference can be performed using the delta-method. 2.7 Conclusion These results hopefully provide useful new options for researchers who wish to use non-linear models of panel data with unobserved effects in applications where only sequential exogeneity is available. The problem of weak instrumental variables seems to be mitigated significantly by the use of additional moment conditions originating from additional restrictions of stationarity of the instruments or serial uncorrelation of the transitory shocks. Monte Carlo evidence also seems to suggest that it is preferable to use only one or a few lags of the instruments compared to all available lags since this results in much better inference at the expense of only small losses in efficiency. Two directions are available in order to obtain estimators with better inference. One consists in studying the higher order properties of the GMM estimator with many over-identifying restrictions, the other consists in finding good exactly identifying moment conditions. Both of these approaches are left for future research. 65 CHAPTER 3 EFFICIENCY OF THE POISSON FIXED EFFECTS ESTIMATOR 3.1 Introduction A commonly used estimator for models of count panel data with multiplicative heterogeneity and strictly exogenous explanatory variables is the Poisson fixed effects (PFE) estimator introduced by Hausman et al. (1984). This estimator is a conditional maximum likelihood estimator which takes advantage of the assumptions of Poisson distribution and independent draws over time to derive a conditional distribution of the dependent variable that does not depend on the distribution of unobserved heterogeneity. In many applications, these distributional assumptions are likely to be violated. Wooldridge (1999) showed that the PFE estimator is consistent as long as the restriction on the conditional mean function is correctly specified, independently of whether the rest of the assumptions of the PFE model hold. In this paper I show that, as long as the conditional mean of the dependent variable is equal to its conditional variance and the conditional serial correlation of the dependent variable is zero, the PFE estimator is also asymptotically efficient in the class of estimators that are consistent under restrictions on the conditional mean function. I then define another estimator that is asymptotically efficient in the same class of estimators under more general conditions. In Section 3.2, I present the model considered in this paper and study the asymptotically efficient estimator for this model. I show under which conditions the PFE estimator is asymptotically efficient and propose an alternative estimator that is asymptotically efficient under more general conditions. In Section 3.3, I use Monte Carlo simulations to investigate the small sample properties of the PFE estimator and of this new estimator. 3.2 The Model and Estimators As in Wooldridge (1999), we consider panel data models that specify a conditional mean function with strictly exogenous explanatory variables and multiplicative heterogeneity: E(yit |ci , xi ) = ci µ(xit , β0 ) ∀ i = 1, ..., n, t = 1, ..., T 66 (3.2.1) where i indexes cross-sectional observations, t indexes time, and xi = {xi1 , ..., xiT }. This model is also a special case of the random coefficients model presented in Section 4 of Chamberlain (1992b). Throughout this paper we consider the case of i.i.d. cross-sectional draws and large n, fixed T asymptotics. Denote µit (β ) = µ(xit , β ) and µit = µit (β0 ). Wooldridge (1999) showed that the parameters in this model can be estimated from the conditional moment conditions: E(ρit (β0 )|xi ) = 0 ∀t = 1, ..., T (3.2.2) T y ∑ where ρit (β ) = yit − µ(xit , β ) T t=1 it . Under (3.2.1), for any deterministic functions gt (.), the folµ(x ,β ) ∑ is s=1 lowing unconditional moment conditions will hold: E(gt (xi1 , ..., xiT )ρit (β0 )) = 0 ∀t = 1, ..., T (3.2.3) Therefore, under standard regularity conditions, any estimator βˆ of β0 defined by: n T ∑ ∑ gt (xi1 , ..., xiT )ρit (βˆ ) = 0 (3.2.4) i=1 t=1 will be consistent for β0 and asymptotically normal. All of the estimators considered in this paper can be written as (3.2.4) so that they are consistent as long as (3.2.1) holds, independently of what other assumptions are considered to study efficiency. 3.2.1 Asymptotically Efficient Estimation The conditional moment conditions written in (3.2.2) can be rewritten in the form: E(ρi (β0 )|xi ) = 0 (3.2.5) where ρi (β ) = [ρi1 (β ), ..., ρiT (β )] . Similarly as in Chamberlain (1987), an optimal estimator for β0 from (3.2.5) can be postulated to be βˆopt from: n ∑ Di Σ+i ρi (βˆopt ) = 0 (3.2.6) i=1 ∂ ρi (β0 )|xi ), Σi = Var(ρi (β0 )|xi ) and Σ+ is some generalized inverse of Σi .1 If βˆopt is i ∂β indeed optimal, Di Σ+ i can be called the optimal instruments for the vector of moment functions ρi (β ). where Di = E( 1 That βˆ opt is optimal for estimating β0 from (3.2.5) has to be proven since Chamberlain (1987) considers cases where Var(ρi (β0 )|xi ) is non-singular a.s., but in our case Var(ρi (β0 )|xi ) can be shown to be non-invertible. 67 Under standard regularity conditions: √ ˆ d n(βopt − β0 ) → N(0,Vopt ) + + + −1 −1 Vopt = E(Di Σ+ i Di ) E(Di Σi Σi Σi Di )E(Di Σi Di ) −1 = E(Di Σ+ i Di ) ˆ Appendix C.1 shows that, when a specific generalized inverse of Σi denoted Σ− i is used, βopt is asymptotically efficient for estimating β0 from (3.2.1) by showing that Vopt is equal to the inverse of the asymptotic information bound for estimating β0 from (3.2.1) derived in Chamberlain (1992b).2 3.2.2 Conditions for Efficiency of the Poisson FE estimator As shown in Wooldridge (1999), the Poisson fixed effects estimator, βˆPFE , is defined by: n ∑( ∂ pi (βˆPFE ) ) Wi (βˆPFE )−1 ρi (βˆPFE ) = 0 (3.2.7) ∂β i=1 µ (β ) where pi (β ) = [pi1 (β ), ..., piT (β )], pit (β ) = T it , Wi (β ) = diag(pi (β )), and diag(a) is the diag∑s=1 µis (β ) onal matrix with a for diagonal. Under standard regularity conditions, βˆPFE is asymptotically equivalent to β˜PFE defined by: n ∑( i=1 ∂ pi (β0 ) ∂β ) Wi (β0 )−1 ρi (β˜PFE ) = 0 We show in Appendix C.2 that Di Σ− i = −( ∂ pi (β0 ) ∂β (3.2.8) ) Wi (β0 )−1 if (3.2.1) holds as well as: Var(yit |ci , xi ) = ci µit Cov(yit , yit−s |ci , xi ) = 0 ∀ s = 1, ...,t (3.2.9) (3.2.10) Therefore under these additional conditions, the PFE estimator is using the optimal instruments for ρi (β ) in order to estimate β0 and is asymptotically efficient in the class of estimators that are consistent under (3.2.1). corrollary of that result is that βˆopt is indeed also optimal for estimating β0 from (3.2.5) since (3.2.1) implies (3.2.5). Therefore, the optimal estimator from (3.2.5) corresponds to the optimal estimator from (3.2.1), i.e. no information is lost for estimating β0 from transforming (3.2.1) to (3.2.5). 2A 68 3.2.3 An Alternative Estimator In this section we derive an optimal estimator for cases where one thinks there might be overdispersion, so that instead of (3.2.9) we have: Var(yit |ci , xi ) = ci µit + θ c2i µit2 (3.2.11) where θ is an unknown parameter, and serial correlation so that instead of (3.2.10) we have: Cov(yit , yit−s |ci , xi ) = γc2i µit µit−1 f or s = 1 = 0 f or s > 1 (3.2.12) (3.2.13) where γ is an unknown parameter. Note that (3.2.9) and (3.2.10) are a special case of assumptions (3.2.11) and (3.2.12) since both sets of assumptions are the same with θ = 0 and γ = 0. Appendix C.3 shows that, as long as T ≥ 3, consistent estimators of θ and γ can be obtained under the ˆ assumptions (3.2.1), (3.2.11) and (3.2.12), denote these estimators θˆ and γ. As seen in Section 2.1, the optimal instruments for ρi (β ) are Di Σ− i where: T Di = −E(ci |xi )( ∑ µit )[ t=1 ∂ pit ∂β ]t=1,...,T (3.2.14) and Σ− i is a specific generalized inverse of Σi = Var(ρi |xi ). Without (3.2.9) and (3.2.10), Di Σ− i does depend on E(ci |xi ) and Var(ci |xi ). Therefore with assumptions (3.2.11) and (3.2.12) instead of (3.2.9) and (3.2.10), the conditional mean and variance of the unobserved heterogeneity term ci are needed to compute the optimal instruments. One can model these as known functions h1 and h2 of a vector of unknown nuisance parameters η: E(ci |xi ) = h1 (xi , η) (3.2.15) Var(ci |xi ) = h2 (xi , η) (3.2.16) and estimate η consistently since under (3.2.1), (3.2.11), (3.2.12), (3.2.15) and (3.2.16): E(yit |xi ) = h1 (xi , η)µit (3.2.17) E(y2it |xi ) = h1 (xi , η)µit + (θ + 1)(h2 (xi , η) + h1 (xi , η)2 )µit2 (3.2.18) Therefore a consistent estimator of η under (3.2.1), (3.2.11), (3.2.12), (3.2.15) and (3.2.16) can be obtained by pooled non-linear regression from (3.2.17) and (3.2.18) with µit replaced by µit (β¨ ) and θ replaced by θˆ , where β¨ is a preliminary consistent estimator of β0 . Denote by ηˆ the resulting estimator of η. 69 The alternative estimator to Poisson fixed effects we propose in this paper is βˆalt defined by: n ∑ Dˆ i Σˆ −i ρi (βˆalt ) = 0 (3.2.19) i=1 where T ∂p ˆ ∑ µ¨ it )[ it (β¨ )]t=1,...,T Dˆ i = h1 (xi , η)( ∂β t=1 (3.2.20) ˆ −1 ˆ −1 ˆ −1 −1 ˆ −1 Σˆ − i = Σy,i − Σy,i µ¨ i (µ¨ i Σy,i µ¨ i ) µ¨ i Σy,i (3.2.21) and: where µ¨ it = µit (β¨ ), µ¨ i = µi (β¨ ) = [µi1 (β¨ ), ..., µiT (β¨ )] and the (t, s)th element of Σˆ y,i is: ˆ it , yis |xi ) = 1[t = s]h1 (xi , η) ˆ µ¨ it + Cov(y ˆ + h1 (xi , η) ˆ 2 ) + h2 (xi , η)) ˆ µ¨ it µ¨ is (1[|t − s| ≤ 1]θ 1−|t−s| γ |t−s| (h2 (xi , η) where β¨ can simply be defined to be the Poisson fixed effects estimator. This estimator uses optimal instruments for ρi (β ) and is asymptotically efficient in the class of estimators of β0 that are consistent under (3.2.1) as long as (3.2.11), (3.2.12), (3.2.15) and (3.2.16) hold. Appendix D shows that when (3.2.9) and (3.2.10) hold, so that (3.2.11) and (3.2.12) hold with θ = 0, γ = 0, and that βˆPFE is asymptotically efficient, βˆalt and βˆPFE are asymptotically equivalent, independently of whether (3.2.15) and (3.2.16) hold. Therefore, the estimator βˆ is indeed efficient under more general conditions alt than the Poisson fixed effects estimator. 3.3 Monte Carlo Simulations Study To compare the small sample perfomance of the Poisson Fixed Effects estimator and the alternative estimator defined by (3.2.19), we use both estimators to estimate β0 from the data generating process: i.i.d. xit ∼ Uni f orm(−a, a) ci |xi ∼ Fc (xi ) i.i.d. eit ∼ Uni f orm(ae , be ) uit = δ exp(eit−1 ) + exp(eit ) i.i.d. yit ∼ Poisson(ci exp(β0 xit )uit ) 70 where Uni f orm(a, b) denotes the uniform distribution over the interval (a, b), Poisson(µ) denotes the Pois1 , Var(e ) = son distribution with mean. We set β0 = 1. We also set ae , be and δ so that E(eit ) = 1+δ it θ 1+δ 2 (so that E(uit |ci , xi ) = 1 and Var(uit |ci , xi ) = θ ) and Cov(uit , uit−1 |ci , xi ) = δVar(exp(eit )) = γ. Therefore (3.2.1), (3.2.11) and (3.2.12) are satisfied. Tables 3.1, 3.2 and 3.3 shows the performance of both estimators and of the unfeasible optimal estimator using Di Σ− i as instruments for ρi (β ) from Monte Carlo simulations. The results shown are measures of bias, standard deviation, and root MSE for sample sizes N = 100, N = 500 and N = 1000, with ten time periods and for the cases where {θ = 0, γ = 0} and {θ = 1, γ = 0.5}. Fc (ci ) is given by ci = exp(λ x¯i + T x . We show results for λ = 0 and λ = 1. In both cases, as a increases, Uni f orm(−a, a)) where x¯i = T1 ∑t=1 it the variances of ci and c2i increase. We show results for a = 1, 1.5, 2. The model for the conditional mean and variance of ci that we use is: h1 (xi , η) = η1 h2 (xi , η) = η2 Therefore this model corresponds to the true data generating process when λ = 0 but not when λ = 1. When {θ = 0, γ = 0}, both the Poisson fixed effects estimator and our alternative estimator are asymptotically efficient in the class of estimators consistent under (3.2.1), independently of the distribution of ci . When {θ = 1, ρ = 0.5}, the Poisson fixed effects estimator is not asymptotically efficient and our alternative estimator is asymptotically efficient when λ = 0. When {θ = 1, ρ = 0.5} and λ = 1, neither the Poisson fixed effects estimator nor our alternative estimator are efficient. Results for Tables 3.1, 3.2 and 3.3 show that significant gains in efficiency can be achieved by using the unfeasible optimal instruments but that in small samples, the additional noise originated from estimating η1 , η2 , θ and γ to compute a feasible estimator can overpower this gain in efficiency and result in the alternative estimator defined by (3.2.19) being signigicantly less accurate than the Poisson Fixed Effects estimator. The solution to this problem could be to derive a data-based criterion that captures the trade-off between asymptotic efficiency and finite sample noise from nuisance parameters and helps decide between different models of optimal instruments. This is left for future research. The conclusion of this section is that the Poisson Fixed Effects estimator performs well with small sample sizes compared to an alternative estimator that is asymptotically efficient under more general conditions. However with large enough sample sizes, significant gains in efficiency can be obtained from using a more general model of optimal instruments. 71 Table 3.1: N = 100: Bias, standard deviation and root mean squared error c ∼ exp(Uni f orm(−1, 1)) Bias sd Poisson FE 0.002 0.057 Feasible alternative 0.002 Unfeasible optimal rmse c ∼ exp(Uni f orm(−1.5, 1.5)) Bias sd 0.057 0.002 0.034 0.057 0.057 0.004 0.002 0.057 0.057 Poisson FE 0.001 0.055 Feasible alternative 0.001 Unfeasible optimal rmse c ∼ exp(Uni f orm(−2, 2)) rmse Bias sd 0.034 0.001 0.022 0.022 0.064 0.065 0.001 0.077 0.077 0.002 0.034 0.034 0.001 0.022 0.022 0.055 −0.001 0.033 0.033 0.001 0.021 0.021 0.055 0.055 −0.002 0.051 0.051 −0.004 0.078 0.078 0.001 0.055 0.055 −0.001 0.033 0.033 0.001 0.021 0.021 Poisson FE 0.005 0.086 0.086 0.004 0.066 0.067 0.000 0.063 0.063 Feasible alternative 0.004 0.077 0.078 −0.003 0.102 0.102 −0.002 0.102 0.102 Unfeasible optimal 0.005 0.076 0.077 0.003 0.054 0.054 0.001 0.043 0.043 Poisson FE 0.007 0.088 0.089 0.003 0.069 0.069 −0.001 0.067 0.067 Feasible alternative 0.008 0.083 0.083 0.001 0.088 0.088 −0.004 0.128 0.128 Unfeasible optimal 0.007 0.078 0.079 0.004 0.053 0.053 −0.002 0.041 0.041 θ = 0, γ = 0, λ = 0 θ = 0, γ = 0, λ = 1 θ = 1, γ = .5, λ = 0 θ = 1, γ = .5, λ = 1 72 Table 3.2: N = 500: Bias, standard deviation and root mean squared error c ∼ exp(Uni f orm(−1, 1)) Bias sd rmse c ∼ exp(Uni f orm(−1.5, 1.5)) Bias sd rmse c ∼ exp(Uni f orm(−2, 2)) Bias sd rmse θ = 0, γ = 0, λ = 0 Poisson FE −0.000 0.025 0.025 −0.000 0.015 0.015 −0.000 0.010 0.010 Feasible alternative −0.000 0.025 0.025 −0.000 0.015 0.015 −0.000 0.012 0.012 Unfeasible optimal −0.000 0.025 0.025 −0.000 0.015 0.015 −0.000 0.010 0.010 Poisson FE −0.001 0.024 0.024 0.000 0.015 0.015 0.000 0.010 0.010 Feasible alternative −0.001 0.024 0.024 0.000 0.015 0.015 0.000 0.012 0.012 Unfeasible optimal −0.001 0.024 0.024 0.000 0.015 0.015 0.000 0.010 0.010 Poisson FE −0.002 0.040 0.040 −0.001 0.031 0.031 −0.001 0.028 0.028 Feasible alternative −0.003 0.036 0.035 −0.002 0.030 0.030 −0.002 0.023 0.023 Unfeasible optimal −0.003 0.036 0.035 −0.001 0.025 0.025 −0.001 0.019 0.019 Poisson FE −0.002 0.040 0.040 −0.001 0.032 0.032 −0.000 0.030 0.030 Feasible alternative −0.003 0.035 0.035 −0.001 0.024 0.024 −0.002 0.030 0.030 Unfeasible optimal −0.003 0.035 0.035 −0.001 0.024 0.024 −0.000 0.019 0.019 θ = 0, γ = 0, λ = 1 θ = 1, γ = .5, λ = 0 θ = 1, γ = .5, λ = 1 73 Table 3.3: N = 1000: Bias, standard deviation and root mean squared error c ∼ exp(Uni f orm(−1, 1)) Bias sd rmse c ∼ exp(Uni f orm(−1.5, 1.5)) Bias sd rmse c ∼ exp(Uni f orm(−2, 2)) Bias sd rmse θ = 0, γ = 0, λ = 0 Poisson FE −0.001 0.017 0.017 −0.001 0.011 0.011 −0.000 0.007 0.007 Feasible alternative −0.001 0.017 0.017 −0.001 0.011 0.011 −0.000 0.007 0.007 Unfeasible optimal −0.001 0.017 0.017 −0.001 0.011 0.011 −0.000 0.007 0.007 Poisson FE −0.000 0.017 0.017 −0.001 0.011 0.011 −0.000 0.007 0.007 Feasible alternative −0.000 0.017 0.017 −0.001 0.011 0.011 −0.000 0.007 0.007 Unfeasible optimal −0.000 0.017 0.017 −0.001 0.011 0.011 −0.000 0.007 0.007 −0.000 0.028 0.028 0.000 0.022 0.022 0.001 0.020 0.020 Feasible alternative 0.000 0.025 0.025 0.001 0.017 0.017 0.000 0.014 0.014 Unfeasible optimal 0.000 0.024 0.024 0.001 0.017 0.017 −0.000 0.013 0.013 Poisson FE −0.001 0.028 0.028 0.001 0.022 0.022 −0.000 0.021 0.021 Feasible alternative −0.001 0.024 0.024 0.000 0.017 0.017 0.000 0.022 0.022 Unfeasible optimal −0.001 0.024 0.024 0.000 0.016 0.016 −0.000 0.013 0.013 θ = 0, γ = 0, λ = 1 θ = 1, γ = .5, λ = 0 Poisson FE θ = 1, γ = .5, λ = 1 74 APPENDICES 75 APPENDIX A ESTIMATION OF DYNAMIC PANEL DATA MODELS WITH CROSS-SECTIONAL DEPENDENCE A.1 A.1.1 Efficient Estimation with Clustering Unfeasible Optimal Instruments Consider any GMM estimator of ρ0 defined as in (1.2.6) for some set of valid instruments {Zi }i=1,...,n of dimension r × (T − 1) which can be rewritten as: G ρˆ = argminρ ( ∑ Z g mg (ρ)) Ξ g=1 g G ∑ Z g mg (ρ) (A.1.1) g=1 g g where Z g = [Z2 , ..., ZT ] and Zt = [Zi t , ..., Zin t ]. 1 g From White (2001), ρˆ is consistent for ρ0 and: √ d G(ρˆ − ρ0 ) → N(0, (D ΞD)−1 D ΞϒΞD(D ΞD)−1 ) (A.1.2) 1 G Z g ∆Y g ) and ϒ = plim( 1 G Z g mg mg Z g ). with D = plim( G ∑g=1 G ∑g=1 −1 Ξ = ϒ−1 is the optimal weighting matrix for that estimator and with such weighting matrix: √ d G(ρˆ − ρ0 ) → N(0, (D ϒ−1 D)−1 ) (A.1.3) Therefore in this section we will show that the asymptotic variance of ρˆ opt defined by (1.3.4) is smaller than (D ϒ−1 D)−1 for any set of valid matrices of instruments {Zi }i=1,...,n as long as (1.2.1), (1.2.2) and Auxiliary Assumption 1 are satisfied. g Since Φg−1/2 is upper triangular with its element in row j, column i being a function of Ymax{i, j}−1 , for any valid set of instruments {Z g }g=1,...,G we have: E(Z g (Φg )−1/2 mg ) = 0 (A.1.4) g because the jth r × ng component of Z g (Φg )−1/2 is a function of Y j−1 . In addition, we have: Var(Z g (Φg )−1/2 mg ) = E(Z g (Φg )−1/2 Φg (Φg )−1/2 Z g ) = E(Z g Z g ) 76 (A.1.5) (A.1.6) g because the jth r × ng component of Z g (Φg )−1/2 is a function of Y j−1 and g g g s=2,...,T Φg = [E(mt ms |Ymax{t,s}−2 )]t=2,...,T (A.1.7) Note that since E((Z g mg mg Z g ) −Var(Z g mg )) = 0, then plim 1 G g g g g 1 G (Z m m Z ) = plim ∑ ∑ Var(Z g mg ) G g=1 G g=1 (A.1.8) Define: g g ∆Y˜−1 = (Φg )−1/2 ∆Y−1 (A.1.9) g g Define ∆Y˜−1,t the t th block of ng rows of ∆Y˜−1 . Define: g g Lt = E(∆Y˜−1,t |Yt−2 ) (A.1.10) g Zopt = L g (Φg )−1/2 (A.1.11) g g Define L g = [L2 , ..., LT ] and: g g g g 1 G Z ∆Y ) and ϒ 1 G g g Define Dopt = plim( G ∑g=1 opt −1 opt = plim( G ∑g=1 Zopt m m Zopt ). g g g E(Zopt ∆Y−1 − L g L g ) = 0 because the jth r × ng component of Z g Φg−1/2 is a function of Y j−1 and g g Lt = E(∆Y˜−1,t |Yt−2 ). Therefore: Dopt = plim( = plim( 1 G g g ∑ Zopt ∆Y−1 ) G g=1 (A.1.12) 1 G ∑ L gL g ) G g=1 (A.1.13) Since Var(Z g (Φg )−1/2 mg ) = E(Z g Z g ), we have in particular: Var(L g (Φg )−1/2 mg ) = E(L g L g ). Therefore: ϒopt = plim( = plim( 1 G g g Zopt mg mg Zopt ) ∑ G g=1 (A.1.14) 1 G ∑ L gL g ) G g=1 (A.1.15) = Dopt (A.1.16) −1 = D−1 . so that (Dopt ϒ−1 opt Dopt ) opt 77 Therefore the estimator ρˆ opt defined by: G g ρˆ opt = argminρ ( ∑ Zopt mg (ρ)) g=1 is consistent for ρ0 and G g ∑ Zopt mg (ρ) (A.1.17) g=1 √ G-asymtotically normal with asymptotic variance: Vopt = D−1 opt (A.1.18) We can show that this variance-covariance matrix is smaller than (D ϒ−1 D)−1 no matter what set of instruments {Z g }g=1,...,G is used. Denote ∆ the difference between D ϒ−1 D and Dopt : D = D ϒ−1 D − Dopt (A.1.19) G 1 = D ϒ−1 D − plim( ∑ L g L g ) G g=1 (A.1.20) G 1 g g = D ϒ−1 D − plim( ∑ Zopt Φg Zopt ) G g=1 (A.1.21) Since (Φg )−1/2 (Φg )−1/2 = (Φg )−1 we also have Φg = Φg1/2 Φg1/2 where Φg1/2 is upper triang gular and is composed of ng × ng matrices such that the ( j, k)th matrix for k > j is a function of Yk−1 . Therefore: E(Z g ( ∂ mg (ρ0 ) g − Φg Zopt )) = 0 ∂ρ (A.1.22) 1/2 g g g since the jth r × ng component of Z g is a function of Y j−1 and (Φg )t Lt = E((Φg )1/2 ∆Y˜−1,t |Yt−2 ) 1/2 where (Φg )t is the (t − 1)th ng × ng (T − 1) matrix composing (Φg )1/2 . In addition we have E(Z g (mg (ρ0 )mg (ρ0 ) − Φg )Z g ) = 0 (A.1.23) We can then apply the WLLN to show that: D = plim( 1 G g g g ∑ Z Φ Zopt ) G g=1 (A.1.24) ϒ = plim( 1 G g g g ∑Z ΦZ ) G g=1 (A.1.25) 1 G Z g Φg Z g , ϒ = 1 G Z g Φg Z g and D = 1 G Z g Φg Z g . Define Z = Define Dn = G ∑g=1 n G ∑g=1 opt opt n G ∑g=1 opt 1 , ..., Z G ] , Z = [Z , ..., Z ] , S = diag({Φg } [Zopt g=1,...,G ), then: opt 1 G −1 Dn ϒ−1 n Dn − Gn = Z SZ(Z SZ) Z SZ − Z SZ = Z S1/2 (S1/2 Z(Z SZ)−1 Z S1/2 − IT ×n )S1/2 Z 78 (A.1.26) (A.1.27) −1 Therefore Dn ϒ−1 n Dn − Dn is positive semi-definite for any value of n. Therefore D ϒ D − Dopt is positive semi-definite by the continuous mapping theorem. A similar result was found in Chamberlain (1992a) for the case of cross-sectional independence. A.1.2 Efficient Estimation with Auxiliary Assumptions g Under Auxiliary Assumptions 1-2a, the variance-covariance matrix of ut is:   1     τu 1    g Σu = σu2     ...   ...   τu ... τu 1 (A.1.28) and we have g g g E(mt ms |Ymax{t,s}−2 ) = 2Σu i f t = s g (A.1.29) = −Σu i f |t − s| = 1 (A.1.30) = 0 i f |t − s| ≥ 2 (A.1.31) Therefore: Φg = J g (IT ⊗ Σg )J g (A.1.32)   −1 0 ... 0 1 0 ... 0      0 −1 0 ... 0 1 ... 0   g J =    ...     0 ... 0 −1 ... 0 1 (A.1.33) where: is the deterministic differencing matrix such that J g ug = mg . g g Therefore Lt = Φg−1/2 E( ∂∂mρ |Yt−2 ) and g g g Zopt = [E(∆Y−1 |Y0 ) , ..., E(∆Y−1 |YT −2 ) ]Ψg Φg−1/2 79 (A.1.34) where:  g−1/2 Φ  1   0  Ψg =    ...  0 g−1/2 where Φ j A.1.3 0 ... g−1/2 Φ2 ... ... ...  0 0 g−1/2 0         (A.1.35) ΦT −1 is the jth ng × ng (T − 1) matrix composing (Φg )−1/2 . Conditional Expectation of Unobserved Heterogeneity under Clustering Under Auxiliary Assumption 1, 2a, 3a we have:   cg      yg   0     cg + ug  ∼ N(µ g , AgV g Ag )  1      ...    g cg + uT (A.1.36) Therefore, using the properties of the multivariate normal distribution, we have:     g g y0 y0           g g  cg + u   cg + u  µ ι  1 ) = µ ι + AgV g Ag AgV g Ag −1 ( 1  −  c T ×ng ) E(cg |       c ng 12 22     1 µ ι c n  ...   ...  g 1−ρ0     g g cg + uT cg + uT  g y0   g y0 (A.1.37)           cg + ug   cg + ug   g 1 ) and both matrices are compo1 ) and AgV g Ag = Var( where AgV g A12 = Cov(cg ,     22     ... ...         g g g g c + uT c + uT gV g Ag . nents of A  g y0     cg + ug   1 ) can be obtained in a similar fashion by considering only the first ((t + 2)ng ) × ((t + E(cg |      ...    g g c + ut 2)ng ) block of AgV g Ag 80 APPENDIX B ESTIMATION OF UNOBSERVED EFFECTS PANEL DATA MODELS UNDER SEQUENTIAL EXOGENEITY B.1 GMM Estimation and Efficiency Bound s=1,...,T Define Σi = [Cov(ρit , ρis |zimax(t−1,s−1) )]t=1,...,T . define Ai and Σ˜ −1 i to be the terms of the LDL decomposition of Σ−1 : Σ−1 = A Σ˜ −1 Ai where Σ˜ i is diagonal and Ai is upper-triangular with only ones on the i i i i s=2,...,T diagonal. We can show that Ai = [1(s ≥ t)(−1)1(s=t) Γit,s ]t=2,...,T where 1(.) is the indicative function and Σ˜ i = diag({Var(ρ˜ it |zit−1 )}t=2,...,T ) so that: T ˜ J = E( ∑ D˜ it Σ˜ −1 it Dit ) (B.1.1) t=2 ¨ i = E(E(A ∂ ρi ¨ |zi ) Σ˜ −1 |zi )) i E(Ai ∂β ∂β ∂ ρi (B.1.2) ¨ t ]t=2,...,T |[x1 , ..., xT −1 ]) is a matrix operator that returns E(gt |xt−1 ) as its (t − 1)th row, t = where E([g 2, ..., T , where gt are row random vectors. Using standard results of GMM estimation we can write: √ ˆ n(β n 1 n ∂ ρi 1 n −1 √1 Z ) ( Z ρ ρ Z ) ∑ i ∑ i i i i ∑ Zi ρi + o p (1) n i=1 n i=1 n i=1 ∂β 1 n 1 n ∂ρ 1 n ∂ρ W = ( ∑ Zi i ) ( ∑ Zi ρi ρi Zi )−1 ∑ Zi i n i=1 ∂ β n i=1 n i=1 ∂ β −1 Lin − β0 ) = W ( where ρi = ρi (β0 ) and ∂ ρi = ∂ ρi (β0 ) (B.1.3) (B.1.4) . ∂β ∂β ∂ρ Applying the WLLN, 1n ∑ni=1 Zi i = O p (1). Also 1n ∑ni=1 Zi (ρi ρi − Σi )Zi = o p (1) since E(Zi (ρi ρi − ∂β Σi )Zi ) = 0 from how Zi and Σi where defined. Using the CLT, √1 ∑ni=1 Zi ρi = O p (1). Using Slutsky’s n 1 n theorem, assuming plim n ∑i=1 Zi ρi ρi Zi is p.d., we have ( 1 n 1 n Zi ρi ρi Zi )−1 − ( ∑ Zi Σi Zi )−1 = o p (1) ∑ n i=1 n i=1 So W = V + o p (1) where V = ( 1n ∑ni=1 Zi ∂ ρi ∂β (B.1.5) ∂ρ ) ( n1 ∑ni=1 Zi Σi Zi )−1 1n ∑ni=1 Zi i and using Slutsky’s ∂β 81 theorem again, assuming plimW is finite and p.d., W −1 = V −1 + o p (1). Therefore we can rewrite: √ ˆ n(β Lin − β0 ) = V −1 ( 1 n Z n ∑ i i=1 ∂ ρi )( ∂β 1 n 1 n Zi Σi Zi )−1 √ ∑ Zi ρi + o p (1) ∑ n i=1 n i=1 (B.1.6) 1 n 1 n ∂ρ 1 n ∂ρ V = ( ∑ Zi i ) ( ∑ Zi Σi Zi )−1 ∑ Zi i n i=1 ∂ β n i=1 n i=1 ∂ β (B.1.7) n(βˆLin − β0 )(βˆLin − β0 ) = V −1 + o p (1) (B.1.8) In addition: Since: 1 n 1 n 1 j n √ ∑ Zi ρi ( √ ∑ Zi ρi ) = ∑ ∑ Zi ρi ρ j Z j n i=1 n i=1 n i=1 j=1 = (B.1.9) 1 j ∑ Zi ρi ρi Zi + o p (1) n i=1 1 j = ∑ Zi Σi Zi + o p (1) n i=1 (B.1.10) (B.1.11) where the second equality follows from random sampling and the WLLN. We can rewrite V as: V= ∂ ρ¨ ¨ ¨ ¨ −1 ¨ ∂ ρ¨ Z (Z Z ) Z ∂ β0 ∂ β0 (B.1.12) ∂ ρ¨ ∂ ρ¨ i ∂ρ ∂ ρ¨ ∂ ρ¨ = [ 1 , ..., n ] , = Σ˜ −1/2 A i , Z¨ = [Z¨ 1 , ..., Z¨ n ] , Z¨ i = Zi A−1 Σ˜ 1/2 . ∂β ∂β ∂β ∂β ∂β ∂ ρ¨ ∂ ρ¨ i on Z¨ i , LP( i |Z¨ i ) = Z¨ i C, where C is a dim(Zi ) × dim(β ) Consider the matrix linear projection of ∂β ∂β where deterministic matrix defined by the moment conditions: E(Z¨ i ( ∂ ρ¨ i ∂β − Z¨ i C)) = 0 It is a standard result that as long as E(Z¨ i Z¨ i ) is finite and p.d. and E(Z¨ i (B.1.13) ∂ ρ¨ i ) exists, this linear projection ∂β is consistently estimated by: ¨ ¨ ˆ ∂ ρ |Z¨ i ) = Z¨ (Z¨ Z¨ i )−1 Z¨ ∂ ρi LP( i i i ∂β ∂β 0 ∂ ρ¨ = LP( i |Z¨ i ) + o p (1) ∂β 82 (B.1.14) (B.1.15) ˆ ∂ ρ¨ |Z) ¨ = Z( ¨ Z¨ Z) ¨ −1 Z¨ ∂ ρ¨ , denote P ¨ = Z( ¨ Z¨ Z) ¨ −1 Z¨ . Define the stacked estimated linear projections by LP( Z ∂β 0 ∂β Since PZ¨ is idempotent, we have: V= ∂ ρ¨ ∂ ρ¨ P /n ∂ β0 Z¨ ∂ β0 (B.1.16) ∂ ρ¨ ∂ ρ¨ PZ¨ PZ¨ /n ∂ β0 ∂ β0 ¨ ¨ ¨ LP( ˆ ∂ ρ |Z)/n ¨ ˆ ∂ ρ |Z) = LP( ∂β ∂β 1 n ˆ ∂ ρ¨ ¨ ˆ ∂ ρ¨ ¨ = ∑ LP( |Zi ) LP( |Zi ) n i=1 ∂β ∂β = = E(LP( (B.1.17) (B.1.18) (B.1.19) ∂ ρ¨ ¨ ∂ ρ¨ ¨ |Zi ) LP( |Zi )) + o p (1) ∂β ∂β (B.1.20) where the last equality follows from Newey and McFadden (1994) for instance. In addition, the matrix linear projection of ∂ ρ¨ on Z¨ i is the same as the matrix linear projection of ∂β ¨ ∂ ρ¨ |Zi ) on Z¨ i defined by: E( ∂β ¨ ¨ ∂ ρi |Zi ) − Z¨ C)) = 0 E(Z¨ i (E( i ∂β (B.1.21) Since the t th vector of Z¨ i , Z¨ it , is a function of Zit since zit contains zi1 , ..., zit−1 . Therefore: ¨ ¨ ¨ ∂ ρi |Zi )|Z¨ i )) + o p (1) ¨ ∂ ρi |Zi )|Z¨ i ) LP(E( V = E(LP(E( ∂β ∂β (B.1.22) So using the standard results on linear projection: ¨ V = E(E( ∂ ρ¨ i ¨ ¨ ∂ ρi |Zi )) − E(e ei ) + o p (1) |Zi ) E( i ∂β ∂β T T t=2 t=2 = E( ∑ D˜ t Σ˜ t D˜ t ) − E( ∑ eit eit ) + o p (1) where ei = E( ∂ ρ¨ i ∂β (B.1.24) ¨ ∂ ρ¨ i |Zi )|Z¨ i ) and eit = E( ∂ ρ¨ it |Zit ) − LP(E( ∂ ρ¨ it |Zit )|Z¨ it ). |Zi ) − LP(E( ∂β ∂β 83 (B.1.23) ∂β APPENDIX C EFFICIENCY OF THE POISSON FIXED EFFECTS ESTIMATOR C.1 Efficient Estimation under Conditional Mean Restrictions Chamberlain (1992b), page 581, showed that the asymptotic information bound for estimating β0 from (3.2.1) is: −1 = E(h ∂ µ (Σ−1 − Σ−1 µ (µ Σ−1 µ )−1 µ Σ−1 )∂ µ h ) V0,β i i i i y,i i y,i y,i i i y,i i (C.1.1) 0 ∂µ where hi = E(ci |xi ), xi = {xi1 , ..., xiT } ∂ µi = [∂ µi1 , ..., ∂ µiT ], ∂ µit = ∂ βit , Σy,i = Var(yi |xi ), yi = [yi1 , ..., yiT ], µi = [µi1 , ..., µiT ]. ρi = ρi (β0 ) can be rewritten as: ρi = (I − Pi )yi where   µi1   Pi = T  ∑t=1 µit  µiT 1 (C.1.2)  ... µi1    ...   ... µiT (C.1.3) Therefore:     µi1 ... µi1   ∂ µi1 ... ∂ µi1  T ∂µ     ∑t=1  it   − )yi |xi ) Di = E(( T ... ...     T 2 ( µ ) ∑ ∑t=1 µit    t=1 it  ∂ µiT ... ∂ µiT µiT ... µiT     ∂ µ ... ∂ µ µ ... µ i1  i1 i1   i1 T ∂µ   ∑t=1    1 it  −  )µi = hi ( T ... ...     T ∑t=1 µit   (∑t=1 µit )2   ∂ µiT ... ∂ µiT µiT ... µiT 1 ∑T ∂ µit = hi (∂ µi − t=1 µi ) T µ ∑t=1 it Note that: −1 −1 −1 −1 µi (Σ−1 y,i − Σy,i µi (µi Σy,i µi ) µi Σy,i ) = 0 (C.1.4) Therefore: −1 −1 −1 −1 −1 −1 −1 −1 −1 Di (Σ−1 y,i − Σy,i µi (µi Σy,i µi ) µi Σy,i )Di = hi ∂ µi (Σy,i − Σy,i µi (µi Σy,i µi ) µi Σy,i )∂ µi hi 84 (C.1.5) −1 −1 −1 −1 Therefore the only thing left to show is that (Σ−1 y,i −Σy,i µi (µi Σy,i µi ) µi Σy,i ) is a generalized inverse of Σi . Σi = (I − Pi )Σy,i (I − Pi ) = Σy,i − Pi Σy,i − Σy,i Pi + Pi Σy,i Pi Note that: Pi µi = µi (C.1.6) −1 −1 −1 −1 Pi (Σ−1 y,i − Σy,i µi (µi Σy,i µi ) µi Σy,i ) = 0 (C.1.7) and: Therefore: −1 −1 −1 −1 −1 −1 −1 −1 −1 (Σ−1 y,i − Σy,i µi (µi Σy,i µi ) µi Σy,i )Σi = (Σy,i − Σy,i µi (µi Σy,i µi ) µi Σy,i )Σyi − 0 −1 −1 −1 −1 − (Σ−1 y,i − Σy,i µi (µi Σy,i µi ) µi Σy,i )Σy,i Pi + 0 = I − Pi −1 −1 −1 −1 Let Mati = (Σ−1 y,i − Σy,i µi (µi Σy,i µi ) µi Σy,i ). Therefore: −1 −1 −1 −1 Mati Σi Mati = (Σ−1 y,i − Σy,i µi (µi Σy,i µi ) µi Σy,i ) − 0 −1 −1 −1 −1 = (Σ−1 y,i − Σy,i µi (µi Σy,i µi ) µi Σy,i ) Note that: Pi Pi = Pi (C.1.8) Therefore: −1 −1 −1 −1 Σi (Σ−1 y,i − Σy,i µi (µi Σy,i µi ) µi Σy,i )Σi = Σi − Σi Pi = Σi − Σy,i Pi + Pi Σy,i Pi + Σy,i Pi − Pi Σy,i Pi = Σi −1 −1 −1 −1 So (Σ−1 y,i − Σy,i µi (µi Σy,i µi ) µi Σy,i ) is indeed a generalized inverse of Σi . Therefore: −1 = E(D Σ− D ) = V −1 V0,β opt i i i 85 (C.1.9) −1 −1 −1 −1 −1 where Σ− i = (Σy,i − Σy,i µi (µi Σy,i µi ) µi Σy,i ). −1 −1 −1 −1 In addition, we can characterize (Σ−1 y,i − Σy,i µi (µi Σy,i µi ) µi Σy,i ) in an alternative way that will be −1 − Σ−1 µ (µ Σ−1 µ )−1 µ Σ−1 ). We have shown that: useful for future results. Denote Bi = (Σy,i i y,i y,i i i y,i i Bi Σi = I − Pi (C.1.10) Σi Bi = I − Pi (C.1.11) XΣi X = X (C.1.12) Σi XΣi = Σi (C.1.13) Since Bi and Σi are symmetric: Bi is the unique matrix X that satisfies: XΣi = I − Pi (C.1.14) Σi X = I − Pi (C.1.15) We have already shown that X = Bi satisfies all of these requirements. This solution is unique since for any X, Y satisfying these requirements1 : X = XΣi X = X(I − Pi ) = XΣiY = (I − Pi )Y = Y ΣiY = Y C.2 (C.1.16) Efficient Estimation under the Poisson FE Assumptions Under (3.2.9) and (3.2.10) we have Σy,i = hi diag(µi ) + vi µi µi where vi = Var(ci |xi ). Therefore: Σi = Var(ρi |xi ) = (I − Pi )(hi diag(µi ) + vi µi µi )(I − Pi ) = (I − Pi )hi diag(µi )(I − Pi ) where the last equality follows from µit µis − µis pit ∑T r=1 µir = 0. Define: Xi = h−1 i (diag( 1 1 )− T J) µi ∑t=1 µit (C.2.1) 1 This proof of uniqueness is identical to the proof of uniqueness for the Moore-Penrose pseudo inverse found in Penrose (1955). 86   1 ... 1     1 1 1  where J =   ...  and, by an abuse of notation, ( µi ) = [ µi1 , ..., µiT ].   1 ... 1 Note that: JPi = T µ ∑t=1 it T µ ∑t=1 it J =J and: diag( 1 1 )Pi = T J µi ∑t=1 µit (C.2.2) Therefore: Σi Xi = (I − Pi )diag(µi )(I − Pi )(diag( = (I − Pi )diag(µi )(diag( 1 1 )− T J) µi ∑t=1 µit 1 1 1 1 )− T J− T J+ T J) µi ∑t=1 µit ∑t=1 µit ∑t=1 µit = (I − Pi )(I − Pi ) = I − Pi Since both Σi and Xi are symmetric: Xi Σi = I − Pi (C.2.3) So in order to show that Σ− i = Xi , there only remains to show that Xi satisfies (C.1.12) and (C.1.13). For (C.1.12): Xi Σi Xi = Xi (I − Pi ) = Xi − Xi Pi 1 1 = Xi − h−1 i ( T µ J − T µ J) ∑t=1 it ∑t=1 it = Xi 87 For (C.1.13): Σi Xi Σi = (I − Pi )Xi = Σi − Pi Σi = Σi − Pi (I − Pi )hi diag(µi )(I − Pi ) = Σi Therefore we have shown that in this case: −1 Σ− i = hi (diag( 1 1 )− T J) µi ∑t=1 µit (C.2.4) Therefore: hi 1 1 Di Σ− i = h ∂ µi (diag( µ ) − T µ J) ∑t=1 it i i =( ∑T ∂ µit ∂ µi ) − t=1 j T µ µi ∑t=1 it ∂µ ∂µ ∂µ where j = [1, ..., 1] and, by an abuse of notation, ( µ i ) = [ µ i1 , ..., µ iT ]. i iT i1 Note that: ∑T ∂ µis ∂ pit ∂µ µit = T it − s=1 ∂β µit (∑T µis )2 ∑ (C.2.5) s=1 t=1 so that: ∂ µis ∂ pit 1 ∂ µit ∑T = − s=1 T ∂ β pit µit ∑s=1 µis (C.2.6) Therefore: ( ∂ pi (β0 ) ∂β ) Wi (β0 )−1 = ( ∑T ∂ µit ∂ µi ) − t=1 j T µ µi ∑t=1 it (C.2.7) Hence we have shown that under (3.2.9) and (3.2.10): Di Σ− i =( ∂ pi (β0 ) ∂β 88 ) Wi (β0 )−1 (C.2.8) C.3 Consistent Estimation of θ and ρ Under (3.2.1), (3.2.11) and (3.2.12): E(y2it |ci , xi ) = ci µit + (θ + 1)c2i µit2 E(yit yit−s |ci , xi ) = (γ + 1)c2i µit µit−s f or s = 1 = c2i µit µit−s f or s > 1 Therefore: 2 θ= and: y −y E( it 2 it ) µ it yit yit−2 − 1 E( µ µ ) it it−2 y y it it−1 γ= yit yit−2 − 1 ) E( µ µ it it−2 E( µit µit−1 ) (C.3.1) (C.3.2) Therefore a consistent estimator for θ under the assumptions (3.2.1), (3.2.11) and (3.2.12) is: 2 1 n T yit −yit ∑ ∑ nT i=1 t=1 µ¨ 2 it θˆ = yit yit−2 − 1 1 n T n(T −2) ∑i=1 ∑t=3 µ¨ it µ¨ it−2 (C.3.3) yit yit−1 µ¨ it µ¨ it−1 yit yit−2 − 1 µ¨ it µ¨ it−2 (C.3.4) A consistent estimator of γ is: 1 n T n(T −1) ∑i=1 ∑t=2 γˆ = 1 n T n(T −2) ∑i=1 ∑t=3 C.4 Asymptotic equivalence of Poisson fixed effects and our alternative estimator when θ = 0, γ = 0 p p When (3.2.9) and (3.2.10) hold, so that (3.2.11) and (3.2.12) hold with θ = 0 and γ = 0, θˆ → 0 and γˆ → 0 independently of whether (3.2.15) and (3.2.16) hold. Therefore: T ∂p p Dˆ i → h1 (xi , η p )( ∑ µit )[ it ]t=1,...,T ) ∂β t=1 √ p s=1,...,T s=1,...,T s=1,...,T Σˆ i → [1(t = s) − pit ]t=1,...,T [1[t = s]h1 (xi , η p ) µit µis + h2 (xi , η p )µit µis ]t=1,...,T ([1(t = s) − pit ]t=1,...,T ) 89 ˆ Appendix B, replacing E(ci |xi ) by h1 (xi , η p ) and Var(ci |xi ) by h2 (xi , η p ), shows where η p = plim(η). that: ∂ p (β ) plim(Dˆ i )plim(Σˆ i )− = −( i 0 ) Wi (β0 )−1 ∂β = Di Σ− i Therefore when (3.2.9) and (3.2.10) hold, βˆPFE and βˆalt are asymptotically equivalent. 90 BIBLIOGRAPHY 91 BIBLIOGRAPHY Ahn, S. C. and Schmidt, P. (1995). Efficient estimation of models for dynamic panel data. Journal of Econometrics, 68(1):5–27. Alvarez, J. and Arellano, M. (2003). The time series and cross-section asymptotics of dynamic panel data estimators. Econometrica, 71(4):1121–1159. Alvarez, J. and Arellano, M. (2004). Robust likelihood estimation of dynamic panel data models. CEMFI Working Paper 0421. Anderson, T. W. and Hsiao, C. (1981). Estimation of dynamic models with error components. Journal of the American Statistical Association, 76(375):598–606. Andrabi, T., Das, J., Ijaz Khwaja, A., and Zajonc, T. (2011). Do value-added estimates add value? accounting for learning dynamics. American Economic Journal. Applied Economics, 3(3):29–54. Arellano, M. (2003). Modelling optimal instrumental variables for dynamic panel data models. CEMFI Working Paper. Arellano, M. and Bond, S. (1991). Some tests of specification for panel data: Monte carlo evidence and an application to employment equations. The Review of Economic Studies, 58(2):277–297. Arellano, M. and Bover, O. (1995). Another look at the instrumental variable estimation of error-components models. Journal of Econometrics, 68(1):29–51. Balasubramanian, N. and Sivadasan, J. (2010). What happens when firms patent? new evidence from U.S. economic census data. Review of Economics and Statistics, 93(1):126–146. Baltagi, B. H., Fingleton, B., and Pirotte, A. (2014). Estimating and forecasting with a dynamic spatial panel data model*. Oxford Bulletin of Economics and Statistics, 76(1):112–138. Bester, A. C., Conley, T. G., and Hansen, C. B. (2011a). Inference with dependent data using cluster covariance estimators. Journal of Econometrics, 165(2):137–151. Bester, A. C., Conley, T. G., Hansen, C. B., and Vogelsang, T. J. (2011b). Fixed-b asymptotics for spatially dependent robust nonparametric covariance matrix estimators. Working Paper. Blundell, R. and Bond, S. (1998). Initial conditions and moment restrictions in dynamic panel data models. Journal of Econometrics, 87(1):115–143. Blundell, R., Griffith, R., and Van Reenen, J. (1995). Dynamic count data models of technological innovation. The Economic Journal, 105(429):333–344. Blundell, R., Griffith, R., and Windmeijer, F. (2002). Individual effects and dynamics in count data models. Journal of Econometrics, 108(1):113–131. Bond, S. R. (2002). Dynamic panel data models: a guide to micro data methods and practice. Portuguese Economic Journal, 1(2):141–162. Browning, M., Ejraes, M., and Alvarez, J. (2010). Modelling income processes with lots of heterogeneity. The Review of Economic Studies, 77(4):1353–1381. 92 Chamberlain, G. (1982). Multivariate regression models for panel data. Journal of Econometrics, 18(1):5– 46. Chamberlain, G. (1987). Asymptotic efficiency in estimation with conditional moment restrictions. Journal of Econometrics, 34(3):305–334. Chamberlain, G. (1992a). Comment: Sequential moment restrictions in panel data. Journal of Business & Economic Statistics, 10(1):20–26. Chamberlain, G. (1992b). Efficiency bounds for semiparametric regression. Econometrica, 60(3):567–596. Cizek, P., Jacobs, J. P., Ligthart, J. E., and Vrijburg, H. (2011). GMM estimation of fixed effects dynamic panel data models with spatial lag and spatial errors. Discussion Paper 2011-134, Tilburg University, Center for Economic Research. Clerides, S. K., Lach, S., and Tybout, J. R. (1998). Is learning by exporting important? micro-dynamic evidence from colombia, mexico, and morocco. The Quarterly Journal of Economics, 113(3):903–947. Conley, T. G. (1999). GMM estimation with cross sectional dependence. Journal of Econometrics, 92(1):1– 45. de Brauw, A. and Giles, J. (2008). Migrant labor markets and the welfare of rural households in the developing world: Evidence from china. 2008 Annual Meeting, July 27-29, 2008, Orlando, Florida 6085, American Agricultural Economics Association. Donald, S. G., Imbens, G. W., and Newey, W. K. (2009). Choosing instrumental variables in conditional moment restriction models. Journal of Econometrics, 152(1):28–36. Elhorst, P. J. (2005). Unconditional maximum likelihood estimation of linear and log-linear dynamic models for spatial panels. Geographical Analysis, 37(1):85–106. Hahn, J. (1997). Efficient estimation of panel data models with sequential moment restrictions. Journal of Econometrics, 79(1):1–21. Hausman, J., Hall, B. H., and Griliches, Z. (1984). Econometric models for count data with an application to the patents-r & d relationship. Econometrica, 52(4):909–938. Hsiao, C., Pesaran, H. M., and Tahmiscioglu, K. A. (2002). Maximum likelihood estimation of fixed effects dynamic panel data models covering short time periods. Journal of Econometrics, 109(1):107–150. Jenish, N. and Prucha, I. R. (2009). Central limit theorems and uniform laws of large numbers for arrays of random fields. Journal of Econometrics, 150(1):86–98. Jenish, N. and Prucha, I. R. (2012). On spatial processes and asymptotic inference under near-epoch dependence. Journal of Econometrics, 170(1):178–190. Kim, M. S. and Sun, Y. (2011). Spatial heteroskedasticity and autocorrelation consistent estimation of covariance matrix. Journal of Econometrics, 160(2):349–371. Kitazawa, Y. (2007). Some additional moment conditions for a dynamic count panel data model. Mundlak, Y. (1978). On the pooling of time series and cross section data. Econometrica, 46(1):69–85. 93 Mutl, J. (2006). Dynamic panel data models with spatially correlated disturbances. University of Maryland Theses and Dissertations. Newey, W. K. and McFadden, D. (1994). Chapter 36 large sample estimation and hypothesis testing. In Robert F. Engle and Daniel L. McFadden, editor, Handbook of Econometrics, volume Volume 4, pages 2111–2245. Elsevier. Newey, W. K. and Windmeijer, F. (2009). Generalized method of moments with many weak moment conditions. Econometrica, 77(3):687–719. Penrose, R. (1955). A generalized inverse for matrices. Mathematical Proceedings of the Cambridge Philosophical Society, 51(03):406–413. Su, L. and Yang, Z. (2013). QML estimation of dynamic panel data models with spatial errors. Research Collection School of Economics (Open Access). Todd, P. E. and Wolpin, K. I. (2003). On the specification and estimation of the production function for cognitive achievement. The Economic Journal, 113(485):F3–F33. Topalova, P. and Khandelwal, A. (2010). Trade liberalization and firm productivity: The case of india. Review of Economics and Statistics, 93(3):995–1009. White, H. (2001). Asymptotic theory for econometricians. Academic Press, San Diego. Windmeijer, F. (2000). Moment conditions for fixed effects count data models with endogenous regressors. Economics Letters, 68(1):21–24. Windmeijer, F. (2005). A finite sample correction for the variance of linear efficient two-step GMM estimators. Journal of Econometrics, 126(1):25–51. Windmeijer, F. (2008). GMM for panel data count models. In Matyas, L. and Sevestre, P., editors, The Econometrics of Panel Data, number 46 in Advanced Studies in Theoretical and Applied Econometrics, pages 603–624. Springer Berlin Heidelberg. Wooldridge, J. M. (1997). Multiplicative panel data models without the strict exogeneity assumption. Econometric Theory, 13(5):667–678. Wooldridge, J. M. (1999). Distribution-free estimation of some nonlinear panel data models. Journal of Econometrics, 90(1):77–97. Wooldridge, J. M. (2005). Simple solutions to the initial conditions problem in dynamic, nonlinear panel data models with unobserved heterogeneity. Journal of Applied Econometrics, 20(1):39–54. Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data. The MIT Press, second edition edition. 94