ESSAYS IN SPATIAL PANEL DATA ECONOMETRICS By Steven Wu-Chaves A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Economics—Doctor of Philosophy 2024 ABSTRACT Chapter 1: Robust inference in short linear panels with fixed effects with endogenous covari- ates in a spatial setting In this chapter, I propose a simple way to obtain robust standard errors in linear panels in a spatial context with endogenous covariates where the number of time periods is small relative to the cross sectional dimension. The method is based on applying a Spatial HAC to an average of moment con- ditions across time to obtain a covariance estimator that is robust to both spatial and serial correlation (HACSC). I also present a control function approach (CF) alternative to estimate the parameters and extend the HACSC estimator to this case, where the standard errors require an adjustment to account for the sampling variability induced by the first stage estimation. In addition, I derive the Fixed Effects-Random Effects equivalence under a Correlated Random Effects framework in the presence of a spatial lag of the dependent variable to obtain a fully-robust Hausman-type test using the HACSC estimator. I run a Monte Carlo experiment and show that the HACSC estimator is robust to strong patterns of serial and spatial correlation. Furthermore, I also find that whenever the CF assumptions hold, the CF approach is more efficient than Two-Stage Least Squares. Finally, I estimate the effect of school district spending on the performance of fourth-grade students in Michigan, allowing for spillovers across districts. I find that the expenditure from neighboring districts has a positive and non-negligible impact on test passing rates. Chapter 2: Estimation of models with spatial panels and missing observations in the covariates Missing data problems are more serious en spatial models with spillover effects as the efficiency loss induced by using estimators that only use the complete cases is larger. In this paper I present a GMM estimator that uses the information on both the complete and incomplete observations for models with spatial spillover effects and missing data on the potentially endogenous variables to obtain potential efficiency gains. I also derive the Fixed Effects and Random Effects equivalence for spatial panels with missing data and I also develop an alternative GMM estimator in this Correlated Random Effects framework. The Monte-Carlo simulations show significant efficiency gains of the GMM estimator compared to estimators that only use the complete cases. Chapter 3: Estimation of models with multiple fixed effects and endogenous variables: a correlated random effects approach The inclusion of multiple individual heterogeneities and time effects, more commonly referred as “fixed effects,” is a common practice in panel data. A common approach to deal with these is to estimate the model using the fixed effects estimator by applying the within transformation, which has the disadvantage of removing all the variables that are constant across one of the dimensions of the data set. An alternative method to estimate the model is the correlated random effects approach using the Mundlak device, which restricts the dependence between the heterogeneities and the covariates in a particular way. In this paper, I show that the fixed effects estimates can be recovered using the Mundlak approach in models with three sets of heterogeneities and in the presence of endogenous variables. Furthermore, I prove that this equivalence can be obtained using two different sets of covariates. Copyright by STEVEN WU-CHAVES 2024 To my parents. v ACKNOWLEDGEMENTS First, I am deeply thankful to my parents for their continued support. I am also grateful to Lucia, without whose encouragement and help I would not have been able to complete this journey. To the rest of my family and furry friends, thank you for their love and support that helped me in every step of this process. Second I would like to extend a sincere thank you to my committee chair, Jeffrey Wooldridge, for all his advice, guidance and support. I would also want to thank Kyoo il Kim, Tim Vogelsang and Guo Chen for serving on my committee and providing excellent feedback and support. I also want to recognize Lisa Cook, Richard Baillie, Susan Zhu, Carl Davidson, Leslie Papke, Hugo Freeman, Soren Anderson, Scott Imberman, Todd Elder and Steven Haider for their helped and opportunities they provided me during the program. I am also grateful to Jay Feight and Lori Jean Nichols for their assistance as I navigated the program. Finally, there are several class classmates that I also want to thank. First and foremost, to Minkyu Kim for his friendship, help and support over these years. To Salem Rogers, Raghav Rakesh, Chia-Hung Kuo, Benjamin Miller and Soo Jeong Lee, thank you for all the moments and conversations we shared together. vi TABLE OF CONTENTS CHAPTER 1 ROBUST INFERENCE IN SHORT LINEAR PANELS WITH FIXED EFFECTS WITH ENDOGENOUS COVARIATES IN A SPATIAL SETTING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 CHAPTER 2 ESTIMATION OF MODELS WITH SPATIAL PANELS AND MISSING OBSERVATIONS IN THE COVARIATES . . . . . . . . . . 45 CHAPTER 3 ESTIMATION OF MODELS WITH MULTIPLE FIXED EFFECTS AND ENDOGENOUS VARIABLES: A CORRELATED RANDOM EFFECTS APPROACH . . . . . . . . . . . . . . . . . . . . . . . . . . 70 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 APPENDIX A ADDITIONAL ASSUMPTIONS AND DEFINITIONS FOR CHAPTER 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 APPENDIX B PROOFS FOR CHAPTER 1 . . . . . . . . . . . . . . . . . . . . . . . 90 APPENDIX C DERIVATION OF THE COVARIANCE MATRIX FOR THE CONTROL FUNCTION APPROACH . . . . . . . . . . . . . . . . . . 104 APPENDIX D TABLES FOR CHAPTER 1 . . . . . . . . . . . . . . . . . . . . . . . 108 APPENDIX E FIGURES FOR CHAPTER 1 . . . . . . . . . . . . . . . . . . . . . . . 111 APPENDIX F PROOFS FOR CHAPTER 2 . . . . . . . . . . . . . . . . . . . . . . . 112 APPENDIX G TABLES FOR CHAPTER 2 . . . . . . . . . . . . . . . . . . . . . . . 114 APPENDIX H FIGURES FOR CHAPTER 2 . . . . . . . . . . . . . . . . . . . . . . . 117 APPENDIX I PROOFS FOR CHAPTER 3 . . . . . . . . . . . . . . . . . . . . . . . 119 vii CHAPTER 1 ROBUST INFERENCE IN SHORT LINEAR PANELS WITH FIXED EFFECTS WITH ENDOGENOUS COVARIATES IN A SPATIAL SETTING 1.1 Introduction The assumption of independent data is widespread in empirical economics since it simplifies many of the estimation methods. However, in many fields such as international trade, urban economics, public policy or even network analysis, this assumption might not hold since the outcome variable of an individual might be affected by other observations’ actions, which leads to (spatially) dependent data. Furthermore, many of the tools used to develop the asymptotic theory behind popular econometric methods such as the Central Limit Theorem and Law of Large Numbers often rely on independent and identically distributed (i.i.d.) data. This facilitates both the estimation and inference, but if this assumption is violated, then the latter becomes more difficult even if the parameters are estimated consistently. Additionally, the increasing availability of data sets over time has increased the popularity of panel methods in recent years as they allow to incorporate time effects and to estimate richer models. Nevertheless, they also introduce complications because the presence of unobserved heterogeneity could generate inconsistency problems both in the parameters and standard errors if it is not properly handled. When combining both spatially dependent observations and panel data, inference becomes more challenging since the error term can be both serially and spatially correlated. To address the spatial correlation, the literature in the field has usually resorted to assume and model a particular structure of the error term, as it was common to do with time series data. However, since the seminal work of White (1980), the common practice nowadays in the latter is to use standard errors that are robust to general forms of heteroskedasticity and autocorrelation (HAC). This procedure has been extended to the spatial framework (SHAC) by Conley (1999) and Kelejian and Prucha (2007) in a cross sectional setting. However, to the best of my knowledge and surprisingly enough, this has not been extended to the panel case where the time dimension is fixed 1 and the number of units of observation goes to infinity, even in the linear case.1 Admittedly, there are many cases in which the time dimension is also large, however there are also instances where the number of observations across time is considerably smaller compared to the cross sectional dimension. This generates issues because ignoring the serial correlation could still generate biased standard errors, even if the associated covariance matrix is robust to spatial correlation. Indeed, some of the estimators that have been proposed in the literature and that have been implemented in software packages only make the standard errors robust to one of these dimensions. For example, Stata is a very popular statistical analysis package and one of the few routines for panel data in a spatial context corrects the standard errors for spatial correlation, but assumes serial independence of the error terms. The main purpose of this paper is to propose a simple way to obtain robust standard errors in a linear panel that are robust to heteroskedasticity and both to spatial and serial correlation (HACSC), without imposing any structure on the time dimension and using a Fixed Effects framework with endogenous covariates. I also extend this procedure to the case of the control function approach, where the computation of standard errors is more difficult due to the presence of a generated regressor. HAC estimators have been extensively used in the time series literature since they avoid having to model the error term structurally, which can lead to inconsistency issues if that process is misspecified. Newey and West (1987) were the first to extend White’s estimator to allow for general forms of heteroskedasticity and autocorrelation. In the panel case, Arellano (1987) introduced the panel clustered standard errors, which are robust to heteroskedasticity and autocorrelation but require that the observations between clusters to be uncorrelated. In spatial panels, multiple authors have made important contributions to the field, extending many of the methods developed in the time series literature. For example, Driscoll and Kraay (1998) presented how to deal with spatially dependent panel data in a GMM context by averaging the moment conditions in the cross section dimension index, 𝑁. Their approach relies on holding 1Perhaps one of the reasons is that econometricians assume that it is obvious what to do, but many methods make strong assumptions in the time dimension like serial independence. 2 fixed 𝑁 and letting time dimension 𝑇 → ∞. Vogelsang (2012) develops asymptotic theory for linear spatial panels with fixed effects in a fixed-b framework by averaging HAC estimators and by computing the HAC for averages as in Driscoll and Kraay (1998). In this case, the asymptotics rely again in 𝑇 → ∞ and allowing 𝑁 to remain fixed or to grow. In a similar context Kim and Sun (2013) proposed a bivariate kernel HACSC estimator, which requires that both the cross section and time dimensions to go to infinity. Bester et al. (2011) suggested a cluster covariance matrix that is applicable when the data is dependent in the context of time series, spatial and panel data. More recently, Müller and Watson (2022a) introduced a new methodology to construct confidence intervals based on population principal components with the property that the resulting interval will have a coverage probability of 95% for a set of spatial patterns in a cross sectional setting. Müller and Watson (2022b) extended this framework to spatial panels to cover estimation techniques like difference-in-difference setups. At the cross sectional level, Conley (1999) was the first to develop a Spatial HAC (SHAC) estimator in a GMM context. His approach is based on the assumption that the data generating process is spatially stationary. When working with dependent data and allowing 𝑁 → ∞, it is common to assume some sort of weak dependence mechanism, analogous to the time series literature, so that the influence of one observation on other units diminishes as the distance between them increases. In this case, Conley assumes that the data is spatially 𝛼−mixing. Bester et al. (2016) provide a fixed-b analysis of Conley’s SHAC estimator. Kelejian and Prucha (2007) relax the spatial stationarity assumption and model the spatial dependence in terms of a weighting matrix, arguing that having a different number of neighbors, as it is common in empirical work, violates the assumption. In this respect, the notion of assigning weights to different units based on their distance to a particular point has been used in many fields. For example, in urban economics, McMillen (1996) used locally weighted regressions to estimate the value of land in Chicago, where each observation is given a specific weight based on its distance to the central business district. In the same spirit, in the geography literature the geographically weighted regression uses a very similar concept to model the idea that there might 3 be spatial variability for models involving geo-referenced data (Wheeler & Tiefelsdorf, 2005). It is important to note that their Kelejian and Prucha’s SHAC estimator is based on consistent estimates of the error terms, but they do not provide any parameter estimation framework. Kim and Sun (2011) generalize this estimator to allow general linear and nonlinear models using moment conditions. Conley and Molinari (2007) performed a Monte Carlo study in which they compared the performance of multiple covariance estimators with dependent data in the context of locations measured with error and they concluded that non parametric estimators work better compared to parametric ones such as GMM and maximum likelihood estimators. In this paper, I follow Driskoll and Kraay’s approach, but instead of averaging the moment conditions over the cross sectional dimension, I average the moment conditions over time and construct a GMM estimator and then apply Kelejian and Prucha’s SHAC over the corresponding residuals. By doing this, I avoid imposing any assumptions over the serial correlation and hence, construct a covariance estimator that is robust to both serial and spatial correlation. Beyond testing the statistical significance of the effect of a covariate on the response variable, robust inference is also important when trying to choose the correct specification of a model. More specifically, the correlated random effects (CRE) approach has been very popular in recent years because it is a simple way to test between Random Effects (RE) and Fixed Effects (FE) specifications and it allows to include time constant variables as noted by Joshi and Wooldridge (2019). Furthermore, we can obtain the FE coefficients of the time varying variables by including the time average of these on the right hand side of the equation in a Pooled OLS or RE regression, a result attributed to Mundlak (1978). Debarsy (2012) was the first to extend the Mundlak approach to the spatial setting. More recently, Li and Yang (2020) showed that when the model includes a structurally modeled error term (which involves maximum likelihood estimation), the equivalence holds conditional on the parameter associated with the error term, however, the equivalence breaks unconditionally, i.e., when this parameter has to be estimated jointly with the rest of parameters. In this paper, I show that the result holds in a specific setting; namely, if the model does not include a structurally modeled error term. 4 One of the additional advantages of not imposing a particular spatial structure on the error term is that some estimation methods become readily available such as Two Stage Least Squares (2SLS) or a Control Function (CF) approach (Blundell & Powell, 2003) whenever the researcher suspects an endogenous variable is in the model. In fact, adding a spatial lag of the response variable as a covariate yields the spatial autoregressive model (SAR), a very popular model in this literature. However, Kelejian and Prucha (1998) showed that this term induces an endogeneity problem, which is why the researcher has to resort to an Instrumental Variable (IV) procedure. In terms of the estimation of parameters, both 2SLS and the CF approach require the availability of instruments, however one important difference is that the latter imposes additional assumptions and is therefore less robust than 2SLS. On the other hand, if the assumptions hold, the CF allows to deal with the endogeneity in a more parsimonious way if multiple functions2 of the endogenous variable appear on the right hand side of the equation and is probably more efficient (Wooldridge, 2010). Note that this parsimony is relevant in the spatial case since it is common to include spillover effects in the models and therefore, the likelihood of having multiple functions of a variable increases in this context. In a spatial setup, Basile (2009) and Basile et al. (2014) extended the CF to additive non- parametric models. In terms of inference, Basile et al. (2014) recommends to use bootstrap to obtain confidence intervals, a practice that is common even in the i.i.d. case. However, as pointed out by Kunsch (1989), the independence assumption plays a critical role on the validity of the bootstrap, so besides the computational cost, in a spatial context this is not a trivial procedure due to the dependence between observations. Intuitively, if we just randomly sample the data in a time series setting at each bootstrap repetition, the serial correlation structure would be lost and a similar issue occurs in the spatial case. This is why different bootstrap methods have been proposed in the time series literature (see Politis and White (2004) for a brief overview), nevertheless their extension to the spatial case is not straightforward due to the absence of a natural ordering of the 2A well known result in the literature is that 2SLS and the CF give the same numerical coefficients if only one function of the endogenous variable is in the model. This carries over to the spatial case under the settings outlined at the beginning of the paragraph. 5 observations. Given this, it might be desirable to obtain a closed-form formula for the covariance matrix when the empirical researcher is working with parametric linear models with panel data in a spatial context. This paper tries to fill this hole in the literature by adjusting the HACSC estimator to the CF setting. This adjustment is necessary because in addition to deal with the spatial and serial correlation, it is necessary to take into account the sampling error induced by the first stage estimation. The rest of the paper is organized a follows. Section 1.2 discusses the model and the assumptions used to obtain the estimator of the covariance matrix. Section 1.3 presents the HACSC estimator and its asymptotic properties. Section 1.4 derives the FE and RE equivalence using the correlated random effects approach in a spatial context. Section 1.5 presents an additional application of the HACSC estimator under a Feasible GLS context. Section 1.6 presents the control function approach and a discussion of the additional assumptions imposed in this context. Section 1.7 contains a set of Monte Carlo experiments and Section 1.8 shows an empirical application of the HACSC estimator using data from the Michigan education system. Section 1.9 concludes. 1.2 Model 1.2.1 Estimation of the parameters Consider the following model3: 𝑦𝑖𝑡 = 𝑥1𝑖𝑡 𝛽1 + 𝑥2𝑖𝑡 𝛽2 + 𝑊𝑖 𝑋1𝑡𝛾1 + 𝑊𝑖 𝑋2𝑡𝛾2 + 𝜆𝑊𝑖 𝑦𝑡 + 𝑐𝑖 + 𝑢𝑖𝑡 = 𝑥𝑖𝑡 𝛽 + 𝑊𝑖 𝑋𝑡𝛾 + 𝜆𝑊𝑖 𝑦𝑡 + 𝑐𝑖 + 𝑢𝑖𝑡, 𝑖 = 1 . . . 𝑁, 𝑡 = 1 . . . 𝑇 (1.1) where 𝑦𝑖𝑡 is the dependent variable, 𝑥1𝑖𝑡 is a 1 × (𝑘1 + 1), vector of explanatory exogenous variables (including an intercept), 𝑥2𝑖𝑡 is a (1 × 𝑘2) vector of endogenous variables. The sense in which 𝑥1𝑖𝑡 is exogenous will be clarified below. 𝑊𝑖 is the 𝑖-th row of the 𝑁 × 𝑁 time invariant weighting matrix 𝑊, whose diagonal elements are zero, 𝑋1𝑡 and 𝑋2𝑡 are the 𝑁 × 𝑘1 and 𝑁 × 𝑘2 matrices of exogenous and endogenous covariates, respectively, for all observations at time 𝑡, 𝑦𝑡 is the vector 3The model includes a spatial lag of the dependent variable on the right hand side for the sake of generality and because this is a widely spread practice in the spatial literature. Nevertheless, it is important to emphasize that its inclusion precludes the interpretation of (1.1) as a conditional mean function and also complicates the interpretation of the coefficients. As such, in some sections of the paper this variable will be omitted. 6 of dependent variables at time 𝑡, 𝑐𝑖 is the individual heterogeneity and 𝑢𝑖𝑡 is the idiosyncratic error. Hence 𝛽, 𝛾 and 𝜆 are the parameters of interest and they are of dimension (𝑘 + 1) × 1, 𝑘 × 1 and 1 × 1 respectively. Throughout the rest of the paper, I assume that 𝑁 → ∞ while 𝑇 remains fixed. We assume that there exist a set of instruments 𝑧2𝑖𝑡 for 𝑥2𝑖𝑡 of dimension 𝑙 ≥ 𝑘2 (so that 𝑊𝑖 𝑍2𝑡 are the instruments for 𝑊𝑖 𝑋2𝑡). As previously shown by Kelejian and Prucha (1998), the inclusion of a spatial lag of the dependent variable on the right hand side also induces an endogeneity issue for which we also need instruments. Kelejian et al. (2004) and Lee (2003) determined that the optimal set of instruments for this variable is a sequence of the form 𝑊 𝑗 𝑋𝑡, for 𝑗 = 1...𝑠, 𝑠 ∈ N (in 𝑟𝑖𝑡 ≡ 𝑊 𝑗 this case, we would only include higher power spatial lags of 𝑋1𝑡). If we let 𝑤 𝑗 𝛾′ 𝛾′ and ℨ2𝑖𝑡 ≡ 𝑊𝑖 𝑍2𝑡, 𝐴𝑖𝑡 ≡ (𝑥1𝑖𝑡 2 1 𝑥2𝑖𝑡 𝑤1𝑖𝑡 𝑤2𝑖𝑡 𝑊𝑖 𝑦𝑡) and 𝜃 ≡ (𝛽′ 1 𝑖 𝑋𝑟𝑡, 𝑟 = 1, 2, 𝜆)′, then the 𝛽′ 2 model can be written more compactly as: 𝑦𝑖𝑡 = 𝐴𝑖𝑡𝜃 + 𝑐𝑖 + 𝑢𝑖𝑡 (1.2) Since we are not assuming a particular structure for the error term, we can estimate the parameters of (1.2) with the Fixed Effects 2SLS estimator. To do so, we can apply the within transformation to all the variables, so let (cid:165)𝑦𝑖𝑡 = 𝑦𝑖𝑡 − ¯𝑦𝑖, where ¯𝑦𝑖 = 1 𝑇 (cid:205)𝑇 𝑡=1 𝑦𝑖𝑡 and similarly for the independent variables and the instruments. Then we can use Pooled 2SLS to the transformed model (cid:165)𝑦𝑖𝑡 = (cid:165)𝐴𝑖𝑡𝜃 + (cid:165)𝑢𝑖𝑡 (1.3) using the instruments (cid:165)𝑍𝑖𝑡 = ( (cid:165)𝑥1𝑖𝑡 (cid:165)𝑤1𝑖𝑡 (cid:165)𝑧2𝑖𝑡 (cid:165)ℨ2𝑖𝑡 (cid:165)𝑤2 1𝑖𝑡 1𝑖𝑡 . . . (cid:165)𝑤𝑠 (cid:165)𝑤3 1𝑖𝑡). Note that all the individual unobserved effects have been removed. To obtain consistent parameters, we need the following orthogonality condition: E( (cid:165)𝑍′ 𝑖𝑡 (cid:165)𝑢𝑖𝑡) = E[𝑔𝑖𝑡 (𝑍𝑖𝑡, 𝜃)] = 0, 𝑡 = 1 . . . 𝑇 (1.4) which is implied by the stronger strict exogeneity condition: E(𝑢𝑖𝑡 |𝑍) = E(𝑢𝑖𝑡 |𝑍, 𝑊) = 0 7 where 𝑍 is the 𝑁𝑇 × [(𝑠 + 1)𝑘1 + 2𝑙 + 1] matrix of exogenous variables for all cross sectional units and all time periods. We note that in this spatial setting, this condition is stronger than in the non-spatial case because here we are conditioning the expected value of 𝑢𝑖𝑡 with respect to all other units and not only 𝑖’s independent variables (see Wooldridge (2010), pp. 301 for more details). The 𝑔𝑖𝑡 (𝑍𝑖𝑡, 𝜃) function is of dimension (𝑠 + 1)𝑘1 + 2𝑙 + 1 = 𝑟, hence for each 𝑖, there are 𝑇 × 𝑟 moment conditions. Under this framework, we could use many more moment conditions because our strict exogeneity assumption implies orthogonality conditions for each pair of time periods and cross sectional units [i.e. E( (cid:165)𝑍𝑖𝑡 (cid:165)𝑢 𝑗 𝑠), 𝑖, 𝑗 = 1 . . . 𝑁 and 𝑡, 𝑠 = 1 . . . 𝑇], however we will only use the conditions implied by the FE estimator. Using a similar idea as Driscoll and Kraay (1998), for each observation 𝑖 we can average these moment conditions over time4, so let: 𝑔𝑖 (𝑍𝑖, 𝜃) = 1 𝑇 𝑇 ∑︁ 𝑡=1 𝑔𝑖𝑡 (𝑍𝑖𝑡, 𝜃) From this, one can construct a GMM estimator, which will be defined as follows: ˆ𝜃 = min 𝜃∈Θ (cid:34) 1 𝑁 𝑁 ∑︁ 𝑖=1 (cid:35) ′ (cid:34) 𝑔𝑖 (𝑍𝑖, 𝜃) ˆΩ (cid:35) 𝑔𝑖 (𝑍𝑖, 𝜃) 1 𝑁 𝑁 ∑︁ 𝑖=1 (1.5) (1.6) where ˆΩ is a 𝑟 × 𝑟 positive definite, symmetric, weighting matrix. Admittedly, as noted above we could estimate 𝜃 by running Pooled 2SLS on (1.3), however, the GMM framework allows for more generality. For instance, averaging the moment conditions over time for each observation can be done in other setups different than fixed effects. Furthermore, this averaging might not be the most efficient approach, but obtaining the optimal GMM in a two-step procedure might provide some efficiency gains with respect to Pooled 2SLS. 1.2.2 Assumptions The consistency and normality of this estimator can be obtained from a Uniform Law of Large Numbers (ULLN) and Central Limit Theorem (CLT) derived by Nazgul and Prucha (2009) for non- stationary random fields in a possibly uneven lattice. Before stating their assumptions, we need some definitions. Let 𝐷 ⊂ R𝑑, 𝑑 ≥ 1 be an uneven lattice and let 𝜌(𝑖, 𝑗) = max1≤𝑘 ≤𝑑 | 𝑗𝑘 − 𝑖𝑘 | and 4Note however that Driscoll and Kraay’s case is based on having 𝑁 fixed ant 𝑇 → ∞ and they average across 𝑖 for all 𝑡. 8 |𝑖| = max1≤𝑘 ≤𝑑 |𝑖𝑘 |, where 𝑖𝑘 denotes the 𝑘-th component of 𝑖, be a metric and norm, respectively, of R𝑑. The minimum distance between two subsets 𝐸, 𝐹 of 𝐷 is defined as 𝜌(𝐸, 𝐹) = inf [𝜌(𝑖, 𝑗) : 𝑖 ∈ 𝐸 and 𝑗 ∈ 𝐹] and let |𝐸 | denote the cardinality of a subset 𝐸 ∈ 𝐷. Other definitions used throughout this section can be found in the Appendix. We now state the assumptions required to obtain the consistency and asymptotic normality of ˆ𝜃. We note that the 𝑁 subscript in the random fields and scalars of the assumptions are to explicitly indicate that the ULLN and CLT can accommodate for triangular arrays, which are common in the spatial literature and particularly in Cliff-Ord type models. However, for notation simplicity, it will be suppressed in many sections for the remainder of the paper. Assumption 1 The lattice 𝐷 ⊂ R𝑑, 𝑑 ≥ 1 is infinite countable and there exists a distance 𝜌0 such that 𝜌(𝑖, 𝑗) ≥ 𝜌0 ∀𝑖, 𝑗 ∈ 𝐷. Without loss of generality, suppose that 𝜌0 > 1. Assumption 1 provides the necessary structure to the lattice. Note that the existence of the distance is essential in order to obtain non parametric estimators of the covariance matrix and it is analogous to the time difference between observations in the time series literature. Furthermore, it is possible that the distance observed by the researcher, between two observations 𝑖 and 𝑗, 𝜌∗(𝑖, 𝑗), is measured with error. Note that the existence and availability of this distance measure is not trivial, even in the leading case of a geographical region. As shown in Figure 1.1, there are instances in which using the linear distance between many pairs of points in that territory would not represent the real burden to arrive from one location to another (e.g. driving), while there are other cases in which this measure would be appropriate (e.g. pollution). 9 Figure 1.1 Points in an irregular geographic region. Now we state conditions related to the the 𝑔𝑖 (·) functions and 𝑍𝑖,𝑁 , where 𝑍𝑖,𝑁 represents an 𝛼-mixing random field such that 𝑖 ∈ 𝐷. At this point, it is important to note that since we are working with panel data and time averages for estimation purposes, the random field considered in the assumptions is the one constructed with the time averages for each observation. Assumption 2 (Uniform 𝐿2 integrability) There is an array of positive real constants {𝑐𝑖,𝑁 } such that lim 𝑘→∞ sup 𝑁 sup 𝑖∈𝐷 𝑁 E (cid:2)|𝑍𝑖,𝑁 /𝑐𝑖,𝑁 |21 (cid:0)|𝑍𝑖,𝑁 /𝑐𝑖,𝑁 | > 𝑘(cid:1)(cid:3) = 0 Where 1(·) denotes an indicator function. Note that Assumption 2 allows for the possibility of asymptotic unbounded second moments, however for the remainder of the paper we will focus on the case of bounded moments, in which case we can set 𝑐𝑖,𝑁 = 1 ∀𝑖. The next assumption put some restrictions on the 𝛼 coefficients of the random field. Assumption 3 (𝛼-mixing) Let ¯𝑄 𝑘 𝑖,𝑁 := 𝑄 |𝑋𝑖, 𝑁 /𝑐𝑖, 𝑁 |1(|𝑍𝑖, 𝑁 /𝑐𝑖, 𝑁 |>𝑘) denote the upper tail quantile function of |𝑍𝑖,𝑁 /𝑐𝑖,𝑁 |1(|𝑍𝑖,𝑁 /𝑐𝑖,𝑁 | > 𝑘) and recall that 𝛼inv(𝑢) is the inverse function of ¯𝛼1,1(𝑚) as in the definition specified in the Appendix. The 𝛼-mixing coefficients satisfy: 1. lim 𝑘→∞ sup 𝑁 sup 𝑖∈𝐷 𝑁 ∫ 1 0 𝛼𝑑 inv (𝑢) (cid:104) ¯𝑄 (𝑘) 𝑖,𝑁 (𝑢) (cid:105) 2 𝑑𝑢 = 0. 10 9089888786858483424344454647 2. ∞ (cid:205) 𝑚=1 𝑚𝑑−1 ¯𝛼𝑘,ℎ (𝑚) < ∞ for 𝑘 + ℎ ≤ 4. 3. ¯𝛼1,∞(𝑚) = O𝑝 (𝑚−𝑝−𝜀) for some 𝜀 > 0. Under Assumptions 2 and 3.2 with 𝑘 = ℎ = 1 and letting {𝐷 𝑁 } be a sequence of finite subsets of 𝐷 that satisfies Assumption 1 such that |𝐷 𝑁 | → ∞ as 𝑁 → ∞, a direct application of Theorem 3 in Nazgul and Prucha (2009) leads to the conclusion that 1 |𝐷 𝑁 | ∑︁ 𝑖∈𝐷 𝑍𝑖,𝑁 − E(𝑍𝑖,𝑁 ) 𝑝 → 0 Note that one could relax Assumption 2 to 𝐿1 uniform integrability for the theorem to hold, nevertheless, the below CLT requires 𝐿2 uniform integrability. In order to apply this pointwise WLLN to the 𝑔𝑖 (·, 𝜃) functions, we assume that these satisfy the regularity conditions specified in Assumption A.1 presented in the Appendix. Given the fact that any measurable function of an 𝛼-mixing process is 𝛼-mixing, the 𝑔𝑖 (𝑍𝑖,𝑁 , 𝜃) also satisfy a pointwise WLLN, i.e. 1 |𝐷 𝑁 | ∑︁ 𝑖∈𝐷 𝑔𝑖 (𝑍𝑖,𝑁 , 𝜃) − E[𝑔𝑖 (𝑍𝑖,𝑁 , 𝜃)] 𝑝 → 0 (1.7) With this Weak Law of Large Numbers, in order for the above GMM estimator to be consistent, we need an Uniform LLN for which we need the additional regularity conditions on the 𝑔𝑖 (·, ·) functions stated in Assumption A.2. Under these assumptions, we have the following proposition, which is a special case of Theorem 2 in Nazgul and Prucha (2009). Proposition 1. Let {𝐷 𝑁 } be a sequence of finite subsets of 𝐷 that satisfies Assumption 1 such that (cid:205)𝑖∈𝐷 𝑁 space and consider a sequence of real valued functions {𝑔𝑖 (𝑍𝑖,𝑁 , 𝜃) : 𝑖 ∈ 𝐷 𝑁 , 𝑁 ∈ N} satisfying |𝐷 𝑁 | → ∞ as 𝑁 → ∞ and let 𝑄 𝑁 (𝜃) = 1 |𝐷 𝑁 | 𝑔𝑖 (𝑍𝑖,𝑁 , 𝜃). Suppose (Θ, 𝜈) is a compact metric Assumption 2 and that for all 𝜃 in Θ, these functions satisfy the WLLN in (1.7). Then |𝑄 𝑁 (𝜃) − E[𝑄 𝑁 (𝜃)] | 𝑝 → 0 sup 𝜃∈Θ 11 With these tools at hand, define the following functions: 𝑄 𝑁 (𝜃) ≡ (cid:35) ′ (cid:34) 𝑔𝑖 (𝑍𝑖, 𝜃) ˆΩ (cid:34) 1 𝑁 𝑁 ∑︁ 𝑖=1 (cid:35) 𝑔𝑖 (𝑍𝑖, 𝜃) 1 𝑁 𝑁 ∑︁ 𝑖=1 𝑄(𝜃0) ≡ E[𝑔𝑖 (𝑍𝑖,𝑁 , 𝜃0)]′ Ω0 E[𝑔𝑖 (𝑍𝑖,𝑁 , 𝜃0)] And suppose that ˆΩ 𝑝 → Ω0, where Ω0 is a positive definite matrix. Recalling that E[𝑔𝑖 (𝑍𝑖, 𝜃)] = 0 only when 𝜃 = 𝜃0, the true population value, the following proposition summarize the conditions under which the GMM estimator will be consistent: Proposition 2. Suppose that all the conditions of Proposition 1 hold. Additionally, assume that (𝑖) 𝑔𝑖 (𝑍𝑖, ·) are continuous for all 𝜃 ∈ Θ, (𝑖𝑖) ˆΩ 𝑝 → Ω0, an 𝑟 × 𝑟 positive definite matrix and (𝑖𝑖𝑖) 𝜃0 is the only vector for which the moment condition in (1.4) holds. Then 𝑄 𝑁 ( ˆ𝜃) converges uniformly to 𝑄(𝜃0) and ˆ𝜃 𝑝 → 𝜃0, the unique minimizer of 𝑄(𝜃). Note that since 1 𝑁 (cid:205)𝑁 𝑖=1 𝑔𝑖 (𝑍𝑖, 𝜃) satisfies the ULLN of Proposition 1 and ˆΩ 𝑝 → Ω0, the proof of this proposition follows from Theorem 4.1.1 in Amemiya (1985). To obtain the asymptotic distribution of ˆ𝜃, we assume the following condition, which guaranties that the sum is not dominated by any term. Assumption 4 If we define ˜𝜎2 𝑛 = Var(𝑆𝑛) and 𝑆𝑛 = (cid:205)𝑖∈𝐷 𝑁 𝑍𝑖,𝑁 . Then the following condition is satisfied: lim inf 𝑛→∞ |𝐷 𝑁 |−1 ˜𝜎2 𝑛 > 0 Under this assumption, Theorem 1 in Nazgul and Prucha (2009) ensures the asymptotic nor- mality of the random variables 𝑍𝑖. Proposition 3. Let {𝐷 𝑁 } be a sequence of finite subsets of 𝐷 that satisfies Assumption 1 such that |𝐷 𝑁 | → ∞ as 𝑁 → ∞ and let {𝑍𝑖 : 𝑖 ∈ 𝐷 𝑁 , 𝑛 ∈ N} be a sequence of zero mean real-valued random variables that satisfy Assumption 2. Furthermore, assume that the random field is 𝛼-mixing satisfying Assumption 5. Then, ˜𝜎−1 𝑛 𝑆𝑛 𝑑 → N(0, 1) 12 Once again, the previous proposition applies directly to the underlying random fields, however, we need a result to for the 𝑔𝑖 (𝑍𝑖,𝑁 , 𝜃) functions. Assuming that the latter satisfy the standard regularity conditions of Assumption A.3 , the first order conditions for the GMM estimator are (cid:35) ′ (cid:34) ∇𝜃𝑔𝑖 (𝑍𝑖, ˆ𝜃) ˆΩ (cid:34) 1 𝑁 𝑁 ∑︁ 𝑖=1 (cid:35) 𝑔𝑖 (𝑍𝑖, ˆ𝜃) = 0 1 𝑁 𝑁 ∑︁ 𝑖=1 (1.8) Taking a mean value expansion of the last term around 𝜃0 yields the following expression: 𝑔𝑖 ( ˆ𝜃) = 𝑔𝑖 (𝜃0) + ∇𝜃𝑔𝑖 ( ˜𝜃) ( ˆ𝜃 − 𝜃0) + remainder (1.9) for ˜𝜃 between ˆ𝜃 and 𝜃0 element-wise and where I suppressed the dependence of 𝑔𝑖 on 𝑍𝑖 for notation simplicity. Replacing (1.9) in (1.8) yields: √ 𝑁 ( ˆ𝜃 − 𝜃0) = − (cid:40) (cid:34) 𝑁 ∑︁ (cid:35) ′ (cid:34) ∇𝜃𝑔𝑖 ( ˆ𝜃) ˆΩ 1 𝑁 𝑖=1 𝑁 ∑︁ (cid:34) 1 𝑁 ∇𝜃𝑔𝑖 ( ˆ𝜃) (cid:35) ′ (cid:34) ˆΩ 1 √ 𝑁 1 𝑁 𝑁 ∑︁ 𝑖=1 𝑁 ∑︁ (cid:35) (cid:41)−1 ∇𝜃𝑔𝑖 (, ˜𝜃) (cid:35) 𝑔𝑖 (𝜃0) + remainder (1.10) 𝑖=1 Noting again that the ∇𝜃𝑔𝑖 (𝜃) preserve the mixing conditions, then 𝑖=1 1 𝑁 𝑁 ∑︁ 𝑖=1 ∇𝜃𝑔𝑖 ( ˆ𝜃) 𝑝 → E [∇𝜃𝑔𝑖 (𝜃0)] by the WLLN above. Since 𝑔𝑖 (𝜃) is continuously differentiable, by Slutzky’s Theorem, the first term of (1.10) converges in probability to Furthermore, by the CLT, (cid:8)[E(∇𝜃𝑔𝑖 (𝜃0)]′ Ω0 [E(∇𝜃𝑔𝑖 (𝜃0)](cid:9)−1 1 √ 𝑁 𝑁 ∑︁ 𝑖=1 𝑔𝑖 (𝜃0) = O𝑝 (1) Therefore, taking the probability limit of (1.10), we obtain √ 𝑁 ( ˆ𝜃 − 𝜃0) = − (cid:8)E [∇𝜃𝑔𝑖 (𝜃0)]′ Ω0E [∇𝜃𝑔𝑖 (𝜃0)](cid:9)−1 (cid:35) (cid:34) 1 √ 𝑁 𝑁 ∑︁ 𝑖=1 𝑔𝑖 (𝜃0) + 𝑜 𝑝 (1) (1.11) E [∇𝜃𝑔𝑖 (𝜃0)]′ Ω0 𝑑 → 𝑁 (0, 𝐶′Σ𝐶) 13 where and 𝐶′ = (cid:8)E [∇𝜃𝑔𝑖 (𝜃0)]′ Ω0E [∇𝜃𝑔𝑖 (𝜃0)](cid:9)−1 E [∇𝜃𝑔𝑖 (𝜃0)]′ Ω0 Σ = E[𝑔𝑖 (𝜃0)𝑔𝑖 (𝜃0)′] = Var[𝑔𝑖 (𝜃0)] (1.12) Note that for the cases considered in this paper, 𝐶 is just a matrix of data, so we do not need to estimate it. On the other hand, we need an estimator of the variance of the moment conditions, which we present in the next section. From an empirical implementation point of view, it is important to note that this GMM framework includes the simple estimators mentioned at the beginning of the section as special cases. For example, in the case of 𝐴𝑖𝑡 containing only exogenous variables, then the GMM reduces to the same solution as estimating (1.3) with Pooled OLS. If 𝐴𝑖𝑡 has some endogenous variables like in the model (1.1), and assuming that we have a set of instruments 𝑍𝑖𝑡, then the Fixed Effects 2SLS can be obtained from the GMM estimator by setting ˆΩ = (cid:165)𝑍′ (cid:165)𝑍, where 𝑍 is the stacked 𝑁𝑇 × 𝑟 matrix of instruments. Furthermore, we would need that the well known matrices of these estimators are of full column rank and to converge in probability to non-singular finite matrices. Another empirical consideration is the specification of the weighting matrix 𝑊 since in the model, the dependence of the outcome variable on other observations is generated by this matrix. In practice, there exist different ways to specify 𝑊. For example, one could assign weights as the inverse of the distance between two observations and set to zero the weights after a threshold or use a 𝑘-neighbors scheme. When dealing with geographic units, one could assign an equal weight for all the units 𝑗 that share a border with unit 𝑖 (rook type) or if they share an edge or a vertex (queen type) like in Figure XX, or even assign an equal weight to all other units in the sample (see LeSage and Pace (2009) for a discussion on weighting matrices). 14 Figure 1.2 Rook and queen type weighting schemes. On the rook type scheme, if 𝑊 is row normalized, only units 8, 12, 14 and 18 will receive a weight of 1 4 in row 13. Analogously, if a queen type scheme is used, units 7, 8, 9, 12, 14, 17, 18 and 19 will have a weight of 1 8 in row 13 of 𝑊. Nonetheless, some of these specifications might violate the assumptions stated in this section. In particular, recall that we are working with an 𝛼-mixing random field, which implies that the dependence between the observations decays as they are farther apart. In this respect, it is clear that assigning an equal weight to all other observations violates this assumption. In a similar fashion, a 𝑘-neighbors pattern might not satisfy the 𝛼-mixing condition in cases where there are isolated units (e.g. a unit located alone in an island). Note that these restrictions to 𝑊 also apply in cases where the distance measure is of economic nature or derived from a network perspective (e.g. degree of centrality). 1.3 The HACSC estimator To obtain robust standard errors, recall that because for each observation we took the time average of their corresponding moment conditions, essentially we are working with a cross sectional problem. The idea is therefore to apply Kelejian and Prucha’s (2007) estimator of the covariance matrix in this context, for which we need consistent estimates of the error terms. Analogous to the time series literature, their estimator requires a kernel function 𝐾 (·), which will provide weights to the covariance terms entering the sums. In principle, only the covariance of observations that are close relative to some distance measure will receive a positive weight, while observations that are far away will receive a weight of zero. In other words, this function will operationalize the weak 15 12345678910111213141516171819202122232425Rook type weighting scheme12345678910111213141516171819202122232425Queen type weighting scheme dependence assumption between observations to the error terms. Note however that this kernel will provide weights at the cross sectional dimension and not across time. To fix ideas, the researcher will need to choose a distance 𝜌𝑏 such that 𝜌𝑏 → ∞ as 𝑁 → ∞ that will play the role of the truncation lag in a time series context. The next assumption imposes additional restrictions on the kernel function. Assumption 5 The kernel 𝐾 : R → [−1, 1], satisfies the following conditions: 1. 𝐾 (0) = 1 2. 𝐾 (𝑥) = 𝐾 (−𝑥) 3. 𝐾 (𝑥) = 0 for 𝑥 > 1 4. |𝐾 (𝑥) − 1| ≤ 𝑐𝐾 |𝑥|𝛼𝐾 , |𝑥| ≤ 1 for some 𝛼𝐾 ≥ 1 and 0 < 𝑐𝐾 < ∞. As pointed out by Kelejian and Prucha (2007), Assumption 5 is satisfied by many kernels such as the rectangular kernel, Bartlett, the triangular kernel, among others. The next assumption imposes some structure for the error terms. Assumption 6 The 𝑁 × 1 vector of errors is generated as follows: 𝑢 = 𝑅𝜀 (1.13) where the 𝜀 is a 𝑁 × 1 vector of i.i.d. random variables with mean 0, variance of 1 and E[|𝜀|𝑞] < ∞ for 𝑞 ≥ 4 and the 𝑅 is a 𝑁 × 𝑁 non-singular unknown matrix whose row and column sums are uniformly bounded. In light of Assumption 6, recall that although theoretically we are working with a cross sectional problem because we took the time average of the moment conditions, the underlying structure of 16 the data is a panel. In this sense, (1.13) can also be seen as an average, so for each 𝑖, we have: 𝑢𝑖 = 𝑢𝑖1 𝑢𝑖2 ... (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) 𝑢𝑖𝑇 (cid:171) (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) (1.14) where it 𝑡-th row of 𝑢𝑖 is: 𝑡 ∑︁ 𝑅𝑖,𝑠𝜀𝑠 𝑢𝑖,𝑡 = 𝑠=1 and 𝑅𝑖,𝑠 is the 𝑖-th row of 𝑅𝑠, a matrix with similar properties than 𝑅 defined above, at time 𝑠. This implies that in each time period, the disturbances will depend on other unit’s disturbances, past own values of disturbances, and past values of other unit’s disturbances. In other words, this structure allows for both spatial correlation and serial correlation, “spatial serial” correlation and heteroskedasticity. Nevertheless, the uniform boundedness condition for 𝑅 guaranties that the correlation between units is restricted at the cross sectional dimension, analogous to the time series case. Given the distance 𝜌𝑏, we can denote with 𝑣𝑖 the number of pseudo-neighbors for 𝑖: 𝑣𝑖 = 𝑁 ∑︁ 𝑗=1 1[𝜌∗(𝑖, 𝑗) ≤ 𝜌𝑏] and let 𝑣 = max𝑖 𝑣𝑖. In words, 𝑣𝑖 denotes the number of units 𝑗 that are at a distance less than 𝜌𝑏 from unit 𝑖. The following assumption is related to 𝑣. Assumption 7 The random variable 𝑣 satisfies the following conditions: 1. E(𝑣2) = 𝑜 𝑝 (𝑁 2𝜏), where 𝜏 ≤ (cid:16) 1 2 (cid:17) 𝑞−2 𝑞−1 and 𝑞 is defined in Assumption 6. 2. (cid:205)𝑁 𝑗=1 |𝜎𝑖 𝑗 |𝜌(𝑖, 𝑗)𝛼𝑆 ≤ 𝑐𝑆, for 𝛼𝑆 ≥ 1 and 0 < 𝑐𝑆 < ∞ and 𝜎𝑖 𝑗 is the (𝑖, 𝑗)-th element of Σ (defined below). Assumption 7 plays a role in terms of limiting the degree of correlation between units, as well as ensuring that the estimator of the covariance matrix is consistent given the fact that we are using 17 residuals instead of errors to estimate it. Assumptions 6 and 9 provide an identification condition and bound the measurement error of the distance, respectively. Assumption 8 The matrix of exogenous variables, (cid:165)𝑍, has full column rank and its elements are uniformly bounded in absolute value by the finite constant 0 < 𝑐𝑍 < 0. For a fixed and finite 𝑇, the matrices: 1. lim 𝑁→∞ 2. lim 𝑁→∞ 3. plim 𝑁→∞ (𝑁𝑇)−1 (cid:165)𝑍′ (cid:165)𝑍 = 𝑄 𝑍 𝑍 . (𝑁𝑇)−1 (cid:165)𝑍′𝑅𝑅′ (cid:165)𝑍 = 𝑄 𝑍 𝑅𝑅𝑍 . (𝑁𝑇)−1 (cid:165)𝑍′ (cid:165)𝑍 = 𝑄 𝑍 𝑍 . are finite and non-singular. Furthermore, the matrix plim 𝑁→∞ (𝑁𝑇)−1 (cid:165)𝑍′ (cid:165)𝐴 = 𝑄 𝑍 𝐴 has full column rank 2𝑘. Similarly, the diagonal elements of 𝑊 are zero and all of its elements are uniformly bounded by a finite constant 0 < 𝑐𝑊 < ∞. Assumption 9 The distance measure used by the empirical researcher 𝜌∗(·, ·) is potentially measured with error, i.e. 𝜌∗(𝑖, 𝑗) = 𝜌(𝑖, 𝑗) + 𝑒𝑖 𝑗 ≥ 0 where 𝑒𝑖 𝑗 = 𝑒 𝑗𝑖 denotes the measurement errors which are bounded in absolute value by the finite constant 0 < 𝑐𝑒 < ∞. Furthermore, {𝑒𝑖 𝑗 } is independent of {𝜀𝑖}. We need an additional assumption to account for the fact that we are using residuals instead of the actual error terms. This condition is provided in Assumption A.4 and should be satisfied by most 𝑁 1 2 -consistent estimators. An extensive discussion of this and the previous assumptions is provided by Kelejian and Prucha (2007). Note that given equations (1.4) and (1.5) and the matrix Σ specified in (1.12), we have the following: E[𝑔𝑖 (𝜃0)𝑔𝑖 (𝜃0)′] = E (cid:2) (cid:165)𝑍′ 𝑖 (cid:165)𝑢𝑖 (cid:165)𝑢′ 𝑖 (cid:165)𝑍𝑖(cid:3) (1.15) 18 Because all the analysis is conditional on 𝑍 and 𝑊 and by applying the Law of Iterated Expectations, from (1.15) and Assumption 6 we get that E(𝑢𝑢′) = 𝑅𝑅′ = Σ, where 𝑢 is the 𝑁 × 1 vector of stacked error terms. In practical terms and recalling that 𝑔𝑖 (·, ·) was defined as an average over time, we can estimate (1.15) by replacing the error terms by their residual counterparts and the expected value by an average applying the WLLN. Therefore, for the proposed estimator ˆΣ, its (𝑟, 𝑠)-th element can be obtained as follows: ˆΣ𝑟 𝑠 = 1 𝑁𝑇 𝑁 ∑︁ 𝑁 ∑︁ 𝑇 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑗=1 𝑡=1 𝑙=1 (cid:165)𝑍𝑖𝑡,𝑟 (cid:165)𝑍 𝑗𝑙,𝑠 ˆ(cid:165)𝑢𝑖𝑡 ˆ(cid:165)𝑢 𝑗𝑙 𝐾 (cid:21) (cid:20) 𝜌∗(𝑖, 𝑗) 𝜌𝑏 (1.16) where (cid:165)𝑍𝑖𝑡,𝑟 is the value of the covariate 𝑟 for observation 𝑖 at time 𝑡, while its population counterpart is given by the following expression: Σ𝑟 𝑠 = 1 𝑁𝑇 𝑁 ∑︁ 𝑁 ∑︁ 𝑇 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑗=1 𝑡=1 𝑙=1 (cid:165)𝑍𝑖𝑡,𝑟 (cid:165)𝑍 𝑗𝑙,𝑙𝜎𝑖𝑡, 𝑗𝑙 (1.17) The following proposition establishes the consistency of ˆΣ. Proposition 4. Consider the model in (1.1) and Assumptions 5-9 and 4. Suppose that the (𝑟, 𝑠)-th elements of Σ and ˆΣ are given by (1.17) and (1.16) respectively. Then ˆΣ 𝑝 → Σ. Given the fact that we have assumed that 𝑇 is fixed from the beginning, the proof of this proposition is virtually the same as in Kelejian and Prucha (2007). Note that we can re-write (1.16) as follows: (cid:40) 𝑁 ∑︁ 𝑇 ∑︁ 𝑇 ∑︁ (cid:165)𝑍𝑖𝑡,𝑟 (cid:165)𝑍𝑖𝑙,𝑠 ˆ(cid:165)𝑢𝑖𝑡 ˆ(cid:165)𝑢𝑖𝑠 · 𝐾 [0] 1 𝑁𝑇 𝑁 ∑︁ ˆΣ𝑟 𝑠 = + 𝑙=1 𝑖=1 𝑇 ∑︁ 𝑡=1 𝑇 ∑︁ 𝑁 ∑︁ (cid:165)𝑍𝑖𝑡,𝑟 (cid:165)𝑍𝑖𝑙,𝑠 ˆ(cid:165)𝑢𝑖𝑡 ˆ(cid:165)𝑢 𝑗 𝑠𝐾 𝑖=1 𝑖≠ 𝑗 𝑡=1 𝑙=1 (cid:21) (cid:41) (cid:20) 𝜌∗(𝑖, 𝑗) 𝜌𝑏 (1.18) The first term of (1.18) makes it clear that there are no restrictions imposed on the serial correlation for a particular observation, as the terms are not being down-weighted. 1.4 Correlated Random Effects A direct application of the HACSC proposed in the previous section is related to the Correlated Random Effects (CRE) context. One of the most popular method applied in a panel setting is the 19 fixed effects estimator since it allows the unobserved heterogeneity 𝑐𝑖 to be arbitrarily correlated with the explanatory variables in the model. On the other side of the spectrum, the random effects approach imposes no correlation between 𝑐𝑖 and the independent variables. A typical task that the empirical researcher must face is to choose between these two specifications, for which the literature has suggested multiple approaches. One of these is the CRE framework, which imposes restrictions on the distribution of the individual heterogeneity conditional on the regressors (Wooldridge, 2010). One option is to follow Mundlak (1978) suggestion, which assumes that 𝑐𝑖 can be modeled as a linear function of the averages of the time varying independent variables. More specifically, consider the following model: 𝑦𝑖𝑡 = 𝑥𝑖𝑡 𝛽 + 𝑐𝑖 + 𝑢𝑖𝑡, 𝑖 = 1 . . . 𝑁, 𝑡 = 1 . . . 𝑇 Assuming that the 𝑥𝑖’s are time varying, Mundlak considered the following specification: 𝑐𝑖 = 𝜂 + ¯𝑥𝑖𝛿 + 𝑒𝑖 (1.19) (1.20) where 𝑒𝑖 is uncorrelated with ¯𝑥𝑖 by assumption. Replacing (1.20) in (1.19) yields: 𝑦𝑖𝑡 = 𝑥𝑖𝑡 𝛽 + ¯𝑥𝑖𝛿 + 𝑒𝑖 + 𝑢𝑖𝑡, 𝑖 = 1 . . . 𝑁, 𝑡 = 1 . . . 𝑇 (1.21) Mundlak (1978) showed that estimating 𝛽 in (1.21) by pooled OLS (POLS) or random effects yields the same 𝛽 than estimating (1.19) by fixed effects. In addition, we can perform a Hausman-type test using this equation by testing 𝛿 = 0 to determine the suitability of one estimator versus the other one. It turns out that this FE-RE equivalence carries over the spatial setting under a particular setting, namely a model such as in equation (1.1), i.e. no autoregressive process of the error term 𝑢 (Li and Yang (2020) showed that the equivalence breaks if we try to model structurally). Furthermore, this result carries over to the case of endogenous variables, which is a common issue in empirical work. More concretely, consider the model in (1.1) and using the same notation, the Fixed Effects Two Stage Least Squares (FE2SLS) coefficients can be obtained by running Pooled 2SLS on the following equation: (cid:165)𝑦𝑖𝑡 = (cid:165)𝑥1𝑖𝑡 𝛽1 + (cid:165)𝑥2𝑖𝑡 𝛽2 + (cid:165)𝑤1𝑖𝑡𝛾1 + (cid:165)𝑤2𝑖𝑡𝛾2 + 𝜌𝑊𝑖 (cid:165)𝑦𝑡 (1.22) 20 using the instrumental variables ( (cid:165)𝑧2𝑖𝑡 (cid:165)ℨ2𝑖𝑡 (cid:165)𝑤2 1𝑖𝑡 1𝑖𝑡 . . . (cid:165)𝑤𝑠 (cid:165)𝑤3 1𝑖𝑡), 𝑠 ∈ N. Then, it can be shown that running Pooled 2SLS on: 𝑦𝑖𝑡 − 𝜂 ¯𝑦𝑖 = (𝑥1𝑖𝑡 − 𝜂 ¯𝑥1𝑖) 𝛽1 + (𝑥2𝑖𝑡 − 𝜂 ¯𝑥2𝑖) 𝛽2 + (𝑤1𝑖𝑡 − 𝜂 ¯𝑤1𝑖)𝛾1 + (𝑤2𝑖𝑡 − 𝜂 ¯𝑤2𝑖)𝛾2 + 𝜌𝑊𝑖 (𝑦𝑡 − 𝜂 ¯𝑦) + (1 − 𝜂) ¯𝑥1𝑖𝛿1 + (1 − 𝜂) ¯𝑧2𝑖𝛿2 + (1 − 𝜂) ¯𝑤1𝑖𝜆1 + (1 − 𝜂) ¯ℨ2𝑖𝜆2 + (1 − 𝜂) 𝑠 ∑︁ 𝑗=2 ¯𝑤 𝑗 1𝑖𝜁 𝑗 using IV’s: [(𝑧2𝑖𝑡 − 𝜂 ¯𝑧2𝑖) (ℨ2𝑖𝑡 − 𝜂 ¯ℨ2𝑖) (1 − 𝜂) ¯ℨ2𝑖 (1 − 𝜂)𝑊𝑖 ¯𝑍2 (𝑤2 1𝑖𝑡 − 𝜂 ¯𝑤2 1𝑖) . . . (𝑤𝑠 1𝑖 . . . (1 − 𝜂)𝑤𝑠 1𝑖] (1 − 𝜂) ¯𝑤2 1𝑖𝑡 − 𝜂 ¯𝑤𝑠 1𝑖) yields the same (𝛽1 𝛽2 𝛾1 𝛾2 𝜌) as in (3.1) and where 𝜂 = 1 − (cid:2)𝜎2 𝑢 /(𝜎2 𝑢 + 𝑇 𝜎2 𝑐 )(cid:3) 1/2 (1.23) is assumed to be known. The following proposition summarizes this result. Proposition 5. Suppose ˜Γ = ( ˜𝛽2 ˜𝛽2 ˜𝛾1 ˜𝛾2 ˜𝜌) is the coefficient vector obtained by running Pooled 2SLS to equation (1.23). Then ˜Γ = ˆΓ𝐹𝐸2𝑆𝐿𝑆, the coefficient vector obtained by running Pooled 2SLS to equation (3.1). The proof of this proposition can be found in the Appendix. Note that we have included the time averages of the instruments in (1.23), but this might introduce some distortions in the sense that the dimension of the 𝑧’s might be larger than the original dimension of the 𝑥2’s. In practice, this will impact the degrees of freedom employed to perform the hypothesis testing to choose between FE and RE. Although when the cross sectional dimension is large this might not matter, in small samples this could have a significant impact in the statistical significance of the coefficients. It is important to note that this FE-RE equivalence is an algebraic result, and as it turns out, one can obtain the FE coefficients of (𝛽 𝛾 𝜌) in (1.23) by replacing the average of the instruments by the time averages of the predicted values of a regression of the endogenous variables on all of the 21 exogenous variables, i.e. 𝑦𝑖𝑡 − 𝜂 ¯𝑦𝑖 = (𝑥1𝑖𝑡 − 𝜂 ¯𝑥1𝑖) 𝛽1 + ( ˆ𝑥2𝑖𝑡 − 𝜂 ˆ¯𝑥2𝑖) 𝛽2 + (𝑤1𝑖𝑡 − 𝜂 ¯𝑤1𝑖)𝛾1 + ( ˆ𝑤2𝑖𝑡 − 𝜂 ˆ¯𝑤2𝑖)𝛾2 + 𝜌𝑊𝑖 ( ˆ𝑦𝑡 − 𝜂 ˆ¯𝑦) + (1 − 𝜂) ¯𝑥1𝑖𝛿1 + (1 − 𝜂) ˆ¯𝑥2𝑖𝛿2 + (1 − 𝜂) ¯𝑤1𝑖𝜆1 + (1 − 𝜂) ˆ¯𝑤2𝑖𝜆2 + (1 − 𝜂)𝑊𝑖 ˆ¯𝑦𝜁1 (1.24) This will “correct” the degrees of freedom issue mentioned above, at the expense of making the asymptotic theory harder since we have to take into account that we are using the predicted values instead of the original instrument averages. Proposition 6 summarizes this result and is proved in the Appendix. Proposition 6. Suppose ˇΓ = ( ˇ𝛽1 ˇ𝛽2 ˇ𝛾1 ˇ𝛾2 ˇ𝜌) is the coefficient vector obtained by running Pooled OLS to equation (1.24), where the ˆ represent the linear projections of the endogenous variables on the exogenous covariates. Then ˇΓ = ˆΓ𝐹𝐸2𝑆𝐿𝑆, the coefficient vector obtained by running Pooled 2SLS to equation (3.1). Once the researcher estimates the coefficients of (1.23) or (1.24), the next natural step is to test the hypothesis Ξ = (𝛿 𝜆 𝜁) = 0 [here 𝜁 denotes either (𝜁2 . . . 𝜁𝑠) in (1.23) or 𝜁1 in (1.24)] to decide between FE and RE specifications. Even if model (1.1) does not have an explicit functional form for the error term, the 𝑢𝑖𝑡 could still be serially or spatially correlated, therefore, we can use the HACSC estimator proposed in section 1.3 to conduct a fully robust Hausman-type test in a simple way. Specifically, one would need to get the Wald statistic as W = (RΞ)′(R ˆΣR′)−1(RΞ), where R includes the set of restrictions on the coefficients, Ξ is the full set of coefficients estimated and ˆΣ is the estimated HACSC robust covariance matrix. 1.5 Feasible GLS As previously stated and analogous to the time series literature, it is common practice in empirical work to assume a particular structure of the error term in a spatial context. In particular, 22 consider the following model: 𝑦𝑡 = 𝑋𝑡 𝛽 + 𝑣𝑡 𝑣𝑡 = 𝜌𝑊𝑣𝑡 + 𝜀𝑡 𝜀𝑡 = 𝑐 + 𝑢𝑡 (1.25) where 𝑦𝑡 is a 𝑁 × 1 vector, 𝑋𝑡 is a 𝑁 × 𝑘 matrix of covariates, 𝑐 denotes the vector of individual heterogeneity and 𝑢𝑡 is a vector of idiosyncratic errors at time 𝑡. In this model, 𝑋𝑡 may contain spatial lags of the independent variables. In what follows, the conditioning on both 𝑋𝑡 and 𝑊 of all the analysis is implicit. By stacking the equations by time period, the model can be rewritten as follows: 𝑦 = 𝑋 𝛽 + 𝑣 𝑣 = (I𝑇 ⊗ 𝜌𝑊)𝑣 + 𝜀 𝜀 = (𝑒𝑡 ⊗ I𝑁 )𝑐 + 𝑢 (1.26) where 𝑒𝑡 represents a 𝑇 ×1 vector of ones. At this point, the researcher needs to make an assumption about the orthogonality condition between the independent variables and the composite error term and more specifically, the vector 𝑐. A typical choice is to assume that all the explanatory variables 𝑋 are exogenous with respect to both vectors 𝑐 and 𝑢, with each element of these being i.i.d. with zero mean and finite variances 𝜎2 𝑐 and 𝜎2 𝑢 respectively, and both vectors being independent from each other. Note that this working assumption is stronger than the one required to obtain the consistency of the fixed effects estimator described in previous sections (as in the rest of the paper, I assume that 𝑇 is fixed and 𝑁 → ∞). Given these assumptions, from (1.25) we can write E(𝑣𝑡𝑣′ 𝑡) as follows: E(𝑣𝑡𝑣′ 𝑡) = (𝜎2 𝑐 + 𝜎2 𝑢 ) (I𝑁 − 𝜌𝑊)−1(I𝑁 − 𝜌𝑊)−1 (1.27) Or using the stacked version of (1.26) instead, then we can write E(𝜀𝜀′) = Ω𝜀 in the following way: Ω𝜀 = 𝜎2 𝑐 (𝐽𝑇 ⊗ I𝑁 ) + 𝜎2 𝑢 I𝑁𝑇 (1.28) 23 where 𝐽𝑇 = 𝑒𝑡𝑒′ 𝑡. Therefore it follows that, E(𝑣𝑣′) = (cid:2)I𝑇 ⊗ (I𝑁 − 𝜌𝑊)−1(cid:3) [𝜎2 𝑐 (𝐽𝑇 ⊗ I𝑁 ) + 𝜎2 𝑢 I𝑁𝑇 ] (cid:2)I𝑇 ⊗ (I𝑁 − 𝜌𝑊)−1(cid:3) (1.29) Note that the middle of this matrix has a classic random effects structure. In order to compute this covariance matrix, it is assumed that the matrix (I𝑁 − 𝜌𝑊) is invertible and that |𝜌| < 1 just as in the previous sections. Following the time series case and to facilitate the computation of the middle of (1.29), note that Ω𝜀 = 𝜎2 𝑢 𝑄0 + 𝜎2 1 𝑄1 (1.30) where 𝑄0 = (cid:16) I𝑇 − 𝐽𝑇 𝑇 (cid:17) ⊗ I𝑁 , 𝑄1 = 𝐽𝑇 𝑇 ⊗ I𝑁 and 𝜎2 1 = 𝜎2 𝑢 + 𝑇 𝜎2 𝑐 . Noting that 𝑄0 and 𝑄1 are idempotent, symmetric, 𝑄0 + 𝑄1 = I𝑁𝑇 and that 𝑄0𝑄1 = 0𝑁𝑇 , it follows that Ω−1 𝜀 = 𝜎−2 𝑢 𝑄0 + 𝜎−2 1 𝑄1 and Ω − 1 2𝜀 = 𝜎−1 𝑢 𝑄0 + 𝜎−1 1 𝑄1. In short, if the researcher is willing to impose that the covariates are orthogonal to the individual heterogeneity vector 𝑐 and the error term in (1.26) follows a spatial AR(1) process, then the matrix E(𝑣𝑣′) will have a particular form that depends only on three parameters. Knowing this, one can obtain an estimator that is potentially more efficient than the FE estimator. More specifically, the researcher can exploit the structure of the error term in (1.26) to remove the spatial correlation by performing a spatial Cochrane-Orcutt type transformation. Let 𝑦∗ = 𝑦 − (I𝑇 ⊗ 𝜌𝑊)𝑦 𝑋 ∗ = 𝑋 − (I𝑇 ⊗ 𝜌𝑊) 𝑋 𝑣∗ = 𝑣 − (I𝑇 ⊗ 𝜌𝑊)𝑣 Therefore, the transformed model is 𝑦∗ = 𝑋 ∗𝛽 + 𝑣∗ (1.31) Note that 𝑣∗ = 𝜀 so that (1.31) contains a classical composite error term. Given the structure of 𝜀, we can perform a second transformation by multiplying (1.31) by Ω − 1 2𝜀 to obtain ˇ𝑦 = ˇ𝑋 𝛽 + ˇ𝜀 24 (1.32) where ˇ𝑦 = Ω − 1 2𝜀 𝑦∗ and similarly for the rest of the terms. Note that E ( ˇ𝜀 ˇ𝜀′) = Ω − 1 2𝜀 E(𝜀𝜀′)Ω − 1 2𝜀 = (𝜎−1 𝑢 𝑄0 + 𝜎−1 1 𝑄1) (𝜎2 𝑢 𝑄0 + 𝜎2 1 𝑄1) (𝜎−1 𝑢 𝑄0 + 𝜎−1 1 𝑄1) = 𝑄0 + 𝑄1 = I𝑁𝑇 (1.33) Hence (1.32) can be estimated by Pooled OLS to obtain a GLS-type estimator to obtain efficiency gains, denoted by 𝛽𝐺 𝐿𝑆. If all the relevant matrices are well behaved as 𝑁 → ∞ and non-singular, Kapoor et al. (2007) showed that (𝑁𝑇) 1 2 (cid:0) ˆ𝛽𝐺 𝐿𝑆 − 𝛽(cid:1) 𝑑 → 𝑁 (0, Ψ) as 𝑁 → ∞ (1.34) where Ψ = (cid:0)𝜎2 𝑋 𝑋 + 𝜎2 1 requires knowledge of 𝜎2 𝑐 , 𝜎2 𝑢 𝑀 0 𝑀 1 (cid:1) −1 and 𝑀 𝑗 𝑋 𝑋 𝑁𝑇 𝑋 ∗′𝑄 𝑗 𝑋 ∗ for 𝑗 = 0, 1. The previous analysis 1 𝑢 and 𝜌 and therefore it is not feasible. Kapoor et al. (2007) proposed 𝑋 𝑋 = lim 𝑁→∞ generalized moments estimators of these parameters and they showed that if ˆ𝛽𝐹𝐺 𝐿𝑆 is the Pooled OLS estimator of (1.32) using any consistent estimators ˆ𝜎2 𝑐 , ˆ𝜎2 𝑢 and ˆ𝜌 instead of 𝜎2 𝑐 , 𝜎2 𝑢 and 𝜌, then (𝑁𝑇) 1 2 (cid:0) ˆ𝛽𝐺 𝐿𝑆 − ˆ𝛽𝐹𝐺 𝐿𝑆(cid:1) 𝑝 → 0 and ˆΨ − Ψ 𝑝 → 0 (1.35) where ˆΨ = (cid:16) 1 𝑁𝑇 ˆ𝑋 ∗′ ˆΩ−1 𝜀 ˆ𝑋 ∗(cid:17) −1 , provided that the working assumptions used to derive (1.34) hold. Note that the hats over the components of ˆΨ denote the dependence of the terms on ˆ𝜎2 𝑐 , ˆ𝜎2 𝑢 and ˆ𝜌. The validity of the previous covariance matrix Ψ rests on the working assumptions that the error term 𝑣 follows a spatial AR(1) and the conditions imposed on each element of 𝑐 and 𝑢 hold. However, from an empirical perspective it is always possible that the structure of Ω𝜀 does not have the RE form due to the presence of heteroskedasticity or serial correlation on 𝑢𝑖 for example. It is important to stress out that even if Ω𝜀 does not have the same structure as in (1.28), ˆ𝛽𝐹𝐺 𝐿𝑆 remains consistent, provided that the strict exogeneity condition (more formally this would mean that E[𝑋 ⊗ 𝑐] = 0 and E[𝑋 ⊗ 𝑢] = 0) and the corresponding rank condition continue to hold. Nevertheless, if the researcher is unsure about the assumptions related to the vectors of individual heterogeneity 𝑐 or the idiosyncratic errors 𝑢 made in this section, it is wise to make robust inference. 25 In these instances, the HACSC estimator presented in this paper can be useful to achieve this purpose. More specifically, consider the residuals (cid:165)ˇ𝜀𝑡 = ˇ𝑦𝑡 − ˇ𝑋𝑡 ˆ𝛽𝐹𝐺 𝐿𝑆, 𝑡 = 1 . . . 𝑇 . where ˆ𝛽𝐹𝐺 𝐿𝑆 is obtained by estimating (1.32). In this context, the (𝑟, 𝑠)-th element of the middle of the robust covariance matrix is ˆΣ𝑟 𝑠 = 1 𝑁𝑇 𝑁 ∑︁ 𝑁 ∑︁ 𝑇 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑗=1 𝑡=1 𝑙=1 ˇ𝑋𝑖𝑡,𝑟 ˇ𝑋 𝑗𝑙,𝑠 (cid:165)ˇ𝜀𝑖𝑡 (cid:165)ˇ𝜀 𝑗𝑙 𝐾 (cid:21) (cid:20) 𝜌∗(𝑖, 𝑗) 𝜌𝑏 (1.36) And the fully robust covariance matrix is: ˇΨ = (cid:0) ˇ𝑋′ ˇ𝑋(cid:1) −1 𝑁 ∑︁ 𝑁 ∑︁ 𝑇 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑗=1 𝑡=1 𝑙=1    (cid:165)ˇ𝜀𝑖𝑡 (cid:165)ˇ𝜀 𝑗𝑙 ˇ𝑋′ 𝑖𝑡 ˇ𝑋 𝑗𝑙 𝐾 (cid:20) 𝜌∗(𝑖, 𝑗) 𝜌𝑏 (cid:21)    (cid:0) ˇ𝑋′ ˇ𝑋(cid:1) −1 (1.37) where ˇ𝑋𝑖𝑡 is the 1 × 𝑘 vector of covariates at time 𝑡 for observation 𝑖. Note that the computation of ˇΨ requires the use of the transformed variables and not the original ones, which is consistent with the estimating equation (1.32). As in the previous sections, the kernel function 𝐾 (·) will provide weights so that the (possible) spatial correlation decreases for observations that are far apart according to the distance measure 𝜌(·, ·). Naturally, ˇΨ will be valid whether the RE structure of Ω𝜀 holds or not and will be robust to arbitrary serial and spatial correlation, as well as heteroskedasticity. Throughout this section we have assumed that all the elements of the explanatory variables are uncorrelated with the error term 𝑢. If some elements of 𝑋 are endogenous (i.e. E[𝑥′ 𝑖𝑡𝑢𝑖𝑡] ≠ 0) and the researcher has available instruments 𝑍, then the extension to an IV procedure is straightforward as discussed in Mutl and Pfaffermayr (2010) and B. Baltagi and Liu (2011). The estimation approach would be to apply Pooled 2SLS to the estimating equation 1.32 using instruments ˇ𝑍, where the ˇ denotes the same transformations made earlier in the section. In this instance, the computation of the covariance matrix using the HACSC estimator would look like (1.16), but the researcher would need to use the transformed variables as in this section instead. 1.6 Alternative estimation: a Control Function approach It is well known that Instrumental Variables estimation procedures such as 2SLS deliver con- sistent estimates of the parameters at the expense of losing precision when compared to OLS as 26 pointed out by Cameron and Trivedi (2005). In such instances, if the researcher is willing to impose additional assumptions, she can resort to the control function approach (Blundell & Powell, 2003), which can deliver estimates that are (potentially) more efficient as it will be shown in simulations. To this end, consider the following estimating equation: 𝑦𝑖𝑡 = 𝑓 (𝑋1𝑡, 𝑋2𝑡, 𝑊) + 𝑐𝑖 + 𝑢𝑖𝑡 (1.38) where 𝑓 (·) is a known function, E(𝑋′ 1𝑡𝑢𝑖𝑡) = 0 and E(𝑋′ 2𝑡𝑢𝑖𝑡) ≠ 0. In practice, 𝑓 (·) will almost certainly contain linear functions of 𝑋1𝑡, 𝑋2𝑡 as well as spatial spillovers from these variables, but it can also include nonlinear terms of the endogenous variables such as interactions with 𝑋1𝑡, squared functions and so on. Now, to analyze the CF approach, consider equation (1.39), which is a special case of (1.38) and is very similar to (1.1) but without the spatial lag of the dependent variable on the right hand side5, which will allow us to interpret it as a conditional mean function and for simplicity we will assume that there’s only one element in 𝑥2𝑖𝑡: 𝑦𝑖𝑡 = 𝑥1𝑖𝑡 𝛽1 + 𝑥2𝑖𝑡 𝛽2 + 𝑊𝑖 𝑋1𝑡𝛾1 + 𝑊𝑖 𝑋2𝑡𝛾2 + 𝑐𝑖 + 𝑢𝑖𝑡 = 𝑥𝑖𝑡 𝛽 + 𝑊𝑖 𝑋𝑡𝛾 + 𝑐𝑖 + 𝑢𝑖𝑡, 𝑖 = 1 . . . 𝑁, 𝑡 = 1 . . . 𝑇 (1.39) where the definitions are the same as in Section 1. By applying the within transformation, we obtain the estimating equation: (cid:165)𝑦𝑖𝑡 = (cid:165)𝑥𝑖𝑡 𝛽 + 𝑊𝑖 (cid:165)𝑋𝑡𝛾 + (cid:165)𝑢𝑖𝑡 (1.40) a set of instruments (cid:165)𝑍𝑖𝑡 = ( (cid:165)𝑥1𝑖𝑡 (CF) approach are the same as with 2SLS, namely: E( (cid:165)𝑍′ As with the 2SLS case and using obvious notation, this approach also requires the availability of (cid:165)ℨ2𝑖𝑡). The first two assumptions of the Control Function 𝑖𝑡 (cid:165)𝑢 𝑗𝑡) = 0 for 𝑖, 𝑗 = 1 . . . 𝑁 and 𝑡 = 1 . . . 𝑇 and the identification condition rank[E( (cid:165)𝑍′ (cid:165)𝐴)] = 2𝑘 − 1. The first stage of the estimation involves (cid:165)𝑤1𝑖𝑡 (cid:165)𝑧2𝑖𝑡 the reduced form of the endogenous variable on the instruments and obtaining the disturbances (cid:165)𝑣2𝑖𝑡, i.e. 5It is certainly possible to use the control function approach with the spatial lag of the dependent variable as a covariate. (cid:165)𝑣2𝑖𝑡 = (cid:165)𝑥2𝑖𝑡 − (cid:165)𝑍𝑖𝑡𝜓 (1.41) 27 where E( (cid:165)𝑍′ 𝑖𝑡 (cid:165)𝑣𝑖𝑡) = 0. Given that E( (cid:165)𝑍′ 𝑖𝑡 (cid:165)𝑢𝑖𝑡) = 0, note that (cid:165)𝑥2𝑖𝑡 and (cid:165)𝑤2𝑖𝑡 are endogenous if and only if (cid:165)𝑢𝑖𝑡 is correlated with (cid:165)𝑣2𝑖𝑡 and 𝑊𝑖 (cid:165)𝑣2𝑡. At this point we state the additional assumption required by the CF approach: E( (cid:165)𝑢𝑖𝑡 |𝑍, 𝑋2, 𝑊) = E( (cid:165)𝑢𝑖𝑡 |𝑍, (cid:165)𝑣2, 𝑊) = E( (cid:165)𝑢𝑖𝑡 | (cid:165)𝑣2, 𝑊) = 𝜇1 (cid:165)𝑣2𝑖𝑡 + 𝜇2𝑊𝑖 (cid:165)𝑣2𝑡 (1.42) This equation has two strong implicit restrictions. First, the second equality would hold under independence of 𝑍 and ( (cid:165)𝑢, (cid:165)𝑣2, 𝑊) and second, we are assuming a linear conditional expectation of (cid:165)𝑢𝑖𝑡 on the parameters. Given this, we can write (cid:165)𝑢𝑖𝑡 = 𝜇1 (cid:165)𝑣2𝑖𝑡 + 𝜇2𝑊𝑖 (cid:165)𝑣2𝑡 + (cid:165)𝑒𝑖𝑡 Replacing (1.43) in (1.40) yields: (cid:165)𝑦𝑖𝑡 = (cid:165)𝑥𝑖𝑡 𝛽 + 𝑊𝑖 (cid:165)𝑋𝑡𝛾 + 𝜇1 (cid:165)𝑣2𝑖𝑡 + 𝜇2𝑊𝑖 (cid:165)𝑣2𝑡 + (cid:165)𝑒𝑖𝑡 (1.43) (1.44) Stacking again all the explanatory variables into a matrix 𝐴 and the coefficients into a vector 𝜃 yields: (cid:165)𝑦𝑖𝑡 = (cid:165)𝑎𝑖𝑡𝜃 + (cid:165)𝑒𝑖𝑡 (1.45) The error term in (1.45) is uncorrelated with the rest of variables in the equation (including (cid:165)𝑥2𝑖𝑡 and (cid:165)𝑤2𝑖𝑡), so the parameters can be consistently estimated using Pooled OLS by replacing the disturbances with the computed residuals from the first stage. Therefore, the estimating equation for the main model becomes: (cid:165)𝑦𝑖𝑡 = ˆ(cid:165)𝑎𝑖𝑡𝜃 + (cid:165)𝑒𝑖𝑡 (1.46) where the ˆ denotes that we are using generated regressors. Two important observations from equation (1.44) is that by including both (cid:165)𝑣2𝑖𝑡 and 𝑊𝑖 (cid:165)𝑣2𝑡, the parameters obtained from this estimation will be numerically the same as 2SLS.6 Second, if 𝜇2 = 0, then it would be enough to include only (cid:165)𝑣2𝑖𝑡 in the estimating equation to get consistent estimates of 𝜃 and in this scenario, they would be different than 2SLS. Furthermore, it is precisely by excluding 𝑊𝑖 (cid:165)𝑣2𝑡 from the estimation that the 6In this sense, we do not get any efficiency gains compared to 2SLS by including both terms. 28 CF would probably be more efficient than 2SLS in this case, as it would be using information from this restriction. The CF has some additional advantages over 2SLS. One, the inclusion of the generated regres- sors in (1.44) allows the researcher to perform a Hausman-type test to determine if the suspected variables are endogenous that can be made robust to heteroskedasticity, spatial and serial correlation using the estimator proposed below. Second, the CF can handle nonlinear functions of the endoge- nous variables in a parsimonious way: for example in model (1.39), 𝑥2𝑖𝑡 could contain interactions with other exogenous variables or even squared terms, in which case the CF only requires to include only ˆ(cid:165)𝑣2𝑖𝑡 in the final estimating equation, whereas 2SLS would need a reduced form equation for each additional function of the endogenous variable. If such nonlinear functions of the endogenous variable are indeed present in the main model, the CF can be made more flexible by including terms such as ˆ(cid:165)𝑣2 2𝑖𝑡 (but again, this is not necessary as the inclusion of ˆ(cid:165)𝑣2𝑖𝑡 already “controls” for this endogeneity), at the cost of having to adapt the standard errors to account for these new generated regressors. From this point onward, one has to decide how to deal with the error term. One option is to impose some structure to it and apply a Feasible GLS procedure in order to obtain further efficiency gains. Note that this is possible because in (1.42) we have conditioned on the whole set of exogenous variables and the weighting matrix. However, this would not be possible if we slightly modify the model. So far we have assumed that the model also contains spatial spillovers of the endogenous variable (cid:165)𝑥2𝑖𝑡, but suppose that for some theoretical reason, the model does not include 𝑊𝑖 (cid:165)𝑋2𝑡. In this case we could relax (1.42) to E( (cid:165)𝑢𝑖𝑡 |𝑍𝑖𝑡, 𝑥2𝑖𝑡) = E( (cid:165)𝑢𝑖𝑡 |𝑍𝑖𝑡, (cid:165)𝑣2𝑖𝑡) = E( (cid:165)𝑢𝑖𝑡 | (cid:165)𝑣2𝑖𝑡) = 𝜇1 (cid:165)𝑣2𝑖𝑡 (1.47) Note that we are now conditioning only on the own control function. In this instance one could still estimate the transformed model by Pooled OLS, however it would preclude to apply a Feasible GLS procedure because the strict spatial exogeneity assumption would be violated since it will involve the weighting matrix 𝑊 and the error terms of other observations. 29 Alternatively, the researcher can treat the error term non-parametrically and apply the HACSC estimator proposed in this paper to obtain robust standard errors. Nevertheless, in this case there’s an additional layer of complication on top of the spatio-temporal correlation and the heteroskedasticity: by including ˆ(cid:165)𝑣2𝑖𝑡 in the estimating equation, we now have a generated regressor and therefore, the covariance matrix of the parameters needs to be adjusted to take into account the sampling error induced by the first stage estimation (i.e. we are getting estimates of 𝜓). Although Basile et al. (2014) recommends to perform a bootstrap to obtain the standard errors in a CF setup, sampling with spatially dependent data is not a trivial matter so having a formula is useful in practice. In this setup, the fully robust covariance matrix is 𝐵−1𝑀 𝐵−1 (1.48) 𝑖 (cid:205)𝑇 where 𝐵 = E (cid:16)(cid:205)𝑁 𝑀 = Var (cid:2)(cid:205)𝑁 𝑖 𝐺 = E (cid:2)(cid:205)𝑁 (cid:205)𝑇 𝑖 (cid:16) 1 𝑁𝑇 𝑟𝑖𝑡 (𝛿) = (cid:17) . 𝑡 (cid:165)𝑎′ (cid:205)𝑇 𝑖𝑡 (cid:165)𝑎𝑖𝑡 𝑡 ( (cid:165)𝑧𝑖𝑡𝜓)′( (cid:165)𝑒𝑖𝑡 + (cid:165)𝑣𝑖𝑡𝜃) − 𝐺 · 𝑟𝑖𝑡 (𝜓)𝜃(cid:3) = Var (cid:2)(cid:205)𝑁 𝑖 (cid:205)𝑇 𝑡 𝑚𝑖𝑡 (cid:3). 𝑡 ( (cid:165)𝑧𝑖𝑡𝜓)′ (cid:165)𝑧𝑖𝑡 (cid:3) (cid:205)𝑇 (cid:205)𝑁 𝑖 𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 (cid:17) −1 (cid:104) (𝑁𝑇)− 1 2 (cid:205)𝑁 𝑖 (cid:205)𝑇 𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑣𝑖𝑡 (cid:105) . The derivation of (1.48) can be found in the Appendix. To estimate it, we can replace the population quantities by their sample analogues so that (cid:205)𝑇 (cid:205)𝑁 𝑖 𝑡 ˆ(cid:165)𝑎′ ˆ𝐵 = 1 𝑁𝑇 𝑖𝑡 ˆ(cid:165)𝑎𝑖𝑡. ˆ𝑚𝑖𝑡 = ( (cid:165)𝑧𝑖𝑡 ˆ𝜓)′( ˆ(cid:165)𝑒𝑖𝑡 + ˆ(cid:165)𝑣𝑖𝑡 ˆ𝜃) − ˆ𝐺 · ˆ𝑟𝑖𝑡 ( ˆ𝜓) ˆ𝜃. With these quantities calculated, the (𝑟, 𝑠)-th element of 𝑀 can be estimated as ˆ𝑀𝑟 𝑠 = 1 𝑁𝑇 𝑁 ∑︁ 𝑁 ∑︁ 𝑇 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑗=1 𝑡=1 𝑙=1 ˆ(cid:165)𝑚𝑖𝑡,𝑟 ˆ(cid:165)𝑚 𝑗𝑙,𝑠𝐾 (cid:21) (cid:20) 𝜌∗(𝑖, 𝑗) 𝜌𝑏 Note that (1.48) also has a sandwich type form, very similar to the HACSC estimator presented earlier. Similarly, the kernel function is also used to operationalize the weak spatial dependence assumption, however in this case the terms it multiplies (𝑚𝑖𝑡 instead of (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑢𝑖𝑡) have a different structure to take into account the first stage sampling error. 30 1.7 Simulations 1.7.1 Design To test the performance of the HACSC estimator and the CF version of it, I performed a Monte Carlo study. In this experiment, the units of observation live in a squared regular grid of 20 × 20 and the distance between two adjacent individuals is normalized to one. To evaluate the performance of the estimator, consider the following data generating process: 𝑦𝑖𝑡 = 𝛽0 + 𝑥1𝑖𝑡 𝛽1 + 𝑥2𝑖𝑡 𝛽2 + 𝑥1𝑖𝑡𝑥2𝑖𝑡 𝛽3 + 𝑐𝑖 + 𝑢𝑖𝑡 𝑥1𝑖𝑡 = 𝛿0 + 𝛿1𝑧1𝑖𝑡 + 𝑣𝑖𝑡 𝑐𝑖 = (𝐼 − 𝜌𝑊)−1 𝑖 𝐶 𝑢𝑖𝑡 = 𝛼𝑣𝑖𝑡 + 𝑒𝑖𝑡 𝑒𝑡 = (𝐼 − 𝜌𝑊)−1𝑎𝑡 𝑎𝑖𝑡 = 𝜓𝑎𝑖,𝑡−1 + 𝜀𝑖𝑡 E(𝑥1𝑖𝑡𝑢𝑖𝑡) ≠ 0, E(𝑥2𝑖𝑡𝑐𝑖) ≠ 0, E(𝑧1𝑖𝑡𝑐𝑖) ≠ 0 where [𝛽0 𝛽1 𝛽2 𝛽3]′ = [2 0.7 0.6 0.3]′ and 𝜀𝑖𝑡, and 𝐶 are independent and identically distributed random variables following normal distributions and are independent from each other. 𝑧1𝑖𝑡 is an instrument for 𝑥1𝑖𝑡 and 𝑥2𝑖𝑡 is exogenous with respect to the error term 𝑢𝑖𝑡 and they follow a normal and gamma distributions respectively. Note that there is an interaction term between the endogenous and exogenous variable, for which we have a readily available instrument, 𝑧1𝑖𝑡𝑥1𝑖𝑡. In this setup, the error term 𝑢𝑖𝑡 satisfy the CF assumption given that it depends linearly on the error term from the reduced-form equation, 𝑣𝑖𝑡. The error terms 𝑒 and 𝑎 follow a spatial and temporal AR(1) process respectively. The strength of the spatial correlation is governed by the parameter 𝜌, while the persistence of the serial correlation is moderated by 𝜓. Note also that the individual heterogeneity also follows a Spatial AR(1) model, however, since I am going to apply the within transformation for the estimation, its DGP does will not affect the results. For the weighting matrix 𝑊, I used a rook-type weighting scheme so that each observation will have between two and four pseudo-neighbors and each of those will have an equal weight. 𝑊 is 31 row-normalized to ensure that (𝐼 − 𝜌𝑊) is invertible. I estimated the model using both FE 2SLS and the CF approach with 𝑁 = 400 and 𝑇 = 5 using 1,000 replications. I am interested in comparing the estimates of the coefficients by the two methods to see if there are some efficiency gains by using the CF approach. Furthermore, I also want to evaluate the performance of four different estimators of the covariance matrix: the HACSC proposed in this paper, a SHAC assuming no serial correlation, the cluster robust and the “regular” ones without any adjustment. In the case of the CF approach, I will compare the standard errors presented in Section 1.6 that account for the first stage and a HACSC that ignores the two-step procedure. I conducted a simulation for every combination of 𝜌 = [0, 0.3, 0.7] and 𝜓 = [0, 0.3, 0.7]. I used the Bartlett Kernel to perform the analysis, contrary to Kelejian and Prucha (2007), who used the Parzen Kernel. An important parameter in this experiment is the threshold distance 𝜌𝑏 at which the Kernel will assign a zero weight for units that are apart by more than 𝜌 𝑝. Following the recommendation of the authors mentioned above, I set 𝜌𝑏 = 𝑁 1 4 , i.e. the integer part of 𝑁 1 4 . At each iteration, I draw a new set of covariates and keep it fixed across the iterations of the 𝜌 and 𝜓 parameters. 1.7.2 Results This section describes the results of the simulations using two metrics for the estimated co- efficients: the mean and the corresponding standard deviation across the 1,000 replications for different values of 𝜌 and 𝜓. Table 1.1 presents the outcomes of this experiment and it shows that both estimators provided unbiased estimates of the parameters in the sense that the average of the estimated coefficients is centered around the true values for any combination of 𝜌 and 𝜓. This is expected since in this exercise the CF assumption is true. However, when analyzing the standard deviations, the CF consistently shows a lower value than 2SLS (e.g. 0.049 against 0.084 for 𝛽3 when 𝜌 = 𝜓 = 0.3). Figure 1.3 exemplifies this finding: note that the distribution of the estimated parameters is tighter around the true value for the CF estimates compared to 2SLS’. Therefore, whenever the CF assumption holds, this estimator seems to be more efficient, which can be explained by the fact that we are using additional information 32 when performing the estimation. Interestingly, these efficiency gains are more evident for 𝛽1 and 𝛽3, the coefficients associated with the endogenous variables, whereas for the coefficient of the exogenous covariate 𝛽2, the differences between the standard deviations of both estimators are more modest across all pairs of 𝜌 and 𝜓. Table 1.1 Average estimated coefficients and standard deviation across the 1000 replications using a rook type weighting matrix, N=400 and T=5. 𝜌 𝜓 0.0 0.0 0.3 0.7 0.0 0.3 0.3 0.7 0.0 0.7 0.3 0.7 𝛽1 𝛽2 𝛽3 CF 2SLS CF 2SLS CF 2SLS 0.704 (0.196) 0.696 (0.188) 0.703 (0.18) 0.690 (0.207) 0.706 (0.191) 0.704 (0.189) 0.695 (0.236) 0.691 (0.231) 0.704 (0.221) 0.698 (0.294) 0.692 (0.269) 0.705 (0.266) 0.683 (0.301) 0.718 (0.281) 0.697 (0.288) 0.698 (0.330) 0.693 (0.334) 0.693 (0.309) 0.606 (0.269) 0.595 (0.254) 0.600 (0.250) 0.589 (0.275) 0.603 (0.276) 0.599 (0.254) 0.573 (0.352) 0.580 (0.335) 0.603 (0.317) 0.604 (0.283) 0.593 (0.272) 0.601 (0.265) 0.584 (0.299) 0.608 (0.294) 0.595 (0.282) 0.575 (0.371) 0.581 (0.360) 0.600 (0.336) 0.298 (0.049) 0.300 (0.047) 0.300 (0.045) 0.303 (0.052) 0.299 (0.049) 0.299 (0.048) 0.302 (0.057) 0.303 (0.054) 0.300 (0.054) 0.300 (0.089) 0.301 (0.080) 0.299 (0.079) 0.305 (0.090) 0.294 (0.084) 0.301 (0.085) 0.301 (0.095) 0.301 (0.095) 0.303 (0.089) To analyze the performance of the HACSC estimator, I use two metrics: first the average of the variance7 estimated for each coefficient for each pair of 𝜌 and 𝜓 across the 1,000 replications and I compare it with the “true value”, which is computed as the variance of the set of estimated coefficients for each pair of 𝜌 and 𝜓 across the 1,000 replications. Tables D.1-D.3 present this comparison and the first thing to note in the case of the CF is that both estimated variances, with 7I used the estimated variances instead of the standard errors because the nonlinearity of the square root function could affect the results. 33 and without the first stage correction, are very close to the true value so at first glance, using this metric the correction does not seem to make an impact. For the 2SLS estimator, the differences are more substantial. The HACSC estimator is consis- tently closer to the true value across all pairs of 𝑟 ℎ𝑜 and 𝑝𝑠𝑖 compared to the SHAC that imposes no serial correlation and the non-robust one. In general, the variance estimated with the HACSC is on average larger compared to the one computed with these two alternatives. Admittedly, in this case the cluster-robust variances are also very close to the true value. Overall these results suggest making the standard errors robust to spatial correlation at the expense of imposing no serial correlation can result in unreliable inference. Furthermore, as shown in Figure E.1, using the HACSC estimator will provided standard errors that are, on average, properly centered around the true value. Figure 1.3 Distribution of coefficients estimated by 2SLS and the Control Function approach for 𝜌 = 0.3 and 𝜓 = 0.7 using a rook type weighting matrix. As a second method to analyze the HACSC in this setup, I tested the null hypothesis 𝐻0 : 𝛽3 = 0.3 at a 5% of significance using a t-test over the 1,000 replications using the standard errors computed with the different estimators and I obtained the rejection probabilities. Using this metric, 34 0.00.51.01.51 CFTrue0.00.51.01.52 CF0.00.20.40.63 CF0.00.51.01.51 2SLS0.00.51.01.52 2SLS0.00.20.40.63 2SLS an estimator is performs better if the rejection probability is closer to 5%. Table 1.2 presents the results of this exercise.8 For the case of the CF approach, the rejection probabilities using the adjustment are slightly closer to 5% compared to the estimator that ignores the first stage so in this sense, the adjustment seems important to obtain more reliable inference if the researcher uses the CF approach. On the other hand, if we use 2SLS to estimate the coefficients, the HACSC estimator rejection probabilities are closer to the 5% compared to the SHAC and non-robust standard errors, which are over rejecting the null hypothesis. Using this metric, the cluster-robust standard errors seem to perform just as well as the HACSC estimator. Overall, the results suggest that the HACSC estimator, both in the case of 2SLS and the CF approach with the correction, provide more reliable inference compared to the existing SHAC. Table 1.2 Rejection probabilities for the null hypothesis 𝐻0 : 𝛽3 = 0.3 at a 5% of significance using a t-test over the 1,000 replications with a rook type weighting matrix, N = 400, T=5. 𝜌 0.0 0.3 0.7 CF 𝜓 0.0 0.050 0.3 0.046 0.7 0.045 0.0 0.045 0.3 0.050 0.7 0.051 0.0 0.050 0.3 0.041 0.7 0.044 CF_no1 HACSC SHAC Cluster Non-Robust 0.060 0.060 0.061 0.061 0.064 0.062 0.072 0.057 0.060 0.067 0.054 0.050 0.068 0.047 0.068 0.057 0.066 0.056 0.088 0.072 0.075 0.096 0.074 0.085 0.077 0.095 0.076 0.058 0.046 0.043 0.058 0.040 0.058 0.048 0.065 0.041 0.082 0.068 0.072 0.091 0.067 0.080 0.066 0.090 0.073 CF is the HACSC estimator using the first stage correction and CF_no1 refers to the HACSC estimator ignoring the first stage estimation using a CF approach. 1.8 Empirical application To test the performance of the HACSC estimator with real world data, I revisit the problem of analyzing the effect of spending on the educational outcome of fourth graders in Michigan studied by Papke and Wooldridge (2008) using district level data from 1993 to 20019. In short, 8Tables D.4 and D.5 show the results for 𝐻0 : 𝛽1 = 0.7 and 𝐻0 : 𝛽2 = 0.6 respectively. 9 I want to thank Dr. Papke and Dr. Wooldridge for kindly sharing their data set. 35 Michigan changed the way schools were funded in 1994, going from a property-tax based system to a statewide system, which was possible trough an increase in the sales tax and lottery profits . To measure the effect of spending on the academic achievement of students, the authors used as the dependent variable the fraction of fourth-graders that passed the math test (math4𝑖𝑡) of the Michigan Education Assessment Program (MEAP) given that the definition of this subject and the way it is evaluated has remained relatively constant over time. On the other hand, in addition to the current level of spending on a student, the authors also allow for the possibility that the level spending on the previous three years to play a role in the test scores. This is indeed a sensible choice given that one could argue that the previous years of education lay the foundations in the learning process of students. The model also includes the proportion of students eligible for the free and reduced-price lunch program (lunch𝑖𝑡), the district enrollment (enroll𝑖𝑡) and time dummies. More details about the full model can be found in Papke (2005). Borrowing their notation, the estimated model is: math4𝑖𝑡 = 𝜃𝑡 + 𝛽1 log(avgrexp𝑖𝑡) + 𝛽2lunch𝑖𝑡 + 𝛽3 log(enroll𝑖𝑡) + 𝑐𝑖1 + 𝑢𝑖𝑡 (1.49) where avgrexp𝑖𝑡 denotes the simple average of real spending from the current and previous three years. It is important to note that in addition to the linear probability model (LPM), Papke and Wooldridge (2008) also estimate the model with other nonlinear estimators but because they find that the LPM is a good approximation to the nonlinear estimates and since this paper focuses on linear models, we will compare the results only with their LPM results. In order to replicate their results and use the HACSC estimator, we need a distance measure between the school districts. As mentioned in previous sections, this is not a trivial matter when we are working with geographical units but in this case, we will work with the geographic distance between the centroids of each district.10 However, there have been changes in the school districts since 2001, which is why I could only use 98.6% of the original sample used by Papke and Wooldridge (2008). The main reason for this is that some districts have merged with others and in these cases, I used the data point of the district that absorbed the one disappearing. Table 1.3 10 Roughly speaking, a centroid can be interpreted as the center of mass of a geometry. 36 compares the summary statistics from the original and new data sets and the t-tests show that there are no statistically significant differences between them. Table 1.3 Sample means (standard deviations) of the original and new data sets and corresponding t-tests (p-values). Pass rate on fourth-grade math test Real expenditure per pupil (2001$) Real foundation grant (2001$) Fraction of eligible for free and reduced lunch Enrollment 1995 Original New 0.62 (0.13) 0.62 (0.13) 2001 t-test Original New 0.76 0.76 -0.30 (0.12) (0.13) (0.76) 6329 (986) 5962 (1031) 0.28 (0.15) 3076 (8156) 6317 (978) 5959 (1035) 0.28 (0.15) 3099 (8210) 0.20 (0.85) 0.05 (0.96) 0.27 (0.79) -0.04 (0.97) 7161 (933) 6348 (689) 0.31 (0.17) 3078 (7293) 7147 (916) 6347 (692) 0.30 (0.17) 3103 (7341) t-test -0.43 (0.67) 0.25 (0.80) 0.03 (0.98) 0.34 (0.73) -0.05 (0.96) Number of observations 501 494 - 501 494 - As a first step, I assume that all the explanatory variables are exogenous with respect to the error term 𝑢𝑖𝑡 and apply the fixed effects estimator, sometimes referred to as "two-way fixed effects" because of the inclusion of the year dummies. Table 1.4 shows the estimates using the new data set and the ones reported by Papke and Wooldridge (2008). The coefficient associated with the average real expenditure is virtually the same, whereas the ones of lunch and the enrollment are negative with the new estimates. Nevertheless, the magnitudes of the latter are small and none of them are statistically significant in the original estimation either. Table 1.4 Estimates assuming that all the explanatory variables are exogenous. log(avgrexp) lunch log(enroll) Original results New results Coefficient 0.372 (0.071) Coefficient 0.377 (0.071) -0.042 (0.073) 0.002 (0.049) 0.029 (0.064) -0.02 (0.048) 493 Observations 501 Standard errors for new results with different bandwidth values 𝜌𝑏=300 𝜌𝑏=100 𝜌𝑏=500 𝜌𝑏=200 𝜌𝑏=400 𝜌𝑏=600 𝜌𝑏=1 0.070 0.072 0.067 0.066 0.066 0.063 0.058 0.064 0.077 0.079 0.072 0.061 0.060 0.061 0.048 0.045 0.033 0.028 0.026 0.023 0.022 - - - - - - - 37 Table 1.4 also shows the standard errors computed with the HACSC estimator using different bandwidth values. As expected and because the minimum distance between any two school districts in the data set is 1.05 kilometers, when the bandwidth is 1 kilometer the HACSC estimator is effectively treating the observations as if they have no effect on their neighbors (i.e. no spatial correlation) and consequently the standard errors are very similar to the ones computed using an estimator that is robust to heteroskedasticity and serial correlation. Interestingly, as the bandwidth increases, the standard error for each coefficients behaves differently: for the average spending, it first increases and then decreases, for enrollment it decreases monotonically whereas for lunch, there is not an evident pattern. Note that this exercise shows that even if the covariance matrix is robust to heteroskedasticity, spatial and serial correlation, this does not mean that the standard errors will be necessarily larger. One of the issues with the estimates previously discussed is that the spending from a school district might be endogenous, mainly due to the fact that a school district might adjust its current spending if they suspect that the (bad) performance of a cohort throughout the year will be reflected on the pass rates of the MEAP test (Papke & Wooldridge, 2008). Fortunately, the change in the way that school districts brought with it a natural instrument: in the 1993/1994 school year, each district started to receive a per-student “foundation grant” based on the initial funding in 1994 that sought to increase the spending per student to a baseline level and had the effect of reducing the differences in spending between the districts across the state of Michigan by the year of 2001 (see Figure 1.4). The details of why this is a suitable instrument are discussed in Papke and Wooldridge (2008), but in broad terms, the identification assumption is that the idiosyncratic error term has a smooth relationship with both the dependent variable and the initial funding. On the other hand, the foundation grant depended on the initial funding in a non-smooth way [see Table 1 in Papke and Wooldridge (2008) to see this]. As a result of this concern, Papke and Wooldridge (2008) augmented the model by also including the real spending from 1994 with interactions with the time dummies, along with the time averages of lunch and enrollment, using as instruments the foundation grant interacted with the year binary 38 Figure 1.4 Average real expenditure per student across the Michigan school districts in 1995 and 2001. variables. The new estimated model using instrumental variables is then math4𝑖𝑡 = 𝜃𝑡 + 𝛽1 log(avgrexp𝑖𝑡) + 𝛽2lunch𝑖𝑡 + 𝛽3 log(enroll𝑖𝑡) (1.50) + 𝛽4𝑡 log(rexppp𝑖,1994) + 𝜉1lunch𝑖 + 𝜉2log(enroll𝑖) + 𝑣𝑖𝑡1. Note that because we have a single endogenous variable, in this case using Two Stage Least Squares (2SLS) would be numerically the same as estimating the model with the control function approach, and because of this, I used the latter. Table 1.5 shows the estimates from this model and once again, the coefficients obtained using the new sample are very similar to the ones computed using the original data set. In particular, the coefficient of the spending is considerable larger than the OLS estimate, which can be explained in the context of the local average treatment effect literature or by the fact that district authorities can decide to increase spending whenever they think the cohort might underperform Papke and Wooldridge (2008). 39 Table 1.5 Estimates assuming that the spending variable is endogenous. Original results New results Coefficient 0.546 (0.211) Coefficient 0.555 (0.208) Standard errors for new results with different bandwidth values 𝜌𝑏=300 𝜌𝑏=100 𝜌𝑏=200 𝜌𝑏=500 𝜌𝑏=400 𝜌𝑏=600 𝜌𝑏=1 0.221 0.265 0.292 0.253 0.221 0.202 0.187 log(avgrexp) lunch log(enroll) v -0.062 (0.075) 0.046 (0.067) -0.421 (0.232) Observations 501 0.008 (0.067) 0.023 (0.066) -0.476 (0.236) 493 0.066 0.077 0.083 0.079 0.07 0.068 0.067 0.069 0.075 0.079 0.071 0.065 0.058 0.054 0.250 0.349 0.411 0.383 0.365 0.357 0.353 - - - - - - - Contrary to the case where all the independent variables were treated as exogenous, the standard errors computed using the HACSC estimator when the bandwidth parameter is set to 1 kilometer are somewhat different to the ones computed using an estimator that is only robust to serial correlation and heteroskedasticity, which is expected because the latter does not take into account the first stage estimation. Once again this results show that the standard errors can be larger or smaller depending on the value selected for the bandwidth. So far I have assumed that there is only spatial correlation in the error term. However in this scenario there could be spatial spillovers from neighboring units that could be affecting the student performance on the math test. Figure 1.4 not only shows that the average real expenditure per student increased between 1995 and 2001 in all the school districts, but it also shows the spatial distribution of it. Note that there are districts where the surrounding neighbors have a very similar level of spending, for example, in 1995 the Detroit region shows multiple school districts with higher levels of expenditure compared to the rest of the state. Similarly, in Figure 1.5 the Upper Peninsula shows several neighboring school districts with higher passing rates than the rest of the region. 40 Figure 1.5 Average real expenditure per student across the Michigan school districts in 1995 and 2001. Multiple reasons could be behind this pattern. For instance, it could be the case that parents with students that are underperforming identify school districts that are increasing spending and throughout the year, move to one of these districts in order to increase help their children to improve their grades. From the labor side, school districts might need to increase the expenditure in teachers’ salaries to avoid losing them to other school districts within a reasonable commuting distance. All in all, it seems important to control for spillover effects of expenditure from neighbors, so I augment the models previously estimated with this additional variable11 and Table 1.6 shows the estimates of this regression assuming that all the independent variables are exogenous with respect to the error term. Note that the coefficient on the average expenditure has decreased significantly so that an increase of approximately 10% in spending will now lead to an increase in the pass rate of about 2.8%. On the other hand, if neighboring school districts of unit 𝑖 increase their expenditure around 10%, the pass rate in 𝑖 is expected to improve around 3.2%, a larger effect than the own spending. To address the endogeneity issue, I also augmented the model 1.50 with the spending spillover variable using the control function approach12 and the results are shown in Table 1.7. 11 For this estimation, I used a rook type weighting matrix 12 I used 𝑊 · log(found) to instrument for 𝑊 · log(avgrexp) 41 Table 1.6 OLS with extension log(avgrexp) lunch log(enroll) W· log(avgrexp) Coefficient (st. error) 0.281 (0.076) 0.030 (0.063) -0.008 (0.047) 0.324 (0.090) 𝜌𝑏=1 Standard errors with different bandwidth values 𝜌𝑏=500 𝜌𝑏=300 𝜌𝑏=100 𝜌𝑏=400 𝜌𝑏=200 𝜌𝑏=600 0.076 0.077 0.071 0.067 0.065 0.061 0.056 0.063 0.077 0.082 0.076 0.066 0.064 0.064 0.047 0.044 0.035 0.03 0.028 0.025 0.024 0.088 0.076 0.071 0.057 0.049 0.047 0.047 Number of districts 493 - - - - - - - Once again, in this case the effect of the own expenditure is larger than in the exogenous case, but it is smaller compared to the original estimate. The spillover effect is significantly reduced to a marginal increase of around 0.7% in the pass rates due to an increase in the spending in surrounding school districts and moreover, the coefficient is not statistically significant. Overall, the difference in the magnitude of the coefficients obtained for the spending in neigh- boring units make it difficult to interpret the effect of this variable. However in both cases it was positive, which supports the hypothesis that parents may move to school districts where the spending per student is higher. Of course, one cannot rule out the possibility that larger spending by neighboring school districts can attract better teachers to the area that are willing to commute, however, more detailed data may be needed to separate these effects. Regarding the standard errors, most of the results show a pattern: if the bandwidth parameter is too small, they seem to be smaller relative to the ones computed with larger values, but at some point they become smaller again. This phenomenon has been documented in the time series literature: for example, Müller (2014) argues that when the bandwidth is too small, the estimate of the covariance matrix is downward biased. In the same line, Kiefer and Vogelsang (2005) show that for an AR(1) process, if the bandwidth is too small the estimator of is biased, whereas if every observation is given a weight of one in the estimation of the covariance matrix, then the estimates are going to tend to zero because the in-sample residuals have an average of zero, which is precisely 42 what is being observed in this example as the bandwidth increases beyond some point. Table 1.7 IV extension log(avgrexp) lunch log(enroll) W · log(avgrexp) v Coefficient (st. error) 0.408 (0.231) 0.016 (0.067) -0.001 (0.067) 0.071 (0.057) -0.249 (0.254) Standard errors for new results with different bandwidth values 𝜌𝑏=300 𝜌𝑏=400 𝜌𝑏=500 𝜌𝑏=200 𝜌𝑏=100 𝜌𝑏=600 𝜌𝑏=1 0.234 0.317 0.361 0.310 0.262 0.234 0.219 0.066 0.078 0.087 0.082 0.074 0.072 0.071 0.068 0.08 0.088 0.079 0.069 0.062 0.058 0.056 0.076 0.083 0.077 0.07 0.067 0.065 0.260 0.379 0.435 0.385 0.346 0.328 0.318 Number of districts 493 - - - - - - - 1.9 Conclusion In this paper, I present a simple way to obtain standard errors that are robust to heteroskedasticity and both serial and spatial correlation in short panels with fixed effects and endogenous covariates. This is important because to the best of my knowledge, the current SHAC estimators do not explicitly allow for serial correlation in this context (admittedly the literature does not ignore this issue when 𝑇 → ∞). The estimator relies on averaging the moment conditions for a single individual across time, which allows to treat the estimation like a cross sectional problem without imposing any restrictions on the serial correlation of the residuals. This will help empirical researchers to obtain more reliable standard errors in different fields such as urban economics or international trade. The proposed HACSC estimator can be directly applied in a Correlated Random Effects frame- work to obtain a fully robust Hausman-type test, which can help empirical researchers to choose between Fixed Effects and Random Effects specifications. In this paper I also showed that the Mundlak equivalence also holds in a particular spatial setting, which will allows to obtain the Fixed Effects coefficients of the time varying covariates in a Random Effects context. Similarly, the HACSC estimator can be used in a RE estimation procedure, whenever the researcher suspects that the structure imposed of the spatial error term might be misspecified. 43 I also presented a control function approach and the required assumptions to estimate the parameters of the model. Although even in the i.i.d. case it is a standard practice to use bootstrap to obtain the standard errors with this approach, in a spatial setting this is not a trivial procedure given the dependence between observations. For this reason, I also extended the HACSC estimator to this setup, which requires an adjustment of the covariance matrix to take into account the sampling error of the first stage estimation. The Monte-Carlo experiment performed showed that the HACSC estimator works well in the presence of strong or moderate serial and spatial correlation compared to other methods used by the literature in terms of obtaining unbiased standard errors. As expected, the estimator also shows higher variance than such estimators, especially in settings with low spatial and/or serial correlation. The simulations also showed that if the CF assumptions hold, we can obtain efficiency gains compared to 2SLS. An avenue for future research is to extend the Monte Carlo experiments in different directions. First, it would be interesting to use different weighting schemes for the weighting matrix 𝑊 based on distance or a 𝑘-neighbor scheme in an irregular lattice, as well as different kernel functions. Analogous to the time series literature, the threshold for the distance bandwidth most certainly plays an important role on the finite sample behavior of the estimator, so implementing a data driven procedure to choose it is also a possibility to explore, particularly when the spatial correlation is strong. 44 CHAPTER 2 ESTIMATION OF MODELS WITH SPATIAL PANELS AND MISSING OBSERVATIONS IN THE COVARIATES 2.1 Introduction Over the last years, the amount and type of data available for economic research has experienced an important increase. Many fields in economics have benefited from this, including areas that focus on spatial related issues such as development, trade, geography and urban economics. Unfortunately, a common issue that empirical researchers have to deal with is missing data, a problem that can arise in multiple ways and which often leads to the need of different methods, one of which is the use of the “complete cases” only, in other words, observations where either the response variable or one of the covariates is missing are dropped from the analysis. The consequences of this will depend on the assumptions and the process that generates the missing data, but regardless of these, discarding observations results in a loss of information. This problem is more serious in a spatial context, where it is common to include spillover effects from “neighboring” units (i.e. spatial lags) in the model. For example, if we are working with county level data and the nature of the dependence between the units is a function of the geographical distance between them, the researcher might include the effects of surrounding counties as an additional explanatory variable using a weighting matrix 𝑊. However, in this setup if a unit 𝑖 is a “neighbor” of 𝑙 counties and 𝑖 has a missing data point and the researcher is using the complete cases only, she might need to drop not only observation 𝑖, but all of its 𝑙 neighbors as well, therefore, the loss of information in the spatial case is potentially more severe. Furthermore, if we have a panel data set, the problem will be aggravated because the missing data could affect both dimensions. This is in fact a common problem in empirical work because the reason of the missing data could be that the units of observation (e.g. countries) have different lengths of their time series (i.e. unbalanced panel). Given this, a method to impute data in this case would be useful for empirical work so that the efficiency loss induced by the missing data is mitigated with respect to using only the complete cases. This work tries to fill this necessity by 45 proposing a new GMM estimation procedure. The problem of missing data has been a known issue in economic research for a long time. One of the approaches that empirical researchers use to deal with it, is dropping the incomplete cases, which induces an efficiency loss as mentioned previously. In this respect, Kelejian and Prucha (2010) present conditions in a spatial setting under which the missing data can be ignored asymptotically based on the proportion of the sample sizes related to the complete and incomplete observations. They also describe the case where the missing data cannot be ignored and will make inference more difficult. In practice, there are alternatives other than just using the complete cases: for example, one might try to complete the sample first and then estimate the model using this “complete” data set. One of the methods documented in the spatial literature was introduced by Lesage and Pace (2004), who used the Expectation-Maximization algorithm to predict the value of the dependent variable that are missing in the context of real estate housing prices. In the spatial context, one could also generate the spatial lags using the available data only, in which case the researcher has two options. First, a common practice is to replace the missing data with zeros (Kelejian & Prucha, 2010), nevertheless this technique does not seem sensible as having missing data is very a different problem and replacing these data points with zeroes will almost certainly lead to biased estimates. The second approach involves constructing the spatial lags using only the available “neighbors”, but doing this could generate a misspecification of the weighting matrix and thus probably yield inconsistent estimates, as pointed out by Wang and Lee (2013). More concretely, if a unit 𝑖 has four “neighbors”, each of which has the same weight and the weighting matrix 𝑊 is row-normalized then in theory each pseudo-neighbor should have a weight of 1 4. However, if the data for one of the pseudo-neighbors is missing, then one would assign a weight of 1 3 to each available unit, thus mispecifying 𝑊. In non spatial settings, the literature has proposed multiple ways to deal with missing data. For instance, Dagenais (1973) proposed a generalized least squares estimator in which the missing variables are approximated using observed covariates. In a similar spirit and in the context of 46 linear models, Gourieroux and Monfort (1981) present a maximum likelihood procedure in which the missing variables are explained by the observed ones. More recently, Dardanoni et al. (2011) suggest a framework with an augmented model to reduce the bias induced by replacing the missing observations with imputed values. Abrevaya and Donald (2017) introduced a GMM framework in linear models in which they exploit moment conditions on the missing observations on the regressors to obtain an estimator that the claim to be more efficient than other estimators previously mentioned such as Dagenais’. Rai (2021) considered the panel data case of their estimator and find that it is more efficient than the fixed effects and correlated random effects using the Mundlak device that use only the complete cases. Rai (2023) extended their approach to the case of missing dependent variables and endogenous explanatory covariates, which is useful in cases where the researcher needs to combine data sets from different sources. Going back to a spatial context, Wang and Lee (2013) also suggest three estimation procedures in the context a missing dependent variable only in the cross sectional case. They propose a GMM estimator based on linear moment conditions, a nonlinear least squares and a two stage least squares with imputation and compare the asymptotic properties of the three estimators. Wang and Lee (2013) extend the previous estimators to the case of spatial autoregressive panels using a random effects framework as a baseline and then generalize it by presenting the spatial Mundlak approach. Note that their work also focuses on the cases of a missing dependent variable only. It is important to note that in the non spatial case, the Mundlak approach falls within the correlated random effects context, a middle ground between the random effects (RE) and fixed effects (FE) estimators. In the first case, the researcher must assume that there is no correlation between the explanatory variables in the model and the individual heterogeneity, whereas in the second case, this assumption is relaxed and allows these terms to be correlated. In this respect, Mundlak (1978) argues that the RE version is a misspecification of the FE model as it does not take into account the correlation between the heterogeneity and the regressors. To solve this problem, he proposed an auxiliary equation where the heterogeneity is modeled as a function of the time 47 averages of the independent variables. By doing this, he shows that if we add these time averages to the main equation and estimate the model by RE, we will obtain the same numerical coefficients as if we estimate the model by FE. This equivalence carries over to the unbalanced panel case if we only use the complete cases, as shown by Wooldridge (2019) and byJoshi and Wooldridge (2019) for the case with endogenous covariates. Debarsy (2012) was the first to extend the Mundlak approach to a spatial setting if the researcher working with a Spatial Durbin Model1 (SDM). Nevertheless, this work does not show the afore- mentioned equivalence between the RE and FE specifications. In 2020, Li and Yang demonstrated that when the error term is modeled structurally2, a very common practice in the spatial literature, then the equivalence holds conditional on the value of the parameter(s) associated with the error term, otherwise the FE and RE will yield different estimates generally. In addition, Wu-Chaves (2024) shows that when the error term is not modeled structurally, the equivalence holds if the model is estimated by ordinary least squares (OLS) or two-stage least squares (2SLS). One of the limitations of the work just described is that they focus on the case where the data is complete. In this paper, I will show that in the case of an unbalanced spatial panel, the the CRE equivalence also holds if the researcher uses the complete cases only to estimate the model. In this chapter, I extend the work of Abrevaya and Donald (2017) and Rai (2021) to the case of spatial panels with spillover effects. The rest of the paper is organized as follows. Section 2.2 presents the model. Sections 2.3 and 2.4 state the assumptions and show the construction of GMM estimator, respectively. Section 2.5 shows the equivalence between FE and RE. Section 2.6 provides Monte-Carlo evidence related to the performance of the GMM estimator. Section 2.7 illustrates an empirical application of the estimator and Section 2.8 concludes. 1A SDM includes both a spatial lag of the dependent and independent variables on the right hand side of the equation. 2Note that by modeling the error term usually involves MLE or a GMM estimation. 48 2.2 Model Consider the following model: 𝑦𝑖𝑡 = 𝑥1𝑖𝑡 𝛽1 + 𝑥2𝑖𝑡 𝛽2 + 𝑊𝑖 𝑋1𝑡𝛾1 + 𝑊𝑖 𝑋2𝑡𝛾2 + 𝑐𝑖 + 𝑢𝑖𝑡 = 𝑥𝑖𝑡 𝛽 + 𝑤𝑖𝑡𝛾 + 𝑐𝑖 + 𝑢𝑖𝑡, 𝑖 = 1 . . . 𝑁, 𝑡 = 1 . . . 𝑇 . (2.1) where 𝑦𝑖𝑡 is the response variable, 𝑥1𝑖𝑡 is a 1 × (𝑘1 + 1) set of exogenous variables that includes an intercept, 𝑥2𝑖𝑡 is a 1 × 𝑘2 vector of endogenous covariates (with 𝑘1 + 𝑘2 = 𝑘), 𝑤𝑖𝑡 = (𝑤1𝑖𝑡 𝑤2𝑖𝑡) = (𝑊𝑖 𝑋1𝑡 𝑊𝑖 𝑋2𝑡) with 𝑊𝑖 being the 𝑖-th row of an exogenous, non random, time invariant 𝑁 × 𝑁 weighting matrix, 𝑋1𝑡 and 𝑋2𝑡 are the 𝑁 × 𝑘1 and 𝑁 × 𝑘2 matrices of exogenous and endogenous covariates, respectively, for all observations at time 𝑡, 𝑐𝑖 is the individual heterogeneity and 𝑢𝑖𝑡 is the idiosyncratic error term. In this type of model, the terms 𝑊𝑖 𝑋1𝑡 and 𝑊𝑖 𝑋2𝑡 are known as spatial lags and they capture the effect of neighboring units on unit 𝑖’s outcome3. (𝛽 𝛾) are the parameters of interest and they are of dimension (𝑘 + 1) × 1 and 𝑘 × 1 respectively4. In this paper, I will treat the error term in a non parametric way so that it might be serially and exogenous is that it is uncorrelated with the error term 𝑢𝑖𝑡 (i.e. E(𝑥′ spatially correlated, but I do not impose any particular structure on it. The sense in which 𝑥1𝑖𝑡 is 1𝑖𝑡𝑢𝑖𝑡) = 0). Analogously the 2𝑖𝑡𝑢𝑖𝑡) ≠ 0. I also assume that the asymptotics refer endogeneity of 𝑥2𝑖𝑡 arises from the fact that E(𝑥′ to the case where 𝑁 → ∞ while 𝑇 remains fixed. Since 𝑥2𝑖𝑡 is endogenous, we need a set of external instruments 𝑧2𝑖𝑡 of dimension 𝑙 × 1 (𝑙 ≥ 𝑘2) that satisfy the usual requirements of relevance and exogeneity with respect to the error term 𝑢𝑖𝑡, that is E(𝑧′ 2𝑖𝑡𝑢𝑖𝑡) = 0. Naturally, ℨ2𝑖𝑡 = 𝑊𝑖 𝑍2𝑡 can be used as the instrument for 𝑊𝑖 𝑋2𝑡. For ease of notation, let 𝑎𝑖𝑡 = (𝑥1𝑖𝑡 𝑥2𝑖𝑡 𝑤1𝑖𝑡 𝑤2𝑖𝑡), and 𝑧𝑖𝑡 = (𝑥1𝑖𝑡 𝑧2𝑖𝑡 𝑤1𝑖𝑡 ℨ2𝑖𝑡). Under these assumptions, 3It is common to also include a spatial lag of the outcome variable, however by doing so the interpretation of the model as a conditional mean function is lost. For this reason, I am omitting this term in the paper. 4From a modeling perspective, it is not necessary to include all 𝑘 variables in the spatial lag so the dimension of 𝛾 could me smaller 49 the set of first stage equations is: 𝑥1𝑖𝑡 = 𝑥1𝑖𝑡 𝜋11 + 𝑧2𝑖𝑡 𝜋12 + 𝑤1𝑖𝑡 𝜋13 + ℨ2𝑖𝑡 𝜋14 + 𝑟1𝑖𝑡 𝑥2𝑖𝑡 = 𝑥1𝑖𝑡 𝜋21 + 𝑧2𝑖𝑡 𝜋22 + 𝑤1𝑖𝑡 𝜋23 + ℨ2𝑖𝑡 𝜋24 + 𝑟2𝑖𝑡 𝑤1𝑖𝑡 = 𝑥1𝑖𝑡 𝜋31 + 𝑧2𝑖𝑡 𝜋32 + 𝑤1𝑖𝑡 𝜋33 + ℨ2𝑖𝑡 𝜋34 + 𝑟3𝑖𝑡 𝑤2𝑖𝑡 = 𝑥1𝑖𝑡 𝜋41 + 𝑧2𝑖𝑡 𝜋42 + 𝑤1𝑖𝑡 𝜋43 + ℨ2𝑖𝑡 𝜋44 + 𝑟4𝑖𝑡 (2.2) where 𝜋 𝑗1, 𝜋 𝑗2, 𝜋 𝑗3 and 𝜋 𝑗4 are vectors of dimensions (𝑘1 + 1) × (𝑘1 + 1), 𝑙 × 𝑘1, 𝑘1 × 𝑘1 and 𝑙 × 𝑘1 respectively for 𝑗 = 1, 2, 3, 4. Of course, the relevant equations of (2.2) are the second and fourth lines as (𝑥1𝑖𝑡 𝑤1𝑖𝑡) act as their own instruments. Given this, (2.1) and (2.2) can be written more compactly as: 𝑦𝑖𝑡 = 𝑎𝑖𝑡𝜃 + 𝑐𝑖 + 𝑢𝑖𝑡 𝑎𝑖𝑡 = 𝑧𝑖𝑡 𝜋 + 𝑟𝑖𝑡 (2.3) (2.4) where 𝜃 = (𝛽 𝛾). By definition, E(𝑧′ 𝑖𝑡𝑟𝑖𝑡) = 0 and because the instruments are relevant, it follows that 𝜋0 ≠ 0. Note that other than the exogeneity with respect to 𝑧𝑖𝑡, no other assumptions have been imposed on the error terms in (2.3) and (2.4). Furthermore, 𝑦𝑖𝑡 can be expressed in terms of 𝑧𝑖𝑡 as follows: 𝑦𝑖𝑡 = 𝑧𝑖𝑡𝜃 𝛽 + 𝑐𝑖 + 𝑢𝑖𝑡 + 𝑟𝑖𝑡 𝛽 = 𝑧𝑖𝑡𝜃 𝛽 + 𝑐𝑖 + 𝑣𝑖𝑡 (2.5) The parameters of this model can be consistently estimated by applying Fixed Effects Two Stage Least Squares (FE2SLS) or equivalently by applying Pooled 2SLS (P2SLS) to (cid:165)𝑦𝑖𝑡 = (cid:165)𝑎𝑖𝑡𝜃 + (cid:165)𝑢𝑖𝑡 (2.6) using the instruments (cid:165)𝑧𝑖𝑡 and where (cid:165)𝑦𝑖𝑡 = 𝑦𝑖𝑡 − ¯𝑦𝑖, ¯𝑦𝑖 = 1 𝑇 (cid:205)𝑇 𝑡=1 𝑦𝑖𝑡 and similar definitions apply to the other variables, provided that the corresponding rank conditions of the relevant matrices hold and E( (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑢𝑖𝑡) = 0 (2.7) 50 The latter is implied by the following condition: E(𝑢𝑖𝑡 |𝑍, 𝐶) = 0 (2.8) which is a strict exogeneity assumption and where 𝑍 is the entire matrix of exogenous variables and 𝐶 is the whole vector of individual heterogeneities. Note that (2.8) is a stronger condition than the classical strict exogeneity assumption because in this case the idiosyncratic error term at time 𝑡 is not only uncorrelated with the exogenous variables at any time period, but it is also uncorrelated with the covariates of other units due to the nature of the spatial panel data set and in particular to the presence of the spatial lags. It is also important to emphasize that in this setup, the individual heterogeneity 𝑐𝑖 is allowed to be arbitrarily correlated with the elements of 𝑧𝑖𝑡 or the endogenous covariates. 2.3 Missing data mechanism Before formalizing the missing data scheme, consider the consequences of missing observations in a model with spatial spillovers compared to a situation without such effects. As previously mentioned, a typical strategy when empirical researchers have missing data is to estimate the parameters use only the observations that have a full set of observed variables and discard the units that are incomplete. If we have a sample of 49 individuals living in a regular grid as shown in Figure 2.1 and there are no spillover effects and the researcher has no data on unit 25, then the loss of information is relatively small (around 2% of the sample). On the other hand, if the model contains spillover effects from neighboring units and we are using a queen type weighting scheme5, then if unit 25 is missing and the researcher decides to use only the complete observations, then she would have to disregard unit 25’s neighbors too (shown in gray in Figure 2.1) and end up losing almost 20% of the original sample. 5Under this weighting mechanism, a neighbor is an unit that shares an edge or a vertex. 51 Figure 2.1 Regular grid with unit 25 missing and its neighbors shown in gray. The previous example shows that missing observations can result in a severe decrease in efficiency when estimating the parameters in a model with spillover effects. To formalize the missing mechanism, let 𝑠𝑖𝑡 =    1 if 𝑥1𝑖𝑡 is observed for unit 𝑖 at time 𝑡 0 otherwise (2.9) and let 𝑆𝑡 be the 𝑁 × 𝑁 diagonal matrix with diagonal elements 𝑠𝑖𝑡. Note that 𝑠𝑖𝑡 is indicating that the researcher observes either the full set of endogenous variables or none at all. Furthermore, I am also assuming that the response variable and the exogenous variables 𝑧𝑖𝑡 are always fully observed for all individuals in all time periods. A common practice in empirical work is to ignore the missing data from neighbors in the spatial lag, so implicitly the missing neighbors are being assigned a weight of 0 (Kelejian & Prucha, 2010). This being the case, 𝑊 𝑆𝑡 𝑋𝑡 would be enough to select units with available self-information but with possibly incomplete data on their the spatial lag. However, to select only the complete cases in the spatial lag, a new variable needs to be defined. To this end, for each 𝑖 and its 𝐽 neighbors, let ˜𝑠𝑖𝑡 = 𝑠𝑖𝑡 · 𝐽 (cid:214) 𝑗=1 𝑠 𝑗𝑡 (2.10) so that ˜𝑠𝑖𝑡 = 1 only when the full set of endogenous variables are observed for unit 𝑖 and its corresponding neighbors. Then define ˜𝑆𝑡 as the diagonal matrix with diagonal elements ˜𝑠𝑖𝑡 so 52 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849 that 𝑊 ˜𝑆𝑡 𝑋𝑡 will select only the fully complete cases. As previously mentioned, the researcher can consistently estimate the parameters using FE2SLS with the complete cases if (2.8) holds, at the expense of losing efficiency. More concretely, the estimator can be defined as follows: ˆ𝜃𝐶𝐹𝐸2𝑆𝐿𝑆 =       (cid:32) (cid:32) ∑︁ ∑︁ 𝑖=1 𝑡=1 (cid:33) (cid:32) ˜𝑠𝑖𝑡 (cid:165)𝑎′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 ∑︁ ∑︁ 𝑖=1 𝑡=1 ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 (cid:33) −1 (cid:32) ∑︁ ∑︁ 𝑖=1 𝑡=1 ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑎𝑖𝑡 ∑︁ ∑︁ 𝑖=1 𝑡=1 ˜𝑠𝑖𝑡 (cid:165)𝑎′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 (cid:33) (cid:32) ∑︁ ∑︁ 𝑖=1 𝑡=1 ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 (cid:33) −1 (cid:32) (cid:33) ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑦𝑖𝑡 ∑︁ ∑︁ 𝑖=1 𝑡=1 −1 · (cid:33)      (2.11) where (cid:165)𝑦𝑖𝑡 = 𝑦𝑖𝑡 − ¯𝑦𝑖 and where ¯𝑦𝑖 = 1 𝑇𝑖 (cid:205)𝑞 ˜𝑠𝑖𝑞 𝑦𝑖𝑞 and 𝑇𝑖 = (cid:205)𝑇 𝑞 ˜𝑠𝑖𝑞. The rest of the variables are similarly defined. Note that 𝑇𝑖 is a random variable as it is a function of the selection6. In words, for each unit 𝑖 the time average is computed using only the periods where the observation as a full set of observed variables. To develop an alternative estimator, consider again the within transformation of the variables similar to the one in (2.11), that is, the averages are computed using only the complete cases. Furthermore, define: (cid:164)𝑥2𝑖𝑡 = 𝑥2𝑖𝑡 − 1 𝑇 − 𝑇𝑖 𝑇 ∑︁ 𝑡=1 (1 − ˜𝑠𝑖𝑡)𝑥2𝑖𝑡 (2.12) that is, (cid:164)𝑥2𝑖𝑡 is a within transformation where the average is computed using the incomplete cases only and similar definitions apply to other variables. Regarding this transformation, it is important to point out that it may be possible that only one time period for a particular unit 𝑖 is missing, in which case the within transformation (2.12) will remove that observation as the “average” is taken over a single time period. In such cases, these units are uninformative and are essentially removed from the estimation and therefore will not help to provide efficiency gains. Note that by applying the within transformation from (2.11) to the main model, the term 𝑐𝑖 disappears. The resulting estimating equations are (cid:165)𝑦𝑖𝑡 = (cid:165)𝑎𝑖𝑡𝜃 + (cid:165)𝑢𝑖𝑡 (cid:165)𝑎𝑖𝑡 = (cid:165)𝑧𝑖𝑡 𝜋 + (cid:165)𝑟𝑖𝑡 (2.13) (2.14) 6Note that I am implicitly assuming that Pr(𝑇𝑖 = 0) = 0 for all 𝑖 so that 𝜃𝐶𝐹𝐸2𝑆𝐿𝑆 is well defined. 53 And by replacing (2.14) in (2.13), we obtain an expression of 𝑦𝑖𝑡 in terms of the always observed variables: (cid:165)𝑦𝑖𝑡 = ( (cid:165)𝑧𝑖𝑡 𝜋 + 𝑟𝑖𝑡)𝜃 + (cid:165)𝑢𝑖𝑡 = (cid:165)𝑧𝑖𝑡 𝜋𝜃 + (cid:165)𝑣𝑖𝑡 (2.15) where (cid:165)𝑣𝑖𝑡 = (cid:165)𝑢𝑖𝑡 + (cid:165)𝑟𝑖𝑡𝜃. In order to obtain efficiency gains and still get a consistent estimator using fixed effects, consider the following assumption: Assumption 1 𝑖) E( ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑢𝑖𝑡) = 0 𝑖𝑖) E( ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑟𝑖𝑡) = 0 𝑖𝑖𝑖) E[(1 − ˜𝑠𝑖𝑡) (cid:164)𝑧′ 𝑖𝑡 (cid:164)𝑣𝑖𝑡] = 0 Part 𝑖) of Assumption 1 imposes that 𝜃 is the same in both the complete and incomplete cases and it is also necessary for the complete cases estimator. The second and third points of Assumption 1 are the basis of the potential efficiency gains that can be achieved with the proposed estimator. Specifically, 𝑖𝑖) states that 𝜋 is the same in both the observed and unobserved samples, whereas 𝑖𝑖𝑖) (along with 𝑖) amounts to say that the model and the imputation method are the same for the complete and incomplete observations. Similarly to the non-missing units case, the conditions in 1 are implied by the following zero conditional mean assumptions: Assumption 2 𝑖) E(𝑢𝑖𝑡 |𝑍, 𝑆, 𝐶) = 0 and 𝑖𝑖) E(𝑟𝑖𝑡 |𝑍, 𝑆, 𝐶) = 0 These are strict exogeneity conditions analogous to the non spatial case. Note that these are weaker than a missing at random (MAR) mechanism, in which the missingness is allowed to depend on the always observed data (Little & Rubin, 2019). More formally and borrowing notation from Rai (2023), in this context the data would be considered MAR if 𝑠𝑖𝑡 ⊥ (𝑦𝑖𝑡, 𝑥1𝑖𝑡, 𝑤1𝑖𝑡)|𝑍 or equivalently 𝑠𝑖𝑡 ⊥ (𝑢𝑖𝑡, 𝑟1𝑖𝑡, 𝑟3𝑖𝑡)|𝑍, where 𝑟1𝑖𝑡 and 𝑟3𝑖𝑡 are the errors related to 𝑥1𝑖𝑡 and 𝑤1𝑖𝑡 respectively in the first stage. A sense in which Assumption 2 is weaker than MAR is that in the former, the condition would still hold if the selection is a function of 𝑍, provided that E(𝑟𝑖𝑡 |𝑍) = 0. Both of these assumptions are weaker than the missing completely at random (MCAR) mechanism, where the probability of missing is independent of the rest of the variables, i.e. 𝑠𝑖𝑡 ⊥ (𝑦𝑖𝑡, 𝑥1𝑖𝑡, 𝑤1𝑖𝑡, 𝑧𝑖𝑡). 54 2.4 GMM estimation Using equations (2.13), (2.14), (2.15) and Assumption 1, we can create a vector of moment conditions to perform GMM estimation. Let7 𝑔𝑖𝑡 (𝜃, 𝜋) = ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑢𝑖𝑡 ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 ⊗ (cid:165)𝑟′ 𝑖𝑡 (1 − ˜𝑠𝑖𝑡) (cid:164)𝑧′ 𝑖𝑡 (cid:164)𝑣𝑖𝑡                     =           ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 ( (cid:165)𝑦𝑖𝑡 − (cid:165)𝑎𝑖𝑡𝜃) ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 ⊗ ( (cid:165)𝑎𝑖𝑡 − (cid:165)𝑧𝑖𝑡 𝜋)′ (1 − ˜𝑠𝑖𝑡) (cid:164)𝑧′ 𝑖𝑡 ( (cid:164)𝑦𝑖𝑡 − (cid:164)𝑧𝑖𝑡 𝜋𝜃) =           𝑔1𝑖𝑡 (𝜃, 𝜋) 𝑔2𝑖𝑡 (𝜃, 𝜋) 𝑔3𝑖𝑡 (𝜃, 𝜋)                     (2.16) Since I am assuming that Assumption 2 holds, it follows that E[𝑔(𝜃0, 𝜋0)] = 0, where (𝜃0, 𝜋0) is the vector of true population parameters. Note that 𝑔1𝑖𝑡 (·) and 𝑔2𝑖𝑡 (·) use the complete cases, while the 𝑔3𝑖𝑡 (·) moment condition uses the incomplete cases. Furthermore 𝑔𝑖𝑡 (·) provides [2(𝑘2 + 𝑙) + 1] [2𝑘 + 3] moment conditions, while there are 2(2𝑘 + 1) (𝑘2 + 𝑙 + 1) parameters to estimate, which leaves 2(2𝑙 + 𝑘2 − 𝑘1) + 1 overidentifying restrictions. Once again it is important to note that the potential efficiency gains from the proposed estimator come from imposing that the 𝜋 is the same among the observed and unobserved units and that the model and imputation method are the same among those same groups. In order to obtain an efficient GMM estimator, we need to construct an optimal weighting matrix. let: where, 𝑉 ≡ E [𝑔(𝜃, 𝜋)𝑔(𝜃, 𝜋)′] = E  𝑉11 𝑉12    𝑉 ′  12      𝑉22 0 0 0           0 𝑉33 𝑉11 = ˜𝑠𝑖𝑡 (cid:165)𝑢2 𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 𝑉12 = ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑢𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 ⊗ (cid:165)𝑟𝑖𝑡 𝑉22 = ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 ⊗ (cid:165)𝑟′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 ⊗ (cid:165)𝑟𝑖𝑡 𝑉33 = (1 − ˜𝑠𝑖𝑡) (cid:164)𝑧′ 𝑖𝑡 ⊗ (cid:164)𝑣2 𝑖𝑡 (cid:164)𝑧′ 𝑖𝑡 (cid:164)𝑧𝑖𝑡 In this setup, the sample GMM objective function is: 7Formally 𝑔𝑖𝑡 (·) is also a function of the ( (cid:165)𝑦, (cid:165)𝑎, (cid:165)𝑧, ˜𝑆), but for notation simplicity I suppress these. ¯𝑔(𝜃, 𝜋)′ ˆΩ ¯𝑔(𝜃, 𝜋) 55 (2.17) (2.18) (2.19) where ¯𝑔(𝜃, 𝜋) = 1 𝑁𝑇 𝑔𝑖𝑡 (𝜃, 𝜋), Ω is a square, non-random, symmetric and positive semi- definite matrix of order [2(𝑘1 + 𝑙) + 1] [2𝑘 + 3] and ˆΩ is a consistent estimator of Ω. To obtain ˆΩ, 𝑁 (cid:205) 𝑖=1 𝑇 (cid:205) 𝑡=1 we can replace the expectations with sample averages in (2.21) and (2.17) and we can get consistent estimates of (cid:165)𝑢𝑖𝑡, (cid:165)𝑟𝑖𝑡 and (cid:164)𝑣𝑖𝑡 by applying GMM to to 𝑔1𝑖𝑡 (·), only 𝑔2𝑖𝑡 (·) and 𝑔3𝑖𝑡 (·) only, respectively. It is noteworthy to point out that restricting 𝜋 to be different across 𝑔2𝑖𝑡 (·) and 𝑔3𝑖𝑡 (·) make these moment functions redundant in the estimation of 𝜃, as pointed out by Ahn and Schmidt (1995) and Rai (2023). The proposed estimator in this paper minimizes (2.19) with respect to (𝜃, 𝜋) using ˆΩ = ˆ𝑉 −1 and will be denoted as ( ˆ𝜃, ˆ𝜋). Before stating the asymptotic normality result, we need to present the other component of the covariance matrix for the GMM estimator. First, let 𝐷 ≡ E (cid:2)∇𝑔(𝜃0, 𝜋0)(cid:3) = E 𝐷11 0 0 𝐷22 𝐷31 𝐷32                     (2.20) where ∇𝑔(𝜃0, 𝜋0) denotes the matrix of derivatives of 𝑔(𝜃, 𝜋) with respect to [𝜃′, vec(𝜋)′]′ evaluated at the true population parameters and where 𝐷11 = −( ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑥𝑖𝑡) 𝐷31 = −(1 − ˜𝑠𝑖𝑡) (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 𝜋0 𝐷22 = − ˜𝑠𝑖𝑡 ( (cid:165)𝑧′ 𝐷32 = −(1 − ˜𝑠𝑖𝑡) 𝛽0′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 ⊗ 𝑒1, . . . , (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 ⊗ 𝑒2𝑘+1) ⊗ (cid:164)𝑧′ 𝑖𝑡 (cid:164)𝑧𝑖𝑡 (2.21) In this case 𝑒 𝑗 denotes a row vector of zeros of dimension [2𝑘 + 1] with the 𝑗-th element being equal to one. In order to get identification, I assume that rank(𝐷) = 2(2𝑘 + 1) (𝑘2 + 𝑙 + 1) and is of dimension [2(𝑘2 + 𝑙) + 1] (2𝑘 + 1) × 2(2𝑘 + 1) (𝑘2 + 𝑙 + 1) so that is has full column rank. Now given the spatial panel structure being considered in this paper, we need to impose some regularity conditions and assumptions on the variables of the model. In particular I will assume that the conditions specified in Nazgul and Prucha (2009) for non-stationary random fields are satisfied, however, in this paper I am going only to focus only on those that are more relevant for the empirical researcher. 56 Assumption 3 The lattice where the units are located is infinitely countable and there is a distance measure 𝜌(·, ·) and a distance 𝜌0 > 0 available to the researcher such that 𝜌(𝑖, 𝑗) ≥ 𝜌0 for any two pair of observations 𝑖 and 𝑗. Assumption 4 The random field is 𝛼-mixing satisfying the properties outlined by Nazgul and Prucha (2009). In practical terms, this means that the degree of dependence between the observations decays as the distance between them increases8. From a modeling perspective, this implies that the weights specified in 𝑊, the weighting matrix capturing the spillover effects, for any two observations 𝑖 and 𝑗 have to decrease as 𝜌(𝑖, 𝑗) → ∞. Note that this assumption also applies to the selection random variables 𝑠𝑖𝑡 so that the missingness of one unit will not affect the availability of observations that are at a large distance from it. The following assumption is related to the error terms. Assumption 5 At each time period, the 𝑁 × 1 vectors of errors are generated as: 𝑢 = 𝐹𝑡𝜀 𝑟 = 𝑀𝑡𝜂 (2.22) where the 𝜀 and 𝜂 are a 𝑁 × 1 vectors of i.i.d. random variables with mean 0, variance of 1, independent of each other and E(|𝜀|𝑞) < ∞ and E(|𝜂|𝑞) < ∞ for 𝑞 ≥ 4 and the 𝐹𝑡 and 𝑀𝑡 are 𝑁 × 𝑁 non-singular unknown matrices whose row and column sums are uniformly bounded. This assumption allows for many structures of spatial correlation between the error terms without imposing any restrictions on the time dimension. Assumption 6 states that all the relevant matrices are well behaved. Assumption 6 The matrix of exogenous variables, (cid:165)𝑧, has full column rank and its elements are uniformly bounded in absolute value by the finite constant 0 < 𝑐𝑍 < 0. For a fixed and finite 𝑇, the matrices: 8Recall that we are working with 𝑁 → ∞ and fixed 𝑇 asymptotics so there is not need to impose a weak dependence restriction on the time dimension. 57 1. lim 𝑁→∞ 2. lim 𝑁→∞ 3. plim 𝑁→∞ (𝑁𝑇)−1 (cid:165)𝑧′ (cid:165)𝑧 = 𝑄𝑧𝑧. (𝑁𝑇)−1 (cid:165)𝑧′𝑅𝑅′ (cid:165)𝑧 = 𝑄𝑧𝑅𝑅𝑧. (𝑁𝑇)−1 (cid:165)𝑧′ (cid:165)𝑧 = 𝑄𝑧𝑧. are finite and non-singular9. Furthermore, the matrix plim 𝑁→∞ (𝑁𝑇)−1 (cid:165)𝑧′ (cid:165)𝑎 = 𝑄𝑧𝑎 has full column rank 2𝑘 + 1. Similarly, the diagonal elements of 𝑊 are zero and all of its elements are uniformly bounded by a finite constant 0 < 𝑐𝑊 < ∞. Having state these conditions, the asymptotic normality is summarized in the following propo- sition. Proposition 1. Under Assumptions 2-6, √ 𝑁𝑇 (cid:104)(cid:0) ˆ𝜃′, 𝑣𝑒𝑐( ˆ𝜋)′(cid:1)′ − (cid:16) 𝜃0′, vec(𝜋0)′(cid:17)′(cid:105) 𝑑 −→ 𝑁 (cid:20) 0, (cid:16) 𝐷′𝑉 −1𝐷 (cid:17) −1(cid:21) Furthermore, 𝑁𝑇 ¯𝑔( ˆ𝜃, ˆ𝜋)′ ˆ𝐶−1 ¯𝑔( ˆ𝜃, ˆ𝜋) 𝑑 −→ 𝜒2 2(2𝑙+𝑘2−𝑘1)+1 This result follows directly from the Uniform Law of Large Numbers and the Central Limit Theorem derived by Nazgul and Prucha (2009) and therefore I omit the proof. This chi-square statistic is useful to determine if the overidentification restrictions (i.e. the moment conditions in Assumption 1 evaluated at the true population parameters) hold. More specifically, this test can help to determine if the mechanism that generated the missing observations is responsible for the violation of Assumption 1, however, it might not be useful in determining if the model is misspeficied (Rai, 2023). 2.5 Correlated Random Effects 2.5.1 The Mundlak Device When working panel data, researchers usually have to decide between two main estimators, Random Effects (RE) and Fixed Effects (FE). The former provides efficiency gains over the second, 9Formally these conditions should also hold for the variables with incomplete observations. 58 while the later is more robust to violations of one of the main assumptions of the RE estimator, namely that the exogenous variables are uncorrelated with the individual heterogeneity, since the FE approach leaves this relationship unrestricted. Mundlak (1978) proposed a middle ground between these by restricting the relationship with a particular functional form, which falls under the Correlated Random Effects (CRE) approach. Consider the following standard linear model without spatial effects: Mundlak’s approach is to to model the individual effects 𝑐𝑖 as a linear function of the time averages 𝑦𝑖𝑡 = 𝑥𝑖𝑡 𝛽 + 𝑐𝑖 + 𝑢𝑖𝑡 (2.23) of the covariates: 𝑐𝑖 = ¯𝑥𝑖𝛿 + ℎ𝑖 where ℎ𝑖 is uncorrelated with ¯𝑥𝑖. By replacing (2.24) in (2.23) we obtain: 𝑦𝑖𝑡 = 𝑥𝑖𝑡 𝛽 + ¯𝑥𝑖𝛿 + ℎ𝑖 + 𝑢𝑖𝑡 = 𝑥𝑖𝑡 𝛽 + ¯𝑥𝑖𝛿 + 𝑟𝑖𝑡 (2.24) (2.25) It turns out that if (2.25) is estimated either by POLS or RE, the estimated coefficient of 𝛽 will be numerically the same as if the FE estimator is used in (2.23), a result attributed to Mundlak (1978). This equivalence has been extended to other contexts: Joshi and Wooldridge (2019) proved it for the case of unbalanced panels, Wooldridge (2019) showed it for models with unbalanced panels and endogenous variables. In the spatial context, Debarsy (2012) was the first to introduce the Mundlak device, while Li and Yang (2020) discuss some conditions under which the equivalence holds. Wang and Lee (2013) discuss how to implement the Mundlak device on spatial panels with missing data on the dependent variable, however they do not show the equivalence. In this paper, I show the equivalence holds for models with missing observations on the endogenous covariates in a spatial panel. To this end, consider again the model (2.1): 𝑦𝑖𝑡 = 𝑥1𝑖𝑡 𝛽1 + 𝑥2𝑖𝑡 𝛽2 + 𝑊𝑖 𝑋1𝑡𝛾1 + 𝑊𝑖 𝑋2𝑡𝛾2 + 𝑐𝑖 + 𝑢𝑖𝑡 (2.1) = 𝑎𝑖𝑡𝜃 + 𝑐𝑖 + 𝑢𝑖𝑡, 𝑖 = 1 . . . 𝑁, 𝑡 = 1 . . . 𝑇 . where the same definitions and conditions described earlier still apply, including the availability of a set of instruments (𝑧2𝑖𝑡 ℨ2𝑖𝑡). Consider also the same selection variables and in particular, the 59 complete cases selection variable ˜𝑠𝑖𝑡 as defined in (2.10). In this context, the Mundlak approach involves modeling the heterogeneity as a function of all the time averages of all the exogenous variables 𝑧𝑖𝑡 = (𝑥1𝑖𝑡 𝑧2𝑖𝑡 𝑤1𝑖𝑡 ℨ2𝑖𝑡) in (2.1): 𝑐𝑖 = ¯𝑧𝑖𝛿 + 𝜂𝑖 and multiply the equation by the complete cases selection variable to obtain: ˜𝑠𝑖𝑡 𝑦𝑖𝑡 = ˜𝑠𝑖𝑡𝑎𝑖𝑡𝜃 + ˜𝑠𝑖𝑡 ¯𝑧𝑖𝛿 + ˜𝑠𝑖𝑡 ˜𝑢𝑖𝑡 (2.26) (2.27) where ˜𝑢𝑖𝑡 = 𝑢𝑖𝑡 + 𝜂𝑖. Then we can recover the FE estimates of 𝜃 by applying Pooled 2SLS to (2.27) using the instruments ˜𝑠𝑖𝑡 (𝑧2𝑖𝑡 ℨ2𝑖𝑡).This result is summarized in the following proposition: Proposition 2. Suppose ˜𝜃 is the estimated coefficient of 𝜃 by applying Pooled 2SLS to equation (2.27). Then ˜𝜃 = ˆ𝜃𝐶𝐹𝐸2𝑆𝐿𝑆, the coefficient defined in (2.11). The proof of Proposition 2 can be found in the Appendix. One of the advantages of the Mundlak device over the FE estimator is that it allows to estimate the effects of variables that do not show variation over time. Note however that as with the FE estimator, this approach is using only using the complete cases so the researcher can obtain efficiency gains with a GMM estimator that uses the information contained in the incomplete cases. The following subsection describes this procedure. 2.5.2 A GMM approach to CRE with missing data Instead of applying the within transformation to recover the FE estimates of 𝜃, in this section we construct moment conditions using the Mundlak approach, for which I will use equations (2.3), (2.4) and (2.5). As a first step, we model 𝑐𝑖 in (2.3) as: 𝑐𝑖 = ¯𝑥2𝑖 ˜𝜃1 + ¯𝑧2𝑖 ˜𝜃2 + ¯𝑤2𝑖 ˜𝜃3 + ¯ℨ2𝑖 ˜𝜃4 + 𝜂𝑖 = ¯𝑧𝑖 ˜𝜃𝑖 + 𝜂𝑖 (2.28) where the bar over the variables denotes the time average taken over the observations only where ˜𝑠𝑖𝑡 = 1. Here we impose the following condition: 60 Assumption 7 E(𝜂𝑖 |𝑍, 𝑆) = 0 Plugging (2.28) into (2.3) yields: 𝑦𝑖𝑡 = 𝑎𝑖𝑡𝜃 + ¯𝑧𝑖 ˜𝜃 + ˜𝑢𝑖𝑡 (2.29) where ˜𝑢𝑖𝑡 = 𝑢𝑖𝑡 + 𝜂𝑖. Since the main model has been augmented with these additional set of variables, the first stage equations need to be adjusted to include these exogenous variables. In particular, letting ´𝑧 = (𝑧𝑖𝑡 ¯𝑧𝑖) we now have: 𝑎𝑖𝑡 = 𝑧𝑖𝑡 ˜𝜋0 1 + ¯𝑧𝑖 ˜𝜋0 2 + ˜𝑟𝑖𝑡 = ´𝑧𝑖𝑡 ˜𝜋0 + ˜𝑟𝑖𝑡 (2.30) where E( ´𝑧′ ˜𝑟𝑖𝑡) = 0 holds by definition. Finally we replace (2.30) in (2.29) to obtain a reduced form of 𝑦𝑖𝑡 on the always observed variables ´𝑧: 𝑦𝑖𝑡 = 𝑧𝑖𝑡 ˜𝜋0 1 𝜃 + ¯𝑧𝑖 ( ˜𝜋0 2 𝜃 + ˜𝜃) + ˜𝑣𝑖𝑡 = 𝑧𝑖𝑡 𝜇1 + ¯𝑧𝑖 𝜇2 + ˜𝑣𝑖𝑡 = ´𝑧𝑖𝑡 𝜇 + ˜𝑣𝑖𝑡 (2.31) where ˜𝑣𝑖𝑡 = ˜𝑢𝑖𝑡 + ˜𝑟𝑖𝑡𝜃. Note that as a consequence of Assumption 7 and E( ´𝑧′ ˜𝑟𝑖𝑡) = 0, E( ´𝑧′ 𝑖𝑡 ˜𝑣𝑖𝑡) = 0. If we let ˇ𝜃 = (𝜃′ ˜𝜃′)′, from here we can construct the vector of moment conditions as follows: ˜𝑠𝑖𝑡 ´𝑧′ 𝑖𝑡 ˜𝑢𝑖𝑡 ˜𝑔𝑖𝑡 ( ˇ𝜃, ˜𝜋) =                     From this point the estimation proceeds as in the previous section, but now we have 𝑖𝑡 (𝑦𝑖𝑡 − 𝑎𝑖𝑡𝜃 − ¯𝑧𝑖 ˜𝜃) 𝑖𝑡 ⊗ (𝑎𝑖𝑡 − ´𝑧𝑖𝑡 ˜𝜋)′ 𝑖𝑡 (𝑦𝑖𝑡 − ´𝑧𝑖𝑡 𝜇) (1 − ˜𝑠𝑖𝑡) ´𝑧′ (1 − ˜𝑠𝑖𝑡) ´𝑧′ 𝑔3𝑖𝑡 ( ˇ𝜃, ˜𝜋) 𝑔2𝑖𝑡 ( ˇ𝜃, ˜𝜋) 𝑔1𝑖𝑡 ( ˇ𝜃, ˜𝜋) 𝑖𝑡 ⊗ ˜𝑟′ 𝑖𝑡                                         ˜𝑠𝑖𝑡 ´𝑧′ ˜𝑠𝑖𝑡 ´𝑧′ ˜𝑠𝑖𝑡 ´𝑧′ 𝑖𝑡 ˜𝑣𝑖𝑡 = = (2.32) ˜𝑉 ≡ E (cid:2) ˜𝑔( ˇ𝜃, ˜𝜋) ˜𝑔( ˇ𝜃, ˜𝜋)′(cid:3) and ˜𝐷 ≡ E (cid:2)∇ ˜𝑔( ˇ𝜃0, ˜𝜋0)(cid:3) (2.33) Once again, the efficiency gains from this estimator come from the second and third moment conditions of (2.32) by imposing the same coefficients on both the complete and incomplete sub-populations. 61 2.6 Simulations 2.6.1 Data generating process To analyze the performance of the proposed GMM estimator in this paper, I ran a Monte- Carlo study where I compared it to the complete cases (CC) estimator, the dummy variable method (DVM) and the estimator that used the data set without any missing observations. Note that although the DVM has been shown to deliver biased results (Jones, 1996), in some simulation studies its performance has been somewhat acceptable, like in Rai (2023). To this end, the benchmark data generating process is as follows: 𝑦𝑖𝑡 = 𝛽0 + 𝑥1𝑖𝑡 𝛽1 + 𝑊𝑖 𝑋1𝑡 𝛽2 + 𝑥2𝑖𝑡 𝛽3 + 𝑊𝑖 𝑋2𝑡 𝛽4 + 𝑐𝑖 + 𝑢𝑖𝑡 (2.34) where (𝑥1𝑖𝑡 𝑥2𝑖𝑡) are scalars and the latter is potentially missing and it is endogenous so that it is correlated with the idiosyncratic error term 𝑢𝑖𝑡. I also generate the variable 𝑧2𝑖𝑡 that will serve to instrument for 𝑥2𝑖𝑡. Naturally 𝑊𝑖 𝑍2𝑡 will serve as an instrument for 𝑊𝑖 𝑋2𝑡. The individual heterogeneity is correlated with (𝑥1𝑖𝑡 𝑧2𝑖𝑡). The observations live in a regular square grid and the weighting matrix that captures the spillover effects follows a rook type scheme. The variables 𝑥1𝑖𝑡, 𝑧2𝑖𝑡, 𝑢𝑖𝑡 follow a standard normal distribution and are independent of each other. The population parameter values are 𝛽0 = 2, 𝛽1 = 1.5, 𝛽2 = 0.7, 𝛽3 = 1.2 and 𝛽4 = 0.4. The sample size was 𝑁 = 900, 𝑇 = 5 and the number of Monte-Carlo repetitions was 1000 for each scenario described below. To incorporate the missing data, I used three different mechanisms. In the first one, the data is missing completely at random (MCAR) for which the selection variable followed a binomial distribution with parameter 𝑝 = 0.85. Under this scheme, the average proportion of observations across the 1000 simulated data sets with a complete “own” set of data was 85% as expected. However, the percentage of units with a complete information set (i.e. both the “own” and neighbors information is non-missing) dropped down to 53%. In the second design, the data is missing at random (MAR) so that the selection variable is allowed to depend on the always observed variables. In this instance, I allowed the missingness to depend on 𝑥1𝑖𝑡 and designed it so that around 85% of the observations had their own 𝑥2𝑖𝑡 available. In this case the average proportion across the 1000 62 repetitions of complete cases was around 51%. As an extension of the first design, the data is again MCAR but the error term follows a spatial autoregressive process of order one (SAR) with a parameter 𝜌 = 0.4 and I repeated this design with a smaller sample size where 𝑁 = 400 and 𝑇 = 5 for a total of 2000 observations. Finally, in the third experiment I allow the data to be MAR again but in this case the missingness also depends on the individual heterogeneity. 2.6.2 Results The simulations showed that the proposed GMM behaves well in finite samples and consistently across the different designs. For example, Table 2.1 shows the average bias, standard deviation and root mean squared error for 𝛽3 and 𝛽4, the coefficients associated with the endogenous and potentially missing variables 𝑥2𝑖𝑡 and 𝑊𝑖 𝑋2𝑡 for the case when the data is MCAR across the 1000 repetitions. The proposed GMM has an average bias just as small as the estimator that uses the complete data, showing that it is indeed a consistent estimator. Table 2.1 Average bias, standard deviation and root mean squared error for 𝛽3 and 𝛽4 across the 1000 repetitions when the data is MCAR. Bias 0.0004 Whole data Complete cases -0.0016 Proposed GMM 0.0004 0.9800 Dummy variable 𝛽3 S.D. 0.0247 0.0364 0.0308 0.1304 RMSE Bias 0.0247 0.0364 0.0308 0.9886 0.0010 0.0013 0.0008 0.3211 𝛽4 S.D. 0.0480 0.0713 0.0604 0.1865 RMSE 0.0480 0.0713 0.0603 0.3713 More importantly, the standard deviation of the estimated coefficients for the proposed GMM is smaller than the estimator that uses the complete cases only, showing that it provides some efficiency gains relative to it. As expected, it is not as efficient as the estimator that uses the whole data set as this one uses the full set of available information, whereas the proposed estimator might lose some information (e.g. cases where there is only one incomplete time period and therefore the unit becomes uninformative). This is illustrated in Figure 2.2, where the estimator that uses the whole data set has tighter distributions around the true population values, followed by the proposed GMM estimator and finally, the complete cases estimator, which has more disperse distributions. 63 The simulations also show that the DVM estimator is inconsistent as the average bias for the parameters associated with the endogenous variables is substantial, although this appears to be limited to these covariates as the 𝛽1 and 𝛽2 coefficients seem to be well behaved. The simulations show a very small loss in efficiency when the data is MAR compared to the MCAR case. Similarly, this loss is also small when the data is MCAR but the error term follows a SAR(1) process, as in the latter the standard deviations are slightly larger relative to the the first two scenarios. Nevertheless, the proposed GMM estimator shows again to be more efficient than the complete cases estimator. Of course, if the researcher is confident that the error terms follows a SAR(1), she might be able to exploit efficiency gains using alternative estimators that use this information such as maximum likelihood, at the risk of misspecifying the structure of data generating process. As expected, when the sample size is smaller the distribution of the coefficients show a greater dispersion, but the proposed GMM estimator continues to show to be well behaved under this scenario with a small bias and a standard deviation that is smaller than the complete cases estimator. Finally, when the missingness is also allowed to depend on the individual heterogeneity, there are no substantial differences in the results: the proposed GMM seems to be consistent and the root mean squared error is between the estimator that uses the whole data set and the complete cases one. This result is somewhat expected as the within transformation removes the individual heterogeneity from the estimating equations. 64 Figure 2.2 Distribution of estimated coefficients across the 1,000 Monte-Carlo repetitions when the data is MCAR. 2.7 Empirical Application I this section, I revisit the problem of analyzing the impact of different variables on crime in the state of North Carolina at the county level between 1981 and 1987. This problem was studied by Cornwell and Trumbull (1994) and by B. Baltagi (2006), where they modeled the crime rate as a function of a set of covariates that included deterrent variables and returns to legal opportunities. However, as pointed out by B. Baltagi (2006), most of the fixed effects estimates presented by Cornwell and Trumbull (1994) turned out to be statistically insignificant, therefore, in this paper I present a simplified version of the model that focuses on the deterrent variables. The original data set used in their estimation contained 90 counties10 and seven time periods for a total of 630 observations. Note that their data has no missing observations and therefore, for the purpose of this illustration, the missing variables will be generated artificially so that around 5% 10North Carolina has a total of 100 counties, nevertheless their data only contained information for 90. 65 1.401.451.501.551.60Allβ1Truevalue0.50.60.70.80.9β21.051.101.151.201.251.301.35β30.00.20.40.60.8β41.401.451.501.551.60CC0.50.60.70.80.91.051.101.151.201.251.301.350.00.20.40.60.81.401.451.501.551.60GMM0.50.60.70.80.91.051.101.151.201.251.301.350.00.20.40.60.8 of the observations has one of their variables missing. To this end, consider the following model: crime𝑖𝑡 = 𝛽0 + 𝛽1arrest𝑖𝑡 + 𝛽2conviction𝑖𝑡 + 𝛽3 prison𝑖𝑡 + 𝛽4police𝑖𝑡 + 𝛽5avgsent𝑖𝑡 + 𝛽6dens𝑖𝑡 + 𝑐𝑖 + 𝑢𝑖𝑡 (2.35) where crime is the crime rate (crimes committed per person), arrest is the “probability” of arrest (number of arrests per crimes), conviction is the proportion of convictions to arrests, prison is the ratio of sentences that results in jail time to the total number of convictions, police is the police per capita, avgsent is the average sentence in days and dens is the the number of people living in the county per square mile. Cornwell and Trumbull (1994) argued that both the arrest and police variables are endogenous, for which they proposed two external instrumental variables (IV): the tax revenue per capita is correlated with the police covariate as we would expect that higher tax revenues are correlated with larger police forces. On the other hand, the ratio of crimes that involve face to face contact to those that do not (denoted by mix) is the IV for arrests, the rationale being that when a crime is committed in person, identification of the perpetrator is facilitated. Columns 1 and 2 of Table 2.211 show the results of estimating the model 2.35 using the complete data set by FE2SLS and using the proposed GMM (PGMM) estimator respectively12. The first thing to note is that all the coefficients are similar in magnitude and most of them have the expected sign. Indeed, deterrent variables such as arrests, conviction and prison have a negative effect on the crime rate. On the other hand, police and density have a positive impact on the dependent variable, which is expected as the latter increases the likelihood of offenders finding victims. On the other hand, as pointed out by B. Baltagi (2006), there might be simultaneity involved in the relationship between the crime rate, arrests and police. Note however that none of the estimates is statistically significant. From a spatial perspective, one could argue that criminal activity in some areas might affect the surrounding counties. For example, if the people living in the big cities of North Carolina are 11All the specifications include time dummies. 12Admittedly the results from the missing data case are going to depend on which observations are missing, therefore I estimated the model 200 times and at each iteration, a different set of observations was missing. The table shows the average estimated coefficients across the 200 repetitions and the “standard errors” presented for the PGMM columns are the sample standard deviation of the computed coefficients across the iterations. 66 more affluent and are more densely populated than those in rural areas, it could be expected that the former have larger crime rates or arrests. Figures 2.3a and 2.3b show these variables plotted over the counties at the beginning and end of the period of analysis. They reflect that indeed counties with high (low) proportion of arrests are neighbors to other counties with higher (lower) “probabilities” of arrest. To capture this, I augment model 2.35 by including the spatial lag of the variable arrest. The results of this are shown in columns 3 and 4 of Table 2.2. Table 2.2 Results from the estimation (standard errors in parenthesis) Arrest Police FE2SLS -0.0202 (0.0128) PGMM FE2SLS -0.0224 -0.0182 (0.0242) (0.0169) PGMM -0.0175 (0.0162) 3.7286 (1.7727) 4.1822 (2.145) 4.0688 (4.0708) 3.9161 (2.6127) Conviction -0.0019 (0.0009) -0.0023 (0.0013) -0.0206 (0.0169) -0.0021 (0.0015) Prison -0.0012 (0.0045) -0.0023 (0.0049) -0.0020 (0.0021) -0.0018 (0.0054) Sentence 0.0002 (0.0002) 0.0004 (0.0002) -0.0012 (0.0072) 0.0004 (0.0003) Density 0.0039 (0.0049) 0.0011 (0.0027) 0.0002 (0.0004) 0.0006 (0.0028) W × Arrest - - 0.0046 (0.0063) -0.0151 (0.0533) After adding the additional covariate, none of the estimates from the estimates for the other variables changes significantly and they remain statistically insignificant at the usual confidence levels. The sign of the coefficient for the spatial lag of arrests is positive for the case of the FE2SLS estimator but negative for the PGMM. One could argue that the expected sign of this variable would be positive because if the number of arrests in counties that are neighbors of 𝑖 increases, the criminals might move their activities 𝑖. However the empirical evidence does not support this 67 theory, as both estimators find a statistically insignificant coefficient, which coincides with the findings of Cornwell and Trumbull (1994). (a) (b) Figure 2.3 Maps of the “probability” of arrest (a) and crime rates (b) at the beginning and end of the period of study. 2.8 Conclusion Missing data is a more serious problem in spatial models with spillover effects because the loss of information is greater if the researcher decides to use only the complete cases. This paper presented a simple way to exploit the information of incomplete observations in spatial panel data models with potentially missing endogenous explanatory variables. The estimator is presented in a GMM framework that imposes restrictions on the coefficients in the complete and incomplete subsamples to obtain a more efficient estimator relative to the fixed effects estimator that only uses the complete cases. An alternative to the FE estimator in panel data is the correlated random effects approach, which restricts the relationship between the unobserved heterogeneity and the explanatory variables. In particular, by using Mundlak’s device the researcher can recover the same numerical FE coefficients 68 for the time varying variables and also estimate the effects of time invariant covariates. In this paper, I show that this equivalence carries over to the missing data with endogenous independent variables. In addition to this equivalence, I also present a potentially more efficient GMM estimator that exploits the incomplete cases information using the the additional restrictions of the Mundlak approach. The simulations show that the proposed GMM estimator behaves well in finite samples with an average bias very close to the estimator that uses the whole set of non missing data but more importantly, it consistently had a smaller standard deviation across the Monte-Carlo study compared to the estimator that uses only the complete cases, which shows that the GMM indeed provides some efficiency gains. 69 CHAPTER 3 ESTIMATION OF MODELS WITH MULTIPLE FIXED EFFECTS AND ENDOGENOUS VARIABLES: A CORRELATED RANDOM EFFECTS APPROACH 3.1 Introduction Gravity type models have been widely used in a variety of economic fields to analyze the flows of goods or services between multiple regions or entities. The international trade literature has had a long tradition of using this type of model to quantify the relationship between bilateral trade flows and other variables such as trade costs and economic integration agreements (Baier et al., 2014), although its use to estimate these relationships can be documented back to 1885 (Kabir et al., 2017). Studies in this area that use the gravity equation include Flach and Unger (2022), Anderson and Van Wincoop (2003) and B. H. Baltagi et al. (2003), but the list of papers is extensive. Furthermore, gravity type models have also been used to explain migration flows (Beine et al., 2015) and international financial assets outflows (Okawa & Van Wincoop, 2012). Kabir et al. (2017) provides an excellent overview of other areas where the gravity equation has been applied. However, for the remainder of the paper I will focus on the international trade case. The main idea behind gravity models is that the bilateral economic relationship between two entities is proportional with their economic size (e.g. a country’s GDP is often used in the trade literature (Matyas, 1997)) and negatively correlated with their economic or geographical distance. Intuitively this idea is appealing and is analogous to Newton’s Universal Gravity Law, however, it was recognized that the inclusion of covariates such as policy variables (e.g. border taxes) lacked theoretical justification (Anderson, 1979). However, Anderson (1979) made a seminal contribution in this direction by presenting a commodities model that are differentiated by the country of origin and deriving a gravity equation from it. Other papers that also presented theoretical foundations for these models include Krugman (1980) Bergstrand (1985), Eaton and Kortum (2002) and Chaney (2018). Gravity type models are at least double indexed: in the cross sectional case, one index corre- sponds to the originating country and the other to the destination country. If time series data is 70 available, then one of the indices identifies the time dimension instead of the originating country and if the researcher is using panel data, then a third index can be added to the model to identify each of the components previously mentioned. More details about the formulation of a gravity model can be found in Matyas (1997). Although more details will be provided later in the paper, each of these dimensions will have a corresponding term (unobserved heterogeneities, latent variables or “fixed effects”) in the model that captures their corresponding effect on the response variable. Depending on the assumptions imposed on these terms, the estimation approach can vary between a random effects (RE) procedure or a fixed effects (FE) estimator (Matyas, 1997). An excellent overview of both the RE and FE with multi-dimensional panels can be found in Matyas (2017, Chapters 1 and 2). As previously mentioned, one of the main differences between the FE and RE estimators is the restriction related to the relationships between the explanatory variables and the unobserved heterogeneities that is imposed to achieve consistency. In particular, the FE allows for arbitrary correlation between the latent variables and the covariates, while the RE assumes zero correlation among the dependent variables and each of the fixed effects. However, the literature has proposed a middle ground between these approaches, the correlated random effects (CRE). For instance, in the one way panel case, Mundlak (1978) suggested to model the individual heterogeneity as a linear function of the time averages of the right hand side variables, use this auxiliary equation in the main model and estimate the parameters with Pooled Ordinary Least Squares (POLS). By following these steps, he showed that the researcher can obtain same numerical estimates of the FE estimator for the time varying covariates. It is important to note that this result is an algebraic equivalence that does not depend on the statistical properties of the estimators nor the conditions assumed to obtain a consistent estimator laid out earlier. This equivalence between the Mundlak device and the FE estimator has been extended to other contexts. For example, Wooldridge (2021) showed it for the case of a two-way panel, Debarsy (2012) was the first to propose it for spatial panels, Joshi and Wooldridge (2019) demonstrated it for the case of unbalanced panels and Yang (2022) proved it for models with multiple fixed effects. 71 It is important to note that Yang (2022) does not allow for correlation between the covariates and the idiosyncratic error term. In this paper, I extend the result by relaxing this assumption and show that the FE estimates and be recovered using two different sets of variables to model the fixed effects. The rest of the paper is organized as follows. Section 3.2 presents the model and its assumptions. Section 3.3 shows how to consistently estimate the model, while Section 3.4 introduces the equivalence between the FE and the CRE approach and Section 3.5 concludes. 3.2 Model To motivate the use of a FE or RE approach, consider the following linear model with additive heterogeneities, which is common to see in gravity-type models: 𝑦𝑖 𝑗𝑡 = 𝑥1𝑖 𝑗𝑡 𝛽1 + 𝑥2𝑖 𝑗𝑡 𝛽2 + 𝛼𝑖 + 𝜙 𝑗 + 𝛾𝑡 + 𝑢𝑖 𝑗𝑡 = 𝑥𝑖 𝑗𝑡 𝛽 + 𝑒𝑖 𝑗𝑡, 𝑖 = 1 . . . 𝑁1, 𝑗 = 1, . . . 𝑁2, 𝑡 = 1 . . . 𝑇 (3.1) where 𝑦𝑖 𝑗𝑡 is the dependent variable, 𝑥𝑖 𝑗𝑡 is a vector of 𝐾 explanatory variables, including a constant. I decompose the error term 𝑒𝑖 𝑗𝑡 into four components: 𝛼𝑖 is the individual specific heterogeneity along one of the dimensions of the data (e.g. exporter “fixed effect”), 𝜙 𝑗 is the heterogeneity along the other dimension (e.g. importer “fixed effect”), 𝛾𝑡 is the time specific effect and 𝑢𝑖 𝑗𝑡 is the that E(𝑥′ idiosyncratic error term. I divide 𝑥𝑖 𝑗𝑡 in two subsets: 𝑥1𝑖 𝑗𝑡 are 𝐾1 exogenous variables in the sense 2𝑖 𝑗𝑡𝑢𝑖 𝑗𝑡) ≠ 0. In light of the endogenous 𝑥2𝑖 𝑗𝑡, to obtain consistent estimates of 𝛽, we could construct Hausman-Taylor type 1𝑖 𝑗𝑡𝑢𝑖 𝑗𝑡) = 0 and 𝑥2𝑖 𝑗𝑡 are 𝐾2 endogenous variables so that E(𝑥′ variables available and denoted by 𝑧2𝑖 𝑗𝑡 that satisfy the usual relevance [E(𝑧′ instrumental variables, however I will assume that we have 𝐿 (with 𝐿 ≥ 𝐾2) external instrumental 2𝑖 𝑗𝑡𝑥2𝑖 𝑗𝑡) ≠ 0] and 2𝑖 𝑗𝑡𝑢𝑖 𝑗𝑡) = 0] conditions and let the set of exogenous variables be 𝑧𝑖 𝑗𝑡 = (𝑥1𝑖 𝑗𝑡 𝑧2𝑖 𝑗𝑡). In this paper, I do not consider formal asymptotic analysis, nor do I focus on whether the exogeneity [E(𝑧′ individual heterogeneities and time effects are parameters to be estimated or should be treated as random variables since the equivalence derived below using the Mundlak approach is an algebraic result. However, at least one of the indices should go to infinity to obtain a consistent estimator of 𝛽, conditional on not treating the heterogeneities or time effects associated with that index as 72 parameters to be estimated to avoid the incidental parameters problem. Matyas (2017) has a nice review of asymptotic properties of fixed effects and random effects estimators for the different cases that can arise in empirical work. Throughout the paper I assume that the data is ordered such that the 𝑖 index is the slowest to change, then 𝑗 and 𝑡 is the fastest. I also assume that all the relevant matrices have full column rank and are therefore invertible. I also maintain the following exogeneity assumption: E (cid:0)𝑢𝑖 𝑗𝑡 |𝑧111, 𝑧112, . . . 𝑧𝑁1𝑁2𝑇 , 𝛼𝑖, 𝜙 𝑗 , 𝛾𝑡 (cid:1) = 0 (3.2) This is an extension to the three dimensional panel of the strict exogeneity assumption found in the one way panel data literature. Note that the equivalence that will be presented in Section 3.4 does not depend on any of the assumptions stated so far, it is an algebraic equivalence that is unrelated to the statistical properties of the estimators presented in the next section. 3.3 Estimation The estimation approach of the parameters in equation (3.1) will depend on the variables of interest and the assumptions the researcher is willing to make. If we assume that the exogenous variables 𝑧2𝑖 𝑗𝑡 are uncorrelated to all the individual heterogeneities (𝛼𝑖, 𝜙 𝑗 and 𝛾𝑡) and the following conditions are met: 1. The heterogeneities are pairwise uncorrelated. 2. E(𝛼𝑖) = E(𝜙 𝑗 ) = E(𝛾𝑡) = 0. Furthermore, if we assume E(𝛼𝑖𝛼𝑖′) = E(𝜙 𝑗 𝜙 𝑗 ′) = E(𝛾𝑡𝛾𝑡′) = 𝜎2 𝛼 if 𝑖 = 𝑖′ 0 otherwise 𝜙 if 𝑗 = 𝑗 ′ 𝜎2 0 otherwise 𝜎2 𝛾 if 𝑡 = 𝑡′ 0 otherwise          73 then the structure of the covariance matrix is given by E(𝑒𝑖 𝑗𝑡𝑒′ 𝑖′ 𝑗 ′𝑡′) = E (cid:2)(𝛼𝑖 + 𝜙 𝑗 + 𝛾𝑡 + 𝑢𝑖 𝑗𝑡) (𝛼𝑖′ + 𝜙 𝑗 ′ + 𝛾𝑡′ + 𝑢𝑖′ 𝑗 ′𝑡′)′(cid:3) = 𝜎2 𝛼 = 𝜎2 𝜙 = 𝜎2 𝛾 = 𝜎2 𝛼 + 𝜎2 𝜙 = 𝜎2 𝛼 + 𝜎2 𝛾 = 𝜎2 𝜙 + 𝜎2 𝛾 = 𝜎2 𝛼 + 𝜎2 𝜙 + 𝜎2 𝛾 + 𝜎2 𝑢 Which translates into the following matrix: if 𝑖 = 𝑖′, 𝑗 ≠ 𝑗 ′, 𝑡 ≠ 𝑡′ if 𝑖 ≠ 𝑖′, 𝑗 = 𝑗 ′, 𝑡 ≠ 𝑡′ if 𝑖 ≠ 𝑖′, 𝑗 ≠ 𝑗 ′, 𝑡 = 𝑡′ if 𝑖 = 𝑖′, 𝑗 = 𝑗 ′, 𝑡 ≠ 𝑡′ if 𝑖 = 𝑖′, 𝑗 ≠ 𝑗 ′, 𝑡 = 𝑡′ if 𝑖 ≠ 𝑖′, 𝑗 = 𝑗 ′, 𝑡 = 𝑡′ if 𝑖 = 𝑖′, 𝑗 = 𝑗 ′, 𝑡 = 𝑡′ Ω = E(𝑒𝑒′) = 𝜎2 𝛼 (I𝑁1 ⊗ 𝐽𝑁2𝑇 ) + 𝜎2 𝜙 (𝐽𝑁1 ⊗ I𝑁2 ⊗ 𝐽𝑇 ) + 𝜎2 𝛾 (𝐽𝑁1𝑁2 ⊗ I𝑇 ) + 𝜎2 𝑢 I𝑁1𝑁2𝑇 (3.3) where ⊗ represents the kronecker product and I and 𝐽 denote an identity matrix and a square matrix of ones, respectively, of size given by their subscript. We can transform the data to obtain an efficient estimator that exploits this information. Indeed, the RE estimator presented in Matyas (2017) can be obtained by applying Pooled Two Stage Least Squares (P2SLS) to the following equation: Ω− 1 2 𝑦 = Ω− 1 2 𝑋 𝛽 + Ω− 1 2 𝑒 (3.4) using the instruments Ω− 1 2 𝑧2, where the absence of subscripts indicate that the data has been stacked. Denote the estimated coefficient from this estimation as ˆ𝛽𝑅𝐸2𝑆𝐿𝑆. A few observations are in order. First, the assumptions stated above related to the second moments of the individual heterogeneities are not necessary to get a consistent estimator of the parameters. These conditions only determine the specific structure of the matrix Ω which in turn is used to perform the GLS-type transformation of the data to get efficiency gains, but the consistency of the estimator hinges on other assumptions. One the other hand, the second moment conditions are important to get a particular structure of 74 the covariance matrix, but if these do not hold, inference can be misleading. For this reason, researchers should use a robust covariance matrix to obtain the associated standard errors. Sometimes imposing a zero correlation between the exogenous variables and the heterogeneities might be an unrealistic restriction. In these instances, a FE approach is also available and it has the advantage of leaving the relationship between the exogenous variables and the heterogeneities unrestricted. One way to obtain the FE2SLS estimator is to include dummy variables to account for the different heterogeneities (see Wooldridge (2021) for a description of the Two-Way Fixed Effects estimator). Alternatively, we can apply a transformation to the data to end up with an estimating equation that does not contain the “fixed effects”. To this end, we define the following notation. Let ¯𝑦𝑖·· = 1 𝑁2𝑇 𝑁2∑︁ 𝑇 ∑︁ 𝑗 𝑡 𝑦𝑖 𝑗𝑡 and ¯𝑦· 𝑗 · = 1 𝑁1𝑇 𝑁1∑︁ 𝑇 ∑︁ 𝑖 𝑡 𝑦𝑖 𝑗𝑡 be the unit specific averages over the remaining dimensions for variable 𝑦. Also define ¯𝑦··𝑡 = 1 𝑁1𝑁2 𝑁1∑︁ 𝑁2∑︁ 𝑖 𝑗 𝑦𝑖 𝑗𝑡 be the cross sectional average for each 𝑡. Let ¯𝑦··· = 1 𝑁1𝑁2𝑇 𝑁1∑︁ 𝑁2∑︁ 𝑇 ∑︁ 𝑦𝑖 𝑗𝑡 𝑖 𝑗 𝑡 be the overall average. Note that ¯𝑦··· = 1 𝑁1 𝑁1∑︁ 𝑖 ¯𝑦𝑖·· = 1 𝑁2 𝑁2∑︁ 𝑖 ¯𝑦· 𝑗 · = 1 𝑇 𝑇 ∑︁ 𝑡 ¯𝑦··𝑡 Finally, transform and denote the original data as follows: (cid:165)𝑦𝑖 𝑗𝑡 = 𝑦𝑖 𝑗𝑡 − ¯𝑦𝑖·· − ¯𝑦· 𝑗 · − ¯𝑦··𝑡 + 2 ¯𝑦··· (3.5) (3.6) (3.7) (3.8) (3.9) and other variables can be constructed similarly. This transformation gives rise to the within estimator and it will remove the heterogeneity and time effects. It was first introduced by Matyas (1997) and its extension to a model with endogenous variables is straightforward. Indeed, the FE estimator of 𝛽, denoted as ˆ𝛽𝐹𝐸2𝑆𝐿𝑆 can be obtained by applying P2SLS to: (cid:165)𝑦𝑖 𝑗𝑡 = (cid:165)𝑥𝑖 𝑗𝑡 𝛽 + (cid:165)𝑒𝑖 𝑗𝑡 = (cid:165)𝑥𝑖 𝑗𝑡 𝛽 + (cid:165)𝑢𝑖 𝑗𝑡 (3.10) 75 using the instruments (cid:165)𝑧2𝑖 𝑗𝑡. A few comments are in order related to the within estimator. First, this is not the only transformation that removes the individual heterogeneities from the estimating equation. As pointed out by Balazsi et al. (2018), the following operation would also remove the heterogeneities and time effects in equation (3.1): (cid:164)𝑦𝑖 𝑗𝑡 = 𝑦𝑖 𝑗𝑡 − ¯𝑦𝑖 𝑗 · − ¯𝑦··𝑡 + ¯𝑦··· (3.11) The transformation needed to remove the “fixed effects” will vary depending on the structure of the heterogeneities. A second and perhaps more important point related to this operation is that some of the coefficients might not be identifiable as the associated variables will also be removed after the operation. In particular, variables that are either time or individual invariant (in either dimension) will be also removed by the transformation. From an empirical point of view, this is not a trivial issue: for example, in the trade literature and gravity models it is common to include the GDP of the exporter or importer region or some policy variables as covariates, which will be invariant from at least one of the cross sectional dimensions and thus eliminated. This problem and the efficiency gains give more appeal to the RE estimator over the FE, at the cost of imposing additional assumptions. It is essential to stress out that the equivalence presented in the next section is an algebraic result and is not related to other statistical properties of the estimators such as consistency. 3.4 Correlated Random Effects As noted in the previous section, the FE and RE rely on opposing assumptions related to the relationship between the exogenous variables and the individual and time effects. On the one hand, the RE estimator assumes that there is no correlation between the exogenous and these unobserved effects, while the FE places no restrictions in this sense. As a result, the usual bias-variance trade-off arises between both estimators from the imposition and plausibility of this condition. In one-way panels, the literature has proposed a middle ground in which the dependence between the unobserved heterogeneity and the covariates is not zero but is restricted in a specific way. In particular, Mundlak (1978) proposed to model the individual heterogeneity as linear function of the time average of the covariates. Chamberlain (1982) provided a more flexible approach in which the 76 heteretogeneity is linearly projected into the space of the whole history of explanatory variables. One of the drawbacks of the latter is that the number of coefficients to be estimated grows linearly as the sample size grows, which can be a greater issue in higher dimensional panels. An interesting fact about the Mundlak device is that he showed that by adding the time averages to the estimating equation, one can recover the same FE estimates if the equation is estimated by RE or POLS. This result has been extended to the two-way panel: Wooldridge (2021) proves that the two-way FE estimates can be recovered by applying POLS to the main equation and adding the time and cross sectional averages as regressors, while B. H. Baltagi (2023) demonstrates that the GLS-type transformation and POLS are equivalent in this sense. In addition, Yang (2022) extends this equivalence to three-way panels and presents conditions under which a weighted variable addition test is equivalent to the Hausman specification test. More concretely, the linear projection of the individual and time effects in the three-way panel using the Mundlak approach is given by: L(𝛼𝑖 |𝑧111, 𝑧112, . . . 𝑧𝑁1𝑁2𝑇 ) = ¯𝑧𝑖··𝛿1 L(𝜙 𝑗 |𝑧111, 𝑧112, . . . 𝑧𝑁1𝑁2𝑇 ) = ¯𝑧· 𝑗 ·𝛿2 L(𝛾𝑡 |𝑧111, 𝑧112, . . . 𝑧𝑁1𝑁2𝑇 ) = ¯𝑧··𝑡𝛿3 (3.12) where L(·) denotes the linear projection operator. One aspect that these papers have in common is that they show the result for the case in which the explanatory variables are exogenous with respect to the idiosyncratic error term. In this paper, I show that the equivalence carries over when there are endogenous variables on the right hand side of the equation, which can be useful as this situation often arises in empirical work. Once again it is important to stress out that this is an algebraic result that is unrelated to the consistency of the estimators. To fix ideas, it is useful to first re-write in scalar form the RE transformation from (3.4), which yields the following: 𝑢 Ω− 1 𝜎2 2 𝑦𝑖 𝑗𝑡 = ˜𝑦𝑖 𝑗𝑡 = 𝑦𝑖 𝑗𝑡 − ˜𝜃1 ¯𝑦𝑖·· − ˜𝜃2 ¯𝑦· 𝑗 · − ˜𝜃4 ¯𝑦··𝑡 + ˜𝜃4 ¯𝑦··· (3.13) 77 where, ˜𝜃1 = ˜𝜃2 = ˜𝜃3 = ˜𝜃4 = 𝜃1 = 𝜃2 = 𝜃3 = 𝜃4 = (cid:16) (cid:16) (cid:16) (cid:16) (cid:17) (cid:17) 1 − √︁𝜃1 1 − √︁𝜃2 1 − √︁𝜃3 2 − √︁𝜃1 − √︁𝜃2 − √︁𝜃3 − √︁𝜃4 (cid:17) (cid:17) 𝜎2 𝑢 𝑁2𝑇 𝜎2 𝛼 + 𝜎2 𝑢 𝜎2 𝑢 𝜙 + 𝜎2 𝑁1𝑇 𝜎2 𝑢 𝜎2 𝑢 𝛾 + 𝜎2 𝑁2𝑇 𝜎2 𝑢 𝜎2 𝑢 𝜙 + 𝑁1𝑁2𝜎2 𝛼 + 𝑁1𝑇 𝜎2 𝛾 + 𝜎2 𝑢 𝑁2𝑇 𝜎2 and where we can transform the rest of the variables in a similar way. Therefore, the RE2SLS estimator can be once again obtained by applying Pooled 2SLS to ˜𝑦𝑖 𝑗𝑡 = ˜𝑥𝑖 𝑗𝑡 𝛽 + ˜𝑒𝑖 𝑗𝑡 (3.14) using instrumental variables ˜𝑧2𝑖 𝑗𝑡. Note that the Pooled 2SLS estimator is a special case of (3.14) by setting ˜𝜃𝑠 = 0 for 𝑠 = 1, 2, 3, 4. To obtain the CRE 2SLS estimator using the Mundlak device, we can apply P2SLS to the following equation: ˜𝑦𝑖 𝑗𝑡 = ˜𝑥𝑖 𝑗𝑡 𝛽 + ˜¯𝑥1𝑖 𝑗𝑡 𝜋 + ˜¯𝑧2𝑖 𝑗𝑡𝛿 (3.15) using instruments ˜𝑧2𝑖 𝑗𝑡, where ˜¯𝑥1𝑖 𝑗𝑡 = ( ˜¯𝑥1𝑖·· ˜¯𝑥1· 𝑗 · ˜¯𝑥1··𝑡) and ˜¯𝑧2𝑖 𝑗𝑡 = ( ˜¯𝑧2𝑖·· ˜¯𝑧2· 𝑗 · ˜¯𝑧2··𝑡). Two observations are in order related to (3.15). First note that ˜¯𝑥1𝑖·· = (1 − ˜𝜃1) ¯𝑥1𝑖··, ˜¯𝑥1· 𝑗 · = (1 − ˜𝜃2) ¯𝑥1· 𝑗 ·, ˜¯𝑥1··𝑡 = (1 − ˜𝜃3) ¯𝑥1··𝑡 and similarly for ˜¯𝑧2, so that the averages of the transformed variables do not depend on parameters that are associated with the other dimensions’ averages. Second, note that we only need to include the averages of the exogenous variables ˜¯𝑥1𝑖 𝑗𝑡 and ˜¯𝑧2𝑖 𝑗𝑡 and not the ones from the endogenous variables ˜¯𝑥2𝑖 𝑗𝑡. By doing so, the 𝛽 recovered from this estimation, denoted as ˆ𝛽𝑀1 will be numerically the same as ˆ𝛽𝐹𝐸2𝑆𝐿𝑆. This result in summarized in Proposition 1. 78 Proposition 1. Suppose that all the relevant matrices have full column rank. Let ˆ𝛽𝐹𝐸2𝑆𝐿𝑆 be the coefficient obtained by estimating equation 3.10 by Pooled 2SLS and ˆ𝛽𝑀1 be the coefficient computed from applying Pooled 2SLS to equation (3.15). Then ˆ𝛽𝐹𝐸2𝑆𝐿𝑆 = ˆ𝛽𝑀1. The proof of this proposition can be found in the Appendix. This result is useful because it allows the researcher to perform a Hausman-type test using a variable addition test. Specifically, the researcher can analyze the significance of the coefficients associated with the averages to decide between a FE or RE specifications. A discussion of this procedure can be found in Joshi and Wooldridge (2019). As Matyas (2017) notes, there can be many causes of endogeneity in three- way panels, which might require at least as many instruments for each of these sources. In order to obtain the CRE equivalence, the researcher has to include all their averages in the estimating equation, which can consume an important number of degrees of freedom and can be costly in finite samples when conducting inference. Fortunately, we can recover the FE estimates by adding a different set of variables. If we let ˆ𝑥2𝑖 𝑗𝑡 denote first stage predicted values for the endogenous variables, then applying Pooled 2SLS to ˜𝑦𝑖 𝑗𝑡 = ˜𝑥1𝑖 𝑗𝑡 𝛽1 + ˆ˜𝑥2𝑖 𝑗𝑡 𝛽2 + ˜¯𝑥1𝑖 𝑗𝑡 𝜋1 + ˆ˜¯𝑥2𝑖 𝑗𝑡 𝜋2 = ˆ˜𝑥𝑖 𝑗𝑡 𝛽 + ˆ˜¯𝑥𝑖 𝑗𝑡 𝜋 (3.16) using the instruments ( ˜𝑧2𝑖 𝑗𝑡 ˜¯𝑧2𝑖 𝑗𝑡) will also yield the same 𝛽 as FE2SLS. Proposition 2 formally states the equivalence. Proposition 2. Suppose that all the relevant matrices have full column rank. Let ˆ𝛽𝐹𝐸2𝑆𝐿𝑆 be the coefficient obtained by estimating equation 3.10 by Pooled 2SLS and ˆ𝛽𝑀2 be the coefficient computed from applying Pooled 2SLS to equation (3.16). Then ˆ𝛽𝐹𝐸2𝑆𝐿𝑆 = ˆ𝛽𝑀2. As noted previously, the advantage of using this set of variables instead of the instruments is that it allows to preserve the degrees of freedom if we have more than one instrument for each endogenous covariate. An important feature of the CRE approach using the Mundlak device is that it allows to estimate the effect of variables that are constant across one of the dimensions of the 79 panel, something that cannot be done using the within estimator as its transformation wipes out any variable of this nature. In fact, Wooldridge (2021) proves in the two-way panel that adding additional variables that only vary across one of the dimensions will not change the FE estimates, a result that most likely carries over to the three-way panel. This result makes intuitive sense as the within estimator is supposed to remove these variables but it also shows that adding the averages (either from the exogenous variables or from the predicted values as in Proposition 2) is enough to control for the the individual and time effects. 3.5 Conclusion In this paper, I establish the algebraic equivalence between FE2SLS and RE2SLS in three-way panels with additive unobserved heterogeneities in the presence of endogenous variables using the Mundlak device. Namely, by including either the averages of the exogenous variables or the means of the predicted values explanatory covariates across all the different dimensions, is enough to control for the unobserved heterogeneities and to recover the FE2SLS estimates. The first approach has the disadvantage that if there are multiple instruments available, the degrees of freedom could be reduced considerably, an issue that is more severe in finite samples. The use of Mundlak’s device also allows to relax the no correlation between the covariates and the unobserved heterogeneities, which allows the researcher to obtain more robust estimates of the coefficients. Furthermore, this result also offers the researchers a flexible and easy to implement solution to choose between a FE and RE specification. In particular, Yang (2022) shows that a modified variable addition test associated with the averages is equivalent to the Hausman-type test, with the additional advantage that the former can be made robust to heteroskedasticity and serial correlation. One of the limitations of the result shown in this paper is that the algebraic equivalence is likely to break with other structures of heterogeneities. For example, Yang (2022) argues that if the cross sectional heterogeneities are time varying, then the result no longer holds in the case of exogeneous variables, a result that most like carries over in the presence of endogenous covariates. However, future research might could extend the result to more general models of unobserved heterogeneities. 80 BIBLIOGRAPHY Abrevaya, J., & Donald, S. G. (2017). A gmm approach for dealing with missing data on regressors. The Review of Economics and Statistics, 99(4), 657–662. Ahn, S. C., & Schmidt, P. (1995). Efficient estimation of models for dynamic panel data. Journal of Econometrics, 68(1), 5–27. Amemiya, T. (1985). Advanced econometrics. Harvard University Press. Anderson, J. (1979). A theoretical foundation for the gravity equation. The American Economic Review, 69(1), 106–116. Anderson, J., & Van Wincoop, E. (2003). Gravity with gravitas: A solution to the border puzzle. The American Economic Review, 93(1), 170–192. Arellano, M. (1987). Computing robust standard errors for within-groups estimators. Oxford Bulletin of Economics and Statistics, 49(4), 431–434. Baier, S. L., Bergstrand, J., & Feng, M. (2014). Economic integration agreements and the margins of international trade. Journal of International Economics, 93(2), 339–350. Balazsi, L., Matyas, L., & Wansbeek, T. (2018). The estimation of multidimensional fixed effects panel data models. Econometric Reviews, 3(3), 212–227. Baltagi, B. (2006). Estimating an economic model of crime using panel data from north carolina. Journal of Applied Econometrics, 21, 543–547. Baltagi, B., & Liu, L. (2011). Instrumental variable estimation of a spatial autoregresive panel model with random effects. Economic Letters, (111), 135–137. Baltagi, B. H. (2023). The two-way mundlak estimator (tech. rep. No. Working Paper No. 256). Center for Policy Research. Baltagi, B. H., Egger, P., & Pfaffermayr, M. (2003). A generalized design for bilateral trade flow models. Economics Letters, 80(3), 391–397. Basile, R. (2009). Productivity polarization across regions in europethe role of nonlinearities and spatial dependence. International Regional Science Review, 92–115. Basile, R., Durban, M., Minguez, R., Montero, J. M., & Mur, J. (2014). Modeling regional economic dynamics: Spatial dependence, spatial heterogeneity and nonlinearities. Journal of Economic Dynamics and Control, 229–245. Beine, M., Bertoli, S., & Fernandez-Huertas-Moraga, J. (2015). A practitioners’ guide to gravity 81 models of international migration. World Econ, 39, 496–512. Bergstrand, J. H. (1985). The gravity equation in international trade: Some microeconomic foundations and empirical evidence. The Review of Economics and Statistics, 67(3), 474– 481. Bester, C. A., Conley, T., Hansen, C., & Vogelsang, T. (2016). Fixed-b asymptotics for spatially dependent robust nonparametric covariance matrix estimators. Econometric Theory, 32(1), 154–186. Bester, C. A., Conley, T. G., & Hansen, C. B. (2011). Inference with dependent data using cluster covariance estimators. Journal of Econometrics, 165(2), 137–151. Blundell, R., & Powell, J. (2003). Endogeneity in nonparametric and semiparametric regression models. Econometric society monographs, 36, 312–357. Cameron, C., & Trivedi, P. (2005). Microeconometrics: Methods and applications. Cambridge University Press. Chamberlain, G. (1982). Multivariate regression models for panel data. Journal of Econometrics, 18(1), 5–46. Chaney, T. (2018). The gravity equation in international trade: An explanation. Journal of Political Economy, 126(1), 150–177. Conley, T. G. (1999). Gmm estimation with cross sectional dependence. Journal of Econometrics, 92(1), 1–45. Conley, T. G., & Molinari, F. (2007). Spatial correlation robust inference with errors in location or distance. Journal of Econometrics, 140, 76–96. Cornwell, C., & Trumbull, W. N. (1994). Estimating the economic model of crime with panel data. The Review of Economics and Statistics, 76(2), 360–366. Dagenais, M. (1973). The use of incomplete observations in multiple regression analysis. Journal of Econometrics, 1(4), 317–328. Dardanoni, Valentino, S., Modica, F., & Peracchi. (2011). Regression with imputed covariates: A generalized missing-indicator approach. Journal of Econometrics, 162(2), 362–368. Debarsy, N. (2012). The mundlak approach in the spatial durbin panel data model. Spatial Economic Analysis, 7(1), 109–131. Driscoll, J. C., & Kraay, A. C. (1998). Consistent covariance matrix estimation with spatially dependent panel data. The Review of Economics and Statistics, 80(4), 549–560. 82 Eaton, J., & Kortum, S. (2002). Technology, geography, and trade. Econometrica, 70(5), 1741– 1779. Flach, L., & Unger, F. (2022). Quality and gravity in international trade. Journal of International Economics, 137, 103578. Gourieroux, C., & Monfort, A. (1981). On the problem of missing data in linear models. The Review of Economic Studies, 48(4), 579–586. Greene, W. (2007). Econometric analysis (7th ed.). Prentice Hall. Jones, M. (1996). Indicator and stratification methods for missing explanatory variables in multiple linear regression. Journal of the American Statistical Association, 91(433), 222–230. Joshi, R., & Wooldridge, J. M. (2019). Correlated random effects models with endogenous explanatory variables and unbalanced panels. Annals of Economics and Statistics, (134). Kabir, M., Salim, R., & Al-Mawali, N. (2017). The gravity model and trade flows: Recent developments in econometric modeling and empirical evidence. Economic Analysis and Policy, 56, 60–71. Kapoor, M., Kelejian, H., & Prucha, I. R. (2007). Panel data models with spatially correlated error components. Journal of Econometrics, 140(140), 97–130. Kelejian, H., & Prucha, I. (1998). A generalized spatial two-stage least squares procedure for estimating a spatial autoregressive model with autoregressive disturbances. The Journal of Real Estate Finance and Economics, 17(1), 99–121. Kelejian, H., & Prucha, I. (2007). Hac estimation in a spatial framework. Journal of Econometrics, 140(1), 131–154. Kelejian, H., & Prucha, I. (2010). Spatial models with spatially lagged dependent variables and incomplete data. Journal of Geographical Systems, (12), 241–257. Kelejian, H., Prucha, I. R., & Yuzefovich, Y. (2004). Instrumental variable estimation of a spatial autoregressive model with autoregressive disturbances: Large and small sample results. In Spatial and spatiotemporal econometrics. Emerald Group Publishing Limited. Kiefer, N. M., & Vogelsang, T. J. (2005). A new asymptotic theory for heteroskedasticity- autocorrelation robust tests. Econometric Theory, 21(6), 1130–1164. Kim, M. S., & Sun, Y. (2011). Spatial heteroskedasticity and autocorrelation consistent estimation of covariance matrix. Journal of Econometrics, 160, 346–371. Kim, M. S., & Sun, Y. (2013). Heteroskedasticity and spatiotemporal dependence robust inference 83 for linear panel models with fixed effects. Journal of Econometrics, 177, 85–108. Krugman, P. (1980). Scale economies, product differentiation, and the pattern of trade. American Economic Review, 70(5), 950–959. Kunsch, H. R. (1989). The jackknife and the bootstrap for general stationary observations. Annals of Statistics, 17(3), 1217–1241. Lee, L.-f. (2003). Best spatial two-stage least squares estimators for a spatial autoregressive model with autoregressive disturbances. Econometric Reviews, 22(4), 307–335. Lesage, J., & Pace, K. (2004). Models for spatially dependent missing data. Journal of Real Estate Finance and Economics, 29(2), 233–254. LeSage, J., & Pace, R. K. (2009). Introduction to spatial econometrics. CRC Press. Li, L., & Yang, Z. (2020). Spatial dynamic panel data models with correlated random effects. Journal of Econometrics. Little, R., & Rubin, D. (2019). Statistical analysis with missing data (3rd edition). Wiley. Matyas, L. (1997). Proper econometric specification of the gravity model. The World Economy, 20, 363–368. Matyas, L. (Ed.). (2017). The econometrics of multi-dimensional panels: Theory and applications. Springer. McMillen, D. P. (1996). One hundred fifty years of land values in chicago: A nonparametric approach. Journal of Urban Economics, 40, 100–124. Müller, U. K. (2014). Hac corrections for strongly autocorrelated time series. Journal of Business and Economic Statistics, 32(3), 311–321. Müller, U. K., & Watson, M. W. (2022a). Spatial correlation robust inference. Econometrica, 90(6), 2901–2935. Müller, U. K., & Watson, M. W. (2022b). Spatial correlation robust inference in linear regression and panel models. Journal of Business and Economics Statistics, 00(0), 1–15. Mundlak, Y. (1978). On the pooling of time series and cross section data. Econometrica, 46(1), 69–85. Mutl, J., & Pfaffermayr, M. (2010). The hausman test in a cliff and ord panel model. Econometrics Journal, 10, 1–30. 84 Nazgul, J., & Prucha, I. R. (2009). Central limit theorems and uniform laws of large numbers for arrays of random fields. Journal of Econometrics, (150), 86–98. Newey, W. K., & West, K. D. (1987). A simple, positive semi-definite, heteroskedasticityand autocorrelation consistent covariance matrix. Econometrica, 55, 703–798. Okawa, Y., & Van Wincoop, E. (2012). Gravity in international finance. Journal of International Economics, 87(2), 205–215. Papke, L. E. (2005). The effects of spending on test pass rates: Evidence from Michigan. Journal of Public Economics, 89(5-6), 821–839. Papke, L. E., & Wooldridge, J. M. (2008). Panel data methods for fractional response variables with an application to test pass rates. Journal of Econometrics, 145(1-2), 121–133. Politis, D. N., & White, H. (2004). Automatic block-length selection for the dependent bootstrap. Econometric Reviews, 23(1), 53–70. Rai, B. (2021). Efficient estimation with missing values in cross section and panel data. [Doctoral dissertation, Michigan State University]. Rai, B. (2023). Eficient estimation with missing data and endogeneity. Econometric Reviews, 42(2), 220–239. Vogelsang, T. (2012). Heteroskedasticity, autocorrelation, and spatial correlation robust inference in linear panel models with fixed-effects. Journal of Econometrics, 166, 303–319. Wang, W., & Lee, L.-F. (2013). Estimation of spatial autoregressive models with randomly missing data in the dependent variable. The Econometrics Journal, 16, 73–102. Wang, W., & Lee, L.-f. (2013). Estimation of spatial panel data models with randomly missing data in the dependent variable. Regional Science and Urban Economics, (43), 521–538. Wheeler, D., & Tiefelsdorf, M. (2005). Multicollinearity and correlation among local regression coefficients in geographically weighted regression. Journal of Geographical Systems, 7(2), 161–187. White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4), 817–838. Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data (2nd ed.). MIT Press. Wooldridge, J. M. (2019). Correlated random effects models with unbalanced panels. Journal of Econometrics, 211, 137–150. 85 Wooldridge, J. M. (2021). Two-way fixed effects, the two-way mundlak regression, and difference- in-differences estimators. SSRN Electronic Journal. Wu-Chaves, S. (2024). Essays in spatial panel data econometrics. [Unpublished doctoral disserta- tion], Michigan State University. Yang, Y. (2022). A correlated random effects approach to the estimation of models with multiple fixed effects. Economics Letters, 213, 110408. 86 APPENDIX A ADDITIONAL ASSUMPTIONS AND DEFINITIONS FOR CHAPTER 1 Assumption 1 The functions 𝑔𝑖 (·, 𝜃) satisfy these conditions: 1. 𝑔𝑖 (·, 𝜃) are Borel measurable on Z, the 𝜎-algebra generated by 𝑍, for all 𝜃 ∈ Θ. 2. sup𝑁 sup𝑖∈𝐷 𝑁 E[|𝑔𝑖 (𝑍𝑖,𝑁 , 𝜃)|2+𝜂] < ∞ ∀𝜃 ∈ Θ for some 𝜂 > 0. Assumption 2 The 𝑔(·, ·) satisfy the following conditions: 1. For some 𝑝 ≥ 1: lim sup 𝑛→∞ 1 |𝐷 𝑁 | ∑︁ E (cid:104) 𝑖,𝑁 1(𝑑 𝑝 𝑑 𝑝 𝑖,𝑁 > 𝑘) (cid:105) → 0 as 𝑘 → ∞ 𝑖∈𝐷 𝑁 where 𝑑𝑖,𝑁 = sup 𝜃∈Θ |𝑔𝑖,𝑁 (𝑍𝑖,𝑁 , 𝜃)|. 2. 𝑔𝑖 (𝑍𝑖,𝑁 , 𝜃) are 𝐿0 stochastically equicontinuous. Assumption 3 The true parameter 𝜃0 and the 𝑔𝑖 (·, ·) satisfy these conditions: 1. 𝜃0 ∈ int(Θ). 2. 𝑔𝑖 (𝑍𝑖, ·) is continuously differentiable on the interior of Θ. 3. |∇𝜃𝑔𝑖 (𝑍𝑖, 𝜃)| < ∞, where ∇𝜃 denotes the gradient of 𝑔𝑖 (𝑍𝑖, 𝜃) with respect to the parameter vector 𝜃. 4. ∇𝜃𝑔𝑖 (𝑍𝑖, 𝜃) is Borel measurable, E[∇𝜃𝑔𝑖 (𝑍𝑖, 𝜃)] exists and rank {E[∇𝜃𝑔𝑖 ( 𝐴𝑖, 𝜃)]} = 𝑃, where 𝑃 = dim(𝜃0). 5. E[|𝑔𝑖 (𝑍𝑖, 𝜃0)|2+𝜖 ] < ∞ for some 𝜖 > 0. 87 Assumption 4 There exist finite dimensional vectors 𝑚𝑖 and Δ such that ˆ𝑢𝑖 − 𝑢𝑖 = 𝑚𝑖Δ and 1 𝑁 𝑁 ∑︁ 𝑖=1 ||𝑧𝑖 ||2 = O𝑝 (1) and 𝑁 1 2 ||Δ|| = O𝑝 (1) Definitions 𝛼-mixing for random fields Let 𝐷 𝑁 be a subset of 𝐷. For 𝑈 ⊆ 𝐷 𝑁 and 𝑉 ⊆ 𝐷 𝑁 , let 𝜎𝑛 (𝑈) = 𝜎(𝑋𝑖,𝑁 : 𝑖 ∈ 𝑈), 𝛼𝑁 (𝑈, 𝑉) = 𝛼(𝜎𝑛 (𝑈), 𝜎𝑛 (𝑉)). Then the 𝛼-mixing coefficients for the random field {𝑋𝑖,𝑁 : 𝑖 ∈ 𝐷 𝑁 , 𝑁 ∈ N} is defined as follows: 𝛼𝑘,𝑙,𝑁 (𝑟) = sup(𝛼𝑛 (𝑈, 𝑉), |𝑈| ≤ 𝑘, |𝑉 | ≤ 𝑙, 𝜌(𝑈, 𝑉) ≥ 𝑟) for 𝑘, 𝑙, 𝑟, 𝑛 ∈ N. Define also Upper tail quantile function ¯𝛼𝑘,𝑙 (𝑟) = sup 𝑁 𝛼𝑘,𝑙,𝑁 (𝑟) Let 𝑋 be a random variable. Then the upper quantile function 𝑄 𝑋 : (0, 1) → [0, ∞) is defined as: 𝑄 𝑋 (𝑢) = inf{𝑡 : 𝑃(𝑋 > 𝑡) ≤ 𝑢} “Inverse" function of mixing coefficients For the non-increasing sequence of the mixing coefficients { ¯𝛼1,1}∞ 𝑚=1, set ¯𝛼1,1(0) = 1 and define its “inverse” function 𝛼inv(𝑢) : (0, 1) → N ∪ {0} as: 𝛼inv(𝑢) = max{𝑚 ≥ 0 : ¯𝛼1,1(𝑚) > 𝑢} Stochastic equicontinuity The array of random functions { 𝑓𝑖,𝑁 (𝑍𝑖,𝑁 , 𝜃) : 𝑖 ∈ 𝐷 𝑁 , 𝑛 ≥ 1} is: 1. 𝐿0 stochastically equicontinuous on Θ iff for every 𝜀 > 0, (cid:34) lim sup 𝑁→∞ 1 |𝐷 𝑁 | ∑︁ 𝑃 𝑖∈𝐷 𝑁 sup 𝜃′∈Θ sup 𝜃∈𝐵(𝜃′,𝛿) | 𝑓𝑖,𝑁 (𝑍𝑖,𝑁 , 𝜃) − 𝑓𝑖,𝑁 (𝑍𝑖,𝑁 , 𝜃′)| > 𝜀 → 0 as 𝛿 → 0. (cid:35) 88 2. 𝐿 𝑝 stochastically equicontinuous, 𝑝 > 0, on Θ iff (cid:34) lim sup 𝑁→∞ 1 |𝐷 𝑁 | ∑︁ E 𝑖∈𝐷 𝑁 sup 𝜃′∈Θ sup 𝜃∈𝐵(𝜃′,𝛿) | 𝑓𝑖,𝑁 (𝑍𝑖,𝑁 , 𝜃) − 𝑓𝑖,𝑁 (𝑍𝑖,𝑁 , 𝜃′)| 𝑝 (cid:35) → 0 as 𝛿 → 0. 3. a.s. stochastically equicontinuous on Θ iff lim sup 𝑁→∞ 1 |𝐷 𝑁 | ∑︁ 𝑖∈𝐷 𝑁 sup 𝜃′∈Θ sup 𝜃∈𝐵(𝜃′,𝛿) | 𝑓𝑖,𝑁 (𝑍𝑖,𝑁 , 𝜃) − 𝑓𝑖,𝑁 (𝑍𝑖,𝑁 , 𝜃′)| → 0 a.s. as 𝛿 → 0. 89 APPENDIX B PROOFS FOR CHAPTER 1 Proof of Proposition 5 For notation simplicity, we will assume that 𝑊𝑖 𝑦𝑡 is included in 𝑥2𝑖𝑡, 𝑥𝑖𝑡 = [𝑥1𝑖𝑡 𝑥2𝑖𝑡], where the 1𝑖𝑡], where 𝑧2 is a vector of 𝐿2 instruments for 𝑥2, with 𝐿2 ≥ 𝑘2, and similarly for the spatial variables (note however that 𝑊𝑖 𝑦𝑡 is 𝑥2 are 𝑘2 + 1 endogenous variables and 𝑧𝑖𝑡 = [𝑥1𝑖𝑡 1𝑖𝑡 . . . 𝑤𝑠 𝑧2𝑖𝑡 𝑤2 not in 𝑊𝑖 𝑋𝑡). Therefore, the problem is to apply Pooled 2SLS to the following equation: 𝑦𝑖𝑡 − 𝜂𝑖 ¯𝑦𝑖 = (𝑥𝑖𝑡 − 𝜂𝑖 ¯𝑥𝑖) 𝛽 + 𝑊𝑖 (𝑋𝑡 − 𝜂𝑖 ¯𝑋)𝛾 + (1 − 𝜂𝑖) ¯𝑧𝑖𝛿 + (1 − 𝜂𝑖)𝑊𝑖 ¯𝑍𝜆 = (𝑥𝑖𝑡 − 𝜂𝑖 ¯𝑥𝑖) 𝛽 + (𝑤𝑖𝑡 − 𝜂𝑖 ¯𝑤𝑖)𝛾 + (1 − 𝜂𝑖) ¯𝑧𝑖𝛿 + (1 − 𝜂𝑖) ¯ℨ𝑖𝜆 using IV’s: [(𝑧𝑖𝑡 − 𝜂 ¯𝑧𝑖) (ℨ𝑖𝑡 − 𝜂 ¯ℨ𝑖) (1 − 𝜂) ¯𝑧2𝑖 (1 − 𝜂) ¯ℨ𝑖]. We first orthogonalize the IV’s, i.e., we run 𝑧𝑖𝑡 − 𝜂 ¯𝑧𝑖 = (1 − 𝜂) ¯𝑧𝑖𝜖1 + (1 − 𝜃𝑖) ¯ℨ𝑖𝜖2 and obtain the residuals 𝑟𝑖𝑡 and ℨ𝑖𝑡 − 𝜂 ¯ℨ𝑖 = (1 − 𝜂) ¯𝑧𝑖𝜖3 + (1 − 𝜃𝑖) ¯ℨ𝑖𝜖4 and get the residuals 𝑠𝑖𝑡. To do so, we use the Frish-Waugh-Lovell theorem sequentially. 1.a) 𝑧𝑖𝑡 − 𝜂 ¯𝑧𝑖 on (1 − 𝜂) ¯𝑧𝑖. The coefficient will be: (1 − 𝜂)2 ¯𝑧′ 1𝑖 ¯𝑧1𝑖 (cid:35) −1 (cid:34) 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 (cid:35) (1 − 𝜂)2 ¯𝑧′ 1𝑖 ¯𝑧𝑖 ˜𝜖1 = = = = (cid:34) 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 (cid:34) 𝑁 ∑︁ 𝑡=1 𝑇 ∑︁ 𝑡=1 𝑖=1 (cid:34) 𝑁 ∑︁ 𝑖=1 (cid:34) 𝑁 ∑︁ 𝑖=1 (cid:35) −1 (cid:34) (1 − 𝜂)2 ¯𝑧′ 1𝑖 ¯𝑧1𝑖 (cid:35) −1 (cid:34) 𝑁 ∑︁ 𝑇 (1 − 𝜂)2 ¯𝑧′ 𝑖 ¯𝑧𝑖 (1 − 𝜂)2 ¯𝑧′ 1𝑖 ¯𝑧1𝑖 𝑖=1 (cid:35) −1 (cid:34) 𝑁 ∑︁ 𝑖=1 (cid:35) 𝑇 (1 − 𝜂)𝜂 ¯𝑧′ 𝑖 ¯𝑧𝑖 (1 − 𝜂) ¯𝑧′ 𝑖 ∑︁ 𝑖=1 𝑧𝑖𝑡 − ∑︁ 𝑖=1 ∑︁ 𝑡=1 (cid:35) 𝑇 (1 − 𝜂)2 ¯𝑧′ 1𝑖 ¯𝑧1𝑖 (cid:35) (1 − 𝜂)2 ¯𝑧′ 1𝑖 ¯𝑧1𝑖 = I𝐿 Therefor the residuals will be 𝑣𝑖𝑡 = 𝑧𝑖𝑡 − ¯𝑧𝑖. 1.b) Run (1 − 𝜂) ¯ℨ𝑖 on (1 − 𝜂) ¯𝑧𝑖. In this case the coefficient and the residuals will depend only on the 𝑖 index, call the latter 𝑓𝑖. 90 1.c) Run 𝑣𝑖𝑡 on 𝑓𝑖 to get 𝜖2. The coefficient will be: 𝜖2 = = = (cid:34) (cid:34) (cid:34) ∑︁ ∑︁ 𝑖=1 𝑡=1 ∑︁ ∑︁ 𝑖=1 𝑡=1 ∑︁ ∑︁ 𝑖=1 𝑡=1 𝑓 ′ 𝑖 𝑓𝑖 𝑓 ′ 𝑖 𝑓𝑖 𝑓 ′ 𝑖 𝑓𝑖 (cid:35) −1 (cid:34) (cid:35) 𝑓 ′ 𝑖 𝑣𝑖𝑡 ∑︁ ∑︁ 𝑖=1 𝑡=1 (cid:35) −1 (cid:34) (cid:35) ∑︁ ∑︁ 𝑣𝑖𝑡 𝑓 ′ 𝑖 𝑖=1 𝑡=1 (cid:35) −1 (cid:34) ∑︁ ∑︁ 𝑓 ′ 𝑖 𝑖=1 𝑡=1 (cid:35) (𝑧𝑖𝑡 − ¯𝑧𝑖) = 0𝐿 where we used the fact that the sum of deviations from the mean add up to zero for all 𝑖 in the second term. This implies that 𝜖1 = I𝐿 and therefore, 𝑟𝑖𝑡 = 𝑧𝑖𝑡 − ¯𝑧𝑖. Using very similar steps, it can be shown that if we run ℨ𝑖𝑡 − 𝜂 ¯ℨ𝑖 = (1 − 𝜂) ¯𝑧𝑖𝜖3 + (1 − 𝜃𝑖) ¯ℨ𝑖𝜖4, then 𝜖3 = 0𝐿 and 𝜖4 = I𝐿, and therefore the residuals of this regression will be 𝑠𝑖𝑡 = ℨ𝑖𝑡 − ¯ℨ𝑖. Since we have orthogonalized the instrumental variables with respect to (1 − 𝜂) ¯𝑧𝑖 and (1 − 𝜂) ¯ℨ𝑖, we now have to apply Pooled 2SLS to the following equation: 𝑦𝑖𝑡 − 𝜂 ¯𝑦𝑖 = (𝑥𝑖𝑡 − 𝜂 ¯𝑥𝑖) 𝛽 + (𝑤𝑖𝑡 − 𝜂 ¯𝑤𝑖)𝛾 using IV’s [(𝑧𝑖𝑡 − ¯𝑧𝑖) (ℨ𝑖𝑡 − ¯ℨ𝑖)]. We now define the following notation: (cid:165)𝑧𝑖𝑡 = 𝑧𝑖𝑡 − ¯𝑧𝑖, (cid:165)ℨ𝑖𝑡 = ℨ𝑖𝑡 − ¯ℨ𝑖, ˆ𝑧𝑖𝑡 = [ (cid:165)𝑧𝑖𝑡 (cid:165)ℨ𝑖𝑡], ˜𝑦𝑖𝑡 = 𝑦𝑖𝑡 − 𝜂 ¯𝑦𝑖, ˜𝑥𝑖𝑡 = [(𝑥𝑖𝑡 − 𝜂 ¯𝑥𝑖) (𝑤𝑖𝑡 − 𝜂 ¯𝑤𝑖)], ˆ𝑦𝑖𝑡 = 𝑦𝑖𝑡 − ¯𝑦𝑖 and ˆ𝑥𝑖𝑡 = [(𝑥𝑖𝑡 − ¯𝑥𝑖) (𝑤𝑖𝑡 − ¯𝑤𝑖)]. Then the Γ = (𝛽 𝛾) from the previous problem can be obtained as: ˆΓ2𝑆𝐿𝑆 =       (cid:32) (cid:32) ∑︁ ∑︁ 𝑖=1 𝑡=1 (cid:33) (cid:32) ˜𝑥′ 𝑖𝑡 ˆ𝑧𝑖𝑡 ∑︁ ∑︁ 𝑖=1 𝑡=1 ˆ𝑧′ 𝑖𝑡 ˆ𝑧𝑖𝑡 (cid:33) −1 (cid:32) ∑︁ ∑︁ 𝑖=1 𝑡=1 ˆ𝑧′ 𝑖𝑡 ˜𝑥𝑖𝑡 ∑︁ ∑︁ 𝑖=1 𝑡=1 ˜𝑥′ 𝑖𝑡 ˆ𝑧𝑖𝑡 (cid:33) (cid:32) ∑︁ ∑︁ 𝑖=1 𝑡=1 ˆ𝑧′ 𝑖𝑡 ˆ𝑧𝑖𝑡 (cid:33) −1 (cid:32) (cid:33) ˆ𝑧′ 𝑖𝑡 ˜𝑦𝑖𝑡 ∑︁ ∑︁ 𝑖=1 𝑡=1 −1 · (cid:33)      (B.1) The first term of the square bracket term can be rewritten as follows (the third term of that inverse matrix can also be written in a similar way): 91 ∑︁ ∑︁ 𝑖=1 𝑡=1 ˜𝑥′ 𝑖𝑡 ˆ𝑧𝑖𝑡 = ∑︁ ∑︁ 𝑖=1 𝑡=1 ∑︁ ∑︁ = 𝑖=1 𝑡=1               (𝑤𝑖𝑡 − 𝜂 ¯𝑤𝑖)′ (𝑥𝑖𝑡 − 𝜂 ¯𝑥𝑖)′        (𝑥𝑖𝑡 − 𝜂 ¯𝑥𝑖)′ (cid:165)𝑧𝑖𝑡 (cid:104) (cid:165)𝑧𝑖𝑡 (cid:105) (cid:165)ℨ𝑖𝑡 (𝑥𝑖𝑡 − 𝜂 ¯𝑥𝑖)′ (cid:165)ℨ𝑖𝑡 (𝑤𝑖𝑡 − 𝜂 ¯𝑤𝑖)′ (cid:165)𝑧𝑖𝑡 (𝑤𝑖𝑡 − 𝜂 ¯𝑤𝑖)′ (cid:165)ℨ′ 𝑖𝑡        (B.2) We focus on the (1,1) term, but the following algebraic manipulation holds for the rest of the terms in the matrix and for the second term in (B.1): ∑︁ ∑︁ 𝑖=1 𝑡=1 (𝑥𝑖𝑡 − 𝜂 ¯𝑥𝑖)′ (cid:165)𝑧𝑖𝑡 = = = = = 𝑥′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 − 𝑥′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 − 𝜂 ¯𝑥′ 𝑖 𝜂 ¯𝑥′ 𝑖 ∑︁ 𝑖=1 ∑︁ 𝑖=1 ∑︁ 𝑡=1 ∑︁ 𝑡=1 (cid:165)𝑧𝑖𝑡 (𝑧𝑖𝑡 − ¯𝑧𝑖) 𝑥′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 𝑥′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 − ∑︁ ¯𝑥′ 𝑖 ∑︁ 𝑖=1 𝑡=1 (𝑧𝑖𝑡 − ¯𝑧𝑖)′ (𝑥𝑖𝑡 − ¯𝑥𝑖)′ (cid:165)𝑧𝑖𝑡 ∑︁ ∑︁ 𝑖=1 ∑︁ 𝑡=1 ∑︁ 𝑖=1 ∑︁ 𝑡=1 ∑︁ 𝑖=1 ∑︁ 𝑡=1 ∑︁ 𝑖=1 ∑︁ 𝑡=1 ∑︁ 𝑖=1 𝑡=1 where in the second and fourth lines we used the fact that the sum of deviations from the mean over 𝑡 add up to zero for all observations. Therefore, (B.2) can be rewritten as: ∑︁ ∑︁ 𝑖=1 𝑡=1        (𝑥𝑖𝑡 − 𝜂 ¯𝑥𝑖)′ (cid:165)𝑧𝑖𝑡 (𝑥𝑖𝑡 − 𝜂 ¯𝑥𝑖)′ (cid:165)ℨ𝑖𝑡 (𝑤𝑖𝑡 − 𝜂 ¯𝑤𝑖)′ (cid:165)𝑧𝑖𝑡 (𝑤𝑖𝑡 − 𝜂 ¯𝑤𝑖)′ (cid:165)ℨ′ 𝑖𝑡        = = ∑︁ ∑︁ 𝑖=1 𝑡=1        (𝑥𝑖𝑡 − ¯𝑥𝑖)′ (cid:165)𝑧𝑖𝑡 (𝑥𝑖𝑡 − ¯𝑥𝑖)′ (cid:165)ℨ𝑖𝑡 (𝑤𝑖𝑡 − ¯𝑤𝑖)′ (cid:165)𝑧𝑖𝑡 (𝑤𝑖𝑡 − ¯𝑤𝑖)′ (cid:165)ℨ′ 𝑖𝑡        ∑︁ ∑︁ 𝑖=1 𝑡=1 ˆ𝑥′ 𝑖𝑡 ˆ𝑧𝑖𝑡 Similarly, ∑︁ ∑︁ 𝑖=1 𝑡=1 ˆ𝑧′ 𝑖𝑡 ˜𝑦𝑖𝑡 = ∑︁ ∑︁ 𝑖=1 𝑡=1 ˆ𝑧′ 𝑖𝑡 ˆ𝑦𝑖𝑡 92 Therefore, ˆΓ2𝑆𝐿𝑆 = =       (cid:32)       (cid:32) (cid:32) ∑︁ ∑︁ 𝑖=1 𝑡=1 (cid:33) (cid:32) ˜𝑥′ 𝑖𝑡 ˆ𝑧𝑖𝑡 ∑︁ ∑︁ 𝑖=1 𝑡=1 ˆ𝑧′ 𝑖𝑡 ˆ𝑧𝑖𝑡 (cid:33) −1 (cid:32) ∑︁ ∑︁ 𝑖=1 𝑡=1 ˆ𝑧′ 𝑖𝑡 ˜𝑥𝑖𝑡 ∑︁ ∑︁ 𝑖=1 𝑡=1 ˜𝑥′ 𝑖𝑡 ˆ𝑧𝑖𝑡 (cid:33) (cid:32) ∑︁ ∑︁ 𝑖=1 𝑡=1 ˆ𝑧′ 𝑖𝑡 ˆ𝑧𝑖𝑡 (cid:33) −1 (cid:32) (cid:33) ˆ𝑧′ 𝑖𝑡 ˜𝑦𝑖𝑡 ∑︁ ∑︁ 𝑖=1 𝑡=1 (cid:32) ∑︁ ∑︁ 𝑖=1 𝑡=1 (cid:33) (cid:32) ˆ𝑥′ 𝑖𝑡 ˆ𝑧𝑖𝑡 ∑︁ ∑︁ 𝑖=1 𝑡=1 ˆ𝑧′ 𝑖𝑡 ˆ𝑧𝑖𝑡 (cid:33) −1 (cid:32) ∑︁ ∑︁ 𝑖=1 𝑡=1 ˆ𝑧′ 𝑖𝑡 ˆ𝑥𝑖𝑡 ∑︁ ∑︁ 𝑖=1 𝑡=1 ˆ𝑥′ 𝑖𝑡 ˆ𝑧𝑖𝑡 (cid:33) (cid:32) ∑︁ ∑︁ 𝑖=1 𝑡=1 ˆ𝑧′ 𝑖𝑡 ˆ𝑧𝑖𝑡 (cid:33) −1 (cid:32) (cid:33) ˆ𝑧′ 𝑖𝑡 ˆ𝑦𝑖𝑡 ∑︁ ∑︁ 𝑖=1 𝑡=1 −1 −1 · · (cid:33)      (cid:33)      = ˆΓ𝐹𝐸2𝑆𝐿𝑆 Proof of Proposition 6 For notation simplicity and without loss of generality, I will omit 𝑊𝑖 𝑦𝑡 in the proof. This term can be treated as an additional endogenous variable included in 𝑥2𝑖𝑡 with its respective instruments [𝑤2 1𝑖𝑡 . . . 𝑤𝑠 1𝑖𝑡]. Let 𝑥𝑖𝑡 = (𝑥1𝑖𝑡 a 1 × k2 vector of endogenous covariates. 𝑥2𝑖𝑡), where 𝑥1𝑖𝑡 is a 1 × k1 vector of exogenous variables and 𝑥2𝑖𝑡 is Similarly, 𝑋𝑡 = (𝑋1𝑡 𝑋2𝑡), 𝑧𝑖𝑡 = (𝑥1𝑖𝑡 𝑧2𝑖𝑡), ¯𝑧𝑖 = ( ¯𝑥1𝑖 ¯𝑧2𝑖), 𝑍𝑡 = (𝑋1𝑡 𝑍2𝑡) and ¯𝑍𝑡 = ( ¯𝑋1 ¯𝑍2), ℨ2𝑖𝑡 = 𝑊𝑖 𝑍2𝑖𝑡, ¯ℨ2𝑖 = 𝑊𝑖 ¯𝑍2. Finally denote ˆ𝑥𝑖𝑡 = (𝑥1𝑖𝑡 ˆ𝑥2𝑖𝑡), ˆ¯𝑥𝑖 = ( ¯𝑥1𝑖 ˆ¯𝑥2𝑖), ˆ¯𝑋 = ( ¯𝑋1 ˆ¯𝑋2), where the hats denote the linear projections of 𝑥2 on (𝑥1 𝑧2) and their spatial lags. In a spatial setting, (𝛽 𝛾)𝐹𝐸2𝑆𝐿𝑆 can be obtained by applying Pooled 2SLS to 𝑦𝑖𝑡 − ¯𝑦𝑖 = (𝑥1𝑖𝑡 − ¯𝑥1𝑖) 𝛽1 + (𝑥2𝑖𝑡 − ¯𝑥2𝑖) 𝛽2 + 𝑊𝑖 (𝑋1𝑡 − ¯𝑋1)𝛾1 + 𝑊𝑖 (𝑋2𝑡 − ¯𝑋2)𝛾2 + (𝑢𝑖𝑡 − 𝑢𝑖) using IV’s: [(𝑧2𝑖𝑡 − ¯𝑧2𝑖) 𝑊𝑖 (𝑍2𝑡 − ¯𝑍2)] We want to show that applying Pooled 2SLS to: 𝑦𝑖𝑡 − 𝜃 ¯𝑦𝑖 = (𝑥1𝑖𝑡 − 𝜃 ¯𝑥1𝑖) 𝛽1 + (𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖) 𝛽2 + 𝑊𝑖 (𝑋1𝑡 − 𝜃 ¯𝑋1)𝛾1 + 𝑊𝑖 (𝑋2𝑡 − 𝜃 ¯𝑋2)𝛾2 + (1 − 𝜃) ¯𝑥1𝑖𝛿1 + (1 − 𝜃) ¯𝑥2𝑖𝛿2 + (1 − 𝜃) ¯𝑊𝑖 ¯𝑋1𝜆1 + (1 − 𝜃) ¯𝑊𝑖 ¯𝑋2𝜆2 + 𝑢𝑖𝑡 93 using IV’s: [(𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖) 𝑊𝑖 (𝑍2𝑡 − 𝜃 ¯𝑍2) (1 − 𝜃) ¯𝑧2𝑖 (1 − 𝜃)𝑊𝑖 ¯𝑍2] yields the same (𝛽 𝛾). In order to proof the result, I will follow these steps: 1. Orthogonalize with respect to [(1 − 𝜃) ¯𝑥1𝑖 (1 − 𝜃) ¯𝑤1𝑖] the instrumental variables and [(𝑥1𝑖𝑡 − 𝜃 ¯𝑥1𝑖) (𝑤1𝑖𝑡 − 𝜃 ¯𝑤1𝑖)] 2. Orthogonalize with respect to [(1 − 𝜃) ¯𝑧2𝑖 (1 − 𝜃) ¯ℨ2𝑖] in the first stage equation. 3. Show that we get the same predicted values using the orthogonalized variables and the original ones. 4. Use the Frisch-Waugh-Lovell (WFL) theorem to show the equivalence. So the model is: 𝑦𝑖𝑡 − 𝜃 ¯𝑦𝑖 = (𝑥1𝑖𝑡 − 𝜃 ¯𝑥1𝑖) 𝛽1 + (𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖) 𝛽2 + (𝑤1𝑖𝑡 − 𝜃 ¯𝑤𝑖1)𝛾1 + (𝑤2𝑖𝑡 − 𝜃 ¯𝑤𝑖2)𝛾2 + (1 − 𝜃) ¯𝑥1𝑖𝛿1 + (1 − 𝜃) ¯𝑥2𝑖𝛿2 + (1 − 𝜃) ¯𝑤𝑖1𝜆1 + (1 − 𝜃) ¯𝑤𝑖2𝜆2 + 𝑢𝑖𝑡 using IV’s: [(𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖) (ℨ2𝑖𝑡 − 𝜃 ¯ℨ2𝑖) (1 − 𝜃) ¯𝑧2𝑖 (1 − 𝜃) ¯ℨ2𝑖]. Step 1 a. 𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖 on (1 − 𝜃) ¯𝑥1𝑖, (1 − 𝜃) ¯𝑤1𝑖 The residuals will be: 𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖 − (1 − 𝜃) ¯𝑥1𝑖 ˆ𝜂1 − (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜂2 = 𝑙𝑖𝑡 Applying the FWL theorem: for (1 − 𝜃) ¯𝑥1𝑖 on (1 − 𝜃) ¯𝑤1𝑖, the coefficient will be: (cid:34) 𝑁 ∑︁ 𝑇 ∑︁ ˆ𝜇1 = (1 − 𝜃)2 ¯𝑤′ 1𝑖 ¯𝑤1𝑖 (cid:35) −1 (cid:34) 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 (cid:35) (1 − 𝜃)2 ¯𝑤′ 1𝑖 ¯𝑥1𝑖 𝑡=1 𝑖=1 (cid:34) 𝑁 ∑︁ 𝑖=1 (cid:34) 𝑁 ∑︁ 𝑖=1 = = 𝑇 (1 − 𝜃)2 ¯𝑤′ 1𝑖 ¯𝑤1𝑖 (cid:35) −1 (cid:34) 𝑁 ∑︁ 𝑖=1 (cid:35) 𝑇 (1 − 𝜃)2 ¯𝑤′ 1𝑖 ¯𝑥1𝑖 (1 − 𝜃)2 ¯𝑤′ 1𝑖 ¯𝑤1𝑖 (cid:35) −1 (cid:34) 𝑁 ∑︁ 𝑖=1 (cid:35) (1 − 𝜃)2 ¯𝑤′ 1𝑖 ¯𝑥1𝑖 The residuals will be (1 − 𝜃) ¯𝑥1𝑖 − (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜇1 = 𝑠𝑖. 94 Now we regress 𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖 on (1 − 𝜃) ¯𝑤1𝑖. The coefficient will be: (cid:34) 𝑁 ∑︁ 𝑇 ∑︁ ˆ𝜇2 = (1 − 𝜃)2 ¯𝑤′ 1𝑖 ¯𝑤1𝑖 (cid:35) −1 (cid:34) 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 (1 − 𝜃)2 ¯𝑤′ 1𝑖 (𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖) 𝑡=1 𝑖=1 (cid:34) 𝑁 ∑︁ 𝑖=1 (cid:34) 𝑁 ∑︁ 𝑖=1 (cid:34) 𝑁 ∑︁ 𝑖=1 = = = 𝑇 (1 − 𝜃)2 ¯𝑤′ 1𝑖 ¯𝑤1𝑖 𝑇 (1 − 𝜃)2 ¯𝑤′ 1𝑖 ¯𝑤1𝑖 (cid:35) −1 (cid:34) 𝑁 ∑︁ 𝑖=1 (cid:35) −1 (cid:34) 𝑁 ∑︁ 𝑖=1 (1 − 𝜃)2 ¯𝑤′ 1𝑖 (cid:35) (𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖) 𝑇 ∑︁ 𝑡=1 (1 − 𝜃)2 ¯𝑤′ 1𝑖{𝑇 × ( ¯𝑧2𝑖 − 𝜃 ¯𝑧2𝑖)} (1 − 𝜃)2 ¯𝑤′ 1𝑖 ¯𝑤1𝑖 (cid:35) −1 (cid:34) 𝑁 ∑︁ 𝑖=1 (cid:35) (1 − 𝜃)2 ¯𝑤′ 1𝑖 ¯𝑧2𝑖 (cid:35) (cid:35) The residuals will be 𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖 − (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜇2 = 𝑔𝑖𝑡. Finally, we run 𝑔𝑖𝑡 on 𝑠𝑖. The coefficient will be: (cid:34) 𝑁 ∑︁ 𝑇 ∑︁ (cid:35) −1 (cid:34) 𝑁 ∑︁ 𝑇 ∑︁ (cid:35) 𝑠′ 𝑖𝑔𝑖𝑡 𝑠′ 𝑖 𝑠𝑖 ˆ𝜂1 = 𝑖=1 (cid:34) 𝑁 ∑︁ 𝑖=1 (cid:34) 𝑁 ∑︁ 𝑖=1 (cid:34) 𝑁 ∑︁ 𝑖=1 = = = 𝑡=1 𝑇 × 𝑠′ 𝑖 𝑠𝑖 𝑇 × 𝑠′ 𝑖 𝑠𝑖 𝑖=1 (cid:35) −1 (cid:34) 𝑁 ∑︁ 𝑖=1 (cid:35) −1 (cid:34) 𝑁 ∑︁ 𝑖=1 𝑡=1 𝑠′ 𝑖 𝑇 ∑︁ 𝑡=1 (cid:35) 𝑔𝑖𝑡 (cid:35) 𝑇 × 𝑠′ 𝑖 ¯𝑔𝑖 (cid:35) −1 (cid:34) 𝑁 ∑︁ 𝑠′ 𝑖 𝑠𝑖 𝑖=1 (1 − 𝜃)𝑠′ 𝑖 ( ¯𝑧2𝑖 − ¯𝑤1𝑖 ˆ𝜇2) Using similar steps, ˆ𝜂2 will be: ˆ𝜂2 = (cid:34) 𝑁 ∑︁ 𝑖=1 (cid:35) −1 (cid:34) 𝑁 ∑︁ 𝑠∗ 𝑖 ′𝑠∗ 𝑖 𝑖=1 (1 − 𝜃)𝑠∗ 𝑖 ′( ¯𝑧2𝑖 − ¯𝑥1𝑖 ˆ𝜇∗ 2) (cid:35) (cid:35) where ˆ𝜇∗ 2 is the coefficient of regressing 𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖 on (1 − 𝜃) ¯𝑥1𝑖 and 𝑠∗ 𝑖 are the residuals of regressing (1 − 𝜃) ¯𝑤1𝑖 on (1 − 𝜃) ¯𝑥1𝑖 b. (ℨ2𝑖𝑡 − 𝜃 ¯ℨ2𝑖) on (1 − 𝜃) ¯𝑥1𝑖, (1 − 𝜃) ¯𝑤1𝑖. The residuals will be (ℨ2𝑖𝑡 − 𝜃 ¯ℨ2𝑖) − (1 − 𝜃) ¯𝑥1𝑖 ˆ𝜂3 − (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜂4 = 𝑚𝑖𝑡. 95 c. (1 − 𝜃) ¯𝑧2𝑖 on (1 − 𝜃) ¯𝑥1𝑖, (1 − 𝜃) ¯𝑤𝑖. The residuals are: (1 − 𝜃) ¯𝑧𝑖 − (1 − 𝜃) ˆ𝜂5 − (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜂6 = 𝑣𝑖, which only depend on the 𝑖 subscript. Applying the FWL theorem, regressing (1 − 𝜃) ¯𝑥1𝑖 on (1 − 𝜃) ¯𝑤1𝑖 yields ˆ𝜇1, the same as in step 1a. The residuals will be only a function of the 𝑖 subscript, say 𝑓𝑖. Finally, run 𝑓𝑖 on 𝑠𝑖 and the coefficient will be: (cid:34) ∑︁ 𝑖=1 𝑇 𝑠′ 𝑖 𝑠𝑖 (cid:35) −1 (cid:34) (cid:35) 𝑇 𝑠′ 𝑖 𝑓𝑖 = (cid:34) ∑︁ 𝑖=1 𝑇 𝑠′ 𝑖 𝑠𝑖 ∑︁ 𝑖=1 (cid:35) −1 (cid:34) ∑︁ 𝑖=1 𝑠′ 𝑖 (1 − 𝜃) ( ¯𝑧2𝑖 − ¯𝑤1𝑖 ˆ𝜇2) = ˆ𝜂5 = ˆ𝜂1 (cid:35) The same coefficient as above. Following similar steps, it can be shown that ˆ𝜂6 = ˆ𝜂2 = (cid:34) ∑︁ 𝑖=1 𝑇 𝑠∗′ 𝑖 𝑠∗ 𝑖 (cid:35) (cid:34) ∑︁ 𝑖=1 𝑖 (1 − 𝜃) ( ¯𝑧2𝑖 − ¯𝑥1𝑖 ˆ𝜇∗ 𝑠∗′ 2) (cid:35) where ˆ𝜇∗ 2 is defined in step 1a. ∴ 𝑣𝑖 = (1 − 𝜃) ¯𝑧2𝑖 − (1 − 𝜃) ¯𝑥1𝑖 ˆ𝜂5 − (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜂6 = (1 − 𝜃) ¯𝑧2𝑖 − (1 − 𝜃) ¯𝑥1𝑖 ˆ𝜂1 − (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜂2 d. (1 − 𝜃) ¯𝑧2𝑖 on (1 − 𝜃) ¯𝑥1𝑖, (1 − 𝜃) ¯𝑤1𝑖 The coefficients will only depend in 𝑖, denote them by 𝑟𝑖. If (1 − 𝜃) ¯𝑧2𝑖 = (1 − 𝜃) ¯𝑥1𝑖 ˆ𝜂7 + (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜂8, it can be shown using similar arguments than in the previous step that ˆ𝜂7 = ˆ𝜂3 and ˆ𝜂8 = ˆ𝜂4. e. 𝑥1𝑖𝑡 − 𝜃 ¯𝑥1𝑖 on (1 − 𝜃) ¯𝑥1𝑖, (1 − 𝜃) ¯𝑤1𝑖 We can apply the FWL theorem to get the coefficients: 96 i. First if we regress 𝑥1𝑖𝑡 − 𝜃 ¯𝑥1𝑖 on (1 − 𝜃) ¯𝑥1𝑖. The coefficient is: (cid:34) (cid:34) (cid:34) (cid:34) = = = ∑︁ ∑︁ 𝑖 𝑡 (1 − 𝜃)2 ¯𝑥′ 1𝑖 ¯𝑥1𝑖 (cid:35) −1 (cid:34) ∑︁ ∑︁ 𝑖 𝑡 (1 − 𝜃) ¯𝑥′ 1𝑖 (𝑥1𝑖𝑡 − 𝜃 ¯𝑥1𝑖) (cid:35) ∑︁ 𝑖 ∑︁ 𝑖 ∑︁ 𝑖 𝑇 (1 − 𝜃)2 ¯𝑥′ 1𝑖 ¯𝑥1𝑖 𝑇 (1 − 𝜃)2 ¯𝑥′ 1𝑖 ¯𝑥1𝑖 𝑇 (1 − 𝜃)2 ¯𝑥′ 1𝑖 ¯𝑥1𝑖 (cid:35) −1 (cid:34) (cid:35) −1 (cid:34) (cid:35) −1 (cid:34) ∑︁ 𝑖 ∑︁ 𝑖 ∑︁ 𝑖 (1 − 𝜃) ¯𝑥′ 1𝑖 ∑︁ 𝑡 (cid:35) (𝑥1𝑖𝑡 − 𝜃 ¯𝑥1𝑖) (1 − 𝜃) ¯𝑥′ 1𝑖𝑇 ( ¯𝑥𝑖𝑡 − 𝜃 ¯𝑥1𝑖) (cid:35) (cid:35) 𝑇 (1 − 𝜃)2 ¯𝑥′ 1𝑖 ¯𝑥1𝑖 =I𝑘1 where I𝑘1 denotes an identity matrix of size 𝑘1. Therefore, the residuals will be 𝑥1𝑖𝑡 − ¯𝑥1𝑖. ii. Now regress (1 − 𝜃) ¯𝑤1𝑖 on (1 − 𝜃) ¯𝑥1𝑖. The coefficients and residuals will only depend on 𝑖. Denote the later by 𝑑𝑖. iii. Finally regress 𝑥1𝑖𝑡 − ¯𝑥1𝑖 on 𝑑𝑖. The coefficient will be: (cid:35) −1 (cid:34) (cid:35) −1 (cid:34) (cid:34) (cid:34) ∑︁ ∑︁ 𝑑′ 𝑖 𝑑𝑖 𝑖 𝑡 ∑︁ ∑︁ 𝑑′ 𝑖 𝑑𝑖 𝑖 𝑡 = ∑︁ ∑︁ 𝑖 𝑡 𝑑′ 𝑖 (𝑥1𝑖𝑡 − ¯𝑥1𝑖) (cid:35) (cid:35) ∑︁ ∑︁ 𝑑′ 𝑖 𝑖 𝑡 (𝑥1𝑖𝑡 − ¯𝑥1𝑖) = 0𝑘1 where we used the fact that (cid:205)𝑡 (𝑥1𝑖𝑡 − ¯𝑥1𝑖) = 0. Therefore 𝑥1𝑖𝑡 − 𝜃 ¯𝑥1𝑖 = (1 − 𝜃) ¯𝑥1𝑖I𝑘1 + (1 − 𝜃) ¯𝑤1𝑖0𝑘1 and the residuals will be 𝑥1𝑖𝑡 − ¯𝑥1𝑖. f. 𝑤1𝑖𝑡 − 𝜃 ¯𝑤1𝑖 on (1 − 𝜃) ¯𝑥1𝑖 and (1 − 𝜃) ¯𝑤1𝑖. Applying the FWL theorem in a similar way than the previous step, we get the following relationship: 𝑤1𝑖𝑡 − 𝜃 ¯𝑤1𝑖 = (1 − 𝜃) ¯𝑥1𝑖0𝑘1 + (1 − 𝜃) ¯𝑤1𝑖I𝑘1 and the residuals will be 𝑤1𝑖𝑡 − ¯𝑤1𝑖. 97 Therefore, after orthogonalizing, we can apply Pooled 2SLS to: 𝑦𝑖𝑡 − 𝜃 ¯𝑦𝑖 = (𝑥1𝑖𝑡 − ¯𝑥1𝑖) 𝛽1 + (𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖) 𝛽2 + (𝑤1𝑖𝑡 − ¯𝑤𝑖1)𝛾1 + (𝑤2𝑖𝑡 − 𝜃 ¯𝑤𝑖2)𝛾2 + (1 − 𝜃) ¯𝑥2𝑖𝛿2 + (1 − 𝜃) ¯𝑤𝑖2𝜆2 + 𝑢𝑖𝑡 using IV’s: [𝑙𝑖𝑡 𝑚𝑖𝑡 𝑣𝑖 𝑟𝑖]. Step 2 In this step we orthogonalize with respect to 𝑣𝑖 and 𝑟𝑖 in the first stage equation. Note that these are the residuals from the previous step associated with (1 − 𝜃) ¯𝑧2𝑖 and (1 − 𝜃) ¯ℨ2𝑖 respectively, the instrumental variables. a. 𝑙𝑖𝑡 = 𝑣𝑖𝜁1 + 𝑟𝑖𝜁2 + 𝜀1. i. 𝑙𝑖𝑡 on 𝑣𝑖. The coefficient will be: (cid:34) (cid:34) (cid:34) ˜𝜂1 = = = ∑︁ ∑︁ 𝑣′ 𝑖𝑣𝑖 𝑖 𝑡 (cid:35) −1 (cid:34) (cid:35) 𝑣′ 𝑖𝑙𝑖𝑡 ∑︁ ∑︁ 𝑖 𝑡 (cid:35) −1 (cid:34) (cid:35) ∑︁ ∑︁ 𝑙𝑖𝑡 𝑣′ 𝑖 𝑇 𝑣′ 𝑖𝑣𝑖 ∑︁ 𝑖 (cid:35) −1 (cid:34) ∑︁ 𝑣′ 𝑖𝑣𝑖 𝑖 𝑡 (cid:35) 𝑖 ∑︁ 𝑣′ 𝑖 ¯𝑙𝑖 𝑖 Note that 𝑙𝑖𝑡 = 𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖 − (1 − 𝜃) ¯𝑥1𝑖 ˆ𝜂1 − (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜂2, therefore ¯𝑙𝑖 = 1 𝑇 ∑︁ 𝑡 [𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖 − (1 − 𝜃) ¯𝑥1𝑖 ˆ𝜂1 − (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜂2] = (1 − 𝜃) ¯𝑧2𝑖 − (1 − 𝜃) ¯𝑥1𝑖 ˆ𝜂1 − (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜂2 = (1 − 𝜃) ( ¯𝑧2𝑖 − ¯𝑥1𝑖 ˆ𝜂1 − ¯𝑤1𝑖 ˆ𝜂2) = 𝑣𝑖 Therefore, ˜𝜂1 = I𝑙 since ˆ𝜂1 = ˆ𝜂5 and ˆ𝜂2 = ˆ𝜂6. The residuals are 𝑧2𝑖𝑡 − ¯𝑧2𝑖. ii. 𝑟𝑖 on 𝑣𝑖. In this case, both the coefficient and the residuals are going to depend only on 𝑖, call them ℎ𝑖. 98 iii. Regress 𝑧2𝑖𝑡 − ¯𝑧2𝑖 on ℎ𝑖. The coefficient is: (cid:35) −1 (cid:34) (cid:34) ˆ𝜂3 = ∑︁ ∑︁ ℎ′ 𝑖 ℎ𝑖 ∑︁ ∑︁ ℎ′ 𝑖 (𝑧2𝑖𝑡 − ¯𝑧2𝑖) 𝑖 𝑡 𝑖 𝑡 (cid:34) = ∑︁ ∑︁ ℎ′ 𝑖 ℎ𝑖 (cid:35) −1 (cid:34) ∑︁ ∑︁ ℎ′ 𝑖 (𝑧2𝑖𝑡 − ¯𝑧2𝑖) = 0𝑙 (cid:35) (cid:35) 𝑖 Because the sum of deviations from the mean add up to zero. Therefore 𝑙𝑖𝑡 = 𝑣𝑖I𝑙 +𝑟𝑖0𝑙 +𝜀 𝑖 𝑡 𝑡 and the residuals will be 𝑧2𝑖𝑡 − ¯𝑧2𝑖. b. 𝑚𝑖𝑡 = 𝑣𝑖𝜋1 + 𝑟𝑖𝜋2 + 𝜀2 i. 𝑚𝑖𝑡 on 𝑟𝑖 The coefficient will be, after some algebra, ˜𝜋2 = (cid:2)(cid:205)𝑖 𝑟′ 𝑖𝑟𝑖(cid:3) −1 (cid:2)(cid:205)𝑖 𝑟′ 𝑖 ¯𝑚𝑖(cid:3). Noting that ¯𝑚𝑖 = 1 𝑇 ∑︁ 𝑡 (cid:2)(ℨ2𝑖𝑡 − 𝜃 ¯ℨ2𝑖) − (1 − 𝜃) ¯𝑥1𝑖 ˆ𝜂3 − (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜂4 (cid:3) = (1 − 𝜃)( ¯ℨ2𝑖 − ¯𝑥1𝑖 ˆ𝜂3 − ¯𝑤1𝑖 ˆ𝜂4) = (1 − 𝜃)( ¯ℨ2𝑖 − ¯𝑥1𝑖 ˆ𝜂7 − ¯𝑤1𝑖 ˆ𝜂8) = 𝑟𝑖 We conclude that ˜𝜋2 = I𝑙 and the residuals are ℨ2𝑖𝑡 − ¯ℨ2𝑖. ii. 𝑣𝑖 on 𝑟𝑖. The coefficient will be denoted by ˜𝜋1 = (cid:2)(cid:205)𝑖 𝑟′ 𝑖𝑟𝑖(cid:3) −1 (cid:2)(cid:205)𝑖 𝑟′ 𝑖 𝑣𝑖(cid:3), and the residuals will depend on 𝑖, call them ˜ℎ𝑖. iii. ℨ2𝑖𝑡 − ¯ℨ2𝑖 on ˜ℎ𝑖. Using again the fact that (cid:205)𝑡 ℨ2𝑖𝑡 − ¯ℨ2𝑖 = 0, we conclude that 𝜋1 = 0𝑙, which implies that ˜𝜋2 = 𝜋2 = I𝑙 and therefore, the residuals will be ℨ2𝑖𝑡 − ¯ℨ2𝑖. In the original first stage we have: 𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖𝑡 = (𝑥1𝑖𝑡 − 𝜃 ¯𝑥1𝑖)𝜙1 + (𝑤1𝑖𝑡 − 𝜃 ¯𝑤1𝑖)𝜙2 + (ℨ2𝑖𝑡 − 𝜃 ¯ℨ2𝑖)𝜙3 + (𝑤2𝑖𝑡 − 𝜃 ¯𝑤𝑖2)𝜙4 + (1 − 𝜃) ¯𝑥1𝑖 𝜌1 + (1 − 𝜃) ¯𝑤1𝑖 𝜌2 + (1 − 𝜃) ¯𝑧2𝑖 𝜌3 + (1 − 𝜃) ¯ℨ2𝑖 𝜌4 + 𝜀𝐹𝑆 After orthogonalizing with respect to [(1 − 𝜃) ¯𝑥1𝑖 (1 − 𝜃) ¯𝑤1𝑖 (1 − 𝜃) ¯𝑧2𝑖 (1 − 𝜃) ¯ℨ2𝑖], to get Φ = (𝜙1 𝜙2 𝜙3 𝜙4), we have to regress 𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖𝑡 on [(𝑥2𝑖𝑡 − ¯𝑥2𝑖𝑡) (𝑥1𝑖𝑡 − ¯𝑥1𝑖) (𝑤1𝑖𝑡 − ¯𝑤1𝑖) (ℨ2𝑖𝑡 − ¯ℨ2𝑖)]. 99 We note that if 𝑧𝑖𝑡 = [𝑥1𝑖𝑡 𝑤1𝑖𝑡 𝑧2𝑖𝑡 ℨ2𝑖𝑡], then the coefficient of 𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖𝑡 on 𝑧𝑖𝑡 − ¯𝑧𝑖 is ˇΦ = = = = (cid:34) (cid:34) (cid:34) (cid:34) ∑︁ ∑︁ 𝑖 𝑡 ∑︁ ∑︁ 𝑖 𝑡 ∑︁ ∑︁ 𝑖 𝑡 ∑︁ ∑︁ 𝑖 𝑡 (𝑧𝑖𝑡 − ¯𝑧𝑖)′(𝑧𝑖𝑡 − ¯𝑧𝑖) (𝑧𝑖𝑡 − ¯𝑧𝑖)′(𝑧𝑖𝑡 − ¯𝑧𝑖) (𝑧𝑖𝑡 − ¯𝑧𝑖)′(𝑧𝑖𝑡 − ¯𝑧𝑖) (𝑧𝑖𝑡 − ¯𝑧𝑖)′(𝑧𝑖𝑡 − ¯𝑧𝑖) (cid:35) −1 (cid:34) (cid:35) −1 (cid:34) (cid:35) −1 (cid:34) (cid:35) −1 (cid:34) ∑︁ ∑︁ 𝑖 𝑡 ∑︁ ∑︁ 𝑖 𝑡 ∑︁ ∑︁ 𝑖 𝑡 ∑︁ ∑︁ 𝑖 𝑡 (𝑧𝑖𝑡 − ¯𝑧𝑖)′(𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖) (cid:35) (𝑧𝑖𝑡 − ¯𝑧𝑖)′𝑥2𝑖𝑡 − (𝑧𝑖𝑡 − ¯𝑧𝑖)′𝑥2𝑖𝑡 − (cid:41) (cid:35) (𝑧𝑖𝑡 − ¯𝑧𝑖)′ 𝜃 ¯𝑥2𝑖 (cid:41) (cid:35) (𝑧𝑖𝑡 − ¯𝑧𝑖)′ ¯𝑥2𝑖 (cid:40) ∑︁ ∑︁ 𝑖 𝑡 (cid:40) ∑︁ ∑︁ 𝑖 𝑡 (cid:35) (𝑧𝑖𝑡 − ¯𝑧𝑖)′(𝑥2𝑖𝑡 − ¯𝑥2𝑖) Where we used the fact that the terms in curly brackets are zero. Therefore, Φ can also be obtained by regressing (𝑥2𝑖𝑡 − ¯𝑥2𝑖𝑡) on [(𝑥2𝑖𝑡 − ¯𝑥2𝑖𝑡) (𝑥1𝑖𝑡 − ¯𝑥1𝑖) (𝑤1𝑖𝑡 − ¯𝑤1𝑖) (ℨ2𝑖𝑡 − ¯ℨ2𝑖)]. Step 3 In this step we show that (cid:92)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖 = (cid:94)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖, where (cid:92)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖 = (𝑥1𝑖𝑡 − 𝜃 ¯𝑥1𝑖) ˆ𝜙1 + (𝑤1𝑖𝑡 − 𝜃 ¯𝑤1𝑖) ˆ𝜙2 + (𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖) ˆ𝜙3 + (ℨ2𝑖𝑡 − 𝜃 ¯ℨ2𝑖) ˆ𝜙4 + (1 − 𝜃) ¯𝑥1𝑖 ˆ𝜌1 + (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜌2 + (1 − 𝜃) ¯𝑧2𝑖 ˆ𝜌3 + (1 − 𝜃) ¯ℨ2𝑖 ˆ𝜌4 (cid:94)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖 = (𝑥1𝑖𝑡 − ¯𝑥1𝑖) ˜𝜙1 + (𝑤1𝑖𝑡 − ¯𝑤1𝑖) ˜𝜙2 + (𝑧2𝑖𝑡 − ¯𝑧2𝑖) ˜𝜙3 + (ℨ2𝑖𝑡 − ¯ℨ2𝑖) ˜𝜙4 + (1 − 𝜃) ¯𝑥1𝑖 ˜𝜌1 + (1 − 𝜃) ¯𝑤1𝑖 ˜𝜌2 + (1 − 𝜃) ¯𝑧2𝑖 ˜𝜌3 + (1 − 𝜃) ¯ℨ2𝑖 ˜𝜌4 First we note that ˆ𝜙 𝑗 = ˜𝜙 𝑗 for 𝑗 = 1, 2, 3, 4 because in the second equation the respective explanatory variables are orthogonalized with respect to the terms related to the time averages of the independent variables. Given this fact and after some algebra, we have that (cid:92)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖 = (cid:94)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖 if ˆ𝜙 𝑗 + ˆ𝜌 𝑗 = ˜𝜌 𝑗 for 𝑗 = 1, 2, 3, 4. To show that the previous equality holds, we start with (cid:94)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖. Since 𝑧𝑖𝑡 = [𝑥1𝑖𝑡 𝑤1𝑖𝑡 𝑧2𝑖𝑡 ℨ2𝑖𝑡] as above, we have 𝑧𝑖𝑡 − ¯𝑧𝑖 = (cid:2)(𝑥1𝑖𝑡 − ¯𝑥1𝑖) (𝑤1𝑖𝑡 − ¯𝑤1𝑖) (𝑧2𝑖𝑡 − ¯𝑧2𝑖) (ℨ2𝑖𝑡 − ¯ℨ2𝑖)(cid:3), ˜𝜌 = ( ˜𝜌′ ( ˜𝜙′ 1 4)′ and ˜𝜙 = ˜𝜌′ 4)′, therefore, (cid:94)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖 = (𝑧𝑖𝑡 − ¯𝑧𝑖) ˆ𝜙 + (1 − 𝜃) ¯𝑧𝑖 ˜𝜌. Greene (2007) shows that given ˜𝜙′ ˜𝜙′ 2 ˜𝜙′ 3 ˜𝜌′ 3 ˜𝜌′ 2 1 100 ˆ𝜙, one can get ˜𝜌 as: (cid:34) (cid:34) (cid:34) ˜𝜌 = = = ∑︁ ∑︁ 𝑖 𝑡 (1 − 𝜃)2 ¯𝑧′ 𝑖 ¯𝑧𝑖 (cid:35) −1 (cid:34) ∑︁ ∑︁ 𝑖 𝑡 (1 − 𝜃) ¯𝑧′ 𝑖 (cid:8)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖 − (𝑧𝑖𝑡 − ¯𝑧𝑖) ˆ𝜙(cid:9) (cid:35) 𝑇 (1 − 𝜃)2 ¯𝑧′ 𝑖 ¯𝑧𝑖 (cid:35) −1 (cid:34) (cid:32) ∑︁ 𝑖 (1 − 𝜃) ¯𝑧′ 𝑖 (1 − 𝜃)2 ¯𝑧′ 𝑖 ¯𝑧𝑖 (cid:35) −1 (cid:34) ∑︁ 𝑖 (1 − 𝜃)2 ¯𝑧′ 𝑖 ¯𝑥2𝑖 ∑︁ 𝑖 ∑︁ 𝑖 ∑︁ 𝑡 (cid:35) (𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖) − (1 − 𝜃) ¯𝑧′ 𝑖 (cid:41) (cid:33)(cid:35) (𝑧𝑖𝑡 − ¯𝑧𝑖) ˆ𝜙 (cid:40) ∑︁ 𝑡 where we used the fact that (cid:205)𝑡 (𝑧𝑖𝑡 − ¯𝑧𝑖) = 0 on the second line. We turn now to (cid:92)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖. With similar definitions as above, given ˆ𝜙, we get ˆ𝜌 as: (cid:35) −1 (cid:34) (cid:35) −1 (cid:34) ∑︁ ∑︁ 𝑖 𝑡 ∑︁ ∑︁ 𝑖 𝑡 (1 − 𝜃)2 ¯𝑧′ 𝑖 ¯𝑧𝑖 (1 − 𝜃)2 ¯𝑧′ 𝑖 ¯𝑧𝑖 (1 − 𝜃) ¯𝑧′ 𝑖 (cid:8)(𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖) − (𝑧𝑖𝑡 − 𝜃 ¯𝑧𝑖) ˆ𝜙(cid:9) (cid:35) ∑︁ ∑︁ 𝑖 𝑡 (1 − 𝜃) ¯𝑧′ 𝑖 ∑︁ 𝑖 (𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖) − (cid:41)(cid:35) (𝑧𝑖𝑡 − 𝜃 ¯𝑧𝑖) ˆ𝜙 ∑︁ 𝑡 (cid:40) ∑︁ 𝑡 (cid:35) ∑︁ 𝑇 (1 − 𝜃)2 ¯𝑧′ 𝑖 ¯𝑧𝑖 𝑇 (1 − 𝜃)2 ¯𝑧′ 𝑖 ¯𝑥2𝑖 (cid:35) −1 (cid:34) ∑︁ 𝑖 (cid:34) (cid:34) (cid:34) ˆ𝜌 = = = − 𝑖 (cid:34) ∑︁ 𝑖 𝑇 (1 − 𝜃)2 ¯𝑧′ 𝑖 ¯𝑧𝑖 (cid:35) −1 (cid:34) ∑︁ (1 − 𝜃) ¯𝑧𝑖 (𝑇 ¯𝑧𝑖 − 𝑇 𝜃 ¯𝑧𝑖) ˆ𝜙 (cid:35) = ˜𝜌 − (cid:34) ∑︁ 𝑖 𝑇 (1 − 𝜃)2 ¯𝑧′ 𝑖 ¯𝑧𝑖 𝑖 (cid:35) −1 (cid:34) ∑︁ 𝑖 (cid:35) 𝑇 (1 − 𝜃)2 ¯𝑧′ 𝑖 ¯𝑧𝑖 ˆ𝜙 = ˜𝜌 − I2(𝑘1+𝑙)+1 ˆ𝜙 = ˜𝜌 − ˆ𝜙 Therefore, ˜𝜌 = ˆ𝜌 + ˆ𝜙 and hence (cid:92)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖 = (cid:94)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖. In a similar way and using obvious notation, it can be shown that (cid:92)𝑤2𝑖𝑡 − 𝜃 ¯𝑤2𝑖 = (cid:94)𝑤2𝑖𝑡 − 𝜃 ¯𝑤2𝑖. Step 4 Given the previous step, the problem becomes: 𝑦𝑖𝑡 − 𝜃 ¯𝑦𝑖 = (𝑥1𝑖𝑡 − ¯𝑥1𝑖) 𝛽1 + (𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖) 𝛽2 + (𝑤1𝑖𝑡 − ¯𝑤𝑖1)𝛾1 + (𝑤2𝑖𝑡 − 𝜃 ¯𝑤𝑖2)𝛾2 + (1 − 𝜃) ¯𝑥2𝑖𝛿2 + (1 − 𝜃) ¯𝑤𝑖2𝜆2 + 𝑢𝑖𝑡 101 using IV’s: [(𝑧2𝑖𝑡 − ¯𝑧2𝑖) (ℨ2𝑖𝑡 − ¯ℨ2𝑖) (1 − 𝜃) ¯𝑧2𝑖 (1 − 𝜃) ¯ℨ2𝑖]. At this point however, it is important to note that although we have orthogonalized with respect to (1 − 𝜃) [ ¯𝑥1𝑖 ¯𝑤1𝑖], we still have to include in the first stage equation to obtain the predicted values of the endogenous variables. Given this, the second stage equation is: 𝑦𝑖𝑡 − 𝜃 ¯𝑦𝑖 = (𝑥1𝑖𝑡 − ¯𝑥1𝑖) 𝛽1 + ( ˆ𝑥2𝑖𝑡 − 𝜃 ˆ¯𝑥2𝑖) 𝛽2 + (𝑤1𝑖𝑡 − ¯𝑤𝑖1)𝛾1 + ( ˆ𝑤2𝑖𝑡 − 𝜃 ˆ¯𝑤𝑖2)𝛾2 + (1 − 𝜃) ˆ¯𝑥2𝑖𝛿2 + (1 − 𝜃) ˆ¯𝑤𝑖2𝜆2 where the ˆ denote the first stage projections on the instrumental variables. To obtain (𝛽 𝛾), we orthogonalize with respect to (1 − 𝜃) ˆ¯𝑥2𝑖 and (1 − 𝜃) ˆ¯𝑤𝑖2. a. (𝑥1𝑖𝑡 − ¯𝑥1𝑖) on (1 − 𝜃) ˆ¯𝑥2𝑖 and (1 − 𝜃) ˆ¯𝑤𝑖2. i. (𝑥1𝑖𝑡 − ¯𝑥1𝑖) on (1 − 𝜃) ˆ¯𝑥2𝑖. The coefficient will be: (cid:34) ∑︁ ∑︁ 𝑖 𝑡 (cid:35) −1 (cid:34) (1 − 𝜃)2 ˆ¯𝑥′ 2𝑖 ˆ¯𝑥2𝑖 ∑︁ 𝑖 (1 − 𝜃) ˆ¯𝑥′ 2𝑖 ∑︁ 𝑡 (cid:35) (𝑥1𝑖𝑡 − ¯𝑥1𝑖) = 0𝑘2 where we used that the sums of deviations from the mean are zero for all 𝑖 and the residuals will be 𝑥1𝑖𝑡 − ¯𝑥1𝑖. ii. (1 − 𝜃) ˆ¯𝑤𝑖2 on (1 − 𝜃) ˆ¯𝑥2𝑖. In this case the coefficients and the residuals will depend only on 𝑖, call them ˜𝑢𝑖. iii. 𝑥1𝑖𝑡 − ¯𝑥1𝑖 on ˜𝑢𝑖. By a similar argument to point i just above, the coefficient is 0𝑘2 and so, 𝑥1𝑖𝑡 − ¯𝑥1𝑖 is orthogonal to both variables. b . (𝑤1𝑖𝑡 − ¯𝑤1𝑖) on (1 − 𝜃) ˆ¯𝑥2𝑖 and (1 − 𝜃) ˆ¯𝑤𝑖2. Using a similar argument as in a) above, 𝑤1𝑖𝑡 − ¯𝑤1𝑖 is orthogonal to both variables. c. ( ˆ𝑥2𝑖𝑡 − 𝜃 ˆ¯𝑥2𝑖) on (1 − 𝜃) ˆ¯𝑥2𝑖 and (1 − 𝜃) ˆ¯𝑤𝑖2. i. (1 − 𝜃) ˆ¯𝑤2𝑖 on (1 − 𝜃) ˆ¯𝑥2𝑖. The coefficient and residuals depend only on 𝑖, call them ˇ𝑢𝑖. ii. ( ˆ𝑥2𝑖𝑡 − 𝜃 ˆ¯𝑥2𝑖) on (1 − 𝜃) ˆ¯𝑥2𝑖. By arguments very similar to previous steps, one can show that the coefficient is I𝑘2 and the residuals will be ( ˆ𝑥2𝑖𝑡 − ˆ¯𝑥2𝑖). 102 iii. ( ˆ𝑥2𝑖𝑡 − ˆ¯𝑥2𝑖) on ˇ𝑢𝑖. By analogous arguments as above, the coefficient of this regression will be 0𝑘2. Therefore, the residuals of this regression will be ( ˆ𝑥2𝑖𝑡 − ˆ¯𝑥2𝑖) d. ˆ𝑤2𝑖𝑡 − 𝜃 ˆ¯𝑤𝑖2 on (1 − 𝜃) ˆ¯𝑥2𝑖 and (1 − 𝜃) ˆ¯𝑤𝑖2. Using similar ideas as in c) above, the residuals of this regression are ˆ𝑤2𝑖𝑡 − ˆ¯𝑤𝑖2. Therefore, to find (𝛽1 𝛽2 𝛾1 𝛾2), we run 𝑦𝑖𝑡 − 𝜃 ¯𝑦𝑖 = (𝑥1𝑖𝑡 − ¯𝑥1𝑖) 𝛽1 + ( ˆ𝑥2𝑖𝑡 − ˆ¯𝑥2𝑖) 𝛽2 + (𝑤1𝑖𝑡 − ¯𝑤𝑖1)𝛾1 + ( ˆ𝑤2𝑖𝑡 − ˆ¯𝑤𝑖2)𝛾2 If we collect all the covariates of this regression into a vector ˆ𝑥𝑖𝑡 − ˆ¯𝑥𝑖 (where the 𝑥1𝑖𝑡 and 𝑤1𝑖𝑡 are their own projections), then: (𝛽 𝛾) = = = (cid:34) (cid:34) (cid:34) ∑︁ ∑︁ 𝑖 𝑡 ∑︁ ∑︁ 𝑖 𝑡 ∑︁ ∑︁ 𝑖 𝑡 ( ˆ𝑥𝑖𝑡 − ˆ¯𝑥𝑖)′( ˆ𝑥𝑖𝑡 − ˆ¯𝑥𝑖) ( ˆ𝑥𝑖𝑡 − ˆ¯𝑥𝑖)′( ˆ𝑥𝑖𝑡 − ˆ¯𝑥𝑖) ( ˆ𝑥𝑖𝑡 − ˆ¯𝑥𝑖)′( ˆ𝑥𝑖𝑡 − ˆ¯𝑥𝑖) (cid:35) −1 (cid:34) (cid:35) −1 (cid:34) (cid:35) −1 (cid:34) ∑︁ ∑︁ 𝑖 𝑡 ∑︁ ∑︁ 𝑖 𝑡 ∑︁ ∑︁ 𝑖 𝑡 ( ˆ𝑥𝑖𝑡 − ˆ¯𝑥𝑖)′(𝑦𝑖𝑡 − 𝜃 ¯𝑦𝑖) (cid:35) ( ˆ𝑥𝑖𝑡 − ˆ¯𝑥𝑖)′𝑦𝑖𝑡 − (cid:40) ∑︁ ∑︁ (cid:41) (cid:35) ( ˆ𝑥𝑖𝑡 − ˆ¯𝑥𝑖) 𝜃 ¯𝑦𝑖 𝑖 𝑡 (cid:35) ( ˆ𝑥𝑖𝑡 − ˆ¯𝑥𝑖)′(𝑦𝑖𝑡 − ¯𝑦𝑖) where we use again the fact that the term in curly brackets in the second line is zero. Therefore, (𝛽 𝛾) can be obtained by regressing (𝑥1𝑖𝑡 − ¯𝑥1𝑖) (cid:92)(𝑥2𝑖𝑡 − ¯𝑥2𝑖) 𝑦𝑖𝑡 − ¯𝑦𝑖 on (cid:104) (𝑤1𝑖𝑡 − ¯𝑤1𝑖) (cid:92)(𝑤2𝑖𝑡 − ¯𝑤2𝑖) (cid:105) , which is exactly the same problem that the Fixed Effects 2SLS estimator solves. 103 APPENDIX C DERIVATION OF THE COVARIANCE MATRIX FOR THE CONTROL FUNCTION APPROACH Consider the estimating equation in (1.46): (cid:165)𝑦𝑖𝑡 = ˆ(cid:165)𝑎𝑖𝑡𝜃 + (cid:165)𝑒𝑖𝑡 where we can write (cid:165)𝑎𝑖𝑡 = (cid:165)𝑧𝑖𝑡𝜓 + (cid:165)𝑣𝑖𝑡. Because every element in (cid:165)𝑎𝑖𝑡 is exogenous with respect to the error term (cid:165)𝑒𝑖𝑡, we can write: ˆ𝜃 = = = √ =⇒ 𝑁𝑇 ( ˆ𝜃 − 𝜃) = (cid:34) (cid:34) (cid:34) (cid:34) 1 𝑁𝑇 1 𝑁𝑇 1 𝑁𝑇 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑖 𝑡 𝑁 ∑︁ 𝑇 ∑︁ 𝑖 𝑡 𝑁 ∑︁ 𝑇 ∑︁ 𝑖 𝑡 𝑁 ∑︁ 𝑇 ∑︁ 𝑖 𝑡 ˆ(cid:165)𝑎′ 𝑖𝑡 ˆ(cid:165)𝑎𝑖𝑡 ˆ(cid:165)𝑎′ 𝑖𝑡 ˆ(cid:165)𝑎𝑖𝑡 ˆ(cid:165)𝑎′ 𝑖𝑡 ˆ(cid:165)𝑎𝑖𝑡 ˆ(cid:165)𝑎′ 𝑖𝑡 ˆ(cid:165)𝑎𝑖𝑡 (cid:35) −1 (cid:34) (cid:35) −1 (cid:34) (cid:35) −1 (cid:34) 1 𝑁𝑇 1 𝑁𝑇 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑖 𝑡 𝑁 ∑︁ 𝑇 ∑︁ 𝑖 𝑡 𝑁 ∑︁ 𝑇 ∑︁ 𝑖 𝑡 (cid:35) ˆ(cid:165)𝑎′ 𝑖𝑡 (cid:165)𝑦𝑖𝑡 (cid:35) ˆ(cid:165)𝑎′ 𝑖𝑡 ( (cid:165)𝑎𝑖𝑡𝜃 + 𝑒𝑖𝑡) ˆ(cid:165)𝑎′ 𝑖𝑡 ( (cid:165)𝑎𝑖𝑡𝜃 + ˆ(cid:165)𝑎𝑖𝑡𝜃 − ˆ(cid:165)𝑎𝑖𝑡𝜃 + (cid:165)𝑒𝑖𝑡) (cid:35) (𝑁𝑇)− 1 2 𝑁 ∑︁ 𝑇 ∑︁ 𝑖 𝑡 ˆ(cid:165)𝑎′ 𝑖𝑡 (cid:35) −1      ( (cid:165)𝑎𝑖𝑡 − ˆ(cid:165)𝑎𝑖𝑡)𝜃   (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)  (cid:125) (cid:123)(cid:122) (cid:124)  Part 2  + (cid:165)𝑒𝑖𝑡      (cid:124)(cid:123)(cid:122)(cid:125)  Part 1     Note that because ˆ𝜓 E (cid:16)(cid:205)𝑁 (cid:205)𝑇 (cid:17) 𝑡 (cid:165)𝑎′ 𝑖𝑡 (cid:165)𝑎𝑖𝑡 𝑖 = 𝐵. Consider now Part 1: 𝑝 → 𝜓, the first matrix on the right hand side will converge in probability to (𝑁𝑇)− 1 2 𝑁 ∑︁ 𝑇 ∑︁ 𝑖 𝑡 𝑖𝑡 (cid:165)𝑒𝑖𝑡 = (𝑁𝑇)− 1 ˆ(cid:165)𝑎′ 2 = (𝑁𝑇)− 1 2 = (𝑁𝑇)− 1 2 = (𝑁𝑇)− 1 2 𝑁 ∑︁ 𝑇 ∑︁ 𝑖 𝑁 ∑︁ 𝑡 𝑇 ∑︁ 𝑖 𝑁 ∑︁ 𝑡 𝑇 ∑︁ 𝑖 𝑁 ∑︁ 𝑡 𝑇 ∑︁ 𝑖 𝑡 ( (cid:165)𝑧𝑖𝑡 ˆ𝜓)′ (cid:165)𝑒𝑖𝑡 ( (cid:165)𝑧𝑖𝑡 ˆ𝜓 + (cid:165)𝑧𝑖𝑡𝜓 − (cid:165)𝑧𝑖𝑡𝜓)′ (cid:165)𝑒𝑖𝑡 ( (cid:165)𝑧𝑖𝑡𝜓)′ (cid:165)𝑒𝑖𝑡 + 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ (cid:104) √ (cid:165)𝑧𝑖𝑡 𝑁𝑇 ( ˆ𝜓 − 𝜓) (cid:105) ′ (cid:165)𝑒𝑖𝑡 𝑖 𝑡 𝑁 ∑︁ 𝑇 ∑︁ 1 𝑁𝑇 𝑖 (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:124) (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑒𝑖𝑡 𝑡 (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) 𝑜 𝑝 (1) (cid:125) √ (cid:124) 𝑁𝑇 ( ˆ𝜓 − 𝜓)′ (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:125) O𝑝 (1) · ( (cid:165)𝑧𝑖𝑡𝜓)′ (cid:165)𝑒𝑖𝑡 + 104 Because O𝑝 (1) · 𝑜 𝑝 (1) = 𝑜 𝑝 (1), Part 1 converges to (cid:104) (𝑁𝑇)− 1 2 (cid:205)𝑁 𝑖 (cid:205)𝑇 𝑡 ( (cid:165)𝑧𝑖𝑡𝜓)′ (cid:165)𝑒𝑖𝑡 (cid:105) . Now consider Part 2: (𝑁𝑇)− 1 2 𝑁 ∑︁ 𝑇 ∑︁ 𝑖 𝑡 𝑖𝑡 ( (cid:165)𝑎𝑖𝑡 − ˆ(cid:165)𝑎𝑖𝑡)𝜃 = (𝑁𝑇)− 1 ˆ(cid:165)𝑎′ 2 = (𝑁𝑇)− 1 2 = (𝑁𝑇)− 1 2 𝑁 ∑︁ 𝑇 ∑︁ 𝑖 𝑁 ∑︁ 𝑡 𝑇 ∑︁ 𝑖 𝑁 ∑︁ 𝑡 𝑇 ∑︁ 𝑖 𝑡 ( (cid:165)𝑧𝑖𝑡 ˆ𝜓)′[ (cid:165)𝑎𝑖𝑡 − (cid:165)𝑧𝑖𝑡 ˆ𝜓]𝜃 ( (cid:165)𝑧𝑖𝑡 ˆ𝜓)′[ (cid:165)𝑎𝑖𝑡 − (cid:165)𝑧𝑖𝑡𝜓 + (cid:165)𝑧𝑖𝑡𝜓 − (cid:165)𝑧𝑖𝑡 ˆ𝜓]𝜃 ( (cid:165)𝑧𝑖𝑡 ˆ𝜓)′𝑣𝑖𝑡𝜃 (cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:125) (cid:124) Part 2.1 − ( (cid:165)𝑧𝑖𝑡 ˆ𝜓)′ (cid:165)𝑧𝑖𝑡 ( ˆ𝜓 − 𝜓)𝜃 (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:125) Part 2.2 (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:124) Starting with Part 2.1: (𝑁𝑇)− 1 2 𝑁 ∑︁ 𝑇 ∑︁ 𝑖 𝑡 ( (cid:165)𝑧𝑖𝑡 ˆ𝜓)′𝑣𝑖𝑡𝜃 = (𝑁𝑇)− 1 2 = (𝑁𝑇)− 1 2 = (𝑁𝑇)− 1 2 𝑁 ∑︁ 𝑇 ∑︁ 𝑖 𝑁 ∑︁ 𝑡 𝑇 ∑︁ ( (cid:165)𝑧𝑖𝑡𝜓 + (cid:165)𝑧𝑖𝑡 ˆ𝜓 − (cid:165)𝑧𝑖𝑡𝜓)′𝑣𝑖𝑡𝜃 ( (cid:165)𝑧𝑖𝑡𝜓)′𝑣𝑖𝑡𝜃 + (cid:2) (cid:165)𝑧𝑖𝑡 ( ˆ𝜓 − 𝜓)(cid:3) ′ 𝑣𝑖𝑡𝜃 𝑖 (cid:34) 𝑁 ∑︁ 𝑡 𝑇 ∑︁ 𝑖 𝑡 (cid:35) ( (cid:165)𝑧𝑖𝑡𝜓)′𝑣𝑖𝑡𝜃 + √ (cid:124) 𝑁𝑇 ( ˆ𝜓 − 𝜓)′ (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:125) O𝑝 (1) 𝑇 ∑︁ 𝑁 ∑︁ 1 𝑁𝑇 (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) 𝑝 →E( (cid:165)𝑧′ 𝑖 (cid:124) (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑣𝑖𝑡𝜃 (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) 𝑡 (cid:123)(cid:122) 𝑖𝑡 (cid:165)𝑣𝑖𝑡 )=0 So in the last line we have O𝑝 (1) · 𝑜 𝑝 (1) = 𝑜 𝑝 (1) and therefore the last term will vanish as 𝑁 → ∞ and only (𝑁𝑇)− 1 2 (cid:205)𝑁 𝑖 (cid:205)𝑇 𝑡 ( (cid:165)𝑧𝑖𝑡𝜓)′𝑣𝑖𝑡𝜃 will remain. Using similar algebra, it can be shown that part 2.2 will converge to − 1 𝑁𝑇 Note that ˆ𝜓 − 𝜓 = 𝑁 ∑︁ 𝑇 ∑︁ √ ( (cid:165)𝑧𝑖𝑡𝜓)′ (cid:165)𝑧𝑖𝑡 𝑁𝑇 ( ˆ𝜓 − 𝜓)𝜃 𝑡 𝑖 (cid:32) 𝑁 ∑︁ 𝑇 ∑︁ 𝑖 𝑡 (cid:33) −1 (cid:32) 𝑁 ∑︁ 𝑇 ∑︁ (cid:33) (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑣𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 𝑖 (cid:33) −1 (cid:34) 𝑡 (𝑁𝑇)− 1 2 𝑁 ∑︁ 𝑇 ∑︁ (cid:35) (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑣𝑖𝑡 𝑖 𝑡 √ =⇒ 𝑁𝑇 ( ˆ𝜓 − 𝜓) = (cid:32) 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑖 𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 105 Putting everything together we have (cid:34) 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑖 𝑡 ˆ(cid:165)𝑎′ 𝑖𝑡 ˆ(cid:165)𝑎𝑖𝑡 (cid:35) −1 (cid:40) (𝑁𝑇)− 1 2 𝑁 ∑︁ 𝑇 ∑︁ 𝑖 𝑡 ˆ(cid:165)𝑎′ 𝑖𝑡 (cid:2)( (cid:165)𝑎𝑖𝑡 − ˆ(cid:165)𝑎𝑖𝑡)𝜃 + (cid:165)𝑒𝑖𝑡 (cid:3) (cid:41) (cid:40) (cid:40) =𝐵−1 =𝐵−1 𝑁 ∑︁ 𝑇 ∑︁ (𝑁𝑇)− 1 2 ( (cid:165)𝑧𝑖𝑡𝜓)′ (cid:2) (cid:165)𝑒𝑖𝑡 + (cid:165)𝑣𝑖𝑡𝜃 − (cid:165)𝑧𝑖𝑡 ( ˆ𝜓 − 𝜓)𝜃(cid:3) (cid:41) + 𝑜 𝑝 (1) (𝑁𝑇)− 1 2 𝑖 (cid:34) 𝑁 ∑︁ 𝑡 𝑇 ∑︁ 𝑖 𝑡 {( (cid:165)𝑧𝑖𝑡𝜓)′( (cid:165)𝑒𝑖𝑡 + (cid:165)𝑣𝑖𝑡𝜃)} − (cid:35) 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑖 𝑡 √ {( (cid:165)𝑧𝑖𝑡𝜓)′ (cid:165)𝑧𝑖𝑡 } (cid:41) 𝑁𝑇 ( ˆ𝜓 − 𝜓)𝜃 + 𝑜 𝑝 (1) √ where 𝑁𝑇 ( ˆ𝜓 − 𝜓) = (cid:16) 1 𝑁𝑇 (cid:205)𝑁 𝑖 (cid:205)𝑇 𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 (cid:17) −1 (cid:104) (𝑁𝑇)− 1 2 (cid:205)𝑁 𝑖 (cid:205)𝑇 𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑣𝑖𝑡 (cid:105) . Let and Then we can write 𝐺 = E (cid:34) 𝑁 ∑︁ 𝑇 ∑︁ 𝑖 𝑡 (cid:35) ( (cid:165)𝑧𝑖𝑡𝜓)′ (cid:165)𝑧𝑖𝑡 𝑟𝑖𝑡 (𝜓) = (cid:32) 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑖 𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 (cid:33) −1 (cid:34) (𝑁𝑇)− 1 2 𝑁 ∑︁ 𝑇 ∑︁ (cid:35) (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑣𝑖𝑡 𝑖 𝑡 √ 𝑁𝑇 ( ˆ𝜃 − 𝜃) = 𝐵−1 (cid:40) (𝑁𝑇)− 1 2 𝑁 ∑︁ 𝑇 ∑︁ 𝑖 𝑡 ( (cid:165)𝑧𝑖𝑡𝜓)′( (cid:165)𝑒𝑖𝑡 + (cid:165)𝑣𝑖𝑡𝜃) − 𝐺 · 𝑟𝑖𝑡 (𝜓)𝜃 + 𝑜 𝑝 (1) (cid:41) And therefore, by the Central Limit Theorem, √ 𝑁𝑇 ( ˆ𝜃 − 𝜃) 𝑎 ∼ 𝑁 (cid:8)0, 𝐵−1𝑀 𝐵−1(cid:9) where 𝑀 = Var (cid:2)(cid:205)𝑁 𝑖 (cid:205)𝑇 𝑡 ( (cid:165)𝑧𝑖𝑡𝜓)′( (cid:165)𝑒𝑖𝑡 + (cid:165)𝑣𝑖𝑡𝜃) − 𝐺 · 𝑟𝑖𝑡 (𝜓)𝜃(cid:3) = Var (cid:2)(cid:205)𝑁 𝑖 (cid:205)𝑇 𝑡 𝑚𝑖𝑡 (cid:3). 𝐵 can be estimated with To estimate 𝑀, let ˆ𝐵 = 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑖 𝑡 ˆ(cid:165)𝑎′ 𝑖𝑡 ˆ(cid:165)𝑎𝑖𝑡 ˆ𝑚𝑖𝑡 = ( (cid:165)𝑧𝑖𝑡 ˆ𝜓)′( ˆ(cid:165)𝑒𝑖𝑡 + ˆ(cid:165)𝑣𝑖𝑡 ˆ𝜃) − ˆ𝐺 · ˆ𝑟𝑖𝑡 ( ˆ𝜓) ˆ𝜃 where, ˆ(cid:165)𝑒𝑖𝑡 are the residuals from the second stage. ˆ(cid:165)𝑣𝑖𝑡 are the residuals from the first stage (note that 𝑣𝑖𝑡 is a vector). ˆ𝐺 = 1 𝑁𝑇 (cid:205)𝑁 𝑖 (cid:205)𝑇 𝑡 ( (cid:165)𝑧𝑖𝑡 ˆ𝜓)′ (cid:165)𝑧𝑖𝑡. 106 ˆ𝑟 ( ˆ𝜓) = (cid:16) 1 𝑁𝑇 (cid:205)𝑁 𝑖 (cid:205)𝑇 𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 (cid:17) −1 (cid:104) (𝑁𝑇)− 1 2 (cid:205)𝑁 𝑖 (cid:205)𝑇 𝑡 (cid:165)𝑧′ 𝑖𝑡 ˆ(cid:165)𝑣𝑖𝑡 (cid:105) . With these quantities defined, the (𝑟, 𝑠)-th element of 𝑀 can be estimated as ˆ𝑀𝑟 𝑠 = 1 𝑁𝑇 𝑁 ∑︁ 𝑁 ∑︁ 𝑇 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑗=1 𝑡=1 𝑙=1 ˆ(cid:165)𝑚𝑖𝑡,𝑟 ˆ(cid:165)𝑚 𝑗𝑙,𝑠𝐾 (cid:21) (cid:20) 𝜌∗(𝑖, 𝑗) 𝜌𝑏 where once again the kernel function 𝐾 (·) is operationalizing the weak spatial dependence assump- tion. 107 APPENDIX D TABLES FOR CHAPTER 1 Additional Simulation Results Table D.1 Average of the estimated variance of 𝛽1 over the 1,000 replications using a rook type weighting matrix, N = 400, T=5. 𝜌 0.0 0.3 0.7 𝜓 0.0 0.3 0.7 0.0 0.3 0.7 0.0 0.3 0.7 CF CF_no1 True value HACSC SHAC Cluster Non-Robust True value 0.041 0.037 0.035 0.043 0.040 0.037 0.062 0.057 0.054 0.037 0.034 0.032 0.039 0.036 0.033 0.055 0.051 0.048 CF 0.0386 0.0352 0.0323 0.0428 0.0364 0.0359 0.0558 0.0535 0.0488 0.082 0.076 0.071 0.087 0.079 0.074 0.111 0.103 0.096 0.068 0.062 0.058 0.071 0.065 0.061 0.091 0.085 0.079 0.086 0.078 0.073 0.089 0.082 0.076 0.115 0.106 0.099 0.069 0.063 0.059 0.072 0.066 0.062 0.092 0.085 0.080 2SLS 0.0866 0.0726 0.0706 0.0906 0.079 0.0829 0.1092 0.1118 0.0954 ∗True value computed as the variance of 𝛽1 across the 1,000 replications. All the numbers were multiplied by 100 for readability. Table D.2 Average of the estimated variance of 𝛽2 over the 1,000 replications using a rook type weighting matrix, N = 400, T=5. 𝜌 0.0 0.3 0.7 𝜓 0.0 0.3 0.7 0.0 0.3 0.7 0.0 0.3 0.7 CF CF_no1 True value HACSC SHAC Cluster Non-Robust True value 0.069 0.063 0.060 0.074 0.068 0.063 0.119 0.108 0.101 0.066 0.060 0.057 0.071 0.065 0.060 0.114 0.104 0.097 CF 0.0724 0.0644 0.0623 0.0756 0.0761 0.0646 0.1237 0.1125 0.1004 0.080 0.074 0.069 0.086 0.078 0.073 0.131 0.120 0.113 0.066 0.061 0.056 0.070 0.065 0.061 0.108 0.099 0.093 0.083 0.076 0.071 0.089 0.081 0.076 0.136 0.125 0.116 0.067 0.061 0.057 0.071 0.065 0.061 0.109 0.100 0.094 2SLS 0.0803 0.0738 0.0704 0.0894 0.0862 0.0793 0.1376 0.1297 0.113 ∗True value computed as the variance of 𝛽2 across the 1,000 replications. All the numbers were multiplied by 100 for readability. 108 Table D.3 Average of the estimated variance of 𝛽3 over the 1,000 replications using a rook type weighting matrix, N = 400, T=5. 𝜌 0.0 0.3 0.7 𝜓 0.0 0.3 0.7 0.0 0.3 0.7 0.0 0.3 0.7 CF CF_no1 True value HACSC SHAC Cluster Non-Robust True value 0.275 0.252 0.239 0.291 0.271 0.254 0.377 0.350 0.329 0.242 0.220 0.206 0.252 0.232 0.215 0.314 0.290 0.270 CF 0.24 0.22 0.21 0.27 0.24 0.23 0.33 0.29 0.29 0.742 0.680 0.631 0.769 0.705 0.660 0.917 0.860 0.794 0.615 0.558 0.522 0.636 0.581 0.545 0.758 0.709 0.657 0.772 0.700 0.654 0.796 0.730 0.681 0.949 0.884 0.824 0.620 0.563 0.526 0.640 0.587 0.549 0.763 0.712 0.662 2SLS 0.79 0.64 0.63 0.81 0.71 0.72 0.90 0.91 0.79 ∗True value computed as the variance of 𝛽3 across the 1,000 replications. All the numbers were multiplied by 100 for readability. CF_no1 refers to the HACSC estimator ignoring the first stage estimation using a CF approach. Table D.4 Rejection probabilities for the null hypothesis 𝐻0 : 𝛽1 = 0.7 at a 5% of significance using a t-test over the 1,000 replications with a rook type weighting matrix, N = 400, T=5. 𝜌 0.0 0.3 0.7 𝜓 0.0 0.3 0.7 0.0 0.3 0.7 0.0 0.3 0.7 CF CF_no1 HACSC SHAC Cluster Non-Robust 0.050 0.047 0.043 0.054 0.048 0.045 0.049 0.047 0.051 0.062 0.057 0.054 0.066 0.062 0.067 0.067 0.062 0.064 0.068 0.056 0.053 0.068 0.057 0.059 0.065 0.071 0.050 0.088 0.072 0.068 0.089 0.073 0.083 0.079 0.093 0.070 0.055 0.052 0.046 0.058 0.042 0.060 0.057 0.064 0.040 0.081 0.076 0.068 0.088 0.072 0.083 0.079 0.095 0.070 109 Table D.5 Rejection probabilities for the null hypothesis 𝐻0 : 𝛽2 = 0.6 at a 5% of significance using a t-test over the 1,000 replications with a rook type weighting matrix, N = 400, T=5. 𝜌 0.0 0.3 0.7 𝜓 0.0 0.3 0.7 0.0 0.3 0.7 0.0 0.3 0.7 CF CF_no1 HACSC SHAC Cluster Non-Robust 0.076 0.078 0.079 0.082 0.082 0.081 0.071 0.077 0.069 0.067 0.064 0.072 0.072 0.075 0.069 0.068 0.066 0.064 0.069 0.077 0.079 0.089 0.079 0.089 0.075 0.092 0.063 0.096 0.101 0.104 0.108 0.106 0.119 0.099 0.111 0.080 0.061 0.066 0.074 0.076 0.076 0.078 0.073 0.080 0.053 0.093 0.102 0.105 0.105 0.104 0.108 0.099 0.110 0.077 110 APPENDIX E FIGURES FOR CHAPTER 1 Figure E.1 Distribution of the computed variances of ˆ𝛽3 obtained for the case with 𝑒 following a spatial AR(1) process (𝜌 = 0.7), and 𝑎 following an AR(1) (𝜓 = 0.3), 𝑁 = 400, 𝑇 = 5. ∗True value computed as the variance of 𝛽3 across the 1,000 replications. 111 0.0050.0100.0150.020Control Function with correctionTrue0.0050.0100.0150.020HACSC0.0050.0100.0150.020Clustered0.0050.0100.0150.020CF ignoring first stage0.0050.0100.0150.020SHAC0.0050.0100.0150.020Regular APPENDIX F PROOFS FOR CHAPTER 2 Proof of Proposition 2 The problem is to apply P2SLS to ˜𝑠𝑖𝑡 𝑦𝑖𝑡 = ˜𝑠𝑖𝑡𝑎𝑖𝑡𝜃 + ˜𝑠𝑖𝑡 ¯𝑧𝑖𝛿 using the instruments 𝑧𝑖𝑡 = (𝑥1𝑖𝑡 𝑧2𝑖𝑡 𝑤1𝑖𝑡 ℨ2𝑖𝑡) when ˜𝑠𝑖𝑡 = 1 and where 𝑎𝑖𝑡 = (𝑥1𝑖𝑡 𝑥2𝑖𝑡 𝑤1𝑖𝑡 𝑤2𝑖𝑡) and ¯𝑧𝑖 = ( ¯𝑥1𝑖 ¯𝑧2𝑖 ¯𝑤1𝑖 ¯ℨ2𝑖). Note that the averages are taken for the cases where ˜𝑠𝑖𝑡 = 1. The first step is to orthogonalize the instrumental variables with respect to ¯𝑧𝑖. I start by regressing 𝑧2𝑖𝑡 on ¯𝑧2𝑖. The associated coefficient will be: (cid:35) (cid:35) (cid:34) 𝑁 ∑︁ 𝑇 ∑︁ ˆ𝛾1 = ˜𝑠𝑖𝑡 ¯𝑧′ 2𝑖 ¯𝑧2𝑖 (cid:35) −1 (cid:34) 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 ˜𝑠𝑖𝑡 ¯𝑧′ 2𝑖𝑧2𝑖𝑡 𝑖=1 (cid:34) 𝑁 ∑︁ 𝑖=1 (cid:34) 𝑁 ∑︁ 𝑖=1 = = 𝑡=1 ¯𝑧′ 2𝑖 ¯𝑧2𝑖 ¯𝑧′ 2𝑖 ¯𝑧2𝑖𝑇𝑖 (cid:35) −1 (cid:34) 𝑇 ∑︁ ˜𝑠𝑖𝑡 𝑡=1 (cid:35) −1 (cid:34) ∑︁ 𝑖=1 ¯𝑧′ 2𝑖 𝑇 ∑︁ 𝑡=1 (cid:35) ¯𝑧′ 2𝑖 ¯𝑧2𝑖𝑇𝑖 = I𝑙 ∑︁ 𝑖=1 ˜𝑠𝑖𝑡 𝑧2𝑖𝑡 And therefore the residuals from this regression will be 𝑧2𝑖𝑡 − ¯𝑧2𝑖 = (cid:165)𝑧2𝑖𝑡. Now we regress each of the remaining elements of ¯𝑧2𝑖, i.e. ¯𝑥1𝑡, ¯𝑤1𝑖 and ¯ℨ2𝑖 on ¯𝑧2𝑖. Note that each set of residuals of these regressions will depend only on the 𝑖 index, so denote them respectively by 𝑓 𝑥1 𝑖 and 𝑓 ℨ2 , 𝑓 𝑤1 𝑖 and 𝑖 stack them into a vector 𝑓𝑖. Now we regress (cid:165)𝑧2𝑖𝑡 on 𝑓𝑖 and the associated coefficient will be: ˆ𝛾2 = = (cid:34) 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 (cid:34) 𝑁 ∑︁ 𝑡=1 𝑇 ∑︁ 𝑖=1 𝑡=1 (cid:35) −1 (cid:34) 𝑁 ∑︁ 𝑇 ∑︁ ˜𝑠𝑖𝑡 𝑓 ′ 𝑖 𝑓𝑖 ˜𝑠𝑖𝑡 𝑓 ′ 𝑖 (cid:165)𝑧2𝑖𝑡 (cid:35) −1 (cid:34) ˜𝑠𝑖𝑡 𝑓 ′ 𝑖 𝑓𝑖 𝑖=1 𝑡=1 𝑇 ∑︁ ∑︁ 𝑓 ′ 𝑖 𝑖=1 𝑡=1 (cid:35) (cid:35) ˜𝑠𝑖𝑡 (cid:165)𝑧2𝑖𝑡 = 02𝑘1+𝑙 where in the last line I used the fact that the sum of deviations from the mean is equal to zero when ˜𝑠𝑖𝑡 = 1. Therefore, after this orthogonalization of 𝑧2𝑖𝑡 with respect to ¯𝑧𝑖, the residuals are (cid:165)𝑧2𝑖𝑡. Following very similar steps, it can be shown that after orthogonalizing the remaining elements of 112 𝑧𝑖𝑡 with respect to ¯𝑧𝑖, the set of residuals will be (cid:165)𝑧𝑖𝑡 = ( (cid:165)𝑥1𝑖𝑡 (cid:165)𝑧2𝑖𝑡 (cid:165)𝑤1𝑖𝑡 (cid:165)ℨ2𝑖𝑡). The problem then becomes to apply Pooled 2SLS to ˜𝑠𝑖𝑡 𝑦𝑖𝑡 = ˜𝑠𝑖𝑡𝑎𝑖𝑡𝜃 using the instruments (cid:165)𝑧𝑖𝑡. The associated coefficient will be ˜𝜃 =       (cid:32) (cid:32) ∑︁ ∑︁ 𝑖=1 𝑡=1 (cid:33) (cid:32) ˜𝑠𝑖𝑡𝑎′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 ∑︁ ∑︁ 𝑖=1 𝑡=1 ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 (cid:33) −1 (cid:32) ∑︁ ∑︁ 𝑖=1 𝑡=1 ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡𝑎𝑖𝑡 ∑︁ ∑︁ 𝑖=1 𝑡=1 ˜𝑠𝑖𝑡𝑎′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 (cid:33) (cid:32) ∑︁ ∑︁ 𝑖=1 𝑡=1 ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 (cid:33) −1 (cid:32) (cid:33) ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 𝑦𝑖𝑡 ∑︁ ∑︁ 𝑖=1 𝑡=1 −1 · (cid:33)      Now focusing on the first element of the square bracket matrix and noting that the following algebraic manipulation holds for the remaining of the terms in the above expression that do not contain demeaned variables, we have: ∑︁ ∑︁ 𝑖=1 𝑡=1 ˜𝑠𝑖𝑡𝑎′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 = = = ˜𝑠𝑖𝑡𝑎′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 − ˜𝑠𝑖𝑡𝑎′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 − ∑︁ 𝑖=1 ∑︁ ¯𝑎′ 𝑖 ∑︁ 𝑡=1 ˜𝑠𝑖𝑡 (𝑧𝑖𝑡 − ¯𝑧𝑖) ∑︁ ˜𝑠𝑖𝑡 ¯𝑎′ 𝑖 (cid:165)𝑧𝑖𝑡 𝑖=1 𝑡=1 ˜𝑠𝑖𝑡 (cid:165)𝑎′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 ∑︁ ∑︁ 𝑖=1 ∑︁ 𝑡=1 ∑︁ 𝑖=1 ∑︁ 𝑡=1 ∑︁ 𝑖=1 𝑡=1 where at the end of the first line I used again the fact that the sum of deviations from the mean for the cases for which ˜𝑠𝑖𝑡 = 0. Therefore we have: (cid:32) ∑︁ ∑︁ 𝑖=1 𝑡=1 (cid:33) (cid:32) ˜𝑠𝑖𝑡𝑎′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 ∑︁ ∑︁ 𝑖=1 𝑡=1 ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 (cid:33) −1 (cid:32) ∑︁ ∑︁ 𝑖=1 𝑡=1 ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡𝑎𝑖𝑡 (cid:33) (cid:32) ∑︁ ∑︁ ˜𝑠𝑖𝑡𝑎′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 ∑︁ ∑︁ 𝑖=1 𝑡=1 ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 (cid:33) −1 (cid:32) (cid:33) ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 𝑦𝑖𝑡 ∑︁ ∑︁ 𝑖=1 𝑡=1 (cid:33) (cid:32) ∑︁ ˜𝑠𝑖𝑡 (cid:165)𝑎′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 ∑︁ ∑︁ 𝑖=1 𝑡=1 ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 (cid:33) −1 (cid:32) ∑︁ ∑︁ 𝑖=1 𝑡=1 ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑎𝑖𝑡 𝑡=1 𝑖=1 (cid:32) ∑︁ 𝑖=1 𝑡=1 ˜𝜃 = =       (cid:32)       (cid:32) −1 −1 · · (cid:33)      (cid:33)      (cid:33) ∑︁ ∑︁ 𝑖=1 𝑡=1 ˜𝑠𝑖𝑡 (cid:165)𝑎′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 (cid:33) (cid:32) ∑︁ ∑︁ 𝑖=1 𝑡=1 ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑧𝑖𝑡 (cid:33) −1 (cid:32) ∑︁ ∑︁ 𝑖=1 𝑡=1 113 ˜𝑠𝑖𝑡 (cid:165)𝑧′ 𝑖𝑡 (cid:165)𝑦𝑖𝑡 = ˆ𝜃𝐶𝐹𝐸2𝑆𝐿𝑆 APPENDIX G TABLES FOR CHAPTER 2 Table G.1 Average bias, standard deviation and root mean squared error for 𝛽1 and 𝛽2 across the 1000 repetitions when the data is MCAR. Bias 0.0002 Whole data Complete cases 0.0005 Proposed GMM 0.0009 0.0014 Dummy variable 𝛽1 S.D. 0.0167 0.0251 0.0213 0.0346 RMSE Bias 0.0167 0.0251 0.0213 0.0346 -0.0021 -0.0023 -0.0025 -0.0012 𝛽2 S.D. 0.0325 0.0510 0.0422 0.0669 RMSE 0.0326 0.0510 0.0422 0.0669 Table G.2 Average bias, standard deviation and root mean squared error for 𝛽1 and 𝛽2 across the 1000 repetitions when the data is MAR. 𝛽1 S.D. Bias RMSE Bias Whole data 0.0001 0.0010 Complete cases Proposed GMM 0.0007 0.0001 Dummy variable 0.0164 0.0164 0.0260 0.0260 0.0214 0.0214 0.0395 0.0395 -0.0012 -0.0020 -0.0019 -0.0032 𝛽2 S.D. 0.0321 0.0654 0.0520 0.0894 RMSE 0.0321 0.0654 0.0520 0.0894 Table G.3 Average bias, standard deviation and root mean squared error for 𝛽3 and 𝛽4 across the 1000 repetitions when the data is MAR. Bias -0.0009 Whole data Complete cases -0.0010 Proposed GMM -0.0000 1.1034 Dummy variable 𝛽3 S.D. 0.0244 0.0396 0.0318 0.1455 RMSE Bias 0.0244 0.0396 0.0318 1.1130 0.0004 0.0014 0.0028 0.3698 𝛽4 S.D. 0.0496 0.0772 0.0630 0.2049 RMSE 0.0496 0.0771 0.0630 0.4227 114 Table G.4 Average bias, standard deviation and root mean squared error for 𝛽1 and 𝛽2 across the 1000 repetitions when the data is MCAR and the error term follows a SAR(1) and 𝑁 = 900. 𝛽1 S.D. Bias RMSE Bias Whole data Complete cases Proposed GMM 0.0000 0.0001 Dummy variable 0.0001 0.0182 -0.0000 0.0278 0.0225 0.0375 0.0182 0.0278 0.0225 0.0375 0.0000 -0.0008 -0.0013 -0.0008 𝛽2 S.D. 0.0382 0.0560 0.0467 0.0760 RMSE 0.0382 0.0560 0.0467 0.0759 Table G.5 Average bias, standard deviation and root mean squared error for 𝛽3 and 𝛽4 across the 1000 repetitions when the data is MCAR and the error term follows a SAR(1) and 𝑁 = 900. 𝛽3 S.D. Bias RMSE Bias Whole data -0.0012 0.0260 -0.0018 0.0402 Complete cases Proposed GMM -0.0002 0.0330 0.1366 Dummy variable 0.9796 0.0260 0.0402 0.0330 0.9891 -0.0006 -0.0006 0.0013 0.3237 𝛽4 S.D. 0.0565 0.0821 0.0674 0.1942 RMSE 0.0565 0.0821 0.0674 0.3774 Table G.6 Average bias, standard deviation and root mean squared error for 𝛽1 and 𝛽2 across the 1000 repetitions when the data is MCAR and the error term follows a SAR(1) and 𝑁 = 400. 𝛽1 S.D. Bias RMSE Bias 0.0001 Whole data Complete cases 0.0007 Proposed GMM 0.0009 0.0006 Dummy variable 0.0278 0.0278 0.0417 0.0417 0.0342 0.0342 0.0538 0.0537 0.0006 -0.0005 0.0007 0.0030 𝛽2 S.D. 0.0582 0.0838 0.0713 0.1123 RMSE 0.0582 0.0838 0.0713 0.1123 Table G.7 Average bias, standard deviation and root mean squared error for 𝛽3 and 𝛽4 across the 1000 repetitions when the data is MCAR and the error term follows a SAR(1) and 𝑁 = 400. 𝛽3 S.D. Bias RMSE Bias -0.0037 0.0387 Whole data Complete cases -0.0037 0.0598 Proposed GMM -0.0009 0.0485 0.1944 Dummy variable 0.9430 0.0389 0.0599 0.0484 0.9628 -0.0036 -0.0062 -0.0043 0.3006 𝛽4 S.D. 0.0820 0.1267 0.1035 0.2817 RMSE 0.0820 0.1268 0.1035 0.4118 115 Table G.8 Average bias, standard deviation and root mean squared error for 𝛽1 and 𝛽2 across the 1000 repetitions when the missingness depends on 𝑥1 and 𝑐𝑖. 𝛽1 S.D. Bias RMSE Bias -0.0005 0.0252 Whole data Complete cases -0.0013 0.0367 Proposed GMM -0.0012 0.0303 -0.0005 0.0526 Dummy variable 0.0252 0.0367 0.0303 0.0526 0.0025 0.0059 0.0030 -0.0044 𝛽2 S.D. 0.0479 0.0801 0.0656 0.1105 RMSE 0.0479 0.0803 0.0656 0.1106 Table G.9 Average bias, standard deviation and root mean squared error for 𝛽3 and 𝛽4 across the 1000 repetitions when the missingness depends on 𝑥1 and 𝑐𝑖. Bias Whole data -0.0005 -0.0020 Complete cases Proposed GMM -0.0006 1.0765 Dummy variable 𝛽3 S.D. 0.0338 0.0506 0.0409 0.2408 RMSE Bias 0.0338 0.0507 0.0409 1.1031 0.0032 0.0066 0.0059 0.3505 𝛽4 S.D. 0.0691 0.1035 0.0862 0.2845 RMSE 0.0691 0.1037 0.0864 0.4514 116 APPENDIX H FIGURES FOR CHAPTER 2 Figure H.1 Distribution of estimated coefficients across the 1000 Monte-Carlo repetitions when the data is MAR. Figure H.2 Distribution of estimated coefficients across the 1000 Monte-Carlo repetitions when the data is MCAR and the error term follows a SAR(1) process with 𝑁 = 900. 117 1.401.451.501.551.60Allβ1Truevalue0.50.60.70.80.9β21.051.101.151.201.251.301.35β30.00.20.40.60.8β41.401.451.501.551.60CC0.50.60.70.80.91.051.101.151.201.251.301.350.00.20.40.60.81.401.451.501.551.60GMM0.50.60.70.80.91.051.101.151.201.251.301.350.00.20.40.60.81.401.451.501.551.60Allβ1Truevalue0.50.60.70.80.9β21.051.101.151.201.251.301.35β30.00.20.40.60.8β41.401.451.501.551.60CC0.50.60.70.80.91.051.101.151.201.251.301.350.00.20.40.60.81.401.451.501.551.60GMM0.50.60.70.80.91.051.101.151.201.251.301.350.00.20.40.60.8 Figure H.3 Distribution of estimated coefficients across the 1000 Monte-Carlo repetitions when the data is MCAR and the error term follows a SAR(1) process with 𝑁 = 400. 118 1.401.451.501.551.60Allβ1Truevalue0.50.60.70.80.9β21.051.101.151.201.251.301.35β30.00.20.40.60.8β41.401.451.501.551.60CC0.50.60.70.80.91.051.101.151.201.251.301.350.00.20.40.60.81.401.451.501.551.60GMM0.50.60.70.80.91.051.101.151.201.251.301.350.00.20.40.60.8 APPENDIX I PROOFS FOR CHAPTER 3 Proof of Proposition 1 Because 𝑥𝑖 𝑗𝑡 = (𝑥1𝑖 𝑗𝑡 𝑥2𝑖 𝑗𝑡) where 𝑥2𝑖 𝑗𝑡 is endogenous with, we need to add the instrumental variables. The CRE IV estimator in this is is to apply Pooled Two-Stage Least Squares (P2SLS) to the following equation: ˜𝑦𝑖 𝑗𝑡 = ˜𝑥𝑖 𝑗𝑡 𝛽 + ˜¯𝑥1··· + ˜¯𝑧2··· using the IV’s ˜𝑧2𝑖 𝑗𝑡 where ˜𝑦𝑖 𝑗𝑡 = 𝑦𝑖 𝑗𝑡 − ˜𝜃1 ¯𝑦𝑖·· − ˜𝜃2 ¯𝑦· 𝑗 · − ˜𝜃3 ¯𝑦··𝑡 − ˜𝜃4 ¯𝑦··· and similarly for each variable. First note that ˜¯𝑥1𝑖·· = (1 − ˜𝜃1) ¯𝑥1𝑖··, ˜¯𝑥1· 𝑗 · = (1 − ˜𝜃2) ¯𝑥1· 𝑗 ·, ˜¯𝑥1··𝑡 = (1 − ˜𝜃3) ¯𝑥1··𝑡 and ˜¯𝑥1··· = (1 − ˜𝜃4) ¯𝑥1···. As a first step, we orthogonalize the exogenous and instrumental variables ( ˜𝑥1𝑖 𝑗𝑡 ˜𝑧2𝑖 𝑗𝑡) with respect to ˜¯𝑧𝑖 𝑗𝑡 = ( ˜¯𝑥1𝑖·· ˜¯𝑥1· 𝑗 · ˜¯𝑥1··𝑡 ˜¯𝑧2𝑖·· ˜¯𝑧2· 𝑗 · ˜¯𝑧2··𝑡). a. ˜𝑥1𝑖 𝑗𝑡 on (1 ˜¯𝑧𝑖 𝑗𝑡), where the 1 represent the constant term. Applying the Frish-Waugh-Lovell (FWL) theorem, to obtain the correct residuals, this is equivalent to regress ˜𝑥1𝑖 𝑗𝑡 − ¯𝑥1··· on ˜¯𝑧𝑖 𝑗𝑡 − ¯𝑧··· (i.e. no constant term). The coefficient associated with this regression will have the typical form of (𝑋′𝑋)−1(𝑋′𝑦). Consider the first matrix, which will have the following structure: (cid:205) 𝑖 (cid:205) 𝑗 (cid:205) 𝑡 (cid:205) 𝑖 (cid:205) 𝑗 (cid:205) 𝑡 ( ˜¯𝑥1𝑖·· − ¯𝑥1···)′( ˜¯𝑥1𝑖·· − ¯𝑥1···) (cid:205) 𝑖 ( ˜¯𝑥1· 𝑗 · − ¯𝑥1···)′( ˜¯𝑥1𝑖·· − ¯𝑥1···) (cid:205) 𝑖 ... (cid:205) 𝑗 (cid:205) 𝑡 ( ˜¯𝑥1𝑖·· − ¯𝑥1···)′( ˜¯𝑥1· 𝑗 · − ¯𝑥1···) · · · (cid:205) 𝑗 (cid:205) 𝑡 ( ˜¯𝑥1· 𝑗 · − ¯𝑥1···)′( ˜¯𝑥1· 𝑗 · − ¯𝑥1···) . . . (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) −1 (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) Each term off the diagonal that has a cross product of different indices (e.g. ˜¯𝑥1𝑖·· and ˜¯𝑥1· 𝑗 ·) can be treated as follows: ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ( ˜¯𝑥1𝑖·· − ¯𝑥1···)′( ˜¯𝑥1𝑖·· − ¯𝑥1···) = 𝑇 · ∑︁ ( ˜¯𝑥1𝑖·· − ¯𝑥1···)′ ∑︁ ( ˜¯𝑥1· 𝑗 · − ¯𝑥1···) = 0 𝑖 𝑗 Therefore, each pair of these regressors is orthogonal to each other in sample. For those pairs of independent variables that have a common index (e.g. regressing ˜¯𝑥1𝑖 𝑗𝑡 ˜¯𝑥1𝑖·· and ˜¯𝑧2𝑖··), using 119 the fact that each variable as been centered around their overall mean and applying the FWL theorem, it can be shown that after partialling out the variable that is not associated with the dependent variable (in this case ˜¯𝑧2𝑖··), we will recover the same coefficient as if we ran ˜¯𝑥1𝑖 𝑗𝑡 on ˜¯𝑥1𝑖·· directly. Now I show that the coefficients associated with each element of 𝑥1 of this regression is equal to an identity matrix of size 𝑘1 and 0 for the elements of 𝑧2. For example, the parameter vector associated with ˜¯𝑥1𝑖·· − ¯𝑥1··· is: (cid:34) (cid:34) (cid:34) (cid:34) ˆ𝜋 ˜¯𝑥1𝑖·· = = = = ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ( ˜¯𝑥1𝑖·· − ¯𝑥1···)′( ˜¯𝑥1𝑖·· − ¯𝑥1···) (cid:35) 1 (cid:34) ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ( ˜¯𝑥1𝑖·· − ¯𝑥1···)′( ˜𝑥1𝑖 𝑗𝑡 − ¯𝑥1···) 𝑁2 · 𝑇 · ∑︁ 𝑖 ( ˜¯𝑥1𝑖·· − ¯𝑥1···)′( ˜¯𝑥1𝑖·· − ¯𝑥1···) (cid:35) 1 (cid:34) ∑︁ ( ˜¯𝑥1𝑖·· − ¯𝑥1···)′ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ( ˜𝑥1𝑖 𝑗𝑡 − ¯𝑥1···) (cid:35) (cid:35) ( ˜¯𝑥1𝑖·· − ¯𝑥1···)′( ˜¯𝑥1𝑖·· − ¯𝑥1···) ( ˜¯𝑥1𝑖·· − ¯𝑥1···)′( ˜¯𝑥1𝑖·· − ¯𝑥1···) ∑︁ 𝑖 ∑︁ 𝑖 (cid:35) 1 (cid:34) (cid:35) 1 (cid:34) ∑︁ 𝑖 ( ˜¯𝑥1𝑖·· − ¯𝑥1···)′ 1 𝑁2𝑇 ∑︁ ∑︁ 𝑗 𝑡 (cid:35) ( ˜𝑥1𝑖 𝑗𝑡 − ¯𝑥1···) ( ˜¯𝑥1𝑖·· − ¯𝑥1···)′( ˜¯𝑥1𝑖·· − ¯𝑥1···) = I𝑘1 (cid:35) ∑︁ 𝑖 Therefore each explanatory variable associated with the averages of 𝑥1 will have a coefficient vector equal to an identity matrix. On the other hand, it can be shown that the coefficients associated with with the ˜¯𝑧2 variables are 0 using the fact that we will obtain sums of vectors that are deviated from their overall mean. b. ˜𝑧2𝑖 𝑗𝑡 on (1 ˜¯𝑧𝑖 𝑗𝑡). Using very similar arguments as in the previous step, the coefficients associated with the ˜¯𝑧2 variables will be identity matrices of size 𝑙 and the ones associated with ˜¯𝑥1 will be 0. Given this, after partialling out ˜¯𝑧𝑖 𝑗𝑡 the associated residuals with be: (cid:165)𝑥1𝑖 𝑗𝑡 = 𝑥1𝑖 𝑗𝑡 − ¯𝑥1𝑖·· − ¯𝑥1· 𝑗 · − ¯𝑥1··𝑡 + 2 ¯𝑥1··· (cid:165)𝑧2𝑖 𝑗𝑡 = 𝑧2𝑖 𝑗𝑡 − ¯𝑧2𝑖·· − ¯𝑧2· 𝑗 · − ¯𝑧2··𝑡 + 2 ¯𝑧2··· The problem has become now to apply P2SLS to ˜𝑦𝑖 𝑗𝑡 = ˜𝑥𝑖 𝑗𝑡 120 using IV’s (cid:165)𝑧2𝑖 𝑗𝑡 = 𝑧2𝑖 𝑗𝑡 − ¯𝑧2𝑖·· − ¯𝑧2· 𝑗 · − ¯𝑧2··𝑡 + 2 ¯𝑧2···. From this, note that ˆ𝛽2𝑆𝐿𝑆 = (cid:34)(cid:32) (cid:34)(cid:32) ∑︁ ∑︁ ∑︁ ˜𝑥′ 𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡 𝑖 𝑗 𝑡 ∑︁ ∑︁ ∑︁ ˜𝑥′ 𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡 (cid:33) (cid:32) (cid:33) (cid:32) ∑︁ ∑︁ ∑︁ (cid:165)𝑧′ 𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡 𝑖 𝑗 𝑡 ∑︁ ∑︁ ∑︁ (cid:165)𝑧′ 𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡 (cid:33) (cid:32) (cid:33) (cid:32) ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 𝑖 𝑗 𝑡 𝑖 𝑗 𝑡 (cid:33)(cid:35) −1 (cid:165)𝑧′ 𝑖 𝑗𝑡 ˜𝑥𝑖 𝑗𝑡 · (cid:33)(cid:35) (cid:165)𝑧′ 𝑖 𝑗𝑡 ˜𝑦𝑖 𝑗𝑡 Note that ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ˜𝑥′ 𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡 = − − − ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 𝑥′ 𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡 − ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ˜𝜃1 ¯𝑥𝑖··(𝑧2𝑖 𝑗𝑡 − ¯𝑧2𝑖·· − ¯𝑧2· 𝑗 · − ¯𝑧2··𝑡 + 2 ¯𝑧2···) ˜𝜃2 ¯𝑥′ · 𝑗 ·(𝑧2𝑖 𝑗𝑡 − ¯𝑧2𝑖·· − ¯𝑧2· 𝑗 · − ¯𝑧2··𝑡 + 2 ¯𝑧2···) ˜𝜃3 ¯𝑥′ ··𝑡 (𝑧2𝑖 𝑗𝑡 − ¯𝑧2𝑖·· − ¯𝑧2· 𝑗 · − ¯𝑧2··𝑡 + 2 ¯𝑧2···) ˜𝜃4 ¯𝑥′ ···(𝑧2𝑖 𝑗𝑡 − ¯𝑧2𝑖·· − ¯𝑧2· 𝑗 · − ¯𝑧2··𝑡 + 2 ¯𝑧2···) Focusing on the last term of the first line from the previous expression, ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ˜𝜃1 ¯𝑥′ 𝑖··(𝑧𝑖 𝑗𝑡 − ¯𝑧𝑖·· − ¯𝑧· 𝑗 · − ¯𝑧··𝑡 + ¯𝑧···) ∑︁ 𝑖 ∑︁ 𝑖 ∑︁ 𝑖 ˜𝜃1 ¯𝑥′ 𝑖·· ∑︁ ∑︁ 𝑗 𝑡 (𝑧𝑖 𝑗𝑡 − ¯𝑧𝑖·· − ¯𝑧· 𝑗 · − ¯𝑧··𝑡 + ¯𝑧···) ˜𝜃1 ¯𝑥′ 𝑖··(𝑁2𝑇 ¯𝑧𝑖·· − 𝑁2𝑇 ¯𝑧𝑖·· − 𝑁2𝑇 ¯𝑧··· − 𝑁2𝑇 ¯𝑧··· + 𝑁2𝑇 ¯𝑧···) ¯𝑥′ 𝑖·· ∑︁ ∑︁ 𝑗 𝑡 (𝑧𝑖 𝑗𝑡 − ¯𝑧𝑖·· − ¯𝑧· 𝑗 · − ¯𝑧··𝑡 + ¯𝑧···) ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ¯𝑥′ 𝑖·· (cid:165)𝑧𝑖 𝑗𝑡 = = = = where we used the fact that the terms in parenthesis in the third line add up to zero. Using similar arguments for the rest of the expression, we can easily show that (cid:205) 𝑖 (cid:205) 𝑗 (cid:205) 𝑡 𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡 = (cid:205) ˜𝑥′ 𝑖 (cid:205) 𝑗 (cid:205) 𝑡 𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡 (cid:165)𝑥′ 121 And applying the same logic to the rest of the terms of 𝛽2𝑆𝐿𝑆, it follows that ˆ𝛽2𝑆𝐿𝑆 = = (cid:34)(cid:32) (cid:34)(cid:32) (cid:34)(cid:32) (cid:34)(cid:32) ∑︁ ∑︁ ∑︁ ˜𝑥′ 𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡 𝑖 𝑗 𝑡 ∑︁ ∑︁ ∑︁ ˜𝑥′ 𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡 𝑖 𝑗 𝑡 ∑︁ ∑︁ ∑︁ (cid:165)𝑥′ 𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡 𝑖 𝑗 𝑡 ∑︁ ∑︁ ∑︁ (cid:165)𝑥′ 𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡 (cid:33) (cid:32) (cid:33) (cid:32) (cid:33) (cid:32) (cid:33) (cid:32) ∑︁ ∑︁ ∑︁ (cid:165)𝑧′ 𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡 𝑖 𝑗 𝑡 ∑︁ ∑︁ ∑︁ (cid:165)𝑧′ 𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡 𝑖 𝑗 𝑡 ∑︁ ∑︁ ∑︁ (cid:165)𝑧′ 𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡 𝑖 𝑗 𝑡 ∑︁ ∑︁ ∑︁ (cid:165)𝑧′ 𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡 (cid:33) (cid:32) (cid:33) (cid:32) (cid:33) (cid:32) (cid:33) (cid:32) (cid:33)(cid:35) −1 ∑︁ ∑︁ ∑︁ (cid:165)𝑧′ 𝑖 𝑗𝑡 ˜𝑥𝑖 𝑗𝑡 𝑖 𝑗 𝑡 (cid:33)(cid:35) (cid:165)𝑧′ 𝑖 𝑗𝑡 ˜𝑦𝑖 𝑗𝑡 ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ∑︁ ∑︁ ∑︁ (cid:165)𝑧′ 𝑖 𝑗𝑡 (cid:165)𝑥𝑖 𝑗𝑡 𝑖 𝑗 𝑡 · · (cid:33)(cid:35) −1 (cid:33)(cid:35) ∑︁ ∑︁ ∑︁ (cid:165)𝑧′ 𝑖 𝑗𝑡 (cid:165)𝑦𝑖 𝑗𝑡 = ˆ𝛽𝐹𝐸2𝑆𝐿𝑆 𝑖 𝑗 𝑡 𝑖 𝑗 𝑡 𝑖 𝑗 𝑡 Proof of Proposition 2 For notation simplicity, I will prove the case of P2SLS, however similar ideas can be applied for a GLS type transformation. We want to show that applying P2SLS to 𝑦𝑖 𝑗𝑡 = 𝑥1𝑖 𝑗𝑡 𝛽1 + 𝑥2𝑖 𝑗𝑡 𝛽2 + ¯𝑥1𝑖 𝑗𝑡𝛾1 + ¯𝑥2𝑖 𝑗𝑡𝛾2 = 𝑥𝑖 𝑗𝑡 𝛽 + ¯𝑥𝑖 𝑗𝑡𝛾 using IV’s (𝑧2𝑖 𝑗𝑡 ¯𝑧𝑖 𝑗𝑡) and where ¯𝑥𝑖 𝑗𝑡 = ( ¯𝑥𝑖·· ¯𝑥· 𝑗 · ¯𝑥··𝑡) (and similarly for other variables) yields the same 𝛽 as ˆ𝛽𝐹𝐸2𝑆𝐿𝑆. To show the result, I follow these steps: 1. Orthogonalize with respect to ¯𝑥1𝑖 𝑗𝑡 the IV’s and the exogenous variables (𝑥1𝑖 𝑗𝑡 𝑧2𝑖 𝑗𝑡). 2. Orthogonalize with respect to ¯𝑧2𝑖 𝑗𝑡 in the first stage equation. 3. I use the FWL theorem to show the equivalence. Step 1 a. Regress 𝑧2𝑖 𝑗𝑡 on ¯𝑥1𝑖 𝑗𝑡 = (1 ¯𝑥1𝑖·· ¯𝑥1· 𝑗 · ¯𝑥1··𝑡) Equivalently, applying the FWL and to obtain the correct residuals, we can regress 𝑧2𝑖 𝑗𝑡 − ¯𝑧2··· on [( ¯𝑥1𝑖·· − ¯𝑥1···) ( ¯𝑥1· 𝑗 · − ¯𝑥1···) ( ¯𝑥1··𝑡 − ¯𝑥1···)]. The residuals from this regression will be 𝑧2𝑖 𝑗𝑡 − ¯𝑧2··· − ( ¯𝑥1𝑖·· − ¯𝑥1···)𝜂1 − ( ¯𝑥1· 𝑗 · − ¯𝑥1···)𝜂2 − ( ¯𝑥1··𝑡 − ¯𝑥1···)𝜂3 = 𝑚𝑖 𝑗𝑡 122 First note that the regressors are orthogonal in sample. For example: ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1· 𝑗 · − ¯𝑥1···) = ∑︁ ( ¯𝑥1𝑖·· − ¯𝑥1···)′ ∑︁ ( ¯𝑥1· 𝑗 · − ¯𝑥1···) = 0 𝑖 𝑗 since we are subtracting the overall mean to both sums of vectors. Therefore, we can find each 𝜂𝑠 by regressing the dependent variable on each regressor individually and therefore: ˆ𝜂1 = ˆ𝜂2 = ˆ𝜂3 = (cid:34) (cid:34) (cid:34) ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1𝑖·· − ¯𝑥1···) (cid:35) −1 (cid:34) ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ( ¯𝑥1𝑖·· − ¯𝑥1···)′(𝑧2𝑖 𝑗𝑡 − ¯𝑧2···) ( ¯𝑥1· 𝑗 · − ¯𝑥1···)′( ¯𝑥1· 𝑗 · − ¯𝑥1···) (cid:35) −1 (cid:34) ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ( ¯𝑥1· 𝑗 · − ¯𝑥1···)′(𝑧2𝑖 𝑗𝑡 − ¯𝑧2···) ( ¯𝑥1··𝑡 − ¯𝑥1···)′( ¯𝑥1··𝑡 − ¯𝑥1···) (cid:35) −1 (cid:34) ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ( ¯𝑥1··𝑡 − ¯𝑥1···)′(𝑧2𝑖 𝑗𝑡 − ¯𝑧2···) (cid:35) (cid:35) (cid:35) Note that each of the coefficients can be rewritten, for example: ˆ𝜂1 = (cid:34) ∑︁ 𝑖 ( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1𝑖·· − ¯𝑥1···) (cid:35) −1 (cid:34) ∑︁ 𝑖 ( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑧2𝑖·· − ¯𝑧2···) (cid:35) b. ¯𝑧2𝑖 𝑗𝑡 on ¯𝑥1𝑖 𝑗𝑡 = (1 ¯𝑥1𝑖·· ¯𝑥1· 𝑗 · ¯𝑥1··𝑡 ¯𝑥1···), where ¯𝑧2𝑖 𝑗𝑡 = ( ¯𝑧2𝑖·· ¯𝑧2· 𝑗 · ¯𝑧2··𝑡 ¯𝑧2···). Similarly to the previous case, we can regress ( ¯𝑧2𝑖 𝑗𝑡 − ¯𝑧2···) on [( ¯𝑥1𝑖·· − ¯𝑥1···) ( ¯𝑥1· 𝑗 · − ¯𝑥1···) ( ¯𝑥1··𝑡 − ¯𝑥1···)]. Because the regressors are orthogonal in sample, we can again obtain the coefficients by running individual regressions. a) Consider the regression of ¯𝑧2𝑖·· − ¯𝑧2··· on ¯𝑥1𝑖·· − ¯𝑥1···. The coefficient will be (cid:34) (cid:34) ˆ𝜂4 = = ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1𝑖·· − ¯𝑥1···) (cid:35) −1 (cid:34) ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑧2𝑖·· − ¯𝑧2···) (cid:35) ( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1𝑖·· − ¯𝑥1···) ∑︁ 𝑖 (cid:35) −1 (cid:34) ∑︁ 𝑖 ( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑧2𝑖·· − ¯𝑧2···) = ˆ𝜂1 (cid:35) Similarly, we can show that the coefficients of ¯𝑧2· 𝑗 · − ¯𝑧2··· on ¯𝑥1· 𝑗 · − ¯𝑥1··· ( ˆ𝜂5)) and ¯𝑧2··𝑡 − ¯𝑧2··· on ¯𝑥1··𝑡 − ¯𝑥1··· ( ˆ𝜂5)) will be equal to ( ˆ𝜂2)) and ( ˆ𝜂3)) respectively. 123 b) Consider now the cases of “cross terms”, i.e. averages in one dimension on variables averaged over a different dimension. For example, if we regress ¯𝑧2· 𝑗 · − ¯𝑧2··· on ¯𝑥1𝑖·· − ¯𝑥1···, the coefficient, say 𝜁1 will be: ˆ𝜁1 = (cid:34) (cid:34) ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1𝑖·· − ¯𝑥1···) (cid:35) −1 (cid:34) ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑧2· 𝑗 · − ¯𝑧2···) (cid:35) = 𝑁2 ( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1𝑖·· − ¯𝑥1···) ∑︁ 𝑖 (cid:35) −1 (cid:34) ∑︁ ( ¯𝑥1𝑖·· − ¯𝑥1···)′ ∑︁ 𝑖 𝑗 (cid:35) ( ¯𝑧2𝑖·· − ¯𝑧2···) = 0 since the sum of deviations from the overall mean add up to 0. Similarly, we can show that all the coefficients from the “cross terms” are 0. Therefore, the residuals from this stage will be 𝑣𝑖 = ¯𝑧2𝑖·· − ¯𝑧2··· − ( ¯𝑥1𝑖·· − ¯𝑥1···) ˆ𝜂4 𝑣 𝑗 = ¯𝑧2· 𝑗 · − ¯𝑧2··· − ( ¯𝑥1· 𝑗 · − ¯𝑥1···) ˆ𝜂5 𝑣𝑡 = ¯𝑧2··𝑡 − ¯𝑧2··· − ( ¯𝑥1··𝑡 − ¯𝑥1···) ˆ𝜂6 c. 𝑥1𝑖𝑡 𝑗 − ¯𝑥1··· on ¯𝑥1𝑖 𝑗𝑡 − ¯𝑥1···. The residuals from this regression will be 𝑙𝑖 𝑗𝑡 = 𝑥1𝑖 𝑗𝑡 − ¯𝑥1( ¯𝑥1𝑖·· − ¯𝑥1···) ˆ𝜀1 − ( ¯𝑥1· 𝑗 · − ¯𝑥1···) ˆ𝜀2 − ( ¯𝑥1··𝑡 − ¯𝑥1···) ˆ𝜀3 Note that ˆ𝜀1 = = = = (cid:34) (cid:34) (cid:34) (cid:34) ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1𝑖·· − ¯𝑥1···) (cid:35) 1 (cid:34) ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ( ¯𝑥1𝑖·· − ¯𝑥1···)′(𝑥1𝑖 𝑗𝑡 − ¯𝑥1···) 𝑁2 · 𝑇 · ∑︁ 𝑖 ( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1𝑖·· − ¯𝑥1···) (cid:35) 1 (cid:34) ∑︁ ( ¯𝑥1𝑖·· − ¯𝑥1···)′ ∑︁ ∑︁ 𝑖 𝑗 𝑡 (𝑥1𝑖 𝑗𝑡 − ¯𝑥1···) (cid:35) (cid:35) ( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1𝑖·· − ¯𝑥1···) ( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1𝑖·· − ¯𝑥1···) ∑︁ 𝑖 ∑︁ 𝑖 (cid:35) 1 (cid:34) (cid:35) 1 (cid:34) ∑︁ 𝑖 ( ¯𝑥1𝑖·· − ¯𝑥1···)′ 1 𝑁2𝑇 ∑︁ ∑︁ 𝑗 𝑡 (cid:35) (𝑥1𝑖 𝑗𝑡 − ¯𝑥1···) ( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1𝑖·· − ¯𝑥1···) = I𝑘1 (cid:35) ∑︁ 𝑖 and similarly we can show that ˆ𝜀2 and ˆ𝜀3 are also identity matrices. Therefore, 𝑙𝑖 𝑗𝑡 = 𝑥1𝑖 𝑗𝑡 − ¯𝑥1𝑖·· − ¯𝑥1· 𝑗 · − ¯𝑥1··𝑡 + 2 ¯𝑥1··· = (cid:165)𝑥1𝑖 𝑗𝑡. After this orthogonalization, the problem becomes 124 to apply P2SLS to using IV’s (𝑚𝑖 𝑗𝑡 𝑣𝑖 𝑣 𝑗 𝑣𝑡). Step 2 𝑦𝑖 𝑗𝑡 = (cid:165)𝑥1𝑖 𝑗𝑡 𝛽1 + 𝑥2𝑖 𝑗𝑡 𝛽2 + ¯𝑥2𝑖 𝑗𝑡𝛾2 Now I partial out 𝑣𝑖, 𝑣 𝑗 , 𝑣𝑡 in the first stage equation, which are the residuals associated with ¯𝑧2𝑖··, ¯𝑧2· 𝑗 ·, ¯𝑧2··𝑡 respectively. Note that based on their definitions and because ˆ𝜂1 = ˆ𝜂4, ˆ𝜂2 = ˆ𝜂5 and ˆ𝜂3 = ˆ𝜂6 and following a procedure similar to Step 1.a, it can be shown that 𝑣𝑖, 𝑣 𝑗 , 𝑣𝑡 are orthogonal to each other in sample. 1. 𝑚𝑖 𝑗𝑡 on (1 𝑣𝑖 𝑣 𝑗 𝑣𝑡). If we let 𝑚𝑖 𝑗𝑡 = 𝑣𝑖 ˜𝜂1 + 𝑣 𝑗 ˜𝜂2 + 𝑣𝑡 ˜𝜂3, then (cid:34) (cid:35) −1 (cid:34) ∑︁ ∑︁ ∑︁ ∑︁ ∑︁ ∑︁ ˜𝜂1 = 𝑣′ 𝑖𝑣𝑖 (cid:35) 𝑣′ 𝑖𝑣𝑖 𝑗𝑡 𝑖 𝑗 (cid:34) ∑︁ = 𝑣′ 𝑖𝑣𝑖 𝑡 (cid:35) −1 (cid:34) ∑︁ 𝑗 𝑡 𝑖 (cid:35) 𝑣′ 𝑖 ¯𝑚𝑖·· 𝑖 Note that 𝑚𝑖·· = ( ¯𝑧2𝑖·· − ¯𝑧2···) − ( ¯𝑥1·· − ¯𝑥1···) and because ˆ𝜂1 = ˆ𝜂4, it follows that ˜𝜂1 = I𝑙 and 𝑖 analogous arguments apply to ˜𝜂2 and ˜𝜂3. Therefore, the residuals from this regression are (cid:165)𝑧2𝑖 𝑗𝑡, where the definition of (cid:165)𝑧2𝑖 𝑗𝑡 is similar to (cid:165)𝑥1𝑖 𝑗𝑡. Originally the first stage was 𝑥2𝑖 𝑗𝑡 = 𝑥1𝑖 𝑗𝑡 𝜙1 + 𝑧2𝑖 𝑗𝑡 𝜙2 + ¯𝑥1𝑖 𝑗𝑡 𝜌1 + ¯𝑧2𝑖 𝑗𝑡 𝜌2 Since we have partialled out ¯𝑥1𝑖 𝑗𝑡 and ¯𝑥1𝑖 𝑗𝑡, to get 𝜙1 and 𝜙2, we regress 𝑥2𝑖 𝑗𝑡 on (cid:165)𝑧𝑖 𝑗𝑡 = [ (cid:165)𝑥1𝑖 𝑗𝑡 (cid:165)𝑧2𝑖 𝑗𝑡]. To get 𝜙 = [𝜙1 𝜙2], we have (cid:34) (cid:34) (cid:34) ˆ𝜙 = = = ∑︁ ∑︁ ∑︁ (cid:165)𝑧′ 𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡 𝑖 𝑗 𝑡 (cid:35) −1 (cid:34) (cid:35) (cid:165)𝑧′ 𝑖 𝑗𝑡𝑥2𝑖 𝑗𝑡 ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ∑︁ ∑︁ ∑︁ (cid:165)𝑧′ 𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡 𝑖 𝑗 𝑡 (cid:35) −1 (cid:34) ∑︁ ∑︁ ∑︁ (cid:165)𝑧′ 𝑖 𝑗𝑡𝑥2𝑖 𝑗𝑡 − ∑︁ ∑︁ ∑︁ (cid:165)𝑧2𝑖 𝑗𝑡 ( ¯𝑥2𝑖·· − ¯𝑥2· 𝑗 · − ¯𝑥2··𝑡 + 2 ¯𝑥2···) (cid:35) 𝑖 𝑗 𝑡 𝑖 𝑗 𝑡 ∑︁ ∑︁ ∑︁ (cid:165)𝑧′ 𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡 𝑖 𝑗 𝑡 (cid:35) −1 (cid:34) (cid:35) (cid:165)𝑧′ 𝑖 𝑗𝑡 (cid:165)𝑥2𝑖 𝑗𝑡 ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 where I used the fact that the sums of deviations from the mean equal to 0 in the second line. Therefore ˆ𝜙 can also be obtained by regressing (cid:165)𝑥2𝑖 𝑗𝑡 on (cid:165)𝑧𝑖 𝑗𝑡. 125 Step 3 Now the problem becomes to apply P2SLS to 𝑦𝑖 𝑗𝑡 = (cid:165)𝑥1𝑖 𝑗𝑡 𝛽1 + 𝑥2𝑖 𝑗𝑡 𝛽2 + ¯𝑥2𝑖 𝑗𝑡𝛿2 using IV’s [ (cid:165)𝑧2 𝑗𝑖𝑡 (cid:165)¯𝑧2𝑖 𝑗𝑡]. The second stage of the problem is to apply POLS to 𝑦𝑖 𝑗𝑡 = (cid:165)𝑥1𝑖 𝑗𝑡 𝛽1 + ˆ𝑥2𝑖 𝑗𝑡 𝛽2 + ˆ¯𝑥2𝑖 𝑗𝑡𝛿2 To get 𝛽, I orthogonalize with respect to ˆ¯𝑥2𝑖 𝑗𝑡: 1. (cid:165)𝑥1𝑖 𝑗𝑡 on ˆ¯𝑥2𝑖 𝑗𝑡, where ˆ¯𝑥2𝑖 𝑗𝑡 = (1 ˆ¯𝑥2𝑖·· ˆ¯𝑥2· 𝑗 · ˆ¯𝑥2··𝑡). Using the fact that the explanatory variables are averages over different dimensions and because the sum of time deviations for (cid:165)𝑥1𝑖 𝑗𝑡, it can be shown that the vector of coefficients is equal to 0𝑘2. 2. ˆ𝑥2𝑖 𝑗𝑡 on ˆ¯𝑥2 𝑗𝑖𝑡, or equivalently ˆ𝑥2𝑖 𝑗𝑡 − ˆ¯𝑥2··· on ˆ¯𝑥2 𝑗𝑖𝑡 − ˆ¯𝑥2···. Using arguments similar to step 1.c, it can be shown that the associated coefficient in this regression will be I𝑘2 and the residuals will be (cid:165)ˆ𝑥2𝑖 𝑗𝑡 = ˆ𝑥2𝑖 𝑗𝑡 − ˆ¯𝑥2𝑖·· − ˆ¯𝑥2· 𝑗 · − ˆ¯𝑥2··𝑡 + 2 ˆ¯𝑥2··· Therefore, to find 𝛽, we run POLS on 𝑦𝑖 𝑗𝑡 = (cid:165)𝑥1𝑖 𝑗𝑡 𝛽1 + ˆ(cid:165)𝑥2𝑖 𝑗𝑡 𝛽2. Letting ˆ(cid:165)𝑥𝑖 𝑗𝑡 = ( (cid:165)𝑥1𝑖 𝑗𝑡 ˆ(cid:165)𝑥2𝑖 𝑗𝑡), (cid:34) ˆ𝛽 = ∑︁ ∑︁ ∑︁ ˆ(cid:165)𝑥′ 𝑖 𝑗𝑡 ˆ(cid:165)𝑥𝑖 𝑗𝑡 𝑖 𝑗 𝑡 (cid:35) −1 (cid:34) (cid:35) ˆ(cid:165)𝑥′ 𝑖 𝑗𝑡 𝑦𝑖 𝑗𝑡 ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 Using a similar argument as in step 2 for 𝑥2𝑖 𝑗𝑡, it can be shown that (cid:34) (cid:34) ˆ𝛽 = = ∑︁ ∑︁ ∑︁ ˆ(cid:165)𝑥′ 𝑖 𝑗𝑡 ˆ(cid:165)𝑥𝑖 𝑗𝑡 𝑖 𝑗 𝑡 ∑︁ ∑︁ ∑︁ ˆ(cid:165)𝑥′ 𝑖 𝑗𝑡 ˆ(cid:165)𝑥𝑖 𝑗𝑡 𝑖 𝑗 𝑡 (cid:35) −1 (cid:34) (cid:35) −1 (cid:34) ∑︁ ∑︁ ∑︁ ˆ(cid:165)𝑥′ 𝑖 𝑗𝑡 𝑦𝑖 𝑗𝑡 𝑖 𝑗 𝑡 (cid:35) (cid:35) ∑︁ ∑︁ ∑︁ 𝑖 𝑗 𝑡 ˆ(cid:165)𝑥′ 𝑖 𝑗𝑡 (cid:165)𝑦𝑖 𝑗𝑡 = ˆ𝛽𝐹𝐸2𝑆𝐿𝑆 126