ESSAYS IN SPATIAL PANEL DATA ECONOMETRICS

By

Steven Wu-Chaves

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Economics—Doctor of Philosophy

2024

ABSTRACT

Chapter 1: Robust inference in short linear panels with fixed effects with endogenous covari-

ates in a spatial setting

In this chapter, I propose a simple way to obtain robust standard errors in linear panels in a spatial

context with endogenous covariates where the number of time periods is small relative to the cross

sectional dimension. The method is based on applying a Spatial HAC to an average of moment con-

ditions across time to obtain a covariance estimator that is robust to both spatial and serial correlation

(HACSC). I also present a control function approach (CF) alternative to estimate the parameters

and extend the HACSC estimator to this case, where the standard errors require an adjustment to

account for the sampling variability induced by the first stage estimation. In addition, I derive the

Fixed Effects-Random Effects equivalence under a Correlated Random Effects framework in the

presence of a spatial lag of the dependent variable to obtain a fully-robust Hausman-type test using

the HACSC estimator. I run a Monte Carlo experiment and show that the HACSC estimator is

robust to strong patterns of serial and spatial correlation. Furthermore, I also find that whenever

the CF assumptions hold, the CF approach is more efficient than Two-Stage Least Squares. Finally,

I estimate the effect of school district spending on the performance of fourth-grade students in

Michigan, allowing for spillovers across districts.

I find that the expenditure from neighboring

districts has a positive and non-negligible impact on test passing rates.

Chapter 2: Estimation of models with spatial panels and missing observations in the

covariates

Missing data problems are more serious en spatial models with spillover effects as the efficiency

loss induced by using estimators that only use the complete cases is larger. In this paper I present

a GMM estimator that uses the information on both the complete and incomplete observations for

models with spatial spillover effects and missing data on the potentially endogenous variables to

obtain potential efficiency gains. I also derive the Fixed Effects and Random Effects equivalence for

spatial panels with missing data and I also develop an alternative GMM estimator in this Correlated

Random Effects framework. The Monte-Carlo simulations show significant efficiency gains of the

GMM estimator compared to estimators that only use the complete cases.

Chapter 3: Estimation of models with multiple fixed effects and endogenous variables: a

correlated random effects approach

The inclusion of multiple individual heterogeneities and time effects, more commonly referred as

“fixed effects,” is a common practice in panel data. A common approach to deal with these is to

estimate the model using the fixed effects estimator by applying the within transformation, which

has the disadvantage of removing all the variables that are constant across one of the dimensions

of the data set. An alternative method to estimate the model is the correlated random effects

approach using the Mundlak device, which restricts the dependence between the heterogeneities

and the covariates in a particular way. In this paper, I show that the fixed effects estimates can

be recovered using the Mundlak approach in models with three sets of heterogeneities and in the

presence of endogenous variables. Furthermore, I prove that this equivalence can be obtained using

two different sets of covariates.

Copyright by
STEVEN WU-CHAVES
2024

To my parents.

v

ACKNOWLEDGEMENTS

First, I am deeply thankful to my parents for their continued support. I am also grateful to Lucia,

without whose encouragement and help I would not have been able to complete this journey. To

the rest of my family and furry friends, thank you for their love and support that helped me in every

step of this process.

Second I would like to extend a sincere thank you to my committee chair, Jeffrey Wooldridge,

for all his advice, guidance and support. I would also want to thank Kyoo il Kim, Tim Vogelsang

and Guo Chen for serving on my committee and providing excellent feedback and support. I also

want to recognize Lisa Cook, Richard Baillie, Susan Zhu, Carl Davidson, Leslie Papke, Hugo

Freeman, Soren Anderson, Scott Imberman, Todd Elder and Steven Haider for their helped and

opportunities they provided me during the program. I am also grateful to Jay Feight and Lori Jean

Nichols for their assistance as I navigated the program.

Finally, there are several class classmates that I also want to thank. First and foremost, to

Minkyu Kim for his friendship, help and support over these years. To Salem Rogers, Raghav

Rakesh, Chia-Hung Kuo, Benjamin Miller and Soo Jeong Lee, thank you for all the moments and

conversations we shared together.

vi

TABLE OF CONTENTS

CHAPTER 1

ROBUST INFERENCE IN SHORT LINEAR PANELS WITH FIXED
EFFECTS WITH ENDOGENOUS COVARIATES IN A SPATIAL
SETTING .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

.

.

1

CHAPTER 2

ESTIMATION OF MODELS WITH SPATIAL PANELS AND
MISSING OBSERVATIONS IN THE COVARIATES . . . . . . . . . . 45

CHAPTER 3

ESTIMATION OF MODELS WITH MULTIPLE FIXED EFFECTS
AND ENDOGENOUS VARIABLES: A CORRELATED RANDOM
EFFECTS APPROACH . . . . . . . . . . . . . . . . . . . . . . . . . . 70

BIBLIOGRAPHY .

.

.

.

.

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

APPENDIX A

ADDITIONAL ASSUMPTIONS AND DEFINITIONS FOR
CHAPTER 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

APPENDIX B

PROOFS FOR CHAPTER 1 . . . . . . . . . . . . . . . . . . . . . . . 90

APPENDIX C

DERIVATION OF THE COVARIANCE MATRIX FOR THE
CONTROL FUNCTION APPROACH . . . . . . . . . . . . . . . . . . 104

APPENDIX D

TABLES FOR CHAPTER 1 . . . . . . . . . . . . . . . . . . . . . . . 108

APPENDIX E

FIGURES FOR CHAPTER 1 . . . . . . . . . . . . . . . . . . . . . . . 111

APPENDIX F

PROOFS FOR CHAPTER 2 . . . . . . . . . . . . . . . . . . . . . . . 112

APPENDIX G

TABLES FOR CHAPTER 2 . . . . . . . . . . . . . . . . . . . . . . . 114

APPENDIX H

FIGURES FOR CHAPTER 2 . . . . . . . . . . . . . . . . . . . . . . . 117

APPENDIX I

PROOFS FOR CHAPTER 3 . . . . . . . . . . . . . . . . . . . . . . . 119

vii

CHAPTER 1

ROBUST INFERENCE IN SHORT LINEAR PANELS WITH FIXED EFFECTS WITH
ENDOGENOUS COVARIATES IN A SPATIAL SETTING

1.1

Introduction

The assumption of independent data is widespread in empirical economics since it simplifies

many of the estimation methods. However, in many fields such as international trade, urban

economics, public policy or even network analysis, this assumption might not hold since the

outcome variable of an individual might be affected by other observations’ actions, which leads

to (spatially) dependent data. Furthermore, many of the tools used to develop the asymptotic

theory behind popular econometric methods such as the Central Limit Theorem and Law of Large

Numbers often rely on independent and identically distributed (i.i.d.) data. This facilitates both the

estimation and inference, but if this assumption is violated, then the latter becomes more difficult

even if the parameters are estimated consistently.

Additionally, the increasing availability of data sets over time has increased the popularity

of panel methods in recent years as they allow to incorporate time effects and to estimate richer

models. Nevertheless, they also introduce complications because the presence of unobserved

heterogeneity could generate inconsistency problems both in the parameters and standard errors

if it is not properly handled. When combining both spatially dependent observations and panel

data, inference becomes more challenging since the error term can be both serially and spatially

correlated.

To address the spatial correlation, the literature in the field has usually resorted to assume

and model a particular structure of the error term, as it was common to do with time series data.

However, since the seminal work of White (1980), the common practice nowadays in the latter is

to use standard errors that are robust to general forms of heteroskedasticity and autocorrelation

(HAC). This procedure has been extended to the spatial framework (SHAC) by Conley (1999) and

Kelejian and Prucha (2007) in a cross sectional setting. However, to the best of my knowledge and

surprisingly enough, this has not been extended to the panel case where the time dimension is fixed

1

and the number of units of observation goes to infinity, even in the linear case.1 Admittedly, there

are many cases in which the time dimension is also large, however there are also instances where

the number of observations across time is considerably smaller compared to the cross sectional

dimension.

This generates issues because ignoring the serial correlation could still generate biased standard

errors, even if the associated covariance matrix is robust to spatial correlation. Indeed, some of the

estimators that have been proposed in the literature and that have been implemented in software

packages only make the standard errors robust to one of these dimensions. For example, Stata is

a very popular statistical analysis package and one of the few routines for panel data in a spatial

context corrects the standard errors for spatial correlation, but assumes serial independence of

the error terms. The main purpose of this paper is to propose a simple way to obtain robust

standard errors in a linear panel that are robust to heteroskedasticity and both to spatial and serial

correlation (HACSC), without imposing any structure on the time dimension and using a Fixed

Effects framework with endogenous covariates.

I also extend this procedure to the case of the

control function approach, where the computation of standard errors is more difficult due to the

presence of a generated regressor.

HAC estimators have been extensively used in the time series literature since they avoid having

to model the error term structurally, which can lead to inconsistency issues if that process is

misspecified. Newey and West (1987) were the first to extend White’s estimator to allow for general

forms of heteroskedasticity and autocorrelation.

In the panel case, Arellano (1987) introduced

the panel clustered standard errors, which are robust to heteroskedasticity and autocorrelation but

require that the observations between clusters to be uncorrelated.

In spatial panels, multiple authors have made important contributions to the field, extending

many of the methods developed in the time series literature. For example, Driscoll and Kraay

(1998) presented how to deal with spatially dependent panel data in a GMM context by averaging

the moment conditions in the cross section dimension index, 𝑁. Their approach relies on holding

1Perhaps one of the reasons is that econometricians assume that it is obvious what to do, but many methods make

strong assumptions in the time dimension like serial independence.

2

fixed 𝑁 and letting time dimension 𝑇 → ∞. Vogelsang (2012) develops asymptotic theory for

linear spatial panels with fixed effects in a fixed-b framework by averaging HAC estimators and by

computing the HAC for averages as in Driscoll and Kraay (1998). In this case, the asymptotics

rely again in 𝑇 → ∞ and allowing 𝑁 to remain fixed or to grow. In a similar context Kim and Sun

(2013) proposed a bivariate kernel HACSC estimator, which requires that both the cross section

and time dimensions to go to infinity. Bester et al. (2011) suggested a cluster covariance matrix

that is applicable when the data is dependent in the context of time series, spatial and panel data.

More recently, Müller and Watson (2022a) introduced a new methodology to construct confidence

intervals based on population principal components with the property that the resulting interval will

have a coverage probability of 95% for a set of spatial patterns in a cross sectional setting. Müller

and Watson (2022b) extended this framework to spatial panels to cover estimation techniques like

difference-in-difference setups.

At the cross sectional level, Conley (1999) was the first to develop a Spatial HAC (SHAC)

estimator in a GMM context. His approach is based on the assumption that the data generating

process is spatially stationary. When working with dependent data and allowing 𝑁 → ∞, it

is common to assume some sort of weak dependence mechanism, analogous to the time series

literature, so that the influence of one observation on other units diminishes as the distance between

them increases. In this case, Conley assumes that the data is spatially 𝛼−mixing. Bester et al.

(2016) provide a fixed-b analysis of Conley’s SHAC estimator.

Kelejian and Prucha (2007) relax the spatial stationarity assumption and model the spatial

dependence in terms of a weighting matrix, arguing that having a different number of neighbors,

as it is common in empirical work, violates the assumption. In this respect, the notion of assigning

weights to different units based on their distance to a particular point has been used in many

fields. For example, in urban economics, McMillen (1996) used locally weighted regressions to

estimate the value of land in Chicago, where each observation is given a specific weight based

on its distance to the central business district. In the same spirit, in the geography literature the

geographically weighted regression uses a very similar concept to model the idea that there might

3

be spatial variability for models involving geo-referenced data (Wheeler & Tiefelsdorf, 2005). It is

important to note that their Kelejian and Prucha’s SHAC estimator is based on consistent estimates

of the error terms, but they do not provide any parameter estimation framework.

Kim and Sun (2011) generalize this estimator to allow general linear and nonlinear models using

moment conditions. Conley and Molinari (2007) performed a Monte Carlo study in which they

compared the performance of multiple covariance estimators with dependent data in the context

of locations measured with error and they concluded that non parametric estimators work better

compared to parametric ones such as GMM and maximum likelihood estimators. In this paper,

I follow Driskoll and Kraay’s approach, but instead of averaging the moment conditions over

the cross sectional dimension, I average the moment conditions over time and construct a GMM

estimator and then apply Kelejian and Prucha’s SHAC over the corresponding residuals. By doing

this, I avoid imposing any assumptions over the serial correlation and hence, construct a covariance

estimator that is robust to both serial and spatial correlation.

Beyond testing the statistical significance of the effect of a covariate on the response variable,

robust inference is also important when trying to choose the correct specification of a model.

More specifically, the correlated random effects (CRE) approach has been very popular in recent

years because it is a simple way to test between Random Effects (RE) and Fixed Effects (FE)

specifications and it allows to include time constant variables as noted by Joshi and Wooldridge

(2019). Furthermore, we can obtain the FE coefficients of the time varying variables by including

the time average of these on the right hand side of the equation in a Pooled OLS or RE regression, a

result attributed to Mundlak (1978). Debarsy (2012) was the first to extend the Mundlak approach

to the spatial setting. More recently, Li and Yang (2020) showed that when the model includes a

structurally modeled error term (which involves maximum likelihood estimation), the equivalence

holds conditional on the parameter associated with the error term, however, the equivalence breaks

unconditionally, i.e., when this parameter has to be estimated jointly with the rest of parameters. In

this paper, I show that the result holds in a specific setting; namely, if the model does not include a

structurally modeled error term.

4

One of the additional advantages of not imposing a particular spatial structure on the error term

is that some estimation methods become readily available such as Two Stage Least Squares (2SLS)

or a Control Function (CF) approach (Blundell & Powell, 2003) whenever the researcher suspects

an endogenous variable is in the model. In fact, adding a spatial lag of the response variable as

a covariate yields the spatial autoregressive model (SAR), a very popular model in this literature.

However, Kelejian and Prucha (1998) showed that this term induces an endogeneity problem, which

is why the researcher has to resort to an Instrumental Variable (IV) procedure. In terms of the

estimation of parameters, both 2SLS and the CF approach require the availability of instruments,

however one important difference is that the latter imposes additional assumptions and is therefore

less robust than 2SLS. On the other hand, if the assumptions hold, the CF allows to deal with the

endogeneity in a more parsimonious way if multiple functions2 of the endogenous variable appear

on the right hand side of the equation and is probably more efficient (Wooldridge, 2010). Note

that this parsimony is relevant in the spatial case since it is common to include spillover effects in

the models and therefore, the likelihood of having multiple functions of a variable increases in this

context.

In a spatial setup, Basile (2009) and Basile et al. (2014) extended the CF to additive non-

parametric models.

In terms of inference, Basile et al. (2014) recommends to use bootstrap to

obtain confidence intervals, a practice that is common even in the i.i.d. case. However, as pointed

out by Kunsch (1989), the independence assumption plays a critical role on the validity of the

bootstrap, so besides the computational cost, in a spatial context this is not a trivial procedure due

to the dependence between observations. Intuitively, if we just randomly sample the data in a time

series setting at each bootstrap repetition, the serial correlation structure would be lost and a similar

issue occurs in the spatial case. This is why different bootstrap methods have been proposed in

the time series literature (see Politis and White (2004) for a brief overview), nevertheless their

extension to the spatial case is not straightforward due to the absence of a natural ordering of the

2A well known result in the literature is that 2SLS and the CF give the same numerical coefficients if only one
function of the endogenous variable is in the model. This carries over to the spatial case under the settings outlined at
the beginning of the paragraph.

5

observations. Given this, it might be desirable to obtain a closed-form formula for the covariance

matrix when the empirical researcher is working with parametric linear models with panel data in a

spatial context. This paper tries to fill this hole in the literature by adjusting the HACSC estimator

to the CF setting. This adjustment is necessary because in addition to deal with the spatial and

serial correlation, it is necessary to take into account the sampling error induced by the first stage

estimation.

The rest of the paper is organized a follows. Section 1.2 discusses the model and the assumptions

used to obtain the estimator of the covariance matrix. Section 1.3 presents the HACSC estimator

and its asymptotic properties. Section 1.4 derives the FE and RE equivalence using the correlated

random effects approach in a spatial context. Section 1.5 presents an additional application of the

HACSC estimator under a Feasible GLS context. Section 1.6 presents the control function approach

and a discussion of the additional assumptions imposed in this context. Section 1.7 contains a set of

Monte Carlo experiments and Section 1.8 shows an empirical application of the HACSC estimator

using data from the Michigan education system. Section 1.9 concludes.

1.2 Model

1.2.1 Estimation of the parameters

Consider the following model3:

𝑦𝑖𝑡 = 𝑥1𝑖𝑡 𝛽1 + 𝑥2𝑖𝑡 𝛽2 + 𝑊𝑖 𝑋1𝑡𝛾1 + 𝑊𝑖 𝑋2𝑡𝛾2 + 𝜆𝑊𝑖 𝑦𝑡 + 𝑐𝑖 + 𝑢𝑖𝑡

= 𝑥𝑖𝑡 𝛽 + 𝑊𝑖 𝑋𝑡𝛾 + 𝜆𝑊𝑖 𝑦𝑡 + 𝑐𝑖 + 𝑢𝑖𝑡,

𝑖 = 1 . . . 𝑁, 𝑡 = 1 . . . 𝑇

(1.1)

where 𝑦𝑖𝑡 is the dependent variable, 𝑥1𝑖𝑡 is a 1 × (𝑘1 + 1), vector of explanatory exogenous variables

(including an intercept), 𝑥2𝑖𝑡 is a (1 × 𝑘2) vector of endogenous variables. The sense in which 𝑥1𝑖𝑡

is exogenous will be clarified below. 𝑊𝑖 is the 𝑖-th row of the 𝑁 × 𝑁 time invariant weighting

matrix 𝑊, whose diagonal elements are zero, 𝑋1𝑡 and 𝑋2𝑡 are the 𝑁 × 𝑘1 and 𝑁 × 𝑘2 matrices of

exogenous and endogenous covariates, respectively, for all observations at time 𝑡, 𝑦𝑡 is the vector

3The model includes a spatial lag of the dependent variable on the right hand side for the sake of generality and
because this is a widely spread practice in the spatial literature. Nevertheless, it is important to emphasize that its
inclusion precludes the interpretation of (1.1) as a conditional mean function and also complicates the interpretation
of the coefficients. As such, in some sections of the paper this variable will be omitted.

6

of dependent variables at time 𝑡, 𝑐𝑖 is the individual heterogeneity and 𝑢𝑖𝑡 is the idiosyncratic error.

Hence 𝛽, 𝛾 and 𝜆 are the parameters of interest and they are of dimension (𝑘 + 1) × 1, 𝑘 × 1 and

1 × 1 respectively. Throughout the rest of the paper, I assume that 𝑁 → ∞ while 𝑇 remains fixed.

We assume that there exist a set of instruments 𝑧2𝑖𝑡 for 𝑥2𝑖𝑡 of dimension 𝑙 ≥ 𝑘2 (so that 𝑊𝑖 𝑍2𝑡

are the instruments for 𝑊𝑖 𝑋2𝑡). As previously shown by Kelejian and Prucha (1998), the inclusion

of a spatial lag of the dependent variable on the right hand side also induces an endogeneity issue

for which we also need instruments. Kelejian et al. (2004) and Lee (2003) determined that the

optimal set of instruments for this variable is a sequence of the form 𝑊 𝑗 𝑋𝑡, for 𝑗 = 1...𝑠, 𝑠 ∈ N (in
𝑟𝑖𝑡 ≡ 𝑊 𝑗
this case, we would only include higher power spatial lags of 𝑋1𝑡). If we let 𝑤 𝑗
𝛾′
𝛾′
and ℨ2𝑖𝑡 ≡ 𝑊𝑖 𝑍2𝑡, 𝐴𝑖𝑡 ≡ (𝑥1𝑖𝑡
2
1

𝑥2𝑖𝑡 𝑤1𝑖𝑡 𝑤2𝑖𝑡 𝑊𝑖 𝑦𝑡) and 𝜃 ≡ (𝛽′
1

𝑖 𝑋𝑟𝑡, 𝑟 = 1, 2,
𝜆)′, then the

𝛽′
2

model can be written more compactly as:

𝑦𝑖𝑡 = 𝐴𝑖𝑡𝜃 + 𝑐𝑖 + 𝑢𝑖𝑡

(1.2)

Since we are not assuming a particular structure for the error term, we can estimate the

parameters of (1.2) with the Fixed Effects 2SLS estimator. To do so, we can apply the within

transformation to all the variables, so let (cid:165)𝑦𝑖𝑡 = 𝑦𝑖𝑡 − ¯𝑦𝑖, where ¯𝑦𝑖 = 1
𝑇

(cid:205)𝑇

𝑡=1

𝑦𝑖𝑡 and similarly for

the independent variables and the instruments. Then we can use Pooled 2SLS to the transformed

model

(cid:165)𝑦𝑖𝑡 = (cid:165)𝐴𝑖𝑡𝜃 + (cid:165)𝑢𝑖𝑡

(1.3)

using the instruments (cid:165)𝑍𝑖𝑡 = ( (cid:165)𝑥1𝑖𝑡

(cid:165)𝑤1𝑖𝑡

(cid:165)𝑧2𝑖𝑡

(cid:165)ℨ2𝑖𝑡

(cid:165)𝑤2
1𝑖𝑡

1𝑖𝑡 . . . (cid:165)𝑤𝑠
(cid:165)𝑤3

1𝑖𝑡). Note that all the individual

unobserved effects have been removed. To obtain consistent parameters, we need the following

orthogonality condition:

E( (cid:165)𝑍′

𝑖𝑡 (cid:165)𝑢𝑖𝑡) = E[𝑔𝑖𝑡 (𝑍𝑖𝑡, 𝜃)] = 0,

𝑡 = 1 . . . 𝑇

(1.4)

which is implied by the stronger strict exogeneity condition:

E(𝑢𝑖𝑡 |𝑍) = E(𝑢𝑖𝑡 |𝑍, 𝑊) = 0

7

where 𝑍 is the 𝑁𝑇 × [(𝑠 + 1)𝑘1 + 2𝑙 + 1] matrix of exogenous variables for all cross sectional

units and all time periods. We note that in this spatial setting, this condition is stronger than in the

non-spatial case because here we are conditioning the expected value of 𝑢𝑖𝑡 with respect to all other

units and not only 𝑖’s independent variables (see Wooldridge (2010), pp. 301 for more details).

The 𝑔𝑖𝑡 (𝑍𝑖𝑡, 𝜃) function is of dimension (𝑠 + 1)𝑘1 + 2𝑙 + 1 = 𝑟, hence for each 𝑖, there are 𝑇 × 𝑟

moment conditions. Under this framework, we could use many more moment conditions because

our strict exogeneity assumption implies orthogonality conditions for each pair of time periods and

cross sectional units [i.e. E( (cid:165)𝑍𝑖𝑡 (cid:165)𝑢 𝑗 𝑠), 𝑖, 𝑗 = 1 . . . 𝑁 and 𝑡, 𝑠 = 1 . . . 𝑇], however we will only use the

conditions implied by the FE estimator. Using a similar idea as Driscoll and Kraay (1998), for each

observation 𝑖 we can average these moment conditions over time4, so let:

𝑔𝑖 (𝑍𝑖, 𝜃) =

1
𝑇

𝑇
∑︁

𝑡=1

𝑔𝑖𝑡 (𝑍𝑖𝑡, 𝜃)

From this, one can construct a GMM estimator, which will be defined as follows:

ˆ𝜃 = min
𝜃∈Θ

(cid:34)

1
𝑁

𝑁
∑︁

𝑖=1

(cid:35) ′

(cid:34)

𝑔𝑖 (𝑍𝑖, 𝜃)

ˆΩ

(cid:35)

𝑔𝑖 (𝑍𝑖, 𝜃)

1
𝑁

𝑁
∑︁

𝑖=1

(1.5)

(1.6)

where ˆΩ is a 𝑟 × 𝑟 positive definite, symmetric, weighting matrix. Admittedly, as noted above we

could estimate 𝜃 by running Pooled 2SLS on (1.3), however, the GMM framework allows for more

generality. For instance, averaging the moment conditions over time for each observation can be

done in other setups different than fixed effects. Furthermore, this averaging might not be the most

efficient approach, but obtaining the optimal GMM in a two-step procedure might provide some

efficiency gains with respect to Pooled 2SLS.

1.2.2 Assumptions

The consistency and normality of this estimator can be obtained from a Uniform Law of Large

Numbers (ULLN) and Central Limit Theorem (CLT) derived by Nazgul and Prucha (2009) for non-

stationary random fields in a possibly uneven lattice. Before stating their assumptions, we need
some definitions. Let 𝐷 ⊂ R𝑑, 𝑑 ≥ 1 be an uneven lattice and let 𝜌(𝑖, 𝑗) = max1≤𝑘 ≤𝑑 | 𝑗𝑘 − 𝑖𝑘 | and

4Note however that Driscoll and Kraay’s case is based on having 𝑁 fixed ant 𝑇 → ∞ and they average across 𝑖 for

all 𝑡.

8

|𝑖| = max1≤𝑘 ≤𝑑 |𝑖𝑘 |, where 𝑖𝑘 denotes the 𝑘-th component of 𝑖, be a metric and norm, respectively,
of R𝑑. The minimum distance between two subsets 𝐸, 𝐹 of 𝐷 is defined as 𝜌(𝐸, 𝐹) = inf [𝜌(𝑖, 𝑗) :

𝑖 ∈ 𝐸 and 𝑗 ∈ 𝐹] and let |𝐸 | denote the cardinality of a subset 𝐸 ∈ 𝐷. Other definitions used

throughout this section can be found in the Appendix.

We now state the assumptions required to obtain the consistency and asymptotic normality of

ˆ𝜃. We note that the 𝑁 subscript in the random fields and scalars of the assumptions are to explicitly

indicate that the ULLN and CLT can accommodate for triangular arrays, which are common in the

spatial literature and particularly in Cliff-Ord type models. However, for notation simplicity, it will

be suppressed in many sections for the remainder of the paper.

Assumption 1

The lattice 𝐷 ⊂ R𝑑, 𝑑 ≥ 1 is infinite countable and there exists a distance 𝜌0 such that 𝜌(𝑖, 𝑗) ≥
𝜌0 ∀𝑖, 𝑗 ∈ 𝐷. Without loss of generality, suppose that 𝜌0 > 1.

Assumption 1 provides the necessary structure to the lattice. Note that the existence of the

distance is essential in order to obtain non parametric estimators of the covariance matrix and it is

analogous to the time difference between observations in the time series literature. Furthermore, it

is possible that the distance observed by the researcher, between two observations 𝑖 and 𝑗, 𝜌∗(𝑖, 𝑗),

is measured with error. Note that the existence and availability of this distance measure is not trivial,

even in the leading case of a geographical region. As shown in Figure 1.1, there are instances in

which using the linear distance between many pairs of points in that territory would not represent

the real burden to arrive from one location to another (e.g. driving), while there are other cases in

which this measure would be appropriate (e.g. pollution).

9

Figure 1.1 Points in an irregular geographic region.

Now we state conditions related to the the 𝑔𝑖 (·) functions and 𝑍𝑖,𝑁 , where 𝑍𝑖,𝑁 represents an

𝛼-mixing random field such that 𝑖 ∈ 𝐷. At this point, it is important to note that since we are

working with panel data and time averages for estimation purposes, the random field considered in

the assumptions is the one constructed with the time averages for each observation.

Assumption 2 (Uniform 𝐿2 integrability)

There is an array of positive real constants {𝑐𝑖,𝑁 } such that

lim
𝑘→∞

sup
𝑁

sup
𝑖∈𝐷 𝑁

E (cid:2)|𝑍𝑖,𝑁 /𝑐𝑖,𝑁 |21 (cid:0)|𝑍𝑖,𝑁 /𝑐𝑖,𝑁 | > 𝑘(cid:1)(cid:3) = 0

Where 1(·) denotes an indicator function. Note that Assumption 2 allows for the possibility

of asymptotic unbounded second moments, however for the remainder of the paper we will focus

on the case of bounded moments, in which case we can set 𝑐𝑖,𝑁 = 1 ∀𝑖. The next assumption put

some restrictions on the 𝛼 coefficients of the random field.

Assumption 3 (𝛼-mixing)

Let ¯𝑄 𝑘

𝑖,𝑁 := 𝑄 |𝑋𝑖, 𝑁 /𝑐𝑖, 𝑁 |1(|𝑍𝑖, 𝑁 /𝑐𝑖, 𝑁 |>𝑘) denote the upper tail quantile function of

|𝑍𝑖,𝑁 /𝑐𝑖,𝑁 |1(|𝑍𝑖,𝑁 /𝑐𝑖,𝑁 | > 𝑘) and recall that 𝛼inv(𝑢) is the inverse function of ¯𝛼1,1(𝑚) as in the

definition specified in the Appendix. The 𝛼-mixing coefficients satisfy:

1.

lim
𝑘→∞

sup
𝑁

sup
𝑖∈𝐷 𝑁

∫ 1
0

𝛼𝑑
inv

(𝑢)

(cid:104) ¯𝑄 (𝑘)

𝑖,𝑁 (𝑢)

(cid:105) 2

𝑑𝑢 = 0.

10

90898887868584834243444546472.

∞
(cid:205)
𝑚=1

𝑚𝑑−1 ¯𝛼𝑘,ℎ (𝑚) < ∞ for 𝑘 + ℎ ≤ 4.

3. ¯𝛼1,∞(𝑚) = O𝑝 (𝑚−𝑝−𝜀) for some 𝜀 > 0.

Under Assumptions 2 and 3.2 with 𝑘 = ℎ = 1 and letting {𝐷 𝑁 } be a sequence of finite subsets

of 𝐷 that satisfies Assumption 1 such that |𝐷 𝑁 | → ∞ as 𝑁 → ∞, a direct application of Theorem

3 in Nazgul and Prucha (2009) leads to the conclusion that

1
|𝐷 𝑁 |

∑︁

𝑖∈𝐷

𝑍𝑖,𝑁 − E(𝑍𝑖,𝑁 )

𝑝
→ 0

Note that one could relax Assumption 2 to 𝐿1 uniform integrability for the theorem to hold,

nevertheless, the below CLT requires 𝐿2 uniform integrability. In order to apply this pointwise

WLLN to the 𝑔𝑖 (·, 𝜃) functions, we assume that these satisfy the regularity conditions specified

in Assumption A.1 presented in the Appendix. Given the fact that any measurable function of an

𝛼-mixing process is 𝛼-mixing, the 𝑔𝑖 (𝑍𝑖,𝑁 , 𝜃) also satisfy a pointwise WLLN, i.e.

1
|𝐷 𝑁 |

∑︁

𝑖∈𝐷

𝑔𝑖 (𝑍𝑖,𝑁 , 𝜃) − E[𝑔𝑖 (𝑍𝑖,𝑁 , 𝜃)]

𝑝
→ 0

(1.7)

With this Weak Law of Large Numbers, in order for the above GMM estimator to be consistent,

we need an Uniform LLN for which we need the additional regularity conditions on the 𝑔𝑖 (·, ·)

functions stated in Assumption A.2. Under these assumptions, we have the following proposition,

which is a special case of Theorem 2 in Nazgul and Prucha (2009).

Proposition 1. Let {𝐷 𝑁 } be a sequence of finite subsets of 𝐷 that satisfies Assumption 1 such that
(cid:205)𝑖∈𝐷 𝑁
space and consider a sequence of real valued functions {𝑔𝑖 (𝑍𝑖,𝑁 , 𝜃) : 𝑖 ∈ 𝐷 𝑁 , 𝑁 ∈ N} satisfying

|𝐷 𝑁 | → ∞ as 𝑁 → ∞ and let 𝑄 𝑁 (𝜃) = 1
|𝐷 𝑁 |

𝑔𝑖 (𝑍𝑖,𝑁 , 𝜃). Suppose (Θ, 𝜈) is a compact metric

Assumption 2 and that for all 𝜃 in Θ, these functions satisfy the WLLN in (1.7). Then

|𝑄 𝑁 (𝜃) − E[𝑄 𝑁 (𝜃)] |

𝑝
→ 0

sup
𝜃∈Θ

11

With these tools at hand, define the following functions:

𝑄 𝑁 (𝜃) ≡

(cid:35) ′

(cid:34)

𝑔𝑖 (𝑍𝑖, 𝜃)

ˆΩ

(cid:34)

1
𝑁

𝑁
∑︁

𝑖=1

(cid:35)

𝑔𝑖 (𝑍𝑖, 𝜃)

1
𝑁

𝑁
∑︁

𝑖=1

𝑄(𝜃0) ≡ E[𝑔𝑖 (𝑍𝑖,𝑁 , 𝜃0)]′ Ω0 E[𝑔𝑖 (𝑍𝑖,𝑁 , 𝜃0)]

And suppose that ˆΩ

𝑝
→ Ω0, where Ω0 is a positive definite matrix. Recalling that E[𝑔𝑖 (𝑍𝑖, 𝜃)] =

0 only when 𝜃 = 𝜃0, the true population value, the following proposition summarize the conditions

under which the GMM estimator will be consistent:

Proposition 2. Suppose that all the conditions of Proposition 1 hold. Additionally, assume that (𝑖)

𝑔𝑖 (𝑍𝑖, ·) are continuous for all 𝜃 ∈ Θ, (𝑖𝑖) ˆΩ

𝑝
→ Ω0, an 𝑟 × 𝑟 positive definite matrix and (𝑖𝑖𝑖) 𝜃0 is

the only vector for which the moment condition in (1.4) holds. Then 𝑄 𝑁 ( ˆ𝜃) converges uniformly to

𝑄(𝜃0) and ˆ𝜃

𝑝
→ 𝜃0, the unique minimizer of 𝑄(𝜃).

Note that since 1
𝑁

(cid:205)𝑁
𝑖=1

𝑔𝑖 (𝑍𝑖, 𝜃) satisfies the ULLN of Proposition 1 and ˆΩ

𝑝
→ Ω0, the proof

of this proposition follows from Theorem 4.1.1 in Amemiya (1985). To obtain the asymptotic

distribution of ˆ𝜃, we assume the following condition, which guaranties that the sum is not dominated

by any term.

Assumption 4

If we define ˜𝜎2

𝑛 = Var(𝑆𝑛) and 𝑆𝑛 = (cid:205)𝑖∈𝐷 𝑁

𝑍𝑖,𝑁 . Then the following condition is satisfied:

lim inf
𝑛→∞

|𝐷 𝑁 |−1 ˜𝜎2

𝑛 > 0

Under this assumption, Theorem 1 in Nazgul and Prucha (2009) ensures the asymptotic nor-

mality of the random variables 𝑍𝑖.

Proposition 3. Let {𝐷 𝑁 } be a sequence of finite subsets of 𝐷 that satisfies Assumption 1 such that

|𝐷 𝑁 | → ∞ as 𝑁 → ∞ and let {𝑍𝑖 : 𝑖 ∈ 𝐷 𝑁 , 𝑛 ∈ N} be a sequence of zero mean real-valued

random variables that satisfy Assumption 2. Furthermore, assume that the random field is 𝛼-mixing

satisfying Assumption 5. Then,

˜𝜎−1
𝑛 𝑆𝑛

𝑑
→ N(0, 1)

12

Once again, the previous proposition applies directly to the underlying random fields, however,

we need a result to for the 𝑔𝑖 (𝑍𝑖,𝑁 , 𝜃) functions. Assuming that the latter satisfy the standard

regularity conditions of Assumption A.3 , the first order conditions for the GMM estimator are

(cid:35) ′

(cid:34)

∇𝜃𝑔𝑖 (𝑍𝑖, ˆ𝜃)

ˆΩ

(cid:34)

1
𝑁

𝑁
∑︁

𝑖=1

(cid:35)

𝑔𝑖 (𝑍𝑖, ˆ𝜃)

= 0

1
𝑁

𝑁
∑︁

𝑖=1

(1.8)

Taking a mean value expansion of the last term around 𝜃0 yields the following expression:

𝑔𝑖 ( ˆ𝜃) = 𝑔𝑖 (𝜃0) + ∇𝜃𝑔𝑖 ( ˜𝜃) ( ˆ𝜃 − 𝜃0) + remainder

(1.9)

for ˜𝜃 between ˆ𝜃 and 𝜃0 element-wise and where I suppressed the dependence of 𝑔𝑖 on 𝑍𝑖 for notation

simplicity. Replacing (1.9) in (1.8) yields:

√

𝑁 ( ˆ𝜃 − 𝜃0) = −

(cid:40) (cid:34)

𝑁
∑︁

(cid:35) ′

(cid:34)

∇𝜃𝑔𝑖 ( ˆ𝜃)

ˆΩ

1
𝑁

𝑖=1
𝑁
∑︁

(cid:34)

1
𝑁

∇𝜃𝑔𝑖 ( ˆ𝜃)

(cid:35) ′

(cid:34)

ˆΩ

1
√
𝑁

1
𝑁

𝑁
∑︁

𝑖=1
𝑁
∑︁

(cid:35) (cid:41)−1

∇𝜃𝑔𝑖 (, ˜𝜃)

(cid:35)

𝑔𝑖 (𝜃0)

+ remainder

(1.10)

𝑖=1
Noting again that the ∇𝜃𝑔𝑖 (𝜃) preserve the mixing conditions, then

𝑖=1

1
𝑁

𝑁
∑︁

𝑖=1

∇𝜃𝑔𝑖 ( ˆ𝜃)

𝑝
→ E [∇𝜃𝑔𝑖 (𝜃0)]

by the WLLN above. Since 𝑔𝑖 (𝜃) is continuously differentiable, by Slutzky’s Theorem, the first

term of (1.10) converges in probability to

Furthermore, by the CLT,

(cid:8)[E(∇𝜃𝑔𝑖 (𝜃0)]′ Ω0 [E(∇𝜃𝑔𝑖 (𝜃0)](cid:9)−1

1
√
𝑁

𝑁
∑︁

𝑖=1

𝑔𝑖 (𝜃0) = O𝑝 (1)

Therefore, taking the probability limit of (1.10), we obtain

√

𝑁 ( ˆ𝜃 − 𝜃0) = − (cid:8)E [∇𝜃𝑔𝑖 (𝜃0)]′ Ω0E [∇𝜃𝑔𝑖 (𝜃0)](cid:9)−1
(cid:35)

(cid:34)

1
√
𝑁

𝑁
∑︁

𝑖=1

𝑔𝑖 (𝜃0)

+ 𝑜 𝑝 (1)

(1.11)

E [∇𝜃𝑔𝑖 (𝜃0)]′ Ω0

𝑑
→ 𝑁 (0, 𝐶′Σ𝐶)

13

where

and

𝐶′ = (cid:8)E [∇𝜃𝑔𝑖 (𝜃0)]′ Ω0E [∇𝜃𝑔𝑖 (𝜃0)](cid:9)−1 E [∇𝜃𝑔𝑖 (𝜃0)]′ Ω0

Σ = E[𝑔𝑖 (𝜃0)𝑔𝑖 (𝜃0)′] = Var[𝑔𝑖 (𝜃0)]

(1.12)

Note that for the cases considered in this paper, 𝐶 is just a matrix of data, so we do not need to

estimate it. On the other hand, we need an estimator of the variance of the moment conditions, which

we present in the next section. From an empirical implementation point of view, it is important

to note that this GMM framework includes the simple estimators mentioned at the beginning of

the section as special cases. For example, in the case of 𝐴𝑖𝑡 containing only exogenous variables,

then the GMM reduces to the same solution as estimating (1.3) with Pooled OLS. If 𝐴𝑖𝑡 has some

endogenous variables like in the model (1.1), and assuming that we have a set of instruments 𝑍𝑖𝑡,

then the Fixed Effects 2SLS can be obtained from the GMM estimator by setting ˆΩ = (cid:165)𝑍′ (cid:165)𝑍, where

𝑍 is the stacked 𝑁𝑇 × 𝑟 matrix of instruments. Furthermore, we would need that the well known

matrices of these estimators are of full column rank and to converge in probability to non-singular

finite matrices.

Another empirical consideration is the specification of the weighting matrix 𝑊 since in the

model, the dependence of the outcome variable on other observations is generated by this matrix.

In practice, there exist different ways to specify 𝑊. For example, one could assign weights as the

inverse of the distance between two observations and set to zero the weights after a threshold or use

a 𝑘-neighbors scheme. When dealing with geographic units, one could assign an equal weight for

all the units 𝑗 that share a border with unit 𝑖 (rook type) or if they share an edge or a vertex (queen

type) like in Figure XX, or even assign an equal weight to all other units in the sample (see LeSage

and Pace (2009) for a discussion on weighting matrices).

14

Figure 1.2 Rook and queen type weighting schemes. On the rook type scheme, if 𝑊 is row

normalized, only units 8, 12, 14 and 18 will receive a weight of 1

4 in row 13. Analogously, if a

queen type scheme is used, units 7, 8, 9, 12, 14, 17, 18 and 19 will have a weight of 1

8 in row 13 of

𝑊.

Nonetheless, some of these specifications might violate the assumptions stated in this section.

In particular, recall that we are working with an 𝛼-mixing random field, which implies that the

dependence between the observations decays as they are farther apart. In this respect, it is clear that

assigning an equal weight to all other observations violates this assumption. In a similar fashion, a

𝑘-neighbors pattern might not satisfy the 𝛼-mixing condition in cases where there are isolated units

(e.g. a unit located alone in an island). Note that these restrictions to 𝑊 also apply in cases where

the distance measure is of economic nature or derived from a network perspective (e.g. degree of

centrality).

1.3 The HACSC estimator

To obtain robust standard errors, recall that because for each observation we took the time

average of their corresponding moment conditions, essentially we are working with a cross sectional

problem. The idea is therefore to apply Kelejian and Prucha’s (2007) estimator of the covariance

matrix in this context, for which we need consistent estimates of the error terms. Analogous to the

time series literature, their estimator requires a kernel function 𝐾 (·), which will provide weights to

the covariance terms entering the sums. In principle, only the covariance of observations that are

close relative to some distance measure will receive a positive weight, while observations that are

far away will receive a weight of zero. In other words, this function will operationalize the weak

15

12345678910111213141516171819202122232425Rook type weighting scheme12345678910111213141516171819202122232425Queen type weighting schemedependence assumption between observations to the error terms. Note however that this kernel will

provide weights at the cross sectional dimension and not across time. To fix ideas, the researcher

will need to choose a distance 𝜌𝑏 such that 𝜌𝑏 → ∞ as 𝑁 → ∞ that will play the role of the

truncation lag in a time series context. The next assumption imposes additional restrictions on the

kernel function.

Assumption 5

The kernel 𝐾 : R → [−1, 1], satisfies the following conditions:

1. 𝐾 (0) = 1

2. 𝐾 (𝑥) = 𝐾 (−𝑥)

3. 𝐾 (𝑥) = 0 for 𝑥 > 1

4.

|𝐾 (𝑥) − 1| ≤ 𝑐𝐾 |𝑥|𝛼𝐾 , |𝑥| ≤ 1 for some 𝛼𝐾 ≥ 1 and 0 < 𝑐𝐾 < ∞.

As pointed out by Kelejian and Prucha (2007), Assumption 5 is satisfied by many kernels

such as the rectangular kernel, Bartlett, the triangular kernel, among others. The next assumption

imposes some structure for the error terms.

Assumption 6

The 𝑁 × 1 vector of errors is generated as follows:

𝑢 = 𝑅𝜀

(1.13)

where the 𝜀 is a 𝑁 × 1 vector of i.i.d. random variables with mean 0, variance of 1 and E[|𝜀|𝑞] < ∞

for 𝑞 ≥ 4 and the 𝑅 is a 𝑁 × 𝑁 non-singular unknown matrix whose row and column sums are

uniformly bounded.

In light of Assumption 6, recall that although theoretically we are working with a cross sectional

problem because we took the time average of the moment conditions, the underlying structure of

16

the data is a panel. In this sense, (1.13) can also be seen as an average, so for each 𝑖, we have:

𝑢𝑖 =

𝑢𝑖1

𝑢𝑖2
...

(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
𝑢𝑖𝑇
(cid:171)

(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)

(1.14)

where it 𝑡-th row of 𝑢𝑖 is:

𝑡
∑︁

𝑅𝑖,𝑠𝜀𝑠

𝑢𝑖,𝑡 =

𝑠=1
and 𝑅𝑖,𝑠 is the 𝑖-th row of 𝑅𝑠, a matrix with similar properties than 𝑅 defined above, at time 𝑠.

This implies that in each time period, the disturbances will depend on other unit’s disturbances,

past own values of disturbances, and past values of other unit’s disturbances.

In other words,

this structure allows for both spatial correlation and serial correlation, “spatial serial” correlation

and heteroskedasticity. Nevertheless, the uniform boundedness condition for 𝑅 guaranties that the

correlation between units is restricted at the cross sectional dimension, analogous to the time series

case. Given the distance 𝜌𝑏, we can denote with 𝑣𝑖 the number of pseudo-neighbors for 𝑖:

𝑣𝑖 =

𝑁
∑︁

𝑗=1

1[𝜌∗(𝑖, 𝑗) ≤ 𝜌𝑏]

and let 𝑣 = max𝑖 𝑣𝑖. In words, 𝑣𝑖 denotes the number of units 𝑗 that are at a distance less than 𝜌𝑏

from unit 𝑖. The following assumption is related to 𝑣.

Assumption 7

The random variable 𝑣 satisfies the following conditions:

1. E(𝑣2) = 𝑜 𝑝 (𝑁 2𝜏), where 𝜏 ≤

(cid:16) 1
2

(cid:17) 𝑞−2
𝑞−1 and 𝑞 is defined in Assumption 6.

2. (cid:205)𝑁

𝑗=1 |𝜎𝑖 𝑗 |𝜌(𝑖, 𝑗)𝛼𝑆 ≤ 𝑐𝑆, for 𝛼𝑆 ≥ 1 and 0 < 𝑐𝑆 < ∞ and 𝜎𝑖 𝑗 is the (𝑖, 𝑗)-th element of Σ

(defined below).

Assumption 7 plays a role in terms of limiting the degree of correlation between units, as well

as ensuring that the estimator of the covariance matrix is consistent given the fact that we are using

17

residuals instead of errors to estimate it. Assumptions 6 and 9 provide an identification condition

and bound the measurement error of the distance, respectively.

Assumption 8

The matrix of exogenous variables, (cid:165)𝑍, has full column rank and its elements are uniformly bounded

in absolute value by the finite constant 0 < 𝑐𝑍 < 0. For a fixed and finite 𝑇, the matrices:

1.

lim
𝑁→∞

2.

lim
𝑁→∞

3. plim
𝑁→∞

(𝑁𝑇)−1 (cid:165)𝑍′ (cid:165)𝑍 = 𝑄 𝑍 𝑍 .

(𝑁𝑇)−1 (cid:165)𝑍′𝑅𝑅′ (cid:165)𝑍 = 𝑄 𝑍 𝑅𝑅𝑍 .

(𝑁𝑇)−1 (cid:165)𝑍′ (cid:165)𝑍 = 𝑄 𝑍 𝑍 .

are finite and non-singular. Furthermore, the matrix plim
𝑁→∞

(𝑁𝑇)−1 (cid:165)𝑍′ (cid:165)𝐴 = 𝑄 𝑍 𝐴 has full column rank

2𝑘. Similarly, the diagonal elements of 𝑊 are zero and all of its elements are uniformly bounded

by a finite constant 0 < 𝑐𝑊 < ∞.

Assumption 9

The distance measure used by the empirical researcher 𝜌∗(·, ·) is potentially measured with error,

i.e.

𝜌∗(𝑖, 𝑗) = 𝜌(𝑖, 𝑗) + 𝑒𝑖 𝑗 ≥ 0

where 𝑒𝑖 𝑗 = 𝑒 𝑗𝑖 denotes the measurement errors which are bounded in absolute value by the finite

constant 0 < 𝑐𝑒 < ∞. Furthermore, {𝑒𝑖 𝑗 } is independent of {𝜀𝑖}.

We need an additional assumption to account for the fact that we are using residuals instead of

the actual error terms. This condition is provided in Assumption A.4 and should be satisfied by

most 𝑁 1

2 -consistent estimators. An extensive discussion of this and the previous assumptions is

provided by Kelejian and Prucha (2007).

Note that given equations (1.4) and (1.5) and the matrix Σ specified in (1.12), we have the

following:

E[𝑔𝑖 (𝜃0)𝑔𝑖 (𝜃0)′] = E (cid:2) (cid:165)𝑍′

𝑖 (cid:165)𝑢𝑖 (cid:165)𝑢′

𝑖 (cid:165)𝑍𝑖(cid:3)

(1.15)

18

Because all the analysis is conditional on 𝑍 and 𝑊 and by applying the Law of Iterated Expectations,

from (1.15) and Assumption 6 we get that E(𝑢𝑢′) = 𝑅𝑅′ = Σ, where 𝑢 is the 𝑁 × 1 vector of stacked

error terms. In practical terms and recalling that 𝑔𝑖 (·, ·) was defined as an average over time, we can

estimate (1.15) by replacing the error terms by their residual counterparts and the expected value

by an average applying the WLLN. Therefore, for the proposed estimator ˆΣ, its (𝑟, 𝑠)-th element

can be obtained as follows:

ˆΣ𝑟 𝑠 =

1
𝑁𝑇

𝑁
∑︁

𝑁
∑︁

𝑇
∑︁

𝑇
∑︁

𝑖=1

𝑗=1

𝑡=1

𝑙=1

(cid:165)𝑍𝑖𝑡,𝑟 (cid:165)𝑍 𝑗𝑙,𝑠 ˆ(cid:165)𝑢𝑖𝑡 ˆ(cid:165)𝑢 𝑗𝑙 𝐾

(cid:21)

(cid:20) 𝜌∗(𝑖, 𝑗)
𝜌𝑏

(1.16)

where (cid:165)𝑍𝑖𝑡,𝑟 is the value of the covariate 𝑟 for observation 𝑖 at time 𝑡, while its population counterpart

is given by the following expression:

Σ𝑟 𝑠 =

1
𝑁𝑇

𝑁
∑︁

𝑁
∑︁

𝑇
∑︁

𝑇
∑︁

𝑖=1

𝑗=1

𝑡=1

𝑙=1

(cid:165)𝑍𝑖𝑡,𝑟 (cid:165)𝑍 𝑗𝑙,𝑙𝜎𝑖𝑡, 𝑗𝑙

(1.17)

The following proposition establishes the consistency of ˆΣ.

Proposition 4. Consider the model in (1.1) and Assumptions 5-9 and 4. Suppose that the (𝑟, 𝑠)-th

elements of Σ and ˆΣ are given by (1.17) and (1.16) respectively. Then ˆΣ

𝑝
→ Σ.

Given the fact that we have assumed that 𝑇 is fixed from the beginning, the proof of this

proposition is virtually the same as in Kelejian and Prucha (2007). Note that we can re-write (1.16)

as follows:

(cid:40) 𝑁
∑︁

𝑇
∑︁

𝑇
∑︁

(cid:165)𝑍𝑖𝑡,𝑟 (cid:165)𝑍𝑖𝑙,𝑠 ˆ(cid:165)𝑢𝑖𝑡 ˆ(cid:165)𝑢𝑖𝑠 · 𝐾 [0]

1
𝑁𝑇

𝑁
∑︁

ˆΣ𝑟 𝑠 =

+

𝑙=1

𝑖=1
𝑇
∑︁

𝑡=1
𝑇
∑︁

𝑁
∑︁

(cid:165)𝑍𝑖𝑡,𝑟 (cid:165)𝑍𝑖𝑙,𝑠 ˆ(cid:165)𝑢𝑖𝑡 ˆ(cid:165)𝑢 𝑗 𝑠𝐾

𝑖=1

𝑖≠ 𝑗

𝑡=1

𝑙=1

(cid:21) (cid:41)

(cid:20) 𝜌∗(𝑖, 𝑗)
𝜌𝑏

(1.18)

The first term of (1.18) makes it clear that there are no restrictions imposed on the serial correlation

for a particular observation, as the terms are not being down-weighted.

1.4 Correlated Random Effects

A direct application of the HACSC proposed in the previous section is related to the Correlated

Random Effects (CRE) context. One of the most popular method applied in a panel setting is the

19

fixed effects estimator since it allows the unobserved heterogeneity 𝑐𝑖 to be arbitrarily correlated

with the explanatory variables in the model. On the other side of the spectrum, the random effects

approach imposes no correlation between 𝑐𝑖 and the independent variables. A typical task that the

empirical researcher must face is to choose between these two specifications, for which the literature

has suggested multiple approaches. One of these is the CRE framework, which imposes restrictions

on the distribution of the individual heterogeneity conditional on the regressors (Wooldridge, 2010).

One option is to follow Mundlak (1978) suggestion, which assumes that 𝑐𝑖 can be modeled

as a linear function of the averages of the time varying independent variables. More specifically,

consider the following model:

𝑦𝑖𝑡 = 𝑥𝑖𝑡 𝛽 + 𝑐𝑖 + 𝑢𝑖𝑡,

𝑖 = 1 . . . 𝑁, 𝑡 = 1 . . . 𝑇

Assuming that the 𝑥𝑖’s are time varying, Mundlak considered the following specification:

𝑐𝑖 = 𝜂 + ¯𝑥𝑖𝛿 + 𝑒𝑖

(1.19)

(1.20)

where 𝑒𝑖 is uncorrelated with ¯𝑥𝑖 by assumption. Replacing (1.20) in (1.19) yields:

𝑦𝑖𝑡 = 𝑥𝑖𝑡 𝛽 + ¯𝑥𝑖𝛿 + 𝑒𝑖 + 𝑢𝑖𝑡,

𝑖 = 1 . . . 𝑁, 𝑡 = 1 . . . 𝑇

(1.21)

Mundlak (1978) showed that estimating 𝛽 in (1.21) by pooled OLS (POLS) or random effects yields

the same 𝛽 than estimating (1.19) by fixed effects. In addition, we can perform a Hausman-type

test using this equation by testing 𝛿 = 0 to determine the suitability of one estimator versus the

other one. It turns out that this FE-RE equivalence carries over the spatial setting under a particular

setting, namely a model such as in equation (1.1), i.e. no autoregressive process of the error

term 𝑢 (Li and Yang (2020) showed that the equivalence breaks if we try to model structurally).

Furthermore, this result carries over to the case of endogenous variables, which is a common issue

in empirical work.

More concretely, consider the model in (1.1) and using the same notation, the Fixed Effects

Two Stage Least Squares (FE2SLS) coefficients can be obtained by running Pooled 2SLS on the

following equation:

(cid:165)𝑦𝑖𝑡 = (cid:165)𝑥1𝑖𝑡 𝛽1 + (cid:165)𝑥2𝑖𝑡 𝛽2 + (cid:165)𝑤1𝑖𝑡𝛾1 + (cid:165)𝑤2𝑖𝑡𝛾2 + 𝜌𝑊𝑖 (cid:165)𝑦𝑡

(1.22)

20

using the instrumental variables ( (cid:165)𝑧2𝑖𝑡

(cid:165)ℨ2𝑖𝑡

(cid:165)𝑤2
1𝑖𝑡

1𝑖𝑡 . . . (cid:165)𝑤𝑠
(cid:165)𝑤3

1𝑖𝑡), 𝑠 ∈ N. Then, it can be shown that

running Pooled 2SLS on:

𝑦𝑖𝑡 − 𝜂 ¯𝑦𝑖 = (𝑥1𝑖𝑡 − 𝜂 ¯𝑥1𝑖) 𝛽1 + (𝑥2𝑖𝑡 − 𝜂 ¯𝑥2𝑖) 𝛽2 + (𝑤1𝑖𝑡 − 𝜂 ¯𝑤1𝑖)𝛾1 + (𝑤2𝑖𝑡 − 𝜂 ¯𝑤2𝑖)𝛾2

+ 𝜌𝑊𝑖 (𝑦𝑡 − 𝜂 ¯𝑦) + (1 − 𝜂) ¯𝑥1𝑖𝛿1 + (1 − 𝜂) ¯𝑧2𝑖𝛿2 + (1 − 𝜂) ¯𝑤1𝑖𝜆1

+ (1 − 𝜂) ¯ℨ2𝑖𝜆2 + (1 − 𝜂)

𝑠
∑︁

𝑗=2

¯𝑤 𝑗
1𝑖𝜁 𝑗

using IV’s:

[(𝑧2𝑖𝑡 − 𝜂 ¯𝑧2𝑖)

(ℨ2𝑖𝑡 − 𝜂 ¯ℨ2𝑖)

(1 − 𝜂) ¯ℨ2𝑖

(1 − 𝜂)𝑊𝑖 ¯𝑍2

(𝑤2

1𝑖𝑡 − 𝜂 ¯𝑤2

1𝑖) . . . (𝑤𝑠
1𝑖 . . . (1 − 𝜂)𝑤𝑠
1𝑖]

(1 − 𝜂) ¯𝑤2

1𝑖𝑡 − 𝜂 ¯𝑤𝑠
1𝑖)

yields the same (𝛽1 𝛽2 𝛾1 𝛾2 𝜌) as in (3.1) and where 𝜂 = 1 − (cid:2)𝜎2

𝑢 /(𝜎2

𝑢 + 𝑇 𝜎2

𝑐 )(cid:3) 1/2

(1.23)

is assumed

to be known. The following proposition summarizes this result.

Proposition 5. Suppose ˜Γ = ( ˜𝛽2

˜𝛽2

˜𝛾1

˜𝛾2

˜𝜌) is the coefficient vector obtained by running

Pooled 2SLS to equation (1.23). Then ˜Γ = ˆΓ𝐹𝐸2𝑆𝐿𝑆, the coefficient vector obtained by running

Pooled 2SLS to equation (3.1).

The proof of this proposition can be found in the Appendix. Note that we have included the

time averages of the instruments in (1.23), but this might introduce some distortions in the sense

that the dimension of the 𝑧’s might be larger than the original dimension of the 𝑥2’s. In practice, this

will impact the degrees of freedom employed to perform the hypothesis testing to choose between

FE and RE. Although when the cross sectional dimension is large this might not matter, in small

samples this could have a significant impact in the statistical significance of the coefficients.

It is important to note that this FE-RE equivalence is an algebraic result, and as it turns out, one

can obtain the FE coefficients of (𝛽 𝛾 𝜌) in (1.23) by replacing the average of the instruments by

the time averages of the predicted values of a regression of the endogenous variables on all of the

21

exogenous variables, i.e.

𝑦𝑖𝑡 − 𝜂 ¯𝑦𝑖 = (𝑥1𝑖𝑡 − 𝜂 ¯𝑥1𝑖) 𝛽1 + ( ˆ𝑥2𝑖𝑡 − 𝜂 ˆ¯𝑥2𝑖) 𝛽2 + (𝑤1𝑖𝑡 − 𝜂 ¯𝑤1𝑖)𝛾1 + ( ˆ𝑤2𝑖𝑡 − 𝜂 ˆ¯𝑤2𝑖)𝛾2

+ 𝜌𝑊𝑖 ( ˆ𝑦𝑡 − 𝜂 ˆ¯𝑦) + (1 − 𝜂) ¯𝑥1𝑖𝛿1 + (1 − 𝜂) ˆ¯𝑥2𝑖𝛿2 + (1 − 𝜂) ¯𝑤1𝑖𝜆1

+ (1 − 𝜂) ˆ¯𝑤2𝑖𝜆2 + (1 − 𝜂)𝑊𝑖 ˆ¯𝑦𝜁1

(1.24)

This will “correct” the degrees of freedom issue mentioned above, at the expense of making the

asymptotic theory harder since we have to take into account that we are using the predicted values

instead of the original instrument averages. Proposition 6 summarizes this result and is proved in

the Appendix.

Proposition 6. Suppose ˇΓ = ( ˇ𝛽1

ˇ𝛽2

ˇ𝛾1

ˇ𝛾2

ˇ𝜌) is the coefficient vector obtained by running

Pooled OLS to equation (1.24), where the ˆ represent the linear projections of the endogenous

variables on the exogenous covariates. Then ˇΓ = ˆΓ𝐹𝐸2𝑆𝐿𝑆, the coefficient vector obtained by

running Pooled 2SLS to equation (3.1).

Once the researcher estimates the coefficients of (1.23) or (1.24), the next natural step is to test

the hypothesis Ξ = (𝛿 𝜆 𝜁) = 0 [here 𝜁 denotes either (𝜁2 . . . 𝜁𝑠) in (1.23) or 𝜁1 in (1.24)] to decide

between FE and RE specifications. Even if model (1.1) does not have an explicit functional form

for the error term, the 𝑢𝑖𝑡 could still be serially or spatially correlated, therefore, we can use the

HACSC estimator proposed in section 1.3 to conduct a fully robust Hausman-type test in a simple

way. Specifically, one would need to get the Wald statistic as W = (RΞ)′(R ˆΣR′)−1(RΞ), where

R includes the set of restrictions on the coefficients, Ξ is the full set of coefficients estimated and

ˆΣ is the estimated HACSC robust covariance matrix.

1.5 Feasible GLS

As previously stated and analogous to the time series literature, it is common practice in

empirical work to assume a particular structure of the error term in a spatial context. In particular,

22

consider the following model:

𝑦𝑡 = 𝑋𝑡 𝛽 + 𝑣𝑡

𝑣𝑡 = 𝜌𝑊𝑣𝑡 + 𝜀𝑡

𝜀𝑡 = 𝑐 + 𝑢𝑡

(1.25)

where 𝑦𝑡 is a 𝑁 × 1 vector, 𝑋𝑡 is a 𝑁 × 𝑘 matrix of covariates, 𝑐 denotes the vector of individual

heterogeneity and 𝑢𝑡 is a vector of idiosyncratic errors at time 𝑡. In this model, 𝑋𝑡 may contain

spatial lags of the independent variables. In what follows, the conditioning on both 𝑋𝑡 and 𝑊 of

all the analysis is implicit. By stacking the equations by time period, the model can be rewritten as

follows:

𝑦 = 𝑋 𝛽 + 𝑣

𝑣 = (I𝑇 ⊗ 𝜌𝑊)𝑣 + 𝜀

𝜀 = (𝑒𝑡 ⊗ I𝑁 )𝑐 + 𝑢

(1.26)

where 𝑒𝑡 represents a 𝑇 ×1 vector of ones. At this point, the researcher needs to make an assumption

about the orthogonality condition between the independent variables and the composite error term

and more specifically, the vector 𝑐. A typical choice is to assume that all the explanatory variables

𝑋 are exogenous with respect to both vectors 𝑐 and 𝑢, with each element of these being i.i.d.

with zero mean and finite variances 𝜎2

𝑐 and 𝜎2

𝑢 respectively, and both vectors being independent

from each other. Note that this working assumption is stronger than the one required to obtain the

consistency of the fixed effects estimator described in previous sections (as in the rest of the paper,

I assume that 𝑇 is fixed and 𝑁 → ∞).

Given these assumptions, from (1.25) we can write E(𝑣𝑡𝑣′

𝑡) as follows:

E(𝑣𝑡𝑣′

𝑡) = (𝜎2

𝑐 + 𝜎2

𝑢 ) (I𝑁 − 𝜌𝑊)−1(I𝑁 − 𝜌𝑊)−1

(1.27)

Or using the stacked version of (1.26) instead, then we can write E(𝜀𝜀′) = Ω𝜀 in the following

way:

Ω𝜀 = 𝜎2

𝑐 (𝐽𝑇 ⊗ I𝑁 ) + 𝜎2

𝑢 I𝑁𝑇

(1.28)

23

where 𝐽𝑇 = 𝑒𝑡𝑒′

𝑡. Therefore it follows that,

E(𝑣𝑣′) = (cid:2)I𝑇 ⊗ (I𝑁 − 𝜌𝑊)−1(cid:3) [𝜎2

𝑐 (𝐽𝑇 ⊗ I𝑁 ) + 𝜎2

𝑢 I𝑁𝑇 ] (cid:2)I𝑇 ⊗ (I𝑁 − 𝜌𝑊)−1(cid:3)

(1.29)

Note that the middle of this matrix has a classic random effects structure. In order to compute

this covariance matrix, it is assumed that the matrix (I𝑁 − 𝜌𝑊) is invertible and that |𝜌| < 1 just

as in the previous sections. Following the time series case and to facilitate the computation of the

middle of (1.29), note that

Ω𝜀 = 𝜎2

𝑢 𝑄0 + 𝜎2
1

𝑄1

(1.30)

where 𝑄0 =

(cid:16)

I𝑇 − 𝐽𝑇
𝑇

(cid:17)

⊗ I𝑁 , 𝑄1 = 𝐽𝑇

𝑇 ⊗ I𝑁 and 𝜎2

1 = 𝜎2

𝑢 + 𝑇 𝜎2

𝑐 . Noting that 𝑄0 and 𝑄1 are

idempotent, symmetric, 𝑄0 + 𝑄1 = I𝑁𝑇 and that 𝑄0𝑄1 = 0𝑁𝑇 , it follows that Ω−1

𝜀 = 𝜎−2

𝑢 𝑄0 + 𝜎−2
1

𝑄1

and Ω

− 1
2𝜀 = 𝜎−1

𝑢 𝑄0 + 𝜎−1
1

𝑄1. In short, if the researcher is willing to impose that the covariates are

orthogonal to the individual heterogeneity vector 𝑐 and the error term in (1.26) follows a spatial

AR(1) process, then the matrix E(𝑣𝑣′) will have a particular form that depends only on three

parameters.

Knowing this, one can obtain an estimator that is potentially more efficient than the FE estimator.

More specifically, the researcher can exploit the structure of the error term in (1.26) to remove the

spatial correlation by performing a spatial Cochrane-Orcutt type transformation. Let

𝑦∗ = 𝑦 − (I𝑇 ⊗ 𝜌𝑊)𝑦

𝑋 ∗ = 𝑋 − (I𝑇 ⊗ 𝜌𝑊) 𝑋

𝑣∗ = 𝑣 − (I𝑇 ⊗ 𝜌𝑊)𝑣

Therefore, the transformed model is

𝑦∗ = 𝑋 ∗𝛽 + 𝑣∗

(1.31)

Note that 𝑣∗ = 𝜀 so that (1.31) contains a classical composite error term. Given the structure of

𝜀, we can perform a second transformation by multiplying (1.31) by Ω

− 1
2𝜀

to obtain

ˇ𝑦 = ˇ𝑋 𝛽 + ˇ𝜀

24

(1.32)

where ˇ𝑦 = Ω

− 1
2𝜀 𝑦∗ and similarly for the rest of the terms. Note that

E ( ˇ𝜀 ˇ𝜀′) = Ω

− 1
2𝜀 E(𝜀𝜀′)Ω

− 1
2𝜀

= (𝜎−1

𝑢 𝑄0 + 𝜎−1
1

𝑄1) (𝜎2

𝑢 𝑄0 + 𝜎2
1

𝑄1) (𝜎−1

𝑢 𝑄0 + 𝜎−1
1

𝑄1)

= 𝑄0 + 𝑄1

= I𝑁𝑇

(1.33)

Hence (1.32) can be estimated by Pooled OLS to obtain a GLS-type estimator to obtain efficiency

gains, denoted by 𝛽𝐺 𝐿𝑆. If all the relevant matrices are well behaved as 𝑁 → ∞ and non-singular,

Kapoor et al. (2007) showed that

(𝑁𝑇)

1

2 (cid:0) ˆ𝛽𝐺 𝐿𝑆 − 𝛽(cid:1) 𝑑

→ 𝑁 (0, Ψ) as 𝑁 → ∞

(1.34)

where Ψ = (cid:0)𝜎2

𝑋 𝑋 + 𝜎2
1
requires knowledge of 𝜎2
𝑐 , 𝜎2

𝑢 𝑀 0

𝑀 1

(cid:1) −1 and 𝑀 𝑗

𝑋 𝑋

𝑁𝑇 𝑋 ∗′𝑄 𝑗 𝑋 ∗ for 𝑗 = 0, 1. The previous analysis
1
𝑢 and 𝜌 and therefore it is not feasible. Kapoor et al. (2007) proposed

𝑋 𝑋 = lim
𝑁→∞

generalized moments estimators of these parameters and they showed that if ˆ𝛽𝐹𝐺 𝐿𝑆 is the Pooled

OLS estimator of (1.32) using any consistent estimators ˆ𝜎2

𝑐 , ˆ𝜎2

𝑢 and ˆ𝜌 instead of 𝜎2

𝑐 , 𝜎2

𝑢 and 𝜌, then

(𝑁𝑇)

1

2 (cid:0) ˆ𝛽𝐺 𝐿𝑆 − ˆ𝛽𝐹𝐺 𝐿𝑆(cid:1) 𝑝

→ 0 and ˆΨ − Ψ

𝑝
→ 0

(1.35)

where ˆΨ =

(cid:16) 1
𝑁𝑇 ˆ𝑋 ∗′ ˆΩ−1
𝜀

ˆ𝑋 ∗(cid:17) −1

, provided that the working assumptions used to derive (1.34) hold.

Note that the hats over the components of ˆΨ denote the dependence of the terms on ˆ𝜎2

𝑐 , ˆ𝜎2

𝑢 and ˆ𝜌.

The validity of the previous covariance matrix Ψ rests on the working assumptions that the

error term 𝑣 follows a spatial AR(1) and the conditions imposed on each element of 𝑐 and 𝑢 hold.

However, from an empirical perspective it is always possible that the structure of Ω𝜀 does not have

the RE form due to the presence of heteroskedasticity or serial correlation on 𝑢𝑖 for example. It

is important to stress out that even if Ω𝜀 does not have the same structure as in (1.28), ˆ𝛽𝐹𝐺 𝐿𝑆

remains consistent, provided that the strict exogeneity condition (more formally this would mean

that E[𝑋 ⊗ 𝑐] = 0 and E[𝑋 ⊗ 𝑢] = 0) and the corresponding rank condition continue to hold.

Nevertheless, if the researcher is unsure about the assumptions related to the vectors of individual

heterogeneity 𝑐 or the idiosyncratic errors 𝑢 made in this section, it is wise to make robust inference.

25

In these instances, the HACSC estimator presented in this paper can be useful to achieve this purpose.

More specifically, consider the residuals

(cid:165)ˇ𝜀𝑡 = ˇ𝑦𝑡 − ˇ𝑋𝑡 ˆ𝛽𝐹𝐺 𝐿𝑆,

𝑡 = 1 . . . 𝑇 .

where ˆ𝛽𝐹𝐺 𝐿𝑆 is obtained by estimating (1.32). In this context, the (𝑟, 𝑠)-th element of the middle

of the robust covariance matrix is

ˆΣ𝑟 𝑠 =

1
𝑁𝑇

𝑁
∑︁

𝑁
∑︁

𝑇
∑︁

𝑇
∑︁

𝑖=1

𝑗=1

𝑡=1

𝑙=1

ˇ𝑋𝑖𝑡,𝑟 ˇ𝑋 𝑗𝑙,𝑠 (cid:165)ˇ𝜀𝑖𝑡 (cid:165)ˇ𝜀 𝑗𝑙 𝐾

(cid:21)

(cid:20) 𝜌∗(𝑖, 𝑗)
𝜌𝑏

(1.36)

And the fully robust covariance matrix is:

ˇΨ = (cid:0) ˇ𝑋′ ˇ𝑋(cid:1) −1

𝑁
∑︁

𝑁
∑︁

𝑇
∑︁

𝑇
∑︁

𝑖=1

𝑗=1

𝑡=1

𝑙=1





(cid:165)ˇ𝜀𝑖𝑡 (cid:165)ˇ𝜀 𝑗𝑙 ˇ𝑋′

𝑖𝑡 ˇ𝑋 𝑗𝑙 𝐾

(cid:20) 𝜌∗(𝑖, 𝑗)
𝜌𝑏

(cid:21) 



(cid:0) ˇ𝑋′ ˇ𝑋(cid:1) −1

(1.37)

where ˇ𝑋𝑖𝑡 is the 1 × 𝑘 vector of covariates at time 𝑡 for observation 𝑖. Note that the computation of ˇΨ

requires the use of the transformed variables and not the original ones, which is consistent with the

estimating equation (1.32). As in the previous sections, the kernel function 𝐾 (·) will provide weights

so that the (possible) spatial correlation decreases for observations that are far apart according to

the distance measure 𝜌(·, ·). Naturally, ˇΨ will be valid whether the RE structure of Ω𝜀 holds or not

and will be robust to arbitrary serial and spatial correlation, as well as heteroskedasticity.

Throughout this section we have assumed that all the elements of the explanatory variables are

uncorrelated with the error term 𝑢. If some elements of 𝑋 are endogenous (i.e. E[𝑥′

𝑖𝑡𝑢𝑖𝑡] ≠ 0) and the

researcher has available instruments 𝑍, then the extension to an IV procedure is straightforward as

discussed in Mutl and Pfaffermayr (2010) and B. Baltagi and Liu (2011). The estimation approach

would be to apply Pooled 2SLS to the estimating equation 1.32 using instruments ˇ𝑍, where the ˇ

denotes the same transformations made earlier in the section. In this instance, the computation of

the covariance matrix using the HACSC estimator would look like (1.16), but the researcher would

need to use the transformed variables as in this section instead.

1.6 Alternative estimation: a Control Function approach

It is well known that Instrumental Variables estimation procedures such as 2SLS deliver con-

sistent estimates of the parameters at the expense of losing precision when compared to OLS as

26

pointed out by Cameron and Trivedi (2005). In such instances, if the researcher is willing to impose

additional assumptions, she can resort to the control function approach (Blundell & Powell, 2003),

which can deliver estimates that are (potentially) more efficient as it will be shown in simulations.

To this end, consider the following estimating equation:

𝑦𝑖𝑡 = 𝑓 (𝑋1𝑡, 𝑋2𝑡, 𝑊) + 𝑐𝑖 + 𝑢𝑖𝑡

(1.38)

where 𝑓 (·) is a known function, E(𝑋′

1𝑡𝑢𝑖𝑡) = 0 and E(𝑋′
2𝑡𝑢𝑖𝑡) ≠ 0. In practice, 𝑓 (·) will almost
certainly contain linear functions of 𝑋1𝑡, 𝑋2𝑡 as well as spatial spillovers from these variables, but it

can also include nonlinear terms of the endogenous variables such as interactions with 𝑋1𝑡, squared

functions and so on. Now, to analyze the CF approach, consider equation (1.39), which is a special

case of (1.38) and is very similar to (1.1) but without the spatial lag of the dependent variable

on the right hand side5, which will allow us to interpret it as a conditional mean function and for

simplicity we will assume that there’s only one element in 𝑥2𝑖𝑡:

𝑦𝑖𝑡 = 𝑥1𝑖𝑡 𝛽1 + 𝑥2𝑖𝑡 𝛽2 + 𝑊𝑖 𝑋1𝑡𝛾1 + 𝑊𝑖 𝑋2𝑡𝛾2 + 𝑐𝑖 + 𝑢𝑖𝑡

= 𝑥𝑖𝑡 𝛽 + 𝑊𝑖 𝑋𝑡𝛾 + 𝑐𝑖 + 𝑢𝑖𝑡,

𝑖 = 1 . . . 𝑁, 𝑡 = 1 . . . 𝑇

(1.39)

where the definitions are the same as in Section 1. By applying the within transformation, we

obtain the estimating equation:

(cid:165)𝑦𝑖𝑡 = (cid:165)𝑥𝑖𝑡 𝛽 + 𝑊𝑖 (cid:165)𝑋𝑡𝛾 + (cid:165)𝑢𝑖𝑡

(1.40)

a set of instruments (cid:165)𝑍𝑖𝑡 = ( (cid:165)𝑥1𝑖𝑡
(CF) approach are the same as with 2SLS, namely: E( (cid:165)𝑍′

As with the 2SLS case and using obvious notation, this approach also requires the availability of
(cid:165)ℨ2𝑖𝑡). The first two assumptions of the Control Function
𝑖𝑡 (cid:165)𝑢 𝑗𝑡) = 0 for 𝑖, 𝑗 = 1 . . . 𝑁 and 𝑡 = 1 . . . 𝑇
and the identification condition rank[E( (cid:165)𝑍′ (cid:165)𝐴)] = 2𝑘 − 1. The first stage of the estimation involves

(cid:165)𝑤1𝑖𝑡 (cid:165)𝑧2𝑖𝑡

the reduced form of the endogenous variable on the instruments and obtaining the disturbances (cid:165)𝑣2𝑖𝑡,

i.e.

5It is certainly possible to use the control function approach with the spatial lag of the dependent variable as a

covariate.

(cid:165)𝑣2𝑖𝑡 = (cid:165)𝑥2𝑖𝑡 − (cid:165)𝑍𝑖𝑡𝜓

(1.41)

27

where E( (cid:165)𝑍′

𝑖𝑡 (cid:165)𝑣𝑖𝑡) = 0. Given that E( (cid:165)𝑍′

𝑖𝑡 (cid:165)𝑢𝑖𝑡) = 0, note that (cid:165)𝑥2𝑖𝑡 and (cid:165)𝑤2𝑖𝑡 are endogenous if and only if

(cid:165)𝑢𝑖𝑡 is correlated with (cid:165)𝑣2𝑖𝑡 and 𝑊𝑖 (cid:165)𝑣2𝑡. At this point we state the additional assumption required by

the CF approach:

E( (cid:165)𝑢𝑖𝑡 |𝑍, 𝑋2, 𝑊) = E( (cid:165)𝑢𝑖𝑡 |𝑍, (cid:165)𝑣2, 𝑊) = E( (cid:165)𝑢𝑖𝑡 | (cid:165)𝑣2, 𝑊) = 𝜇1 (cid:165)𝑣2𝑖𝑡 + 𝜇2𝑊𝑖 (cid:165)𝑣2𝑡

(1.42)

This equation has two strong implicit restrictions. First, the second equality would hold under

independence of 𝑍 and ( (cid:165)𝑢, (cid:165)𝑣2, 𝑊) and second, we are assuming a linear conditional expectation of

(cid:165)𝑢𝑖𝑡 on the parameters. Given this, we can write

(cid:165)𝑢𝑖𝑡 = 𝜇1 (cid:165)𝑣2𝑖𝑡 + 𝜇2𝑊𝑖 (cid:165)𝑣2𝑡 + (cid:165)𝑒𝑖𝑡

Replacing (1.43) in (1.40) yields:

(cid:165)𝑦𝑖𝑡 = (cid:165)𝑥𝑖𝑡 𝛽 + 𝑊𝑖 (cid:165)𝑋𝑡𝛾 + 𝜇1 (cid:165)𝑣2𝑖𝑡 + 𝜇2𝑊𝑖 (cid:165)𝑣2𝑡 + (cid:165)𝑒𝑖𝑡

(1.43)

(1.44)

Stacking again all the explanatory variables into a matrix 𝐴 and the coefficients into a vector 𝜃

yields:

(cid:165)𝑦𝑖𝑡 = (cid:165)𝑎𝑖𝑡𝜃 + (cid:165)𝑒𝑖𝑡

(1.45)

The error term in (1.45) is uncorrelated with the rest of variables in the equation (including

(cid:165)𝑥2𝑖𝑡 and (cid:165)𝑤2𝑖𝑡), so the parameters can be consistently estimated using Pooled OLS by replacing the

disturbances with the computed residuals from the first stage. Therefore, the estimating equation

for the main model becomes:

(cid:165)𝑦𝑖𝑡 = ˆ(cid:165)𝑎𝑖𝑡𝜃 + (cid:165)𝑒𝑖𝑡

(1.46)

where the ˆ denotes that we are using generated regressors. Two important observations from

equation (1.44) is that by including both (cid:165)𝑣2𝑖𝑡 and 𝑊𝑖 (cid:165)𝑣2𝑡, the parameters obtained from this estimation

will be numerically the same as 2SLS.6 Second, if 𝜇2 = 0, then it would be enough to include only

(cid:165)𝑣2𝑖𝑡 in the estimating equation to get consistent estimates of 𝜃 and in this scenario, they would be

different than 2SLS. Furthermore, it is precisely by excluding 𝑊𝑖 (cid:165)𝑣2𝑡 from the estimation that the

6In this sense, we do not get any efficiency gains compared to 2SLS by including both terms.

28

CF would probably be more efficient than 2SLS in this case, as it would be using information from

this restriction.

The CF has some additional advantages over 2SLS. One, the inclusion of the generated regres-

sors in (1.44) allows the researcher to perform a Hausman-type test to determine if the suspected

variables are endogenous that can be made robust to heteroskedasticity, spatial and serial correlation

using the estimator proposed below. Second, the CF can handle nonlinear functions of the endoge-

nous variables in a parsimonious way: for example in model (1.39), 𝑥2𝑖𝑡 could contain interactions

with other exogenous variables or even squared terms, in which case the CF only requires to include

only ˆ(cid:165)𝑣2𝑖𝑡 in the final estimating equation, whereas 2SLS would need a reduced form equation for

each additional function of the endogenous variable. If such nonlinear functions of the endogenous

variable are indeed present in the main model, the CF can be made more flexible by including

terms such as ˆ(cid:165)𝑣2

2𝑖𝑡 (but again, this is not necessary as the inclusion of ˆ(cid:165)𝑣2𝑖𝑡 already “controls” for this
endogeneity), at the cost of having to adapt the standard errors to account for these new generated

regressors.

From this point onward, one has to decide how to deal with the error term. One option is to

impose some structure to it and apply a Feasible GLS procedure in order to obtain further efficiency

gains. Note that this is possible because in (1.42) we have conditioned on the whole set of exogenous

variables and the weighting matrix. However, this would not be possible if we slightly modify the

model. So far we have assumed that the model also contains spatial spillovers of the endogenous

variable (cid:165)𝑥2𝑖𝑡, but suppose that for some theoretical reason, the model does not include 𝑊𝑖 (cid:165)𝑋2𝑡. In

this case we could relax (1.42) to

E( (cid:165)𝑢𝑖𝑡 |𝑍𝑖𝑡, 𝑥2𝑖𝑡) = E( (cid:165)𝑢𝑖𝑡 |𝑍𝑖𝑡, (cid:165)𝑣2𝑖𝑡) = E( (cid:165)𝑢𝑖𝑡 | (cid:165)𝑣2𝑖𝑡) = 𝜇1 (cid:165)𝑣2𝑖𝑡

(1.47)

Note that we are now conditioning only on the own control function. In this instance one could still

estimate the transformed model by Pooled OLS, however it would preclude to apply a Feasible GLS

procedure because the strict spatial exogeneity assumption would be violated since it will involve

the weighting matrix 𝑊 and the error terms of other observations.

29

Alternatively, the researcher can treat the error term non-parametrically and apply the HACSC

estimator proposed in this paper to obtain robust standard errors. Nevertheless, in this case there’s an

additional layer of complication on top of the spatio-temporal correlation and the heteroskedasticity:

by including ˆ(cid:165)𝑣2𝑖𝑡 in the estimating equation, we now have a generated regressor and therefore, the

covariance matrix of the parameters needs to be adjusted to take into account the sampling error

induced by the first stage estimation (i.e. we are getting estimates of 𝜓). Although Basile et al.

(2014) recommends to perform a bootstrap to obtain the standard errors in a CF setup, sampling

with spatially dependent data is not a trivial matter so having a formula is useful in practice.

In this setup, the fully robust covariance matrix is

𝐵−1𝑀 𝐵−1

(1.48)

𝑖

(cid:205)𝑇

where
𝐵 = E (cid:16)(cid:205)𝑁
𝑀 = Var (cid:2)(cid:205)𝑁
𝑖
𝐺 = E (cid:2)(cid:205)𝑁
(cid:205)𝑇
𝑖
(cid:16) 1
𝑁𝑇

𝑟𝑖𝑡 (𝛿) =

(cid:17)

.

𝑡 (cid:165)𝑎′
(cid:205)𝑇

𝑖𝑡 (cid:165)𝑎𝑖𝑡
𝑡 ( (cid:165)𝑧𝑖𝑡𝜓)′( (cid:165)𝑒𝑖𝑡 + (cid:165)𝑣𝑖𝑡𝜃) − 𝐺 · 𝑟𝑖𝑡 (𝜓)𝜃(cid:3) = Var (cid:2)(cid:205)𝑁
𝑖

(cid:205)𝑇

𝑡 𝑚𝑖𝑡 (cid:3).

𝑡 ( (cid:165)𝑧𝑖𝑡𝜓)′ (cid:165)𝑧𝑖𝑡 (cid:3)
(cid:205)𝑇
(cid:205)𝑁
𝑖

𝑡 (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑧𝑖𝑡

(cid:17) −1 (cid:104)

(𝑁𝑇)− 1

2 (cid:205)𝑁
𝑖

(cid:205)𝑇

𝑡 (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑣𝑖𝑡

(cid:105)

.

The derivation of (1.48) can be found in the Appendix. To estimate it, we can replace the

population quantities by their sample analogues so that

(cid:205)𝑇

(cid:205)𝑁
𝑖

𝑡 ˆ(cid:165)𝑎′

ˆ𝐵 = 1
𝑁𝑇

𝑖𝑡 ˆ(cid:165)𝑎𝑖𝑡.
ˆ𝑚𝑖𝑡 = ( (cid:165)𝑧𝑖𝑡 ˆ𝜓)′( ˆ(cid:165)𝑒𝑖𝑡 + ˆ(cid:165)𝑣𝑖𝑡 ˆ𝜃) − ˆ𝐺 · ˆ𝑟𝑖𝑡 ( ˆ𝜓) ˆ𝜃.

With these quantities calculated, the (𝑟, 𝑠)-th element of 𝑀 can be estimated as

ˆ𝑀𝑟 𝑠 =

1
𝑁𝑇

𝑁
∑︁

𝑁
∑︁

𝑇
∑︁

𝑇
∑︁

𝑖=1

𝑗=1

𝑡=1

𝑙=1

ˆ(cid:165)𝑚𝑖𝑡,𝑟 ˆ(cid:165)𝑚 𝑗𝑙,𝑠𝐾

(cid:21)

(cid:20) 𝜌∗(𝑖, 𝑗)
𝜌𝑏

Note that (1.48) also has a sandwich type form, very similar to the HACSC estimator presented

earlier. Similarly, the kernel function is also used to operationalize the weak spatial dependence

assumption, however in this case the terms it multiplies (𝑚𝑖𝑡 instead of (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑢𝑖𝑡) have a different

structure to take into account the first stage sampling error.

30

1.7 Simulations

1.7.1 Design

To test the performance of the HACSC estimator and the CF version of it, I performed a Monte

Carlo study. In this experiment, the units of observation live in a squared regular grid of 20 × 20 and

the distance between two adjacent individuals is normalized to one. To evaluate the performance

of the estimator, consider the following data generating process:

𝑦𝑖𝑡 = 𝛽0 + 𝑥1𝑖𝑡 𝛽1 + 𝑥2𝑖𝑡 𝛽2 + 𝑥1𝑖𝑡𝑥2𝑖𝑡 𝛽3 + 𝑐𝑖 + 𝑢𝑖𝑡

𝑥1𝑖𝑡 = 𝛿0 + 𝛿1𝑧1𝑖𝑡 + 𝑣𝑖𝑡

𝑐𝑖 = (𝐼 − 𝜌𝑊)−1

𝑖 𝐶

𝑢𝑖𝑡 = 𝛼𝑣𝑖𝑡 + 𝑒𝑖𝑡

𝑒𝑡 = (𝐼 − 𝜌𝑊)−1𝑎𝑡

𝑎𝑖𝑡 = 𝜓𝑎𝑖,𝑡−1 + 𝜀𝑖𝑡

E(𝑥1𝑖𝑡𝑢𝑖𝑡) ≠ 0, E(𝑥2𝑖𝑡𝑐𝑖) ≠ 0, E(𝑧1𝑖𝑡𝑐𝑖) ≠ 0

where [𝛽0 𝛽1 𝛽2 𝛽3]′ = [2 0.7 0.6 0.3]′ and 𝜀𝑖𝑡, and 𝐶 are independent and identically

distributed random variables following normal distributions and are independent from each other.

𝑧1𝑖𝑡 is an instrument for 𝑥1𝑖𝑡 and 𝑥2𝑖𝑡 is exogenous with respect to the error term 𝑢𝑖𝑡 and they follow

a normal and gamma distributions respectively. Note that there is an interaction term between the

endogenous and exogenous variable, for which we have a readily available instrument, 𝑧1𝑖𝑡𝑥1𝑖𝑡.

In this setup, the error term 𝑢𝑖𝑡 satisfy the CF assumption given that it depends linearly on

the error term from the reduced-form equation, 𝑣𝑖𝑡. The error terms 𝑒 and 𝑎 follow a spatial and

temporal AR(1) process respectively. The strength of the spatial correlation is governed by the

parameter 𝜌, while the persistence of the serial correlation is moderated by 𝜓. Note also that the

individual heterogeneity also follows a Spatial AR(1) model, however, since I am going to apply

the within transformation for the estimation, its DGP does will not affect the results.

For the weighting matrix 𝑊, I used a rook-type weighting scheme so that each observation will

have between two and four pseudo-neighbors and each of those will have an equal weight. 𝑊 is

31

row-normalized to ensure that (𝐼 − 𝜌𝑊) is invertible. I estimated the model using both FE 2SLS and

the CF approach with 𝑁 = 400 and 𝑇 = 5 using 1,000 replications. I am interested in comparing

the estimates of the coefficients by the two methods to see if there are some efficiency gains by

using the CF approach. Furthermore, I also want to evaluate the performance of four different

estimators of the covariance matrix:

the HACSC proposed in this paper, a SHAC assuming no

serial correlation, the cluster robust and the “regular” ones without any adjustment. In the case of

the CF approach, I will compare the standard errors presented in Section 1.6 that account for the

first stage and a HACSC that ignores the two-step procedure.

I conducted a simulation for every combination of 𝜌 = [0, 0.3, 0.7] and 𝜓 = [0, 0.3, 0.7].

I used the Bartlett Kernel to perform the analysis, contrary to Kelejian and Prucha (2007), who

used the Parzen Kernel. An important parameter in this experiment is the threshold distance 𝜌𝑏 at

which the Kernel will assign a zero weight for units that are apart by more than 𝜌 𝑝. Following the

recommendation of the authors mentioned above, I set 𝜌𝑏 = 𝑁 1

4 , i.e. the integer part of 𝑁 1

4 . At

each iteration, I draw a new set of covariates and keep it fixed across the iterations of the 𝜌 and 𝜓

parameters.

1.7.2 Results

This section describes the results of the simulations using two metrics for the estimated co-

efficients:

the mean and the corresponding standard deviation across the 1,000 replications for

different values of 𝜌 and 𝜓. Table 1.1 presents the outcomes of this experiment and it shows that

both estimators provided unbiased estimates of the parameters in the sense that the average of the

estimated coefficients is centered around the true values for any combination of 𝜌 and 𝜓. This is

expected since in this exercise the CF assumption is true.

However, when analyzing the standard deviations, the CF consistently shows a lower value than

2SLS (e.g. 0.049 against 0.084 for 𝛽3 when 𝜌 = 𝜓 = 0.3). Figure 1.3 exemplifies this finding:

note that the distribution of the estimated parameters is tighter around the true value for the CF

estimates compared to 2SLS’. Therefore, whenever the CF assumption holds, this estimator seems

to be more efficient, which can be explained by the fact that we are using additional information

32

when performing the estimation. Interestingly, these efficiency gains are more evident for 𝛽1 and

𝛽3, the coefficients associated with the endogenous variables, whereas for the coefficient of the

exogenous covariate 𝛽2, the differences between the standard deviations of both estimators are more

modest across all pairs of 𝜌 and 𝜓.

Table 1.1 Average estimated coefficients and standard deviation across the 1000 replications using
a rook type weighting matrix, N=400 and T=5.

𝜌

𝜓

0.0

0.0

0.3

0.7

0.0

0.3

0.3

0.7

0.0

0.7

0.3

0.7

𝛽1

𝛽2

𝛽3

CF

2SLS

CF

2SLS

CF

2SLS

0.704
(0.196)

0.696
(0.188)

0.703
(0.18)
0.690
(0.207)

0.706
(0.191)

0.704
(0.189)
0.695
(0.236)

0.691
(0.231)

0.704
(0.221)

0.698
(0.294)

0.692
(0.269)

0.705
(0.266)
0.683
(0.301)

0.718
(0.281)

0.697
(0.288)
0.698
(0.330)

0.693
(0.334)

0.693
(0.309)

0.606
(0.269)

0.595
(0.254)

0.600
(0.250)
0.589
(0.275)

0.603
(0.276)

0.599
(0.254)
0.573
(0.352)

0.580
(0.335)

0.603
(0.317)

0.604
(0.283)

0.593
(0.272)

0.601
(0.265)
0.584
(0.299)

0.608
(0.294)

0.595
(0.282)
0.575
(0.371)

0.581
(0.360)

0.600
(0.336)

0.298
(0.049)

0.300
(0.047)

0.300
(0.045)
0.303
(0.052)

0.299
(0.049)

0.299
(0.048)
0.302
(0.057)

0.303
(0.054)

0.300
(0.054)

0.300
(0.089)

0.301
(0.080)

0.299
(0.079)
0.305
(0.090)

0.294
(0.084)

0.301
(0.085)
0.301
(0.095)

0.301
(0.095)

0.303
(0.089)

To analyze the performance of the HACSC estimator, I use two metrics: first the average of

the variance7 estimated for each coefficient for each pair of 𝜌 and 𝜓 across the 1,000 replications

and I compare it with the “true value”, which is computed as the variance of the set of estimated

coefficients for each pair of 𝜌 and 𝜓 across the 1,000 replications. Tables D.1-D.3 present this

comparison and the first thing to note in the case of the CF is that both estimated variances, with

7I used the estimated variances instead of the standard errors because the nonlinearity of the square root function

could affect the results.

33

and without the first stage correction, are very close to the true value so at first glance, using this

metric the correction does not seem to make an impact.

For the 2SLS estimator, the differences are more substantial. The HACSC estimator is consis-

tently closer to the true value across all pairs of 𝑟 ℎ𝑜 and 𝑝𝑠𝑖 compared to the SHAC that imposes

no serial correlation and the non-robust one. In general, the variance estimated with the HACSC

is on average larger compared to the one computed with these two alternatives. Admittedly, in

this case the cluster-robust variances are also very close to the true value. Overall these results

suggest making the standard errors robust to spatial correlation at the expense of imposing no

serial correlation can result in unreliable inference. Furthermore, as shown in Figure E.1, using the

HACSC estimator will provided standard errors that are, on average, properly centered around the

true value.

Figure 1.3 Distribution of coefficients estimated by 2SLS and the Control Function approach for
𝜌 = 0.3 and 𝜓 = 0.7 using a rook type weighting matrix.

As a second method to analyze the HACSC in this setup, I tested the null hypothesis 𝐻0 :

𝛽3 = 0.3 at a 5% of significance using a t-test over the 1,000 replications using the standard errors

computed with the different estimators and I obtained the rejection probabilities. Using this metric,

34

0.00.51.01.51 CFTrue0.00.51.01.52 CF0.00.20.40.63 CF0.00.51.01.51 2SLS0.00.51.01.52 2SLS0.00.20.40.63 2SLSan estimator is performs better if the rejection probability is closer to 5%. Table 1.2 presents the

results of this exercise.8

For the case of the CF approach, the rejection probabilities using the adjustment are slightly

closer to 5% compared to the estimator that ignores the first stage so in this sense, the adjustment

seems important to obtain more reliable inference if the researcher uses the CF approach. On the

other hand, if we use 2SLS to estimate the coefficients, the HACSC estimator rejection probabilities

are closer to the 5% compared to the SHAC and non-robust standard errors, which are over rejecting

the null hypothesis. Using this metric, the cluster-robust standard errors seem to perform just as

well as the HACSC estimator. Overall, the results suggest that the HACSC estimator, both in the

case of 2SLS and the CF approach with the correction, provide more reliable inference compared

to the existing SHAC.

Table 1.2 Rejection probabilities for the null hypothesis 𝐻0 : 𝛽3 = 0.3 at a 5% of significance
using a t-test over the 1,000 replications with a rook type weighting matrix, N = 400, T=5.

𝜌

0.0

0.3

0.7

CF

𝜓
0.0 0.050
0.3 0.046
0.7 0.045
0.0 0.045
0.3 0.050
0.7 0.051
0.0 0.050
0.3 0.041
0.7 0.044

CF_no1 HACSC SHAC Cluster Non-Robust

0.060
0.060
0.061
0.061
0.064
0.062
0.072
0.057
0.060

0.067
0.054
0.050
0.068
0.047
0.068
0.057
0.066
0.056

0.088
0.072
0.075
0.096
0.074
0.085
0.077
0.095
0.076

0.058
0.046
0.043
0.058
0.040
0.058
0.048
0.065
0.041

0.082
0.068
0.072
0.091
0.067
0.080
0.066
0.090
0.073

CF is the HACSC estimator using the first stage correction and CF_no1 refers to
the HACSC estimator ignoring the first stage estimation using a CF approach.

1.8 Empirical application

To test the performance of the HACSC estimator with real world data, I revisit the problem

of analyzing the effect of spending on the educational outcome of fourth graders in Michigan

studied by Papke and Wooldridge (2008) using district level data from 1993 to 20019. In short,

8Tables D.4 and D.5 show the results for 𝐻0 : 𝛽1 = 0.7 and 𝐻0 : 𝛽2 = 0.6 respectively.
9 I want to thank Dr. Papke and Dr. Wooldridge for kindly sharing their data set.

35

Michigan changed the way schools were funded in 1994, going from a property-tax based system

to a statewide system, which was possible trough an increase in the sales tax and lottery profits .

To measure the effect of spending on the academic achievement of students, the authors used

as the dependent variable the fraction of fourth-graders that passed the math test (math4𝑖𝑡) of the

Michigan Education Assessment Program (MEAP) given that the definition of this subject and the

way it is evaluated has remained relatively constant over time. On the other hand, in addition to

the current level of spending on a student, the authors also allow for the possibility that the level

spending on the previous three years to play a role in the test scores. This is indeed a sensible

choice given that one could argue that the previous years of education lay the foundations in the

learning process of students.

The model also includes the proportion of students eligible for the free and reduced-price lunch

program (lunch𝑖𝑡), the district enrollment (enroll𝑖𝑡) and time dummies. More details about the full

model can be found in Papke (2005). Borrowing their notation, the estimated model is:

math4𝑖𝑡 = 𝜃𝑡 + 𝛽1 log(avgrexp𝑖𝑡) + 𝛽2lunch𝑖𝑡 + 𝛽3 log(enroll𝑖𝑡) + 𝑐𝑖1 + 𝑢𝑖𝑡

(1.49)

where avgrexp𝑖𝑡 denotes the simple average of real spending from the current and previous three

years. It is important to note that in addition to the linear probability model (LPM), Papke and

Wooldridge (2008) also estimate the model with other nonlinear estimators but because they find

that the LPM is a good approximation to the nonlinear estimates and since this paper focuses on

linear models, we will compare the results only with their LPM results.

In order to replicate their results and use the HACSC estimator, we need a distance measure

between the school districts. As mentioned in previous sections, this is not a trivial matter when

we are working with geographical units but in this case, we will work with the geographic distance

between the centroids of each district.10 However, there have been changes in the school districts

since 2001, which is why I could only use 98.6% of the original sample used by Papke and

Wooldridge (2008). The main reason for this is that some districts have merged with others and

in these cases, I used the data point of the district that absorbed the one disappearing. Table 1.3

10 Roughly speaking, a centroid can be interpreted as the center of mass of a geometry.

36

compares the summary statistics from the original and new data sets and the t-tests show that there

are no statistically significant differences between them.

Table 1.3 Sample means (standard deviations) of the original and new data sets and corresponding
t-tests (p-values).

Pass rate on fourth-grade
math test

Real expenditure
per pupil (2001$)

Real foundation
grant (2001$)

Fraction of eligible for
free and reduced lunch

Enrollment

1995
Original New
0.62
(0.13)

0.62
(0.13)

2001
t-test Original New
0.76
0.76
-0.30
(0.12)
(0.13)
(0.76)

6329
(986)

5962
(1031)

0.28
(0.15)

3076
(8156)

6317
(978)

5959
(1035)

0.28
(0.15)

3099
(8210)

0.20
(0.85)

0.05
(0.96)

0.27
(0.79)

-0.04
(0.97)

7161
(933)

6348
(689)

0.31
(0.17)

3078
(7293)

7147
(916)

6347
(692)

0.30
(0.17)

3103
(7341)

t-test
-0.43
(0.67)

0.25
(0.80)

0.03
(0.98)

0.34
(0.73)

-0.05
(0.96)

Number of observations

501

494

-

501

494

-

As a first step, I assume that all the explanatory variables are exogenous with respect to the

error term 𝑢𝑖𝑡 and apply the fixed effects estimator, sometimes referred to as "two-way fixed effects"

because of the inclusion of the year dummies. Table 1.4 shows the estimates using the new data set

and the ones reported by Papke and Wooldridge (2008). The coefficient associated with the average

real expenditure is virtually the same, whereas the ones of lunch and the enrollment are negative

with the new estimates. Nevertheless, the magnitudes of the latter are small and none of them are

statistically significant in the original estimation either.

Table 1.4 Estimates assuming that all the explanatory variables are exogenous.

log(avgrexp)

lunch

log(enroll)

Original results New results
Coefficient
0.372
(0.071)

Coefficient
0.377
(0.071)

-0.042
(0.073)

0.002
(0.049)

0.029
(0.064)

-0.02
(0.048)

493

Observations

501

Standard errors for new results with different bandwidth values
𝜌𝑏=300

𝜌𝑏=100

𝜌𝑏=500

𝜌𝑏=200

𝜌𝑏=400

𝜌𝑏=600

𝜌𝑏=1

0.070

0.072

0.067

0.066

0.066

0.063

0.058

0.064

0.077

0.079

0.072

0.061

0.060

0.061

0.048

0.045

0.033

0.028

0.026

0.023

0.022

-

-

-

-

-

-

-

37

Table 1.4 also shows the standard errors computed with the HACSC estimator using different

bandwidth values. As expected and because the minimum distance between any two school

districts in the data set is 1.05 kilometers, when the bandwidth is 1 kilometer the HACSC estimator

is effectively treating the observations as if they have no effect on their neighbors (i.e. no spatial

correlation) and consequently the standard errors are very similar to the ones computed using an

estimator that is robust to heteroskedasticity and serial correlation. Interestingly, as the bandwidth

increases, the standard error for each coefficients behaves differently: for the average spending,

it first increases and then decreases, for enrollment it decreases monotonically whereas for lunch,

there is not an evident pattern. Note that this exercise shows that even if the covariance matrix

is robust to heteroskedasticity, spatial and serial correlation, this does not mean that the standard

errors will be necessarily larger.

One of the issues with the estimates previously discussed is that the spending from a school

district might be endogenous, mainly due to the fact that a school district might adjust its current

spending if they suspect that the (bad) performance of a cohort throughout the year will be reflected

on the pass rates of the MEAP test (Papke & Wooldridge, 2008). Fortunately, the change in the

way that school districts brought with it a natural instrument: in the 1993/1994 school year, each

district started to receive a per-student “foundation grant” based on the initial funding in 1994 that

sought to increase the spending per student to a baseline level and had the effect of reducing the

differences in spending between the districts across the state of Michigan by the year of 2001 (see

Figure 1.4). The details of why this is a suitable instrument are discussed in Papke and Wooldridge

(2008), but in broad terms, the identification assumption is that the idiosyncratic error term has a

smooth relationship with both the dependent variable and the initial funding. On the other hand,

the foundation grant depended on the initial funding in a non-smooth way [see Table 1 in Papke

and Wooldridge (2008) to see this].

As a result of this concern, Papke and Wooldridge (2008) augmented the model by also including

the real spending from 1994 with interactions with the time dummies, along with the time averages

of lunch and enrollment, using as instruments the foundation grant interacted with the year binary

38

Figure 1.4 Average real expenditure per student across the Michigan school districts in 1995 and
2001.

variables. The new estimated model using instrumental variables is then

math4𝑖𝑡 = 𝜃𝑡 + 𝛽1 log(avgrexp𝑖𝑡) + 𝛽2lunch𝑖𝑡 + 𝛽3 log(enroll𝑖𝑡)

(1.50)

+ 𝛽4𝑡 log(rexppp𝑖,1994) + 𝜉1lunch𝑖 + 𝜉2log(enroll𝑖) + 𝑣𝑖𝑡1.

Note that because we have a single endogenous variable, in this case using Two Stage Least

Squares (2SLS) would be numerically the same as estimating the model with the control function

approach, and because of this, I used the latter. Table 1.5 shows the estimates from this model and

once again, the coefficients obtained using the new sample are very similar to the ones computed

using the original data set.

In particular, the coefficient of the spending is considerable larger

than the OLS estimate, which can be explained in the context of the local average treatment effect

literature or by the fact that district authorities can decide to increase spending whenever they think

the cohort might underperform Papke and Wooldridge (2008).

39

Table 1.5 Estimates assuming that the spending variable is endogenous.

Original results New results
Coefficient
0.546
(0.211)

Coefficient
0.555
(0.208)

Standard errors for new results with different bandwidth values
𝜌𝑏=300

𝜌𝑏=100

𝜌𝑏=200

𝜌𝑏=500

𝜌𝑏=400

𝜌𝑏=600

𝜌𝑏=1

0.221

0.265

0.292

0.253

0.221

0.202

0.187

log(avgrexp)

lunch

log(enroll)

v

-0.062
(0.075)

0.046
(0.067)

-0.421
(0.232)

Observations

501

0.008
(0.067)

0.023
(0.066)

-0.476
(0.236)

493

0.066

0.077

0.083

0.079

0.07

0.068

0.067

0.069

0.075

0.079

0.071

0.065

0.058

0.054

0.250

0.349

0.411

0.383

0.365

0.357

0.353

-

-

-

-

-

-

-

Contrary to the case where all the independent variables were treated as exogenous, the standard

errors computed using the HACSC estimator when the bandwidth parameter is set to 1 kilometer are

somewhat different to the ones computed using an estimator that is only robust to serial correlation

and heteroskedasticity, which is expected because the latter does not take into account the first stage

estimation. Once again this results show that the standard errors can be larger or smaller depending

on the value selected for the bandwidth.

So far I have assumed that there is only spatial correlation in the error term. However in this

scenario there could be spatial spillovers from neighboring units that could be affecting the student

performance on the math test. Figure 1.4 not only shows that the average real expenditure per

student increased between 1995 and 2001 in all the school districts, but it also shows the spatial

distribution of it. Note that there are districts where the surrounding neighbors have a very similar

level of spending, for example, in 1995 the Detroit region shows multiple school districts with

higher levels of expenditure compared to the rest of the state. Similarly, in Figure 1.5 the Upper

Peninsula shows several neighboring school districts with higher passing rates than the rest of the

region.

40

Figure 1.5 Average real expenditure per student across the Michigan school districts in 1995 and
2001.

Multiple reasons could be behind this pattern. For instance, it could be the case that parents

with students that are underperforming identify school districts that are increasing spending and

throughout the year, move to one of these districts in order to increase help their children to improve

their grades. From the labor side, school districts might need to increase the expenditure in teachers’

salaries to avoid losing them to other school districts within a reasonable commuting distance. All

in all, it seems important to control for spillover effects of expenditure from neighbors, so I augment

the models previously estimated with this additional variable11 and Table 1.6 shows the estimates

of this regression assuming that all the independent variables are exogenous with respect to the

error term. Note that the coefficient on the average expenditure has decreased significantly so that

an increase of approximately 10% in spending will now lead to an increase in the pass rate of

about 2.8%. On the other hand, if neighboring school districts of unit 𝑖 increase their expenditure

around 10%, the pass rate in 𝑖 is expected to improve around 3.2%, a larger effect than the own

spending. To address the endogeneity issue, I also augmented the model 1.50 with the spending

spillover variable using the control function approach12 and the results are shown in Table 1.7.

11 For this estimation, I used a rook type weighting matrix
12 I used 𝑊 · log(found) to instrument for 𝑊 · log(avgrexp)

41

Table 1.6 OLS with extension

log(avgrexp)

lunch

log(enroll)

W· log(avgrexp)

Coefficient
(st. error)
0.281
(0.076)

0.030
(0.063)

-0.008
(0.047)

0.324
(0.090)

𝜌𝑏=1

Standard errors with different bandwidth values
𝜌𝑏=500
𝜌𝑏=300
𝜌𝑏=100

𝜌𝑏=400

𝜌𝑏=200

𝜌𝑏=600

0.076

0.077

0.071

0.067

0.065

0.061

0.056

0.063

0.077

0.082

0.076

0.066

0.064

0.064

0.047

0.044

0.035

0.03

0.028

0.025

0.024

0.088

0.076

0.071

0.057

0.049

0.047

0.047

Number of districts

493

-

-

-

-

-

-

-

Once again, in this case the effect of the own expenditure is larger than in the exogenous case, but

it is smaller compared to the original estimate. The spillover effect is significantly reduced to a

marginal increase of around 0.7% in the pass rates due to an increase in the spending in surrounding

school districts and moreover, the coefficient is not statistically significant.

Overall, the difference in the magnitude of the coefficients obtained for the spending in neigh-

boring units make it difficult to interpret the effect of this variable. However in both cases it

was positive, which supports the hypothesis that parents may move to school districts where the

spending per student is higher. Of course, one cannot rule out the possibility that larger spending

by neighboring school districts can attract better teachers to the area that are willing to commute,

however, more detailed data may be needed to separate these effects.

Regarding the standard errors, most of the results show a pattern: if the bandwidth parameter

is too small, they seem to be smaller relative to the ones computed with larger values, but at

some point they become smaller again. This phenomenon has been documented in the time series

literature: for example, Müller (2014) argues that when the bandwidth is too small, the estimate of

the covariance matrix is downward biased. In the same line, Kiefer and Vogelsang (2005) show

that for an AR(1) process, if the bandwidth is too small the estimator of is biased, whereas if every

observation is given a weight of one in the estimation of the covariance matrix, then the estimates

are going to tend to zero because the in-sample residuals have an average of zero, which is precisely

42

what is being observed in this example as the bandwidth increases beyond some point.

Table 1.7 IV extension

log(avgrexp)

lunch

log(enroll)

W · log(avgrexp)

v

Coefficient
(st. error)
0.408
(0.231)

0.016
(0.067)

-0.001
(0.067)

0.071
(0.057)

-0.249
(0.254)

Standard errors for new results with different bandwidth values
𝜌𝑏=300

𝜌𝑏=400

𝜌𝑏=500

𝜌𝑏=200

𝜌𝑏=100

𝜌𝑏=600

𝜌𝑏=1

0.234

0.317

0.361

0.310

0.262

0.234

0.219

0.066

0.078

0.087

0.082

0.074

0.072

0.071

0.068

0.08

0.088

0.079

0.069

0.062

0.058

0.056

0.076

0.083

0.077

0.07

0.067

0.065

0.260

0.379

0.435

0.385

0.346

0.328

0.318

Number of districts

493

-

-

-

-

-

-

-

1.9 Conclusion

In this paper, I present a simple way to obtain standard errors that are robust to heteroskedasticity

and both serial and spatial correlation in short panels with fixed effects and endogenous covariates.

This is important because to the best of my knowledge, the current SHAC estimators do not explicitly

allow for serial correlation in this context (admittedly the literature does not ignore this issue when

𝑇 → ∞). The estimator relies on averaging the moment conditions for a single individual across

time, which allows to treat the estimation like a cross sectional problem without imposing any

restrictions on the serial correlation of the residuals. This will help empirical researchers to obtain

more reliable standard errors in different fields such as urban economics or international trade.

The proposed HACSC estimator can be directly applied in a Correlated Random Effects frame-

work to obtain a fully robust Hausman-type test, which can help empirical researchers to choose

between Fixed Effects and Random Effects specifications.

In this paper I also showed that the

Mundlak equivalence also holds in a particular spatial setting, which will allows to obtain the

Fixed Effects coefficients of the time varying covariates in a Random Effects context. Similarly,

the HACSC estimator can be used in a RE estimation procedure, whenever the researcher suspects

that the structure imposed of the spatial error term might be misspecified.

43

I also presented a control function approach and the required assumptions to estimate the

parameters of the model. Although even in the i.i.d. case it is a standard practice to use bootstrap to

obtain the standard errors with this approach, in a spatial setting this is not a trivial procedure given

the dependence between observations. For this reason, I also extended the HACSC estimator to

this setup, which requires an adjustment of the covariance matrix to take into account the sampling

error of the first stage estimation.

The Monte-Carlo experiment performed showed that the HACSC estimator works well in the

presence of strong or moderate serial and spatial correlation compared to other methods used

by the literature in terms of obtaining unbiased standard errors. As expected, the estimator also

shows higher variance than such estimators, especially in settings with low spatial and/or serial

correlation. The simulations also showed that if the CF assumptions hold, we can obtain efficiency

gains compared to 2SLS.

An avenue for future research is to extend the Monte Carlo experiments in different directions.

First, it would be interesting to use different weighting schemes for the weighting matrix 𝑊 based

on distance or a 𝑘-neighbor scheme in an irregular lattice, as well as different kernel functions.

Analogous to the time series literature, the threshold for the distance bandwidth most certainly plays

an important role on the finite sample behavior of the estimator, so implementing a data driven

procedure to choose it is also a possibility to explore, particularly when the spatial correlation is

strong.

44

CHAPTER 2

ESTIMATION OF MODELS WITH SPATIAL PANELS AND
MISSING OBSERVATIONS IN THE COVARIATES

2.1

Introduction

Over the last years, the amount and type of data available for economic research has experienced

an important increase. Many fields in economics have benefited from this, including areas that focus

on spatial related issues such as development, trade, geography and urban economics. Unfortunately,

a common issue that empirical researchers have to deal with is missing data, a problem that can

arise in multiple ways and which often leads to the need of different methods, one of which is the

use of the “complete cases” only, in other words, observations where either the response variable

or one of the covariates is missing are dropped from the analysis.

The consequences of this will depend on the assumptions and the process that generates the

missing data, but regardless of these, discarding observations results in a loss of information. This

problem is more serious in a spatial context, where it is common to include spillover effects from

“neighboring” units (i.e. spatial lags) in the model. For example, if we are working with county

level data and the nature of the dependence between the units is a function of the geographical

distance between them, the researcher might include the effects of surrounding counties as an

additional explanatory variable using a weighting matrix 𝑊. However, in this setup if a unit 𝑖 is

a “neighbor” of 𝑙 counties and 𝑖 has a missing data point and the researcher is using the complete

cases only, she might need to drop not only observation 𝑖, but all of its 𝑙 neighbors as well, therefore,

the loss of information in the spatial case is potentially more severe.

Furthermore, if we have a panel data set, the problem will be aggravated because the missing

data could affect both dimensions. This is in fact a common problem in empirical work because

the reason of the missing data could be that the units of observation (e.g. countries) have different

lengths of their time series (i.e. unbalanced panel). Given this, a method to impute data in this

case would be useful for empirical work so that the efficiency loss induced by the missing data is

mitigated with respect to using only the complete cases. This work tries to fill this necessity by

45

proposing a new GMM estimation procedure.

The problem of missing data has been a known issue in economic research for a long time.

One of the approaches that empirical researchers use to deal with it, is dropping the incomplete

cases, which induces an efficiency loss as mentioned previously.

In this respect, Kelejian and

Prucha (2010) present conditions in a spatial setting under which the missing data can be ignored

asymptotically based on the proportion of the sample sizes related to the complete and incomplete

observations. They also describe the case where the missing data cannot be ignored and will make

inference more difficult.

In practice, there are alternatives other than just using the complete cases: for example, one

might try to complete the sample first and then estimate the model using this “complete” data set.

One of the methods documented in the spatial literature was introduced by Lesage and Pace (2004),

who used the Expectation-Maximization algorithm to predict the value of the dependent variable

that are missing in the context of real estate housing prices.

In the spatial context, one could also generate the spatial lags using the available data only,

in which case the researcher has two options. First, a common practice is to replace the missing

data with zeros (Kelejian & Prucha, 2010), nevertheless this technique does not seem sensible as

having missing data is very a different problem and replacing these data points with zeroes will

almost certainly lead to biased estimates. The second approach involves constructing the spatial

lags using only the available “neighbors”, but doing this could generate a misspecification of the

weighting matrix and thus probably yield inconsistent estimates, as pointed out by Wang and Lee

(2013). More concretely, if a unit 𝑖 has four “neighbors”, each of which has the same weight and the

weighting matrix 𝑊 is row-normalized then in theory each pseudo-neighbor should have a weight

of 1

4. However, if the data for one of the pseudo-neighbors is missing, then one would assign a

weight of 1

3 to each available unit, thus mispecifying 𝑊.

In non spatial settings, the literature has proposed multiple ways to deal with missing data.

For instance, Dagenais (1973) proposed a generalized least squares estimator in which the missing

variables are approximated using observed covariates.

In a similar spirit and in the context of

46

linear models, Gourieroux and Monfort (1981) present a maximum likelihood procedure in which

the missing variables are explained by the observed ones. More recently, Dardanoni et al. (2011)

suggest a framework with an augmented model to reduce the bias induced by replacing the missing

observations with imputed values.

Abrevaya and Donald (2017) introduced a GMM framework in linear models in which they

exploit moment conditions on the missing observations on the regressors to obtain an estimator that

the claim to be more efficient than other estimators previously mentioned such as Dagenais’. Rai

(2021) considered the panel data case of their estimator and find that it is more efficient than the fixed

effects and correlated random effects using the Mundlak device that use only the complete cases.

Rai (2023) extended their approach to the case of missing dependent variables and endogenous

explanatory covariates, which is useful in cases where the researcher needs to combine data sets

from different sources.

Going back to a spatial context, Wang and Lee (2013) also suggest three estimation procedures

in the context a missing dependent variable only in the cross sectional case. They propose a GMM

estimator based on linear moment conditions, a nonlinear least squares and a two stage least squares

with imputation and compare the asymptotic properties of the three estimators. Wang and Lee

(2013) extend the previous estimators to the case of spatial autoregressive panels using a random

effects framework as a baseline and then generalize it by presenting the spatial Mundlak approach.

Note that their work also focuses on the cases of a missing dependent variable only.

It is important to note that in the non spatial case, the Mundlak approach falls within the

correlated random effects context, a middle ground between the random effects (RE) and fixed

effects (FE) estimators. In the first case, the researcher must assume that there is no correlation

between the explanatory variables in the model and the individual heterogeneity, whereas in the

second case, this assumption is relaxed and allows these terms to be correlated. In this respect,

Mundlak (1978) argues that the RE version is a misspecification of the FE model as it does not take

into account the correlation between the heterogeneity and the regressors. To solve this problem,

he proposed an auxiliary equation where the heterogeneity is modeled as a function of the time

47

averages of the independent variables. By doing this, he shows that if we add these time averages

to the main equation and estimate the model by RE, we will obtain the same numerical coefficients

as if we estimate the model by FE. This equivalence carries over to the unbalanced panel case if we

only use the complete cases, as shown by Wooldridge (2019) and byJoshi and Wooldridge (2019)

for the case with endogenous covariates.

Debarsy (2012) was the first to extend the Mundlak approach to a spatial setting if the researcher

working with a Spatial Durbin Model1 (SDM). Nevertheless, this work does not show the afore-

mentioned equivalence between the RE and FE specifications. In 2020, Li and Yang demonstrated

that when the error term is modeled structurally2, a very common practice in the spatial literature,

then the equivalence holds conditional on the value of the parameter(s) associated with the error

term, otherwise the FE and RE will yield different estimates generally. In addition, Wu-Chaves

(2024) shows that when the error term is not modeled structurally, the equivalence holds if the

model is estimated by ordinary least squares (OLS) or two-stage least squares (2SLS). One of the

limitations of the work just described is that they focus on the case where the data is complete. In

this paper, I will show that in the case of an unbalanced spatial panel, the the CRE equivalence also

holds if the researcher uses the complete cases only to estimate the model.

In this chapter, I extend the work of Abrevaya and Donald (2017) and Rai (2021) to the case

of spatial panels with spillover effects. The rest of the paper is organized as follows. Section

2.2 presents the model. Sections 2.3 and 2.4 state the assumptions and show the construction of

GMM estimator, respectively. Section 2.5 shows the equivalence between FE and RE. Section 2.6

provides Monte-Carlo evidence related to the performance of the GMM estimator. Section 2.7

illustrates an empirical application of the estimator and Section 2.8 concludes.

1A SDM includes both a spatial lag of the dependent and independent variables on the right hand side of the

equation.

2Note that by modeling the error term usually involves MLE or a GMM estimation.

48

2.2 Model

Consider the following model:

𝑦𝑖𝑡 = 𝑥1𝑖𝑡 𝛽1 + 𝑥2𝑖𝑡 𝛽2 + 𝑊𝑖 𝑋1𝑡𝛾1 + 𝑊𝑖 𝑋2𝑡𝛾2 + 𝑐𝑖 + 𝑢𝑖𝑡

= 𝑥𝑖𝑡 𝛽 + 𝑤𝑖𝑡𝛾 + 𝑐𝑖 + 𝑢𝑖𝑡, 𝑖 = 1 . . . 𝑁, 𝑡 = 1 . . . 𝑇 .

(2.1)

where 𝑦𝑖𝑡 is the response variable, 𝑥1𝑖𝑡 is a 1 × (𝑘1 + 1) set of exogenous variables that includes an

intercept, 𝑥2𝑖𝑡 is a 1 × 𝑘2 vector of endogenous covariates (with 𝑘1 + 𝑘2 = 𝑘), 𝑤𝑖𝑡 = (𝑤1𝑖𝑡 𝑤2𝑖𝑡) =

(𝑊𝑖 𝑋1𝑡 𝑊𝑖 𝑋2𝑡) with 𝑊𝑖 being the 𝑖-th row of an exogenous, non random, time invariant 𝑁 × 𝑁

weighting matrix, 𝑋1𝑡 and 𝑋2𝑡 are the 𝑁 × 𝑘1 and 𝑁 × 𝑘2 matrices of exogenous and endogenous

covariates, respectively, for all observations at time 𝑡, 𝑐𝑖 is the individual heterogeneity and 𝑢𝑖𝑡 is

the idiosyncratic error term. In this type of model, the terms 𝑊𝑖 𝑋1𝑡 and 𝑊𝑖 𝑋2𝑡 are known as spatial

lags and they capture the effect of neighboring units on unit 𝑖’s outcome3. (𝛽 𝛾) are the parameters

of interest and they are of dimension (𝑘 + 1) × 1 and 𝑘 × 1 respectively4.

In this paper, I will treat the error term in a non parametric way so that it might be serially and

exogenous is that it is uncorrelated with the error term 𝑢𝑖𝑡 (i.e. E(𝑥′

spatially correlated, but I do not impose any particular structure on it. The sense in which 𝑥1𝑖𝑡 is
1𝑖𝑡𝑢𝑖𝑡) = 0). Analogously the
2𝑖𝑡𝑢𝑖𝑡) ≠ 0. I also assume that the asymptotics refer

endogeneity of 𝑥2𝑖𝑡 arises from the fact that E(𝑥′

to the case where 𝑁 → ∞ while 𝑇 remains fixed.

Since 𝑥2𝑖𝑡 is endogenous, we need a set of external instruments 𝑧2𝑖𝑡 of dimension 𝑙 × 1 (𝑙 ≥ 𝑘2)

that satisfy the usual requirements of relevance and exogeneity with respect to the error term 𝑢𝑖𝑡,

that is E(𝑧′

2𝑖𝑡𝑢𝑖𝑡) = 0. Naturally, ℨ2𝑖𝑡 = 𝑊𝑖 𝑍2𝑡 can be used as the instrument for 𝑊𝑖 𝑋2𝑡. For ease of
notation, let 𝑎𝑖𝑡 = (𝑥1𝑖𝑡 𝑥2𝑖𝑡 𝑤1𝑖𝑡 𝑤2𝑖𝑡), and 𝑧𝑖𝑡 = (𝑥1𝑖𝑡 𝑧2𝑖𝑡 𝑤1𝑖𝑡 ℨ2𝑖𝑡). Under these assumptions,

3It is common to also include a spatial lag of the outcome variable, however by doing so the interpretation of the

model as a conditional mean function is lost. For this reason, I am omitting this term in the paper.

4From a modeling perspective, it is not necessary to include all 𝑘 variables in the spatial lag so the dimension of

𝛾 could me smaller

49

the set of first stage equations is:

𝑥1𝑖𝑡 = 𝑥1𝑖𝑡 𝜋11 + 𝑧2𝑖𝑡 𝜋12 + 𝑤1𝑖𝑡 𝜋13 + ℨ2𝑖𝑡 𝜋14 + 𝑟1𝑖𝑡

𝑥2𝑖𝑡 = 𝑥1𝑖𝑡 𝜋21 + 𝑧2𝑖𝑡 𝜋22 + 𝑤1𝑖𝑡 𝜋23 + ℨ2𝑖𝑡 𝜋24 + 𝑟2𝑖𝑡

𝑤1𝑖𝑡 = 𝑥1𝑖𝑡 𝜋31 + 𝑧2𝑖𝑡 𝜋32 + 𝑤1𝑖𝑡 𝜋33 + ℨ2𝑖𝑡 𝜋34 + 𝑟3𝑖𝑡

𝑤2𝑖𝑡 = 𝑥1𝑖𝑡 𝜋41 + 𝑧2𝑖𝑡 𝜋42 + 𝑤1𝑖𝑡 𝜋43 + ℨ2𝑖𝑡 𝜋44 + 𝑟4𝑖𝑡

(2.2)

where 𝜋 𝑗1, 𝜋 𝑗2, 𝜋 𝑗3 and 𝜋 𝑗4 are vectors of dimensions (𝑘1 + 1) × (𝑘1 + 1), 𝑙 × 𝑘1, 𝑘1 × 𝑘1 and 𝑙 × 𝑘1

respectively for 𝑗 = 1, 2, 3, 4. Of course, the relevant equations of (2.2) are the second and fourth

lines as (𝑥1𝑖𝑡 𝑤1𝑖𝑡) act as their own instruments. Given this, (2.1) and (2.2) can be written more

compactly as:

𝑦𝑖𝑡 = 𝑎𝑖𝑡𝜃 + 𝑐𝑖 + 𝑢𝑖𝑡

𝑎𝑖𝑡 = 𝑧𝑖𝑡 𝜋 + 𝑟𝑖𝑡

(2.3)

(2.4)

where 𝜃 = (𝛽 𝛾). By definition, E(𝑧′

𝑖𝑡𝑟𝑖𝑡) = 0 and because the instruments are relevant, it follows

that 𝜋0 ≠ 0. Note that other than the exogeneity with respect to 𝑧𝑖𝑡, no other assumptions have been

imposed on the error terms in (2.3) and (2.4). Furthermore, 𝑦𝑖𝑡 can be expressed in terms of 𝑧𝑖𝑡 as

follows:

𝑦𝑖𝑡 = 𝑧𝑖𝑡𝜃 𝛽 + 𝑐𝑖 + 𝑢𝑖𝑡 + 𝑟𝑖𝑡 𝛽 = 𝑧𝑖𝑡𝜃 𝛽 + 𝑐𝑖 + 𝑣𝑖𝑡

(2.5)

The parameters of this model can be consistently estimated by applying Fixed Effects Two Stage

Least Squares (FE2SLS) or equivalently by applying Pooled 2SLS (P2SLS) to

(cid:165)𝑦𝑖𝑡 = (cid:165)𝑎𝑖𝑡𝜃 + (cid:165)𝑢𝑖𝑡

(2.6)

using the instruments (cid:165)𝑧𝑖𝑡 and where (cid:165)𝑦𝑖𝑡 = 𝑦𝑖𝑡 − ¯𝑦𝑖, ¯𝑦𝑖 = 1
𝑇

(cid:205)𝑇

𝑡=1

𝑦𝑖𝑡 and similar definitions apply to

the other variables, provided that the corresponding rank conditions of the relevant matrices hold

and

E( (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑢𝑖𝑡) = 0

(2.7)

50

The latter is implied by the following condition:

E(𝑢𝑖𝑡 |𝑍, 𝐶) = 0

(2.8)

which is a strict exogeneity assumption and where 𝑍 is the entire matrix of exogenous variables

and 𝐶 is the whole vector of individual heterogeneities. Note that (2.8) is a stronger condition than

the classical strict exogeneity assumption because in this case the idiosyncratic error term at time 𝑡

is not only uncorrelated with the exogenous variables at any time period, but it is also uncorrelated

with the covariates of other units due to the nature of the spatial panel data set and in particular to

the presence of the spatial lags. It is also important to emphasize that in this setup, the individual

heterogeneity 𝑐𝑖 is allowed to be arbitrarily correlated with the elements of 𝑧𝑖𝑡 or the endogenous

covariates.

2.3 Missing data mechanism

Before formalizing the missing data scheme, consider the consequences of missing observations

in a model with spatial spillovers compared to a situation without such effects. As previously

mentioned, a typical strategy when empirical researchers have missing data is to estimate the

parameters use only the observations that have a full set of observed variables and discard the units

that are incomplete. If we have a sample of 49 individuals living in a regular grid as shown in

Figure 2.1 and there are no spillover effects and the researcher has no data on unit 25, then the

loss of information is relatively small (around 2% of the sample). On the other hand, if the model

contains spillover effects from neighboring units and we are using a queen type weighting scheme5,

then if unit 25 is missing and the researcher decides to use only the complete observations, then she

would have to disregard unit 25’s neighbors too (shown in gray in Figure 2.1) and end up losing

almost 20% of the original sample.

5Under this weighting mechanism, a neighbor is an unit that shares an edge or a vertex.

51

Figure 2.1 Regular grid with unit 25 missing and its neighbors shown in gray.

The previous example shows that missing observations can result in a severe decrease in

efficiency when estimating the parameters in a model with spillover effects. To formalize the

missing mechanism, let

𝑠𝑖𝑡 =





1 if 𝑥1𝑖𝑡 is observed for unit 𝑖 at time 𝑡

0 otherwise

(2.9)

and let 𝑆𝑡 be the 𝑁 × 𝑁 diagonal matrix with diagonal elements 𝑠𝑖𝑡. Note that 𝑠𝑖𝑡 is indicating that

the researcher observes either the full set of endogenous variables or none at all. Furthermore, I am

also assuming that the response variable and the exogenous variables 𝑧𝑖𝑡 are always fully observed

for all individuals in all time periods. A common practice in empirical work is to ignore the missing

data from neighbors in the spatial lag, so implicitly the missing neighbors are being assigned a

weight of 0 (Kelejian & Prucha, 2010). This being the case, 𝑊 𝑆𝑡 𝑋𝑡 would be enough to select units

with available self-information but with possibly incomplete data on their the spatial lag. However,

to select only the complete cases in the spatial lag, a new variable needs to be defined. To this end,

for each 𝑖 and its 𝐽 neighbors, let

˜𝑠𝑖𝑡 = 𝑠𝑖𝑡 ·

𝐽
(cid:214)

𝑗=1

𝑠 𝑗𝑡

(2.10)

so that ˜𝑠𝑖𝑡 = 1 only when the full set of endogenous variables are observed for unit 𝑖 and its

corresponding neighbors. Then define ˜𝑆𝑡 as the diagonal matrix with diagonal elements ˜𝑠𝑖𝑡 so

52

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849that 𝑊 ˜𝑆𝑡 𝑋𝑡 will select only the fully complete cases. As previously mentioned, the researcher can

consistently estimate the parameters using FE2SLS with the complete cases if (2.8) holds, at the

expense of losing efficiency. More concretely, the estimator can be defined as follows:

ˆ𝜃𝐶𝐹𝐸2𝑆𝐿𝑆 =







(cid:32)

(cid:32)

∑︁

∑︁

𝑖=1

𝑡=1

(cid:33) (cid:32)

˜𝑠𝑖𝑡 (cid:165)𝑎′

𝑖𝑡 (cid:165)𝑧𝑖𝑡

∑︁

∑︁

𝑖=1

𝑡=1

˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑧𝑖𝑡

(cid:33) −1 (cid:32)

∑︁

∑︁

𝑖=1

𝑡=1

˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑎𝑖𝑡

∑︁

∑︁

𝑖=1

𝑡=1

˜𝑠𝑖𝑡 (cid:165)𝑎′

𝑖𝑡 (cid:165)𝑧𝑖𝑡

(cid:33) (cid:32)

∑︁

∑︁

𝑖=1

𝑡=1

˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑧𝑖𝑡

(cid:33) −1 (cid:32)

(cid:33)

˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑦𝑖𝑡

∑︁

∑︁

𝑖=1

𝑡=1

−1

·

(cid:33)






(2.11)

where (cid:165)𝑦𝑖𝑡 = 𝑦𝑖𝑡 − ¯𝑦𝑖 and where ¯𝑦𝑖 = 1
𝑇𝑖

(cid:205)𝑞 ˜𝑠𝑖𝑞 𝑦𝑖𝑞 and 𝑇𝑖 = (cid:205)𝑇

𝑞 ˜𝑠𝑖𝑞. The rest of the variables are

similarly defined. Note that 𝑇𝑖 is a random variable as it is a function of the selection6. In words,

for each unit 𝑖 the time average is computed using only the periods where the observation as a full

set of observed variables.

To develop an alternative estimator, consider again the within transformation of the variables

similar to the one in (2.11), that is, the averages are computed using only the complete cases.

Furthermore, define:

(cid:164)𝑥2𝑖𝑡 = 𝑥2𝑖𝑡 −

1
𝑇 − 𝑇𝑖

𝑇
∑︁

𝑡=1

(1 − ˜𝑠𝑖𝑡)𝑥2𝑖𝑡

(2.12)

that is, (cid:164)𝑥2𝑖𝑡 is a within transformation where the average is computed using the incomplete cases

only and similar definitions apply to other variables. Regarding this transformation, it is important

to point out that it may be possible that only one time period for a particular unit 𝑖 is missing, in

which case the within transformation (2.12) will remove that observation as the “average” is taken

over a single time period. In such cases, these units are uninformative and are essentially removed

from the estimation and therefore will not help to provide efficiency gains. Note that by applying

the within transformation from (2.11) to the main model, the term 𝑐𝑖 disappears. The resulting

estimating equations are

(cid:165)𝑦𝑖𝑡 = (cid:165)𝑎𝑖𝑡𝜃 + (cid:165)𝑢𝑖𝑡

(cid:165)𝑎𝑖𝑡 = (cid:165)𝑧𝑖𝑡 𝜋 + (cid:165)𝑟𝑖𝑡

(2.13)

(2.14)

6Note that I am implicitly assuming that Pr(𝑇𝑖 = 0) = 0 for all 𝑖 so that 𝜃𝐶𝐹𝐸2𝑆𝐿𝑆 is well defined.

53

And by replacing (2.14) in (2.13), we obtain an expression of 𝑦𝑖𝑡 in terms of the always observed

variables:

(cid:165)𝑦𝑖𝑡 = ( (cid:165)𝑧𝑖𝑡 𝜋 + 𝑟𝑖𝑡)𝜃 + (cid:165)𝑢𝑖𝑡

= (cid:165)𝑧𝑖𝑡 𝜋𝜃 + (cid:165)𝑣𝑖𝑡

(2.15)

where (cid:165)𝑣𝑖𝑡 = (cid:165)𝑢𝑖𝑡 + (cid:165)𝑟𝑖𝑡𝜃. In order to obtain efficiency gains and still get a consistent estimator using

fixed effects, consider the following assumption:

Assumption 1

𝑖) E( ˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑢𝑖𝑡) = 0

𝑖𝑖) E( ˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑟𝑖𝑡) = 0

𝑖𝑖𝑖) E[(1 − ˜𝑠𝑖𝑡) (cid:164)𝑧′

𝑖𝑡 (cid:164)𝑣𝑖𝑡] = 0

Part 𝑖) of Assumption 1 imposes that 𝜃 is the same in both the complete and incomplete cases

and it is also necessary for the complete cases estimator. The second and third points of Assumption

1 are the basis of the potential efficiency gains that can be achieved with the proposed estimator.

Specifically, 𝑖𝑖) states that 𝜋 is the same in both the observed and unobserved samples, whereas

𝑖𝑖𝑖) (along with 𝑖) amounts to say that the model and the imputation method are the same for the

complete and incomplete observations. Similarly to the non-missing units case, the conditions in

1 are implied by the following zero conditional mean assumptions:

Assumption 2

𝑖) E(𝑢𝑖𝑡 |𝑍, 𝑆, 𝐶) = 0

and

𝑖𝑖) E(𝑟𝑖𝑡 |𝑍, 𝑆, 𝐶) = 0

These are strict exogeneity conditions analogous to the non spatial case. Note that these are

weaker than a missing at random (MAR) mechanism, in which the missingness is allowed to depend

on the always observed data (Little & Rubin, 2019). More formally and borrowing notation from Rai

(2023), in this context the data would be considered MAR if 𝑠𝑖𝑡 ⊥ (𝑦𝑖𝑡, 𝑥1𝑖𝑡, 𝑤1𝑖𝑡)|𝑍 or equivalently

𝑠𝑖𝑡 ⊥ (𝑢𝑖𝑡, 𝑟1𝑖𝑡, 𝑟3𝑖𝑡)|𝑍, where 𝑟1𝑖𝑡 and 𝑟3𝑖𝑡 are the errors related to 𝑥1𝑖𝑡 and 𝑤1𝑖𝑡 respectively in the

first stage. A sense in which Assumption 2 is weaker than MAR is that in the former, the condition

would still hold if the selection is a function of 𝑍, provided that E(𝑟𝑖𝑡 |𝑍) = 0. Both of these

assumptions are weaker than the missing completely at random (MCAR) mechanism, where the

probability of missing is independent of the rest of the variables, i.e. 𝑠𝑖𝑡 ⊥ (𝑦𝑖𝑡, 𝑥1𝑖𝑡, 𝑤1𝑖𝑡, 𝑧𝑖𝑡).

54

2.4 GMM estimation

Using equations (2.13), (2.14), (2.15) and Assumption 1, we can create a vector of moment

conditions to perform GMM estimation. Let7

𝑔𝑖𝑡 (𝜃, 𝜋) =

˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑢𝑖𝑡

˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 ⊗ (cid:165)𝑟′
𝑖𝑡

(1 − ˜𝑠𝑖𝑡) (cid:164)𝑧′

𝑖𝑡 (cid:164)𝑣𝑖𝑡























=












˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 ( (cid:165)𝑦𝑖𝑡 − (cid:165)𝑎𝑖𝑡𝜃)

˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 ⊗ ( (cid:165)𝑎𝑖𝑡 − (cid:165)𝑧𝑖𝑡 𝜋)′

(1 − ˜𝑠𝑖𝑡) (cid:164)𝑧′

𝑖𝑡 ( (cid:164)𝑦𝑖𝑡 − (cid:164)𝑧𝑖𝑡 𝜋𝜃)

=












𝑔1𝑖𝑡 (𝜃, 𝜋)

𝑔2𝑖𝑡 (𝜃, 𝜋)

𝑔3𝑖𝑡 (𝜃, 𝜋)























(2.16)

Since I am assuming that Assumption 2 holds, it follows that E[𝑔(𝜃0, 𝜋0)] = 0, where (𝜃0, 𝜋0)

is the vector of true population parameters. Note that 𝑔1𝑖𝑡 (·) and 𝑔2𝑖𝑡 (·) use the complete cases,

while the 𝑔3𝑖𝑡 (·) moment condition uses the incomplete cases. Furthermore 𝑔𝑖𝑡 (·) provides [2(𝑘2 +

𝑙) + 1] [2𝑘 + 3] moment conditions, while there are 2(2𝑘 + 1) (𝑘2 + 𝑙 + 1) parameters to estimate,

which leaves 2(2𝑙 + 𝑘2 − 𝑘1) + 1 overidentifying restrictions. Once again it is important to note

that the potential efficiency gains from the proposed estimator come from imposing that the 𝜋 is

the same among the observed and unobserved units and that the model and imputation method are

the same among those same groups.

In order to obtain an efficient GMM estimator, we need to construct an optimal weighting

matrix. let:

where,

𝑉 ≡ E [𝑔(𝜃, 𝜋)𝑔(𝜃, 𝜋)′] = E


𝑉11 𝑉12



𝑉 ′

12






𝑉22

0

0

0












0 𝑉33

𝑉11 = ˜𝑠𝑖𝑡 (cid:165)𝑢2

𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑧𝑖𝑡

𝑉12 = ˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑢𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 ⊗ (cid:165)𝑟𝑖𝑡

𝑉22 = ˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 ⊗ (cid:165)𝑟′

𝑖𝑡 (cid:165)𝑧𝑖𝑡 ⊗ (cid:165)𝑟𝑖𝑡 𝑉33 = (1 − ˜𝑠𝑖𝑡) (cid:164)𝑧′

𝑖𝑡 ⊗ (cid:164)𝑣2

𝑖𝑡 (cid:164)𝑧′

𝑖𝑡 (cid:164)𝑧𝑖𝑡

In this setup, the sample GMM objective function is:

7Formally 𝑔𝑖𝑡 (·) is also a function of the ( (cid:165)𝑦, (cid:165)𝑎, (cid:165)𝑧, ˜𝑆), but for notation simplicity I suppress these.

¯𝑔(𝜃, 𝜋)′ ˆΩ ¯𝑔(𝜃, 𝜋)

55

(2.17)

(2.18)

(2.19)

where ¯𝑔(𝜃, 𝜋) = 1
𝑁𝑇

𝑔𝑖𝑡 (𝜃, 𝜋), Ω is a square, non-random, symmetric and positive semi-

definite matrix of order [2(𝑘1 + 𝑙) + 1] [2𝑘 + 3] and ˆΩ is a consistent estimator of Ω. To obtain ˆΩ,

𝑁
(cid:205)
𝑖=1

𝑇
(cid:205)
𝑡=1

we can replace the expectations with sample averages in (2.21) and (2.17) and we can get consistent

estimates of (cid:165)𝑢𝑖𝑡, (cid:165)𝑟𝑖𝑡 and (cid:164)𝑣𝑖𝑡 by applying GMM to to 𝑔1𝑖𝑡 (·), only 𝑔2𝑖𝑡 (·) and 𝑔3𝑖𝑡 (·) only, respectively.

It is noteworthy to point out that restricting 𝜋 to be different across 𝑔2𝑖𝑡 (·) and 𝑔3𝑖𝑡 (·) make these

moment functions redundant in the estimation of 𝜃, as pointed out by Ahn and Schmidt (1995) and

Rai (2023).

The proposed estimator in this paper minimizes (2.19) with respect to (𝜃, 𝜋) using ˆΩ = ˆ𝑉 −1

and will be denoted as ( ˆ𝜃, ˆ𝜋). Before stating the asymptotic normality result, we need to present

the other component of the covariance matrix for the GMM estimator. First, let

𝐷 ≡ E (cid:2)∇𝑔(𝜃0, 𝜋0)(cid:3) = E

𝐷11

0

0 𝐷22

𝐷31 𝐷32























(2.20)

where ∇𝑔(𝜃0, 𝜋0) denotes the matrix of derivatives of 𝑔(𝜃, 𝜋) with respect to [𝜃′, vec(𝜋)′]′ evaluated

at the true population parameters and where

𝐷11 = −( ˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑥𝑖𝑡)

𝐷31 = −(1 − ˜𝑠𝑖𝑡) (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑧𝑖𝑡 𝜋0

𝐷22 = − ˜𝑠𝑖𝑡 ( (cid:165)𝑧′
𝐷32 = −(1 − ˜𝑠𝑖𝑡) 𝛽0′

𝑖𝑡 (cid:165)𝑧𝑖𝑡 ⊗ 𝑒1, . . . , (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑧𝑖𝑡 ⊗ 𝑒2𝑘+1)

⊗ (cid:164)𝑧′

𝑖𝑡 (cid:164)𝑧𝑖𝑡

(2.21)

In this case 𝑒 𝑗 denotes a row vector of zeros of dimension [2𝑘 + 1] with the 𝑗-th element being

equal to one. In order to get identification, I assume that rank(𝐷) = 2(2𝑘 + 1) (𝑘2 + 𝑙 + 1) and is of

dimension [2(𝑘2 + 𝑙) + 1] (2𝑘 + 1) × 2(2𝑘 + 1) (𝑘2 + 𝑙 + 1) so that is has full column rank.

Now given the spatial panel structure being considered in this paper, we need to impose some

regularity conditions and assumptions on the variables of the model. In particular I will assume

that the conditions specified in Nazgul and Prucha (2009) for non-stationary random fields are

satisfied, however, in this paper I am going only to focus only on those that are more relevant for

the empirical researcher.

56

Assumption 3

The lattice where the units are located is infinitely countable and there is a distance measure

𝜌(·, ·) and a distance 𝜌0 > 0 available to the researcher such that 𝜌(𝑖, 𝑗) ≥ 𝜌0 for any two pair of

observations 𝑖 and 𝑗.

Assumption 4

The random field is 𝛼-mixing satisfying the properties outlined by Nazgul and Prucha (2009).

In practical terms, this means that the degree of dependence between the observations decays as

the distance between them increases8. From a modeling perspective, this implies that the weights

specified in 𝑊, the weighting matrix capturing the spillover effects, for any two observations 𝑖 and

𝑗 have to decrease as 𝜌(𝑖, 𝑗) → ∞. Note that this assumption also applies to the selection random

variables 𝑠𝑖𝑡 so that the missingness of one unit will not affect the availability of observations that

are at a large distance from it. The following assumption is related to the error terms.

Assumption 5

At each time period, the 𝑁 × 1 vectors of errors are generated as:

𝑢 = 𝐹𝑡𝜀

𝑟 = 𝑀𝑡𝜂

(2.22)

where the 𝜀 and 𝜂 are a 𝑁 × 1 vectors of i.i.d.

random variables with mean 0, variance of 1,

independent of each other and E(|𝜀|𝑞) < ∞ and E(|𝜂|𝑞) < ∞ for 𝑞 ≥ 4 and the 𝐹𝑡 and 𝑀𝑡 are 𝑁 × 𝑁

non-singular unknown matrices whose row and column sums are uniformly bounded.

This assumption allows for many structures of spatial correlation between the error terms

without imposing any restrictions on the time dimension. Assumption 6 states that all the relevant

matrices are well behaved.

Assumption 6

The matrix of exogenous variables, (cid:165)𝑧, has full column rank and its elements are uniformly bounded

in absolute value by the finite constant 0 < 𝑐𝑍 < 0. For a fixed and finite 𝑇, the matrices:

8Recall that we are working with 𝑁 → ∞ and fixed 𝑇 asymptotics so there is not need to impose a weak dependence

restriction on the time dimension.

57

1.

lim
𝑁→∞

2.

lim
𝑁→∞

3. plim
𝑁→∞

(𝑁𝑇)−1 (cid:165)𝑧′ (cid:165)𝑧 = 𝑄𝑧𝑧.

(𝑁𝑇)−1 (cid:165)𝑧′𝑅𝑅′ (cid:165)𝑧 = 𝑄𝑧𝑅𝑅𝑧.

(𝑁𝑇)−1 (cid:165)𝑧′ (cid:165)𝑧 = 𝑄𝑧𝑧.

are finite and non-singular9. Furthermore, the matrix plim
𝑁→∞

(𝑁𝑇)−1 (cid:165)𝑧′ (cid:165)𝑎 = 𝑄𝑧𝑎 has full column rank

2𝑘 + 1. Similarly, the diagonal elements of 𝑊 are zero and all of its elements are uniformly bounded

by a finite constant 0 < 𝑐𝑊 < ∞.

Having state these conditions, the asymptotic normality is summarized in the following propo-

sition.

Proposition 1. Under Assumptions 2-6,

√

𝑁𝑇

(cid:104)(cid:0) ˆ𝜃′, 𝑣𝑒𝑐( ˆ𝜋)′(cid:1)′

−

(cid:16)

𝜃0′, vec(𝜋0)′(cid:17)′(cid:105) 𝑑

−→ 𝑁

(cid:20)

0,

(cid:16)

𝐷′𝑉 −1𝐷

(cid:17) −1(cid:21)

Furthermore,

𝑁𝑇 ¯𝑔( ˆ𝜃, ˆ𝜋)′ ˆ𝐶−1 ¯𝑔( ˆ𝜃, ˆ𝜋)

𝑑
−→ 𝜒2

2(2𝑙+𝑘2−𝑘1)+1

This result follows directly from the Uniform Law of Large Numbers and the Central Limit

Theorem derived by Nazgul and Prucha (2009) and therefore I omit the proof. This chi-square

statistic is useful to determine if the overidentification restrictions (i.e.

the moment conditions

in Assumption 1 evaluated at the true population parameters) hold. More specifically, this test

can help to determine if the mechanism that generated the missing observations is responsible

for the violation of Assumption 1, however, it might not be useful in determining if the model is

misspeficied (Rai, 2023).

2.5 Correlated Random Effects

2.5.1 The Mundlak Device

When working panel data, researchers usually have to decide between two main estimators,

Random Effects (RE) and Fixed Effects (FE). The former provides efficiency gains over the second,

9Formally these conditions should also hold for the variables with incomplete observations.

58

while the later is more robust to violations of one of the main assumptions of the RE estimator,

namely that the exogenous variables are uncorrelated with the individual heterogeneity, since the

FE approach leaves this relationship unrestricted. Mundlak (1978) proposed a middle ground

between these by restricting the relationship with a particular functional form, which falls under

the Correlated Random Effects (CRE) approach. Consider the following standard linear model

without spatial effects:

Mundlak’s approach is to to model the individual effects 𝑐𝑖 as a linear function of the time averages

𝑦𝑖𝑡 = 𝑥𝑖𝑡 𝛽 + 𝑐𝑖 + 𝑢𝑖𝑡

(2.23)

of the covariates:

𝑐𝑖 = ¯𝑥𝑖𝛿 + ℎ𝑖

where ℎ𝑖 is uncorrelated with ¯𝑥𝑖. By replacing (2.24) in (2.23) we obtain:

𝑦𝑖𝑡 = 𝑥𝑖𝑡 𝛽 + ¯𝑥𝑖𝛿 + ℎ𝑖 + 𝑢𝑖𝑡 = 𝑥𝑖𝑡 𝛽 + ¯𝑥𝑖𝛿 + 𝑟𝑖𝑡

(2.24)

(2.25)

It turns out that if (2.25) is estimated either by POLS or RE, the estimated coefficient of 𝛽 will be

numerically the same as if the FE estimator is used in (2.23), a result attributed to Mundlak (1978).

This equivalence has been extended to other contexts: Joshi and Wooldridge (2019) proved it for

the case of unbalanced panels, Wooldridge (2019) showed it for models with unbalanced panels

and endogenous variables. In the spatial context, Debarsy (2012) was the first to introduce the

Mundlak device, while Li and Yang (2020) discuss some conditions under which the equivalence

holds. Wang and Lee (2013) discuss how to implement the Mundlak device on spatial panels with

missing data on the dependent variable, however they do not show the equivalence. In this paper, I

show the equivalence holds for models with missing observations on the endogenous covariates in

a spatial panel. To this end, consider again the model (2.1):

𝑦𝑖𝑡 = 𝑥1𝑖𝑡 𝛽1 + 𝑥2𝑖𝑡 𝛽2 + 𝑊𝑖 𝑋1𝑡𝛾1 + 𝑊𝑖 𝑋2𝑡𝛾2 + 𝑐𝑖 + 𝑢𝑖𝑡

(2.1)

= 𝑎𝑖𝑡𝜃 + 𝑐𝑖 + 𝑢𝑖𝑡, 𝑖 = 1 . . . 𝑁, 𝑡 = 1 . . . 𝑇 .

where the same definitions and conditions described earlier still apply, including the availability of

a set of instruments (𝑧2𝑖𝑡 ℨ2𝑖𝑡). Consider also the same selection variables and in particular, the

59

complete cases selection variable ˜𝑠𝑖𝑡 as defined in (2.10). In this context, the Mundlak approach

involves modeling the heterogeneity as a function of all the time averages of all the exogenous

variables 𝑧𝑖𝑡 = (𝑥1𝑖𝑡 𝑧2𝑖𝑡 𝑤1𝑖𝑡 ℨ2𝑖𝑡) in (2.1):

𝑐𝑖 = ¯𝑧𝑖𝛿 + 𝜂𝑖

and multiply the equation by the complete cases selection variable to obtain:

˜𝑠𝑖𝑡 𝑦𝑖𝑡 = ˜𝑠𝑖𝑡𝑎𝑖𝑡𝜃 + ˜𝑠𝑖𝑡 ¯𝑧𝑖𝛿 + ˜𝑠𝑖𝑡 ˜𝑢𝑖𝑡

(2.26)

(2.27)

where ˜𝑢𝑖𝑡 = 𝑢𝑖𝑡 + 𝜂𝑖. Then we can recover the FE estimates of 𝜃 by applying Pooled 2SLS to (2.27)

using the instruments ˜𝑠𝑖𝑡 (𝑧2𝑖𝑡 ℨ2𝑖𝑡).This result is summarized in the following proposition:

Proposition 2. Suppose ˜𝜃 is the estimated coefficient of 𝜃 by applying Pooled 2SLS to equation

(2.27). Then ˜𝜃 = ˆ𝜃𝐶𝐹𝐸2𝑆𝐿𝑆, the coefficient defined in (2.11).

The proof of Proposition 2 can be found in the Appendix. One of the advantages of the Mundlak

device over the FE estimator is that it allows to estimate the effects of variables that do not show

variation over time. Note however that as with the FE estimator, this approach is using only using

the complete cases so the researcher can obtain efficiency gains with a GMM estimator that uses the

information contained in the incomplete cases. The following subsection describes this procedure.

2.5.2 A GMM approach to CRE with missing data

Instead of applying the within transformation to recover the FE estimates of 𝜃, in this section

we construct moment conditions using the Mundlak approach, for which I will use equations (2.3),

(2.4) and (2.5). As a first step, we model 𝑐𝑖 in (2.3) as:

𝑐𝑖 = ¯𝑥2𝑖 ˜𝜃1 + ¯𝑧2𝑖 ˜𝜃2 + ¯𝑤2𝑖 ˜𝜃3 + ¯ℨ2𝑖 ˜𝜃4 + 𝜂𝑖

= ¯𝑧𝑖 ˜𝜃𝑖 + 𝜂𝑖

(2.28)

where the bar over the variables denotes the time average taken over the observations only where

˜𝑠𝑖𝑡 = 1. Here we impose the following condition:

60

Assumption 7

E(𝜂𝑖 |𝑍, 𝑆) = 0

Plugging (2.28) into (2.3) yields:

𝑦𝑖𝑡 = 𝑎𝑖𝑡𝜃 + ¯𝑧𝑖 ˜𝜃 + ˜𝑢𝑖𝑡

(2.29)

where ˜𝑢𝑖𝑡 = 𝑢𝑖𝑡 + 𝜂𝑖. Since the main model has been augmented with these additional set of

variables, the first stage equations need to be adjusted to include these exogenous variables. In

particular, letting ´𝑧 = (𝑧𝑖𝑡

¯𝑧𝑖) we now have:

𝑎𝑖𝑡 = 𝑧𝑖𝑡 ˜𝜋0

1 + ¯𝑧𝑖 ˜𝜋0

2 + ˜𝑟𝑖𝑡

= ´𝑧𝑖𝑡 ˜𝜋0 + ˜𝑟𝑖𝑡

(2.30)

where E( ´𝑧′ ˜𝑟𝑖𝑡) = 0 holds by definition. Finally we replace (2.30) in (2.29) to obtain a reduced form

of 𝑦𝑖𝑡 on the always observed variables ´𝑧:

𝑦𝑖𝑡 = 𝑧𝑖𝑡 ˜𝜋0
1

𝜃 + ¯𝑧𝑖 ( ˜𝜋0
2

𝜃 + ˜𝜃) + ˜𝑣𝑖𝑡

= 𝑧𝑖𝑡 𝜇1 + ¯𝑧𝑖 𝜇2 + ˜𝑣𝑖𝑡

= ´𝑧𝑖𝑡 𝜇 + ˜𝑣𝑖𝑡

(2.31)

where ˜𝑣𝑖𝑡 = ˜𝑢𝑖𝑡 + ˜𝑟𝑖𝑡𝜃. Note that as a consequence of Assumption 7 and E( ´𝑧′ ˜𝑟𝑖𝑡) = 0, E( ´𝑧′

𝑖𝑡 ˜𝑣𝑖𝑡) = 0.

If we let ˇ𝜃 = (𝜃′

˜𝜃′)′, from here we can construct the vector of moment conditions as follows:

˜𝑠𝑖𝑡 ´𝑧′

𝑖𝑡 ˜𝑢𝑖𝑡

˜𝑔𝑖𝑡 ( ˇ𝜃, ˜𝜋) =





















From this point the estimation proceeds as in the previous section, but now we have

𝑖𝑡 (𝑦𝑖𝑡 − 𝑎𝑖𝑡𝜃 − ¯𝑧𝑖 ˜𝜃)

𝑖𝑡 ⊗ (𝑎𝑖𝑡 − ´𝑧𝑖𝑡 ˜𝜋)′

𝑖𝑡 (𝑦𝑖𝑡 − ´𝑧𝑖𝑡 𝜇)

(1 − ˜𝑠𝑖𝑡) ´𝑧′

(1 − ˜𝑠𝑖𝑡) ´𝑧′

𝑔3𝑖𝑡 ( ˇ𝜃, ˜𝜋)

𝑔2𝑖𝑡 ( ˇ𝜃, ˜𝜋)

𝑔1𝑖𝑡 ( ˇ𝜃, ˜𝜋)

𝑖𝑡 ⊗ ˜𝑟′
𝑖𝑡













































˜𝑠𝑖𝑡 ´𝑧′

˜𝑠𝑖𝑡 ´𝑧′

˜𝑠𝑖𝑡 ´𝑧′

𝑖𝑡 ˜𝑣𝑖𝑡

=

=

(2.32)

˜𝑉 ≡ E (cid:2) ˜𝑔( ˇ𝜃, ˜𝜋) ˜𝑔( ˇ𝜃, ˜𝜋)′(cid:3)

and

˜𝐷 ≡ E (cid:2)∇ ˜𝑔( ˇ𝜃0, ˜𝜋0)(cid:3)

(2.33)

Once again, the efficiency gains from this estimator come from the second and third moment

conditions of (2.32) by imposing the same coefficients on both the complete and incomplete

sub-populations.

61

2.6 Simulations

2.6.1 Data generating process

To analyze the performance of the proposed GMM estimator in this paper, I ran a Monte-

Carlo study where I compared it to the complete cases (CC) estimator, the dummy variable method

(DVM) and the estimator that used the data set without any missing observations. Note that although

the DVM has been shown to deliver biased results (Jones, 1996), in some simulation studies its

performance has been somewhat acceptable, like in Rai (2023). To this end, the benchmark data

generating process is as follows:

𝑦𝑖𝑡 = 𝛽0 + 𝑥1𝑖𝑡 𝛽1 + 𝑊𝑖 𝑋1𝑡 𝛽2 + 𝑥2𝑖𝑡 𝛽3 + 𝑊𝑖 𝑋2𝑡 𝛽4 + 𝑐𝑖 + 𝑢𝑖𝑡

(2.34)

where (𝑥1𝑖𝑡 𝑥2𝑖𝑡) are scalars and the latter is potentially missing and it is endogenous so that it

is correlated with the idiosyncratic error term 𝑢𝑖𝑡. I also generate the variable 𝑧2𝑖𝑡 that will serve

to instrument for 𝑥2𝑖𝑡. Naturally 𝑊𝑖 𝑍2𝑡 will serve as an instrument for 𝑊𝑖 𝑋2𝑡. The individual

heterogeneity is correlated with (𝑥1𝑖𝑡

𝑧2𝑖𝑡). The observations live in a regular square grid and

the weighting matrix that captures the spillover effects follows a rook type scheme. The variables

𝑥1𝑖𝑡, 𝑧2𝑖𝑡, 𝑢𝑖𝑡 follow a standard normal distribution and are independent of each other. The population

parameter values are 𝛽0 = 2, 𝛽1 = 1.5, 𝛽2 = 0.7, 𝛽3 = 1.2 and 𝛽4 = 0.4. The sample size was

𝑁 = 900, 𝑇 = 5 and the number of Monte-Carlo repetitions was 1000 for each scenario described

below.

To incorporate the missing data, I used three different mechanisms. In the first one, the data

is missing completely at random (MCAR) for which the selection variable followed a binomial

distribution with parameter 𝑝 = 0.85. Under this scheme, the average proportion of observations

across the 1000 simulated data sets with a complete “own” set of data was 85% as expected.

However, the percentage of units with a complete information set (i.e. both the “own” and neighbors

information is non-missing) dropped down to 53%. In the second design, the data is missing at

random (MAR) so that the selection variable is allowed to depend on the always observed variables.

In this instance, I allowed the missingness to depend on 𝑥1𝑖𝑡 and designed it so that around 85% of

the observations had their own 𝑥2𝑖𝑡 available. In this case the average proportion across the 1000

62

repetitions of complete cases was around 51%. As an extension of the first design, the data is

again MCAR but the error term follows a spatial autoregressive process of order one (SAR) with a

parameter 𝜌 = 0.4 and I repeated this design with a smaller sample size where 𝑁 = 400 and 𝑇 = 5

for a total of 2000 observations. Finally, in the third experiment I allow the data to be MAR again

but in this case the missingness also depends on the individual heterogeneity.

2.6.2 Results

The simulations showed that the proposed GMM behaves well in finite samples and consistently

across the different designs. For example, Table 2.1 shows the average bias, standard deviation

and root mean squared error for 𝛽3 and 𝛽4, the coefficients associated with the endogenous and

potentially missing variables 𝑥2𝑖𝑡 and 𝑊𝑖 𝑋2𝑡 for the case when the data is MCAR across the 1000

repetitions. The proposed GMM has an average bias just as small as the estimator that uses the

complete data, showing that it is indeed a consistent estimator.

Table 2.1 Average bias, standard deviation and root mean squared error for 𝛽3 and 𝛽4 across the
1000 repetitions when the data is MCAR.

Bias

0.0004
Whole data
Complete cases
-0.0016
Proposed GMM 0.0004
0.9800
Dummy variable

𝛽3
S.D.

0.0247
0.0364
0.0308
0.1304

RMSE

Bias

0.0247
0.0364
0.0308
0.9886

0.0010
0.0013
0.0008
0.3211

𝛽4
S.D.

0.0480
0.0713
0.0604
0.1865

RMSE

0.0480
0.0713
0.0603
0.3713

More importantly, the standard deviation of the estimated coefficients for the proposed GMM

is smaller than the estimator that uses the complete cases only, showing that it provides some

efficiency gains relative to it. As expected, it is not as efficient as the estimator that uses the whole

data set as this one uses the full set of available information, whereas the proposed estimator might

lose some information (e.g. cases where there is only one incomplete time period and therefore

the unit becomes uninformative). This is illustrated in Figure 2.2, where the estimator that uses the

whole data set has tighter distributions around the true population values, followed by the proposed

GMM estimator and finally, the complete cases estimator, which has more disperse distributions.

63

The simulations also show that the DVM estimator is inconsistent as the average bias for the

parameters associated with the endogenous variables is substantial, although this appears to be

limited to these covariates as the 𝛽1 and 𝛽2 coefficients seem to be well behaved.

The simulations show a very small loss in efficiency when the data is MAR compared to the

MCAR case. Similarly, this loss is also small when the data is MCAR but the error term follows

a SAR(1) process, as in the latter the standard deviations are slightly larger relative to the the

first two scenarios. Nevertheless, the proposed GMM estimator shows again to be more efficient

than the complete cases estimator. Of course, if the researcher is confident that the error terms

follows a SAR(1), she might be able to exploit efficiency gains using alternative estimators that

use this information such as maximum likelihood, at the risk of misspecifying the structure of data

generating process. As expected, when the sample size is smaller the distribution of the coefficients

show a greater dispersion, but the proposed GMM estimator continues to show to be well behaved

under this scenario with a small bias and a standard deviation that is smaller than the complete cases

estimator. Finally, when the missingness is also allowed to depend on the individual heterogeneity,

there are no substantial differences in the results: the proposed GMM seems to be consistent and

the root mean squared error is between the estimator that uses the whole data set and the complete

cases one. This result is somewhat expected as the within transformation removes the individual

heterogeneity from the estimating equations.

64

Figure 2.2 Distribution of estimated coefficients across the 1,000 Monte-Carlo repetitions when
the data is MCAR.

2.7 Empirical Application

I this section, I revisit the problem of analyzing the impact of different variables on crime in

the state of North Carolina at the county level between 1981 and 1987. This problem was studied

by Cornwell and Trumbull (1994) and by B. Baltagi (2006), where they modeled the crime rate as

a function of a set of covariates that included deterrent variables and returns to legal opportunities.

However, as pointed out by B. Baltagi (2006), most of the fixed effects estimates presented by

Cornwell and Trumbull (1994) turned out to be statistically insignificant, therefore, in this paper I

present a simplified version of the model that focuses on the deterrent variables.

The original data set used in their estimation contained 90 counties10 and seven time periods for

a total of 630 observations. Note that their data has no missing observations and therefore, for the

purpose of this illustration, the missing variables will be generated artificially so that around 5%

10North Carolina has a total of 100 counties, nevertheless their data only contained information for 90.

65

1.401.451.501.551.60Allβ1Truevalue0.50.60.70.80.9β21.051.101.151.201.251.301.35β30.00.20.40.60.8β41.401.451.501.551.60CC0.50.60.70.80.91.051.101.151.201.251.301.350.00.20.40.60.81.401.451.501.551.60GMM0.50.60.70.80.91.051.101.151.201.251.301.350.00.20.40.60.8of the observations has one of their variables missing. To this end, consider the following model:

crime𝑖𝑡 = 𝛽0 + 𝛽1arrest𝑖𝑡 + 𝛽2conviction𝑖𝑡 + 𝛽3 prison𝑖𝑡

+ 𝛽4police𝑖𝑡 + 𝛽5avgsent𝑖𝑡 + 𝛽6dens𝑖𝑡 + 𝑐𝑖 + 𝑢𝑖𝑡

(2.35)

where crime is the crime rate (crimes committed per person), arrest is the “probability” of arrest

(number of arrests per crimes), conviction is the proportion of convictions to arrests, prison is the

ratio of sentences that results in jail time to the total number of convictions, police is the police

per capita, avgsent is the average sentence in days and dens is the the number of people living in

the county per square mile. Cornwell and Trumbull (1994) argued that both the arrest and police

variables are endogenous, for which they proposed two external instrumental variables (IV): the

tax revenue per capita is correlated with the police covariate as we would expect that higher tax

revenues are correlated with larger police forces. On the other hand, the ratio of crimes that involve

face to face contact to those that do not (denoted by mix) is the IV for arrests, the rationale being

that when a crime is committed in person, identification of the perpetrator is facilitated.

Columns 1 and 2 of Table 2.211 show the results of estimating the model 2.35 using the complete

data set by FE2SLS and using the proposed GMM (PGMM) estimator respectively12. The first

thing to note is that all the coefficients are similar in magnitude and most of them have the expected

sign. Indeed, deterrent variables such as arrests, conviction and prison have a negative effect on the

crime rate. On the other hand, police and density have a positive impact on the dependent variable,

which is expected as the latter increases the likelihood of offenders finding victims. On the other

hand, as pointed out by B. Baltagi (2006), there might be simultaneity involved in the relationship

between the crime rate, arrests and police. Note however that none of the estimates is statistically

significant.

From a spatial perspective, one could argue that criminal activity in some areas might affect

the surrounding counties. For example, if the people living in the big cities of North Carolina are

11All the specifications include time dummies.
12Admittedly the results from the missing data case are going to depend on which observations are missing, therefore
I estimated the model 200 times and at each iteration, a different set of observations was missing. The table shows the
average estimated coefficients across the 200 repetitions and the “standard errors” presented for the PGMM columns
are the sample standard deviation of the computed coefficients across the iterations.

66

more affluent and are more densely populated than those in rural areas, it could be expected that the

former have larger crime rates or arrests. Figures 2.3a and 2.3b show these variables plotted over the

counties at the beginning and end of the period of analysis. They reflect that indeed counties with

high (low) proportion of arrests are neighbors to other counties with higher (lower) “probabilities”

of arrest. To capture this, I augment model 2.35 by including the spatial lag of the variable arrest.

The results of this are shown in columns 3 and 4 of Table 2.2.

Table 2.2 Results from the estimation (standard errors in parenthesis)

Arrest

Police

FE2SLS
-0.0202
(0.0128)

PGMM FE2SLS
-0.0224
-0.0182
(0.0242)
(0.0169)

PGMM
-0.0175
(0.0162)

3.7286
(1.7727)

4.1822
(2.145)

4.0688
(4.0708)

3.9161
(2.6127)

Conviction

-0.0019
(0.0009)

-0.0023
(0.0013)

-0.0206
(0.0169)

-0.0021
(0.0015)

Prison

-0.0012
(0.0045)

-0.0023
(0.0049)

-0.0020
(0.0021)

-0.0018
(0.0054)

Sentence

0.0002
(0.0002)

0.0004
(0.0002)

-0.0012
(0.0072)

0.0004
(0.0003)

Density

0.0039
(0.0049)

0.0011
(0.0027)

0.0002
(0.0004)

0.0006
(0.0028)

W × Arrest

-

-

0.0046
(0.0063)

-0.0151
(0.0533)

After adding the additional covariate, none of the estimates from the estimates for the other

variables changes significantly and they remain statistically insignificant at the usual confidence

levels. The sign of the coefficient for the spatial lag of arrests is positive for the case of the FE2SLS

estimator but negative for the PGMM. One could argue that the expected sign of this variable

would be positive because if the number of arrests in counties that are neighbors of 𝑖 increases,

the criminals might move their activities 𝑖. However the empirical evidence does not support this

67

theory, as both estimators find a statistically insignificant coefficient, which coincides with the

findings of Cornwell and Trumbull (1994).

(a)

(b)

Figure 2.3 Maps of the “probability” of arrest (a) and crime rates (b) at the beginning and end of
the period of study.

2.8 Conclusion

Missing data is a more serious problem in spatial models with spillover effects because the

loss of information is greater if the researcher decides to use only the complete cases. This paper

presented a simple way to exploit the information of incomplete observations in spatial panel data

models with potentially missing endogenous explanatory variables. The estimator is presented in

a GMM framework that imposes restrictions on the coefficients in the complete and incomplete

subsamples to obtain a more efficient estimator relative to the fixed effects estimator that only uses

the complete cases.

An alternative to the FE estimator in panel data is the correlated random effects approach, which

restricts the relationship between the unobserved heterogeneity and the explanatory variables. In

particular, by using Mundlak’s device the researcher can recover the same numerical FE coefficients

68

for the time varying variables and also estimate the effects of time invariant covariates. In this

paper, I show that this equivalence carries over to the missing data with endogenous independent

variables. In addition to this equivalence, I also present a potentially more efficient GMM estimator

that exploits the incomplete cases information using the the additional restrictions of the Mundlak

approach.

The simulations show that the proposed GMM estimator behaves well in finite samples with

an average bias very close to the estimator that uses the whole set of non missing data but more

importantly, it consistently had a smaller standard deviation across the Monte-Carlo study compared

to the estimator that uses only the complete cases, which shows that the GMM indeed provides

some efficiency gains.

69

CHAPTER 3

ESTIMATION OF MODELS WITH MULTIPLE FIXED EFFECTS AND ENDOGENOUS
VARIABLES: A CORRELATED RANDOM EFFECTS APPROACH

3.1

Introduction

Gravity type models have been widely used in a variety of economic fields to analyze the

flows of goods or services between multiple regions or entities. The international trade literature

has had a long tradition of using this type of model to quantify the relationship between bilateral

trade flows and other variables such as trade costs and economic integration agreements (Baier

et al., 2014), although its use to estimate these relationships can be documented back to 1885

(Kabir et al., 2017). Studies in this area that use the gravity equation include Flach and Unger

(2022), Anderson and Van Wincoop (2003) and B. H. Baltagi et al. (2003), but the list of papers is

extensive. Furthermore, gravity type models have also been used to explain migration flows (Beine

et al., 2015) and international financial assets outflows (Okawa & Van Wincoop, 2012). Kabir et al.

(2017) provides an excellent overview of other areas where the gravity equation has been applied.

However, for the remainder of the paper I will focus on the international trade case.

The main idea behind gravity models is that the bilateral economic relationship between two

entities is proportional with their economic size (e.g. a country’s GDP is often used in the trade

literature (Matyas, 1997)) and negatively correlated with their economic or geographical distance.

Intuitively this idea is appealing and is analogous to Newton’s Universal Gravity Law, however, it

was recognized that the inclusion of covariates such as policy variables (e.g. border taxes) lacked

theoretical justification (Anderson, 1979). However, Anderson (1979) made a seminal contribution

in this direction by presenting a commodities model that are differentiated by the country of origin

and deriving a gravity equation from it. Other papers that also presented theoretical foundations for

these models include Krugman (1980) Bergstrand (1985), Eaton and Kortum (2002) and Chaney

(2018).

Gravity type models are at least double indexed: in the cross sectional case, one index corre-

sponds to the originating country and the other to the destination country. If time series data is

70

available, then one of the indices identifies the time dimension instead of the originating country

and if the researcher is using panel data, then a third index can be added to the model to identify each

of the components previously mentioned. More details about the formulation of a gravity model

can be found in Matyas (1997). Although more details will be provided later in the paper, each

of these dimensions will have a corresponding term (unobserved heterogeneities, latent variables

or “fixed effects”) in the model that captures their corresponding effect on the response variable.

Depending on the assumptions imposed on these terms, the estimation approach can vary between

a random effects (RE) procedure or a fixed effects (FE) estimator (Matyas, 1997). An excellent

overview of both the RE and FE with multi-dimensional panels can be found in Matyas (2017,

Chapters 1 and 2).

As previously mentioned, one of the main differences between the FE and RE estimators is

the restriction related to the relationships between the explanatory variables and the unobserved

heterogeneities that is imposed to achieve consistency. In particular, the FE allows for arbitrary

correlation between the latent variables and the covariates, while the RE assumes zero correlation

among the dependent variables and each of the fixed effects. However, the literature has proposed

a middle ground between these approaches, the correlated random effects (CRE). For instance,

in the one way panel case, Mundlak (1978) suggested to model the individual heterogeneity as a

linear function of the time averages of the right hand side variables, use this auxiliary equation

in the main model and estimate the parameters with Pooled Ordinary Least Squares (POLS). By

following these steps, he showed that the researcher can obtain same numerical estimates of the

FE estimator for the time varying covariates. It is important to note that this result is an algebraic

equivalence that does not depend on the statistical properties of the estimators nor the conditions

assumed to obtain a consistent estimator laid out earlier.

This equivalence between the Mundlak device and the FE estimator has been extended to other

contexts. For example, Wooldridge (2021) showed it for the case of a two-way panel, Debarsy

(2012) was the first to propose it for spatial panels, Joshi and Wooldridge (2019) demonstrated it

for the case of unbalanced panels and Yang (2022) proved it for models with multiple fixed effects.

71

It is important to note that Yang (2022) does not allow for correlation between the covariates and

the idiosyncratic error term.

In this paper, I extend the result by relaxing this assumption and

show that the FE estimates and be recovered using two different sets of variables to model the

fixed effects. The rest of the paper is organized as follows. Section 3.2 presents the model and

its assumptions. Section 3.3 shows how to consistently estimate the model, while Section 3.4

introduces the equivalence between the FE and the CRE approach and Section 3.5 concludes.

3.2 Model

To motivate the use of a FE or RE approach, consider the following linear model with additive

heterogeneities, which is common to see in gravity-type models:

𝑦𝑖 𝑗𝑡 = 𝑥1𝑖 𝑗𝑡 𝛽1 + 𝑥2𝑖 𝑗𝑡 𝛽2 + 𝛼𝑖 + 𝜙 𝑗 + 𝛾𝑡 + 𝑢𝑖 𝑗𝑡

= 𝑥𝑖 𝑗𝑡 𝛽 + 𝑒𝑖 𝑗𝑡,

𝑖 = 1 . . . 𝑁1, 𝑗 = 1, . . . 𝑁2, 𝑡 = 1 . . . 𝑇

(3.1)

where 𝑦𝑖 𝑗𝑡 is the dependent variable, 𝑥𝑖 𝑗𝑡 is a vector of 𝐾 explanatory variables, including a constant.

I decompose the error term 𝑒𝑖 𝑗𝑡 into four components: 𝛼𝑖 is the individual specific heterogeneity

along one of the dimensions of the data (e.g. exporter “fixed effect”), 𝜙 𝑗 is the heterogeneity along

the other dimension (e.g.

importer “fixed effect”), 𝛾𝑡 is the time specific effect and 𝑢𝑖 𝑗𝑡 is the

that E(𝑥′

idiosyncratic error term. I divide 𝑥𝑖 𝑗𝑡 in two subsets: 𝑥1𝑖 𝑗𝑡 are 𝐾1 exogenous variables in the sense
2𝑖 𝑗𝑡𝑢𝑖 𝑗𝑡) ≠ 0. In light of
the endogenous 𝑥2𝑖 𝑗𝑡, to obtain consistent estimates of 𝛽, we could construct Hausman-Taylor type

1𝑖 𝑗𝑡𝑢𝑖 𝑗𝑡) = 0 and 𝑥2𝑖 𝑗𝑡 are 𝐾2 endogenous variables so that E(𝑥′

variables available and denoted by 𝑧2𝑖 𝑗𝑡 that satisfy the usual relevance [E(𝑧′

instrumental variables, however I will assume that we have 𝐿 (with 𝐿 ≥ 𝐾2) external instrumental
2𝑖 𝑗𝑡𝑥2𝑖 𝑗𝑡) ≠ 0] and
2𝑖 𝑗𝑡𝑢𝑖 𝑗𝑡) = 0] conditions and let the set of exogenous variables be 𝑧𝑖 𝑗𝑡 = (𝑥1𝑖 𝑗𝑡 𝑧2𝑖 𝑗𝑡).
In this paper, I do not consider formal asymptotic analysis, nor do I focus on whether the

exogeneity [E(𝑧′

individual heterogeneities and time effects are parameters to be estimated or should be treated as

random variables since the equivalence derived below using the Mundlak approach is an algebraic

result. However, at least one of the indices should go to infinity to obtain a consistent estimator

of 𝛽, conditional on not treating the heterogeneities or time effects associated with that index as

72

parameters to be estimated to avoid the incidental parameters problem. Matyas (2017) has a nice

review of asymptotic properties of fixed effects and random effects estimators for the different cases

that can arise in empirical work.

Throughout the paper I assume that the data is ordered such that the 𝑖 index is the slowest to

change, then 𝑗 and 𝑡 is the fastest. I also assume that all the relevant matrices have full column rank

and are therefore invertible. I also maintain the following exogeneity assumption:

E (cid:0)𝑢𝑖 𝑗𝑡 |𝑧111, 𝑧112, . . . 𝑧𝑁1𝑁2𝑇 , 𝛼𝑖, 𝜙 𝑗 , 𝛾𝑡 (cid:1) = 0

(3.2)

This is an extension to the three dimensional panel of the strict exogeneity assumption found

in the one way panel data literature. Note that the equivalence that will be presented in Section

3.4 does not depend on any of the assumptions stated so far, it is an algebraic equivalence that is

unrelated to the statistical properties of the estimators presented in the next section.

3.3 Estimation

The estimation approach of the parameters in equation (3.1) will depend on the variables of

interest and the assumptions the researcher is willing to make. If we assume that the exogenous

variables 𝑧2𝑖 𝑗𝑡 are uncorrelated to all the individual heterogeneities (𝛼𝑖, 𝜙 𝑗 and 𝛾𝑡) and the following

conditions are met:

1. The heterogeneities are pairwise uncorrelated.

2. E(𝛼𝑖) = E(𝜙 𝑗 ) = E(𝛾𝑡) = 0.

Furthermore, if we assume

E(𝛼𝑖𝛼𝑖′) =

E(𝜙 𝑗 𝜙 𝑗 ′) =

E(𝛾𝑡𝛾𝑡′) =

𝜎2

𝛼 if 𝑖 = 𝑖′

0 otherwise

𝜙 if 𝑗 = 𝑗 ′
𝜎2

0 otherwise

𝜎2

𝛾 if 𝑡 = 𝑡′

0 otherwise











73

then the structure of the covariance matrix is given by

E(𝑒𝑖 𝑗𝑡𝑒′

𝑖′ 𝑗 ′𝑡′) = E (cid:2)(𝛼𝑖 + 𝜙 𝑗 + 𝛾𝑡 + 𝑢𝑖 𝑗𝑡) (𝛼𝑖′ + 𝜙 𝑗 ′ + 𝛾𝑡′ + 𝑢𝑖′ 𝑗 ′𝑡′)′(cid:3)

= 𝜎2
𝛼

= 𝜎2
𝜙

= 𝜎2
𝛾

= 𝜎2

𝛼 + 𝜎2
𝜙

= 𝜎2

𝛼 + 𝜎2
𝛾

= 𝜎2

𝜙 + 𝜎2
𝛾

= 𝜎2

𝛼 + 𝜎2

𝜙 + 𝜎2

𝛾 + 𝜎2
𝑢

Which translates into the following matrix:

if 𝑖 = 𝑖′, 𝑗 ≠ 𝑗 ′, 𝑡 ≠ 𝑡′

if 𝑖 ≠ 𝑖′, 𝑗 = 𝑗 ′, 𝑡 ≠ 𝑡′

if 𝑖 ≠ 𝑖′, 𝑗 ≠ 𝑗 ′, 𝑡 = 𝑡′

if 𝑖 = 𝑖′, 𝑗 = 𝑗 ′, 𝑡 ≠ 𝑡′

if 𝑖 = 𝑖′, 𝑗 ≠ 𝑗 ′, 𝑡 = 𝑡′

if 𝑖 ≠ 𝑖′, 𝑗 = 𝑗 ′, 𝑡 = 𝑡′

if 𝑖 = 𝑖′, 𝑗 = 𝑗 ′, 𝑡 = 𝑡′

Ω = E(𝑒𝑒′) = 𝜎2

𝛼 (I𝑁1 ⊗ 𝐽𝑁2𝑇 ) + 𝜎2

𝜙 (𝐽𝑁1 ⊗ I𝑁2 ⊗ 𝐽𝑇 ) + 𝜎2

𝛾 (𝐽𝑁1𝑁2 ⊗ I𝑇 ) + 𝜎2

𝑢 I𝑁1𝑁2𝑇

(3.3)

where ⊗ represents the kronecker product and I and 𝐽 denote an identity matrix and a square

matrix of ones, respectively, of size given by their subscript. We can transform the data to obtain

an efficient estimator that exploits this information. Indeed, the RE estimator presented in Matyas

(2017) can be obtained by applying Pooled Two Stage Least Squares (P2SLS) to the following

equation:

Ω− 1

2 𝑦 = Ω− 1

2 𝑋 𝛽 + Ω− 1
2 𝑒

(3.4)

using the instruments Ω− 1

2 𝑧2, where the absence of subscripts indicate that the data has been stacked.

Denote the estimated coefficient from this estimation as ˆ𝛽𝑅𝐸2𝑆𝐿𝑆. A few observations are in order.

First, the assumptions stated above related to the second moments of the individual heterogeneities

are not necessary to get a consistent estimator of the parameters. These conditions only determine

the specific structure of the matrix Ω which in turn is used to perform the GLS-type transformation

of the data to get efficiency gains, but the consistency of the estimator hinges on other assumptions.

One the other hand, the second moment conditions are important to get a particular structure of

74

the covariance matrix, but if these do not hold, inference can be misleading. For this reason,

researchers should use a robust covariance matrix to obtain the associated standard errors.

Sometimes imposing a zero correlation between the exogenous variables and the heterogeneities

might be an unrealistic restriction. In these instances, a FE approach is also available and it has

the advantage of leaving the relationship between the exogenous variables and the heterogeneities

unrestricted. One way to obtain the FE2SLS estimator is to include dummy variables to account for

the different heterogeneities (see Wooldridge (2021) for a description of the Two-Way Fixed Effects

estimator). Alternatively, we can apply a transformation to the data to end up with an estimating

equation that does not contain the “fixed effects”. To this end, we define the following notation.

Let

¯𝑦𝑖·· =

1
𝑁2𝑇

𝑁2∑︁

𝑇
∑︁

𝑗

𝑡

𝑦𝑖 𝑗𝑡

and

¯𝑦· 𝑗 · =

1
𝑁1𝑇

𝑁1∑︁

𝑇
∑︁

𝑖

𝑡

𝑦𝑖 𝑗𝑡

be the unit specific averages over the remaining dimensions for variable 𝑦. Also define

¯𝑦··𝑡 =

1
𝑁1𝑁2

𝑁1∑︁

𝑁2∑︁

𝑖

𝑗

𝑦𝑖 𝑗𝑡

be the cross sectional average for each 𝑡. Let

¯𝑦··· =

1
𝑁1𝑁2𝑇

𝑁1∑︁

𝑁2∑︁

𝑇
∑︁

𝑦𝑖 𝑗𝑡

𝑖

𝑗

𝑡

be the overall average. Note that

¯𝑦··· =

1
𝑁1

𝑁1∑︁

𝑖

¯𝑦𝑖·· =

1
𝑁2

𝑁2∑︁

𝑖

¯𝑦· 𝑗 · =

1
𝑇

𝑇
∑︁

𝑡

¯𝑦··𝑡

Finally, transform and denote the original data as follows:

(cid:165)𝑦𝑖 𝑗𝑡 = 𝑦𝑖 𝑗𝑡 − ¯𝑦𝑖·· − ¯𝑦· 𝑗 · − ¯𝑦··𝑡 + 2 ¯𝑦···

(3.5)

(3.6)

(3.7)

(3.8)

(3.9)

and other variables can be constructed similarly. This transformation gives rise to the within

estimator and it will remove the heterogeneity and time effects. It was first introduced by Matyas

(1997) and its extension to a model with endogenous variables is straightforward. Indeed, the FE

estimator of 𝛽, denoted as ˆ𝛽𝐹𝐸2𝑆𝐿𝑆 can be obtained by applying P2SLS to:

(cid:165)𝑦𝑖 𝑗𝑡 = (cid:165)𝑥𝑖 𝑗𝑡 𝛽 + (cid:165)𝑒𝑖 𝑗𝑡 = (cid:165)𝑥𝑖 𝑗𝑡 𝛽 + (cid:165)𝑢𝑖 𝑗𝑡

(3.10)

75

using the instruments (cid:165)𝑧2𝑖 𝑗𝑡. A few comments are in order related to the within estimator. First,

this is not the only transformation that removes the individual heterogeneities from the estimating

equation. As pointed out by Balazsi et al. (2018), the following operation would also remove the

heterogeneities and time effects in equation (3.1):

(cid:164)𝑦𝑖 𝑗𝑡 = 𝑦𝑖 𝑗𝑡 − ¯𝑦𝑖 𝑗 · − ¯𝑦··𝑡 + ¯𝑦···

(3.11)

The transformation needed to remove the “fixed effects” will vary depending on the structure of the

heterogeneities.

A second and perhaps more important point related to this operation is that some of the

coefficients might not be identifiable as the associated variables will also be removed after the

operation. In particular, variables that are either time or individual invariant (in either dimension)

will be also removed by the transformation. From an empirical point of view, this is not a trivial

issue: for example, in the trade literature and gravity models it is common to include the GDP

of the exporter or importer region or some policy variables as covariates, which will be invariant

from at least one of the cross sectional dimensions and thus eliminated. This problem and the

efficiency gains give more appeal to the RE estimator over the FE, at the cost of imposing additional

assumptions. It is essential to stress out that the equivalence presented in the next section is an

algebraic result and is not related to other statistical properties of the estimators such as consistency.

3.4 Correlated Random Effects

As noted in the previous section, the FE and RE rely on opposing assumptions related to the

relationship between the exogenous variables and the individual and time effects. On the one hand,

the RE estimator assumes that there is no correlation between the exogenous and these unobserved

effects, while the FE places no restrictions in this sense. As a result, the usual bias-variance

trade-off arises between both estimators from the imposition and plausibility of this condition. In

one-way panels, the literature has proposed a middle ground in which the dependence between

the unobserved heterogeneity and the covariates is not zero but is restricted in a specific way. In

particular, Mundlak (1978) proposed to model the individual heterogeneity as linear function of the

time average of the covariates. Chamberlain (1982) provided a more flexible approach in which the

76

heteretogeneity is linearly projected into the space of the whole history of explanatory variables.

One of the drawbacks of the latter is that the number of coefficients to be estimated grows linearly

as the sample size grows, which can be a greater issue in higher dimensional panels.

An interesting fact about the Mundlak device is that he showed that by adding the time averages

to the estimating equation, one can recover the same FE estimates if the equation is estimated by

RE or POLS. This result has been extended to the two-way panel: Wooldridge (2021) proves that

the two-way FE estimates can be recovered by applying POLS to the main equation and adding the

time and cross sectional averages as regressors, while B. H. Baltagi (2023) demonstrates that the

GLS-type transformation and POLS are equivalent in this sense. In addition, Yang (2022) extends

this equivalence to three-way panels and presents conditions under which a weighted variable

addition test is equivalent to the Hausman specification test. More concretely, the linear projection

of the individual and time effects in the three-way panel using the Mundlak approach is given by:

L(𝛼𝑖 |𝑧111, 𝑧112, . . . 𝑧𝑁1𝑁2𝑇 ) = ¯𝑧𝑖··𝛿1

L(𝜙 𝑗 |𝑧111, 𝑧112, . . . 𝑧𝑁1𝑁2𝑇 ) = ¯𝑧· 𝑗 ·𝛿2

L(𝛾𝑡 |𝑧111, 𝑧112, . . . 𝑧𝑁1𝑁2𝑇 ) = ¯𝑧··𝑡𝛿3

(3.12)

where L(·) denotes the linear projection operator. One aspect that these papers have in common

is that they show the result for the case in which the explanatory variables are exogenous with

respect to the idiosyncratic error term. In this paper, I show that the equivalence carries over when

there are endogenous variables on the right hand side of the equation, which can be useful as this

situation often arises in empirical work. Once again it is important to stress out that this is an

algebraic result that is unrelated to the consistency of the estimators. To fix ideas, it is useful to

first re-write in scalar form the RE transformation from (3.4), which yields the following:

𝑢 Ω− 1
𝜎2

2 𝑦𝑖 𝑗𝑡 = ˜𝑦𝑖 𝑗𝑡 = 𝑦𝑖 𝑗𝑡 − ˜𝜃1 ¯𝑦𝑖·· − ˜𝜃2 ¯𝑦· 𝑗 · − ˜𝜃4 ¯𝑦··𝑡 + ˜𝜃4 ¯𝑦···

(3.13)

77

where,

˜𝜃1 =

˜𝜃2 =

˜𝜃3 =

˜𝜃4 =

𝜃1 =

𝜃2 =

𝜃3 =

𝜃4 =

(cid:16)

(cid:16)

(cid:16)

(cid:16)

(cid:17)

(cid:17)

1 − √︁𝜃1
1 − √︁𝜃2
1 − √︁𝜃3
2 − √︁𝜃1 − √︁𝜃2 − √︁𝜃3 − √︁𝜃4

(cid:17)

(cid:17)

𝜎2
𝑢
𝑁2𝑇 𝜎2
𝛼 + 𝜎2
𝑢
𝜎2
𝑢
𝜙 + 𝜎2
𝑁1𝑇 𝜎2
𝑢
𝜎2
𝑢
𝛾 + 𝜎2
𝑁2𝑇 𝜎2
𝑢

𝜎2
𝑢
𝜙 + 𝑁1𝑁2𝜎2
𝛼 + 𝑁1𝑇 𝜎2

𝛾 + 𝜎2
𝑢

𝑁2𝑇 𝜎2

and where we can transform the rest of the variables in a similar way. Therefore, the RE2SLS

estimator can be once again obtained by applying Pooled 2SLS to

˜𝑦𝑖 𝑗𝑡 = ˜𝑥𝑖 𝑗𝑡 𝛽 + ˜𝑒𝑖 𝑗𝑡

(3.14)

using instrumental variables ˜𝑧2𝑖 𝑗𝑡. Note that the Pooled 2SLS estimator is a special case of (3.14)

by setting ˜𝜃𝑠 = 0 for 𝑠 = 1, 2, 3, 4. To obtain the CRE 2SLS estimator using the Mundlak device,

we can apply P2SLS to the following equation:

˜𝑦𝑖 𝑗𝑡 = ˜𝑥𝑖 𝑗𝑡 𝛽 + ˜¯𝑥1𝑖 𝑗𝑡 𝜋 + ˜¯𝑧2𝑖 𝑗𝑡𝛿

(3.15)

using instruments ˜𝑧2𝑖 𝑗𝑡, where ˜¯𝑥1𝑖 𝑗𝑡 = ( ˜¯𝑥1𝑖··

˜¯𝑥1· 𝑗 ·

˜¯𝑥1··𝑡) and ˜¯𝑧2𝑖 𝑗𝑡 = ( ˜¯𝑧2𝑖··

˜¯𝑧2· 𝑗 ·

˜¯𝑧2··𝑡). Two

observations are in order related to (3.15). First note that ˜¯𝑥1𝑖·· = (1 − ˜𝜃1) ¯𝑥1𝑖··, ˜¯𝑥1· 𝑗 · = (1 − ˜𝜃2) ¯𝑥1· 𝑗 ·,

˜¯𝑥1··𝑡 = (1 − ˜𝜃3) ¯𝑥1··𝑡 and similarly for ˜¯𝑧2, so that the averages of the transformed variables do not

depend on parameters that are associated with the other dimensions’ averages. Second, note that

we only need to include the averages of the exogenous variables ˜¯𝑥1𝑖 𝑗𝑡 and ˜¯𝑧2𝑖 𝑗𝑡 and not the ones

from the endogenous variables ˜¯𝑥2𝑖 𝑗𝑡. By doing so, the 𝛽 recovered from this estimation, denoted as

ˆ𝛽𝑀1 will be numerically the same as ˆ𝛽𝐹𝐸2𝑆𝐿𝑆. This result in summarized in Proposition 1.

78

Proposition 1. Suppose that all the relevant matrices have full column rank. Let ˆ𝛽𝐹𝐸2𝑆𝐿𝑆 be

the coefficient obtained by estimating equation 3.10 by Pooled 2SLS and ˆ𝛽𝑀1 be the coefficient
computed from applying Pooled 2SLS to equation (3.15). Then ˆ𝛽𝐹𝐸2𝑆𝐿𝑆 = ˆ𝛽𝑀1.

The proof of this proposition can be found in the Appendix. This result is useful because it

allows the researcher to perform a Hausman-type test using a variable addition test. Specifically,

the researcher can analyze the significance of the coefficients associated with the averages to decide

between a FE or RE specifications. A discussion of this procedure can be found in Joshi and

Wooldridge (2019). As Matyas (2017) notes, there can be many causes of endogeneity in three-

way panels, which might require at least as many instruments for each of these sources. In order

to obtain the CRE equivalence, the researcher has to include all their averages in the estimating

equation, which can consume an important number of degrees of freedom and can be costly in

finite samples when conducting inference. Fortunately, we can recover the FE estimates by adding

a different set of variables. If we let ˆ𝑥2𝑖 𝑗𝑡 denote first stage predicted values for the endogenous

variables, then applying Pooled 2SLS to

˜𝑦𝑖 𝑗𝑡 = ˜𝑥1𝑖 𝑗𝑡 𝛽1 + ˆ˜𝑥2𝑖 𝑗𝑡 𝛽2 + ˜¯𝑥1𝑖 𝑗𝑡 𝜋1 + ˆ˜¯𝑥2𝑖 𝑗𝑡 𝜋2

= ˆ˜𝑥𝑖 𝑗𝑡 𝛽 + ˆ˜¯𝑥𝑖 𝑗𝑡 𝜋

(3.16)

using the instruments ( ˜𝑧2𝑖 𝑗𝑡

˜¯𝑧2𝑖 𝑗𝑡) will also yield the same 𝛽 as FE2SLS. Proposition 2 formally

states the equivalence.

Proposition 2. Suppose that all the relevant matrices have full column rank. Let ˆ𝛽𝐹𝐸2𝑆𝐿𝑆 be

the coefficient obtained by estimating equation 3.10 by Pooled 2SLS and ˆ𝛽𝑀2 be the coefficient
computed from applying Pooled 2SLS to equation (3.16). Then ˆ𝛽𝐹𝐸2𝑆𝐿𝑆 = ˆ𝛽𝑀2.

As noted previously, the advantage of using this set of variables instead of the instruments is

that it allows to preserve the degrees of freedom if we have more than one instrument for each

endogenous covariate. An important feature of the CRE approach using the Mundlak device is that

it allows to estimate the effect of variables that are constant across one of the dimensions of the

79

panel, something that cannot be done using the within estimator as its transformation wipes out

any variable of this nature. In fact, Wooldridge (2021) proves in the two-way panel that adding

additional variables that only vary across one of the dimensions will not change the FE estimates,

a result that most likely carries over to the three-way panel. This result makes intuitive sense as the

within estimator is supposed to remove these variables but it also shows that adding the averages

(either from the exogenous variables or from the predicted values as in Proposition 2) is enough to

control for the the individual and time effects.

3.5 Conclusion

In this paper, I establish the algebraic equivalence between FE2SLS and RE2SLS in three-way

panels with additive unobserved heterogeneities in the presence of endogenous variables using the

Mundlak device. Namely, by including either the averages of the exogenous variables or the means

of the predicted values explanatory covariates across all the different dimensions, is enough to

control for the unobserved heterogeneities and to recover the FE2SLS estimates. The first approach

has the disadvantage that if there are multiple instruments available, the degrees of freedom could

be reduced considerably, an issue that is more severe in finite samples. The use of Mundlak’s device

also allows to relax the no correlation between the covariates and the unobserved heterogeneities,

which allows the researcher to obtain more robust estimates of the coefficients.

Furthermore, this result also offers the researchers a flexible and easy to implement solution

to choose between a FE and RE specification. In particular, Yang (2022) shows that a modified

variable addition test associated with the averages is equivalent to the Hausman-type test, with the

additional advantage that the former can be made robust to heteroskedasticity and serial correlation.

One of the limitations of the result shown in this paper is that the algebraic equivalence is likely

to break with other structures of heterogeneities. For example, Yang (2022) argues that if the

cross sectional heterogeneities are time varying, then the result no longer holds in the case of

exogeneous variables, a result that most like carries over in the presence of endogenous covariates.

However, future research might could extend the result to more general models of unobserved

heterogeneities.

80

BIBLIOGRAPHY

Abrevaya, J., & Donald, S. G. (2017). A gmm approach for dealing with missing data on regressors.

The Review of Economics and Statistics, 99(4), 657–662.

Ahn, S. C., & Schmidt, P. (1995). Efficient estimation of models for dynamic panel data. Journal

of Econometrics, 68(1), 5–27.

Amemiya, T. (1985). Advanced econometrics. Harvard University Press.

Anderson, J. (1979). A theoretical foundation for the gravity equation. The American Economic

Review, 69(1), 106–116.

Anderson, J., & Van Wincoop, E. (2003). Gravity with gravitas: A solution to the border puzzle.

The American Economic Review, 93(1), 170–192.

Arellano, M. (1987). Computing robust standard errors for within-groups estimators. Oxford

Bulletin of Economics and Statistics, 49(4), 431–434.

Baier, S. L., Bergstrand, J., & Feng, M. (2014). Economic integration agreements and the margins

of international trade. Journal of International Economics, 93(2), 339–350.

Balazsi, L., Matyas, L., & Wansbeek, T. (2018). The estimation of multidimensional fixed effects

panel data models. Econometric Reviews, 3(3), 212–227.

Baltagi, B. (2006). Estimating an economic model of crime using panel data from north carolina.

Journal of Applied Econometrics, 21, 543–547.

Baltagi, B., & Liu, L. (2011). Instrumental variable estimation of a spatial autoregresive panel

model with random effects. Economic Letters, (111), 135–137.

Baltagi, B. H. (2023). The two-way mundlak estimator (tech. rep. No. Working Paper No. 256).

Center for Policy Research.

Baltagi, B. H., Egger, P., & Pfaffermayr, M. (2003). A generalized design for bilateral trade flow

models. Economics Letters, 80(3), 391–397.

Basile, R. (2009). Productivity polarization across regions in europethe role of nonlinearities and

spatial dependence. International Regional Science Review, 92–115.

Basile, R., Durban, M., Minguez, R., Montero, J. M., & Mur, J. (2014). Modeling regional
economic dynamics: Spatial dependence, spatial heterogeneity and nonlinearities. Journal
of Economic Dynamics and Control, 229–245.

Beine, M., Bertoli, S., & Fernandez-Huertas-Moraga, J. (2015). A practitioners’ guide to gravity

81

models of international migration. World Econ, 39, 496–512.

Bergstrand, J. H. (1985). The gravity equation in international trade: Some microeconomic
foundations and empirical evidence. The Review of Economics and Statistics, 67(3), 474–
481.

Bester, C. A., Conley, T., Hansen, C., & Vogelsang, T. (2016). Fixed-b asymptotics for spatially
dependent robust nonparametric covariance matrix estimators. Econometric Theory, 32(1),
154–186.

Bester, C. A., Conley, T. G., & Hansen, C. B. (2011). Inference with dependent data using cluster

covariance estimators. Journal of Econometrics, 165(2), 137–151.

Blundell, R., & Powell, J. (2003). Endogeneity in nonparametric and semiparametric regression

models. Econometric society monographs, 36, 312–357.

Cameron, C., & Trivedi, P. (2005). Microeconometrics: Methods and applications. Cambridge

University Press.

Chamberlain, G. (1982). Multivariate regression models for panel data. Journal of Econometrics,

18(1), 5–46.

Chaney, T. (2018). The gravity equation in international trade: An explanation. Journal of Political

Economy, 126(1), 150–177.

Conley, T. G. (1999). Gmm estimation with cross sectional dependence. Journal of Econometrics,

92(1), 1–45.

Conley, T. G., & Molinari, F. (2007). Spatial correlation robust inference with errors in location or

distance. Journal of Econometrics, 140, 76–96.

Cornwell, C., & Trumbull, W. N. (1994). Estimating the economic model of crime with panel data.

The Review of Economics and Statistics, 76(2), 360–366.

Dagenais, M. (1973). The use of incomplete observations in multiple regression analysis. Journal

of Econometrics, 1(4), 317–328.

Dardanoni, Valentino, S., Modica, F., & Peracchi. (2011). Regression with imputed covariates: A
generalized missing-indicator approach. Journal of Econometrics, 162(2), 362–368.

Debarsy, N. (2012). The mundlak approach in the spatial durbin panel data model. Spatial

Economic Analysis, 7(1), 109–131.

Driscoll, J. C., & Kraay, A. C. (1998). Consistent covariance matrix estimation with spatially

dependent panel data. The Review of Economics and Statistics, 80(4), 549–560.

82

Eaton, J., & Kortum, S. (2002). Technology, geography, and trade. Econometrica, 70(5), 1741–

1779.

Flach, L., & Unger, F. (2022). Quality and gravity in international trade. Journal of International

Economics, 137, 103578.

Gourieroux, C., & Monfort, A. (1981). On the problem of missing data in linear models. The

Review of Economic Studies, 48(4), 579–586.

Greene, W. (2007). Econometric analysis (7th ed.). Prentice Hall.

Jones, M. (1996). Indicator and stratification methods for missing explanatory variables in multiple
linear regression. Journal of the American Statistical Association, 91(433), 222–230.

Joshi, R., & Wooldridge, J. M. (2019). Correlated random effects models with endogenous

explanatory variables and unbalanced panels. Annals of Economics and Statistics, (134).

Kabir, M., Salim, R., & Al-Mawali, N. (2017). The gravity model and trade flows: Recent
developments in econometric modeling and empirical evidence. Economic Analysis and
Policy, 56, 60–71.

Kapoor, M., Kelejian, H., & Prucha, I. R. (2007). Panel data models with spatially correlated error

components. Journal of Econometrics, 140(140), 97–130.

Kelejian, H., & Prucha, I. (1998). A generalized spatial two-stage least squares procedure for
estimating a spatial autoregressive model with autoregressive disturbances. The Journal of
Real Estate Finance and Economics, 17(1), 99–121.

Kelejian, H., & Prucha, I. (2007). Hac estimation in a spatial framework. Journal of Econometrics,

140(1), 131–154.

Kelejian, H., & Prucha, I. (2010). Spatial models with spatially lagged dependent variables and

incomplete data. Journal of Geographical Systems, (12), 241–257.

Kelejian, H., Prucha, I. R., & Yuzefovich, Y. (2004). Instrumental variable estimation of a spatial
autoregressive model with autoregressive disturbances: Large and small sample results. In
Spatial and spatiotemporal econometrics. Emerald Group Publishing Limited.

Kiefer, N. M., & Vogelsang, T. J. (2005). A new asymptotic theory for heteroskedasticity-

autocorrelation robust tests. Econometric Theory, 21(6), 1130–1164.

Kim, M. S., & Sun, Y. (2011). Spatial heteroskedasticity and autocorrelation consistent estimation

of covariance matrix. Journal of Econometrics, 160, 346–371.

Kim, M. S., & Sun, Y. (2013). Heteroskedasticity and spatiotemporal dependence robust inference

83

for linear panel models with fixed effects. Journal of Econometrics, 177, 85–108.

Krugman, P. (1980). Scale economies, product differentiation, and the pattern of trade. American

Economic Review, 70(5), 950–959.

Kunsch, H. R. (1989). The jackknife and the bootstrap for general stationary observations. Annals

of Statistics, 17(3), 1217–1241.

Lee, L.-f. (2003). Best spatial two-stage least squares estimators for a spatial autoregressive model

with autoregressive disturbances. Econometric Reviews, 22(4), 307–335.

Lesage, J., & Pace, K. (2004). Models for spatially dependent missing data. Journal of Real Estate

Finance and Economics, 29(2), 233–254.

LeSage, J., & Pace, R. K. (2009). Introduction to spatial econometrics. CRC Press.

Li, L., & Yang, Z. (2020). Spatial dynamic panel data models with correlated random effects.

Journal of Econometrics.

Little, R., & Rubin, D. (2019). Statistical analysis with missing data (3rd edition). Wiley.

Matyas, L. (1997). Proper econometric specification of the gravity model. The World Economy,

20, 363–368.

Matyas, L. (Ed.). (2017). The econometrics of multi-dimensional panels: Theory and applications.

Springer.

McMillen, D. P. (1996). One hundred fifty years of land values in chicago: A nonparametric

approach. Journal of Urban Economics, 40, 100–124.

Müller, U. K. (2014). Hac corrections for strongly autocorrelated time series. Journal of Business

and Economic Statistics, 32(3), 311–321.

Müller, U. K., & Watson, M. W. (2022a). Spatial correlation robust inference. Econometrica,

90(6), 2901–2935.

Müller, U. K., & Watson, M. W. (2022b). Spatial correlation robust inference in linear regression

and panel models. Journal of Business and Economics Statistics, 00(0), 1–15.

Mundlak, Y. (1978). On the pooling of time series and cross section data. Econometrica, 46(1),

69–85.

Mutl, J., & Pfaffermayr, M. (2010). The hausman test in a cliff and ord panel model. Econometrics

Journal, 10, 1–30.

84

Nazgul, J., & Prucha, I. R. (2009). Central limit theorems and uniform laws of large numbers for

arrays of random fields. Journal of Econometrics, (150), 86–98.

Newey, W. K., & West, K. D. (1987). A simple, positive semi-definite, heteroskedasticityand

autocorrelation consistent covariance matrix. Econometrica, 55, 703–798.

Okawa, Y., & Van Wincoop, E. (2012). Gravity in international finance. Journal of International

Economics, 87(2), 205–215.

Papke, L. E. (2005). The effects of spending on test pass rates: Evidence from Michigan. Journal

of Public Economics, 89(5-6), 821–839.

Papke, L. E., & Wooldridge, J. M. (2008). Panel data methods for fractional response variables
with an application to test pass rates. Journal of Econometrics, 145(1-2), 121–133.

Politis, D. N., & White, H. (2004). Automatic block-length selection for the dependent bootstrap.

Econometric Reviews, 23(1), 53–70.

Rai, B. (2021). Efficient estimation with missing values in cross section and panel data. [Doctoral

dissertation, Michigan State University].

Rai, B. (2023). Eficient estimation with missing data and endogeneity. Econometric Reviews,

42(2), 220–239.

Vogelsang, T. (2012). Heteroskedasticity, autocorrelation, and spatial correlation robust inference
in linear panel models with fixed-effects. Journal of Econometrics, 166, 303–319.

Wang, W., & Lee, L.-F. (2013). Estimation of spatial autoregressive models with randomly missing

data in the dependent variable. The Econometrics Journal, 16, 73–102.

Wang, W., & Lee, L.-f. (2013). Estimation of spatial panel data models with randomly missing

data in the dependent variable. Regional Science and Urban Economics, (43), 521–538.

Wheeler, D., & Tiefelsdorf, M. (2005). Multicollinearity and correlation among local regression
coefficients in geographically weighted regression. Journal of Geographical Systems, 7(2),
161–187.

White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for

heteroskedasticity. Econometrica, 48(4), 817–838.

Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data (2nd ed.). MIT

Press.

Wooldridge, J. M. (2019). Correlated random effects models with unbalanced panels. Journal of

Econometrics, 211, 137–150.

85

Wooldridge, J. M. (2021). Two-way fixed effects, the two-way mundlak regression, and difference-

in-differences estimators. SSRN Electronic Journal.

Wu-Chaves, S. (2024). Essays in spatial panel data econometrics. [Unpublished doctoral disserta-

tion], Michigan State University.

Yang, Y. (2022). A correlated random effects approach to the estimation of models with multiple

fixed effects. Economics Letters, 213, 110408.

86

APPENDIX A

ADDITIONAL ASSUMPTIONS AND DEFINITIONS FOR
CHAPTER 1

Assumption 1

The functions 𝑔𝑖 (·, 𝜃) satisfy these conditions:

1. 𝑔𝑖 (·, 𝜃) are Borel measurable on Z, the 𝜎-algebra generated by 𝑍, for all 𝜃 ∈ Θ.

2. sup𝑁 sup𝑖∈𝐷 𝑁

E[|𝑔𝑖 (𝑍𝑖,𝑁 , 𝜃)|2+𝜂] < ∞ ∀𝜃 ∈ Θ for some 𝜂 > 0.

Assumption 2

The 𝑔(·, ·) satisfy the following conditions:

1. For some 𝑝 ≥ 1:

lim sup
𝑛→∞

1
|𝐷 𝑁 |

∑︁

E (cid:104)

𝑖,𝑁 1(𝑑 𝑝
𝑑 𝑝

𝑖,𝑁 > 𝑘)

(cid:105)

→ 0 as 𝑘 → ∞

𝑖∈𝐷 𝑁

where 𝑑𝑖,𝑁 = sup
𝜃∈Θ

|𝑔𝑖,𝑁 (𝑍𝑖,𝑁 , 𝜃)|.

2. 𝑔𝑖 (𝑍𝑖,𝑁 , 𝜃) are 𝐿0 stochastically equicontinuous.

Assumption 3

The true parameter 𝜃0 and the 𝑔𝑖 (·, ·) satisfy these conditions:

1. 𝜃0 ∈ int(Θ).

2. 𝑔𝑖 (𝑍𝑖, ·) is continuously differentiable on the interior of Θ.

3.

|∇𝜃𝑔𝑖 (𝑍𝑖, 𝜃)| < ∞, where ∇𝜃 denotes the gradient of 𝑔𝑖 (𝑍𝑖, 𝜃) with respect to the parameter

vector 𝜃.

4. ∇𝜃𝑔𝑖 (𝑍𝑖, 𝜃) is Borel measurable, E[∇𝜃𝑔𝑖 (𝑍𝑖, 𝜃)] exists and rank {E[∇𝜃𝑔𝑖 ( 𝐴𝑖, 𝜃)]} = 𝑃, where

𝑃 = dim(𝜃0).

5. E[|𝑔𝑖 (𝑍𝑖, 𝜃0)|2+𝜖 ] < ∞ for some 𝜖 > 0.

87

Assumption 4

There exist finite dimensional vectors 𝑚𝑖 and Δ such that ˆ𝑢𝑖 − 𝑢𝑖 = 𝑚𝑖Δ and

1
𝑁

𝑁
∑︁

𝑖=1

||𝑧𝑖 ||2 = O𝑝 (1) and 𝑁 1

2 ||Δ|| = O𝑝 (1)

Definitions

𝛼-mixing for random fields

Let 𝐷 𝑁 be a subset of 𝐷. For 𝑈 ⊆ 𝐷 𝑁 and 𝑉 ⊆ 𝐷 𝑁 , let 𝜎𝑛 (𝑈) = 𝜎(𝑋𝑖,𝑁 : 𝑖 ∈ 𝑈), 𝛼𝑁 (𝑈, 𝑉) =

𝛼(𝜎𝑛 (𝑈), 𝜎𝑛 (𝑉)). Then the 𝛼-mixing coefficients for the random field {𝑋𝑖,𝑁 : 𝑖 ∈ 𝐷 𝑁 , 𝑁 ∈ N} is

defined as follows:

𝛼𝑘,𝑙,𝑁 (𝑟) = sup(𝛼𝑛 (𝑈, 𝑉), |𝑈| ≤ 𝑘, |𝑉 | ≤ 𝑙, 𝜌(𝑈, 𝑉) ≥ 𝑟)

for 𝑘, 𝑙, 𝑟, 𝑛 ∈ N. Define also

Upper tail quantile function

¯𝛼𝑘,𝑙 (𝑟) = sup
𝑁

𝛼𝑘,𝑙,𝑁 (𝑟)

Let 𝑋 be a random variable. Then the upper quantile function 𝑄 𝑋 : (0, 1) → [0, ∞) is defined as:

𝑄 𝑋 (𝑢) = inf{𝑡 : 𝑃(𝑋 > 𝑡) ≤ 𝑢}

“Inverse" function of mixing coefficients

For the non-increasing sequence of the mixing coefficients { ¯𝛼1,1}∞

𝑚=1, set ¯𝛼1,1(0) = 1 and define its

“inverse” function 𝛼inv(𝑢) : (0, 1) → N ∪ {0} as:

𝛼inv(𝑢) = max{𝑚 ≥ 0 : ¯𝛼1,1(𝑚) > 𝑢}

Stochastic equicontinuity

The array of random functions { 𝑓𝑖,𝑁 (𝑍𝑖,𝑁 , 𝜃) : 𝑖 ∈ 𝐷 𝑁 , 𝑛 ≥ 1} is:

1. 𝐿0 stochastically equicontinuous on Θ iff for every 𝜀 > 0,

(cid:34)

lim sup
𝑁→∞

1
|𝐷 𝑁 |

∑︁

𝑃

𝑖∈𝐷 𝑁

sup
𝜃′∈Θ

sup
𝜃∈𝐵(𝜃′,𝛿)

| 𝑓𝑖,𝑁 (𝑍𝑖,𝑁 , 𝜃) − 𝑓𝑖,𝑁 (𝑍𝑖,𝑁 , 𝜃′)| > 𝜀

→ 0 as 𝛿 → 0.

(cid:35)

88

2. 𝐿 𝑝 stochastically equicontinuous, 𝑝 > 0, on Θ iff

(cid:34)

lim sup
𝑁→∞

1
|𝐷 𝑁 |

∑︁

E

𝑖∈𝐷 𝑁

sup
𝜃′∈Θ

sup
𝜃∈𝐵(𝜃′,𝛿)

| 𝑓𝑖,𝑁 (𝑍𝑖,𝑁 , 𝜃) − 𝑓𝑖,𝑁 (𝑍𝑖,𝑁 , 𝜃′)| 𝑝

(cid:35)

→ 0 as 𝛿 → 0.

3. a.s. stochastically equicontinuous on Θ iff

lim sup
𝑁→∞

1
|𝐷 𝑁 |

∑︁

𝑖∈𝐷 𝑁

sup
𝜃′∈Θ

sup
𝜃∈𝐵(𝜃′,𝛿)

| 𝑓𝑖,𝑁 (𝑍𝑖,𝑁 , 𝜃) − 𝑓𝑖,𝑁 (𝑍𝑖,𝑁 , 𝜃′)| → 0 a.s. as 𝛿 → 0.

89

APPENDIX B

PROOFS FOR CHAPTER 1

Proof of Proposition 5

For notation simplicity, we will assume that 𝑊𝑖 𝑦𝑡 is included in 𝑥2𝑖𝑡, 𝑥𝑖𝑡 = [𝑥1𝑖𝑡 𝑥2𝑖𝑡], where the
1𝑖𝑡], where 𝑧2 is a vector of 𝐿2
instruments for 𝑥2, with 𝐿2 ≥ 𝑘2, and similarly for the spatial variables (note however that 𝑊𝑖 𝑦𝑡 is

𝑥2 are 𝑘2 + 1 endogenous variables and 𝑧𝑖𝑡 = [𝑥1𝑖𝑡

1𝑖𝑡 . . . 𝑤𝑠

𝑧2𝑖𝑡 𝑤2

not in 𝑊𝑖 𝑋𝑡). Therefore, the problem is to apply Pooled 2SLS to the following equation:

𝑦𝑖𝑡 − 𝜂𝑖 ¯𝑦𝑖 = (𝑥𝑖𝑡 − 𝜂𝑖 ¯𝑥𝑖) 𝛽 + 𝑊𝑖 (𝑋𝑡 − 𝜂𝑖 ¯𝑋)𝛾 + (1 − 𝜂𝑖) ¯𝑧𝑖𝛿 + (1 − 𝜂𝑖)𝑊𝑖 ¯𝑍𝜆

= (𝑥𝑖𝑡 − 𝜂𝑖 ¯𝑥𝑖) 𝛽 + (𝑤𝑖𝑡 − 𝜂𝑖 ¯𝑤𝑖)𝛾 + (1 − 𝜂𝑖) ¯𝑧𝑖𝛿 + (1 − 𝜂𝑖) ¯ℨ𝑖𝜆

using IV’s: [(𝑧𝑖𝑡 − 𝜂 ¯𝑧𝑖)

(ℨ𝑖𝑡 − 𝜂 ¯ℨ𝑖)

(1 − 𝜂) ¯𝑧2𝑖

(1 − 𝜂) ¯ℨ𝑖].

We first orthogonalize the IV’s, i.e., we run 𝑧𝑖𝑡 − 𝜂 ¯𝑧𝑖 = (1 − 𝜂) ¯𝑧𝑖𝜖1 + (1 − 𝜃𝑖) ¯ℨ𝑖𝜖2 and obtain the
residuals 𝑟𝑖𝑡 and ℨ𝑖𝑡 − 𝜂 ¯ℨ𝑖 = (1 − 𝜂) ¯𝑧𝑖𝜖3 + (1 − 𝜃𝑖) ¯ℨ𝑖𝜖4 and get the residuals 𝑠𝑖𝑡. To do so, we use

the Frish-Waugh-Lovell theorem sequentially.

1.a) 𝑧𝑖𝑡 − 𝜂 ¯𝑧𝑖 on (1 − 𝜂) ¯𝑧𝑖. The coefficient will be:

(1 − 𝜂)2 ¯𝑧′

1𝑖 ¯𝑧1𝑖

(cid:35) −1 (cid:34) 𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

(cid:35)

(1 − 𝜂)2 ¯𝑧′

1𝑖 ¯𝑧𝑖

˜𝜖1 =

=

=

=

(cid:34) 𝑁
∑︁

𝑇
∑︁

𝑖=1
(cid:34) 𝑁
∑︁

𝑡=1
𝑇
∑︁

𝑡=1

𝑖=1
(cid:34) 𝑁
∑︁

𝑖=1
(cid:34) 𝑁
∑︁

𝑖=1

(cid:35) −1 (cid:34)

(1 − 𝜂)2 ¯𝑧′

1𝑖 ¯𝑧1𝑖

(cid:35) −1 (cid:34) 𝑁
∑︁

𝑇 (1 − 𝜂)2 ¯𝑧′

𝑖 ¯𝑧𝑖

(1 − 𝜂)2 ¯𝑧′

1𝑖 ¯𝑧1𝑖

𝑖=1
(cid:35) −1 (cid:34) 𝑁
∑︁

𝑖=1

(cid:35)

𝑇 (1 − 𝜂)𝜂 ¯𝑧′

𝑖 ¯𝑧𝑖

(1 − 𝜂) ¯𝑧′
𝑖

∑︁

𝑖=1

𝑧𝑖𝑡 −

∑︁

𝑖=1

∑︁

𝑡=1
(cid:35)

𝑇 (1 − 𝜂)2 ¯𝑧′

1𝑖 ¯𝑧1𝑖

(cid:35)

(1 − 𝜂)2 ¯𝑧′

1𝑖 ¯𝑧1𝑖

= I𝐿

Therefor the residuals will be 𝑣𝑖𝑡 = 𝑧𝑖𝑡 − ¯𝑧𝑖.

1.b) Run (1 − 𝜂) ¯ℨ𝑖 on (1 − 𝜂) ¯𝑧𝑖. In this case the coefficient and the residuals will depend only on

the 𝑖 index, call the latter 𝑓𝑖.

90

1.c) Run 𝑣𝑖𝑡 on 𝑓𝑖 to get 𝜖2. The coefficient will be:

𝜖2 =

=

=

(cid:34)

(cid:34)

(cid:34)

∑︁

∑︁

𝑖=1

𝑡=1

∑︁

∑︁

𝑖=1

𝑡=1

∑︁

∑︁

𝑖=1

𝑡=1

𝑓 ′
𝑖 𝑓𝑖

𝑓 ′
𝑖 𝑓𝑖

𝑓 ′
𝑖 𝑓𝑖

(cid:35) −1 (cid:34)

(cid:35)

𝑓 ′
𝑖 𝑣𝑖𝑡

∑︁

∑︁

𝑖=1

𝑡=1

(cid:35) −1 (cid:34)

(cid:35)

∑︁

∑︁

𝑣𝑖𝑡

𝑓 ′
𝑖

𝑖=1

𝑡=1

(cid:35) −1 (cid:34)

∑︁

∑︁

𝑓 ′
𝑖

𝑖=1

𝑡=1

(cid:35)

(𝑧𝑖𝑡 − ¯𝑧𝑖)

= 0𝐿

where we used the fact that the sum of deviations from the mean add up to zero for all 𝑖 in

the second term. This implies that 𝜖1 = I𝐿 and therefore, 𝑟𝑖𝑡 = 𝑧𝑖𝑡 − ¯𝑧𝑖.

Using very similar steps, it can be shown that if we run ℨ𝑖𝑡 − 𝜂 ¯ℨ𝑖 = (1 − 𝜂) ¯𝑧𝑖𝜖3 + (1 − 𝜃𝑖) ¯ℨ𝑖𝜖4,
then 𝜖3 = 0𝐿 and 𝜖4 = I𝐿, and therefore the residuals of this regression will be 𝑠𝑖𝑡 = ℨ𝑖𝑡 − ¯ℨ𝑖. Since
we have orthogonalized the instrumental variables with respect to (1 − 𝜂) ¯𝑧𝑖 and (1 − 𝜂) ¯ℨ𝑖, we now

have to apply Pooled 2SLS to the following equation:

𝑦𝑖𝑡 − 𝜂 ¯𝑦𝑖 = (𝑥𝑖𝑡 − 𝜂 ¯𝑥𝑖) 𝛽 + (𝑤𝑖𝑡 − 𝜂 ¯𝑤𝑖)𝛾

using IV’s [(𝑧𝑖𝑡 − ¯𝑧𝑖)

(ℨ𝑖𝑡 − ¯ℨ𝑖)]. We now define the following notation:

(cid:165)𝑧𝑖𝑡 = 𝑧𝑖𝑡 − ¯𝑧𝑖,

(cid:165)ℨ𝑖𝑡 = ℨ𝑖𝑡 − ¯ℨ𝑖, ˆ𝑧𝑖𝑡 = [ (cid:165)𝑧𝑖𝑡

(cid:165)ℨ𝑖𝑡], ˜𝑦𝑖𝑡 = 𝑦𝑖𝑡 − 𝜂 ¯𝑦𝑖, ˜𝑥𝑖𝑡 = [(𝑥𝑖𝑡 − 𝜂 ¯𝑥𝑖)

(𝑤𝑖𝑡 − 𝜂 ¯𝑤𝑖)], ˆ𝑦𝑖𝑡 = 𝑦𝑖𝑡 − ¯𝑦𝑖 and

ˆ𝑥𝑖𝑡 = [(𝑥𝑖𝑡 − ¯𝑥𝑖)

(𝑤𝑖𝑡 − ¯𝑤𝑖)]. Then the Γ = (𝛽 𝛾) from the previous problem can be obtained as:

ˆΓ2𝑆𝐿𝑆 =







(cid:32)

(cid:32)

∑︁

∑︁

𝑖=1

𝑡=1

(cid:33) (cid:32)

˜𝑥′
𝑖𝑡 ˆ𝑧𝑖𝑡

∑︁

∑︁

𝑖=1

𝑡=1

ˆ𝑧′
𝑖𝑡 ˆ𝑧𝑖𝑡

(cid:33) −1 (cid:32)

∑︁

∑︁

𝑖=1

𝑡=1

ˆ𝑧′
𝑖𝑡 ˜𝑥𝑖𝑡

∑︁

∑︁

𝑖=1

𝑡=1

˜𝑥′
𝑖𝑡 ˆ𝑧𝑖𝑡

(cid:33) (cid:32)

∑︁

∑︁

𝑖=1

𝑡=1

ˆ𝑧′
𝑖𝑡 ˆ𝑧𝑖𝑡

(cid:33) −1 (cid:32)

(cid:33)

ˆ𝑧′
𝑖𝑡 ˜𝑦𝑖𝑡

∑︁

∑︁

𝑖=1

𝑡=1

−1

·

(cid:33)






(B.1)

The first term of the square bracket term can be rewritten as follows (the third term of that

inverse matrix can also be written in a similar way):

91

∑︁

∑︁

𝑖=1

𝑡=1

˜𝑥′
𝑖𝑡 ˆ𝑧𝑖𝑡 =

∑︁

∑︁

𝑖=1

𝑡=1

∑︁

∑︁

=

𝑖=1

𝑡=1
















(𝑤𝑖𝑡 − 𝜂 ¯𝑤𝑖)′

(𝑥𝑖𝑡 − 𝜂 ¯𝑥𝑖)′








(𝑥𝑖𝑡 − 𝜂 ¯𝑥𝑖)′ (cid:165)𝑧𝑖𝑡

(cid:104)

(cid:165)𝑧𝑖𝑡

(cid:105)

(cid:165)ℨ𝑖𝑡

(𝑥𝑖𝑡 − 𝜂 ¯𝑥𝑖)′ (cid:165)ℨ𝑖𝑡

(𝑤𝑖𝑡 − 𝜂 ¯𝑤𝑖)′ (cid:165)𝑧𝑖𝑡

(𝑤𝑖𝑡 − 𝜂 ¯𝑤𝑖)′ (cid:165)ℨ′
𝑖𝑡









(B.2)

We focus on the (1,1) term, but the following algebraic manipulation holds for the rest of the terms

in the matrix and for the second term in (B.1):

∑︁

∑︁

𝑖=1

𝑡=1

(𝑥𝑖𝑡 − 𝜂 ¯𝑥𝑖)′ (cid:165)𝑧𝑖𝑡 =

=

=

=

=

𝑥′
𝑖𝑡 (cid:165)𝑧𝑖𝑡 −

𝑥′
𝑖𝑡 (cid:165)𝑧𝑖𝑡 −

𝜂 ¯𝑥′
𝑖

𝜂 ¯𝑥′
𝑖

∑︁

𝑖=1
∑︁

𝑖=1

∑︁

𝑡=1
∑︁

𝑡=1

(cid:165)𝑧𝑖𝑡

(𝑧𝑖𝑡 − ¯𝑧𝑖)

𝑥′
𝑖𝑡 (cid:165)𝑧𝑖𝑡

𝑥′
𝑖𝑡 (cid:165)𝑧𝑖𝑡 −

∑︁

¯𝑥′
𝑖

∑︁

𝑖=1

𝑡=1

(𝑧𝑖𝑡 − ¯𝑧𝑖)′

(𝑥𝑖𝑡 − ¯𝑥𝑖)′ (cid:165)𝑧𝑖𝑡

∑︁

∑︁

𝑖=1
∑︁

𝑡=1
∑︁

𝑖=1
∑︁

𝑡=1
∑︁

𝑖=1
∑︁

𝑡=1
∑︁

𝑖=1
∑︁

𝑡=1
∑︁

𝑖=1

𝑡=1

where in the second and fourth lines we used the fact that the sum of deviations from the mean over

𝑡 add up to zero for all observations. Therefore, (B.2) can be rewritten as:

∑︁

∑︁

𝑖=1

𝑡=1









(𝑥𝑖𝑡 − 𝜂 ¯𝑥𝑖)′ (cid:165)𝑧𝑖𝑡

(𝑥𝑖𝑡 − 𝜂 ¯𝑥𝑖)′ (cid:165)ℨ𝑖𝑡

(𝑤𝑖𝑡 − 𝜂 ¯𝑤𝑖)′ (cid:165)𝑧𝑖𝑡

(𝑤𝑖𝑡 − 𝜂 ¯𝑤𝑖)′ (cid:165)ℨ′
𝑖𝑡









=

=

∑︁

∑︁

𝑖=1

𝑡=1









(𝑥𝑖𝑡 − ¯𝑥𝑖)′ (cid:165)𝑧𝑖𝑡

(𝑥𝑖𝑡 − ¯𝑥𝑖)′ (cid:165)ℨ𝑖𝑡

(𝑤𝑖𝑡 − ¯𝑤𝑖)′ (cid:165)𝑧𝑖𝑡

(𝑤𝑖𝑡 − ¯𝑤𝑖)′ (cid:165)ℨ′
𝑖𝑡









∑︁

∑︁

𝑖=1

𝑡=1

ˆ𝑥′
𝑖𝑡 ˆ𝑧𝑖𝑡

Similarly,

∑︁

∑︁

𝑖=1

𝑡=1

ˆ𝑧′
𝑖𝑡 ˜𝑦𝑖𝑡 =

∑︁

∑︁

𝑖=1

𝑡=1

ˆ𝑧′
𝑖𝑡 ˆ𝑦𝑖𝑡

92

Therefore,

ˆΓ2𝑆𝐿𝑆 =

=







(cid:32)







(cid:32)

(cid:32)

∑︁

∑︁

𝑖=1

𝑡=1

(cid:33) (cid:32)

˜𝑥′
𝑖𝑡 ˆ𝑧𝑖𝑡

∑︁

∑︁

𝑖=1

𝑡=1

ˆ𝑧′
𝑖𝑡 ˆ𝑧𝑖𝑡

(cid:33) −1 (cid:32)

∑︁

∑︁

𝑖=1

𝑡=1

ˆ𝑧′
𝑖𝑡 ˜𝑥𝑖𝑡

∑︁

∑︁

𝑖=1

𝑡=1

˜𝑥′
𝑖𝑡 ˆ𝑧𝑖𝑡

(cid:33) (cid:32)

∑︁

∑︁

𝑖=1

𝑡=1

ˆ𝑧′
𝑖𝑡 ˆ𝑧𝑖𝑡

(cid:33) −1 (cid:32)

(cid:33)

ˆ𝑧′
𝑖𝑡 ˜𝑦𝑖𝑡

∑︁

∑︁

𝑖=1

𝑡=1

(cid:32)

∑︁

∑︁

𝑖=1

𝑡=1

(cid:33) (cid:32)

ˆ𝑥′
𝑖𝑡 ˆ𝑧𝑖𝑡

∑︁

∑︁

𝑖=1

𝑡=1

ˆ𝑧′
𝑖𝑡 ˆ𝑧𝑖𝑡

(cid:33) −1 (cid:32)

∑︁

∑︁

𝑖=1

𝑡=1

ˆ𝑧′
𝑖𝑡 ˆ𝑥𝑖𝑡

∑︁

∑︁

𝑖=1

𝑡=1

ˆ𝑥′
𝑖𝑡 ˆ𝑧𝑖𝑡

(cid:33) (cid:32)

∑︁

∑︁

𝑖=1

𝑡=1

ˆ𝑧′
𝑖𝑡 ˆ𝑧𝑖𝑡

(cid:33) −1 (cid:32)

(cid:33)

ˆ𝑧′
𝑖𝑡 ˆ𝑦𝑖𝑡

∑︁

∑︁

𝑖=1

𝑡=1

−1

−1

·

·

(cid:33)






(cid:33)






= ˆΓ𝐹𝐸2𝑆𝐿𝑆

Proof of Proposition 6

For notation simplicity and without loss of generality, I will omit 𝑊𝑖 𝑦𝑡 in the proof. This term

can be treated as an additional endogenous variable included in 𝑥2𝑖𝑡 with its respective instruments

[𝑤2

1𝑖𝑡 . . . 𝑤𝑠

1𝑖𝑡]. Let 𝑥𝑖𝑡 = (𝑥1𝑖𝑡
a 1 × k2 vector of endogenous covariates.

𝑥2𝑖𝑡), where 𝑥1𝑖𝑡 is a 1 × k1 vector of exogenous variables and 𝑥2𝑖𝑡 is

Similarly, 𝑋𝑡 = (𝑋1𝑡

𝑋2𝑡), 𝑧𝑖𝑡 = (𝑥1𝑖𝑡

𝑧2𝑖𝑡), ¯𝑧𝑖 = ( ¯𝑥1𝑖

¯𝑧2𝑖), 𝑍𝑡 = (𝑋1𝑡

𝑍2𝑡) and

¯𝑍𝑡 = ( ¯𝑋1

¯𝑍2), ℨ2𝑖𝑡 = 𝑊𝑖 𝑍2𝑖𝑡, ¯ℨ2𝑖 = 𝑊𝑖 ¯𝑍2.

Finally denote ˆ𝑥𝑖𝑡 = (𝑥1𝑖𝑡

ˆ𝑥2𝑖𝑡), ˆ¯𝑥𝑖 = ( ¯𝑥1𝑖

ˆ¯𝑥2𝑖), ˆ¯𝑋 = ( ¯𝑋1

ˆ¯𝑋2), where the hats denote the linear

projections of 𝑥2 on (𝑥1

𝑧2) and their spatial lags.

In a spatial setting, (𝛽 𝛾)𝐹𝐸2𝑆𝐿𝑆 can be obtained by applying Pooled 2SLS to

𝑦𝑖𝑡 − ¯𝑦𝑖 = (𝑥1𝑖𝑡 − ¯𝑥1𝑖) 𝛽1 + (𝑥2𝑖𝑡 − ¯𝑥2𝑖) 𝛽2 + 𝑊𝑖 (𝑋1𝑡 − ¯𝑋1)𝛾1 + 𝑊𝑖 (𝑋2𝑡 − ¯𝑋2)𝛾2 + (𝑢𝑖𝑡 − 𝑢𝑖)

using IV’s: [(𝑧2𝑖𝑡 − ¯𝑧2𝑖) 𝑊𝑖 (𝑍2𝑡 − ¯𝑍2)]

We want to show that applying Pooled 2SLS to:

𝑦𝑖𝑡 − 𝜃 ¯𝑦𝑖 = (𝑥1𝑖𝑡 − 𝜃 ¯𝑥1𝑖) 𝛽1 + (𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖) 𝛽2 + 𝑊𝑖 (𝑋1𝑡 − 𝜃 ¯𝑋1)𝛾1 + 𝑊𝑖 (𝑋2𝑡 − 𝜃 ¯𝑋2)𝛾2

+ (1 − 𝜃) ¯𝑥1𝑖𝛿1 + (1 − 𝜃) ¯𝑥2𝑖𝛿2 + (1 − 𝜃) ¯𝑊𝑖 ¯𝑋1𝜆1 + (1 − 𝜃) ¯𝑊𝑖 ¯𝑋2𝜆2 + 𝑢𝑖𝑡

93

using IV’s: [(𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖) 𝑊𝑖 (𝑍2𝑡 − 𝜃 ¯𝑍2)

(1 − 𝜃) ¯𝑧2𝑖

(1 − 𝜃)𝑊𝑖 ¯𝑍2] yields the same (𝛽 𝛾).

In order to proof the result, I will follow these steps:

1. Orthogonalize with respect to [(1 − 𝜃) ¯𝑥1𝑖

(1 − 𝜃) ¯𝑤1𝑖] the instrumental variables and

[(𝑥1𝑖𝑡 − 𝜃 ¯𝑥1𝑖)

(𝑤1𝑖𝑡 − 𝜃 ¯𝑤1𝑖)]

2. Orthogonalize with respect to [(1 − 𝜃) ¯𝑧2𝑖

(1 − 𝜃) ¯ℨ2𝑖] in the first stage equation.

3. Show that we get the same predicted values using the orthogonalized variables and the

original ones.

4. Use the Frisch-Waugh-Lovell (WFL) theorem to show the equivalence.

So the model is:

𝑦𝑖𝑡 − 𝜃 ¯𝑦𝑖 = (𝑥1𝑖𝑡 − 𝜃 ¯𝑥1𝑖) 𝛽1 + (𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖) 𝛽2 + (𝑤1𝑖𝑡 − 𝜃 ¯𝑤𝑖1)𝛾1 + (𝑤2𝑖𝑡 − 𝜃 ¯𝑤𝑖2)𝛾2

+ (1 − 𝜃) ¯𝑥1𝑖𝛿1 + (1 − 𝜃) ¯𝑥2𝑖𝛿2 + (1 − 𝜃) ¯𝑤𝑖1𝜆1 + (1 − 𝜃) ¯𝑤𝑖2𝜆2 + 𝑢𝑖𝑡

using IV’s: [(𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖)

(ℨ2𝑖𝑡 − 𝜃 ¯ℨ2𝑖)

(1 − 𝜃) ¯𝑧2𝑖

(1 − 𝜃) ¯ℨ2𝑖].

Step 1

a. 𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖 on (1 − 𝜃) ¯𝑥1𝑖, (1 − 𝜃) ¯𝑤1𝑖

The residuals will be: 𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖 − (1 − 𝜃) ¯𝑥1𝑖 ˆ𝜂1 − (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜂2 = 𝑙𝑖𝑡

Applying the FWL theorem: for (1 − 𝜃) ¯𝑥1𝑖 on (1 − 𝜃) ¯𝑤1𝑖, the coefficient will be:

(cid:34) 𝑁
∑︁

𝑇
∑︁

ˆ𝜇1 =

(1 − 𝜃)2 ¯𝑤′

1𝑖 ¯𝑤1𝑖

(cid:35) −1 (cid:34) 𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

(cid:35)

(1 − 𝜃)2 ¯𝑤′

1𝑖 ¯𝑥1𝑖

𝑡=1

𝑖=1
(cid:34) 𝑁
∑︁

𝑖=1
(cid:34) 𝑁
∑︁

𝑖=1

=

=

𝑇 (1 − 𝜃)2 ¯𝑤′

1𝑖 ¯𝑤1𝑖

(cid:35) −1 (cid:34) 𝑁
∑︁

𝑖=1

(cid:35)

𝑇 (1 − 𝜃)2 ¯𝑤′

1𝑖 ¯𝑥1𝑖

(1 − 𝜃)2 ¯𝑤′

1𝑖 ¯𝑤1𝑖

(cid:35) −1 (cid:34) 𝑁
∑︁

𝑖=1

(cid:35)

(1 − 𝜃)2 ¯𝑤′

1𝑖 ¯𝑥1𝑖

The residuals will be (1 − 𝜃) ¯𝑥1𝑖 − (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜇1 = 𝑠𝑖.

94

Now we regress 𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖 on (1 − 𝜃) ¯𝑤1𝑖. The coefficient will be:

(cid:34) 𝑁
∑︁

𝑇
∑︁

ˆ𝜇2 =

(1 − 𝜃)2 ¯𝑤′

1𝑖 ¯𝑤1𝑖

(cid:35) −1 (cid:34) 𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

(1 − 𝜃)2 ¯𝑤′

1𝑖 (𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖)

𝑡=1

𝑖=1
(cid:34) 𝑁
∑︁

𝑖=1
(cid:34) 𝑁
∑︁

𝑖=1
(cid:34) 𝑁
∑︁

𝑖=1

=

=

=

𝑇 (1 − 𝜃)2 ¯𝑤′

1𝑖 ¯𝑤1𝑖

𝑇 (1 − 𝜃)2 ¯𝑤′

1𝑖 ¯𝑤1𝑖

(cid:35) −1 (cid:34) 𝑁
∑︁

𝑖=1
(cid:35) −1 (cid:34) 𝑁
∑︁

𝑖=1

(1 − 𝜃)2 ¯𝑤′
1𝑖

(cid:35)

(𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖)

𝑇
∑︁

𝑡=1

(1 − 𝜃)2 ¯𝑤′

1𝑖{𝑇 × ( ¯𝑧2𝑖 − 𝜃 ¯𝑧2𝑖)}

(1 − 𝜃)2 ¯𝑤′

1𝑖 ¯𝑤1𝑖

(cid:35) −1 (cid:34) 𝑁
∑︁

𝑖=1

(cid:35)

(1 − 𝜃)2 ¯𝑤′

1𝑖 ¯𝑧2𝑖

(cid:35)

(cid:35)

The residuals will be 𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖 − (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜇2 = 𝑔𝑖𝑡.

Finally, we run 𝑔𝑖𝑡 on 𝑠𝑖. The coefficient will be:

(cid:34) 𝑁
∑︁

𝑇
∑︁

(cid:35) −1 (cid:34) 𝑁
∑︁

𝑇
∑︁

(cid:35)

𝑠′
𝑖𝑔𝑖𝑡

𝑠′
𝑖 𝑠𝑖

ˆ𝜂1 =

𝑖=1
(cid:34) 𝑁
∑︁

𝑖=1
(cid:34) 𝑁
∑︁

𝑖=1
(cid:34) 𝑁
∑︁

𝑖=1

=

=

=

𝑡=1

𝑇 × 𝑠′

𝑖 𝑠𝑖

𝑇 × 𝑠′

𝑖 𝑠𝑖

𝑖=1
(cid:35) −1 (cid:34) 𝑁
∑︁

𝑖=1
(cid:35) −1 (cid:34) 𝑁
∑︁

𝑖=1

𝑡=1

𝑠′
𝑖

𝑇
∑︁

𝑡=1

(cid:35)

𝑔𝑖𝑡

(cid:35)

𝑇 × 𝑠′

𝑖 ¯𝑔𝑖

(cid:35) −1 (cid:34) 𝑁
∑︁

𝑠′
𝑖 𝑠𝑖

𝑖=1

(1 − 𝜃)𝑠′

𝑖 ( ¯𝑧2𝑖 − ¯𝑤1𝑖 ˆ𝜇2)

Using similar steps, ˆ𝜂2 will be:

ˆ𝜂2 =

(cid:34) 𝑁
∑︁

𝑖=1

(cid:35) −1 (cid:34) 𝑁
∑︁

𝑠∗
𝑖

′𝑠∗
𝑖

𝑖=1

(1 − 𝜃)𝑠∗
𝑖

′( ¯𝑧2𝑖 − ¯𝑥1𝑖 ˆ𝜇∗
2)

(cid:35)

(cid:35)

where ˆ𝜇∗

2 is the coefficient of regressing 𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖 on (1 − 𝜃) ¯𝑥1𝑖 and 𝑠∗

𝑖 are the residuals of

regressing (1 − 𝜃) ¯𝑤1𝑖 on (1 − 𝜃) ¯𝑥1𝑖

b. (ℨ2𝑖𝑡 − 𝜃 ¯ℨ2𝑖) on (1 − 𝜃) ¯𝑥1𝑖, (1 − 𝜃) ¯𝑤1𝑖.

The residuals will be (ℨ2𝑖𝑡 − 𝜃 ¯ℨ2𝑖) − (1 − 𝜃) ¯𝑥1𝑖 ˆ𝜂3 − (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜂4 = 𝑚𝑖𝑡.

95

c. (1 − 𝜃) ¯𝑧2𝑖 on (1 − 𝜃) ¯𝑥1𝑖, (1 − 𝜃) ¯𝑤𝑖.

The residuals are: (1 − 𝜃) ¯𝑧𝑖 − (1 − 𝜃) ˆ𝜂5 − (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜂6 = 𝑣𝑖, which only depend on the 𝑖

subscript.

Applying the FWL theorem, regressing (1 − 𝜃) ¯𝑥1𝑖 on (1 − 𝜃) ¯𝑤1𝑖 yields ˆ𝜇1, the same as in

step 1a. The residuals will be only a function of the 𝑖 subscript, say 𝑓𝑖.

Finally, run 𝑓𝑖 on 𝑠𝑖 and the coefficient will be:

(cid:34)

∑︁

𝑖=1

𝑇 𝑠′

𝑖 𝑠𝑖

(cid:35) −1 (cid:34)

(cid:35)

𝑇 𝑠′

𝑖 𝑓𝑖

=

(cid:34)

∑︁

𝑖=1

𝑇 𝑠′

𝑖 𝑠𝑖

∑︁

𝑖=1

(cid:35) −1 (cid:34)

∑︁

𝑖=1

𝑠′
𝑖 (1 − 𝜃) ( ¯𝑧2𝑖 − ¯𝑤1𝑖 ˆ𝜇2)

= ˆ𝜂5 = ˆ𝜂1

(cid:35)

The same coefficient as above. Following similar steps, it can be shown that

ˆ𝜂6 = ˆ𝜂2 =

(cid:34)

∑︁

𝑖=1

𝑇 𝑠∗′

𝑖 𝑠∗
𝑖

(cid:35) (cid:34)

∑︁

𝑖=1

𝑖 (1 − 𝜃) ( ¯𝑧2𝑖 − ¯𝑥1𝑖 ˆ𝜇∗
𝑠∗′
2)

(cid:35)

where ˆ𝜇∗

2 is defined in step 1a.

∴ 𝑣𝑖 = (1 − 𝜃) ¯𝑧2𝑖 − (1 − 𝜃) ¯𝑥1𝑖 ˆ𝜂5 − (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜂6 = (1 − 𝜃) ¯𝑧2𝑖 − (1 − 𝜃) ¯𝑥1𝑖 ˆ𝜂1 − (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜂2

d. (1 − 𝜃) ¯𝑧2𝑖 on (1 − 𝜃) ¯𝑥1𝑖, (1 − 𝜃) ¯𝑤1𝑖

The coefficients will only depend in 𝑖, denote them by 𝑟𝑖. If (1 − 𝜃) ¯𝑧2𝑖 = (1 − 𝜃) ¯𝑥1𝑖 ˆ𝜂7 + (1 −

𝜃) ¯𝑤1𝑖 ˆ𝜂8, it can be shown using similar arguments than in the previous step that ˆ𝜂7 = ˆ𝜂3 and

ˆ𝜂8 = ˆ𝜂4.

e. 𝑥1𝑖𝑡 − 𝜃 ¯𝑥1𝑖 on (1 − 𝜃) ¯𝑥1𝑖, (1 − 𝜃) ¯𝑤1𝑖

We can apply the FWL theorem to get the coefficients:

96

i. First if we regress 𝑥1𝑖𝑡 − 𝜃 ¯𝑥1𝑖 on (1 − 𝜃) ¯𝑥1𝑖. The coefficient is:

(cid:34)

(cid:34)

(cid:34)

(cid:34)

=

=

=

∑︁

∑︁

𝑖

𝑡

(1 − 𝜃)2 ¯𝑥′

1𝑖 ¯𝑥1𝑖

(cid:35) −1 (cid:34)

∑︁

∑︁

𝑖

𝑡

(1 − 𝜃) ¯𝑥′

1𝑖 (𝑥1𝑖𝑡 − 𝜃 ¯𝑥1𝑖)

(cid:35)

∑︁

𝑖

∑︁

𝑖

∑︁

𝑖

𝑇 (1 − 𝜃)2 ¯𝑥′

1𝑖 ¯𝑥1𝑖

𝑇 (1 − 𝜃)2 ¯𝑥′

1𝑖 ¯𝑥1𝑖

𝑇 (1 − 𝜃)2 ¯𝑥′

1𝑖 ¯𝑥1𝑖

(cid:35) −1 (cid:34)

(cid:35) −1 (cid:34)

(cid:35) −1 (cid:34)

∑︁

𝑖

∑︁

𝑖

∑︁

𝑖

(1 − 𝜃) ¯𝑥′
1𝑖

∑︁

𝑡

(cid:35)

(𝑥1𝑖𝑡 − 𝜃 ¯𝑥1𝑖)

(1 − 𝜃) ¯𝑥′

1𝑖𝑇 ( ¯𝑥𝑖𝑡 − 𝜃 ¯𝑥1𝑖)

(cid:35)

(cid:35)

𝑇 (1 − 𝜃)2 ¯𝑥′

1𝑖 ¯𝑥1𝑖

=I𝑘1

where I𝑘1 denotes an identity matrix of size 𝑘1. Therefore, the residuals will be 𝑥1𝑖𝑡 − ¯𝑥1𝑖.

ii. Now regress (1 − 𝜃) ¯𝑤1𝑖 on (1 − 𝜃) ¯𝑥1𝑖.

The coefficients and residuals will only depend on 𝑖. Denote the later by 𝑑𝑖.

iii. Finally regress 𝑥1𝑖𝑡 − ¯𝑥1𝑖 on 𝑑𝑖. The coefficient will be:

(cid:35) −1 (cid:34)

(cid:35) −1 (cid:34)

(cid:34)

(cid:34)

∑︁

∑︁

𝑑′
𝑖 𝑑𝑖

𝑖

𝑡

∑︁

∑︁

𝑑′
𝑖 𝑑𝑖

𝑖

𝑡

=

∑︁

∑︁

𝑖

𝑡

𝑑′
𝑖 (𝑥1𝑖𝑡 − ¯𝑥1𝑖)

(cid:35)

(cid:35)

∑︁

∑︁

𝑑′
𝑖

𝑖

𝑡

(𝑥1𝑖𝑡 − ¯𝑥1𝑖)

= 0𝑘1

where we used the fact that (cid:205)𝑡 (𝑥1𝑖𝑡 − ¯𝑥1𝑖) = 0. Therefore 𝑥1𝑖𝑡 − 𝜃 ¯𝑥1𝑖 = (1 − 𝜃) ¯𝑥1𝑖I𝑘1 + (1 −
𝜃) ¯𝑤1𝑖0𝑘1 and the residuals will be 𝑥1𝑖𝑡 − ¯𝑥1𝑖.

f. 𝑤1𝑖𝑡 − 𝜃 ¯𝑤1𝑖 on (1 − 𝜃) ¯𝑥1𝑖 and (1 − 𝜃) ¯𝑤1𝑖.

Applying the FWL theorem in a similar way than the previous step, we get the following

relationship:

𝑤1𝑖𝑡 − 𝜃 ¯𝑤1𝑖 = (1 − 𝜃) ¯𝑥1𝑖0𝑘1 + (1 − 𝜃) ¯𝑤1𝑖I𝑘1 and the residuals will be 𝑤1𝑖𝑡 − ¯𝑤1𝑖.

97

Therefore, after orthogonalizing, we can apply Pooled 2SLS to:

𝑦𝑖𝑡 − 𝜃 ¯𝑦𝑖 = (𝑥1𝑖𝑡 − ¯𝑥1𝑖) 𝛽1 + (𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖) 𝛽2 + (𝑤1𝑖𝑡 − ¯𝑤𝑖1)𝛾1 + (𝑤2𝑖𝑡 − 𝜃 ¯𝑤𝑖2)𝛾2

+ (1 − 𝜃) ¯𝑥2𝑖𝛿2 + (1 − 𝜃) ¯𝑤𝑖2𝜆2 + 𝑢𝑖𝑡

using IV’s: [𝑙𝑖𝑡 𝑚𝑖𝑡

𝑣𝑖

𝑟𝑖].

Step 2

In this step we orthogonalize with respect to 𝑣𝑖 and 𝑟𝑖 in the first stage equation. Note that these

are the residuals from the previous step associated with (1 − 𝜃) ¯𝑧2𝑖 and (1 − 𝜃) ¯ℨ2𝑖 respectively, the

instrumental variables.

a. 𝑙𝑖𝑡 = 𝑣𝑖𝜁1 + 𝑟𝑖𝜁2 + 𝜀1.

i. 𝑙𝑖𝑡 on 𝑣𝑖. The coefficient will be:

(cid:34)

(cid:34)

(cid:34)

˜𝜂1 =

=

=

∑︁

∑︁

𝑣′
𝑖𝑣𝑖

𝑖

𝑡

(cid:35) −1 (cid:34)

(cid:35)

𝑣′
𝑖𝑙𝑖𝑡

∑︁

∑︁

𝑖

𝑡

(cid:35) −1 (cid:34)

(cid:35)

∑︁

∑︁

𝑙𝑖𝑡

𝑣′
𝑖

𝑇 𝑣′

𝑖𝑣𝑖

∑︁

𝑖

(cid:35) −1 (cid:34)

∑︁

𝑣′
𝑖𝑣𝑖

𝑖

𝑡

(cid:35)

𝑖

∑︁

𝑣′
𝑖 ¯𝑙𝑖

𝑖

Note that 𝑙𝑖𝑡 = 𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖 − (1 − 𝜃) ¯𝑥1𝑖 ˆ𝜂1 − (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜂2, therefore

¯𝑙𝑖 =

1
𝑇

∑︁

𝑡

[𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖 − (1 − 𝜃) ¯𝑥1𝑖 ˆ𝜂1 − (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜂2]

= (1 − 𝜃) ¯𝑧2𝑖 − (1 − 𝜃) ¯𝑥1𝑖 ˆ𝜂1 − (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜂2

= (1 − 𝜃) ( ¯𝑧2𝑖 − ¯𝑥1𝑖 ˆ𝜂1 − ¯𝑤1𝑖 ˆ𝜂2) = 𝑣𝑖

Therefore, ˜𝜂1 = I𝑙 since ˆ𝜂1 = ˆ𝜂5 and ˆ𝜂2 = ˆ𝜂6. The residuals are 𝑧2𝑖𝑡 − ¯𝑧2𝑖.

ii. 𝑟𝑖 on 𝑣𝑖. In this case, both the coefficient and the residuals are going to depend only on

𝑖, call them ℎ𝑖.

98

iii. Regress 𝑧2𝑖𝑡 − ¯𝑧2𝑖 on ℎ𝑖. The coefficient is:
(cid:35) −1 (cid:34)

(cid:34)

ˆ𝜂3 =

∑︁

∑︁

ℎ′
𝑖 ℎ𝑖

∑︁

∑︁

ℎ′
𝑖 (𝑧2𝑖𝑡 − ¯𝑧2𝑖)

𝑖

𝑡

𝑖

𝑡

(cid:34)

=

∑︁

∑︁

ℎ′
𝑖 ℎ𝑖

(cid:35) −1 (cid:34)

∑︁

∑︁

ℎ′
𝑖

(𝑧2𝑖𝑡 − ¯𝑧2𝑖)

= 0𝑙

(cid:35)

(cid:35)

𝑖
Because the sum of deviations from the mean add up to zero. Therefore 𝑙𝑖𝑡 = 𝑣𝑖I𝑙 +𝑟𝑖0𝑙 +𝜀

𝑖

𝑡

𝑡

and the residuals will be 𝑧2𝑖𝑡 − ¯𝑧2𝑖.

b. 𝑚𝑖𝑡 = 𝑣𝑖𝜋1 + 𝑟𝑖𝜋2 + 𝜀2

i. 𝑚𝑖𝑡 on 𝑟𝑖

The coefficient will be, after some algebra, ˜𝜋2 = (cid:2)(cid:205)𝑖 𝑟′

𝑖𝑟𝑖(cid:3) −1 (cid:2)(cid:205)𝑖 𝑟′

𝑖 ¯𝑚𝑖(cid:3). Noting that

¯𝑚𝑖 =

1
𝑇

∑︁

𝑡

(cid:2)(ℨ2𝑖𝑡 − 𝜃 ¯ℨ2𝑖) − (1 − 𝜃) ¯𝑥1𝑖 ˆ𝜂3 − (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜂4

(cid:3)

= (1 − 𝜃)( ¯ℨ2𝑖 − ¯𝑥1𝑖 ˆ𝜂3 − ¯𝑤1𝑖 ˆ𝜂4)

= (1 − 𝜃)( ¯ℨ2𝑖 − ¯𝑥1𝑖 ˆ𝜂7 − ¯𝑤1𝑖 ˆ𝜂8) = 𝑟𝑖

We conclude that ˜𝜋2 = I𝑙 and the residuals are ℨ2𝑖𝑡 − ¯ℨ2𝑖.

ii. 𝑣𝑖 on 𝑟𝑖. The coefficient will be denoted by ˜𝜋1 = (cid:2)(cid:205)𝑖 𝑟′

𝑖𝑟𝑖(cid:3) −1 (cid:2)(cid:205)𝑖 𝑟′

𝑖 𝑣𝑖(cid:3), and the residuals

will depend on 𝑖, call them ˜ℎ𝑖.

iii. ℨ2𝑖𝑡 − ¯ℨ2𝑖 on ˜ℎ𝑖.

Using again the fact that (cid:205)𝑡 ℨ2𝑖𝑡 − ¯ℨ2𝑖 = 0, we conclude that 𝜋1 = 0𝑙, which implies
that ˜𝜋2 = 𝜋2 = I𝑙 and therefore, the residuals will be ℨ2𝑖𝑡 − ¯ℨ2𝑖.

In the original first stage we have:

𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖𝑡 = (𝑥1𝑖𝑡 − 𝜃 ¯𝑥1𝑖)𝜙1 + (𝑤1𝑖𝑡 − 𝜃 ¯𝑤1𝑖)𝜙2 + (ℨ2𝑖𝑡 − 𝜃 ¯ℨ2𝑖)𝜙3 + (𝑤2𝑖𝑡 − 𝜃 ¯𝑤𝑖2)𝜙4

+ (1 − 𝜃) ¯𝑥1𝑖 𝜌1 + (1 − 𝜃) ¯𝑤1𝑖 𝜌2 + (1 − 𝜃) ¯𝑧2𝑖 𝜌3 + (1 − 𝜃) ¯ℨ2𝑖 𝜌4 + 𝜀𝐹𝑆

After orthogonalizing with respect to [(1 − 𝜃) ¯𝑥1𝑖

(1 − 𝜃) ¯𝑤1𝑖

(1 − 𝜃) ¯𝑧2𝑖

(1 − 𝜃) ¯ℨ2𝑖], to get

Φ = (𝜙1 𝜙2 𝜙3 𝜙4), we have to regress
𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖𝑡 on [(𝑥2𝑖𝑡 − ¯𝑥2𝑖𝑡) (𝑥1𝑖𝑡 − ¯𝑥1𝑖) (𝑤1𝑖𝑡 − ¯𝑤1𝑖) (ℨ2𝑖𝑡 − ¯ℨ2𝑖)].

99

We note that if 𝑧𝑖𝑡 = [𝑥1𝑖𝑡 𝑤1𝑖𝑡 𝑧2𝑖𝑡 ℨ2𝑖𝑡], then the coefficient of 𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖𝑡 on 𝑧𝑖𝑡 − ¯𝑧𝑖 is

ˇΦ =

=

=

=

(cid:34)

(cid:34)

(cid:34)

(cid:34)

∑︁

∑︁

𝑖

𝑡

∑︁

∑︁

𝑖

𝑡

∑︁

∑︁

𝑖

𝑡

∑︁

∑︁

𝑖

𝑡

(𝑧𝑖𝑡 − ¯𝑧𝑖)′(𝑧𝑖𝑡 − ¯𝑧𝑖)

(𝑧𝑖𝑡 − ¯𝑧𝑖)′(𝑧𝑖𝑡 − ¯𝑧𝑖)

(𝑧𝑖𝑡 − ¯𝑧𝑖)′(𝑧𝑖𝑡 − ¯𝑧𝑖)

(𝑧𝑖𝑡 − ¯𝑧𝑖)′(𝑧𝑖𝑡 − ¯𝑧𝑖)

(cid:35) −1 (cid:34)

(cid:35) −1 (cid:34)

(cid:35) −1 (cid:34)

(cid:35) −1 (cid:34)

∑︁

∑︁

𝑖

𝑡

∑︁

∑︁

𝑖

𝑡

∑︁

∑︁

𝑖

𝑡

∑︁

∑︁

𝑖

𝑡

(𝑧𝑖𝑡 − ¯𝑧𝑖)′(𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖)

(cid:35)

(𝑧𝑖𝑡 − ¯𝑧𝑖)′𝑥2𝑖𝑡 −

(𝑧𝑖𝑡 − ¯𝑧𝑖)′𝑥2𝑖𝑡 −

(cid:41)

(cid:35)

(𝑧𝑖𝑡 − ¯𝑧𝑖)′

𝜃 ¯𝑥2𝑖

(cid:41)

(cid:35)

(𝑧𝑖𝑡 − ¯𝑧𝑖)′

¯𝑥2𝑖

(cid:40)

∑︁

∑︁

𝑖

𝑡

(cid:40)

∑︁

∑︁

𝑖

𝑡

(cid:35)

(𝑧𝑖𝑡 − ¯𝑧𝑖)′(𝑥2𝑖𝑡 − ¯𝑥2𝑖)

Where we used the fact that the terms in curly brackets are zero. Therefore, Φ can also be obtained

by regressing (𝑥2𝑖𝑡 − ¯𝑥2𝑖𝑡) on [(𝑥2𝑖𝑡 − ¯𝑥2𝑖𝑡) (𝑥1𝑖𝑡 − ¯𝑥1𝑖) (𝑤1𝑖𝑡 − ¯𝑤1𝑖) (ℨ2𝑖𝑡 − ¯ℨ2𝑖)].

Step 3

In this step we show that (cid:92)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖 = (cid:94)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖, where

(cid:92)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖 = (𝑥1𝑖𝑡 − 𝜃 ¯𝑥1𝑖) ˆ𝜙1 + (𝑤1𝑖𝑡 − 𝜃 ¯𝑤1𝑖) ˆ𝜙2 + (𝑧2𝑖𝑡 − 𝜃 ¯𝑧2𝑖) ˆ𝜙3 + (ℨ2𝑖𝑡 − 𝜃 ¯ℨ2𝑖) ˆ𝜙4

+ (1 − 𝜃) ¯𝑥1𝑖 ˆ𝜌1 + (1 − 𝜃) ¯𝑤1𝑖 ˆ𝜌2 + (1 − 𝜃) ¯𝑧2𝑖 ˆ𝜌3 + (1 − 𝜃) ¯ℨ2𝑖 ˆ𝜌4

(cid:94)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖 = (𝑥1𝑖𝑡 − ¯𝑥1𝑖) ˜𝜙1 + (𝑤1𝑖𝑡 − ¯𝑤1𝑖) ˜𝜙2 + (𝑧2𝑖𝑡 − ¯𝑧2𝑖) ˜𝜙3 + (ℨ2𝑖𝑡 − ¯ℨ2𝑖) ˜𝜙4

+ (1 − 𝜃) ¯𝑥1𝑖 ˜𝜌1 + (1 − 𝜃) ¯𝑤1𝑖 ˜𝜌2 + (1 − 𝜃) ¯𝑧2𝑖 ˜𝜌3 + (1 − 𝜃) ¯ℨ2𝑖 ˜𝜌4

First we note that ˆ𝜙 𝑗 = ˜𝜙 𝑗 for 𝑗 = 1, 2, 3, 4 because in the second equation the respective

explanatory variables are orthogonalized with respect to the terms related to the time averages of the
independent variables. Given this fact and after some algebra, we have that (cid:92)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖 = (cid:94)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖
if ˆ𝜙 𝑗 + ˆ𝜌 𝑗 = ˜𝜌 𝑗 for 𝑗 = 1, 2, 3, 4.

To show that the previous equality holds, we start with (cid:94)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖. Since 𝑧𝑖𝑡 = [𝑥1𝑖𝑡 𝑤1𝑖𝑡 𝑧2𝑖𝑡 ℨ2𝑖𝑡]

as above, we have
𝑧𝑖𝑡 − ¯𝑧𝑖 = (cid:2)(𝑥1𝑖𝑡 − ¯𝑥1𝑖) (𝑤1𝑖𝑡 − ¯𝑤1𝑖) (𝑧2𝑖𝑡 − ¯𝑧2𝑖) (ℨ2𝑖𝑡 − ¯ℨ2𝑖)(cid:3), ˜𝜌 = ( ˜𝜌′
( ˜𝜙′
1

4)′ and ˜𝜙 =
˜𝜌′
4)′, therefore, (cid:94)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖 = (𝑧𝑖𝑡 − ¯𝑧𝑖) ˆ𝜙 + (1 − 𝜃) ¯𝑧𝑖 ˜𝜌. Greene (2007) shows that given
˜𝜙′

˜𝜙′
2

˜𝜙′
3

˜𝜌′
3

˜𝜌′
2

1

100

ˆ𝜙, one can get ˜𝜌 as:

(cid:34)

(cid:34)

(cid:34)

˜𝜌 =

=

=

∑︁

∑︁

𝑖

𝑡

(1 − 𝜃)2 ¯𝑧′

𝑖 ¯𝑧𝑖

(cid:35) −1 (cid:34)

∑︁

∑︁

𝑖

𝑡

(1 − 𝜃) ¯𝑧′
𝑖

(cid:8)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖 − (𝑧𝑖𝑡 − ¯𝑧𝑖) ˆ𝜙(cid:9)

(cid:35)

𝑇 (1 − 𝜃)2 ¯𝑧′

𝑖 ¯𝑧𝑖

(cid:35) −1 (cid:34)

(cid:32)

∑︁

𝑖

(1 − 𝜃) ¯𝑧′
𝑖

(1 − 𝜃)2 ¯𝑧′

𝑖 ¯𝑧𝑖

(cid:35) −1 (cid:34)

∑︁

𝑖

(1 − 𝜃)2 ¯𝑧′

𝑖 ¯𝑥2𝑖

∑︁

𝑖

∑︁

𝑖

∑︁

𝑡

(cid:35)

(𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖) − (1 − 𝜃) ¯𝑧′
𝑖

(cid:41)

(cid:33)(cid:35)

(𝑧𝑖𝑡 − ¯𝑧𝑖)

ˆ𝜙

(cid:40)

∑︁

𝑡

where we used the fact that (cid:205)𝑡 (𝑧𝑖𝑡 − ¯𝑧𝑖) = 0 on the second line.

We turn now to (cid:92)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖. With similar definitions as above, given ˆ𝜙, we get ˆ𝜌 as:

(cid:35) −1 (cid:34)

(cid:35) −1 (cid:34)

∑︁

∑︁

𝑖

𝑡

∑︁

∑︁

𝑖

𝑡

(1 − 𝜃)2 ¯𝑧′

𝑖 ¯𝑧𝑖

(1 − 𝜃)2 ¯𝑧′

𝑖 ¯𝑧𝑖

(1 − 𝜃) ¯𝑧′
𝑖

(cid:8)(𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖) − (𝑧𝑖𝑡 − 𝜃 ¯𝑧𝑖) ˆ𝜙(cid:9)

(cid:35)

∑︁

∑︁

𝑖

𝑡

(1 − 𝜃) ¯𝑧′
𝑖

∑︁

𝑖

(𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖) −

(cid:41)(cid:35)

(𝑧𝑖𝑡 − 𝜃 ¯𝑧𝑖) ˆ𝜙

∑︁

𝑡

(cid:40)

∑︁

𝑡

(cid:35)

∑︁

𝑇 (1 − 𝜃)2 ¯𝑧′

𝑖 ¯𝑧𝑖

𝑇 (1 − 𝜃)2 ¯𝑧′

𝑖 ¯𝑥2𝑖

(cid:35) −1 (cid:34)

∑︁

𝑖

(cid:34)

(cid:34)

(cid:34)

ˆ𝜌 =

=

=

−

𝑖

(cid:34)

∑︁

𝑖

𝑇 (1 − 𝜃)2 ¯𝑧′

𝑖 ¯𝑧𝑖

(cid:35) −1 (cid:34)

∑︁

(1 − 𝜃) ¯𝑧𝑖 (𝑇 ¯𝑧𝑖 − 𝑇 𝜃 ¯𝑧𝑖) ˆ𝜙

(cid:35)

= ˜𝜌 −

(cid:34)

∑︁

𝑖

𝑇 (1 − 𝜃)2 ¯𝑧′

𝑖 ¯𝑧𝑖

𝑖
(cid:35) −1 (cid:34)

∑︁

𝑖

(cid:35)

𝑇 (1 − 𝜃)2 ¯𝑧′

𝑖 ¯𝑧𝑖

ˆ𝜙

= ˜𝜌 − I2(𝑘1+𝑙)+1 ˆ𝜙 = ˜𝜌 − ˆ𝜙

Therefore, ˜𝜌 = ˆ𝜌 + ˆ𝜙 and hence (cid:92)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖 = (cid:94)𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖.

In a similar way and using obvious notation, it can be shown that
(cid:92)𝑤2𝑖𝑡 − 𝜃 ¯𝑤2𝑖 = (cid:94)𝑤2𝑖𝑡 − 𝜃 ¯𝑤2𝑖.

Step 4

Given the previous step, the problem becomes:

𝑦𝑖𝑡 − 𝜃 ¯𝑦𝑖 = (𝑥1𝑖𝑡 − ¯𝑥1𝑖) 𝛽1 + (𝑥2𝑖𝑡 − 𝜃 ¯𝑥2𝑖) 𝛽2 + (𝑤1𝑖𝑡 − ¯𝑤𝑖1)𝛾1 + (𝑤2𝑖𝑡 − 𝜃 ¯𝑤𝑖2)𝛾2

+ (1 − 𝜃) ¯𝑥2𝑖𝛿2 + (1 − 𝜃) ¯𝑤𝑖2𝜆2 + 𝑢𝑖𝑡

101

using IV’s: [(𝑧2𝑖𝑡 − ¯𝑧2𝑖)

(ℨ2𝑖𝑡 − ¯ℨ2𝑖)

(1 − 𝜃) ¯𝑧2𝑖

(1 − 𝜃) ¯ℨ2𝑖]. At this point however, it is

important to note that although we have orthogonalized with respect to (1 − 𝜃) [ ¯𝑥1𝑖

¯𝑤1𝑖], we still

have to include in the first stage equation to obtain the predicted values of the endogenous variables.

Given this, the second stage equation is:

𝑦𝑖𝑡 − 𝜃 ¯𝑦𝑖 = (𝑥1𝑖𝑡 − ¯𝑥1𝑖) 𝛽1 + ( ˆ𝑥2𝑖𝑡 − 𝜃 ˆ¯𝑥2𝑖) 𝛽2 + (𝑤1𝑖𝑡 − ¯𝑤𝑖1)𝛾1 + ( ˆ𝑤2𝑖𝑡 − 𝜃 ˆ¯𝑤𝑖2)𝛾2

+ (1 − 𝜃) ˆ¯𝑥2𝑖𝛿2 + (1 − 𝜃) ˆ¯𝑤𝑖2𝜆2

where the ˆ denote the first stage projections on the instrumental variables. To obtain (𝛽 𝛾), we

orthogonalize with respect to (1 − 𝜃) ˆ¯𝑥2𝑖 and (1 − 𝜃) ˆ¯𝑤𝑖2.

a. (𝑥1𝑖𝑡 − ¯𝑥1𝑖) on (1 − 𝜃) ˆ¯𝑥2𝑖 and (1 − 𝜃) ˆ¯𝑤𝑖2.

i. (𝑥1𝑖𝑡 − ¯𝑥1𝑖) on (1 − 𝜃) ˆ¯𝑥2𝑖. The coefficient will be:

(cid:34)

∑︁

∑︁

𝑖

𝑡

(cid:35) −1 (cid:34)

(1 − 𝜃)2 ˆ¯𝑥′

2𝑖 ˆ¯𝑥2𝑖

∑︁

𝑖

(1 − 𝜃) ˆ¯𝑥′
2𝑖

∑︁

𝑡

(cid:35)

(𝑥1𝑖𝑡 − ¯𝑥1𝑖)

= 0𝑘2

where we used that the sums of deviations from the mean are zero for all 𝑖 and the

residuals will be 𝑥1𝑖𝑡 − ¯𝑥1𝑖.

ii. (1 − 𝜃) ˆ¯𝑤𝑖2 on (1 − 𝜃) ˆ¯𝑥2𝑖. In this case the coefficients and the residuals will depend only

on 𝑖, call them ˜𝑢𝑖.

iii. 𝑥1𝑖𝑡 − ¯𝑥1𝑖 on ˜𝑢𝑖. By a similar argument to point i just above, the coefficient is 0𝑘2 and

so, 𝑥1𝑖𝑡 − ¯𝑥1𝑖 is orthogonal to both variables.

b . (𝑤1𝑖𝑡 − ¯𝑤1𝑖) on (1 − 𝜃) ˆ¯𝑥2𝑖 and (1 − 𝜃) ˆ¯𝑤𝑖2. Using a similar argument as in a) above, 𝑤1𝑖𝑡 − ¯𝑤1𝑖

is orthogonal to both variables.

c. ( ˆ𝑥2𝑖𝑡 − 𝜃 ˆ¯𝑥2𝑖) on (1 − 𝜃) ˆ¯𝑥2𝑖 and (1 − 𝜃) ˆ¯𝑤𝑖2.

i. (1 − 𝜃) ˆ¯𝑤2𝑖 on (1 − 𝜃) ˆ¯𝑥2𝑖. The coefficient and residuals depend only on 𝑖, call them ˇ𝑢𝑖.

ii. ( ˆ𝑥2𝑖𝑡 − 𝜃 ˆ¯𝑥2𝑖) on (1 − 𝜃) ˆ¯𝑥2𝑖. By arguments very similar to previous steps, one can show

that the coefficient is I𝑘2 and the residuals will be ( ˆ𝑥2𝑖𝑡 − ˆ¯𝑥2𝑖).

102

iii. ( ˆ𝑥2𝑖𝑡 − ˆ¯𝑥2𝑖) on ˇ𝑢𝑖. By analogous arguments as above, the coefficient of this regression

will be 0𝑘2.

Therefore, the residuals of this regression will be ( ˆ𝑥2𝑖𝑡 − ˆ¯𝑥2𝑖)

d.

ˆ𝑤2𝑖𝑡 − 𝜃 ˆ¯𝑤𝑖2 on (1 − 𝜃) ˆ¯𝑥2𝑖 and (1 − 𝜃) ˆ¯𝑤𝑖2. Using similar ideas as in c) above, the residuals of

this regression are ˆ𝑤2𝑖𝑡 − ˆ¯𝑤𝑖2.

Therefore, to find (𝛽1 𝛽2 𝛾1 𝛾2), we run

𝑦𝑖𝑡 − 𝜃 ¯𝑦𝑖 = (𝑥1𝑖𝑡 − ¯𝑥1𝑖) 𝛽1 + ( ˆ𝑥2𝑖𝑡 − ˆ¯𝑥2𝑖) 𝛽2 + (𝑤1𝑖𝑡 − ¯𝑤𝑖1)𝛾1 + ( ˆ𝑤2𝑖𝑡 − ˆ¯𝑤𝑖2)𝛾2

If we collect all the covariates of this regression into a vector ˆ𝑥𝑖𝑡 − ˆ¯𝑥𝑖 (where the 𝑥1𝑖𝑡 and 𝑤1𝑖𝑡 are

their own projections), then:

(𝛽 𝛾) =

=

=

(cid:34)

(cid:34)

(cid:34)

∑︁

∑︁

𝑖

𝑡

∑︁

∑︁

𝑖

𝑡

∑︁

∑︁

𝑖

𝑡

( ˆ𝑥𝑖𝑡 − ˆ¯𝑥𝑖)′( ˆ𝑥𝑖𝑡 − ˆ¯𝑥𝑖)

( ˆ𝑥𝑖𝑡 − ˆ¯𝑥𝑖)′( ˆ𝑥𝑖𝑡 − ˆ¯𝑥𝑖)

( ˆ𝑥𝑖𝑡 − ˆ¯𝑥𝑖)′( ˆ𝑥𝑖𝑡 − ˆ¯𝑥𝑖)

(cid:35) −1 (cid:34)

(cid:35) −1 (cid:34)

(cid:35) −1 (cid:34)

∑︁

∑︁

𝑖

𝑡

∑︁

∑︁

𝑖

𝑡

∑︁

∑︁

𝑖

𝑡

( ˆ𝑥𝑖𝑡 − ˆ¯𝑥𝑖)′(𝑦𝑖𝑡 − 𝜃 ¯𝑦𝑖)

(cid:35)

( ˆ𝑥𝑖𝑡 − ˆ¯𝑥𝑖)′𝑦𝑖𝑡 −

(cid:40)

∑︁

∑︁

(cid:41)

(cid:35)

( ˆ𝑥𝑖𝑡 − ˆ¯𝑥𝑖)

𝜃 ¯𝑦𝑖

𝑖

𝑡

(cid:35)

( ˆ𝑥𝑖𝑡 − ˆ¯𝑥𝑖)′(𝑦𝑖𝑡 − ¯𝑦𝑖)

where we use again the fact that the term in curly brackets in the second line is zero. Therefore,

(𝛽 𝛾) can be obtained by regressing
(𝑥1𝑖𝑡 − ¯𝑥1𝑖) (cid:92)(𝑥2𝑖𝑡 − ¯𝑥2𝑖)

𝑦𝑖𝑡 − ¯𝑦𝑖 on

(cid:104)

(𝑤1𝑖𝑡 − ¯𝑤1𝑖) (cid:92)(𝑤2𝑖𝑡 − ¯𝑤2𝑖)

(cid:105)

,

which is exactly the same problem that the Fixed Effects 2SLS estimator solves.

103

APPENDIX C

DERIVATION OF THE COVARIANCE MATRIX FOR THE
CONTROL FUNCTION APPROACH

Consider the estimating equation in (1.46):

(cid:165)𝑦𝑖𝑡 = ˆ(cid:165)𝑎𝑖𝑡𝜃 + (cid:165)𝑒𝑖𝑡

where we can write (cid:165)𝑎𝑖𝑡 = (cid:165)𝑧𝑖𝑡𝜓 + (cid:165)𝑣𝑖𝑡. Because every element in (cid:165)𝑎𝑖𝑡 is exogenous with respect to the

error term (cid:165)𝑒𝑖𝑡, we can write:

ˆ𝜃 =

=

=

√

=⇒

𝑁𝑇 ( ˆ𝜃 − 𝜃) =

(cid:34)

(cid:34)

(cid:34)

(cid:34)

1
𝑁𝑇

1
𝑁𝑇

1
𝑁𝑇

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑖

𝑡

𝑁
∑︁

𝑇
∑︁

𝑖

𝑡

𝑁
∑︁

𝑇
∑︁

𝑖

𝑡

𝑁
∑︁

𝑇
∑︁

𝑖

𝑡

ˆ(cid:165)𝑎′
𝑖𝑡 ˆ(cid:165)𝑎𝑖𝑡

ˆ(cid:165)𝑎′
𝑖𝑡 ˆ(cid:165)𝑎𝑖𝑡

ˆ(cid:165)𝑎′
𝑖𝑡 ˆ(cid:165)𝑎𝑖𝑡

ˆ(cid:165)𝑎′
𝑖𝑡 ˆ(cid:165)𝑎𝑖𝑡

(cid:35) −1 (cid:34)

(cid:35) −1 (cid:34)

(cid:35) −1 (cid:34)

1
𝑁𝑇

1
𝑁𝑇

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑖

𝑡

𝑁
∑︁

𝑇
∑︁

𝑖

𝑡

𝑁
∑︁

𝑇
∑︁

𝑖

𝑡

(cid:35)

ˆ(cid:165)𝑎′
𝑖𝑡 (cid:165)𝑦𝑖𝑡

(cid:35)

ˆ(cid:165)𝑎′
𝑖𝑡 ( (cid:165)𝑎𝑖𝑡𝜃 + 𝑒𝑖𝑡)

ˆ(cid:165)𝑎′
𝑖𝑡 ( (cid:165)𝑎𝑖𝑡𝜃 + ˆ(cid:165)𝑎𝑖𝑡𝜃 − ˆ(cid:165)𝑎𝑖𝑡𝜃 + (cid:165)𝑒𝑖𝑡)

(cid:35)

(𝑁𝑇)− 1

2

𝑁
∑︁

𝑇
∑︁

𝑖

𝑡

ˆ(cid:165)𝑎′
𝑖𝑡

(cid:35) −1 





( (cid:165)𝑎𝑖𝑡 − ˆ(cid:165)𝑎𝑖𝑡)𝜃


(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:125)
(cid:123)(cid:122)
(cid:124)

Part 2


+ (cid:165)𝑒𝑖𝑡






(cid:124)(cid:123)(cid:122)(cid:125)

Part 1






Note that because ˆ𝜓
E (cid:16)(cid:205)𝑁

(cid:205)𝑇

(cid:17)

𝑡 (cid:165)𝑎′

𝑖𝑡 (cid:165)𝑎𝑖𝑡

𝑖

= 𝐵. Consider now Part 1:

𝑝
→ 𝜓, the first matrix on the right hand side will converge in probability to

(𝑁𝑇)− 1

2

𝑁
∑︁

𝑇
∑︁

𝑖

𝑡

𝑖𝑡 (cid:165)𝑒𝑖𝑡 = (𝑁𝑇)− 1
ˆ(cid:165)𝑎′

2

= (𝑁𝑇)− 1

2

= (𝑁𝑇)− 1

2

= (𝑁𝑇)− 1

2

𝑁
∑︁

𝑇
∑︁

𝑖
𝑁
∑︁

𝑡
𝑇
∑︁

𝑖
𝑁
∑︁

𝑡
𝑇
∑︁

𝑖
𝑁
∑︁

𝑡
𝑇
∑︁

𝑖

𝑡

( (cid:165)𝑧𝑖𝑡 ˆ𝜓)′ (cid:165)𝑒𝑖𝑡

( (cid:165)𝑧𝑖𝑡 ˆ𝜓 + (cid:165)𝑧𝑖𝑡𝜓 − (cid:165)𝑧𝑖𝑡𝜓)′ (cid:165)𝑒𝑖𝑡

( (cid:165)𝑧𝑖𝑡𝜓)′ (cid:165)𝑒𝑖𝑡 +

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

(cid:104)

√

(cid:165)𝑧𝑖𝑡

𝑁𝑇 ( ˆ𝜓 − 𝜓)

(cid:105) ′

(cid:165)𝑒𝑖𝑡

𝑖

𝑡

𝑁
∑︁

𝑇
∑︁

1
𝑁𝑇
𝑖
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:124)

(cid:165)𝑧′
𝑖𝑡 (cid:165)𝑒𝑖𝑡
𝑡
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
𝑜 𝑝 (1)

(cid:125)

√

(cid:124)

𝑁𝑇 ( ˆ𝜓 − 𝜓)′
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
O𝑝 (1)

·

( (cid:165)𝑧𝑖𝑡𝜓)′ (cid:165)𝑒𝑖𝑡 +

104

Because O𝑝 (1) · 𝑜 𝑝 (1) = 𝑜 𝑝 (1), Part 1 converges to

(cid:104)

(𝑁𝑇)− 1

2 (cid:205)𝑁
𝑖

(cid:205)𝑇

𝑡 ( (cid:165)𝑧𝑖𝑡𝜓)′ (cid:165)𝑒𝑖𝑡

(cid:105)

. Now consider

Part 2:

(𝑁𝑇)− 1

2

𝑁
∑︁

𝑇
∑︁

𝑖

𝑡

𝑖𝑡 ( (cid:165)𝑎𝑖𝑡 − ˆ(cid:165)𝑎𝑖𝑡)𝜃 = (𝑁𝑇)− 1
ˆ(cid:165)𝑎′

2

= (𝑁𝑇)− 1

2

= (𝑁𝑇)− 1

2

𝑁
∑︁

𝑇
∑︁

𝑖
𝑁
∑︁

𝑡
𝑇
∑︁

𝑖
𝑁
∑︁

𝑡
𝑇
∑︁

𝑖

𝑡

( (cid:165)𝑧𝑖𝑡 ˆ𝜓)′[ (cid:165)𝑎𝑖𝑡 − (cid:165)𝑧𝑖𝑡 ˆ𝜓]𝜃

( (cid:165)𝑧𝑖𝑡 ˆ𝜓)′[ (cid:165)𝑎𝑖𝑡 − (cid:165)𝑧𝑖𝑡𝜓 + (cid:165)𝑧𝑖𝑡𝜓 − (cid:165)𝑧𝑖𝑡 ˆ𝜓]𝜃

( (cid:165)𝑧𝑖𝑡 ˆ𝜓)′𝑣𝑖𝑡𝜃
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
Part 2.1

− ( (cid:165)𝑧𝑖𝑡 ˆ𝜓)′ (cid:165)𝑧𝑖𝑡 ( ˆ𝜓 − 𝜓)𝜃
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
Part 2.2

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:124)

Starting with Part 2.1:

(𝑁𝑇)− 1

2

𝑁
∑︁

𝑇
∑︁

𝑖

𝑡

( (cid:165)𝑧𝑖𝑡 ˆ𝜓)′𝑣𝑖𝑡𝜃 = (𝑁𝑇)− 1

2

= (𝑁𝑇)− 1

2

= (𝑁𝑇)− 1

2

𝑁
∑︁

𝑇
∑︁

𝑖
𝑁
∑︁

𝑡
𝑇
∑︁

( (cid:165)𝑧𝑖𝑡𝜓 + (cid:165)𝑧𝑖𝑡 ˆ𝜓 − (cid:165)𝑧𝑖𝑡𝜓)′𝑣𝑖𝑡𝜃

( (cid:165)𝑧𝑖𝑡𝜓)′𝑣𝑖𝑡𝜃 + (cid:2) (cid:165)𝑧𝑖𝑡 ( ˆ𝜓 − 𝜓)(cid:3) ′ 𝑣𝑖𝑡𝜃

𝑖
(cid:34) 𝑁
∑︁

𝑡

𝑇
∑︁

𝑖

𝑡

(cid:35)

( (cid:165)𝑧𝑖𝑡𝜓)′𝑣𝑖𝑡𝜃

+

√

(cid:124)

𝑁𝑇 ( ˆ𝜓 − 𝜓)′
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
O𝑝 (1)

𝑇
∑︁

𝑁
∑︁

1
𝑁𝑇
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
𝑝
→E( (cid:165)𝑧′

𝑖

(cid:124)

(cid:165)𝑧′
𝑖𝑡 (cid:165)𝑣𝑖𝑡𝜃
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:125)

𝑡
(cid:123)(cid:122)
𝑖𝑡 (cid:165)𝑣𝑖𝑡 )=0

So in the last line we have O𝑝 (1) · 𝑜 𝑝 (1) = 𝑜 𝑝 (1) and therefore the last term will vanish as 𝑁 → ∞

and only (𝑁𝑇)− 1

2 (cid:205)𝑁
𝑖

(cid:205)𝑇

𝑡 ( (cid:165)𝑧𝑖𝑡𝜓)′𝑣𝑖𝑡𝜃 will remain. Using similar algebra, it can be shown that part

2.2 will converge to

−

1
𝑁𝑇

Note that

ˆ𝜓 − 𝜓 =

𝑁
∑︁

𝑇
∑︁

√

( (cid:165)𝑧𝑖𝑡𝜓)′ (cid:165)𝑧𝑖𝑡

𝑁𝑇 ( ˆ𝜓 − 𝜓)𝜃

𝑡

𝑖
(cid:32) 𝑁
∑︁

𝑇
∑︁

𝑖

𝑡

(cid:33) −1 (cid:32) 𝑁
∑︁

𝑇
∑︁

(cid:33)

(cid:165)𝑧′
𝑖𝑡 (cid:165)𝑣𝑖𝑡

(cid:165)𝑧′
𝑖𝑡 (cid:165)𝑧𝑖𝑡

𝑖
(cid:33) −1 (cid:34)

𝑡

(𝑁𝑇)− 1

2

𝑁
∑︁

𝑇
∑︁

(cid:35)

(cid:165)𝑧′
𝑖𝑡 (cid:165)𝑣𝑖𝑡

𝑖

𝑡

√

=⇒

𝑁𝑇 ( ˆ𝜓 − 𝜓) =

(cid:32)

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑖

𝑡

(cid:165)𝑧′
𝑖𝑡 (cid:165)𝑧𝑖𝑡

105

Putting everything together we have

(cid:34)

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑖

𝑡

ˆ(cid:165)𝑎′
𝑖𝑡 ˆ(cid:165)𝑎𝑖𝑡

(cid:35) −1 (cid:40)

(𝑁𝑇)− 1

2

𝑁
∑︁

𝑇
∑︁

𝑖

𝑡

ˆ(cid:165)𝑎′
𝑖𝑡

(cid:2)( (cid:165)𝑎𝑖𝑡 − ˆ(cid:165)𝑎𝑖𝑡)𝜃 + (cid:165)𝑒𝑖𝑡 (cid:3)

(cid:41)

(cid:40)

(cid:40)

=𝐵−1

=𝐵−1

𝑁
∑︁

𝑇
∑︁

(𝑁𝑇)− 1

2

( (cid:165)𝑧𝑖𝑡𝜓)′ (cid:2) (cid:165)𝑒𝑖𝑡 + (cid:165)𝑣𝑖𝑡𝜃 − (cid:165)𝑧𝑖𝑡 ( ˆ𝜓 − 𝜓)𝜃(cid:3)

(cid:41)

+ 𝑜 𝑝 (1)

(𝑁𝑇)− 1

2

𝑖
(cid:34) 𝑁
∑︁

𝑡

𝑇
∑︁

𝑖

𝑡

{( (cid:165)𝑧𝑖𝑡𝜓)′( (cid:165)𝑒𝑖𝑡 + (cid:165)𝑣𝑖𝑡𝜃)}

−

(cid:35)

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑖

𝑡

√

{( (cid:165)𝑧𝑖𝑡𝜓)′ (cid:165)𝑧𝑖𝑡 }

(cid:41)

𝑁𝑇 ( ˆ𝜓 − 𝜓)𝜃

+ 𝑜 𝑝 (1)

√

where

𝑁𝑇 ( ˆ𝜓 − 𝜓) =

(cid:16) 1
𝑁𝑇

(cid:205)𝑁
𝑖

(cid:205)𝑇

𝑡 (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑧𝑖𝑡

(cid:17) −1 (cid:104)

(𝑁𝑇)− 1

2 (cid:205)𝑁
𝑖

(cid:205)𝑇

𝑡 (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑣𝑖𝑡

(cid:105)

.

Let

and

Then we can write

𝐺 = E

(cid:34) 𝑁
∑︁

𝑇
∑︁

𝑖

𝑡

(cid:35)

( (cid:165)𝑧𝑖𝑡𝜓)′ (cid:165)𝑧𝑖𝑡

𝑟𝑖𝑡 (𝜓) =

(cid:32)

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑖

𝑡

(cid:165)𝑧′
𝑖𝑡 (cid:165)𝑧𝑖𝑡

(cid:33) −1 (cid:34)

(𝑁𝑇)− 1

2

𝑁
∑︁

𝑇
∑︁

(cid:35)

(cid:165)𝑧′
𝑖𝑡 (cid:165)𝑣𝑖𝑡

𝑖

𝑡

√

𝑁𝑇 ( ˆ𝜃 − 𝜃) = 𝐵−1

(cid:40)

(𝑁𝑇)− 1

2

𝑁
∑︁

𝑇
∑︁

𝑖

𝑡

( (cid:165)𝑧𝑖𝑡𝜓)′( (cid:165)𝑒𝑖𝑡 + (cid:165)𝑣𝑖𝑡𝜃) − 𝐺 · 𝑟𝑖𝑡 (𝜓)𝜃

+ 𝑜 𝑝 (1)

(cid:41)

And therefore, by the Central Limit Theorem,

√

𝑁𝑇 ( ˆ𝜃 − 𝜃)

𝑎
∼ 𝑁 (cid:8)0, 𝐵−1𝑀 𝐵−1(cid:9)

where 𝑀 = Var (cid:2)(cid:205)𝑁
𝑖

(cid:205)𝑇

𝑡 ( (cid:165)𝑧𝑖𝑡𝜓)′( (cid:165)𝑒𝑖𝑡 + (cid:165)𝑣𝑖𝑡𝜃) − 𝐺 · 𝑟𝑖𝑡 (𝜓)𝜃(cid:3) = Var (cid:2)(cid:205)𝑁
𝑖

(cid:205)𝑇

𝑡 𝑚𝑖𝑡 (cid:3).

𝐵 can be estimated with

To estimate 𝑀, let

ˆ𝐵 =

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑖

𝑡

ˆ(cid:165)𝑎′
𝑖𝑡 ˆ(cid:165)𝑎𝑖𝑡

ˆ𝑚𝑖𝑡 = ( (cid:165)𝑧𝑖𝑡 ˆ𝜓)′( ˆ(cid:165)𝑒𝑖𝑡 + ˆ(cid:165)𝑣𝑖𝑡 ˆ𝜃) − ˆ𝐺 · ˆ𝑟𝑖𝑡 ( ˆ𝜓) ˆ𝜃

where,

ˆ(cid:165)𝑒𝑖𝑡 are the residuals from the second stage.

ˆ(cid:165)𝑣𝑖𝑡 are the residuals from the first stage (note that 𝑣𝑖𝑡 is a vector).

ˆ𝐺 = 1
𝑁𝑇

(cid:205)𝑁
𝑖

(cid:205)𝑇

𝑡 ( (cid:165)𝑧𝑖𝑡 ˆ𝜓)′ (cid:165)𝑧𝑖𝑡.

106

ˆ𝑟 ( ˆ𝜓) =

(cid:16) 1
𝑁𝑇

(cid:205)𝑁
𝑖

(cid:205)𝑇

𝑡 (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑧𝑖𝑡

(cid:17) −1 (cid:104)

(𝑁𝑇)− 1

2 (cid:205)𝑁
𝑖

(cid:205)𝑇

𝑡 (cid:165)𝑧′

𝑖𝑡 ˆ(cid:165)𝑣𝑖𝑡

(cid:105)

.

With these quantities defined, the (𝑟, 𝑠)-th element of 𝑀 can be estimated as

ˆ𝑀𝑟 𝑠 =

1
𝑁𝑇

𝑁
∑︁

𝑁
∑︁

𝑇
∑︁

𝑇
∑︁

𝑖=1

𝑗=1

𝑡=1

𝑙=1

ˆ(cid:165)𝑚𝑖𝑡,𝑟 ˆ(cid:165)𝑚 𝑗𝑙,𝑠𝐾

(cid:21)

(cid:20) 𝜌∗(𝑖, 𝑗)
𝜌𝑏

where once again the kernel function 𝐾 (·) is operationalizing the weak spatial dependence assump-

tion.

107

APPENDIX D

TABLES FOR CHAPTER 1

Additional Simulation Results

Table D.1 Average of the estimated variance of 𝛽1 over the 1,000 replications using a rook type
weighting matrix, N = 400, T=5.

𝜌

0.0

0.3

0.7

𝜓
0.0
0.3
0.7
0.0
0.3
0.7
0.0
0.3
0.7

CF

CF_no1 True value

HACSC SHAC Cluster Non-Robust True value

0.041
0.037
0.035
0.043
0.040
0.037
0.062
0.057
0.054

0.037
0.034
0.032
0.039
0.036
0.033
0.055
0.051
0.048

CF
0.0386
0.0352
0.0323
0.0428
0.0364
0.0359
0.0558
0.0535
0.0488

0.082
0.076
0.071
0.087
0.079
0.074
0.111
0.103
0.096

0.068
0.062
0.058
0.071
0.065
0.061
0.091
0.085
0.079

0.086
0.078
0.073
0.089
0.082
0.076
0.115
0.106
0.099

0.069
0.063
0.059
0.072
0.066
0.062
0.092
0.085
0.080

2SLS
0.0866
0.0726
0.0706
0.0906
0.079
0.0829
0.1092
0.1118
0.0954

∗True value computed as the variance of 𝛽1 across the 1,000 replications.
All the numbers were multiplied by 100 for readability.

Table D.2 Average of the estimated variance of 𝛽2 over the 1,000 replications using a rook type
weighting matrix, N = 400, T=5.

𝜌

0.0

0.3

0.7

𝜓
0.0
0.3
0.7
0.0
0.3
0.7
0.0
0.3
0.7

CF

CF_no1 True value

HACSC SHAC Cluster Non-Robust True value

0.069
0.063
0.060
0.074
0.068
0.063
0.119
0.108
0.101

0.066
0.060
0.057
0.071
0.065
0.060
0.114
0.104
0.097

CF
0.0724
0.0644
0.0623
0.0756
0.0761
0.0646
0.1237
0.1125
0.1004

0.080
0.074
0.069
0.086
0.078
0.073
0.131
0.120
0.113

0.066
0.061
0.056
0.070
0.065
0.061
0.108
0.099
0.093

0.083
0.076
0.071
0.089
0.081
0.076
0.136
0.125
0.116

0.067
0.061
0.057
0.071
0.065
0.061
0.109
0.100
0.094

2SLS
0.0803
0.0738
0.0704
0.0894
0.0862
0.0793
0.1376
0.1297
0.113

∗True value computed as the variance of 𝛽2 across the 1,000 replications.
All the numbers were multiplied by 100 for readability.

108

Table D.3 Average of the estimated variance of 𝛽3 over the 1,000 replications using a rook type
weighting matrix, N = 400, T=5.

𝜌

0.0

0.3

0.7

𝜓
0.0
0.3
0.7
0.0
0.3
0.7
0.0
0.3
0.7

CF

CF_no1 True value

HACSC SHAC Cluster Non-Robust True value

0.275
0.252
0.239
0.291
0.271
0.254
0.377
0.350
0.329

0.242
0.220
0.206
0.252
0.232
0.215
0.314
0.290
0.270

CF
0.24
0.22
0.21
0.27
0.24
0.23
0.33
0.29
0.29

0.742
0.680
0.631
0.769
0.705
0.660
0.917
0.860
0.794

0.615
0.558
0.522
0.636
0.581
0.545
0.758
0.709
0.657

0.772
0.700
0.654
0.796
0.730
0.681
0.949
0.884
0.824

0.620
0.563
0.526
0.640
0.587
0.549
0.763
0.712
0.662

2SLS
0.79
0.64
0.63
0.81
0.71
0.72
0.90
0.91
0.79

∗True value computed as the variance of 𝛽3 across the 1,000 replications.
All the numbers were multiplied by 100 for readability.
CF_no1 refers to the HACSC estimator ignoring the first stage estimation using a CF approach.

Table D.4 Rejection probabilities for the null hypothesis 𝐻0 : 𝛽1 = 0.7 at a 5% of significance
using a t-test over the 1,000 replications with a rook type weighting matrix, N = 400, T=5.

𝜌

0.0

0.3

0.7

𝜓
0.0
0.3
0.7
0.0
0.3
0.7
0.0
0.3
0.7

CF

CF_no1 HACSC SHAC Cluster Non-Robust

0.050
0.047
0.043
0.054
0.048
0.045
0.049
0.047
0.051

0.062
0.057
0.054
0.066
0.062
0.067
0.067
0.062
0.064

0.068
0.056
0.053
0.068
0.057
0.059
0.065
0.071
0.050

0.088
0.072
0.068
0.089
0.073
0.083
0.079
0.093
0.070

0.055
0.052
0.046
0.058
0.042
0.060
0.057
0.064
0.040

0.081
0.076
0.068
0.088
0.072
0.083
0.079
0.095
0.070

109

Table D.5 Rejection probabilities for the null hypothesis 𝐻0 : 𝛽2 = 0.6 at a 5% of significance
using a t-test over the 1,000 replications with a rook type weighting matrix, N = 400, T=5.

𝜌

0.0

0.3

0.7

𝜓
0.0
0.3
0.7
0.0
0.3
0.7
0.0
0.3
0.7

CF

CF_no1 HACSC SHAC Cluster Non-Robust

0.076
0.078
0.079
0.082
0.082
0.081
0.071
0.077
0.069

0.067
0.064
0.072
0.072
0.075
0.069
0.068
0.066
0.064

0.069
0.077
0.079
0.089
0.079
0.089
0.075
0.092
0.063

0.096
0.101
0.104
0.108
0.106
0.119
0.099
0.111
0.080

0.061
0.066
0.074
0.076
0.076
0.078
0.073
0.080
0.053

0.093
0.102
0.105
0.105
0.104
0.108
0.099
0.110
0.077

110

APPENDIX E

FIGURES FOR CHAPTER 1

Figure E.1 Distribution of the computed variances of ˆ𝛽3 obtained for the case with 𝑒 following a
spatial AR(1) process (𝜌 = 0.7), and 𝑎 following an AR(1) (𝜓 = 0.3), 𝑁 = 400, 𝑇 = 5.
∗True value computed as the variance of 𝛽3 across the 1,000 replications.

111

0.0050.0100.0150.020Control Function with correctionTrue0.0050.0100.0150.020HACSC0.0050.0100.0150.020Clustered0.0050.0100.0150.020CF ignoring first stage0.0050.0100.0150.020SHAC0.0050.0100.0150.020RegularAPPENDIX F

PROOFS FOR CHAPTER 2

Proof of Proposition 2

The problem is to apply P2SLS to

˜𝑠𝑖𝑡 𝑦𝑖𝑡 = ˜𝑠𝑖𝑡𝑎𝑖𝑡𝜃 + ˜𝑠𝑖𝑡 ¯𝑧𝑖𝛿

using the instruments 𝑧𝑖𝑡 = (𝑥1𝑖𝑡 𝑧2𝑖𝑡 𝑤1𝑖𝑡 ℨ2𝑖𝑡) when ˜𝑠𝑖𝑡 = 1 and where 𝑎𝑖𝑡 = (𝑥1𝑖𝑡 𝑥2𝑖𝑡 𝑤1𝑖𝑡 𝑤2𝑖𝑡)

and ¯𝑧𝑖 = ( ¯𝑥1𝑖

¯𝑧2𝑖

¯𝑤1𝑖

¯ℨ2𝑖). Note that the averages are taken for the cases where ˜𝑠𝑖𝑡 = 1.

The first step is to orthogonalize the instrumental variables with respect to ¯𝑧𝑖.

I start by

regressing 𝑧2𝑖𝑡 on ¯𝑧2𝑖. The associated coefficient will be:

(cid:35)

(cid:35)

(cid:34) 𝑁
∑︁

𝑇
∑︁

ˆ𝛾1 =

˜𝑠𝑖𝑡 ¯𝑧′

2𝑖 ¯𝑧2𝑖

(cid:35) −1 (cid:34) 𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

˜𝑠𝑖𝑡 ¯𝑧′

2𝑖𝑧2𝑖𝑡

𝑖=1
(cid:34) 𝑁
∑︁

𝑖=1
(cid:34) 𝑁
∑︁

𝑖=1

=

=

𝑡=1

¯𝑧′
2𝑖 ¯𝑧2𝑖

¯𝑧′
2𝑖 ¯𝑧2𝑖𝑇𝑖

(cid:35) −1 (cid:34)

𝑇
∑︁

˜𝑠𝑖𝑡

𝑡=1
(cid:35) −1 (cid:34)

∑︁

𝑖=1

¯𝑧′
2𝑖

𝑇
∑︁

𝑡=1
(cid:35)

¯𝑧′
2𝑖 ¯𝑧2𝑖𝑇𝑖

= I𝑙

∑︁

𝑖=1

˜𝑠𝑖𝑡 𝑧2𝑖𝑡

And therefore the residuals from this regression will be 𝑧2𝑖𝑡 − ¯𝑧2𝑖 = (cid:165)𝑧2𝑖𝑡. Now we regress each of
the remaining elements of ¯𝑧2𝑖, i.e. ¯𝑥1𝑡, ¯𝑤1𝑖 and ¯ℨ2𝑖 on ¯𝑧2𝑖. Note that each set of residuals of these
regressions will depend only on the 𝑖 index, so denote them respectively by 𝑓 𝑥1
𝑖

and 𝑓 ℨ2

, 𝑓 𝑤1
𝑖

and

𝑖

stack them into a vector 𝑓𝑖. Now we regress (cid:165)𝑧2𝑖𝑡 on 𝑓𝑖 and the associated coefficient will be:

ˆ𝛾2 =

=

(cid:34) 𝑁
∑︁

𝑇
∑︁

𝑖=1
(cid:34) 𝑁
∑︁

𝑡=1
𝑇
∑︁

𝑖=1

𝑡=1

(cid:35) −1 (cid:34) 𝑁
∑︁

𝑇
∑︁

˜𝑠𝑖𝑡 𝑓 ′

𝑖 𝑓𝑖

˜𝑠𝑖𝑡 𝑓 ′

𝑖 (cid:165)𝑧2𝑖𝑡

(cid:35) −1 (cid:34)

˜𝑠𝑖𝑡 𝑓 ′

𝑖 𝑓𝑖

𝑖=1

𝑡=1

𝑇
∑︁

∑︁

𝑓 ′
𝑖

𝑖=1

𝑡=1

(cid:35)

(cid:35)

˜𝑠𝑖𝑡 (cid:165)𝑧2𝑖𝑡

= 02𝑘1+𝑙

where in the last line I used the fact that the sum of deviations from the mean is equal to zero

when ˜𝑠𝑖𝑡 = 1. Therefore, after this orthogonalization of 𝑧2𝑖𝑡 with respect to ¯𝑧𝑖, the residuals are (cid:165)𝑧2𝑖𝑡.

Following very similar steps, it can be shown that after orthogonalizing the remaining elements of

112

𝑧𝑖𝑡 with respect to ¯𝑧𝑖, the set of residuals will be (cid:165)𝑧𝑖𝑡 = ( (cid:165)𝑥1𝑖𝑡

(cid:165)𝑧2𝑖𝑡

(cid:165)𝑤1𝑖𝑡

(cid:165)ℨ2𝑖𝑡). The problem then

becomes to apply Pooled 2SLS to

˜𝑠𝑖𝑡 𝑦𝑖𝑡 = ˜𝑠𝑖𝑡𝑎𝑖𝑡𝜃

using the instruments (cid:165)𝑧𝑖𝑡. The associated coefficient will be

˜𝜃 =







(cid:32)

(cid:32)

∑︁

∑︁

𝑖=1

𝑡=1

(cid:33) (cid:32)

˜𝑠𝑖𝑡𝑎′

𝑖𝑡 (cid:165)𝑧𝑖𝑡

∑︁

∑︁

𝑖=1

𝑡=1

˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑧𝑖𝑡

(cid:33) −1 (cid:32)

∑︁

∑︁

𝑖=1

𝑡=1

˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡𝑎𝑖𝑡

∑︁

∑︁

𝑖=1

𝑡=1

˜𝑠𝑖𝑡𝑎′

𝑖𝑡 (cid:165)𝑧𝑖𝑡

(cid:33) (cid:32)

∑︁

∑︁

𝑖=1

𝑡=1

˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑧𝑖𝑡

(cid:33) −1 (cid:32)

(cid:33)

˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 𝑦𝑖𝑡

∑︁

∑︁

𝑖=1

𝑡=1

−1

·

(cid:33)






Now focusing on the first element of the square bracket matrix and noting that the following

algebraic manipulation holds for the remaining of the terms in the above expression that do not

contain demeaned variables, we have:

∑︁

∑︁

𝑖=1

𝑡=1

˜𝑠𝑖𝑡𝑎′

𝑖𝑡 (cid:165)𝑧𝑖𝑡 =

=

=

˜𝑠𝑖𝑡𝑎′

𝑖𝑡 (cid:165)𝑧𝑖𝑡 −

˜𝑠𝑖𝑡𝑎′

𝑖𝑡 (cid:165)𝑧𝑖𝑡 −

∑︁

𝑖=1
∑︁

¯𝑎′
𝑖

∑︁

𝑡=1

˜𝑠𝑖𝑡 (𝑧𝑖𝑡 − ¯𝑧𝑖)

∑︁

˜𝑠𝑖𝑡 ¯𝑎′

𝑖 (cid:165)𝑧𝑖𝑡

𝑖=1

𝑡=1

˜𝑠𝑖𝑡 (cid:165)𝑎′

𝑖𝑡 (cid:165)𝑧𝑖𝑡

∑︁

∑︁

𝑖=1
∑︁

𝑡=1
∑︁

𝑖=1
∑︁

𝑡=1
∑︁

𝑖=1

𝑡=1

where at the end of the first line I used again the fact that the sum of deviations from the mean for

the cases for which ˜𝑠𝑖𝑡 = 0. Therefore we have:

(cid:32)

∑︁

∑︁

𝑖=1

𝑡=1

(cid:33) (cid:32)

˜𝑠𝑖𝑡𝑎′

𝑖𝑡 (cid:165)𝑧𝑖𝑡

∑︁

∑︁

𝑖=1

𝑡=1

˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑧𝑖𝑡

(cid:33) −1 (cid:32)

∑︁

∑︁

𝑖=1

𝑡=1

˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡𝑎𝑖𝑡

(cid:33) (cid:32)

∑︁

∑︁

˜𝑠𝑖𝑡𝑎′

𝑖𝑡 (cid:165)𝑧𝑖𝑡

∑︁

∑︁

𝑖=1

𝑡=1

˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑧𝑖𝑡

(cid:33) −1 (cid:32)

(cid:33)

˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 𝑦𝑖𝑡

∑︁

∑︁

𝑖=1

𝑡=1

(cid:33) (cid:32)

∑︁

˜𝑠𝑖𝑡 (cid:165)𝑎′

𝑖𝑡 (cid:165)𝑧𝑖𝑡

∑︁

∑︁

𝑖=1

𝑡=1

˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑧𝑖𝑡

(cid:33) −1 (cid:32)

∑︁

∑︁

𝑖=1

𝑡=1

˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑎𝑖𝑡

𝑡=1

𝑖=1
(cid:32)

∑︁

𝑖=1

𝑡=1

˜𝜃 =

=







(cid:32)







(cid:32)

−1

−1

·

·

(cid:33)






(cid:33)






(cid:33)

∑︁

∑︁

𝑖=1

𝑡=1

˜𝑠𝑖𝑡 (cid:165)𝑎′

𝑖𝑡 (cid:165)𝑧𝑖𝑡

(cid:33) (cid:32)

∑︁

∑︁

𝑖=1

𝑡=1

˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑧𝑖𝑡

(cid:33) −1 (cid:32)

∑︁

∑︁

𝑖=1

𝑡=1

113

˜𝑠𝑖𝑡 (cid:165)𝑧′

𝑖𝑡 (cid:165)𝑦𝑖𝑡

= ˆ𝜃𝐶𝐹𝐸2𝑆𝐿𝑆

APPENDIX G

TABLES FOR CHAPTER 2

Table G.1 Average bias, standard deviation and root mean squared error for 𝛽1 and 𝛽2 across the
1000 repetitions when the data is MCAR.

Bias

0.0002
Whole data
Complete cases
0.0005
Proposed GMM 0.0009
0.0014
Dummy variable

𝛽1
S.D.

0.0167
0.0251
0.0213
0.0346

RMSE

Bias

0.0167
0.0251
0.0213
0.0346

-0.0021
-0.0023
-0.0025
-0.0012

𝛽2
S.D.

0.0325
0.0510
0.0422
0.0669

RMSE

0.0326
0.0510
0.0422
0.0669

Table G.2 Average bias, standard deviation and root mean squared error for 𝛽1 and 𝛽2 across the
1000 repetitions when the data is MAR.

𝛽1
S.D.

Bias

RMSE

Bias

Whole data
0.0001
0.0010
Complete cases
Proposed GMM 0.0007
0.0001
Dummy variable

0.0164 0.0164
0.0260 0.0260
0.0214 0.0214
0.0395 0.0395

-0.0012
-0.0020
-0.0019
-0.0032

𝛽2
S.D.

0.0321
0.0654
0.0520
0.0894

RMSE

0.0321
0.0654
0.0520
0.0894

Table G.3 Average bias, standard deviation and root mean squared error for 𝛽3 and 𝛽4 across the
1000 repetitions when the data is MAR.

Bias

-0.0009
Whole data
Complete cases
-0.0010
Proposed GMM -0.0000
1.1034
Dummy variable

𝛽3
S.D.

0.0244
0.0396
0.0318
0.1455

RMSE

Bias

0.0244
0.0396
0.0318
1.1130

0.0004
0.0014
0.0028
0.3698

𝛽4
S.D.

0.0496
0.0772
0.0630
0.2049

RMSE

0.0496
0.0771
0.0630
0.4227

114

Table G.4 Average bias, standard deviation and root mean squared error for 𝛽1 and 𝛽2 across the
1000 repetitions when the data is MCAR and the error term follows a SAR(1) and 𝑁 = 900.

𝛽1
S.D.

Bias

RMSE

Bias

Whole data
Complete cases
Proposed GMM 0.0000
0.0001
Dummy variable

0.0001
0.0182
-0.0000 0.0278
0.0225
0.0375

0.0182
0.0278
0.0225
0.0375

0.0000
-0.0008
-0.0013
-0.0008

𝛽2
S.D.

0.0382
0.0560
0.0467
0.0760

RMSE

0.0382
0.0560
0.0467
0.0759

Table G.5 Average bias, standard deviation and root mean squared error for 𝛽3 and 𝛽4 across the
1000 repetitions when the data is MCAR and the error term follows a SAR(1) and 𝑁 = 900.

𝛽3
S.D.

Bias

RMSE

Bias

Whole data
-0.0012 0.0260
-0.0018 0.0402
Complete cases
Proposed GMM -0.0002 0.0330
0.1366
Dummy variable

0.9796

0.0260
0.0402
0.0330
0.9891

-0.0006
-0.0006
0.0013
0.3237

𝛽4
S.D.

0.0565
0.0821
0.0674
0.1942

RMSE

0.0565
0.0821
0.0674
0.3774

Table G.6 Average bias, standard deviation and root mean squared error for 𝛽1 and 𝛽2 across the
1000 repetitions when the data is MCAR and the error term follows a SAR(1) and 𝑁 = 400.

𝛽1
S.D.

Bias

RMSE

Bias

0.0001
Whole data
Complete cases
0.0007
Proposed GMM 0.0009
0.0006
Dummy variable

0.0278 0.0278
0.0417 0.0417
0.0342 0.0342
0.0538 0.0537

0.0006
-0.0005
0.0007
0.0030

𝛽2
S.D.

0.0582
0.0838
0.0713
0.1123

RMSE

0.0582
0.0838
0.0713
0.1123

Table G.7 Average bias, standard deviation and root mean squared error for 𝛽3 and 𝛽4 across the
1000 repetitions when the data is MCAR and the error term follows a SAR(1) and 𝑁 = 400.

𝛽3
S.D.

Bias

RMSE

Bias

-0.0037 0.0387
Whole data
Complete cases
-0.0037 0.0598
Proposed GMM -0.0009 0.0485
0.1944
Dummy variable

0.9430

0.0389
0.0599
0.0484
0.9628

-0.0036
-0.0062
-0.0043
0.3006

𝛽4
S.D.

0.0820
0.1267
0.1035
0.2817

RMSE

0.0820
0.1268
0.1035
0.4118

115

Table G.8 Average bias, standard deviation and root mean squared error for 𝛽1 and 𝛽2 across the
1000 repetitions when the missingness depends on 𝑥1 and 𝑐𝑖.

𝛽1
S.D.

Bias

RMSE

Bias

-0.0005 0.0252
Whole data
Complete cases
-0.0013 0.0367
Proposed GMM -0.0012 0.0303
-0.0005 0.0526
Dummy variable

0.0252
0.0367
0.0303
0.0526

0.0025
0.0059
0.0030
-0.0044

𝛽2
S.D.

0.0479
0.0801
0.0656
0.1105

RMSE

0.0479
0.0803
0.0656
0.1106

Table G.9 Average bias, standard deviation and root mean squared error for 𝛽3 and 𝛽4 across the
1000 repetitions when the missingness depends on 𝑥1 and 𝑐𝑖.

Bias

Whole data
-0.0005
-0.0020
Complete cases
Proposed GMM -0.0006
1.0765
Dummy variable

𝛽3
S.D.

0.0338
0.0506
0.0409
0.2408

RMSE

Bias

0.0338
0.0507
0.0409
1.1031

0.0032
0.0066
0.0059
0.3505

𝛽4
S.D.

0.0691
0.1035
0.0862
0.2845

RMSE

0.0691
0.1037
0.0864
0.4514

116

APPENDIX H

FIGURES FOR CHAPTER 2

Figure H.1 Distribution of estimated coefficients across the 1000 Monte-Carlo repetitions when
the data is MAR.

Figure H.2 Distribution of estimated coefficients across the 1000 Monte-Carlo repetitions when
the data is MCAR and the error term follows a SAR(1) process with 𝑁 = 900.

117

1.401.451.501.551.60Allβ1Truevalue0.50.60.70.80.9β21.051.101.151.201.251.301.35β30.00.20.40.60.8β41.401.451.501.551.60CC0.50.60.70.80.91.051.101.151.201.251.301.350.00.20.40.60.81.401.451.501.551.60GMM0.50.60.70.80.91.051.101.151.201.251.301.350.00.20.40.60.81.401.451.501.551.60Allβ1Truevalue0.50.60.70.80.9β21.051.101.151.201.251.301.35β30.00.20.40.60.8β41.401.451.501.551.60CC0.50.60.70.80.91.051.101.151.201.251.301.350.00.20.40.60.81.401.451.501.551.60GMM0.50.60.70.80.91.051.101.151.201.251.301.350.00.20.40.60.8Figure H.3 Distribution of estimated coefficients across the 1000 Monte-Carlo repetitions when
the data is MCAR and the error term follows a SAR(1) process with 𝑁 = 400.

118

1.401.451.501.551.60Allβ1Truevalue0.50.60.70.80.9β21.051.101.151.201.251.301.35β30.00.20.40.60.8β41.401.451.501.551.60CC0.50.60.70.80.91.051.101.151.201.251.301.350.00.20.40.60.81.401.451.501.551.60GMM0.50.60.70.80.91.051.101.151.201.251.301.350.00.20.40.60.8APPENDIX I

PROOFS FOR CHAPTER 3

Proof of Proposition 1

Because 𝑥𝑖 𝑗𝑡 = (𝑥1𝑖 𝑗𝑡

𝑥2𝑖 𝑗𝑡) where 𝑥2𝑖 𝑗𝑡 is endogenous with, we need to add the instrumental

variables. The CRE IV estimator in this is is to apply Pooled Two-Stage Least Squares (P2SLS) to

the following equation:

˜𝑦𝑖 𝑗𝑡 = ˜𝑥𝑖 𝑗𝑡 𝛽 + ˜¯𝑥1··· + ˜¯𝑧2···

using the IV’s ˜𝑧2𝑖 𝑗𝑡 where ˜𝑦𝑖 𝑗𝑡 = 𝑦𝑖 𝑗𝑡 − ˜𝜃1 ¯𝑦𝑖·· − ˜𝜃2 ¯𝑦· 𝑗 · − ˜𝜃3 ¯𝑦··𝑡 − ˜𝜃4 ¯𝑦··· and similarly for each variable.

First note that ˜¯𝑥1𝑖·· = (1 − ˜𝜃1) ¯𝑥1𝑖··, ˜¯𝑥1· 𝑗 · = (1 − ˜𝜃2) ¯𝑥1· 𝑗 ·, ˜¯𝑥1··𝑡 = (1 − ˜𝜃3) ¯𝑥1··𝑡 and ˜¯𝑥1··· = (1 − ˜𝜃4) ¯𝑥1···.

As a first step, we orthogonalize the exogenous and instrumental variables ( ˜𝑥1𝑖 𝑗𝑡

˜𝑧2𝑖 𝑗𝑡) with respect

to ˜¯𝑧𝑖 𝑗𝑡 = ( ˜¯𝑥1𝑖··

˜¯𝑥1· 𝑗 ·

˜¯𝑥1··𝑡

˜¯𝑧2𝑖··

˜¯𝑧2· 𝑗 ·

˜¯𝑧2··𝑡).

a. ˜𝑥1𝑖 𝑗𝑡 on (1 ˜¯𝑧𝑖 𝑗𝑡), where the 1 represent the constant term. Applying the Frish-Waugh-Lovell

(FWL) theorem, to obtain the correct residuals, this is equivalent to regress ˜𝑥1𝑖 𝑗𝑡 − ¯𝑥1··· on

˜¯𝑧𝑖 𝑗𝑡 − ¯𝑧··· (i.e. no constant term). The coefficient associated with this regression will have

the typical form of (𝑋′𝑋)−1(𝑋′𝑦). Consider the first matrix, which will have the following

structure:

(cid:205)
𝑖

(cid:205)
𝑗

(cid:205)
𝑡

(cid:205)
𝑖

(cid:205)
𝑗

(cid:205)
𝑡

( ˜¯𝑥1𝑖·· − ¯𝑥1···)′( ˜¯𝑥1𝑖·· − ¯𝑥1···) (cid:205)
𝑖
( ˜¯𝑥1· 𝑗 · − ¯𝑥1···)′( ˜¯𝑥1𝑖·· − ¯𝑥1···) (cid:205)
𝑖

...

(cid:205)
𝑗

(cid:205)
𝑡

( ˜¯𝑥1𝑖·· − ¯𝑥1···)′( ˜¯𝑥1· 𝑗 · − ¯𝑥1···)

· · ·

(cid:205)
𝑗

(cid:205)
𝑡

( ˜¯𝑥1· 𝑗 · − ¯𝑥1···)′( ˜¯𝑥1· 𝑗 · − ¯𝑥1···)

. . .

(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)

−1

(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)

Each term off the diagonal that has a cross product of different indices (e.g. ˜¯𝑥1𝑖·· and ˜¯𝑥1· 𝑗 ·)

can be treated as follows:

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

( ˜¯𝑥1𝑖·· − ¯𝑥1···)′( ˜¯𝑥1𝑖·· − ¯𝑥1···) = 𝑇 ·

∑︁

( ˜¯𝑥1𝑖·· − ¯𝑥1···)′ ∑︁

( ˜¯𝑥1· 𝑗 · − ¯𝑥1···) = 0

𝑖

𝑗

Therefore, each pair of these regressors is orthogonal to each other in sample. For those pairs

of independent variables that have a common index (e.g. regressing ˜¯𝑥1𝑖 𝑗𝑡 ˜¯𝑥1𝑖·· and ˜¯𝑧2𝑖··), using

119

the fact that each variable as been centered around their overall mean and applying the FWL

theorem, it can be shown that after partialling out the variable that is not associated with the

dependent variable (in this case ˜¯𝑧2𝑖··), we will recover the same coefficient as if we ran ˜¯𝑥1𝑖 𝑗𝑡

on ˜¯𝑥1𝑖·· directly.

Now I show that the coefficients associated with each element of 𝑥1 of this regression is equal

to an identity matrix of size 𝑘1 and 0 for the elements of 𝑧2. For example, the parameter

vector associated with ˜¯𝑥1𝑖·· − ¯𝑥1··· is:

(cid:34)

(cid:34)

(cid:34)

(cid:34)

ˆ𝜋 ˜¯𝑥1𝑖·· =

=

=

=

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

( ˜¯𝑥1𝑖·· − ¯𝑥1···)′( ˜¯𝑥1𝑖·· − ¯𝑥1···)

(cid:35) 1 (cid:34)

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

( ˜¯𝑥1𝑖·· − ¯𝑥1···)′( ˜𝑥1𝑖 𝑗𝑡 − ¯𝑥1···)

𝑁2 · 𝑇 ·

∑︁

𝑖

( ˜¯𝑥1𝑖·· − ¯𝑥1···)′( ˜¯𝑥1𝑖·· − ¯𝑥1···)

(cid:35) 1 (cid:34)

∑︁

( ˜¯𝑥1𝑖·· − ¯𝑥1···)′ ∑︁

∑︁

𝑖

𝑗

𝑡

( ˜𝑥1𝑖 𝑗𝑡 − ¯𝑥1···)

(cid:35)

(cid:35)

( ˜¯𝑥1𝑖·· − ¯𝑥1···)′( ˜¯𝑥1𝑖·· − ¯𝑥1···)

( ˜¯𝑥1𝑖·· − ¯𝑥1···)′( ˜¯𝑥1𝑖·· − ¯𝑥1···)

∑︁

𝑖

∑︁

𝑖

(cid:35) 1 (cid:34)

(cid:35) 1 (cid:34)

∑︁

𝑖

( ˜¯𝑥1𝑖·· − ¯𝑥1···)′ 1
𝑁2𝑇

∑︁

∑︁

𝑗

𝑡

(cid:35)

( ˜𝑥1𝑖 𝑗𝑡 − ¯𝑥1···)

( ˜¯𝑥1𝑖·· − ¯𝑥1···)′( ˜¯𝑥1𝑖·· − ¯𝑥1···)

= I𝑘1

(cid:35)

∑︁

𝑖

Therefore each explanatory variable associated with the averages of 𝑥1 will have a coefficient

vector equal to an identity matrix. On the other hand, it can be shown that the coefficients

associated with with the ˜¯𝑧2 variables are 0 using the fact that we will obtain sums of vectors

that are deviated from their overall mean.

b. ˜𝑧2𝑖 𝑗𝑡 on (1

˜¯𝑧𝑖 𝑗𝑡). Using very similar arguments as in the previous step, the coefficients

associated with the ˜¯𝑧2 variables will be identity matrices of size 𝑙 and the ones associated

with ˜¯𝑥1 will be 0. Given this, after partialling out ˜¯𝑧𝑖 𝑗𝑡 the associated residuals with be:

(cid:165)𝑥1𝑖 𝑗𝑡 = 𝑥1𝑖 𝑗𝑡 − ¯𝑥1𝑖·· − ¯𝑥1· 𝑗 · − ¯𝑥1··𝑡 + 2 ¯𝑥1···

(cid:165)𝑧2𝑖 𝑗𝑡 = 𝑧2𝑖 𝑗𝑡 − ¯𝑧2𝑖·· − ¯𝑧2· 𝑗 · − ¯𝑧2··𝑡 + 2 ¯𝑧2···

The problem has become now to apply P2SLS to

˜𝑦𝑖 𝑗𝑡 = ˜𝑥𝑖 𝑗𝑡

120

using IV’s (cid:165)𝑧2𝑖 𝑗𝑡 = 𝑧2𝑖 𝑗𝑡 − ¯𝑧2𝑖·· − ¯𝑧2· 𝑗 · − ¯𝑧2··𝑡 + 2 ¯𝑧2···. From this, note that

ˆ𝛽2𝑆𝐿𝑆 =

(cid:34)(cid:32)

(cid:34)(cid:32)

∑︁

∑︁

∑︁

˜𝑥′
𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡

𝑖

𝑗

𝑡

∑︁

∑︁

∑︁

˜𝑥′
𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡

(cid:33) (cid:32)

(cid:33) (cid:32)

∑︁

∑︁

∑︁

(cid:165)𝑧′
𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡

𝑖

𝑗

𝑡

∑︁

∑︁

∑︁

(cid:165)𝑧′
𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡

(cid:33) (cid:32)

(cid:33) (cid:32)

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

𝑖

𝑗

𝑡

𝑖

𝑗

𝑡

(cid:33)(cid:35) −1

(cid:165)𝑧′
𝑖 𝑗𝑡 ˜𝑥𝑖 𝑗𝑡

·

(cid:33)(cid:35)

(cid:165)𝑧′
𝑖 𝑗𝑡 ˜𝑦𝑖 𝑗𝑡

Note that

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

˜𝑥′
𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡 =

−

−

−

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

𝑥′
𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡 −

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

˜𝜃1 ¯𝑥𝑖··(𝑧2𝑖 𝑗𝑡 − ¯𝑧2𝑖·· − ¯𝑧2· 𝑗 · − ¯𝑧2··𝑡 + 2 ¯𝑧2···)

˜𝜃2 ¯𝑥′

· 𝑗 ·(𝑧2𝑖 𝑗𝑡 − ¯𝑧2𝑖·· − ¯𝑧2· 𝑗 · − ¯𝑧2··𝑡 + 2 ¯𝑧2···)

˜𝜃3 ¯𝑥′

··𝑡 (𝑧2𝑖 𝑗𝑡 − ¯𝑧2𝑖·· − ¯𝑧2· 𝑗 · − ¯𝑧2··𝑡 + 2 ¯𝑧2···)

˜𝜃4 ¯𝑥′

···(𝑧2𝑖 𝑗𝑡 − ¯𝑧2𝑖·· − ¯𝑧2· 𝑗 · − ¯𝑧2··𝑡 + 2 ¯𝑧2···)

Focusing on the last term of the first line from the previous expression,

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

˜𝜃1 ¯𝑥′

𝑖··(𝑧𝑖 𝑗𝑡 − ¯𝑧𝑖·· − ¯𝑧· 𝑗 · − ¯𝑧··𝑡 + ¯𝑧···)

∑︁

𝑖

∑︁

𝑖
∑︁

𝑖

˜𝜃1 ¯𝑥′
𝑖··

∑︁

∑︁

𝑗

𝑡

(𝑧𝑖 𝑗𝑡 − ¯𝑧𝑖·· − ¯𝑧· 𝑗 · − ¯𝑧··𝑡 + ¯𝑧···)

˜𝜃1 ¯𝑥′

𝑖··(𝑁2𝑇 ¯𝑧𝑖·· − 𝑁2𝑇 ¯𝑧𝑖·· − 𝑁2𝑇 ¯𝑧··· − 𝑁2𝑇 ¯𝑧··· + 𝑁2𝑇 ¯𝑧···)

¯𝑥′
𝑖··

∑︁

∑︁

𝑗

𝑡

(𝑧𝑖 𝑗𝑡 − ¯𝑧𝑖·· − ¯𝑧· 𝑗 · − ¯𝑧··𝑡 + ¯𝑧···)

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

¯𝑥′
𝑖·· (cid:165)𝑧𝑖 𝑗𝑡

=

=

=

=

where we used the fact that the terms in parenthesis in the third line add up to zero. Using similar

arguments for the rest of the expression, we can easily show that (cid:205)
𝑖

(cid:205)
𝑗

(cid:205)
𝑡

𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡 = (cid:205)
˜𝑥′
𝑖

(cid:205)
𝑗

(cid:205)
𝑡

𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡
(cid:165)𝑥′

121

And applying the same logic to the rest of the terms of 𝛽2𝑆𝐿𝑆, it follows that

ˆ𝛽2𝑆𝐿𝑆 =

=

(cid:34)(cid:32)

(cid:34)(cid:32)

(cid:34)(cid:32)

(cid:34)(cid:32)

∑︁

∑︁

∑︁

˜𝑥′
𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡

𝑖

𝑗

𝑡

∑︁

∑︁

∑︁

˜𝑥′
𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡

𝑖

𝑗

𝑡

∑︁

∑︁

∑︁

(cid:165)𝑥′
𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡

𝑖

𝑗

𝑡

∑︁

∑︁

∑︁

(cid:165)𝑥′
𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡

(cid:33) (cid:32)

(cid:33) (cid:32)

(cid:33) (cid:32)

(cid:33) (cid:32)

∑︁

∑︁

∑︁

(cid:165)𝑧′
𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡

𝑖

𝑗

𝑡

∑︁

∑︁

∑︁

(cid:165)𝑧′
𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡

𝑖

𝑗

𝑡

∑︁

∑︁

∑︁

(cid:165)𝑧′
𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡

𝑖

𝑗

𝑡

∑︁

∑︁

∑︁

(cid:165)𝑧′
𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡

(cid:33) (cid:32)

(cid:33) (cid:32)

(cid:33) (cid:32)

(cid:33) (cid:32)

(cid:33)(cid:35) −1

∑︁

∑︁

∑︁

(cid:165)𝑧′
𝑖 𝑗𝑡 ˜𝑥𝑖 𝑗𝑡

𝑖

𝑗

𝑡

(cid:33)(cid:35)

(cid:165)𝑧′
𝑖 𝑗𝑡 ˜𝑦𝑖 𝑗𝑡

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

∑︁

∑︁

∑︁

(cid:165)𝑧′
𝑖 𝑗𝑡 (cid:165)𝑥𝑖 𝑗𝑡

𝑖

𝑗

𝑡

·

·

(cid:33)(cid:35) −1

(cid:33)(cid:35)

∑︁

∑︁

∑︁

(cid:165)𝑧′
𝑖 𝑗𝑡 (cid:165)𝑦𝑖 𝑗𝑡

= ˆ𝛽𝐹𝐸2𝑆𝐿𝑆

𝑖

𝑗

𝑡

𝑖

𝑗

𝑡

𝑖

𝑗

𝑡

Proof of Proposition 2

For notation simplicity, I will prove the case of P2SLS, however similar ideas can be applied

for a GLS type transformation. We want to show that applying P2SLS to

𝑦𝑖 𝑗𝑡 = 𝑥1𝑖 𝑗𝑡 𝛽1 + 𝑥2𝑖 𝑗𝑡 𝛽2 + ¯𝑥1𝑖 𝑗𝑡𝛾1 + ¯𝑥2𝑖 𝑗𝑡𝛾2 = 𝑥𝑖 𝑗𝑡 𝛽 + ¯𝑥𝑖 𝑗𝑡𝛾

using IV’s (𝑧2𝑖 𝑗𝑡

¯𝑧𝑖 𝑗𝑡) and where ¯𝑥𝑖 𝑗𝑡 = ( ¯𝑥𝑖··

¯𝑥· 𝑗 ·

¯𝑥··𝑡) (and similarly for other variables) yields the

same 𝛽 as ˆ𝛽𝐹𝐸2𝑆𝐿𝑆. To show the result, I follow these steps:

1. Orthogonalize with respect to ¯𝑥1𝑖 𝑗𝑡 the IV’s and the exogenous variables (𝑥1𝑖 𝑗𝑡 𝑧2𝑖 𝑗𝑡).

2. Orthogonalize with respect to ¯𝑧2𝑖 𝑗𝑡 in the first stage equation.

3. I use the FWL theorem to show the equivalence.

Step 1

a. Regress 𝑧2𝑖 𝑗𝑡 on ¯𝑥1𝑖 𝑗𝑡 = (1 ¯𝑥1𝑖··

¯𝑥1· 𝑗 ·

¯𝑥1··𝑡) Equivalently, applying the FWL and to obtain

the correct residuals, we can regress 𝑧2𝑖 𝑗𝑡 − ¯𝑧2··· on [( ¯𝑥1𝑖·· − ¯𝑥1···) ( ¯𝑥1· 𝑗 · − ¯𝑥1···) ( ¯𝑥1··𝑡 − ¯𝑥1···)].

The residuals from this regression will be

𝑧2𝑖 𝑗𝑡 − ¯𝑧2··· − ( ¯𝑥1𝑖·· − ¯𝑥1···)𝜂1 − ( ¯𝑥1· 𝑗 · − ¯𝑥1···)𝜂2 − ( ¯𝑥1··𝑡 − ¯𝑥1···)𝜂3 = 𝑚𝑖 𝑗𝑡

122

First note that the regressors are orthogonal in sample. For example:

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1· 𝑗 · − ¯𝑥1···) =

∑︁

( ¯𝑥1𝑖·· − ¯𝑥1···)′ ∑︁

( ¯𝑥1· 𝑗 · − ¯𝑥1···) = 0

𝑖

𝑗

since we are subtracting the overall mean to both sums of vectors. Therefore, we can find

each 𝜂𝑠 by regressing the dependent variable on each regressor individually and therefore:

ˆ𝜂1 =

ˆ𝜂2 =

ˆ𝜂3 =

(cid:34)

(cid:34)

(cid:34)

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1𝑖·· − ¯𝑥1···)

(cid:35) −1 (cid:34)

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

( ¯𝑥1𝑖·· − ¯𝑥1···)′(𝑧2𝑖 𝑗𝑡 − ¯𝑧2···)

( ¯𝑥1· 𝑗 · − ¯𝑥1···)′( ¯𝑥1· 𝑗 · − ¯𝑥1···)

(cid:35) −1 (cid:34)

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

( ¯𝑥1· 𝑗 · − ¯𝑥1···)′(𝑧2𝑖 𝑗𝑡 − ¯𝑧2···)

( ¯𝑥1··𝑡 − ¯𝑥1···)′( ¯𝑥1··𝑡 − ¯𝑥1···)

(cid:35) −1 (cid:34)

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

( ¯𝑥1··𝑡 − ¯𝑥1···)′(𝑧2𝑖 𝑗𝑡 − ¯𝑧2···)

(cid:35)

(cid:35)

(cid:35)

Note that each of the coefficients can be rewritten, for example:

ˆ𝜂1 =

(cid:34)

∑︁

𝑖

( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1𝑖·· − ¯𝑥1···)

(cid:35) −1 (cid:34)

∑︁

𝑖

( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑧2𝑖·· − ¯𝑧2···)

(cid:35)

b. ¯𝑧2𝑖 𝑗𝑡 on ¯𝑥1𝑖 𝑗𝑡 = (1 ¯𝑥1𝑖··

¯𝑥1· 𝑗 ·

¯𝑥1··𝑡

¯𝑥1···), where ¯𝑧2𝑖 𝑗𝑡 = ( ¯𝑧2𝑖··

¯𝑧2· 𝑗 ·

¯𝑧2··𝑡

¯𝑧2···). Similarly to

the previous case, we can regress ( ¯𝑧2𝑖 𝑗𝑡 − ¯𝑧2···) on [( ¯𝑥1𝑖·· − ¯𝑥1···) ( ¯𝑥1· 𝑗 · − ¯𝑥1···) ( ¯𝑥1··𝑡 − ¯𝑥1···)].

Because the regressors are orthogonal in sample, we can again obtain the coefficients by

running individual regressions.

a) Consider the regression of ¯𝑧2𝑖·· − ¯𝑧2··· on ¯𝑥1𝑖·· − ¯𝑥1···. The coefficient will be

(cid:34)

(cid:34)

ˆ𝜂4 =

=

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1𝑖·· − ¯𝑥1···)

(cid:35) −1 (cid:34)

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑧2𝑖·· − ¯𝑧2···)

(cid:35)

( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1𝑖·· − ¯𝑥1···)

∑︁

𝑖

(cid:35) −1 (cid:34)

∑︁

𝑖

( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑧2𝑖·· − ¯𝑧2···)

= ˆ𝜂1

(cid:35)

Similarly, we can show that the coefficients of ¯𝑧2· 𝑗 · − ¯𝑧2··· on ¯𝑥1· 𝑗 · − ¯𝑥1··· ( ˆ𝜂5)) and

¯𝑧2··𝑡 − ¯𝑧2··· on ¯𝑥1··𝑡 − ¯𝑥1··· ( ˆ𝜂5)) will be equal to ( ˆ𝜂2)) and ( ˆ𝜂3)) respectively.

123

b) Consider now the cases of “cross terms”, i.e. averages in one dimension on variables

averaged over a different dimension. For example, if we regress ¯𝑧2· 𝑗 · − ¯𝑧2··· on ¯𝑥1𝑖·· − ¯𝑥1···,

the coefficient, say 𝜁1 will be:

ˆ𝜁1 =

(cid:34)

(cid:34)

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1𝑖·· − ¯𝑥1···)

(cid:35) −1 (cid:34)

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑧2· 𝑗 · − ¯𝑧2···)

(cid:35)

=

𝑁2

( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1𝑖·· − ¯𝑥1···)

∑︁

𝑖

(cid:35) −1 (cid:34)

∑︁

( ¯𝑥1𝑖·· − ¯𝑥1···)′ ∑︁

𝑖

𝑗

(cid:35)

( ¯𝑧2𝑖·· − ¯𝑧2···)

= 0

since the sum of deviations from the overall mean add up to 0. Similarly, we can show

that all the coefficients from the “cross terms” are 0. Therefore, the residuals from this

stage will be

𝑣𝑖 = ¯𝑧2𝑖·· − ¯𝑧2··· − ( ¯𝑥1𝑖·· − ¯𝑥1···) ˆ𝜂4

𝑣 𝑗 = ¯𝑧2· 𝑗 · − ¯𝑧2··· − ( ¯𝑥1· 𝑗 · − ¯𝑥1···) ˆ𝜂5

𝑣𝑡 = ¯𝑧2··𝑡 − ¯𝑧2··· − ( ¯𝑥1··𝑡 − ¯𝑥1···) ˆ𝜂6

c. 𝑥1𝑖𝑡 𝑗 − ¯𝑥1··· on ¯𝑥1𝑖 𝑗𝑡 − ¯𝑥1···. The residuals from this regression will be

𝑙𝑖 𝑗𝑡 = 𝑥1𝑖 𝑗𝑡 − ¯𝑥1( ¯𝑥1𝑖·· − ¯𝑥1···) ˆ𝜀1 − ( ¯𝑥1· 𝑗 · − ¯𝑥1···) ˆ𝜀2 − ( ¯𝑥1··𝑡 − ¯𝑥1···) ˆ𝜀3

Note that

ˆ𝜀1 =

=

=

=

(cid:34)

(cid:34)

(cid:34)

(cid:34)

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1𝑖·· − ¯𝑥1···)

(cid:35) 1 (cid:34)

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

( ¯𝑥1𝑖·· − ¯𝑥1···)′(𝑥1𝑖 𝑗𝑡 − ¯𝑥1···)

𝑁2 · 𝑇 ·

∑︁

𝑖

( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1𝑖·· − ¯𝑥1···)

(cid:35) 1 (cid:34)

∑︁

( ¯𝑥1𝑖·· − ¯𝑥1···)′ ∑︁

∑︁

𝑖

𝑗

𝑡

(𝑥1𝑖 𝑗𝑡 − ¯𝑥1···)

(cid:35)

(cid:35)

( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1𝑖·· − ¯𝑥1···)

( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1𝑖·· − ¯𝑥1···)

∑︁

𝑖

∑︁

𝑖

(cid:35) 1 (cid:34)

(cid:35) 1 (cid:34)

∑︁

𝑖

( ¯𝑥1𝑖·· − ¯𝑥1···)′ 1
𝑁2𝑇

∑︁

∑︁

𝑗

𝑡

(cid:35)

(𝑥1𝑖 𝑗𝑡 − ¯𝑥1···)

( ¯𝑥1𝑖·· − ¯𝑥1···)′( ¯𝑥1𝑖·· − ¯𝑥1···)

= I𝑘1

(cid:35)

∑︁

𝑖

and similarly we can show that ˆ𝜀2 and ˆ𝜀3 are also identity matrices. Therefore, 𝑙𝑖 𝑗𝑡 =

𝑥1𝑖 𝑗𝑡 − ¯𝑥1𝑖·· − ¯𝑥1· 𝑗 · − ¯𝑥1··𝑡 + 2 ¯𝑥1··· = (cid:165)𝑥1𝑖 𝑗𝑡. After this orthogonalization, the problem becomes

124

to apply P2SLS to

using IV’s (𝑚𝑖 𝑗𝑡 𝑣𝑖 𝑣 𝑗 𝑣𝑡).

Step 2

𝑦𝑖 𝑗𝑡 = (cid:165)𝑥1𝑖 𝑗𝑡 𝛽1 + 𝑥2𝑖 𝑗𝑡 𝛽2 + ¯𝑥2𝑖 𝑗𝑡𝛾2

Now I partial out 𝑣𝑖, 𝑣 𝑗 , 𝑣𝑡 in the first stage equation, which are the residuals associated with

¯𝑧2𝑖··, ¯𝑧2· 𝑗 ·, ¯𝑧2··𝑡 respectively. Note that based on their definitions and because ˆ𝜂1 = ˆ𝜂4, ˆ𝜂2 = ˆ𝜂5 and

ˆ𝜂3 = ˆ𝜂6 and following a procedure similar to Step 1.a, it can be shown that 𝑣𝑖, 𝑣 𝑗 , 𝑣𝑡 are orthogonal

to each other in sample.

1. 𝑚𝑖 𝑗𝑡 on (1 𝑣𝑖 𝑣 𝑗 𝑣𝑡). If we let 𝑚𝑖 𝑗𝑡 = 𝑣𝑖 ˜𝜂1 + 𝑣 𝑗 ˜𝜂2 + 𝑣𝑡 ˜𝜂3, then
(cid:34)

(cid:35) −1 (cid:34)

∑︁

∑︁

∑︁

∑︁

∑︁

∑︁

˜𝜂1 =

𝑣′
𝑖𝑣𝑖

(cid:35)

𝑣′
𝑖𝑣𝑖 𝑗𝑡

𝑖

𝑗

(cid:34)

∑︁

=

𝑣′
𝑖𝑣𝑖

𝑡
(cid:35) −1 (cid:34)

∑︁

𝑗

𝑡

𝑖
(cid:35)

𝑣′
𝑖 ¯𝑚𝑖··

𝑖
Note that 𝑚𝑖·· = ( ¯𝑧2𝑖·· − ¯𝑧2···) − ( ¯𝑥1·· − ¯𝑥1···) and because ˆ𝜂1 = ˆ𝜂4, it follows that ˜𝜂1 = I𝑙 and

𝑖

analogous arguments apply to ˜𝜂2 and ˜𝜂3. Therefore, the residuals from this regression are

(cid:165)𝑧2𝑖 𝑗𝑡, where the definition of (cid:165)𝑧2𝑖 𝑗𝑡 is similar to (cid:165)𝑥1𝑖 𝑗𝑡. Originally the first stage was

𝑥2𝑖 𝑗𝑡 = 𝑥1𝑖 𝑗𝑡 𝜙1 + 𝑧2𝑖 𝑗𝑡 𝜙2 + ¯𝑥1𝑖 𝑗𝑡 𝜌1 + ¯𝑧2𝑖 𝑗𝑡 𝜌2

Since we have partialled out ¯𝑥1𝑖 𝑗𝑡 and ¯𝑥1𝑖 𝑗𝑡, to get 𝜙1 and 𝜙2, we regress 𝑥2𝑖 𝑗𝑡 on (cid:165)𝑧𝑖 𝑗𝑡 =

[ (cid:165)𝑥1𝑖 𝑗𝑡 (cid:165)𝑧2𝑖 𝑗𝑡]. To get 𝜙 = [𝜙1 𝜙2], we have

(cid:34)

(cid:34)

(cid:34)

ˆ𝜙 =

=

=

∑︁

∑︁

∑︁

(cid:165)𝑧′
𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡

𝑖

𝑗

𝑡

(cid:35) −1 (cid:34)

(cid:35)

(cid:165)𝑧′
𝑖 𝑗𝑡𝑥2𝑖 𝑗𝑡

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

∑︁

∑︁

∑︁

(cid:165)𝑧′
𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡

𝑖

𝑗

𝑡

(cid:35) −1 (cid:34)

∑︁

∑︁

∑︁

(cid:165)𝑧′
𝑖 𝑗𝑡𝑥2𝑖 𝑗𝑡 −

∑︁

∑︁

∑︁

(cid:165)𝑧2𝑖 𝑗𝑡 ( ¯𝑥2𝑖·· − ¯𝑥2· 𝑗 · − ¯𝑥2··𝑡 + 2 ¯𝑥2···)

(cid:35)

𝑖

𝑗

𝑡

𝑖

𝑗

𝑡

∑︁

∑︁

∑︁

(cid:165)𝑧′
𝑖 𝑗𝑡 (cid:165)𝑧𝑖 𝑗𝑡

𝑖

𝑗

𝑡

(cid:35) −1 (cid:34)

(cid:35)

(cid:165)𝑧′
𝑖 𝑗𝑡 (cid:165)𝑥2𝑖 𝑗𝑡

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

where I used the fact that the sums of deviations from the mean equal to 0 in the second line.

Therefore ˆ𝜙 can also be obtained by regressing (cid:165)𝑥2𝑖 𝑗𝑡 on (cid:165)𝑧𝑖 𝑗𝑡.

125

Step 3

Now the problem becomes to apply P2SLS to

𝑦𝑖 𝑗𝑡 = (cid:165)𝑥1𝑖 𝑗𝑡 𝛽1 + 𝑥2𝑖 𝑗𝑡 𝛽2 + ¯𝑥2𝑖 𝑗𝑡𝛿2

using IV’s [ (cid:165)𝑧2 𝑗𝑖𝑡 (cid:165)¯𝑧2𝑖 𝑗𝑡]. The second stage of the problem is to apply POLS to

𝑦𝑖 𝑗𝑡 = (cid:165)𝑥1𝑖 𝑗𝑡 𝛽1 + ˆ𝑥2𝑖 𝑗𝑡 𝛽2 + ˆ¯𝑥2𝑖 𝑗𝑡𝛿2

To get 𝛽, I orthogonalize with respect to ˆ¯𝑥2𝑖 𝑗𝑡:

1. (cid:165)𝑥1𝑖 𝑗𝑡 on ˆ¯𝑥2𝑖 𝑗𝑡, where ˆ¯𝑥2𝑖 𝑗𝑡 = (1 ˆ¯𝑥2𝑖··

ˆ¯𝑥2· 𝑗 ·

ˆ¯𝑥2··𝑡).

Using the fact that the explanatory variables are averages over different dimensions and

because the sum of time deviations for (cid:165)𝑥1𝑖 𝑗𝑡, it can be shown that the vector of coefficients is

equal to 0𝑘2.

2. ˆ𝑥2𝑖 𝑗𝑡 on ˆ¯𝑥2 𝑗𝑖𝑡, or equivalently ˆ𝑥2𝑖 𝑗𝑡 − ˆ¯𝑥2··· on ˆ¯𝑥2 𝑗𝑖𝑡 − ˆ¯𝑥2···. Using arguments similar to step 1.c,

it can be shown that the associated coefficient in this regression will be I𝑘2 and the residuals

will be

(cid:165)ˆ𝑥2𝑖 𝑗𝑡 = ˆ𝑥2𝑖 𝑗𝑡 − ˆ¯𝑥2𝑖·· − ˆ¯𝑥2· 𝑗 · − ˆ¯𝑥2··𝑡 + 2 ˆ¯𝑥2···

Therefore, to find 𝛽, we run POLS on 𝑦𝑖 𝑗𝑡 = (cid:165)𝑥1𝑖 𝑗𝑡 𝛽1 + ˆ(cid:165)𝑥2𝑖 𝑗𝑡 𝛽2. Letting ˆ(cid:165)𝑥𝑖 𝑗𝑡 = ( (cid:165)𝑥1𝑖 𝑗𝑡

ˆ(cid:165)𝑥2𝑖 𝑗𝑡),

(cid:34)

ˆ𝛽 =

∑︁

∑︁

∑︁

ˆ(cid:165)𝑥′
𝑖 𝑗𝑡 ˆ(cid:165)𝑥𝑖 𝑗𝑡

𝑖

𝑗

𝑡

(cid:35) −1 (cid:34)

(cid:35)

ˆ(cid:165)𝑥′
𝑖 𝑗𝑡 𝑦𝑖 𝑗𝑡

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

Using a similar argument as in step 2 for 𝑥2𝑖 𝑗𝑡, it can be shown that

(cid:34)

(cid:34)

ˆ𝛽 =

=

∑︁

∑︁

∑︁

ˆ(cid:165)𝑥′
𝑖 𝑗𝑡 ˆ(cid:165)𝑥𝑖 𝑗𝑡

𝑖

𝑗

𝑡

∑︁

∑︁

∑︁

ˆ(cid:165)𝑥′
𝑖 𝑗𝑡 ˆ(cid:165)𝑥𝑖 𝑗𝑡

𝑖

𝑗

𝑡

(cid:35) −1 (cid:34)

(cid:35) −1 (cid:34)

∑︁

∑︁

∑︁

ˆ(cid:165)𝑥′
𝑖 𝑗𝑡 𝑦𝑖 𝑗𝑡

𝑖

𝑗

𝑡

(cid:35)

(cid:35)

∑︁

∑︁

∑︁

𝑖

𝑗

𝑡

ˆ(cid:165)𝑥′
𝑖 𝑗𝑡 (cid:165)𝑦𝑖 𝑗𝑡

= ˆ𝛽𝐹𝐸2𝑆𝐿𝑆

126