NEW ESTIMATION METHODS FOR PANEL DATA MODELS
By
Valentin Verdier

A DISSERTATION
Submitted
to Michigan State University
in partial fulfillment of the requirements
for the degree of
Economics - Doctor of Philosophy
2014

ABSTRACT
NEW ESTIMATION METHODS FOR PANEL DATA MODELS
By
Valentin Verdier
This dissertation is composed of three chapters that develop new estimation methods for several models
of panel data. The first and third chapters are mainly concerned with understanding and aproximating the
structure of optimal instruments for estimating dynamic panel data models with cross-sectional dependence
in the case of the first chapter, and non-linear panel data models with strictly exogeneous covariates in the
case of the third chapter. The second chapter is concerned with additional restrictions that can be used to
estimate non-linear dynamic panel data models.
The first chapter considers the estimation of dynamic panel data models when data are suspected to
exhibit cross-sectional dependence. A new estimator is defined that uses cross-sectional dependence for
efficiency while being robust to the misspecification of the form of the cross-sectional dependence. I show
that using cross-sectional dependence for estimation is important to obtain an estimator that is more accurate than existing estimators. This new estimator also uses nuisance parameters parsimoniously so that it
exhibits good small sample properties even when the number of available moment conditions is large. As an
empirical application, I estimate the effect of attending private school on student achievement using a value
added model.
The second chapter considers the instrumental variable estimation of non-linear models of panel data
with multiplicative unobserved effects where instrumental variables are predetermined as opposed to strictly
exogenous. Existing estimators for these models suffer from a weak instrumental variable problem, which
can cause them to be too inaccurate to be reliable. In this chapter I present additional sets of restrictions that
can be used for more precise estimation. Monte Carlo simulations show that using these additional moment
conditions improves the precision of the estimators significantly and hence should facilitate the use of these
models.
In the third chapter I study the efficiency of the Poisson Fixed Effects estimator. The Poisson fixed
effects estimator is a conditional maximum likelihood estimator and as such is consistent under specific
distributional assumptions. It has also been shown to be consistent under significantly weaker restrictions

on the conditional mean function only. I show that the Poisson Fixed Effects estimator is asymptotically
efficient in the class of estimators that are consistent under restrictions on the conditional mean function, as
long as the assumptions of equal conditional mean and variance and zero conditional serial correlation are
satisfied. I then define another estimator that is optimal under more general conditions. I use Monte Carlo
simulations to investigate the small-sample performance of this new estimator compared to the Poisson fixed
effects estimator.

ACKNOWLEDGEMENTS

I particularly thank Jeffrey Wooldridge who served as the chair of my dissertation committee. His teaching
throughout my studies as Michigan State University shaped this dissertation and my current research. I also
thank my other committee members, Peter Schmidt, Timothy Vogelsang and Robert Myers, whose help at
various stages of my research had a large positive impact on the quality of my work.
I also thank graduate students in the department of economics and the department of agricultural economics for helpful conversations.
Finally I thank Margaret Lynch and Lori Jean Nichols, who work in the administrative staff of the
department of economics, whose continuous support for five years helped a lot in completing this degree.

iv

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
CHAPTER 1
1.1
1.2

1.3

1.4
1.5
1.6
1.7

ESTIMATION OF DYNAMIC PANEL DATA MODELS WITH CROSS-SECTIONAL
DEPENDENCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Dynamic Panel Data Models with Cross-Sectional Dependence . . . . . . . . . . . . . . . . 3
1.2.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Consistent Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Efficient Estimation under Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Special Case of Independent Disturbances and T = 2 . . . . . . . . . . . . . . . . . 6
1.3.2 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.3 Comparison to Existing Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Models with Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Monte Carlo Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Application: Estimation of Persistence in Student Achievement . . . . . . . . . . . . . . . . 36
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

CHAPTER 2
2.1
2.2
2.3
2.4

2.5
2.6
2.7

ESTIMATION OF UNOBSERVED EFFECTS PANEL DATA MODELS UNDER SEQUENTIAL EXOGENEITY . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Model and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Estimation without Additional Assumptions . . . . . . . . . . . . . . . . . . . . . . . .
Additional Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Estimation with Stationary Instruments . . . . . . . . . . . . . . . . . . . . . .
2.4.1.1 Example of the Linear Feedback Model . . . . . . . . . . . . . . . .
2.4.1.2 Time Demeaned Instruments . . . . . . . . . . . . . . . . . . . . . .
2.4.2 Serially Uncorrelated Transitory Shocks . . . . . . . . . . . . . . . . . . . . . .
Monte Carlo Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Average Partial Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 3 EFFICIENCY OF THE POISSON FIXED EFFECTS ESTIMATOR
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 The Model and Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Asymptotically Efficient Estimation . . . . . . . . . . . . . . . . . . . .
3.2.2 Conditions for Efficiency of the Poisson FE estimator . . . . . . . . . . .
3.2.3 An Alternative Estimator . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Monte Carlo Simulations Study . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

45
45
47
49
51
51
51
53
54
56
62
65

.
.
.
.
.
.
.

.
.
.
.
.
.
.

66
66
66
67
68
69
70

APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
APPENDIX A ESTIMATION OF DYNAMIC PANEL DATA MODELS WITH CROSS-SECTIONAL
DEPENDENCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
APPENDIX B ESTIMATION OF UNOBSERVED EFFECTS PANEL DATA MODELS UNDER SEQUENTIAL EXOGENEITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
v

APPENDIX C EFFICIENCY OF THE POISSON FIXED EFFECTS ESTIMATOR . . . . . . . . 84
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

vi

LIST OF TABLES

Table 1.1

Number of replications where all estimators converged (out of 1,000) . . . . . . . . . . . 23

Table 1.2

Bias and RMSE, ρ = .8, equi-correlation within clusters . . . . . . . . . . . . . . . . . . 24

Table 1.3

Bias and RMSE, ρ = .8, no correlation within clusters . . . . . . . . . . . . . . . . . . . 25

Table 1.4

Bias and RMSE, ρ = .8, heteroscedasticity and correlation within clusters . . . . . . . . . 26

Table 1.5

Inference, ρ = .8, equi-correlation within clusters . . . . . . . . . . . . . . . . . . . . . . 27

Table 1.6

Inference, ρ = .8, no correlation within clusters . . . . . . . . . . . . . . . . . . . . . . . 28

Table 1.7

Inference, ρ = .8, heteroscedasticity and correlation within clusters . . . . . . . . . . . . 29

Table 1.8

Bias and RMSE, ρ = .5, equi-correlation within clusters . . . . . . . . . . . . . . . . . . 30

Table 1.9

Bias and RMSE, ρ = .5, no correlation within clusters . . . . . . . . . . . . . . . . . . . 31

Table 1.10 Bias and RMSE, ρ = .5, heteroscedasticity and correlation within clusters . . . . . . . . . 32
Table 1.11 Inference, ρ = .5, equi-correlation within clusters . . . . . . . . . . . . . . . . . . . . . . 33
Table 1.12 Inference, ρ = .5, no correlation within clusters . . . . . . . . . . . . . . . . . . . . . . . 34
Table 1.13 Inference, ρ = .5, heteroscedasticity and correlation within clusters . . . . . . . . . . . . 35
Table 1.14 Averages and standard deviations of scores per subject and per grade . . . . . . . . . . . . 42
Table 1.15 Effects of Attending Private Schools on Student Achievement . . . . . . . . . . . . . . . 43
Table 2.1

Bias and RMSE for estimating γ, T = 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Table 2.2

Bias and RMSE for estimating γ, T = 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Table 2.3

Ratio of standard errors over standard deviations of estimators of γ, T = 4 . . . . . . . . . 61

Table 2.4

Ratio of standard errors over standard deviations of estimators of γ, T = 8 . . . . . . . . . 62

Table 2.5

Coverage of 95% confidence intervals for γ, T = 4 . . . . . . . . . . . . . . . . . . . . . 63

Table 2.6

Coverage of 95% confidence intervals for γ, T = 8 . . . . . . . . . . . . . . . . . . . . . 64

Table 3.1

N = 100: Bias, standard deviation and root mean squared error . . . . . . . . . . . . . . . 72

Table 3.2

N = 500: Bias, standard deviation and root mean squared error . . . . . . . . . . . . . . . 73

Table 3.3

N = 1000: Bias, standard deviation and root mean squared error . . . . . . . . . . . . . . 74

vii

CHAPTER 1
ESTIMATION OF DYNAMIC PANEL DATA MODELS WITH CROSS-SECTIONAL
DEPENDENCE

1.1

Introduction

In some econometric studies of panel data, researchers might want to account for the presence of feedback
between the dependent variable and explanatory variables, i.e. for current values of the dependent variable
to affect future values of the explanatory variables or even for both dependent and independent variables to
be jointly determined. The simplest example of such models is the dynamic panel data model where lagged
values of the dependent variable are used as covariates. In such cases, explanatory variables can not be
treated as strictly exogenous. In virtually all panel data applications, researchers also want to control for
unobserved heterogeneity that affects the dependent variable but might also be correlated with the covariates.
The presence of both non strictly exogenous covariates and unobserved heterogeneity in panel data
models causes many estimation methods to be invalid (see for instance Wooldridge (2010)). In the context of cross-sectionally independent data, a valid estimator for dynamic panel data models that relies on
first differencing and instrumental variables has been defined in early work by Anderson and Hsiao (1981)
Anderson and Hsiao (1981). Additionally, an asymptotically efficient estimator is found in Arellano and
Bond (1991)1 . In the rest of the paper, we refer to this estimator as the AB estimator. These estimators
often suffer from having a large variance because the instrumental variables that they use are weak.2 In
addition, inference for the AB estimator is often unsatisfactory when the number of time periods in the data
set is relatively large because of problems due to using many moment conditions, as studied in Alvarez and
Arellano (2003) or Windmeijer (2005) for the case of cross-sectional independence.
In this paper, we consider the estimation of panel data models with covariates that are not strictly exogenous when data also exhibit cross-sectional dependence. We will define a new estimator that is more
1 The Arellano and Bond estimator is asymptotically efficient in the class of estimators using linear functions of the instruments.
2 To address this problem, papers such as Ahn and Schmidt (1995), Arellano and Bover (1995), and Blundell and Bond (1998)
considered using for estimation additional assumptions such as homoscedasticity, uncorrelation of the transitory shocks, or restrictions on initial conditions. Another approach to obtain efficiency gains by using additional assumptions can be found in the literature
on First Difference Quasi-Maximum Likelihood estimation, as in Hsiao et al. (2002) for instance which relies on assumptions of
homoscedasticity and serial uncorrelation. We do not consider these estimators here since we am interested in estimators that are
consistent under the only assumption of mean independence of the transitory shock, without any other assumption holding.

1

efficient than the AB estimator and for which inference is significantly better in small samples. The main
reason why our estimator is more efficient than previous estimators that were defined for data with crosssectional independence is that it makes use of cross-sectional dependence to obtain stronger instruments.
In order to obtain an estimator with not only good properties in terms of point estimation, but also
good properties for inference, we use an auxiliary model for optimal instruments. Optimal instruments are
instruments that, once interacted with corresponding moment functions, provide an optimal set of exactly
identifying moment conditions so that the resulting estimator achieves the asymptotic efficiency bound for
estimating unknown parameters from the assumption of mean independence of the transitory shocks. Optimal instruments for estimating dynamic panel data models without cross-sectional dependence are found
in Chamberlain (1992a) and they can be generalized to the case of cross-sectional dependence. In this paper, we propose auxiliary assumptions sufficient to model optimal instruments for panel data models with
covariates that are not strictly exogenous and cross-sectional dependence. The advantage of such an approach is that it provides a systematic way of weighting many moment conditions while making use of few
nuisance parameters. As a result, our estimator exhibits good small sample properties and inference while
being robust to the misspecification of our model of optimal instruments.
Arellano (2003) and Alvarez and Arellano (2004) have previously considered modeling optimal instruments for dynamic panel data models in the special case of cross-sectional independence. We show that
cross-sectional dependence can be particularly useful to obtain more accurate estimators. Previous work on
dynamic panel data models that has considered cross-sectional dependence has not made use of this dependence to obtain stronger instruments. Mutl (2006), for instance, studied a GMM estimator based on the same
moment conditions as in Anderson and Hsiao (1981) or Arellano and Bond (1991) and only uses an optimal
weighting matrix based on a specific model of spatial dependence. Elhorst (2005) and Su and Yang (2013)
generalized maximum likelihood estimators as in Hsiao et al. (2002) to the case of cross-sectional dependence but these estimators are not robust to heteroscedasticity, serial correlation of the transitory shocks or
misspecification of the cross-sectional dependence.
In Section 1.2, we present the simplest example of the models we consider, the dynamic panel data
model without covariates for data with cross-sectional dependence. In Section 1.3, we define our estimator
and compare it to existing estimators. In Section 1.4, we generalize our estimator to general models with
non strictly exogenous covariates. In Section 1.5, we present Monte-Carlo evidence that the efficiency gains
from using cross-sectional dependence for estimation can be significant and that the estimator we propose

2

has superior small sample properties compared to existing estimators. In Section 1.6, we apply our estimator
to the estimation of the effect of attending private school on student achievement using a value-added model
and taking into account the possibility that student achievements are correlated within schools.

1.2

Dynamic Panel Data Models with Cross-Sectional Dependence

1.2.1

The Model

Throughout the paper we will consider large n, fixed T asymptotics.3 Consider first the model for any
observation i from a sample of n observations and any time period t from a fixed number T of time periods:
yit = ρ0 yit−1 + ci + uit t = 1, ..., T
E(uit |Yt−1 ) = 0 t = 1, ..., T

(1.2.1)
(1.2.2)

where Yt = [Y1t , ...,Ynt ] and Yit = [yi0 , ..., yit ] are random vectors that stack values of yit across time and
observations and ci are time constant unobserved effects, also called unobserved heterogeneity. We also
assume that ρ0 = 1 so that ρ0 is identified from differenced equations as seen in the next subsection.
In the case where there is no cross-sectional dependence, (1.2.1) and (1.2.2) correspond to the linear
dynamic model for panel data as presented in Arellano and Bond (1991) for instance. When there is crosssectional dependence, (1.2.1) and (1.2.2) impose the restriction that cross-sectional dependence does not
cause Yt−1 to be endogenous.
For instance if contemporaneous spatial lags were omitted variables in (1.2.1), then (1.2.2) would be
violated. Some papers such as Cizek et al. (2011), Elhorst (2005), Su and Yang (2013) and Baltagi et al.
(2014) have considered models with both dynamic effects and contemporaneous spatial lag effects. Since
estimators for such models rely on correct specification of the form of cross-sectional dependence, we do not
consider them here and concentrate on models where cross-sectional dependence of some unknown form is
present in the residuals.4 Lagged values of the dependent variable of neighboring observations could also be
3 Using a parsimonious number of nuisance parameters seems to grant the estimator we propose good properties with relatively

large numbers of time periods but a formal derivation of results under large N, large T asymptotics is left for future research.
4 It is also important to note that, with cross-sectional dependence, it is not likely for E(u |Y
it it−1 ) = 0 to hold without
(1.2.2) holding. If (1.2.2) is not satisfied, it is likely that both estimators for cross-sectionally independent data such as the
Arellano and Bond estimator and the alternative estimator proposed in this chapter will be inconsistent. For instance suppose for simplicity that n = 2 and E(u1t |Yt−1 ) = α + β1 y1t−1 + β2 y2t−1 = 0 so that β1 = 0 or β2 = 0. Then E(u1t |Y1t−1 ) =
α + β1 y1t−1 + β2 E(y2t−1 |Y1t−1 ) and it likely that E(y2t−1 |Y1t−1 ) be a function of y10 , ..., y1t−2 in addition to y1t−1 so that, in
general, α + β1 y1t−1 = −β2 E(y2t−1 |Y1t−1 ) and E(u1t |y1t−1 ) = 0.

3

included in the model as covariates to control for dynamic cross-sectional effects. We will discuss models
with covariates in Section 1.4.
The objective of the next section is to characterize estimators for ρ0 that are consistent when (1.2.1) and
(1.2.2) hold under general conditions on the form of cross-sectional dependence in ci and uit .

1.2.2

Consistent Estimation

The presence of unobserved heterogeneity rules out estimating ρ0 by a regression. Because (1.2.1) and
(1.2.2) form a dynamic model, fixed effects estimation is also ruled out because explanatory variables are
not strictly exogenous.
To estimate ρ0 , we will consider a first difference transformation. All of the derivations in this paper can
be generalized to other transformations, such as the forward filtering transformation presented in Arellano
and Bover (1995) for instance, which can be useful in the case of unbalanced panels. Define:
mit (ρ) = ∆yit − ρ∆yit−1 ∀t = 2, ..., T

(1.2.3)

where ∆ is the first difference operator. Therefore, mit (ρ0 ) = uit − uit−1 and (1.2.1) and (1.2.2) imply:
E(mit (ρ0 )|Yt−2 ) = 0 ∀t = 2, ..., T

(1.2.4)

Define mi (ρ) = [mit (ρ)]t=2,...,T to be the column vector with mit+1 (ρ) as its t th element. Sometimes
we will also shorten notation by writing mi = mi (ρ0 ), mit = mit (ρ0 ) and ∆Y−1,i = [∆yit−1 ]t=2,...,T .
Define:
Zi = [Zi2 , ..., ZiT ]

(1.2.5)

to be a matrix containing instruments for each time period so that Zit is a function of Yt−2 and therefore
T E(Z m (ρ )) = 0.5
E(Zit mit (ρ0 )) = 0 and E(Zi mi (ρ0 )) = ∑t=1
it it 0

Define Ξ to be some weighting matrix. Define ρˆ an estimator for ρ0 as:
n

n

ρˆ = argminρ ( ∑ Zi mi (ρ)) Ξ
i=1

∑ Zi mi (ρ)

(1.2.6)

i=1

Consider first the case where cross-sectional dependence is captured by a large group of clusters with
fixed numbers of observations so that observations within a cluster might be related but observations across
5 Note

that we need to assume ρ0 = 1 for E(Zi mi (ρ)) = 0 to hold for ρ = ρ0 only since if ρ0 = 1 then E(Zi mi (ρ)) = 0 ∀ ρ.

4

clusters are independent. Standard results on asymptotic properties of GMM estimators with clustering,
found in White (2001) for instance, imply that ρˆ will be consistent for ρ0 and asymptotically normal as the
number of clusters grows unboundedly under standard regularity conditions.
For more general forms of cross-sectional dependence, Conley (1999), Jenish and Prucha (2009), Jenish
and Prucha (2012) consider different sets of regularity conditions that guarantee that ρˆ is consistent and
asymptotically normal as long as E(Zi mi (ρ0 )) = 0.
In this paper, we will assume that either set of regularity conditions holds so that the probability limits
p

D = plim( n1 ∑ni=1 Zi ∆Y−1,i ) and ϒ = plim n1 ∑ni=1 ∑nj=1 Zi mi m j Z j exist and are finite, D ΞD = 0, ρˆ → ρ0
and, as n → ∞:6
√
d
n(ρˆ − ρ0 ) → N(0,V )
V = (D ΞD)−1 D ΞϒΞD(D ΞD)−1

(1.2.7)
(1.2.8)

In the next sections we consider efficient feasible GMM estimation where the matrix of instruments
Zi and an estimator of the weighting matrix Ξ are chosen so that the resulting estimator of ρ0 is efficient
under some auxiliary assumptions. It is important to note that all of the estimators we propose will be
asymptotically equivalent to estimators of the type defined by (1.2.6) so that they will be consistent as long
as (1.2.1) and (1.2.2) hold, independently of whether the auxiliary models we specify are true or not.

1.3

Efficient Estimation under Clustering

In this section, we consider an auxiliary model for deriving optimal instruments that assumes that every
observation belongs to one of a large number of clusters. Observations are treated as correlated within
clusters but independent across clusters. While clustering only represents a specific form of cross-sectional
dependence, it might be a good approximation for more general forms of dependence in many applications.
In addition, the method outlined in this section for the special case of clustering can easily be extended
to other forms of cross-sectional dependence. Therefore we restrict our attention in this paper to auxiliary
models that make use of the clustering assumption.
For simplicity we will consider in this section the case where each observation belongs to the same
cluster across all time periods but the results in this section can be generalized to clusters changing over
6 Note

that in the case of clustering we consider {ng }g=1,...,G to be a set of fixed values, where ng denotes the number of
√
√
observations in cluster g and G the number of clusters. Then n-asymptotic normality or G-asymptotic normality are equivalent
since n/min{ng } ≤ G ≤ n/max{ng }.

5

time as shown in Section 1.6. Previous work that estimated dynamic models of panel data with clustered
sampling generally used estimators developed for i.i.d. data such as the ones found in Anderson and Hsiao
(1981), Arellano and Bond (1991), or Ahn and Schmidt (1995), and adjusted inference by using clustered
standard errors. Such an analysis can be found for instance in de Brauw and Giles (2008) where farming
households are treated as clustered by village or Andrabi et al. (2011) where students are clustered by
school7 . Topalova and Khandelwal (2010) and Balasubramanian and Sivadasan (2010) consider the case
where firms are clustered by industry.
In this section, we show that there is much to gain in terms of efficiency by using a different estimator that
takes into account correlation within cluster but is robust to misspecification of the form of this correlation.
We will consider the case where the data is composed of a large number of clusters indexed by g = 1, ..., G,
each with a fixed number of observations denoted ng so that asymptotics will be performed for G → ∞.
In the first subsection, we present the special case of two time periods since in this case the problem
reduces to estimating ρ0 from only one differenced equation using instrumental variables.

1.3.1

Special Case of Independent Disturbances and T = 2

For this simple special case, we derive an efficient estimator for the case where {uit }i=1,...,n,t=1,2 are independent both cross-sectionally and across time, where T = 2 and where we have conditional homoscedasticity so that:
Var(uit |Yt−1 ) = σu2 ∀t = 1, 2

(1.3.1)

When T = 2, there is only one differenced equation that can be used for estimation:
∆yi2 = ρ0 ∆yi1 + ∆ui2

(1.3.2)

for which the available instruments are Y0 . Under the assumption of independence of disturbances and homoscedasticity, ∆ui2 is also cross-sectionally independent and homoscedastic, so the optimal instrument for
the differenced equation is the best prediction of ∆yi1 based on all the available instruments, i.e. E(∆yi1 |Y0 ).
To find E(∆yi1 |Y0 ), note that under (1.2.1) and (1.2.2), yi1 = ρ0 yi0 + ci + ui1 so that E(∆yi1 |Y0 ) = (ρ0 −
1)yi0 + E(ci |Y0 ). Therefore the quality of the prediction of ∆yi1 based on the instruments will depend on
7 We

will show in Section 1.6 however that the clustering used in Andrabi et al. (2011) is not appropriate to obtain robust
standard errors due to observations moving across clusters during the period of observation. We will show robust standard errors
that take this factor into consideration.

6

the quality of the prediction of ci based on the instruments. In many applications, it is very likely that agents
that belong to the same cluster will have levels of unobserved heterogeneity that are related. For instance,
farmers that live in the same village might farm plots with with similar soil quality or develop similar
farming practices over time. Firms that operate in the same industry might also face similar constraints
such as for instance regulation or access to skilled labor force. Similarly, households that live in the same
district might have been selected based on common characteristics such as wealth, income, family status or
values. Therefore, in many applications, we can expect that using information from other observations in
the same cluster in addition to one’s own previous outcomes can provide a better predictor for one’s level of
unobserved heterogeneity.
For this simple case, we could derive an optimal predictor for ci by using the assumption that for any
observation i belonging to cluster g we have:
ci = cg + ei

(1.3.3)

where {cg }g=1,...,G forms a sequence of i.i.d. random variables, {ei }i=1,...,n is an i.i.d. sequence of
zero-mean random variables with ei being mean independent of {y j0 } j=i conditional on yi0 . Then for any
observation i in cluster g we have E(ci |Y0 ) = E(cg |Y0 ) + E(ei |yi0 ). To obtain a parsimonious model for
the optimal instruments, we can postulate that conditional expectations are linear and that each observation
within a cluster contributes in the same way to predict cg . Then for any observation in cluster g, E(ci |Y0 ) =
α0 + β0 n1g ∑ j∈g y j0 + γ0 yi0 where ng denotes the number of observations in cluster g.
Therefore the optimal instrument for (1.3.2) for an observation in cluster g is zi = (ρ0 − 1)yi0 + α0 +
β0 n1g ∑ j∈g y j0 + γ0 yi0 . A feasible version of this optimal instrument can be obtained from a consistent
¨ since consistent estimators for α0 , β0 , γ0 can be obtained from a
preliminary estimator of ρ0 , denote it ρ,
¨ it−1 on an intercept, n1 ∑ j∈g y j0 and yi0 . Using the information contained in
pooled regression of yit − ρy
g
past outcomes for other observations in the cluster will presumably yield a much better predictor of ci and
hence a much better instrument, which can lead to sizable gains in efficiency. Even though we derived this
efficient estimator by using very strong auxiliary assumption, it is consistent as long as (1.2.1) and (1.2.2)
hold and one can use inference that is robust to all of our auxiliary assumptions being violated as shown in
the next sub-section.

7

1.3.2

General Case

In this sub-section, we consider efficient estimation with T being any fixed integer equal or greater than two
and disturbances being potentially correlated within clusters. Here we will generalize the idea developed
in the previous sub-section of using other observations from a cluster to predict one’s level of unobserved
heterogeneity. We will start with the same auxiliary assumption of clustering as in the previous subsection:

Auxiliary Assumption 1: Clusters of observations are independent and identically distributed.
With Auxiliary Assumption 1, we can derive the optimal estimator for ρ0 by generalizing the work on
optimal instruments for cross-sectionally independent data in Chamberlain (1992a) to the case of clustersampling.
In this section we will index observations by cluster so that for any i, gi denotes the cluster to which
observation i belongs and jg denotes the jth observation of cluster g so that for any observation i in g,
there is j such that jg = i and {{x jg } j=1,...,ng }g=1,...,G = {xi }i=1,...,n for any sequence of variables
g

{xi }i=1,...,n . Consider stacking all observations by cluster and define mt (ρ) = [m1g ,t (ρ), ..., mngg ,t (ρ)] ,
g

g

g

g

g

mg (ρ) = [m2 (ρ), ..., mT (ρ)] , mt = mt (ρ0 ) and mg = mg (ρ0 ). Similarly, define ut = [u1g ,t , ..., ungg ,t ] ,
g

g

g

g

g

g

g

g

g

ug = [u1 , ..., uT ] , cg = [c1g , ..., cngg ] yt = [y1g ,t , ..., yngg ,t ] , Yt = [y0 , ..., yt ] , and ∆Y−1 = [∆y1 , ..., ∆yT −1 ] .
Appendix A.1.1 shows that the optimal estimator for ρ0 is defined by:
G

g

∑ Zopt mg (ρˆopt ) = 0

(1.3.4)

g=1

g
g g g
s=2,...,T
where Zopt = L g (Φg )−1/2 where Φg = [Cov(mt , ms |Ymax{t,s}−2 )]t=2,...,T , (Φg )−1/2 is the upper dig
g
g
g
g
agonal matrix such that (Φg )−1/2 (Φg )−1/2 = (Φg )−1 , L g = [Lt ]t=2,...,T and Lt = E((Φt )−1/2 ∆Y−1 |Yt−2 )
g

where (Φt )−1/2 is the (t − 1)th ng × ng (T − 1) matrix composing (Φg )−1/2 .
One could estimate these optimal instruments non-parametrically by using series of instruments that
include lagged values of the dependent variable for an observation but also lagged values of the dependent
variable for neighboring observations. A similar estimator has been studied for the case of cross-sectionally
independent data in Donald et al. (2009) for static models and Hahn (1997) for dynamic models. However such an approach would not be practical here since there are too many possible terms to consider as
instruments. Also, it would involve using many nuisance parameters which can cause poor small sample
properties for the estimator, as is discussed later. Instead, we propose two auxiliary auxiliary assumptions

8

that will allow us to model optimal instruments and drastically reduce the number of nuisance parameters
needed. The resulting estimator will be consistent as long as (1.2.1) and (1.2.2) hold and efficient when
these auxiliary assumptions are satisfied. Because the estimator we propose makes use of few nuisance
parameters, it will have good small sample properties even when the auxiliary assumptions do not hold, as
evidenced in Section 1.5.
The second auxiliary assumption we will use is the assumption of conditional homoscedasticity as well
as conditional serial uncorrelation and conditional equi-correlation within clusters:
Auxiliary Assumption 2a: For any i, j ∈ g, t, s = 1, ..., T , t ≥ s:
g

Cov(uit , u js |cg ,Yt−1 ) = σu2 i f i = j, t = s
= τu σu2 i f i = j, t = s
= 0 otherwise
g

Under Auxiliary Assumption 2a, Appendix A.1.2 shows that the optimal instrument for mg , Zopt , is
g

now a linear function of {E(∆yt−1 |Yt−s )}t=2,...,T, s=2,...,t . This corresponds to the intuition developed
in the previous section where we found that, for the special case T = 2, optimal instruments were simply
g

E(∆y1 |Y0 ).
From (1.2.1) and (1.2.2):
g

g

E(∆yt−1 |Yt−s ) = (ρ0 − 1)ρ0s−1 yt−s +
g

= (ρ0 − 1)ρ0s−1 yt−s +

s−2

∑ ρ0r E(cg |Yt−s )

(1.3.5)

1 − ρ0s−1

(1.3.6)

r=0

1 − ρ0

E(cg |Yt−s )

Under Auxiliary Assumption 1:
g

g

E(∆yt−1 |Yt−s ) = (ρ0 − 1)ρ0s−1 yt−s +

1 − ρ0s−1
1 − ρ0

g

E(cg |Yt−s )

(1.3.7)

Therefore in order to obtain a model for optimal instruments, one needs to make additional assumptions
so that there exists a parametric model for the mean of unobserved heterogeneity conditional on lagged
values of the dependent variable. In order to keep the number of nuisance parameters low, it is useful to use
the assumption that unobserved heterogeneity follows the simple cluster correlation structure:
Corr(ci , c j ) = τc i f i = j, i, j ∈ g
= 0 otherwise

9

(1.3.8)
(1.3.9)

g

Also we use the assumption that disturbances {ut }t=1,...,T are independent from unobserved heterogeneity, that both have a joint normal distribution and that the initial values of the dependent variable are in
the stationary state associated with (1.2.1), i.e.:
g

y0 =

cg
g
+ u˜0
1 − ρ0

(1.3.10)

g
g
where u˜0 is independent of cg and {ut }t=1,...,T , follows normal distribution with zero mean, variance
g

equal to σu2 /(1 − ρ02 ) and has a within cluster correlation of τu .8 Let the variance-covariance matrix of ut
g

for t = 1, ..., T be denoted by Σu :




1





τu 1



g
Σu = σu2 



...
...




τu ... τu 1

(1.3.11)

g

Let the variance-covariance matrix of cg be denoted by Σc :

1


τc 1

g
Σc = σc2 

...
...

τc ... τc










1

(1.3.12)

The last auxiliary assumption of our model for optimal instruments is:
Auxiliary Assumption 3a: Suppose that for any cluster g = 1, ..., G:
 
g
g
Σc
c 
 µc ιng  
  1

 g
g
y  ∼ N( 1 µc ιn  , 
Σc
1−ρ
 0
 1−ρ0
g 
0
 
 


g
u
0
0








g
g
1
Σc + 1 2 Σu
(1−ρ0 )2
1−ρ0

0

8 The

g

IT ⊗ Σu




)



(1.3.13)

auxiliary assumption of stationary initial conditions can easily be generalized, at the expense of introducing three additional nuisance parameters, by assuming:
g

g

y0 = α + β cg + u˜0
g
u˜ |cg ∼ N(0, Σ˜ 0 )
0

Var(u˜i0 ) = σ˜ 0
Corr(u˜i0 , u˜ j0 ) = τu i f i = j but gi = g j

10

g

g

where Σc and Σu have been defined previously and ιng is a column vector of ones of dimension ng × 1.
g

g

g

g

Note that E(cg |Yt ) = E(cg |y0 , cg + u1 , ..., cg + ut ). Define V g as:




g

Σc



g
 1
V g =  1−ρ Σc
0


0

g
g
1
Σc + 1 2 Σu
2
(1−ρ0 )
1−ρ0

0

g

IT ⊗ Σu








(1.3.14)

Under Auxiliary Assumption 3a:


cg







g
 y


0 


 cg + ug  ∼ N(µ g , AgV g Ag )

1




 ... 


g
g
c + uT




(1.3.15)




g
c


 


 g
g
g
g
g
1

 
where µ = A 
 1−ρ0 µc ιng  and A is the deterministic matrix of ones and zeros so that A y0  =


 
ug
0


cg




 yg 

0 


 cg + ug .

1




 ... 


g
g
c + uT
Therefore, using the properties of the multivariate normal distribution, E(cg |Yt ) can be obtained as a
µc ιng

g

g

g

linear function of y0 , cg + u1 , ..., cg + ut with coefficients given by the elements of V g . The exact form of
E(cg |Yt ) under Auxiliary Assumptions 1, 2a, 3a is given in Appendix A.1.3.
Only five nuisance parameters compose V g and can be consistently estimated if a consistent preliminary
¨ Let rit (ρ) = yit − ρyit−1 . Consistent estimators for the nuisance
estimator of ρ0 is available, denote it ρ.

11

parameters in V are:
σˆ u2 =

1 1 1 T n
¨ 2
∑ ∑ mit (ρ)
2 T − 1 n t=2
i=1

τˆu =

n
1
1 1 1 1 T n
¨ jt (ρ)
¨
1[i = j, gi = g j ]mit (ρ)m
∑
∑
∑
σˆ u2 2 T − 1 n t=2 i=1 ngi − 1 j=1

σˆ c2 =

1
1 T T n
¨ is (ρ)
¨ − µˆ c2
∑ ∑ ∑ 1[t = s]rit (ρ)r
T (T − 1) n t=1
s=1 i=1

µˆ c =

11 T n
¨
∑ ∑ rit (ρ)
T n t=1
i=1

τˆc =

n
1 T T n
1
1
1
¨ js (ρ)
¨
1[t = s, gi = g j ; i = j]rit (ρ)r
∑
∑
∑
∑
σˆ c2 T (T − 1) n t=1 s=1 i=1 ngi − 1 j=1

ˆ g be the consistent estimator for the variance-covariance matrix Φg = Var(mg (ρ0 )) composed of
Let Φ
ˆ g−1/2 be the upper-diagonal matrix such that
σˆ u and τˆu from the formula derived in Appendix A.1.2. Let Φ
ˆ g−1/2 Φ
ˆ g−1/2 = Φ
ˆ g−1 . Denote Φ
ˆ g−1/2 the t th ng × ng (T − 1) matrix composing Φ
ˆ g−1/2 . Let µˆ gc be
Φ
t
t
a consistent estimator of E(cg |Yt ) from the formula given in the Appendix A.1.3.
A consistent estimator for the optimal instrument for mg (ρ) under (1.2.1) and (1.2.2) and Assumptions
1, 2a, 3a is:




g
gc
(ρ¨ − 1)y0 + µˆ 0





0





 g−1/2
g−1/2


]Φ
ˆ
ˆ
...
...

 , ..., ΦT −1 




g
gc
g 1−ρ¨ T −1 gc
(ρ¨ − 1)yT −2 + µˆ T −2
ρ¨ T −2 (ρ¨ − 1)y0 + 1−ρ¨ µˆ 0
(1.3.16)



g
ˆ g−1/2 
Zˆ opt = [Φ

1

and the estimator obtained from using this instrument matrix is defined by:
G

g

∑ Zˆopt mg (ρˆ

)=0

(1.3.17)

g=1

So that:
ρˆ =

g
ˆg
∑G
g=1 Zopt ∆y

(1.3.18)

g
ˆg
∑G
g=1 Zopt ∆y−1
g

= ρ0 +
g

g

g

g

g
ˆ
∑G
g=1 Zopt ∆u
g

(1.3.19)

g

ˆ
∑G
g=1 Zopt ∆y−1
g

g

g

where ∆yg = [∆y2 , ..., ∆yT ] , ∆y−1 = [∆y1 , ..., ∆yT −1 ] and ∆ug = [∆u2 , ..., ∆uT ] .

12

g
¨ σˆ u2 , σˆ c2 , τˆu , τˆc , µˆ c are replaced
Let Z¨ opt to be the random vector defined as in (1.3.16) but where ρ,

¨ plim(σˆ u2 ), plim(σˆ c2 ), plim(τˆu ), plim(τˆc ), plim(µˆ c ). When (1.2.1), (1.2.2) and Auxiliary Asby plim(ρ),
sumption 1 hold, ρˆ is asymptotically normal:
√
d
G(ρˆ − ρ0 ) → N(0,Vρ )

(1.3.20)

g
g
g
Vρ = E(Z¨ opt ∆y−1 )−2Var(Z¨ opt ∆ug )

(1.3.21)

Standard errors for ρˆ that are consistent as long as (1.2.1), (1.2.2) and Auxiliary Assumption 1 hold are
given by:
G

g
g
s.e. = (( ∑ Zˆ opt ∆y−1 )−2
g=1

G

g

∑ (Zˆopt mg (ρˆ

))2 )1/2

(1.3.22)

g=1

The estimator defined by (1.3.17) is consistent and asymptotically normal even when the Auxiliary
Assumption 1 of cluster sampling is not satisfied, as long as some regularity conditions hold on the strength
of cross-sectional dependence. As in section 1.2.2, cross-sectional dependence has to be weak enough so
that asymptotic theorems can be applied:

g

1 G ¨g
p
Zopt ∆ug → 0
∑
G g=1

(1.3.23)

1 G ¨g
g p
Zopt ∆y−1 → a
∑
G g=1

(1.3.24)

1 G g
d
√ ∑ Z¨ opt
∆ug → N(0, v)
G g=1

(1.3.25)

g

g

1 G Z¨ ∆y ) = 0 and v = plim( 1 ( G Z¨ ∆ug )2 ). In this case:
where a = plim( G
∑g=1 opt −1
G ∑g=1 opt

√
d
G(ρˆ − ρ0 ) → N(0, a−2 v)

(1.3.26)

1 G Zˆ g ∆yg and non-parametric estimators for plim( 1 ( G Z g ∆ug )2 )
a can simply be estimated by G
∑g=1 opt −1
G ∑g=1 opt

as well as statistical tests with general forms of spatial dependence are available and have been discussed in
Conley (1999), Bester et al. (2011b), Kim and Sun (2011) and Bester et al. (2011a).
In situations where available preliminary estimators might have poor small sample properties, one can
g

also use an iterated version of the feasible optimal estimator. Denote Zˆ opt (ρ) to be the value of the estimated
¨ evaluated at ρ. The iterated optimal
optimal instruments for a preliminary estimator (previously denoted ρ)
estimator is defined by:
G

g

∑ Zˆopt (ρˆiter )mg (ρˆiter ) = 0

g=1

13

(1.3.27)

This estimator has the same

√
n-asymptotic properties as the two step estimator defined by (1.3.17) but

its small sample properties will not depend on the small sample properties of a preliminary estimator.

1.3.3

Comparison to Existing Estimators

The estimator defined by (1.3.17) can be rewritten as ρˆ ∗ that satisfies the equation:
G

ˆ g mg (ρˆ ∗ ) = 0
∑ wg∗ (η)Z

(1.3.28)

g=1

where ηˆ = [σˆ u2 , τˆu , σˆ c2 , µˆ c , τˆc ], Z g is the matrix containing all valid instruments for mg :


g
Ing ⊗Y0




g


0
I
⊗Y
g


1
Zg = 



...
 ...



g
0
...
0 Ing ⊗YT −2

(1.3.29)

g
ˆ g = Zˆ opt
and wg∗ () is the row vector function such that wg∗ (η)Z
.

The Arellano and Bond estimator can also be written as exactly identified from:
G

g

∑ wˆ AB Z g mg (ρˆAB ) = 0

(1.3.30)

g=1

where:
g

wˆ AB =

n

n

˜ i (ρ)
˜ Zi )−1 Sg
∑ (∆Y−1,i Zi )( ∑ Zi mi (ρ)m

(1.3.31)

i=1

i=1
g
where ρ˜ is a preliminary consistent estimator and S is the matrix of zeros and ones such that Sg Z g mg (ρ) =

∑i∈g Zi mi (ρ) where:


Y
 i0




0 Y
i1


Zi = 



...
 ...



0 ... 0 YiT −2

(1.3.32)

In the presence of cross-sectional dependence, it is likely that our estimator will perform better than the
Arellano and Bond estimator even when some of the Auxiliary Assumptions 1, 2a, 3a are violated because
our estimator gives non-zero weights to moment conditions obtained from using instruments from neighboring observations. As discussed in previous sections, these instruments may have significant predictive

14

power for the covariates in the differenced equations so that these additional moment conditions might be
useful to improve the accuracy of the estimator.
In addition, our estimator relies on the estimation of only five nuisance parameters to compute weights
for all n2g ×T ×(T −1)/2 moment conditions available per cluster, whereas the Arellano and Bond estimator
relies on the estimation of T × (T − 1)/2 weights. When T is relatively large, estimating that many nuisance
parameters causes the Arellano and Bond estimator to suffer from poor small sample properties in terms of
bias, precision and inference, which was studied in the context of cross-sectional independence in Alvarez
and Arellano (2003) and Windmeijer (2005). Because our estimator makes use of few nuisance parameters,
it will have good properties in finite samples even when T is relatively large. A formal derivation of the
asymptotic properties of our estimator when both n and T grow unboundedly is left for future research.
As a result of both using non-zero weights for useful moment conditions and using nuisance parameters parsimoniously, the results from Monte Carlo simulations presented in Section 1.5 show that our estimator has significantly better small sample properties than the Arellano and Bond estimator in terms of
efficiency and quality of inference, particularly in cases with cross-sectional dependence but also without
cross-sectional dependence.
So-called system GMM estimators presented in Ahn and Schmidt (1995), Arellano and Bover (1995),
and Blundell and Bond (1998) are similar to the Arellano and Bond estimator but use additional moment
conditions based on additional assumptions of homoscedasticity, no serial correlation, or stationary initial
conditions. Since our estimator is only based on the mean independence of transitory shocks conditional
on past outcomes, it is more robust than the estimators presented in Ahn and Schmidt (1995), Arellano and
Bover (1995) or Blundell and Bond (1998).

1.4

Models with Covariates

Similar auxiliary assumptions as in the previous section can be listed to model optimal instruments for
models with covariates. In this section we consider a model that allows for some of the covariates to be
strictly exogenous (wit ) and some of the covariates to be sequentially exogenous or contemporaneously

15

endogenous (xit ):
yit = xit β0 + wit γ0 + ci + uit t = 1, ..., T

(1.4.1)

E(uit |Zt ,W ) = 0

(1.4.2)
( j)

where W = [W1 , ...,Wn ] and Wi = wi1 , ..., wiT and Zt = [zi1 , ..., zit ] and for every random variable xit
( j)

( j)

( j)

in

( j)

xit , either xit or xit−1 is in zit .9 xit is said to be sequentially exogenous if it is in zit . If xit−1 only is in
j

zit , xit is said to be contemporaneously endogenous. Such a model specification is flexible enough to allow
for complex interactions between unobserved factors and covariates of interest, an example will be given in
Section 1.6. The estimation method presented in this section can be generalized to the case where neither
xit nor xit−1 are part of zit but where some other instruments are available, which is also treated in the
example given in Section 1.6. As a notational matter, we generalize the notation from the previous section
by denoting by xg the vector [x1g , ..., xngg ] for any sequence of variables {xi }i=1,...,n .
A consistent estimator of β0 , γ0 is obtained from the differenced equation:
∆yit = ∆xit β0 + ∆wit γ0 + ∆uit t = 1, ..., T
E(∆uit |Zt−1 ,W ) = 0

(1.4.3)
(1.4.4)

To model optimal instruments for estimating β0 and γ0 from (1.4.3) and (1.4.4), we will make use of
the same auxiliary assumption of clustering, i.e. we maintain the use of Auxiliary Assumption 1a. We also
generalize Auxiliary Assumption 2a so that homoscedasticity and serial correlation are specified conditional
on the relevant instruments:
Auxiliary Assumption 2b: For any i, j ∈ g, t, s = 1, ..., T , t ≥ s:
g

Cov(uit , u js |cg , Zt ,W g ) = σu2 i f i = j, t = s
= τu σu2 i f i = j, t = s
= 0 if t > s
As in the previous section, this assumption guarantees that the optimal instruments will be known ling

ear functions of {E(∆xit |Zs ,W g )}t,s=1,...,T, s≤t−1 and Wi (up to the unknown nuisance parameters σu2
9 Note

that a special case of this model is the dynamic model we considered in the previous section where xit = yit−1 , xit = zit ,
and wit γ0 = 0. In most applications, even if xit includes other covariates than lagged values of the dependent variable, it is expected
that yit−1 will be included in xit in order to identify the effect of xit on yit separately from the dynamic effects in yit and xit .

16

and τu ). Therefore, we need to generalize Auxiliary Assumption 3a to obtain a parsimonious model for
g

g

{E(∆xit |Zs ,W g )}t,s=1,...,T, s≤t−1 . To do so, we can model zt as a VAR process conditional on W g :
Auxiliary Assumption 3b: Suppose that for any observation i = 1, ..., n:
zit = Γzit−1 + wit η + di + vit
and:


 
 
g
g
g
 µd (W )   Σd
d 

 g g
 
 z  |W ∼ N(µ (W g ) , Σ
z
 0
 0
  dz0

 
 
g
v
0
0
g

(1.4.5)








Σz0
0

(1.4.6)

Σv

g

where vg = [v1 , ..., vT ] .
g

In particular applications, one will impose auxiliary restrictions on µd (.), µz0 (.), Σd , Σdz , Σz0 , Σv so
0
that they can be estimated with few enough nuisance parameters.
Auxiliary Assumption 3b implies:
g

E(zit |Zt−s ,W g ) = Γs zit−s +

s−1

g

∑ Γr (wit−r η + E(di |Zt−s ,W g ))

(1.4.7)

r=0

and:
g

g

g

g

E(di |Zt ,W g ) = E(di |z0 ,W g , d g + v1 , ..., d g + vt )

(1.4.8)

which can be derived from Auxiliary Assumption 3b as was done in the previous section. For any co( j)

( j)

variate xit , either xit
( j)

( j)

is in zit or xit−1 is in zit , therefore Auxiliary Assumption 3b yields a model for

g

g

E(xit |Zs ,W g ) ∀s ≤ t and hence E(∆xit |Zs ,W g ) ∀s ≤ t − 1 as a function of the nuisance parameters in
Auxiliary Assumption 3b.
Therefore, under (1.4.1), (1.4.2) and Auxiliary Assumptions 1a, 2b and 3b, one can find a parametric
model for the optimal instruments for estimating β0 and γ0 . A feasible version of these instruments can be
obtained from a preliminary estimator of (β0 , η0 ) as in the previous section. One can also use an iterated
version of this feasible estimator in order to obtain an estimator with better performances in small samples.

17

1.5

Monte Carlo Simulations

In this section, we will study the small sample properties of the estimator we propose using Monte Carlo
simulations. Consider the simple data generating process for a model with cluster correlation and without
covariates:
ng ∼ Poisson(α) + 1
cg ∼ Fc
g

y0 |cg ∼ Normal(µ0 (cg ), Σ0 (cg ))
g

g

g

g

g

g

yt |cg , yt−1 , ..., y0 ∼ Normal(cg + ρyt−1 , Σu (cg , yt−1 , ..., y0 ))
We compare the properties of three estimators of ρ: The estimator defined in Arellano and Bond (1991)
which we call the AB estimator, the estimator defined by (1.3.27), which we denote by Estimator 1 and
the estimator defined by (1.3.27) but with estimated within-cluster correlations replaced by zero which
we denote by Estimator 2.10 As a benchmark for comparison, we also show the results from using an
unfeasible optimal estimator (UO) which is optimal in the class of estimators that use linear functions of the
instruments. This estimator weights optimally all available moment conditions that use linear instruments
using the true unobserved optimal weights so that it is defined by:
G

∑ wg Z g mg (ρˆUO ) = 0

(1.5.1)

g=1

wg = ∆g (W g )−1

(1.5.2)

∂ mg
)
∂ρ

(1.5.3)

∆g = E(Z g

W g = E(Z g mg mg Z g )

(1.5.4)

When Auxiliary Assumptions 1, 2a, 3a hold, the UO estimator is the same as the estimator defined by
(1.3.4) and will be efficient in the class of estimators using any function of the instruments. When these
assumptions hold, Estimator 1 and the unfeasible optimal estimator will also be asymptotically equivalent
so that, for small samples, the difference in their performances is due to the extra noise in Estimator 1
10 In most of the scenarios we simulate, transitory shocks will be homoscedastic, serially uncorrelated and the dependent variable

will be stationary so that additional moment conditions presented in Arellano and Bover 1995, Ahn and Schmidt 1995 or Blundell
and Bond 1998 hold. We do not present estimators that use these moment conditions however since we are interested in studying
the properties of estimators that are robust to these moment conditions being false.

18

due to estimating the nuisance parameters needed. When Auxiliary Assumptions 2a or 3a are violated, the
unfeasible optimal estimator is asymptotically more efficient than Estimator 1.
Estimator 1 is asymptotically more efficient than the AB estimator or than Estimator 2 when there exists
cross-sectional dependence and Auxiliary Assumptions 1, 2a, 3a hold. When Auxiliary Assumptions 1, 2a,
3a hold and there is no cross-sectional dependence, the AB Estimator, Estimator 1 and Estimator 2 have
the same asymptotic variance. When Auxiliary Assumptions 2a or 3a are violated and there is no crosssectional dependence, the AB estimator has a smaller asymptotic variance than Estimator 1 and 2 but, in
finite samples, Estimator 1 or Estimator 2 might still have better properties than the AB estimator because
they make use of less nuisance parameters. When Auxiliary Assumptions 2a or 3a are violated and there
is cross-sectional dependence, which of the AB estimator or Estimator 1 has smallest asymptotic variance
depends on the data generating process but we expect Estimator 1 to perform better since, by making use of
instruments from other observations in the cluster, it should use a weighted sum of moment conditions that
is closer to optimal that the sum used for the AB estimator.
For inference for the AB estimator, we will consider GMM robust standard errors with clustered standard
errors with and without the finite sample correction proposed by Windmeijer (2005). For inference for
Estimators 1 and 2, we use the standard errors defined in (1.3.22) that only require (1.2.1), (1.2.2) and
Auxiliary Assumption 1 to hold in order to be consistent.
We will study the small sample properties of the estimators in three different scenarios: within cluster equi-correlation, cross-sectional independence and general within cluster correlation with unobserved
heterogeneity that does not have a Normal distribution. Scenario 1 and 2 will correspond to Auxiliary Assumptions 1, 2a, 3a holding. In Scenario 1 there is cross-sectional dependence and in Scenario 2 there is no
cross-sectional dependence. Scenario 3 corresponds to only Auxiliary Assumption 1 holding.

19

More precisely, Scenario 1 uses the following parameterization:


1




0.5 1



Fc = Normal(0, 
)


...
 ...



0.5 ... 0.5 1


1




0.5 1



g
g
g
Σu (c , yt−1 , ..., y0 ) = 



...
 ...



0.5 ... 0.5 1
µ0 (cg ) =
Σ0 (cg ) =

cg
1 − ρ0
1
1 − ρ02

g

g

Σu (cg , yt−1 , ..., y0 )

Scenario 2 uses:


1







0 1



Fc = Normal(0, 
)


... 
...


0 ... 0 1


1




0 1



g
g
Σu (cg , yt−1 , ..., y0 ) = 



... 
...


0 ... 0 1
µ0 (cg ) =
Σ0 (cg ) =

cg
1 − ρ0
1

g
g
Σu (cg , yt−1 , ..., y0 )
2
1 − ρ0

20

And Scenario 3 uses:




1





0.5 1



Fc = LogNormal(0, 
)


...
 ...



0.5 ... 0.5 1


2
ui t−1


1



 0.5u
2
u
u


i
t−1
i
t−1
i2t−1
g
g
1
2

Σu (cg , yt−1 , ..., y0 ) = 




...
...




2
0.5ui t−1 uin t−1
...
0.5ui
u
u
ing t−1
g
1
ng −1t−1 ing t−1
g
c
µ0 (cg ) =
1 − ρ0


1






1 0.5 1

g
Σ0 (c ) =


2


1 − ρ0  ...
...



0.5 ... 0.5 1
All Monte Carlo results were obtained using 1,000 replications. Because Estimators 1 and 2 are iterated
versions of our estimator, we present results from simulations conditional on Estimators 1 and 2 converging.
Table 1.1 shows the number of observations where all estimators converged, which represents all or almost
all draws except when T = 5, G = 100 and ρ = 0.8. In this case Estimator 1 or Estimator 2 did not converge
in 15%-22% of the replications depending on the scenario. In particular applications, convergence of the
iterated Estimators 1 and 2 will depend on the particular numerical algorithm chosen and properties of
the data. For instance in the application presented in Section 1.6, convergence was achieved in just a few
iterations even though T = 3.
Table 1.2, 1.3 and 1.4 show the results for the four estimators considered in terms of bias, standard deviation and root mean squared error for a value of ρ of 0.8. Table 1.2 shows results for the case where there
is equi-correlation within clusters (Scenario 1), Table 1.3 the case where there is no cross-sectional correlation (Scenario 2) and Table 1.4 the case where there is heteroscedasticity and cross-sectional correlation
(Scenario 3). The first conclusion from these three tables is that Estimator 1 and 2 exhibit virtually no bias
compared to the AB estimator. Estimator 1 also has significantly smaller standard deviations when there is
cross-sectional correlation (Scenarios 1 and 3). Both of these features of our estimator result in significantly
21

smaller values for mean squared error. The smaller standard deviations of our estimator are due to the use
of instruments from other observations in the cluster that are relevant in the presence of cross-sectional dependence. The low bias is attributable to our estimators using very few nuisance parameters compared to
the AB estimator. The improvement of Estimator 1 over the AB estimator is particularly striking when T
is large and G is small, which is when the AB estimator uses the most nuisance parameters compared to
the sample size. When there is no within cluster correlation (Scenario 2), Estimators 1 and 2 have standard
deviations only slightly lower than the AB estimator so that the decrease in rmse of Estimators 1 and 2 compared to the AB estimator is mostly due to the elimination of the bias. In Scenario 3 where the unfeasible
optimal estimator is asymptotically more efficient than Estimator 1, Estimator 1 performs very closely to the
unfeasible optimal estimator, which shows that the approximation of the optimal weighted sum of moment
conditions used by Estimator 1 is good in this case.
Table 1.5, Table 1.6 and Table 1.7 show results in terms of bias in standard errors (captured by the ratio
of the mean of the standard errors over the standard deviations of the estimators), coverage of the 95%
confidence interval and average length of 95% confidence intervals. All three tables show that standard
errors for the AB estimator without the Windmeijer correction are seriously downward biased, particularly
when T is large, resulting is very low coverage of 95% confidence intervals (as low as 48%). The Windmeijer
correction yields unbiased standard errors for the AB estimator but the resulting confidence intervals still
have low coverage because of the bias in the AB estimator of ρ. The standard errors for Estimators 1 and 2
are unbiased and the resulting confidence intervals have the correct coverage of 95%. Because our estimators
have smaller standard deviations that the AB estimator, the average length of their 95% confidence intervals
is also smaller than that of the AB estimator so that our estimators have confidence intervals that are both
tighter and have the correct coverage.
Tables 1.8-1.13 show the same results for ρ0 = 0.5. Estimators 1 and 2 show similar improvements over
the AB estimator but slightly less markedly since, with this lower level of persistence, the instruments used
by the AB estimator are not as weak as when ρ0 = 0.8 so that there is less to gain compared to the unfeasible
optimal estimator.

22

Table 1.1: Number of replications where all estimators converged (out of 1,000)
ρ = 0.8

ρ = 0.5

Scenario 1

Scenario 2

Scenario 3

Scenario 1

Scenario 2

Scenario 3

G=100

802

854

781

1000

999

1000

G=200

906

935

867

1000

1000

1000

G=400

977

976

942

1000

1000

1000

G=100

998

992

989

1000

999

1000

G=200

1000

999

1000

1000

1000

1000

G=400

1000

1000

1000

1000

1000

1000

G=100

1000

999

1000

1000

999

1000

G=200

1000

1000

1000

1000

1000

1000

G=400

995

1000

1000

1000

1000

1000

T=5

T=10

T=15

23

Table 1.2: Bias and RMSE, ρ = .8, equi-correlation within clusters
Unfeasible
Optimal
Estimator

Arellano and
Bond
Estimator

Estimator 1

Estimator 2

−0.031

−0.158

−0.037

−0.045

sd

0.121

0.176

0.137

0.191

rmse

0.125

0.236

0.142

0.196

−0.018

−0.087

−0.018

−0.025

sd

0.089

0.127

0.093

0.134

rmse

0.091

0.154

0.095

0.136

−0.001

−0.033

−0.002

0.000

sd

0.064

0.092

0.065

0.097

rmse

0.064

0.098

0.065

0.097

bias

0.001

−0.060

0.001

0.001

sd

0.047

0.067

0.048

0.068

rmse

0.047

0.090

0.048

0.068

−0.001

−0.033

−0.001

−0.003

sd

0.034

0.048

0.034

0.047

rmse

0.034

0.058

0.034

0.047

bias

0.001

−0.016

0.001

0.000

sd

0.024

0.035

0.024

0.034

rmse

0.024

0.038

0.024

0.034

−0.001

−0.041

−0.000

−0.003

sd

0.028

0.038

0.028

0.037

rmse

0.028

0.056

0.028

0.037

bias

0.000

−0.022

0.000

−0.001

sd

0.018

0.027

0.018

0.026

rmse

0.018

0.034

0.018

0.026

bias

0.000

−0.010

0.000

0.000

sd

0.013

0.019

0.013

0.018

rmse

0.013

0.021

0.013

0.018

T=5
G=100

G=200

G=400

bias

bias

bias

T=10
G=100

G=200

G=400

bias

T=15
G=100

G=200

G=400

bias

24

Table 1.3: Bias and RMSE, ρ = .8, no correlation within clusters
Unfeasible
Optimal
Estimator

Arellano and
Bond
Estimator

Estimator 1

Estimator 2

−0.028

−0.097

−0.030

−0.034

sd

0.121

0.119

0.135

0.127

rmse

0.124

0.153

0.138

0.131

−0.013

−0.046

−0.013

−0.014

sd

0.093

0.091

0.097

0.094

rmse

0.093

0.102

0.098

0.095

−0.001

−0.020

−0.001

−0.002

sd

0.063

0.063

0.066

0.065

rmse

0.063

0.066

0.066

0.065

bias

0.000

−0.033

0.000

−0.000

sd

0.046

0.050

0.047

0.047

rmse

0.046

0.060

0.047

0.047

−0.001

−0.018

−0.001

−0.002

sd

0.033

0.035

0.033

0.033

rmse

0.033

0.039

0.033

0.033

bias

0.001

−0.008

0.001

0.001

sd

0.024

0.025

0.024

0.024

rmse

0.024

0.026

0.024

0.024

−0.001

−0.023

−0.001

−0.001

sd

0.028

0.031

0.028

0.028

rmse

0.028

0.038

0.028

0.028

bias

0.000

−0.011

0.000

0.000

sd

0.018

0.020

0.018

0.018

rmse

0.018

0.023

0.018

0.018

bias

0.000

−0.005

0.000

0.000

sd

0.013

0.013

0.013

0.013

rmse

0.013

0.014

0.013

0.013

T=5
G=100

G=200

G=400

bias

bias

bias

T=10
G=100

G=200

G=400

bias

T=15
G=100

G=200

G=400

bias

25

Table 1.4: Bias and RMSE, ρ = .8, heteroscedasticity and correlation within clusters
Unfeasible
Optimal
Estimator

Arellano and
Bond
Estimator

Estimator 1

Estimator 2

−0.027

−0.226

−0.065

−0.049

sd

0.183

0.218

0.271

0.367

rmse

0.185

0.314

0.279

0.370

−0.025

−0.140

−0.027

−0.033

sd

0.125

0.166

0.135

0.203

rmse

0.128

0.217

0.137

0.206

−0.007

−0.069

−0.008

−0.010

sd

0.084

0.121

0.085

0.131

rmse

0.085

0.139

0.086

0.131

bias

0.001

−0.075

−0.000

0.000

sd

0.055

0.075

0.056

0.074

rmse

0.055

0.106

0.056

0.074

bias

0.000

−0.042

0.000

−0.000

sd

0.039

0.055

0.039

0.054

rmse

0.039

0.069

0.039

0.054

bias

0.001

−0.019

0.001

0.001

sd

0.027

0.039

0.027

0.038

rmse

0.027

0.043

0.027

0.038

−0.001

−0.046

−0.000

−0.002

sd

0.031

0.041

0.030

0.040

rmse

0.031

0.062

0.030

0.040

bias

0.001

−0.024

0.001

0.000

sd

0.020

0.029

0.020

0.028

rmse

0.020

0.037

0.020

0.028

bias

0.001

−0.012

0.000

0.000

sd

0.015

0.021

0.014

0.020

rmse

0.015

0.024

0.014

0.020

T=5
G=100

G=200

G=400

bias

bias

bias

T=10
G=100

G=200

G=400

T=15
G=100

G=200

G=400

bias

26

Table 1.5: Inference, ρ = .8, equi-correlation within clusters
Unfeasible
Optimal
Estimator

Arellano and
Bond
Estimator

AB w/
Windmeijer
correction

Estimator 1

Estimator 2

T=5
G=100

G=200

G=400

ratio

1.121

0.788

1.051

1.016

1.002

coverage

0.969

0.763

0.895

0.964

0.959

length

0.539

0.550

0.749

0.552

0.760

ratio

1.063

0.834

1.045

1.024

0.978

coverage

0.956

0.796

0.917

0.951

0.956

length

0.375

0.417

0.525

0.376

0.518

ratio

1.073

0.863

1.029

1.042

0.977

coverage

0.976

0.888

0.939

0.969

0.954

length

0.268

0.313

0.376

0.268

0.372

ratio

0.987

0.661

1.004

0.962

0.943

coverage

0.948

0.653

0.873

0.952

0.949

length

0.184

0.176

0.270

0.184

0.254

ratio

0.983

0.734

1.003

0.975

0.961

coverage

0.949

0.760

0.905

0.949

0.944

length

0.130

0.139

0.194

0.130

0.179

ratio

0.974

0.766

0.975

0.968

0.943

coverage

0.953

0.835

0.922

0.952

0.939

length

0.092

0.104

0.134

0.091

0.127

ratio

0.941

0.561

1.015

0.939

0.983

coverage

0.941

0.480

0.815

0.943

0.948

length

0.105

0.084

0.140

0.105

0.144

ratio

1.025

0.688

1.053

1.022

1.006

coverage

0.961

0.696

0.916

0.958

0.951

length

0.074

0.072

0.110

0.074

0.102

ratio

1.012

0.763

1.024

1.013

1.006

coverage

0.952

0.823

0.925

0.957

0.944

length

0.052

0.057

0.078

0.052

0.072

T=10
G=100

G=200

G=400

T=15
G=100

G=200

G=400

27

Table 1.6: Inference, ρ = .8, no correlation within clusters
Unfeasible
Optimal
Estimator

Arellano and
Bond
Estimator

AB w/
Windmeijer
correction

Estimator 1

Estimator 2

T=5
G=100

G=200

G=400

ratio

1.140

1.042

1.142

1.013

1.060

coverage

0.978

0.883

0.923

0.959

0.960

length

0.548

0.491

0.558

0.542

0.533

ratio

1.038

1.000

1.053

0.984

1.004

coverage

0.961

0.909

0.932

0.954

0.948

length

0.379

0.359

0.383

0.377

0.372

ratio

1.082

1.052

1.073

1.040

1.041

coverage

0.975

0.947

0.951

0.968

0.965

length

0.269

0.262

0.271

0.268

0.267

ratio

1.006

0.820

1.007

0.974

0.975

coverage

0.958

0.836

0.910

0.951

0.952

length

0.185

0.163

0.204

0.182

0.182

ratio

0.990

0.890

0.971

0.981

0.979

coverage

0.950

0.884

0.918

0.951

0.949

length

0.130

0.122

0.142

0.129

0.129

ratio

0.976

0.920

0.975

0.964

0.965

coverage

0.952

0.914

0.932

0.950

0.950

length

0.092

0.089

0.097

0.091

0.091

ratio

0.949

0.686

0.959

0.935

0.937

coverage

0.948

0.730

0.866

0.943

0.944

length

0.106

0.084

0.105

0.105

0.105

ratio

1.027

0.846

1.021

1.022

1.020

coverage

0.962

0.860

0.925

0.957

0.958

length

0.074

0.066

0.082

0.074

0.074

ratio

1.011

0.935

1.033

1.008

1.009

coverage

0.952

0.902

0.945

0.955

0.956

length

0.052

0.050

0.058

0.052

0.052

T=10
G=100

G=200

G=400

T=15
G=100

G=200

G=400

28

Table 1.7: Inference, ρ = .8, heteroscedasticity and correlation within clusters
Unfeasible
Optimal
Estimator

Arellano and
Bond
Estimator

AB w/
Windmeijer
correction

Estimator 1

Estimator 2

T=5
G=100

G=200

G=400

ratio

1.027

0.773

1.088

0.834

0.812

coverage

0.963

0.666

0.872

0.949

0.940

length

0.747

0.669

0.943

0.897

1.181

ratio

1.028

0.801

1.033

0.943

0.854

coverage

0.945

0.736

0.884

0.955

0.934

length

0.507

0.523

0.680

0.501

0.684

ratio

1.078

0.845

1.027

1.036

0.919

coverage

0.969

0.831

0.901

0.960

0.947

length

0.358

0.402

0.491

0.348

0.474

ratio

0.973

0.652

1.009

0.937

0.954

coverage

0.948

0.608

0.857

0.942

0.954

length

0.214

0.193

0.301

0.208

0.280

ratio

0.968

0.711

0.982

0.951

0.936

coverage

0.934

0.729

0.899

0.932

0.929

length

0.150

0.155

0.218

0.146

0.199

ratio

0.980

0.772

0.988

0.968

0.942

coverage

0.953

0.829

0.926

0.947

0.943

length

0.106

0.117

0.152

0.103

0.141

ratio

0.955

0.537

1.002

0.939

0.965

coverage

0.940

0.478

0.800

0.940

0.951

length

0.117

0.088

0.149

0.113

0.152

ratio

1.016

0.681

1.028

1.006

0.996

coverage

0.964

0.700

0.897

0.952

0.947

length

0.082

0.077

0.118

0.079

0.109

ratio

0.977

0.746

1.004

0.988

0.978

coverage

0.945

0.810

0.914

0.943

0.945

length

0.058

0.061

0.085

0.056

0.077

T=10
G=100

G=200

G=400

T=15
G=100

G=200

G=400

29

Table 1.8: Bias and RMSE, ρ = .5, equi-correlation within clusters
Unfeasible
Optimal
Estimator

Arellano and
Bond
Estimator

−0.001

−0.022

0.000

0.002

sd

0.063

0.087

0.063

0.085

rmse

0.063

0.090

0.063

0.085

−0.002

−0.015

−0.002

−0.004

sd

0.045

0.064

0.045

0.064

rmse

0.045

0.066

0.045

0.064

−0.000

−0.006

0.000

0.000

sd

0.031

0.044

0.031

0.044

rmse

0.031

0.045

0.031

0.044

bias

0.001

−0.015

0.001

0.002

sd

0.029

0.041

0.029

0.040

rmse

0.029

0.044

0.029

0.040

−0.001

−0.008

−0.000

−0.001

sd

0.021

0.029

0.021

0.028

rmse

0.021

0.030

0.021

0.028

bias

0.000

−0.004

0.000

−0.000

sd

0.014

0.021

0.014

0.020

rmse

0.014

0.021

0.014

0.020

bias

0.000

−0.014

0.000

−0.001

sd

0.020

0.028

0.020

0.027

rmse

0.020

0.031

0.020

0.027

bias

0.000

−0.007

0.000

−0.000

sd

0.013

0.019

0.013

0.018

rmse

0.013

0.020

0.013

0.018

bias

0.000

−0.003

0.000

0.000

sd

0.010

0.014

0.010

0.013

rmse

0.010

0.014

0.010

0.013

Estimator 1

Estimator 2

T=5
G=100

G=200

G=400

bias

bias

bias

T=10
G=100

G=200

G=400

bias

T=15
G=100

G=200

G=400

30

Table 1.9: Bias and RMSE, ρ = .5, no correlation within clusters
Unfeasible
Optimal
Estimator

Arellano and
Bond
Estimator

Estimator 1

Estimator 2

−0.001

−0.013

−0.000

−0.001

sd

0.063

0.063

0.062

0.063

rmse

0.063

0.065

0.062

0.063

−0.002

−0.008

−0.002

−0.002

sd

0.045

0.045

0.045

0.045

rmse

0.045

0.046

0.045

0.045

−0.000

−0.003

0.000

−0.000

sd

0.031

0.031

0.031

0.031

rmse

0.031

0.031

0.031

0.031

bias

0.001

−0.008

0.001

0.001

sd

0.029

0.031

0.029

0.029

rmse

0.029

0.032

0.029

0.029

−0.001

−0.005

−0.000

−0.001

sd

0.021

0.022

0.021

0.021

rmse

0.021

0.022

0.021

0.021

bias

0.000

−0.002

0.000

0.000

sd

0.014

0.015

0.014

0.014

rmse

0.014

0.015

0.014

0.014

bias

0.000

−0.007

0.000

0.000

sd

0.020

0.022

0.020

0.020

rmse

0.020

0.023

0.020

0.020

bias

0.000

−0.003

0.000

0.000

sd

0.013

0.014

0.013

0.013

rmse

0.013

0.015

0.013

0.013

bias

0.000

−0.002

0.000

0.000

sd

0.010

0.010

0.010

0.010

rmse

0.010

0.010

0.010

0.010

T=5
G=100

G=200

G=400

bias

bias

bias

T=10
G=100

G=200

G=400

bias

T=15
G=100

G=200

G=400

31

Table 1.10: Bias and RMSE, ρ = .5, heteroscedasticity and correlation within clusters
Unfeasible
Optimal
Estimator

Arellano and
Bond
Estimator

Estimator 1

Estimator 2

T=5
G=100

G=200

G=400

bias

0.001

−0.035

0.001

0.003

sd

0.079

0.109

0.076

0.103

rmse

0.079

0.114

0.076

0.103

−0.003

−0.022

−0.002

−0.004

sd

0.058

0.080

0.055

0.078

rmse

0.058

0.083

0.055

0.078

−0.001

−0.012

−0.001

−0.002

sd

0.039

0.057

0.038

0.054

rmse

0.039

0.058

0.038

0.054

bias

0.001

−0.017

0.001

0.002

sd

0.033

0.045

0.032

0.043

rmse

0.033

0.048

0.032

0.043

bias

0.000

−0.009

0.000

0.000

sd

0.023

0.032

0.023

0.031

rmse

0.023

0.034

0.023

0.031

bias

0.001

−0.004

0.001

0.000

sd

0.016

0.023

0.015

0.022

rmse

0.016

0.023

0.016

0.022

bias

0.000

−0.015

0.000

−0.001

sd

0.022

0.030

0.021

0.028

rmse

0.022

0.033

0.021

0.028

bias

0.001

−0.006

0.001

0.001

sd

0.014

0.021

0.014

0.020

rmse

0.014

0.022

0.014

0.020

bias

0.000

−0.004

0.000

0.000

sd

0.011

0.015

0.010

0.014

rmse

0.011

0.015

0.010

0.014

bias

bias

T=10
G=100

G=200

G=400

T=15
G=100

G=200

G=400

32

Table 1.11: Inference, ρ = .5, equi-correlation within clusters
Unfeasible
Optimal
Estimator

Arellano and
Bond
Estimator

AB w/
Windmeijer
correction

Estimator 1

Estimator 2

T=5
G=100

G=200

G=400

ratio

1.009

0.818

1.034

1.004

1.014

coverage

0.959

0.875

0.948

0.957

0.951

length

0.251

0.282

0.362

0.250

0.344

ratio

0.985

0.805

0.968

0.978

0.956

coverage

0.945

0.875

0.943

0.944

0.942

length

0.175

0.204

0.249

0.175

0.242

ratio

1.030

0.851

1.000

1.024

0.993

coverage

0.961

0.907

0.950

0.960

0.950

length

0.124

0.148

0.175

0.124

0.172

ratio

0.969

0.682

0.986

0.969

0.974

coverage

0.931

0.795

0.933

0.932

0.939

length

0.112

0.111

0.165

0.112

0.153

ratio

0.973

0.746

0.977

0.975

0.987

coverage

0.944

0.841

0.935

0.942

0.948

length

0.079

0.086

0.117

0.079

0.110

ratio

0.997

0.790

0.979

0.997

0.969

coverage

0.952

0.873

0.946

0.950

0.946

length

0.056

0.064

0.081

0.056

0.077

ratio

0.964

0.561

1.002

0.961

0.983

coverage

0.945

0.666

0.925

0.943

0.950

length

0.076

0.062

0.104

0.076

0.104

ratio

1.016

0.696

1.052

1.020

1.018

coverage

0.952

0.795

0.951

0.955

0.950

length

0.053

0.053

0.079

0.053

0.074

ratio

0.993

0.765

1.008

0.994

0.996

coverage

0.949

0.862

0.939

0.951

0.942

length

0.038

0.041

0.056

0.038

0.052

T=10
G=100

G=200

G=400

T=15
G=100

G=200

G=400

33

Table 1.12: Inference, ρ = .5, no correlation within clusters
Unfeasible
Optimal
Estimator

Arellano and
Bond
Estimator

AB w/
Windmeijer
correction

Estimator 1

Estimator 2

T=5
G=100

G=200

G=400

ratio

1.015

0.960

1.022

1.007

1.004

coverage

0.960

0.928

0.947

0.952

0.953

length

0.252

0.242

0.264

0.249

0.249

ratio

0.990

0.960

0.990

0.978

0.980

coverage

0.944

0.936

0.938

0.943

0.943

length

0.176

0.172

0.180

0.175

0.175

ratio

1.032

1.010

1.027

1.026

1.026

coverage

0.960

0.954

0.955

0.962

0.963

length

0.125

0.123

0.126

0.124

0.124

ratio

0.976

0.828

0.985

0.968

0.968

coverage

0.938

0.880

0.942

0.928

0.932

length

0.113

0.102

0.125

0.111

0.111

ratio

0.973

0.885

0.948

0.977

0.975

coverage

0.945

0.917

0.927

0.943

0.942

length

0.079

0.075

0.087

0.079

0.079

ratio

0.999

0.950

0.991

0.994

0.996

coverage

0.952

0.938

0.952

0.951

0.951

length

0.056

0.054

0.059

0.056

0.056

ratio

0.970

0.701

0.970

0.961

0.962

coverage

0.952

0.811

0.931

0.941

0.942

length

0.076

0.061

0.077

0.076

0.076

ratio

1.018

0.852

1.017

1.020

1.018

coverage

0.956

0.906

0.948

0.955

0.954

length

0.054

0.048

0.059

0.053

0.053

ratio

0.996

0.930

1.012

0.995

0.995

coverage

0.950

0.926

0.955

0.949

0.950

length

0.038

0.036

0.041

0.038

0.038

T=10
G=100

G=200

G=400

T=15
G=100

G=200

G=400

34

Table 1.13: Inference, ρ = .5, heteroscedasticity and correlation within clusters
Unfeasible
Optimal
Estimator

Arellano and
Bond
Estimator

AB w/
Windmeijer
correction

Estimator 1

Estimator 2

T=5
G=100

G=200

G=400

ratio

1.019

0.798

1.025

1.016

0.992

coverage

0.955

0.857

0.935

0.959

0.951

length

0.320

0.345

0.448

0.306

0.407

ratio

0.983

0.809

0.977

0.994

0.943

coverage

0.948

0.864

0.936

0.953

0.944

length

0.223

0.255

0.313

0.215

0.290

ratio

1.018

0.828

0.985

1.020

0.982

coverage

0.961

0.883

0.928

0.962

0.946

length

0.157

0.185

0.220

0.152

0.207

ratio

0.956

0.680

0.983

0.957

0.968

coverage

0.938

0.795

0.929

0.930

0.949

length

0.126

0.120

0.179

0.122

0.164

ratio

0.962

0.739

0.971

0.961

0.971

coverage

0.937

0.840

0.931

0.935

0.942

length

0.088

0.094

0.128

0.086

0.117

ratio

0.997

0.793

0.984

0.995

0.969

coverage

0.952

0.877

0.945

0.953

0.942

length

0.062

0.070

0.089

0.061

0.083

ratio

0.957

0.542

0.988

0.956

0.974

coverage

0.945

0.650

0.915

0.939

0.944

length

0.082

0.064

0.109

0.080

0.108

ratio

1.022

0.678

0.999

1.026

0.988

coverage

0.961

0.807

0.935

0.960

0.940

length

0.058

0.055

0.083

0.056

0.077

ratio

0.971

0.758

1.002

0.980

0.983

coverage

0.946

0.858

0.939

0.950

0.942

length

0.041

0.043

0.059

0.040

0.055

T=10
G=100

G=200

G=400

T=15
G=100

G=200

G=400

35

1.6

Application: Estimation of Persistence in Student Achievement

In this section, we are interested in estimating the effect of attending private schools on student achievement
in the province of Punjab, Pakistan. In a non-experimental framework, estimating the causal effects of some
factors on student achievement requires accounting for factors that affected student achievements in previous
time periods since these factors might affect students’ learning ability in the future but also be correlated
across time. A model for studying the effect of some factor x on student achievement y can be written as in
Andrabi et al. (2011) in its summary of the work in Todd and Wolpin (2003):
t−1

yit =

∑

t−1

α j xit− j +

j=0

∑ θ j µit− j

(1.6.1)

j=0

where µt are unobserved shocks to student achievement. If one assumes that both {α j } j=1,...,T and
{θ j } j=1,...,T form geometric series such that α j = ρα j−1 and θ j = ρθ j−1 , we can write:
yit = αxit + ρyit−1 + µit

(1.6.2)

where θ0 was normalized to one.
In order to account for the possibility that students have unobserved characteristics that affect their ability
to learn and are related to other educational inputs in xit , we can decompose µit between time constant
unobserved factors (also called unobserved heterogeneity) and transitory shocks:
µit = ci + uit

(1.6.3)

where ci is arbitrarily related to xi = [xi1 , ..., xiT ] and xit is either strictly exogenous:
E(uit |Xi ,Yit−1 ) = 0

(1.6.4)

E(uit |Xit ,Yit−1 ) = 0

(1.6.5)

E(uit |Xit−1 ,Yit−1 ) = 0

(1.6.6)

or sequentially exogenous:

or contemporaneously endogenous:

where Xi = [xi1 , ..., xiT ] and Xit−1 = [xi1 , ..., xit−1 ].
In this section we will use the data analyzed in Andrabi et al. (2011) to estimate the effect of attending
private schools on student achievement in three districts of Punjab in Pakistan so that the input of interest
36

is attendance of private school. The other covariates included are wealth and variables indicating whether
each parent lives with the student. We treat all of these inputs as contemporaneously endogenous since
it is likely that they follow a dynamic process with unobserved transitory shocks that are correlated with
shocks to student achievement. For instance, an unobserved and unexpected increase in income might result
in a student enrolling in private school but also benefiting from better study conditions at home so that
Cov(uit , privateschoolit ) = 0 while it is still possible that Cov(uit , privateschoolit−1 ) = 0.
It is likely that transitory shocks are correlated within schools since there are school or class-level unobserved shocks, such as changes in infrastructure, staff or teachers, that will affect all students within a
school or class. The data-set we use collected between 0 and 25 students per school in each year with most
schools being represented by less than 10 students, which is too small to estimate time-varying school fixed
effects accurately. Instead, we prefer treating uit as cross-sectionally correlated within schools. It is likely
that unobserved heterogeneity will also be correlated across students within schools since students might
attend specific schools based on unobserved characteristics, such as residential location, socio-economic
characteristics or past achievements, that relate to their performance. As described in the rest of the paper,
using this cross-sectional correlation for estimation can result in significant efficiency gains.
j

For any subject j among English, Urdu and Mathematics (denoted E, U, M), denote by yit the grade
obtained by student i in year t and subject j and denote by xit the variable indicating whether student
i attended a private school in year t. Let wit be the vector containing other predetermined explanatory
j

variables for student i at time period t. Denote by uit transitory shocks in achievement in subject j and
j

denote measurement error in that achievement by εit . Also denote by git the school attended by student
i in year t. We will assume clustering so that, in a given year, transitory shocks are independent across
schools. We consider a model with measurement error and contemporaneously endogenous covariates. As
in Andrabi et al. (2011), we assume that measurement errors are independent across subjects. We can write
such a model as:
j

j

j

j j

j

j

j

j

j j

yit = dt + α0 xit + ρ0 yit−1 + wit β0 + ci + uit + εit − ρ0 εit−1
j

E(uit |Xt−1 ,Yt−1 ,Wt−1 , gt−1 ) = 0
j

−j

E(εit |Xt ,Yt

(1.6.8)

,Wt−1 , gt−1 ) = 0
−j

where Yt = {Ytk }k=E,U,M , Yt

(1.6.7)

(1.6.9)
j

= {Ytk }k= j,k=E,U,M and dt are time specific intercepts. The first dif-

ference with the model used in Andrabi et al. (2011) is that we use predetermined instruments instead
37

of sequentially exogenous instruments. This is a more suitable assumption since, as explained previously, the covariates used in this model are likely jointly determined. The second difference is that we
include as potential instruments the lagged values of the covariates for all observations instead of only
−j

j

Xit−1 ,Yit−1 ,Wit−1 . As pointed out in Section 1.2, since uit is correlated cross-sectionally, it is unlikely that
j

j

E(uit |Xt−1 ,Yit−1 ,Wit−1 ) = 0 holds without E(uit |Xt−1 ,Yt−1 ,Wt−1 ) = 0 holding. It could be interesting
to introduce peer effects in the model but we do not consider it here for simplicity and comparability with
results in Andrabi et al. (2011).
In this application, clusters (school membership) are not time constant and, as pointed out previously,
not strictly or sequentially exogenous. Therefore it is possible that:
E(uit |git , Xt−1 ,Yt−1 ,Wt−1 ) = 0

(1.6.10)

E(uit |git−1 , Xt−1 ,Yt−1 ,Wt−1 ) = 0

(1.6.11)

even though:

Hence we can use as instruments lagged values of achievements of students from schools where an
observation was previously enrolled but not from schools where it is currently enrolled.
There are three time periods t = 0, 1, 2 available in the data-set used so the only transformed equation
for each subject that can be used for estimation is:
j

j

j

j

j

j

j

j

j

∆yi2 = δ0 + α0 ∆xi2 + ρ0 ∆yi1 + ∆wi2 β0 + ∆ui2 + ∆εi2 − ρ∆εi1
j

−j

j

−j

j

−j

(1.6.12)

E(∆ui2 |X0 ,Y0 ,W0 , g0 ) = 0

(1.6.13)

E(∆εi2 |X0 ,Y0 ,W0 , g0 ) = 0

(1.6.14)

E(∆εi2 |X0 ,Y0 ,W0 , g0 ) = 0
j

j

(1.6.15)

j

where δ0 = d2 − d1 .
j

j

j

j

j

j

Let φ j = [δ j , α j , ρ j , β j ] , mi (φ j ) = ∆yi2 − (δ j + α j ∆xi2 + ρ j ∆yi1 + ∆wi2 β j ) and mi = mi (φ0 ). The
Arellano and Bond estimator for this model is defined by:
n

n

n

i=1

i=1

i=1

j
jAB j
jAB j
j
jAB −1
jAB j
) ( ∑ Zi mi (φ j ))
φˆAB = argmin j ( ∑ Zi mi (φ j )) ( ∑ Zi mi (φ˜ j )mi (φ˜ j ) Zi
φ
j
jAB
−j
where φ˜ j is a preliminary estimator of φ0 and Zi
= [1, yi0 , xi0 , wi0 ] .

38

(1.6.16)

This estimator is inefficient because it ignores cross-sectional dependence11 . Using the previous results
in this paper, we can specify auxiliary assumptions so that an estimator can be derived which will be consistent as long as the identifying assumptions defined above hold and efficient if the auxiliary assumptions
also hold.
The first auxiliary assumption we can use is of conditional homoscedasticity and cluster equi-correlation.
For j = U, E, M:
j

j

−j

Cov(uit , uls |Y0 , X0 ,W0 , g0 ) = σu2j i f i = l, t = s

(1.6.17)

= τ j σu2j i f i = j, git = gls , t = s

(1.6.18)

= 0 otherwise

(1.6.19)

and:
j

j

−j

j

j

−j

Cov(uit , εls |Y0 , X0 ,W0 , g0 ) = 0 ∀ i, l, j, k,t, s

(1.6.20)

Cov(εit , εls |Y0 , X0 ,W0 , g0 ) = σε2 j i f i = l, j = k, t = s
= 0 otherwise

(1.6.21)
(1.6.22)

Under this assumption:
j

j

−j

Cov(mi , ml |Y0 , X0 ,W, g0 ) = 2σu2j + 2σε2 j (1 + ρ + ρ 2 ) i f i = j
= τ j σu2j (1[gi1 = gl1 ] + 1[gi2 = gl2 ]) i f i = j

(1.6.23)
(1.6.24)

j

Under the previous auxiliary assumption, the optimal instruments for mi (φ j ) will be linear functions of
j

−j

−j

−j

E(∆yi1 |Y0 , X0 ,W0 , g0 ), E(∆xi2 |Y0 , X0 ,W0 , g0 ) and E(∆wi2 |Y0 , X0 ,W0 , g0 ). Since we have T = 2 so
that there is only one transformed equation available for estimation, we can use the simple second auxiliary

11 Without

measurement error, it would also be possible to use correlation of transitory shocks across outcomes to obtain an
efficient joint estimator of {φ j } j=U,E,M . However because of measurement error, the sets of instruments across subjects are
non-overlapping, so that optimal instruments cannot be derived. Since there is no restriction in the parameters across equations,
weighting of optimally weighted moment conditions or minimum distance methods cannot be used either.

39

assumption:
j

−j

j

j

E(∆yi1 |Y0 , X0 ,W0 , g0 ) = a0 +

1

j

∑ a01k yki0 + ∑ a02k #g

k= j

k= j

j

j

+ a04 wi0 + a05

j

∑

i0 l∈g
i0

ykl0 + a03 xi0

1
∑ wl0
#gi0 l∈g

(1.6.25)

i0

−j

j

j

1

j

∑ bz01k yki0 + ∑ bz02k #g

E(∆zi2 |Y0 , X0 ,W0 , g0 ) = bz0 +

k= j

i0 l∈g

k= j

j

j

+ bz04 wi0 + bz05

∑

j

ykl0 + bz03 xi0

i0

1
∑ wl0
#gi0 l∈g

(1.6.26)

i0

where z ∈ {x, w}and consistently estimate these unknown parameters by OLS regression.
j

j
j
∆y
Define Eˆi , Eˆi∆x and Eˆi∆w to be the estimated conditional expectations defined by (1.6.25) and

(1.6.26). Define:

∆y j
ˆ
 Ei 


j
∆x j 
Di = 
 Eˆi 


j
Eˆi∆w


(1.6.27)

j
j
j j
j j l=1,...,n
ˆ
and D j = [D1 , ..., Dn ]. Define Σˆ j (φ j ) = [Cov(m
i (φ ), ml (φ ))]i=1,...,n and
j

j

m j (φ j ) = [m1 (φ j ), ..., mn (φ j )]

(1.6.28)

j
The efficient estimator for φ0 under the auxiliary assumptions is φˆ j opt defined by:

ˆ φˆ j )−1 m(φˆ j ) = 0
D j Σ(
opt
opt

(1.6.29)


j
∆y
 1
j


∂ mi (φ )
j

=
(
Let Mi = 
) . Both the Arellano and Bond estimator and our optimal estimator can
 ∆xi2 
∂φ


∆wi2
be written as:


n

j

j

∑ Zi mi (φˆ j ) = 0

(1.6.30)

i=1

where for the Arellano and Bond estimator:
j

n

j

jAB

Zi = ( ∑ Ml Zl
l=1

jAB

ˆ jAB−1 Z
)Θ
i

(1.6.31)

with
ˆ jAB =
Θ

n

jAB j ˜ j j ˜ j
jAB
mi (φ )mi (φ ) Zi

∑ Zi

i=1

40

(1.6.32)

j
j
For our optimal estimator, Zi is the ith column of D j Σˆ j (φˆopt )−1 .
j

Under the assumption that transitory shocks are independent across schools, φˆ j is consistent for φ0 and
asymptotically normal. The asymptotic variance-covariance matrix of both estimators is12 :
AVar(φˆ ) = A BA

(1.6.33)

A = plim(

1 n j j −1
∑ Zi Mi )
n i=1

(1.6.34)

B = plim(

1 n j j n j j
∑ Zi mi ∑ mi Zi )
n i=1
i=1

(1.6.35)

1 n j j n
j j
∑ Zi mi ∑ 1[{git = gls }t,s=1,2 ]ml Zl )
n i=1
l=1

(1.6.36)

= plim(

which can be estimated consistently since there is a small number of observations in each school.
The students’ achievement in each subject was measured by the results obtained by students on a test
administed by the authors of Andrabi et al. (2011) and graded using the Item Response Theory so that scores
can be compared across students and years and the standard deviation of scores in the first year (third grade)
is one. Table 1.14 shows the average and standard deviations of scores by subject and grade. Table 1.15
reports the estimated degree of persistence and the estimated effect of attending private schools on performance for the three subjects considered. We also show the associated standard errors and 95% confidence
intervals. Similarly as in Andrabi et al. (2011), we find that there is significant persistence in scores except
for Mathematics. We estimate effects of attending private school that are smaller than in Andrabi et al.
(2011), which can be attributed to Andrabi et al. (2011) treating the covariates as sequentially exogenous
instead of contemporaneously endogenous while it is likely that unobserved factors simultaneously affect
performance and school attendance, as explained previously. The optimal estimator we presented in this
section yields smaller standard errors compared to the Arellano and Bond estimator both for estimating persistence in student achievements and for estimating the effect of attending private school, with particularly
significantly smaller standard errors for the latter.

12 Note

that clustering standard errors by the first school attended, which is used in Andrabi et al. (2011), is not justified since
transitory shocks should be correlated within a school that a child is currently attending and not necessarily only across students
who attended the same school in the first time period.

41

Table 1.14: Averages and standard deviations of scores per subject and per grade
English

Math

Urdu

Average

s.d.

Average

s.d.

Average

s.d.

Grade 3

0

1

0

1

0

1

Grade 4

0.18

1.04

0.18

1.11

0.24

1.10

Grade 5

0.68

0.89

0.81

1.04

0.82

0.94

42

Table 1.15: Effects of Attending Private Schools on Student Achievement
Optimal Estimator

Persistence

Andrabi et al. 2010

English

Urdu

Math

English

Urdu

Math

English

Urdu

Math

0.31

0.30

0.04

0.34

0.53

0.23

0.19

0.35

0.12

(0.14)

(0.12)

(0.11)

(0.17)

(0.18)

(0.14)

(0.10)

(0.11)

(0.12)

[0.04,0.58]
Private
School

Arellano and Bond Estimator

[0.06,0.54] [-0.18,0.26]

[0.01,0.67]

[0.18,0.88] [-0.04,0.50]

[-0.01,0.39] [0.13,0.57] [-0.12,0.36]

0.44

0.89

0.30

0.40

0.81

0.43

1.15

0.90

0.46

(0.38)

(0.41)

(0.31)

(0.55)

(0.59)

(0.54)

(0.39)

(0.48)

(0.50)

[-0.30,1.18] [0.09,1.69] [-0.31,0.91]

[-0.68,1.48] [-0.35,1.97] [-0.63,1.49]

[0.39,1.91] [-0.04,1.84] [-0.52,1.44]

Numbers in parenthesis are standard errors and intervals are 95% confidence intervals. Standard errors and confidence intervals in Andrabi
et al. 2010 do not take into account changes in school attendance across time. Covariates are treated as sequentially exogenous instead of
contemporaneously endogenous in Andrabi et al. 2010.

43

1.7

Conclusion

We have presented an estimation method that used cross-sectional dependence to improve the accuracy with
which dynamic models of panel data are estimated while making use of few nuisance parameters and being
robust to the misspecification of the form of the cross-sectional dependence. This method can be generalized to models with covariates that are strictly exogenous, sequentially exogenous or contemporaneoulsy
endogenous.
Monte Carlo simulations and an application to the estimation of a value-added model show that, when
there is cross-sectional dependence, this method dominates existing estimators in terms of accuracy and
quality of inference.
Extensions of this work that are the subject of ongoing research consider the generalization of the results
in this paper to non-linear panel data models, the use of other forms of cross-sectional dependence than
clustering in the auxiliary restrictions, and the asymptotic properties of our estimator with large numbers of
time period and of observations within clusters.

44

CHAPTER 2
ESTIMATION OF UNOBSERVED EFFECTS PANEL DATA MODELS UNDER SEQUENTIAL
EXOGENEITY

2.1

Introduction

Time constant unobserved effects are now routinely introduced in models of panel data to address endogeneity issues that are due to time constant unobserved variables. A first group of estimators for such models
uses iterated conditioning by specifying an auxiliary model for unobserved effects conditional on the covariates. Such models are commonly called Correlated Random Effects (CRE) models. A second group
of estimators implements instrumental variable estimation methods on transformed data as long as some
specific functional form assumptions can be made.
In the case of linear models with strictly exogenous covariates, CRE estimators have first been proposed
in Mundlak (1978) and Chamberlain (1982). When covariates are strictly exogenous, Wooldridge (2010)
contains many examples of the generalization of this approach to non-linear models of panel data. For
linear models with strictly exogenous covariates, the instrumental variable estimators are the well-known
Fixed Effects estimator and the First Difference estimator. These estimators have also been generalized to
non-linear models that are linear in random coefficients in Chamberlain (1992b).
The estimators mentioned above cannot be used in applications where there are sequentially exogenous
covariates or, more generally, instruments. Sequentially exogenous instruments arise when transitory shocks
to the dependent variable are independent of past and current values of the instruments but affect future
values of the instruments. This scenario is particularly plausible in a dynamic optimization framework. The
simplest example is when lagged values of the dependent variables are used as covariates and instruments.
Dynamic models are frequently used to analyze panel data. A review of linear dynamic models for panel
data can be found in Bond (2002). Sometimes instruments other than lagged values of the dependent variable
can be sequentially exogenous. This is the case, for instance, in Clerides et al. (1998) which investigates
the causal effect of exporting on firm efficiency but recognizes that shocks to firm efficiency will affect
current and future exports as well. Another well know example is Blundell et al. (1995) which investigates
the causal effect of Research and Development spending on innovation knowing that current success in

45

deposing patents will affect future Research and Development spending.
CRE models encounter strong limitations when instruments are sequentially exogenous. Dynamic CRE
models can be used for special cases where lagged dependent variables are included in the list of covariates
but all other covariates are strictly exogenous. Such models and the corresponding estimation methods are
discussed in Wooldridge (2005). One application can be found in Browning et al. (2010). In more general
cases, however, one would need a very large set of auxiliary assumptions in order to use a CRE model to
analyze panel data with sequential exogeneity.
Instrumental variable estimation can be used for estimation of panel data models with sequential exogeneity as long as there exists a transformation of the data so that the method of moments can be applied,
which makes it a much more flexible method. For models with additive unobserved effects, such transformations of the data are presented in Arellano and Bond (1991). Chamberlain (1992a) and Wooldridge (1997)
discuss transformations of the data for models with multiplicative unobserved effects. Once the transformed
equations are obtained, these papers advocate for a two-step GMM estimation of the unknown parameters
using the transformed equations at each time period and the corresponding available sets of instruments.
These estimators are efficient given the set of unconditional moment conditions that are used, but they are
still known for suffering from a weak instrumental variable problem that can hinder their use in practice.
In this paper, we consider using additional assumptions to derive useful additional moment conditions
and hence obtain a more precise estimator. The additional moment conditions that we present in this paper
are generalized versions of the additional moment conditions for the linear dynamic model with additive
unobserved effects presented in Arellano and Bover (1995), Ahn and Schmidt (1995) and Blundell and
Bond (1998). Windmeijer (2000) considered some of the additional moment conditions we present here,
namely uncorrelation of the transitory shocks, for a special case of the group of models we define. However,
the chosen set of assumption in Windmeijer (2000) is actually too weak to support the moment conditions
that are used for estimation. Hence it seems useful to present these moment conditions here as part of a
unifying framework.
In Section 2.2 we will present the model and assumptions we use. In Section 2.3 we will discuss the
estimator that is currently used. In Section 2.4 we will present additional sets of restrictions that can be used
for estimation when instruments are stationary or when transitory shocks are serially uncorrelated. In Section 2.5 we will show using Monte Carlo simulations that the propositions to address the weak instrumental
variable problem of these models result in significant improvements in accuracy and hence effectively mit-

46

igate the weak instrumental variable problem. In Section 2.6 we will show how to estimate and perform
inference on measures of interest of the effect of covariates on the mean of the dependent variable.

2.2

Model and Assumptions

The models we consider are such that for each observation i of a random sample of large size n and each
time period t of a fixed number of time periods T we can specify:
E(yit |xti , uti , zit ) = h0 (xit , β0 ) + h1 (xit , β0 )uit
E(uit |zit ) = E(uit+1 |zit ) ∀t ≤ T − 1

(2.2.1)
(2.2.2)

for known functions h0 , h1 . In this model xit are observed covariates and uit captures the effect of unobserved covariates. xti = {xis }s=1,...,t contains all values of the covariates up to the current time period and
similarly we denote uti = {uis }s=1,...,t . zit are observed instruments that do not belong to the mean equation for yit once we condition on the observed and unobserved covariates xit , uit . We consider cases where
zi1 ⊆ zi2 ⊆ ... ⊆ ziT so that we have sequential exogeneity, also called predetermined instruments. (2.2.1)
was specified in terms of a conditional expectation instead of simply in terms of yit in order to allow for
dependent variables with discrete supports as we will see later in this section. (2.2.2) requires that at each
time period the effects of unobserved covariates have the same mean conditional on the instruments as the
effects of unobserved covariates at future time periods. Hence it requires that the source of endogeneity of
the instruments be time constant. For simplicity we will consider the case where yit , h0 (., .), h1 (., .) and ci
are scalars but all the results can be generalized to systems of equations if needed.
Dynamic linear models with additive heterogeneity are a special case of the group of models described
by (2.2.1) and (2.2.2) with xit = yit−1 , zit = [y0 , ..., yit−1 ], h0 (x, β ) = β x, h1 (., .) = 1:
yit = β0 yit−1 + ci + νit

(2.2.3)

E(ci + νit |yit−1 , ..., y0 ) = E(ci |yit−1 , ..., y0 )

(2.2.4)

Here we wrote uit = ci + νit . Traditionally unobserved effects have been explicitely decomposed between a time constant part, sometimes called unobserved heterogeneity, and a transitory part. In this paper
we keep a more general notation as in (2.2.1) and (2.2.2) for more flexibility.

47

Other special cases of the models we consider have been used to model count dependent variables, such
as the linear feedback model presented in Blundell et al. (2002):
E(yit |yit−1 , ..., yi0 , xit , ..., xi1 , ci ) = γ0 yit−1 + exp(xit θ0 )ci ∀t = 1, ..., T

(2.2.5)

In this case we simply have uit = ci .
Our specification also includes models for count dependent variables where covariates cannot be used
as instruments, i.e. xit ∈
/ zit , but where enough instruments are available to identify the parameters of the
model. An example where instruments available are lagged covariates is presented in Windmeijer (2000)1 :
yit = exp(xit β0 )ci νit
E(ci νit |xit−1 , ..., xi0 ) = E(ci |xit−1 , ..., xi0 )

(2.2.6)
(2.2.7)

Multiplicative unobserved effects models have first been used to analyze count data and a description
of the state of the literature on dynamic models of count data with unobserved heterogeneity is given in
Windmeijer (2008). However the class of models we consider in this paper is very appropriate for the
analysis of any data that require the specification of a non-linear response function like binary, fractional,
ordered, non-negative, corner solution response data and so on. For a binary dependent variable for instance,
a dynamic probit model with sequential exogeneity in the explanatory variables could be specified:
E(yit |yit−1 , ..., yi0 , ci , xit , ..., xi1 ) = ci Φ(γ0 yit−1 + xit θ0 )

(2.2.8)

Here ci /2 is the conditional probability of being in state yit = 1 when yit−1 = 0, xit = 0 and also captures
a time constant unobserved propensity to be in state yit = 1.
It is also important to note that the generality of our chosen specification also allows us to use models
where some of the explanatory variables are endogenous but where instruments are available:
E(yit |yit−1 , ..., yi0 , xit , ..., xi1 , uit , ..., ui1 , zit ) = Φ(γ0 yit−1 + xit θ0 )uit
E(uit |zit ) = E(uit+1 |zit )

(2.2.9)
(2.2.10)

where uit is a random variable between zero and one and captures the effect on the mean of yit of unobserved
explanatory variables which are not independent from xit but have the same mean conditional on instruments
as effects on the mean in future time periods.
1 We

present slightly different assumptions here than in Windmeijer (2000) since Windmeijer (2000) goes to great lengths
to avoid making assumptions on conditional means and only consider assumptions of uncorrelation but does so at the expense of
making two mistakes. One of which being that assuming that xit−1 is uncorrelated with νit does not imply E(xit−1 ci (νit+1 −νit )) = 0
as is claimed in Windmeijer (2000). The other one will be mentioned in Section 2.4.2.

48

2.3

Estimation without Additional Assumptions

Following the argument made in Chamberlain (1992a), the model described by (2.2.1), (2.2.2) is statistically
indistinguishable from:
y − h0 (xiT , β0 )
E( iT
|ziT ) = E(uiT |ziT )
h1 (xiT , β0 )
y − h0 (xit , β0 )
E(∆ it
|zit−1 ) = 0 ∀t = 2, ..., T
h1 (xit , β0 )

(2.3.1)
(2.3.2)

where ∆ denotes the difference operator. Since E(uiT |ziT ) is unknown and unrestricted, (2.3.1) does not
participate in estimating β0 . Therefore we can restrict our attention to estimating β0 from (2.3.2). For
notation we will write:
y − h0 (xit , β0 )
∀t = 2, ..., T
ρt (wi , β ) ≡ ∆ it
h1 (xit , β )

(2.3.3)

where wi ≡ {yit , xit }t=1,...,T . So the conditional moment restrictions available for estimation are:
E(ρt (wi , β0 )|zit−1 ) = 0 ∀t = 2, ..., T

(2.3.4)

Chamberlain (1992a) has shown that an optimal estimator would be βˆopt that solves:
n

ˆ
∑ D˜ it Σ˜ −1
it ρ˜t (wi , zi , βopt ) = 0

(2.3.5)

i=1

Such an estimator would achieve the asymptotic information bound for estimating β0 from these condiT D
˜ Σ˜ −1 D˜ it ) where D˜ it ≡ E( ∂ ρ˜t (wi , zi , β0 )|zit−1 ), Σ˜ it ≡
tional moment restrictions which is J = E(∑t=2
it it
∂β

Var(ρ˜t (wi , zi , β0 )|zit−1 ), ρ˜t () is defined by:
ρ˜ T (wi , zi , β ) = ρT (wi , β )

(2.3.6)

ρ˜t (wi , zi , β ) = ρt (wi , β ) − Γit,t+1 ρ˜t+1 (wi , zi , β ) − ... − Γit,T ρ˜ T (wi , zi , β ) ∀t = 2, ..., T − 1

(2.3.7)

where zi = {zit−1 }t=2,...,T , Γit,s ≡ Cov(ρit , ρ˜ is |zis−1 )Var(ρ˜ is |zis−1 )−1 ∀ s > t where ρit = ρit (β0 ), ρit (β ) =
ρt (wi , β ) and ρ˜ it = ρ˜ it (β0 ), ρ˜ it (β ) = ρ˜t (wi , zi , β ). The intuition behind this result is that the asymptotic
information bound from all the sequential conditional moment restrictions is the sum of the information
bounds for each conditional moment restriction once these restrictions have been orthogonalized.
Unfortunately the optimal estimator from equation (2.3.5) is usually not feasible without additional
assumptions since D˜ it , Σ˜ it and ρ˜ it are not observed, i.e. they are not known functions of the data and of β0 .
One could think about approximating such moment conditions arbitrarily well, as suggested in Chamberlain
49

(1992a) or partially studied in Hahn (1997), but this introduces several new problems and therefore is left
for future research.
Under the conditional moment restrictions given in (2.3.4), any function of zit−1 can be used as instruments for ρit (β ) to estimate β0 . Windmeijer (2008), for instance, recommends the use of all available lags
of the instruments in levels, in our notation this is just zit−1 . So the estimator that is commonly used to
estimate β0 from the model given in (2.2.1) and (2.2.2) is:
βˆ = argmin ∑(Zi ρi (β )) (Zi ρi (β˜ )ρi (β˜ ) Zi )−1 ∑ Zi ρi (β )
i

β

(2.3.8)

i

√
where β˜ is a preliminary n-consistent estimator of β0 , ρi (β ) = [ρit (β )]t=2,...,T ,


z
0 ...
0
 i1



0 z
... 


i2
Zi = 



...
0 
 ...


0 ... 0 ziT −1
The asymptotic variance of of
Avar = (E(Zi

(2.3.9)

√ ˆ
n(β − β0 ) is:

∂ ρi (β0 )
∂ ρ (β )
) E(Zi ρi (β0 )ρi (β0 ) Zi )−1 E(Zi i 0 ))−1
∂β
∂β

(2.3.10)

It is shown in Appendix B.1 that this asymptotic variance is equal to:
T

T

−1
˜
Avar = (E( ∑ D˜ it Σ˜ −1
it Dit ) − E( ∑ eit eit ))

(2.3.11)

t=2

t=2

−1/2

where et is the error term from the linear projection of D˜ it Σ˜ it

1/2

on z˜it−1 = {zis Γs+1,t Σ˜ it }s=1,...,t−1 , so

that βˆ can be seen as the estimator resulting from a linear approximation of the optimal moment conditions.
This estimator is often quite imprecise. For dynamic linear models, the weak instrumental variable problem affecting the estimator described in this section has been documented in Arellano and Bover (1995), Ahn
and Schmidt (1995) and Blundell and Bond (1998). The additional moment conditions that are proposed in
these papers to address this issue are the ones we generalize to our more general set up in the next section.
Blundell et al. (2002) have documented the weak instrumental variable problem for estimating the Linear
Feedback Model for count data, which we will use in the next section as an example. The additional assumptions used by Blundell et al. (2002) in order to alleviate the weak instrumental variable problem are quite
unrealistic compared to the additional sets of assumptions we present in the next section. In addition, Monte
50

Carlo simulations show that using the additional assumptions presented in this paper achieves efficiency
gains that are similar to those obtained in Blundell et al. (2002) when both sets of additional assumptions
hold.

2.4

Additional Assumptions

2.4.1

Estimation with Stationary Instruments

For the models described in (2.2.1), it is possible in some applications that part of the instruments, denote it
zstat
it , has a time constant covariance with the unobserved effects and a time constant mean so that in addition
to (2.2.2) we can assume:
E(zstat
it ) = µzstat
Cov(zstat
it , uit ) = γ

(2.4.1)
(2.4.2)

stat
(2.4.1) and (2.4.2) imply that E(zstat
it uit ) is time constant as well since E(zit uit ) = γ + µzstat E(uit )

and E(uit ) is time constant by the law of iterated expectations and E(ui1 |zi1 ) = ... = E(uiT |zi1 ) which
stat
stat
stat
is implied (2.2.2). This in turn implies that E(zstat
it uit ) = E(zit−1 uit ) since E(zit−1 uit−1 ) = E(zit−1 uit )
stat additional
from (2.2.2). Let K stat be the dimension of zstat
it . We can use for estimation the (T − 1) × K

moment conditions2 :
stat yit − h0 (xit , β0 )
E((zstat
it − zit−1 ) h (x , β ) ) = 0 ∀t = 2, ..., T
1 it 0

2.4.1.1

(2.4.3)

Example of the Linear Feedback Model

An example of a model where such additional moment conditions can be used but have not been exploited
in previous studies is the linear feedback model (LFM) presented in Blundell et al. (2002). For |γ0 | < 1:
E(yit |yit−1 , ..., yi0 , xit , ..., xi1 , ci ) = γ0 yit−1 + ci µ(xit , θ0 )

(2.4.4)

2 Note

stat
that the moment conditions E((zstat
it−s − zit−s−1 )uit ) = 0 for s ≥ 1 do not constitute useful additional moment conditions
stat
stat
since they are implied by the moment conditions E((zstat
it−s − zit−s−1 )uit−s ) = 0 and E(zit−s (uit−τ − uit−τ−1 )) = 0 ∀ τ = 0, ..., s − 1
since:
stat
stat
stat
stat
(zstat
it−1 − zit−2 )uit = zit−1 uit − zit−2 ∆uit − zit−2 uit−1
stat
stat
stat
= zstat
it−1 uit − zit−2 ∆uit + ∆zit−1 uit−1 − zit−1 uit−1
stat
stat
= zstat
it−1 ∆uit − zit−2 ∆uit + ∆zit−1 uit−1
stat
stat
stat
stat
stat
stat
stat
and by iteration: (zstat
it−s − zit−s−1 )uit = zit−s ∆uit − zit−s−1 ∆uit + ∆zit−s uit−1 is a function of (zit−s − zit−s−1 )uit−s and {zit−s (uit−τ −
uit−τ−1 ). ∀ τ = 0, ..., s − 1}

51

For estimation we can use the sequence of conditional moment conditions corresponding to the conditional
moment conditions (2.3.2) considered in the previous section:
y − γ0 yit−1 yit−1 − γ0 yit−2
−
|y
, ..., yi0 , xit−1 , ..., xi1 ) = 0 ∀t = 2, ..., T
E( it
µ(xit , θ0 )
µ(xit−1 , θ0 ) it−2

(2.4.5)

so for this specific model we have uit = ci .
Blundell et al. (2002) also assumes that xit is strictly stationary conditional on ci .3 This implies that
E(µ(xit , θ0 )|ci ) = g1 (ci ) for some arbitrary function g1 (.). Consider the difference equation given by:
yit = γ0 yit−1 + µ(xit , θ0 )ci εit .

(2.4.6)

y −γ y

it−1 . The associated stationary process is defined by
where εit = cit µ(x0 ,θ
i
it 0 )
∞

sit =

∑ γ0s ci µ(xit−s , θ0 )εit−s

(2.4.7)

s=1

y −γ yit−1
c g1 (ci )
Then E(sit |ci ) = i1−γ
since E(εit |ci , xit , xit−1 , ...) = E( cit µ(x0 ,θ
) = 1. So if we simply assume
)
0

i

it 0

c g1 (ci )
so that
that the deviation of yi0 from si0 has mean zero conditional on ci , we have E(yi0 |ci ) = i1−γ

0
ci g1 (ci )
E(yit |ci ) = 1−γ ∀t = 1, ..., T . This assumption is the generalization of the restriction on initial condi0

tions made in Blundell and Bond (1998) for dynamic linear models with additive unobserved effects. It
results in the additional over-identifying moment conditions:
y − γ0 yit−1
) = E((yit−1 − yit−2 )ci ) ∀t = 2, ..., T
E((yit−1 − yit−2 ) it
µ(xit , θ0 )
= 0 ∀t = 2, ..., T

(2.4.8)
(2.4.9)

Since for this specific model these conditions would not be plausible without the stationarity of xit , we
can also add the moment conditions:
y − γ0 yit−1
E((xit − xit−1 ) it
) = 0 t = 2, ..., T − 1
µ(xit , θ0 )

(2.4.10)

Monte Carlo simulations show that these extra moment conditions improve the efficiency of estimators
significantly, even tough they rely on assumptions that are more realistic than the assumptions imposed in
Blundell et al. (2002).
3 They

do so in a different attempt to mitigate the weak IV problem of FE estimators for the LFM. Blundell et al. (2002)
proposes a so called pre-sample mean estimator which attempts to control for unobserved heterogeneity by using the average of
observations on the dependent variable for many periods before the rest of the sample started as a proxy for time constant unobserved
heterogeneity. However this estimator suffers from two severe drawbacks which make it unusable in practice: it supposes one has
many observations on the dependent variable before the start of the rest of the sample but most importantly the assumptions under
which the pre-sample average is a good proxy for unobserved heterogeneity are highly unrealistic, in particular it supposes that the
covariates xit have a mean that is proportional to ci and restricts µ() to be the linear index exponential function.

52

2.4.1.2

Time Demeaned Instruments

In some applications it might not be plausible to assume that some of the instruments are mean stationary. However similar additional moment conditions as (2.4.3) can be obtained after time demeaning of the
stat
stat
instruments if E((zstat
it − E(zit ))uit ) = Cov(zit , uit ) is time constant. In this section we consider the

conditions necessary for this to be true when zstat
it is not itself mean stationary.
stat
stat
stat
From (2.2.2) alone, Cov(zstat
it , uit ) = Cov(zit , uiT ) since (2.2.2) implies E(zit uit ) = E(zit uiT ) and

E(uit ) = E(uiT ). Cov(zstat
it , uiT ) will be time constant if ∀t, s ≤ T − 1:
stat
Cov(zstat
it , uiT ) −Cov(zis , uiT ) = 0

(2.4.11)

stat
Cov(zstat
it − zis , uiT ) = 0

(2.4.12)

stat
Hence Cov(zstat
it , uiT ) will be time constant if the change in zit over time is uncorrelated with the unob-

served effects at the last time period. To provide more intuition regarding what such an assumption mean,
we can consider the unobserved heterogeneity decomposition of unobserved effects and write uiT = ci νiT
with ci and νiT such that E(uiT |zit ) = E(ci |zit ). Therefore:
stat
stat
stat
Cov(zstat
it − zis , uiT ) = Cov(zit − zis , ci )

(2.4.13)

stat
So for Cov(zstat
it , uiT ) to be time constant, we need the change in zit over time to be uncorrelated with the

time constant part of the unobserved effects.
This will be satisfied for instance if zstat
it is composed of a deterministic time component ft , a time
constant component that is arbitrarily correlated with ci , denote it di , and a time varying component that is
uncorrelated with ci , denote it εit :
zstat
it = ft + di + εit

(2.4.14)

stat
stat
stat
Cov(zstat
it − zis , uiT ) = Cov(zit − zis , ci )

(2.4.15)

Indeed in this case,

= Cov(εit − εis , ci )

(2.4.16)

=0

(2.4.17)

As long as Cov(zstat
it , uiT ) is time constant, we can use for estimation the following additional moment
conditions:
stat yit − h0 (xit , β0 )
E((˜zstat
it − z˜it−1 ) h (x , β ) ) = 0 ∀t = 2, ..., T
1 it 0

53

(2.4.18)

stat
stat
stat
where z˜stat
it = zit − E(zit ). For estimation, E(zit ) can be simply replaced by the sample average of

zstat
it . The asymptotic variance of the estimator of β0 will not be affected by this preliminary estimation,
following the results in Newey and McFadden (1994) for instance.
A simple informal test of whether the change in zstat
it over time is uncorrelated with the time constant
part of the unobserved effects could be to regress the change in zstat
it over time on time period dummies and
as many time constant explanatory variables as available and test the joint significance of the time constant
covariates in the regression.

2.4.2

Serially Uncorrelated Transitory Shocks

In some applications it might be unlikely that instruments or functions of the instruments have a time constant covariance with unobserved effects. For instance consider the case of the linear feedback model where
only time period dummy variables are used as covariates so that xit = Dt . Then E(yit |yit−1 , ..., yi0 , ci ) =
γ0 yit−1 + µt ci where µt is a deterministic constant that depends on t. Even if we assume that yi0 does not
∞
s
s
deviate from the stationary process si0 = ∑∞
s=1 γ0 ci µ−s εi−s , E(yi1 − yi0 |ci ) = ∑s=1 γ0 ci (µ−s+1 − µ−s )

so that in general yit − yit−1 will be correlated with ci and therefore yit−1 − yit−2 can not be used as an
instrument for the equation in level even if it is time demeaned.
However in such cases other additional restrictions might be available that would come from restrictions on the variance covariance matrix of ui = [ui1 , ..., uiT ] . It is sometimes plausible to assume that
the only source of serial correlation in the unobserved effects is time constant unobserved effects so that
Cov(uit , uis ) = Cov(uiq , uir ) ∀ s < t, q < r4 . In general such restrictions imply T × (T − 1)/2 − 1 additional
overidentifying moment restrictions which can be written as:
y − h0 (xit , β0 ) yis − h0 (xis , β0 )
) = τ0 ∀t, s = 1, ..., T, s < t
E( it
h1 (xit , β0 )
h1 (xis , β0 )

(2.4.19)

where τ0 is an additional parameter added to β0 defined by τ0 = Cov(uit , uis ) + E(uit )E(uis ) ∀t = s which
doesn’t depend on t or s since E(uit ) is constant by (2.2.2). This is however not true in the case of dynamic models since then some of these moment conditions are already implied by (2.2.1) and (2.2.2). For
dynamic models, uit = (yit − h0 (xit , β0 ))/h1 (xit , β0 ) and yit−1 , xit ∈ zit so uit−1 ∈ zit . Hence (2.2.1) and
4 One could also consider weaker additional restrictions of the type E(u u
it it−s ) = τ(s) so that serial correlation in the unobserved

effects only depends on the number of lags s separating these unobserved effects and not on the chosen time period t. We do not
consider this possibility in this paper for simplicity, it would be straightforward to modify the derivations of this section to consider
this case.

54

(2.2.2) imply E(uit uis ) = E(uit uir ) ∀t < s, r so that Cov(uit uis ) = Cov(uit uir ) ∀t < s, r. Therefore assuming Cov(uit , uis ) = Cov(uiq , uir ) ∀t < s, q < r in the case of dynamic models will only imply the additional
T − 2 over-identifying restrictions:
y − h0 (xiT , β0 ) yiT −s − h0 (xiT −s , β0 )
) = τ0 ∀ s = 1, ..., T − 1
E( iT
h1 (xiT , β0 )
h1 (xiT −s , β0 )

(2.4.20)

These moment restrictions are the generalization of the additional moment conditions derived for linear
dynamic models in Ahn and Schmidt (1995).
In the case of dynamic models, an assumption such as Cov(uit , uis ) = Cov(uiq , uir ) ∀t = s, q = r can be
very plausible since it is possible to argue that modeling dynamics will also account for all serial correlation
in unobserved effects other than serial correlation due to the time constant part of unobserved effects. For
instance, the linear feedback model presented in (2.4.4) implies such additional moment restrictions even
though they have not been exploited for estimation in previous studies. Indeed from 2.4.45 :
y − γ0 yiT −1 yiT −s − γ0 yiT −s−1
) = E(c2i ) s = 1, ..., T − 1
E( iT
µ(xiT , θ0 )
µ(xiT −s , θ0 )

(2.4.21)

Windmeijer (2000) has also derived similar moment conditions for the model presented in (2.2.6) and
(2.2.7) but under a set of assumptions that was too weak. Windmeijer (2000) only assumes that c2i is
uncorrelated with εit and that εit is uncorrelated with εis for t = s which doesn’t imply E(c2i εit εis ) =
E(c2i )E(εit )E(εis ) = E(c2i ) hence it seems that a specification of models in terms of conditional expectations
and unobserved effects as in (2.2.1) and (2.2.2) is more straightforward than the specification of the model
in terms of uncorrelation found in Windmeijer (2000).

5I

was made aware after finishing a draft of this paper that, in unpublished work, Kitazawa (2007) also considers similar
moment conditions for the LFM. Note however that the LFM is only one of the special cases considered by the group of models we
defined.

55

2.5

Monte Carlo Evidence

To study the small sample performance of the estimators we present in this paper, we consider estimating
the Linear Feedback model presented in Blundell et al. (2002):
yit ∼ Poisson(γyit−1 + exp(β xit + ηi )) ∀t = 1, ..., T

(2.5.1)

xit = ρxit−1 + τηi + εit

(2.5.2)

τ
η + ξi
1−ρ i
exp(β xi0 + ηi )
yi0 ∼ Poisson(
)
1−γ

xi0 =

(2.5.3)
(2.5.4)

ηi ∼ N(0, ση2 )

(2.5.5)

εit ∼ N(0, σε2 )

(2.5.6)

ξi ∼ N(0,

σε2
)
1 − ρ2

(2.5.7)

The only difference with the data generating process of Blundell et al. (2002) is that we do not obtain
yi0 as the last draw from fifty draws starting at yi−49 ∼ Poisson(exp(β xi−49 + ηi )) but instead impose
E(yi0 |ci ) = ci E(exp(β xi0 ))/(1 − γ). Since we will restrict our attention to γ < 1, both data generating
processes will be very similar even though not exactly equivalent.
With this model, we will consider using for estimation the sequence of moment conditions:
y − γ0 yit−1 yit−1 − γ0 yit−2
−
)) = 0 t = 2, ..., T
E(zit ( it
h1 (xit , β0 )
h1 (xit−1 , β0 )

(2.5.8)

where zit = (yit−2 , ..., yi0 , xit−1 , ..., xi1 ) or zit = (yit−2 , xit−1 ).
The additional conditions that arise from the restriction imposed on the initial conditions are:
y − γ0 yit−1
E((yit−1 − yit−2 ) it
) = 0 t = 2, ..., T
h1 (xit , β0 )
y − γ0 yit−1
E((xit − xit−1 ) it
) = 0 t = 2, ..., T
h1 (xit , β0 )

(2.5.9)
(2.5.10)

The additional conditions that arise from serial uncorrelation of the transitory shocks are:
y − γ0 yiT −1 yit − γ0 yit−1 yit−1 − γ0 yit−2
E( iT
(
−
)) = 0 t = 2, ..., T − 1
h1 (xiT , β0 )
h1 (xit , β0 )
h1 (xit−1 , β0 )

(2.5.11)

We will consider four groups of estimators: using no additional moment conditions, using the additional
moment conditions from the restrictions on the initial conditions, using the additional moment conditions
56

from serially uncorrelated transitory shocks, and using both sets of additional moment conditions. Within
each group we will consider the GMM estimator that uses all available lags of the instruments for the
conditional moment conditions and the GMM estimator that uses only one lag of the instruments for the
conditional moment conditions. For each estimator we will also consider the two-step GMM estimator with
the identity matrix as initial weighting matrix and the iterated GMM estimator, which is a multiple step
GMM estimator that takes as many steps as are needed for the estimates to converge6 . Therefore we will be
considering a total of sixteen estimators.
Table 2.1 and Table 2.2 report the bias and root Mean Squared Error (MSE) of the estimators of γ 7 .
Table 2.3 and Table 2.4 report the ratio of the mean of standard errors of the estimators of γ over the standard
deviations of these estimators. Therefore these tables capture the bias in the estimators of the variance of
the estimators of γ. Table 2.5 and Table 2.6 report the coverage rate of the 95% confidence intervals created
from the estimators of γ and their associated standard errors. All results are from 1,000 replications.
The first conclusion from Table 2.1 and Table 2.2 is that using the additional moment conditions presented in the previous section results in large efficiency gains with very sizable decreases in both bias and
standard deviations. This gain is especially noticeable when either set of additional moment conditions is
used compared to not using any set of additional moment conditions. The addition of a second set of additional moment conditions causes a more modest gain in efficiency. Bias is almost always smaller when
using only one lag of the instruments instead of all available lags. When all available lags of the instruments
are used, iterated GMM seems to perform better than two step GMM.
Table 2.3 and Table 2.4 show a severe downward bias in standard errors for small n and large T when all
available lags of the instruments are used. This problem is alleviated by using iterated GMM, particularly
when T is large. However, even using iterated GMM can result in standard deviations being significantly
underestimated. This bias in standard errors is due to the use of many over-identifying moment conditions.
The same problem of downward biased standard errors has been studied for the special case of linear models
in Windmeijer (2005) and for models of count data in Windmeijer (2008). However these two papers concentrate on the bias originating from using a preliminary estimator to compute the optimal weighting matrix,
6 We do not present the results of iterated GMM estimation for n = 100 because for this small sample size and with the
convergence criterion we used for the other sample sizes, the iterated GMM algorithm failed to converge in less than 400 iterations
in 25% of the bootstrap draws when T = 4 and 50% of the bootstrap draws when T = 8. Conditional on having the iterated GMM
algorithm converging for n = 100, using iterated GMM instead of two step GMM seemed to provide some efficiency gain and
significantly better inference when many moment conditions are used in a similar way as for larger sample sizes.
7 We only show results for the estimation of γ here but results for estimation of β exhibit similar patterns.

57

whereas we see that using iterated GMM instead of two step GMM helps but does not solve completely the
problem of downward biased standard errors. Asymptotic analysis under many moment conditions performed in separate work in progress seems to indicate that most of the bias comes from the correlation
between the gradient of the moment functions and the moment functions themselves. This result has been
presented in a more general setting in Newey and Windmeijer (2009). Bootstrapped standard errors might
also be a solution.
Table 2.5 and Table 2.6 show the effect of both downward biased standard errors and bias in the estimator
of γ on inference. For small n or large T the coverage of confidence intervals is significantly lower than the
confidence level of 95%, particularly when all available lags of the instruments are used. This problem is
alleviated by using iterated GMM but not completely solved. Corrected standard errors should participate in
constructing better confidence intervals as could bias correction, particularly in the case where no additional
moment condition is available. Similarly as for correction of the standard errors, bias correction could be
based on higher order asymptotic analysis.
The first conclusion of this section is that using additional restrictions on the stationarity of the instruments or serial uncorrelation of transitory shocks can make a big difference in terms of the precision of the
point estimates. It does not solve however the problem of inference which was already present with previous
estimators and is due to the poor properties of GMM standard errors in cases where many over-identifying
conditions are used.
Using iterated GMM can improve the quality of inference compared to two step GMM especially when
T is relatively large without solving the problem completely. The results presented in this section also
suggest that using only one lag of the instruments can result in much better inference especially when T
is relatively large. Previous studies of instrumental variable estimation of models similar to the ones we
consider in this paper, such as Arellano and Bond (1991) or Windmeijer (2008), recommended the use of
all lags of the instruments in (2.5.8). However the Monte Carlo evidence we presented indicates that using
only one lag of the instruments causes only a modest loss in accuracy, especially when additional moment
conditions are available, but results in significantly lower bias and significantly better inference compared
to using all available lags of the instruments.

58

Table 2.1: Bias and RMSE for estimating γ, T = 4
N = 100
Bias

N = 1000

N = 500

RMSE

Bias

RMSE

Bias

N = 2000

RMSE

Bias

RMSE

Two step GMM
no additional conditions
initial conditions
serial uncorrelation
both sets of conditions

All Lags

−0.214

0.314

−0.070

0.117

−0.039

0.076

−0.022

0.053

One Lag

−0.186

0.320

−0.059

0.137

−0.031

0.088

−0.019

0.065

All Lags

−0.036

0.167

−0.008

0.068

0.001

0.046

−0.001

0.033

One Lag

−0.006

0.142

−0.001

0.067

0.003

0.047

−0.000

0.034

All Lags

−0.108

0.196

−0.034

0.072

−0.015

0.046

−0.008

0.033

One Lag

−0.070

0.168

−0.022

0.073

−0.007

0.049

−0.005

0.036

All Lags

−0.029

0.166

−0.007

0.070

−0.001

0.048

−0.002

0.033

One Lag

−0.006

0.135

−0.003

0.064

0.002

0.043

−0.001

0.031

All Lags

−0.063

0.105

−0.036

0.074

−0.021

0.051

One Lag

−0.060

0.132

−0.029

0.088

−0.018

0.063

All Lags

−0.001

0.062

0.004

0.046

−0.001

0.032

One Lag

0.003

0.065

0.004

0.047

−0.000

0.034

All Lags

−0.023

0.058

−0.010

0.043

−0.007

0.031

One Lag

−0.017

0.061

−0.006

0.047

−0.004

0.034

All Lags

−0.010

0.086

−0.001

0.042

−0.002

0.029

One Lag

−0.003

0.061

−0.000

0.043

−0.002

0.030

Iterated GMM
no additional conditions
initial conditions
serial uncorrelation
both sets of conditions

The values of parameters used for the simulations are: γ = 0.5, β = 0.5,ρ = 0.5, τ = 0.1, ση2 = 0.5, σε2 = 0.5

59

Table 2.2: Bias and RMSE for estimating γ, T = 8
N = 100
Bias

N = 1000

N = 500

RMSE

Bias

RMSE

Bias

N = 2000

RMSE

Bias

RMSE

Two step GMM
no additional conditions
initial conditions
serial uncorrelation
both sets of conditions

All Lags

−0.244

0.319

−0.070

0.092

−0.037

0.050

−0.018

0.028

One Lag

−0.126

0.183

−0.033

0.061

−0.019

0.041

−0.010

0.028

All Lags

−0.122

0.204

−0.030

0.061

−0.014

0.032

−0.005

0.017

One Lag

−0.013

0.105

−0.002

0.041

−0.001

0.029

−0.000

0.020

All Lags

−0.191

0.267

−0.052

0.075

−0.025

0.039

−0.011

0.020

One Lag

−0.068

0.128

−0.017

0.041

−0.010

0.027

−0.005

0.018

All Lags

−0.107

0.192

−0.026

0.060

−0.012

0.032

−0.005

0.017

One Lag

−0.002

0.101

−0.002

0.038

−0.002

0.026

−0.001

0.018

All Lags

−0.044

0.058

−0.026

0.038

−0.015

0.024

One Lag

−0.030

0.057

−0.017

0.039

−0.009

0.027

All Lags

−0.009

0.035

−0.005

0.023

−0.002

0.015

One Lag

−0.001

0.041

−0.001

0.029

−0.000

0.020

All Lags

−0.019

0.034

−0.011

0.023

−0.007

0.015

One Lag

−0.011

0.036

−0.008

0.026

−0.004

0.018

All Lags

−0.010

0.039

−0.006

0.022

−0.003

0.014

One Lag

−0.004

0.037

−0.003

0.025

−0.002

0.018

Iterated GMM
no additional conditions
initial conditions
serial uncorrelation
both sets of conditions

The values of parameters used for the simulations are: γ = 0.5, β = 0.5,ρ = 0.5, τ = 0.1, ση2 = 0.5, σε2 = 0.5

60

Table 2.3: Ratio of standard errors over standard deviations of estimators of γ, T = 4
N = 100

N = 500

N = 1000

N = 2000

All Lags

0.612

0.929

1.033

1.029

One Lag

0.772

0.883

1.025

0.967

All Lags

0.428

0.763

0.870

0.921

One Lag

0.556

0.826

0.908

0.945

All Lags

0.562

0.900

0.990

0.975

One Lag

0.718

0.921

0.980

0.958

All Lags

0.368

0.724

0.844

0.887

One Lag

0.520

0.794

0.914

0.939

All Lags

0.971

0.999

1.039

One Lag

0.902

0.990

0.986

All Lags

0.758

0.827

0.922

One Lag

0.809

0.899

0.939

All Lags

0.976

0.971

0.997

One Lag

1.022

0.983

0.976

All Lags

0.496

0.829

0.907

One Lag

0.774

0.877

0.943

Two step GMM
no additional conditions
initial conditions
serial uncorrelation
both sets of conditions
Iterated GMM
no additional conditions
initial conditions
serial uncorrelation
both sets of conditions

The values of parameters used for the simulations are: γ = 0.5, β = 0.5,ρ = 0.5, τ = 0.1, ση2 = 0.5,
σε2 = 0.5

61

Table 2.4: Ratio of standard errors over standard deviations of estimators of γ, T = 8
N = 100

N = 500

N = 1000

N = 2000

All Lags

0.146

0.540

0.734

0.914

One Lag

0.599

0.898

0.948

0.948

All Lags

0.081

0.403

0.615

0.837

One Lag

0.353

0.736

0.834

0.899

All Lags

0.099

0.415

0.591

0.832

One Lag

0.435

0.823

0.907

0.955

All Lags

0.055

0.354

0.555

0.803

One Lag

0.290

0.696

0.822

0.899

All Lags

0.749

0.854

0.942

One Lag

0.915

0.954

0.963

All Lags

0.561

0.727

0.849

One Lag

0.713

0.814

0.891

All Lags

0.695

0.803

0.923

One Lag

0.855

0.924

0.976

All Lags

0.440

0.683

0.848

One Lag

0.684

0.823

0.897

Two step GMM
no additional conditions
initial conditions
serial uncorrelation
both sets of conditions
Iterated GMM
no additional conditions
initial conditions
serial uncorrelation
both sets of conditions

The values of parameters used for the simulations are: γ = 0.5, β = 0.5,ρ = 0.5, τ = 0.1, ση2 = 0.5,
σε2 = 0.5

2.6

Average Partial Effects

With multiplicative heterogeneity models, Average Partial Effects (APE) are very simple to compute. Average Partial Effects are defined by:
∂y
APE fw = E fw ( )
∂x

(2.6.1)

where fw is some distribution over the domain of w which represents all the information observed for one
observation and ∂∂ xy denotes the change in y caused by a small change in x.8 Eliminating the subscripts,

8 Here we use the notation for partial derivatives but in case of discrete changes in x of ∆ we could use the counterfactuals
x
notation and use ∆y (x) = y|(x + ∆x ) − y|x interchangeably.

62

Table 2.5: Coverage of 95% confidence intervals for γ, T = 4
N = 100

N = 500

N = 1000

N = 2000

All Lags

0.628

0.835

0.895

0.924

One Lag

0.788

0.875

0.930

0.929

All Lags

0.614

0.855

0.903

0.923

One Lag

0.717

0.887

0.927

0.934

All Lags

0.628

0.847

0.910

0.921

One Lag

0.774

0.894

0.930

0.940

All Lags

0.538

0.832

0.906

0.918

One Lag

0.699

0.868

0.919

0.936

All Lags

0.849

0.893

0.931

One Lag

0.879

0.914

0.930

All Lags

0.842

0.891

0.921

One Lag

0.874

0.918

0.926

All Lags

0.884

0.918

0.933

One Lag

0.911

0.934

0.951

All Lags

0.808

0.891

0.929

One Lag

0.863

0.902

0.933

Two step GMM
no additional conditions
initial conditions
serial uncorrelation
both sets of conditions
Iterated GMM
no additional conditions
initial conditions
serial uncorrelation
both sets of conditions

The values of parameters used for the simulations are: γ = 0.5, β = 0.5,ρ = 0.5, τ = 0.1, ση2 = 0.5,
σε2 = 0.5

(2.2.1) can be written as:
y = h0 (x, β0 ) + h1 (x, β0 )u

(2.6.2)

Therefore:
∂ h0 (x, β0 ) ∂ h1 (x, β0 )
+
u)
∂x
∂x
∂ h (x, β0 ) ∂ h1 (x, β0 ) y − h0 (x, β0 )
= E fw ( 0
+
)
∂x
∂x
h1 (x, β0 )

APE fw = E fw (

(2.6.3)
(2.6.4)

For notational simplicity in this section we will consider h0 (., .) = 0 but this will not affect any of the
results.

63

Table 2.6: Coverage of 95% confidence intervals for γ, T = 8
N = 100

N = 500

N = 1000

N = 2000

All Lags

0.098

0.434

0.649

0.821

One Lag

0.590

0.862

0.898

0.913

All Lags

0.091

0.534

0.757

0.875

One Lag

0.522

0.846

0.902

0.923

All Lags

0.086

0.410

0.637

0.820

One Lag

0.573

0.853

0.895

0.926

All Lags

0.066

0.501

0.724

0.863

One Lag

0.457

0.832

0.902

0.924

All Lags

0.620

0.753

0.862

One Lag

0.875

0.904

0.920

All Lags

0.729

0.834

0.895

One Lag

0.836

0.888

0.920

All Lags

0.731

0.822

0.897

One Lag

0.876

0.911

0.933

All Lags

0.697

0.820

0.892

One Lag

0.823

0.892

0.923

Two step GMM
no additional conditions
initial conditions
serial uncorrelation
both sets of conditions
Iterated GMM
no additional conditions
initial conditions
serial uncorrelation
both sets of conditions

The values of parameters used for the simulations are: γ = 0.5, β = 0.5,ρ = 0.5, τ = 0.1, ση2 = 0.5,
σε2 = 0.5

Many applications are interested in the average effect across an observed subset of the population, denote
it A. This corresponds to using fw = f (w|A) so that APEA = E( ∂∂ xy |A) = E(

∂ h1 (x,β0 )
y
|A).
∂x
h1 (x,β0 )

For instance we could be interested in the average effect of x on y across the entire population in some
given time period t APEt = E(

∂ h1 (xit ,β0 )
yit
). Or in the case of a binary explanatory variable x1 ,
∂x
h1 (xit ,β0 )

with x = (x1 , x_1 ), we could be interested in Average Treatment Effect on the Treated at a given time period:
AT ETt = E(y(1, xit_1 ) − y(0, xit_1 )|xit1 = 1)
= E((h1 ((1, xit_1 ), β0 ) − h1 ((0, xit_1 ), β0 ))

(2.6.5)
yit
|x1 = 1)
h1 (xit , β0 ) it

(2.6.6)

Estimation and inference are straightforward in this case once a consistent estimator βˆ for β0 is defined.
Since E(1(i ∈ A)(

∂ h1 (xit ,β0 )
yit
− APEAt )) = 0 where 1(.) is the indicative function, we can just
∂x
h1 (xit ,β0 )

64

add this moment condition to the moment conditions used to estimate β0 and obtain an additional estimator
ˆ At and covariance between APE
ˆ At and βˆ
of APEAt as well an estimator for the asymptotic variance of APE
where βˆ denotes the estimator of β0 we will be using.
Since we are adding one moment condition for one new parameter, the estimator βˆ will not be affected
by estimation of Average Partial Effects. In addition the GMM estimator of APE will be given by:
n
ˆ
yit
ˆ At = 1 ∑ 1(i ∈ A) ∂ h1 (xit , β )
APE
nA i=1
∂x
h1 (xit , βˆ )

(2.6.7)

Where nA = ∑ni=1 1(i ∈ A).
If we think that APE should be equal across time periods, we can impose this restriction in the GMM
estimation by adding the moment restrictions {E(1(i ∈ A)(

∂ h1 (xit ,β0 )
yit
− APEA )) = 0}t=1,...,T
∂x
h1 (xit ,β0 )

which might affect estimation of β0 or we can estimate average partial effects for each time period and
combine them using Minimum Distance Estimation which will not affect estimation of β0 .
ˆ where ηˆ is a vector of estimators of
In other situations, if fw can be consistently estimated by fw (η)
nuisance parameters η0 , then:
ˆ f = Ef
APE
w

y
∂ h1 (x, βˆ )
)
ˆ
w(η)
∂x
h1 (x, βˆ )
(

(2.6.8)

ˆ are jointly asymptotically normal and a consistent estimator of their
is consistent for APE fw . If (βˆ , η)
asymptotic variance-covariance matrix is available then inference can be performed using the delta-method.

2.7

Conclusion

These results hopefully provide useful new options for researchers who wish to use non-linear models of
panel data with unobserved effects in applications where only sequential exogeneity is available. The problem of weak instrumental variables seems to be mitigated significantly by the use of additional moment
conditions originating from additional restrictions of stationarity of the instruments or serial uncorrelation
of the transitory shocks. Monte Carlo evidence also seems to suggest that it is preferable to use only one or
a few lags of the instruments compared to all available lags since this results in much better inference at the
expense of only small losses in efficiency.
Two directions are available in order to obtain estimators with better inference. One consists in studying
the higher order properties of the GMM estimator with many over-identifying restrictions, the other consists
in finding good exactly identifying moment conditions. Both of these approaches are left for future research.
65

CHAPTER 3
EFFICIENCY OF THE POISSON FIXED EFFECTS ESTIMATOR

3.1

Introduction

A commonly used estimator for models of count panel data with multiplicative heterogeneity and strictly
exogenous explanatory variables is the Poisson fixed effects (PFE) estimator introduced by Hausman et al.
(1984). This estimator is a conditional maximum likelihood estimator which takes advantage of the assumptions of Poisson distribution and independent draws over time to derive a conditional distribution of
the dependent variable that does not depend on the distribution of unobserved heterogeneity. In many applications, these distributional assumptions are likely to be violated. Wooldridge (1999) showed that the
PFE estimator is consistent as long as the restriction on the conditional mean function is correctly specified,
independently of whether the rest of the assumptions of the PFE model hold.
In this paper I show that, as long as the conditional mean of the dependent variable is equal to its conditional variance and the conditional serial correlation of the dependent variable is zero, the PFE estimator is
also asymptotically efficient in the class of estimators that are consistent under restrictions on the conditional
mean function. I then define another estimator that is asymptotically efficient in the same class of estimators
under more general conditions.
In Section 3.2, I present the model considered in this paper and study the asymptotically efficient estimator for this model. I show under which conditions the PFE estimator is asymptotically efficient and
propose an alternative estimator that is asymptotically efficient under more general conditions. In Section
3.3, I use Monte Carlo simulations to investigate the small sample properties of the PFE estimator and of
this new estimator.

3.2

The Model and Estimators

As in Wooldridge (1999), we consider panel data models that specify a conditional mean function with
strictly exogenous explanatory variables and multiplicative heterogeneity:
E(yit |ci , xi ) = ci µ(xit , β0 ) ∀ i = 1, ..., n, t = 1, ..., T

66

(3.2.1)

where i indexes cross-sectional observations, t indexes time, and xi = {xi1 , ..., xiT }. This model is also a
special case of the random coefficients model presented in Section 4 of Chamberlain (1992b). Throughout
this paper we consider the case of i.i.d. cross-sectional draws and large n, fixed T asymptotics. Denote
µit (β ) = µ(xit , β ) and µit = µit (β0 ).
Wooldridge (1999) showed that the parameters in this model can be estimated from the conditional
moment conditions:
E(ρit (β0 )|xi ) = 0 ∀t = 1, ..., T

(3.2.2)

T

y
∑
where ρit (β ) = yit − µ(xit , β ) T t=1 it . Under (3.2.1), for any deterministic functions gt (.), the folµ(x ,β )
∑
is

s=1

lowing unconditional moment conditions will hold:
E(gt (xi1 , ..., xiT )ρit (β0 )) = 0 ∀t = 1, ..., T

(3.2.3)

Therefore, under standard regularity conditions, any estimator βˆ of β0 defined by:
n

T

∑ ∑ gt (xi1 , ..., xiT )ρit (βˆ ) = 0

(3.2.4)

i=1 t=1

will be consistent for β0 and asymptotically normal. All of the estimators considered in this paper can be
written as (3.2.4) so that they are consistent as long as (3.2.1) holds, independently of what other assumptions
are considered to study efficiency.

3.2.1

Asymptotically Efficient Estimation

The conditional moment conditions written in (3.2.2) can be rewritten in the form:
E(ρi (β0 )|xi ) = 0

(3.2.5)

where ρi (β ) = [ρi1 (β ), ..., ρiT (β )] . Similarly as in Chamberlain (1987), an optimal estimator for β0 from
(3.2.5) can be postulated to be βˆopt from:
n

∑ Di Σ+i ρi (βˆopt ) = 0

(3.2.6)

i=1
∂ ρi

(β0 )|xi ), Σi = Var(ρi (β0 )|xi ) and Σ+
is some generalized inverse of Σi .1 If βˆopt is
i
∂β
indeed optimal, Di Σ+
i can be called the optimal instruments for the vector of moment functions ρi (β ).
where Di = E(

1 That βˆ
opt is optimal for estimating β0 from (3.2.5) has to be proven since Chamberlain (1987) considers cases where
Var(ρi (β0 )|xi ) is non-singular a.s., but in our case Var(ρi (β0 )|xi ) can be shown to be non-invertible.

67

Under standard regularity conditions:
√ ˆ
d
n(βopt − β0 ) → N(0,Vopt )
+
+
+
−1
−1
Vopt = E(Di Σ+
i Di ) E(Di Σi Σi Σi Di )E(Di Σi Di )
−1
= E(Di Σ+
i Di )

ˆ
Appendix C.1 shows that, when a specific generalized inverse of Σi denoted Σ−
i is used, βopt is asymptotically efficient for estimating β0 from (3.2.1) by showing that Vopt is equal to the inverse of the asymptotic
information bound for estimating β0 from (3.2.1) derived in Chamberlain (1992b).2

3.2.2

Conditions for Efficiency of the Poisson FE estimator

As shown in Wooldridge (1999), the Poisson fixed effects estimator, βˆPFE , is defined by:
n

∑(

∂ pi (βˆPFE )

) Wi (βˆPFE )−1 ρi (βˆPFE ) = 0

(3.2.7)

∂β

i=1

µ (β )
where pi (β ) = [pi1 (β ), ..., piT (β )], pit (β ) = T it
, Wi (β ) = diag(pi (β )), and diag(a) is the diag∑s=1 µis (β )
onal matrix with a for diagonal.

Under standard regularity conditions, βˆPFE is asymptotically equivalent to β˜PFE defined by:
n

∑(

i=1

∂ pi (β0 )
∂β

) Wi (β0 )−1 ρi (β˜PFE ) = 0

We show in Appendix C.2 that Di Σ−
i = −(

∂ pi (β0 )
∂β

(3.2.8)

) Wi (β0 )−1 if (3.2.1) holds as well as:

Var(yit |ci , xi ) = ci µit
Cov(yit , yit−s |ci , xi ) = 0 ∀ s = 1, ...,t

(3.2.9)
(3.2.10)

Therefore under these additional conditions, the PFE estimator is using the optimal instruments for
ρi (β ) in order to estimate β0 and is asymptotically efficient in the class of estimators that are consistent
under (3.2.1).

corrollary of that result is that βˆopt is indeed also optimal for estimating β0 from (3.2.5) since (3.2.1) implies (3.2.5).
Therefore, the optimal estimator from (3.2.5) corresponds to the optimal estimator from (3.2.1), i.e. no information is lost for
estimating β0 from transforming (3.2.1) to (3.2.5).
2A

68

3.2.3

An Alternative Estimator

In this section we derive an optimal estimator for cases where one thinks there might be overdispersion, so
that instead of (3.2.9) we have:
Var(yit |ci , xi ) = ci µit + θ c2i µit2

(3.2.11)

where θ is an unknown parameter, and serial correlation so that instead of (3.2.10) we have:
Cov(yit , yit−s |ci , xi ) = γc2i µit µit−1 f or s = 1
= 0 f or s > 1

(3.2.12)
(3.2.13)

where γ is an unknown parameter. Note that (3.2.9) and (3.2.10) are a special case of assumptions (3.2.11)
and (3.2.12) since both sets of assumptions are the same with θ = 0 and γ = 0.
Appendix C.3 shows that, as long as T ≥ 3, consistent estimators of θ and γ can be obtained under the
ˆ
assumptions (3.2.1), (3.2.11) and (3.2.12), denote these estimators θˆ and γ.
As seen in Section 2.1, the optimal instruments for ρi (β ) are Di Σ−
i where:
T

Di = −E(ci |xi )( ∑ µit )[
t=1

∂ pit
∂β

]t=1,...,T

(3.2.14)

and Σ−
i is a specific generalized inverse of Σi = Var(ρi |xi ).
Without (3.2.9) and (3.2.10), Di Σ−
i does depend on E(ci |xi ) and Var(ci |xi ). Therefore with assumptions
(3.2.11) and (3.2.12) instead of (3.2.9) and (3.2.10), the conditional mean and variance of the unobserved
heterogeneity term ci are needed to compute the optimal instruments. One can model these as known
functions h1 and h2 of a vector of unknown nuisance parameters η:
E(ci |xi ) = h1 (xi , η)

(3.2.15)

Var(ci |xi ) = h2 (xi , η)

(3.2.16)

and estimate η consistently since under (3.2.1), (3.2.11), (3.2.12), (3.2.15) and (3.2.16):
E(yit |xi ) = h1 (xi , η)µit

(3.2.17)

E(y2it |xi ) = h1 (xi , η)µit + (θ + 1)(h2 (xi , η) + h1 (xi , η)2 )µit2

(3.2.18)

Therefore a consistent estimator of η under (3.2.1), (3.2.11), (3.2.12), (3.2.15) and (3.2.16) can be obtained
by pooled non-linear regression from (3.2.17) and (3.2.18) with µit replaced by µit (β¨ ) and θ replaced by
θˆ , where β¨ is a preliminary consistent estimator of β0 . Denote by ηˆ the resulting estimator of η.
69

The alternative estimator to Poisson fixed effects we propose in this paper is βˆalt defined by:
n

∑ Dˆ i Σˆ −i ρi (βˆalt ) = 0

(3.2.19)

i=1

where
T

∂p
ˆ ∑ µ¨ it )[ it (β¨ )]t=1,...,T
Dˆ i = h1 (xi , η)(
∂β
t=1

(3.2.20)

ˆ −1 ˆ −1
ˆ −1 −1 ˆ −1
Σˆ −
i = Σy,i − Σy,i µ¨ i (µ¨ i Σy,i µ¨ i ) µ¨ i Σy,i

(3.2.21)

and:

where µ¨ it = µit (β¨ ), µ¨ i = µi (β¨ ) = [µi1 (β¨ ), ..., µiT (β¨ )] and the (t, s)th element of Σˆ y,i is:
ˆ it , yis |xi ) = 1[t = s]h1 (xi , η)
ˆ µ¨ it +
Cov(y
ˆ + h1 (xi , η)
ˆ 2 ) + h2 (xi , η))
ˆ µ¨ it µ¨ is
(1[|t − s| ≤ 1]θ 1−|t−s| γ |t−s| (h2 (xi , η)
where β¨ can simply be defined to be the Poisson fixed effects estimator.
This estimator uses optimal instruments for ρi (β ) and is asymptotically efficient in the class of estimators of β0 that are consistent under (3.2.1) as long as (3.2.11), (3.2.12), (3.2.15) and (3.2.16) hold. Appendix
D shows that when (3.2.9) and (3.2.10) hold, so that (3.2.11) and (3.2.12) hold with θ = 0, γ = 0, and that
βˆPFE is asymptotically efficient, βˆalt and βˆPFE are asymptotically equivalent, independently of whether
(3.2.15) and (3.2.16) hold. Therefore, the estimator βˆ is indeed efficient under more general conditions
alt

than the Poisson fixed effects estimator.

3.3

Monte Carlo Simulations Study

To compare the small sample perfomance of the Poisson Fixed Effects estimator and the alternative estimator
defined by (3.2.19), we use both estimators to estimate β0 from the data generating process:
i.i.d.

xit ∼ Uni f orm(−a, a)
ci |xi ∼ Fc (xi )
i.i.d.

eit ∼ Uni f orm(ae , be )
uit = δ exp(eit−1 ) + exp(eit )
i.i.d.

yit ∼ Poisson(ci exp(β0 xit )uit )

70

where Uni f orm(a, b) denotes the uniform distribution over the interval (a, b), Poisson(µ) denotes the Pois1 , Var(e ) =
son distribution with mean. We set β0 = 1. We also set ae , be and δ so that E(eit ) = 1+δ
it

θ
1+δ 2

(so that E(uit |ci , xi ) = 1 and Var(uit |ci , xi ) = θ ) and Cov(uit , uit−1 |ci , xi ) = δVar(exp(eit )) = γ. Therefore
(3.2.1), (3.2.11) and (3.2.12) are satisfied.
Tables 3.1, 3.2 and 3.3 shows the performance of both estimators and of the unfeasible optimal estimator
using Di Σ−
i as instruments for ρi (β ) from Monte Carlo simulations. The results shown are measures of
bias, standard deviation, and root MSE for sample sizes N = 100, N = 500 and N = 1000, with ten time
periods and for the cases where {θ = 0, γ = 0} and {θ = 1, γ = 0.5}. Fc (ci ) is given by ci = exp(λ x¯i +
T x . We show results for λ = 0 and λ = 1. In both cases, as a increases,
Uni f orm(−a, a)) where x¯i = T1 ∑t=1
it

the variances of ci and c2i increase. We show results for a = 1, 1.5, 2. The model for the conditional mean
and variance of ci that we use is:
h1 (xi , η) = η1
h2 (xi , η) = η2
Therefore this model corresponds to the true data generating process when λ = 0 but not when λ = 1.
When {θ = 0, γ = 0}, both the Poisson fixed effects estimator and our alternative estimator are asymptotically efficient in the class of estimators consistent under (3.2.1), independently of the distribution of ci .
When {θ = 1, ρ = 0.5}, the Poisson fixed effects estimator is not asymptotically efficient and our alternative
estimator is asymptotically efficient when λ = 0. When {θ = 1, ρ = 0.5} and λ = 1, neither the Poisson
fixed effects estimator nor our alternative estimator are efficient.
Results for Tables 3.1, 3.2 and 3.3 show that significant gains in efficiency can be achieved by using
the unfeasible optimal instruments but that in small samples, the additional noise originated from estimating
η1 , η2 , θ and γ to compute a feasible estimator can overpower this gain in efficiency and result in the
alternative estimator defined by (3.2.19) being signigicantly less accurate than the Poisson Fixed Effects
estimator. The solution to this problem could be to derive a data-based criterion that captures the trade-off
between asymptotic efficiency and finite sample noise from nuisance parameters and helps decide between
different models of optimal instruments. This is left for future research. The conclusion of this section is
that the Poisson Fixed Effects estimator performs well with small sample sizes compared to an alternative
estimator that is asymptotically efficient under more general conditions. However with large enough sample
sizes, significant gains in efficiency can be obtained from using a more general model of optimal instruments.

71

Table 3.1: N = 100: Bias, standard deviation and root mean squared error
c ∼ exp(Uni f orm(−1, 1))
Bias

sd

Poisson FE

0.002

0.057

Feasible alternative

0.002

Unfeasible optimal

rmse

c ∼ exp(Uni f orm(−1.5, 1.5))
Bias

sd

0.057

0.002

0.034

0.057

0.057

0.004

0.002

0.057

0.057

Poisson FE

0.001

0.055

Feasible alternative

0.001

Unfeasible optimal

rmse

c ∼ exp(Uni f orm(−2, 2))
rmse

Bias

sd

0.034

0.001

0.022

0.022

0.064

0.065

0.001

0.077

0.077

0.002

0.034

0.034

0.001

0.022

0.022

0.055

−0.001

0.033

0.033

0.001

0.021

0.021

0.055

0.055

−0.002

0.051

0.051

−0.004

0.078

0.078

0.001

0.055

0.055

−0.001

0.033

0.033

0.001

0.021

0.021

Poisson FE

0.005

0.086

0.086

0.004

0.066

0.067

0.000

0.063

0.063

Feasible alternative

0.004

0.077

0.078

−0.003

0.102

0.102

−0.002

0.102

0.102

Unfeasible optimal

0.005

0.076

0.077

0.003

0.054

0.054

0.001

0.043

0.043

Poisson FE

0.007

0.088

0.089

0.003

0.069

0.069

−0.001

0.067

0.067

Feasible alternative

0.008

0.083

0.083

0.001

0.088

0.088

−0.004

0.128

0.128

Unfeasible optimal

0.007

0.078

0.079

0.004

0.053

0.053

−0.002

0.041

0.041

θ = 0, γ = 0, λ = 0

θ = 0, γ = 0, λ = 1

θ = 1, γ = .5, λ = 0

θ = 1, γ = .5, λ = 1

72

Table 3.2: N = 500: Bias, standard deviation and root mean squared error
c ∼ exp(Uni f orm(−1, 1))
Bias

sd

rmse

c ∼ exp(Uni f orm(−1.5, 1.5))
Bias

sd

rmse

c ∼ exp(Uni f orm(−2, 2))
Bias

sd

rmse

θ = 0, γ = 0, λ = 0
Poisson FE

−0.000

0.025

0.025

−0.000

0.015

0.015

−0.000

0.010

0.010

Feasible alternative

−0.000

0.025

0.025

−0.000

0.015

0.015

−0.000

0.012

0.012

Unfeasible optimal

−0.000

0.025

0.025

−0.000

0.015

0.015

−0.000

0.010

0.010

Poisson FE

−0.001

0.024

0.024

0.000

0.015

0.015

0.000

0.010

0.010

Feasible alternative

−0.001

0.024

0.024

0.000

0.015

0.015

0.000

0.012

0.012

Unfeasible optimal

−0.001

0.024

0.024

0.000

0.015

0.015

0.000

0.010

0.010

Poisson FE

−0.002

0.040

0.040

−0.001

0.031

0.031

−0.001

0.028

0.028

Feasible alternative

−0.003

0.036

0.035

−0.002

0.030

0.030

−0.002

0.023

0.023

Unfeasible optimal

−0.003

0.036

0.035

−0.001

0.025

0.025

−0.001

0.019

0.019

Poisson FE

−0.002

0.040

0.040

−0.001

0.032

0.032

−0.000

0.030

0.030

Feasible alternative

−0.003

0.035

0.035

−0.001

0.024

0.024

−0.002

0.030

0.030

Unfeasible optimal

−0.003

0.035

0.035

−0.001

0.024

0.024

−0.000

0.019

0.019

θ = 0, γ = 0, λ = 1

θ = 1, γ = .5, λ = 0

θ = 1, γ = .5, λ = 1

73

Table 3.3: N = 1000: Bias, standard deviation and root mean squared error
c ∼ exp(Uni f orm(−1, 1))
Bias

sd

rmse

c ∼ exp(Uni f orm(−1.5, 1.5))
Bias

sd

rmse

c ∼ exp(Uni f orm(−2, 2))
Bias

sd

rmse

θ = 0, γ = 0, λ = 0
Poisson FE

−0.001

0.017

0.017

−0.001

0.011

0.011

−0.000

0.007

0.007

Feasible alternative

−0.001

0.017

0.017

−0.001

0.011

0.011

−0.000

0.007

0.007

Unfeasible optimal

−0.001

0.017

0.017

−0.001

0.011

0.011

−0.000

0.007

0.007

Poisson FE

−0.000

0.017

0.017

−0.001

0.011

0.011

−0.000

0.007

0.007

Feasible alternative

−0.000

0.017

0.017

−0.001

0.011

0.011

−0.000

0.007

0.007

Unfeasible optimal

−0.000

0.017

0.017

−0.001

0.011

0.011

−0.000

0.007

0.007

−0.000

0.028

0.028

0.000

0.022

0.022

0.001

0.020

0.020

Feasible alternative

0.000

0.025

0.025

0.001

0.017

0.017

0.000

0.014

0.014

Unfeasible optimal

0.000

0.024

0.024

0.001

0.017

0.017

−0.000

0.013

0.013

Poisson FE

−0.001

0.028

0.028

0.001

0.022

0.022

−0.000

0.021

0.021

Feasible alternative

−0.001

0.024

0.024

0.000

0.017

0.017

0.000

0.022

0.022

Unfeasible optimal

−0.001

0.024

0.024

0.000

0.016

0.016

−0.000

0.013

0.013

θ = 0, γ = 0, λ = 1

θ = 1, γ = .5, λ = 0
Poisson FE

θ = 1, γ = .5, λ = 1

74

APPENDICES

75

APPENDIX A
ESTIMATION OF DYNAMIC PANEL DATA MODELS WITH CROSS-SECTIONAL
DEPENDENCE

A.1
A.1.1

Efficient Estimation with Clustering
Unfeasible Optimal Instruments

Consider any GMM estimator of ρ0 defined as in (1.2.6) for some set of valid instruments {Zi }i=1,...,n of
dimension r × (T − 1) which can be rewritten as:
G

ρˆ = argminρ ( ∑ Z g mg (ρ)) Ξ
g=1

g

G

∑ Z g mg (ρ)

(A.1.1)

g=1

g

g

where Z g = [Z2 , ..., ZT ] and Zt = [Zi t , ..., Zin t ].
1

g

From White (2001), ρˆ is consistent for ρ0 and:
√
d
G(ρˆ − ρ0 ) → N(0, (D ΞD)−1 D ΞϒΞD(D ΞD)−1 )

(A.1.2)

1 G Z g ∆Y g ) and ϒ = plim( 1 G Z g mg mg Z g ).
with D = plim( G
∑g=1
G ∑g=1
−1

Ξ = ϒ−1 is the optimal weighting matrix for that estimator and with such weighting matrix:
√
d
G(ρˆ − ρ0 ) → N(0, (D ϒ−1 D)−1 )

(A.1.3)

Therefore in this section we will show that the asymptotic variance of ρˆ opt defined by (1.3.4) is smaller
than (D ϒ−1 D)−1 for any set of valid matrices of instruments {Zi }i=1,...,n as long as (1.2.1), (1.2.2) and
Auxiliary Assumption 1 are satisfied.
g

Since Φg−1/2 is upper triangular with its element in row j, column i being a function of Ymax{i, j}−1 ,
for any valid set of instruments {Z g }g=1,...,G we have:
E(Z g (Φg )−1/2 mg ) = 0

(A.1.4)

g
because the jth r × ng component of Z g (Φg )−1/2 is a function of Y j−1 .

In addition, we have:
Var(Z g (Φg )−1/2 mg ) = E(Z g (Φg )−1/2 Φg (Φg )−1/2 Z g )
= E(Z g Z g )

76

(A.1.5)
(A.1.6)

g
because the jth r × ng component of Z g (Φg )−1/2 is a function of Y j−1 and
g g

g

s=2,...,T

Φg = [E(mt ms |Ymax{t,s}−2 )]t=2,...,T

(A.1.7)

Note that since E((Z g mg mg Z g ) −Var(Z g mg )) = 0, then
plim

1 G g g g g
1 G
(Z
m
m
Z
)
=
plim
∑
∑ Var(Z g mg )
G g=1
G g=1

(A.1.8)

Define:
g

g

∆Y˜−1 = (Φg )−1/2 ∆Y−1

(A.1.9)

g
g
Define ∆Y˜−1,t the t th block of ng rows of ∆Y˜−1 . Define:
g
g
Lt = E(∆Y˜−1,t |Yt−2 )

(A.1.10)

g
Zopt = L g (Φg )−1/2

(A.1.11)

g
g
Define L g = [L2 , ..., LT ] and:

g

g

g

g

1 G Z ∆Y ) and ϒ
1 G
g g
Define Dopt = plim( G
∑g=1 opt −1
opt = plim( G ∑g=1 Zopt m m Zopt ).
g
g
g
E(Zopt ∆Y−1 − L g L g ) = 0 because the jth r × ng component of Z g Φg−1/2 is a function of Y j−1 and
g
g
Lt = E(∆Y˜−1,t |Yt−2 ). Therefore:

Dopt = plim(

= plim(

1 G g
g
∑ Zopt ∆Y−1 )
G g=1

(A.1.12)

1 G
∑ L gL g )
G g=1

(A.1.13)

Since Var(Z g (Φg )−1/2 mg ) = E(Z g Z g ), we have in particular: Var(L g (Φg )−1/2 mg ) = E(L g L g ).
Therefore:
ϒopt = plim(

= plim(

1 G g
g
Zopt mg mg Zopt )
∑
G g=1

(A.1.14)

1 G
∑ L gL g )
G g=1

(A.1.15)

= Dopt

(A.1.16)

−1 = D−1 .
so that (Dopt ϒ−1
opt Dopt )
opt

77

Therefore the estimator ρˆ opt defined by:
G

g
ρˆ opt = argminρ ( ∑ Zopt mg (ρ))
g=1

is consistent for ρ0 and

G

g

∑ Zopt mg (ρ)

(A.1.17)

g=1

√
G-asymtotically normal with asymptotic variance:
Vopt = D−1
opt

(A.1.18)

We can show that this variance-covariance matrix is smaller than (D ϒ−1 D)−1 no matter what set of
instruments {Z g }g=1,...,G is used. Denote ∆ the difference between D ϒ−1 D and Dopt :
D = D ϒ−1 D − Dopt

(A.1.19)
G

1
= D ϒ−1 D − plim( ∑ L g L g )
G g=1

(A.1.20)

G

1
g
g
= D ϒ−1 D − plim( ∑ Zopt Φg Zopt )
G g=1

(A.1.21)

Since (Φg )−1/2 (Φg )−1/2 = (Φg )−1 we also have Φg = Φg1/2 Φg1/2 where Φg1/2 is upper triang

gular and is composed of ng × ng matrices such that the ( j, k)th matrix for k > j is a function of Yk−1 .
Therefore:
E(Z g (

∂ mg (ρ0 )
g
− Φg Zopt )) = 0
∂ρ

(A.1.22)

1/2 g
g
g
since the jth r × ng component of Z g is a function of Y j−1 and (Φg )t Lt = E((Φg )1/2 ∆Y˜−1,t |Yt−2 )
1/2

where (Φg )t

is the (t − 1)th ng × ng (T − 1) matrix composing (Φg )1/2 .

In addition we have
E(Z g (mg (ρ0 )mg (ρ0 ) − Φg )Z g ) = 0

(A.1.23)

We can then apply the WLLN to show that:
D = plim(

1 G g g g
∑ Z Φ Zopt )
G g=1

(A.1.24)

ϒ = plim(

1 G g g g
∑Z ΦZ )
G g=1

(A.1.25)

1 G Z g Φg Z g , ϒ = 1 G Z g Φg Z g and D = 1 G Z g Φg Z g . Define Z =
Define Dn = G
∑g=1
n G ∑g=1 opt
opt n G ∑g=1
opt
1 , ..., Z G ] , Z = [Z , ..., Z ] , S = diag({Φg }
[Zopt
g=1,...,G ), then:
opt
1
G
−1
Dn ϒ−1
n Dn − Gn = Z SZ(Z SZ) Z SZ − Z SZ

= Z S1/2 (S1/2 Z(Z SZ)−1 Z S1/2 − IT ×n )S1/2 Z

78

(A.1.26)
(A.1.27)

−1
Therefore Dn ϒ−1
n Dn − Dn is positive semi-definite for any value of n. Therefore D ϒ D − Dopt is

positive semi-definite by the continuous mapping theorem. A similar result was found in Chamberlain
(1992a) for the case of cross-sectional independence.

A.1.2

Efficient Estimation with Auxiliary Assumptions
g

Under Auxiliary Assumptions 1-2a, the variance-covariance matrix of ut is:




1





τu 1



g
Σu = σu2 



... 
 ...


τu ... τu 1

(A.1.28)

and we have
g g

g

E(mt ms |Ymax{t,s}−2 ) = 2Σu i f t = s
g

(A.1.29)

= −Σu i f |t − s| = 1

(A.1.30)

= 0 i f |t − s| ≥ 2

(A.1.31)

Therefore:
Φg = J g (IT ⊗ Σg )J g

(A.1.32)



−1 0 ... 0 1 0 ... 0




 0 −1 0 ... 0 1 ... 0


g
J =



...




0
...
0 −1 ... 0 1

(A.1.33)

where:

is the deterministic differencing matrix such that J g ug = mg .
g

g

Therefore Lt = Φg−1/2 E( ∂∂mρ |Yt−2 ) and
g

g

g

Zopt = [E(∆Y−1 |Y0 ) , ..., E(∆Y−1 |YT −2 ) ]Ψg Φg−1/2

79

(A.1.34)

where:

g−1/2
Φ
 1


0

Ψg = 

 ...

0
g−1/2

where Φ j

A.1.3

0

...

g−1/2
Φ2

...
...

...



0

0
g−1/2

0










(A.1.35)

ΦT −1

is the jth ng × ng (T − 1) matrix composing (Φg )−1/2 .

Conditional Expectation of Unobserved Heterogeneity under Clustering

Under Auxiliary Assumption 1, 2a, 3a we have:


cg




 yg 

0 


 cg + ug  ∼ N(µ g , AgV g Ag )

1




 ... 


g
cg + uT

(A.1.36)

Therefore, using the properties of the multivariate normal distribution, we have:




g
g
y0
y0

 







g
g
 cg + u 
 cg + u 
µ
ι

1 ) = µ ι + AgV g Ag AgV g Ag −1 (
1  −  c T ×ng )
E(cg | 

 


c ng
12
22




1 µ ι
c
n
 ... 
 ... 
g
1−ρ0




g
g
cg + uT
cg + uT


g

y0





g

y0

(A.1.37)











 cg + ug 
 cg + ug 

g
1 ) and both matrices are compo1 ) and AgV g Ag = Var(
where AgV g A12 = Cov(cg , 



22




...
...








g
g
g
g
c + uT
c + uT
gV g Ag .
nents of A

g
y0




cg + ug 

1 ) can be obtained in a similar fashion by considering only the first ((t + 2)ng ) × ((t +
E(cg | 



 ... 


g
g
c + ut

2)ng ) block of AgV g Ag

80

APPENDIX B
ESTIMATION OF UNOBSERVED EFFECTS PANEL DATA MODELS UNDER SEQUENTIAL
EXOGENEITY

B.1

GMM Estimation and Efficiency Bound

s=1,...,T
Define Σi = [Cov(ρit , ρis |zimax(t−1,s−1) )]t=1,...,T . define Ai and Σ˜ −1
i to be the terms of the LDL decomposition of Σ−1 : Σ−1 = A Σ˜ −1 Ai where Σ˜ i is diagonal and Ai is upper-triangular with only ones on the
i

i i

i

s=2,...,T

diagonal. We can show that Ai = [1(s ≥ t)(−1)1(s=t) Γit,s ]t=2,...,T where 1(.) is the indicative function
and Σ˜ i = diag({Var(ρ˜ it |zit−1 )}t=2,...,T ) so that:
T

˜
J = E( ∑ D˜ it Σ˜ −1
it Dit )

(B.1.1)

t=2

¨ i
= E(E(A

∂ ρi
¨
|zi ) Σ˜ −1
|zi ))
i E(Ai
∂β
∂β
∂ ρi

(B.1.2)

¨ t ]t=2,...,T |[x1 , ..., xT −1 ]) is a matrix operator that returns E(gt |xt−1 ) as its (t − 1)th row, t =
where E([g
2, ..., T , where gt are row random vectors.
Using standard results of GMM estimation we can write:
√ ˆ
n(β

n
1 n
∂ ρi 1 n
−1 √1
Z
)
(
Z
ρ
ρ
Z
)
∑ i
∑ i i i i
∑ Zi ρi + o p (1)
n i=1
n i=1
n i=1
∂β
1 n
1 n
∂ρ
1 n
∂ρ
W = ( ∑ Zi i ) ( ∑ Zi ρi ρi Zi )−1 ∑ Zi i
n i=1 ∂ β
n i=1
n i=1 ∂ β

−1
Lin − β0 ) = W (

where ρi = ρi (β0 ) and

∂ ρi

=

∂ ρi (β0 )

(B.1.3)
(B.1.4)

.

∂β
∂β
∂ρ
Applying the WLLN, 1n ∑ni=1 Zi i = O p (1). Also 1n ∑ni=1 Zi (ρi ρi − Σi )Zi = o p (1) since E(Zi (ρi ρi −
∂β
Σi )Zi ) = 0 from how Zi and Σi where defined. Using the CLT, √1 ∑ni=1 Zi ρi = O p (1). Using Slutsky’s
n
1
n
theorem, assuming plim n ∑i=1 Zi ρi ρi Zi is p.d., we have

(

1 n
1 n
Zi ρi ρi Zi )−1 − ( ∑ Zi Σi Zi )−1 = o p (1)
∑
n i=1
n i=1

So W = V + o p (1) where V = ( 1n ∑ni=1 Zi

∂ ρi
∂β

(B.1.5)

∂ρ
) ( n1 ∑ni=1 Zi Σi Zi )−1 1n ∑ni=1 Zi i and using Slutsky’s
∂β

81

theorem again, assuming plimW is finite and p.d., W −1 = V −1 + o p (1). Therefore we can rewrite:
√ ˆ
n(β

Lin − β0 ) = V

−1 ( 1

n

Z
n ∑ i
i=1

∂ ρi

)(

∂β

1 n
1 n
Zi Σi Zi )−1 √ ∑ Zi ρi + o p (1)
∑
n i=1
n i=1

(B.1.6)

1 n
1 n
∂ρ
1 n
∂ρ
V = ( ∑ Zi i ) ( ∑ Zi Σi Zi )−1 ∑ Zi i
n i=1 ∂ β
n i=1
n i=1 ∂ β

(B.1.7)

n(βˆLin − β0 )(βˆLin − β0 ) = V −1 + o p (1)

(B.1.8)

In addition:

Since:
1 n
1 n
1 j n
√ ∑ Zi ρi ( √ ∑ Zi ρi ) = ∑ ∑ Zi ρi ρ j Z j
n i=1
n i=1
n i=1 j=1
=

(B.1.9)

1 j
∑ Zi ρi ρi Zi + o p (1)
n i=1

1 j
= ∑ Zi Σi Zi + o p (1)
n i=1

(B.1.10)

(B.1.11)

where the second equality follows from random sampling and the WLLN.
We can rewrite V as:
V=

∂ ρ¨ ¨ ¨ ¨ −1 ¨ ∂ ρ¨
Z (Z Z ) Z
∂ β0
∂ β0

(B.1.12)

∂ ρ¨
∂ ρ¨ i
∂ρ
∂ ρ¨
∂ ρ¨
= [ 1 , ..., n ] ,
= Σ˜ −1/2 A i , Z¨ = [Z¨ 1 , ..., Z¨ n ] , Z¨ i = Zi A−1 Σ˜ 1/2 .
∂β
∂β
∂β
∂β
∂β
∂ ρ¨
∂ ρ¨ i
on Z¨ i , LP( i |Z¨ i ) = Z¨ i C, where C is a dim(Zi ) × dim(β )
Consider the matrix linear projection of
∂β
∂β

where

deterministic matrix defined by the moment conditions:
E(Z¨ i (

∂ ρ¨ i
∂β

− Z¨ i C)) = 0

It is a standard result that as long as E(Z¨ i Z¨ i ) is finite and p.d. and E(Z¨ i

(B.1.13)
∂ ρ¨ i

) exists, this linear projection

∂β

is consistently estimated by:
¨
¨
ˆ ∂ ρ |Z¨ i ) = Z¨ (Z¨ Z¨ i )−1 Z¨ ∂ ρi
LP(
i i
i ∂β
∂β
0
∂ ρ¨
= LP( i |Z¨ i ) + o p (1)
∂β

82

(B.1.14)
(B.1.15)

ˆ ∂ ρ¨ |Z)
¨ = Z(
¨ Z¨ Z)
¨ −1 Z¨ ∂ ρ¨ , denote P ¨ = Z(
¨ Z¨ Z)
¨ −1 Z¨ .
Define the stacked estimated linear projections by LP(
Z
∂β
0

∂β

Since PZ¨ is idempotent, we have:
V=

∂ ρ¨
∂ ρ¨
P
/n
∂ β0 Z¨ ∂ β0

(B.1.16)

∂ ρ¨
∂ ρ¨
PZ¨ PZ¨
/n
∂ β0
∂ β0
¨
¨
¨ LP(
ˆ ∂ ρ |Z)/n
¨
ˆ ∂ ρ |Z)
= LP(
∂β
∂β
1 n ˆ ∂ ρ¨ ¨ ˆ ∂ ρ¨ ¨
= ∑ LP(
|Zi ) LP(
|Zi )
n i=1
∂β
∂β
=

= E(LP(

(B.1.17)
(B.1.18)
(B.1.19)

∂ ρ¨ ¨
∂ ρ¨ ¨
|Zi ) LP(
|Zi )) + o p (1)
∂β
∂β

(B.1.20)

where the last equality follows from Newey and McFadden (1994) for instance.
In addition, the matrix linear projection of

∂ ρ¨
on Z¨ i is the same as the matrix linear projection of
∂β

¨ ∂ ρ¨ |Zi ) on Z¨ i defined by:
E(
∂β

¨
¨ ∂ ρi |Zi ) − Z¨ C)) = 0
E(Z¨ i (E(
i
∂β

(B.1.21)

Since the t th vector of Z¨ i , Z¨ it , is a function of Zit since zit contains zi1 , ..., zit−1 . Therefore:
¨
¨
¨ ∂ ρi |Zi )|Z¨ i )) + o p (1)
¨ ∂ ρi |Zi )|Z¨ i ) LP(E(
V = E(LP(E(
∂β
∂β

(B.1.22)

So using the standard results on linear projection:
¨
V = E(E(

∂ ρ¨ i

¨
¨ ∂ ρi |Zi )) − E(e ei ) + o p (1)
|Zi ) E(
i
∂β
∂β

T

T

t=2

t=2

= E( ∑ D˜ t Σ˜ t D˜ t ) − E( ∑ eit eit ) + o p (1)
where ei = E(

∂ ρ¨ i
∂β

(B.1.24)

¨ ∂ ρ¨ i |Zi )|Z¨ i ) and eit = E( ∂ ρ¨ it |Zit ) − LP(E( ∂ ρ¨ it |Zit )|Z¨ it ).
|Zi ) − LP(E(
∂β

∂β

83

(B.1.23)

∂β

APPENDIX C
EFFICIENCY OF THE POISSON FIXED EFFECTS ESTIMATOR

C.1

Efficient Estimation under Conditional Mean Restrictions

Chamberlain (1992b), page 581, showed that the asymptotic information bound for estimating β0 from
(3.2.1) is:
−1 = E(h ∂ µ (Σ−1 − Σ−1 µ (µ Σ−1 µ )−1 µ Σ−1 )∂ µ h )
V0,β
i i
i i y,i
i y,i
y,i i i y,i i

(C.1.1)

0

∂µ

where hi = E(ci |xi ), xi = {xi1 , ..., xiT } ∂ µi = [∂ µi1 , ..., ∂ µiT ], ∂ µit = ∂ βit , Σy,i = Var(yi |xi ), yi = [yi1 , ..., yiT ],
µi = [µi1 , ..., µiT ].
ρi = ρi (β0 ) can be rewritten as:
ρi = (I − Pi )yi
where


 µi1


Pi = T

∑t=1 µit 
µiT
1

(C.1.2)


...

µi1 


...


... µiT

(C.1.3)

Therefore:








µi1 ... µi1 
 ∂ µi1 ... ∂ µi1 
T ∂µ 


 ∑t=1

it


−
)yi |xi )
Di = E(( T
...
...




T
2
(
µ
)
∑
∑t=1 µit 


t=1 it 
∂ µiT ... ∂ µiT
µiT ... µiT




∂
µ
...
∂
µ
µ
...
µ
i1 
i1
i1 
 i1
T ∂µ 
 ∑t=1



1
it

−

)µi
= hi ( T
...
...




T
∑t=1 µit 
 (∑t=1 µit )2 

∂ µiT ... ∂ µiT
µiT ... µiT
1

∑T ∂ µit
= hi (∂ µi − t=1
µi )
T µ
∑t=1
it
Note that:
−1
−1 −1
−1
µi (Σ−1
y,i − Σy,i µi (µi Σy,i µi ) µi Σy,i ) = 0

(C.1.4)

Therefore:
−1
−1 −1
−1
−1
−1
−1 −1
−1
Di (Σ−1
y,i − Σy,i µi (µi Σy,i µi ) µi Σy,i )Di = hi ∂ µi (Σy,i − Σy,i µi (µi Σy,i µi ) µi Σy,i )∂ µi hi

84

(C.1.5)

−1
−1 −1
−1
Therefore the only thing left to show is that (Σ−1
y,i −Σy,i µi (µi Σy,i µi ) µi Σy,i ) is a generalized inverse

of Σi .

Σi = (I − Pi )Σy,i (I − Pi )
= Σy,i − Pi Σy,i − Σy,i Pi + Pi Σy,i Pi
Note that:
Pi µi = µi

(C.1.6)

−1
−1 −1
−1
Pi (Σ−1
y,i − Σy,i µi (µi Σy,i µi ) µi Σy,i ) = 0

(C.1.7)

and:

Therefore:
−1
−1 −1
−1
−1
−1
−1 −1
−1
(Σ−1
y,i − Σy,i µi (µi Σy,i µi ) µi Σy,i )Σi = (Σy,i − Σy,i µi (µi Σy,i µi ) µi Σy,i )Σyi − 0
−1
−1 −1
−1
− (Σ−1
y,i − Σy,i µi (µi Σy,i µi ) µi Σy,i )Σy,i Pi + 0

= I − Pi
−1
−1 −1
−1
Let Mati = (Σ−1
y,i − Σy,i µi (µi Σy,i µi ) µi Σy,i ). Therefore:
−1
−1 −1
−1
Mati Σi Mati = (Σ−1
y,i − Σy,i µi (µi Σy,i µi ) µi Σy,i ) − 0
−1
−1 −1
−1
= (Σ−1
y,i − Σy,i µi (µi Σy,i µi ) µi Σy,i )

Note that:
Pi Pi = Pi

(C.1.8)

Therefore:

−1
−1 −1
−1
Σi (Σ−1
y,i − Σy,i µi (µi Σy,i µi ) µi Σy,i )Σi = Σi − Σi Pi

= Σi − Σy,i Pi + Pi Σy,i Pi + Σy,i Pi − Pi Σy,i Pi
= Σi
−1
−1 −1
−1
So (Σ−1
y,i − Σy,i µi (µi Σy,i µi ) µi Σy,i ) is indeed a generalized inverse of Σi . Therefore:
−1 = E(D Σ− D ) = V −1
V0,β
opt
i i i

85

(C.1.9)

−1
−1
−1 −1
−1
where Σ−
i = (Σy,i − Σy,i µi (µi Σy,i µi ) µi Σy,i ).
−1
−1 −1
−1
In addition, we can characterize (Σ−1
y,i − Σy,i µi (µi Σy,i µi ) µi Σy,i ) in an alternative way that will be
−1 − Σ−1 µ (µ Σ−1 µ )−1 µ Σ−1 ). We have shown that:
useful for future results. Denote Bi = (Σy,i
i y,i
y,i i i y,i i

Bi Σi = I − Pi

(C.1.10)

Σi Bi = I − Pi

(C.1.11)

XΣi X = X

(C.1.12)

Σi XΣi = Σi

(C.1.13)

Since Bi and Σi are symmetric:

Bi is the unique matrix X that satisfies:

XΣi = I − Pi

(C.1.14)

Σi X = I − Pi

(C.1.15)

We have already shown that X = Bi satisfies all of these requirements. This solution is unique since for
any X, Y satisfying these requirements1 :
X = XΣi X = X(I − Pi ) = XΣiY = (I − Pi )Y = Y ΣiY = Y

C.2

(C.1.16)

Efficient Estimation under the Poisson FE Assumptions

Under (3.2.9) and (3.2.10) we have Σy,i = hi diag(µi ) + vi µi µi where vi = Var(ci |xi ). Therefore:
Σi = Var(ρi |xi )
= (I − Pi )(hi diag(µi ) + vi µi µi )(I − Pi )
= (I − Pi )hi diag(µi )(I − Pi )
where the last equality follows from µit µis − µis pit ∑T
r=1 µir = 0.
Define:
Xi = h−1
i (diag(

1
1
)− T
J)
µi
∑t=1 µit

(C.2.1)

1 This proof of uniqueness is identical to the proof of uniqueness for the Moore-Penrose pseudo inverse found in Penrose (1955).

86



1
...
1




1
1
1

where J = 
 ...  and, by an abuse of notation, ( µi ) = [ µi1 , ..., µiT ].


1 ... 1
Note that:
JPi =

T µ
∑t=1
it
T µ
∑t=1
it

J

=J
and:
diag(

1
1
)Pi = T
J
µi
∑t=1 µit

(C.2.2)

Therefore:
Σi Xi = (I − Pi )diag(µi )(I − Pi )(diag(
= (I − Pi )diag(µi )(diag(

1
1
)− T
J)
µi
∑t=1 µit

1
1
1
1
)− T
J− T
J+ T
J)
µi
∑t=1 µit
∑t=1 µit
∑t=1 µit

= (I − Pi )(I − Pi )
= I − Pi
Since both Σi and Xi are symmetric:
Xi Σi = I − Pi

(C.2.3)

So in order to show that Σ−
i = Xi , there only remains to show that Xi satisfies (C.1.12) and (C.1.13).
For (C.1.12):
Xi Σi Xi = Xi (I − Pi )
= Xi − Xi Pi
1
1
= Xi − h−1
i ( T µ J − T µ J)
∑t=1 it
∑t=1 it
= Xi

87

For (C.1.13):
Σi Xi Σi = (I − Pi )Xi
= Σi − Pi Σi
= Σi − Pi (I − Pi )hi diag(µi )(I − Pi )
= Σi
Therefore we have shown that in this case:
−1
Σ−
i = hi (diag(

1
1
)− T
J)
µi
∑t=1 µit

(C.2.4)

Therefore:
hi
1
1
Di Σ−
i = h ∂ µi (diag( µ ) − T µ J)
∑t=1 it
i
i
=(

∑T ∂ µit
∂ µi
) − t=1
j
T µ
µi
∑t=1
it
∂µ

∂µ

∂µ

where j = [1, ..., 1] and, by an abuse of notation, ( µ i ) = [ µ i1 , ..., µ iT ].
i
iT
i1
Note that:
∑T ∂ µis
∂ pit
∂µ
µit
= T it − s=1
∂β
µit (∑T µis )2
∑

(C.2.5)

s=1

t=1

so that:
∂ µis
∂ pit 1
∂ µit ∑T
=
− s=1
T
∂ β pit
µit
∑s=1 µis

(C.2.6)

Therefore:
(

∂ pi (β0 )
∂β

) Wi (β0 )−1 = (

∑T ∂ µit
∂ µi
) − t=1
j
T µ
µi
∑t=1
it

(C.2.7)

Hence we have shown that under (3.2.9) and (3.2.10):
Di Σ−
i =(

∂ pi (β0 )
∂β

88

) Wi (β0 )−1

(C.2.8)

C.3

Consistent Estimation of θ and ρ

Under (3.2.1), (3.2.11) and (3.2.12):
E(y2it |ci , xi ) = ci µit + (θ + 1)c2i µit2
E(yit yit−s |ci , xi ) = (γ + 1)c2i µit µit−s f or s = 1
= c2i µit µit−s f or s > 1
Therefore:

2

θ=
and:

y −y
E( it 2 it )
µ

it
yit yit−2 − 1
E( µ µ
)
it it−2

y y
it it−1
γ=
yit yit−2 − 1
)
E( µ µ
it it−2

E( µit µit−1 )

(C.3.1)

(C.3.2)

Therefore a consistent estimator for θ under the assumptions (3.2.1), (3.2.11) and (3.2.12) is:
2
1 n
T yit −yit
∑
∑
nT i=1 t=1 µ¨ 2
it
θˆ =
yit yit−2 − 1
1
n
T
n(T −2) ∑i=1 ∑t=3 µ¨ it µ¨ it−2

(C.3.3)

yit yit−1
µ¨ it µ¨ it−1
yit yit−2 − 1
µ¨ it µ¨ it−2

(C.3.4)

A consistent estimator of γ is:
1
n
T
n(T −1) ∑i=1 ∑t=2
γˆ =
1
n
T
n(T −2) ∑i=1 ∑t=3

C.4

Asymptotic equivalence of Poisson fixed effects and our alternative estimator when
θ = 0, γ = 0

p
p
When (3.2.9) and (3.2.10) hold, so that (3.2.11) and (3.2.12) hold with θ = 0 and γ = 0, θˆ → 0 and γˆ → 0

independently of whether (3.2.15) and (3.2.16) hold. Therefore:
T

∂p
p
Dˆ i → h1 (xi , η p )( ∑ µit )[ it ]t=1,...,T )
∂β
t=1
√
p
s=1,...,T
s=1,...,T
s=1,...,T
Σˆ i → [1(t = s) − pit ]t=1,...,T [1[t = s]h1 (xi , η p ) µit µis + h2 (xi , η p )µit µis ]t=1,...,T ([1(t = s) − pit ]t=1,...,T )

89

ˆ Appendix B, replacing E(ci |xi ) by h1 (xi , η p ) and Var(ci |xi ) by h2 (xi , η p ), shows
where η p = plim(η).
that:
∂ p (β )
plim(Dˆ i )plim(Σˆ i )− = −( i 0 ) Wi (β0 )−1
∂β
= Di Σ−
i
Therefore when (3.2.9) and (3.2.10) hold, βˆPFE and βˆalt are asymptotically equivalent.

90

BIBLIOGRAPHY

91

BIBLIOGRAPHY

Ahn, S. C. and Schmidt, P. (1995). Efficient estimation of models for dynamic panel data. Journal of
Econometrics, 68(1):5–27.
Alvarez, J. and Arellano, M. (2003). The time series and cross-section asymptotics of dynamic panel data
estimators. Econometrica, 71(4):1121–1159.
Alvarez, J. and Arellano, M. (2004). Robust likelihood estimation of dynamic panel data models. CEMFI
Working Paper 0421.
Anderson, T. W. and Hsiao, C. (1981). Estimation of dynamic models with error components. Journal of
the American Statistical Association, 76(375):598–606.
Andrabi, T., Das, J., Ijaz Khwaja, A., and Zajonc, T. (2011). Do value-added estimates add value? accounting for learning dynamics. American Economic Journal. Applied Economics, 3(3):29–54.
Arellano, M. (2003). Modelling optimal instrumental variables for dynamic panel data models. CEMFI
Working Paper.
Arellano, M. and Bond, S. (1991). Some tests of specification for panel data: Monte carlo evidence and an
application to employment equations. The Review of Economic Studies, 58(2):277–297.
Arellano, M. and Bover, O. (1995). Another look at the instrumental variable estimation of error-components
models. Journal of Econometrics, 68(1):29–51.
Balasubramanian, N. and Sivadasan, J. (2010). What happens when firms patent? new evidence from U.S.
economic census data. Review of Economics and Statistics, 93(1):126–146.
Baltagi, B. H., Fingleton, B., and Pirotte, A. (2014). Estimating and forecasting with a dynamic spatial
panel data model*. Oxford Bulletin of Economics and Statistics, 76(1):112–138.
Bester, A. C., Conley, T. G., and Hansen, C. B. (2011a). Inference with dependent data using cluster
covariance estimators. Journal of Econometrics, 165(2):137–151.
Bester, A. C., Conley, T. G., Hansen, C. B., and Vogelsang, T. J. (2011b). Fixed-b asymptotics for spatially
dependent robust nonparametric covariance matrix estimators. Working Paper.
Blundell, R. and Bond, S. (1998). Initial conditions and moment restrictions in dynamic panel data models.
Journal of Econometrics, 87(1):115–143.
Blundell, R., Griffith, R., and Van Reenen, J. (1995). Dynamic count data models of technological innovation. The Economic Journal, 105(429):333–344.
Blundell, R., Griffith, R., and Windmeijer, F. (2002). Individual effects and dynamics in count data models.
Journal of Econometrics, 108(1):113–131.
Bond, S. R. (2002). Dynamic panel data models: a guide to micro data methods and practice. Portuguese
Economic Journal, 1(2):141–162.
Browning, M., Ejraes, M., and Alvarez, J. (2010). Modelling income processes with lots of heterogeneity.
The Review of Economic Studies, 77(4):1353–1381.
92

Chamberlain, G. (1982). Multivariate regression models for panel data. Journal of Econometrics, 18(1):5–
46.
Chamberlain, G. (1987). Asymptotic efficiency in estimation with conditional moment restrictions. Journal
of Econometrics, 34(3):305–334.
Chamberlain, G. (1992a). Comment: Sequential moment restrictions in panel data. Journal of Business &
Economic Statistics, 10(1):20–26.
Chamberlain, G. (1992b). Efficiency bounds for semiparametric regression. Econometrica, 60(3):567–596.
Cizek, P., Jacobs, J. P., Ligthart, J. E., and Vrijburg, H. (2011). GMM estimation of fixed effects dynamic
panel data models with spatial lag and spatial errors. Discussion Paper 2011-134, Tilburg University,
Center for Economic Research.
Clerides, S. K., Lach, S., and Tybout, J. R. (1998). Is learning by exporting important? micro-dynamic
evidence from colombia, mexico, and morocco. The Quarterly Journal of Economics, 113(3):903–947.
Conley, T. G. (1999). GMM estimation with cross sectional dependence. Journal of Econometrics, 92(1):1–
45.
de Brauw, A. and Giles, J. (2008). Migrant labor markets and the welfare of rural households in the developing world: Evidence from china. 2008 Annual Meeting, July 27-29, 2008, Orlando, Florida 6085,
American Agricultural Economics Association.
Donald, S. G., Imbens, G. W., and Newey, W. K. (2009). Choosing instrumental variables in conditional
moment restriction models. Journal of Econometrics, 152(1):28–36.
Elhorst, P. J. (2005). Unconditional maximum likelihood estimation of linear and log-linear dynamic models
for spatial panels. Geographical Analysis, 37(1):85–106.
Hahn, J. (1997). Efficient estimation of panel data models with sequential moment restrictions. Journal of
Econometrics, 79(1):1–21.
Hausman, J., Hall, B. H., and Griliches, Z. (1984). Econometric models for count data with an application
to the patents-r & d relationship. Econometrica, 52(4):909–938.
Hsiao, C., Pesaran, H. M., and Tahmiscioglu, K. A. (2002). Maximum likelihood estimation of fixed effects
dynamic panel data models covering short time periods. Journal of Econometrics, 109(1):107–150.
Jenish, N. and Prucha, I. R. (2009). Central limit theorems and uniform laws of large numbers for arrays of
random fields. Journal of Econometrics, 150(1):86–98.
Jenish, N. and Prucha, I. R. (2012). On spatial processes and asymptotic inference under near-epoch dependence. Journal of Econometrics, 170(1):178–190.
Kim, M. S. and Sun, Y. (2011). Spatial heteroskedasticity and autocorrelation consistent estimation of
covariance matrix. Journal of Econometrics, 160(2):349–371.
Kitazawa, Y. (2007). Some additional moment conditions for a dynamic count panel data model.
Mundlak, Y. (1978). On the pooling of time series and cross section data. Econometrica, 46(1):69–85.

93

Mutl, J. (2006). Dynamic panel data models with spatially correlated disturbances. University of Maryland
Theses and Dissertations.
Newey, W. K. and McFadden, D. (1994). Chapter 36 large sample estimation and hypothesis testing. In
Robert F. Engle and Daniel L. McFadden, editor, Handbook of Econometrics, volume Volume 4, pages
2111–2245. Elsevier.
Newey, W. K. and Windmeijer, F. (2009). Generalized method of moments with many weak moment
conditions. Econometrica, 77(3):687–719.
Penrose, R. (1955). A generalized inverse for matrices. Mathematical Proceedings of the Cambridge
Philosophical Society, 51(03):406–413.
Su, L. and Yang, Z. (2013). QML estimation of dynamic panel data models with spatial errors. Research
Collection School of Economics (Open Access).
Todd, P. E. and Wolpin, K. I. (2003). On the specification and estimation of the production function for
cognitive achievement. The Economic Journal, 113(485):F3–F33.
Topalova, P. and Khandelwal, A. (2010). Trade liberalization and firm productivity: The case of india.
Review of Economics and Statistics, 93(3):995–1009.
White, H. (2001). Asymptotic theory for econometricians. Academic Press, San Diego.
Windmeijer, F. (2000). Moment conditions for fixed effects count data models with endogenous regressors.
Economics Letters, 68(1):21–24.
Windmeijer, F. (2005). A finite sample correction for the variance of linear efficient two-step GMM estimators. Journal of Econometrics, 126(1):25–51.
Windmeijer, F. (2008). GMM for panel data count models. In Matyas, L. and Sevestre, P., editors, The
Econometrics of Panel Data, number 46 in Advanced Studies in Theoretical and Applied Econometrics,
pages 603–624. Springer Berlin Heidelberg.
Wooldridge, J. M. (1997). Multiplicative panel data models without the strict exogeneity assumption. Econometric Theory, 13(5):667–678.
Wooldridge, J. M. (1999). Distribution-free estimation of some nonlinear panel data models. Journal of
Econometrics, 90(1):77–97.
Wooldridge, J. M. (2005). Simple solutions to the initial conditions problem in dynamic, nonlinear panel
data models with unobserved heterogeneity. Journal of Applied Econometrics, 20(1):39–54.
Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data. The MIT Press, second
edition edition.

94