LINEAR AND NONLINEAR ESTIMATION WITH SPATIAL DATA
By
Cuicui Lu

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
Economics – Doctor of Philosophy
2013

ABSTRACT
LINEAR AND NONLINEAR ESTIMATION WITH SPATIAL DATA
By
Cuicui Lu
In some economic situations, observations are cross-sectionally correlated. One example
of cross-sectional correlation is spatial correlation, which means the correlation comes from
the spatial closeness of diﬀerent individuals. Spillover eﬀect, externalities, network issues
and so on are common causes for spatial correlation. For example, a supply shock in one
region will result in production shocks in the regions nearby. This type of correlation reﬂects
the correlations among individuals’ unobservables.
In Chapter 1, I study a linear regression model with a spatially correlated error term.
Most current literature in econometrics assumes cluster sampling (independence between
diﬀerent clusters) in the population; however, this could be easily violated. I study the
case in which spatial correlation exists between each pair of observations without assuming
independent clusters. Generalized least squares (GLS) can be applied to the cross-sectional
dimension but it is hard to account for all pairwise correlations for a large sample of spatial
data. It is because the calculation of the huge error covariance matrix generally needs
large computer memory. Instead I use a pseudo generalized least squares (PGLS) approach,
which means it is a GLS procedure but uses a ”tapered” error covariance matrix. Data
could be divided into groups according to natural geographic areas, only correlations within
groups are accounted for while ignoring the correlations between groups. Since correlations
within groups account for most of the correlations among observations, the resulting PGLS
estimator will not lose much eﬃciency compared to GLS. The PGLS estimator is consistent,

asymptotically normal, and computationally easier than GLS. A spatial heteroskedasticity
and autocorrelation consistent (HAC) covariance estimator for PGLS which is robust to
both heteroskedasticity and spatial correlation is provided. Monte Carlo simulations show
that PGLS becomes more eﬃcient than ordinary least squares (OLS) as spatial correlation
increases.
Chapter 2 studies nonlinear estimation with spatial data.Generalized estimating equations (GEE) is applied to cross section data with spatial correlations in nonlinear models. I
use a partial quasi-maximum likelihood estimator (PQMLE) in the ﬁrst step and use GEE
approach in the second step. Given some regularity conditions and assumptions, the asymptotic distribution of the two-step estimator is derived in the framework of M-estimation. I
use a Probit model for binary data with a latent spatially error and a Poisson model for
count data with a multiplicative spatial error to demonstrate the GEE procedures. As the
spatial correlations in the underlying error term increase, those in the dependent variable
also increase. Monte Carlo simulations show eﬃciency comparison of the PQMLE and GEE.
The results show that correctly modeling the structure of the working correlation matrix
is important in nonlinear models, which is quite diﬀerent from the linear model. In addition, as spatial correlation increases, more eﬃciency estimation can be obtained by the GEE
approach.
Chapter 3 studies conditions for the Numerical Equality of the OLS, GLS and AmemiyaCragg Estimators. Conditions under which the ordinary least squares (OLS) and generalized
least squares (GLS) estimators are equal are well known. This chapter extends these results
in two ways. First, it give conditions under which GLS based on one assumed error variance
matrix equals GLS based on a diﬀerent assumed variance matrix. Second, it give conditions
under which GLS equals the GMM estimator of Amemiya (1983) and Cragg (1983).

To my mother and father

iv

ACKNOWLEDGMENTS

I was born in a beautiful town on the east coast of China. As I grew up, I witnessed the
dramatic improvement in people’s life of this town, as well as rapid growth of the Chinese
economy. Thus I became very interested in the comparatively new ”market economy” in
China. I realized in high school that my dream was to become an economist. And after
college, I came to Michigan State University for my graduate study. Behind me the whole
time were my kind and diligent parents, who show their deep love to me and have faith in
me as always. I am very grateful to my parents for supporting my education both ﬁnancially
and mentally, especially during graduate school.
At Michigan State University, I have had great fortune to be taught by Jeﬀrey Wooldridge,
Peter Schmidt, Timothy Vogelsang, and Young Chae Lim. First and foremost, I thank my
adviser, Jeﬀrey Wooldridge, for years of professional guidance. He gave me numerous suggestions and comments on my research. He also taught me how to do independent research.
Peter Schmidt taught me how to do rigorous proofs. I learned very interesting time series
knowledge from Tim Vogelsang. I also gained a deeper understanding of spatial statistics
from Young Chae Lim. I would like to thank all of them who have been guiding me during
my doctoral study.
I thank the economics department for giving me the opportunity to be a teaching assistant. I also thank Jeﬀ, Tim and the University for providing conference funding. My fellow
classmates and the friends I have made are treasures to me. I appreciate their help in both
research and life during my study.
I am glad I am able to explore econometrics, which is an exciting area worthy of eﬀort.
I am on my way to my dream.
v

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Chapter 1 Pseudo Generalized Least Squares Regression Estimation with
Spatial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 A Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Spatial Error Model . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.2 Positive-Deﬁniteness of Spatial Covariance Matrix . . . . . . . . . . .
1.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Ordinary Least Squares and Generalized Least Squares . . . . . . . .
1.3.2 Pseudo Generalized Least Squares . . . . . . . . . . . . . . . . . . . .
1.3.3 Discussion of the Strict Exogeneity Condition . . . . . . . . . . . . .
1.4 Consistent Variance-Covariance Estimators . . . . . . . . . . . . . . . . . . .
1.4.1 Spatial HAC Estimator . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.2 Estimation of the Spatial Correlation Parameter . . . . . . . . . . . .
1.5 Alternative Estimation Approach: Quasi-Maximum Likelihood Estimator . .
1.6 A Monte Carlo Simulation Study . . . . . . . . . . . . . . . . . . . . . . . .
1.6.1 Data Generating Process . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.1.1 Case I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.1.2 Case II . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.1.3 Case III . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.1.4 Robust Standard Errors . . . . . . . . . . . . . . . . . . . .
1.6.2 Monte Carlo Simulation Results . . . . . . . . . . . . . . . . . . . . .
1.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
1
3
5
6
7
7
9
13
14
14
17
17
21
21
21
22
23
24
24
30

Chapter 2 Estimation of Nonlinear Models in a Quasi-Maximum Likelihood Framework with Spatial Data . . . . . . . . . . . . . . . . .
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 M-estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 M-estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Two-step Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Two-step Estimator: QMLE and GEE . . . . . . . . . . . . . . . . . . . . .
2.4 Asymptotic Distribution for PQMLE and GEE . . . . . . . . . . . . . . . .
2.5 Variance-Covariance Matrix Estimator . . . . . . . . . . . . . . . . . . . . .
2.5.1 Parametric Variance-Covariance Estimator . . . . . . . . . . . . . . .
2.5.2 Nonparametric Variance-Covariance Estimator . . . . . . . . . . . . .
2.6 Two Examples: Spatial Probit Model and Poisson Regression Model . . . . .
2.6.1 Example 1. A Probit Model with Spatial Correlation in the Latent Error

31
31
34
35
39
40
47
51
51
52
54
55

vi

.
.
.
.
.
.

60
66
67
68
71
75

Chapter 3 Conditions for the Numerical Equality of the OLS, GLS
Amemiya-Cragg Estimators . . . . . . . . . . . . . . . . . . .
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Equivalence of OLS and GLS . . . . . . . . . . . . . . . . . . . . . . .
3.3 Equivalence of Two Diﬀerent GLS Estimators . . . . . . . . . . . . . .
3.4 Equivalence of GLS and the Amemiya-Cragg Estimator . . . . . . . . .

and
. . .
. . .
. . .
. . .
. . .

76
76
76
79
82

Chapter 4 Proofs of Propositions and
4.1 Proofs of Propositions in Chapter 1
4.2 Proofs of Propositions in Chapter 2
4.3 Proofs of Theorems in Chapter 3 .

.
.
.
.

2.7

2.8

2.6.2 Example 2. A Poisson Model with a Multiplicative Spatial Error
Monte Carlo Simulations . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7.1 Sampling Space . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7.2 Spatial Probit Data . . . . . . . . . . . . . . . . . . . . . . . . .
2.7.3 Spatial Count Data . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

BIBLIOGRAPHY

Theorems
. . . . . . .
. . . . . . .
. . . . . . .

. . . . . . . . . . . . .

vii

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.

. 84
. 84
. 97
. 103

. . . . . . . . . . . . . . . . . 109

LIST OF TABLES

Table 1.1

Case I: Mean Parameters and Standard Deviation . . . . . . . . . .

26

Table 1.2

Case I: Robust Standard Errors . . . . . . . . . . . . . . . . . . . .

26

Table 1.3

Case I: Estimated Error Variance Parameters . . . . . . . . . . . . .

27

Table 1.4

Case II: Mean Parameters and Standard Deviation . . . . . . . . . .

27

Table 1.5

Case II: Robust Standard Errors . . . . . . . . . . . . . . . . . . . .

28

Table 1.6

Case II: Estimated Error Variance Parameters . . . . . . . . . . . .

28

Table 1.7

Case III: Mean Parameters and Standard Deviation . . . . . . . . .

29

Table 1.8

Case III: Estimated Error Variance Parameters . . . . . . . . . . . .

29

Table 2.1

Binary Data N=400, L=4 . . . . . . . . . . . . . . . . . . . . . . . .

70

Table 2.2

Binary Data N=400, L=16 . . . . . . . . . . . . . . . . . . . . . . .

70

Table 2.3

Correlations in Simulated Count Data . . . . . . . . . . . . . . . . .

73

Table 2.4

Count Data N=1600, L=4 and 16 . . . . . . . . . . . . . . . . . . .

74

viii

Chapter 1
Pseudo Generalized Least Squares
Regression Estimation with Spatial
Data

1.1

Introduction

In some economic situations, observations are cross-sectionally correlated. One instance of
cross-sectional correlation is spatial correlation, which means the correlations come from
the spatial closeness of diﬀerent individuals. Spillover eﬀect, externalities, network issues,
and so on are common causes for spatial correlation. For example, a supply shock in one
region will result in production shocks in the regions nearby. This type of correlation reﬂects
the correlations among individuals’ unobservables. Similar to time series correlation which
depends on the ”distance” in time1 , spatial correlation depends on the ”distance” in space.
In economic study, spatial data are commonly observed in the two-dimensional Euclidean
space R2 . Each observation resides in the speciﬁc location on Earth’s surface. We would
expect the larger the distance is between two individuals, the smaller their correlation. A
legitimate distance measure is the physical distance. In economics, a distance measure has
1 The

time passed between two time points.

1

a wider meaning. Conley (1999) uses a metric of ”Economic Distance” to measure crosssectional dependence. For example, transportation cost and whether two countries are in a
free trade area are economic distances. Thus spatial correlation based on spatial distance is
more complicated than time series dependence.
Researchers emphasize the importance of correct distance measurement. Measurement
errors are often considered in the estimation, for example, Conley (1999) and Kelejian and
Prucha (2007). Here we do not discuss the measurement errors problem and assume that we
have correctly measured distances. Clustered data have a very similar structure to spatial
data. The diﬀerence is that data in one cluster are correlated through a common cluster
eﬀect, and they do not depend on distances. Due to the complicated correlation relationship
among observations, for spatial data, accounting for all pairwise correlations in the estimation
is very diﬃcult when the sample size is large. If the sample size is 100, there are 4950 pairwise
correlations to account for. Although one can still do the estimation, ignoring the pairwise
correlations and getting consistent estimators, one can improve estimation eﬃciency by using
more information. Spatial data are commonly collected from diﬀerent geographic areas such
as unemployment rates in diﬀerent states, foreign investment amounts in diﬀerent cities in a
developing country, and trade volume between diﬀerent US states and Canadian provinces.
Thus there is usually a natural division of groups for the data. Since observations far away
have small or no correlations, we can account for the correlations only within the same
groups. By doing so, we actually account for most of the total correlations. Therefore, in
this paper, I propose a pseudo generalized least squares estimator (PGLS). It can account for
most correlations and is more eﬃcient than the usual OLS estimator, while the computation
burden is reduced by dividing data into groups and only using the information within groups.
In spatial statistics, ”increasing domain” and ”ﬁxed domain” asymptotics are popularly
2

used (Cressie 1993). Under increasing domain asymptotics, the sampling space increases
without bound, while the minimum distance between locations is bounded below by a positive
constant. Under ﬁxed domain asymptotics, the sampling space is ﬁxed and bounded, and
sampling locations become increasingly dense within this region. Zhang (2004) shows that
some parameters cannot be consistently estimated by maximum likelihood estimation (MLE)
under the ﬁxed domain asymptiotics. Considering the properties of economic data, which
can be obtained by sampling in a large Euclidean space, we will focus on increasing domain
asymptotics in this paper. That is, given that the number in each group is ﬁxed, when the
sampling space gets larger, the number of the groups increases as the sample size increases.
In this chapter, Section 1.2 presents a linear regression model with possible spatial correlation in the error term. Section 1.3 presents the OLS, GLS and PGLS estimation methods.
Section 1.4 discusses how to estimate the spatial correlation parameter and provides consistent covariance estimators. Section 1.5 explores quasi-MLE method for estimating a linear
regression model. In Section 1.6, Monte Carlo simulation results show the advantages of the
PGLS estimation procedure. Section 1.7 provides the conclusions.

1.2

A Linear Regression Model

Let S be the space the population resides. si , i = 1, 2, ... represents a location in S. Let dij be
the distance between location si ∈ S and location sj ∈ S. The space can be one dimensional
(like time series), two-dimensional (a Euclidean space) or multidimensional. Let (xi , yi )
denote the data point sampled at location si . Let ui denote the underlying unobservable at

3

si . Consider a linear regression model:

yi = xi β + ui , i = 1, 2, ..., N,

(1.1)

where xi is 1 × K regressors with the ﬁrst element xi1 = 1, and β ≡ (β1 , β2 , ..., βK ) is a
K × 1 unknown vector of parameters. In a matrix form, the above equation reads

y = Xβ + u,

(1.2)

where y = (y1 , y2 , ..., yN ) , X is an N × K matrix, with the ith row equal to xi , and
u = (u1 , u2 , ..., uN ) .
Generally we do not know the speciﬁc form of spatial correlation. In this paper, we focus
on spatial correlations in the error term. The error term exhibits spatial correlation in the
sense that
Var (u|X, D) = Ω (D, λ) ,

(1.3)

where D = dij , i, j = 1, 2, ..., N which contains all pairwise distances between observations.
λ is a vector of variance covariance parameters. That is, the spatial correlations between
diﬀerent observations are commonly assumed to depend on the pairwise distances and some
other ﬁxed parameters. Note that under random sampling, Ω would be a scalar matrix
with all oﬀ-diagonal elements equal to zero. For simplicity in what follows, I often drop the
conditioning on the explanatory variables and location indicators.

4

1.2.1

Spatial Error Model

Spatial error model is one of the common models based on underlying random ﬁelds. There
are diﬀerent cases of spatial error model. Instead of writing the speciﬁc error covariance
matrix as a function of distance matrix D, we can use a spatial weight matrix W which is
deﬁned as a function of D. Let wij be the ijth element of W. Let = (ε1 , ε2 , ..., εi , ..., εN )
be a vector of spatial white noise with mean zero and constant variance.
The spatial autoregressive (SAR) error model is most widely used in spatial econometrics
(Anselin, 1988; Anselin et al., 2004). Its error term is modeled as simultaneous autoregressive
random ﬁeld,
N

wij uj + εi ,

ui = ρ

(1.4)

j=1

In a matrix form, the above equation reads,

u = ρWu + ,

(1.5)

Ω (λ) = σ 2 (I − ρW)−1 I − ρW

−1

.

(1.6)

In the spatial moving average (SMA) error model, the error term is modeled as moving
average random ﬁelds,
N

ui = λ

wij εj .

(1.7)

j=1

Equivalently,

u = ρW ,
Ω (λ) = ρ2 σ 2 WW .

5

(1.8)
(1.9)

And in the spatial autoregressive moving average (SARMA) error model,
N

ui = ρ

(1)
wij uj

N

+λ

j=1

(2)

wij εj ,

(1.10)

j=1

which can be written in matrix notation as

u = ρW(1) u+γW(2) ,
Ω (λ) = ρ2 σ 2 I − ρW(1)

(1.11)
−1

I − ρW(1)

−1

W(2) W(2) .

(1.12)

As an alternative to starting with speciﬁc models of spatial correlation, we can directly
specify the variances and covariances for the error term. For example, the variance can be a
constant, and the covariance can be a function c (·) of the variance, a distance between two
locations, and an unknown parameter which is to be estimated. In the matrix form,

2
Ω (λ)ii = σi ,
2
Ω (λ)ij = σi c dij, ρ .

1.2.2

Positive-Deﬁniteness of Spatial Covariance Matrix

The error covariance matrix Ω must be positive-deﬁnite. For each function c (·) speciﬁed, one
need to check whether Ω is positive-deﬁnite. Suppose h ∈ R which is used as the distance
temporarily in order to distinguish from the derivative sign. According to Christakos (1984),
Ω is positive-deﬁnite as long as function c (·) satisfy the following conditions:
i. At h = 0, dc (h) /dh < 0.
ii. limh→∞ c (h) = 0.
6

iii. d2 c (h) /dh2 ≥ 0.
Derivates at h = 0 are one-sided, taken from right.

1.3
1.3.1

Estimation
Ordinary Least Squares and Generalized Least Squares

First, consider two estimators. Conditioning on the explanatory variables and location indicators, the OLS estimator is

ˆ
βOLS = (X X)−1 X y,

(1.13)

and the GLS estimator is

ˆ
βGLS = X Ω−1 X

−1

(X Ω−1 y).

(1.14)

Suppose Ω depends on ﬁxed pairwise distances and a parameter vector λ. Then we can
ˆ
write Ω as a function of λ, Ω (λ). Once we ﬁnd an estimator for λ, say λ, we can obtain the
ˆ
ˆ
estimated covariance matrix Ω ≡ Ω λ . The feasible GLS estimator (FGLS) is

ˆ
ˆ
ˆ
βF GLS = (X Ω−1 X)−1 (X Ω−1 y).

(1.15)

The following regularity conditions are assumed for deriving the asymptotics.
Assumption 1. The pairwise distance dij < ∞ between two locations si and sj are
lower bounded by d0 > 0, i = 1, 2, ...;
Assumption 2. {(xi , ui )} is a mixing sequence on the sampling space, with mixing
7

coeﬃcient α of size −r/2 (r − 1) , r ≥ 2 or φ of size −r/ (r − 2) , r > 2;
Assumption 3.
(a) E xi ui = 0,i = 1, 2, ...;
(b) E xi ui

r

< ∆ < ∞ for all i;

Assumption 4.
N
i=1 xi xi

1
(a) A1 ≡ E N
N

is uniformly positive deﬁnite and has full rank K;

(b) plim A1 = A1 as N → ∞;
N
Assumption 5.
(a) B1 ≡ Var √1
N

N

N
i=1 xi ui

1
=E N X ΩN X is uniformly positive deﬁnite, where

ΩN is a positive-deﬁnite symmetric covariance matrix;
(b) plimB1 = B1 as N → ∞;
N
Assumption 6.
(a) E (u|X) = 0. This is the so-called strict exogeneity assumption. It says every ui
is not correlated with any function of X. This condition implies Assumption 3.(a) and
−1
1
E N X ΩN u = 0;

Assumption 7.
1
1
(a) B2 ≡ Var √1 X Ω−1 u = E N X Ω−1 uu Ω−1 X =E N X Ω−1 X ;
N
N
N
N
N
N

(b) B2 is uniformly positive deﬁnite and has full rank K;
N
(c) plim B2 = B2 as N → ∞;
N
ˆ
Proposition 1 Under Assumption 1, 2, 3, 4, and 5, the OLS estimator βOLS is con√
ˆ
sistent and N βOLS − β →d N 0, A−1 B1 A−1 . See proof in Chapter 4.
1
1
ˆ
Proposition 2 Under Assumption 1, 2, 5, 6, and 7, the FGLS estimator βF GLS is
√
ˆ
consistent and N βF GLS − β →d N 0, B−1 . See proof in Chapter 4.
2
8

OLS is easier than GLS but it is a less eﬃcient estimator. GLS is eﬃcient but could be
computationally hard if the sample size is very large since it involves computing the inverse
of a large covariance matrix. The following section demonstrates another way to get an
eﬃcient estimator.

1.3.2

Pseudo Generalized Least Squares

A lot of spatial data are collected with geographical location information such as ﬁrms in a
county and schools in a school district. We can consider the ﬁrms in a county or schools in a
school district as a group when we deal with empirical data. Spatial correlation can generally
exist among any of the individuals, whether they are in the same group. The correlations
in a group are much easier to deal with than those not in the same group and they should
represent most of the correlations among individuals in the sample. PGLS is only based on
within-group correlations. The estimation is still weighted by the error covariance matrix
but it is a tapered matrix in the sense that the correlations are set as zero if individuals
are not in the same group no matter if their true values are zero. The asymptotics depend
on the mechanism that the size of each group is ﬁxed while the groups increases as the
total number of observations increases. If we have a large sample data set, it is hard to
account for all pairwise correlations. The reason is that there is insuﬃcient information to
estimate the N × N covariance matrix directly from the data. Even asymptotics are not
helpful since the number of covariances increases with N 2 , whereas the sample size only
grows with N (Anselin 1999). But with ﬁxed group size, it is possible to account for the
pairwise correlations within the group. Therefore, it is more realistic that we only use the
information within a group while ignoring the cross-group correlations. Since within-group
correlations will take into account most correlations of the observations in that group, it is
9

possible to get an estimator that is quite close to the FGLS estimator, which is the psuedo
genralized least squares (PGLS) estimator.
For panel data, a group is a cross section, and within a group observations are possibly
serial correlated. Panel data assumes random sampling in the cross-sectional dimension;
however, in this paper, we discuss a situation in which there is only one pure cross section.
We estimate the parameters as if there are no groupwise correlations although there are
correlations between observations in diﬀerent groups. This is not the most eﬃcient estimator,
but by using this procedure, we can obtain a consistent and almost eﬃcient estimator.
For notational convenience, we write the the linear regression model for a group in the
population as
yg = Xg β + ug ,

(1.16)

where g denotes the gth group, g = 1, 2, .... The number of observations in group g is Lg .
yg and ug are Lg × 1 vectors. Xg is Lg × K. Now we can state the assumptions using the
group notation.
Assumption P1.
(a) The number of observations Lg in each group is ﬁxed. Lg /G is a small number. For
simplicity assume that there are the same number of observations in each group. Thus the
group size is ﬁxed as L = N/G.
(b) The number of groups and the sample size both increase as the sampling domain
increases.
Assumption P2. Let dgh < ∞ denote the pairwise distance between two groups g
and h. dgh is lower bounded by d∗ > 0, i = 1, 2, ..., N. dgh ∈ D∗ , where D∗ is the space

10

containing all pairwise group distances2 .
Assumption P3(Assumption 2). {(xi , ui )} is a mixing sequence on the sampling
space, with mixing coeﬃcient α of size −r/2 (r − 1) , r ≥ 2 or φ of size −r/ (r − 2) , r > 2;
Assumption P4. Let Λg be the L × L within-group covariance matrix for group g. Let
ΛN be the N ×N matrix that only contains within-group variances and covariances. In other
words, Λg is the gth diagonal matrix of ΛN . Let ΩN be the true covariance matrix of the
sample. Notice that ΛN is part of ΩN . Deﬁne a N × N tapering matrix Γ. If observation i
and j are in the same group, the ijth entry of Γ, Γ(i, j) is equal to one. If observation i and
j are in diﬀerent groups, Γ(i, j) is equal to zero. Thus ΛN = ΩN · Γ element by element.
Assumption P5.
(a) E ug |Xg = 0. This implies that for any positive deﬁnite L × L dimensional matrix
1
Λg , E Xg Λ−1 ug = 0 and plim G
g

G
−1
g=1 Xg Λg ug

= 0. Note that when the number of

groups is equal to one, Assumption P5 becomes E (u|X) = 0, which is the strict exogeneity
assumption for GLS.
(b) E Xg Λ−1 ug
g

r

exists for all g.

Assumption P6.
1
(a) SG ≡ Var √

G

1
=E G

G
g=1

G
−1
g=1 Xg Λg ug

1
=E G

G
−1
−1
h=1 Xg Λg Ωgh Λh Xh

G
g=1

G
−1
−1
h=1 Xg Λg ug uh Λh Xh

, where Ωgh is the ghth block of matrix ΩN ;

(b) plimSG = S as G → ∞;
Assumption P7.
(a) Qg ≡ Xg Λ−1 Xg is uniformly positive deﬁnite and has full rank K;
g
2 One

way to deﬁne groupwise distance is to use the smallest distance between two observations belonging to two diﬀerent groups. Another way is to use the average distance of all
pairwise distances of observations belonging to diﬀerent groups. What we use for a group
distance could depend on the empirical needs. The importance of the way researchers deﬁne
the groupwise distance is unknown. We will leave this problem for future discussion.
11

1
(b) p limG→∞ G
1
plim G

G
g=1 Qg

G
−1
g=1 Xg Λg Xg

= Q, and rank (Q) = K. This condition implies that

exists and has full rank.

The PGLS estimator is


−1 

G

Xg Λ−1 Xg 
g

ˆ
βP GLS = 
g=1



G

Xg Λ−1 yg  .
g



(1.17)

g=1

If we can obtain a consistent estimator for the spatial correlation parameter λ, we can get a
ˆ
ˆ
feasible PGLS estimator. Let Λg = Λg λ . Note that λ should be the same as it is in GLS
if the assumptions are true.


−1 

G

ˆg
Xg Λ−1 Xg 

ˆ
βF P GLS = 



G

ˆg
Xg Λ−1 yg  .



(1.18)

g=1

g=1

ˆ
Proposition 3 Under Assumption P1-P7, βF P GLS is consistent and has an asymptotic
√
d
ˆ
normal distribution, G βF P GLS − β → N 0, Q−1 SQ−1 . See proof in Chapter 4.
The variance covariance estimator under Assumption P6 for FPGLS can be given as

ˆ
AVar βF P GLS

−1

G

ˆg
Xg Λ−1 Xg 

= 

(1.19)

g=1



G



G

ˆ ˆ
ˆg
Xg Λ−1 Ωgh λ Λ−1 Xh 
g


g=1 h=1



−1

G

ˆg
Xg Λ−1 Xg 



.

g=1

The above expression in a concise form reads:

ˆ
ˆ
AVar βF P GLS = X Λ−1 X

−1

ˆ ˆ ˆ −1
X Λ−1 ΩΛ X
12

ˆ
X Λ−1 X

−1

.

(1.20)

Usually the speciﬁc form of the true covariance matrix Ω is not known. Thus equation
(1.19) can not be obtained.

1.3.3

Discussion of the Strict Exogeneity Condition

PGLS depends on Assumption P5. (a) E ug |Xg = 0. This assumption means the error
term is not correlated with any of the explanatory variables within the group, but can be
correlated with explanatory variables that are not in the group. This assumption is less likely
to hold if we divide observations into arbitrary groups. Thus the strict exogeneity condition
E (u|X) = 0, which holds for errors and explanatory variables are more reasonable. There
are cases in which the strict exogeneity condition is violated. For example, one case is the
spatial lag model in which the strict exogeneity assumption is necessarily false. The spatial
lag model can be written as
Y = AY + BX + u,

(1.21)

where u and explanatory variables must correlated with each other. In this case, even OLS is
not consistent. This model can be solved in the reduced form by quasi-maximum likelihood
estimation. For example, Lee (2004). Another case is when the explanatory variable has
spatial correlation. If a shock to one region is correlated with explanatory variables in other
regions, the strict exogeneity can fail.

13

1.4
1.4.1

Consistent Variance-Covariance Estimators
Spatial HAC Estimator

Failure to account for spatial dependence may result in inconsistent standard errors using
standard techniques. Thus getting robust standard errors to spatial dependence is very
important for hypothesis testing and statistical inference. In time series literature, Newey
and West (1994) use a Bartlett kernel to consistently estimate the covariance matrix for
a time series process. This is called the heterskedasticity and autocorrelation consistent
(HAC) variance covariance estimator. Driscoll and Kraay (1998) provide a nonparametric
covariance matrix estimation technique which yields standard error estimates that are robust
to very general forms of spatial and temporal dependence as the time dimension becomes
large. Conley (1999) uses a Bartlett window to estimate the variance matrix robust to crosssectional correlation, so a rectangular window of correlations are used in this estimation.
Kelejian and Prucha (2007) suggest a spatial HAC estimation. In this section, I suggest
HAC variance-covariance estimators for PGLS, and provide a proof that the HAC estimator
for PGLS is consistent.
Deﬁne a kernel function k dij which depends on pairwise distance dij . Let d∗ be a
cutoﬀ point such that if dij < d∗ , Cov ui , uj

will be estimated and used. Otherwise,

treat Cov ui , uj equal to zero. k dij decreases as dij increases. Researchers can use
diﬀerent kernel functions. In this paper, I use a Bartlett kernel, but I do not constrain the
correlations within a rectangular window as in Conley (1999). Instead we could call the
kernel function a bartlett circle. That is, as the center of one circle with its radius equal to
d∗ , one observation’s covariance with all observations within this circle will be estimated.

14

The Bartlett circle kernel weighting function is

k dij =



 1 − dij /d∗ dij ≤ d∗



0

.

(1.22)

dij > d∗

A consistent variance covariance estimator for OLS that is robust to heteroskedasticity
and spatial correlation is


ˆ
Avar βOLS

rob

−1 

N

=

xi xi 
i=1

N

N


k dij vi vj  
ˆˆ


i=1 j=1

−1

N

,

x i xi 

(1.23)

i=1

where vi = xi ui and ui is OLS residual.
ˆ
ˆ
ˆ
A consistent variance covariance estimator for PGLS that is robust to misspeciﬁcation of
variance-covariance structure is

ˆ
ˆ
Avar βF GLS
= X Ω−1 X
rob
ˆ
ˆ
ˆ
where Ω ≡ Ω λ , v = X Ω λ
ˇ

−1

−1

ˆ
K (D) · vv X Ω−1 X
ˇˇ

−1

,

(1.24)

ˆ
u and u = y − Xβ F GLS . K (D) is an N × N kernel
ˇ
ˇ

matrix. Here we use K (D) · vv denotes the product of K (D) and vv element by element.
ˇˇ
ˇˇ
The ijth element of K (D) is k dij , which is the same as given above. vv is a matrix of
ˇˇ
products of pairwise residuals. Because of possible misspeciﬁcation of the variance covariance
structure Ω, the spatial correlation may not be fully accounted for, or be treated incorrectly.
Therefore there might still be spatial correlation left and we use the FGLS residuals to get
a robust estimator.
In the PGLS estimation, correlations among observations within the same groups are

15

used and in diﬀerent groups are not. Thus intuitively we would want the HAC estimator to
only account for the correlations across diﬀerent groups. Therefore the kernel weights should
only be put on the correlations among observations across diﬀerent groups.
ˆ
Let vg = Xg Λg λ
˜

−1

ˆ
u and u = y − Xβ F P GLS . Deﬁne kernel function for PFGLS:
˜
˜


 1 − dgh /d∗∗ dgh ≤ d∗∗
p dgh =


0
dgh > d∗∗

(1.25)

where dgh is the distance between group g and h. Let P dgh be the G × G matrix with the
ghth element Pgh = p dgh for g, h = 1, 2, ..., G. Let O be a square matrix with each element
equal to one. The dimension of O is equal to N/G, which is the number of individuals within
each group. Then the full N × N kernel matrix is P ⊗ O.
The HAC estimator that is robust to groupwise spatial correlation and misspeciﬁcation
is


−1 

G

G

G



ˆ
ˆg
= 
Avar βF P GLS
Xg Λ−1 Xg  
p dgh vg vh 
˜ ˜
rob
g=1
g=1 h=1
−1


(1.26)

G



ˆg
Xg Λ−1 Xg 

.

g=1

Following the proof of Theorem 2 in Kelejian and Prucha (2007), we can complete the proof
for consistency of the PGLS estimator. Kelejian and Prucha (2007) provide a proof for the
consistency of the spatial HAC estimator for the linear regression model while their scenario
is based on ordinary least squares estimator, which is a special case of PGLS when the group
size is equal to one. Thus, the HAC estimator for PGLS is an extension to the one in Kelejian
and Prucha (2007).

16

p
ˆ
Proposition 4 G · Avar βF P GLS
− Q−1 SQ−1 → 0. See proof in Chapter 4.
rob

1.4.2

Estimation of the Spatial Correlation Parameter

A consistent estimator for ρ can be obtained by a minimum distance estimator,
N

N

[ˆi uj − Ωij (dij , ρ)]2 .
uˆ

ρ = arg min
ˆ
i=1 j=i

A common structure of Ωij in spatial statistics is the exponential form,

Ωij = σ 2 exp −

dij
ρ

.

An estimator for σ 2 is

N

σ 2 = N −1
ˆ

u2 .
ˆi
i=1

1.5

Alternative Estimation Approach: Quasi-Maximum
Likelihood Estimator

The quasi-maximum likelihood estimator (QMLE) using the estimated covariance matrix
derived from OLS residuals is the same as PGLS. Supposing the same assumptions hold
as before, E ug |Xg = yg − Xg β. Var ug |Xg = σ 2 Λg (ρ) . We can write the quasi log

17

likelihood function for group g using a multivariate normal distribution as

lg β, ρ, σ 2

= log fg yg |Xg ; β, λ

(1.27)

L
1
L
log (2π) − log σ 2 − log Λg (ρ)
2
2
2
1
− 2 yg − Xg β Λ−1 (ρ) yg − Xg β .
g
2σ

= −

ˆ
The QMLE estimators βQM LE , ρQM LE and σQM LE jointly solve
ˆ
ˆ2
G

max LG
θ∈Θ

β, ρ, σ 2

lg (β, λ) = −

=
g=1

−

1
2

G×L
G×L
log (2π) −
log σ 2
2
2

G

log Λg (ρ) −
g=1

1
2σ 2

(1.28)

G

yg − Xg β Λ−1 (ρ) yg − Xg β .
g
g=1

Since the ﬁrst term of log likelihood is a constant, we can eliminate it in the maximization
problem. The log likelihood becomes

LG

G×L
1
= −
log σ 2 −
2
2

β, ρ, σ 2

−

1
2σ 2

G

log Λg (ρ)

(1.29)

g=1

G

yg − Xg β Λ−1 (ρ) yg − Xg β .
g

(1.30)

g=1

For a given ρ, by maximizing the log likelihood function we can get


−1 

G

Xg Λg (ρ)−1 Xg 

ˆ
βQM LE = 
g=1

G


g=1

18


Xg Λg (ρ)−1 yg 

(1.31)

and
σQM LE =
ˆ2

1
N

G

ˆ
yg −Xg βQM LE

ˆ
Λ−1 (ρ) yg −Xg βQM LE .
g

(1.32)

g=1

ˆ
By plugging βQM LE and σQM LE into the log likelihood we can get a concentrated likelihood.
ˆ2
ˆ
By maximizing the concentrated log likelihood L (ρ), we get ρQM LE and further βQM LE
ˆ
and σQM LE .
ˆ2

1
1
L (ρ) = − log 
2
N
1
−
2
=



G

ˆ
Λ−1 (ρ) yg −Xg βQM LE 
g

ˆ
yg −Xg βQM LE

(1.33)

g=1

G

log Λg (ρ) −
g=1

1
2


1
1
[log (N ) − 1] − log 
2
2



G

ˆ
yg − Xg βQM LE

ˆ
Λ−1 (ρ) yg − Xg βQM LE 
g

g=1

1
−
2

G

log Λg (ρ)
g=1

ˆ
Note that if we plug ρ in βQM LE we get βF P GLS .
ˆ ˆ
ˆ
We can also get a HAC estimator for βQM LE . The Hessian for each group g is Hg =

 Hg11 Hg12 

 . Let ag = −E Hg |Xg , Dg , then
Hg21 Hg22




ag = 



Xg Λ−1 (λ) Xg
g
0

0
1
2

λ Λg (λ)

19

Λ−1 (λ) ⊗ Λ−1 (λ)
g
g

λ Λg (λ)




H ≡

1
G

G

Hg
g=1

1
A ≡ E [−H] = −
G

ˆ
A ≡

1
G

1
=
G

G
g=1

1
E Hg |Xg , Dg =
G

G

ag
g=1

G

ˆ ˆ
ag β, λ
g=1
G

[

ˆ
Xg Λ−1 λ Xg
g

]
1
2

0

g=1

0
ˆ
λ Λg λ

Λ−1
g

ˆ
ˆ
λ ⊗ Λ−1 λ
g

ˆ
λ Λg λ

ˆ
ˆ
Note that there is no β in A.




G

G

H

1
1
E sg sh ,
BG ≡ Var  √
sg  =
G
G g=1
g=1 g=1
ˆ ˆ
and let ˆg = sg β, λ ,
s

ˆ
BHAC =


1 G H
G

K [dist (g, h)] ˆgˆh
s s

g=1 g=1




.



ˆ 1ˆ ˆ
ˆ
Thus the HAC estimator for asymptotic variance of θ is G A−1 BHAC A−1 .
ˆ
Proposition 5 Under certain conditions, βQM LE is consistent and has an asymptotic normal distribution. See proof in Chapter 4.

20

1.6

A Monte Carlo Simulation Study

In this section, we provide Monte Carlo simulation results which shows that the PGLS estimator performs reasonably well for ﬁnite samples. In the simulation, we study the properties
of the PGLS estimator compared to the OLS and GLS estimators. We also provide robust
standard errors for OLS and PGLS.

1.6.1

Data Generating Process

The total number of observations is N which takes 400 and 1600 separately. The data are
√
√
generated on a N × N lattice. The locations for the data are represented by coordi√
nates (r, s) : r, s = 1, 2, ..., N . The distance dij between location i and j is Euclidean
distance. Suppose A(ai , aj ) and B(bi , bj ) are the two points on the lattice, then their Euclidean distance is

(ai − bi )2 + (aj − bj )2 . Besides the Euclidean distance, other distance

measures may also be considered. Commonly seen distance measures include a time series distance (as a special case of Euclidean distance) which means the locations are on a
straight line, Manhattan distance |ai − bi | + aj − bj , and maximum coordinate-wise distance max |ai − bi | , aj − bj

. For simplicity, only Euclidean distance is used in this paper.

We consider three cases of data-generating processes for diﬀerent correlation structures.

1.6.1.1

Case I

In the ﬁrst case, the exponential form of spatial correlation is used. We generate data as
follows:
(1) yi = α + xi β + ui , α = 1, β = 1.
(2) Specify the covariance matrix Ω for the error term:

21

d

ij
Ωij = σ 2 exp − ρ

, σ 2 = 1, ρ = 0.1, 0.5, 1, 2, 5. In this case, the corresponding pair-

wise correlations for the error term when the pairwise distance is one are: 0.00, 0.14, 0.37,
0.61, 0.82.
(3) Generate the error term: u = Ω1/2 , where

is a vector of i.i.d. standard normal

random numbers. ui is the i th element in the vector u. Ω1/2 is obtained by Cholesky
decomposition.
(4) xi is a spatially correlated variable. The vector x = Ω1/2 ξ, where ξ is a vector of
i.i.d. standard normal random numbers and Ω is the same as in (2) and (3) but with ρ = 1
without loss of generality (w.l.o.g.).
The simulation results can be found in Table 1.1, 1.2, and 1.3. In this estimation
process, the spatial correlation parameter is estimated by a minimum distance estimator
ρ = min
ˆ

1.6.1.2

dij
[ˆi uj − σ 2 exp(− ρ )]2 for i = j.
uˆ

Case II

In the second case, the spatial correlation is a function of the inverse of distance. The data
generating process is as follows:
(1) yi = α + xi β + ui , α = 1, β = 1.
(2) Specify the covariance matrix Ω for the error term:
Ωij = σ 2 dρ , σ 2 = 1, ρ = 0, 0.2, 0.4, 0.6. In this case, the corresponding pairwise corij

relation for the error term when the pairwise distance is one is equal to ρ.
(3) Generate the error term: u = Ω1/2 , where

is a vector of i.i.d. standard normal

random numbers. ui is the i th element in the vector u. Ω1/2 is obtained by Cholesky
decomposition.
(4) xi is a spatially correlated variable. The vector x = Ω1/2 ξ, where ξ is a vector of i.i.d.
22

standard normal random numbers and Ω is the same as in (2) and (3) but ρ = 1 w.l.o.g..
The simulation results can be found in Table 1.4 and 1.5, 1.6. In this estimation process,
the spatial correlation parameter is estimated as the average of ui uj /ˆ 2 for pairwise distance
ˆˆ σ
dij = 1, which means, ρ is estimated only using the pairs of observations whose distances
are one.

1.6.1.3

Case III

For completeness, another data generating process which uses a spatial weight matrix that
is popular in spatial econometrics is as follows:
(1) yi = α + xi β + ui , α = 1, β = 1.
(2) There are N/L independent districts, where L is the number of observations within
a district. Specify the spatial weight matrix W for the error term: wij = 1/ (L − 1) if i = j
and i and j are in the same district. wij = 0 if i = j or i and j are in diﬀerent districts.
In this case, the corresponding pairwise correlation for the error term within the district is
equal to ρ/ (L − 1).
(3) Generate the error term: u = ρWu + , σ 2 = 1, ρ = 0, 0.2, 0.4, 0.6 , where

is a

vector of i.i.d. standard normal random numbers. ui is the i th element in the vector u.
(4) xi is a standard normal variable N (0, 1).
The simulation results can be found in Table 1.7 and 1.8. In this estimation process, the
spatial correlation parameter is estimated in the following way:

ρ = (Wˆ ) (Wˆ )
ˆ
u
u

23

−1

(Wˆ ) u.
u ˆ

(1.34)

1.6.1.4

Robust Standard Errors

The simulations also provide robust standard errors for the OLS and PGLS estimators since
these estimators ignore spatial correlations to some extent. We do not discuss the optimal
kernal variance estimators and we only provide feasible variance estimators for the estimation
of robust standard errors. For the OLS estimator, we use the formula in equation (1.23).
√
The cutoﬀ point d∗ is set as 3 N . When N = 400, d∗ = 7.368. Thus, for observation i, its
covariances with any observation that is within 7.4 distance units are accounted for, similar
to the case when N = 1600. For the PGLS estimators, we use the formula in equation (1.26).
The kernel function in the simulation is speciﬁed as in equation (1.25). First, the groupwise
distance dgh is speciﬁed as the distance between centers of the location. Let the cutoﬀ point
√
d∗∗ be 3 N . If dgh ≤ d∗∗ , the covariances of the observations in group g and h will be used.
To keep it simple, we put the same weight 1 − dgh /d∗∗ on each of those covariances. Like
the lags in time series, how to choose the cutoﬀ point needs consideration of economics of
the problems.

1.6.2

Monte Carlo Simulation Results

For Case I, Table 1.1 shows that as ρ increases, the standard deviation of the OLS estimator
increases, which means OLS becomes less eﬃcient as spatial correlation increases in the
sample. PGLS performance becomes better when compared to OLS as ρ increases. Using
a group size equal to four, which is very small, we can gain back most of the eﬃciency.
As we increase the group size, the feasible PGLS gains more eﬃciency. When we use a
group size equal to 16, the PGLS estimator is almost as eﬃcient as the GLS estimator.
Table 1.2 provides robust standard errors for OLS, PGLS with group size 4 and 16. The

24

spatial correlation parameter is calculated using a minimum distance estimator. Simulations
show that this estimator is biased when the spatial correlation is high. However, even if
the spatial correlation parameter is not consistently estimated, the feasible PGLS estimator
can still gain eﬃciency back. This is because the estimated covariance (even though not
consistent) captures some structure of the true covariance.
Case II and Case I have very similar results regarding PGLS behavior. But the estimation
of spatial correlation in Case II is less biased and more eﬃcient than in Case I.
In Case III, independent groups are generated, thus GLS is PGLS in this case. The
estimation of spatial correlation as in equation 1.34 is an OLS estimator using the residuals.
As in Table 1.7, the estimator for ρ is biased upwards, though the GLS estimator still behaves
well.

25

Table 1.1: Case I: Mean Parameters and Standard Deviation
N=400,
ρ
0.1
0.5
1
2
5

T=2000,

β=1

ˆ
ˆ
ˆ
ˆ
βols
s.d.(βols ) βgls s.d.(βgls )
1.000
0.050
1.000
0.050
1.000
0.056
1.000
0.054
1.000
0.068
0.999
0.051
1.002
0.081
1.000
0.040
1.002
0.088
1.000
0.028
N=1600, T=2000, β = 1

ˆ4
βpgls
1.000
1.000
1.000
1.001
1.001

ˆ4
s.d.(βpgls )
0.050
0.055
0.057
0.051
0.041

ˆ
ˆ
ˆ
ˆ
ˆ4
ˆ4
ρ
βols
s.d.(βols ) βgls s.d.(βgls ) βpgls s.d.(βpgls )
0.1
0.999
0.025
0.999
0.025
0.999
0.025
0.5
0.999
0.028
0.999
0.027
0.999
0.027
1
0.999
0.034
0.999
0.025
0.999
0.028
2
1.000
0.043
0.999
0.020
0.999
0.025
5
0.999
0.049
0.999
0.013
0.999
0.018
s.d.() means standard deviation in the simulation.
ˆ4
ˆ16
βpgls and βpgls are the PGLS estimators using group size equal to

ˆ16
βpgls
1.000
1.000
1.000
1.000
1.002

ˆ16
s.d.(βpgls )
0.050
0.054
0.053
0.044
0.033

ˆ16
βpgls
0.999
0.999
0.999
0.999
0.999

ˆ16
s.d.(βpgls )
0.025
0.027
0.026
0.022
0.015

4 and 16 separately.

Table 1.2: Case I: Robust Standard Errors
N=400,
ˆ
ρ s.e.∗ (βols )
0.1
0.038
0.5
0.038
1
0.037
2
0.035
5
0.030
N=1600,

T=2000
ˆ
r.s.e.(βols )
0.047
0.050
0.058
0.066
0.068
T=2000

ˆ4
r.s.e.(βpgls )
0.047
0.049
0.051
0.052
0.048

ˆ16
r.s.e.(βpgls )
0.047
0.050
0.051
0.048
0.041

ˆ
ˆ
ˆ4
ˆ16
ρ s.e.∗ (βols ) r.s.e.(βols ) r.s.e.(βpgls )
r.s.e.(βpgls )
0.1
0.017
0.024
0.024
0.024
0.5
0.017
0.026
0.025
0.025
1
0.017
0.032
0.027
0.027
2
0.017
0.038
0.027
0.026
5
0.015
0.042
0.026
0.023
∗ (β ) is the average usual OLS standard error.
s.e. ˆols
r.s.e.() is the average spatial correlation robust standard error.

26

Table 1.3: Case I: Estimated Error Variance Parameters

N=400,
ρ
0.1
0.5
1
2
5

ρ
ˆ
0.134
0.480
0.919
1.554
2.297
N=1600,

T=2000, σ 2 = 1
s.d.(ˆ)
ρ
0.125
0.079
0.166
0.357
0.563

σ2
ˆ
0.995
0.992
0.980
0.943
0.825

s.d. σ 2
ˆ
0.072
0.074
0.095
0.150
0.248

T=2000, σ 2 = 1

ρ
ρ
ˆ
s.d.(ˆ)
ρ
σ2
ˆ
s.d. σ 2
ˆ
0.1
0.121
0.102
0.999
0.035
0.5
0.495
0.040
0.998
0.037
1
0.979
0.097
0.995
0.048
2
1.855
0.310
0.984
0.084
5
3.511
0.824
0.930
0.175
ρ is the average of estimates for ρ.
ˆ
σ 2 is the average of estimates for σ 2 .
ˆ

Table 1.4: Case II: Mean Parameters and Standard Deviation
N=400,
ρ
0
0.2
0.4
0.6

T=2000,

β=1

ˆ
ˆ
ˆ
ˆ
βols
s.d.(βols ) βgls s.d.(βgls )
1.000
0.051
1.002
0.051
1.000
0.057
0.999
0.051
1.000
0.062
0.999
0.044
1.001
0.068
1.000
0.038
N=1600, T=2000, β = 1

ˆ4
βpgls
1.000
1.000
1.000
1.000

ˆ4
s.d.(βpgls )
0.051
0.053
0.050
0.042

ˆ
ˆ
ˆ
ˆ
ˆ4
ˆ4
ρ
βols
s.d.(βols ) βgls s.d.(βgls ) βpgls s.d.(βpgls )
0
0.999
0.025
0.999
0.087
0.999
0.025
0.2
0.999
0.029
0.999
0.025
0.999
0.027
0.4
0.999
0.033
0.999
0.022
0.999
0.025
0.6
0.999
0.037
1.002
0.018
1.000
0.021
s.d.() means standard deviation in the simulation.
ˆ4
ˆ16
βpgls and βpgls are the PGLS estimators using group size equal to

27

ˆ16
βpgls
1.000
1.000
1.000
1.000

ˆ16
s.d.(βpgls )
0.053
0.052
0.047
0.035

ˆ16
βpgls
0.999
0.999
0.999
1.000

ˆ16
s.d.(βpgls )
0.025
0.026
0.023
0.017

4 and 16 separately.

Table 1.5: Case II: Robust Standard Errors
N=400,

T=2000

ˆ
ρ s.e.∗ (βols )
0
0.037
0.2
0.036
0.4
0.035
0.6
0.034
N=1600,

ˆ
r.s.e.(βols )
0.047
0.050
0.053
0.056
T=2000

ˆ4
r.s.e.(βpgls )
0.067
0.068
0.067
0.062

ˆ16
r.s.e.(βpgls )
0.095
0.093
0.086
0.069

ˆ
ˆ
ˆ4
ˆ16
ρ s.e.∗ (βols ) r.s.e.(βols ) r.s.e.(βpgls )
r.s.e.(βpgls )
0
0.017
0.024
0.037
0.050
0.2
0.017
0.027
0.038
0.051
0.4
0.017
0.030
0.037
0.047
0.6
0.016
0.032
0.035
0.038
ˆ
s.e.∗ (βols ) is the average usual OLS standard error.
r.s.e.() is the average spatial correlation robust standard error.

Table 1.6: Case II: Estimated Error Variance Parameters

N=400,
ρ
0
0.2
0.4
0.6

T=2000, σ 2 = 1

ρ
ˆ
-0.003
0.171
0.352
0.541
N=1600,

s.d.(ˆ)
ρ
σ2
ˆ
s.d. σ 2
ˆ
0.036
0.995
0.072
0.048
0.968
0.078
0.055
0.943
0.098
0.051
0.919
0.127
2=1
T=2000, σ

ρ
ρ
ˆ
s.d.(ˆ)
ρ
σ2
ˆ
s.d. σ 2
ˆ
0
-0.001
0.018
0.999
0.035
0.2
0.186
0.027
0.985
0.041
0.4
0.377
0.032
0.972
0.057
0.6
0.573
0.029
0.959
0.077
ρ is the average of estimates for ρ.
ˆ
σ 2 is the average of estimates for σ 2 .
ˆ

28

Table 1.7: Case III: Mean Parameters and Standard Deviation
Case III
ρ
0
0.2
0.4
0.6

N=400,

T=2000,

ˆ
ˆ
βols
s.d.(βols )
1.000
0.051
0.999
0.053
1.000
0.057
1.001
0.072
N=1600, T=2000,

β=1
ˆ
ˆ
βgls s.d.(βgls )
0.999
0.051
1.001
0.052
1.000
0.050
1.000
0.048
β=1

ˆ
ˆ
ˆ
ˆ
ρ
βols
s.d.(βols ) βgls s.d.(βgls )
0
1.000
0.026
1.000
0.026
0.2
1.001
0.026
1.001
0.026
0.4
1.000
0.029
1.001
0.025
0.6
1.001
0.036
1.002
0.024
s.d.() means standard deviation in the simulation.
ˆ
ˆ
βols and βf gls are the average of OLS and FGLS
estimators.

Table 1.8: Case III: Estimated Error Variance Parameters

N=400,
ρ
0
0.2
0.4
0.6

ρ
ˆ
−0.018
0.353
0.648
0.852
N=1600,

T=2000, σ 2 = 1
s.d.(ˆ)
ρ
0.128
0.092
0.055
0.024

σ2
ˆ
0.992
0.982
0.948
0.896

s.d. σ 2
ˆ
0.070
0.071
0.072
0.071

T=2000, σ 2 = 1

ρ
ρ
ˆ
s.d.(ˆ)
ρ
σ2
ˆ
s.d. σ 2
ˆ
0
-0.004
0.062
0.998
0.036
0.2
0.365
0.044
0.998
0.036
0.4
0.656
0.026
0.995
0.036
0.6
0.856
0.015
0.985
0.036
ρ is the average of estimates for ρ.
ˆ
σ 2 is the average of estimates for σ 2 .
ˆ

29

1.7

Conclusions

In a linear regression model with spatial data, we can use a weighted least squares to improve
eﬃciency, even if the weight might be misspeciﬁed. As long as the misspeciﬁed structure
can capture some properties of the true variance structure of the spatial data, eﬃciency can
be improved, which is the idea of ”pseudo GLS”.

30

Chapter 2
Estimation of Nonlinear Models in a
Quasi-Maximum Likelihood
Framework with Spatial Data

2.1

Introduction

In a lot of empirical economic and social studies, there are discrete data examples which
exhibit spatial correlations due to the geographical locations of individuals or agents of
interest. For instance, the number of patents a ﬁrm received shows correlation with that
received by other ﬁrms near by. This may be due to a technology spillover eﬀect or a
common policy aiming at encouraging new technology in this place. Another example is
the neighborhood eﬀect. There is a causal eﬀect between the individual decision whether
to own stocks and the average stock market participation of the individual’s community
(Brown, Smith, & Weisbenner 2008). The ﬁrst example is a count data example and the
second one is a binary response example. Both examples deals with discrete data. Nonlinear
models are suitable for the study on discrete variables. Unfortunately there are not much
literature on the estimation of nonlinear models with discrete spatial data. Because of the
spatial correlation, the discrete variables are not independent. Both the nonlinearity and

31

the correlation make the estimation diﬃcult.
Maximum likelihood estimation (MLE) is a widely used method in estimating nonlinear
models. In order to use MLE, one needs to specify the joint distributions of spatial random
variables. This includes correctly specifying the marginal and the conditional distributions.
However, given a spatial data set, the dependence structure is generally unknown. If the
joint distribution of the variables is misspeciﬁed, MLE is generally not consistent. Another
estimation method is quasi-maximum likelihood estimation (QMLE). Using a density that
belongs to a linear exponential family (LEF), QMLE is consistent if we correctly specify
the conditional mean with other features of the density misspeciﬁed. In a panel data case,
pooled (partial) QMLE which ignores serial correlations is consistent under some regularity
conditions (Wooldridge 2010).
In their 2009 working paper, Wang, Iglesias and Wooldridge use a bivariate Probit partial
MLE to improve the estimation eﬃciency with a spatial Probit model. Using their approach
we would need to correctly specify the marginal distribution of the binary response variable
conditional on the covariates and distance measures1 . Since the bivariate marginal distribution of a spatial multivariate normal distribution is bivariate normal, one can derive the
bivariate normal distribution under some distributional assumptions. There are two problems with this paper by Wang, Iglesias and Wooldridge (2012). First the computation is
already hard for a bivariate distribution. The multivariate marginal distribution of a higher
dimension is more computationally demanding; second it also requires the correct speciﬁ1A

sample of spatial data is collected with a set of geographical locations. Spatial dependence is usually characterized by distances between observations. A distance measure is
how one deﬁnes the distances between observations. Physical distance or economic distance
could be two options. Information about agents physical locations is commonly imprecise,
eg. only zip code is known. Conley and Molinari (2007) deals with the inference problem
when there exist distance errors. In this chapter I assume there are no measurement errors
in pairwise distances.
32

cation of the marginal bivariate normal distribution to obtain consistency. In fact, we can
have less restrictive distributional assumptions than those required in bivariate partial MLE.
Suppose we only specify the mean and the working variances and covariances2 . Using QMLE
in the LEF, we can get consistent estimators. Even if the the variances and covariances are
not correctly speciﬁed, we can still consistently estimate the mean parameters as well as the
average partial eﬀects, which are more interesting.
In the literature, the QMLE and GEE approach is used in panel data models to get
more eﬃcient estimators (Gourierous, Monfort, & Trongnon 1984). In this paper, I will
demonstrate how to use the QMLE and GEE approach in a spatial data setting to get a
consistent and more eﬃcient estimator. Generalized least squares (GLS) can be used to
improve the estimation eﬃciency in a linear regression model even if the variance covariance
structure is misspeciﬁed. Similarly, generalized estimating equations (GEE) or weighted
multivariate nonlinear least squares (WMNLS) are used in nonlinear panel data models and
system of equations to obtain more eﬃcient conditional mean parameters. Generally we
expect that GEE can give more eﬃcient estimators compared to PMLE, which uses only the
marginal density to get the consistent estimators.
To use the QMLE in the spatial data setting, I ﬁrst give a series of assumptions, based
on which M-estimators are consistent for the spatial processes. To derive the asymptotics
for the M-estimators we have to use a uniform law of large numbers (ULLN) and a central
limit theorem (CLT). These limit theorems are the fundamental building blocks for the
asymptotic theory of nonlinear spatial M-estimators, e.g. maximum likelihood estimators
(MLE) and generalized method of moments estimators (GMM) (Jenish and Prucha, 2009).
2 The

true variance covariance matrix is generally unknown. By specifying a working
variance covariance matrix, one can capture some of the correlation structure between observations.
33

Conley (1999) makes an important contribution towards developing an asymptotic theory of
GMM estimators for spatial processes. He utilizes Bolthausen’s (1982) CLT for stationary
random ﬁelds. Jenish and Prucha (2009) provide a ULLN and a CLT for spatial data
including nonstationary spatial processes. Using theorems in Jenish and Prucha (2009), one
can analyze more interesting economic phenomena. For example, real estate prices usually
shoot up as one moves from the periphery to the center of a big city. While I will not discuss
trending processes in this paper, Cressie (1993) provides numerous examples of trending
spatial processes.
In Section 2, the M-estimator framework under the spatial data context is established. A
series of assumptions are given based on Jenish and Prucha (2009) under which M-estimators
are consistent and have an asymptotic normal distribution. In Section 3, I propose a twostep GEE estimator in a QMLE framework. In Section 4, the asymptotic distributions for
QMLE and GEE for spatial data are derived. In Section 5, consistent variance covariance
estimators are provided for the nonlinear estimators for spatial data. In Section 6, we look
in detail at a Probit model with spatial correlation in the error term of the latent variable
and a count data model with a multiplicative spatial error term. Section 7 contains Monte
Carlo simulation results which compare eﬃciency of diﬀerent estimation methods for the
two nonlinear models explored in the previous section. Section 8 concludes. Section 10 is the
appendix.

2.2

M-estimation

In this section, I will examine the M-estimation framework of nonlinear models with spatial
data. Unlike linear models, a very important feature of nonlinear models is that estimators

34

cannot be obtained in a closed form, which requires new tools for asymptotic analysis:
we need uniform law of large numbers (ULLN) and a central limit theorem (CLT). The
GEE procedure is a two-step M-estimation method within the QMLE framework. The Mestimator with spatial data is proved to be consistent and asymptotically normal under
certain assumptions.

2.2.1

M-estimation

Assume that spatial processes reside on a regular lattice3 D in a Euclidean space, Rd , d ≥ 1.4
Let s denote a location in D. Suppose we have a sample of N observations. Let DN
⊆ D contains the location information for this sample. Let si denote the location of the
observation i, i = 1, 2, ..., N. Let dij be the pairwise distance between location si and location
sj . That is, dij is the pairwise distance between observation i and observation j. I ﬁrst give
some regularity conditions for the spatial processes I am studying and then I give a general
framework for M-estimators with spatial data.
Following Jenish and Prucha (2009) Deﬁnition 1, I adopt the follwing deﬁnitions of
mixing conditions for underlying random ﬁeld. For U ⊆ DN and V ⊆ DN , deﬁne σalgebras σ (U ) = σ (xi ; i ∈ U ) and σ (V ) = σ (xi ; i ∈ V ). |U | and |V | denote the cardinality
of U and V . The two commonly used mixing conditions are α-mixing and φ-mixing which are
introduced separately by Rosenblatt and Ibragimov. The α-mixing and φ-mixing conditions
are:
αN (U, V ) = sup (|P (u∩v) − P (u) P (v)| , u ∈ U, v ∈ V ) ,
3A

lattice is a collection of spatial sites (locations) supplemented with neighborhood
information (Cressie 1993, p. 383).
4 A two-dimensional Euclidean space is called the Cartesian plane. Spatial processes can
also reside in a higher dimension of space, Rn , n > 2.

35

and φN (U, V ) = sup (|P (u|v) − P (u)| , u ∈ U, v ∈ V, P (u) > 0) .
Deﬁne a metric ρ (i, j) = max1≤l≤d |il | , where il denotes the l-th component of i. The
mixing conditions for the underlying random ﬁelds are deﬁned as follows:
αk,l,N (r) = sup (αN (U, V ) , |U | ≤ k, |V | ≤ l, ρ (U, V ) ≥ r),
φk,l,N (r) = sup (φN (U, V ) , |U | ≤ k, |V | ≤ l, ρ (U, V ) ≥ r) , with k, l, r, N natural num¯
bers. Further let αk,l,N (r) = supαk,l,N (r) and φk,l,N (r) = supφk,l,N (r) .
¯
N

N

Let {wN } = {(xi , yi )} , i = 1, 2, ..., N . (xi , yi ) is the observation obtained at location si .
xi is a row vector of independent variables and yi is a scalar dependent variable. θ ∈ Θ is
a P × 1 parameter vector, and θ0 is the true parameter value. Let θ be a general notation
for the parameter vector. An objective function QN depends on a sample of realizations of
variables wN , location information DN , parameter θ and the sample size N. An M-estimator
of θ0 is given by minimizing the objective function QN as follows,

ˆ
θ N = arg min QN (wN , DN ; θ) .
θ∈Θ

(2.1)

In particular, I will address the case when QN (wN , DN ; θ) can be expressed as a sample
average. An example of this type of M-estimator is partial (pooled) maximum likelihood
(PMLE) estimator. The objective function can be written as

QN (wN , DN ; θ) =

1
N

N

qi (wi , DN ; θ) ,

(2.2)

i=1

where qi (wi , DN ; θ) is some real valued function deﬁned on Θ. wi is the observed data
obtained at location si . DN contains the location information of wi and other observations.
For simplicity reason, let qi (θ) ≡ qi (wi , DN ; θ) and QN (θ) ≡ QN (wN , DN ; θ). I will drop
w and D unless they are needed for clarity.
36

Suppose that in a parametric model, conditional mean is correctly speciﬁed. Let
E(yi |xi , DN ) = mi (xi , DN ; θ0 ) be a correctly speciﬁed mean function along with a LEF
density fi (yi |xi ; θ). For example, in a nonlinear regression model, the objective function for
1
the nonlinear weighted least squares (NWLS) estimator is N

N
i=1

[yi − mi (xi ; θ)]2 /vi ,

where vi is the variance of the error term (usually based on the LEF density), while for the
1
partial maximum likelihood estimation the objective function is N

N
i=1 log fi (yi |xi , DN ; θ),

with E(yi |xi , DN ) = mi (xi , DN ; θ0 ) .
For the M-estimator to be consistent, we need a uniform law of large numbers (ULLN).
To derive the ULLN, we need the following assumptions.
Assumption 1: The pairwise distances dij are ﬁnite and lower bounded by some ε > 0.
I employ increasing domain asymptotics. That is, the sample size grows as the sampling
region expands.
Assumption 2: xi , yi are uniformly bounded variables.
Assumption 3: Θ is a compact subset on Rp .
¯
Assumption 4 (Pointwise Convergence): For each θ ∈ Θ, QN (θ)− QN (θ) = op (1) ,
1
¯
where QN (θ) = E (QN (θ)) = N

N
i=1 E (qi (θ)) ,

¯
¨
and limN →∞ QN (θ) = Q.

Assumption 5: QN (θ) is stochastically equicontinuous. Let (Θ,v) be a metric space.
Let B θ , δ be the open ball θ ∈ Θ :v θ, θ

< δ . QN (θ) is stochastically equicontinuous

in the sense that, for every ε > 0,

lim supN →∞ P 


sup
QN (θ) − QN θ
θ,θ ∈Θ,v (θ,θ )

> ε → 0 as δ → 0 .

¯
Assumption 6: QN (θ) is also stochastically equicontinuous on Θ. The deﬁnition is
similar to the statement above for QN (θ) to be stachostically equicontinuous.
37

¨
Assumption 7: Q attains unique global minimization at θ0 ∈ Θ.
Assumption 8: No perfect multicollinearity in xi . For exponential and logistic regression
functions, the objective function is a function of a linear function of independent variables.
Multicollinearity should be ruled out in order to identify the model.
Assumption 4 provides a pointwise law of large numbers. Assumption 7 and 8
provide the identiﬁcation conditions that make sure the model has unique solution to the
minimization problem.
Proposition 6 Under Assumptions 1-8, the M-estimator in 2.2 is consistent, that is,
ˆ
θN →a.s. θ0 as N → ∞. See proof in Chapter 4.
Following the above proposition, we can apply the ULLN to diﬀerent nonlinear estimators.
The above M-estimator can be expressed utilizing groups of observations (group notation).
That is, we can divide the observations into groups according to geographical properties or
other economic or social relationships. Then we can write the objective function as

1
QG (wG , DG ; θ) =
G

G

q g w g , DG ; θ .

(2.3)

i=1

Let qg (θ) ≡ qg wg , DG ; θ , which is a real valued function of the gth group of observations,
g = 1, 2, ..., G. wg contains the observations of the gth group. DG represents the lattice
with group information other than just locations. QG denotes the objective function which
indicate that the total number of groups is G, although the total number of observations is
still N. Note that when the group size is equal to one, the group notation is the same as
the individual notation. I will use the group notation in the rest of this paper.The above
Assumptions 1-8 for the group notation are basically the same as the individual notation

38

(equation (1) and equation (2)) except that the subscript i changes into g. Let L be the
number of observations in each group, then the total number of groups G = N/L.

ˆ
θ G = arg min QG (wG , DG ; θ)
θ∈Θ

(2.4)

Proposition 7 Follow Proposition 1, the groupwise M-estimator in 2.3 is consistent, that
ˆ
is, θG →a.s. θ0 as G → ∞. See proof in Chapter 4.

2.2.2

Two-step Estimation

In some situations, we have a preliminary estimator. For example, from a partial QMLE
estimator, we can get a preliminary consistent estimator. After that we can get an estimated working covariance matrix as weight to get a more eﬃcient estimator. A two-step
ˆ
M-estimator θG of θ0 solves the problem

ˆ
θG = arg min QG (wG , DG ; θ, ˆ) ,
γ
θ∈Θ

1
QG (wG , DG ; θ, ˆ) =
γ
G

(2.5)

G

qg wg , DG ; θ, ˆ ,
γ
g=1

where γ is a preliminary estimator based on the sample wg : g = 1, 2, ..., G which exhibits
ˆ
spatial correlations. p lim ˆ = γ ∗ , where γ ∗ is some element in the parameter space Γ. I will
γ
discuss a speciﬁc example of the two-step M-estimator in the next section, the PMLE and
GEE method.
¯
Assumption 9: QG (θ, γ ∗ ) attains unique global minimization at θ0 ∈ Θ, where
1
¯
QG (θ, γ ∗ ) = G

G
g=1 E

qg wg , DG ; θ, γ ∗

¯
¯
; That is QG (θ0 , γ ∗ ) < QG (θ, γ ∗ ) , for all θ ∈

Θ, θ = θ0 .
39

Assumption 9 provides necessary identiﬁcation condition for the two-step M-estimator. If
qg (wi , θ0 ; γ) is stachastically equicontinuous over Θ × Γ, then a ULLN applies. Along with
ˆ
identiﬁcation, this result can be shown to imply consistency of θG for θ0 . The consistency
argument is the same as Proposition 1.

2.3

Two-step Estimator: QMLE and GEE

Due to the complexity of the joint distribution of spatial random processes, econometricians
have developed a variety of ways to reduce the computational burden. One way is to specify
the partial conditional distribution, and maximize the summand of log likelihoods for each
observation. The parameters can be consistently estimated if the partial log likelihood
function satisﬁes the assumptions for consistency of M-estimation. A consistent variance
estimator should be provided for valid inference5 . Moreover, one can divide data into diﬀerent
groups, and specify the marginal distribution for each group to get a more eﬃcient estimator
(Wang, Iglesias, & Wooldridge 2012). However, this approach requires correctly speciﬁed
marginal distributions, and when we increase the group size, the joint distribution for each
group of variables becomes more and more diﬃcult to compute. In this section, I propose
a two-step estimator in a QMLE framework. The ﬁrst step is a pooled QMLE procedure
and the second step is a GEE procedure. Only the correct conditional mean and a density
function in the LEF need to be speciﬁed.
The partial (pooled) QMLE method requires correctly speciﬁed conditional mean functions, E(yi |xi ) = mi (xi ; θ0 ) , i = 1, 2, ..., N, along with a LEF density fi (yi |xi ; θ). Then
5 Ignoring

dependence in the estimation of parameters will result in wrong inferences if
the variances are calculated in the way that independence is assumed. Dependence should
be accounted for to the extent of how much one ignores it in the estimation.

40

one proceeds with minimizing the sum of log individual likelihoods ignoring any spatial
dependence. Note that, the true log likelihood cannot be written by the sum of the individual likelihoods. PQMLE is an approximation of the true MLE. However, under certain
conditions, PQMLE delivers consistent estimators with computation ease.
The the partial quasi-log likelihood is

1
LN (θ) =
N

N

log fi (yi |xi , DN ; θ) .

(2.6)

i=1

The partial QMLE is found by solving the score function,
N

ˇ
si θ = 0.

ˇ
SN θ =

(2.7)

i=1

One characterization of QMLE in LEF is that the individual score function has the
following form:

si (θ) =
where

mi (xi , DN ; θ) [yi − mi (xi , DN ; θ)] /v (mi (xi , DN ; θ)) ,

(2.8)

mi (xi , DN ; θ) is the 1 × P gradient of the mean function and v (mi (xi , DN ; θ)) is

the variance function associated with the chosen LEF density. For Bernoulli,

v (mi (xi , DN ; θ)) = mi (xi , DN ; θ) (1 − mi (xi , DN ; θ)) ,

and for Poisson distribution,

v (mi (xi , DN ; θ)) = mi (xi , DN ; θ) .

41

E(si (θ) |xi , DN ) = 0 if E(yi |xi , DN ) = mi (xi , DN ; θ0 ), which implies Fisher consistency.
The above gives a consistent estimator by Proposition 1. However, this estimator
is not likely to be the most eﬃcient estimator among the estimators that are based on
the same distributional assumptions, because it ignores the spatial correlations between
observations. If we can use some or all pairwise correlation information, we can possibly
improve the estimation eﬃciency. A common way to make use of the pairwise information
is to divide observations into groups, and use spatial correlations within groups while ignore
correlations between groups. In empirical studies, there exist some natural groups of data,
e.g., the technology spillover eﬀects within a certain state. Suppose we know the groupwise
distribution but not the full distribution of the whole data, we can get a consistent estimator
by using only groupwise information.
Let g be the number of groups, g = 1, 2, ..., G. The group size is the number of observations
divided by the number of groups, and it can vary from 1, 2, ..., to N. There are two extreme
cases of the group size. The ﬁrst case is when the group size is 1, the resulting estimator is the
usual partial QMLE estimator, which means we ignore all of the pairwise correlations. The
second case is when the group size is N , which means we are using all pairwise information.
If the group size is not equal to one or N , the estimation is actually a partial QMLE. By
”partial”, I mean that I do not use full information, only the information within groups.
Suppose we divide data into G groups and assume that there are the same number of
observations in each group. Let L = N/G, which is ﬁxed. The group numbers and the sample
size both increase as the sampling domain increases. That is, we get more observations by
increasing the space where we obtain a sample. For group g, Xg is an L × K matrix and
yg is an L × 1 vector, where g denotes the gth group, g = 1, 2, ..., G. Let X denote the
N × P covariate matrix. We can write the assumptions in terms of group notation. Those
42

assumptions are basically same, the diﬀerence is the notation. Thus I will not readdress the
assumptions again.
Assumption 10: Conditional mean is correctly speciﬁed for each group. E yg |Xg =
mg Xg , DG ; θ ≡ mg , g = 1, 2, ..., G. I will use mg for short and mg (·) to emphasize certain
parameters. When G = N, we get E(yi |xi ) = mi (xi , DN ; θ0 ) .
The QMLE estimator is given by setting the score function equal to 0.
G

ˆ
sg θ = 0,

(2.9)

g=1

where sg (θ) is the score function for each group. And the group score sg (θ) has the following
form,
sg (θ) =
where

−1
mg Wg yg − mg .

(2.10)

mg is the L × P gradient of the group mean function and Wg is the LEF variance

covariance matrix for group g. Notice that Wg is not a diagonal matrix that only contain
the variances of each individual, but also contains the covariances of pairwise individuals. It
is because of this property that we can improve eﬃciency by doing a so called generalized
estimating equations (GEE) approach. The GEE approach was ﬁrst extended to correlated
data by Zeger and Liang (1986). In the spatial data context, I propose that the generalized
estimating equations (GEE) for the mean parameters, which is given by
G
−1
mg Wg yg − mg = 0.

(2.11)

g=1

In order to use GEE, we need to get a consistent estimator for Wg which depends on the
ˆ
pairwise distances and a spatial dependence parameter. Suppose Wg is a consistent estima43

tor for Wg . GEE is a ”pseudo” weighted multivariate nonlinear least squares (MWNLS),
because GEE only use the groupwise information. The GEE estimator is given by:
G

ˆ −1
yg − mg Wg yg − mg .

ˆ
θGEE = arg min
θ

(2.12)

g=1

As a special case, pooled QMLE is the same as the nonlinear weighted least squares
estimator (NWLS):
N

[yi − mi (xi , DN ; θ)]2 /v (mi (xi , DN ; θ)) .

˜
θN W LS = arg min
θ

(2.13)

i=1

The following demonstrates how to ﬁnd a consistent estimator for Wg . We can write

Wg = Vg (Xg ; θ)1/2 Rg (ρ, DG ) Vg (Xg ; θ)1/2 .

The diagonal elements of Wg correspond to the variances of dependent variables drawn from
a density in LEF. The oﬀ-diagonal elements are the covariances that depend on the spatial
parameter and distances.






Vg (Xg , DG ; θ) = 






v1 · · ·
0

v2

.
.
.
0

...

0 

. 
. 
. 
,

...
0 


0 vL

(2.14)

where the lth element on the diagonal is vl = Var(ygl |Xgl ) in group g, ygl is the lth element
in the vector yg and Xgl is the lth row in Xg .
Let RN (ρ, DN ) be the N ×N correlation matrix for the whole sample, and let Rg (ρ, DG )
44

be the L × L correlation matrix for the group g. A common assumption of the ijth element
of RN (ρ, DN ) is that

if dij = 0,


 c + b 1 − exp −d /ρ
ij

Rij = 1 − γ dij , ρ , γ dij , h =



 0

otherwise,

(2.15)

where the vector of spatial parameters h = (c, b, ρ) , c ≥ 0, b ≥ 0, ρ ≥ 0, and c + b ≤ 2.6
Set b = c = 1 without loss of generality. Then

if dij = 0,


 exp −d /ρ
ij

Rij =



 1

otherwise.

(2.16)

Another example would be

if dij = 0,


 ρ/d
ij

Rij =



 1

otherwise.

(2.17)

Although the above speciﬁcation does not represent all the possiblilities, it at least provides a
way of how to parameterize the spatial correlation, and provides the basis for testing spatial
correlation.
ˇ
ˇ
Let θ be the partial QMLE estimator. ui = yi − mi xi ; θ , for i = 1, 2, ..., N , is the
ˆ
ˇ
QMLE residual. vi = v mi xi , DN ; θ
ˇ

is the ﬁtted variance of individual i associated with

the chosen LEF density. Let ri = ui /ˇi be the standardized residual. Let ˆ = (ˆ1 , r2 , ..., rN ) .
ˆ
ˇ v
r
r ˆ
ˆ
Then ˆˆ is the sample correlation matrix. We can use a method in Prentice (1988) to ﬁnd
rr
a consistent estimator for ρ. Let e be a vector containing N (N − 1)/2 diﬀerent elements of
6 See

Cressie (1993) p.61 for more examples.

45

the lower (or upper) triangle of ˆˆ , excluding the diagonal. Let z be the vector containing
rr
the elements in R corresponding to the same entries of elements in ˆˆ . We can get the
rr
parameter estimator ρ by solving:
ˆ

ρ = arg min(e − z) Ξ−1 (e − z),
ˆ

(2.18)

where Ξ is the working correlation matrix for e, the sample correlation vector. Ξ is a
diagonal matrix with ξ21 , ξ31 , ..., ξn1 , ξ32 , ξ42 , ..., ξN 2 , ..., ξN,N −1 as the elements on the
diagonal, which are the corresponding variances of element in e . If the variance covariˆ
ance matrix W is correctly speciﬁed, a model-based consistent variance estimator of θ is
G
g=1

mg Wg
ˆ ˆ −1 mg
ˆ

−1

, where

mg =
ˆ

ˆ
mg Xg ; θGEE . As an alternative, the variˆ

ˆ
ance estimator of θ that is robust to misspeciﬁcation of the variance covariance matrix is


G

−1 
ˆ
mg Wg
ˆ ˆ −1 mg 


G

ˆ ˆ ˆ −1 ˆ
mg Wg k (g, h) ug uh Wh mh 
ˆ ˆ −1

g=1 h=1

−1
mg Wg
ˆ ˆ −1 mg 
ˆ





G



g=1



G

,

(2.19)

g=1

where k (g, h) is the kernel function depending on the distance between group g and h.
The GEE approach is summarized as follows:
First, ﬁnd the partial QMLE estimator for the mean parameters and obtain the residuals
from PQMLE.
Second, use the ﬁrst step estimator to get a ﬁtted variance covariance matrix according
to the LEF density, and obtain the spatial correlation parameter using the standardized
residuals from the ﬁrst step. After obtaining a working matrix, undertake a multivariate
weighted least squares procedure. This gives the GEE estimator.
46

Finally, we can get a consistent variance estimator for the mean parameter that is robust
to heteroskedasticity and spatial correlation. Further, we can obtain the average partial
eﬀect (APE).

2.4

Asymptotic Distribution for PQMLE and GEE

A central limit theorem is needed to develop the asymptotic distributions for M-estimators.
Bolthausen (1982) provides central limit theorem (CLT) for strictly stationary processes.
However, in economic applications there are a lot of nonstationary spatial processes in the
sense that they are heterogenous; that is, the joint distribution of dependent variables varies
with locations. There are also spatial processes which may have asymptotically unbounded
moments. However, in this paper, I will only discuss uniformly bounded random variables.
From the above equation, we can take derivatives with respect to the parameter, and get
the scores for each group. Let sg (θ) denote the P × 1 vector of score for qg (θ) . The group
score sg (θ) and Hessian Hg (θ) have the following forms,

sg (θ) =

−1
mg (θ) Wg yg − mg (θ) ,

(2.20)

E sg (θ) |xg , DG = 0.
By the assumption of correctly speciﬁed conditional mean, the above condition implies Fisher
consistency for QMLE for linear exponential family.

47

While

Hg (θ) = ∂sg (θ) /∂θ

(2.21)

−1 m + ∂ 2 m W−1 y − m ,
= − θ mg Wg
g
g
g
θ g g

where

θ mg

≡ ∂mg /∂θ is the L × P gradient of the group mean function, ∂ 2 mg ≡
θ

∂ 2 mg /∂θ∂θ is the L × P jacobian of the group mean function and Wg is the LEF variance
covariance matrix for group g. Taking the expected value of the score function over the
distributions of w gives

E Hg (θ0 )

= E E Hg (θ0 ) |wg , DG

(2.22)

−1 m + E ∂ 2 m W−1 y − m |w , D
= E − mg Wg
g
g
g
g
G
θ g g

= E

−1 m
2
−1 E y |w , D
− mg Wg
g + ∂ θ mg Wg
g g
G − mg

−1 m
2
−1 m − m
= E − mg Wg
g + ∂ θ mg Wg
g
g
−1 m
= E − mg Wg
g

Notice that Wg is not a diagonal matrix that only contain the variances of each individual,
but also contains the covariances of pairwise individuals. It is because of this property that we
can improve eﬃciency by doing a so called generalized estimating equations (GEE) approach.
The GEE approach was ﬁrst extended to correlated data by Zeger and Liang (1986). In the
spatial data context, I propose that the generalized estimating equations (GEE) for the mean
parameters are given by
1
G

G

ˆ
sg θ = 0,
g=1

48

(2.23)

1
G

G
−1
ˆ
ˆ
mg θ Wg yg − mg θ

= 0.

(2.24)

g=1

Because each score is a function of spatial processes, they are correlated with each other.
1
ˆ
The score function for the total sample is SG θ = G

G
g=1 sg

ˆ
θ = 0. The score function

can be expanded about θ0 in a mean-value expansion:

1
ˆ
SG θ =
G

G
g=1

1
sg (θ0 ) +
G

G

¨
Hg θ

ˆ
θ − θ0 .

(2.25)

g=1

¨
¨
ˆ
where θ ∈ Θ is between θ and θ0 . Hg θ is the P × P Hessian of the objective function
qg (θ) .


−1
G
G
√
1
ˆ − θ0 = − 1
¨  √
G θ
sg (θ0 ) .
Hg θ
G
G g=1
g=1

(2.26)

Assumption 11 (Uniform L2+δ integrability): The elements in the scores are uniformly bounded and have a limit expectation equal to zero. That is,

lim sup E sgl

2+δ

k→∞ g

1 sgl > k

= 0,

(2.27)

where 1 (·) is the indicator function and sg is the group score matrix, which is P × L. sgl is
an element in the score matrix, and k is a constant.
Assumption 12: The second moment of the score function is positive and uniformly
bounded.

1
Var 
G→∞ G

G

sg (θ0 ) < ∞.

0 < lim

G
g=1 E

Hg (θ0 )

(2.28)

g=1

Proposition 8 Under Assumptions 1-12,
1
A0 = limG→∞ − G



√

ˇ
G θ − θ0 ⇒ N 0, A−1 B0 A−1 , where
0
0

1
, and B0 = limG→∞ Var √

G

proof in Chapter 4.
49

G
g=1 sg (θ0 )

. See

1
E Hg (θ0 ) has already been given in Equation (22). Var √

G

G
g=1 sg (θ0 )

is given as

follows,



 1 G

1
−1
Var  √
sg (θ0 ) = Var √
mg (θ0 ) Wg yg − mg (θ0 )
 G

G g=1
g=1


G
1
−1
mg (θ0 ) Wg ug 
= Var  √
G g=1




G

=

1
G

(2.29)

G
−1
−1 m (θ )
mg (θ0 ) Wg ug ug Wg
g 0

E
g=1

1
+
G

G

G

E

−1
−1
mg (θ0 ) Wg ug uh Wh mh (θ0 ) .

g=1 g=h

If we have a ﬁrst step estimator, say γ , p lim ˆ = γ ∗ . By linear expansion, the score should
ˆ
γ
be written as

ˆ
SG θ

1
=
G
1
=
G

G

sg (θ0 , ˆ) + op (1)
γ

(2.30)

g=1
G

sg (θ0 , γ ∗ ) + F0 (ˆ − γ ∗ ) + op (1) ,
γ
g=1

where
sg (θ0 , ˆ) =
γ

ˆ −1
mg (θ) Wg yg − mg (θ) = 0.

ˆ −1
where Wg is a function of ˆ, F0 is a P × J matrix. J is the dimension of γ. F0 =
γ
1
limg→∞ G

G
g=1 E

γ sg

wg , θ0 ; γ ∗

. We can see

mg (θ) and mg (θ) do not rely on γ.

−1
When we take derivatives with respect to γ, it only matters with Wg .

γ sg

wg , θ0 ; γ ∗

is a linear combination of elements of yg − mg (θ) . Since E[ yg − mg | w, D ]= 0,

50

E[ γ sg wg , θ0 ; γ ∗ |w, D] = 0. By law of iterated expectations, E

γ sg

wg , θ0 ; γ ∗

= 0.

Thus F0 = 0. Then we can write the score function as
1
ˆ
SG θ =
G

G

sg (θ0 , γ ∗ ) + op (1) ,

(2.31)

g=1

and
1
ˆ
SG θ =
G

G

sg (θ0 , γ ∗ ) +
g=1

1
G

G

¨ ∗
Hg θ, γ

ˆ
θ − θ0

(2.32)

g=1

¨
¨
ˆ
where θ ∈ Θ is between θ and θ0 . Hg θ is the P × P Hessian of the objective function
qg (θ) .


√
1
ˆ
G θ − θ0 = −
G

G

−1
¨ ∗
Hg θ, γ 

g=1

G

1
√
sg (θ0 , γ ∗ ) .
G g=1

(2.33)

Thus the ﬁrst step estimation will not aﬀect the asymptotic distribution of the second step.
γ ∗ is a ﬁxed number in the second step score function.
Proposition 9 Under Assumptions 1-12,
1
A0 = limG→∞ − G

G
g=1 E

Hg (θ0 , γ ∗ )

√
ˆ
G θ − θ0

−1
⇒ N 0, A−1 B0 A0 , where
0

1
, and B0 = limG→∞ Var √

G

See proof in Chapter 4.

2.5
2.5.1

Variance-Covariance Matrix Estimator
Parametric Variance-Covariance Estimator

Assumption 13: Var ug = Var ug |xg , DG = Wg ;

51

G
∗
g=1 sg (θ0 , γ )

.

Cov ug , uh = Cov ug , uh |xg , xh , DG = Cgh .




G

G

1
1
Var  √
sg (θ0 ) =
E
G
G g=1
g=1
1
+
G

G

−1 m (θ )
mg (θ0 ) Wg
g 0

(2.34)

G
−1
−1
mg (θ0 ) Wg Cgh Wh mh (θ0 ) .

E
g=1 g=h

ˆ
Under Assumption 13, the asymptotic variance estimator for θG can be estimated by

ˆ
Avar1 θ

=

1 ˆ −1 ˆ ˆ −1
A B1 A
G


(2.35)
−1 

G

mg Wg
ˆ ˆ −1 mg 
ˆ

= 

g=1

−1

G

mg Wg
ˆ ˆ −1 mg 
ˆ



mg Wg Cgh Wh mh 
ˆ ˆ −1 ˆ ˆ −1 ˆ



g=1





G

,

g=1

where
ˆ 1
A=
G

G

mg Wg
ˆ ˆ −1 mg ,
ˆ
g=1

and
1
ˆ
B1 =
G

2.5.2

G

G

mg Wg Cgh Wh mh .
ˆ ˆ −1 ˆ ˆ −1 ˆ
g=1 h=1

Nonparametric Variance-Covariance Estimator

Assumption 13 is not always obtained. And most of times it can not be easily known.
√
√
−1
ˆ
ˆ
Since Avar G θG − θ0 = A0 B0 A−1 , we can consistently estimate Avar G θG − θ0
0
ˆ
ˆ ˆ ˆ
by A−1 B2 A−1 . The asymptotic standard errors are obtained from the matrix Avar2 θG =

52

ˆ ˆ ˆ
A−1 B2 A−1 /G. A robust nonparametric variance covariance estimator is given as

ˆ
Avar2 θ

=

1 ˆ −1 ˆ ˆ −1
A B2 A
G


−1 

G

ˆ
m g Wg
ˆ ˆ −1 mg 

= 

G

(2.36)


G

k(dgh ) mg Wg ug uh Wh mh 
ˆ ˆ −1 ˆ ˆ ˆ −1 ˆ


g=1 h=1

g=1

mg Wg
ˆ ˆ −1 mg ,
ˆ

where
G

ˆ 1
A=
G

ˆ
mg Wg
ˆ ˆ −1 mg ,
g=1

and
1
ˆ
B2 =
G

G

G

k(dgh ) mg Wg ug uh Wh mh .
ˆ ˆ −1 ˆ ˆ ˆ −1 ˆ
g=1 h=1

ˆ
Proposition 10 Under Assumptions 1-13, Avar2 θG

robust

1
→ G A−1 B0 A−1 . See
0
0

proof in Chapter 4.
Although we do not need to make adjustment to the two-step QMLE estimator in this
paper, it is worth to mention that in lot of cases F0 = 0, and we need to make adjustment
to the asymptotic variances.
√

G

1
ˆ
G θ − θ 0 = A0 √
−gg (θ0 ; γ ∗ ) + op (1) ,
G g=1

where
gg (θ0 ; γ ∗ ) ≡ sg (θ0 , γ ∗ ) + F0 rg (γ ∗ ) .
Let

1 
E
G→∞ G

G

D0 ≡ lim

G

gg (θ0 ; γ ∗ )

g=1

gh (θ0 ; γ ∗ )  ,
h=1

53



ˆ
the asyptotic distribution of θ can be written as

√

ˆ
G θ − θ0 ⇒d N 0, A−1 D0 A−1 .
0
0

A robust estimator after adjustment is given as

ˆ
Avar3 θ

=

1 ˆ −1 ˆ ˆ −1
A DA
G


(2.37)
−1

G

mg Wg
ˆ ˆ −1 mg 
ˆ

= 
g=1



G



G

ˆr s
ˆr r ˆ
k(dgh ) ˆgˆh + ˆgˆh F + Fˆgˆh + Fˆgˆh F 
s s
s r ˆ


g=1 h=1



−1

G

mg Wg
ˆ ˆ −1 mg 
ˆ



,

g=1

where
ˆ 1
A=
G

G

mg Wg
ˆ ˆ −1 mg ,
ˆ
g=1

and

ˆ 1
D=
G

2.6

G

G

ˆr s
ˆr r ˆ
k(dgh ) ˆgˆh + ˆgˆh F + Fˆgˆh + Fˆgˆh F
s s
s r ˆ

.

(2.38)

g=1 h=1

Two Examples: Spatial Probit Model and Poisson
Regression Model

The setup of nonlinear models with spatial data could be tricky. We need to incorporate
the spatial correlated term in an appropriate way. In this section, I will use two nonlinear

54

models to demonstrate how we can incorporate the spatial correlated term and use a GEE
procedure. This could vary with diﬀerent models. The ﬁrst example is a Probit model, and
the second one is a count data model.

2.6.1

Example 1. A Probit Model with Spatial Correlation in the
Latent Error

The Probit model is one of the popular binary response models. The dependent variable
has conditional Bernoulli distribution. The dependent variable y takes on the values zero
and one, which indicates whether or not a certain event has occurred. For example, y = 1
if a ﬁrm adopts a new technology, and y = 0 otherwise. The value ot the latent variable y ∗
determines the outcome of y.
Assume the Probit model is
∗
yi = 1 [yi ≥ 0] ,

(2.39)

∗
yi = xi β + ei ,

(2.40)

The latent error e has a standard multivariate normal distribution, but the covariances
depend on pairwise distances DN , which is diﬀerent from the usual multivariate normal
distribution.7
corr ei , ej = f dij, ρ ,

(2.41)

where f (·)is a function increases in ρ and decreases in dij .
∗
We do not observe yi ; we only observe yi . Let Φ (·) be the standard normal cumula-

tive density function (CDF), and φ be the standard normal probability density function
7A

multivariate normal distribution usually speciﬁes the mean vector and correlation
matrix. The correlations do not depend on the pairwise distance between two variables.
55

(PDF). Assume that the mean function mi (xi ; β) ≡ E (yi |xi , DN ) = Φ (xi β) is correctly
∗
speciﬁed. Because of the nonlinearity of yi and non-observability of the latent variable yi ,

Cov yi , yj |xi , xj , DN is hard to discover without more information on the multivariate distribution. In order to proceed with GEE, we need to specify a working matrix, which is
possibly misspeciﬁed.
The partial QMLE delivers a consistent ﬁrst-step estimator for the mean parameters as
in Proposition 1. Using the Bernoulli density function, the log likelihood function for each
observation is:

li (β) =yi log Φ (xi β) + (1 − yi ) log [1 − Φ (xi β)] .

(2.42)

The partial QMLE solves:

ˇ
β = arg max LN (β)

(2.43)

θ∈Θ

N

N

N
i=1

(1 − yi ) log [1 − Φ (xi β)] .

yi log Φ (xi β) +

li (β) =

LN (β) =

i=1

i=1

The score of the likelihood function for each individual is

si ≡

φ (xi β) xi [yi − Φ (xi β)]
.
Φ (xi β) [1 − Φ (xi β)]

(2.44)

The expected Hessian8 for each observation is

E (Hi |xi , DN ) = −

φ2 (xi β) xi xi
.
Φ (xi β) {1 − Φ (xi β)}

8 Unexpected

(2.45)

Hessian can be derived. Because the score has the special form that contains
yi − Φ (xi β), by taking the expectations conditional on xi and DN , we can get a cleaner
expression.
56

Let AN be the sum of negative expected Hessians AN =

φ2 (xi β )xi xi
N
i=1 Φ(x β ){1−Φ(x β )} ,
i
i

and A0 =

ˇ
E (Hi ) on the distribution of x. Let ui = yi − Φ xi β , i = 1, 2, ..., N be the residuals. At
ˇ
ˇ
this stage, a robust variance and covariance estimator for β can be computed as follows:
−1
ˇ x xi
xi β i

= 
ˇ
ˇ
Φ xi β 1 − Φ xi β
i=1


N N
ˇ φ xj β x u i u j xj
ˇ
φ xi β
iˇ ˇ


k dij
Φ (xi β) [1 − Φ (xi β)]
i=1 j=i

−1
N
2 xβ xx
ˇ
φ
i
i i

 ,
ˇ 1 − Φ xi β
ˇ
Φ xi β


ˇ
Var β

N

φ2

(2.46)

i=1

where k dij is the kernel weight function that depend on pairwise distances. This partial QMLE and its robust variance covariance estimator provides a legitimate way of the
estimation of the spatial Probit model.
The next is to ﬁnd out how how the two-step estimator GEE works. The second step is
to construct the weighting matrix using the ﬁrst-step estimators and residuals. As the data
can be divided into groups, the working matrix can be the weight for a speciﬁc group. If the
group size equals two, the working matrix is a two by two matrix. We can write the working
ˆ
ˆ
ˆ ρ
ˆ 1/2
variance covariance matrix as Wg = Vg 1/2 Rg (ˆ, DG ) Vg . An estimator for the working

57

variance matrix for each group is










ˆ
Vg = 










v1
ˇ

0

0

v2
ˇ

0

0

0

..

···

0

.
.
.

..

.
.

.
.
.
0

..

vl
ˇ
..

0

···

.

.

···

···

0 


0 


. 
. 
. 
,
. 
..
. 
. .



..
. 0 


0 vL
ˇ

(2.47)

where
ˇ
ˇ
vl = Φ xl β 1 − Φ xl β , l = 1, ..., L.
ˇ

(2.48)

Next we will ﬁnd an estimator for the working correlation matrix for yi −Φ (xi β). Suppose
the structure of the true correlation matrix R is Rij = Cij dij , λ ,where Cij dij , λ is a
function that increases in λ and decreases in dij . Note that λ is the spatial correlation
parameter for the dependent variables, while ρ is for the latent error. This two parameters
√
ˇ
are generally diﬀerent. Let ri = ui / vi , for i = 1, 2, ..., N, be the standardized residuals.
ˆ
ˇ
√
ˆ
ˆ
ˆ
ˆ
Cij equals the sample correlation of ui / vi and uj / vj . Let R ≡ R DG , λ and Rg
ˇ
ˇ
ˇ
ˇ
ˆ
stand for the correlation matrix Rg Dg , λ

for the gth group. The function Cij dij , λ

is unknown, but we can choose a correlation function to approximate it and use it in the
estimation. For example, say C dij , ρ = dλ or exp − dij . By only using the correlations
λ
ij
ˆ
within groups, an estimator of λ is λ = arg min
i < j.

58

G
g=1

L
i=1

L
j=i

ri rj − Cij dij , ρ
ˆˆ

2

for

The second step GEE estimator for β is
G

ˆ
β = arg min
β

yg − Φ xg β

ˆ −1
Wg y g − Φ x g β

.

(2.49)

g=1

If one believes the working correlation matrix is correctly speciﬁed, the non-robust variˆ
ance estimator for β is


−1

G

ˆ ˆ −1 ˆ
Dg Wg Dg 

ˆ
Var1 β = 

(2.50)

g=1

ˆ
ˆ
ˆ
ˆ
ˆ
where Dg = ∂Φ xg β /∂ β =φ xg β xg . β is consistent even for misspeciﬁed spatial correlation structure. The robust variance estimator to misspeciﬁcation of spatial correlation
is:

ˆ
VarR β

−1

G

ˆ ˆ −1 ˆ
Dg Wg Dg 

= 

(2.51)

g=1



G



G

ˆ ˆ −1 ˆ ˆ ˆ −1 ˆ
k(dgh )Dg Wg ug uh Wh Dh 


g=1 h=1



−1

G

ˆ ˆ −1 ˆ
Dg Wg Dg 


g=1

where k(dgh ) is a kernel function which depends on the distances between groups.
Alternative approach is to specify the speciﬁc distributions of the multivariate normal
distribution of the latent error, and then ﬁnd the estimator for the spatial correlation parameter for the latent error within a MLE framework. For example, see Wang, Iglesias and
Wooldridge (2012).

59

2.6.2

Example 2. A Poisson Model with a Multiplicative Spatial
Error

A count variable is a variable that takes on nonnegative integer values. Many variables that
we would like to explain in terms of covariates come as counts, such as the number of times
someone is arrested during a given year, and the number of patents applied for by a ﬁrm
during a year. Count data examples with upper bound include the number of children in a
family who are high school graduates, in which the upper bound is number of children in the
family (Wooldridge 2010). A count data is usually characterized by a density in LEF and a
population mean. Now let’s use a speciﬁc example to demonstrate a count data model with
a spatial error term. The conditional Poisson density is speciﬁed as

f (y|x, D) = exp [−µ] µy /y!,

(2.52)

where y! = 1 · 2 · ... · (y − 1) · y and 0! = 1. µ is the conditional mean of y. For a given sample,
specify the conditional mean as the exponential form:

E (yi |xi , DN ) = exp (xi β) .

(2.53)

Assume that the conditional mean function is correctly speciﬁed and model the spatial
correlation in the conditional mean function:

E (yi |xi , vi , DN ) = vi exp (xi β0 ) ,

60

(2.54)

where vi is the multiplicative spatial error term. Let v equal v1, v2 , ..., vN

. The count data

model can be written in a conditional mean form:

yi = vi exp (xi β0 ) + δi ,

(2.55)

and
E (δi |xi , vi , DN ) = 0.
A count data model with a multiplicative spatial error can be characterized by the following
assumptions:
(1) {(xi , vi , δi )} is a mixing sequence on the sampling space, with mixing coeﬃcient α or
φ.
(2) E (yi |xi , vi , DN ) = vi exp (xi β0 )
(3) yi , yj are independent conditional on xi , xj , vi , vj , DN

i=j

(4) vi is independent of xi , E(vi ) = 1, Var (vi ) = τ 2 , and Cov vi , vj = τ 2 · c dij , where
c dij is the spatial correlation depending on the distance between observation i and j.
Note that, we only specify the conditional mean, instead of the distribution. Even if
the data do not follow the Poisson distribution, the quasi-MLE approach will give you a
consistent estimator if you use the Poisson density function. Moreover, y even need not to
be a count number. Under the above assumptions we can integrate out vi by using the law

61

of iterated expectations.

E (yi |xi , DN ) = E (E (yi |xi , vi , DN ) |xi , DN )

(2.56)

= E (vi exp (xi β0 ) |xi , DN )
= exp (xi β0 ) E (vi |xi , DN )
= exp (xi β0 ) E (vi )
= exp (xi β0 )

And we can calculate the variances and covariances of y conditional on x:

Var (yi |xi , DN ) = E [Var ((yi |xi , vi , DN ) |xi , DN )]
+Var [E ((yi |xi , vi , DN ) |xi , DN )]
= E [vi exp (xi β0 ) |xi , DN ] + Var [vi exp (xi β0 ) |xi , DN ]
= exp (xi β0 ) E (vi |xi , DN ) + exp (2xi β0 ) Var (vi |xi , DN )
= exp (xi β0 ) + exp (2xi β0 ) · τ 2

62

(2.57)

Cov yi , yj |xi , xj , DN

= E Cov

yi , yj |xi , xj , vi , vj , DN |xi , xj , DN

+Cov{ E (yi |xi , vi , DN ) , E yj |xi , vi , DN

(2.58)
|xi , xj , DN }

= 0 + Cov vi exp (xi β0 ) , vj exp xj β0 |xi , xj , DN
= exp (xi β0 ) exp xj β0 Cov vi , vj |xi , xj , DN
= exp (xi β0 ) exp xj β0 Cov vi , vj
= exp (xi β0 ) exp xj β0 · τ 2 · c dij

The model can also be expressed as

E (yi |xi , ei , DN ) = exp (xi β + ei )

(2.59)

= exp (ei ) exp (xi β0 )

We can see that vi = exp (ei ). We can model the spatial correlation between ei and ej , for
i, j = 1, 2, ..., N . Let vector e denote e1, e2 , ..., eN

. In this paper, only positive correlations

are considered for convenience. Negative correlations9 are possible and one can extend
this method to it. For convenience, I assume e follows a multivariate normal distribution,
N (µ, Ω). Also I use this in the simulaion later. Then v will follow a multivariate lognormal
distribution. One could use other multivariate distributions for e, and v will follow the
corresponding multivariate distribution. Or one could use a multivariate distribution for vi
directly as long as the ﬁrst two moments assumptions for vi are satisﬁed.
9 Bloom,

schankerman and Reenen’s working paper (2012) in NBER identiﬁes negative
product market rivalry eﬀect.

63

Although y follow a multivariate Poisson distribution, the above equation shows that β0
can be consistently estimated by the partial Poisson maximum likelihood estimation. The
density function for yi is

fi (yi |xi , DN ) = exp [− exp (xi β)] [exp (xi β)]yi /yi !,

(2.60)

where yi ! = 1 · 2 · ... · (yi − 1) · yi and 0! = 1.
The log likelihood for each observation is

li (β) = − exp (xi β) + yi xi β − log (yi !) ,

(2.61)

and the score is

∂Li (β)
= xi [yi − exp (xi β)] = xi ui ,
∂β

(2.62)

Hi (β) = − exp (xi β) xi xi .

si (β) ≡

(2.63)

and the Hessian is

Let Ai ≡ −E (Hi (β) |xi , DN ) = exp (xi β) xi xi , and A0 ≡ E [Ai ] over the distribution
of x and the spatial space D. Let B0 ≡ E si (β0 ) si (β0 )

= E u 2 xi xi .
i

The partial QMLE gives a consistent estimator for the mean parameters, which solves:
N

ˇ
β = arg max L (β) =
θ∈Θ

N

yi xi β −

li (β) =
i=1

N

i=1

exp (xi β) −
i=1

64

N

log (yi !) .
i=1

(2.64)

A robust partial QMLE variance covariance estimator is


N

−1

N

ˇ
exp xi β xi xi 





N

ˇ
exp xi β xi xi 

ˆˆ
k dij xi ui uj xj 

,

(2.65)

i=1

i=1 j=1

i=1

−1

N

where k (i, j) is a kernel function depending on the distance between observations i and j.
The partial QMLE does not make any use of the pairwise correlations. If we can use
them, we may improve eﬃciency. The GEE approach is:
G

ˆ
β = arg min
β

yg − exp Xg β

ˆ −1
Wg yg − exp Xg β

.

(2.66)

g=1

ˆ
ˆ
ˆ 1/2 ˆ ˆ 1/2
where Wg is an estimated weighting matrix. Again, like the Probit model, Wg = Vg Rg Vg
ˆ
ˆ
, where V is a diagonal matrix with the estimated variances on the diagonal, and R is the
estimated working correlation matrix. The most eﬃcient weighting matrix is the true covariance matrix of y − exp (Xβ).
There are two pivotal parameters that we do not know in the estimation, τ 2 and ρ. We
ˇ
could use the partial Poisson QMLE residuals to estimate the parameters. Let β be the
ˇ
partial Poisson regression estimator. Let u2 = yi − exp xi β
ˇi

2

be the squared residual. τ 2

ˇ
can be estimated in the following way: τ 2 equals the coeﬃcient by regressing u2 − exp xi β
ˆ
ˇi
ˇ
on exp 2xi β . The situation to estimate ρ is a little bit complicated. First, we do not know
how σij depends on ρ and dij . If we use the wrong structure, we probably will get a wrong
estimator for ρ. Suppose the correlation structure is c dij = dρ , then an estimator for ρ is:
ij
ρ = coeﬃcient by regressing
ˆ

ui uj
ˇ ˇ
ˇ
ˇ
exp(xi β ) exp xj β

τ2
ˆ
on d . However, this estimator sometimes
ij

suﬀers from the negative products of the Poisson regression residuals, and the estimated
ˆ
parameter ρ is biased downward. Wg is obtained by plugging τ 2 and ρ back in the variance
ˆ
ˆ
ˆ

65

covariance matrix. By minimizing the above equation, we can get the GEE estimator.
If Assumption 13 holds, a non-robust variance estimator for GEE is


−1

G

ˆ ˆ −1 ˆ
Dg Wg Dg 

ˆ
Var β = 

,

(2.67)

g=1

ˆ
ˆ
where Dg = ∂ exp Xg β /∂β = exp Xg β Xg .
ˆ
β is still consistent even for a misspeciﬁed spatial correlation structure when Assumption 13 does not hold. The robust variance estimator to misspeciﬁcation of spatial correlation is:

ˆ
Var β

−1

G

ˆ ˆ −1 ˆ
Dg W g Dg 

= 

(2.68)

g=1



G



G

ˆ ˆ −1 ˆ ˆ ˆ −1 ˆ
k(dgh )Dg Wg ug uh Wh Dh 


g=1 h=1



−1

G

ˆ ˆ −1 ˆ
Dg Wg Dg 


g=1

where k(dgh ) is a kernel function depending on the distances between groups. The distance
could be the smallest distance between two observations belonging to diﬀerent groups.

2.7

Monte Carlo Simulations

In this section, I want to do the Monte Carlo simulation to investigate the eﬃciency gain of
the GEE approach compared to the pooled QMLE. The simulation mechanism is described
as follows:
(1) Each individual resides on a intersection of the square lattice. Thus the pairwise

66

distance can be calculated using the coordinates of each observation.
(2) According to pairwise distances, generate the pairwise correlations. In the simulation,
there are two cases of pairwise correlations. One is that each observation is correlated with
all other observations; the other is that only observations in the same group are correlated
with each other. That is, assuming groupwise independence. The group are demonstrated
in the graph in the next subsection.
(3) After generated correlated data, the ﬁrst step is to ﬁnd the pooled QMLE estimator.
The second step is divide observations into groups and only use within group information
to estimate the correlation parameters. In the case of groupwise dependence, although each
pair of the data are correlated, we do not use pairwise correlations between observations in
diﬀerent groups. In the case of groupwise dependence, this method sounds natural. After
estimation of the spatial correlation parameters, estimate the mean parameters again using
the GEE procedure.

2.7.1

Sampling Space

Graph 1 demonstrates the case when the sample size is 400. Thus, I create a 20 × 20
square lattice. Each observation resides on the intersections of the lattice. The locations
for the data are {(r, s) : r, s = 1, 2, ...20}. The distance dij between location i and j is
Euclidean distance. Suppose A(ai , aj ) and B(bi , bj ) are the two points on the lattice; their
distance dij is

(ai − bi )2 + (aj − bj )2 . Then the spatial correlation is based on a given

parameter ρ and dij . The assumed correlation for the spatial correlated error term is ρ/dij
for i, j = 1, 2, ..., N . For the Probit model this correlation is between the latent error, and
for the count data model this correlation is between the multiplicative error. The data are
divided into groups of 4, 16. And you can use more observations in one group, say, 25 and
67

100. But this increase the computation burden and 25 and 100 are too big for the sample
size 400. In Graph 1, the left upper corner of the lattice demonstrates the case that the
group size is four; the left lower corner of the lattice demonstrates the case that the group
size is 16; and the right lower part of the lattice demonstrates the case that the group size is
100. The idea of this two-step method is to use a small number of pairwise correlations and
conveniently get more eﬃcient estimators. Thus, I use 4 or 16 as the number of observations
in each group. Notice, for convenience, I make the pairwise distances in diﬀerent groups the
same.

2.7.2

Spatial Probit Data

For the Probit model, we can not easily ﬁnd the variances and covariances for the dependent
variables conditional on the covariates. The corrlated latent errors result in correlated binary
response varibles. However, the correlations in latent error usually do not reﬂect the exact
correlations in the binary dependent variables. The correlations are much smaller in the
binary response variable because of the nonlinear tranformation. In the following simulation,
let the sample size be 400 and replication be 500. Consider the following data generating
process:
1. xi = [1, xi1 ],
∗
2. yi = xi β + ei ,

xi1 ∼ N (1, 1) ;

β = [−1, 1] .

e ∼ MVN(0, Ω),

Var (ei ) = 1,

Cov ei , ej = dρ ;
ij

ρ=

0, 0.1, 0.2, 0.3, 0.4.
∗
3. yi = 1 if yi ≥ 0,

∗
yi = 0 if yi < 0.

4. Use Cov yi , yj |xi , xj , DN = dλ as the correlation between the dependent variables.
ij
This form is arbitrary. Based on the information, this is not likely to be true and λ is
unknown.
68

ˆ
ˆ
The simulation results are in Table 1-2. λ is the estimator for λ. λ is calculated by the
minimum distance estimator
G

sumL
i=1

min
λ

L

g=1




j=i



ui uj
ˇˇ
ˇ
ˇ
Φ x i β 1 − Φ xi β

−
ˇ
ˇ
Φ xj β 1 − Φ xj β

λ 
,
dij

for i, j in the same group g, and L is the number of individuals in a group. Because λ is
ˆ ˇ
unknown, we can not calculate the bias of λ . β is the ﬁrst-step partial QMLE estimator for
ˆ
β. β1 is the two-step weighted nonlinear least squred estimator (WNLS) that only uses the
ˆ
variances as weight in the second step. β4 is the two-step GEE estimator that uses a 4 × 4
ˆ
variance covariance matrix as weight for each group. β16 is the two-step GEE estimator that
uses a 16 × 16 variance covariance matrix as weight for each group. Note that when L = 4,
ˆ
ˆ
ˆ
β16 is calculated by using λ which is calculated using group size equal 4. Similarly for β4
when L = 16. The following two tables show two cases of the simulation: (1) N=400, L=4.
(2) N=400, L=16.

69

Table 2.1: Binary Data N=400, L=4
N=400,
ρ
0
0.1
0.2
0.3
0.4

L=4,

T=500,

β=1

estimate
s.d
estimate
s.d
estimate
s.d
estimate
s.d
estimate
s.d

ˆ
λ
0.0032
0.0419
0.0281
0.0501
0.0616
0.0575
0.0903
0.0558
0.1281
0.0745

ˇ
β
1.0162
0.1111
1.0139
0.1166
1.0241
0.1103
1.0242
0.1138
1.0444
0.1128

ˆ
β1
1.0162
0.1011
1.0139
0.1166
1.0241
0.1103
1.0233
0.1147
1.0434
0.1148

ˆ
β4
1.0155
0.1144
1.0145
0.1165
1.0243
0.1111
1.0253
0.1142
1.0484
0.1185

ˆ
β16
1.0169
0.1118
1.1053
0.1172
1.0261
0.1136
1.0251
0.1157
1.0515
0.1162

Table 2.2: Binary Data N=400, L=16
N=400,
ρ
0
0.1
0.2
0.3
0.4

L=16,

T=500,

β=1

estimate
s.d
estimate
s.d
estimate
s.d
estimate
s.d
estimate
s.d

ˆ
λ
−0.0033
0.0267
0.0238
0.0338
0.0504
0.0407
0.0822
0.0539
0.1100
0.0601

ˇ
β
1.0042
0.1082
1.0165
0.1174
1.0295
0.1118
1.0336
0.1121
1.0483
0.1277

70

ˆ
β1
1.0043
0.1081
1.0165
0.1174
1.0280
0.1154
1.0336
0.1121
1.0483
0.1277

ˆ
β4
1.0037
0.1103
1.0164
0.1170
1.0300
0.1120
1.0346
0.1135
1.0494
0.1283

ˆ
β16
1.0050
0.1081
1.0166
0.1171
1.0303
0.1125
1.0366
0.1135
1.0516
0.1298

ˆ
From the above results, λ is much smaller than ρ, which implies that a high spatial
correlation in the latent error term may only cause a tiny correlation in the binary response
ˇ
variables. The PQMLE β delivers almost the same eﬃciency as the other three two-step
estimators. Since the correlations between the binary dependent variables are low, a twostep estimator may not be necessary. One can use other data generating process to generate
highly correlated binary dependent variables to examine the two-step method.

2.7.3

Spatial Count Data

In the count data case, the variances and covariances of the count dependent variable can
be written in closed forms. The spatial correlation in the underlying spatial error term is
then transformed to spatial correlation in the response variable term. That is, by knowing
the correlations in the spatial error term, we know the correlations in the count dependent
variable. This avoids the situation that we can not compare the estimated spatial correlation
parameter and the true parameter.
Remember the assumption of conditional mean function for the count data is
E (yi |xi , vi , DN ) = exp (xi β + ei ) = vi exp (xi β0 ) . Consider the following case of spatial
count data generating process:
1. ei , i = 1, 2, ..., N follows a multivariate normal distribution with E(ei ) = − 1 and
2
Var (ei ) = σ 2 = 1; Cov ei , ej = dρ , and ρ = 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6. Therefore, vi follows
ij
a multivariate lognormal distribution with E(vi ) = 1, Var (vi ) = e − 1 and Cov vi , vj =
exp

ρ
dij

− 1.

2. xi = [1, xi1 ],

xi1 ∼ Uniform (0, 1)

3. β = [1, −1]
4. mi = exp (xi β+ei )
71

5. yi ∼ P oisson (mi )
The parameter ρ represents the correlation in the underlying error term. The true correlations for yi and yj conditional on xi , xj , dij is

Corr yi , yj |xi , xj , dij

(2.69)

exp (xi β0 ) exp xj β0

=

exp (xi β0 ) + exp (2xi β0

) · τ2

exp ρ/dij − 1
exp xj β0 + exp 2xj β0

.
· τ2

We can calculate the sample correlations in the simulated count data based on the above
expression. For each replication t, let ui = yi − exp (xi β0 ) , and the sample correlation of y
for each t is:

corrt yi , yj |xi , xj , dij
1
=
Nd∗

N

N

i j>i

ui uj
exp (xi β0 ) + exp (2xi β0 ) · τ 2

,

exp xj β0 + exp 2xj β0 · τ 2

where Nd∗ is the number of distinct pairs of observations whose distance is equal to a certain
distance dij = d∗ , and Var (y) is the sample variance of count variable at time t. Table 2.3
shows the sample correlations in y over the replications. The correlations are calculated
separately for pairs of yi s that are 1, 2, 5 apart from each other. As pairwise distances
increase, the correlations decrease. There is almost no correlation between two observations
when distance is 5. Moreover, the spatial parameter ρ can be as high as 0.6, and the sample
correlation of y can be as high as more than 0.3.

72

Table 2.3: Correlations in Simulated Count Data
T=2000
dij
1
Nij
760
ρ
ρy,d=1
0
-0.001
0.1
0.043
0.2
0.089
0.3
0.139
0.4
0.194
0.5
0.250
0.312
0.6

N=400
N=1600
2
5
1
2
5
720
1688
3170
3040
8028
ρy,d=2 ρy,d=5 ρy,d=1 ρy,d=2 ρy,d=5
-0.001 −0.000 0.000 0.000 -0.000
0.022
0.009
0.043 0.021 0.008
0.043
0.017
0.089 0.043 0.015
0.068
0.026
0.139 0.064 0.023
0.091
0.036
0.194 0.088 0.032
0.115
0.046
0.253 0.112 0.040
0.139
0.056
0.315 0.136 0.048

As long as we can ﬁnd consistent estimators for τ 2 and ρ, we can do the second step
estimation. As in 2.57 and 2.58, the conditional variances and covariances of count dependent
variable can be written in a closed form according to which the spatial parameters can be
estimated. Using the information above, τ 2 can be estimated by τ 2 as the coeﬃcient by
ˆ
ˇ
ˇ
ˆ
regressing u2 − exp xi β on exp 2xi β . Obviously τ 2 does not depend on distances. For
ˇi
1
simplicity ρ is estimated by N
0

N
i=1

N
j=i log[

ui uj
ˇ ˇ
ˇ
ˇ
exp(xi β ) exp xj β

+ 1] for those pairs whose

pairwise distance is one, where N0 is the number of pairs whose distance is one.
Table 2.4 shows the simulation results of the case in which: N=1600, L=4 and L=16.
ˇ
ˆ ˆ
ˆ
Replication is 2000. β is the one-step QMLE estimator; β1 , β4 and β16 are three two-step
ˆ
estimators which uses diﬀerent variance and covariance matrices of the count variable . β1
ˆ
only uses the estimated variances as weights; β4 uses the covariance matrix with group size
ˆ
equal to four, and β16 uses the covariance matrix with group size equal to sixteen. We
ˆ ˆ
ˆ
can see that all the three two-step estimators β1 , β4 and β16 are more eﬃcient than the
ˇ
ˆ
ˆ
one-step estimator β. The two GEE estimators β4 and β16 has more improvement when ρ
grows larger. When sample size increases, eﬃciency gains. ρ and τ 2 are both slightly biased
ˆ
ˆ
downwards.

73

Table 2.4: Count Data N=1600, L=4 and 16
σ 2 = 1,

N=1600
ρ
0

0.1

0.2

0.3

0.4

0.5

ρ
ˆ
estimate −0.002
bias
−0.002
s.d.
(0.043)
estimate 0.095
bias
−0.005
s.d.
(0.048)
estimate 0.191
bias
-0.008
s.d.
(0.055)
estimate 0.287
bias
−0.013
s.d.
0.063
estimate 0.384
bias
−0.016
s.d.
(0.070)
estimate 0.480
bias
−0.020
s.d.
(0.076)

β = −1

L=4

σ2
ˆ
0.984
0.016
(0.156)
0.980
−0.020
(0.158)
0.970
-0.030
(0.160)
0.969
-0.031
(0.161)
0.967
-0.033
(0.172)
0.974
−0.026
(0.196)

ˇ
β
-1.000
0.000
(0.139)
-1.000
0.000
(0.137)
−1.001
-0.001
(0.131)
-0.998
0.002
(0.138)
−1.000
0.000
(0.136)
-0.998
0.002
(0.134)

ˆ
ˆ
ˆ
β1
β4
β16
-0.998 -0.998 -0.998
0.002
0.002
0.002
(0.135) (0.135) (0.135)
-0.998 -0.998 -0.998
0.002
0.002
0.002
(0.134) (0.134) (0.134)
-0.998 -0.998 -0.999
0.002
0.002
0.001
(0.128) (0.126) (0.126)
-0.996 -0.995 -0.997
0.004
0.005
0.003
(0.135) (0.132) (0.131)
-0.998 -0.998 -0.999
0.002
0.002
0.001
(0.134) (0.128) (0.126)
-0.997 -0.996 -0.998
0.003
0.004
0.002
(0.130) (0.120) (0.118)

74

L=16

2.8

Conclusions

The spatial correlation in nonlinear models makes it more diﬃcult to get eﬃcient estimators. For a Probit model, because of the nonlinearity, it is hard to ﬁnd the real form of
spatial correlations between two dependent variables. Thus the estimated variance covariance matrix is likely to be misspeciﬁed. In the spatial count data model case, since we
can actually write down the variances and covariances of the dependent variables based on
certain assumptions, we can use a multivariate nonlinear weighted least squares (MNWLS)
to improve the eﬃciency. Accounting for spatial correlation will improve eﬃciency in the
count data example. The further study will focus on how to get a better estimator for the
spatial correlation parameter, how to get a good approximation of the correlation structure
and how to incorporate spatial correlation in other nonlinear models, such as multi-catogary
binary choice, fractional response, two-part model and so on.

75

Chapter 3
Conditions for the Numerical
Equality of the OLS, GLS and
Amemiya-Cragg Estimators

3.1

Introduction

Conditions under which the ordinary least squares (OLS) and generalized least squares (GLS)
estimators are equal are well known. This paper extends these results in two ways. First, we
give conditions under which GLS based on one assumed error variance matrix equals GLS
based on a diﬀerent assumed variance matrix. Second, we give conditions under which GLS
equals the GMM estimator of Amemiya (1983) and Cragg (1983).

3.2

Equivalence of OLS and GLS

Consider the linear regression model

y = Xβ + ε

76

(3.1)

where y is T × 1 and X is T × K. Let Σ be a T × T , positive deﬁnite, assumed or estimated
ˆ
variance matrix of ε. We consider the ordinary least squares (OLS) estimator β and the
˜
generalized least squares (GLS) estimator β deﬁned as follows:

ˆ
˜
β = (X X)−1 X y, β = (X Σ−1 X)−1 X Σ−1 y.

(3.2)

ˆ ˜
The question is under what conditions on X and Σ it is the case that β = β. Since these will
be conditions for numerical equality of the estimators, they will not depend on assumptions
about ε, and in particular they will not depend on whether Σ does or does not actually equal
the variance matrix of ε. We assume only that Σ, X X and X Σ−1 X are positive deﬁnite.
This is an old and classic problem. The basic equivalence results were ﬁrst given by
Zyskind (1962, 1967), Rao (1967) and Kruskal (1968). These results were summarized in
textbook fashion by Amemiya (1985). Since then the number of equivalent conditions for
equality of the two estimators has grown. The survey by Puntanen and Styan (1989) lists
20 such conditions. See also Baltagi (1989) and McAleer (1992) for more discussion and
applications of these results.
A related but diﬀerent questions is under what conditions on Σ we have OLS = GLS for
all X. McElroy (1967) showed that, if X contains an intercept, a necessary and suﬃcient
condition is that

have the “equicorrelated” form (all diagonal elements equal, and all

oﬀ-diagonal elements equal). Balestra (1970) extended this result to the case that a subset
of the regressors must have a certain form and then OLS = GLS for all possible values of
the remaining regressors. We do not seek to extend these results in this paper.
Because of the numerous diﬀerent equivalent sets of conditions for the equality of OLS
and GLS, we will focus on the conditions given in Theorem 6.1.1 of Amemiya (1985, p. 182).

77

The following Theorem is a minor extension of that result.
THEOREM 1. Suppose that Σ, X X and X Σ−1 X are positive deﬁnite. Then the
following statements are equivalent.
−1

(A)

(X X)−1 X ΣX(X X)−1 = X Σ−1 X

(B)

ΣX = XB for some nonsingular B

(C)

(X X)−1 X = X Σ−1 X

(D)

X = HA for some nonsingular A, where the columns of H are K eigenvectors of

(E)

Z ΣX = 0 for any Z such that Z X = 0

−1

X Σ−1

Σ

(F)

Σ = XΓX + QΘQ + c2 I for some Γ, some Θ, some c, and some Q such that

QX=0
(F’)

Σ = XΨX + RΦR for some Ψ, some Φ, and some R such that R X = 0

Conditions (A)-(F) are, apart from a few changes in notation, as given by Amemiya,
p. 182.. Condition (F’) is new. Amemiya shows the equivalence of (A)-(E) and refers to
Rao (though not to the correct paper by Rao) for a proof of the equivalence of (F). It is
trivial that (F) implies (E) but the proof in Rao (1967, p. 364) that (E) implies (F) is not
transparent (or even complete; it just says that “it is easy to verify . . . ”). In Chapter 4 we
show that (F’) is equivalent to (F) and provide a straightforward proof that (D) implies (F’).

78

3.3

Equivalence of Two Diﬀerent GLS Estimators

We now consider two diﬀerent GLS estimators:

˜
¨
β = (X Σ−1 X)−1 X Σ−1 y, β = (X Ω−1 X)−1 X Ω−1 y.

(3.3)

We will provide conditions (on X, Σ and Ω) such that these two estimators are equal. (Obviously the results of the previous section are a special case corresponding to Ω = I.) The
context that we have in mind is that Σ is the correct variance matrix of ε and Ω is an incor¨
rect or approximate variance matrix. In this case β could be considered to be a pseudo-GLS
estimate. However, our conditions are just conditions for equality of the two estimators and
so which (if either) is based on the correct variance matrix of ε is irrelevant. Note also that
any of the conditions given below must still hold if we reverse the roles of Σ and Ω, which is
why there are two versions of results (A), (B), etc.
The phrase “pseudo-GLS” has been used in the literature with a diﬀerent meaning,
namely, GLS using the original error variance matrix but after some of the regressors have
been partialled out. See, e.g., Fiebig, Bartels and Kr¨mer (1996) or Gross and Puntanen
a
(2000). This section does not apply to that topic, because we assume that both Σ and Ω
are nonsingular. Matthew (1983) gives conditions for the two diﬀerent GLS estimators to
be equal for all X in a certain class. This is the same sort of question that was addressed
for equality of OLS and GLS by McElroy (1967) and Balestra (1970). Our Theorem below
is diﬀerent because it applies for a given X.
THEOREM 2. Suppose that Σ, Ω, X X, X Σ−1 X and X Ω−1 X are positive deﬁnite.
Then the following statements are equivalent.
(A1)

X Σ−1 X

−1

X Σ−1 ΩΣ−1 X X Σ−1 X
79

−1

= X Ω−1 X

−1

(A2)

X Ω−1 X

−1

X Ω−1 ΣΩ−1 X X Ω−1 X

(B1)

= X Σ−1 X

−1

ΩΣ−1 X = XB for some nonsingular B

(B2)

−1

ΣΩ−1 X = XB for some nonsingular B

(C)
(D1)

X Σ−1 X

−1

X Σ−1 = X Ω−1 X

−1

X Ω−1

Σ−1/2 X = HA for some nonsingular A, where the columns of H are K eigen-

vectors of
Σ−1/2 ΩΣ−1/2
(D2)

Ω−1/2 X = HA for some nonsingular A, where the columns of H are K eigen-

vectors of
Ω−1/2 ΣΩ−1/2
(E1)

Z ΩΣ−1 X = 0 for any Z such that Z X = 0

(E2)

Z ΣΩ−1 X = 0 for any Z such that Z X = 0

(F1)

Ω = XΓX + ΣQΘQ Σ + c2 Σ for some Γ, some Θ, some c, and some Q such

that
QX=0
(F2)

Σ = XΓX + ΩQΘQ Ω + c2 Ω for some Γ, some Θ, some c, and some Q such

that
QX=0
(F1’)

Ω = XΨX + ΣRΦR Σ for some Ψ, some Φ, and some R such that R X = 0

(F2’)

Σ = XΨX + ΩRΦR Ω for some Ψ, some Φ, and some R such that R X = 0

The proof of these results is given in Chapter 4. Our proof is essentially an exercise in
translation of Amemiya’s conditions. An alternative that we do not pursue would be to use
the results of Gourieroux and Monfort (1980).
The results in Theorem 2 have some applications that are similar to existing applications
80

of Theorem 1 in the literature.
Example 1. Random coeﬃcients. Consider the random coeﬃcient model

y = Xβ∗ + u, β∗ = β + v so y = Xβ + ε where ε = u + Xv.

(3.4)

Suppose that u and v are uncorrelated and V (v) = Γ. If V (u) = σ 2 I , then V (ε) ≡ Σ =
XΓX + σ 2 I. Therefore condition (F) of Theorem 1 applies and GLS = OLS, as has been
pointed out by Rao (1967), Amemiya(1985) and Baltagi(1989), among others. However,
now suppose that V (u) = Ω. Then Σ = XΓX + Ω. Therefore condition (F2) of Theorem 2
holds, and GLS based on Σ equals GLS based on Ω, which so far as we know is a previously
unknown result.
Example 2. SUR with the same regressors in every equation. Consider a set of G
seemingly unrelated regression equations, with T observations per equation and a common
regressor matrix X of dimension T × K. Using standard notation, write the stacked system
of equations as

y ∗ = X ∗ β∗ + ε∗

(3.5)

where y∗ is GT × 1, X∗ = (IG ⊗ X) is GT × KG, etc. Suppose that V (ε∗ ) = Σ ⊗ IT ≡ Σ∗ .
It is well known that GLS using Σ∗ is the same as OLS, and Baltagi (1989) proves this
using condition (B) of Theorem 1: Σ∗ X∗ =X∗ B where B = Σ ⊗ IK . Now suppose that
Ω∗ =Ω ⊗ IT is another possible variance matrix. Then GLS based on Σ∗ is the same as GLS
based on Ω∗ . This is probably obvious, since both must equal OLS, but it can also be proved
using Theorem 2. Deﬁne Ψ = ΩΣ−1 and Ψ∗ = Ψ ⊗ IT = Ω∗ Σ−1 . Then
∗

81

Ω∗ Σ−1 X∗ = Ψ∗ X∗ = Ψ ⊗ X = X∗ B
∗

(3.6)

where B = Ψ ⊗ IK . So condition (B1) of Theorem 2 holds and the two GLS estimators are
the same.

3.4

Equivalence of GLS and the Amemiya-Cragg Estimator

We now consider the two estimators:

˜
ˇ
β = (X Σ−1 X)−1 X Σ−1 y and β = [X H(H ΣH)−1 H X]−1 X H(H ΣH)−1 H y.

(3.7)

The ﬁrst of these is the GLS estimator while the second is the estimator of Amemiya (1983)
and Cragg (1983). We will provide conditions (on X,H and Σ) such that these two estimators
are equal. If Σ is the correct error variance matrix, then the GLS estimator is eﬃcient relative
to the Amemiya-Cragg estimator, and the Amemiya-Cragg estimator is eﬃcient relative to
OLS if X is contained in H. However, again, our results are just algrebraic results for the
numerical equivalence of the two estimators.
THEOREM 3. Suppose that Σ, X Σ−1 X and H ΣH are positive deﬁnite and that
H X has full column rank. Then the following statements are equivalent.
(A)

X H(H ΣH)−1 H X = X Σ−1 X

(B)

Σ−1 X = HB for some B [or, X = ΣHB for some B]

(C)

[X H(H ΣH)−1 H X]−1 X H(H ΣH)−1 H = (X Σ−1 X)−1 X Σ−1

82

(D)

X = SA where the columns of S are the eigenvectors that correspond to the

non-zero
eigenvalues of ΣH(H ΣH)−1 H
Z Σ−1 X = 0 for any Z such that Z H = 0 [or, Z X = 0 for any Z such that

(E)
Z ΣH = 0]
(F’)

Σ = XΓX + QΘQ for some nonsingular Γ, some Θ, some Q and some B (of

dimension L × K, where H is T × L) such that X HB is nonsingular and Q HB = 0
The proofs of these results are given in Chapter 4.

83

Chapter 4
Proofs of Propositions and Theorems

4.1

Proofs of Propositions in Chapter 1

This sections provides proofs of propositions in Chapter 1.
Proof of Proposition 1
(1) Consistency.

ˆ
βOLS = (X X)−1 X Y
N

N

= β

+ (N −1

xi xi

)−1 (N −1

xi ui )
i=1

i=1
N

= β

+ [p lim(N −1

N

xi x i

)−1

+ op (1)][p lim(N −1

i=1

xi ui ) + op (1)].
i=1

Under Assumption 2, since {(xi , ui )} is a mixing sequence, xi ui and xi xi are also mixing.
By law of large numbers for mixing sequences (Arbia 2006, page 70), we have the following
two equations.
N

N −1


p

N

xi ui → E N −1
i=1
N

N −1


xi ui  ,

i=1


p

N

xi xi → E N −1
i=1

xi x i  .
i=1

84



Under Assumption 3 and 4,




N

N

xi ui  = N −1

E N −1

i=1

i=1



E xi ui = 0,



N

E N −1

x i x i  → A1 .
i=1

And


−1

N

p lim N −1

x i xi 

= A−1 .
1

i=1

Thus
p
ˆ
βOLS → β + A−1 · 0 = β.
1

(2) Asymptotic Normality.

−1 

N
N
√
1
ˆ
xi u i  .
xi xi   √
N βOLS − β = N −1
N i=1
i=1
Under Assumption 3,

1
E
N

N
i=1


1
xi ui  =
N

N

E xi ui = 0.
i=1

Under Assumption 5,




N

1
p
xi ui  = BN → B.
Var  √
N i=1

Let ZN ≡

1
N

N x u −E 1
i=1 i i
N
1
Var N

N xu
i=1 i i

N xu
i=1 i i

. Since xi ui is mixing, by the central limit
d

theorem of Wooldridge and White (1988) for mixing sequences, ZN → N (0, 1). Thus

85

1
√
N

d
N
i=1 xi ui →

N (0, B) . Therefore,
√
ˆ
N βOLS − β →d N 0, A−1 BA−1 .
1
1

Proof of Proposition 2
(1) Consistency.
p
ˆ
ˆ
ˆ
ˆ p
ΩN ≡ ΩN λ is a consistent estimator in the sense that λ → λ. Thus N −1 X Ω−1 X →
N
p
ˆ
N −1 X Ω−1 X, and N −1 X Ω−1 u → N −1 X Ω−1 u.
N
N
N

ˆ
ˆ
ˆ
βF GLS = (X Ω−1 X)−1 X Ω−1 Y
N
N
ˆ
= β + N −1 X Ω−1 X
N
p

−1

→ β + N −1 X Ω−1 X
N

ˆ
N −1 X Ω−1 u
N
−1

= β + p lim N −1 X Ω−1 X
N

N −1 X Ω−1 u
N
−1

+ op (1)

p lim N −1 X Ω−1 u + op (1) .
N

For any symmetric positive deﬁnite matrix ΩN ,

p

p

N −1 X Ω−1 X → E N −1 X Ω−1 X → B,
N
N

p lim N −1 X Ω−1 X
N

−1

= B−1 .

Under Assumption 6,

p

N −1 X Ω−1 u → E N −1 X Ω−1 u = 0.
N
N

86

Thus
p
ˆ
βF GLS = β + B−1 + op (1) · op (1) → β.

(2) Asymptotic Normality.
√

=

ˆ
N −1 X Ω−1 X
N

−1

=

ˆ
N βF GLS − β

N −1 X Ω−1 X
N

−1

1
ˆ
√ X Ω−1 u
N
N
1
√ X Ω−1 u + op (1) .
N
N

Under Assumption 6, E N −1 X Ω−1 u = 0.
N
Under Assumption 7,

Var

Let PN ≡

1
√ X Ω−1 u
N
N

−1
N −1 X Ω−1 u−E N −1 X ΩN u
N
Var N −1 X Ω−1 u
N

=E

=

1
X Ω−1 X
N
N

p

→ B2 .

N −1 X Ω−1 u−E N −1 X Ω−1 u
N
N

, by the central

1
N −1 Var √ X Ω−1 u
N
N
d

limit theorem of Wooldridge and White (1988) for mixing sequences, PN → N (0, 1). Thus,
1
√ X Ω−1 u →N (0, B2 ) .
N
N
√

ˆ
N βF GLS − β →N 0, B−1 B2 B−1 = N 0, B−1 .
2
2
2
Proof of Proposition 3

ˆ
In order to prove βF P GLS is consistent and asymptotically normal, we need to prove
ˆ
ˆ
ˆ
that ﬁrst βP GLS is consistent and asymptotically normal, and that βF P GLS and βP GLS are
asymptotically equivalent.

87

(1) Consistency.


−1 

G



G

1
1
ˆ
βP GLS = 
Xg Λ−1 Xg  
Xg Λ−1 yg 
g
g
G
G
g=1
g=1
−1 


G
G
1
1
Xg Λ−1 Xg  
Xg Λ−1 ug 
= β+ 
g
g
G
G
g=1

g=1

Under Assumption P3-P5, since {(xi , ui )} is a mixing sequence, Xg Λ−1 Xg and Xg Λ−1 ug
g
g
are also two mixing sequences. By law of large numbers for mixing sequences (Theorem 3.57
in White), we have the following two equations.

1
G



G

p

Xg Λ−1 Xg → E 
g
g=1

1
G

1
G

G

G


p
Xg Λ−1 Xg  = Qg → Q
g

g=1
p

Xg Λ−1 ug → E Xg Λ−1 ug
g
g

= 0.

g=1

Thus

ˆ
βP GLS

= β+ 

1
G

−1 

G

Xg Λ−1 Xg 
g
g=1

1
G



G

Xg Λ−1 ug 
g
g=1

= β+ Q−1 + op (1) · op (1)
→ β

(2) Asymptotic Normality.

√


1
ˆ
G βP GLS − β = 
G

−1 

G

Xg Λ−1 Xg 
g
g=1

88

G



1
√
Xg Λ−1 ug  ,
g
G g=1

By central limit theorem for mixing sequences (Wooldridge and White 1988),
G

1
√
Xg Λ−1 ug →d N (0, S) ,
g
G g=1

1
G

−1

G

Xg Λ−1 Xg 
g

= Q−1 +op (1) .

g=1

Thus
√
d
ˆ
G βP GLS − β → N 0, Q−1 SQ−1 .
ˆ
ˆ
(3) βF P GLS and βP GLS are asymptotically equivalent.
ˆ
ˆ
Let Λ ≡ Λ λ . Write down the formulas for the two estimators as follows.



−1 

G

ˆg
Xg Λ−1 Xg 

ˆ
βF P GLS = 
g=1



ˆg
Xg Λ−1 yg  ,


g=1

−1 

G

Xg Λ−1 Xg 
g

ˆ
βP GLS = 



G



G



Xg Λ−1 yg  .
g

g=1

g=1

ˆ
ˆ
In order to prove the asymptotic equivalence of βF P GLS and βP GLS , we need to prove the
following two conditions hold. The procedure follows Schmidt (1971).


−1

G





G

1
p
ˆg
1
Xg Λ−1 Xg  − 
Xg Λ−1 Xg  → op (1) ,
g
G
G
g=1
g=1


 
G
G
1
1
p
ˆg
√
Xg Λ−1 ug  −  √
Xg Λ−1 ug  → op (1) .
g
G g=1
G g=1
1
Lemma 1 G

G
ˆ −1
g=1 Xg Λg Xg

1
−G

G
−1
g=1 Xg Λg Xg

Proof:
89

= op (1)

ˆ
ˆ
ˆ
ˆg
We have consistent estimator λ → λ, p lim λ = λ. Λ−1 is a continuous function of λ,
ˆg p g
thus Λ−1 → Λ−1 .

p

ˆg
Xg Λ−1 Xg → Xg Λ−1 Xg
g
p

ˆg
Xg Λ−1 Xg −Qg → Xg Λ−1 Xg −Qg
g
1
G


1
G

G
g=1

1
ˆg
Xg Λ−1 Xg −
G

G

ˆg
Xg Λ−1 Xg −
g=1

1
G

G
g=1





G

1
Qg →
G
p

Qg  − 
g=1


1
G

1
G

G

Xg Λ−1 Xg
g
g=1

G

1
−
G

G

Qg
g=1

G

Xg Λ−1 Xg −
g
g=1

G

G

1
Qg  = op (1)
G
g=1


1
ˆg
Xg Λ−1 Xg −
Qg  − op (1) = op (1)
G
g=1
g=1


G
G
1
ˆg
1
Xg Λ−1 Xg −
Qg  = op (1)
G
G
g=1

g=1

Thus the ﬁrst equation holds.
1
Lemma 2 √

G

G
ˆ −1
g=1 Xg Λg ug

G
g=1 E

1
−√

G

Xg Λ−1 ug = op (1) .
g

Similar to the argument in Lemma 1, we can get

1
G

G
g=1



1
ˆg
Xg Λ−1 ug −
G
1
G

G

E Xg Λ−1 ug
g

= op (1) ,

g=1
G

ˆg
Xg Λ−1 ug − 0 = op (1) ,
g=1

1
G

G

ˆg
Xg Λ−1 ug → p 0,
g=1

90





G

1
ˆg
ˆ
Xg Λ−1 ug 
SG = Var  √
G g=1
1
=
G

G

G

ˆg
ˆ
E Xg Λ−1 ug uh Λ−1 Xh
h
g=1 h=1

= E Xg Λ−1 + op (1) ug uh Λ−1 + op (1) Xh
g
h

ˆg
ˆ
E Xg Λ−1 ug uh Λ−1 Xh
h

= E Xg Λ−1 ug uh Λ−1 Xh + E Xg ·op (1) ·ug uh Λ−1 Xh
g
h
h
+E Xg Λ−1 ug uh · op (1) · Xh
g
+E Xg · op (1) · ug uh · op (1) · Xh

ˆ
lim SG =

G→∞

1
lim
G→∞ G

1
= lim
G→∞ G
=

G

G

ˆg
ˆ
E Xg Λ−1 ug uh Λ−1 Xh
h
g=1 h=1
G

G

E Xg Λ−1 ug uh Λ−1 Xh + 0
g
h
g=1 h=1

lim SG = S

G→∞

In addition, by the central limit theorem for mixing sequence in Wooldridge and White
1
(1988), we get the asymptotic distribution for √

G

G
ˆ −1
g=1 Xg Λg ug ,

G

1
ˆg
√
Xg Λ−1 ug →d N (0, S) ,
G g=1
1
which is the same as √

G

G
−1
g=1 Xg Λg ug .

Thus the second equation holds:

G

G

1
1
ˆg
√
Xg Λ−1 ug − √
Xg Λ−1 ug = op (1) .
g
G g=1
G g=1
91

Finally we can write:

√


ˆ
G βF P GLS − β

−1 

G



G

1
1
ˆg
ˆg
Xg Λ−1 Xg   √
Xg Λ−1 ug 
= 
G
G g=1
g=1



 1
= 
G


−1

G

Xg Λ−1 Xg 
g
g=1


+ op(1)



G



1
 √
Xg Λ−1 ug  + op(1)
g
G g=1

−1 

G
G
1
1
Xg Λ−1 Xg   √
Xg Λ−1 ug  + op(1)
= 
g
g
G
G g=1
g=1
√
ˆ
G βP GLS − β + op(1)
=
√

ˆ
ˆ
G βF P GLS − βP GLS = op(1)

ˆ
ˆ
Therefore, βF P GLS and βP GLS are asymptotically equivalent.
Proof of Proposition 4
As stated in Section 4, the HAC estimator that is robust to groupwise spatial correlation
and misspeciﬁcation as in equation (1.26) is
ˆ
Avar βF P GLS
=
rob
G
g=1

G
h=1 p

G
g=1

ˆg
Xg Λ−1 Xg

ˆg ˜ ˜ ˆ
dgh Xg Λ−1 ug uh Λ−1 Xh
h

−1
G
g=1

Deﬁne
1
ˆ
S≡ G

G
g=1

G
h=1 p

ˆg ˜ ˜ ˆg
dgh Xg Λ−1 ug uh Λ−1 Xh

92

ˆg
Xg Λ−1 Xg

−1

.

1
=G

G
ˆ −1 ˜ ˜ ˆ −1
g=1 Xg Λg ug ug Λg Xg

1
+G

G
g=1

G
h=g p

ˆg ˜ ˜ ˆ
dgh Xg Λ−1 ug uh Λ−1 Xh
h

Deﬁne the following expressions:
G

G

1
p dgh Xg Λ−1 ug uh Λ−1 Xh
g
h
G
g=1 h=1


G G
1
p
−1
p dgh Xg Λ−1 ug uh Λh Xh 
S0 = E 
g
G
g=1 h=1


G G
1
S0 = E 
Xg Λ−1 ug uh Λ−1 Xh 
g
h
G

Sp =

(4.1)

(4.2)

(4.3)

g=1 h=1

p
p
p
ˆ
ˆ
S − S ≤ S − S + Sp − S0 + S0 − S0 + |S0 − S|

(4.4)

In order to prove Proposition 4, we need to prove each of the above four terms on the
ˆ
ˆ
right-hand side to be op (1) . Since p lim λ = λ, Λg → Λg , as G → ∞. By Assumption
P6(b), |S0 − S| = op (1) . Rewrite the other three terms as follows:
1
p
ˆ
S−S =
G

Sp

p
− S0

G

G

p dgh Xg Λg (λ)−1 ug uh − ug uh Λh (λ)−1 Xh + op (1) .
˜ ˜

(4.5)

g=1 h=1

1
=
G

G

G

p dgh

(4.6)

g=1 h=1

· Xg Λg (λ)−1 ug uh Λh (λ)−1 Xh − E Xg Λg (λ)−1 ug uh Λh (λ)−1 Xh

p

S0 − S0 = E


1 G
G

G

Xg Λg (λ)−1 ug uh Λh (λ)−1 Xh

1 − p dgh





g=1 h=1

93

.

(4.7)

First, prove the ﬁrst term is op (1) .

ˆ
ug uh = ug uh − ug βF P GLS − β
˜ ˜
ˆ
+Xg βF P GLS − β

ˆ
Xh − Xg βF P GLS − β uh

ˆ
βF P GLS − β

(4.8)

Xh .

It suﬃces to show that the averages of the last three terms converge in probability to zero.
1
The average of the vec of the ﬁrst term can be written as G

G
g=1

G
h=1

1
ˆ
ˆ
vec βF P GLS − β , which is op (1) because p lim βF P GLS − β = 0 and G

Xh ⊗ ug
G
g=1

G
h=1

Xh ⊗ ug →p 0. The third term is the transposition of the second. The average of the
G
g=1

1
last term can be written as G

G
h=1

ˆ
ˆ
Xh ⊗ Xg vec[(βF P GLS − β)(βF P GLS − β) ].

ˆ
ˆ
vec[(βF P GLS −β)(βF P GLS −β) ] = op (1). Assuming Xh and Xg have ﬁnite second moment,
1
each element of G ΣG ΣG (Xh ⊗Xg ) is Op (1) . Since Op (1) · op (1) = op (1), the last term
g=1 h=1

is op (1). Thus ug uh − ug uh = op (1) . Further assuming each element in Λh is ﬁnite, each
˜ ˜
element in the ﬁrst term is op (1) .
The rest terms can be proved to be op (1) by similar arguments.
Proof of Proposition 5
A suﬃcient condition is LG is stochastically equicontinuous according to Newey (1991).
G

LG =

lg (β, λ)
g=1
G

=

G

¨ ¨
sg β, λ

lg (β0 , λ0 ) +
g=1

g=1

94

β − β0
,
λ − λ0

¨
¨
where β is between β0 and β, and λ is between λ0 and λ.

sup LG (θ) − LG (θ0 )
G

G

¨ ¨
sg β, λ

lg (β0 , λ0 ) +

= sup
g=1

g=1

G

¨ ¨
sg β, λ

= sup
g=1

G

−

lg (β0 , λ0 )
g=1

β − β0
λ − λ0

G

β − β0
λ − λ0

¨ ¨
sg β, λ

= sup

β − β0
λ − λ0

g=1

< ∞.

Thus LG (θ) is stochastically equicontinuous. The PQMLE estimator is consistent.
The (k + p) × 1 score of the log likelihood of group g is simply

β lg (β, λ)

sg (β, λ) ≡

λ lg (β, λ)

,

where
∂L (β; λ)
=
∂β

β lg (β, λ)

= Xg Λ−1 (λ) yg − Xg β ,
g

and

∂L (β; λ)
=
∂λ

λ lg (β, λ)

1
−1
λ Λg (λ) vec Λg (λ)
2
1
+
Λg (λ) vec Λ−1 (λ) ug ug − Λg (λ) Λ−1 (λ)
g
g
2 λ
1
−1
−1
=
λ Λg (λ) Λg (λ) ⊗ Λg (λ) vec ug ug − Λg (λ)
2

= −

95

Set
G

sg (β, λ) = 0.
g=1



−1

G

Xg Λ−1
g

ˆ
βP QM LE = 

G

ˆ
Xg Λ−1 λ yg ,
g

ˆ
λ Xg 
g=1

g=1

ˆ
λP QM LE may or may not have a closed functional form.
The groupwise QMLE is normally distributed asymptotically.
√

ˆ
G θ − θ0 ⇒d N 0, A−1 BA−1

G

ˆ
SG θ

ˆ
sg θ = 0

=
g=1

G

G
g=1


ˆ
θ − θ0

¨
Hg θ

sg (θ0 ) +

=

g=1

−1

G


ˆ
G θ − θ0

G

¨
Hg θ 

= −
g=1

√

ˆ
θ − θ0

1
= −
G

sg (θ0 )
g=1
−1

G

¨
Hg θ 
g=1

By ergodic theorem for mixing ﬁelds,

1
−
G

G

¨ p
Hg θ → A

g=1

A = E Hg (θ0 )

96

G

1
√
sg (θ0 )
G g=1

By central limit theorem in Jenish and Prucha (2007),
G

1
d
√
sg (θ0 ) ⇒ N (0, B)
G g=1


G
1
B = lim
Var 
sg (θ0 ) .
G→∞ G
g=1

Thus we have
√

4.2

d
ˆ
G θ − θ0 ⇒ N 0, A−1 BA−1 .

Proofs of Propositions in Chapter 2

This section provides proofs of propositions in Chapter 2.
Proof of Proposition 6
The proof follows Theorem 2 in Jenish and Prucha (2009). A suﬃcient condition for
consistency estimators is that the objective function satisﬁes the uniform law of large numbers (ULLN). To prove the ULLN, we need a pointwise law of large numbers (LLN). From
¯
Assumption 4, we have a pointwise LLN, and we get QN (θ) − QN (θ) →a.s. 0, where
¯
QN (θ) = E (QN (θ)) .Since QN (θ) is the sample average of qi , it is also stochastically
¨
equicontinuous. Then we have supθ QN (θ) − Q (θ) →a.s. 0 as N → ∞, which is the
uniform law of large numbers.
Proof of Proposition 7
Proof of Proposition 7 is similar to the proof of Proposition 6. Instead of writing the objective
function in the sum of individuals, we write it as the sum of groups. The pointwise LLN

97

¯
¯
can be written as QG (θ) − QG (θ) →a.s. 0, where QG (θ) = E (QG (θ)) . In addition with the
¨
stochatically equicontinuity condition, supθ QG (θ) − Q (θ) →a.s. 0 as G → ∞.
Proof of Proposition 8

G

ˇ
SG θ

ˇ
sg θ = 0

=
g=1
G

=

G

g=1


ˇ
θ − θ0

¨
Hg θ

sg (θ0 ) +
g=1

−1

G


ˇ
G θ − θ0

G

¨
Hg θ 

= −

sg (θ0 )

g=1

√

ˆ
θ − θ0

1
= −
G

g=1

−1

G

G

1
√
sg (θ0 )
G g=1

¨
Hg θ 
g=1

By uniform law of large numbers for mixing ﬁelds,

−

1
G

G

¨
Hg θ

→ p A0

g=1

A0 = E Hg (θ0 )

By Central Limit Theorem in Jenish and Prucha (2009),
G

1
√
sg (θ0 ) ⇒ d N (0, B0 )
G g=1

B0 =

1
Var 
G→∞ G

G

lim

g=1

98


sg (θ0 )

Thus we have
√
ˇ
G θ − θ0 ⇒d N 0, A−1 B0 A−1 .
0
0
Proof of Proposition 9
−1

G
G
√
1
¨γ
ˆ
Hg θ; ˆ 
G θG −θ0 = −
sg (θ0 ; ˆ) ,
γ
G
g=1

√

g=1





G

1
ˆ
sg (θ0 ; ˆ) + op (1) ,
γ
G θG −θ0 = A0 − √
G g=1

1
where A0 = limG→∞ − G

G
g=1 E

Hg (θ0 , γ ∗ )

by law of large numbers for mixing se-

quences.
G

G

√
1
1
√
γ
sg (θ0 ; ˆ) = √
γ
sg (θ0 ; γ ∗ ) + F0 G (ˆ − γ ∗ ) + op (1) ,
G g=1
G i=1
1
where F0 is a P × J matrix, F0 = limi→∞ G G E γ sg (wi , θ0 ; γ ∗ )
g=1
√
Assume G (ˆ − γ ∗ ) can be written in the following form:
γ

√

G

G (ˆ − γ ∗ )
γ

1
=√
rg (γ ∗ ) + op (1) ,
G g=1

where rg (γ ∗ ) is a J × 1 vector with E rg (γ ∗ ) = 0 . Then
G
√
1
ˆ − θ0 = A0 √
G θ
−eg (θ0 ; γ ∗ ) + op (1)
G g=1

eg (θ0 ; γ ∗ ) ≡ sg (θ0 ; γ ∗ ) + F0 rg (γ ∗ )


G
G
1 
D0 ≡ lim
E
eg (θ0 ; γ ∗ )
eh (θ0 ; γ ∗ ) 
G→∞ G
g=1

h=1

99

.

√
ˆ
Avar G θ − θ0 = A−1 D0 A−1
0
0
In the case of QMLE, we can see

mg (θ) and mg (θ) do not rely on γ. When we take deriva-

−1
tives with respect to γ, it only matters with Wg .

γ sg

wg , θ0 ; γ ∗ is a linear combination

of elements of yg − mg (θ) . Since the E yg − mg |w, D = 0, E
= 0. By law of iterated expectations, E

γ sg

wg , θ0 ; γ ∗

γ sg

wg , θ0 ; γ ∗ |w, D

= 0. Thus F0 = 0.

√
ˆ
Avar G θ − θ0 = A−1 B0 A−1
0
0

1
and B0 = limG→∞ Var √

G

G
∗
g=1 sg (θ0 , γ )

Proof of Proposition 10
Let

mg =
ˆ

ˆ
mg θ ,

ˆ
mg θ . The proof consists two parts:

mg =
ˆ

ˆ
ˆ
(1) A → A0 ; (2)B2 → B0 .
ˆ
Part (1) prove that A → A0 .
1
ˆ
A= G

G
g=1

1
ˆ
mg Wg
ˆ ˆ −1 mg → limG→∞ G

G
g=1 E

−1 m = A
mg Wg
g
0

ˆ
Part (2) prove that B2 → B0 .

1
ˆ
B2 =
G

G

G

k(dgh ) mg Wg ug uh Wh mh ,
ˆ ˆ −1 ˆ ˆ ˆ −1 ˆ
g=1 h=1

and

1
B0 = lim
G→∞ G

G
−1
−1 m
mg Wg ug ug Wg
g

E
g=1

1
+ lim
G→∞ G

G

G

E
g=1 g=h

100

−1
−1
mg Wg ug uh Wh mh .

Deﬁne
Zg =

−1
mg Wg ug

ˆ
Zg =

mg Wg ug
ˆ ˆ −1 ˆ

For a given G,

1
G

ˆ
B2 =

1
=
G

1
B0 =
G
=

1
G

G

1
mg W g u g u g W g
ˆ ˆ −1 ˆ ˆ ˆ −1 mg +
ˆ
G

g=1
G
g=1

1
ˆ ˆ
Zg Zg +
G

g=1

g=1 g=h

g=1 g=h

g=1

E Zg Zg +

k(dgh ) mh Wg ug uh Wh mh
ˆ ˆ −1 ˆ ˆ ˆ −1 ˆ

ˆ ˆ
k(dgh )Zg Zh ,

−1
−1
mg Wg ug ug Wg

G

G

G

G

G

E

G

1
G

G

1
mg +
G

G

G

E

−1
−1
mg Wg ug uh Wh mh

g=1 g=h

G

E Zg Zh .
g=1 g=h

Deﬁne

Bk =
0

1
G

G
g=1

1
+
G
1
=
G

−1
−1 m
mg Wg ug ug Wg
g

E
G

G
−1
−1
mg Wg ug uh Wh mh

k(dgh )E
g=1 g=h

G
g=1

1
E Zg Zg +
G

G

G

k(dgh )E Zg Zh ,
g=1 g=h

101

Bk =

1
G

1
=
G

G
−1
−1 m +
mg Wg ug ug Wg
g
g=1
G
g=1

1
Zg Zg +
G

G

1
G

G

G

k(dgh )

−1
−1
mg Wg ug uh Wh mh

g=1 g=h

G

k(dgh )Zg Zh .
g=1 g=h

Next,

ˆ
B2 −B0

=

ˆ
B2 −Bk + Bk − Bk + Bk − B0
0
0

≤

ˆ
B2 −Bk + Bk − Bk + Bk − B0
0
0


G
G G
1
1
ˆ
B2 − 
k(dgh )Zg Zh 
Zg Zg +
G
G

=

g=1

1
+
G
1
+
G

G
g=1
G
g=1

1
E Zg Zg +
G

1
ˆ
B2 − 
G
+

+

1
G
1
G

G

Zg Zg +
g=1

G

g=1 g=h

+

1
G

Zg Zg +
g=1

1
G

G

g=1 g=h



G

k(dgh )Zg Zh 
g=1 g=h

+

G

1
G

G

G

k(dgh ) Zg Zh − E Zg Zh
g=1 g=h
G

G

E Zg Zh
g=1 g=h



G

k(dgh )Zg Zh 
g=1 g=h

G

Zg Zg − E Zg Zg
g=1
G

k(dgh ) Zg Zh − E Zg Zh

g=1 g=h

1
k(dgh )E Zg Zh −
G

G

G

k(dgh )E Zg Zh −B0

g=1
G

G

G

Zg Zg − E Zg Zg

1
ˆ
B2 − 
G
1
+
G

1
G

G

G


≤

1
+
G

Zg Zg − E Zg Zg


=

g=1 g=h

1
+
G

G

k(dgh ) − 1 E Zg Zh
g=1 g=h

102

G

G

k(dgh ) Zg Zh − E Zg Zh
g=1 g=h

(4.9)

Next, prove each of the right hand side term goes to zero. Deﬁne pgh = Zg Zh −E Zg Zh .
Use this device to make the mean value expansion go to zero and this will complete the proof.

4.3

Proofs of Theorems in Chapter 3

This section provides proofs of theorems in Chapter 3.
Proof of Theorem 1
Amemiya (1985, pp. 182-183) showed the equivalence of conditions (A), (B), (C),
(D) and (E). It is trivial that (F) implies (E) and that (F’) implies (F). We still need to
establish that one of (A), (B), (C), (D) or (E) implies (F’).
Proof that (D) implies (F’): Condition (D) says that X = F1 A for some nonsingu

 Λ1 0 
lar A, where the eigenvectors of Σ are F = [F1 , F2 ] and the eigenvalues are Λ = 
,
0 Λ2
a diagonal matrix. We can choose F such that F F = I and therefore F1 F2 = 0. Then the
spectral representation of Σ says
Σ = F ΛF = F1 Λ1 F1 + F2 Λ2 F2 .
Since F1 = XA−1 ,
Σ = X(A−1 Λ1 A −1 )X + F2 Λ2 F2 .
Finally, X F2 = A F1 F2 = 0. So (F’) holds.
Because the result is somewhat counterintuitive, we also give a proof that (F) implies
(F’).
Proof that (F) implies (F’): Suppose that (F) holds so Σ = XΓX + QΘQ + c2 I with
Q X = 0. Now deﬁne PX = I − X(X X)−1 X and MX = I − PX . Since Q is in the null
space of X, Q = MX B for some B. Also, there exists a matrix A, of dimension T × (T − K),
103

such that
MX = AA , A A = IT −K and A X = 0. (See, e.g., Theil (1971, pp. 203-209).) So we
can write
Σ = XΓX + QΘQ + c2 (PX + MX )
= X[Γ + c2 (X X)−1 ]X + MX BΘB MX + c2 MX
= X[Γ + c2 (X X)−1 ]X + A[c2 I + A BΦB A]A
Since A X = 0, condition (F’) holds.
Proof of Theorem 2
˜
The model is y = Xβ + ε, and β = (X Σ−1 X)−1 X Σ−1 y is GLS assuming that the
variance matrix of ε is Σ. Now deﬁne X∗ = Ω−1/2 X, y∗ = Ω−1/2 y and Σ∗ = Ω−1/2 ΣΩ−1/2
¨
(so Σ−1 = Ω1/2 Σ−1 Ω1/2 ). Then β = (X Ω−1 X)−1 X Ω−1 y is OLS of y∗ on X∗ , and
∗
˜
β = (X∗ Σ−1 X∗ )−1 X∗ Σ−1 y∗ is GLS of y∗ on X∗ using the error variance matrix Σ∗ . So
∗
∗
Amemiya’s results apply if we simply replace his X by X∗ , and Σ by Σ∗ .
(A) (X∗ X∗ )−1 X∗ Σ∗ X∗ (X∗ X∗ )−1 = (X∗ Σ−1 X∗ )−1
∗
X Ω−1/2 Ω−1/2 X

−1

X Ω−1/2 · Ω−1/2 ΣΩ−1/2 · Ω−1/2 X X Ω−1/2 Ω−1/2 X
= X Ω−1/2 · Ω1/2 Σ−1 Ω1/2 · Ω−1/2 X

X Ω−1 X

−1

X Ω−1 ΣΩ−1 X X Ω−1 X

−1

= X Σ−1 X

−1

which is condition (A2) of Theorem 2.
(B) Σ∗ X∗ = X∗ B for some nonsingular B
Ω−1/2 ΣΩ−1/2 · Ω−1/2 X = Ω−1/2 XB for some nonsingular B
ΣΩ−1/2 X = XB for some nonsingular B
which is condition (B2) of theorem 2.
(C) (X∗ X∗ )−1 X∗ = (X∗ Σ−1 X∗ )−1 X∗ Σ−1
∗
∗
104

−1

−1

X Ω−1 X

−1

X Ω−1/2 = X Ω−1/2 · Ω1/2 Σ−1 Ω1/2 · Ω−1/2 X

X Ω−1 X

−1

X Ω−1 = X Σ−1 X

−1

X Σ−1

which is condition (C) of Theorem 2.
(D) X∗ = HA for some nonsingular A, where the columns of H are Keigenvectors of Σ∗ .
Ω−1/2 X = HA for some nonsingular A, where the columns of H are Keigenvectors of
Ω−1/2 ΣΩ−1/2 .
which is condition (D2) of Theorem 2.
(E) X∗ Σ∗ Z = 0 for any Z such that Z X∗ = 0.
X Ω−1/2 · Ω−1/2 ΣΩ−1/2 · Z = 0 for any Z such that Z Ω−1/2 X = 0.
X Ω−1 ΣΩ−1/2 · Z = 0 for any Z such that Z Ω−1/2 X = 0.
Now deﬁne Z∗ = Ω−1/2 Z. Then this condition becomes X Ω−1 ΣZ∗ for any Z∗ such that
Z∗ X = 0, which is condition (E1) of Theorem 2.
(F) Σ∗ = X∗ ΓX∗ + QΘQ + c2 I with Q X∗ = 0.
Ω−1/2 ΣΩ−1/2 = Ω−1/2 XΓX Ω−1/2 + QΘQ + c2 I with Q Ω−1/2 X = 0.
Σ = XΓX + Ω1/2 QΘQ Ω1/2 + c2 Ω with Q Ω−1/2 X = 0.
Now let Q∗ = Ω−1/2 Q. Then we obtain
Σ = XΓX + ΩQ∗ ΘQ∗ Ω + c2 Ω with Q∗ X = 0
which is condition (F2) of Theorem 2.
(F’) The proof is essentially the same as for condition (F).
Proof of Theorem 3
Proof that (A) and (C) are equivalent: Deﬁne
∆ = X H H ΣH

−1

HX

−1

X H H ΣH

105

−1

H − (X Σ−1 X)−1 X Σ−1

then X H H ΣH

−1

HX

−1

(C)) is equivalent to X H H ΣH

− (X Σ−1 X)−1 = ∆Σ∆ . So ∆ = 0 (Which is condition
−1

HX

−1

= (X Σ−1 X)−1 , which is condition (A).

Proof that (A) and (B) are equivalent:
X Σ−1 X − X H H ΣH

−1

H X = X Σ−1/2 [I − P 1/2 ]Σ−1/2 X
[Σ
H]

and this equals zero if and only if Σ−1/2 X is in the column space of Σ1/2 H, that is ,
Σ−1/2 X = Σ1/2 HB for some B, or Σ−1 X = HB or X = ΣHB, which is condition (B).
Proof that (B) and (E) are equivalent: The proof that (B) implies (E) is trivial. To
show that (E) implies (B), suppose that Z H = 0. Then Z = MH S where MH = I − PH =
I − H(H H)−1 H . Then (E) says that Z Σ−1 X = 0, or S MH Σ−1 X = 0. This is true for
any S, so it must be true that MH Σ−1 X = 0, that is Σ−1 X is in the column space of H, or
Σ−1 X = HB for some B. This is condition (B).
Proof that (B) and (D) are equivalent: Note that
ΣH H ΣH

−1

H · ΣH = ΣH • I

That is, the column of ΣH are eigenvectors of ΣH H ΣH

−1

H , and the eigenvalues

equal one. So, if (B) holds, X = ΣHB and X is a linear combination of these eigenvectors,
so (D) holds. Conversely, if (D) holds, then X = (ΣH)A and (B) holds.
Proof that (F’) implies (B): Suppose (F’) holds, so that
Σ = XΓX + QΘQ
and Q satisﬁes Q HB = 0 for some B such that X HB is nonsingular. Then, for that B,
ΣHB = XΓX HB + QΘQ HB = XΓX HB
and so
X = ΣH[B(X HB)−1 Γ−1 ]
So condition (B) holds.

106

Proof that (B) implies (F’): Suppose (B) holds so that X = ΣHB. Let Γ =
(B H ΣHB)−1 ,
where the inverse must exist because X has rank K so HB must have rank K. Then
Σ = XΓX + C
where
C = Σ − ΣHB(B H ΣHB)−1 B H Σ
= Σ1/2 [I − P 1/2
)]Σ1/2
[Σ
HB]
= QQ
where Q = Σ1/2 [I − P 1/2
].
[Σ
HB]
Then Q HB = [I − P 1/2
]Σ1/2 HB = 0. So (F’) holds.
[Σ
HB]

107

BIBLIOGRAPHY

108

BIBLIOGRAPHY
[1] Amemiya, T. (1983), “Partially Generalized Least Squares and Two-Stage Least
Squares Estimators,” Journal of Econometrics, 23, 275-283.
[2] Amemiya, T. (1985), Advanced Econometrics, Harvard University Press, Cambridge,
MA.
[3] Anderson, T.W. (1971), The Statistical Analysis of Time Series, John Wiley and Sons,
New York.
[4] Anselin, L. (2010), “Thirty Years of Spatial Econometrics,” Papers in Regional Science,
89, 3–25.
[5] Arbia, G. (2006), Spatial Econometrics: Statistical Foundations and Applications to
Regional Convergence . Springer.
[6] Baltagi, B.H. (1989), “Applications of a Necessary and Suﬃcient Condition for OLS
to be BLUE,” Statistics and Probability Letters, 8, 457-461.
[7] Balestra, P. (1970), “On the Eﬃciency of Ordinary Least Squares in Regression Analysis,” Journal of the American Statistical Association, 65, 1330-1337.
[8] Bester, A.C., Conley, T.G., Hansen, C.B. and Vogelsang, T.J. (2008), “Fixed-b Asymptotics for Spatially Dependent Robust Nonparametric Covariance Matrix Estimators.”
[9] Bloom,N., Schankerman, M. and VanReenen, J. (2012), “Identifying technology
spillovers and product market rivalry,” NBER working paper.
[10] Bollerslev, T. and Wooldridge, J.M. (1992), “Quasi-Maximum Likelihood Estimation
and Inference in Dynamic Models with Time-varying Covariances,” Econometric Reviews, 11(2), 143-172.
[11] Bolthausen, E. (1982), “On the Central Limit Theorem for Stationary Mixing Random
Fields,” The Annals of Probability, Vol. 10, No. 4, 1047-1050.

109

[12] Brown, J.R., Ivkovic, Z., Smith, P.A. and Wwisbenner, S. (2008), “Neighbors Matter:
Causal Community Eﬀects and Stock Market Participation,” The Journal of Finance,
63: 1509–1531.
[13] Christakos, G. (1987), “On the Problem of Permissible Covariance and Variogram
Models,” Water Resources Research, 20, 251-265.
[14] Cragg, J. (1983), “More Eﬃcient Estimation in the Presence of Heteroscedasticity of
Unknown Form,” Econometrica, 51, 751-763.
[15] Conley,T.G. (1999), “GMM Estimation with Cross Sectional Dependence,” Journal of
Econometrics, Volume 92, Issue 1, 1-45.
[16] Cressie, N.A.C. (1993), Statistics for Spatial Data, Wiley Series in Probability and
Mathematical Statistics: Applied Probability and Statistics, New York: John Wiley &
Sons Inc., a Wiley-Interscience Publication.
[17] Du, J. and Zhang, H. (2007), “Covariance Tapering in Spatial Statistics,” Working
paper.
[18] Dubin, R.A. (1988), “Estimation of Regression Coeﬃcients in the Presence of Spatially
Autocorrelated Error Terms,” The Review of Economics and Statistics , Vol. 70, No.
3, 466-474.
[19] Fiebig, D.G., Bartels, R. and Kr¨mer, W. (1996), “The Frisch-Waugh Theorem and
a
Generalized Least Squares,” Econometric Reviews, 15, 431-443.
[20] Gneiting, T. (2002), “Compactly Supported Correlation Functions,” Journal of Multivariate Analysis, 83, 493-508.
[21] Gourieroux, C. and Monfort A. (1980), “Suﬃcient Linear Structures: Econometric
Applications,” Econometrica, 48, 1083-1097.
[22] Gourieroux, C., Monfort, A. and Trognon, A. (1984), “Pseudo Maximum Likelihood
Methods: Applications to Poisson Models,” Econometrica , Vol. 52, No. 3, pp. 701-720
[23] Gross, J. and Puntanen, S. (2000), “Remark on Pseudo-Generalized Least Squares,”
Econometric Reviews, 19, 131-134.
[24] Kelejian, H.H. and Prucha I.R. (2007), “HAC estimation in a spatial framework,”
Journal of Econometrics, Volume 140, Issue 1, September 2007, Pages 131-154
110

[25] Kruskal, G. (1968), “When Are Gauss-Markov and Least Squares Estimators Identical?
A Coordinate-Free Approach,” Annals of Mathematical Statistics, 39, 70-75.
[26] Jenish, N., and Prucha, I. R. (2009), “Central limit theorems and uniform laws of large
numbers for arrays of random ﬁelds,” Journal of Econometrics 150: 86–98.
[27] Kaufman, C., Schervish, M. and Nychka, D. (2008), “Covariance tapering for
likelihood-based estimation in large spatial datasets,” J. Amer. Statist. Assoc. 103
1545–1555.
[28] Lee, L.F.(2004), “Asymptotic Distributions of Quasi-Maximum Likelihood Estimators
for Spatial Autoregressive Models,” Econometrica, Vol. 72, 1899-1925.
[29] Matthew, T. (1983), “Linear Estimation with an Incorrect Dispersion Matrix in Linear
Models with a Common Linear Part,” Journal of the American Statistical Association,
78, 468-471.
[30] McAleer, M. (1992), “Eﬃcient Estimation: The Rao-Zyskind Condition, Kruskal’s
Theorem and Ordinary Least Squares,” Economic Record, 68, 65-72.
[31] McElroy, F.W. (1967), “A Necessary and Suﬃcient Condition that Ordinary Least
Squares Estimators Be Best Linear Unbiased,” Journal of the American Statistical
Association, 63, 1302-1304.
[32] Newey, W.K. and West K.D., 1987. “A simple, positive semi-deﬁnite, heteroskedasticity
and autocorrelation consistent covariance matrix,” Econometrica, Vol.55, No.3, 703708.
[33] Prentice, R.L. (1988), “Correlated binary regression with covariates speciﬁc to each
binary observation,” Biometrics AA, 1033-48.
[34] Puntanen, S. and Styan, G.P.H. “The Equality of the Ordinary Least Squares Estimator and the Best Linear Unbiased Estimator,” The American Statistician, 43, 153-161.
[35] Rao, C.R. (1967), “Least Squares Theory Using an Estimated Dispersion Matrix and Its
Application to Measurement of Signals,” Proceedings of the Fifth Berkeley Symposium
on Mathematical Statistics and Probability, vol. 1, 355-372, University of California
Press, Berkeley, CA.
[36] Theil, H. (1971), Principles of Econometrics, John Wiley and Sons, New York.

111

[37] Wang, H., Iglesias, E. and Wooldrige, J.M. (2012). “Partial Maximum Likelihood Estimation of Spatial Probit Models,” Journal of Econonometrics.
[38] Wooldridge, J.M. (2011), Econometric Analysis of Cross Section and Panel Data. MIT.
[39] Zeger, S.L., Liang, K.Y., and Albert, P.S. (1988), “Models for Longitudinal Data: A
Generalized Estimating Equation Approach,” Biometrics, 44, 1049- 1060.

[40] Zyskind, G. (1962), “On Conditions for Equality of Best and Simple Linear Least
Squares Estimators (abstract),” Annals of Mathematical Statistics, 33, 1502-1503.
[41] Zyskind, G. (1967), “On Canonical Forms, Non-Negative Covariance Matrices and Best
and Simple Least Squares Linear Estimators in Linear Models,” Annals of Mathematical Statistics, 38, 1092-1109.

112