LINEAR AND NONLINEAR ESTIMATION WITH SPATIAL DATA By Cuicui Lu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Economics – Doctor of Philosophy 2013 ABSTRACT LINEAR AND NONLINEAR ESTIMATION WITH SPATIAL DATA By Cuicui Lu In some economic situations, observations are cross-sectionally correlated. One example of cross-sectional correlation is spatial correlation, which means the correlation comes from the spatial closeness of different individuals. Spillover effect, externalities, network issues and so on are common causes for spatial correlation. For example, a supply shock in one region will result in production shocks in the regions nearby. This type of correlation reflects the correlations among individuals’ unobservables. In Chapter 1, I study a linear regression model with a spatially correlated error term. Most current literature in econometrics assumes cluster sampling (independence between different clusters) in the population; however, this could be easily violated. I study the case in which spatial correlation exists between each pair of observations without assuming independent clusters. Generalized least squares (GLS) can be applied to the cross-sectional dimension but it is hard to account for all pairwise correlations for a large sample of spatial data. It is because the calculation of the huge error covariance matrix generally needs large computer memory. Instead I use a pseudo generalized least squares (PGLS) approach, which means it is a GLS procedure but uses a ”tapered” error covariance matrix. Data could be divided into groups according to natural geographic areas, only correlations within groups are accounted for while ignoring the correlations between groups. Since correlations within groups account for most of the correlations among observations, the resulting PGLS estimator will not lose much efficiency compared to GLS. The PGLS estimator is consistent, asymptotically normal, and computationally easier than GLS. A spatial heteroskedasticity and autocorrelation consistent (HAC) covariance estimator for PGLS which is robust to both heteroskedasticity and spatial correlation is provided. Monte Carlo simulations show that PGLS becomes more efficient than ordinary least squares (OLS) as spatial correlation increases. Chapter 2 studies nonlinear estimation with spatial data.Generalized estimating equations (GEE) is applied to cross section data with spatial correlations in nonlinear models. I use a partial quasi-maximum likelihood estimator (PQMLE) in the first step and use GEE approach in the second step. Given some regularity conditions and assumptions, the asymptotic distribution of the two-step estimator is derived in the framework of M-estimation. I use a Probit model for binary data with a latent spatially error and a Poisson model for count data with a multiplicative spatial error to demonstrate the GEE procedures. As the spatial correlations in the underlying error term increase, those in the dependent variable also increase. Monte Carlo simulations show efficiency comparison of the PQMLE and GEE. The results show that correctly modeling the structure of the working correlation matrix is important in nonlinear models, which is quite different from the linear model. In addition, as spatial correlation increases, more efficiency estimation can be obtained by the GEE approach. Chapter 3 studies conditions for the Numerical Equality of the OLS, GLS and AmemiyaCragg Estimators. Conditions under which the ordinary least squares (OLS) and generalized least squares (GLS) estimators are equal are well known. This chapter extends these results in two ways. First, it give conditions under which GLS based on one assumed error variance matrix equals GLS based on a different assumed variance matrix. Second, it give conditions under which GLS equals the GMM estimator of Amemiya (1983) and Cragg (1983). To my mother and father iv ACKNOWLEDGMENTS I was born in a beautiful town on the east coast of China. As I grew up, I witnessed the dramatic improvement in people’s life of this town, as well as rapid growth of the Chinese economy. Thus I became very interested in the comparatively new ”market economy” in China. I realized in high school that my dream was to become an economist. And after college, I came to Michigan State University for my graduate study. Behind me the whole time were my kind and diligent parents, who show their deep love to me and have faith in me as always. I am very grateful to my parents for supporting my education both financially and mentally, especially during graduate school. At Michigan State University, I have had great fortune to be taught by Jeffrey Wooldridge, Peter Schmidt, Timothy Vogelsang, and Young Chae Lim. First and foremost, I thank my adviser, Jeffrey Wooldridge, for years of professional guidance. He gave me numerous suggestions and comments on my research. He also taught me how to do independent research. Peter Schmidt taught me how to do rigorous proofs. I learned very interesting time series knowledge from Tim Vogelsang. I also gained a deeper understanding of spatial statistics from Young Chae Lim. I would like to thank all of them who have been guiding me during my doctoral study. I thank the economics department for giving me the opportunity to be a teaching assistant. I also thank Jeff, Tim and the University for providing conference funding. My fellow classmates and the friends I have made are treasures to me. I appreciate their help in both research and life during my study. I am glad I am able to explore econometrics, which is an exciting area worthy of effort. I am on my way to my dream. v TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Chapter 1 Pseudo Generalized Least Squares Regression Estimation with Spatial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 A Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Spatial Error Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Positive-Definiteness of Spatial Covariance Matrix . . . . . . . . . . . 1.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Ordinary Least Squares and Generalized Least Squares . . . . . . . . 1.3.2 Pseudo Generalized Least Squares . . . . . . . . . . . . . . . . . . . . 1.3.3 Discussion of the Strict Exogeneity Condition . . . . . . . . . . . . . 1.4 Consistent Variance-Covariance Estimators . . . . . . . . . . . . . . . . . . . 1.4.1 Spatial HAC Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Estimation of the Spatial Correlation Parameter . . . . . . . . . . . . 1.5 Alternative Estimation Approach: Quasi-Maximum Likelihood Estimator . . 1.6 A Monte Carlo Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Data Generating Process . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1.1 Case I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1.2 Case II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1.3 Case III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1.4 Robust Standard Errors . . . . . . . . . . . . . . . . . . . . 1.6.2 Monte Carlo Simulation Results . . . . . . . . . . . . . . . . . . . . . 1.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 5 6 7 7 9 13 14 14 17 17 21 21 21 22 23 24 24 30 Chapter 2 Estimation of Nonlinear Models in a Quasi-Maximum Likelihood Framework with Spatial Data . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 M-estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 M-estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Two-step Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Two-step Estimator: QMLE and GEE . . . . . . . . . . . . . . . . . . . . . 2.4 Asymptotic Distribution for PQMLE and GEE . . . . . . . . . . . . . . . . 2.5 Variance-Covariance Matrix Estimator . . . . . . . . . . . . . . . . . . . . . 2.5.1 Parametric Variance-Covariance Estimator . . . . . . . . . . . . . . . 2.5.2 Nonparametric Variance-Covariance Estimator . . . . . . . . . . . . . 2.6 Two Examples: Spatial Probit Model and Poisson Regression Model . . . . . 2.6.1 Example 1. A Probit Model with Spatial Correlation in the Latent Error 31 31 34 35 39 40 47 51 51 52 54 55 vi . . . . . . 60 66 67 68 71 75 Chapter 3 Conditions for the Numerical Equality of the OLS, GLS Amemiya-Cragg Estimators . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Equivalence of OLS and GLS . . . . . . . . . . . . . . . . . . . . . . . 3.3 Equivalence of Two Different GLS Estimators . . . . . . . . . . . . . . 3.4 Equivalence of GLS and the Amemiya-Cragg Estimator . . . . . . . . . and . . . . . . . . . . . . . . . 76 76 76 79 82 Chapter 4 Proofs of Propositions and 4.1 Proofs of Propositions in Chapter 1 4.2 Proofs of Propositions in Chapter 2 4.3 Proofs of Theorems in Chapter 3 . . . . . 2.7 2.8 2.6.2 Example 2. A Poisson Model with a Multiplicative Spatial Error Monte Carlo Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Sampling Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.2 Spatial Probit Data . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.3 Spatial Count Data . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . BIBLIOGRAPHY Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 . 84 . 97 . 103 . . . . . . . . . . . . . . . . . 109 LIST OF TABLES Table 1.1 Case I: Mean Parameters and Standard Deviation . . . . . . . . . . 26 Table 1.2 Case I: Robust Standard Errors . . . . . . . . . . . . . . . . . . . . 26 Table 1.3 Case I: Estimated Error Variance Parameters . . . . . . . . . . . . . 27 Table 1.4 Case II: Mean Parameters and Standard Deviation . . . . . . . . . . 27 Table 1.5 Case II: Robust Standard Errors . . . . . . . . . . . . . . . . . . . . 28 Table 1.6 Case II: Estimated Error Variance Parameters . . . . . . . . . . . . 28 Table 1.7 Case III: Mean Parameters and Standard Deviation . . . . . . . . . 29 Table 1.8 Case III: Estimated Error Variance Parameters . . . . . . . . . . . . 29 Table 2.1 Binary Data N=400, L=4 . . . . . . . . . . . . . . . . . . . . . . . . 70 Table 2.2 Binary Data N=400, L=16 . . . . . . . . . . . . . . . . . . . . . . . 70 Table 2.3 Correlations in Simulated Count Data . . . . . . . . . . . . . . . . . 73 Table 2.4 Count Data N=1600, L=4 and 16 . . . . . . . . . . . . . . . . . . . 74 viii Chapter 1 Pseudo Generalized Least Squares Regression Estimation with Spatial Data 1.1 Introduction In some economic situations, observations are cross-sectionally correlated. One instance of cross-sectional correlation is spatial correlation, which means the correlations come from the spatial closeness of different individuals. Spillover effect, externalities, network issues, and so on are common causes for spatial correlation. For example, a supply shock in one region will result in production shocks in the regions nearby. This type of correlation reflects the correlations among individuals’ unobservables. Similar to time series correlation which depends on the ”distance” in time1 , spatial correlation depends on the ”distance” in space. In economic study, spatial data are commonly observed in the two-dimensional Euclidean space R2 . Each observation resides in the specific location on Earth’s surface. We would expect the larger the distance is between two individuals, the smaller their correlation. A legitimate distance measure is the physical distance. In economics, a distance measure has 1 The time passed between two time points. 1 a wider meaning. Conley (1999) uses a metric of ”Economic Distance” to measure crosssectional dependence. For example, transportation cost and whether two countries are in a free trade area are economic distances. Thus spatial correlation based on spatial distance is more complicated than time series dependence. Researchers emphasize the importance of correct distance measurement. Measurement errors are often considered in the estimation, for example, Conley (1999) and Kelejian and Prucha (2007). Here we do not discuss the measurement errors problem and assume that we have correctly measured distances. Clustered data have a very similar structure to spatial data. The difference is that data in one cluster are correlated through a common cluster effect, and they do not depend on distances. Due to the complicated correlation relationship among observations, for spatial data, accounting for all pairwise correlations in the estimation is very difficult when the sample size is large. If the sample size is 100, there are 4950 pairwise correlations to account for. Although one can still do the estimation, ignoring the pairwise correlations and getting consistent estimators, one can improve estimation efficiency by using more information. Spatial data are commonly collected from different geographic areas such as unemployment rates in different states, foreign investment amounts in different cities in a developing country, and trade volume between different US states and Canadian provinces. Thus there is usually a natural division of groups for the data. Since observations far away have small or no correlations, we can account for the correlations only within the same groups. By doing so, we actually account for most of the total correlations. Therefore, in this paper, I propose a pseudo generalized least squares estimator (PGLS). It can account for most correlations and is more efficient than the usual OLS estimator, while the computation burden is reduced by dividing data into groups and only using the information within groups. In spatial statistics, ”increasing domain” and ”fixed domain” asymptotics are popularly 2 used (Cressie 1993). Under increasing domain asymptotics, the sampling space increases without bound, while the minimum distance between locations is bounded below by a positive constant. Under fixed domain asymptotics, the sampling space is fixed and bounded, and sampling locations become increasingly dense within this region. Zhang (2004) shows that some parameters cannot be consistently estimated by maximum likelihood estimation (MLE) under the fixed domain asymptiotics. Considering the properties of economic data, which can be obtained by sampling in a large Euclidean space, we will focus on increasing domain asymptotics in this paper. That is, given that the number in each group is fixed, when the sampling space gets larger, the number of the groups increases as the sample size increases. In this chapter, Section 1.2 presents a linear regression model with possible spatial correlation in the error term. Section 1.3 presents the OLS, GLS and PGLS estimation methods. Section 1.4 discusses how to estimate the spatial correlation parameter and provides consistent covariance estimators. Section 1.5 explores quasi-MLE method for estimating a linear regression model. In Section 1.6, Monte Carlo simulation results show the advantages of the PGLS estimation procedure. Section 1.7 provides the conclusions. 1.2 A Linear Regression Model Let S be the space the population resides. si , i = 1, 2, ... represents a location in S. Let dij be the distance between location si ∈ S and location sj ∈ S. The space can be one dimensional (like time series), two-dimensional (a Euclidean space) or multidimensional. Let (xi , yi ) denote the data point sampled at location si . Let ui denote the underlying unobservable at 3 si . Consider a linear regression model: yi = xi β + ui , i = 1, 2, ..., N, (1.1) where xi is 1 × K regressors with the first element xi1 = 1, and β ≡ (β1 , β2 , ..., βK ) is a K × 1 unknown vector of parameters. In a matrix form, the above equation reads y = Xβ + u, (1.2) where y = (y1 , y2 , ..., yN ) , X is an N × K matrix, with the ith row equal to xi , and u = (u1 , u2 , ..., uN ) . Generally we do not know the specific form of spatial correlation. In this paper, we focus on spatial correlations in the error term. The error term exhibits spatial correlation in the sense that Var (u|X, D) = Ω (D, λ) , (1.3) where D = dij , i, j = 1, 2, ..., N which contains all pairwise distances between observations. λ is a vector of variance covariance parameters. That is, the spatial correlations between different observations are commonly assumed to depend on the pairwise distances and some other fixed parameters. Note that under random sampling, Ω would be a scalar matrix with all off-diagonal elements equal to zero. For simplicity in what follows, I often drop the conditioning on the explanatory variables and location indicators. 4 1.2.1 Spatial Error Model Spatial error model is one of the common models based on underlying random fields. There are different cases of spatial error model. Instead of writing the specific error covariance matrix as a function of distance matrix D, we can use a spatial weight matrix W which is defined as a function of D. Let wij be the ijth element of W. Let = (ε1 , ε2 , ..., εi , ..., εN ) be a vector of spatial white noise with mean zero and constant variance. The spatial autoregressive (SAR) error model is most widely used in spatial econometrics (Anselin, 1988; Anselin et al., 2004). Its error term is modeled as simultaneous autoregressive random field, N wij uj + εi , ui = ρ (1.4) j=1 In a matrix form, the above equation reads, u = ρWu + , (1.5) Ω (λ) = σ 2 (I − ρW)−1 I − ρW −1 . (1.6) In the spatial moving average (SMA) error model, the error term is modeled as moving average random fields, N ui = λ wij εj . (1.7) j=1 Equivalently, u = ρW , Ω (λ) = ρ2 σ 2 WW . 5 (1.8) (1.9) And in the spatial autoregressive moving average (SARMA) error model, N ui = ρ (1) wij uj N +λ j=1 (2) wij εj , (1.10) j=1 which can be written in matrix notation as u = ρW(1) u+γW(2) , Ω (λ) = ρ2 σ 2 I − ρW(1) (1.11) −1 I − ρW(1) −1 W(2) W(2) . (1.12) As an alternative to starting with specific models of spatial correlation, we can directly specify the variances and covariances for the error term. For example, the variance can be a constant, and the covariance can be a function c (·) of the variance, a distance between two locations, and an unknown parameter which is to be estimated. In the matrix form, 2 Ω (λ)ii = σi , 2 Ω (λ)ij = σi c dij, ρ . 1.2.2 Positive-Definiteness of Spatial Covariance Matrix The error covariance matrix Ω must be positive-definite. For each function c (·) specified, one need to check whether Ω is positive-definite. Suppose h ∈ R which is used as the distance temporarily in order to distinguish from the derivative sign. According to Christakos (1984), Ω is positive-definite as long as function c (·) satisfy the following conditions: i. At h = 0, dc (h) /dh < 0. ii. limh→∞ c (h) = 0. 6 iii. d2 c (h) /dh2 ≥ 0. Derivates at h = 0 are one-sided, taken from right. 1.3 1.3.1 Estimation Ordinary Least Squares and Generalized Least Squares First, consider two estimators. Conditioning on the explanatory variables and location indicators, the OLS estimator is ˆ βOLS = (X X)−1 X y, (1.13) and the GLS estimator is ˆ βGLS = X Ω−1 X −1 (X Ω−1 y). (1.14) Suppose Ω depends on fixed pairwise distances and a parameter vector λ. Then we can ˆ write Ω as a function of λ, Ω (λ). Once we find an estimator for λ, say λ, we can obtain the ˆ ˆ estimated covariance matrix Ω ≡ Ω λ . The feasible GLS estimator (FGLS) is ˆ ˆ ˆ βF GLS = (X Ω−1 X)−1 (X Ω−1 y). (1.15) The following regularity conditions are assumed for deriving the asymptotics. Assumption 1. The pairwise distance dij < ∞ between two locations si and sj are lower bounded by d0 > 0, i = 1, 2, ...; Assumption 2. {(xi , ui )} is a mixing sequence on the sampling space, with mixing 7 coefficient α of size −r/2 (r − 1) , r ≥ 2 or φ of size −r/ (r − 2) , r > 2; Assumption 3. (a) E xi ui = 0,i = 1, 2, ...; (b) E xi ui r < ∆ < ∞ for all i; Assumption 4. N i=1 xi xi 1 (a) A1 ≡ E N N is uniformly positive definite and has full rank K; (b) plim A1 = A1 as N → ∞; N Assumption 5. (a) B1 ≡ Var √1 N N N i=1 xi ui 1 =E N X ΩN X is uniformly positive definite, where ΩN is a positive-definite symmetric covariance matrix; (b) plimB1 = B1 as N → ∞; N Assumption 6. (a) E (u|X) = 0. This is the so-called strict exogeneity assumption. It says every ui is not correlated with any function of X. This condition implies Assumption 3.(a) and −1 1 E N X ΩN u = 0; Assumption 7. 1 1 (a) B2 ≡ Var √1 X Ω−1 u = E N X Ω−1 uu Ω−1 X =E N X Ω−1 X ; N N N N N N (b) B2 is uniformly positive definite and has full rank K; N (c) plim B2 = B2 as N → ∞; N ˆ Proposition 1 Under Assumption 1, 2, 3, 4, and 5, the OLS estimator βOLS is con√ ˆ sistent and N βOLS − β →d N 0, A−1 B1 A−1 . See proof in Chapter 4. 1 1 ˆ Proposition 2 Under Assumption 1, 2, 5, 6, and 7, the FGLS estimator βF GLS is √ ˆ consistent and N βF GLS − β →d N 0, B−1 . See proof in Chapter 4. 2 8 OLS is easier than GLS but it is a less efficient estimator. GLS is efficient but could be computationally hard if the sample size is very large since it involves computing the inverse of a large covariance matrix. The following section demonstrates another way to get an efficient estimator. 1.3.2 Pseudo Generalized Least Squares A lot of spatial data are collected with geographical location information such as firms in a county and schools in a school district. We can consider the firms in a county or schools in a school district as a group when we deal with empirical data. Spatial correlation can generally exist among any of the individuals, whether they are in the same group. The correlations in a group are much easier to deal with than those not in the same group and they should represent most of the correlations among individuals in the sample. PGLS is only based on within-group correlations. The estimation is still weighted by the error covariance matrix but it is a tapered matrix in the sense that the correlations are set as zero if individuals are not in the same group no matter if their true values are zero. The asymptotics depend on the mechanism that the size of each group is fixed while the groups increases as the total number of observations increases. If we have a large sample data set, it is hard to account for all pairwise correlations. The reason is that there is insufficient information to estimate the N × N covariance matrix directly from the data. Even asymptotics are not helpful since the number of covariances increases with N 2 , whereas the sample size only grows with N (Anselin 1999). But with fixed group size, it is possible to account for the pairwise correlations within the group. Therefore, it is more realistic that we only use the information within a group while ignoring the cross-group correlations. Since within-group correlations will take into account most correlations of the observations in that group, it is 9 possible to get an estimator that is quite close to the FGLS estimator, which is the psuedo genralized least squares (PGLS) estimator. For panel data, a group is a cross section, and within a group observations are possibly serial correlated. Panel data assumes random sampling in the cross-sectional dimension; however, in this paper, we discuss a situation in which there is only one pure cross section. We estimate the parameters as if there are no groupwise correlations although there are correlations between observations in different groups. This is not the most efficient estimator, but by using this procedure, we can obtain a consistent and almost efficient estimator. For notational convenience, we write the the linear regression model for a group in the population as yg = Xg β + ug , (1.16) where g denotes the gth group, g = 1, 2, .... The number of observations in group g is Lg . yg and ug are Lg × 1 vectors. Xg is Lg × K. Now we can state the assumptions using the group notation. Assumption P1. (a) The number of observations Lg in each group is fixed. Lg /G is a small number. For simplicity assume that there are the same number of observations in each group. Thus the group size is fixed as L = N/G. (b) The number of groups and the sample size both increase as the sampling domain increases. Assumption P2. Let dgh < ∞ denote the pairwise distance between two groups g and h. dgh is lower bounded by d∗ > 0, i = 1, 2, ..., N. dgh ∈ D∗ , where D∗ is the space 10 containing all pairwise group distances2 . Assumption P3(Assumption 2). {(xi , ui )} is a mixing sequence on the sampling space, with mixing coefficient α of size −r/2 (r − 1) , r ≥ 2 or φ of size −r/ (r − 2) , r > 2; Assumption P4. Let Λg be the L × L within-group covariance matrix for group g. Let ΛN be the N ×N matrix that only contains within-group variances and covariances. In other words, Λg is the gth diagonal matrix of ΛN . Let ΩN be the true covariance matrix of the sample. Notice that ΛN is part of ΩN . Define a N × N tapering matrix Γ. If observation i and j are in the same group, the ijth entry of Γ, Γ(i, j) is equal to one. If observation i and j are in different groups, Γ(i, j) is equal to zero. Thus ΛN = ΩN · Γ element by element. Assumption P5. (a) E ug |Xg = 0. This implies that for any positive definite L × L dimensional matrix 1 Λg , E Xg Λ−1 ug = 0 and plim G g G −1 g=1 Xg Λg ug = 0. Note that when the number of groups is equal to one, Assumption P5 becomes E (u|X) = 0, which is the strict exogeneity assumption for GLS. (b) E Xg Λ−1 ug g r exists for all g. Assumption P6. 1 (a) SG ≡ Var √ G 1 =E G G g=1 G −1 g=1 Xg Λg ug 1 =E G G −1 −1 h=1 Xg Λg Ωgh Λh Xh G g=1 G −1 −1 h=1 Xg Λg ug uh Λh Xh , where Ωgh is the ghth block of matrix ΩN ; (b) plimSG = S as G → ∞; Assumption P7. (a) Qg ≡ Xg Λ−1 Xg is uniformly positive definite and has full rank K; g 2 One way to define groupwise distance is to use the smallest distance between two observations belonging to two different groups. Another way is to use the average distance of all pairwise distances of observations belonging to different groups. What we use for a group distance could depend on the empirical needs. The importance of the way researchers define the groupwise distance is unknown. We will leave this problem for future discussion. 11 1 (b) p limG→∞ G 1 plim G G g=1 Qg G −1 g=1 Xg Λg Xg = Q, and rank (Q) = K. This condition implies that exists and has full rank. The PGLS estimator is  −1  G Xg Λ−1 Xg  g ˆ βP GLS =  g=1  G Xg Λ−1 yg  . g  (1.17) g=1 If we can obtain a consistent estimator for the spatial correlation parameter λ, we can get a ˆ ˆ feasible PGLS estimator. Let Λg = Λg λ . Note that λ should be the same as it is in GLS if the assumptions are true.  −1  G ˆg Xg Λ−1 Xg  ˆ βF P GLS =   G ˆg Xg Λ−1 yg  .  (1.18) g=1 g=1 ˆ Proposition 3 Under Assumption P1-P7, βF P GLS is consistent and has an asymptotic √ d ˆ normal distribution, G βF P GLS − β → N 0, Q−1 SQ−1 . See proof in Chapter 4. The variance covariance estimator under Assumption P6 for FPGLS can be given as  ˆ AVar βF P GLS −1 G ˆg Xg Λ−1 Xg  =  (1.19) g=1  G  G ˆ ˆ ˆg Xg Λ−1 Ωgh λ Λ−1 Xh  g  g=1 h=1  −1 G ˆg Xg Λ−1 Xg   . g=1 The above expression in a concise form reads: ˆ ˆ AVar βF P GLS = X Λ−1 X −1 ˆ ˆ ˆ −1 X Λ−1 ΩΛ X 12 ˆ X Λ−1 X −1 . (1.20) Usually the specific form of the true covariance matrix Ω is not known. Thus equation (1.19) can not be obtained. 1.3.3 Discussion of the Strict Exogeneity Condition PGLS depends on Assumption P5. (a) E ug |Xg = 0. This assumption means the error term is not correlated with any of the explanatory variables within the group, but can be correlated with explanatory variables that are not in the group. This assumption is less likely to hold if we divide observations into arbitrary groups. Thus the strict exogeneity condition E (u|X) = 0, which holds for errors and explanatory variables are more reasonable. There are cases in which the strict exogeneity condition is violated. For example, one case is the spatial lag model in which the strict exogeneity assumption is necessarily false. The spatial lag model can be written as Y = AY + BX + u, (1.21) where u and explanatory variables must correlated with each other. In this case, even OLS is not consistent. This model can be solved in the reduced form by quasi-maximum likelihood estimation. For example, Lee (2004). Another case is when the explanatory variable has spatial correlation. If a shock to one region is correlated with explanatory variables in other regions, the strict exogeneity can fail. 13 1.4 1.4.1 Consistent Variance-Covariance Estimators Spatial HAC Estimator Failure to account for spatial dependence may result in inconsistent standard errors using standard techniques. Thus getting robust standard errors to spatial dependence is very important for hypothesis testing and statistical inference. In time series literature, Newey and West (1994) use a Bartlett kernel to consistently estimate the covariance matrix for a time series process. This is called the heterskedasticity and autocorrelation consistent (HAC) variance covariance estimator. Driscoll and Kraay (1998) provide a nonparametric covariance matrix estimation technique which yields standard error estimates that are robust to very general forms of spatial and temporal dependence as the time dimension becomes large. Conley (1999) uses a Bartlett window to estimate the variance matrix robust to crosssectional correlation, so a rectangular window of correlations are used in this estimation. Kelejian and Prucha (2007) suggest a spatial HAC estimation. In this section, I suggest HAC variance-covariance estimators for PGLS, and provide a proof that the HAC estimator for PGLS is consistent. Define a kernel function k dij which depends on pairwise distance dij . Let d∗ be a cutoff point such that if dij < d∗ , Cov ui , uj will be estimated and used. Otherwise, treat Cov ui , uj equal to zero. k dij decreases as dij increases. Researchers can use different kernel functions. In this paper, I use a Bartlett kernel, but I do not constrain the correlations within a rectangular window as in Conley (1999). Instead we could call the kernel function a bartlett circle. That is, as the center of one circle with its radius equal to d∗ , one observation’s covariance with all observations within this circle will be estimated. 14 The Bartlett circle kernel weighting function is k dij =    1 − dij /d∗ dij ≤ d∗   0 . (1.22) dij > d∗ A consistent variance covariance estimator for OLS that is robust to heteroskedasticity and spatial correlation is  ˆ Avar βOLS rob −1  N = xi xi  i=1 N N  k dij vi vj   ˆˆ  i=1 j=1 −1 N , x i xi  (1.23) i=1 where vi = xi ui and ui is OLS residual. ˆ ˆ ˆ A consistent variance covariance estimator for PGLS that is robust to misspecification of variance-covariance structure is ˆ ˆ Avar βF GLS = X Ω−1 X rob ˆ ˆ ˆ where Ω ≡ Ω λ , v = X Ω λ ˇ −1 −1 ˆ K (D) · vv X Ω−1 X ˇˇ −1 , (1.24) ˆ u and u = y − Xβ F GLS . K (D) is an N × N kernel ˇ ˇ matrix. Here we use K (D) · vv denotes the product of K (D) and vv element by element. ˇˇ ˇˇ The ijth element of K (D) is k dij , which is the same as given above. vv is a matrix of ˇˇ products of pairwise residuals. Because of possible misspecification of the variance covariance structure Ω, the spatial correlation may not be fully accounted for, or be treated incorrectly. Therefore there might still be spatial correlation left and we use the FGLS residuals to get a robust estimator. In the PGLS estimation, correlations among observations within the same groups are 15 used and in different groups are not. Thus intuitively we would want the HAC estimator to only account for the correlations across different groups. Therefore the kernel weights should only be put on the correlations among observations across different groups. ˆ Let vg = Xg Λg λ ˜ −1 ˆ u and u = y − Xβ F P GLS . Define kernel function for PFGLS: ˜ ˜    1 − dgh /d∗∗ dgh ≤ d∗∗ p dgh =   0 dgh > d∗∗ (1.25) where dgh is the distance between group g and h. Let P dgh be the G × G matrix with the ghth element Pgh = p dgh for g, h = 1, 2, ..., G. Let O be a square matrix with each element equal to one. The dimension of O is equal to N/G, which is the number of individuals within each group. Then the full N × N kernel matrix is P ⊗ O. The HAC estimator that is robust to groupwise spatial correlation and misspecification is  −1  G G G  ˆ ˆg =  Avar βF P GLS Xg Λ−1 Xg   p dgh vg vh  ˜ ˜ rob g=1 g=1 h=1 −1  (1.26) G  ˆg Xg Λ−1 Xg  . g=1 Following the proof of Theorem 2 in Kelejian and Prucha (2007), we can complete the proof for consistency of the PGLS estimator. Kelejian and Prucha (2007) provide a proof for the consistency of the spatial HAC estimator for the linear regression model while their scenario is based on ordinary least squares estimator, which is a special case of PGLS when the group size is equal to one. Thus, the HAC estimator for PGLS is an extension to the one in Kelejian and Prucha (2007). 16 p ˆ Proposition 4 G · Avar βF P GLS − Q−1 SQ−1 → 0. See proof in Chapter 4. rob 1.4.2 Estimation of the Spatial Correlation Parameter A consistent estimator for ρ can be obtained by a minimum distance estimator, N N [ˆi uj − Ωij (dij , ρ)]2 . uˆ ρ = arg min ˆ i=1 j=i A common structure of Ωij in spatial statistics is the exponential form, Ωij = σ 2 exp − dij ρ . An estimator for σ 2 is N σ 2 = N −1 ˆ u2 . ˆi i=1 1.5 Alternative Estimation Approach: Quasi-Maximum Likelihood Estimator The quasi-maximum likelihood estimator (QMLE) using the estimated covariance matrix derived from OLS residuals is the same as PGLS. Supposing the same assumptions hold as before, E ug |Xg = yg − Xg β. Var ug |Xg = σ 2 Λg (ρ) . We can write the quasi log 17 likelihood function for group g using a multivariate normal distribution as lg β, ρ, σ 2 = log fg yg |Xg ; β, λ (1.27) L 1 L log (2π) − log σ 2 − log Λg (ρ) 2 2 2 1 − 2 yg − Xg β Λ−1 (ρ) yg − Xg β . g 2σ = − ˆ The QMLE estimators βQM LE , ρQM LE and σQM LE jointly solve ˆ ˆ2 G max LG θ∈Θ β, ρ, σ 2 lg (β, λ) = − = g=1 − 1 2 G×L G×L log (2π) − log σ 2 2 2 G log Λg (ρ) − g=1 1 2σ 2 (1.28) G yg − Xg β Λ−1 (ρ) yg − Xg β . g g=1 Since the first term of log likelihood is a constant, we can eliminate it in the maximization problem. The log likelihood becomes LG G×L 1 = − log σ 2 − 2 2 β, ρ, σ 2 − 1 2σ 2 G log Λg (ρ) (1.29) g=1 G yg − Xg β Λ−1 (ρ) yg − Xg β . g (1.30) g=1 For a given ρ, by maximizing the log likelihood function we can get  −1  G Xg Λg (ρ)−1 Xg  ˆ βQM LE =  g=1 G  g=1 18  Xg Λg (ρ)−1 yg  (1.31) and σQM LE = ˆ2 1 N G ˆ yg −Xg βQM LE ˆ Λ−1 (ρ) yg −Xg βQM LE . g (1.32) g=1 ˆ By plugging βQM LE and σQM LE into the log likelihood we can get a concentrated likelihood. ˆ2 ˆ By maximizing the concentrated log likelihood L (ρ), we get ρQM LE and further βQM LE ˆ and σQM LE . ˆ2  1 1 L (ρ) = − log  2 N 1 − 2 =  G ˆ Λ−1 (ρ) yg −Xg βQM LE  g ˆ yg −Xg βQM LE (1.33) g=1 G log Λg (ρ) − g=1 1 2  1 1 [log (N ) − 1] − log  2 2  G ˆ yg − Xg βQM LE ˆ Λ−1 (ρ) yg − Xg βQM LE  g g=1 1 − 2 G log Λg (ρ) g=1 ˆ Note that if we plug ρ in βQM LE we get βF P GLS . ˆ ˆ ˆ We can also get a HAC estimator for βQM LE . The Hessian for each group g is Hg =   Hg11 Hg12    . Let ag = −E Hg |Xg , Dg , then Hg21 Hg22    ag =   Xg Λ−1 (λ) Xg g 0 0 1 2 λ Λg (λ) 19 Λ−1 (λ) ⊗ Λ−1 (λ) g g λ Λg (λ)   H ≡ 1 G G Hg g=1 1 A ≡ E [−H] = − G ˆ A ≡ 1 G 1 = G G g=1 1 E Hg |Xg , Dg = G G ag g=1 G ˆ ˆ ag β, λ g=1 G [ ˆ Xg Λ−1 λ Xg g ] 1 2 0 g=1 0 ˆ λ Λg λ Λ−1 g ˆ ˆ λ ⊗ Λ−1 λ g ˆ λ Λg λ ˆ ˆ Note that there is no β in A.   G G H 1 1 E sg sh , BG ≡ Var  √ sg  = G G g=1 g=1 g=1 ˆ ˆ and let ˆg = sg β, λ , s ˆ BHAC =  1 G H G K [dist (g, h)] ˆgˆh s s g=1 g=1   .  ˆ 1ˆ ˆ ˆ Thus the HAC estimator for asymptotic variance of θ is G A−1 BHAC A−1 . ˆ Proposition 5 Under certain conditions, βQM LE is consistent and has an asymptotic normal distribution. See proof in Chapter 4. 20 1.6 A Monte Carlo Simulation Study In this section, we provide Monte Carlo simulation results which shows that the PGLS estimator performs reasonably well for finite samples. In the simulation, we study the properties of the PGLS estimator compared to the OLS and GLS estimators. We also provide robust standard errors for OLS and PGLS. 1.6.1 Data Generating Process The total number of observations is N which takes 400 and 1600 separately. The data are √ √ generated on a N × N lattice. The locations for the data are represented by coordi√ nates (r, s) : r, s = 1, 2, ..., N . The distance dij between location i and j is Euclidean distance. Suppose A(ai , aj ) and B(bi , bj ) are the two points on the lattice, then their Euclidean distance is (ai − bi )2 + (aj − bj )2 . Besides the Euclidean distance, other distance measures may also be considered. Commonly seen distance measures include a time series distance (as a special case of Euclidean distance) which means the locations are on a straight line, Manhattan distance |ai − bi | + aj − bj , and maximum coordinate-wise distance max |ai − bi | , aj − bj . For simplicity, only Euclidean distance is used in this paper. We consider three cases of data-generating processes for different correlation structures. 1.6.1.1 Case I In the first case, the exponential form of spatial correlation is used. We generate data as follows: (1) yi = α + xi β + ui , α = 1, β = 1. (2) Specify the covariance matrix Ω for the error term: 21 d ij Ωij = σ 2 exp − ρ , σ 2 = 1, ρ = 0.1, 0.5, 1, 2, 5. In this case, the corresponding pair- wise correlations for the error term when the pairwise distance is one are: 0.00, 0.14, 0.37, 0.61, 0.82. (3) Generate the error term: u = Ω1/2 , where is a vector of i.i.d. standard normal random numbers. ui is the i th element in the vector u. Ω1/2 is obtained by Cholesky decomposition. (4) xi is a spatially correlated variable. The vector x = Ω1/2 ξ, where ξ is a vector of i.i.d. standard normal random numbers and Ω is the same as in (2) and (3) but with ρ = 1 without loss of generality (w.l.o.g.). The simulation results can be found in Table 1.1, 1.2, and 1.3. In this estimation process, the spatial correlation parameter is estimated by a minimum distance estimator ρ = min ˆ 1.6.1.2 dij [ˆi uj − σ 2 exp(− ρ )]2 for i = j. uˆ Case II In the second case, the spatial correlation is a function of the inverse of distance. The data generating process is as follows: (1) yi = α + xi β + ui , α = 1, β = 1. (2) Specify the covariance matrix Ω for the error term: Ωij = σ 2 dρ , σ 2 = 1, ρ = 0, 0.2, 0.4, 0.6. In this case, the corresponding pairwise corij relation for the error term when the pairwise distance is one is equal to ρ. (3) Generate the error term: u = Ω1/2 , where is a vector of i.i.d. standard normal random numbers. ui is the i th element in the vector u. Ω1/2 is obtained by Cholesky decomposition. (4) xi is a spatially correlated variable. The vector x = Ω1/2 ξ, where ξ is a vector of i.i.d. 22 standard normal random numbers and Ω is the same as in (2) and (3) but ρ = 1 w.l.o.g.. The simulation results can be found in Table 1.4 and 1.5, 1.6. In this estimation process, the spatial correlation parameter is estimated as the average of ui uj /ˆ 2 for pairwise distance ˆˆ σ dij = 1, which means, ρ is estimated only using the pairs of observations whose distances are one. 1.6.1.3 Case III For completeness, another data generating process which uses a spatial weight matrix that is popular in spatial econometrics is as follows: (1) yi = α + xi β + ui , α = 1, β = 1. (2) There are N/L independent districts, where L is the number of observations within a district. Specify the spatial weight matrix W for the error term: wij = 1/ (L − 1) if i = j and i and j are in the same district. wij = 0 if i = j or i and j are in different districts. In this case, the corresponding pairwise correlation for the error term within the district is equal to ρ/ (L − 1). (3) Generate the error term: u = ρWu + , σ 2 = 1, ρ = 0, 0.2, 0.4, 0.6 , where is a vector of i.i.d. standard normal random numbers. ui is the i th element in the vector u. (4) xi is a standard normal variable N (0, 1). The simulation results can be found in Table 1.7 and 1.8. In this estimation process, the spatial correlation parameter is estimated in the following way: ρ = (Wˆ ) (Wˆ ) ˆ u u 23 −1 (Wˆ ) u. u ˆ (1.34) 1.6.1.4 Robust Standard Errors The simulations also provide robust standard errors for the OLS and PGLS estimators since these estimators ignore spatial correlations to some extent. We do not discuss the optimal kernal variance estimators and we only provide feasible variance estimators for the estimation of robust standard errors. For the OLS estimator, we use the formula in equation (1.23). √ The cutoff point d∗ is set as 3 N . When N = 400, d∗ = 7.368. Thus, for observation i, its covariances with any observation that is within 7.4 distance units are accounted for, similar to the case when N = 1600. For the PGLS estimators, we use the formula in equation (1.26). The kernel function in the simulation is specified as in equation (1.25). First, the groupwise distance dgh is specified as the distance between centers of the location. Let the cutoff point √ d∗∗ be 3 N . If dgh ≤ d∗∗ , the covariances of the observations in group g and h will be used. To keep it simple, we put the same weight 1 − dgh /d∗∗ on each of those covariances. Like the lags in time series, how to choose the cutoff point needs consideration of economics of the problems. 1.6.2 Monte Carlo Simulation Results For Case I, Table 1.1 shows that as ρ increases, the standard deviation of the OLS estimator increases, which means OLS becomes less efficient as spatial correlation increases in the sample. PGLS performance becomes better when compared to OLS as ρ increases. Using a group size equal to four, which is very small, we can gain back most of the efficiency. As we increase the group size, the feasible PGLS gains more efficiency. When we use a group size equal to 16, the PGLS estimator is almost as efficient as the GLS estimator. Table 1.2 provides robust standard errors for OLS, PGLS with group size 4 and 16. The 24 spatial correlation parameter is calculated using a minimum distance estimator. Simulations show that this estimator is biased when the spatial correlation is high. However, even if the spatial correlation parameter is not consistently estimated, the feasible PGLS estimator can still gain efficiency back. This is because the estimated covariance (even though not consistent) captures some structure of the true covariance. Case II and Case I have very similar results regarding PGLS behavior. But the estimation of spatial correlation in Case II is less biased and more efficient than in Case I. In Case III, independent groups are generated, thus GLS is PGLS in this case. The estimation of spatial correlation as in equation 1.34 is an OLS estimator using the residuals. As in Table 1.7, the estimator for ρ is biased upwards, though the GLS estimator still behaves well. 25 Table 1.1: Case I: Mean Parameters and Standard Deviation N=400, ρ 0.1 0.5 1 2 5 T=2000, β=1 ˆ ˆ ˆ ˆ βols s.d.(βols ) βgls s.d.(βgls ) 1.000 0.050 1.000 0.050 1.000 0.056 1.000 0.054 1.000 0.068 0.999 0.051 1.002 0.081 1.000 0.040 1.002 0.088 1.000 0.028 N=1600, T=2000, β = 1 ˆ4 βpgls 1.000 1.000 1.000 1.001 1.001 ˆ4 s.d.(βpgls ) 0.050 0.055 0.057 0.051 0.041 ˆ ˆ ˆ ˆ ˆ4 ˆ4 ρ βols s.d.(βols ) βgls s.d.(βgls ) βpgls s.d.(βpgls ) 0.1 0.999 0.025 0.999 0.025 0.999 0.025 0.5 0.999 0.028 0.999 0.027 0.999 0.027 1 0.999 0.034 0.999 0.025 0.999 0.028 2 1.000 0.043 0.999 0.020 0.999 0.025 5 0.999 0.049 0.999 0.013 0.999 0.018 s.d.() means standard deviation in the simulation. ˆ4 ˆ16 βpgls and βpgls are the PGLS estimators using group size equal to ˆ16 βpgls 1.000 1.000 1.000 1.000 1.002 ˆ16 s.d.(βpgls ) 0.050 0.054 0.053 0.044 0.033 ˆ16 βpgls 0.999 0.999 0.999 0.999 0.999 ˆ16 s.d.(βpgls ) 0.025 0.027 0.026 0.022 0.015 4 and 16 separately. Table 1.2: Case I: Robust Standard Errors N=400, ˆ ρ s.e.∗ (βols ) 0.1 0.038 0.5 0.038 1 0.037 2 0.035 5 0.030 N=1600, T=2000 ˆ r.s.e.(βols ) 0.047 0.050 0.058 0.066 0.068 T=2000 ˆ4 r.s.e.(βpgls ) 0.047 0.049 0.051 0.052 0.048 ˆ16 r.s.e.(βpgls ) 0.047 0.050 0.051 0.048 0.041 ˆ ˆ ˆ4 ˆ16 ρ s.e.∗ (βols ) r.s.e.(βols ) r.s.e.(βpgls ) r.s.e.(βpgls ) 0.1 0.017 0.024 0.024 0.024 0.5 0.017 0.026 0.025 0.025 1 0.017 0.032 0.027 0.027 2 0.017 0.038 0.027 0.026 5 0.015 0.042 0.026 0.023 ∗ (β ) is the average usual OLS standard error. s.e. ˆols r.s.e.() is the average spatial correlation robust standard error. 26 Table 1.3: Case I: Estimated Error Variance Parameters N=400, ρ 0.1 0.5 1 2 5 ρ ˆ 0.134 0.480 0.919 1.554 2.297 N=1600, T=2000, σ 2 = 1 s.d.(ˆ) ρ 0.125 0.079 0.166 0.357 0.563 σ2 ˆ 0.995 0.992 0.980 0.943 0.825 s.d. σ 2 ˆ 0.072 0.074 0.095 0.150 0.248 T=2000, σ 2 = 1 ρ ρ ˆ s.d.(ˆ) ρ σ2 ˆ s.d. σ 2 ˆ 0.1 0.121 0.102 0.999 0.035 0.5 0.495 0.040 0.998 0.037 1 0.979 0.097 0.995 0.048 2 1.855 0.310 0.984 0.084 5 3.511 0.824 0.930 0.175 ρ is the average of estimates for ρ. ˆ σ 2 is the average of estimates for σ 2 . ˆ Table 1.4: Case II: Mean Parameters and Standard Deviation N=400, ρ 0 0.2 0.4 0.6 T=2000, β=1 ˆ ˆ ˆ ˆ βols s.d.(βols ) βgls s.d.(βgls ) 1.000 0.051 1.002 0.051 1.000 0.057 0.999 0.051 1.000 0.062 0.999 0.044 1.001 0.068 1.000 0.038 N=1600, T=2000, β = 1 ˆ4 βpgls 1.000 1.000 1.000 1.000 ˆ4 s.d.(βpgls ) 0.051 0.053 0.050 0.042 ˆ ˆ ˆ ˆ ˆ4 ˆ4 ρ βols s.d.(βols ) βgls s.d.(βgls ) βpgls s.d.(βpgls ) 0 0.999 0.025 0.999 0.087 0.999 0.025 0.2 0.999 0.029 0.999 0.025 0.999 0.027 0.4 0.999 0.033 0.999 0.022 0.999 0.025 0.6 0.999 0.037 1.002 0.018 1.000 0.021 s.d.() means standard deviation in the simulation. ˆ4 ˆ16 βpgls and βpgls are the PGLS estimators using group size equal to 27 ˆ16 βpgls 1.000 1.000 1.000 1.000 ˆ16 s.d.(βpgls ) 0.053 0.052 0.047 0.035 ˆ16 βpgls 0.999 0.999 0.999 1.000 ˆ16 s.d.(βpgls ) 0.025 0.026 0.023 0.017 4 and 16 separately. Table 1.5: Case II: Robust Standard Errors N=400, T=2000 ˆ ρ s.e.∗ (βols ) 0 0.037 0.2 0.036 0.4 0.035 0.6 0.034 N=1600, ˆ r.s.e.(βols ) 0.047 0.050 0.053 0.056 T=2000 ˆ4 r.s.e.(βpgls ) 0.067 0.068 0.067 0.062 ˆ16 r.s.e.(βpgls ) 0.095 0.093 0.086 0.069 ˆ ˆ ˆ4 ˆ16 ρ s.e.∗ (βols ) r.s.e.(βols ) r.s.e.(βpgls ) r.s.e.(βpgls ) 0 0.017 0.024 0.037 0.050 0.2 0.017 0.027 0.038 0.051 0.4 0.017 0.030 0.037 0.047 0.6 0.016 0.032 0.035 0.038 ˆ s.e.∗ (βols ) is the average usual OLS standard error. r.s.e.() is the average spatial correlation robust standard error. Table 1.6: Case II: Estimated Error Variance Parameters N=400, ρ 0 0.2 0.4 0.6 T=2000, σ 2 = 1 ρ ˆ -0.003 0.171 0.352 0.541 N=1600, s.d.(ˆ) ρ σ2 ˆ s.d. σ 2 ˆ 0.036 0.995 0.072 0.048 0.968 0.078 0.055 0.943 0.098 0.051 0.919 0.127 2=1 T=2000, σ ρ ρ ˆ s.d.(ˆ) ρ σ2 ˆ s.d. σ 2 ˆ 0 -0.001 0.018 0.999 0.035 0.2 0.186 0.027 0.985 0.041 0.4 0.377 0.032 0.972 0.057 0.6 0.573 0.029 0.959 0.077 ρ is the average of estimates for ρ. ˆ σ 2 is the average of estimates for σ 2 . ˆ 28 Table 1.7: Case III: Mean Parameters and Standard Deviation Case III ρ 0 0.2 0.4 0.6 N=400, T=2000, ˆ ˆ βols s.d.(βols ) 1.000 0.051 0.999 0.053 1.000 0.057 1.001 0.072 N=1600, T=2000, β=1 ˆ ˆ βgls s.d.(βgls ) 0.999 0.051 1.001 0.052 1.000 0.050 1.000 0.048 β=1 ˆ ˆ ˆ ˆ ρ βols s.d.(βols ) βgls s.d.(βgls ) 0 1.000 0.026 1.000 0.026 0.2 1.001 0.026 1.001 0.026 0.4 1.000 0.029 1.001 0.025 0.6 1.001 0.036 1.002 0.024 s.d.() means standard deviation in the simulation. ˆ ˆ βols and βf gls are the average of OLS and FGLS estimators. Table 1.8: Case III: Estimated Error Variance Parameters N=400, ρ 0 0.2 0.4 0.6 ρ ˆ −0.018 0.353 0.648 0.852 N=1600, T=2000, σ 2 = 1 s.d.(ˆ) ρ 0.128 0.092 0.055 0.024 σ2 ˆ 0.992 0.982 0.948 0.896 s.d. σ 2 ˆ 0.070 0.071 0.072 0.071 T=2000, σ 2 = 1 ρ ρ ˆ s.d.(ˆ) ρ σ2 ˆ s.d. σ 2 ˆ 0 -0.004 0.062 0.998 0.036 0.2 0.365 0.044 0.998 0.036 0.4 0.656 0.026 0.995 0.036 0.6 0.856 0.015 0.985 0.036 ρ is the average of estimates for ρ. ˆ σ 2 is the average of estimates for σ 2 . ˆ 29 1.7 Conclusions In a linear regression model with spatial data, we can use a weighted least squares to improve efficiency, even if the weight might be misspecified. As long as the misspecified structure can capture some properties of the true variance structure of the spatial data, efficiency can be improved, which is the idea of ”pseudo GLS”. 30 Chapter 2 Estimation of Nonlinear Models in a Quasi-Maximum Likelihood Framework with Spatial Data 2.1 Introduction In a lot of empirical economic and social studies, there are discrete data examples which exhibit spatial correlations due to the geographical locations of individuals or agents of interest. For instance, the number of patents a firm received shows correlation with that received by other firms near by. This may be due to a technology spillover effect or a common policy aiming at encouraging new technology in this place. Another example is the neighborhood effect. There is a causal effect between the individual decision whether to own stocks and the average stock market participation of the individual’s community (Brown, Smith, & Weisbenner 2008). The first example is a count data example and the second one is a binary response example. Both examples deals with discrete data. Nonlinear models are suitable for the study on discrete variables. Unfortunately there are not much literature on the estimation of nonlinear models with discrete spatial data. Because of the spatial correlation, the discrete variables are not independent. Both the nonlinearity and 31 the correlation make the estimation difficult. Maximum likelihood estimation (MLE) is a widely used method in estimating nonlinear models. In order to use MLE, one needs to specify the joint distributions of spatial random variables. This includes correctly specifying the marginal and the conditional distributions. However, given a spatial data set, the dependence structure is generally unknown. If the joint distribution of the variables is misspecified, MLE is generally not consistent. Another estimation method is quasi-maximum likelihood estimation (QMLE). Using a density that belongs to a linear exponential family (LEF), QMLE is consistent if we correctly specify the conditional mean with other features of the density misspecified. In a panel data case, pooled (partial) QMLE which ignores serial correlations is consistent under some regularity conditions (Wooldridge 2010). In their 2009 working paper, Wang, Iglesias and Wooldridge use a bivariate Probit partial MLE to improve the estimation efficiency with a spatial Probit model. Using their approach we would need to correctly specify the marginal distribution of the binary response variable conditional on the covariates and distance measures1 . Since the bivariate marginal distribution of a spatial multivariate normal distribution is bivariate normal, one can derive the bivariate normal distribution under some distributional assumptions. There are two problems with this paper by Wang, Iglesias and Wooldridge (2012). First the computation is already hard for a bivariate distribution. The multivariate marginal distribution of a higher dimension is more computationally demanding; second it also requires the correct specifi1A sample of spatial data is collected with a set of geographical locations. Spatial dependence is usually characterized by distances between observations. A distance measure is how one defines the distances between observations. Physical distance or economic distance could be two options. Information about agents physical locations is commonly imprecise, eg. only zip code is known. Conley and Molinari (2007) deals with the inference problem when there exist distance errors. In this chapter I assume there are no measurement errors in pairwise distances. 32 cation of the marginal bivariate normal distribution to obtain consistency. In fact, we can have less restrictive distributional assumptions than those required in bivariate partial MLE. Suppose we only specify the mean and the working variances and covariances2 . Using QMLE in the LEF, we can get consistent estimators. Even if the the variances and covariances are not correctly specified, we can still consistently estimate the mean parameters as well as the average partial effects, which are more interesting. In the literature, the QMLE and GEE approach is used in panel data models to get more efficient estimators (Gourierous, Monfort, & Trongnon 1984). In this paper, I will demonstrate how to use the QMLE and GEE approach in a spatial data setting to get a consistent and more efficient estimator. Generalized least squares (GLS) can be used to improve the estimation efficiency in a linear regression model even if the variance covariance structure is misspecified. Similarly, generalized estimating equations (GEE) or weighted multivariate nonlinear least squares (WMNLS) are used in nonlinear panel data models and system of equations to obtain more efficient conditional mean parameters. Generally we expect that GEE can give more efficient estimators compared to PMLE, which uses only the marginal density to get the consistent estimators. To use the QMLE in the spatial data setting, I first give a series of assumptions, based on which M-estimators are consistent for the spatial processes. To derive the asymptotics for the M-estimators we have to use a uniform law of large numbers (ULLN) and a central limit theorem (CLT). These limit theorems are the fundamental building blocks for the asymptotic theory of nonlinear spatial M-estimators, e.g. maximum likelihood estimators (MLE) and generalized method of moments estimators (GMM) (Jenish and Prucha, 2009). 2 The true variance covariance matrix is generally unknown. By specifying a working variance covariance matrix, one can capture some of the correlation structure between observations. 33 Conley (1999) makes an important contribution towards developing an asymptotic theory of GMM estimators for spatial processes. He utilizes Bolthausen’s (1982) CLT for stationary random fields. Jenish and Prucha (2009) provide a ULLN and a CLT for spatial data including nonstationary spatial processes. Using theorems in Jenish and Prucha (2009), one can analyze more interesting economic phenomena. For example, real estate prices usually shoot up as one moves from the periphery to the center of a big city. While I will not discuss trending processes in this paper, Cressie (1993) provides numerous examples of trending spatial processes. In Section 2, the M-estimator framework under the spatial data context is established. A series of assumptions are given based on Jenish and Prucha (2009) under which M-estimators are consistent and have an asymptotic normal distribution. In Section 3, I propose a twostep GEE estimator in a QMLE framework. In Section 4, the asymptotic distributions for QMLE and GEE for spatial data are derived. In Section 5, consistent variance covariance estimators are provided for the nonlinear estimators for spatial data. In Section 6, we look in detail at a Probit model with spatial correlation in the error term of the latent variable and a count data model with a multiplicative spatial error term. Section 7 contains Monte Carlo simulation results which compare efficiency of different estimation methods for the two nonlinear models explored in the previous section. Section 8 concludes. Section 10 is the appendix. 2.2 M-estimation In this section, I will examine the M-estimation framework of nonlinear models with spatial data. Unlike linear models, a very important feature of nonlinear models is that estimators 34 cannot be obtained in a closed form, which requires new tools for asymptotic analysis: we need uniform law of large numbers (ULLN) and a central limit theorem (CLT). The GEE procedure is a two-step M-estimation method within the QMLE framework. The Mestimator with spatial data is proved to be consistent and asymptotically normal under certain assumptions. 2.2.1 M-estimation Assume that spatial processes reside on a regular lattice3 D in a Euclidean space, Rd , d ≥ 1.4 Let s denote a location in D. Suppose we have a sample of N observations. Let DN ⊆ D contains the location information for this sample. Let si denote the location of the observation i, i = 1, 2, ..., N. Let dij be the pairwise distance between location si and location sj . That is, dij is the pairwise distance between observation i and observation j. I first give some regularity conditions for the spatial processes I am studying and then I give a general framework for M-estimators with spatial data. Following Jenish and Prucha (2009) Definition 1, I adopt the follwing definitions of mixing conditions for underlying random field. For U ⊆ DN and V ⊆ DN , define σalgebras σ (U ) = σ (xi ; i ∈ U ) and σ (V ) = σ (xi ; i ∈ V ). |U | and |V | denote the cardinality of U and V . The two commonly used mixing conditions are α-mixing and φ-mixing which are introduced separately by Rosenblatt and Ibragimov. The α-mixing and φ-mixing conditions are: αN (U, V ) = sup (|P (u∩v) − P (u) P (v)| , u ∈ U, v ∈ V ) , 3A lattice is a collection of spatial sites (locations) supplemented with neighborhood information (Cressie 1993, p. 383). 4 A two-dimensional Euclidean space is called the Cartesian plane. Spatial processes can also reside in a higher dimension of space, Rn , n > 2. 35 and φN (U, V ) = sup (|P (u|v) − P (u)| , u ∈ U, v ∈ V, P (u) > 0) . Define a metric ρ (i, j) = max1≤l≤d |il | , where il denotes the l-th component of i. The mixing conditions for the underlying random fields are defined as follows: αk,l,N (r) = sup (αN (U, V ) , |U | ≤ k, |V | ≤ l, ρ (U, V ) ≥ r), φk,l,N (r) = sup (φN (U, V ) , |U | ≤ k, |V | ≤ l, ρ (U, V ) ≥ r) , with k, l, r, N natural num¯ bers. Further let αk,l,N (r) = supαk,l,N (r) and φk,l,N (r) = supφk,l,N (r) . ¯ N N Let {wN } = {(xi , yi )} , i = 1, 2, ..., N . (xi , yi ) is the observation obtained at location si . xi is a row vector of independent variables and yi is a scalar dependent variable. θ ∈ Θ is a P × 1 parameter vector, and θ0 is the true parameter value. Let θ be a general notation for the parameter vector. An objective function QN depends on a sample of realizations of variables wN , location information DN , parameter θ and the sample size N. An M-estimator of θ0 is given by minimizing the objective function QN as follows, ˆ θ N = arg min QN (wN , DN ; θ) . θ∈Θ (2.1) In particular, I will address the case when QN (wN , DN ; θ) can be expressed as a sample average. An example of this type of M-estimator is partial (pooled) maximum likelihood (PMLE) estimator. The objective function can be written as QN (wN , DN ; θ) = 1 N N qi (wi , DN ; θ) , (2.2) i=1 where qi (wi , DN ; θ) is some real valued function defined on Θ. wi is the observed data obtained at location si . DN contains the location information of wi and other observations. For simplicity reason, let qi (θ) ≡ qi (wi , DN ; θ) and QN (θ) ≡ QN (wN , DN ; θ). I will drop w and D unless they are needed for clarity. 36 Suppose that in a parametric model, conditional mean is correctly specified. Let E(yi |xi , DN ) = mi (xi , DN ; θ0 ) be a correctly specified mean function along with a LEF density fi (yi |xi ; θ). For example, in a nonlinear regression model, the objective function for 1 the nonlinear weighted least squares (NWLS) estimator is N N i=1 [yi − mi (xi ; θ)]2 /vi , where vi is the variance of the error term (usually based on the LEF density), while for the 1 partial maximum likelihood estimation the objective function is N N i=1 log fi (yi |xi , DN ; θ), with E(yi |xi , DN ) = mi (xi , DN ; θ0 ) . For the M-estimator to be consistent, we need a uniform law of large numbers (ULLN). To derive the ULLN, we need the following assumptions. Assumption 1: The pairwise distances dij are finite and lower bounded by some ε > 0. I employ increasing domain asymptotics. That is, the sample size grows as the sampling region expands. Assumption 2: xi , yi are uniformly bounded variables. Assumption 3: Θ is a compact subset on Rp . ¯ Assumption 4 (Pointwise Convergence): For each θ ∈ Θ, QN (θ)− QN (θ) = op (1) , 1 ¯ where QN (θ) = E (QN (θ)) = N N i=1 E (qi (θ)) , ¯ ¨ and limN →∞ QN (θ) = Q. Assumption 5: QN (θ) is stochastically equicontinuous. Let (Θ,v) be a metric space. Let B θ , δ be the open ball θ ∈ Θ :v θ, θ < δ . QN (θ) is stochastically equicontinuous in the sense that, for every ε > 0,  lim supN →∞ P   sup QN (θ) − QN θ θ,θ ∈Θ,v (θ,θ ) > ε → 0 as δ → 0 . ¯ Assumption 6: QN (θ) is also stochastically equicontinuous on Θ. The definition is similar to the statement above for QN (θ) to be stachostically equicontinuous. 37 ¨ Assumption 7: Q attains unique global minimization at θ0 ∈ Θ. Assumption 8: No perfect multicollinearity in xi . For exponential and logistic regression functions, the objective function is a function of a linear function of independent variables. Multicollinearity should be ruled out in order to identify the model. Assumption 4 provides a pointwise law of large numbers. Assumption 7 and 8 provide the identification conditions that make sure the model has unique solution to the minimization problem. Proposition 6 Under Assumptions 1-8, the M-estimator in 2.2 is consistent, that is, ˆ θN →a.s. θ0 as N → ∞. See proof in Chapter 4. Following the above proposition, we can apply the ULLN to different nonlinear estimators. The above M-estimator can be expressed utilizing groups of observations (group notation). That is, we can divide the observations into groups according to geographical properties or other economic or social relationships. Then we can write the objective function as 1 QG (wG , DG ; θ) = G G q g w g , DG ; θ . (2.3) i=1 Let qg (θ) ≡ qg wg , DG ; θ , which is a real valued function of the gth group of observations, g = 1, 2, ..., G. wg contains the observations of the gth group. DG represents the lattice with group information other than just locations. QG denotes the objective function which indicate that the total number of groups is G, although the total number of observations is still N. Note that when the group size is equal to one, the group notation is the same as the individual notation. I will use the group notation in the rest of this paper.The above Assumptions 1-8 for the group notation are basically the same as the individual notation 38 (equation (1) and equation (2)) except that the subscript i changes into g. Let L be the number of observations in each group, then the total number of groups G = N/L. ˆ θ G = arg min QG (wG , DG ; θ) θ∈Θ (2.4) Proposition 7 Follow Proposition 1, the groupwise M-estimator in 2.3 is consistent, that ˆ is, θG →a.s. θ0 as G → ∞. See proof in Chapter 4. 2.2.2 Two-step Estimation In some situations, we have a preliminary estimator. For example, from a partial QMLE estimator, we can get a preliminary consistent estimator. After that we can get an estimated working covariance matrix as weight to get a more efficient estimator. A two-step ˆ M-estimator θG of θ0 solves the problem ˆ θG = arg min QG (wG , DG ; θ, ˆ) , γ θ∈Θ 1 QG (wG , DG ; θ, ˆ) = γ G (2.5) G qg wg , DG ; θ, ˆ , γ g=1 where γ is a preliminary estimator based on the sample wg : g = 1, 2, ..., G which exhibits ˆ spatial correlations. p lim ˆ = γ ∗ , where γ ∗ is some element in the parameter space Γ. I will γ discuss a specific example of the two-step M-estimator in the next section, the PMLE and GEE method. ¯ Assumption 9: QG (θ, γ ∗ ) attains unique global minimization at θ0 ∈ Θ, where 1 ¯ QG (θ, γ ∗ ) = G G g=1 E qg wg , DG ; θ, γ ∗ ¯ ¯ ; That is QG (θ0 , γ ∗ ) < QG (θ, γ ∗ ) , for all θ ∈ Θ, θ = θ0 . 39 Assumption 9 provides necessary identification condition for the two-step M-estimator. If qg (wi , θ0 ; γ) is stachastically equicontinuous over Θ × Γ, then a ULLN applies. Along with ˆ identification, this result can be shown to imply consistency of θG for θ0 . The consistency argument is the same as Proposition 1. 2.3 Two-step Estimator: QMLE and GEE Due to the complexity of the joint distribution of spatial random processes, econometricians have developed a variety of ways to reduce the computational burden. One way is to specify the partial conditional distribution, and maximize the summand of log likelihoods for each observation. The parameters can be consistently estimated if the partial log likelihood function satisfies the assumptions for consistency of M-estimation. A consistent variance estimator should be provided for valid inference5 . Moreover, one can divide data into different groups, and specify the marginal distribution for each group to get a more efficient estimator (Wang, Iglesias, & Wooldridge 2012). However, this approach requires correctly specified marginal distributions, and when we increase the group size, the joint distribution for each group of variables becomes more and more difficult to compute. In this section, I propose a two-step estimator in a QMLE framework. The first step is a pooled QMLE procedure and the second step is a GEE procedure. Only the correct conditional mean and a density function in the LEF need to be specified. The partial (pooled) QMLE method requires correctly specified conditional mean functions, E(yi |xi ) = mi (xi ; θ0 ) , i = 1, 2, ..., N, along with a LEF density fi (yi |xi ; θ). Then 5 Ignoring dependence in the estimation of parameters will result in wrong inferences if the variances are calculated in the way that independence is assumed. Dependence should be accounted for to the extent of how much one ignores it in the estimation. 40 one proceeds with minimizing the sum of log individual likelihoods ignoring any spatial dependence. Note that, the true log likelihood cannot be written by the sum of the individual likelihoods. PQMLE is an approximation of the true MLE. However, under certain conditions, PQMLE delivers consistent estimators with computation ease. The the partial quasi-log likelihood is 1 LN (θ) = N N log fi (yi |xi , DN ; θ) . (2.6) i=1 The partial QMLE is found by solving the score function, N ˇ si θ = 0. ˇ SN θ = (2.7) i=1 One characterization of QMLE in LEF is that the individual score function has the following form: si (θ) = where mi (xi , DN ; θ) [yi − mi (xi , DN ; θ)] /v (mi (xi , DN ; θ)) , (2.8) mi (xi , DN ; θ) is the 1 × P gradient of the mean function and v (mi (xi , DN ; θ)) is the variance function associated with the chosen LEF density. For Bernoulli, v (mi (xi , DN ; θ)) = mi (xi , DN ; θ) (1 − mi (xi , DN ; θ)) , and for Poisson distribution, v (mi (xi , DN ; θ)) = mi (xi , DN ; θ) . 41 E(si (θ) |xi , DN ) = 0 if E(yi |xi , DN ) = mi (xi , DN ; θ0 ), which implies Fisher consistency. The above gives a consistent estimator by Proposition 1. However, this estimator is not likely to be the most efficient estimator among the estimators that are based on the same distributional assumptions, because it ignores the spatial correlations between observations. If we can use some or all pairwise correlation information, we can possibly improve the estimation efficiency. A common way to make use of the pairwise information is to divide observations into groups, and use spatial correlations within groups while ignore correlations between groups. In empirical studies, there exist some natural groups of data, e.g., the technology spillover effects within a certain state. Suppose we know the groupwise distribution but not the full distribution of the whole data, we can get a consistent estimator by using only groupwise information. Let g be the number of groups, g = 1, 2, ..., G. The group size is the number of observations divided by the number of groups, and it can vary from 1, 2, ..., to N. There are two extreme cases of the group size. The first case is when the group size is 1, the resulting estimator is the usual partial QMLE estimator, which means we ignore all of the pairwise correlations. The second case is when the group size is N , which means we are using all pairwise information. If the group size is not equal to one or N , the estimation is actually a partial QMLE. By ”partial”, I mean that I do not use full information, only the information within groups. Suppose we divide data into G groups and assume that there are the same number of observations in each group. Let L = N/G, which is fixed. The group numbers and the sample size both increase as the sampling domain increases. That is, we get more observations by increasing the space where we obtain a sample. For group g, Xg is an L × K matrix and yg is an L × 1 vector, where g denotes the gth group, g = 1, 2, ..., G. Let X denote the N × P covariate matrix. We can write the assumptions in terms of group notation. Those 42 assumptions are basically same, the difference is the notation. Thus I will not readdress the assumptions again. Assumption 10: Conditional mean is correctly specified for each group. E yg |Xg = mg Xg , DG ; θ ≡ mg , g = 1, 2, ..., G. I will use mg for short and mg (·) to emphasize certain parameters. When G = N, we get E(yi |xi ) = mi (xi , DN ; θ0 ) . The QMLE estimator is given by setting the score function equal to 0. G ˆ sg θ = 0, (2.9) g=1 where sg (θ) is the score function for each group. And the group score sg (θ) has the following form, sg (θ) = where −1 mg Wg yg − mg . (2.10) mg is the L × P gradient of the group mean function and Wg is the LEF variance covariance matrix for group g. Notice that Wg is not a diagonal matrix that only contain the variances of each individual, but also contains the covariances of pairwise individuals. It is because of this property that we can improve efficiency by doing a so called generalized estimating equations (GEE) approach. The GEE approach was first extended to correlated data by Zeger and Liang (1986). In the spatial data context, I propose that the generalized estimating equations (GEE) for the mean parameters, which is given by G −1 mg Wg yg − mg = 0. (2.11) g=1 In order to use GEE, we need to get a consistent estimator for Wg which depends on the ˆ pairwise distances and a spatial dependence parameter. Suppose Wg is a consistent estima43 tor for Wg . GEE is a ”pseudo” weighted multivariate nonlinear least squares (MWNLS), because GEE only use the groupwise information. The GEE estimator is given by: G ˆ −1 yg − mg Wg yg − mg . ˆ θGEE = arg min θ (2.12) g=1 As a special case, pooled QMLE is the same as the nonlinear weighted least squares estimator (NWLS): N [yi − mi (xi , DN ; θ)]2 /v (mi (xi , DN ; θ)) . ˜ θN W LS = arg min θ (2.13) i=1 The following demonstrates how to find a consistent estimator for Wg . We can write Wg = Vg (Xg ; θ)1/2 Rg (ρ, DG ) Vg (Xg ; θ)1/2 . The diagonal elements of Wg correspond to the variances of dependent variables drawn from a density in LEF. The off-diagonal elements are the covariances that depend on the spatial parameter and distances.       Vg (Xg , DG ; θ) =       v1 · · · 0 v2 . . . 0 ... 0   .  .  .  ,  ... 0    0 vL (2.14) where the lth element on the diagonal is vl = Var(ygl |Xgl ) in group g, ygl is the lth element in the vector yg and Xgl is the lth row in Xg . Let RN (ρ, DN ) be the N ×N correlation matrix for the whole sample, and let Rg (ρ, DG ) 44 be the L × L correlation matrix for the group g. A common assumption of the ijth element of RN (ρ, DN ) is that if dij = 0,   c + b 1 − exp −d /ρ ij Rij = 1 − γ dij , ρ , γ dij , h =    0 otherwise, (2.15) where the vector of spatial parameters h = (c, b, ρ) , c ≥ 0, b ≥ 0, ρ ≥ 0, and c + b ≤ 2.6 Set b = c = 1 without loss of generality. Then if dij = 0,   exp −d /ρ ij Rij =    1 otherwise. (2.16) Another example would be if dij = 0,   ρ/d ij Rij =    1 otherwise. (2.17) Although the above specification does not represent all the possiblilities, it at least provides a way of how to parameterize the spatial correlation, and provides the basis for testing spatial correlation. ˇ ˇ Let θ be the partial QMLE estimator. ui = yi − mi xi ; θ , for i = 1, 2, ..., N , is the ˆ ˇ QMLE residual. vi = v mi xi , DN ; θ ˇ is the fitted variance of individual i associated with the chosen LEF density. Let ri = ui /ˇi be the standardized residual. Let ˆ = (ˆ1 , r2 , ..., rN ) . ˆ ˇ v r r ˆ ˆ Then ˆˆ is the sample correlation matrix. We can use a method in Prentice (1988) to find rr a consistent estimator for ρ. Let e be a vector containing N (N − 1)/2 different elements of 6 See Cressie (1993) p.61 for more examples. 45 the lower (or upper) triangle of ˆˆ , excluding the diagonal. Let z be the vector containing rr the elements in R corresponding to the same entries of elements in ˆˆ . We can get the rr parameter estimator ρ by solving: ˆ ρ = arg min(e − z) Ξ−1 (e − z), ˆ (2.18) where Ξ is the working correlation matrix for e, the sample correlation vector. Ξ is a diagonal matrix with ξ21 , ξ31 , ..., ξn1 , ξ32 , ξ42 , ..., ξN 2 , ..., ξN,N −1 as the elements on the diagonal, which are the corresponding variances of element in e . If the variance covariˆ ance matrix W is correctly specified, a model-based consistent variance estimator of θ is G g=1 mg Wg ˆ ˆ −1 mg ˆ −1 , where mg = ˆ ˆ mg Xg ; θGEE . As an alternative, the variˆ ˆ ance estimator of θ that is robust to misspecification of the variance covariance matrix is  G −1  ˆ mg Wg ˆ ˆ −1 mg   G ˆ ˆ ˆ −1 ˆ mg Wg k (g, h) ug uh Wh mh  ˆ ˆ −1 g=1 h=1 −1 mg Wg ˆ ˆ −1 mg  ˆ   G  g=1  G , (2.19) g=1 where k (g, h) is the kernel function depending on the distance between group g and h. The GEE approach is summarized as follows: First, find the partial QMLE estimator for the mean parameters and obtain the residuals from PQMLE. Second, use the first step estimator to get a fitted variance covariance matrix according to the LEF density, and obtain the spatial correlation parameter using the standardized residuals from the first step. After obtaining a working matrix, undertake a multivariate weighted least squares procedure. This gives the GEE estimator. 46 Finally, we can get a consistent variance estimator for the mean parameter that is robust to heteroskedasticity and spatial correlation. Further, we can obtain the average partial effect (APE). 2.4 Asymptotic Distribution for PQMLE and GEE A central limit theorem is needed to develop the asymptotic distributions for M-estimators. Bolthausen (1982) provides central limit theorem (CLT) for strictly stationary processes. However, in economic applications there are a lot of nonstationary spatial processes in the sense that they are heterogenous; that is, the joint distribution of dependent variables varies with locations. There are also spatial processes which may have asymptotically unbounded moments. However, in this paper, I will only discuss uniformly bounded random variables. From the above equation, we can take derivatives with respect to the parameter, and get the scores for each group. Let sg (θ) denote the P × 1 vector of score for qg (θ) . The group score sg (θ) and Hessian Hg (θ) have the following forms, sg (θ) = −1 mg (θ) Wg yg − mg (θ) , (2.20) E sg (θ) |xg , DG = 0. By the assumption of correctly specified conditional mean, the above condition implies Fisher consistency for QMLE for linear exponential family. 47 While Hg (θ) = ∂sg (θ) /∂θ (2.21) −1 m + ∂ 2 m W−1 y − m , = − θ mg Wg g g g θ g g where θ mg ≡ ∂mg /∂θ is the L × P gradient of the group mean function, ∂ 2 mg ≡ θ ∂ 2 mg /∂θ∂θ is the L × P jacobian of the group mean function and Wg is the LEF variance covariance matrix for group g. Taking the expected value of the score function over the distributions of w gives E Hg (θ0 ) = E E Hg (θ0 ) |wg , DG (2.22) −1 m + E ∂ 2 m W−1 y − m |w , D = E − mg Wg g g g g G θ g g = E −1 m 2 −1 E y |w , D − mg Wg g + ∂ θ mg Wg g g G − mg −1 m 2 −1 m − m = E − mg Wg g + ∂ θ mg Wg g g −1 m = E − mg Wg g Notice that Wg is not a diagonal matrix that only contain the variances of each individual, but also contains the covariances of pairwise individuals. It is because of this property that we can improve efficiency by doing a so called generalized estimating equations (GEE) approach. The GEE approach was first extended to correlated data by Zeger and Liang (1986). In the spatial data context, I propose that the generalized estimating equations (GEE) for the mean parameters are given by 1 G G ˆ sg θ = 0, g=1 48 (2.23) 1 G G −1 ˆ ˆ mg θ Wg yg − mg θ = 0. (2.24) g=1 Because each score is a function of spatial processes, they are correlated with each other. 1 ˆ The score function for the total sample is SG θ = G G g=1 sg ˆ θ = 0. The score function can be expanded about θ0 in a mean-value expansion: 1 ˆ SG θ = G G g=1 1 sg (θ0 ) + G G ¨ Hg θ ˆ θ − θ0 . (2.25) g=1 ¨ ¨ ˆ where θ ∈ Θ is between θ and θ0 . Hg θ is the P × P Hessian of the objective function qg (θ) .  −1 G G √ 1 ˆ − θ0 = − 1 ¨  √ G θ sg (θ0 ) . Hg θ G G g=1 g=1 (2.26) Assumption 11 (Uniform L2+δ integrability): The elements in the scores are uniformly bounded and have a limit expectation equal to zero. That is, lim sup E sgl 2+δ k→∞ g 1 sgl > k = 0, (2.27) where 1 (·) is the indicator function and sg is the group score matrix, which is P × L. sgl is an element in the score matrix, and k is a constant. Assumption 12: The second moment of the score function is positive and uniformly bounded.  1 Var  G→∞ G G sg (θ0 ) < ∞. 0 < lim G g=1 E Hg (θ0 ) (2.28) g=1 Proposition 8 Under Assumptions 1-12, 1 A0 = limG→∞ − G  √ ˇ G θ − θ0 ⇒ N 0, A−1 B0 A−1 , where 0 0 1 , and B0 = limG→∞ Var √ G proof in Chapter 4. 49 G g=1 sg (θ0 ) . See 1 E Hg (θ0 ) has already been given in Equation (22). Var √ G G g=1 sg (θ0 ) is given as follows,    1 G  1 −1 Var  √ sg (θ0 ) = Var √ mg (θ0 ) Wg yg − mg (θ0 )  G  G g=1 g=1   G 1 −1 mg (θ0 ) Wg ug  = Var  √ G g=1   G = 1 G (2.29) G −1 −1 m (θ ) mg (θ0 ) Wg ug ug Wg g 0 E g=1 1 + G G G E −1 −1 mg (θ0 ) Wg ug uh Wh mh (θ0 ) . g=1 g=h If we have a first step estimator, say γ , p lim ˆ = γ ∗ . By linear expansion, the score should ˆ γ be written as ˆ SG θ 1 = G 1 = G G sg (θ0 , ˆ) + op (1) γ (2.30) g=1 G sg (θ0 , γ ∗ ) + F0 (ˆ − γ ∗ ) + op (1) , γ g=1 where sg (θ0 , ˆ) = γ ˆ −1 mg (θ) Wg yg − mg (θ) = 0. ˆ −1 where Wg is a function of ˆ, F0 is a P × J matrix. J is the dimension of γ. F0 = γ 1 limg→∞ G G g=1 E γ sg wg , θ0 ; γ ∗ . We can see mg (θ) and mg (θ) do not rely on γ. −1 When we take derivatives with respect to γ, it only matters with Wg . γ sg wg , θ0 ; γ ∗ is a linear combination of elements of yg − mg (θ) . Since E[ yg − mg | w, D ]= 0, 50 E[ γ sg wg , θ0 ; γ ∗ |w, D] = 0. By law of iterated expectations, E γ sg wg , θ0 ; γ ∗ = 0. Thus F0 = 0. Then we can write the score function as 1 ˆ SG θ = G G sg (θ0 , γ ∗ ) + op (1) , (2.31) g=1 and 1 ˆ SG θ = G G sg (θ0 , γ ∗ ) + g=1 1 G G ¨ ∗ Hg θ, γ ˆ θ − θ0 (2.32) g=1 ¨ ¨ ˆ where θ ∈ Θ is between θ and θ0 . Hg θ is the P × P Hessian of the objective function qg (θ) .  √ 1 ˆ G θ − θ0 = − G G −1 ¨ ∗ Hg θ, γ  g=1 G 1 √ sg (θ0 , γ ∗ ) . G g=1 (2.33) Thus the first step estimation will not affect the asymptotic distribution of the second step. γ ∗ is a fixed number in the second step score function. Proposition 9 Under Assumptions 1-12, 1 A0 = limG→∞ − G G g=1 E Hg (θ0 , γ ∗ ) √ ˆ G θ − θ0 −1 ⇒ N 0, A−1 B0 A0 , where 0 1 , and B0 = limG→∞ Var √ G See proof in Chapter 4. 2.5 2.5.1 Variance-Covariance Matrix Estimator Parametric Variance-Covariance Estimator Assumption 13: Var ug = Var ug |xg , DG = Wg ; 51 G ∗ g=1 sg (θ0 , γ ) . Cov ug , uh = Cov ug , uh |xg , xh , DG = Cgh .   G G 1 1 Var  √ sg (θ0 ) = E G G g=1 g=1 1 + G G −1 m (θ ) mg (θ0 ) Wg g 0 (2.34) G −1 −1 mg (θ0 ) Wg Cgh Wh mh (θ0 ) . E g=1 g=h ˆ Under Assumption 13, the asymptotic variance estimator for θG can be estimated by ˆ Avar1 θ = 1 ˆ −1 ˆ ˆ −1 A B1 A G  (2.35) −1  G mg Wg ˆ ˆ −1 mg  ˆ =  g=1 −1 G mg Wg ˆ ˆ −1 mg  ˆ  mg Wg Cgh Wh mh  ˆ ˆ −1 ˆ ˆ −1 ˆ  g=1   G , g=1 where ˆ 1 A= G G mg Wg ˆ ˆ −1 mg , ˆ g=1 and 1 ˆ B1 = G 2.5.2 G G mg Wg Cgh Wh mh . ˆ ˆ −1 ˆ ˆ −1 ˆ g=1 h=1 Nonparametric Variance-Covariance Estimator Assumption 13 is not always obtained. And most of times it can not be easily known. √ √ −1 ˆ ˆ Since Avar G θG − θ0 = A0 B0 A−1 , we can consistently estimate Avar G θG − θ0 0 ˆ ˆ ˆ ˆ by A−1 B2 A−1 . The asymptotic standard errors are obtained from the matrix Avar2 θG = 52 ˆ ˆ ˆ A−1 B2 A−1 /G. A robust nonparametric variance covariance estimator is given as ˆ Avar2 θ = 1 ˆ −1 ˆ ˆ −1 A B2 A G  −1  G ˆ m g Wg ˆ ˆ −1 mg  =  G (2.36)  G k(dgh ) mg Wg ug uh Wh mh  ˆ ˆ −1 ˆ ˆ ˆ −1 ˆ  g=1 h=1 g=1 mg Wg ˆ ˆ −1 mg , ˆ where G ˆ 1 A= G ˆ mg Wg ˆ ˆ −1 mg , g=1 and 1 ˆ B2 = G G G k(dgh ) mg Wg ug uh Wh mh . ˆ ˆ −1 ˆ ˆ ˆ −1 ˆ g=1 h=1 ˆ Proposition 10 Under Assumptions 1-13, Avar2 θG robust 1 → G A−1 B0 A−1 . See 0 0 proof in Chapter 4. Although we do not need to make adjustment to the two-step QMLE estimator in this paper, it is worth to mention that in lot of cases F0 = 0, and we need to make adjustment to the asymptotic variances. √ G 1 ˆ G θ − θ 0 = A0 √ −gg (θ0 ; γ ∗ ) + op (1) , G g=1 where gg (θ0 ; γ ∗ ) ≡ sg (θ0 , γ ∗ ) + F0 rg (γ ∗ ) . Let  1  E G→∞ G G D0 ≡ lim G gg (θ0 ; γ ∗ ) g=1 gh (θ0 ; γ ∗ )  , h=1 53  ˆ the asyptotic distribution of θ can be written as √ ˆ G θ − θ0 ⇒d N 0, A−1 D0 A−1 . 0 0 A robust estimator after adjustment is given as ˆ Avar3 θ = 1 ˆ −1 ˆ ˆ −1 A DA G  (2.37) −1 G mg Wg ˆ ˆ −1 mg  ˆ =  g=1  G  G ˆr s ˆr r ˆ k(dgh ) ˆgˆh + ˆgˆh F + Fˆgˆh + Fˆgˆh F  s s s r ˆ  g=1 h=1  −1 G mg Wg ˆ ˆ −1 mg  ˆ  , g=1 where ˆ 1 A= G G mg Wg ˆ ˆ −1 mg , ˆ g=1 and ˆ 1 D= G 2.6 G G ˆr s ˆr r ˆ k(dgh ) ˆgˆh + ˆgˆh F + Fˆgˆh + Fˆgˆh F s s s r ˆ . (2.38) g=1 h=1 Two Examples: Spatial Probit Model and Poisson Regression Model The setup of nonlinear models with spatial data could be tricky. We need to incorporate the spatial correlated term in an appropriate way. In this section, I will use two nonlinear 54 models to demonstrate how we can incorporate the spatial correlated term and use a GEE procedure. This could vary with different models. The first example is a Probit model, and the second one is a count data model. 2.6.1 Example 1. A Probit Model with Spatial Correlation in the Latent Error The Probit model is one of the popular binary response models. The dependent variable has conditional Bernoulli distribution. The dependent variable y takes on the values zero and one, which indicates whether or not a certain event has occurred. For example, y = 1 if a firm adopts a new technology, and y = 0 otherwise. The value ot the latent variable y ∗ determines the outcome of y. Assume the Probit model is ∗ yi = 1 [yi ≥ 0] , (2.39) ∗ yi = xi β + ei , (2.40) The latent error e has a standard multivariate normal distribution, but the covariances depend on pairwise distances DN , which is different from the usual multivariate normal distribution.7 corr ei , ej = f dij, ρ , (2.41) where f (·)is a function increases in ρ and decreases in dij . ∗ We do not observe yi ; we only observe yi . Let Φ (·) be the standard normal cumula- tive density function (CDF), and φ be the standard normal probability density function 7A multivariate normal distribution usually specifies the mean vector and correlation matrix. The correlations do not depend on the pairwise distance between two variables. 55 (PDF). Assume that the mean function mi (xi ; β) ≡ E (yi |xi , DN ) = Φ (xi β) is correctly ∗ specified. Because of the nonlinearity of yi and non-observability of the latent variable yi , Cov yi , yj |xi , xj , DN is hard to discover without more information on the multivariate distribution. In order to proceed with GEE, we need to specify a working matrix, which is possibly misspecified. The partial QMLE delivers a consistent first-step estimator for the mean parameters as in Proposition 1. Using the Bernoulli density function, the log likelihood function for each observation is: li (β) =yi log Φ (xi β) + (1 − yi ) log [1 − Φ (xi β)] . (2.42) The partial QMLE solves: ˇ β = arg max LN (β) (2.43) θ∈Θ N N N i=1 (1 − yi ) log [1 − Φ (xi β)] . yi log Φ (xi β) + li (β) = LN (β) = i=1 i=1 The score of the likelihood function for each individual is si ≡ φ (xi β) xi [yi − Φ (xi β)] . Φ (xi β) [1 − Φ (xi β)] (2.44) The expected Hessian8 for each observation is E (Hi |xi , DN ) = − φ2 (xi β) xi xi . Φ (xi β) {1 − Φ (xi β)} 8 Unexpected (2.45) Hessian can be derived. Because the score has the special form that contains yi − Φ (xi β), by taking the expectations conditional on xi and DN , we can get a cleaner expression. 56 Let AN be the sum of negative expected Hessians AN = φ2 (xi β )xi xi N i=1 Φ(x β ){1−Φ(x β )} , i i and A0 = ˇ E (Hi ) on the distribution of x. Let ui = yi − Φ xi β , i = 1, 2, ..., N be the residuals. At ˇ ˇ this stage, a robust variance and covariance estimator for β can be computed as follows: −1 ˇ x xi xi β i  =  ˇ ˇ Φ xi β 1 − Φ xi β i=1   N N ˇ φ xj β x u i u j xj ˇ φ xi β iˇ ˇ   k dij Φ (xi β) [1 − Φ (xi β)] i=1 j=i  −1 N 2 xβ xx ˇ φ i i i   , ˇ 1 − Φ xi β ˇ Φ xi β  ˇ Var β N φ2 (2.46) i=1 where k dij is the kernel weight function that depend on pairwise distances. This partial QMLE and its robust variance covariance estimator provides a legitimate way of the estimation of the spatial Probit model. The next is to find out how how the two-step estimator GEE works. The second step is to construct the weighting matrix using the first-step estimators and residuals. As the data can be divided into groups, the working matrix can be the weight for a specific group. If the group size equals two, the working matrix is a two by two matrix. We can write the working ˆ ˆ ˆ ρ ˆ 1/2 variance covariance matrix as Wg = Vg 1/2 Rg (ˆ, DG ) Vg . An estimator for the working 57 variance matrix for each group is           ˆ Vg =           v1 ˇ 0 0 v2 ˇ 0 0 0 .. ··· 0 . . . .. . . . . . 0 .. vl ˇ .. 0 ··· . . ··· ··· 0    0    .  .  .  , .  .. .  . .    .. . 0    0 vL ˇ (2.47) where ˇ ˇ vl = Φ xl β 1 − Φ xl β , l = 1, ..., L. ˇ (2.48) Next we will find an estimator for the working correlation matrix for yi −Φ (xi β). Suppose the structure of the true correlation matrix R is Rij = Cij dij , λ ,where Cij dij , λ is a function that increases in λ and decreases in dij . Note that λ is the spatial correlation parameter for the dependent variables, while ρ is for the latent error. This two parameters √ ˇ are generally different. Let ri = ui / vi , for i = 1, 2, ..., N, be the standardized residuals. ˆ ˇ √ ˆ ˆ ˆ ˆ Cij equals the sample correlation of ui / vi and uj / vj . Let R ≡ R DG , λ and Rg ˇ ˇ ˇ ˇ ˆ stand for the correlation matrix Rg Dg , λ for the gth group. The function Cij dij , λ is unknown, but we can choose a correlation function to approximate it and use it in the estimation. For example, say C dij , ρ = dλ or exp − dij . By only using the correlations λ ij ˆ within groups, an estimator of λ is λ = arg min i < j. 58 G g=1 L i=1 L j=i ri rj − Cij dij , ρ ˆˆ 2 for The second step GEE estimator for β is G ˆ β = arg min β yg − Φ xg β ˆ −1 Wg y g − Φ x g β . (2.49) g=1 If one believes the working correlation matrix is correctly specified, the non-robust variˆ ance estimator for β is  −1 G ˆ ˆ −1 ˆ Dg Wg Dg  ˆ Var1 β =  (2.50) g=1 ˆ ˆ ˆ ˆ ˆ where Dg = ∂Φ xg β /∂ β =φ xg β xg . β is consistent even for misspecified spatial correlation structure. The robust variance estimator to misspecification of spatial correlation is:  ˆ VarR β −1 G ˆ ˆ −1 ˆ Dg Wg Dg  =  (2.51) g=1  G  G ˆ ˆ −1 ˆ ˆ ˆ −1 ˆ k(dgh )Dg Wg ug uh Wh Dh   g=1 h=1  −1 G ˆ ˆ −1 ˆ Dg Wg Dg   g=1 where k(dgh ) is a kernel function which depends on the distances between groups. Alternative approach is to specify the specific distributions of the multivariate normal distribution of the latent error, and then find the estimator for the spatial correlation parameter for the latent error within a MLE framework. For example, see Wang, Iglesias and Wooldridge (2012). 59 2.6.2 Example 2. A Poisson Model with a Multiplicative Spatial Error A count variable is a variable that takes on nonnegative integer values. Many variables that we would like to explain in terms of covariates come as counts, such as the number of times someone is arrested during a given year, and the number of patents applied for by a firm during a year. Count data examples with upper bound include the number of children in a family who are high school graduates, in which the upper bound is number of children in the family (Wooldridge 2010). A count data is usually characterized by a density in LEF and a population mean. Now let’s use a specific example to demonstrate a count data model with a spatial error term. The conditional Poisson density is specified as f (y|x, D) = exp [−µ] µy /y!, (2.52) where y! = 1 · 2 · ... · (y − 1) · y and 0! = 1. µ is the conditional mean of y. For a given sample, specify the conditional mean as the exponential form: E (yi |xi , DN ) = exp (xi β) . (2.53) Assume that the conditional mean function is correctly specified and model the spatial correlation in the conditional mean function: E (yi |xi , vi , DN ) = vi exp (xi β0 ) , 60 (2.54) where vi is the multiplicative spatial error term. Let v equal v1, v2 , ..., vN . The count data model can be written in a conditional mean form: yi = vi exp (xi β0 ) + δi , (2.55) and E (δi |xi , vi , DN ) = 0. A count data model with a multiplicative spatial error can be characterized by the following assumptions: (1) {(xi , vi , δi )} is a mixing sequence on the sampling space, with mixing coefficient α or φ. (2) E (yi |xi , vi , DN ) = vi exp (xi β0 ) (3) yi , yj are independent conditional on xi , xj , vi , vj , DN i=j (4) vi is independent of xi , E(vi ) = 1, Var (vi ) = τ 2 , and Cov vi , vj = τ 2 · c dij , where c dij is the spatial correlation depending on the distance between observation i and j. Note that, we only specify the conditional mean, instead of the distribution. Even if the data do not follow the Poisson distribution, the quasi-MLE approach will give you a consistent estimator if you use the Poisson density function. Moreover, y even need not to be a count number. Under the above assumptions we can integrate out vi by using the law 61 of iterated expectations. E (yi |xi , DN ) = E (E (yi |xi , vi , DN ) |xi , DN ) (2.56) = E (vi exp (xi β0 ) |xi , DN ) = exp (xi β0 ) E (vi |xi , DN ) = exp (xi β0 ) E (vi ) = exp (xi β0 ) And we can calculate the variances and covariances of y conditional on x: Var (yi |xi , DN ) = E [Var ((yi |xi , vi , DN ) |xi , DN )] +Var [E ((yi |xi , vi , DN ) |xi , DN )] = E [vi exp (xi β0 ) |xi , DN ] + Var [vi exp (xi β0 ) |xi , DN ] = exp (xi β0 ) E (vi |xi , DN ) + exp (2xi β0 ) Var (vi |xi , DN ) = exp (xi β0 ) + exp (2xi β0 ) · τ 2 62 (2.57) Cov yi , yj |xi , xj , DN = E Cov yi , yj |xi , xj , vi , vj , DN |xi , xj , DN +Cov{ E (yi |xi , vi , DN ) , E yj |xi , vi , DN (2.58) |xi , xj , DN } = 0 + Cov vi exp (xi β0 ) , vj exp xj β0 |xi , xj , DN = exp (xi β0 ) exp xj β0 Cov vi , vj |xi , xj , DN = exp (xi β0 ) exp xj β0 Cov vi , vj = exp (xi β0 ) exp xj β0 · τ 2 · c dij The model can also be expressed as E (yi |xi , ei , DN ) = exp (xi β + ei ) (2.59) = exp (ei ) exp (xi β0 ) We can see that vi = exp (ei ). We can model the spatial correlation between ei and ej , for i, j = 1, 2, ..., N . Let vector e denote e1, e2 , ..., eN . In this paper, only positive correlations are considered for convenience. Negative correlations9 are possible and one can extend this method to it. For convenience, I assume e follows a multivariate normal distribution, N (µ, Ω). Also I use this in the simulaion later. Then v will follow a multivariate lognormal distribution. One could use other multivariate distributions for e, and v will follow the corresponding multivariate distribution. Or one could use a multivariate distribution for vi directly as long as the first two moments assumptions for vi are satisfied. 9 Bloom, schankerman and Reenen’s working paper (2012) in NBER identifies negative product market rivalry effect. 63 Although y follow a multivariate Poisson distribution, the above equation shows that β0 can be consistently estimated by the partial Poisson maximum likelihood estimation. The density function for yi is fi (yi |xi , DN ) = exp [− exp (xi β)] [exp (xi β)]yi /yi !, (2.60) where yi ! = 1 · 2 · ... · (yi − 1) · yi and 0! = 1. The log likelihood for each observation is li (β) = − exp (xi β) + yi xi β − log (yi !) , (2.61) and the score is ∂Li (β) = xi [yi − exp (xi β)] = xi ui , ∂β (2.62) Hi (β) = − exp (xi β) xi xi . si (β) ≡ (2.63) and the Hessian is Let Ai ≡ −E (Hi (β) |xi , DN ) = exp (xi β) xi xi , and A0 ≡ E [Ai ] over the distribution of x and the spatial space D. Let B0 ≡ E si (β0 ) si (β0 ) = E u 2 xi xi . i The partial QMLE gives a consistent estimator for the mean parameters, which solves: N ˇ β = arg max L (β) = θ∈Θ N yi xi β − li (β) = i=1 N i=1 exp (xi β) − i=1 64 N log (yi !) . i=1 (2.64) A robust partial QMLE variance covariance estimator is  N −1 N ˇ exp xi β xi xi    N ˇ exp xi β xi xi  ˆˆ k dij xi ui uj xj  , (2.65) i=1 i=1 j=1 i=1 −1 N where k (i, j) is a kernel function depending on the distance between observations i and j. The partial QMLE does not make any use of the pairwise correlations. If we can use them, we may improve efficiency. The GEE approach is: G ˆ β = arg min β yg − exp Xg β ˆ −1 Wg yg − exp Xg β . (2.66) g=1 ˆ ˆ ˆ 1/2 ˆ ˆ 1/2 where Wg is an estimated weighting matrix. Again, like the Probit model, Wg = Vg Rg Vg ˆ ˆ , where V is a diagonal matrix with the estimated variances on the diagonal, and R is the estimated working correlation matrix. The most efficient weighting matrix is the true covariance matrix of y − exp (Xβ). There are two pivotal parameters that we do not know in the estimation, τ 2 and ρ. We ˇ could use the partial Poisson QMLE residuals to estimate the parameters. Let β be the ˇ partial Poisson regression estimator. Let u2 = yi − exp xi β ˇi 2 be the squared residual. τ 2 ˇ can be estimated in the following way: τ 2 equals the coefficient by regressing u2 − exp xi β ˆ ˇi ˇ on exp 2xi β . The situation to estimate ρ is a little bit complicated. First, we do not know how σij depends on ρ and dij . If we use the wrong structure, we probably will get a wrong estimator for ρ. Suppose the correlation structure is c dij = dρ , then an estimator for ρ is: ij ρ = coefficient by regressing ˆ ui uj ˇ ˇ ˇ ˇ exp(xi β ) exp xj β τ2 ˆ on d . However, this estimator sometimes ij suffers from the negative products of the Poisson regression residuals, and the estimated ˆ parameter ρ is biased downward. Wg is obtained by plugging τ 2 and ρ back in the variance ˆ ˆ ˆ 65 covariance matrix. By minimizing the above equation, we can get the GEE estimator. If Assumption 13 holds, a non-robust variance estimator for GEE is  −1 G ˆ ˆ −1 ˆ Dg Wg Dg  ˆ Var β =  , (2.67) g=1 ˆ ˆ where Dg = ∂ exp Xg β /∂β = exp Xg β Xg . ˆ β is still consistent even for a misspecified spatial correlation structure when Assumption 13 does not hold. The robust variance estimator to misspecification of spatial correlation is:  ˆ Var β −1 G ˆ ˆ −1 ˆ Dg W g Dg  =  (2.68) g=1  G  G ˆ ˆ −1 ˆ ˆ ˆ −1 ˆ k(dgh )Dg Wg ug uh Wh Dh   g=1 h=1  −1 G ˆ ˆ −1 ˆ Dg Wg Dg   g=1 where k(dgh ) is a kernel function depending on the distances between groups. The distance could be the smallest distance between two observations belonging to different groups. 2.7 Monte Carlo Simulations In this section, I want to do the Monte Carlo simulation to investigate the efficiency gain of the GEE approach compared to the pooled QMLE. The simulation mechanism is described as follows: (1) Each individual resides on a intersection of the square lattice. Thus the pairwise 66 distance can be calculated using the coordinates of each observation. (2) According to pairwise distances, generate the pairwise correlations. In the simulation, there are two cases of pairwise correlations. One is that each observation is correlated with all other observations; the other is that only observations in the same group are correlated with each other. That is, assuming groupwise independence. The group are demonstrated in the graph in the next subsection. (3) After generated correlated data, the first step is to find the pooled QMLE estimator. The second step is divide observations into groups and only use within group information to estimate the correlation parameters. In the case of groupwise dependence, although each pair of the data are correlated, we do not use pairwise correlations between observations in different groups. In the case of groupwise dependence, this method sounds natural. After estimation of the spatial correlation parameters, estimate the mean parameters again using the GEE procedure. 2.7.1 Sampling Space Graph 1 demonstrates the case when the sample size is 400. Thus, I create a 20 × 20 square lattice. Each observation resides on the intersections of the lattice. The locations for the data are {(r, s) : r, s = 1, 2, ...20}. The distance dij between location i and j is Euclidean distance. Suppose A(ai , aj ) and B(bi , bj ) are the two points on the lattice; their distance dij is (ai − bi )2 + (aj − bj )2 . Then the spatial correlation is based on a given parameter ρ and dij . The assumed correlation for the spatial correlated error term is ρ/dij for i, j = 1, 2, ..., N . For the Probit model this correlation is between the latent error, and for the count data model this correlation is between the multiplicative error. The data are divided into groups of 4, 16. And you can use more observations in one group, say, 25 and 67 100. But this increase the computation burden and 25 and 100 are too big for the sample size 400. In Graph 1, the left upper corner of the lattice demonstrates the case that the group size is four; the left lower corner of the lattice demonstrates the case that the group size is 16; and the right lower part of the lattice demonstrates the case that the group size is 100. The idea of this two-step method is to use a small number of pairwise correlations and conveniently get more efficient estimators. Thus, I use 4 or 16 as the number of observations in each group. Notice, for convenience, I make the pairwise distances in different groups the same. 2.7.2 Spatial Probit Data For the Probit model, we can not easily find the variances and covariances for the dependent variables conditional on the covariates. The corrlated latent errors result in correlated binary response varibles. However, the correlations in latent error usually do not reflect the exact correlations in the binary dependent variables. The correlations are much smaller in the binary response variable because of the nonlinear tranformation. In the following simulation, let the sample size be 400 and replication be 500. Consider the following data generating process: 1. xi = [1, xi1 ], ∗ 2. yi = xi β + ei , xi1 ∼ N (1, 1) ; β = [−1, 1] . e ∼ MVN(0, Ω), Var (ei ) = 1, Cov ei , ej = dρ ; ij ρ= 0, 0.1, 0.2, 0.3, 0.4. ∗ 3. yi = 1 if yi ≥ 0, ∗ yi = 0 if yi < 0. 4. Use Cov yi , yj |xi , xj , DN = dλ as the correlation between the dependent variables. ij This form is arbitrary. Based on the information, this is not likely to be true and λ is unknown. 68 ˆ ˆ The simulation results are in Table 1-2. λ is the estimator for λ. λ is calculated by the minimum distance estimator G sumL i=1 min λ L g=1   j=i  ui uj ˇˇ ˇ ˇ Φ x i β 1 − Φ xi β − ˇ ˇ Φ xj β 1 − Φ xj β λ  , dij for i, j in the same group g, and L is the number of individuals in a group. Because λ is ˆ ˇ unknown, we can not calculate the bias of λ . β is the first-step partial QMLE estimator for ˆ β. β1 is the two-step weighted nonlinear least squred estimator (WNLS) that only uses the ˆ variances as weight in the second step. β4 is the two-step GEE estimator that uses a 4 × 4 ˆ variance covariance matrix as weight for each group. β16 is the two-step GEE estimator that uses a 16 × 16 variance covariance matrix as weight for each group. Note that when L = 4, ˆ ˆ ˆ β16 is calculated by using λ which is calculated using group size equal 4. Similarly for β4 when L = 16. The following two tables show two cases of the simulation: (1) N=400, L=4. (2) N=400, L=16. 69 Table 2.1: Binary Data N=400, L=4 N=400, ρ 0 0.1 0.2 0.3 0.4 L=4, T=500, β=1 estimate s.d estimate s.d estimate s.d estimate s.d estimate s.d ˆ λ 0.0032 0.0419 0.0281 0.0501 0.0616 0.0575 0.0903 0.0558 0.1281 0.0745 ˇ β 1.0162 0.1111 1.0139 0.1166 1.0241 0.1103 1.0242 0.1138 1.0444 0.1128 ˆ β1 1.0162 0.1011 1.0139 0.1166 1.0241 0.1103 1.0233 0.1147 1.0434 0.1148 ˆ β4 1.0155 0.1144 1.0145 0.1165 1.0243 0.1111 1.0253 0.1142 1.0484 0.1185 ˆ β16 1.0169 0.1118 1.1053 0.1172 1.0261 0.1136 1.0251 0.1157 1.0515 0.1162 Table 2.2: Binary Data N=400, L=16 N=400, ρ 0 0.1 0.2 0.3 0.4 L=16, T=500, β=1 estimate s.d estimate s.d estimate s.d estimate s.d estimate s.d ˆ λ −0.0033 0.0267 0.0238 0.0338 0.0504 0.0407 0.0822 0.0539 0.1100 0.0601 ˇ β 1.0042 0.1082 1.0165 0.1174 1.0295 0.1118 1.0336 0.1121 1.0483 0.1277 70 ˆ β1 1.0043 0.1081 1.0165 0.1174 1.0280 0.1154 1.0336 0.1121 1.0483 0.1277 ˆ β4 1.0037 0.1103 1.0164 0.1170 1.0300 0.1120 1.0346 0.1135 1.0494 0.1283 ˆ β16 1.0050 0.1081 1.0166 0.1171 1.0303 0.1125 1.0366 0.1135 1.0516 0.1298 ˆ From the above results, λ is much smaller than ρ, which implies that a high spatial correlation in the latent error term may only cause a tiny correlation in the binary response ˇ variables. The PQMLE β delivers almost the same efficiency as the other three two-step estimators. Since the correlations between the binary dependent variables are low, a twostep estimator may not be necessary. One can use other data generating process to generate highly correlated binary dependent variables to examine the two-step method. 2.7.3 Spatial Count Data In the count data case, the variances and covariances of the count dependent variable can be written in closed forms. The spatial correlation in the underlying spatial error term is then transformed to spatial correlation in the response variable term. That is, by knowing the correlations in the spatial error term, we know the correlations in the count dependent variable. This avoids the situation that we can not compare the estimated spatial correlation parameter and the true parameter. Remember the assumption of conditional mean function for the count data is E (yi |xi , vi , DN ) = exp (xi β + ei ) = vi exp (xi β0 ) . Consider the following case of spatial count data generating process: 1. ei , i = 1, 2, ..., N follows a multivariate normal distribution with E(ei ) = − 1 and 2 Var (ei ) = σ 2 = 1; Cov ei , ej = dρ , and ρ = 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6. Therefore, vi follows ij a multivariate lognormal distribution with E(vi ) = 1, Var (vi ) = e − 1 and Cov vi , vj = exp ρ dij − 1. 2. xi = [1, xi1 ], xi1 ∼ Uniform (0, 1) 3. β = [1, −1] 4. mi = exp (xi β+ei ) 71 5. yi ∼ P oisson (mi ) The parameter ρ represents the correlation in the underlying error term. The true correlations for yi and yj conditional on xi , xj , dij is Corr yi , yj |xi , xj , dij (2.69) exp (xi β0 ) exp xj β0 = exp (xi β0 ) + exp (2xi β0 ) · τ2 exp ρ/dij − 1 exp xj β0 + exp 2xj β0 . · τ2 We can calculate the sample correlations in the simulated count data based on the above expression. For each replication t, let ui = yi − exp (xi β0 ) , and the sample correlation of y for each t is: corrt yi , yj |xi , xj , dij 1 = Nd∗ N N i j>i ui uj exp (xi β0 ) + exp (2xi β0 ) · τ 2 , exp xj β0 + exp 2xj β0 · τ 2 where Nd∗ is the number of distinct pairs of observations whose distance is equal to a certain distance dij = d∗ , and Var (y) is the sample variance of count variable at time t. Table 2.3 shows the sample correlations in y over the replications. The correlations are calculated separately for pairs of yi s that are 1, 2, 5 apart from each other. As pairwise distances increase, the correlations decrease. There is almost no correlation between two observations when distance is 5. Moreover, the spatial parameter ρ can be as high as 0.6, and the sample correlation of y can be as high as more than 0.3. 72 Table 2.3: Correlations in Simulated Count Data T=2000 dij 1 Nij 760 ρ ρy,d=1 0 -0.001 0.1 0.043 0.2 0.089 0.3 0.139 0.4 0.194 0.5 0.250 0.312 0.6 N=400 N=1600 2 5 1 2 5 720 1688 3170 3040 8028 ρy,d=2 ρy,d=5 ρy,d=1 ρy,d=2 ρy,d=5 -0.001 −0.000 0.000 0.000 -0.000 0.022 0.009 0.043 0.021 0.008 0.043 0.017 0.089 0.043 0.015 0.068 0.026 0.139 0.064 0.023 0.091 0.036 0.194 0.088 0.032 0.115 0.046 0.253 0.112 0.040 0.139 0.056 0.315 0.136 0.048 As long as we can find consistent estimators for τ 2 and ρ, we can do the second step estimation. As in 2.57 and 2.58, the conditional variances and covariances of count dependent variable can be written in a closed form according to which the spatial parameters can be estimated. Using the information above, τ 2 can be estimated by τ 2 as the coefficient by ˆ ˇ ˇ ˆ regressing u2 − exp xi β on exp 2xi β . Obviously τ 2 does not depend on distances. For ˇi 1 simplicity ρ is estimated by N 0 N i=1 N j=i log[ ui uj ˇ ˇ ˇ ˇ exp(xi β ) exp xj β + 1] for those pairs whose pairwise distance is one, where N0 is the number of pairs whose distance is one. Table 2.4 shows the simulation results of the case in which: N=1600, L=4 and L=16. ˇ ˆ ˆ ˆ Replication is 2000. β is the one-step QMLE estimator; β1 , β4 and β16 are three two-step ˆ estimators which uses different variance and covariance matrices of the count variable . β1 ˆ only uses the estimated variances as weights; β4 uses the covariance matrix with group size ˆ equal to four, and β16 uses the covariance matrix with group size equal to sixteen. We ˆ ˆ ˆ can see that all the three two-step estimators β1 , β4 and β16 are more efficient than the ˇ ˆ ˆ one-step estimator β. The two GEE estimators β4 and β16 has more improvement when ρ grows larger. When sample size increases, efficiency gains. ρ and τ 2 are both slightly biased ˆ ˆ downwards. 73 Table 2.4: Count Data N=1600, L=4 and 16 σ 2 = 1, N=1600 ρ 0 0.1 0.2 0.3 0.4 0.5 ρ ˆ estimate −0.002 bias −0.002 s.d. (0.043) estimate 0.095 bias −0.005 s.d. (0.048) estimate 0.191 bias -0.008 s.d. (0.055) estimate 0.287 bias −0.013 s.d. 0.063 estimate 0.384 bias −0.016 s.d. (0.070) estimate 0.480 bias −0.020 s.d. (0.076) β = −1 L=4 σ2 ˆ 0.984 0.016 (0.156) 0.980 −0.020 (0.158) 0.970 -0.030 (0.160) 0.969 -0.031 (0.161) 0.967 -0.033 (0.172) 0.974 −0.026 (0.196) ˇ β -1.000 0.000 (0.139) -1.000 0.000 (0.137) −1.001 -0.001 (0.131) -0.998 0.002 (0.138) −1.000 0.000 (0.136) -0.998 0.002 (0.134) ˆ ˆ ˆ β1 β4 β16 -0.998 -0.998 -0.998 0.002 0.002 0.002 (0.135) (0.135) (0.135) -0.998 -0.998 -0.998 0.002 0.002 0.002 (0.134) (0.134) (0.134) -0.998 -0.998 -0.999 0.002 0.002 0.001 (0.128) (0.126) (0.126) -0.996 -0.995 -0.997 0.004 0.005 0.003 (0.135) (0.132) (0.131) -0.998 -0.998 -0.999 0.002 0.002 0.001 (0.134) (0.128) (0.126) -0.997 -0.996 -0.998 0.003 0.004 0.002 (0.130) (0.120) (0.118) 74 L=16 2.8 Conclusions The spatial correlation in nonlinear models makes it more difficult to get efficient estimators. For a Probit model, because of the nonlinearity, it is hard to find the real form of spatial correlations between two dependent variables. Thus the estimated variance covariance matrix is likely to be misspecified. In the spatial count data model case, since we can actually write down the variances and covariances of the dependent variables based on certain assumptions, we can use a multivariate nonlinear weighted least squares (MNWLS) to improve the efficiency. Accounting for spatial correlation will improve efficiency in the count data example. The further study will focus on how to get a better estimator for the spatial correlation parameter, how to get a good approximation of the correlation structure and how to incorporate spatial correlation in other nonlinear models, such as multi-catogary binary choice, fractional response, two-part model and so on. 75 Chapter 3 Conditions for the Numerical Equality of the OLS, GLS and Amemiya-Cragg Estimators 3.1 Introduction Conditions under which the ordinary least squares (OLS) and generalized least squares (GLS) estimators are equal are well known. This paper extends these results in two ways. First, we give conditions under which GLS based on one assumed error variance matrix equals GLS based on a different assumed variance matrix. Second, we give conditions under which GLS equals the GMM estimator of Amemiya (1983) and Cragg (1983). 3.2 Equivalence of OLS and GLS Consider the linear regression model y = Xβ + ε 76 (3.1) where y is T × 1 and X is T × K. Let Σ be a T × T , positive definite, assumed or estimated ˆ variance matrix of ε. We consider the ordinary least squares (OLS) estimator β and the ˜ generalized least squares (GLS) estimator β defined as follows: ˆ ˜ β = (X X)−1 X y, β = (X Σ−1 X)−1 X Σ−1 y. (3.2) ˆ ˜ The question is under what conditions on X and Σ it is the case that β = β. Since these will be conditions for numerical equality of the estimators, they will not depend on assumptions about ε, and in particular they will not depend on whether Σ does or does not actually equal the variance matrix of ε. We assume only that Σ, X X and X Σ−1 X are positive definite. This is an old and classic problem. The basic equivalence results were first given by Zyskind (1962, 1967), Rao (1967) and Kruskal (1968). These results were summarized in textbook fashion by Amemiya (1985). Since then the number of equivalent conditions for equality of the two estimators has grown. The survey by Puntanen and Styan (1989) lists 20 such conditions. See also Baltagi (1989) and McAleer (1992) for more discussion and applications of these results. A related but different questions is under what conditions on Σ we have OLS = GLS for all X. McElroy (1967) showed that, if X contains an intercept, a necessary and sufficient condition is that have the “equicorrelated” form (all diagonal elements equal, and all off-diagonal elements equal). Balestra (1970) extended this result to the case that a subset of the regressors must have a certain form and then OLS = GLS for all possible values of the remaining regressors. We do not seek to extend these results in this paper. Because of the numerous different equivalent sets of conditions for the equality of OLS and GLS, we will focus on the conditions given in Theorem 6.1.1 of Amemiya (1985, p. 182). 77 The following Theorem is a minor extension of that result. THEOREM 1. Suppose that Σ, X X and X Σ−1 X are positive definite. Then the following statements are equivalent. −1 (A) (X X)−1 X ΣX(X X)−1 = X Σ−1 X (B) ΣX = XB for some nonsingular B (C) (X X)−1 X = X Σ−1 X (D) X = HA for some nonsingular A, where the columns of H are K eigenvectors of (E) Z ΣX = 0 for any Z such that Z X = 0 −1 X Σ−1 Σ (F) Σ = XΓX + QΘQ + c2 I for some Γ, some Θ, some c, and some Q such that QX=0 (F’) Σ = XΨX + RΦR for some Ψ, some Φ, and some R such that R X = 0 Conditions (A)-(F) are, apart from a few changes in notation, as given by Amemiya, p. 182.. Condition (F’) is new. Amemiya shows the equivalence of (A)-(E) and refers to Rao (though not to the correct paper by Rao) for a proof of the equivalence of (F). It is trivial that (F) implies (E) but the proof in Rao (1967, p. 364) that (E) implies (F) is not transparent (or even complete; it just says that “it is easy to verify . . . ”). In Chapter 4 we show that (F’) is equivalent to (F) and provide a straightforward proof that (D) implies (F’). 78 3.3 Equivalence of Two Different GLS Estimators We now consider two different GLS estimators: ˜ ¨ β = (X Σ−1 X)−1 X Σ−1 y, β = (X Ω−1 X)−1 X Ω−1 y. (3.3) We will provide conditions (on X, Σ and Ω) such that these two estimators are equal. (Obviously the results of the previous section are a special case corresponding to Ω = I.) The context that we have in mind is that Σ is the correct variance matrix of ε and Ω is an incor¨ rect or approximate variance matrix. In this case β could be considered to be a pseudo-GLS estimate. However, our conditions are just conditions for equality of the two estimators and so which (if either) is based on the correct variance matrix of ε is irrelevant. Note also that any of the conditions given below must still hold if we reverse the roles of Σ and Ω, which is why there are two versions of results (A), (B), etc. The phrase “pseudo-GLS” has been used in the literature with a different meaning, namely, GLS using the original error variance matrix but after some of the regressors have been partialled out. See, e.g., Fiebig, Bartels and Kr¨mer (1996) or Gross and Puntanen a (2000). This section does not apply to that topic, because we assume that both Σ and Ω are nonsingular. Matthew (1983) gives conditions for the two different GLS estimators to be equal for all X in a certain class. This is the same sort of question that was addressed for equality of OLS and GLS by McElroy (1967) and Balestra (1970). Our Theorem below is different because it applies for a given X. THEOREM 2. Suppose that Σ, Ω, X X, X Σ−1 X and X Ω−1 X are positive definite. Then the following statements are equivalent. (A1) X Σ−1 X −1 X Σ−1 ΩΣ−1 X X Σ−1 X 79 −1 = X Ω−1 X −1 (A2) X Ω−1 X −1 X Ω−1 ΣΩ−1 X X Ω−1 X (B1) = X Σ−1 X −1 ΩΣ−1 X = XB for some nonsingular B (B2) −1 ΣΩ−1 X = XB for some nonsingular B (C) (D1) X Σ−1 X −1 X Σ−1 = X Ω−1 X −1 X Ω−1 Σ−1/2 X = HA for some nonsingular A, where the columns of H are K eigen- vectors of Σ−1/2 ΩΣ−1/2 (D2) Ω−1/2 X = HA for some nonsingular A, where the columns of H are K eigen- vectors of Ω−1/2 ΣΩ−1/2 (E1) Z ΩΣ−1 X = 0 for any Z such that Z X = 0 (E2) Z ΣΩ−1 X = 0 for any Z such that Z X = 0 (F1) Ω = XΓX + ΣQΘQ Σ + c2 Σ for some Γ, some Θ, some c, and some Q such that QX=0 (F2) Σ = XΓX + ΩQΘQ Ω + c2 Ω for some Γ, some Θ, some c, and some Q such that QX=0 (F1’) Ω = XΨX + ΣRΦR Σ for some Ψ, some Φ, and some R such that R X = 0 (F2’) Σ = XΨX + ΩRΦR Ω for some Ψ, some Φ, and some R such that R X = 0 The proof of these results is given in Chapter 4. Our proof is essentially an exercise in translation of Amemiya’s conditions. An alternative that we do not pursue would be to use the results of Gourieroux and Monfort (1980). The results in Theorem 2 have some applications that are similar to existing applications 80 of Theorem 1 in the literature. Example 1. Random coefficients. Consider the random coefficient model y = Xβ∗ + u, β∗ = β + v so y = Xβ + ε where ε = u + Xv. (3.4) Suppose that u and v are uncorrelated and V (v) = Γ. If V (u) = σ 2 I , then V (ε) ≡ Σ = XΓX + σ 2 I. Therefore condition (F) of Theorem 1 applies and GLS = OLS, as has been pointed out by Rao (1967), Amemiya(1985) and Baltagi(1989), among others. However, now suppose that V (u) = Ω. Then Σ = XΓX + Ω. Therefore condition (F2) of Theorem 2 holds, and GLS based on Σ equals GLS based on Ω, which so far as we know is a previously unknown result. Example 2. SUR with the same regressors in every equation. Consider a set of G seemingly unrelated regression equations, with T observations per equation and a common regressor matrix X of dimension T × K. Using standard notation, write the stacked system of equations as y ∗ = X ∗ β∗ + ε∗ (3.5) where y∗ is GT × 1, X∗ = (IG ⊗ X) is GT × KG, etc. Suppose that V (ε∗ ) = Σ ⊗ IT ≡ Σ∗ . It is well known that GLS using Σ∗ is the same as OLS, and Baltagi (1989) proves this using condition (B) of Theorem 1: Σ∗ X∗ =X∗ B where B = Σ ⊗ IK . Now suppose that Ω∗ =Ω ⊗ IT is another possible variance matrix. Then GLS based on Σ∗ is the same as GLS based on Ω∗ . This is probably obvious, since both must equal OLS, but it can also be proved using Theorem 2. Define Ψ = ΩΣ−1 and Ψ∗ = Ψ ⊗ IT = Ω∗ Σ−1 . Then ∗ 81 Ω∗ Σ−1 X∗ = Ψ∗ X∗ = Ψ ⊗ X = X∗ B ∗ (3.6) where B = Ψ ⊗ IK . So condition (B1) of Theorem 2 holds and the two GLS estimators are the same. 3.4 Equivalence of GLS and the Amemiya-Cragg Estimator We now consider the two estimators: ˜ ˇ β = (X Σ−1 X)−1 X Σ−1 y and β = [X H(H ΣH)−1 H X]−1 X H(H ΣH)−1 H y. (3.7) The first of these is the GLS estimator while the second is the estimator of Amemiya (1983) and Cragg (1983). We will provide conditions (on X,H and Σ) such that these two estimators are equal. If Σ is the correct error variance matrix, then the GLS estimator is efficient relative to the Amemiya-Cragg estimator, and the Amemiya-Cragg estimator is efficient relative to OLS if X is contained in H. However, again, our results are just algrebraic results for the numerical equivalence of the two estimators. THEOREM 3. Suppose that Σ, X Σ−1 X and H ΣH are positive definite and that H X has full column rank. Then the following statements are equivalent. (A) X H(H ΣH)−1 H X = X Σ−1 X (B) Σ−1 X = HB for some B [or, X = ΣHB for some B] (C) [X H(H ΣH)−1 H X]−1 X H(H ΣH)−1 H = (X Σ−1 X)−1 X Σ−1 82 (D) X = SA where the columns of S are the eigenvectors that correspond to the non-zero eigenvalues of ΣH(H ΣH)−1 H Z Σ−1 X = 0 for any Z such that Z H = 0 [or, Z X = 0 for any Z such that (E) Z ΣH = 0] (F’) Σ = XΓX + QΘQ for some nonsingular Γ, some Θ, some Q and some B (of dimension L × K, where H is T × L) such that X HB is nonsingular and Q HB = 0 The proofs of these results are given in Chapter 4. 83 Chapter 4 Proofs of Propositions and Theorems 4.1 Proofs of Propositions in Chapter 1 This sections provides proofs of propositions in Chapter 1. Proof of Proposition 1 (1) Consistency. ˆ βOLS = (X X)−1 X Y N N = β + (N −1 xi xi )−1 (N −1 xi ui ) i=1 i=1 N = β + [p lim(N −1 N xi x i )−1 + op (1)][p lim(N −1 i=1 xi ui ) + op (1)]. i=1 Under Assumption 2, since {(xi , ui )} is a mixing sequence, xi ui and xi xi are also mixing. By law of large numbers for mixing sequences (Arbia 2006, page 70), we have the following two equations. N N −1  p N xi ui → E N −1 i=1 N N −1  xi ui  , i=1  p N xi xi → E N −1 i=1 xi x i  . i=1 84  Under Assumption 3 and 4,   N N xi ui  = N −1 E N −1 i=1 i=1  E xi ui = 0,  N E N −1 x i x i  → A1 . i=1 And  −1 N p lim N −1 x i xi  = A−1 . 1 i=1 Thus p ˆ βOLS → β + A−1 · 0 = β. 1 (2) Asymptotic Normality.  −1   N N √ 1 ˆ xi u i  . xi xi   √ N βOLS − β = N −1 N i=1 i=1 Under Assumption 3,  1 E N N i=1  1 xi ui  = N N E xi ui = 0. i=1 Under Assumption 5,   N 1 p xi ui  = BN → B. Var  √ N i=1 Let ZN ≡ 1 N N x u −E 1 i=1 i i N 1 Var N N xu i=1 i i N xu i=1 i i . Since xi ui is mixing, by the central limit d theorem of Wooldridge and White (1988) for mixing sequences, ZN → N (0, 1). Thus 85 1 √ N d N i=1 xi ui → N (0, B) . Therefore, √ ˆ N βOLS − β →d N 0, A−1 BA−1 . 1 1 Proof of Proposition 2 (1) Consistency. p ˆ ˆ ˆ ˆ p ΩN ≡ ΩN λ is a consistent estimator in the sense that λ → λ. Thus N −1 X Ω−1 X → N p ˆ N −1 X Ω−1 X, and N −1 X Ω−1 u → N −1 X Ω−1 u. N N N ˆ ˆ ˆ βF GLS = (X Ω−1 X)−1 X Ω−1 Y N N ˆ = β + N −1 X Ω−1 X N p −1 → β + N −1 X Ω−1 X N ˆ N −1 X Ω−1 u N −1 = β + p lim N −1 X Ω−1 X N N −1 X Ω−1 u N −1 + op (1) p lim N −1 X Ω−1 u + op (1) . N For any symmetric positive definite matrix ΩN , p p N −1 X Ω−1 X → E N −1 X Ω−1 X → B, N N p lim N −1 X Ω−1 X N −1 = B−1 . Under Assumption 6, p N −1 X Ω−1 u → E N −1 X Ω−1 u = 0. N N 86 Thus p ˆ βF GLS = β + B−1 + op (1) · op (1) → β. (2) Asymptotic Normality. √ = ˆ N −1 X Ω−1 X N −1 = ˆ N βF GLS − β N −1 X Ω−1 X N −1 1 ˆ √ X Ω−1 u N N 1 √ X Ω−1 u + op (1) . N N Under Assumption 6, E N −1 X Ω−1 u = 0. N Under Assumption 7, Var Let PN ≡ 1 √ X Ω−1 u N N −1 N −1 X Ω−1 u−E N −1 X ΩN u N Var N −1 X Ω−1 u N =E = 1 X Ω−1 X N N p → B2 . N −1 X Ω−1 u−E N −1 X Ω−1 u N N , by the central 1 N −1 Var √ X Ω−1 u N N d limit theorem of Wooldridge and White (1988) for mixing sequences, PN → N (0, 1). Thus, 1 √ X Ω−1 u →N (0, B2 ) . N N √ ˆ N βF GLS − β →N 0, B−1 B2 B−1 = N 0, B−1 . 2 2 2 Proof of Proposition 3 ˆ In order to prove βF P GLS is consistent and asymptotically normal, we need to prove ˆ ˆ ˆ that first βP GLS is consistent and asymptotically normal, and that βF P GLS and βP GLS are asymptotically equivalent. 87 (1) Consistency.  −1  G  G 1 1 ˆ βP GLS =  Xg Λ−1 Xg   Xg Λ−1 yg  g g G G g=1 g=1 −1    G G 1 1 Xg Λ−1 Xg   Xg Λ−1 ug  = β+  g g G G g=1 g=1 Under Assumption P3-P5, since {(xi , ui )} is a mixing sequence, Xg Λ−1 Xg and Xg Λ−1 ug g g are also two mixing sequences. By law of large numbers for mixing sequences (Theorem 3.57 in White), we have the following two equations. 1 G  G p Xg Λ−1 Xg → E  g g=1 1 G 1 G G G  p Xg Λ−1 Xg  = Qg → Q g g=1 p Xg Λ−1 ug → E Xg Λ−1 ug g g = 0. g=1 Thus  ˆ βP GLS = β+  1 G −1  G Xg Λ−1 Xg  g g=1 1 G  G Xg Λ−1 ug  g g=1 = β+ Q−1 + op (1) · op (1) → β (2) Asymptotic Normality. √  1 ˆ G βP GLS − β =  G −1  G Xg Λ−1 Xg  g g=1 88 G  1 √ Xg Λ−1 ug  , g G g=1 By central limit theorem for mixing sequences (Wooldridge and White 1988), G 1 √ Xg Λ−1 ug →d N (0, S) , g G g=1  1 G −1 G Xg Λ−1 Xg  g = Q−1 +op (1) . g=1 Thus √ d ˆ G βP GLS − β → N 0, Q−1 SQ−1 . ˆ ˆ (3) βF P GLS and βP GLS are asymptotically equivalent. ˆ ˆ Let Λ ≡ Λ λ . Write down the formulas for the two estimators as follows.  −1  G ˆg Xg Λ−1 Xg  ˆ βF P GLS =  g=1  ˆg Xg Λ−1 yg  ,  g=1 −1  G Xg Λ−1 Xg  g ˆ βP GLS =   G  G  Xg Λ−1 yg  . g g=1 g=1 ˆ ˆ In order to prove the asymptotic equivalence of βF P GLS and βP GLS , we need to prove the following two conditions hold. The procedure follows Schmidt (1971).  −1 G   G 1 p ˆg 1 Xg Λ−1 Xg  −  Xg Λ−1 Xg  → op (1) , g G G g=1 g=1     G G 1 1 p ˆg √ Xg Λ−1 ug  −  √ Xg Λ−1 ug  → op (1) . g G g=1 G g=1 1 Lemma 1 G G ˆ −1 g=1 Xg Λg Xg 1 −G G −1 g=1 Xg Λg Xg Proof: 89 = op (1) ˆ ˆ ˆ ˆg We have consistent estimator λ → λ, p lim λ = λ. Λ−1 is a continuous function of λ, ˆg p g thus Λ−1 → Λ−1 . p ˆg Xg Λ−1 Xg → Xg Λ−1 Xg g p ˆg Xg Λ−1 Xg −Qg → Xg Λ−1 Xg −Qg g 1 G  1 G G g=1 1 ˆg Xg Λ−1 Xg − G G ˆg Xg Λ−1 Xg − g=1 1 G G g=1   G 1 Qg → G p Qg  −  g=1  1 G 1 G G Xg Λ−1 Xg g g=1 G 1 − G G Qg g=1 G Xg Λ−1 Xg − g g=1 G G 1 Qg  = op (1) G g=1  1 ˆg Xg Λ−1 Xg − Qg  − op (1) = op (1) G g=1 g=1   G G 1 ˆg 1 Xg Λ−1 Xg − Qg  = op (1) G G g=1 g=1 Thus the first equation holds. 1 Lemma 2 √ G G ˆ −1 g=1 Xg Λg ug G g=1 E 1 −√ G Xg Λ−1 ug = op (1) . g Similar to the argument in Lemma 1, we can get 1 G G g=1  1 ˆg Xg Λ−1 ug − G 1 G G E Xg Λ−1 ug g = op (1) , g=1 G ˆg Xg Λ−1 ug − 0 = op (1) , g=1 1 G G ˆg Xg Λ−1 ug → p 0, g=1 90   G 1 ˆg ˆ Xg Λ−1 ug  SG = Var  √ G g=1 1 = G G G ˆg ˆ E Xg Λ−1 ug uh Λ−1 Xh h g=1 h=1 = E Xg Λ−1 + op (1) ug uh Λ−1 + op (1) Xh g h ˆg ˆ E Xg Λ−1 ug uh Λ−1 Xh h = E Xg Λ−1 ug uh Λ−1 Xh + E Xg ·op (1) ·ug uh Λ−1 Xh g h h +E Xg Λ−1 ug uh · op (1) · Xh g +E Xg · op (1) · ug uh · op (1) · Xh ˆ lim SG = G→∞ 1 lim G→∞ G 1 = lim G→∞ G = G G ˆg ˆ E Xg Λ−1 ug uh Λ−1 Xh h g=1 h=1 G G E Xg Λ−1 ug uh Λ−1 Xh + 0 g h g=1 h=1 lim SG = S G→∞ In addition, by the central limit theorem for mixing sequence in Wooldridge and White 1 (1988), we get the asymptotic distribution for √ G G ˆ −1 g=1 Xg Λg ug , G 1 ˆg √ Xg Λ−1 ug →d N (0, S) , G g=1 1 which is the same as √ G G −1 g=1 Xg Λg ug . Thus the second equation holds: G G 1 1 ˆg √ Xg Λ−1 ug − √ Xg Λ−1 ug = op (1) . g G g=1 G g=1 91 Finally we can write: √  ˆ G βF P GLS − β −1  G  G 1 1 ˆg ˆg Xg Λ−1 Xg   √ Xg Λ−1 ug  =  G G g=1 g=1     1 =  G  −1 G Xg Λ−1 Xg  g g=1  + op(1)  G  1  √ Xg Λ−1 ug  + op(1) g G g=1  −1   G G 1 1 Xg Λ−1 Xg   √ Xg Λ−1 ug  + op(1) =  g g G G g=1 g=1 √ ˆ G βP GLS − β + op(1) = √ ˆ ˆ G βF P GLS − βP GLS = op(1) ˆ ˆ Therefore, βF P GLS and βP GLS are asymptotically equivalent. Proof of Proposition 4 As stated in Section 4, the HAC estimator that is robust to groupwise spatial correlation and misspecification as in equation (1.26) is ˆ Avar βF P GLS = rob G g=1 G h=1 p G g=1 ˆg Xg Λ−1 Xg ˆg ˜ ˜ ˆ dgh Xg Λ−1 ug uh Λ−1 Xh h −1 G g=1 Define 1 ˆ S≡ G G g=1 G h=1 p ˆg ˜ ˜ ˆg dgh Xg Λ−1 ug uh Λ−1 Xh 92 ˆg Xg Λ−1 Xg −1 . 1 =G G ˆ −1 ˜ ˜ ˆ −1 g=1 Xg Λg ug ug Λg Xg 1 +G G g=1 G h=g p ˆg ˜ ˜ ˆ dgh Xg Λ−1 ug uh Λ−1 Xh h Define the following expressions: G G 1 p dgh Xg Λ−1 ug uh Λ−1 Xh g h G g=1 h=1   G G 1 p −1 p dgh Xg Λ−1 ug uh Λh Xh  S0 = E  g G g=1 h=1   G G 1 S0 = E  Xg Λ−1 ug uh Λ−1 Xh  g h G Sp = (4.1) (4.2) (4.3) g=1 h=1 p p p ˆ ˆ S − S ≤ S − S + Sp − S0 + S0 − S0 + |S0 − S| (4.4) In order to prove Proposition 4, we need to prove each of the above four terms on the ˆ ˆ right-hand side to be op (1) . Since p lim λ = λ, Λg → Λg , as G → ∞. By Assumption P6(b), |S0 − S| = op (1) . Rewrite the other three terms as follows: 1 p ˆ S−S = G Sp p − S0 G G p dgh Xg Λg (λ)−1 ug uh − ug uh Λh (λ)−1 Xh + op (1) . ˜ ˜ (4.5) g=1 h=1 1 = G G G p dgh (4.6) g=1 h=1 · Xg Λg (λ)−1 ug uh Λh (λ)−1 Xh − E Xg Λg (λ)−1 ug uh Λh (λ)−1 Xh p S0 − S0 = E  1 G G G Xg Λg (λ)−1 ug uh Λh (λ)−1 Xh 1 − p dgh    g=1 h=1 93 . (4.7) First, prove the first term is op (1) . ˆ ug uh = ug uh − ug βF P GLS − β ˜ ˜ ˆ +Xg βF P GLS − β ˆ Xh − Xg βF P GLS − β uh ˆ βF P GLS − β (4.8) Xh . It suffices to show that the averages of the last three terms converge in probability to zero. 1 The average of the vec of the first term can be written as G G g=1 G h=1 1 ˆ ˆ vec βF P GLS − β , which is op (1) because p lim βF P GLS − β = 0 and G Xh ⊗ ug G g=1 G h=1 Xh ⊗ ug →p 0. The third term is the transposition of the second. The average of the G g=1 1 last term can be written as G G h=1 ˆ ˆ Xh ⊗ Xg vec[(βF P GLS − β)(βF P GLS − β) ]. ˆ ˆ vec[(βF P GLS −β)(βF P GLS −β) ] = op (1). Assuming Xh and Xg have finite second moment, 1 each element of G ΣG ΣG (Xh ⊗Xg ) is Op (1) . Since Op (1) · op (1) = op (1), the last term g=1 h=1 is op (1). Thus ug uh − ug uh = op (1) . Further assuming each element in Λh is finite, each ˜ ˜ element in the first term is op (1) . The rest terms can be proved to be op (1) by similar arguments. Proof of Proposition 5 A sufficient condition is LG is stochastically equicontinuous according to Newey (1991). G LG = lg (β, λ) g=1 G = G ¨ ¨ sg β, λ lg (β0 , λ0 ) + g=1 g=1 94 β − β0 , λ − λ0 ¨ ¨ where β is between β0 and β, and λ is between λ0 and λ. sup LG (θ) − LG (θ0 ) G G ¨ ¨ sg β, λ lg (β0 , λ0 ) + = sup g=1 g=1 G ¨ ¨ sg β, λ = sup g=1 G − lg (β0 , λ0 ) g=1 β − β0 λ − λ0 G β − β0 λ − λ0 ¨ ¨ sg β, λ = sup β − β0 λ − λ0 g=1 < ∞. Thus LG (θ) is stochastically equicontinuous. The PQMLE estimator is consistent. The (k + p) × 1 score of the log likelihood of group g is simply β lg (β, λ) sg (β, λ) ≡ λ lg (β, λ) , where ∂L (β; λ) = ∂β β lg (β, λ) = Xg Λ−1 (λ) yg − Xg β , g and ∂L (β; λ) = ∂λ λ lg (β, λ) 1 −1 λ Λg (λ) vec Λg (λ) 2 1 + Λg (λ) vec Λ−1 (λ) ug ug − Λg (λ) Λ−1 (λ) g g 2 λ 1 −1 −1 = λ Λg (λ) Λg (λ) ⊗ Λg (λ) vec ug ug − Λg (λ) 2 = − 95 Set G sg (β, λ) = 0. g=1  −1 G Xg Λ−1 g ˆ βP QM LE =  G ˆ Xg Λ−1 λ yg , g ˆ λ Xg  g=1 g=1 ˆ λP QM LE may or may not have a closed functional form. The groupwise QMLE is normally distributed asymptotically. √ ˆ G θ − θ0 ⇒d N 0, A−1 BA−1 G ˆ SG θ ˆ sg θ = 0 = g=1 G G g=1  ˆ θ − θ0 ¨ Hg θ sg (θ0 ) + = g=1 −1 G  ˆ G θ − θ0 G ¨ Hg θ  = − g=1 √ ˆ θ − θ0 1 = − G sg (θ0 ) g=1 −1 G ¨ Hg θ  g=1 By ergodic theorem for mixing fields, 1 − G G ¨ p Hg θ → A g=1 A = E Hg (θ0 ) 96 G 1 √ sg (θ0 ) G g=1 By central limit theorem in Jenish and Prucha (2007), G 1 d √ sg (θ0 ) ⇒ N (0, B) G g=1   G 1 B = lim Var  sg (θ0 ) . G→∞ G g=1 Thus we have √ 4.2 d ˆ G θ − θ0 ⇒ N 0, A−1 BA−1 . Proofs of Propositions in Chapter 2 This section provides proofs of propositions in Chapter 2. Proof of Proposition 6 The proof follows Theorem 2 in Jenish and Prucha (2009). A sufficient condition for consistency estimators is that the objective function satisfies the uniform law of large numbers (ULLN). To prove the ULLN, we need a pointwise law of large numbers (LLN). From ¯ Assumption 4, we have a pointwise LLN, and we get QN (θ) − QN (θ) →a.s. 0, where ¯ QN (θ) = E (QN (θ)) .Since QN (θ) is the sample average of qi , it is also stochastically ¨ equicontinuous. Then we have supθ QN (θ) − Q (θ) →a.s. 0 as N → ∞, which is the uniform law of large numbers. Proof of Proposition 7 Proof of Proposition 7 is similar to the proof of Proposition 6. Instead of writing the objective function in the sum of individuals, we write it as the sum of groups. The pointwise LLN 97 ¯ ¯ can be written as QG (θ) − QG (θ) →a.s. 0, where QG (θ) = E (QG (θ)) . In addition with the ¨ stochatically equicontinuity condition, supθ QG (θ) − Q (θ) →a.s. 0 as G → ∞. Proof of Proposition 8 G ˇ SG θ ˇ sg θ = 0 = g=1 G = G g=1  ˇ θ − θ0 ¨ Hg θ sg (θ0 ) + g=1 −1 G  ˇ G θ − θ0 G ¨ Hg θ  = − sg (θ0 ) g=1 √ ˆ θ − θ0 1 = − G g=1 −1 G G 1 √ sg (θ0 ) G g=1 ¨ Hg θ  g=1 By uniform law of large numbers for mixing fields, − 1 G G ¨ Hg θ → p A0 g=1 A0 = E Hg (θ0 ) By Central Limit Theorem in Jenish and Prucha (2009), G 1 √ sg (θ0 ) ⇒ d N (0, B0 ) G g=1  B0 = 1 Var  G→∞ G G lim g=1 98  sg (θ0 ) Thus we have √ ˇ G θ − θ0 ⇒d N 0, A−1 B0 A−1 . 0 0 Proof of Proposition 9 −1  G G √ 1 ¨γ ˆ Hg θ; ˆ  G θG −θ0 = − sg (θ0 ; ˆ) , γ G g=1 √ g=1   G 1 ˆ sg (θ0 ; ˆ) + op (1) , γ G θG −θ0 = A0 − √ G g=1 1 where A0 = limG→∞ − G G g=1 E Hg (θ0 , γ ∗ ) by law of large numbers for mixing se- quences. G G √ 1 1 √ γ sg (θ0 ; ˆ) = √ γ sg (θ0 ; γ ∗ ) + F0 G (ˆ − γ ∗ ) + op (1) , G g=1 G i=1 1 where F0 is a P × J matrix, F0 = limi→∞ G G E γ sg (wi , θ0 ; γ ∗ ) g=1 √ Assume G (ˆ − γ ∗ ) can be written in the following form: γ √ G G (ˆ − γ ∗ ) γ 1 =√ rg (γ ∗ ) + op (1) , G g=1 where rg (γ ∗ ) is a J × 1 vector with E rg (γ ∗ ) = 0 . Then G √ 1 ˆ − θ0 = A0 √ G θ −eg (θ0 ; γ ∗ ) + op (1) G g=1 eg (θ0 ; γ ∗ ) ≡ sg (θ0 ; γ ∗ ) + F0 rg (γ ∗ )   G G 1  D0 ≡ lim E eg (θ0 ; γ ∗ ) eh (θ0 ; γ ∗ )  G→∞ G g=1 h=1 99 . √ ˆ Avar G θ − θ0 = A−1 D0 A−1 0 0 In the case of QMLE, we can see mg (θ) and mg (θ) do not rely on γ. When we take deriva- −1 tives with respect to γ, it only matters with Wg . γ sg wg , θ0 ; γ ∗ is a linear combination of elements of yg − mg (θ) . Since the E yg − mg |w, D = 0, E = 0. By law of iterated expectations, E γ sg wg , θ0 ; γ ∗ γ sg wg , θ0 ; γ ∗ |w, D = 0. Thus F0 = 0. √ ˆ Avar G θ − θ0 = A−1 B0 A−1 0 0 1 and B0 = limG→∞ Var √ G G ∗ g=1 sg (θ0 , γ ) Proof of Proposition 10 Let mg = ˆ ˆ mg θ , ˆ mg θ . The proof consists two parts: mg = ˆ ˆ ˆ (1) A → A0 ; (2)B2 → B0 . ˆ Part (1) prove that A → A0 . 1 ˆ A= G G g=1 1 ˆ mg Wg ˆ ˆ −1 mg → limG→∞ G G g=1 E −1 m = A mg Wg g 0 ˆ Part (2) prove that B2 → B0 . 1 ˆ B2 = G G G k(dgh ) mg Wg ug uh Wh mh , ˆ ˆ −1 ˆ ˆ ˆ −1 ˆ g=1 h=1 and 1 B0 = lim G→∞ G G −1 −1 m mg Wg ug ug Wg g E g=1 1 + lim G→∞ G G G E g=1 g=h 100 −1 −1 mg Wg ug uh Wh mh . Define Zg = −1 mg Wg ug ˆ Zg = mg Wg ug ˆ ˆ −1 ˆ For a given G, 1 G ˆ B2 = 1 = G 1 B0 = G = 1 G G 1 mg W g u g u g W g ˆ ˆ −1 ˆ ˆ ˆ −1 mg + ˆ G g=1 G g=1 1 ˆ ˆ Zg Zg + G g=1 g=1 g=h g=1 g=h g=1 E Zg Zg + k(dgh ) mh Wg ug uh Wh mh ˆ ˆ −1 ˆ ˆ ˆ −1 ˆ ˆ ˆ k(dgh )Zg Zh , −1 −1 mg Wg ug ug Wg G G G G G E G 1 G G 1 mg + G G G E −1 −1 mg Wg ug uh Wh mh g=1 g=h G E Zg Zh . g=1 g=h Define Bk = 0 1 G G g=1 1 + G 1 = G −1 −1 m mg Wg ug ug Wg g E G G −1 −1 mg Wg ug uh Wh mh k(dgh )E g=1 g=h G g=1 1 E Zg Zg + G G G k(dgh )E Zg Zh , g=1 g=h 101 Bk = 1 G 1 = G G −1 −1 m + mg Wg ug ug Wg g g=1 G g=1 1 Zg Zg + G G 1 G G G k(dgh ) −1 −1 mg Wg ug uh Wh mh g=1 g=h G k(dgh )Zg Zh . g=1 g=h Next, ˆ B2 −B0 = ˆ B2 −Bk + Bk − Bk + Bk − B0 0 0 ≤ ˆ B2 −Bk + Bk − Bk + Bk − B0 0 0   G G G 1 1 ˆ B2 −  k(dgh )Zg Zh  Zg Zg + G G = g=1 1 + G 1 + G G g=1 G g=1 1 E Zg Zg + G 1 ˆ B2 −  G + + 1 G 1 G G Zg Zg + g=1 G g=1 g=h + 1 G Zg Zg + g=1 1 G G g=1 g=h  G k(dgh )Zg Zh  g=1 g=h + G 1 G G G k(dgh ) Zg Zh − E Zg Zh g=1 g=h G G E Zg Zh g=1 g=h  G k(dgh )Zg Zh  g=1 g=h G Zg Zg − E Zg Zg g=1 G k(dgh ) Zg Zh − E Zg Zh g=1 g=h 1 k(dgh )E Zg Zh − G G G k(dgh )E Zg Zh −B0 g=1 G G G Zg Zg − E Zg Zg 1 ˆ B2 −  G 1 + G 1 G G G  ≤ 1 + G Zg Zg − E Zg Zg  = g=1 g=h 1 + G G k(dgh ) − 1 E Zg Zh g=1 g=h 102 G G k(dgh ) Zg Zh − E Zg Zh g=1 g=h (4.9) Next, prove each of the right hand side term goes to zero. Define pgh = Zg Zh −E Zg Zh . Use this device to make the mean value expansion go to zero and this will complete the proof. 4.3 Proofs of Theorems in Chapter 3 This section provides proofs of theorems in Chapter 3. Proof of Theorem 1 Amemiya (1985, pp. 182-183) showed the equivalence of conditions (A), (B), (C), (D) and (E). It is trivial that (F) implies (E) and that (F’) implies (F). We still need to establish that one of (A), (B), (C), (D) or (E) implies (F’). Proof that (D) implies (F’): Condition (D) says that X = F1 A for some nonsingu   Λ1 0  lar A, where the eigenvectors of Σ are F = [F1 , F2 ] and the eigenvalues are Λ =  , 0 Λ2 a diagonal matrix. We can choose F such that F F = I and therefore F1 F2 = 0. Then the spectral representation of Σ says Σ = F ΛF = F1 Λ1 F1 + F2 Λ2 F2 . Since F1 = XA−1 , Σ = X(A−1 Λ1 A −1 )X + F2 Λ2 F2 . Finally, X F2 = A F1 F2 = 0. So (F’) holds. Because the result is somewhat counterintuitive, we also give a proof that (F) implies (F’). Proof that (F) implies (F’): Suppose that (F) holds so Σ = XΓX + QΘQ + c2 I with Q X = 0. Now define PX = I − X(X X)−1 X and MX = I − PX . Since Q is in the null space of X, Q = MX B for some B. Also, there exists a matrix A, of dimension T × (T − K), 103 such that MX = AA , A A = IT −K and A X = 0. (See, e.g., Theil (1971, pp. 203-209).) So we can write Σ = XΓX + QΘQ + c2 (PX + MX ) = X[Γ + c2 (X X)−1 ]X + MX BΘB MX + c2 MX = X[Γ + c2 (X X)−1 ]X + A[c2 I + A BΦB A]A Since A X = 0, condition (F’) holds. Proof of Theorem 2 ˜ The model is y = Xβ + ε, and β = (X Σ−1 X)−1 X Σ−1 y is GLS assuming that the variance matrix of ε is Σ. Now define X∗ = Ω−1/2 X, y∗ = Ω−1/2 y and Σ∗ = Ω−1/2 ΣΩ−1/2 ¨ (so Σ−1 = Ω1/2 Σ−1 Ω1/2 ). Then β = (X Ω−1 X)−1 X Ω−1 y is OLS of y∗ on X∗ , and ∗ ˜ β = (X∗ Σ−1 X∗ )−1 X∗ Σ−1 y∗ is GLS of y∗ on X∗ using the error variance matrix Σ∗ . So ∗ ∗ Amemiya’s results apply if we simply replace his X by X∗ , and Σ by Σ∗ . (A) (X∗ X∗ )−1 X∗ Σ∗ X∗ (X∗ X∗ )−1 = (X∗ Σ−1 X∗ )−1 ∗ X Ω−1/2 Ω−1/2 X −1 X Ω−1/2 · Ω−1/2 ΣΩ−1/2 · Ω−1/2 X X Ω−1/2 Ω−1/2 X = X Ω−1/2 · Ω1/2 Σ−1 Ω1/2 · Ω−1/2 X X Ω−1 X −1 X Ω−1 ΣΩ−1 X X Ω−1 X −1 = X Σ−1 X −1 which is condition (A2) of Theorem 2. (B) Σ∗ X∗ = X∗ B for some nonsingular B Ω−1/2 ΣΩ−1/2 · Ω−1/2 X = Ω−1/2 XB for some nonsingular B ΣΩ−1/2 X = XB for some nonsingular B which is condition (B2) of theorem 2. (C) (X∗ X∗ )−1 X∗ = (X∗ Σ−1 X∗ )−1 X∗ Σ−1 ∗ ∗ 104 −1 −1 X Ω−1 X −1 X Ω−1/2 = X Ω−1/2 · Ω1/2 Σ−1 Ω1/2 · Ω−1/2 X X Ω−1 X −1 X Ω−1 = X Σ−1 X −1 X Σ−1 which is condition (C) of Theorem 2. (D) X∗ = HA for some nonsingular A, where the columns of H are Keigenvectors of Σ∗ . Ω−1/2 X = HA for some nonsingular A, where the columns of H are Keigenvectors of Ω−1/2 ΣΩ−1/2 . which is condition (D2) of Theorem 2. (E) X∗ Σ∗ Z = 0 for any Z such that Z X∗ = 0. X Ω−1/2 · Ω−1/2 ΣΩ−1/2 · Z = 0 for any Z such that Z Ω−1/2 X = 0. X Ω−1 ΣΩ−1/2 · Z = 0 for any Z such that Z Ω−1/2 X = 0. Now define Z∗ = Ω−1/2 Z. Then this condition becomes X Ω−1 ΣZ∗ for any Z∗ such that Z∗ X = 0, which is condition (E1) of Theorem 2. (F) Σ∗ = X∗ ΓX∗ + QΘQ + c2 I with Q X∗ = 0. Ω−1/2 ΣΩ−1/2 = Ω−1/2 XΓX Ω−1/2 + QΘQ + c2 I with Q Ω−1/2 X = 0. Σ = XΓX + Ω1/2 QΘQ Ω1/2 + c2 Ω with Q Ω−1/2 X = 0. Now let Q∗ = Ω−1/2 Q. Then we obtain Σ = XΓX + ΩQ∗ ΘQ∗ Ω + c2 Ω with Q∗ X = 0 which is condition (F2) of Theorem 2. (F’) The proof is essentially the same as for condition (F). Proof of Theorem 3 Proof that (A) and (C) are equivalent: Define ∆ = X H H ΣH −1 HX −1 X H H ΣH 105 −1 H − (X Σ−1 X)−1 X Σ−1 then X H H ΣH −1 HX −1 (C)) is equivalent to X H H ΣH − (X Σ−1 X)−1 = ∆Σ∆ . So ∆ = 0 (Which is condition −1 HX −1 = (X Σ−1 X)−1 , which is condition (A). Proof that (A) and (B) are equivalent: X Σ−1 X − X H H ΣH −1 H X = X Σ−1/2 [I − P 1/2 ]Σ−1/2 X [Σ H] and this equals zero if and only if Σ−1/2 X is in the column space of Σ1/2 H, that is , Σ−1/2 X = Σ1/2 HB for some B, or Σ−1 X = HB or X = ΣHB, which is condition (B). Proof that (B) and (E) are equivalent: The proof that (B) implies (E) is trivial. To show that (E) implies (B), suppose that Z H = 0. Then Z = MH S where MH = I − PH = I − H(H H)−1 H . Then (E) says that Z Σ−1 X = 0, or S MH Σ−1 X = 0. This is true for any S, so it must be true that MH Σ−1 X = 0, that is Σ−1 X is in the column space of H, or Σ−1 X = HB for some B. This is condition (B). Proof that (B) and (D) are equivalent: Note that ΣH H ΣH −1 H · ΣH = ΣH • I That is, the column of ΣH are eigenvectors of ΣH H ΣH −1 H , and the eigenvalues equal one. So, if (B) holds, X = ΣHB and X is a linear combination of these eigenvectors, so (D) holds. Conversely, if (D) holds, then X = (ΣH)A and (B) holds. Proof that (F’) implies (B): Suppose (F’) holds, so that Σ = XΓX + QΘQ and Q satisfies Q HB = 0 for some B such that X HB is nonsingular. Then, for that B, ΣHB = XΓX HB + QΘQ HB = XΓX HB and so X = ΣH[B(X HB)−1 Γ−1 ] So condition (B) holds. 106 Proof that (B) implies (F’): Suppose (B) holds so that X = ΣHB. Let Γ = (B H ΣHB)−1 , where the inverse must exist because X has rank K so HB must have rank K. Then Σ = XΓX + C where C = Σ − ΣHB(B H ΣHB)−1 B H Σ = Σ1/2 [I − P 1/2 )]Σ1/2 [Σ HB] = QQ where Q = Σ1/2 [I − P 1/2 ]. [Σ HB] Then Q HB = [I − P 1/2 ]Σ1/2 HB = 0. So (F’) holds. [Σ HB] 107 BIBLIOGRAPHY 108 BIBLIOGRAPHY [1] Amemiya, T. (1983), “Partially Generalized Least Squares and Two-Stage Least Squares Estimators,” Journal of Econometrics, 23, 275-283. [2] Amemiya, T. (1985), Advanced Econometrics, Harvard University Press, Cambridge, MA. [3] Anderson, T.W. (1971), The Statistical Analysis of Time Series, John Wiley and Sons, New York. [4] Anselin, L. (2010), “Thirty Years of Spatial Econometrics,” Papers in Regional Science, 89, 3–25. [5] Arbia, G. (2006), Spatial Econometrics: Statistical Foundations and Applications to Regional Convergence . Springer. [6] Baltagi, B.H. (1989), “Applications of a Necessary and Sufficient Condition for OLS to be BLUE,” Statistics and Probability Letters, 8, 457-461. [7] Balestra, P. (1970), “On the Efficiency of Ordinary Least Squares in Regression Analysis,” Journal of the American Statistical Association, 65, 1330-1337. [8] Bester, A.C., Conley, T.G., Hansen, C.B. and Vogelsang, T.J. (2008), “Fixed-b Asymptotics for Spatially Dependent Robust Nonparametric Covariance Matrix Estimators.” [9] Bloom,N., Schankerman, M. and VanReenen, J. (2012), “Identifying technology spillovers and product market rivalry,” NBER working paper. [10] Bollerslev, T. and Wooldridge, J.M. (1992), “Quasi-Maximum Likelihood Estimation and Inference in Dynamic Models with Time-varying Covariances,” Econometric Reviews, 11(2), 143-172. [11] Bolthausen, E. (1982), “On the Central Limit Theorem for Stationary Mixing Random Fields,” The Annals of Probability, Vol. 10, No. 4, 1047-1050. 109 [12] Brown, J.R., Ivkovic, Z., Smith, P.A. and Wwisbenner, S. (2008), “Neighbors Matter: Causal Community Effects and Stock Market Participation,” The Journal of Finance, 63: 1509–1531. [13] Christakos, G. (1987), “On the Problem of Permissible Covariance and Variogram Models,” Water Resources Research, 20, 251-265. [14] Cragg, J. (1983), “More Efficient Estimation in the Presence of Heteroscedasticity of Unknown Form,” Econometrica, 51, 751-763. [15] Conley,T.G. (1999), “GMM Estimation with Cross Sectional Dependence,” Journal of Econometrics, Volume 92, Issue 1, 1-45. [16] Cressie, N.A.C. (1993), Statistics for Spatial Data, Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics, New York: John Wiley & Sons Inc., a Wiley-Interscience Publication. [17] Du, J. and Zhang, H. (2007), “Covariance Tapering in Spatial Statistics,” Working paper. [18] Dubin, R.A. (1988), “Estimation of Regression Coefficients in the Presence of Spatially Autocorrelated Error Terms,” The Review of Economics and Statistics , Vol. 70, No. 3, 466-474. [19] Fiebig, D.G., Bartels, R. and Kr¨mer, W. (1996), “The Frisch-Waugh Theorem and a Generalized Least Squares,” Econometric Reviews, 15, 431-443. [20] Gneiting, T. (2002), “Compactly Supported Correlation Functions,” Journal of Multivariate Analysis, 83, 493-508. [21] Gourieroux, C. and Monfort A. (1980), “Sufficient Linear Structures: Econometric Applications,” Econometrica, 48, 1083-1097. [22] Gourieroux, C., Monfort, A. and Trognon, A. (1984), “Pseudo Maximum Likelihood Methods: Applications to Poisson Models,” Econometrica , Vol. 52, No. 3, pp. 701-720 [23] Gross, J. and Puntanen, S. (2000), “Remark on Pseudo-Generalized Least Squares,” Econometric Reviews, 19, 131-134. [24] Kelejian, H.H. and Prucha I.R. (2007), “HAC estimation in a spatial framework,” Journal of Econometrics, Volume 140, Issue 1, September 2007, Pages 131-154 110 [25] Kruskal, G. (1968), “When Are Gauss-Markov and Least Squares Estimators Identical? A Coordinate-Free Approach,” Annals of Mathematical Statistics, 39, 70-75. [26] Jenish, N., and Prucha, I. R. (2009), “Central limit theorems and uniform laws of large numbers for arrays of random fields,” Journal of Econometrics 150: 86–98. [27] Kaufman, C., Schervish, M. and Nychka, D. (2008), “Covariance tapering for likelihood-based estimation in large spatial datasets,” J. Amer. Statist. Assoc. 103 1545–1555. [28] Lee, L.F.(2004), “Asymptotic Distributions of Quasi-Maximum Likelihood Estimators for Spatial Autoregressive Models,” Econometrica, Vol. 72, 1899-1925. [29] Matthew, T. (1983), “Linear Estimation with an Incorrect Dispersion Matrix in Linear Models with a Common Linear Part,” Journal of the American Statistical Association, 78, 468-471. [30] McAleer, M. (1992), “Efficient Estimation: The Rao-Zyskind Condition, Kruskal’s Theorem and Ordinary Least Squares,” Economic Record, 68, 65-72. [31] McElroy, F.W. (1967), “A Necessary and Sufficient Condition that Ordinary Least Squares Estimators Be Best Linear Unbiased,” Journal of the American Statistical Association, 63, 1302-1304. [32] Newey, W.K. and West K.D., 1987. “A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix,” Econometrica, Vol.55, No.3, 703708. [33] Prentice, R.L. (1988), “Correlated binary regression with covariates specific to each binary observation,” Biometrics AA, 1033-48. [34] Puntanen, S. and Styan, G.P.H. “The Equality of the Ordinary Least Squares Estimator and the Best Linear Unbiased Estimator,” The American Statistician, 43, 153-161. [35] Rao, C.R. (1967), “Least Squares Theory Using an Estimated Dispersion Matrix and Its Application to Measurement of Signals,” Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, 355-372, University of California Press, Berkeley, CA. [36] Theil, H. (1971), Principles of Econometrics, John Wiley and Sons, New York. 111 [37] Wang, H., Iglesias, E. and Wooldrige, J.M. (2012). “Partial Maximum Likelihood Estimation of Spatial Probit Models,” Journal of Econonometrics. [38] Wooldridge, J.M. (2011), Econometric Analysis of Cross Section and Panel Data. MIT. [39] Zeger, S.L., Liang, K.Y., and Albert, P.S. (1988), “Models for Longitudinal Data: A Generalized Estimating Equation Approach,” Biometrics, 44, 1049- 1060. [40] Zyskind, G. (1962), “On Conditions for Equality of Best and Simple Linear Least Squares Estimators (abstract),” Annals of Mathematical Statistics, 33, 1502-1503. [41] Zyskind, G. (1967), “On Canonical Forms, Non-Negative Covariance Matrices and Best and Simple Least Squares Linear Estimators in Linear Models,” Annals of Mathematical Statistics, 38, 1092-1109. 112