ESSAYS ON HETEROGENEITY IN ECONOMETRIC MODELS By Shengwu Shang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Economics–Doctor of Philosophy 2013 ABSTRACT ESSAYS ON HETEROGENEITY IN ECONOMETRIC MODELS By Shengwu Shang The dissertation consists of three parts and the theme is to deal with heterogeneity in econometrics models for positive response variables. The first part studies the models with multiplicative heterogeneity for cross sectional data; the multiplicative heterogeneity can be transformed from the log linear model with additive heterogeneity. We introduce the notion of Average Partial Effect (APE) and Conditional APE (CAPE); the estimators and their asymptotic distribution are proposed. In order to catch the positivity of the unknown conditional expectation function of the unobserved heterogeneity, we borrow the idea of power series approximation of unknown function in Newey (1993, 1994) and develop an “exponential sieves” estimator for CAPE suggested in Wooldridge (1992a). The second part of the dissertation pertains to extending results for CAPE in chapter 1 for panel data sets. First, Using the models in Wooldridge (1999), We compare three main estimation methods for positive response variable– FE method for log linear model (LFE), Poisson Quasi-Maximum Likelihood (PQML) and Generalized Method of Moment (GMM) – by Monte Carlo Simulation and real life data set. It is not surprising that LFE estimator is not consistent when PQML is; however, we do find circumstance where both LFE and PQML estimators are consistent plus LFE is more efficient. With this regard, we introduce GMM to improve the efficiency of PQML estimator as well as keeping the consistency; this way also finds a solution to the problem raised in Wooldridge (1999). From the simulation results, we find that GMM can reduce the standard error of PQML estimator by almost a half. Second, an “exponential sieves” estimator for CAPE is proposed under panel data setting; the result automatically extends the results in Ai and Norton (2008) from cross sectional setting to panel data models. Third, We also apply the GMM to a US domestic airlines data set and the result shows that GMM improves the efficiency by 10% compared with PQML. The third part investigates the effect of spatial correlation for fractional response variable. By a MEAP data of Michigan in 2009/2010 school year, we investigate again the effect of school financing reform on school performance which is studied by Papke (2005, 2008), Pake and Wooldridge (2008); we use both level math test pass rate (linear case)and its log odds ratio (nonlinear) as dependent variable to run OLS and GLS regression; Conley (1999)’s spatial dependence corrected standard errors are calculated and find that the statistical significance for some regressors hinges on the choice of cut off points ; however there do exist other factors whose statistical significance is robust to the choice. This way we shed some light on how to pick the right window size. Moreover, by transforming LOR back to level rate, we find the spending effect estimated from linear model is about 4 ∼ 6% higher than from nonlinear one. This thesis is dedicated to my mom, Jinxiang Mao, and my dad, Chuchuan Shang. iv ACKNOWLEDGEMENTS Understanding econometrics, in my opinion, is like peeling an onion. Each time you peel away one layer you discover that another awaits. Each time you think you reach some understanding of intuition, estimation and application, you later discover that much more needs to be understood. I fear that I will never reach the core of the onion. But this is what makes the subject of econometrics so exciting. I would never been able to finish my dissertation without guidance, suggestions and support from several people. First, my deepest appreciation goes to my advisor, Professor Jeffrey M. Wooldridge. Albeit lots of challenges and struggles on my studies at Michigan State University, I have never regretted working with him. He accepted me as one of his students when I felt lost and hopeless four years ago and helped me get through the most difficult time in my life. He provided me his perspective and helpful comments all the time and his contributions to this dissertation were invaluable. I could never have reached the depths of this dissertation without his mentorship and insightful advice. In my mind, he is a genuine scientist; He sets an example of a world-class researcher and teacher for me to model myself after in my career life. I also want to express my deep appreciation to Professor Peter Schmidt for his guidance in my research. I did not take any courses from him, which is really regretful; but his office door was always open for me whenever I had questions. His comments such as “econometric research is to tell a story” really were eye opening for me. He amazed me with his thorough understanding of theories and ability to expressing them in simple words. In addition, I cannot say enough thanks to Professor Steve Woodbury, who was always sharp on comments and had an open mind, providing formidable advice that greatly improved this dissertation. My special gratitude goes to Professor Lijian Yang, who has been a big brother for me and v ready for help throughout my entire graduate career. Beside them, I would like to sincerely thank the following faculty and staff members at MSU: Leslie E. Papke, Jack Meyer, Byron Brown, Timothy Vogelsang, Richard Baillie, Todd Elder, Tony Doblas Madrid, Kun Ho Kim, Dorothy Pathak, Pramod Pathak, Ed Mahoney, Ashton Shortridge, Andrew O. Finley, Michael Frazier, R. V. Ramamoorthi and Margaret Lynch. I have been very fortunate to meet them at MSU and they all helped me at various stages of my doctoral study. While I cannot mention everyone, I would like to thank my fellow graduate students, particularly Jeff Brown, Quentin Brummet, Paul Burkander, Dooyeon Cho, Sanders Chang, Simon Chang, Hon Foong Cheah, Yu-Wei Chu, Myoung-Jin Keay, Do Won Kwak, Jinyoung Lee, Cuicui Lu, Tan Lu, Kritkorn Nawakitphaitoon, Ilya Rahkovsky, Seunghwa Rho, Monthien Satimanon, Valentin Verdier, Wei-Siang Wang, Yali Wang. We had numerous discussions and I appreciate their thoughtful comments and enjoyable conversation. I thank my parents Jinxiang Mao and Chuchuan Shang for their unwavering love and support throughout this endeavor; without their encouragement, I could have never accomplished this. I would like to express my gratitude to my elder brother Shengwen Shang , elder sister Shenghua Shang and younger brother Shenggang Shang who took my part of the responsibility for my parents, which made it possible for me to stay single-minded on my studies for so long. Last but not least, I would like to thank my wife who was always there cheering me up and stood by me through the good and bad times. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi CHAPTER 1 1.1 1.2 1.3 1.4 1.5 1.6 ON ESTIMATING PARTIAL EFFECTS AFTER RETRANSFORMATION . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Model and Partial Effects . . . . . . . . . . . . . . . . . . . . . . . . . . Estimating the APEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimating Conditional APEs . . . . . . . . . . . . . . . . . . . . . . . . . . Application to Treatment Effects . . . . . . . . . . . . . . . . . . . . . . . . Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 7 10 16 19 CHAPTER 2 ON THE USE OF EXPONENTIAL VERSUS LOG-LINEAR MODELS FOR PANEL DATA . . . . . . . . . . . . . . . . . . 21 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 PQML VS. LFE: a Simulation Approach . . . . . . . . . . . . . . . . . . . . 24 2.3.1 Simulation 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.2 Simulation 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.3 Simulation 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4 Estimating APE: Exponential Sieve Estimator . . . . . . . . . . . . . . . . . 28 2.4.1 Estimating APE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.2 Estimating CAPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.5 More Efficient Estimator: GMM . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5.1 Model 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5.2 GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5.3 Simulation 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.5.4 Optimal IV Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.6 Empirical Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 CHAPTER 3 A SPATIAL ANALYSIS OF SPENDING EFFECT ON MEAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.1 Data Characteristics and Sources . . . . . . . . . . . . . . . . . . . . 52 3.3.2 Spatial Dependence Measurement . . . . . . . . . . . . . . . . . . . . 54 3.4 Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.4.1 Ordinary Least Square (OLS) . . . . . . . . . . . . . . . . . . . . . . 56 vii 3.5 3.6 3.4.2 Generalized Least Square (GLS) . Nonlinear Model . . . . . . . . . . . . . 3.5.1 Regression of Log Odds Ratio . . 3.5.2 Estimate the APE for Level Rates Conclusions . . . . . . . . . . . . . . . . APPENDICES Appendix A . Appendix B . Appendix C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 60 60 62 64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 . . . 67 . . . 78 . . . 100 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 viii LIST OF TABLES Table B.1 Estimation results: xtreg . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Table B.2 Simulation results where Vit has Gamma distribution 79 . . . . . . . . . . . Table B.3 Simulation results where Vit has Gamma distribution(Continued) . . . . 79 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Table B.5 Simulation results where Vit has log-normal distribution . . . . . . . . . . 80 Table B.6 Vit = exp(a ∗ x2it + b ∗ xit ∗ zit ) with N = 500 . . . . . . . . . . . . . . . . 81 Table B.7 Vit = exp(a ∗ x2it + b ∗ xit ∗ zit ) with N = 1000 . . . . . . . . . . . . . . . 81 Table B.4 Special case 2 Table B.8 Simulation results with Vit = exp(−.125Xi + .5Xi ∗ zit ) . . . . . . . . . . 82 Table B.9 Simulation results for four estimators . . . . . . . . . . . . . . . . . . . . 82 Table B.10 Summary statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Table B.11 Dependent variable, passen 83 . . . . . . . . . . . . . . . . . . . . . . . . . Table C.1 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Table C.2 OLS Regression, dependent variable=math4 . . . . . . . . . . . . . . . . . 102 Table C.3 OLS Regression with Conley S.E., dependent variable=math4 . . . . . . . 103 Table C.4 OLS Regression with Conley S.E., dependent variable=math4 . . . . . . . 103 Table C.5 OLS Regression with Conley S.E., dependent variable=math4 . . . . . . . 104 Table C.6 OLS Regression with Conley S.E., dependent variable=math4 . . . . . . . 104 math4 ) . . . . . . . . . . . . . 105 Table C.7 OLS Regression, dependent variable=log( 1−math4 math4 ) . . . 105 Table C.8 OLS Regression with Conley S.E., dependent variable=log( 1−math4 math4 ) . . . 106 Table C.9 OLS Regression with Conley S.E., dependent variable=log( 1−math4 math4 ) . . . 106 Table C.10 OLS Regression with Conley S.E., dependent variable=log( 1−math4 Table C.11 OLS Regression with Conley S.E. in Nonlinear Model . . . . . . . . . . . 107 ix Table C.12 APEs with Bootstrap S.E. in Nonlinear Model . . . . . . . . . . . . . . . 108 Table C.13 QGLS with Conley S.E., dependent variable=math4 . . . . . . . . . . . . 121 Table C.14 QGLS with Conley S.E., dependent variable=math4 in Year 2010 . . . . . 122 Table C.15 SAR GLS, dependent variable=math4 . . . . . . . . . . . . . . . . . . . . 122 Table C.16 SAR GLS, dependent variable=math4 in Year 2010 . . . . . . . . . . . . 123 Table C.17 SAR GLS, dependent variable=math4, contiguity . . . . . . . . . . . . . . 123 Table C.18 SAR GLS, dependent variable=math4, inverse distance . . . . . . . . . . 124 Table C.19 Summary of Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Table C.20 Summary of Correlation(div) . . . . . . . . . . . . . . . . . . . . . . . . . 126 x LIST OF FIGURES Figure 1.1 Picture of Sieve Estimator. . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Figure B.1 Bias of LFE and PQML with change of ρ . . . . . . . . . . . . . . . . . . 84 Figure B.2 Std. error of LFE and PQML with change of ρ . . . . . . . . . . . . . . 85 Figure B.3 Bias of LFE and PQML with change of ρ, N=1000 . . . . . . . . . . . . 86 Figure B.4 Std. error of LFE and PQML with change of ρ, N=1000 . . . . . . . . . 87 Figure B.5 Histogram of Passengers . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Figure C.1 All school districts of Michigan in 2010. . . . . . . . . . . . . . . . . . . . 110 Figure C.2 All Colleges of Michigan in 2010. . . . . . . . . . . . . . . . . . . . . . . 111 Figure C.3 MEAP math pass rate for 4th graders of Michigan in 2010. . . . . . . . . 112 Figure C.4 Colleges and math pass rate for 4th graders of Michigan in 2010. . . . . . 113 Figure C.5 Selection of 96 School Districts. . . . . . . . . . . . . . . . . . . . . . . . 114 Figure C.6 Selection of 96 School Districts with centroids. . . . . . . . . . . . . . . . 115 Figure C.7 Selection of 96 School Districts with centroids in grids. . . . . . . . . . . 116 Figure C.8 Conley Coordinates of 96 School Districts. . . . . . . . . . . . . . . . . . 117 Figure C.9 Histogram for all covariates . . . . . . . . . . . . . . . . . . . . . . . . . 118 Figure C.10 APE w.r.t average expenditure . . . . . . . . . . . . . . . . . . . . . . . 119 Figure C.11 APE w.r.t enroll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Figure C.12 APE w.r.t lunch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 xi Chapter 1 ON ESTIMATING PARTIAL EFFECTS AFTER RETRANSFORMATION 1.1 Introduction Strictly positive response variables are very common in economics and other social sciences. Just a few examples include prices, populations, and firm sales. If Y > 0 is the response that we would like to explain, the most common approach for modeling Y is to use a linear model for its natural log, log(Y ), and then estimate the linear model using an appropriate technique – usually ordinary least squares (OLS) or instrumental variables (IV). There are at least two reasons modeling log(Y ) may not be sufficient. First, one might wish to predict Y , not log(Y ). When the prediction of Y is based on its expected value conditional on a vector of covariates – say, X – in general there is no way to recover E(Y |X) from E[log(Y )|X]. The difficulty in predicting Y given a model for log(Y ) has long been recognized, and solutions are available under varying levels of assumptions. Duan (1983) covers the case where an additive error in the model for log(Y ) is assumed independent of X, and this method is covered even in some introductory econometrics texts [for example, Wooldridge (2009, Chapter 6)]. Under distributional assumptions, such as assuming Y given X follows a lognormal distribution, parametric heteroskedasticity in Var[log(Y )|X] is easily allowed. More recently, Ai and Norton (2008) provide a semiparametric approach that produces consistent predictions under weak assumptions, although the approach they use allows the possibility that predictions of Y can be negative. Wooldridge (1992) proposes direct estimation of E(Y |X) via quasi-likelihood methods using flexible functional forms that ensure nonnegative predictions. Of course, nonparametric methods, of the type covered in Li and Racine (2007), can be used, too. 1 Related to the prediction issue is the calculation of partial effects. Wooldridge (1992) makes a case for basing partial effects on E(Y |X) (in cases where the elements of X are appropriately “exogenous”). Such an approach underlies the work by Ai and Norton (2008), who begin with a linear model for log(Y ) but then employ nonparametric methods to recover an estimate E(Y |X) – without placing further restrictions on D(Y |X), the conditional distribution of Y given X. But basing partial effects on E(Y |X) (or even some other feature of the conditional distribution, such as the median) is not the only possibility. Lately, the notion of an average partial effect (APE) has become important in applied econometrics; see, for example, Wooldridge (2005, 2010). The APE is closely tied to Blundell and Powell’s (2003) average structural function (ASF), which is defined by averaging out unobservables from a “structural” model. One potentially important implication of the ASF approach has largely gone unnoticed: the ASF approach can provide very different partial effects than that based on E(Y |X). A heteroskedastic probit example is provided by Wooldridge (2005): the partial effects based on E(Y |X) = P (Y = 1|X) and those based on the ASF need not even have the same sign, let alone similar magnitudes. One reason the APE/ASF concept is appealing is because it can be easily applied to cases where explanatory variables should be treated as endogenous. Consequently, the focus on the ASF has led to a widely applicable class of control function estimators in nonlinear models – see Blundell and Powell (2003, 2004). Further, in a broad class of models, the sign of an APE and an underlying parameter on the covariate of interest are the same. In this paper we highlight another useful feature of the APE approach: it provides justification for Duan-type retransformation estimators even when Duan’s key assumption – independence between the underlying error U and the covariates X – is violated. We also consider the notion of a “conditional” average partial effect (CAPE) (for example, Wooldridge, 2004, 2005), which is considered more generally as the “local average response” in Altonji and Matzkin (2005). Interestingly, general consideration of CAPEs leads to essentially the same estimation problem described in Ai and Norton (2008), except 2 that our approach here allows for endogenous explanatory variables. Plus, when on restricts the nature of the condititioning set, then simple strategies are available that do not require complicated nonparametric estimation. After presenting the model and definitions of partial effects in Section 2, Section 3 discusses estimation of APEs – which, in the current seeting, turns out to be nothing more than extending Duan’s (1983) “smearing” estimate to more general settings. We consider the estimation of CAPEs in Section 4, and Section 5 contains a brief conclusion. Technical derivations are contained in an appendix. 1.2 The Model and Partial Effects The setup we consider is a standard linear model with log(Y ) as the dependent variable. Initially assume that, in the population, log(Y ) = Xβ + U E(U |X) = 0, (1.2.1) (1.2.2) where X is a 1 × K vector of covariates with first element X unity. We could consider nonlinear regression functions in place of Xβ, but retransformation methods are almost always applied when the transformed variable follows a linear-in-parameters model. As in Duan (1983), we could also consider other strictly monotonic transformations of Y but the natural logarithm is by far the most popular. Given equation (1.2.1), we can write Y = exp(Xβ + U ) = exp(Xβ) exp(U ). (1.2.3) Following the discussion in the introduction, we base the partial effects of interest on this equation because we are interested in how the Xj affect Y , not log(Y ). 3 Ai and Norton (2008) focus on the conditional mean E(Y |X), which can be written generally as E(Y |X) = exp(Xβ)E[exp(U )|X] ≡ exp(Xβ)r(X) (1.2.4) where r(X) ≡ E[exp(U )|X]. If Xj is a continuous variable then the partial effect on µ(x) ≡ E(Y |X = x) is ∂E(Y |X = x) ∂r(x) ∂µ(x) = = β j exp(xβ)r(x) + exp(xβ) . ∂xj ∂xj ∂xj (1.2.5) To estimate this partial effect we need to estimate r(·) in addition to β – the problem considered by Ai and Norton (2008). We reconsider this problem from a different perspective in Section 4. Equation (1.2.5) clearly shows that the partial effect of xj on E(Y |X = x) need not even have the same sign as β j , and the magnitude of the partial effect can depend on r(·) in a rather complicated way. In the special case E[exp(U )|X] = exp(Xδ), E(Y |X) = exp[X(β + δ)], and so the partial effects of xj on E(Y |X = x) is the same sign as β j + δ j . There is another definition of partial effects that is easier to summarize and, as it turns out, also easier to estimate. Following Blundell and Powell (2003), the average structural function (ASF) is defined as ASF (x) = E[exp(xβ) exp(U )] = exp(xβ)E[exp(U )] ≡ η exp(xβ), (1.2.6) where η ≡ E[exp(U )]. The definition of the ASF is related to the notion of a “Marshallian structural function” defined in Heckman (2001). In defining the ASF it is important to see that the covariates are held fixed at specified values, with U averaged out. Once the ASF is obtained, we can see 4 how this function changes as the xj change. As discussed in Wooldridge (2005), the ASF is closely tied to the notion of an average partial effect (APE). From (1.2.3), the partial effect of Xj on Y is ∂Y = β j exp(Xβ) exp(U ). ∂Xj To get the APE we average U out of the partial effect for given covariate values x. In other words, the APE at x is AP Ej (x) = E β j exp(xβ) exp(U ) = η[β j exp(xβ)], and this is easily seen to be the partial derivative of the ASF. A similar argument works if we use discrete changes in xj rather than a calculus approximation. An attractive feature of the ASF is that its definition does not require us to take a stand on possible dependence between U and X. The definition is unchanged even if U and X are correlated. When U and X are independent, µ(x) = ASF (x), but generally these quantities differ when D(U |X) depends on X – even if X is exogenous in the sense of assumption (1.2.2). The potential difference between average partial effects and partial effects based on E(Y |X = x) has been pointed out by Wooldridge (2005), who uses a probit model with heteroskedasticity to illustrate that when U and X are not independent, the APEs and partial effects based on E(Y |X) will be different – perhaps very different. Unfortunately, it does not seem possible to resolve the issue of how one should compute partial effects. The choice between E(Y |X = x) and ASF (x) is essentially one of preference. The main contribution of the current paper is to show that it is easy to estimate the ASF in the retransformation context without taking a stand on D(U |X). One shortcoming with APEs is that the heterogeneity is averaged across the entire population whereas the covariates are set at specific values. Altonji and Matzkin (2005) argue that finding partial effects conditional on specific outcomes x is generally more useful – what 5 they call a “local average response” (LAR). Wooldridge (2005) also discusses partial effects average over a subset of the population, based on observable characteristics. As a general statement, suppose Y = g(X, U ), so that, for a continuous covariate, the partial effect at X = x is ∂g(x, U )/∂xj . Now suppose we wish to compute the expected value of this partial effect not across the entire distribution of U , but for the subset of the population with X = x. Then we can define a conditional average partial effect (CAPE) as CAP Ej (x) = E ∂g(x, U ) X=x ∂xj In the exponential case, the CAPE can be expressed as CAP Ej (x) = β j exp(xβ)E[exp(U )|X = x] ≡ β j exp(xβ)r(x) Note that this is different than basing partial effects on E(Y |X = x), yet we need to estimate the same function, r(x). From an interpretation standpoint CAPE has the convenient feature that it is the same sign as β j ; we simply need to compute the scale factor, r(x), that multiplies β j exp(xβ). The function AP Ej (x) replaces r(x) with the mean value η = E[r(X)] = E[exp(U )]. As with the APE, the CAPE makes sense even if X includes endogenous elements. In fact, the CAPE includes as a special case average treatment effects for various subpopulations when treatment assignment is endogenous. (With binary treatments we would use changes, not derivatives.) 6 1.3 Estimating the APEs √ The expression for ASF (x) makes it clear that a N -consistent estimator of ASF (x) is √ available if N -consistent estimators of β and η are available. By contrast, the dependence of µ(x) on the nonparametric function r(x) means that partial effects based on E(Y |X = x) √ are not generally estimable at the N -rate (because nonparametric rates of convergence √ are slower – often much slower – than N ). Thus, we can estimate the ASF much more precisely than E(Y |X = x). Because µ(x) and ASF (x) both depend on β, we first need to consistently estimate β. It is very common to use OLS as the estimator of β, even though it may not be asymptotically efficient under E(U |X) = 0. (For example, a weighted least squares estimator that attempts to exploit nonconstant Var(U |X) could be more efficient.) As is well known, the assumption E(U |X) = 0 does not substantively restrict r(x) ≡ E[exp(U )|X = x]: it can be virtually any positive function of x. Of course, if we suitably restrict D(U |X) then we can usually find r(x). A useful assumption is that U and X are independent, a case considered by Duan (1983); see also Wooldridge (2009, Section 6.4). Then r(X) = E[exp(U )|X] = E[exp(U )] = η, (1.3.1) E(Y |X = x) = η exp(xβ) = ASF (x). (1.3.2) and it follows that [If we specify the distribution of U , then we can sometimes write η in terms of higher moments of U . For example, if U ∼ N ormal(0, σ 2 ) then η = exp(σ 2 /2); see Wooldridge (2009, Section 6.4).] In the case considered by Duan (1983) – where U and X are assumed to be independent – estimation of η is straightforward. First, by the law of large numbers, 7 N N −1 p exp(Ui ) → η. (1.3.3) i=1 The average is not an estimator because we do not observe the Ui . Instead, given a random ˆ from the OLS regression sample {(Xi , Yi ) : i = 1, ..., N }, obtain β log(Yi ) on Xi , i = 1, ..., N (1.3.4) p ˆ be the OLS residuals. Because β ˆ→ and then let Uˆi = log(Yi ) − Xi β β it is not suprising that N η= N −1 exp(Uˆi ) (1.3.5) i=1 is generally consistent for η. Wooldridge (2010, Lemma 12.1) contains a general result that implies consistency under weak regularity conditions. Because U and X are independent, we estimate µ(x) and ASF (x) in exactly the same way: ˆ = ASF (x). µ ˆ (x) = ηˆ exp(xβ) Wooldridge (2010, Problem 12.17) can be used to show √ (1.3.6) N (ˆ η − η) has a limiting normal distribution and to find its asymptotic variance. Below we consider the problem of estimating the joint asymptotic variance under very weak assumptions. For estimating the ASF, there is an important point about Duan’s estimator in (1.3.6): it is a consistent estimator of η = E[exp(U )] even when U and X are dependent. In fact, the most we need to assume is E(X U ) = 0 (1.3.7) as this is sufficient for OLS to consistently estimate β. Duan (1983) was interested in recovering E(Y |X), which is why he assumed independence between U and X; see also Abrevaya 8 (2002), who obtained the asymptotic variance of the predictions under the assumption of independence. But when we view Duan’s estimator as estimating the scale factor that appears ˆ is consistent for β. In settings in the ASF, the estimator is generally consistent provided β where we intend X to be exogenous in (1.2.1), it suffices to assume (1.3.7) – in which case the OLS estimator from (1.3.3) is consistent but not necessarily unbiased. In Section 2 we mentioned how the definition of the ASF is unchanged regardless of dependence between U and X. As can be seen from equation (1.2.6), we consistently esimate ASF (x) = η exp(xβ) provided we have consistent estimators of η and β, and (1.3.5) shows we just need a consistent estimator of β to consisently estimate η. Having some elements of X endogenous in (1.2.1) in the sense that Cov(X, U ) = 0 causes no problems for estimating ASF (x) provided we have suitable instrumental variables. In particular, suppose we have a 1 × L vector satisfying E(Z U ) = 0 rank E(Z X) = K (1.3.8) rank E(Z Z) = L Under these assumptions, the 2SLS estimator (as well as other generalized method of moments estimators) is consistent for β; see, for example, Wooldridge (2010, Chapter 5). Then, we can let β be the 2SLS estimator from log(Yi ) = Xi β + Ui (1.3.9) using IVs Zi . The Ui are now the 2SLS residuals, and η is still computed from (1.3.5). Notice that it would be meaningless in this context to base partial effects on E(Y |X) or E(Y |X, Z), whereas the ASF can have a causal interpretation. We summarize the above as the following theorem. Theorem 1.3.1. As assumptions in equation (1.3.8) and Cov(X, U ) = 0, β be the 2SLS 9 estimator from equation (1.3.9) using IVs Zi and η is defined in equation (1.3.5), then   √ β − β  d N  → N (0, Ω) η−η    Si  Where Ω ≡ E   ∗ Si Qi Qi PROOF: See the appendix. 1.4 Estimating Conditional APEs Estimation of CAPEs is more difficult due to the need to estimate the function r(x), which is the same problem faced by Ai and Norton (2008). The motivation for their general approach is straightforward. If we could observe the Ui then we could use nonparametric regression of exp(Ui ) on Xi . Because we do not know β we replace Ui with the OLS residuals Ui and use exp(Ui ) as the dependent variable in a nonparametric regression. Because of technical complications, Ai and Norton propose linear series estimation, where exp(Ui ) is regressed on various functions of Xi . As in all nonparametric contexts, the rate √ of convergence of r(·) to r(·) is slower than N , and much slower when the dimension of X is large. See Ai and Norton (2008) for details. From equation (1.2.4), we have: r(x) ≡ E Y X=x exp(Xβ) (1.4.1) From now on, the focus is how to estimate r(x). In the literature, we do have at least two ways to estimate r(x): one way is the traditional parametric method; we assume r(x) = g(xα), where g(.) is a real function satisfying certain conditions. The remained work is to estimate the parameter α. If we assume g(.) is a linear function of xα, α can be estimated by the classic ordinary least square; if g(.) is nonlinear, then nonlinear least 10 r(x) 1 r∗ (x) 3 2 r(x) Figure 1.1: Picture of Sieve Estimator: For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this dissertation. squared can be applied. Details can be referred to Wooldridge (2010). The other way is noparametric method which does not impose any parametric assumptions on r(x). As for the comparison of those two methods, refer to a recent paper by Ackerberg et al. (2011). Here the nonparametric assumption is used, and we will propose a sieve estimator for r(x); the idea of sieves can be vividly explained by the above picture Figure1.1. In Figure1.1, r(x) is unknown and is our target, we approximate it with r∗ (x); we know more terms, the number of which is denoted as M , can lead to more close approximations. However,with more terms to approximate, the estimating bias from r(x) to r(x) is increasing. The triangle picture shows that we need to choose a balanced M . This can also be seen from the order of the mean square error in theorem 2. Theoretically, as long as r(x) is smooth enough, we can always find a reasonable M . Newey (1994) and Ai and Norton (2008) suggest a sample based method to decide M -cross validation. Note that r(x) > 0, we cannot guarantee it if we follow the method in Ai and Norton (2008). At the end of his paper, Wooldridge (1992a) suggests “exponential sieves”to replace linear ones to span the unknown space composed of positive functions. Similarly, Hirano et al.(2003) propose a “logit series” estimator for the probability distribution functions. Inspired by previous work, we develop an “exponential sieve” estimator. 11 Let B = [trace(B B)]1/2 be the Euclidian norm of a matrix B; ζ(M ) = sup GM (x) ; x∈Ξ the power series GM (x) we use here are the same as in Hirano et al (2003).First, we list some assumptions: Assumption 1: (Y1 , X1 ), . . . , (YN , XN ) are i.i.d and Var(Y |X) is bounded and bounded Y − r(X) away from zero; Var exp(Xβ) 4 is bounded. Assumption 2:(i) the smallest eigenvalue of E[GM (X) GM (X)] is bounded away from zero uniformly in M ;(ii) there is a sequence of constants ζ(M ) and M (N ) such that ζ(M )2 M/N −→ 0 as N −→ ∞ These assumptions are as usual as in the literature; since we use power series here,ζ 0 (M ) = CM , later on we use this equivalence lots of times in the derivation. Assumption 3: If f : RK −→ R is s times continuously differentiable and gM (x) = [1, x, ..., xn ] , M = (n + 1), Note that gM (x) has powers in x at least up to n; there is a M-vector γ M such that for GM (x) = AM gM (x), and on the compact set Ξ ⊆ RK , sup f (x) − GM (x)γ M < C1 n−s ≤ C2 M −s , (1.4.2) x∈Ξ This assumption is used as a fact in Hirano et al (2003), while Newey (1997) puts it as assumption. To ensure that the approximation of r(x) is positive, we first approximate the log of r(x): sup log(r(x)) − GM (x)π M < CM −s , (1.4.3) x∈Ξ So the exponential sieves estimator of r(x) is r(x) = exp(GM (x)π M ), where M is fixed and N Yi π M = arg min π exp(Xi β) i=1 2 − exp(GM (X i )π) (1.4.4) p ˆ M − π ∗M −→ 0, where For N −→ ∞, we have π π ∗M = arg min E π 2 Yi − exp(GM (Xi )π) exp(Xi β) 12 (1.4.5) Lemma 1: Suppose that: 1. the support Ξ of x is a compact set of RK 2. r(x) is s times continuously differentiable, that is, r(x) ∈ C s ,with s ≥ 4 3. r(x) is bounded away from zero 4. the density of X is bounded away from zero on Ξ Then: r(x) − exp(GM (x)π ∗M ) = O(CM −s ), PROOF: See appendix. Lemma 1 is corresponding to the step 1 in the picture; as long as the M is large enough, the deviation term will vanish. With a higher M , the converge rate is faster. Lemma 2: Suppose that same four conditions as in Lemma 1 hold. In addition, suppose that: (5) M (N ) is a sequence of values of M satisfying M (N ) → ∞, and ζ(M (N ))4 /N → 0. Then π M (N ) − π ∗M (N ) = Op M (N ) N PROOF: See appendix. Theorem 1.4.1. Given all the assumptions in Lemma 1 and 2, then [r(x) − r(x)]2 dF (x) = O(M/N + M −2s ) 13 PROOF: Note that, [r(x) − r(x)]2 dF (x) = [r(x) − exp(GM (xi )π ∗M ) + exp(GM (xi )π ∗M ) − exp(GM (xi )π M )]2 dF (x) ≤ [r(x) − exp(GM (xi )π ∗M )]2 + [exp(GM (xi )π ∗M ) − exp(GM (xi )π M )]2 dF (x) = O(M/N + M −2s ) The last equality follows from Lemma 1 and 2. From this theorem, we can see the two steps in the picture and the tradeoff of increase the M Let ΣM ≡ E[GM (Xi ) GM (Xi ) 2 Yi − r(Xi ) exp(2GM (Xi )π ∗M )], exp(Xi β) QM ≡ E[GM (Xi ) GM (Xi ) exp(2GM (Xi )π ∗M )], −1 Σ Q−1 (x)GM (x) exp(2GM (x)π ∗ ). VM (x) ≡ GM (x)QM M M M Lemma 3: Suppose that the same four conditions as in Lemma 1 hold, then √ −1/2 N VM −1/2 Note that VM d (x) r(x) − r(x) → N (0, 1) (1.4.6) (x) ≤ O(M −1/2 ), so this result gives an upper bound Op ((M/N )1/2 ) converge rate, which is lower than Op ((1/N )1/2 ); Ai and Norton (2008) find the similar result for linear case. This result also coincides many other results in semi/no parametric literature. The structure of the APE estimator is interesting: it is a scale factor times the estimator of the parameter of interest. The ideal is originally from the so-called average structure function in Blundell and Powell (1994), and Papke and Wooldridge (2008) advance further to estimate APE of parameter of interest in a panel data probit model with and without 14 endogenous explanatory variables. From section 2, the CAPE can be expressed as CAP Ej (x) = β j exp(xβ)E[exp(U )|X = x] ≡ β j exp(xβ)r(x) So the estimator of CAP E follows as : CAP Ej (x) = β j exp(xβ)r(x) = β j exp(xβ) exp(GM (x)π M ) Theorem 1.4.2. Given N (1/2) M −(s+1) → 0 as N → ∞ and β j = 0, Assumptions 1-3 and all the assumptions in Lemma 1 and 2, then √ d −1/2 N VM (x) CAP Ej (x) − CAP Ej (x) → N (0, V ) Where V = β 2j exp(2xβ) PROOF: See appendix. Note that, theorem 3 requires that at least one element of β is not zero: if β is zero completely, the model does not make any practical use; but it is possible that some elements of β are zero. If there is some element in β is zero, without loss of generality, let’s put it as β j =0, then: Corollary: Given all the assumptions in theorem 3 except β j =0, then √ d N CAP Ej (x) − CAP Ej (x) → N (0, W ) Where CAP Ej (x) ≡ β j exp(xβ)r(x); as for the detailed form of W , refer to proof in appendix. PROOF: See appendix. 15 This is an interesting result: if some element of β is zero, then the estimator for the √ corresponding CAP E is N consistent, which is faster than the rates of other CAPEs of √ −1/2 nozero β. From theorem 3, we can see that under N VM (x), CAP Ej (x) degenerates to zero when β j = 0; so it is not surprising that we need faster rate to get a non-degenerating asymptotic distribution. There are several such examples in Ferguson (1996), e.g., refer to example 3 in Chapter 7; also, theorem 5.4 in Lee (2004). 1.5 Application to Treatment Effects As an example of where we might want to apply retransformation after IV estimation, consider the modern treatment effect literature. Let D denote a binary treatment and denote the strictly positive countfactual outcomes as Y (0) and Y (1). We observe D and Y = (1 − D)Y (0) + DY (1). Assume we have covariates W such that Y (0) = exp(α0 + Wβ 0 + U ) Y (1) = exp(α1 + Wβ 1 + U ) where, for simplicity, we assume only one source of heterogeneity, U , and E(U ) = 0. Also, assume that U and W are independent. then the average treatment effect, as a function of w, defined by τ ate (w) = E[Y (1) − Y (0)|W = w], is easily seen to be τ ate (w) = η[exp(α1 + wβ 1 ) − exp(α0 + wβ 0 )] 16 where η = E[exp(U )]. Furthermore, in terms of the observed outcome Y , τ ate (w) can be written in terms of its ASF, ASF (d, w). To see how, write Y = Y (0)(1−D) Y (1)D and so Y = exp(α0 + γD + Wβ 0 + D · Wδ + U ), where γ = α1 − α0 and δ = β 1 − β 0 . So the ASF for Y is ASF (d, w) = exp(α0 + γd + wβ 0 + d · wδ)E[exp(U )] = η exp(α0 + γd + wβ 0 + d · wδ) If we evaluate the ASF at d = 1 and d = 0 and difference we get ASF (1, w) − ASF (0, w) = η[exp(α0 + γ + wβ 0 + wδ) − exp(α0 + wβ 0 )] = η[exp(α1 + wβ 1 ) − exp(α0 + wβ 0 )] = τ ate (w) We can now apply a simple 2SLS strategy if we assume E(U |Z) = 0 for IVs Z that include W and at least one element not in W that predicts treatment status, D. Write log(Yi ) = α0 + γDi + Wi β 0 + Di · Wi δ + Ui and use instruments, say, (1, Zi , Zi ⊗ Wi ), or one can be selective with the interactions. Or, as described by Wooldridge (2010, Chapter 21), probit or logit fitted values can be used as IVs, in which case the list would look like 17 ˆ i , Wi , G ˆ i · Wi ), (1, G ˆ i are the fitted probabilities from a binary response model of Di on Zi . where the G Given the 2SLS residuals Uˆi we compute ηˆ exactly as in equation (1.3.5), and then the estimated ASF is ˆ 0 + d · wˆ ASF (d, w) = ηˆ exp(ˆ α0 + γˆ d + wβ δ) For any w we estimate τˆate (w) = ASF (1, w) − ASF (0, w), and the unconditional average treatment effect, τ ate = E[Y (1) − Y (0)], is consistently estimated as N τˆate = N −1 [ASF (1, Wi ) − ASF (0, Wi )] i=1 N = N −1 ˆ 0 )] ˆ 1 ) − exp(ˆ α 0 + Wi β ηˆ[exp(ˆ α 1 + Wi β i=1 The average treatment effect on the treated is τ att (w) = E[Y (1) − Y (0)|D = 1, W = w] = [exp(α1 + wβ 1 ) − exp(α0 + wβ 0 )]E[exp(U )|D = 1, W = w] ≡ [exp(α1 + wβ 1 ) − exp(α0 + wβ 0 )]r(1, w), where r(d, w) = E[exp(U )|D = d, W = w]. Now we use nonparametrics of exp(Uˆi ) on Wi for Di = 1. ˆ 1 ) − exp(ˆ ˆ 0 )]ˆ τˆatt (w) = [exp(ˆ α1 + w β α0 + w β r(1, w) Often reported is 18 τ att = E[Y (1) − Y (0)|D = 1] and a consistent estimator is N N1−1 ˆ 1 ) − exp(ˆ ˆ 0 )]ˆ Di [exp(ˆ α 1 + Wi β α 0 + Wi β r(1, Wi ). i=1 1.6 Concluding Remarks We have shown that a common retransformation method due to Duan (1983) can be used generally to estimate the parameters in the average structural function under weak assumptions on the dependence between the error and the covariates. In a standard regression setting, even a zero correlation assumption suffices. Further, the method easily applies when instrumental variables are needed to consistently estimate the coefficients in the log-linear model. Our derivation of the joint asymptotic distribution of the parameters indexing the ASF holds under weak assumptions – much weaker than the independence assumption used in the standard Duan (1983) setting where errors and explanatory variables are independent. In the case where X does not contain what we traditionally think of as endogenous variables – so that the dependence of D(U |X) comes through moments other than E(U |X) – one might question whether partial effects defined through the ASF are more useful than those obtained from E(Y |X = x). After all, for prediction purposes we prefer E(Y |X = x) to ASF (x) because the former is the minimum mean square predictor of Y . But if we are interested in getting the best predictor of Y then we might just model E(Y |X) – such as a flexible exponential function – and estimate it directly. Of course, we would get partial effects directly, too. We also considered estimation of conditional APEs, which generally differ from the APEs when U and X are dependent (even though they may be mean independent). Here, we are 19 led to an estimation problem essentially the same as Ai and Norton (2008). However, the calculation of partial effects differs, and the CAPEs have the same signs is the coefficients on the log-linear model. Some of this extends immediately to other transformations, such as log[Y /(1 − Y )] when 0 < Y < 1. More work is needed in treatment effect examples with more than one unobservable. Heckman switching regression. “Simple” solution is to make distributional assumptions. 20 Chapter 2 ON THE USE OF EXPONENTIAL VERSUS LOG-LINEAR MODELS FOR PANEL DATA 2.1 Introduction As we all know, more and more data are collected over time for the same cross section units with the advance of technology; a big chunk of them are to deal with positive response variables, just name a few, the observed wage rate for workers over 5 years, number of patents applied for the firms in the last several years and etc. The immediate extension of the results in chapter one to the panel data setting would be that model the logarithm transformation of positive response variable in linear functional form of covariates still and assume, at a minimum, the error in each time period was assumed to be uncorrelated with the explanatory variables in the same time period. Basically, we just stack the observations of all time periods for each cross section unit and repeat the analysis in chapter one again; as we can see, everything just follows with a bigger sample size in terms of application. However, we know that assumption is too strong for certain panel data applications.In fact, a primary motivation for using panel data is to solve the omitted variables problem, and error terms in each time period are hardly uncorrelated most of the time. So a more interesting extension would be modeling logarithm transformation of positive response variable in a “modern” panel data setting, which explicitly contain a time-constant unobserved effect and treat it as random variable, drawn from the population along with the observed explained and explanatory variables. We can refer to the “linear panel data model (LPD)” in Wooldridge (2010) for more details. While, the problem of modeling the logarithm transformation of positive response variable in a LPD still exists and it can become even much worse because that unobserved time invariant heterogeneity can either cause sever correlation among the error terms, or itself 21 is correlated with covariates; even with it removed as in the standard Fixed Effect (FE) transformation, the problem can hardly be solved. For example, Blackbrun (2007) compares Poisson Quasi-Maximum Likelihood (PQML) method to wage equation and first difference to log wage equation; he finds that the latter overestimates the union coverage effect on wage by almost 14% using panel data of 1989 and 1993 NLSY. We will also see this from an analysis of a data set about the American domestic airline market from year 1997 to 2000. The results of usual FE method estimation are in Table B.1. Based on the results, we would say the price elasticity of demand of the market is over one in magnitude. We know this number is way too big and does not make any sense intuitively. For example, Park et. al.(2007) finds that most city pairs airline routes for 12 main US carriers are inelastic in short run. We will say more about it in the application section. This prompts our question of the estimation method itself. Considering the retransformation problem we have done in chapter 1, it is natural to ask the question: Why should we use the logarithmic transformation of the response variables instead of modeling them directly? Several attempts are made in the literature. Most of the work uses nonlinear panel data models and relies on the method of conditional maximum likelihood (CML) , where a sufficient statistic (the sum of the explained variable across time) is conditioned on to remove the unobserved effect. Examples include Chamberlain (1980, 1984) for binary responses and Hausman et al.(1984) for count data. Wooldridge (1999) investigates robustness properties of these CMLEs to misspecification of the initially specified joint distribution and has shown CMLEs to be consistent when only the conditional mean in the unobserved effects multiplicative panel data model is correctly specified, which means that the consistency is robust to arbitrary patterns of serial correlation. Moreover, the results hold not only for binary or count variables, but also for any nonnegative variables. So it is not surprising that the estimation method is widely used in empirical applications, especially after Simcoe (2008) writes a STATA code titled “xtpqml ”1 . It sheds new light on how to estimate 1 the code is updated to xtpoisson in STATA12 22 the conditional mean parameters consistently for positive response variable without taking logarithmic transformation in panel data models. So, this chapter starts with Wooldridge (1999). First, we maintain the conditional mean specification condition (equation (3.1) in Wooldridge (1999)) and show what consequence of logarithm transformation can cause for the usual FE estimation in LPM by Mont Carlo simulation; Second, we extend the “exponential sieve” estimator in chapter 1 to the panel data model- estimate the average partial effect (APE) of the respective interesting explanatory variables; Third, we use “exponential sieve” estimator to construct the optimal IV as in Newey (1993, 1994) to improve the the efficiency for the GMM estimator suggested in Wooldridge (1999). Section 2 introduces model; under which the simulations are constructed to compare PQML and LFE estimators in section 3. A “exponential sieve” estimator of APE in the panel data model is formulated in section 4 .Section 5 discusses the GMM method for models with conditional mean and variance functions specified. The GMM can be understood as double PQML: one is for original scale of dependent variable; the other is for squared scale. We also propose the optimal instrument variables (OIV) estimator used in Newey (1993, 1994). An empirical application to airfare data is provided in section 6, and some remarks are contained in section 7. 2.2 Model 1 The model we are considering is the following: Yit = exp(Xit β)Ci Vit , (2.2.1) where Ci is the unobserved heterogeneity and Vit is the multiplicative idiosyncratic error term; we assume both Ci and Vit are positive. The immediate question is how to estimate β. It is tempting to take log transformation for both sides of equation (2.2.1): log(Yit ) = Xit β + log(Ci ) + log(Vit ), 23 (2.2.2) This is typical linear panel data model as in Wooldridge (2010) if the relevant assumptions are made. Justifications for this log transformation include to deal with a dependent variable badly skewed to the right, and to interpret β as semi elasticity of Y with respect to X, for example in Manning (1998), Ai and Norton (2000). As suggested in Wooldridge (2010), it is natural to do the usual FE transformation to remove the heterogeneity: ¨ )=X ¨ ), ¨ it β + log(V log(Y it it ¨ it = Xit − T −1 Where, X T t=1 Xit , (2.2.3) ¨ ) and log(V ¨ ). So pooled ordinary similarly for log(Y it it least square can be used to estimate β, which is exactly the FE method in Wooldridge (2010); we denote this method as LFE. Whether LFE is consistent or not depends on the ¨ ) are uncorrelated. Instead, we assume: ¨ it and log(V assumptions, one of which is that X it E(Vit |Xi , Ci ) = 1, (2.2.4) E(Yit |Xi , Ci ) = exp(Xit β + log(Ci )), (2.2.5) Note that, we can easily get: So the correct conditional mean specification condition as in Wooldridge (1999) is satisfied; We can use Poission Qausi Maximum Likelihood (PQML) to estimate β consistently. It interesting to compare LFE estimator of β with PQML under equation (2.2.1) and (2.2.4). It is not difficult to know that LFE estimator of β is not consistent since equation (2.2.4) ¨ ), are uncorrelated to the can not guarantee that error terms in equation (2.2.3), log(V it ¨ it . Simulations to compare LFE with PQML are in the following section. covariates, X 2.3 PQML VS. LFE: a Simulation Approach In this section, we compare PQML with LFE by 2 Monte Carlo Simulations. The key point here is how to generate Vit which are positive and satisfy equation (2.2.4). In the first simulation, we specify that Vit has Gamma distribution and find that PQML estimator is 24 consistent while LFE is not; while in the second simulation, we specify that Vit has log normal distribution; we find both are consistent and LFE is more efficient than PQML. 2.3.1 Simulation 1 We follow the model in section 2, i. e., equation (2.2.3). The data generating process is specified as the following: • i = 1, 2, ..., N ; t = 1, 2, ..., T , i.i.d • Xit ∼ N(0, 1), • Ci = exp(Xi + N(0, 1)), where Xi = T −1 T t=1 Xit • β = .1, i.i.d • Vit ∼ Gamma(α, γ), where γ = 1/α = exp(a ∗ X2it + b ∗ Xit ), • T = 5, N =500 if it is not specified, • Number of simulations =1000. We have two reasons to specify the Gamma distribution of Vit that way: 1. it guarantees Vit s are positive; 2. It makes that the condition of equation (2.2.4) is satisfied since E(Vit |Xi , Ci ) = α ∗ γ = 1, ∀a, b. As for the values of a and b, they will be specified in the specific setting. Table B.2 and Table B.3 show the results for PQML and LFE methods; Note the true value of β is .1. The PQML estimates are very close to it for all the values of a and b. However, for LFE, the bias is big; such as a = .01, b = .1, a = .05, b = .1, and a = .1, b = .1. These strengthen the findings in Blackburn (2007); they tell us that we should be very careful to use LFE to deal with positive response variables. On the other hand, as we can see from ¨ ) biases the LFE estimator; so ¨ and log(V equation (2.2.3) that the correlation between X it whenever the value of b is relatively high compared to a, the correlation between X and 25 log(Vit ), which is denoted as ρx, lv , is high, and forces the LFE away from its target. We consider an extreme case here: let a = .1, b = 0 and do the simulations for three different N values: 500, 1000 and 2000; and the results are in table titled “special case”. We can see that all three LFE estimates beat PQML in terms of bias and efficiency. The bias of PQML decreases with creasing of sample size, but for small sample, it may not perform as well as LFE does. 2.3.2 Simulation 2 Compared with simulation 1, simulation 2 is the same as simulation 1 except for the distribution of Vit . Here, Vit has a log-normal distribution. The data generating process is specified as the following: • i = 1, 2, ..., N ; t = 1, 2, ..., T , i.i.d • Xit ∼ N(0, 1), • Ci = exp(Xi + N(0, 1)), • β = .1, i.i.d • Vit = exp(a ∗ X2it + b ∗ Xit ∗ zit ), where zit ∼ N(0, 1) if not specified. • T = 5, N =500 or 1000, • Number of simulations =1000. For the distribution of Vit , the positive issue is obviously satisfied; but the values of a and b cannot be arbitrary since equation (2.2.4) is required to be satisfied. By algebra, we need have b2 = −2a. Table B.4 shows that LFE estimator is as good as PQML in terms of consistency; moreover, PQML estimator is more sensitive to the values of a and b. On the other hand, the 26 standard errors of LFE estimator is only half of PQML. This result is very interesting since this adds one more advantage to the log transformation-more efficient. 2.3.3 Simulation 3 Compared with simulation 1, simulation 2 is the same as simulation 1 except for the distribution of Vit . Here, Vit has a log-normal distribution. The data generating process is specified as the following: • i = 1, 2, ..., N ; t = 1, 2, ..., T , i.i.d • Xit ∼ N(0, 1), • Ci = exp(Xi + N(0, 1)), • β = .1,       2 • Vit = exp(a∗Xit +b∗Xit ∗zit ), where Z =       . zi1        zi2       zi3  ∼N(0, Σ), Σ =       zi4     zi5 ρ ρ2 ρ3 ρ4 ρ 1 ρ ρ2 ρ2 ρ 1 ρ ρ3 ρ2 ρ 1 ρ4 ρ3 ρ2 ρ 1      2 ρ     ρ   1 ρ3 • T = 5, N =500 and ρ = {−.95, −.5, −.1, .1, .5, .95}, • Number of simulations =1000. The focus of this simulation is setting of Vit : note that equation (2.2.4) only specifies relationship between Vit , Xi and ci, it does not exclude any correlation among error terms of all the time periods for a certain cross section unit. The variance covariance matrix of Z, Σ is the popular exchangeable working matrix in generalized estimating equation literature. The reason that name is called is every observation in an individual is equally correlated 27  with every other observation in that individual. The degree of correlation is measured by the intraclass correlation coefficient. Here we can understand that Corr(zit , zit+j ) = ρj , j = 0, 1, 2, · · · , T −1 and |ρ| < 1. All the estimation results are in Table B.6; overall, both methods perform well in terms of bias and efficiency. However, we do see some fluctuation when ρ varies in its ranges: from Figure B.3, we can see that the biases of LFE are fluctuating around zero while all PQML estimators display downward bias; but there is no monotonic relationship between the bias and the ρ. When it comes to standard errors, the story is different: Figure B.4 shows that the standard errors of both estimators are increasing with crease of ρ; and LFE always beats PQML. We repeat the analysis for N = 1000, and similar patterns apply. So the serial correlation can cause problem to PQML, especially when sample size is not that big. 2.4 Estimating APE: Exponential Sieve Estimator After Wooldridge (1999) proves that PQML is appropriate for any non-negative dependent variable-not just count data that follow a Poisson distribution and the estimator is robust to arbitrary patterns of serial correlation, the method is widely used in all kinds of empirical works, especially after Simcoe (2008) writes a STATA command xtpqml. However, since most of the time, the model is nonlinear in observed explanatory variables with multiplicative unobserved heterogeneity, it is difficulty to interpret the estimates of parameters, which are the popular so called semi-elasticities in log linear models; this is one of the key drawbacks of the method compared with OLS/IV to the log linear models. Inspired by what we have done in chapter 1, we propose a nonparametric estimation method for APE in model as in equation (2.2.1). This way, we extend the analysis in chapter 1 from cross section to panel data settings. 28 2.4.1 Estimating APE Refer to equation (2.2.1), let Uit =Ci Vit be the whole error term. For the simplicity of denotation, we drop i for the time being. Similarly to what we have done in Chapter One, we define APE as follows: AP Ej (xt ) = E β j exp(xt β)Ut = η[β j exp(xt β)], (2.4.1) Here the idea is the same as in Chapter One: the expectation in equation (2.4.1) is taken only with respect to Ut . In order to estimate AP Ej (xt ), the key is to estimate η; from equation (2.2.5), we have, Yit X exp(Xit β) i E = E(Uit |Xi ), (2.4.2) So, η ≡ E(Uit ) = E Yit exp(Xit β) , Then the straight forward estimator for η is: ηˆ = 1 NT N T Yit ˆ exp(Xit β) i=1 t=1 , (2.4.3) Where βˆ is the PQML estimator of β in equation (2.2.1). Similarly to Theorem 1 in Chapter one, we have the following: Theorem 2.4.1. Let β be the PQML estimator from equation (2.2.1) with condition in equation (2.2.5)and η is defined in equation (2.4.3), then   √ β − β  d N  → N(0, Ω) η−η    Si  Where Ω ≡ E   × Si Qi Qi 29 PROOF: See appendix. With delta method, it is natural to have the asymptotic result for the device in equation (2.4.1): Corollary 2.4.2. √ d N AP E j (xt ) − AP Ej (xt ) → N(0, Θ ΩΘ) Where ˆ AP E j (xt ) = ηˆβ j exp(Xit β) Θ = ∇θ {η[β j exp(xt β)}, θ = β , η 2.4.2 Estimating CAPE First, we assume: D(Ci |Xi ) = D(Ci |Xi ), where Xi = T −1 T t=1 Xit , (2.4.4) which is the same as in the simulations in the previous section, represents the time-averaged Xit over the various panel periods. This assumption can date back to Mundlak (1978) and Chamberlain (1980, 1982, 1984), and more recently appears in Papke and Wooldridge (2008). Of course, we don’t need the normal distribution assumption as they do. Combine equations (2.2.1), (2.2.4) and (2.4.4), and we get the following by iterated conditional expectation: E(Yit |Xi ) = E[(Yit |Xi , Ci )|Xi ] = exp(Xit β)E(Ci |Xi ) = exp(Xit β)E(Ci |Xi ) 30 (2.4.5) So, we define: E Yit X exp(Xit β) i = E(Ci |Xi ) ≡ r(Xi ), (2.4.6) Here, the r(Xi ) is our target to estimate and assume it has the same properties as in chapter 1. The only difference is that the independent variable is Xi instead of Xi , and variables in the left hand side of above equation has one more layer of subscript, t, other than i. So the ˆ M ), where exponential sieves estimator of r(Xi ) is rˆ(Xi ) = exp(GM (Xi )π N T Yit ˆ M = arg min π π i=1 t=1 ˆ exp(Xit β) 2 − exp(GM (X i )π) (2.4.7) p ˆ M − π ∗M −→ 0, where For N −→ ∞, we have π T π ∗M = arg min E π t=1 2 Yit − exp(GM (Xi )π) exp(Xit β) (2.4.8) Lemma 2.4.3. Suppose that: 1. the support Ξ of X is a compact set of RJ 2. r(X) is s times continuously differentiable, that is, r(X) ∈ C s , with s ≥ 4 3. r(X) is bounded away from zero 4. the density of X is bounded away from zero on Ξ Then: r(X) − exp(GM (X)π ∗M ) = Op (CM −s ), (2.4.9) PROOF: See appendix. Lemma 2.4.4. Suppose that same three conditions as in Lemma 1 hold. In addition, suppose that: (iv) M (N ) is a sequence of values of M satisfying M (N ) → ∞, and ζ(M (N ))4 /N → 0. Then 31 ˆ M (N ) − π ∗M (N ) = Op π M (N ) N (2.4.10) PROOF: See appendix. These two lemmas are almost the same as lemma 1 and 2 in chapter 1 expect here is for Xi . Theorem 2.4.5. Given all the assumptions in Lemma 2.4.3 and Lemma 2.4.4, then [r(X) − rˆ(X)]2 dF (X) = O(M/N + M −2s ) (2.4.11) PROOF: the proof is the same as Theorem 2 in chapter 1. From this theorem, we can see that the same kind of trade off argument as in chapter 1 happens here: we know more terms, which means higher value of M , can lead to more close approximations. However, with more terms to approximate, the estimating bias from r(X) to rˆ(X) is increasing. Refer to chapter 1 for detailed illustration. The next question is how to estimate the CAPE. First, we need to define the CAPE in the new device; usually, we need to find the functional form of E(Yit |Xit ). From equation (2.2.1): ∂Yit = β j exp(Xit β)Ci Vit , ∂xij By iterated expectation and combine equation (2.4.5) and equation (2.4.6) E ∂Yit Xi ∂xij = β j exp(Xit β)r(Xi ) So CAP Ej (x) ≡ E ∂Yit Xi = x ∂xij 32 = β j exp(xt β)r(x), Where, x = (x1 , · · · , xT ), x = T −1 T t=1 xt . The natural estimator for CAPE is: CAP E j (x) = β j exp(xt β)r(x), Since β can be easily obtained from PQML method, the next will focus on how to estimate r(x). Let  T GM (Xi ) GM (Xi ) ΣM ≡ E  t=1 Yit − r(Xi ) exp(Xit β) 2  exp(2GM (Xi )π ∗M ) , QM ≡ E[GM (Xi ) GM (Xi )exp(2GM (Xi )π ∗M )], −1 x)GM (¯ VM (¯ x) ≡ GM (¯ x)Q−1 x) exp(2GM (¯ x)π ∗M ). M ΣM QM (¯ Lemma 2.4.6. Suppose that the same four conditions as in Lemma 2.4.3 hold, then √ d −1/2 N VM (x) (r(x) − r(x)) → N(0, 1) Theorem 2.4.7. Given N (1/2) M −(s+1) → 0 as N → ∞ and β j = 0 , assumptions 1-3 and all the assumptions in Lemma 2.4.3 and Lemma 2.4.4, then √ −1/2 N VM d (x) CAP E j (x) − CAP Ej (x) → N(0, V ) Where V = β 2j exp(2xβ) PROOF: See appendix. Note that, theorem 2.4.7 requires that at least one element of β is not zero: if β is zero completely, the model does not make any practical use; but it is possible that some elements of β are zero. If there is some element in β is zero, without loss of generality, let’s put it as β j =0, then: 33 Corollary 2.4.8. Given all the assumptions in theorem 2.4.7 except β j =0, then √ d N CAP E j (x) − CAP Ej (x) → N (0, W ) Where CAP Ej (x) ≡ β j exp(xβ)r(x) as in equation ; as for the detailed form of W , refer to proof in appendix. PROOF: See appendix. The above results are the analog extensions as in Chapter 1. But we do have a special result for panel data case. From equation (2.4.5), we know that E(Yit |Xi ) only depends on Xit and Xi with assumption of equation (2.4.4); so we do the following manipulation: E(Yit |Xit , Xi ) = E[E(Yit |Xi )|Xit , Xi ] = E[exp(Xit β)r(Xi )|Xit , Xi ] = exp(Xit β)r(Xi ) (2.4.12) Following the same derivation as in Blundell and Powell (2003), we have: ASF (Xt ) = E(Yit |Xit = Xt , Xi )dFX = exp(Xt β) r(Xi )dFX i i If we let E(Yit |Xit , Xi ) = H(Xit , Xi ), then the RHS of equation (2.4.13) is (2.4.13) H(Xit , Xi )dFX , i which is exactly the so called average structural function in Blundell and Powell (2003); as in chapter 1, we define APE basing on ASF. Denote the average partial effect with respect to Xt as: ∂ASF (Xt ) ∂Xt ∂exp(Xt β) = ∂Xt r(Xi )dFX = β exp(Xt β) r(Xi )dFX λ ≡ 34 i i (2.4.14) The corresponding estimator is:   N ˆ pqml exp(Xt β ˆ pqml ) N −1 λ=β ˆ M ) exp(GM (Xi )π (2.4.15) i=1 The structure of λ is interesting: it only depends on Xt ; the effect of cross sectional heterogeneity is coorperated as a scalar factor. Note that it is between AP E and CAP E(x) : the former measures the overall effect of covariates and heterogeneity; the latter considers both cross section and time dimensions. While λ considers the effect of covariates on both dimensions, but only overall effect of heterogeneity. Papke and Wooldridge (2008) get a similar device in a panel data probit model with and without endogenous explanatory variables. To obtain asymptotic result for λ, we go further following Papke and Wooldridge (2008):  N  T ˆ pqml (N T )−1 τ =β ˆ pqml + GM (Xi )π ˆ M ) exp(Xit β i=1 t=1 Note that λ can be considered as a special case of τ :  N  T ˆ pqml (N T )−1 β ˆ pqml + GM (Xi )π ˆ M ) exp(Xit β i=1 t=1   N ˆ pqml ) N −1 ˆ pqml exp(Xt β = β ˆ M ) exp(GM (Xi )π i=1 Where Xit = (Xt , 0, · · · , 0). So as long as we obtain asymptotic results for τ , the result T −1 for λ can be easily achieved: Theorem 2.4.9. Given N (1/2) M −s → 0 as N → ∞, and all the assumptions in Lemma 2.4.3 and 2.4.4 , then √ d N (ˆ τ − τ ) → N(0, V) PROOF: See appendix. 35 (2.4.16) The structure of the variance is a little complicated, the derivations and estimation are in the appendix. Note that, it follows from equation (2.4.6): Yit X exp(Xit β) i E r(Xi ) ≡ E E(Ci |Xi ) = E E =E Yit exp(Xit β) Then there is an “naive” estimator of APE automatically: ˆ pqml exp(Xt β ˆ pqml ) 1 β NT N T i=1 t=1 Yit ˆ pqml ) exp(Xit β With all these solved, we can propose the following estimating procedure: Step 1: do the PQMLE of Yit on Xit , and denote the estimator of β as β pqml ; Step 2: use power series to approximate r(Xi ), denote as rˆ(Xi ); Step 3: estimate the APE (or CAPE) of parameter of interest and corresponding standard error. 2.5 More Efficient Estimator: GMM As we find that in simulation 2 in section 3, when both PQML and LFE estimators are consistent, the latter is more efficient than the former. Wooldridge (1999) suggests GMM method to improve the efficiency of PQML estimator. The key step of GMM is to derive a moment condition from equation (2.2.3). 2.5.1 Model 2 As we all known, we need a class to compare efficiency; the model we are considering here as follows: E(Yit |Xi , φi , ψ i ) = φi µ(Xit , β 0 ), (2.5.1) Var(Yit |Xi , φi , ψ i ) = ψ 2i [E(Yit |Xi , φi , ψ i )]2 = ψ 2i [φi µ(Xit , β 0 )]2 (2.5.2) 36 Cov(Yit , Yir |Xi , φi , ψ i ) = 0, t = r (2.5.3) This model is initiated in Wooldridge (1999) who put it as a question. As we can see that the function µ(., .) can be any positive functional form, such as exponential in section 2. Compared with model 1, we have one more equation to specify the relationship between the conditional variance and the square of conditional mean. Note that ψ i is also a random variable, so it is the extension of constant coefficient of variation model. 2.5.2 GMM From the above two equations, we can easily get the following equation: E(Yit2 |Xi , φi , ψ i ) = (1 + ψ 2i )[E(Yit |Xi , φi )]2 = (1 + ψ 2i )φ2i [µ(Xit , β 0 )]2 (2.5.4) if we put the Yit2 as the the Yit , (1 + ψ 2i )φ2i as the exp(Ci ) and [µ(Xit , β 0 )]2 as exp(Xit β) in the equation (2.2.5), we can apply the PQML method to Yit2 . From here, we can derive the moment conditions. We define: T (Yit )j = nji , j = 1, 2 t=1 pjt (Xi , β 0 ) ≡ µj (Xit , β 0 ) T j r=1 µ (Xir , β 0 ) , j = 1, 2 uji (β) ≡ (Yi )j − pj (Xi , β)njt , j = 1, 2 where, pj (Xi , β) ≡ [pj1 (Xi , β), ..., pjT (Xi , β)] , Yi = [Yi1 , ..., YiT ] By iterated expectation, we have: E(uji (β 0 )|Xi , φi , ψ i ) = E((Yi )j |Xi , φi , ψ i ) − pj (Xi , β)E(nji |Xi , φi , ψ i ) =0, j = 1, 2 (2.5.5) So, E(Dj (Xi , β) uji (β 0 )) = 0, j = 1, 2 37 (2.5.6) Where Dj (Xi , β) is any appropriate function of Xi . Wooldridge (1999) suggests the following: Dj (Xi , β) = [Wj (Xi , β)∇β pj (Xi , β)|∇β pj (Xi , β)]. (2.5.7) where, Wj (Xi , β) ≡ [diag{pj1 (Xi , β), ..., pjT (Xi , β)}]−1 If we define:   0  D1 (Xi , β)  D(Xi , β) =   0 D2 (Xi , β)    u1i (β)  ui (β) =   u2i (β) the two moment conditions can be combined as: E(D(Xi ) ui (β 0 )) = 0 (2.5.8) GMM follows easily:  N ˆ gmm = arg min  β   N Di ui ui Di  Di ui (β)  i=1 −1  i=1 N  Di ui (β)  (2.5.9) i=1 ˆ pqml ), ui = ui (β ˆ pqml ). Where Di = D(Xi , β As pointed out by Wooldridge (1999), GMM with only D1 (Xi , β) and u1i (β) is identical to the PQML. Theorem 2.5.1. As consistent estimators for β 0 in model 2, GMM is more efficient than PQML. PROOF: From Wooldridge (2002, section 8.3.3), we know: ˆ gmm ) = C −1 C A Var(β 38 ˆ pqml ) = C Λ−1 C1 A Var(β 1 1 Where, Λ = E Di ui ui Di C = E Di ∇β ui (β) Λ1 and C1 are similar to Λ and C with Di and ui replaced by D1i and u1i respec−1 C is p.s.d.; then the result follows tively. From White (1984), the matrix C1 Λ−1 1 C1 − C immediately. 2.5.3 Simulation 3 The data generating process is specified as the following: i.i.d • Xit ∼ N(0, 1), • Ci = Xi + N(0, 1), • β = .1, i.i.d 2 • Vit = exp(−.125 ∗ Xi + .5 ∗ Xi ∗ zit ), where zit ∼ N(0, 1). • Yit = exp(Xit β + Ci )Vit , • T=5, Number of simulations =5000 . Here, Vit has the similar form as in simulation 1 and 2; however, since Vit has to satisfy equation (2.2.2), we use Xi instead of Xit . We only use a = −.125, b = .5 here since other values have the similar results. Table B.8 has the results for all the three methods. All the three estimators are consistent. But the differences of standard errors are big. Compared with PQML with GMM, standard error of the latter is about a half of the former; so the extra moment condition does matter here and GMM is the right direction to improve the efficiency. However, the standard errors 39 of LFE are smaller than GMM. Theoretically, there are cases where LFE is not consistent while GMM and PQML are; but we fail to find a simulation . 2.5.4 Optimal IV Estimator Considering Model 2, we can derive the so-called optimal instrumental variables (OIV) estimator in Newey (1990, 1993) to improve efficiency. For the moment condition, u1i (β) ≡ Yi − p1 (Xi , β)n1i , we first need to find its variance structure. By the conditions equations (2.5.1)-(2.5.3)from model 2, we can get: Var(u1i |Xi ) = E(u1i ∗ u1i |Xi ) = E[ψ 2i φ2i |Xi ] ∗ Ωi (2.5.10) Where, Ωi (r, s) =    1−             ( 2∗µ(Xis ,β) T µ(X ,β) it t=1 T µ(X ,β)2 it t=1 T µ(X ,β))2 it t=1 + − ( T µ(X ,β)2 it t=1 T µ(X ,β))2 it t=1 µ(Xis ,β)+µ(Xir ,β) T µ(X ,β) it t=1 µ(Xis , β)2 , if r = s; µ(Xis , β)µ(Xir , β), if r = s. In fact, if we use matrix algebra, we can do the following:       pi1 pi1 · · · pi1         p   p · · · p i2 i2     i2 u1i (β) = IT −  .  .. .    ..  . · · · ..          piT piT · · · piT  Yi1   Yi2   ..  .    YiT so,we can get the following:   Yi1   Y  i2 Var(u1i (β)|Xi , φi , ψ i ) = A Var  .  ..   YiT 40      Xi , φi , ψ i  A = ψ 2i φ2i ABA    where,     pi1 pi1     p   i2 pi2 A = IT −  . ..   .. .     piT piT  2 0  µ(Xi1 , β)   0 µ(Xi2 , β)2  B=  0 0   0 0  · · · pi1    · · · pi2     .  · · · ..    · · · piT ··· 0 ··· ... 0 0 · · · µ(XiT , β)2          So, Var(u1i (β)|Xi ) = E(ψ 2i φ2i |Xi )ABA Note that, Ωi = ABA The remained part is to find E(ψ 2i φ2i |Xi ); note that: Var(u1it |Xi ) = E[ψ 2i φ2i |Xi ] 1− 2 ∗ µ(Xit , β) T t=1 µ(Xit , β) + ( T 2 t=1 µ(Xit , β) T 2 t=1 µ(Xit , β)) µ(Xit , β)2 , So,  E[ψ 2i φ2i |Xi ] = E  1− u21it T µ(X ,β)2 2∗µ(Xit ,β) it t=1 2 + T µ(X ,β) ( T µ(X ,β))2 µ(Xit ,β) it it t=1 t=1  Xi  , If we assume that: E[ψ 2i φ2i |Xi ] = E[(1 + ψ 2i )φ2i |Xi ] = h(Xi ) (2.5.11) Here, function h(.) is a unknown function. As we know that h(.) is the conditional expectation of positive random variables, so it is positive. Hence the main problem is still how to catch the positivity. Newey (1994) uses the truncation, which works in the large 41 sample case. Since the truncation threshold is very arbitrary, then it may not work well in finite sample case. Next, we try to find the D(Xi ) as in Newey (1990, 1993): D(Xi ) ≡ E(∇β u1i (β)|Xi ) = E(∇β [Yi − p1 (Xi , β)n1i ]|Xi ) = −∇β [p1 (Xi , β)]E(n1i |Xi ) T = −∇β [p1 (Xi , β)] µ(Xit , β)E(φi |Xi ) t=1 T = −∇β [p1 (Xi , β)] µ(Xit , β)r(Xi ) t=1 Here again we will use “exponential sieve” to estimate h(Xi ) and r(Xi ) as in section 3; note that the method shows up two times for OIV estimator. We use cross validation as in Newey (1993) to choose the number of terms M . We propose the following estimating method: Step 1: do the PQML of Yit on Xit , and denote the estimator of β as β pqml ; also define, uˆ1it = Yit − n1i pit (β pqml ) ; ˆ i ); define B(Xi ) = D(X ˆ i) ∗ Step 2: use power series to approximate h(Xi ), denote as h(X ˆ i ); ˆ i )−1 /h(X (Ω Step 3: do the following GMM,    N ˆ gmm = arg min  β N Bi u1i (β)  i=1 −1  Bi Bi  i=1 N  Bi u1i (β)  (2.5.12) i=1 We use a simulation to end this section: Simulation 4 We follow the model 2 in section 5, i.e, equations (2.5.1)-(2.5.3). The data generating process is specified as the following: • i = 1, 2, ..., N ; t = 1, 2, ..., T , 42 i.i.d • Xit ∼ N(0, 1), • Ci = exp(Xi + N(0, 1)), where Xi = T −1 T t=1 Xit • β = .1, i.i.d 2 i.i.d • Vit ∼ Gamma(α, γ), where γ = 1/α = exp(−.125 ∗ Xi + .5 ∗ Xi ∗ zit ), where zit ∼ N(0, 1), • T = 5, N =500 if it is not specified, • Yit = exp(Xit β)Ci Vit , • T=5, Number of simulations =5000 . The estimation results are in Table B.9; for the LFE, PQML and GMM, estimates are almost the same as Table B.8: all of them are very close to the true value of parameter; in terms of efficiency, LFE is best and PQML is the worst. Now with the new estimation method, OIV, we can see that it performs very well considering the standard errors, which are close to those of LFE. Finally, we do find a method that can compete with LEF. 2.6 Empirical Application In this section we illustrate GMM estimator as well as PQML by estimating APE of price on demand for airline market. The market is defined the same as in Park et. al. (2007), which is a trip between origin and destination cities. For each route, i,e, an pair of cities, the measurement is taken everyday. In order to wash out the daily fluctuations, like weekend and holidays, the yearly average is taken from year 1997 to 2000. Of course, for some of the routes, they do not have records for these consecutive four years and have to be dropped; but the drop rate is less than 10%-we have over 1000 routes. So the data set is a typical balanced panel data set. The description of the key variables as follows: 43 For each route,the variable passen measures the average number of daily passengers for the year; lf are is log transformation of average one-way fare in US dollars; concen is the market share of the biggest carrier along the route. Refer to Table B.10 for details. The Table B.11 shows the estimation results for the three methods. The GMM method reduce the standard errors by about 10% compared with PQML. The estimates of PQML and GMM are close to each other. As for the practice purpose, from equation (2.4.12) in section 4, we know the APE for lf are ˆ pqml ) 1 β j exp(xt β NT N T i=1 t=1 Yit ≈ 512 ˆ pqml ) exp(Xit β (2.6.1) N βˆ j ∗ N −1 ˆ pqml + GM (Xi )π ˆ M ) ≈ 507 exp(xt β (2.6.2) i=1 ˆ pqml + GM (x)π ˆ M ) ≈ 421 βˆ j ∗ exp(xt β (2.6.3) Where xt is evaluated at the median values of lf are and concen and year dummy is 1998; x is evaluated at at mean values of lf are and concen. For the interest variable lf are, a 1% price rise in air tickets will exclude about 500 passengers overall while about 400 at the specific setting. Compare those result with LFE: a 1% price rise in air tickets will exclude about 6 passengers overall (here we evaluate passen at its mean). By calculating the corresponding standard errors, all the estimates are statistically significant at 5% level. From Figure B.5, we can see that passen is skewed to the right and is not symmetric even after taking logarithmic transformation. In terms of partial effect of airfare on market demand, the advantage of equation 2.6.3 is that it can give arbitrary partial effect of airfare on demand for any value of x. For example, we can consider the APE of airfare on market demand for the route which has the highest demand of 8497: in Year 1998, a 1% price rise in air tickets will exclude about 545 passengers from the route; this way, the price elasticity of demand is about 6%. 44 2.7 Conclusions We summarize the three usual estimation methods for positive response variables. By simulation, we list the disadvantages and advantages for each of them-basically the trade off between consistency and efficiency. As long as the conditional mean function is correctly specified,the PQML is robust to any distribution of the error terms. On the other hand, it suffers for less preciseness for error with certain distribution. If we can assume more on the distribution, we can use GMM to reduce the standard errors of PQML estimator. By the application of the airfare data set, we show that it works. As for the estimation of APE, it keeps the property of distribution free and amplify the positivity of the conditional mean function of the heterogeneity. Many issues can be studied in the future research. For one, whether it is possible to develop a test to distinguish PQML and LFE, like the Hausman test for RE and FE estimators. Another interest problem is what the consequence if the conditional mean function is misspecified. 45 Chapter 3 A SPATIAL ANALYSIS OF SPENDING EFFECT ON MEAP 3.1 Introduction As we all know, public schools of the K-12 educational system in the U.S. are financed mostly by local revenue, primarily by taxes levied on property. One of the disadvantages of this policy is that this can potentially lead to economic inequality across school districts within a state since, as is often argued, demand for (and affordability of) a good education increases with parental income and educational attainment. Take Michigan as an example, in year 1992, the per pupil expenditure in a rich school district (the name is Bloomfield Hills School District, DCODE 63080)could reach as more than 9 times as it in a poor school district (the name is Ionia Township S/D #2, DCODE 34360).1 It is under this kind of background, in 1994, Michigan initiated a school finance reform which is called Proposal A, aimed at equalization of school finances among school districts within state boundaries. A great body of research has been done for its impacts. Papke(2005, 2008)use panel data sets, either school level or school district level, and find that there is statistically significantly positive relationship between student performance which is measured in the pass rate of math test for fourth graders and finance expenditure with linear regression models. Moreover, as pointed out in Papke (2008), the magnitude of the effect of initially high-performing districts are lower than the initially low-performing ones 2 , which suggests clearly the nonlinearity of the relationship. Under this circumstance, Papke and Wooldridge (2008) extend the analysis further to a nonlinear model with a panel data set of school district but with less 1 We get this from the data set used in Papke and Wooldridge (2008); In terms of household income, Chakrabarti and Roy (2012) find that the median income in a rich school district is more than three times of it in a poor one. 2 and Roy (2011) even claims that the finance reform may have had negative effect on student’s performance in the highest spending districts. 46 time span compared with Papke (2008). One of the challenges in Papke and Wooldridge (2008) is to deal with the unobserved heterogeneity in probit setting; with strictly exogenous explanatory variables, they propose a conditional normal assumption following Mundlak (1978) and Chamberlain (1980) device. The usual shortcoming for nonlinear models, to make inference about the average partial effect, is elegantly overcome. More recently with even boarder data coverage, Chakrabarti and Roy (2012) study the impact of Proposal A on spatial segregation of housing market and find that there is continued high demand for residence in highest-spending districts, suggesting the importance of neighborhood peer effects (“local” social capital). However, none of them accounts for the spatial dependence in the unobserved cross sectional fixed effect, either building level or school district level. This is the first reason that the problem is revisited. On the other hand, as pointed out in Papke (2008), using a data set with time span of 10 years, Papke (2008) finds that although spending inequality was reduced in the years of immediately after Proposal A, equalization has slowed considerably since year 2000. Chakrabarti and Roy (2012) also notice that there is continued high demand for residence in the highestspending communities, implying that even a comprehensive government aid program can fail to make a large impact on residential segregation. So considering the data set we use is in Year 2010, it is interesting to investigate the effect Proposal A after 15 years of implementation, especially with spatial effects controlled. Spatial econometrics can be understood as a parallel extension of time series with time index replaced by space. In applied literature, issues relating to geographic proximity, transportation, spillover effects, etc., are important. Indeed, in recent years the spatial analysis in economics is booming: refer to Case (1991), Anselin and Florax (1995), Kelejian and Prucha (1999), Anselin (2010) and literatures therein. Modeling spatial interactions that arises in spatially referenced data is traditionally done by incorporating the spatial dependence into the covariance structure via an autoregressive model. For example, Wall (2004) analyzed the SAT scores of all 48 contiguous states of America for the year of 1999 by two 47 mostly used models in spatial statistics: conditional autoregressive model (CAR) and simultaneously autoregressive model (SAR). Both of models cooperate spatial dependence in the covariance structure as a function a neighbor matrix and often a fixed unknown spatial correlation parameter; refer to the paper and Banerjee et.al.(2004) for more details. Obviously, that approach is not robust to the misspecfication of the covariance functional form; what’s more, as it states in Conley (1999), whenever there are errors in the measurement of spatial dependence, which is happening very often in application, we cannot estimate parameters of interest consistently without assuming distributions of the errors; as we all know, the distribution assumptions are most of the time naive in reality. Conley and Molinari (2007) show how poorly MLE could perform when the distribution is misspecified; while the method in Conley (1999) works well. Moreover, he extends the spatial dependence to a broad economics system, which is general dependence within a cross section and not necessarily related to geographical features; for example, he creates a notion of “economics distance” among the some countries in the world in Conley and Ligon (2002); the physical distance definitely will contribute to it but it also has some type of “border effect” as in Engle and Roger (1996). On the other hand, he also pioneers a nonparametric approach for covariance structure estimation; the basic idea of that estimator, as it is pointed out in Keller and Shiue (2007), it is spatial version of Newey-West (1987) type heteroskedasticity and autocorrelation robust covariance estimator. The method is nicely used in many areas of economics, such as development, international economics and labor, etc; refer to Conley and Ligon (2002), Conley and Topa (2002), Conley and Dupor (2003). However, Conley (1999) does not elaborate about how to choose the cutoff points other than the asymptotic condition of one third order of the sample size(or the width of sample region). By simulation, Conley and Molinari (2007) argue that the choice is relevant; so it might be hard for an empirical researcher to apply the method. Hence, the purpose of this paper is twofold. First, we use the Michigan MEAP data of year 2010 to retrieve the studies in Papke (2005, 2008) and in a nonlinear regression model; to investigate the effect of 48 spatial dependence specified by Conley (1999), we propose two ways to choose cutoff points and show how the spatial dependence corrected standard errors are related to the choice. Second, in order to get the average partial effect (APE) as in Papke and Wooldridge (2008), an average structure function (ASF) device in Blundell and Powell (2003) is used to get the estimates of the spending effect as well as other ones. The remainder of this paper is organized as follows. Section 2 introduces the linear and nonlinear models; data description and related issues are contained in section 3. Section 4 presents the linear regression with all kinds of standard error calculations. Section 5 provides the results of nonlinear model and how the APE is estimated correspondingly. Conclusions and discussions are in Section 6. 3.2 Model Let (Yi , Xi ) be a sequence of observations for area Ai ; first following Papke (2005, 2008), the linear model is considered: Yi = Xi β + Ci , E(Ci |Xi ) = 0 (3.2.1) where Yi is scalar, Xi is 1 × K vector and β is K × 1 vector of unknown parameters; for area Ai : Yi and Xi can be understood as Y (Ai ) and X(Ai ) respectively; but for simpler denotation, we just put them as Yi and Xi . Ci or C(Ai ) is the unobserved spatial heterogeneity for each area Ai . If there are total areas of N , then i = 1, 2, · · · , N . Under general conditions [e.g., Wooldridge (2010, Chapter 4)], the OLS estimator of β based on N observations looks like: ˆ = (X X)−1 X Y, β (3.2.2) Where Y ≡ [Y1 , Y2 , · · · , YN ] is N × 1, X ≡ [X1 , X2 , · · · , XN ] is N × K. With little algebra [refer to Wooldridge (2010, Chapter 4)], we can get:  −1   N N √ ˆ − β) =  1 N (β Xi Xi  N −1/2 Xi C i  , N i=1 i=1 49 (3.2.3) ˆ is weakly consistent for β and asymptotically normally disSo with general conditions, β tributed; note that the asymptotic distribution really hinges on the second term in equation (3.2.3). If we assume the following conditions: V ar(Ci |Xi ) = σ 2 , i = 1, 2, · · · , N ; (3.2.4) {Xi Ci : i = 1, 2, · · · , N } is an uncorrelated sequence. (3.2.5) Condition (3.2.4) is the appropriate homoskedasticity assumption and (3.2.5) is no correlation requirement for the sequence; it can be also extended to be independent. Then the covariance ˆ is σ 2 (X X)−1 . Let σ matrix of β ˆ 2 = SSR/(N − K) be the usual square of the standard error of the regression; the usual standard error of jth OLS estimator βˆ j is the square root of the jth diagonal element of σ ˆ 2 (X X)−1 . That is also the ‘standard error’ printed out by all regression packages. Without homoskedasticity [condition (3.2.4)] but with no correlation [condition (3.2.5)], we can allow arbitrary heteroskedasticity of conditional variance C on X by Huber (1976) and White (1980) robust standard errors, which are square root of diagonal terms of matrix (X X)−1 ( N ˆ2 i=1 Xi Ci Xi )(X V ar N −1/2 N i=1 Xi Ci X)−1 . Note that N1 ( N ˆ2 i=1 Xi Ci Xi ) is a consistent estimator of , the second term of equation (3.2.3). Algebra details are referred to Wooldridge (2010, Chapter 4). For those two types of standard errors, they allow no correlation among the sequence {Xi Ci : i = 1, 2, · · · , N }. To relax that assumption, we can try two directions. One is to ˆ g = (Cˆi , · · · , Cˆj ) , divide the whole N observations into G groups: Wg = (Xi , · · · , Xj ) , v ˆ are square root where g = 1, · · · , G; then the group corrected standard errors of elements of β of diagonal terms of matrix −1 G g=1 Wg Wg N ˆg v ˆ g Wg g=1 Wg v −1 G , g=1 Wg Wg which allow arbitrary correlation within each group while independence is assumed across groups; that is similar to the idea for cluster corrected robust standard errors in Wooldridge (2010, Chapter 20); and Wang et. al (2013) also adopt that idea: split the whole sample into 50 many groups, one of which has two observations with arbitrary dependence; and a bivariate Probit method is proposed. Note that when G = 1, then the cluster corrected standard errors are H-W standard errors. Conley (1999) goes that way further: he divides the sequence {Xi Ci : i = 1, 2, · · · , N } into two groups with one group is lagged the other in two non-opposing directions and allow arbitrary correlation between the two groups; with different lag, we can get two different groups and corresponding correlation between them-that is, the sequence is regrouped with times of the number of lags; then final estimator of V ar N i=1 Xi Ci , which is denoted as ˆ is the weighted sum of all the covariances. Let Cˆi be the OLS residual in equation (3.2.1) V, , that is: L1 (N ) L2 (N ) ˆ = V N N (Xs−i Cˆs−i Cˆt−j Xt−j + Xt−j Cˆt−j Cˆs−i Xs−i ) W (i, j) j=0 i=0 N s=i+1 t=j+1 N (Xs Cˆs Cˆt Xt ) − (3.2.6) s=1 t=1 Where W (i, j) are the weights for covariance with one direction lagged of i and the other j; Without lags, that is i = j = 0, then W (i, j) = 1. L1 (N ) and L2 (N ) are the cutoff points of the two direction and both of them converge to ∞ with order of L1 (N ) = o(N 1/3 ) ˆ can be understood as the Newey and and L2 (N ) = o(N 1/3 ) as N −→ ∞. Intuitively, V West (1987) estimator of V ar N i=1 Xi Ci in two non-opposing directions summed up with proper weights; so it is basically a nonparametric estimator. Hence, Conley’s estimator of ˆ is (X X)−1 V(X ˆ variance-covariance matrix of β X)−1 . As for the nonlinear model, we follow Papke and Wooldridge (1996): Yi = G(Xi β + Ci ), (3.2.7) Where G(·) is any continuous function with range of (0, 1) in the real line; for example, we exp(x) x 1 2 can have G(x) = 1+exp(x) or G(x) = Φ(x), where Φ(x) = −∞ 2π exp(−t /2)dt. So it is natural to have: G−1 (Yi ) = Xi β + Ci , 51 (3.2.8) The equation (3.2.8) is very appealing since its right hand side has the linear form as equation (3.2.1) does. For example, if G(·) takes the logistic function form, then equation (3.2.8) turns into: log Yi 1 − Yi = Xi β + C i , (3.2.9) This is the popular linear model for log-odds ratio. As in Papke and Wooldridge (1996), Y i with usual assumption like, E(Ci |Xi ) = 0, we can get E log 1−Y i Xi = Xi β. So we can model transformed response variable instead of original one. This is the advantage of equation (3.2.7); but we know the responses should be strictly in the interval of (0, 1) since 0 and 1 will explode the transform. Note that the model in 3.2.9 is different from the model exp(X β+C ) i i , in which case the responses can be 0 and 1. specification E(Yi |Xi , Ci ) = 1+exp(X i β+Ci ) Another issue is the average partial effect, which is hard to be estimated in most nonlinear models. But since we have equation (3.2.7), from which we can get: ∂Yi = g(Xi β + Ci )β j , ∂xij where g(x) = ∂G(x) ∂x . (3.2.10) The rest of the paper will keep models (3.2.1) and (3.2.8), and assume E(Ci |Xi ) = 0 and necessary rank condition, then OLS estimators of β in both models are consistent, we will check standard errors under different conditional covariance ˆ in (3.2.2) and the corresponding APE estimates. structures of β 3.3 3.3.1 Data Data Characteristics and Sources All the data are from Michigan Department of Education: expenditure data are from Bulletin 1014 ; enrollment and free and reduced price lunch data are from the Center for Educational Performance and Information (CEPI); the test results data are from Michigan Educational Assessment Program (MEAP). The data resource is the same as in Papke (2005, 2008) and Papke and Wooldridge (2008), but there is some change for the name of the program since we use the data of Year 2010. Papke (2005) uses building level data but we use 52 school district level ones, which are the same as in Papke (2008). Other than the reasons explained in Papke and Wooldridge (2008), we pick the school districts for the convenience of spatial analysis. As for the dependent variable, math4, the pass rate of math test for forth graders in each school district of year 2010, which is defined the same in Papke (2005, 2008). The variable of interest, average expenditure per pupil in each school district, is collected differently than in Papke (2005, 2008); instead, we follow the definition in Papke and Wooldridge (2008)-take the average per pupil real expenditure in the last four years: avgexp = (avgexp+avgexp−1 +avgexp−2 +avgexp−3 )/4. Since both papers find significant effect of the previous expending on school performance, it is meaningful to collect expenditure variables this way-the reason we take last 4 years is that we are considering the test pass rates for forth graders. The real dollars are calculated in the price of year 2010 Midwest Urban Price index from Bureau of Labor Statistics. For the purpose of comparison, we also use the average per pupil expenditure of year 2010. Two other explanatory variables are the same in previous papers except in year 2010: enroll is the student enrollment for each school district in academic year 2009/10 and lunch is percentage of students who are eligible for free lunch or reduced price lunch program. scdist is the distance in kilometers for each school district to its nearest 2-year or 4-year college. Inspired by Kane and Rouse (1995), we are trying to investigate the effect the higher education has on the K-12 system, which is not done in previous studies. Intuitively, the schools in a school district which is closer to a college should have a higher chance to perform well, like it is easier for a forth grader to find a tutor since the college students are around; or the students can take advantages of the facilities in the colleges. On the other hand, most locations of colleges are in the areas with higher social economic status, which has positive externality to public schools. The summary statistics for these variables are in Table C.1. There are 551 school districts in shape file which is from Michigan Geographic Data Library. Combined with MEAP data, there are only 518 school districts because of data missing. With help of ArcMAP, we plot the the map of the all the school districts in Figure 53 C.1. As for the colleges in Michigan, I download all the 131 Michigan 2-year and 4-year colleges from National Center for Education Statistics website, and plot into the map(Figure C.2). If we plot the test pass rates into the map (Figure C.3), we can find some areas are clustered with high rates while others with low; that is, the there is possibility that high rates are congested together; so do the lower rates. That is the what the spatial dependence stems from. When add college data in and get Figure C.4, we do find that some colleges are in or close to the areas with high math test pass rates; while other colleges are in the areas with low test rates. How to measure the spatial dependence among those 518 school districts is the main part of application. Since the shape of each district is not regular, and the area sizes of them vary from 3.7km2 to 3317.7km2 , we should treat them as area data instead of point ones. The next subsection will explain how to set up the spatial dependence among these available 518 school districts. 3.3.2 Spatial Dependence Measurement Since we treat each school district as an irregular lattice, the physical distance between the centroids of any two districts cannot be a perfect measure for spatial dependence among them; for example, for some districts, their centroids are outside of the polygons. So distance based spatial weighting matrix, which is perfect for points data, is not that appropriate here; on the other hand, the dependence which is measured only by contiguity for any two areas ignores the size variations among those school districts. Keller and Shiue (2007) apply both methods and have excellent descriptions about them. Since Conley (1999) initiates a measurement for spatial dependence, which is robust to measurement error (Conley, 2007), we will adopt their method. First, we get the projected latitude and longitude coordinates for the centroid of each school districts3 ; then we use the smallest distance among all those 518 centroids to construct little squares in the whole state; whenever a centroid is in a square, the 3 the reason we use projected lattitude/longitude instead of original ones is that we want to use coordinates in a plane rather a sphere 54 coordinates of the upright corner the lattice are the new coordinates for each school district. A picture is worth one thousand words: we choose part of the 518 school districts and make a picture to explain the process: as in figure C.5, we have 96 school districts and we plot the 96 centroids into the map (Figure C.6); then we divide the whole area into 70 × 70 little squares in Figure C.7 with unit of about 2.5 kilometers, which is about the smallest distance among those 96 points. Now, we know the new coordinates for each centroid are decided by the little square where it lies and those are called Conley coordinates for each school district in Figure C.8. From there, we can see that one of the advantage of this measurement is that as long as a centroid is in a square, it will have the same coordinates no matter where it stay in the square; this way, it can cover the damage of measurement error while the traditional ways cannot. Thereafter, we define the spatial weighting matrix, W = (wij ), this way:    (1 − |i| )(1 − |j| ), for |i| < L1 , |j| < L2 L1 L2 wij =   0, else We can see that the weight function is a Barlett window in each direction. The choice of L1 and L2 is based on theory and practice: In theory, it should not be bigger than 1/3 root of the sample region; among those 518 school districts, the sample region is about 1000 horizontally and vertically; so we can take as about 10, which is about 10 kilometers since we know each unit of the coordinate is about one kilometer. While from practice, we know we should have a distance longer than that. We will explain more about that in the analysis. For the coordinates beyond L1 and L2 , we would not consider the dependence any more. Of course, we can set L1 and L2 as very large and so that we will include all the sample districts. the The missing data definitely will not cause big problems for the matrix defined this way, while the 0 − 1 measure of spatial dependence based on contiguity will: some element would be one were data for all 551 districts available. This is another advantage compared with conventional method. With this spatial weighting matrix, we will use the Conley (1999) estimator described in section 2 to account for dependence among school districts. 55 3.4 Linear Model As in Papke (2005, 2008), we run the following regression: math4i = β 0 + β 1 log(avgexp)i + β 2 log(enroll)i + β 3 lunchi + β 4 log(scdist)i + Ci , math4i = β 0 + β 1 log(exp10)i + β 2 log(enroll)i + β 3 lunchi + β 4 log(scdist)i + Ci , (4.1a, 4.1b) The only difference of those two models [equations (4.1a, 4.1b)] is the first explanatory variable: we replace avgexp in the first regression with exp10. From the description, we know that some differences do exist between them and we wonder how those difference matter for the test rate. The reason we take logarithms transformation for explanatory variables avgexp and scdist is that their distributions are skewed: It can be easily seen from the histograms (Figure C.9); and Manning (1998) claims that logarithms transformation is a good way to model those kinds of variables. Such analysis also applies to enroll. 3.4.1 Ordinary Least Square (OLS) Here, we first assume that usual assumptions of linear regression as in Wooldrdige (2010, Chapter4) hold; the OLS results are in the first two columns of Table C.2. Note that the usual standard errors are calculated under assumption of homoskedasticity as well as other assumptions for consistency. As we all know, the magnitudes of standard errors are crucial to the inference: they will decide whether the estimates of parameters are significant or not at a given significance level. From Table C.2, we can see that all the explanatory variables except scdist are significant at 5% level: the effect of student expenditure as well as enrollment on school performance is positive and lunch program is negative; interestingly, the effect of a school district’s distance to its nearest college is positive which means the longer the better and it is against usual intuition. Fortunately, the effect is not significant at 5% level or even at 10%; and we see the same pattern for the second regression and the size of the effect of the 56 last four years’ expenditure on school performance is slightly bigger than year 2010. However, if we abandon the homoskedasticity assumption and allow arbitrary functional form of the conditional variance of the error term Ci on explanatory variables and we will get the the White standard errors; then the positive effective of student enrollment is not significant at 5% level any more; so does for the second regression. Further, if we calculate the robust standard errors corrected for the intermediate school districts (ISD), even the positive effect of year 2010’s expenditure is not significant at 5% level. Considering Michigan geographical feature, we divide the whole sample into two clusters: the Upper peninsula and lower; if we correct for the peninsula clusters, neither the last four years’ average expenditure nor year 2010’s expenditure has the significant positive effect on school performance. Now, we know it is important to have a reasonable method to calculate standard errors. Now we calculate Conley (1999) standard errors; Based on the description in section 2 and 3, we can get the coordinates for each centroid of 518 school districts. The key step is how to choose the cutoff points. Since the coordinates range from 0 to 758 horizontally, and from 0 to 880 vertically, from theory the cutoff points should not bigger than 10; while we think the number should be larger; we will design two schemes to see how the significance level changes with respect to its corresponding cutoff points. First, we have 11 pairs of the cutoff points, which increases by 50 or 100 by each step; and results are in Table C.3 and C.4. We can see that the overall trend of the standard error magnitudes is decreasing with increasing of the window size; but we do see some local fluctuations: for example in Table C.3, for the variable exp10, the Conley standard errors of the estimator of its parameter are increasing for the first three pairs of cutoff points. At 5% level, the estimators of β 1 and β 3 are all statistically significant under all the cutoff points; but for β 2 , its estimator is not significant at 5% level if the first three pairs are chosen. In Table C.4, the similar pattern holds for the regression with exp10 replaced by avgexp, while the corresponding errors are a little inflated. Secondly, we have 7 pairs of points, which are the the 5%, 15%, 25%, 50%, 75%, 95% and 100% percentiles of the coordinates respectively; and results are 57 in Table C.5 and C.6. Note that, with each of the pair, we can make that many of the 518 sample covered in the weight calculation. Surprisingly, the similar story happens as the first scheme; and there is no difference of signifiance between only 5% samples are considered for positive weights and all samples are included for β 1 and β 3 at 5% level; while we do see the difference for β 2 : more sample considered for positive weights more significant. Considering each unit of the coordinates is equal to about 2 kilometers, it is not that hard to have a reasonable choice; there is a catch here: if the decreasing trend of the Conley standard errors holds, we make them as small as possible: for example, I make a pair of 106 , even the β 4 can be statistically significant at 5% level; but we know that number does not make any sense considering none of the coordinates is greater than 900. 3.4.2 Generalized Least Square (GLS) The matrix form of equation (3.2.1) [or equations (4.1a, 4.1b)]: Y = Xβ + C (3.4.1) Where Y and X are as in equation (3.2.2), C = [C1 , C2 , · · · , CN ] . If we assume the Spatial Autoregressive (SAR) structure in the error term C: C = ρWC + ε, ε ∼ N(0, IN σ 2 ) (3.4.2) Where ρ is the unknown spatial correlation parameter and W is the weighting matrix; that is a conventional assumption in the spatial econometrics literature. Based on theoretical results of maximum likelihood estimator (MLE) and QMLE in Lee (2004), STATA has the build-in function of “spreg ml” to implement MLE or QMLE given a weighting matrix. The first interesting result is that estimates of covariate parameters β are close while the estimates of spatial correlation parameter ρ are different for different weighting matrix. From Table C.17 and Table C.18, we can see that the magnitude of ρ estimates under weighting matrix of contiguity is only one third of it under weighting matrix of inverse distance; if we investigate the weighting matrix further, we find that contiguity weighting matrix is much 58 smaller than the inverse distance weighting matrix element by element; the usual summary statistics of the correlation matrix of error terms are in Table C.19 and Table C.20. One strong assumption of MLE or QMLE for SAR model is the homoskedasticity; without it, MLE (or QMLE) are not consistent. That result has been explored thoroughly in the literature , such as Arraiz at. el. (2010), Kelejian and Prucha (2010), Lin and Lee (2010). They all come up with some new methods to cover heteroskedasticity. On the other hand, the SAR structure might be misspecified. Here we combine the idea of SAR structure and Conley standard errors, and a new form of standard errors is introduced. From equation (3.4.2): C = (IN − ρW)−1 ε Plug into queation (3.4.1) : Y = Xβ+ (IN − ρW)−1 ε (3.4.3) With the homoskedasticity, GLS in euqation (3.4.3) is the same as MLE with ε ∼ N(0, IN σ 2 ) (or QMLE without normality asumption); that is aslo what “spreg ml” reports in STATA. Moreover, GLS is also equivalent to the OLS in the following model: (IN − ρW) Y = (IN − ρW) Xβ+ε (3.4.4) Since ρ is unkown, by the idea of quasi-GLS, we can replace ρ with its GLS estimator, ρ, in equation (3.4.3). Then we apply Conley’s method to the following model: Y∗ = X∗ β+ε (3.4.5) Where Y∗ = (IN − ρW) Y, X∗ = (IN − ρW) X∗ . Note that OLS in equation (3.4.5) wtih Conley standard errors has at least two advantanges: first, it allows hetroskedastictiy by H-W standard errors and robust to condition in 59 equation (3.2.5); second, it is robust to the misspecification of SAR struction. Note that even if the SAR structure is wrong, this transformation adds one more possible structure for the error terms; and we know that Quasi-GLS is still consistent as long as E (ε | X) = 0, which means strictly exogenous covariates. From the results in Table C.13, the heteroskedasticity does make a difference for magnitude of standard errors of expenditure and enrollment; the similar patterns for the Conley standard errors as in OLS case. Since the efficiency gain from Conley’s method only works for high cutoff points in which case correlation is weak, we can see that the SAR structure is not a good choice for error term. 3.5 Nonlinear Model 3.5.1 Regression of Log Odds Ratio We follow the model in equation (3.2.7) with the following form : Yi = exp(Xi α + ei ) , 1 + exp(Xi α + ei ) (3.5.1) One of the advantages of the above equation (3.5.1), compared with equation (3.2.1), is that logistic function of its right hand side is strictly in the range of (0,1), which will make prediction in the interval; however, it comes at the expense of the difficulty of estimating parameters and causal effect. Fortunately, the equation (3.5.1) can be easily transformed as follows: log math4 1 − math4 = α0 + α1 log(avgexp)i + α2 log(enroll)i + α3 lunchi + α4 log(scdist)i + ei , log math4 1 − math4 = α0 + α1 log(exp10)i + α2 log(enroll)i + α3 lunchi + α4 log(scdist)i + ei , (4.5.1a, 4.5.1b) The left hand side of the above two equations (4.5.1a, b) are the logarithm transformation of relative risk of passing the MEAP math test for each school district; it is popularly called as “log odds ratio” in literature. The right hand sides are exactly the same as in equations (4.1a, 60 b). Avoiding abusing of notation, we use α for the parameters of interests; more important is that interpretations of the parameters of interest are different: the β in equations (4.1a, b) are the average partial effect (APE) of covariates on pass rate, while α’s in equations (5.1a, b) are APE of covariates on log ratio of each school’s pass rate to its failure rate. Of course, we can retrieve the first APE and it is in the next subsection. First, we regress the log odds ratio under four types of assumptions for the conditional covariance of ei on the regressors and the estimates are in the first column of Table C.7; as in previous section, we consider two sets of expenditure variable: avgexp and exp10. The signs of the estimates are as expected: the effect of expenditure and enroll on the odds ratio is positive; although the lscdist shows some positive effect, it is not statistically significant at even 10% level, let alone 5%. we remember those results show up in the level pass rate regression in the previous section. However, the significance picture is different. Under iid assumption, increasing avgexp by 1% will increase the odds ratio by about 5% and about 4% for exp10; and the effects are statistically significant at 5% level. If we correct for arbitrary correlation among ei s or between ei s and explanatory variables, the significance claim does not hold any more; nor do the corrected clusters of ISD and peninsula cases. That gives us motivation to investigate further about the assumptions for the conditional covariance structure. On the contrary, the effect of enrollment on pass rate ratio is robust to the assumptions: if we increase the enrollment by 5%, then the odds ratio will increase by 9% with either avgexp or exp10 controlled; and the effect is statistically significant at 5% level. The same pattern holds for lunch: every 1% point increase in the percentage of students in free lunch or reduced price lunch for a school district, it will lead to reduce the odds ratio by almost 3%; and the effect is statistically significant at 5% level. Then, we consider the Conley (1999) standard errors. As in the level pass rate case, it is important to find an appropriate window size for the weights; we apply the same scheme. For the first 11 pairs of cutoff points and corresponding results are in Table C.8 and C.9, the global trend of the magnitude of standard errors for each variable is decreasing with increase 61 of cutoff points; while for some variable, like log(exp10), there is some local distortion, and for other variables, like lunch and log(enroll), the decreasing trend is strict. Given at 5% level, the statistical significance of OLS estimators of lunch and log(enroll) is robust to the choice of cutoff points: all of them are significant; while for log(scdist), the OLS estimator of its parameter is never significant at 5% level even at 1000 of the window size. So we are very confident to say that there positive effect of enroll on the odds ratio of math test pass rate, while negative for lunch at 5% significant level; and scdist is not significant factor to explain the ratio. As for log(exp10), its effect only becomes statistically significant at 5% level when we set the window at size of 1000; the similar story is told when log(exp10) is replaced by log(avgexp), whose positive effect on the odds ratio is changed to be significant at the cutoff points of 600 and 700 and over. When the percentiles are specified for the cutoff points, judging from the results in Table C.10 and C.11 the same thing happens for lunch, log(enroll) and log(scdist): the statistical significance of their effects on the odds ratio at 5% level holds the same situation as first scheme of cutoff points. For log(exp10), it effect is not statistically significant at 5% level even though all the points are included in the weights to correct for spatial dependence; the case of log(avgexp) is slightly different: when 95% or above of the sample are included for weights, its effect is significant. The same comment shows up as for the level test pass rate: as long as we keep the cutoff size large enough, all the variables are statistically significant factors to explain the ratio although the magnitude of the window does not make any practical sense. In a word, if we try to explain the odds ratio of the test instead of rates themselves, only two factors are statistically significant at 5% level: lunch and log(enroll); the former keeps its significance while the latter is changed-it replace the expenditure variables as in level pass rates case. 3.5.2 Estimate the APE for Level Rates As we all know, economists are interested in what a model would suggest for a policy as well as estimating parameters in the model itself. After we go to the estimation of parameters, 62 we notice that one of disadvantages for the nonlinear model as in equation (3.5.1) is that we cannot estimate APE as smoothly as in linear case, where the estimates of parameters themselves speak everything; Plus, there is another issue-the difference between APE and PAE, refer to Wooldridge (2005, 2010) for more detailed discussion about this. In this paper, we define the APE with respect to xj evaluated at X as follows: +∞ AP Ej (X) = αj −∞ exp(Xα + e) (1 + exp(Xα + e))2 f (e)de, (3.5.2) Note that equation (3.5.2) follows the idea of average structure function(ASF) in Blundell and Powell (2003); one of the advantages of ASF is that it can specify partial effect with respect to individual change of X, not just average. Refer to equations (5.1a, b), Xα = α0 + α1 log(avgexp) + α2 log(enroll) + α3 lunch + α4 log(scdist); for example, if we want to know the expenditure effects on school performance, then APE with respect to log(avgexp) is +∞ of our interest; by equation (3.5.2), AP E1 (X) = α1 −∞ exp(Xα+e) f (e)de. (1+exp(Xα+e))2 The notion of that is the partial effect defined in equation (3.2.10) with the heterogeneity averaged out ˆ be parameter estimator and eˆi be the residuals from log over its whole population. Let α odds ratio OLS regression, it is straight forward to estimate the APE this way: 1 AP E j (X) = α ˆj N N ˆ + eˆi ) exp(Xα ˆ ˆi ))2 i=1 (1 + exp(Xα + e , (3.5.3) The idea behind this method is the method of moment; it can also be understood as Duan’s(1983) swearing estimator. But we know, we do not need to assume the independence of Xi and ei , which can be an attractive feature. As for the asymptotic variance of AP E j (X), it can be obtained by delta method, which is conveniently implemented using the method of moments approach in Newey and McFadden (1994). Bootstrapping methods can also be readily applied. Continuing the avgexp example, we can estimate the APE at the mean value of all X as .1151 with standard error of .0663. The results for other explanatory variables evaluated at mean and 25%, 50%, 75%, 95% percentiles are in Table C.12. We can see that the magnitudes of the APEs do correlate to the values at which the X are evaluated; take the 63 avgexp example, the trend of its APE is decreasing with higher percentile of X and the difference of APE between 95% and 25% is about 2%: the average expenditure in the lower end of school districts kicks in at a higher rate than in the higher ones. This coincides the results in Papke (2008) who divides the sample into two groups by the median of average expenditure; thanks to the nonlinear model, we do not need to split the data and we can investigate it in an even finer setting: technically, we can get the APE for any value of X, which definitely is attractive to the practitioners. We can also find the similar trend for APEs of lunch program and enrollment albeit the change is smaller. As for inference, the APEs of enrollment and lunch are statistically significant at 5% level when avgexp is used to control for expenditure, whose significance level is 10%; when the expenditure of year 2010 is used, the significance story for enrollment and lunch is the same while the expenditure is not statistically significant at 10% level any longer. This is another difference compared with line model: if there is a temporally lagged effect for the expenditure, the linear model cannot catch it while nonlinear does. To dig into the question deeper, we compare the estimates of linear and nonlinear models and put the results in figure C.10-C.12. First, for the average expenditure, the linear estimate is about 4 ∼ 6% higher than nonlinear ones, with the gap wider for the higher spending school districts. But for enrollment, the trend is reversed: the effect of enrollment estimated in the nonlinear model is higher than in the linear one, although the difference is about half percentage and the gap shrinks with more students enrolled. Positive effect of enrollment coincides with the “peer effect” in Lin (2010). And for free lunch program, its effects on school performance both in linear model and nonlinear are almost the same. 3.6 Conclusions We have investigated the effects of Proposal A in Michigan on school performance in a broader setting compared with the previous literature, such as the nonlinear model, the interaction between K12 public schools and colleges; To control the spatial dependence among all school 64 districts, we adopt the method in Conley (1999) and explore the effect of different cutoff points. Considering the statistical significance varies with standard errors, we find that the OLS estimates in the linear level test rates are statistically significant at 5% level given a reasonable window size; however, the picture is different for the nonlinear model, which is the linear in the log odds ratio of the level test rates: even for a very large cutoff points, the spending effect is not statistically significant at 5% level any more. What’s more, after transformed back from log odds ratio into level rates, both the magnitudes and statistical significance of APEs are changed; one of the advantages of the nonlinear model is that it can catch the effect variation with specific part of population. Also, the difference between some estimates from the linear and nonlinear models is not negligible, this raises a question about issues of specifications of regression functional form. As for the future work, the interaction between public schools and charter schools, as in Imberman(2011), is an interesting extension. Also, to make use of panel data is also a promising direction. 65 APPENDICES 66 Appendix A PROOF OF THEOREM 1: Note that here we prove the 2SLS case; and OLS is special case. Abrevaya (2002) mentions how the delta method can be used for the Duan estimator, but considers the case where U , X are independent, which is not necessarily in our case Let θ = (β , η) , θ = (β , η) .  √   N −1/2 √ β − β   N = N −1/2 η−η   N  Si  ≡ N −1/2   + op (1) i=1 Qi N (θ − θ) = N −1 i=1 A Zi Ui N i=1 (PSi + exp (Ui ) − η)    + op (1) where A ≡ (C D−1 C)−1 C D−1 C ≡ E(Zi Xi ), D ≡ E(Zi Zi ), P ≡ E(Xi exp(Ui )) Si ≡ A−1 Zi Ui Qi ≡ PSi + exp (Ui ) − η So by central limit theorem, this finishes the proof. The above process is similar to the solution to question 12.17 in Wooldridge (2010). For details, refer to Wooldridge (2011). 67 PROOF OF LEMMA 1: From (1.4.3), and by monotonicity of logarithmic function, exp(−CM −s + GM (x)π M ) − exp(GM (x)π M ) < r(x) − exp(GM (x)π M ) (A.1) < exp(CM −s + GM (x)π M ) − exp(GM (x)π M ), By the mean value theorem applied to the lower and upper bound: exp(−CM −s + GM (x)π M ) − exp(GM (x)π M ) = −CM −s exp(ξ 1 ), ξ 1 ∈ [−CM −s + GM (x)π M , GM (x)π M ] exp(CM −s + GM (x)π M ) − exp(GM (x)π M ) = CM −s exp(ξ 2 ), ξ 2 ∈ [GM (x)π M , CM −s + GM (x)π M ] So, for the π M that satisfies (A.1), we have: sup r(x) − exp(GM (x)π M ) < CM −s , x∈Ξ So, E r(Xi ) − exp(GM (Xi )π M ) 2 ≤ CM −2s Note that: E 2 Yi − exp(GM (Xi )π) exp(Xi β) = E Var Yi X exp(Xi β) i + E r(Xi ) − exp(GM (Xi )π) 2 Considering equation (1.4.5) and the first term in the above equation is constant w.r.t π, we get π ∗M = arg min E r(Xi ) − exp(GM (Xi )π) π 68 2 So, E r(Xi ) − exp(GM (Xi )π ∗M ) 2 ≤ E r(Xi ) − exp(GM (Xi )π) 2 ≤ CM −2s Hence, r(x) − exp(GM (x)π ∗M ) = O(CM −s ) PROOF OF LEMMA 2: this proof drives heavily from proof of Lemma 2 in Hirano et.al.(2003) In the sequel we write M for M (N ).By definition of GM (x), 1 SM = N N GM (Xi )GM (Xi ) i=1 has expecatation equal to IM .By Newey(1997), it satisfies SM − IM = Op ζ(M ) M N , which converges to zero in probability by condition (5). Hence the probability that the smallest eigenvalue of SM is larger than 1/2 goes to one. Let N LN (π) = − i=1 Yi exp(Xi β) 2 − exp(GM (X i )π) Next, we will show that 1 ∂LN ∗ (π M ) = Op N ∂π 69 M N , (A.2) Consider 1 ∂LN ∗ 2 (π M ) N ∂π   2 Yi 1 trE  = − exp(GM (Xi )π ∗M ) exp(2GM (Xi )π ∗M )GM (Xi ) GM (Xi )) N exp(Xi β) E = 1 trE N Var 2 Yi Xi + r(Xi ) − exp(GM (Xi )π ∗M ) + op (1) exp(Xi β) exp(2GM (Xi )π ∗M )GM (Xi ) GM (Xi ) ≤ C trE GM (Xi ) GM (Xi ) N ≤ CM N and Markov inequality implies (A.2). Next, let η= inf x∈Ξ,M 2 exp(GM (x π ∗M ) − r(x)) exp(2GM (x)π ∗M )) ,which by assumptions and Lemma 1 is positive. For any ε > 0, choose C such that for N large enough P 1 ∂LN ∗ (π M ) < ηC N ∂π M N ≥1− ε 2 Note that, exp(2GM (x)π) − exp(2GM (x)π ∗M ) sup x∈Ξ,|π−π ∗ |<ηC ≤ M N |CGM (x)(π − π ∗ )| sup x∈Ξ,|π−π ∗ |<ηC ≤ ζ(M )C M N M N which goes to zero, so that for large enough N 70 (A.3) 2 exp(GM (x)π) − r(x)) exp(2GM (x)π) ≥ 4η inf x∈Ξ, π−π ∗ <ηC M N Choose N large enough so that this inequality holds. that (A.3) holds with probability at least 1 − ε/2, and the the probability that the smallest eigenvalue of SM is larger than 1/2 is at least 1 − ε/2. Then the probability that both of these hold is at least 1 − ε, then for every π with π − π ∗ = M, N a second order expansion gives ∂ 2 LN 1 1 1 ∂LN ∗ 1 ¯ LN (π) = LN (π ∗M ) + (π M )(π − π ∗ ) + (π − π ∗ ) (π)(π − π ∗ ), (A.4) N N N ∂π 2N ∂π∂π ¯ − π∗ ≤ π − π∗ = where π M. N We have 1 ∂ 2 LN ¯ (π) 2N ∂π∂π 1 = − 2N = − N ¯ − 2 exp(GM (Xi )π) i=1 1 E 2N Yi exp(Xi β) ¯ M (Xi )) GM (Xi )) exp(GM (Xi )π)G ¯ − r(Xi ) exp(GM (Xi )π)G ¯ M (Xi )) GM (Xi )) + op (1) 2 exp(GM (Xi )π) ≤ −2η SM + op (1) with its eigenvalues bounded away from zero in absolute value by η. Then, rearranging (A.4) and using the triangle inequality, with probability greater than 1 − ε, for π − π ∗ = 1 1 1 ∂LN ∗ LN (π) − LN (π ∗M ) ≤ (π M )(π − π ∗ ) − η π − π ∗ 2 + op (1) N N N ∂π ≤ = 1 ∂LN ∗ (π M ) N ∂π π − π ∗ − η π − π ∗ 2 + op (1) 1 ∂LN ∗ (π M ) − η N ∂π < 0 71 M N π − π ∗ + op (1) M, N That is, we have with probability greater than 1 − ε, N1 LN (π) < N1 LN (π ∗M ) for all π with π − π ∗ = π : π − π∗ ≤ M N Since LN (π) is continuous, it has a maximum on the compact set M . By the last inequality, this maximum must N M . Hence the first order conditions are satisfied N πM − π∗ < occur for some π M with at π M and by concavity of LN (π), π M maximize LN (π) over all of GM . Because the probability of this is greater than 1−ε with ε arbitrary, we conclude that π M exists and satisfies the first order conditions with probability approaching one, and that π M (N ) − π ∗M (N ) = Op M (N ) N . PROOF OF LEMMA 3: √ −1/2 N VM √ −1/2 N VM = r(x) − r(x) r(x) − exp(GM (x)π ∗M ) + √ −1/2 N VM exp(GM (x)π ∗M ) − r(x) ≡ T 1 + T 2, (A.5) By mean value theorem: −1/2 √ T 1 = VM −1/2 = VM N exp(GM (x)π M ) − exp(GM (x)π ∗M ) √ ˇ M ) N π M − π ∗M , GM (x) exp(GM (x)π From equation (1.4.4), we know: 1 N N i=1 Yi exp(Xi β) − exp(GM (Xi )π M ) GM (Xi ) exp(GM (Xi )π M ) = 0 72 (A.6) Taylor expansion around π ∗M for the left hand side of the above equation: N i=1 1 N Yi exp(Xi β) N i=1 1 N Yi exp(Xi β) − 2 exp(GM (Xi )π M ) GM (Xi ) GM (Xi ) exp(GM (Xi )π M ) π M − π ∗M N i=1 = N1 1 N Yi exp(Xi β) − exp(Xi β) i Y Yi exp(Xi β) − r(Xi ) GM (Xi ) exp(GM (Xi )π ∗M ) + N i=1 N i=1 r(Xi ) − exp(GM (Xi )π ∗M ) GM (Xi ) exp(GM (Xi )π ∗M ) + − exp(GM (Xi )π M ) GM (Xi ) GM (Xi ) exp(GM (Xi )π M ) π M − π ∗M N M M i=1 exp(G (Xi )π M )G (Xi ) − N1 GM (Xi ) exp(GM (Xi )π ∗M ) + 1 N Yi exp(Xi β) N i=1 1 N − exp(GM (Xi )π ∗M ) GM (Xi ) exp(GM (Xi )π ∗M ) + GM (Xi ) exp(GM (Xi )π M ) π M − π ∗M =0 So √ N M M i=1 exp(G (Xi )π M )G (Xi ) 1 N 1 N N i=1 Yi exp(Xi β) √1 N √1 N √1 N N π M − π ∗M = N i=1 GM (Xi ) exp(GM (Xi )π M )− − exp(GM (Xi )π M ) GM (Xi ) GM (Xi ) exp(GM (Xi )π M )}−1 N i=1 N i=1 Yi exp(Xi β) Yi exp(Xi β) Y − exp(Xi β) i GM (Xi ) exp(GM (Xi )π ∗M )+ − r(Xi ) GM (Xi ) exp(GM (Xi )π ∗M )+ r(Xi ) − exp(GM (Xi )π ∗M ) GM (Xi ) exp(GM (Xi )π ∗M ) ≡ {E − F}−1 {G + H + J} 73 So from equation (A.6), −1/2 T 1 = VM √ ¯ M ) N π M − π ∗M GM (x) exp(GM (x)π ˇ M ){E − F}−1 {G + H + J}, = GM (x) exp(GM (x)π Let ΣM ≡ E[GM (Xi ) GM (Xi ) 2 Yi − r(Xi ) exp(2GM (Xi )π ∗M )], exp(Xi β) QM ≡ E[GM (Xi ) GM (Xi ) exp(2GM (Xi )π ∗M )], −1 Σ Q−1 (x)GM (x) exp(2GM (x)π ∗ ). VM (x) ≡ GM (x)QM M M M so −1/2 T 1 = VM ˘ M ){E − F}−1 {G + H + J} GM (x) exp(GM (x)π Note that E F = ≤ 1 N 1 N N GM (Xi ) GM (Xi ) exp(2GM (Xi )π M ) ≤ Op (M 2 ) i=1 N Yi i=1 exp(Xi β) − exp(GM (Xi )π M ) exp(GM (Xi )π M ) N GM (Xi ) GM (Xi ) ≤ Op (M 2−s ) i=1 N G ≤ 1 √ N i=1 N Yi Yi − exp(Xi β) exp(Xi β) exp(GM (X ≤ GM (Xi ) ≤ Op (M ) i=1 N J ∗ i )π M ) 1 √ r(Xi ) − exp(GM (Xi )π ∗M ) exp(GM (Xi )π ∗M ) N i=1 ≤ Op (N 1/2 M 1−s ) 74 N GM (Xi ) i=1 p Since s > 2, so F → 0 as M −→ ∞. Hence {E − F}−1 is equivalent to E−1 as M −→ ∞; while E−1 G ≤ E−1 G ≤ Op (M −2 )Op (M ) = Op (M −1 ) E−1 J ≤ E−1 J ≤ Op (M −2 )Op (N 1/2 M −s ) = Op (N 1/2 M −(s+2) ) Here, we assume N 1/2 M −(s+1) −→ 0 as N −→ ∞; so T1 −1/2 ¯ M ){E}−1 {H} + op (1) GM (x) exp(GM (x)π −1/2 ¯M) GM (x) exp(GM (x)π = VM = VM  1 N N i=1 −1  M M M M exp(G (Xi )π M )G (Xi ) G (Xi ) exp(G (Xi )π M )    1 N √  N i=1 −1/2 = VM   Yi − r(Xi ) GM (Xi ) exp(GM (Xi )π ∗M ) + op (1)  exp(Xi β) ¯ M )Q−1 × GM (x) exp(GM (x)π   1 N √  N i=1   Yi − r(Xi ) GM (Xi ) exp(GM (Xi )π ∗M ) + op (1)  exp(Xi β) Next, Let = [ 1, · · · , N ] , i = Yi − r(Xi ), exp(Xi β) 1 O(π M ) = √ [exp(GM (X1 )π M )GM (X1 ) , . . . , exp(GM (XN )π M )GM (XN ) ], N −1/2 ZiN = VM √ ¯ M ){O(π M )O(π M ) }−1 O(π ∗M )i i / N GM (x) exp(GM (x)π 75 so that N −1/2 ZiN = VM √ ¯ M ){O(π M )O(π M ) }−1 O(π ∗M ) / N GM (x) exp(GM (x)π i=1 Note that for each N , ZiN (i = 1, · · · , N ) is i.i.d. Also, E[ZiN ] = 0, N 2 i=1 E[ZiN ] = 1; and ∀ε > 0 2 ] = N ε2 E[1(|Z /ε| > 1)(Z /ε)2 ] N E[1(|ZiN | > ε)ZiN iN iN ≤ N ε2 E[(ZiN /ε)4 ] −2 ≤ N ε 2 VM ¯ M ) 2 E[ O(π ∗M ) 2 E[ 4i |xi ]]/(N 2 ε4 ) GM (x) exp(GM (x)π ≤ Cζ 0 (M )2 M/N −→ 0 Then by Lindbergh-Feller central limit theorem, d N i=1 ZiN → N (0, 1),i.e. d T 1 → N (0, 1) As for the second term, T2 , in equation(A.5) √ −1/2 N VM exp(GM (x)π ∗M ) − r(x) ≤ √ −1/2 N VM exp(GM (x)π ∗M ) − r(x) √ ≤ O( N M −(2+s) ) −→ 0 So, √ −1/2 N VM d r(x) − r(x) → N (0, 1) QED. PROOF OF THEOREM 3: 76 Note that CAP Ej (x) − CAP Ej (x) = β j exp(xβ) exp(GM (x)π M ) − β j exp(xβ)r(x) = β j exp(xβ) exp(GM (x)π M ) − β j exp(xβ)r(x) +β j exp(xβ)r(x) − β j exp(xβ)r(x) = β j exp(xβ) r(x) − r(x) + r(x) β j exp(xβ) − β j exp(xβ) From result in theorem 1, we know √ √ N (β j − β j ) = Op (1); so by delta method N β j exp(xβ) − β j exp(xβ) = Op (1); so √ −1/2 N VM r(x) β j exp(xβ) − β j exp(xβ) −1/2 ≤ C VM ≤ CM −1/2 −→ 0 While From Lemma 3 and Slutsky theorem √ d −1/2 N VM β j exp(xβ) r(x) − r(x) → β j exp(xβ)N (0, 1) Hence √ −1/2 N VM d CAP Ej (x) − CAP Ej (x) → β j exp(xβ)N (0, 1) QED. PROOF OF COROLLARY: Note that, √ N CAP Ej (x) − CAP Ej (x) √ = N β j exp(xβ)r(x) From theorem 1, we know that √ N βj = √ d N (β j − β j ) → N (0, Ω ), where (0, · · · , 1, 0, · · · , 0). So: √ d N CAP Ej (x) − CAP Ej (x) → N (0, r2 (x) exp(2xβ) Ω ) QED. 77 = Appendix B Tables and Figures Table B.1: Estimation results: xtreg Variable lfare concen y98 y99 y00 Intercept Coefficient -1.163 0.145 0.045 0.104 0.197 11.769 78 (Std. Err.) (0.023) (0.040) (0.006) (0.006) (0.006) (0.116) Table B.2: Simulation results where Vit has Gamma distribution b = .01 .0934 (.0289)* a=0 b = .05 .0675 (.0283) b = .1 .0350 (.0284) .1001 (.0415) .0988 (.0404) se(βˆ lf e ) se(βˆ pqml ) .0286 .0388 ρX,V ∗ ∗ ρX,lv ∗ ∗ mean(lv) sd(lv) βˆ lf e βˆ pqml b = .01 .0939 (.0290) a = .01 b = .05 .0666 (.0230) b = .1 .0315 (.0294) .0967 (.0421) .1011 (.0410) .1002 (.0411) .0994 (.0439) .0287 .0388 .0291 .0397 .0292 .0390 .0294 .0393 .0296 .0405 .0008 -.0057 -.0004 -.0250 -.0003 -.0450 .0000 -.0046 -.0004 -.0263 -.0014 -.0523 -.5781 1.2816 -.5767 1.2839 -.5800 1.2914 -.5836 1.2919 -.5855 1.2945 -.5884 1.3021 * Monte Carlo Standard Deviations in parentheses ** ρX,V = Corr(X, V ), ρX,lv = Corr(X, lv),lv = log(V ) Table B.3: Simulation results where Vit has Gamma distribution(Continued) a = .05 b = .01 b = .05 b = .1 .0921 .0630 .0222 (.0327) (.0319) (.0318) .0997 (.0449) .1023 (.0454) se(βˆ lf e ) se(βˆ pqml ) .0320 .0414 ρX,V ρX,lv mean(lv) sd(lv) βˆ lf e βˆ pqml b = .01 .0922 (.0365) a = .1 b = .05 .0495 (.0376) b = .1 .0032 (.0373) .0998 (.0454) .0996 (.0484) .0972 (.0471) .1006 (.0516) .0321 .0421 .0327 .0427 .0374 .0448 .0380 .0453 .0387 .0467 -.0000 -.0057 .0004 -.0277 .0004 -.0575 -.0001 -.0062 -.0011 -.0350 .0002 -.0672 -.6145 1.3358 -.6138 1.3398 .6169 1.3492 -.6572 1.4140 -.6592 1.4197 -.6604 1.4309 79 Table B.4: Special case a=.1,b=0 500 1000 .0997 .1004 (.0372) (.0266) 2000 .1006 (.0185) .1017 (.0461) .1021 (.0334) .1011 (.0248) se(βˆ lf e ) se(βˆ pqml ) .0375 .0445 .0266 .0329 .0189 .0240 ρx,v ρx,lv .0001 -.0003 .0005 .0009 .0006 .0004 mean(lv) sd(lv) -.6574 1.4120 -.6558 1.4126 -.6558 1.4134 N βˆ lf e βˆ pqml Table B.5: Simulation results where Vit has log-normal distribution N βˆ lf e 500 0.10001 (.01998) βˆ pqml 0.09837 (.04476) ˆ se(β lf e ) 0.01969 se(βˆ pqml ) 0.03758 ρX,V -0.0012 ρX,lv 0.00003 mean(lv) -0.1249 sd(lv) 0.52993 a=-.125, b=.5 1000 0.09990 (.01411) 0.09655 (.03261) 0.01396 0.02778 -0.0010 0.0003 -0.1254 0.5305 2000 0.09991 (.01015) 0.09834 (.02412) 0.00987 0.02084 -0.0005 -0.0003 -0.1251 0.53047 80 500 0.09827 (.04807) 0.07994 (.13131) 0.04838 0.08555 -0.0004 -0.0011 -0.5004 1.2255 a=-.5, b=1 1000 0.10144 (.03315) 0.09138 (.11863) 0.03424 0.07361 0.00025 0.00088 -0.5001 1.2254 2000 0.10127 (.02410) 0.09515 (.09854) 0.02418 0.05789 0.00051 0.00074 -0.4998 1.2243 Table B.6: Vit = exp(a ∗ x2it + b ∗ xit ∗ zit ) with N = 500 ρ∗ βˆ a=-.5, b=1, N=500 -0.1 0.1 -0.95 -0.5 0.5 0.95 lf e 0.10010 (.04456) 0.09669 (.04563) 0.09978 (.04795) 0.10090 (.04927) 0.09768 (.05510) 0.10261 (.06163) βˆ pqml 0.07839 (.10740) 0.08340 (.13876) 0.08476 (.12536) 0.08303 (.13958) 0.08036 (.12433) 0.08497 (.13743) se(βˆ lf e ) se(βˆ pqml ) 0.04448 0.06785 0.04558 0.08034 0.04751 0.08202 0.04909 0.08652 0.05334 0.08887 0.06217 0.09637 ρx,v ρx,lv mean(lv) sd(lv) -0.00018 -0.00005 -0.49941 1.22230 -0.00247 -0.00268 -0.49949 1.22259 0.00054 -0.00007 -0.49859 1.22251 0.00014 0.00087 -0.49969 1.22435 -0.00177 -0.00163 -0.49972 1.22324 0.00190 0.00235 -0.49948 1.22320 *zi ∼ N (I5 , Σ), ρ = Corr(zit , zit+1 ) Table B.7: Vit = exp(a ∗ x2it + b ∗ xit ∗ zit ) with N = 1000 ρ βˆ lf e a=-.5, b=1, N=1000 -0.1 0.1 -0.95 -0.5 0.5 0.95 0.09984 0.03032 0.09959 0.03281 0.09845 0.03553 0.10023 0.03549 0.10137 0.03643 0.10145 0.04594 βˆ pqml 0.08465 0.10493 0.08509 0.09423 0.08018 0.10190 0.08802 0.12003 0.08665 0.11121 0.08614 0.11349 se(βˆ lf e ) se(βˆ pqml ) 0.03152 0.06084 0.03231 0.06522 0.03370 0.06652 0.03469 0.07327 0.03764 0.07415 0.04395 0.08087 ρx,v ρx,lv mean(lv) sd(lv) -0.00129 -0.00039 -0.50118 1.22425 0.00053 -0.00024 -0.49880 1.22294 -0.00198 -0.00124 -0.49996 1.22432 -0.00088 0.00013 -0.49942 1.22375 0.00101 0.00090 -0.49900 1.22204 -0.00066 0.00087 -0.50027 1.22413 81 2 Table B.8: Simulation results with Vit = exp(−.125Xi + .5Xi ∗ zit ) N βˆ pqml 2000 0.0998 ( 0.0070) 1000 0.1001 (0.0095) 500 0.1001 (0.0132) 250 0.0998 (0.0183) 100 0.1001 (0.0278) βˆ lf e .10004 ( 0.0025) .09997 (0.0035) .09989 (0.0050) .09995 (0.0071) .09991 (0.0112) βˆ gmm 0.1001 (0.0042) 0.1003 ( 0.0058) 0.1001 (0.0079) 0.1002 (0.0107) 0.1003 (0.0168) se(βˆ pqml ) 0.0074 0.0102 0.0142 0.0205 0.0384 se(βˆ lf e ) 0.0025 0.0035 0.0050 0.0070 0.0110 se(βˆ gmm ) 0.0044 0.0057 0.0073 0.0094 0.0136 Table B.9: Simulation results for four estimators N βˆ pqml 2000 0.0999 ( 0.0068) 1000 0.1002 (0.0093) 500 0.1003 (0.0146) 250 0.0995 (0.0179) 100 0.999 (0.0265) βˆ lf e .10002 ( 0.0027) .09989 (0.0038) .09998 (0.0052) .09992 (0.0069) .09995 (0.0120) βˆ gmm 0.9999 (0.0044) 0.1002 ( 0.0055) 0.9998 (0.0080) 0.1003 (0.0110) 0.1004 (0.0172) βˆ oiv 0.1001 (0.0026) 0.9999 ( 0.0038) 0.1001 (0.0066) 0.9998 (0.0077) 0.1006 (0.0118) se(βˆ pqml ) 0.0077 0.0110 0.0139 0.0207 0.0379 se(βˆ lf e ) 0.0028 0.0039 0.0052 0.0073 0.0109 se(βˆ gmm ) 0.0045 0.0060 0.0071 0.0090 0.0140 se(βˆ oiv ) 0.0024 0.0034 0.0053 0.0069 0.0106 82 Table B.10: Summary statistics passen lfare concen Obs 4596 4596 4596 Mean 636.8242 5.095601 0.6101149 Std. Dev. 812 0.4363999 0.196435 Min Max 2 8497 3.610918 6.257668 0.1605 1 Table B.11: Dependent variable, passen lf are concen y88 y89 y00 βˆ pqml -0.8658 -0.1289 0.0427 0.1093 0.1899 βˆ lf e -1.1632 0.1455 0.0454 0.1038 0.1970 βˆ gmm -0.8515 -0.1450 0.0431 0.1081 0.1911 se(βˆ pqml ) 0.0366 0.0544 0.0037 0.0054 0.0085 se(βˆ lf e ) 0.1101 0.0890 0.0049 0.0063 0.0101 se(βˆ gmm ) 0.0336 0.0538 0.0035 0.0049 0.0069 83 .005 0 −.005 −.01 −.015 −.02 −1 −.5 0 rou bias_fe .5 bias_pqml Figure B.1: Bias of LFE and PQML with change of ρ 84 1 .1 .08 .06 .04 −1 −.5 0 rou se_fe .5 se_pqml Figure B.2: Std. error of LFE and PQML with change of ρ 85 1 0 −.005 −.01 −.015 −.02 −1 −.5 0 rou bias_fe .5 bias_pqml Figure B.3: Bias of LFE and PQML with change of ρ, N=1000 86 1 .08 .07 .06 .05 .04 .03 −1 −.5 0 rou se_fe .5 se_pqml Figure B.4: Std. error of LFE and PQML with change of ρ, N=1000 87 1 15 10 30 5 Percent 20 Percent 0 10 0 0 2000 4000 6000 avg. passengers per day 8000 0 2 Figure B.5: Histogram of Passengers 88 4 6 log(passen) 8 10 Proofs PROOF OF Theorem 2.4.1 √ N β−β η−η = N −1 i=1 A Vi N −1/2 N i=1 (PSi N −1/2 N ≡ N −1/2 i=1 + Ui − η) + op (1) Si + op (1) Qi Where Vi = Yi − p(Xi , β)ni A =E(ni ∇β p(Xi , β) W(Xi , β)∇β p(Xi , β)) p(Xi , β) = W(Xi , β) = diag exp(Xi1 β) , T t=1 exp(Xit β) exp(Xi1 β) T t=1 exp(Xit β) ··· , , ··· , exp(XiT β) T t=1 exp(Xit β) exp(XiT β) −1 T t=1 exp(Xit β) P =E(Xit Ui ), Qi = PSi + Ui − η PROOF OF LEMMA 2.4.3: From equation (1.4.3), and by monotonicity of logarithmic function, exp(−CK −s + GM (X)π M ) − exp(GM (X)π M ) < r(X)) − exp(GM (X)π M ) < exp(CK −s + GM (X)π M ) − exp(GM (X)π M ), 89 By the mean value theorem applied to the lower and upper bound: exp(−CK −s + GM (X)π M ) − exp(GM (X)π M ) = −CK −s exp(ξ 1 ), ξ 1 ∈ [−CK −s + GM (X), GM (X)π M ] exp(CK −s + GM (X)π M ) − exp(GM (X)π M ) = CK −s exp(ξ 2 ), ξ 2 ∈ [GM (X)π M , CK −s + GM (X)] So, for the π M that satisfies equation (1.4.3), we have: sup |r(X) − exp(GM (X)π M )| < CK −s , X∈Ξ So, E r(Xi ) − exp(GM (Xi )π M ) 2 ≤ CK −2s Note that: T E t=1 T = Var t=1 2 Yit − exp(GM (Xi )π) exp(xit β) Yit X exp(Xit β) i + T E r(Xi ) − exp(GM (Xi )π) 2 So, π M = arg min E r(Xi ) − exp(GM (Xi )π) π 2 So, E r(Xi ) − exp(GM (Xi )π ∗M ) 2 ≤ E r(Xi ) − exp(GM (Xi )π) So, similarly, r(Xi ) − exp(GM (Xi )π ∗M = Op (CK −s ) 90 2 ≤ CK −2s PROOF OF LEMMA 2.4.4: this proof drives heavily from proof of Lemma 2 in Hirano et.al.(2003) In the sequel we write M for M (N ). By definition of GM (X), 1 SˆM = N N GM (Xi )GM (Xi ) i=1 has expecatation equal to IM . By Newey (1997), it satisfies SˆM − IM = Op ζ(M ) M N , which converges to zero in probability by condition (iv). Hence the probability that the smallest eigenvalue of SˆM is larger than 1/2 goes to one. Let N T Yit LN (π) = − i=1 t=1 ˆ exp(Xit β) 2 − exp(GM (X i )π) Next, we will show that 1 ∂LN ∗ (π M ) = Op N ∂π M N , (B.1) Consider 1 ∂LN ∗ 2 (π M ) N ∂π   2 T 1 Yit = − exp(GM (Xi )π ∗M ) exp(2GM (Xi )π ∗M )GM (Xi ) GM (Xi ) tr E  ˆ N t=1 exp(Xit β) E = T tr E N Var Yit |X exp(Xit β) i + r(Xi ) − exp(GM (Xi )π ∗M ) exp(2GM (Xi )π ∗M )GM (Xi ) GM (Xi ) ≤ C tr E GM (Xi ) GM (Xi ) N ≤ CK N 91 2 + op (1) and Markov inequality implies (B.1). Next, let η= inf X∈Ξ,M 2 exp(GM (X)π ∗M ) − r∗ (X)) exp(2GM (X)π ∗M ) ,which by assumptions and Lemma 1 is positive. For any ε > 0, choose C such that for N large enough 1 ∂LN ∗ (π M ) < ηC N ∂π P M N ≥1− ε 2 (B.2) Note that, exp(2GM (X)π) − exp(2GM (X)π ∗M ) sup X∈Ξ,|π−π ∗ |<ηC ≤ M N |CK(X)(π − π ∗ )| sup X∈Ξ,|π−π ∗ |<ηC ≤ ζ(M )C M N M N which goes to zero,so that for large enough N 2 exp(GM (X)π) − r∗ (X)) exp(2GM (X)π) ≥ 4η inf X∈Ξ, π−π ∗ <ηC M N Choose N large enough so that this inequality holds. that (B.2) holds with probability at least 1 − ε/2, and the the probability that the smallest eigenvalue of SˆM is larger than 1/2 is at least 1 − ε/2. Then the probability that both of these hold is at least 1 − ε, then for every π with π − π ∗ = M, N a second order expansion gives 1 1 1 ∂LN ∗ 1 ∂ 2 LN (π)(π − π ∗ ) (B.3) LN (π) = LN (π ∗M ) + (π M )(π − π ∗ ) + (π − π ∗ ) N N N ∂π 2N ∂π∂π 92 where π − π ∗ ≤ π − π ∗ = M. N We have 1 ∂ 2 LN (π) 2N ∂π∂π = − = − 1 2N N T 2 exp(GM (Xi )π) − i=1 t=1 1 E 2N Yit ˆ exp(Xit β) exp(GM (Xi )π)GM (Xi )) GM (Xi )) 2 exp(GM (Xi )π) − r∗ (Xi ) exp(GM (Xi )π)GM (Xi )) GM (Xi )) + op (1) ≤ −2η SˆM + op (1) with its eigenvalues bounded away from zero in absolute value by η. Then, rearranging (B.3) and using the triangle inequality, with probability greater than 1 − ε, for π − π ∗ = M, N 1 1 ∂LN ∗ 1 LN (π) − LN (π ∗M ) ≤ (π M )(π − π ∗ ) − η π − π ∗ 2 + op (1) N N N ∂π ≤ 1 ∂LN ∗ (π M ) N ∂π = π − π ∗ − η π − π ∗ 2 + op (1) 1 ∂LN ∗ (π M ) − η N ∂π M N π − π ∗ + op (1) < 0 That is, we have with probability greater than 1 − ε, N1 LN (π) < N1 LN (π ∗M ) for all π with π − π ∗ = π : π − π∗ ≤ ˆ M − π∗ < π M N Since LN (π) is continuous, it has a maximum on the compact set M . By the last inequality, this maximum must N M . Hence the first order conditions are satisfied N ˆ M with occur for some π ˆ M and by concavity at π ˆ M maximize LN (π) over all of GM . Because the probability of this is greater of LN (π), π ˆ M exists and satisfies the first order conditions than 1−ε with ε arbitrary, we conclude that π ˆ M (N ) − π ∗M (N ) = Op with probability approaching one, and that π PROOF OF THEOREM 2.4.9: 93 M (N ) N . let ω it = (Xit , GM (Xi )),    β  θ =   π ∗M T j(wit , θ) = exp(Xit β + GM (Xi )π ∗M )β T −1 t=1 So, √ √ √ N (ˆ τ − τ ) = N (ˆ τ − Ej(wit , θ)) + N (Ej(wit , θ) − τ ) ≡ T 1 + T 2 94 Note that, N √ −1 T1 = N (N T ) T ˆ M βˆ pqml exp Xit βˆ pqml + GM (Xi )π i=1 t=1 − exp Xit β + GM (Xi )π ∗M β N √ −1 + N (N T ) T exp Xit β + GM (Xi )π ∗M ) β − Ej(wit , θ) i=1 t=1 N √ −1 = N (N T ) T exp(wit ˆθ)βˆ pqml − exp(wit θ)β i=1 t=1   N T T √ + T −1 N (N )−1 exp(wit θ) − ET −1 exp(wit θ) β t=1 i=1 N = N −1 T T −1 i=1 N √ exp(wit θ) N (βˆ pqml − β) t=1   T ∇θ (T )−1 + N −1 t=1 exp(wit θ) (θ) ˆθ − θ t=1 i=1   N √ j(wit , θ) − Ej(wit , θ) + op (1) + N N −1 i=1 N = N −1 √ j(wit , θ) N (βˆ pqml − β) i=1 N + N −1 ∇θ (j(wit , θ)) (θ) ˆθ − θ i=1   N √ + N N −1 j(wit , θ) − Ej(wit , θ) + op (1) i=1 While, N −1 N i=1 ∇θ (j(wit , θ)) (θ) = Op (ζ(M )), ˆθ − θ 95 = Op M (N ) N So the M (N ) N second term of the above equation is of order Op ζ(M ) , which vanishes by as- sumptions in Lemma 2. On the other hand,   T √ √ T 2 ≡ N (Ej(wit , θ) − τ ) = N T −1 E exp(Xit β)(exp(GM (Xi )π ∗M ) − r(Xi )) β t=1 From Lemma 1, we know the term in the parasynthesis is of order Op (CK −s ); as long as N (1/2) M −s → 0, it can be ignored too. Hence, √ N N (ˆ τ −τ ) = N −1   N √ √ j(wit , θ) N (βˆ pqml −β)+ N N −1 j(wit , θ) − Ej(wit , θ)+op (1) i=1 i=1 From Wooldridge (1999), the first term follows as: √ N N (βˆ pqml − β) = N −1/2 ∇2β p1 (Xi , β)W1 (Xi , β)(Yi − p1 (Xi , β)n1i ) i=1 And the second term:   N N √ N N −1 j(wit , θ) − Ej(wit , θ) = N −1/2 (j(wit , θ) − Ej(wit , θ)) i=1 i=1 Therefore: √ N N (ˆ τ − τ ) = N −1/2 {Ej(wit , θ)∇2β p1 (Xi , β)W1 (Xi , β)(Yi − p1 (Xi , β)n1i ) i=1 + (j(wit , θ) − Ej(wit , θ))} + op (1) We follow that: √ N (ˆ τ − τ ) ⇒ N(0, V ) Where, 96 V = Var Ej(wit , θ)∇2β p1 (Xi , β)W1 (Xi , β)(Yi − p1 (Xi , β)n1i ) + (j(wit , θ) − Ej(wit , θ)) Q.E.D. As for the estimation of V is straight forward: let,   N Vˆi1 = N −1 j(wit , ˆθ) ∇2β p1 (Xi , βˆ pqml )W1 (Xi , βˆ pqml )(Yi − p1 (Xi , βˆ pqml )n1i ) i=1   N + j(wit , ˆθ) − N −1 j(wit , ˆθ) i=1 then, N Vˆ = N −1 Vˆi1 Vˆi1 i=1 Note, for the denotations here, please refer to section 2 and 3. GMM Simulation Setup: • we do the following setting up: T yit = ni t=1 T 2 =n yit i2 t=1 pt (xi , β) ≡ pt2 (xi , β) ≡ 97 exp(βxit ) T r=1 exp(βxir ) exp(2βxit ) T r=1 exp(2βxir ) p(xi , β) ≡ [p1 (xi , β), ..., pT (xi , β)] , p2 (xi , β) ≡ [p12 (xi , β), ..., pT 2 (xi , β)] , u1i (β) ≡ Yi − p(xi , β)ni , where Yi = [Yi1 , · · · , YiT ] 2] u2i (β) ≡ Yi2 − p2 (xi , β)ni2 , where, Yi2 = [Yi12 , · · · , YiT D1 (xi , β) = xi1 − D2 (xi , β) = 2xi1 − T r=1 xir exp(βxir ) , T r=1 exp(βxir ) T r=1 2xir exp(2βxir ) , T r=1 exp(2βxir ) T r=1 exp(βxir ) , T r=1 2xir exp(2βxir ) T r=1 exp(2βxir ) , ··· , T r=1 xir exp(βxir ) T r=1 exp(βxir ) pT (xi , β) xiT − = ..., 2xiT − T r=1 xir exp(βxir ) T r=1 exp(βxir ) D3 (xi , β) = ni p1 (xi , β) xi1 − ni xi1 exp(βxi1 ) ..., xiT − T r=1 xir exp(βxir ) T r=1 exp(βxir ) − T r=1 xir exp(βxir ) , T 2 r=1 exp(βxir )) ni exp(βxi1 ) ( ni xiT exp(βxiT ) ni exp(βxiT ) Tr=1 xir exp(βxir ) − T exp(βx ) ( Tr=1 exp(βxir ))2 ir r=1 98 ··· , ,   0  D1 (xi , β)  Di (xi , β) =   0 D2 (xi , β)    u1i (β)  ui (β) =   u2i (β) • Step 1: PQML N T βˆ pqml = arg max Yit log(pt (xi , β)) i=1 t=1   se(βˆ pqml ) =  −1  N D3 D1   N D1 ui1 u1i D1    i=1 i=1 i=1 N −1 1/2  D3 D1   Where, D1 = D1 (xi , βˆ pqml ), D3 = D3 (xi , βˆ pqml ), u1i = u1i (βˆ pqml ) • Step 2: GMM  N βˆ gmm = arg min N −1   Di ui (β) N −1  se(βˆ gmm ) =    N N Di ∇βˆ ui   i=1 Di ui ui Di  −1  Di ui ui Di  i=1 N  i=1 Where, Di = Di (xi , βˆ pqml ), ui = ui (βˆ pqml ) 99  N N −1 Di ui (β) i=1 i=1 i=1  −1  N −1/2  Di ∇βˆ ui  ∇βˆ ui  ˆ  ∇β u1i (β gmm )  =  ∇β u2i (βˆ gmm )     exp(βxi1 ) Tr=1 xir exp(βˆ gmm xir ) xi1 exp(βˆ gmm xi1 ) − T ˆ exp( β x ) ( Tr=1 exp(βˆ gmm xir ))2 ir gmm r=1 = ··· , xiT exp(βˆ gmm xiT ) exp(βxiT ) Tr=1 xir exp(βˆ gmm xir ) − T ˆ exp( β x ) ( Tr=1 exp(βˆ gmm xir ))2 ir gmm r=1 ni , ni 2xi1 exp(2βˆ gmm xi1 ) exp(2βˆ gmm xi1 ) Tr=1 2xir exp(2βˆ gmm xir ) − T ˆ ( Tr=1 exp(2βˆ gmm xir ))2 r=1 exp(2β gmm 2xir ) ··· , ˆ exp(2βˆ gmm xiT ) Tr=1 2xir exp(2βˆ gmm xir ) 2xiT exp(2β gmm xiT ) − T ˆ ( Tr=1 exp(2βˆ gmm xir ))2 r=1 exp(2β gmm 2xir ) n2i , n2i • Step 3: OIV    N βˆ oiv = arg min N −1 Bi u1i (β) N −1 i=1   se(βˆ gmm ) =  −1  N N Bi Bi  i=1   Bi ∇βˆ u1i   i=1 Bi Bi  i=1 N −1 Bi u1i (β) i=1 −1  N N  i=1 −1/2  Bi ∇βˆ u1i  Where, ui = u1i (βˆ pqml ) ˆ i ) ∗ (Ω ˆ i )−1 /ˆ B(Xi ) = D(X g (Xi ) T ˆ i = −∇β [p1 (Xi , βˆ pqml )] D exp(βˆ pqml Xit )ˆ r(Xi ) t=1 100  N ∇βˆ u1i ≡ ∇β u1i (βˆ oiv ) = exp(βˆ oiv xi1 ) Tr=1 xir exp(βˆ oiv xir ) xi1 exp(βˆ oiv xi1 ) − T ˆ ( Tr=1 exp(βˆ oiv xir ))2 r=1 exp(β oiv xir ) ˆ M ), exp(GM (Xi )π ··· , exp(βˆ oiv xiT ) Tr=1 xir exp(βˆ oiv xir ) xiT exp(βˆ oiv xiT ) − T ˆ ( Tr=1 exp(βˆ oiv xir ))2 r=1 exp(β oiv xir ) 101 ˆM) exp(GM (Xi )π Appendix C Tables Table C.1: Summary Statistics Mean Median Standard deviation Minimum Maximum Sample size math4 37.73 avgexp 9037 exp10 9251 enroll 2769.4 lunch 39.087 scdist 20.303 38.05 8646 8800 1596.5 37.792 16.935 15.107 1644.950 1862.571 4473.652 16.049 16.250 3.1 7258 6890 64 5.993 .134 81.3 28611 30379 75263 87.815 80.952 518 518 518 518 518 518 Table C.2: OLS Regression, dependent variable=math4 Coef. Estimate Usual Std. Err. H-W Std. Err. ISD cluster Std. Err. log(avgexp) lunch log(enroll) log(scdist) constant 0.1595 -0.0058 0.0147 0.0030 -0.9639 0.0405* 0.0004* 0.0068* 0.0068 0.3744* 0.0558* 0.0004* 0.0076 0.0068 0.4973 0.0671* 0.0005* 0.0086 0.0080 0.6036 log(exp10) lunch log(enroll) log(scdist) constant 0.1289 -0.0058 0.0145 0.0017 -0.6848 0.0373* 0.0004* 0.0068* 0.0068 0.3451* 0.0552* 0.0004* 0.0077 0.0068 0.4895 0.0699 0.0005* 0.0087 0.0080 0.6259 * Significant at 5% level 102 Table C.3: OLS Regression with Conley S.E., dependent variable=math4 Coef. H-W Conley cut1 50 100 150 200 350 400 500 600 700 800 1000 Estimate Std. Err. Std. Err. cut2 100 150 200 250 400 500 600 700 800 900 1000 log(exp10) log(enroll) lunch log(scdist) constant 0.1289 0.0552* 0.0145 0.0077 -0.00576 0.00038 0.0017 0.0068 -0.6848 0.4895 0.0606* 0.0636* 0.0640* 0.0633* 0.0586* 0.0552* 0.0517* 0.0482* 0.0455* 0.0430* 0.0404* 0.0086 0.0085 0.0077 0.0072* 0.0062* 0.0063* 0.0060* 0.0054* 0.0050* 0.0046* 0.0042* 0.00044* 0.00044* 0.00042* 0.00037* 0.00032* 0.00030* 0.00027* 0.00025* 0.00024* 0.00022* 0.00021* 0.0080 0.0079 0.0075 0.0075 0.0069 0.0062 0.0057 0.0054 0.0051 0.0049 0.0048 0.5507 0.5743 0.5737 0.5696 0.5322 0.4981 0.4682 0.4366 0.4134 0.3926 0.3709 Table C.4: OLS Regression with Conley S.E., dependent variable=math4 Coef. White Conley cut1 50 100 150 200 350 400 500 600 700 800 1000 Estimate Std. Err. Std. Err. cut2 100 150 200 250 400 500 600 700 800 900 1000 log(avgexp) log(enroll) lunch log(scdist) constant 0.1595 0.0558 0.0147 0.0076 -0.00579 0.00036 0.0030 0.0068 -0.9639 0.4973 0.0616* 0.0638* 0.0634* 0.0625* 0.0591* 0.0559* 0.0529* 0.0497* 0.0474* 0.0452* 0.0430* 0.0086 0.0084 0.0076 0.0071* 0.0061* 0.0062* 0.0059* 0.0053* 0.0049* 0.0046* 0.0041* 0.00042* 0.00042* 0.00040* 0.00035* 0.00029* 0.00028* 0.00025* 0.00023* 0.00022* 0.00021* 0.00019* 0.0080 0.0080 0.0075 0.0075 0.0070 0.0063 0.0058 0.0055 0.0052 0.0050 0.0048 0.5640 0.5802 0.5710 0.5655 0.5412 0.5112 0.4856* 0.4569* 0.4371* 0.4178* 0.3991* 103 Table C.5: OLS Regression with Conley S.E., dependent variable=math4 Coef. H-W Conley cut1 242 364 435 546.5 649 719 758 Estimate Std. Err. Std. Err. cut2 21 41 88 167.5 284 637 880 log(exp10) log(enroll) lunch log(scdist) constant 0.1289 0.0552 0.0145 0.0077 -0.00576 0.00038 0.0017 0.0068 -0.6848 0.4895 0.0553* 0.0570* 0.0620* 0.0602* 0.0587* 0.0491* 0.0437* .00758 .00760 0.0080 0.0072* 0.0059* 0.0052* 0.0047* .00044* .00049* .00050* .00045* .00033* .00025* .00023* 0.0073 0.0078 0.0089 0.0093 0.0088 0.0057 0.0050 0.4942 0.5102 0.5602 0.5541 0.5454 0.4506 0.3976 Table C.6: OLS Regression with Conley S.E., dependent variable=math4 Coef. H-W Conley cut1 242 364 435 546.5 649 719 758 Estimate Std. Err. Std. Err. cut2 21 41 88 167.5 284 637 880 log(avgexp) log(enroll) lunch log(scdist) constant 0.1595 0.0558 0.0147 0.0076 -0.00579 0.00036 0.0030 0.0068 -0.9639 0.4973 0.0561* 0.0578* 0.0629* 0.0617* 0.0609* 0.0515* 0.0457* .00755 .00760 0.0080 0.0072* 0.0059* 0.0051* 0.0047* .00042* .00046* .00047* .00042* .00030* .00023* .00021* 0.0073 0.0077 0.0089 0.0093 0.0089 0.0058 0.0051 0.5037 0.5226 0.5740 0.5747 0.5713 0.4783* 0.4218* 104 math4 ) Table C.7: OLS Regression, dependent variable=log( 1−math4 Coef. Estimate usual Std. Err H-W Std. Err ISD cluster Std. Err. Peninsula cluster S.E. log(avgexp) log(enroll) lunch log(scdist) constant 0.5277 0.0919 -0.0275 0.0235 -5.0362 0.1976* 0.0332* 0.0018* 0.0332 1.8271* 0.2946 0.0387* 0.0018* 0.0328 2.6355 0.3505 0.0446* 0.0023* 0.0398 3.1478 0.0937 0.0021* 0.0007* 0.0166 0.8732 log(exp10) log(enroll) lunch log(scdist) constant 0.4152 0.0911 -0.0273 0.0192 -4.0088 0.1815* 0.0333* 0.0018* 0.0331 1.6817* 0.2811 0.0389* 0.0019* 0.0324 2.5017 0.3554 0.0450* 0.0024* 0.0396 3.1720 0.0631 0.0000* 0.0007* 0.0167 0.5806 math4 ) Table C.8: OLS Regression with Conley S.E., dependent variable=log( 1−math4 Coef. H-W Conley cut1 50 100 150 200 350 400 500 600 700 800 1000 Estimate Std. Err. Std. Err. cut2 100 150 200 250 400 500 600 700 800 900 1000 log(exp10) log(enroll) lunch log(scdist) constant 0.4152 0.2811 0.0911 0.0389 -.0273 .00193 0.0192 0.0324 -4.0088 2.5017 0.3094 0.3235 0.3244 0.3209 0.2907 0.2737 0.2596 0.2443 0.2320 0.2208 0.2079* 0.0432* 0.0430* 0.0397* 0.0376* 0.0331* 0.0335* 0.0319* 0.0287* 0.0263* 0.0245* 0.0222* .00224* .00225* .00216* .00192* .00159* .00152* .00135* .00124* .00116* .00109* .00101* 0.0396 0.0407 0.0397 0.0400 0.0373 0.0333 0.0309 0.0291 0.0278 0.0267 0.0257 2.8198 2.9325 2.9270 2.9141 2.6737 2.5026 2.3823 2.2454 2.1394 2.0409* 1.9330* 105 math4 ) Table C.9: OLS Regression with Conley S.E., dependent variable=log( 1−math4 Coef. H-W Conley cut1 50 100 150 200 350 400 500 600 700 800 1000 Estimate Std. Err. Std. Err. cut2 100 150 200 250 400 500 600 700 800 900 1000 log(avgexp) log(enroll) lunch log(scdist) constant 0.5277 0.2946 0.0919 0.0387 -.0275 .00187 0.0235 0.0328 -5.0362 2.6355 0.3248 0.3345 0.3323 0.3292 0.3037 0.2856 0.2720 0.2576* 0.2468* 0.2361* 0.2250* 0.0431* 0.0428* 0.0394* 0.0373* 0.0330* 0.0334* 0.0319* 0.0287* 0.0263* 0.0245* 0.0223* .00218* .00217* .00208* .00184* .00147* .00141* .00125* .00116* .00109* .00102* .00095* 0.0401 0.0411 0.0400 0.0405 0.0381 0.0342 0.0317 0.0298 0.0285 0.0273 0.0263 2.9776 3.0533 3.0183 3.0113 2.8254 2.6533 2.5345* 2.4021* 2.3058* 2.2103* 2.1153* math4 ) Table C.10: OLS Regression with Conley S.E., dependent variable=log( 1−math4 Coef. H-W Conley cut1 242 364 435 546.5 649 719 758 Estimate Std. Err. Std. Err. cut2 21 41 88 167.5 284 637 880 log(exp10) log(enroll) lunch log(scdist) constant 0.4152 0.2811 0.0911 0.0389 -0.0273 0.0019 0.0192 0.0324 -4.0088 2.5017 0.2787 0.2855 0.3076 0.3007 0.3026 0.2478 0.2238 0.0389* 0.0387* 0.0412* 0.0374* 0.0314* 0.0278* 0.0250* 0.0022* .00236* .00237* 0.0021* 0.0016* 0.0012* 0.0011* 0.0368 0.0400 0.0473 0.0506 0.0479 0.0308 0.0268 2.5073 2.5739 2.8016 2.8079 2.8503 2.3079 2.0654 106 Table C.11: OLS Regression with Conley S.E. in Nonlinear Model Coef. H-W Conley cut1 242 364 435 546.5 649 719 758 Estimate Std. Err. Std. Err. cut2 21 41 88 167.5 284 637 880 log(avgexp) log(enroll) lunch log(scdist) constant 0.5277 0.2946 0.0919 0.0387 -0.0275 0.0019 0.0235 0.0328 -5.0362 2.6355 0.2952 0.3024 0.3230 0.3192 0.3274 0.2665* 0.2386* 0.0389* 0.0387* 0.0413* 0.0376* 0.0317* 0.0279* 0.0251* 0.0021* 0.0023* 0.0022* 0.0020* 0.0015* 0.0011* 0.0010* 0.0369 0.0399 0.0472 0.0511 0.0487 0.0315 0.0275 2.6643 2.7459 2.9710 3.0140 3.1091 2.5120* 2.2302* 107 Table C.12: APEs with Bootstrap S.E. in Nonlinear Model log(avgexp) log(enroll) lunch log(scdist) log(exp10) log(enroll) lunch log(scdist) APE Evaluated at 50% 75% Mean 25% 0.1151* (.0663) 0.0200** (.0083) -0.0060** (.0004) 0.0051 (.0071) 0.1193* (.0683) 0.0208** (.0085) -0.0062** (.0004) 0.0053 (.0073) 0.1156* (.0663) 0.0201** (.0084) -0.0060** (.0004) 0.0052 (.0072) 0.1100* (.0636) 0.0191** (.0082) -0.0057** (.0004) 0.0049 (.0068) 0.0982* (.0590) 0.0171** (.0078) -0.0051** (.0004) 0.0044 (.0063) 0.0905 (.0631) 0.0199** (.0084) -0.0060** (.0004) 0.0042 (.0070) 0.0940 (.0652) 0.0206** (.0086) -0.0062** (.0004) 0.0043 (.0072) 0.0910 (.0632) 0.0200** (.0084) -0.0060** (.0004) 0.0042 (.0070) 0.0863 (.0603) 0.0189** (.0082) -0.0057** (.0004) 0.0040 (.0067) 0.0767 (.0553) 0.0168** (.0077) -0.0051** (.0004) 0.0035 (.0061) * Significant at 10% level ** Significant at 5% level Bootstraps standard errors are in parenthesis 108 95% Figures 109 Ü Legend MI SD N/A Figure C.1: All school districts of Michigan in 2010: For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this dissertation. 110 Ü !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! !! !! ! ! ! ! ! ! ! Legend ! ! !! MI College MI SD N/A !! !!! !! ! ! ! !!! ! !!! ! !! ! !! ! ! ! !! !! ! !! ! !! ! ! ! ! !! ! ! ! !! ! ! ! ! ! !! Figure C.2: All Colleges of Michigan in 2010. 111 ! !!! ! !! !! ! ! ! !! ! ! ! !! ! !! ! !! ! !! ! Ü math4 3.1-21.3 21.4 - 33.3 33.9- 44.4 44.6 - 57.6 58.0 - 81.3 N/A Figure C.3: MEAP math pass rate for 4th graders of Michigan in 2010. 112 Ü !! ! ! ! ! ! ! ! ! !! ! math4 ! ! ! 3.1-21.3 ! ! ! 21.4 - 33.3 ! ! ! ! !! 44.6 - 57.6 N/A ! ! ! 33.9- 44.4 58.0 - 81.3 !! !! ! !! !! !! ! ! ! !!! ! !!! ! ! !! ! ! ! !! !! ! !! ! !! ! ! ! !! !!! ! !! !! ! ! ! ! ! ! !! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! !! ! ! ! ! !! ! MI College Figure C.4: Colleges and math pass rate for 4th graders of Michigan in 2010. 113 Ü Legend selection MI SD N/A Figure C.5: Selection of 96 School Districts. 114 Ü Figure C.6: Selection of 96 School Districts with centroids. 115 Ü Figure C.7: Selection of 96 School Districts with centroids in grids. 116 Ü Figure C.8: Conley Coordinates of 96 School Districts. 117 Percent 0 10 20 30 40 50 10 20 30 40 0 Percent 5K 10K 15K 20K 25K 30K 5K 15 0 0 Percent 5 10 Percent 20 40 60 80 exp 10K 15K 20K 25K 30K exp10 0 20K 40K enroll 60K 80K 0 20 Figure C.9: Histogram for all covariates 118 40 scdist 60 80 APE for Average Expenditure 25% 20% APE 15% 10% 5% 0 Mean 25% 50% 75% APE Evaluated at 95% APE Nonlinear OLS Linear 90% confidence interval − APE Figure C.10: APE w.r.t average expenditure 119 APE for Enrollment 4% APE 3% 2% 1% 0 Mean 25% 50% 75% APE Evaluated at APE Nonlinear 95% confidence interval − APE 95% OLS Linear Figure C.11: APE w.r.t enroll APE for Free Lunch −.45% APE −.50% −.55% −.60% −.65% −.70% Mean 25% 50% 75% APE Evaluated at 95% APE Nonlinear OLS Linear 95% confidence interval − APE Figure C.12: APE w.r.t lunch 120 Table C.13: QGLS with Conley S.E., dependent variable=math4 Coef. Usual H-W Conley cut1 50 100 150 200 350 400 500 600 700 800 1000 Estimate Std. Err. Std. Err. Std. Err. cut2 100 150 200 250 400 500 600 700 800 900 1000 log(avgexp) log(enroll) lunch 0.1690 0.0416* 0.0577* 0.0171 0.00705* 0.00778* -0.00589 0.00255 0.00038* 0.00707 0.00038* 0.00712 -1.0588 0.3862* 0.5137* 0.0621* 0.0634* 0.0626* 0.0615* 0.0596* 0.0560* 0.0534* 0.0502* 0.0479* 0.0457* 0.0433* 0.0083 0.0079 0.0070 0.0065 0.0058* 0.0056* 0.0053* 0.0048* 0.0044* 0.0041* 0.0037* 0.00041* 0.00040* 0.00038* 0.00033* 0.00027* 0.00026* 0.00024* 0.00022* 0.00021* 0.0002* 0.00018* 0.5669 0.5739 0.5615 0.5550 0.5469 0.5131* 0.4907* 0.4614* 0.4405* 0.4205* 0.4007* 121 log(scdist) 0.0083 0.0083 0.0079 0.0079 0.0077 0.0067 0.0063 0.0059 0.0057 0.0055 0.0054 constant Table C.14: QGLS with Conley S.E., dependent variable=math4 in Year 2010 Coef. Usual H-W Conley cut1 50 100 150 200 350 400 500 600 700 800 1000 Estimate Std. Err. Std. Err. Std. Err. cut2 100 150 200 250 400 500 600 700 800 900 1000 log(exp10) log(enroll) lunch log(scdist) constant 0.13496 0.03828 0.05759 0.0169 0.00708 0.00785 -0.00585 0.00039 0.00039 0.00142 0.00708 0.00709 -0.74907 0.35662 0.51117 0.0616* 0.0638* 0.0639* 0.0631* 0.0584* 0.0556* 0.0522* 0.0486* 0.0458* 0.0433* 0.0406* 0.0083* 0.0080* 0.0071* 0.0066* 0.0056* 0.0057* 0.0054* 0.0048* 0.0044* 0.0041* 0.0037* 0.00043* 0.00043* 0.0004* 0.00035* 0.00031* 0.00029* 0.00026* 0.00024* 0.00023* 0.00021* 0.0002* 0.0083 0.0084 0.0080 0.0080 0.0074 0.0066 0.0062 0.0059 0.0057 0.0055 0.0053 0.5580 0.5729 0.5696 0.5648 0.5295 0.5010 0.4724 0.4395 0.4149 0.3929 0.3694* Table C.15: SAR GLS, dependent variable=math4 Weight Matrix Contiguity Weight GLS Estimates Coef. log(avgexp) log(enroll) lunch log(scdist) constant Std. Err. Inverse Dist Weight Coef. Std. Err. 0.1602579 0.0420131* 0.015429 0.0069654* -0.005761 0.0004002* 0.0034104 0.0072557 -0.9785768 0.3874754 0.1690079 0.0170857 -0.0058864 0.00255 -1.058793 0.0426937* 0.0070499* 0.0003816* 0.0070406 0.3950128 ρ 0.2677935 0.0590989* 0.8456347 0.1394274* σ2 0.0133044 0.0008314* 0.013609 0.0008473* 122 Table C.16: SAR GLS, dependent variable=math4 in Year 2010 Weight Matrix Contiguity Weight GLS Estimates Coef. log(exp10) log(enroll) lunch log(scdist) constant Inverse Dist Weight Std. Err. Coef. Std. Err. 0.1245523 0.0387077* 0.0153031 0.0069908* -0.0057036 0.0004028* 0.0023392 0.0072584 -0.6548259 0.3581062 0.1349613 0.0168974 -0.005847 0.0014194 -0.7490704 0.0399863* 0.0070563* 0.0003847* 0.0070466 0.3715918 ρ 0.2638944 0.0592924* 0.8401031 0.1468289* σ2 0.0134163 0.0008383* 0.0137166 0.000854* Table C.17: SAR GLS, dependent variable=math4, contiguity math4 Coef. lexp lenroll lunch lscdist cons Std. Err. z [95% Conf. Interval ] 0.1602579 0.0420131 3.81 0 0.015429 0.0069654 2.22 0.027 -0.005761 0.0004002 -14.4 0 0.0034104 0.0072557 0.47 0.638 -0.9785768 0.3874754 -2.53 0.012 0.0779136 0.001777 -0.0065453 -0.0108106 -1.738015 0.2426021 0.029081 -0.0049766 0.0176313 -0.219139 rho 0.2677935 0.0590989 4.53 0 0.1519618 0.3836253 sigma2 0.0133044 0.0008314 16 0 0.0116749 0.014934 123 P>z Table C.18: SAR GLS, dependent variable=math4, inverse distance math4 Coef. Std. Err. z P>z [95% Conf. Interval] lexp lenroll lunch lscdist cons 0.1690079 0.0170857 -0.0058864 0.00255 -1.058793 0.0426937 0.0070499 0.0003816 0.0070406 0.3950125 3.96 2.42 -15.43 0.36 -2.68 0 0.015 0 0.717 0.007 0.0853298 0.0032681 -0.0066343 -0.0112493 -1.833003 0.252686 0.0309033 -0.0051385 0.0163492 -0.2845826 rho 0.8456345 0.1394278 6.07 0 0.572361 1.118908 sigma2 0.013609 0.0008473 16.06 0 0.0119483 0.0152696 124 Table C.19: Summary of Correlation Overall Overall Overall Overall Max Min mean variance 1 3.33E-38 0.00346437 0.002071544 Row mean 1% 5% 10% 25% 50% Percentiles 0.0030197 0.0031132 0.0031967 0.0033227 0.0034535 75% 90% 95% 99% 0.0035881 0.0037387 0.0038485 0.004069 Smallest 2.85E-03 0.0028733 0.0029348 0.0029758 Largest 0.004259 0.0042819 0.0043556 0.0057957 Obs Sum of Wgt. Mean Std. Dev. 518 518 0.0034644 0.0002449 Variance Skewness Kurtosis 6.00E-08 1.967204 18.49351 Obs Sum of Wgt. Mean Std. Dev. 518 518 1.35E-21 1.11E-20 Variance Skewness Kurtosis 1.24E-40 13.54964 209.841 Obs Sum of Wgt. Mean Std. Dev. 518 518 0.0020755 0.0000389 Variance Skewness Kurtosis 1.51E-09 2.174566 10.42997 Row min 1% 5% 10% 25% 50% Percentiles 5.22E-38 1.34E-36 9.47E-36 9.24E-34 1.87E-29 75% 90% 95% 99% 1.49E-25 8.06E-23 1.47E-21 3.52E-20 Smallest 3.33E-38 3.33E-38 4.15E-38 5.04E-38 Largest 4.02E-20 4.70E-20 1.34E-19 1.92E-19 Row variance 1% 5% 10% 25% 50% 75% 90% 95% 99% Percentiles 0.0020325 0.0020376 0.0020417 0.002049 0.002062 0.0020927 0.0021256 0.0021453 0.0022243 Smallest 2.03E-03 0.0020309 0.0020312 0.0020322 Largest 0.0022539 0.002289 0.0022999 0.0023159 125 Table C.20: Summary of Correlation(div) Overall Overall Overall Overall Max(2nd largest) Min mean variance 1(.266828657) 0.062464033 0.079143268 0.001712584 Row mean 1% 5% 10% 25% 50% 75% 90% 95% 99% Percentiles 0.0694387 0.0710834 0.0731629 0.0765987 0.0794727 0.0817451 0.0848623 0.0865209 0.087731 Smallest 0.0692969 0.0693381 0.0693386 0.0693432 Largest 0.0879126 0.0879339 0.0881153 0.0881707 Obs Sum of Wgt. Mean Std. Dev. 518 518 0.0791433 0.0042475 Variance Skewness Kurtosis 0.000018 -0.2346403 2.719421 Obs Sum of Wgt. Mean Std. Dev. 518 518 0.0658829 0.0016574 Variance Skewness Kurtosis 2.75E-06 0.5907457 2.952739 Obs Sum of Wgt. Mean Std. Dev. 518 518 0.0016978 0.0000361 Variance Skewness Kurtosis 1.30E-09 1.809396 5.520294 Row min 1% 5% 10% 25% 50% Percentiles 0.0625941 0.06363 0.063966 0.0647366 0.0656149 75% 90% 95% 99% 0.0668986 0.0682221 0.0692942 0.0701836 Smallest 0.062464 0.062464 0.0624661 0.0625628 Largest 0.070315 0.0703266 0.070412 0.0704638 Row variance 1% 5% 10% 25% 50% Percentiles 0.0016646 0.0016661 0.0016686 0.0016759 0.0016843 75% 90% 95% 99% 0.0017045 0.0017555 0.0017896 0.0018151 Smallest 0.0016636 0.0016644 0.0016644 0.0016645 Largest 0.0018191 0.0018207 0.0018216 0.0018218 126 BIBLIOGRAPHY 127 BIBLIOGRAPHY Abrevaya, J., (2002). Computing marginal effects in the Box-Cox model, Econometric Reviews 21, 383-393. Ackerberg, D., X. Chen, and J. Hahn (2012). A Practical Asymptotic Variance Estimator for Two-Step Semiparametric Estimators.Review of Economics and Statistics94, 481-498. Ai, C., Chen, X., (2003). Efficient estimation of conditional moment restrictions models containing unknown functions. Econometrica 71, 1795-1843. Ai, C., Norton, E. C., (2000). Standard errors for the retransformation problem with heteroscedasticity. Journal of Health Economics19, 697-718. Ai, C., Norton, E. C., (2008). A semiparametric derivative estimator in log transformation models. Econometrics Journal 2, 538-553. Altonji, J. G. and Martskin, R., (2005). Cross Section and Panel Data Estimators for Nonseperable Models with Endogenous Regressors, Econometrica 73, 1053-1102. Anselin, L., Florax, R. (Eds.), 1995. New directions in Spatial Econometrics. Springer, Berlin. Anselin, L., (2010). Thirty Years of Spatial Econometrics. Papers in Regional Science 89(1), 3-25. Arraiz, I., Drukker, D.M., Kelejian, H. H., Prucha, I.R., (2010). A Spatial Cliff-Ord-Type Model with Heteroskedastic Innoviations: Small and Large Sample Rerults. Journal of Regional Science 50(2), 592-614. Bajari, P., Chernozhukov, V., Hong, H. and Nekipelov, D., (2009). Identification and Efficient Semiparametric Estimation of aDynamic Discrete Game, working paper. Banerjee, S., Carlin, B.P., Gelfand A.E., (2004). Hierarchical Modeling and Analysis for spatial data. Chapman and Hall/CRC Press, Boca Raton. Berndt, E., Showalter, M., Wooldridge, J.M., (1993). An Empirical Investigation of the BoxCox Model and a Nonlinear Least Squares Alternative, Econometric Reviews 12, 65-102. Blackburn, M. L., (2007). Estimating Wage Differentials without Logarithms, Labour Economics 14, 73-98. Blundell, R., Powell, J., (2003). Endogeneity in Nonparametric and Semiparametric Regression Models,in M. Dewatripont, L. P. Hansen and S. J. Turnsovsky (eds.) Advances in Economics and Econometrics, 312-357. Blundell, R., Powell, J., (2004). Endogeneity in Semiparametric Binary Response Models, Review of Economic Studies, 6559. 128 Case, A., (1991). Spatial Patterns in Household Demand, Econometrica 59, 953-965. Chakrabarti, R., Roy, J., (2012). Housing Markets and Residential Segregation Impacts of the Michigan School Finance Reform on Inter- and Intra-District Sorting, Federal Reserve Bank of New York Staff Reports, no.565. Chamberlain, G., (1980). Analysis with qualitative data, Review of Economic Studies 47, 225-238. Chamberlain, G., (1982). Multivariate regression models for panel data, Journal of Econometrics 18 5-46. Chamberlain, G., (1984). Panel Data. in Handbook of Econometrics, Volume 2, ed. Z. Griliches and M. D. Intriligator. Amsterdam: North Holland, 1247-1318. Chen, X., (2007). Large Sample Sieve Estimation of Semi-nonparametric Models, Heckman, James. J. and Leamer, Edward E., eds. Handbook of Econometrics, Vol. 6B, Chapter 76, North-Holland. Conley, T. G., (1999). GMM estimation with cross sectional dependence, Journal of Econometrics92, 1-45. Conley, T. G., Ligon, E.A., (2002). Economic distance, spillovers, and cross country comparisons, Journal of Economic Growth 7, 157-187. Conley, T. G., Topa, G., (2002). Socio-economic distance and spatial patterns in unemployment, Journal of Applied Econometrics 17(4), 303-327. Conley, T. G., Dupor, B., (2003). A spatial analysis of sectoral complementarity, Journal of Political Economy 111(2), 311-352. Conley T. G., Molinari, F., (2007). Spatial correlation robust inference with errors in location or distance, Journal of Econometrics140 76-96 Duan, N.(1983). Smearing Estimate: A Nonparametric Restransformation Method, Journal of American Statistical Association 78, 605-610. Engel, C., Rogers, J. H., (1996). How wide is the border? The American Economic Review 86 (5), 1112-1125. Ferguson, T. S., (1996). A Course in Large Sample Theory, first edition. Chapman & Hall Press. Hausman, J.A., Hall, B.H. and Griliches, Z., (1984). Econometric models for count data with an application to the patents-R&D relationship, Econometrica 52, 909-938. Heckman, J. J., (2001). Micro Data, Heterogeneity, and the Evaluation of Public Policy: Nobel Lecture, Journal of Political Economy 109, 673-748. Hirano, K., Imbens, G. W. and Ridder, G., (2003). Efficient estimation of average treatment effects using the estimated propensity score, Econometrica 71, 1161-1189. 129 Huber, P. J., (1967). The behavior of maximum likelihood estimates under non-standard conditions, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. University of California Press, Berkeley, CA. Imberman, S. A., (2011). The Effect of Charter Schools on Achievement and Behavior of Public School Students. Journal of Public Economics, 95(7-8), 850-863. Kane, T., Rouse, C. E., (1995). Labor Market Returns to Two- and Four-year College. American Economic Review 85(3), 600-614. Kelejian, H. H., Prucha, I.R., (1999). A generalized moments estimator for the autoregressive parameter in a spatial model. International Economic Review 40 (2), 509-533. Kelejian, H. H., Prucha, I.R., (2010). Specification and estimation of spatial autoregressive models with autoregressive and heteroskedastic disturbances, Journal of Econometrics 157, 53-67. Keller, W., Shiue, C., (2007). The origin of spatial interaction, Journal of Econometrics140, 304-332 Lee, L., (2004). Asymptotic Distributions of Quasi-Maximum Likelihood Estimators for Spatial Econometric Models, Econometrica72, 1899-1926. Lin, X., (2010). Identifying Peer Effects in Student Academic Achievement by Spatial Autoregressive Models with Group Unobservables, Journal of Labor Economics 28(4), 825-860. Lin, X., Lee, L.F., (2010). GMM estimation of spatial autoregressive models with unknown heteroskedasticity, Journal of Econometrics 157, 34-52. Li, Q., Racine, J. S., (2007). Nonparametric Econoemtrics: Theory and Practice , Princeton and Oxford, Princeton University Press. Mall, M. M., (2004). A Close Look at the Spatial Structure Implied by the CAR and SAR Models, Journal of Statistical Planning and Inference 145(1),121-133. Manning, W. G., (1998). The logged dependent variable, heteroscedasticity, and the retransformation problem, Journal of Health Economics 17(3), 283-295. Mundlak, Y., (1978). On the pooling of time series and cross section data, Econometrica46, 69-85. Mullahy, J., (1998). Much ado about two reconsidering retransformation and the two-part model in health econometrics, Journal of Health Economics 17, 247-81. Newey, W. K., (1993). Efficient Estimation of Models with Conditional Moment Restrictions, G.S. Maddala, C.R. Rao, and H.D. Vinod, eds., Handbook of Statistics, Volume 11: Econometrics. Newey, W. K., (1994). Series estimation of regression functionals, Econometric Theory 10, 1-28. 130 Newey, W. K., (1997). Convergence rates and asymptotic normality for series estimators, Journal of Econometrics 79, 147-68. Newey, W. K., McFadden, D., (1994). Large Sample Estimation and Hypothesis Testing, in R.F. Engle and D. McFadden (eds.), Handbook of Econometrics, Volume 4. Amsterdam: North Holland, 2111-2245. Papke, L. E., (2005). The effects of spending on test pass rates: Evidence from Michigan, Journal of Public Economics 89(5-6), 821-39. Papke, L. E., (2008). The effects of changes in Michigan’s school finance system, Public Finance Review 36(4), 456-74. Papke, L.E., Wooldridge, J. M., (1996). Econometric Methods for Fractional Response variables with an Application to 401(K) Plan Participation rates, Journal of Applied Econometrics 11(1),619-32. Papke, L. E., Wooldridge, J. M., (2008). Panel date methods for fractional response variables with an application to test pass rates, Journal of Econometrics 121,311-24. Park, U.B., Sickles, R. C., and Simar, L., (2007). Semiparametric efficient estimation of dynamic panel data models, Journal of Econometrics 136, 281-301. Roy, J., (2011). Impact of School Finance Reform on Resource Equalization and Academic Performance: Evidence from Michigan, Education Finance and Policy 6(2), 137-167. Simcoe, T., (2008). XTPQML: Stata module to estimate Fixed-effects Poisson (Quasi-ML) regression with robust standard errors. Repec online paper. Wang, H., Iglesias, E. and Wooldridge, J. M., (2013). Partial Maximum Likelihood Estimation of a Spatial Bivariate Probit Model, Journal of Econometrics172(1), 77-89. White, H., (1980). A heteroskedasticity-consistent covariance estimator and a direct test for heteroskedasticity, Econometrica48, 817-830. Wooldridge, J. M. (1992a). A test for functional form against nonparametric alternatives, Econometric Theory 8, 452-475. Wooldridge, J. M. (1992b). Some Alternatives to the Box-Cox Regression Model, International Economic Review 33, 935-955. Wooldridge, J. M., (1997). Multiplicative Panel Data Models Without the Strict Erogeneity Assumption, Econometric Theory 13, 667-678. Wooldridge, J. M., (1999). Distribution-free estimation of some nonlinear panel data models, Journal of Econometrics 90(1), 77-97. Wooldridge, J. M., (2002). Econometric Analysis of Cross Section and Panel Data, Cambridge, MA: MIT Press. 131 Wooldridge, J. M., (2004). Estimating average partial effects under conditional moment independence assumptions , CeMMAP working papers CWP03/04, Centre for Microdata Methods and Practice, Institute for Fiscal Studies. Wooldridge, J. M., (2005). Unobserved heterogeneity and estimation of average partial effects. In: Andrews, D.W.K., Stock, J.H. (Eds.), Identification and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg. Cambridge University Press, Cambridge, 27-55. Wooldridge, J. M., (2009). Introductory Econometrics: A Modern Approach, fourth edition. Cincinnati, OH: South-Western College Publishing. Wooldridge, J. M., (2010). Econometric Analysis of Cross Section and Panel Data, second edition. Cambridge, MA: MIT Press. Wooldridge, J. M., (2011). Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition. Cambridge, MA: MIT Press. 132