THREE ESSAYS ON ECONOMETRICS By Yimin Yang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Economics – Doctor of Philosophy 2018 ABSTRACT THREE ESSAYS ON ECONOMETRICS By Yimin Yang This dissertation consists of three chapters concerning the estimation and inference of multi-level models. The first chapter shows how various econometric methods can be extended to multi- level linear models. The second chapter is a natural extension of the first chapter which focuses on the estimation of nonlinear models with hierarchical structure. The third chapter provides a practical guidance on choosing between the one-step GEE estimator and two-step GEE estimator by conducting simulation studies. More specifically, the first chapter considers a three-level linear model with both nested and non-nested structures. The stylized example will be a model for the average score on some statewide test at the school level at various points in time; scores are observed at multiple points in time (level one) in multiple schools (level two) in various school districts (level three). In such a model, the schools are naturally nested within the districts but time is not nested within schools or districts. This chapter extends several important econometric methods of analysis for panel data models to multi-level models. In particular, this chapter generalizes the results of Hausman and Taylor and the subsequent literature to these multi-level models. This is a non-trivial extension because the three-level linear model now has more than one kind of heterogeneity and more than one kind of between regression. This chapter also generalizes the concept of testing the exogene- ity assumptions by a simple variable addition test. It also provides a discussion about the GMM estimation without the assumption of no conditional heteroskedasticity. The second chapter considers the estimation of hierarchical nonlinear models. It shows how to extend the generalized estimating equations approach to models with more than two levels. When some covariates are correlated the unobserved heterogeneity, the GEE estimation method can be combined with a modified Chamberlain-Mundlak device to account for the unbalanced data structure. Another important contribution is that this chapter provides a systematic discussion about the misspecification issue of the working correlation matrix. When the working correlation matrix is misspecified, some structured working correlation matrices may not have a well-defined limit or their limiting matrix is not positive semi-definite, which will cause identification issue of the GEE estimator. When the conditional mean function associated with heterogeneity has a multiplicative form, a more efficient estimator is available, which extends the multivariate weighted nonlinear least squares estimation to multi-level models. The third chapter investigates the small sample performances of the one-step GEE estimator and two-step GEE estimator, which is similar to the comparison between the continuously updating GMM estimator and two-step GMM estimator. The simulation study shows that the these two estimators have similar finite sample performances when the efficient working correlation matrix is used. When the inefficient working correlation matrix is adopted, the two-step GEE estimator usually has a better finite sample performance in terms of the mean squared error. The simulation also shows the efficiency gains by relaxing the traditional generalized linear model (GLM) variance assumption. By allowing time-varying dispersion parameters, the efficiency of the GEE estimator can be further enhanced. Copyright by YIMIN YANG 2018 ACKNOWLEDGEMENTS I would like to express my sincere gratitude to Jeffrey M. Wooldridge, the chair of my dissertation committee, for his continuous support and encouragement during these past five years. His friendly guidance and insightful advice have been invaluable throughout all stages of the work. It was a pleasure and a great honor to study under his tutelage. I am also very grateful to my committee members, Peter Schmidt and Tim Vogelsang, for their patience and generosity of time. I will always strive for those incredible personal and professional attributes that they have demonstrated. My gratitude also extends to all the administrative staff who helped me to make smooth transitions throughout my Ph.D. study. Last but not the least, I would like to thank my family for all their love and trust. I am partic- ularly indebted to my grandfather, who gave everything he had to help me to pursue my dream. I would not have been able to finish this dissertation without their support. v TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii CHAPTER 1 AN ECONOMETRIC METHOD TO THE ESTIMATION OF MULTI- . . . . . . . . . Introduction . LEVEL MODELS (WITH PETER SCHMIDT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 . 1.2 Model and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Comments on Alternative Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 GLS Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 An IV Treatment of the Multi-Level Model 1 1 2 4 6 . . . . . . . . . . . . . . . . . . . . . 10 . . . . . 11 1.5.1 A Hausman and Taylor-Type Treatment of the Multi-Level Model 1.5.2 An Amemiya and MaCurdy and Breusch, Mizon and Schmidt-Type Treat- . . . . . . . . ment of the Multi-Level Model . . . . . . . . . . . . . . . . . . . . . . . . 14 1.6 Exogeneity Tests . 15 1.7 GMM Estimation with or without the NCH Assumption . . . . . . . . . . . . . . . 18 1.8 Conclusion Remarks . 19 . 21 . APPENDIX . . BIBLIOGRAPHY . . . 28 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 2 AN EXTENDED GENERALIZED ESTIMATING EQUATIONS APPROACH . . . . . . . Introduction . TO NONLINEAR MODELS WITH HIERARCHICAL STRUCTURE . . . 31 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2 Nonlinear Panel Data Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3 Generalized Estimating Equations method for Nonlinear Panel Models . . . . . . . 37 2.4 Selection of Working Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . 41 2.5 Estimation of Higher-Level Nonlinear Models . . . . . . . . . . . . . . . . . . . . 45 2.5.1 Extended GEE Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 46 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Some Examples . . . . . 2.6 Simulation . . 2.7 Conclusion . APPENDIX . BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . 3.1 3.2 One-Step and Two-Step GEE Estimations 3.3 Simulation Results CHAPTER 3 ONE-STEP GEE OR TWO-STEP GEE? A SIMULATION STUDY . . . . . 72 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 . . . . . . . . . . . . . . . . . . . . . . 73 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 . 3.3.1 Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.3.2 Nonlinear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.4 Conclusion . APPENDIX . BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . vi LIST OF TABLES Table 2.1: traditional GEE estimator and GEE estimator with time-varying dispersion parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 . . . . . . Table 2.2: pooled estimator and GEE estimator in a binary response panel model . . . . . . 67 Table 2.3: pooled estimator and GEE estimator in a fractional response panel model . . . . 67 Table 2.4: pooled estimator and EGEE estimator in a three-level binary response model . . 67 Table 2.5: EGEE and mixed-effects estimator in a three-level binary response model . . . . 68 Table 3.1: one-step GEE and two-step GEE for linear models (σ 2 c = 1) . . . . . . . . . . . 83 Table 3.2: one-step GEE and two-step GEE for linear models (σ 2 c = 2) . . . . . . . . . . . 83 Table 3.3: Table 3.4: two-step GEE estimators with and without time-varying dispersion parameters (α = 0.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 . . . . . . . two-step GEE estimators with and without time-varying dispersion parameters (α = 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 . . . . . . . . . Table 3.5: coverage rate: one-step GEE and two-step GEE for linear models (σ 2 c = 2) . . . 85 Table 3.6: coverage rate: two-step GEE estimators with and without time-varying dis- persion parameters (α = 0.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Table 3.7: one-step GEE and two-step GEE for the binary response models (σ 2 c = 1) . . . . 86 Table 3.8: one-step GEE and two-step GEE for the binary response models (σ 2 c = 1.5) . . . 87 Table 3.9: one-step GEE and two-step GEE for the fractional response models(all scale by 100) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Table 3.10: one-step GEE and two-step GEE for the count response models without any scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Table 3.11: coverage rate: one-step GEE and two-step GEE for binary response model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 (σ 2 c = 1) (σ 2 c = 1.5) Table 3.12: coverage rate: one-step GEE and two-step GEE for binary response model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 vii CHAPTER 1 AN ECONOMETRIC METHOD TO THE ESTIMATION OF MULTI-LEVEL MODELS (WITH PETER SCHMIDT) 1.1 Introduction In this paper we discuss the estimation of so-called “multi-level" (or “hierarchical") models. The standard textbook panel data model with dependent variable yit (an outcome for individual i at time t) is a two-level model, where level one is time and level two is individual. Each individual is observed for multiple time periods. In this paper we will consider a three-level model. Our stylized example will be a model for the average score on some statewide test at the school level at various points in time;scores are observed at multiple points in time (level one) in multiple schools (level two) in various school districts (level three). There is a very large literature on multi-level models, which dates back at least to the seminal paper by Fuller and Battese (1973), in which there are individuals (level three), there are multiple measurements (level two) per individual, and there are multiple “determinations” (level one) per measurement. A good but somewhat dated overview treatment of this literature is Raudenbusch and Bryk (2002). They consider(chapter 8) a prototypical three-level model in which the dependent variable is student achievement, level one is student, level two is classroom and level three is school. A more recent general treatment is given by Garson (2012). Most of the work on multi-level models has been done by individuals interested in educational research. There is little economic or econometric literature on the topic. Wooldridge (2010, pp. 876-883) discusses a three-level model under the heading “Cluster Samples with Unit-Specific Panel Data,” but he does not pursue any mathematical details. Also, after the semi-final draft of this paper was written, we became aware of Matyas (2017), an edited volume also in semi-final form that also deals with the econometrics of multi-level models. Several of its chapters overlap with our paper, as we will note below. Our paper and the chapters in Matyas (2017) were written 1 independently of each other and more or less contemporaneously. The econometric literature on panel data is clearly relevant to this topic, and we will discuss the estimation and testing of the three-level model in a way that follows the panel data literature as closely as is reasonably possible. We hope that this exposition will make multi-level models more accessible to econometricians, and also that it will make the relevant econometric results more accessible to non-economists who work on multi-level models. We are specifically interested in extending the analysis of Hausman and Taylor (1981), and of Amemiya and MaCurdy (1986) and Breusch, Mizon and Schmidt (1989), to the three level model. This is a non-trivial extension because now there are two time-invariant effects, and there are the “within” regression and two “between” regressions, and a given variable may be correlated with either or both of the time-invariant effects. Currently existing applications of the Hausman- Taylor methods to the multi-level model, such as Rice, Jones and Goldstein (1998) and Kim and Frees (2007), do not identify all of the valid and relevant instruments and therefore do not yield asymptotically efficient estimators. We also give a simple variable-addition form of the Hausman test of the exogeneity assump- tions that we make, and we discuss Hausman-Taylor type estimation without the assumption of no conditional heteroskedasticity (NCH). 1.2 Model and Notation We will consider the case that we have panel data covering T time periods for each of N schools, and the schools are grouped into G districts.In terms of standard panel data econometrics, this is just a panel model with cluster effects. In terms of the multi-level framework, time is level one, school is level two and district is level three. In the text, for simplicity we will consider the balanced case in which the number of time periods is the same for all schools, and also in which the number of schools is the same for each district. We will let P = N/G denote the number of schools per district. The assumption that the number of schools is the same for each district is very unrealistic, but it simplifies the exposition considerably. In Appendix we show the changes that 2 are necessary to accommodate different numbers of schools per district and different numbers of time periods per school. Asymptotics in this paper will be as G → ∞, which also implies that N → ∞. So we have many districts and many schools, but we do not require a large number of time periods or a large number of schools per district. Let t = 1, ...,T index time periods, i = 1, ...,N index schools and g = 1, ...,G index districts. For a given school (i), we know what district it is in, and so sometimes for clarity we will write g(i) to indicate the district to which school i belongs. We assume the following linear model: yit = xT it β + wT i α + zT g(i)γ + dT t θ + ci + vg(i) + uit (1.1) Here yit is the dependent variable (e.g. some school-wide average test score) for school i at time t; xit is a vector of observable time-varying characteristics of school i at time t; wi is a vector of observable time-invariant characteristics of school i; zg is a vector of observable time-invariant characteristics of district g; and dt is a vector of observable variables that depend only on time, not on school. We assume that dt includes dummies for time, so that we have time effects, which we treat as fixed since T is small. We also have unobserved school effects ci and district effects vg, to be treated as random, and an idiosyncratic error uit. We will rewrite the model as yit = hT it ξ + εit (1.2) where hT it = (xT it ,wT i ,zT g(i),dT t ), ξ T = (β T ,αT ,γT ,θ T ) and εit = ci + vg(i) + uit. Then for all NT observations we write (correspondingly to (1.1) and (1.2)) y = Xβ + Wα + Zγ + Dθ + c + v + u = Hξ + ε (1.3) The order of observations is: T observations for school 1; T observations for school 2; ...;T ob- servations for school N. Thus, for example, yT = (y11, ...,y1T ;y21, ...,y2T ; ...;yN1, ...,yNT ). Also we will sort the school indices by district, so that the first P schools (and therefore the first PT observations) are in district 1, the next P schools are in district 2, etc. 3 1.3 Comments on Alternative Notation The notation in equation (1.1) is not standard for the multilevel model. Consider, for example, the exposition in Wooldridge (2010, p.877, equation (20.52)). His linear model is (with slight changes in notation to match ours more closely): ygmt = xT gmtβ + wT gmα + zT g γ + dT t θ + cgm + vg + ugmt (1.4) Here, as in equation (1.1), g = 1, ...,G indexes districts and t = 1, ...T indexes time, but now m = 1, ...,P indexes schools in district g, which is why m runs from 1 to P and not from 1 to N. This notation is very similar to that of other papers in the multi-level modeling literature, such as Fuller and Battese (1973), Rice, Jones and Goldstein (1998), and Kim and Frees (2007). The obvious difference between this notation and ours is that in our notation a given school is represented by a single index i, where in Wooldridge (and the other papers cited) a given school is represented by a pair of indices (g,m). As it stands this is not a matter of substance because the models are really the same. For example, if P = 4, the school that corresponds to i = 7 in our notation corresponds to g = 2,m = 3 in his notation. The reason that the models are the same is that in either case we have a school effect (our ci, his cgm) and a district effect (our vg(i), his vg). The triple-index representation would be useful if we wanted an effect for whatever “m” rep- resents. That would not make sense in the case that we are discussing, because m is an arbitrarily assigned index for a school in a district. The second school in district 3 has nothing in common with the second school in district 5, for example, and correspondingly an effect for m = 2 would have no meaningful interpretation. To better understand the issues involved in the number of indices used, we will use the words nested or non-nested to describe the relationship between levels of the data. In our example, level 2 (school) is nested in level 3 (district) because, if we know which school we have, we know which district it is in. That is, we can (and do) write g(i) to indicate the district in which we find school i. However, in our example, level 1 (time) is not nested in level 2 (school). If we know that an observation is from t = 1 that does not tell us what school it is from. (And similarly level 1 is not 4 nested in level 3, because if we know that an observation is from t = 1 that does not tell us what district it is from.) So it is natural and useful to have two indices (i,t) to identify an observation. However, because g is a function of i, we do not need three indices. Also, because t = 1 has a meaning that does not depend on which school we are considering, it is not unreasonable to have a time effect. Note that the time effect can be treated as fixed because the fixed effects for time will not be collinear with the effects for higher levels, but we cannot have fixed effects for nested levels like school and district because they would be collinear. We can contrast this with the fully nested example of Fuller and Battese (1973, p.630). Here (in their notation) i indexes plots (level 3) of ground for an agricultural experiment, j indexes sub- plots (level 2) and k indexes sub-sub-plots (level 1), and yi jk is an output measure for sub-sub-plot k of sub-plot j of plot i. This example is nested in the sense that if we know a sub-sub-plot, then we also know what sub-plot it is in and what plot it is in. (A similar example is discussed by Raudenbusch and Bryk (2002, chapter 8) who consider a three-level model in which the dependent variable is student achievement, level one is “student", level two is “classroom" and level three is “school". This data structure is fully nested in the same sense as above: There are multiple students in each classroom but each student is in only one classroom, and there are multiple classrooms in each school but each classroom is in only one school.) Once again, although the third level (plot) index i is meaningful, the second level (sub-plot) indexjis not; we need two indices (i, j) to identify a sub-plot, and sub-plot j in plot 1 has nothing in common with sub-plot j in plot 2, for example. In fact, because these examples are fully nested, it takes only one index to keep track of the data. Make a list of all of the sub-sub-plots and index them by n = 1, ...,N. (So, for example, if there are five plots, four sub-plots per plot, and three sub-sub-plots per sub-plot, N = 60.) Then make a list of all of the sub-plots and index them by m = 1, ...,M (where M = 20 in the simple numerical example). Note that m is a function of n, denoted m(n), and the plot index i is a function of m, say i(m), and therefore of n, namely i(m(n)). If we wish to have effects for plot, say vi, and effects for sub-plots, say cm, then we can simply write the model as (cid:48) yn = h nξ + cm(n) + vi(m(n)) + un n = 1, ...N (1.5) 5 Finally, we can consider completely non-nested models. The first example in Fuller and Battese (1973) fit this description. Here level three is “individual," there are multiple “measurements" (level two) per individual, and there are multiple “determinations" (level one) per measurement. They do not give a more tangible description, but we can consider the case that their “individuals" are schools, the “measurements" correspond to different achievement or aptitude tests like the SAT or ACT, and the “determinations" correspond to various subscores like the math or verbal portions of these tests. So yi jk might be the school’s (k) average score on the math portion (i) of the SAT exam ( j). This is non-nested in the obvious sense that knowing that a score is on the math portion of a test does not identify which test, and knowing which test does not identify which school. A three-subscript representation is necessary because the model is non-nested and the different levels are meaningful in and of themselves. For example, we might want to have an effect for math portion and an effect for verbal portion, and another effect for SAT and for ACT. That is, we could have an error with the components ci +v j +dk, each of which is meaningful. If we wish, these can all be fixed effects because they are not collinear with each other.A classic reference with a model with this structure is Matyas (1997). See Balazsi, Matyas and Wansbeek (2015, 2017) for more details of a fixed effects treatment of the three-level model. 1.4 GLS Estimation We now return to the model as given in equations (1.1), (1.2) and (1.3) above. We can consider estimating the model by fixed effects, but we cannot include all of the time, school and district ef- fects because the school effects would be collinear with the district effects. To put this another way, for a given district g, adding a constant to each of the school effects for the schools in that district, and subtracting the same constant from the district effect, leaves the sum ci + v(g(i)) unchanged. This is a reflection of the fact that we have one level of nesting in the model. However, if we are interested only in controlling for the various types of effects in estimating the effect of xit on yit, we can simply ignore the district effects and have only time and school effects. This still leaves us with the important issue that we cannot identify the coefficients of time-invariant variables, and 6 the coefficients of variables which vary more over schools and districts than over time will likely be estimated imprecisely. We therefore wish to assume random school and district effects and estimate this model by generalized least squares (GLS). The algebra of GLS for this model has been known since Fuller and Battese (1973), but it will be convenient later to have given a brief derivation along the lines of Hausman and Taylor (1981), Balzsi, Baltagi, Matyas and Pus (2017) take a similar approach to ours, in a very general model. Our model is a special case of their general model but none of the special cases that they analyze is the same as our model. We make the following assumptions about the errors. Assumption 1 (Strict exogeneity): E(c|H) = E(v|H) = E(u|H) = 0 Assumption 1 implies that E(ε|H) = 0 and therefore ordinary least squares will be consistent for ξ in the regression of equation (1.3), subject to the usual techinial conditions (e.g. no perfect multicollinearity) on H. However, we want to obtain more efficient estimates by GLS. To do so we need to say something about the variance matrix of ε. Assumption 2(No conditinal heteroskedasticity-NCH): Conditional on H, (i) c,v and u are mutually independent. (ii) ci, i = 1, ...N, are iid (0,σ 2 c ). (iii) vg, g = 1, ...G, are iid (0,σ 2 v ). (iv) uit, i = 1, ...N, t = 1, ...T , are iid (0,σ 2 u ). Note that we have independence of the errors across groups. This will be automatically satisfied if we have random sample across groups. We do not have independence across schools, because of the group effects, and we do not have independence across time because of the school and group effects. Let Ω = V(ε). We want to derive expressions for Ω−1 and Ω−1/2, given the NCH assumption. It is easy to calculate that Ω = V(u) + V(c) + V(v) = σ 2 u INT + Tσ 2 c A + PTσ 2 v B (1.6) 7 where A = IN ⊗ ET , B = IG ⊗ EPT , and for any integer n, En is a matrix of dimension n× n with each element equal to 1 u Ωo where Ωo = INT + ( n. We write Ω = σ 2 )A + ( Tσ2c σ2u PTσ2v σ2u )B. Now we define the following: Q = INT − A = IN ⊗ (IT − ET ) R = A− B = IG ⊗ [(IP ⊗ ET )− EPT ] B = IG ⊗ EPT (as above) (1.7) We note that Q, R and B are idempotent; Q + R + B = INT ; and QR = QB = RB = 0. We also note that Q is the Hausman-Taylor matrix “Qv" that creates deviations from individual (school) means and that R + B(= A) is the Hausman-Taylor matrix “Pv" that creates individual means. For any data vector or matrix with observations in the order that we presented above, like y, we have (Qy)it = yit − ¯yi, (Ry)it = ¯yi − ¯¯yg(i) (By)it = ¯¯yg(i) (1.8) Here ¯yit is the mean value of y over time for school i, i.e. ¯yi = 1 of y over time and over the schools in district g, i.e. ¯¯yg = 1 y = Qy + Ry + By corresponds in scalar terms to yit = (yit − ¯yi) + (¯yi − ¯¯yg(i) + ¯¯yg(i). t=1 yit; and ¯¯yg is the mean value t=1 yit. So the identity T ∑T P ∑i(cid:51)g(i) 1 T ∑T As in the usual two-level model, these idempotent matrices have important properties in terms of what they do to the unobserved error components and the time-invariant and individually- invariant regressors. Specifically, Qc = Qv = Rv = 0; QW = 0, QZ = 0, RZ = 0, RD = 0. (1.9) Just above equation (1.7), we wrote Ωo in terms of I, A and B. We can rewrite this in terms of . It then follows σ2u +Tσ2c +PTσ2v σ2u +Tσ2c and φ2 = Q, R and B: Ωo = Q +φ1R +φ2B where φ1 = that σ2u σ2u Ω−1 = Q + 1 φ1 R + 1 φ2 B, Ω−1/2 o = Q + λ1R + λ2B (1.10) 8 (cid:113) 1 φ1 σu Ω−1/2 o = . where λ1 = Ω−1/2 = 1 (cid:114) σ2u σ2u +Tσ2c (cid:114) (cid:113) 1 φ2 = and λ2 = σ2u σ2u +Tσ2c +PTσ2v . Then Ω−1 = 1 σ2u Ω−1 o and These expressions for Ω−1 and Ω−1/2 have been known at least since Fuller and Battese (1973), but our simple derivation may have pedagogical value, and it serves to link our treatment of the next section to the work of Hausman and Taylor (1981). GLS can be expressed as ˜ξGLS = (HT Ω−1H)−1HT Ω−1y, or equivalently as (HT Ω−1 o H)−1HT Ω−1 o y, or equivalently as the result of OLS applied to the transformed equation Ω−1/2 o y = Ω−1/2 o Hξ + Ω−1/2 o ε (1.11) It may be useful to be explicit about the form of Ω−1/2 (Ω−1/2 o o H. We have (Ω−1/2 X)it = (xit − ¯xi) + λ1(¯xi − ¯¯xg(i)) + λ2¯¯xg(i) = xit − (1− λ1)¯xi + (λ2 − λ1)¯¯xg(i) (1.12a) o W)it = λ1(wi − ¯¯wg(i)) + λ2 ¯¯wg(i) = λ1 ¯wi + (λ2 − λ1) ¯¯wg(i) (1.12b) Z)it = λ2zg(i) (1.12c) D)it = (dt − ¯d) + λ2 ¯d = dt − (1− λ2)¯d (1.12d) (Ω−1/2 (Ω−1/2 o o Given standard regularity conditions on H, plus Assumption 1 (Strict exogeneity), GLS is consistent. (This is true whether or not the NCH assumption actually holds.) Given Assumption 1, OLS is also consistent, and in fact the consistency of OLS requires only the weaker exogeneity condition that E(εit|hit) = 0. Given Assumption 1, the usual fixed effects (within) estimator is also consistent for β and θ (the coefficients of X and D) but we cannot estimate α or γ by fixed effects because QW = 0 and QZ = 0. The standard argument in favor of GLS is that, given also Assumption 2 (NCH), GLS is asymptotically more efficient than OLS or within. The GLS estimator depends on σ 2 v and a feasible version of GLS will require consistent estimates of these variances. The estimation of these variances is discussed in the second c and σ 2 u , σ 2 part of the appendix. 9 1.5 An IV Treatment of the Multi-Level Model In this Section we follow Hausman and Taylor (1981) and subsequent papers in considering models in which some of the regressors may be correlated with the school effects ci and/or the dis- trict effects vg. Consistent estimation will be by instrumental variables (IV), where the instruments are created from the regressors that are uncorrelated with one or both of the effects. In Hausman and Taylor there is only one type of effect, which corresponds to our school effect. There are time-varying regressors (X) and time-invariant regressors (Z), and these are divided into regressors that are uncorrelated with the effects (X1,Z1) and those that are correlated with the effects (X2,Z2). Our case is philosophically similar but we have more types of variables and two types of effects. Our classification of variables will be denoted as follows: Group 1 : X1,W1,Z1,D = variables uncorrelated with (c,v) Group 2 : X2,W2,Z2 = variables uncorrelated with c but correlated with v Group 3 : X3,W3,Z2 = variables correlated with c (1.13a) (1.13b) (1.13c) We note two things. First, a district specific variable is part of Z2 if it is correlated with c or it is correlated with v. We don’t need to distinguish these two cases because in either case it does not yield any valid instruments. Second, the variables D that depend only on t and not on i cannot be correlated with c or v so they belong in group “1" above. We now make the following exogeneity assumption (modified from Assumption 1 of Section 3) to reflect the classification of variables given above. Assumption 3 (HT Exogeneity): (i) E(u|X,W,Z,D) = 0 (ii) E(c|X1,X2,W1,W2,Z1,D) = 0 (iii) E(v|X1,W1,Z1,D) = 0. Given Assumption 3, plus standard regularity conditions on the regressors, the within estimator is consistent for β and θ (the coefficients of X and D). The ordinary least squares estimator is 10 not consistent because some of the regressors in (1.1) are endogenous. Consistent estimation of all of the parameters in (1.1) can be obtained if Assumption 3 implies enough valid and useful instruments. 1.5.1 A Hausman and Taylor-Type Treatment of the Multi-Level Model Under Assumption 3, we can list a set of valid instruments that is a generalization of the Hausman and Taylor (HT) instrument set to our model. If M is a variable in group “1” above, then QM,RM and BM are valid instruments, if they are non-zero. If M is in group “2” above, then QM and RM are valid instruments, if they are non-zero. If M is in group “3” above, then QM is a valid instrument, if it is non-zero. The restriction to non-zero instruments is obvious, and is needed because the following transformed variables are equal to zero: QW1, QZ1, RZ1, RD, QW2, QZ2, RZ2. Also BD is a constant and is unnecessary if Z1 contains a constant since then BZ1 includes a constant. We therefore have the following list of valid and non-zero instruments: JHT = (QX1,RX1,BX1,QD,RW1,BW1,BZ1,QX2,RX2,RW2,QX3) (1.14) HT JHT )−1JT Let PHT = JHT (JT HT be the projection matrix onto the space spanned by JHT . Then the IV estimator of ξ is ˆξIV = (HT PHT H))(−1)HT PHT y. This estimator is consistent if Assump- tion 3 holds and if the standard rank condition holds (JT HT H is of full column rank). A necessary condition for this to hold is that we have at least as many instruments as regressors. If we adopt the convention that, for any data matrix M, KM is the column dimension of M, we need the following condition to hold: KX + 2KX1 + KX2 + KD + 2KW1 + KW2 + KZ1 ≥ KX + KD + KW + KZ. With a little algebra, this becomes the following: HT Order Condition: 2KX1 + KX2 + KW1 ≥ KW3 + KZ2 (1.15) 11 Next we define the analogue of the Hausman-Taylor efficient estimator for our model. For this we need the following assumption, which is the modified NCH assumption for the present setting. Assumption 4 (HT-NCH): Conditional on JHT (i) c,v, and u are mutually independent. (ii) ci, i = 1, ...,N, are i.i.d (0,σ 2 c ). (iii) vg, g = 1, ...,G, are i.i.d (0,σ 2 v ). (iv) uit, i = 1, ...,N,t = 1, ...,T , are i.i.d (0,σ 2 u ). This is the same as Assumption 2 above (NCH) except that now we are conditioning only on the instruments, which are exogenous, and not on the entire set of explanatory variables, some of which may be endogenous. Again assumption (iii) will be redundant if we randomly sample across groups. o o y = Ω−1/2 o Hξ + Ω−1/2 Under assumption 4, V(ε|JHT ) equals Ω as defined in equation (1.6) above. We then con- sider the transformed equation Ω−1/2 ε as given in equation (1.11) above. We denote the analogue of the Hausman-Taylor “efficient" estimator as ˆξHT , equal to the instru- mental variables estimator of the transformed model using the instrument set JHT . Explicitly, ˆξHT = (HT Ω−1/2 y . This estimator is referred to as “ef- ficient” because, given the HT-NCH assumption, it is more efficient than the within estimator or the IV estimator ˆξ . As we show below, it is the efficient GMM estimator given the orthogonality o H))−1HT Ω−1/2 PHT Ω−1/2 PHT Ω−1/2 o o o conditions implied by the HT-NCH assumptions. There have been some previous papers that apply the Hausman and Taylor methodology to our o and then seek a Hausman-Taylor type instrument set. three-level model. So far as we know, the closest to our paper is Rice, Jones and Goldstein (1998). Like us, they transform the model by Ω−1/2 This is all done correctly, but they do not find all of the available instruments. They distinguish three types of regressors: (1) X1,Z1 = regressors (time-varying and time-invariant, respectively) that are uncorrelated with both c and v; (2) X2 2 = variables correlated with their u, which is our c; (3) Xv 2 = variables correlated with their v, which is also our v. Then their instrument set is JRJG = (QuX1,QuXu 2,X1,Z1). Here their Qu = our Q and their Qv = our 2,QvX1,QvXv 2,QvZv 2,Zv u,Zu 12 (Q + R) = (I− B). 3,X4) For purposes of comparison, we will split our X3 (variables correlated with with c) as X3 = (X∗ where X∗ 3 = variables correlated with c but not with v and X4 = variables correlated with both c and v. It is not clear to us whether X4 belongs in Xu 2 or neither or both. From the verbal description given in Rice, Jones and Goldstein, it would appear that X4 belongs in both Xu 2 and 2. However, if X4 is part of Xv Xv 2, then the instrument set contains (Q + R)X4 which is not a valid instrument, because [(Q + R)ε]it contains (ci − ¯¯cg(i)) which is still correlated with X4. Therefore we will proceed to analyze their instrument set under the interpretation that X4 is part of Xu 2 but not part of Xv 2. 2 or Xv Under this interpretation, their Xu 2 = (X∗ 3,X4) = X3; their Zu 2 = (W∗ 3,W4,Z∗ 3,Z4) = (W3,Z3); their Xv 2 = X2; and their Zv 2 = (W2,Z2). In our notation, their instrument set is equivalent to JRJG = (QX1,RX1,BX1,QD,W1,Z1, (Q + R)X2,RW2,QX3) (1.16) Comparing this to our instrument set JHT in equation (1.14) above, there are two differences. First, they have W1, which equals (R + B)W1, whereas we have RW1 and BW1 separately. Second, they have (Q + R)X2 whereas we have QX2 and RX2 separately. In addition, if we had interpreted their definition of Xu 2 to exclude X4, they would also be missing the instruments QX4. Balazsi, Bun, Chan and Harris (1977) also provide a Hausman and Taylor – type treatment of a three-level model. Their approach is basically the same as ours, but the details differ because the models differ. Our model has fixed time effects and random school and district effects, which they refer to (Section 3.4) as a “mixed effects model.” The details of their treatment differ from ours because in our model the two levels other than time (school and district) are nested, whereas in their model they are not. 13 1.5.2 An Amemiya and MaCurdy and Breusch, Mizon and Schmidt-Type Treatment of the Multi-Level Model i = (Mi1,Mi2, ...,MiT ), of dimension 1× LT ; define M(cid:93) For the standard two-level panel data model, Amemiya and MaCurdy (1986) generated more in- struments than Hausman and Taylor from the time-varying exogenous variables (our X1) and thus proposed a more efficient IV estimator. To express this construction simply, suppose that M is a data matrix, of dimension NT × L, say, whose rows are M11,M12, ...,M1T , ...,MNT , where each Mit is 1× L. Then define Mo i = 1T ⊗ Mo i , of dimension T × LT ; and define ]mathb f M∗ as the matrix of dimension NT × LT made up of 1,M(cid:93) the blocks (arranged vertically) M(cid:93) 2, ...,M(cid:93) N. Then Amemiya and MaCurdy replaced PvX1 (in Hausman and Taylor notation) with X∗ 1. Alternatively, Breusch, Mizon and Schmidt showed that this is equivalent to keeping the HT instrument set but adding (QvX1)∗ to it. Here the conven- tion is that, because QvX1 is made up of deviations from means that add to zero, we use only the first T − 1 deviations from means. Therefore (QvX1)∗ is of dimension NT × (T − 1)KX1 not NT × T KX1. For the model of this paper, the second expression is the more convenient. Therefore we define our Amemiya-MaCurdy type instrument set as JAM = (QX1, (QX1)∗,RX1,BX1,QD,RW1,BW1,BZ1,QX2,RX2,RW2,QX3) (1.17) This is the same as JHT as given in equation (1.14) above, except that we have added (QX1)∗. These extra instruments are valid given Assumption 3 and so we have achieved an efficiency gain without additional assumptions. While QX2 and QX3 are valid instruments given Assumption 3, (QX2)∗ and (QX3)∗ are not valid instruments without a further assumption. Following Breusch, Mizon and Schmidt, we make the following additional assumption. Assumption 5 (BMS): (i) E(cid:0)x2,itvg(i) (ii) E(cid:0)x3,itci (cid:1) is the same for all t. (cid:1) is the same for all t. 14 (iii) E(cid:0)x3,itvg(i) (cid:1) is the same for all t. This condition implies that the time-specific deviations from means are uncorrelated with ci and v(g(i)) so that (QX2)∗ and (QX3)∗ are valid instruments. This leads us to the instrument set JBMS = (QX1, (QX1)∗,RX1,BX1,QD,RW1,BW1,BZ1,QX2, (QX2)∗,RX2,RW2,QX3, (QX3)∗) (1.18) 1.6 Exogeneity Tests The various estimators discussed above are consistent under different exogeneity assumptions. Tests of these assumptions are called exogeneity tests, and they are useful for researchers to choose an appropriate estimation method. In this section we present a simple regression-based variable addition test and we show the equivalence between this variable addition test and the classical Hausman specification test based on the difference between an efficient and an inefficient estimator. This is a known result for the usual two-level panel data model but so far as we are aware it is new for the three-level model. As mentioned in section 1.4, the fixed effects estimator is consistent under Assumption 3(i), that E(u|X,W,Z.D) = 0, which is also necessary for the consistency of the GLS and HT estima- tors. A statistically significant difference between the fixed effects estimator and the HT (or GLS) estimator indicates a failure of the other exogeneity assumptions required for the HT (or GLS) estimator to be consistent. So the null hypothesis is that Assumptions 3(i), 3(ii) and 3(iii) hold, while only Assumption 3(i) is maintained under the alternative hypothesis. Another way to say this is that the null hypothesis is that the HT instruments are valid, and the alternative is that they are not. In this section, we will focus on the comparison of the fixed effects estimator and the HT estimator, because the GLS estimator is a special case of the HT estimator when all of the ex- planatory variables are uncorrelated with the school and district effects. That is, the HT estimator is GLS when X1 = X,W1 = W and Z1 = Z, so that there are no variables in Groups 2 or 3 in equation (1.13) above. Even though it is not true that PHT = I, in this case it is still true that 15 HT Ω−1/2 o PHT Ω−1/2 o H = HT Ω−1 o H and so HT = GLS. Before we proceed to any details, it is important to be specific about the set of parameters that we can compare. We will write the model as y = H1ξ1 + H2ξ2 + ε 1 = (β T ,θ T ) and ξ T (1.19) 2 = (αT ,γT ), so that H1 contains the where H1 = (X,D), H2 = (W,Z), ξ T time-varying variables. Both β and θ are identified, but the asymptotic variance of the differ- ence between the fixed effects and HT estimators of ξ1 is singular. That is, V( ˆξ1,FE − ˆξ1.HT ) = V( ˆξ1,FE )−V( ˆξ1,HT ) has rank of at most KX instead of KX + KD. Wooldridge (2010, Chapter 10) provides a proof of this result for the random effects and fixed effects estimators in the standard panel data model. Appendix 1.3 gives a proof of this result for the fixed effects and HT estima- tors in our multilevel model. For this reason we will focus on the Hausman test based on β (the coefficeints of X) only. A natural test statistic is the Hausman test statistic based on the difference between ˆβHT and ˆβFE. We will first treat the case that V( ˆβHT − ˆβFE ) is nonsingular. Then the Hausman test statistic is (cid:48) HT1 = ( ˆβHT − ˆβFE ) [ ˆV( ˆβHT − ˆβFE )]−1( ˆβHT − ˆβFE ) (1.20) Under the null hypothesis HT1 converges to a χ2 distribution with KX degrees of freedom. We can further simplify the asymptotic variance under the null if we impose the HT-NCH assumptions. V( ˆβHT − ˆβFE ) = V( ˆβFE )− V( ˆβHT ) (1.21) because ˆβHT is efficient when the HT-NCH assumptions hold.(The efficiency of ˆβHT when the HT-NCH assumptions hold is proved explicitly in Section 7 below.) Therefore, we have following traditional Hausman test statistic: (cid:48) HT2 = ( ˆβHT − ˆβFE ) [ ˆV( ˆβHT )− ˆV( ˆβFE )]−1( ˆβHT − ˆβFE ) (1.22) 16 HT2 has a limiting χ2 distribution under the null hypothesis and the HT-NCH assumptions. This test statistic is not robust to the failure of the HT-NCH assumptions. See Wooldridge (1991) for further discussion of this point in a slightly different setting. In the standard linear panel data model, it is well known that the Hausman test is equivalent to a variable addition test, where we add the time averages of the X’s to the model and test the significance of their coefficients. The same kind of equivalence between the Hausman test HT1 and a variable addition test also holds in our multi-level linear model. The variable addition test is based on following augmented regression model: y = H1ξ1 + H2ξ2 + ¯Xψ + εA (1.23) where ¯X = (R + B)X is the matrix of the individual means of the X’s and εA is the error term in augmented regression. We first apply the GLS transformation to the augmented regression: ¨y = ¨H1ξ1 + ¨H2ξ2 + X∗ψ + ¨εA where(¨y, ¨H1, ¨H2,X∗, ¨εA) = Ω−1/2 (y,H1,H2, ¯X,εA). The HT estimator is the instrumental vari- ables estimator of the transformed regression (1.24), with the HT instrument set, and with the (1.24) o restriction ψ = 0 imposed. We show in Appendix 3 that the instrumental variables estimator of β in the transformed regression with unrestricted ψ is identical to the fixed effects estimator, and a Wald statistic testing ψ = 0 is equivalent to the Hausman statistic HT1. The comparison between the GLS estimator and the fixed effects estimator is a special case. The GLS estimator is just the OLS estimator of the transformed regression when we impose ψ = 0. The OLS estimator of β in regression (1.24) with unrestricted ψ is identical to fixed effects estimator. Then the Wald statistic testing ψ = 0 is equivalent to the Hausman test HT1 which is based on the difference between GLS estimator and fixed effects estimator. The null hypothesis for this test is that all of the regressors are exogenous. An advantage of the variable addition test is that we can easily make the exogeneity test robust to failure of the NCH assumptions. We simply have to use a robust estimate of V( ˆψ) to construct 17 valid Wald test statistic when we are not willing to assume NCH or HT-NCH. Lastly we consider the case that ˆV( ˆβHT )− ˆV( ˆβFE ) is singular. The nonsingularity of [ ˆV( ˆβHT )− ˆV( ˆβFE )] requires that the number of exogeneity assumptions being tested is at least as large as KX, so that we have enough instruments to identify ψ. If this is not true, we will require a generalized inverse, and the number of degrees of freedom of the limiting chi-squared distribution will be the rank of [ ˆV( ˆβHT )− ˆV( ˆβFE )], not KX. Also the form of the variable addition test is altered. These points are discussed in more detail in the Appendix. 1.7 GMM Estimation with or without the NCH Assumption In this section we consider estimation of our model by GMM, using the moment conditions implied by the HT Exogeneity Assumption of Section 1.5. We will first consider estimation without imposing the NCH assumption (Assumption 4). Let JHT be the set of instrumental variables given in equation (1.14). Then we have following moment conditions: E[JT HT (y− Hξ )] = 0 This leads to a GMM estimator ˆξGMM = (HT JHT ΛJT HT H)−1HT JHT ΛJT HT y (1.25) (1.26) Here Λ is a positive definite weighting matrix. Subject to technical conditions including the HT Order Condition, this estimator is consistent regardless of the choice of Λ. Picking Λ = HT JHT )−1 yields the IV estimator discussed just below equation (1.14) in Section 1.5.1. But (JT HT (y− Hξ )])−1 = this is not the optimal choice of Λ. The optimal weighting matrix is (V[JT HT,g(yg − Hgξ )])−1. A consistent estimator of the optimal weighting matrix would be (V[∑G HT,g(yg−Hg ˆξ )(yg−Hg ˆξ )T JHT,g]−1 where ˆξ is some preliminary its sample analogue [ 1 consistent estimator such the GMM estimator using the identity matrix as weighting matrix. This g=1 JT G ∑G g=1 JT 18 choice yields the optimal (minimum asymptotic variance) GMM estimator based on the moment conditions (1.25). The weighting matrix estimator of the previous paragraph is an “unrestricted” estimator in the sense that it imposes independence over values of g but it does not impose any structure on the variance matrix of εg. However, now suppose that we impose the HT-NCH assumption. Then we have V[JT HT ΩJHT ] , and the GMM estimator can be written as: HT (y− Hξ )] = E[JT ˆξGMM = [HT JHT (JT HT ΩoJHT )−1JT HT H]−1HT JHT (JT HT ΩoJHT )−1JT HT y (1.27) This estimator is consistent whether the NCH assumption actually holds or not. If the NCH assumption holds, it is asymptotically equivalent to the optimal GMM estimator discussed in the previous paragraph. Its advantage is lies in the general folklore that GMM estimators using an analytically derived optimal weighting matrix have better finite sample properties than GMM esti- mators based on an unrestricted estimated weighting matrix. We can use the GMM estimator in (1.27) to establish the efficiency of the HT estimator ˆξHT under the NCH assumption. Specifically, ˆξHT = ˆξGMM as long as the same estimate of Ωo is used. This equality holds because JHT (JT HT ΩoJHT )−1JT HT = Ω−1/2 o PHT Ω−1/2 o (1.28) as is easily demonstrated using the facts that Ω−1/2 idempotent and orthogonal to each to other, and that JHT is of the form (QJ1,RJ2,BJ3). = Q + λ1R + λ2B where Q,R and B are o The same discussion holds also if we consider the AM instrument set or the BMS instrument set. Just substitute JAM or JBMS for JHT . 1.8 Conclusion Remarks In this paper we have considered the estimation of a three-level hierarchical model. The most standard econometric panel data model corresponds to a two-level (individual and time) model. We show how the methods developed for the panel data model can be applied to the three-level 19 hierarchical model. In the three-level model, we have the within regression and we have two dif- ferent between regressions, and we give a systematic treatment of GLS and efficient IV estimation in this setting. Part of the contribution of the paper is expository. We try to explain hierarchical models to panel data econometricians, and we also try to give a clear explanation of the econometric methodology for linear panel data models to people who use hierarchical models. However, we also provide some new theoretical results. First, we give an exhaustive list of the available instruments in the Hausman-Taylor, Amemiya-MaCurdy and Breusch-Mizon-Schmidt IV approaches to this model. Second, we show how a Hausman test of the exogeneity assumptions made for GLS or IV can be implemented using a simple variable addition test. This is a known result for the two-level model but it is new for the three-level model. Third, we show how the model can be estimated without imposing the no conditional heteroscedasticity (NCH) assumption and discuss the advantages and disadvantages of such an approach. Obviously we have touched upon only a very small part of the panel data literature. One could make a long list of topics (dynamic models, nonlinear models, selectivity models, . . . ) that could profitably be extended to the multi-level setting. We hope that this paper provides a useful step in that direction. 20 APPENDIX 21 APPENDIX FOR CHAPTER 1 Different Numbers of Schools in Different Districts In this Appendix we show how to accommodate the realistic case that the number of schools is different for different districts. This requires only some minor modifications of our notation and does not change any of the main results in the paper. We assume that there are Pg schools in district g and each school in district g is observed for the same Tg time periods. Let ug denote the column vector of {uit}i∈g,1≤t≤Tg that is properly stacked by the order of schools of district g. Similarly we define cg and vg as the column vectors of c and v for district g. u IPgTg + Tgσ 2 c Ag + PgTgσ 2 Ωg = IPgTg + u we have Ωgo = 1 σ2u In this case, we need to represent the variance matrix as Ω = diag{Ω1,Ω2, ...,ΩG}, where v Bg and where Ag = IPg ⊗ ETg and Ωg = V(ug) + V(cg) + V(vg) = σ 2 Bg = EPgTg. With normalization by dividing by σ 2 Ag + PgTgσ2v Bg. With this notation, we can now define Qg = IPg ⊗ (ITg − ETg) and Rg = Ag − Bg = σ2u (IPg − EPg)⊗ ETg. It is easy to verify that Qg,Rg and Bg are idempotent matrices and are or- (cid:115) thogonal to each other. Similarly we can write Ω−1/2 go = Qg + λ1gRg + λ2gBg, where λ1g = . For the whole data set with G groups, we define Q = diag{Q1,Q2, ...,QG}, R = diag{R1,R2, ...,RG} and B = Q = diag{B1,B2, ...,BG}. With this re- , ...,Ω−1/2 = diag{Ω−1/2 vised notation we obtain the GLS transformation matrix Ω−1/2 Go } which can be represented as a linear combination of the matrices Q,R and B: σ2u +Tgσ2c +PgTgσ2v and λ1g = ,Ω−1/2 2o σ2u σ2u +Tgσ2c o 1o Tgσ2c σ2u (cid:115) σ2u Ω−1/2 (A.1) where λ 1 = diag{λ11IP1T1,λ12IP2T2, ...,λ1GIPGTG} and λ 2 is similarly defined. All of the other analysis in the paper goes through with this revised notation. = Q + λ 1R + λ 2B o 22 Estimation of the Error Variances In practice, we do not know the true values of σ 2 v , and so we need consistent estimates of these variances to construct a feasible version of the GLS or efficient HT estimator. u , σ 2 c and σ 2 Here we will provide two possible estimation approaches. We will focus on the case of GLS, which can be extended to the case of the efficient HT estimator with trivial changes. 1.Error Variance Estimation Based on the Within and the Between Estimators We first obtain a consistent estimator of σ 2 u based on the fixed effects estimator. We consider the within-transformed regression Qy = QHξ + Qε = QHQξQ + Qε (A.2) where HQ = (X,D)= the time-varying variables. (Recall that QW = QZ = 0.) Ordinary least squares applied to (A.2) gives the within estimator of ξQ = (β T ,θ T )T , say ˆξQ, which is consistent given standard regularity conditions. However, in this discussion we can also simply take the true value of ξQ as known, because consistent estimation of the regression coefficients does not alter the consistency of the variance estimators. With some standard algebra, it can be shown that E(εT Qε) = E(uT Qu) = σ 2 u trace(Q). With g=1 Pg(Tg − 1). g=1 trace(Qg) = ∑G g=1 Pg(Tg − 1). A consistent estimator of σ 2 u is the definition of Q in Appendix 1.1, we have trace(Q) = ∑G Therefore we have E(εT Qε) = σ 2 u ∑G ˆσ 2 u = [ G ∑ g=1 Pg(Tg − 1)]−1 ˆεT QQˆεQ (A.3) where ˆεQ = y− HQ ˆξQ We can then use the between estimator at the school level to get a consistent estimator of σ 2 c . So we consider the equation Ry = RHRξR + Rε (A.4) 23 where HR = (X,W) and ξR = (β T ,αT )T . (Recall that RD = RZ = 0). Ordinary least squares applied to (A.4) yields ˆξR. Now we have E(εT Qε) = trace[RE(εT ε)] = ∑G g=1 trace(RgΩg) = ∑G g=1(σ 2 c )(Pg − 1) . A consistent estimator of σ 2 c is u + Tgσ 2 ˆσ 2 c = [ G ∑ g=1 where ˆεR = y− HR ˆξR. Tg(Pg − 1)]−1[ˆεT R RˆεR − G ∑ g=1 (Pg − 1) ˆσ 2 u ] (A.5) Finally, using the between estimator at the district level gives a consistent estimator of σ 2 v . We consider the equation By = BHξ + Bε (A.6) In this case we have E(εT Bε) = trace[BE(εT ε)] = ∑G PgTgσ 2 v ). We then have a consistent estimator of σ 2 v , g=1 trace(BgΩg) = ∑G g=1(σ 2 u + Tgσ 2 c + ˆσ 2 v = [ G ∑ g=1 TgPg]−1[ˆεT B BˆεB − G ∑ g=1 ( ˆσ 2 u + Tg ˆσ 2 c )] (A.7) where ˆεB = y− HR ˆξB and ˆξB is OLS applied to (A.6). 2. Error Variance Estimation Based on the Pooled OLS Estimator We could also use the pooled OLS estimator to construct consistent estimators of the error vari- ances. The pooled OLS estimator can be written as ˆξOLS = (HT H)−1HT y and is consistent under Assumption 1 (strict exogeneity) or under the weaker condition of contemporaneous exogeneity. Let ˆε = y− H ˆξOLS be the OLS residuals. Then we can simply replace ˆεQ in (A.3), ˆεR in (A.5) and ˆεB in (A.7) with ˆε and we still have consistent estimates of the variances σ 2 v . This is a reflection of the fact, noted above, that as far as consistent estimation of the variances is concerned, c and σ 2 u , σ 2 any consistent estimator of ξ is as good as the true value. 24 Proofs of some Results of Section 1.6 This section provides proofs of the main results in section 1.6. We will frist prove that V( ˆξ1,HT )− V( ˆξ1,FE ) is singular. To do so, we introduce the following notation. For any variable M, let ¨M = Ω−1/2 o M = QM + (λ1R + λ2B)M ≡ ˜M + M∗. Note that ¨Z = Z∗ (since = 0), ¨W = W∗ and ¨D = ˜D + λ2 ¯D where ¯D is constant over both i and t. Then V( ˆξ1,FE ) = σ 2 u ( ˜HT 1 ˜H1)−1 = σ 2 u (A.8) −1  ˜XT ˜X ˜XT ˜D ˜DT ˜X ˜DT ˜D To obtain V( ˆξ1,HT ), we write the HT regression as follows Applying the Frisch-Waugh-Lovell Theorem, we have ˆξ1,HT =(cid:0)(PHT ¨H1)oT (PHT ¨Ho PHT ¨y = PHT ¨H1ξ1 + PHT ¨H2ξ2 + PHT ¨ε (A.9) 1)(cid:1)−1(cid:0)(PHT ¨H1)oT ¨y(cid:1), where (PHT ¨H1)o is the part of PHT ¨H1 that is orthogonal to PHT ¨H2. It is clear that PHT ¨D = ˜D since ¯D cannot be orthogonal to PHT ¨H2 if H2 contains an intercept. Similarly (PHT ¨X)o = ˜X + Λ where Λ = (PHT X∗)o is the part of PHT X∗ orthogonal to PHT ¨H2. Note that ΛT ˜X = 0 and ΛT ˜D = 0. Then V( ˆξ1,HT ) = σ 2 u  ˜XT ˜X + ΛT Λ ˜XT ˜D −1  and so V( ˆξ1,FE )−V( ˆξ1,HT ) 0  ˜XT ˜D  ˜XT ˜D  = V( ˆξ1,HT )  = (A.10) ˜DT ˜X ˜DT ˜D ˜DT ˜D ˜DT ˜D I Then it is the case that V( ˆξ1,FE ) is singular. It has rank Kx instead of KX + KD. Next we prove the equivalence of the variable addition test and the Hausman test HT1. The augmented regression for the variable addition test, transformed by the projection matrix PHT , can be written as follows: PHT ¨y = PHT ¨H1ξ1 + PHT ¨H2ξ2 + PHT X∗ψ + PHT ¨εA (A.11) 25 When φ ≡ 0, the OLS estimator of the restricted regression gives the HT estimator ˆξHT . With unrestricted φ, it is easy to see that the OLS estimator of ξ1 is the fixed effects estimator, since the part of PHT ¨H1 that is orthogonal to (PHT ¨H2,PHT X∗) is just ( ˜X, ˜D). To simplify the analysis we can first project out PHT ¨H2 , so that the regression is just PHT ¨y = (PHT ¨H1)oξ1 + (PHT X∗)oφ + error. Using a generic result on restricted regression, we have the following equality: ˆξ1,HT = ˆξ1,FE +(cid:0)(PHT ¨H1)oT (PHT ¨H1)o(cid:1)−1 (PHT ¨H1)oT (PHT X∗)o ˆψ (A.12) With a little more algebra, we can calculate the difference between fixed effects and HT estimators of β : ˆβHT − ˆβFE =(cid:2) ˜XT ˜X− ˜XT ˜D( ˜DT ˜D)−1 ˜DT ˜X + (PHT X∗)oT (PHT X∗)o(cid:3)−1 (PHT X∗)oT (PHT X∗)o ˆψ (A.13) Let M = ˜XT ˜X − ˜XT ˜D( ˜DT ˜D)−1 ˜DT ˜X + (PHT X∗)oT (PHT X∗)o and A = (PHT X∗)oT (PHT X∗)o. Then we can write ˆβHT − ˆβFE = M−1A ˆψ. Suppose first that A is nonsigular. Then ˆψ = A−1M( ˆβHT − ˆβFE ), and it follows that V( ˆψ) = A−1MV( ˆβHT − ˆβFE )MA−1. Therefore the statistic for the variable addition test is ˆψT V( ˆψ)−1 ˆψ = ( ˆβHT − ˆβFE )T MA−1[A−1MV( ˆβHT − ˆβFE )MA−1]−1A−1M( ˆβHT − ˆβFE ) = ( ˆβHT − ˆβFE )T [V( ˆβHT − ˆβFE )]−1( ˆβHT − ˆβFE ) = HT1 (A.14) However, the matrix A can be singular. The rank of A is the same as the rank of (PHT X∗)o which has KX columns. We will write PHT = P(QJ1) + P(RJ2) + P(BJ3) where J1 = H1, J2 = (X1,W1,X2,W2), J3 = (X1,W1,Z1). Since P(QJ1)(X∗)o = 0, apart from standard issues of mul- ticollinearity the rank of (PHT X∗)o is determined by the rank of (P(RJ2) + P(BJ3)), and nonsin- gularity of A requires that the number of variables in J2 plus the number of variables in J3 be greater than or equal to KX. That is, nonsingularity of A requires that the number of exogeneity assumptions being tested must be at least KX. 26 To simplify notation, let ∆ = ˆβHT − ˆβFE and V∆ = V(∆). Then it is still the case that M∆ = A ˆψ, but when A is singular so is V∆ , and we will need to use generalized inverses. As a matter of notation let r = rank(A) = rank(V∆). For arbitrary A, consider the following conditions on a matrix G. (i) AGA = A. (ii) GAG = G. (iii) AG and GA are symmetric. Then the following are standard definitions. (a) G is a generalized inverse of A if (i) holds, and we write G = A−. (b) G is a reflexive generalized inverse of A if (i) and (ii) both hold, and we write G = A− n . (c) G is the Moore-Penrose pseudo inverse of A if (i), (ii) and (iii) hold, and we write G = A+. The relevance of these facts to our discussion is the fact that, if Y ∼ N(0,Σ), then YT Σ− n Y is distributed as chi-squared with r degrees of freedom, where r = rank(Σ). See Rao and Mitra (1972), p. 615, Corollary 7.1 and the discussion in the following paragraph. When V∆ is singular, it is well known that the Hausman test statistic HT1 = ∆T V+ ∆ ∆ and that it is asymptotically distributed as chi-squared with r degrees of freedom. We now want to show that this is equivalent to a variable addition test, but this test has to be a test of the hypothesis that Aψ = 0, not just ψ = 0, since Aψ can be estimated but Aψ = 0 does not imply ψ = 0 when A is singular. That is, the test must depend on A ˆψ not ˆψ. So our test statistic will be (A ˆψ)T [V(A ˆψ)]− n (A ˆψ). But A ˆψ = M∆ and V(A ˆψ) = MV∆M so the variable addition test statistic is equal to ∆T M(MV∆M)− n is a reflexive generalized inverse of MV∆M. It is easy to verify that M−1V+ ∆ M−1 · MV∆M· M−1V+ ∆ M−1 · MV∆M = MV∆M. So the ∆ M−1 = M−1V+ variable addition test statistic equals ∆T M· M−1V+ ∆ M−1 is such a reflexive generalize inverse, since M−1V+ ∆ M−1 and MV∆M· M−1V+ n M∆, where (MV∆M)− ∆ M−1 · M∆ = ∆T V+ ∆ ∆ = HT1. 27 BIBLIOGRAPHY 28 BIBLIOGRAPHY Amemiya, T. and T.E. MaCurdy (1986). Instrumental Variable Estimation of an Error Compo- nent Model, Econometrica, 54, 896-881. Balazsi, L., L. Matyas and T. Wansbeek (2015). The Estimation of Multidimen- sional Fixed Effects Panel Data Models, Econometric Reviews, published online DOI:10.1080/07474938.2015.1032164. Balazsi, L., L. Matyas and T. Wansbeek (2017). Fixed Effects Models, Chapter 1 of The Econo- metrics of Multi-dimensional Panels (L. Matyas, ed.), Springer-Verlag, forthcoming. Balazsi, L., B.H. Baltagi, L. Matyas and D. Pus (2017). Random Effects Models, Chapter 2 of The Econometrics of Multi-dimensional Panels (L. Matyas, ed.), Springer-Verlag, forthcoming. Balazsi, L., M. Bun, F. Chan and M.N. Harris (2017). Models with Endogenous Regressors, Chapter 3 of The Econometrics of Multi-dimensional Panels (L. Matyas, ed.), Springer-Verlag, forthcoming. Breusch, T.S., G.E. Mizon and P. Schmidt (1989). Efficient Estimation Using Panel Data, Econo- metrica, 57, 695-700. Chamberlain, G. (1980). Analysis of Covariance with Qualitative Data, Review of Economic Studies, 47, 225-238. Fuller, W.A. and G.E. Battese (1973). Transformations for Estimation of Linear Models with Nested-Error Structure, Journal of the American Statistical Association, 68, 626-632. Garson, G.D. (2012). Hierarchical Linear Modeling: Guide and Applications, Sage Publication. Kim, J.-S. and E.W. Frees (2007). Multilevel Modeling with Correlated Effects, Psychometrika, 72, 505-533. Hausman, J.A. and W.E. Taylor (1981). Panel Data and Unobservable Individual Effects, Econo- metrica, 49, 1377-1399. Matyas, L. (1997). Proper Econometric Specification of the Gravity Model. World Economy, 20, 363-369. Matyas, L. (ed.) (2017). The Econometrics of Multi-dimensional Panels, Springer-Verlag, forth- coming. Mundlak, Y. (1978). On the Pooling of Time-Series and Cross-Section Data, Econometrica, 46, 69-86. Rao, C.R. and S.K. Mitra (1972). Generalized Inverse of a Matrix and Its Applications, Proceed- ings of the Berkeley Symposium on Mathematical Statistics and Probability, 1, 601-620. 29 Raudenbush, S.W. and A.S. Bryk (2002). Hierarchical Linear Models: Applications and Data Analysis Methods, 2nd edition, Sage Publication. Rice, N., A.M. Jones and H. Goldstein (1998). Multilevel Models Where the Random Effects Are Correlated with the Fixed Predictors: A Conditioned Iterative Generalised least Squares Estimator (CIGLS), Multilevel Modeling Newsletter, 10, 10-14. Wooldridge, J.M. (1991). On the Application of Robust, Regression-Based Diagnostics to Models of Conditional Means and Conditional Variances, Journal of Econometrics, 47, 5-46. Wooldridge, J.M. (2010). Econometric Analysis of Cross Section and Panel Data, 2nd edition, MIT Press. 30 CHAPTER 2 AN EXTENDED GENERALIZED ESTIMATING EQUATIONS APPROACH TO NONLINEAR MODELS WITH HIERARCHICAL STRUCTURE 2.1 Introduction The generalized estimating equations (GEE) method was first proposed by Liang and Zeger (1986) for the analysis of panel data models. It is also common to see applications of the GEE method in models with cluster data. See, for example, Entorf and Sattarova (2016). The GEE estimation method has been well developed for the two-level model, which includes the panel data model as a special case. When the model has a more complicated nested structure, the mixed- effects estimation is widely used rather than the GEE method. The main contribution of this paper is to extend the GEE estimation to higher-level models, which serves as an alternative method to the mixed-effects estimation for multi-level models. Compared with the usual mixed-effects estimator in multi-level models, the GEE estimator is robust to certain misspecifications. Because the mixed-effects estimator is based on maximum likelihood estimation, the computation burden escalates as the dimension of the heterogeneity increases. For some nonlinear models, the fractional response model for example, it is difficult to specify the underlying distribution to obtain mixed-effects estimators. Detailed discussion of mixed-effects models can be found in Randenbusch and Bryk (2002), Deleeuw and Meijer (2008). In comparison with the pooled estimator, our simulation results show that the GEE estimator has some good efficiency properties in various data generating processes. I show in detail how to obtain the GEE estimator in multi-level nonlinear models. The second contribution of this paper is to provide a systematic discussion of the selection of working correlation matrices which are needed to implement GEE estimators. An important ingredient of the GEE estimation is to use the working correlation matrix to approximate corre- lations within the group. A necessary assumption for the consistency of the GEE estimator, as 31 stated in Liang and Zeger (1986), is that the chosen working correlation matrix should converge to some nonrandom positive semi-definite matrix. Crowder (1995) shows a special case that the misspecified working correlation matrix does not a have well-defined limit. To date, there is no systematic analysis of this issue. This paper will examine the converging limits of several popular working correlation matrices under different misspecification settings. The result could provide some useful guidelines for the application of the GEE method. This paper also shows how to apply the GEE method to hierarchical nonlinear models with un- balanced data structure within the correlated random effects framework. As suggested in Wooldridge (2010) for the panel binary response model, when the heterogeneity is correlated with covariates, a modified Chamberlain-Mundlak device could be used to accommodate unbalanced data struc- ture. In multi-level models, we could have more than one heterogeneity. I extend the treatment of Wooldridge (2010) in the panel data model to the multi-level model with unbalanced data struc- ture and show how the GEE estimation method can be applied to the modified conditional mean function. Another contribution of this paper to the GEE literature is that we could further enhance the efficiency of the GEE estimator by allowing time-varying dispersion parameters. The traditional GEE estimation uses a time constant dispersion parameter in the so-called generalized linear model variance assumption. In the panel data model, or more generally, for models with a non-nested data structure, we have the flexibility to allow time-varying dispersion parameters, each of which can be precisely estimated. Simulation evidence shows that this is a nontrivial efficiency improvement compared with the traditional GEE estimator. This paper also extends the multivariate weighted nonlinear least squares (MWNLS) method in the panel data model to multi-level models. In some special cases, we can obtain a more efficient estimator by using the multivariate weighted nonlinear least squares method. I illustrate this with the multi-level count response model in which we could explicitly calculate the within-group cor- relations. The resultant MWNLS estimator is more efficient than the GEE estimator under certain assumptions. 32 My stylized example in this paper is a model for passing rates of a standardized test at the school level at various time points. Data on passing rates and other variables are collected from schools in various districts at multiple points in time. This is a three-level model where time is level one, school is level two, and district is level three. The model is set up in a way to have both nested and non-nested structures. Schools in the same district are nested, and we can use school identity to track which district it is in. On the other hand, time is not nested with school or district. Throughout this paper, I focus on binary/fractional response models and count response models for the purpose of expository, though most methods are applicable to other nonlinear models. For asymptotic analysis, I assume that the number of groups (districts) goes to infinite, and the group sizes and time periods are finite. The remainder of the paper is organized as follows. In section 2.2, we first briefly review existing econometric methods for nonlinear panel models. Applications of the GEE method in nonlinear panel models and some extensions are discussed in section 2.3. Section 2.4 provides discussions about the selection of working correlation matrices. I consider three-level nonlinear models with both nested and non-nested structures in section 2.5. Simulation results are presented in section 2.6. Section 2.7 concludes the paper. 2.2 Nonlinear Panel Data Models This section briefly summarizes some existing estimation methods for nonlinear panel data models. For nonlinear models, we are not only interested in structural parameters, which give us the direction of the effects and the relative effects, but also partial effects and average partial effects (APEs). Instead of proceeding with abstract nonlinear models, I use the binary/fractional response model and the count response model for illustration. Example 1 (binary/fractional response panel models) The binary response model and fractional response model are closely related. When we use the same underlying distribution function, they have the same conditional mean. yit = 1[xitα + ziβ + ci + uit > 0] (2.1) 33 where subscript i, which runs from 1 to N, indexes the individuals. For individual i, t = 1,2, ...,Ti; thus, the unbalanced panel structure is allowed. The unobserved group heterogeneity ci captures the within-group correlations, and uit is the usual idiosyncratic error term. random effects binary/fractional response panel models D(yit | xi,zi,ci) = D(yit | xit,zi,ci) D(yi | xi,zi,ci) = D(yit | xi,zi,ci) Ti∏ t=1 ci | xi,zi ∼ Normal(0,σ 2 c ) (2.2) (2.3) (2.4) where yi is a vector that contains all yit for individual i, and xi is similarly defined. Equa- tion (2.2) is the strict exogeneity condition in the panel data model. If we further assume that P(yit = 1 | xit,zi,ci) = Φ(xitα + ziβ + ci), where Φ(·) is the cumulative distribution function of the standard normal random variable, then we have the so-called random effects probit model. A straightforward application of maximum likelihood estimation gives consistent estimators of all parameters as well as partial effects. When the dependent variable yit has range [0,1] and we are interested in estimating conditional mean function, we can make the following assumption: E(yit | xi,zi,ci) = E(yit | xit,zi,ci) = G(xitα + ziβ + ci) (2.5) This is the strict exogeneity assumption stated in terms of conditional mean. If assumption (2.4) is true, and we use normal distribution function for G(·), with some algebra we can integrate out ci and get E(yit | xi,zi) = Φ(xitαc + ziβc), where αc and βc are original parameters scaled down c )1/2. These are the parameters that define the average structure functions(ASF) , which by (1 + σ 2 leads to the estimation of APEs. We could either use the pooled nonlinear least squares method or the pooled quasi-maximum likelihood method to obtain consistent estimators of the scaled coeffi- cients. For hypothesis testing, robust variance estimators are available in both estimation methods. correlated random effects binary/fractional response panel models 34 When we suspect that covariates are correlated with heterogeneity, we can use either the Cham- berlain (1980) or Mundlak (1978) device to model the conditional distribution of ci as a function of covariates. Here I adopt the Mundlak device to conserve parameters. The difficulty of mod- eling D(ci | xi,zi) is that we have an unbalanced panel structure. Wooldridge (2010) shows that under joint normality assumption of (ci,xi,zi), both conditional mean and conditional variance of ci depend on Ti and, he proposes several flexible parametric approaches to model group hetero- geneity. The key point is that we should at least allow D(ci | xi,zi) to depend on Ti. If we have little variation of Ti across i, we can model the unobserved heterogeneity as follows: ci = φ1Ti + ¯xiφ2Ti + εi εi | xi,zi ∼ Normal(0,σ 2 ) cTi (2.6) where ¯xi is the time averages of xit. With assumption (2.2) and (2.6), we can use the pooled heteroskedastic method to identify all the parameters. Specifically, we can write the conditional probability as: P(yi = 1 | xi,zi) = Φ(cid:2)xitα + ziβ + φ1Ti + ¯xiφ2Ti (cid:113) (cid:3) 1 + σ 2 cTi (2.7) To implement the estimation, we can first create time period dummies for all but one Ti. We can then interact them with ¯xi and include them in the conditional mean function as in equation (2.7). With these consistent estimators, we can estimate the average structural function (ASF), from which we can estimate average partial effects. When we have many different group sizes, it might be too costly to include all these group size dummies. Wooldridge (2010) suggests several flexible parametric functions to model the conditional mean and conditional variance of the group heterogeneity. For the correlated random effects fractional response model, we get the same estimating equa- tion if we combine assumption (2.6) with (2.5) E(yit | xi,zi) =E[Φ(xitα + ziβ + φ1Ti + ¯xiφ2Ti + εi) | xi,zi] (cid:104)xitα + φ1Ti + ¯xiφ2Ti (cid:105) (cid:113) =Φ 1 + σ 2 cTi 35 (2.8) The implementation of estimation is the same as the pooled heteroskedasticity binary response model that is addressed above. For fixed values of (x,z), the average structural function can be expressed as: ASF(x,z) = E¯xi (cid:16)xα + zβ + φ1Ti + ¯xiφ2Ti (cid:17)(cid:105) (cid:104) Φ (cid:113) 1 + σ 2 cTi (2.9) With consistent estimators from the nonlinear least squares estimation or the quasi-maximum like- lihood estimation, ASF can be consistently estimated by (cid:100)ASF(x,z) = N−1 (cid:16)x ˆα + z ˆφ1Ti + ¯xi ˆφ2Ti (cid:113) 1 + ˆσ 2 cTi N ∑ i=1 Φ (cid:17) . (2.10) APEs can be obtained by taking partial derivatives for continuous variables or by differencing for discrete variables. The bootstrapping method could be used to get valid inferences of the APEs. Example 2 (count response panel models): Many response variables in economics are integers. For example, years of education, number of patent applications (Hausman, Hall, and Griliches (1984)), number of new firms in an industry (Papke (1991)) and so on. Nonlinear models for count response variables have been well developed for panel data. Suppose our interested dependent variable yit takes the value of nonnegative integers and we model the conditional mean as follows: E(yit | xi,zi,ci) = E(yit | xit,zi,ci) = ciG(xitα + ziβ ) (2.11) This equation assumes the strict exogeneity as well as correct specification of conditional mean. One popular choice for G(·) is exponential function form for unbounded count variable. Notice that equation (11) has a multiplicative form of the group heterogeneity, which is usually assumed in count panel data models. As a consequence of the multiplicative heterogeneity, the elasticity or semi-elasticity does not depend on unobserved effects. In the random effects framework, we assume the unobserved effect to be mean independent of all covariates. Without loss of generality, we can normalize its mean to be one as shown in 36 equation (2.12). Such a normalization is not restrictive in the sense that it does not affect the estimation results for the slope parameters. E(ci|xi,zi) = 1 (2.12) Combining equations (2.11) and (2.12), we can integrate out the group heterogeneity and obtain the conditional mean E(yit | xi,zi). If we make parametric assumptions on conditional mean function, we could proceed with the pooled nonlinear least squares or the pooled quasi-maximum likelihood estimation. For inference, we should use robust standard error since the score functions are se- rially correlated due to the presence of individual heterogeneity. Other more efficient estimation method is available. See Wooldridge (2010) for applications of multivariate weighted nonlinear least squares method in count response panel models. Fixed-effects estimators are also available, but we would not be able to obtain partial effects. 2.3 Generalized Estimating Equations method for Nonlinear Panel Models We have seen that pooled methods, including pooled NLS and pooled QMLE, can be applied to nonlinear panel models to estimate scaled parameters that identify partial effects. Pooled methods are convenient for implementation but are unlikely to be efficient as they ignore the within-group correlations. In this section, I will first briefly introduce the generalized estimating equations (GEE) method, which was introduced by Liang and Zeger (1986). The GEE method tries to en- hance efficiency without imposing any stronger conditions. Then I show how to extend the GEE estimation method to heteroskedastic binary response panel models. Another extension for the GEE estimation in panel data models is to allow time-varying dispersion parameters, which could lead to further efficiency improvements. In the previous section, we have seen that our task reduces to estimating the conditional mean E(yit|xi,zi). For the random effects estimation, E(yit|xi,zi) = m(xi,zi;θ ), and for the correlated random effects estimation, E(yit|xi,zi) has Ti included in zi. As discussed earlier, the GEE method is intended to improve the efficiency by approximating the within-group correlations, and it explic- 37 itly acknowledges that the approximation is different from the true conditional covariance matrix. As a result, we should always make inferences robust to the misspecification of the chosen corre- lation matrix. There is no guarantee that GEE estimators are more efficient than pooled estimators, but the idea is that a rough approximation of the within-group correlations, which are ignored in the pooled estimation, could improve the efficiency. To proceed, I first define a diagonal matrix. V(xi,zi;θ ,σ 2 o ) = diag{var∗(yi1 | xi,zi),var∗(yi2 | xi,zi), ...,var∗(yiTi | xi,zi)} (2.13) Notice that the conditional variance var∗(yit | xi,zi) used in (2.13) could be different from the true conditional variance var(yit | xi,zi). The consistency of the GEE estimator does not require the correct specification of the conditional variance function. If we have a binary dependent variable, then var∗(yit | xi,zi) = var(yit | xi,zi) provided that the conditional mean is correctly specified. var(yit | xi,zi) = E(yit | xi,zi)[1− E(yit | xi,zi)] (2.14) If the dependent variable is fractional, and we adopt Bernoulli distribution as underlying distribu- tion, the conditional variance can be written as: var∗(yit | xi,zi) = σ 2 o E(yit | xi,zi)[1− E(yit | xi,zi)] (2.15) where σ 2 o is called the dispersion parameter. If we have a count response variable, the GEE ap- proach associated with the Poisson distribution assumption would be to maintain the following variance assumption: var∗(yit | xi,zi) = σ 2 o E(yit | xi,zi) (2.16) Equations (2.15) and (2.16) are called the generalized linear model (GLM) variance assump- tion in statistics literature. For panel data models, one possible extension is to allow the dispersion parameter σ 2 o to vary with time. Since we have fixed time periods, we can obtain precise esti- mates of dispersion parameters in each time period. Take the random effects count response model 38 as an example, if we have some preliminary estimators of the conditional mean G(xit ˇα + zi ˇβ ), the dispersion parameter σ 2 , where ˇvit = yit − G(xit ˇα + zi ˇβ ) and K is the number of parameters to be estimated. Simulation results show that allowing different dispersion parameters across time usually leads to more efficient GEE es- ot can be consistently estimated by 1 N−K ∑N i=1 ˇv2 it G(xit ˇα+zi ˇβ ) timators. Again the consistency of GEE estimators only requires the correct specification of the conditional mean. We could be arbitrarily wrong about the second moment assumptions. If we don’t assume strict exogeneity, we could use pooled NLS or quasi-MLE methods to estimate scaled coefficients or the so-called “population average" parameters. Consistency would only require the conditional mean of the marginal distribution to be correctly specified. For binary response models, if we add strict exogeneity and conditional independence assumptions, we could get the full MLE estimators, which are more efficient than pooled estimators. A middle ground between the pooled method and the full maximum likelihood estimation method is the generalized estimating equations (GEE) method. For GEE to work, we need to impose the strict exogeneity assumption. The purpose of using GEE is to get back some efficiency when we don’t assume conditional independence. GEE starts by specifying a set of conditional mean functions. Following I will focus on the correlated random effects fractional response panel model: E(yi | xi,zi) = m(xi,zi;θ ) (2.17) Where m(xi,zi;θ ) is a Ti × 1 vector containing elements m(yit | xi,zi;θ ) as specified in equation (2.8). As discussed in section 2.2, we include Ti as an extra argument in zi for the correlated ran- dom effects framework. In contrast to pooled methods, the GEE approach accounts for neglected within-group correlations by utilizing the so called “working" variance matrix, which is denoted by W(xi,zi;γ). The working matrix can be expressed as: W(xi,zi;γ) = V(xi,zi;θ ,σ 2 o ) 1 2 Ri(ρ)V(xi,zi;θ ,σ 2 o ) 1 2 (2.18) where Ri(ρ) is a Ti × Ti positive definite matrix with ri column. ri ts denoting the element in tth row and sth ts ∈ (−1,1) for t (cid:54)= s. The discussion about the selection of Ri(ρ) is provided tt = 1 and ri 39 in next section. Specifically, GEE estimators solve the following minimization problem: Min θ [yi − m(xi,zi;θ )]T [W(xi,zi; ˇγ)]−1[yi − m(xi,zi;θ )] N ∑ i=1 (2.19) where we have replaced γ with its consistent estimator ˇγ, which could be obtained from some preliminary estimation, such as the pooled NLS method. It should be mentioned that, instead of using the two-step estimation method, we could take γ as a function of θ and other nuisance parameters. Thus, the one-step estimation is also available. The asymptotic distributions of the two-step estimators and one-step estimators are the same as long as we use the √ N consistent estimator of γ in two-step estimation. Here I use the two-step estimation method, which makes it easier for the later discussion of the selection of working correlation matrices. The GEE method is very similar to the multivariate weighted nonlinear least squares method in econometrics literature. While MWNLS tries to model var(yi | xi,zi) accurately, GEE acknowl- edges that the weighting matrix is inherently wrong. It might be instructive to look at the first order conditions of above minimization problem (cid:53)θ m(xi,zi; ˆθ )T W(xi,zi; ˇγ)−1[yi − m(xi,zi; ˆθ )] = 0 (2.20) N ∑ i=1 The above equation looks very similar to the first order condition of quasi-MLE. In fact, GEE estimators can be obtained by solving these “quasi-score" equations. As proved by Liang and √ N consistent and converge to the Zeger (1986), under some mild conditions, GEE estimators are following normal distribution. √ N( ˆθ − θ ) ∼ N (cid:16) 0,(cid:0)E[sT i W−1 i si](cid:1)−1E[sT i W−1 i νiνT i W−1 i si](cid:0)E[sT i W−1 i si](cid:1)−1(cid:17) (2.21) where si = (cid:53)θ m(xi,zi;θ ), νi = yi − m(xi,zi;θ ) and Wi is shorthand for W(xi,zi;γ). The consis- tent estimator of variance-covariance matrix can be obtained as (cid:100)Avar( ˆθ ) =(cid:0) N (cid:1)−1(cid:0) N ∑ i=1 ˆW−1 i ˆsi ˆsT i ∑ i=1 (cid:1)(cid:0) N ∑ i=1 (cid:1)−1 ˆW−1 i ˆsi ˆsT i (2.22) ˆW−1 i ˆsT i ˆνi ˆνT i ˆW−1 i ˆsi 40 In equation (2.22) we replace the expectation by the sample averages and the unknown parameters by the corresponding consistent estimators. The efficiency of the GEE estimator obviously depends on how well the chosen working correlation matrix approximates the true correlations. If it happens to be true that W(xi,zi;γ) = var(yi | xi,zi), then GEE estimators are efficient and the asymptotic variance can be further simplified. As mentioned earlier, we could further improve the efficiency of GEE estimators by allowing time-varying dispersion parameters. This would only require minor notation changes to equation (2.18). 2.4 Selection of Working Correlation Matrix If we look at the weighting matrix W(xi,zi;γ) in equation (2.18), once we have specified the o ), which is associated with the function form of E(yi | xi,zi) and the function form of V (xi,zi;θ ,σ 2 underlying distribution assumption, our task reduces to determining the working correlation matrix Ri(ρ). In this section, I first introduce various popular choices of working correlation matrices and then discuss their validity in different situations. among Pearson residuals eit = [yit − m(hit; ˆβ )]/(cid:112)(cid:99)var(yit | xi,zi). The working correlation matrix Ri(ρ) is intended to capture the within group correlations In practice, we first need to estimate the conditional mean and the pseudo conditional variance to get the Pearson residuals. In the pooled estimation method, Ri(ρ) is set to be identity matrix and the consistency of GEE estimators in this case would not require strict exogeneity. Ri(ρ) = ITi (2.23) Using the identity weighting correlation matrix together with time-varying dispersion parameters is similar to applying the generalized least squares method to models with heteroskedasticity. The GEE estimator with identity working correlation matrix is the same as the pooled estimator which only requires contemporaneous exogeneity for consistency. The identity correlation matrix ignores the within-group correlations. For a nested structure, a popular choice of the working correlation matrix is a matrix with exchangeable structure. The exchangeable matrix has ones along its diag- 41 onal and some constant ρ everywhere else. Ri(ρ) =   (2.24) 1 ρ . . . ρ ... ρ 1 . . . ρ ... ... ρ ρ . . . 1 ... Essentially the exchangeable correlation matrix assumes that within-group correlations are con- stant for all pairs of group members. This is more realistic in nested models but somewhat re- strictive for non-nested models such as panel data models. In panel data context, the working ts = ρ|t−s| for an AR(1) process; ri correlation matrix can be assumed to be generated by some autoregressive process or moving av- erage process of the Pearson residuals. For example, ri ts = ρ if ts = 0 if | t − s |> 1 for MA(1) process. It should be mentioned that the MA(1) | t − s |= 1 and ri process is a special case of the one-dependent process, which is often used in biology and statis- tics. For example, a sequence {eit}T t=1 follows one-dependent process if corr(eit,eit+1) = ρt and corr(eit,eis) = 0 for |t − s| > 1. When ρt = ρ for t = 1,2, ...,T − 1, then the one-dependent pro- cess is the same as the MA(1) process. Generalization to m-dependent process is also available, which would resemble the MA(m) process. Unstructured working correlation matrix, in which each element is estimated by its corresponding sample average, is attractive when the group size is relatively small. In practice, we rarely know the true underlying correlation matrix, and our hope is that the cho- sen working matrix is able to capture enough information about the within-group correlations. As mentioned earlier, misspecification of Ri(ρ) would not cause inconsistency of the GEE estimator, but we do require that the chosen working matrix converges to some positive definite nonrandom matrix. This requirement poses some challenges for the use of the GEE method. For a concrete example, consider a T ×T correlation matrix that has an exchangeable structure indexed by the pa- rameter ρ. The exchangeable matrix would be positive definite only if ρ > − 1 T−1. This means that even if we know the true correlation structure and can estimate the correlation parameter precisely, the obtained working correlation matrix may not be positive definite for a certain range of negative 42 correlations. In other words, the working correlation matrix will not converge to a positive defi- nite matrix when ρ < − 1 T−1, consequently, the GEE estimator is no longer consistent. Chaganty and Joe (2006) provide further discussion of the definiteness of working correlation matrices for dependent Bernoulli random variables. A potentially important issue ignored by many GEE applied researchers is that under mis- specification of the working correlation matrix, the working correlation matrix may not have any well-defined limit; this was first investigated by Crowder (1995). An example given in his paper is that the true correlation matrix has an exchangeable form and the chosen working correlation matrix is derived from an AR(1) process of the Pearson residuals. Crowder (1995) shows that there is no solution for the estimator ˆρ for a certain range of the parameter associated with the exchange- able matrix. So far there is no systematic discussion of the convergence issue of the GEE working correlation matrices in the econometrics literature. Next I will focus on three commonly used working correlation matrices and analyze the convergence issue in various situations. It should be noted that the estimation method of working correlation matrix discussed in Crowder (1995) is not standard and can be seen as an inefficient moment estimation. For the following analysis, I use ξ to denote the parameter that indexes the true correlation matrix and ρ to denote the parameter associated with the chosen working matrix. For simplicity, I assume Ti = T for all i. Case 1.1: The true correlation matrix associated with {eit}T t=1 has an exchangeable structure with parameter ξ , and the chosen working correlation matrix is derived from an AR(1) process determined by parameter ρ. As previously discussed, a necessary condition for the exchangeable correlation matrix to be positive definite is ξ > − 1 T−1. Crowder(1995) suggests that the estimator of ρ can be obtained by solving the equation ∑T−1 i=1 eiteis − ˆρt−s) = 0. As shown t=1 ∑T by Crowder (1995) (see more details in the appendix), when T is odd and ξ + 1 T < 0, no solution exists for ˆρ, therefore, it does not have any sensible converging limit. In summary, the working T ≥ 0. When the solution correlation in this case converges only if T is even or T is odd with ξ + 1 exists, the proposed estimator ˆρ is an inefficient moment estimator. Another drawback of the s=t+1( 1 N ∑N estimation method in Crowder(1995) is that it is difficult to extend the result to unbalanced data 43 √ N. i=1 ∑T−1 t=1 eiteit+1, which converges to ξ at rate structures as we allow Ti to vary with i. Another convenient estimator of ρ can be obtained as N(T−1) ∑N 1 Case 1.2: The true correlation matrix associated with {eit} has an exchangeable structure associated with parameter ξ , and the chosen working correlation matrix is derived from the one- dependent process determined by parameters {ρt}T−1 t=1 . In this case, ρt is estimated by its sample analogue 1 i=1 eiteit+1, which converges to ξ by applying the law of large numbers. For the spe- cial case where the one-dependent process degenerates to the MA(1) process, the single parameter N ∑N ρ is estimated by i=1 ∑T−1 Case 2.1: Pearson residuals {eit}T N(T−1) ∑N 1 t=1 eiteit+1, which still converges to ξ as N → ∞ t=1 follow an AR(1) process associated with parameter ξ , and the chosen working correlation matrix has an exchangeable structure with parameter ρ. In s=t+1 eiteis, which can be 1−ξ ). This example shows that parameters associated with working correlation matrix do not necessarily have a meaningful statistical interpretation. One can verify that ρ > − 1 this case, ρ is estimated by its sample analogue 1−ξ (T − 1−ξ T T−1 is satisfied for all positive integer T and ξ ∈ (−1,1) NT (T−1)/2 ∑N shown to converge to i=1 ∑T−1 T (T−1)/2 t=1 ∑T ξ 1 1 Case 2.2: Pearson residuals {eit}T t=1 follow AR(1) process associated with parameter ξ , and the chosen working correlation matrix is derived from the one-dependent process determined by parameters {ρt}T−1 i=1 eiteit+1, which converges to ξ by applying the law of large numbers. Similarly, in the special case where working t=1 . In this case, ρt is estimated by its sample analogue 1 N ∑N correlation matrix is derived from an MA(1) process, ˆρ can be estimated by 1 N(T−1) ∑N i=1 ∑T−1 t=1 eiteit+1. Case 3.1: Pearson residuals {eit}T t=1 follow the one-dependent process associated with param- eters ξ , and the chosen working correlation matrix has an exchangeable structure determined by pa- rameter ρ. We can estimate ρ by its corresponding sample analogue: 1 NT (T−1)/2 ∑N i=1 ∑T−1 t=1 ∑T s=t+1 eiteis, ∑T−1 t=1 ξt T (T−1)/2. The constraint that ρ > − 1 T−1 is automatically sat- 2 . For the special case where ξt = ξ , the converging limit simplifies to which can be shown to converge to isfied if ∑T−1 ξ T /2. t=1 ξt > −T Case 3.2: Pearson residuals {eit}T t=1 follow the one-dependent process associated with param- 44 eters ξ , and the chosen working correlation matrix is derived from AR(1) process with parameter ρ. Following Crowder (1995), the estimator of ρ can be obtained by solving ∑T−1 ˆρs−t) = 0. I show that the solution exists only if ∑T−1 t=1 ξt is sufficiently large (see the appendix for proof). Again, this estimation method would be inefficient and problematic in some cases. We s=t+1( 1 t=1 ∑T N ∑N i=1 eiteis− could instead estimate ρ by simplifies if {eit}T N(T−1) ∑N i=1 ∑T−1 t=1 follows an MA(1) process. 1 t=1 eiteit+1, which converges to ∑T−1 t=1 ξt T−1 . The analysis From the above analysis we see that, depending on the methods used for estimation, the work- ing correlation matrices do not have well-defined converging limits in some cases. For empirical researchers, the exchangeable matrix is a good choice as it does converge to some nonrandom matrix in most cases. We should also make sure that the chosen working correlation matrix is positive semi-definite, and this has to be examined on a case by case basis. It should be mentioned that the estimation equations of parameters in autoregressive models presented in cases 1.1 and 3.2 are not standard in econometrics literature, and they are also different from the estimation method proposed by Liang and Zeger (1986). The estimating equation in case 3.2 is just analog of the one proposed by Crowder (1995). Other methods are also available to construct the working correlation matrix. An example in the count response model is the MWNLS approach in which the working correlation matrix converges to positive semi-definite matrix by construction. We could also use information criteria to select different working correlation matrices. A good updated review about the selection of working correlation matrices is provided by Wang (2014). 2.5 Estimation of Higher-Level Nonlinear Models Previous sections have introduced nonlinear panel models and some common estimation meth- ods. In particular, I have considered the correlated-random effects models with unbalanced group sizes. I also discuss the popular GEE method in detail. If our interest is population average param- eters, the GEE estimator has some good efficiency properties relative to the pooled estimator. With simple reasoning, previous results can be extended to models of higher levels. It is interesting 45 to note that most analyses about higher level models are based on the mixed-effects estimation, which has no robust property if any of the distributional assumptions fails. As we will see later, the GEE method is robust to certain misspecifications of the distribution assumptions. In this section, I consider the estimation of the conditional mean function for a three-level nonlinear model which has both nested and non-nested features. I first start with a general framework without specify- ing any parametric function form for the conditional mean. Both random-effects and correlated random-effects models are considered. Extended GEE estimation and inference are provided. I use the binary/fractional response model and the count response model as concrete examples. Be- cause of the special setup in the count response model, another attractive estimation method is also available. 2.5.1 Extended GEE Estimation Suppose that we want to study the effects of some factors or policy changes on passing rates of national tests, and we have standard panel data of schools across many different districts. In practice, many researchers would construct standard panel data model for estimation and inference, however, as we will show later, the ignorance of the higher level cluster structure would cause biased inferences. This issue has been addressed by Moulton (1990) and Wooldridge (2003) for linear models, and a similar result is found in nonlinear models. Thus, it is very important to explicitly model the correlation of data at a higher level. In our example, schools within the same district might be correlated with each other through the district effect, and this is a nested data structure. We also have the time dimension which is not nested within school identity or district identity. Our interest is in estimating the conditional mean function, which relates to partial effects and average partial effects. E(yit | hit,ci,vg(i)) = m∗(hit,ci,vg(i);β∗) (2.25) Notations are slightly different here. Let t = 1, ...,T index time periods, i = 1, ...,N index schools and g = 1, ...,G index districts. Given a school identity i, we know what district it is in through 46 the function g(i). Explanatory variables hit = (xit,zi,wg(i)) could include time-varying variables, school constant variables, and district constant variables. Of course time dummy variables can be included in xit. Here I use ci to denote the school effect and vg(i) to denote the district effect. β∗ are true structure parameters. For simplicity of notation, here I assume balanced panel structure as well as balanced cluster structure. In particular, I assume there are Q schools in each district and each school is observed for T periods. Analysis for unbalanced group sizes and varying time periods can be easily accommodated with minor notation changes. For the asymptotic analysis, I assume that the number of group goes to infinity while the group sizes and time periods are fixed or relatively small. As we have seen in panel models, many estimation methods rely on the strict exogeneity as- sumption. Here I state the exogeneity assumption as follow. E(yit | Hg(i),ci,vg(i)) = E(yit | hit,ci,vg(i)) (2.26) where Hg(i) is a matrix that contains all explanatory variables hit within the same group across all periods. The above equation is twofold: covariates of other periods of the same school do not help to explain the outcome for current period; covariates of other schools within the same district do not affect the outcome of the chosen school. These are all stated conditional on the school hetero- geneity and the district heterogeneity. Strict exogeneity in time dimension can be restrictive in the sense that it does not allow feedback effect or lagged dependent variables. Strict exogeneity across schools within the same district essentially rules out peer effects. Strict exogeneity of the time- varying variables can be examined by adding a subset of time-varying variables of next period, and a joint significance test of these added variables would be a valid test of the strict exogeneity as- sumption in time dimension. Testing of peer effects, on the other hand, is an ongoing research area and usually counters identification issues. In some special panel data models, we could relax strict exogeneity to sequential exogeneity to identify the conditional mean. See Wooldridge (1997) for multiplicative panel data models. Relaxing the strict exogeneity condition would be more difficult in nested models as there is no meaningful “order" for schools within the same district. Testing 47 and relaxing strict exogeneity assumption in nested models could be interesting topics for future research. Depending on the assumptions we make about the correlation between the unobserved effects and explanatory variables, we have the random-effects framework or the correlated-random effects framework. In some special cases, we could use fixed-effects estimation approach to identify some structural parameters, but we would not be able to obtain partial effects from the fixed effects estimation. Since we are working with a general function form, it is better to hold off the fixed- effects estimation. In a random-effects framework, we generally assume the heterogeneities are independent of explanatory variables. D(ci | Hg(i)) = D(ci) D(vg | Hg) = D(vg) (2.27) Undoubtedly these are very strong assumptions, but here we still allow for the dependence of ci on vg(i). This is different from GLS type assumptions made in linear models, in which ci and vg(i) are usually assumed to be independent of each other. With the above assumptions, we can integrate out the two heterogeneities. E(yit | Hg(i)) =E (cid:16) E(yit | Hg(i),ci,vg(i)) | Hg(i) ≡m1(hit;βo) (cid:90)(cid:90) (cid:17) = m∗(hit,c,v;β∗)dF(c | Hg(i))dF(v | Hg(i)) (2.28) Note that I allow the function form m1(·) as well as the parameter βo to be different from the In the probit model, βo would be β∗ scaled original structure function as in equation (2.25). by some constant. A more realistic case is that unobserved heterogeneities could be correlated with explanatory variables. Since there are two different unobserved effects, we could have three different situations: both are correlated with covariates; only one of the two unobserved effects is correlated with covariates. D(ci,vg(i) | Hg(i)) = D(ci,vg(i) | ¯Hg(i)) (2.29) 48 where ¯Hg(i) are time averages, and possibly group averages, of some variables in hit. Here I con- sider the first scenario as the other two are just special cases of the first one. Again I adopt the Mundlak-Chamberlain device for modeling the heterogeneities. This assumption allows correla- tion between heterogeneities and covariates. As in random-effects assumption, ci can depend on vg(i) in arbitrary ways. A simple application of the Bayesian rule can show that the above assump- tion can also be obtained from two separate assumptions: D(ci | vg(i),Hg(i)) = D(ci | vg(i), ¯Hg(i)) and D(vg(i) | Hg(i)) = D(vg(i) | ¯Hg(i)). By the law of iterated expectation we can integrate out these two heterogeneities. E(yit | Hg(i)) =E (cid:16) E(yit | Hg(i),ci,vg(i)) | Hg(i) ≡m2(hit, ¯Hg(i);βo,φo) (cid:90)(cid:90) (cid:17) = m∗(hit,c,v;β∗)dF(ci,vg(i) | ¯Hg(i)) (2.30) Note that we now have some new parameters φo associated with ¯Hg(i) to be estimated. The func- tion form m2(·) could be different from the original conditional mean function form in equation (2.25). If m2(·) has a linear index function form, then we would not be able to identify the original coefficients associated with the time constant variables. As discussed in panel models, when we have unbalanced structures, it is important to allow the conditional distribution of heterogeneities to be group size specific. If we don’t have much variation in group sizes, we could just index the parameters to be group-size specific, or we can explicitly model the distribution as a function of group size when there are many different group sizes. A consistent estimator of average structural function can be obtained by averaging out ¯Hg(i) across all schools while fixing hit at some con- stant vector. Later I use concrete examples to show how to estimate three-level models with an unbalanced data structure and how to obtain partial effects. In both the random-effects models and correlated random-effects models, the problem pins down to the estimation of E(yit | Hg(i)) after integrating out heterogeneities. Without loss of gen- erality, I focus on equation (2.28) only. The following analysis also applies to correlated random- effects models by including some additional variables in the conditional mean function. A simple 49 estimation method is just pooled nonlinear least squares: (cid:105) (yg − M(Hg;β ))T (yg − M(Hg;β )) (cid:104) G ∑ g=1 ˆβp1 = argmin β (2.31) where M(Hg(i);β ) is a QT × 1 vector containing m(hit;β ) that is sorted by schools and arranged according to time. yg is similarly defined as the first T rows are the observations of some school in district g. Under some regularity assumptions, we can show that (cid:104) (cid:105) √ G( ˆβp1 − βo) ∼ Normal(0,A−1 (cid:104)(cid:53)β M(Hg;βo)T νgνT (cid:105) 1 B1A−1 1 ) g (cid:53)β M(Hg;βo) H(M;βo) , and B1 = E , in which H(M;βo) where A1 = E is the Hessian matrix of M(Hg(i);β ) evaluated at βo, and νg = yg − M(Hg,βo). The consistent estimator of variance matrix can be obtained by replacing βo with its consistent estimator ˆβp1 and replacing expectation with sample average. An interesting question to ask here is what would hap- (2.32) pen if we ignore higher nested structure and take it as a standard panel data model. This issue has been addressed in linear models, and we have the same result in nonlinear models. If we take the data as a panel, we would solve: (cid:105) (yi − m(Hi;β ))T (yi − m(Hi;β )) (cid:104) N ∑ i=1 ˆβp2 = argmin β (2.33) where m(Hi;β ) is T ×1 vector of the conditional mean function for school i and yT i = (yi1, ...,yiT ). Note that the objective functions are essentially the same in equations (2.31) and (2.33), so both ˆβp1 and ˆβp2 are consistent. For nonlinear panel data models, we have following standard results (cid:16) (cid:17) √ N( ˆβp2 − βo) ∼ Normal(0,A−1 (cid:16)(cid:53)β m(Hi;βo)T νiνT , B2 = E (cid:17) 2 B2A−1 (2.34) 2 ) , and νi = yi−m(Hi,βo). i (cid:53)β m(Hi;βo) (cid:17) H(m;βo) where A2 = E With some algebra, one can show that A1 = QA2 and B1 = QB2 +E[∑i∈g ∑ j∈g\(i)(cid:53)β m(Hi;βo)T νiνT . From these results we can see that Avar( ˆβp1) is not the same as Avar( ˆβp2) unless the m(Hi;βo) expectation of all cross products of scores for different schools within the same district is zero. If j (cid:53)β 50 the second part of B1 is zero, then one can show that Avar( ˆβp1) = Avar( ˆβp2) using the fact that B1 = QB2 and N = QG. This result is not limited to the pooled NLS estimation method. It is gen- eral enough for all M estimators. Suppose that sg(βo) is the score function of the objective function considering the higher level nested structure and si(βo) is the score function of the objective func- tion ignoring higher level cluster structure. The two estimators would have the same asymptotic variance matrix only if ∑i∈g ∑ j∈g\(i) E[si(βo)s j(βo)T ] = 0. In general, the pooled estimator that ignores higher-level nested structure would still be consistent, but its asymptotic variances are not valid for inferences except in some special cases. Later we will see an example in which this condition can be violated as long as the group effect exists at a higher level. Differences between asymptotic variances of pooled estimations with and without considering the cluster structure can be used as an informal test of whether there is a higher-level nested struc- ture. In the given example, if there are large differences between the variance estimators of pooled panel estimation and pooled cluster estimation, this could be a sign that we have ignored nested structure at the higher (district) level. For the same reason as previously discussed, the pooled estimation is unlikely to be efficient as it ignores correlations across time and across schools within the district. The GEE method is attrac- tive for improve the efficiency without imposing stronger assumptions. In the three-level model, the working matrix would be a little more complicated as it intends to capture within-group corre- lations as well as correlation across time. I will call it “Extended GEE"(EGEE) estimation when the GEE method is applied to higher level models. With a minor notation change, the objective function for EGEE can be written as [yg − M(Hg;β )]T [W(Hg; ˇγ)]−1[yg − M(Hg;β )] G ∑ g=1 Min β (2.35) Of course, the efficiency of the EGEE estimator depends on how well we approximate the true correlation matrix with our chosen working matrix W(Hg; ˇγ). Similarly, we can decom- pose the working matrix as V(Hg,α)1/2R(ρ)V(Hg,α)1/2, where V(Hg,α) = diag{var∗(yi1 | Hg), ...var∗(yiT | Hg), ....}i∈g. Again the GEE method allows the possibility that the constructed 51 conditional variance var∗(yiT | Hg) could be different from the true conditional variance var(yiT | Hg). The interesting part is modeling R(ρ). Several commonly used working correlation matrices are available as discussed in section 4. In our example, level one is time, and we can use exchange- able correlation, AR correlation, or MA correlation to model the correlation across time. Level two is schools which are nested within the district. For the nested data structure, it makes sense to assume an exchangeable correlation structure. To be explicit about R(ρ), we can write it as:  R1(ρ1) R2(ρ2) R2(ρ2) R1(ρ1) ... ... R2(ρ2) R2(ρ2)  . . . R2(ρ2) . . . R2(ρ2) ... . . . R1(ρ1) ... R(ρ) = (2.36) where R1(ρ1) is used to model the correlation across time, i.e. corr(yit,yis | Hg(i)), and R2(ρ2) is used to approximate the within group correlation, i.e. corr(yit,y js | Hg(i) = Hg( j)). In both GEE estimation and EGEE estimation, we use R1(ρ1) to model the correlation across time. The new part of the EGEE estimation is R2(ρ2). If we have chosen exchangeable structure for R1(ρ1), then we have good reason to assume R2(ρ2) is a matrix of some constant ρ2. When R1(ρ1) is approximated by the correlation matrices generated from an AR process or an MA process, we might still assume R2(ρ2) to be a matrix of a single constant if the within group correlation is dominated by vg, which is time invariant. Of course, one might suspect that corr(yit,y js | Hg(i) = Hg( j)) decreases as | t − s | increases. In this scenario we could use a more sophisticated matrix for R2(ρ2). One possibility is to approximate corr(yit,y js | Hg(i) = Hg( j)) by θ1θ|t−s| , and in this case, ρ2 = (θ1,θ2). The Implementation of EGEE requires some initial consistent estimator of ρ. Here I outline the formulas to calculate ˇρ with some initial consistent estimator ˇβ when we choose exchangeable correlation matrices for both R1(ρ1) and R2(ρ2). 2 ˇνis ˇµis where ˇνit = yit − m(hit; ˇβ ), and ˇµit is the consistent estimator of NT (T − 1) ˇρ1 = T ∑ s=i+1 ˇνit ˇµit 2 N ∑ i=1 T−1 ∑ t=1 (cid:113) var∗(yit | Hg(i)), which is (2.37) 52 (cid:16) (cid:17) (2.38) usually associated with m(hit;β ) in many nonlinear models. The consistent estimator of ρ2 can be found by ˇρ2 = 1 GT 2Q(Q− 1) G ∑ g=1 ∑ i∈g ∑ j∈g T ∑ t=1 T ∑ s=1 ˇνit ˇµit ˇν js ˇµ js − ∑ i∈g T ∑ t=1 T ∑ s=1 ˇνit ˇµit ˇνis ˇµis Robust variance matrix estimator of ˆβEGEE can be found in equation (2.22) with minor notation changes. When we have correctly specified the working matrix, the EGEE estimator is efficient. Even if the working matrix is misspecified, the EGEE estimator is shown to be more efficient than the pooled estimator in various simulation studies. 2.5.2 Some Examples This subsection presents some examples of nonlinear panel models with higher hierarchical struc- ture. Specifically, I consider the binary/fractional response model and the count response model. There are other estimation methods available for the count response model because of the multi- plicative function form of the conditional mean. Some results in the panel count response model can also be extended to higher level models. Three-Level Binary/Fractional Response Models When our interest is only about the conditional mean function, we can model the conditiona mean functuon of the binary or fractional response variable as E(yit | hit,ci,vg(i)) = Φ(hitβ∗ +ci + vg(i)). Depending on the assumption we make about the correlation between (ci,vg(i)) and Hg(i), we could have E(yit | Hg(i)) = Φ(hitβo) for the random effects framework or a more complicated function form for correlated random effects after integrating out (ci,vg(i)) based on some normal distributional assumptions. It might be instructive to discuss in detail the estimation of three- level binary/fractional response models within the correlated random effects framework. When we allow an unbalanced data structure, the number of schools varies across districts, and the total observation periods are different across schools. For the same reason as in unbalanced panel data models, we should, at least, allow the parameters and variances to depend on the group size and 53 the total time periods. In the correlated random effects context, we could model the heterogeneity as follows: ci + vg(i) = ¯Hg(i)φTiQg + εi + εg TiQg) εi + εg | Hg(i) ∼ Normal(0,σ 2 (2.39) where Ti denotes the time periods for school i, and Qg denotes the number of schools in district g. If there is not much variation in Ti and Qg, we could first generate dummy variables for different values of {TiQg(i)}N i=1. Making the substitution using equation (39) for the conditional mean function, we have a heteroskedastic binary/fractional response model. E(yit | Hg(i)) = Φ (2.40) (cid:104)hitβ∗ + ¯Hg(i)φTiQg (cid:113) (cid:105) 1 + σ 2 TiQg There are slight differences between the conditional variances of binary response and fractional response models. For the binary response model, var(yit | Hg(i)) = Φ(hitβo)(1− Φ(hitβo)). For the fractional response model, we could allow more flexibility for conditional variance, var(yit | otΦ(hitβo)(1 − Φ(hitβo)), which is a modified Bernoulli GLM variance assumption. Hg(i)) = σ 2 For the fractional response model, σ 2 , where ˇνit = yit − Φ(hit ˇβ ). ot can be estimated by 1 Φ(hit ˇβ )(1−Φ(hit ˇβ )) N ∑N i=1 ˇν2 it In the mixed-effects literature, the binary response model with random intercepts at school level and district level makes the following assumption: yit = 1{hitβ + ci + vg(i) + uit > 0} (2.41) Usually the heterogeneities and the idiosyncratic errors are assumed to be normally distributed. As we have seen, the GEE method also requires distributional assumption on heterogeneities, but the consistency of GEE estimators does not require the correct specification of var(ug). On the other hand, mixed-effects models are usually estimated by the maximum likelihood estimation method, 54 which will give us inconsistent estimators if we misspecify the variance-covariance matrix of id- iosyncratic errors. Clearly, there is a trade-off between efficiency and robustness. Simulation study in the next section shows the comparison between GEE estimators and mixed effects estimators. Three-Level Count Response Models For the count response model, or more generally for models with a non-negative dependent variable, we usually specify the conditional mean function as E(yit | Hg(i),ci,vg(i)) = exp(hitβo + ci + vg(i)). Notice that I have implicitly imposed the strict exogeneity assumption. Other function forms are also available, but the exponential function is by far one of the most popular choices in empirical research. Without loss of generality, we might just write E(yit | Hg(i),ci,vg(i)) = civg(i)exp(hitβo), where (ci,vg(i)) is strictly positive. The random-effects assumption for this multiplicative function form can be stated in terms of conditional mean. E(ci | Hg(i)) = E(vg | Hg) = 1 ci|=vg(i) | Hg (2.42) We could allow different means for school effects and district effects, but this does not matter in the context of fixed group size and fixed time periods. Without such a normalization, we would get an estimator of the scaled βo. With the above assumption, we can integrate out ci and vg(i) and get the estimating equation E(yit | Hg(i)) = exp(hitβo). Notice that the above assumption rules out dependence between ci and vg(i), which could be restrictive in some cases. We could also assume E(civg(i) | Hg) = 1, which allows the heterogeneities to be correlated. Other than the pooled NLS estimation, we could also use pooled Poisson Quasi-MLE estimation since Poisson distribution belongs to the so-called linear exponential family. As shown by Gourieroux, Monfort, and Trognon (1984), the consistency of the Poisson Quasi-MLE estimator only requires the conditional mean to be correctly specified. We should use robust variance estimators to allow for arbitrary conditional variance and serial correlation induced by ci and vg(i). For the extended-GEE method, we could assume var∗(yit | Hg(i)) = σ 2 otexp(hitβo). With these specific conditional mean and conditional 55 variance functions, we could apply the extended-GEE method to improve efficiency over pooled estimators. Another version of Poisson random-effects assumptions is also available. Hausman, Hall and Griliches (1984)(HHG) provides a set of assumptions for count response panel models. Fixed effects estimators are also available under the HHG assumptions, and Wooldridge (1999) further demonstrates that the fixed effects Poisson estimator is consistent under fairly weak conditions.The fixed effects estimator will not be the focus of this part because it does not deliver partial effects. Here I extend the HHG framework in panel data to the three-level models. The Poisson random effects assumptions in our framework are as follows. Poisson random-effects assumptions yit | Hg(i),ci,vg(i) ∼ Poisson(cid:0)civg(i)exp(hitβo)(cid:1) yit|=yis | Hg(i),ci,vg(i) D(ci | Hg(i)) = D(ci) yit|=y js | Hg(i) = Hg( j),ci,c j,vg(i) D(vg(i) | Hg(i)) = D(vg(i)) (2.43) (2.44) (2.45) With some parametric assumptions on the distribution of ci and vg(i), full conditional MLE is available and is efficient if all assumptions hold. The drawback of full MLE is that the estimator is not robust to the failure of any above assumptions. An alternative approach is to implement the multivariate weighted nonlinear least squares estimation. The MWNLS estimator would be robust to the failure of the above Poisson random-effects assumptions. It is consistent under the assumption of the correct specification of the conditional mean along with the strict exogeneity assumption and equation (42). The MWNLS estimator could be more efficient than the extended- GEE estimator if some of the Poisson random-effects assumptions are true. The weighting matrix used in MWNLS could provide a better approximation of var(yg | Hg) than EGEE if Poisson random-effects assumptions are true. This is a special case in which we could explicitly quantify the correlations across time and across schools because of the multiplicative function form. Define νit = yit − E(yit | Hg(i)) and eit = yit − E(yit | Hg(i),ci,vg(i)). If we assume E(yit | 56 Hg(i),ci,vg(i)) = civg(i)E(yit | Hg(i)), as what we did in the count response model, then we have νit = eit + E(yit | Hg(i))(civg(i) − 1). Under the Poisson random-effects assumption, with some algebra, we have the following results. E(ν2 E(νitνis | Hg(i)) = α2m(hit;βo)m(his;βo) it | Hg(i)) = α1m(hit;βo) + α2m(hit;βo)2 t (cid:54)= s (2.46) (2.47) E(νitν js | Hg(i) = Hg( j)) = α3m(hit;βo)m(h js;βo) (2.48) where α1 = E[civg(i)], α2 = E[(civg(i) − 1)2] and α3 = E[(civg(i) − 1)(c jvg( j) − 1)]. Obviously when the Poisson random-effects assumptions are true, the constant working correlation matrix i (cid:54)= j would be inappropriate because the conditional correlation matrix depends on the covariates. With the above results, we can specify var∗(yg | Hg) under the Poisson random-effects assumptions. Using var∗(yg | Hg) to replace the working matrix in the extended GEE estimation would give us the MWNLS estimator, which could be more efficient than the EGEE estimator. Notice that var∗(yg | Hg) could be different from the true conditional variance var(yg | Hg) if any Poisson distribution assumption fails. As usual, we need some preliminary estimators to implement the MWNLS estimation. In particular, we can estimate all the unknown αs based on moment condi- tions (2.46)-(2.48). T ˇνit ˇνis ∑ ˇmit ˇmis s(cid:54)=t − ˆα2 ˇmit) ˇνit ˇν js ˇmit ˇm js ˆα3 = 1 NT 2 ˆα2 = 1 ˆα1 = N ∑ NT (T − 1) i=1 N T ∑ ∑ ( i=1 t=1 ∑ ∑ i∈g j∈g\i 1 NT G ∑ g=1 T ∑ t=1 ˇν2 it ˇmit T T ∑ ∑ s(cid:54)=t t=1 (2.49) (2.50) (2.51) For inference of MWNLS estimators, we should use robust sandwich variance matrix estimators, as in equation (22), to allow for failure of Poisson random-effects assumptions. Fixed effects estimators are also available, but we would not be able to obtain partial effects based on these estimators in the count response model because we have little information about the 57 correlation between heterogeneities and explanatory variables. A middle ground between random effects and fixed effects is correlated random effects framwork, which allows us to explicitly model the correlation between unobserved effects and explanatory variables. We still maintain E(yit | (cid:1). With this Hg(i),ci,vg(i)) = civg(i)exp(hitβo). In addition, I assume E(civg(i) | Hg(i)) = exp( ¯Hg(i)φTiQg). Using the law of iterated expectation gives us E(yit | Hg(i)) = exp(cid:0) ¯Hg(i)φTiQg + hitβo moment condition, we could use the EGEE or WMNLS method to get consistent estimators of φo and βo. The average structure function can be obtained by fixing hit = h and integrating out heterogeneties. ASF(h) = E ¯Hg(i) Consistent estimator of ASF(h) can be obtained by 1 N ∑N [exp( ¯Hg(i)φTiQg + hβo)] i=1 exp(cid:0) ¯Hg(i) ˆφTiQg +h ˆβ(cid:1). Average partial (2.52) effects can be estimated either by differentiation with respect to continuous variables or differenc- ing with respect to discrete change. Bootstrapping provides valid inference for average partial effects. 2.6 Simulation This section presents several simulation results for the efficiency comparisons between GEE estimators and other competitive estimators under various data generating processes. It is also of practical interest to compare efficiency of the GEE estimators that use different working correlation matrices. Specifically, this section compares the Monte Carlo standard deviations of four group estimators : the traditional GEE estimator and the GEE estimator with time-varying dispersion parameters; the GEE estimator and the pooled estimator in two-level nonlinear models; the EGEE estimator and the pooled estimator in three-level nonlinear models; the EGEE estimator and the mixed-effects estimator in three-level nonlinear models. In most nonlinear models, it is difficult to analytically derive the population correlations of the structural errors even if we have full knowledge of the data generating process. Take the random- effects probit model as an example. E(yi | xi,zi) = Φ(xitαc + ziβc) and the structure error is 58 yit − Φ(xitαc + ziβc), then the correlation within the group can be found as ρts = Corr(cid:0)yit − Φ(xitαc + ziβc),yis − Φ(xisαc + ziβc) | xi,zi (cid:1) which is difficult to derive a close form expression. To show the efficiency gain by allowing time- varying dispersion parameters, it might be helpful to look at the simple linear panel data model in which the data generating process immediately determines within-group correlations. Comparison of the traditional GEE estimator and the GEE estimator with time-varying dis- persion parameters: yit = β0 + ziβ1 + xitβ2 + ci + uit where β0 = 1,β1 = 2 and β2 = 3 are the true parameters. I assume that zi ∼ Gamma(2,3), ci ∼ √zi + 0.5wit, where wwit ∼ Normal(0,9). I also assume the balanced uni f orm(−2,2), and xit = panel structure T = 3 for each i. Next, I specify several different DGP for the error term uit. (DGP1): {uit}T t=1 are independent across time and have different variances for each period. Specifically we set uit ∼ Normal(0,t2). It is easy to verify that the true correlation of the composed error terms cg + uit does not follow an exchangeable correlation structure. (DGP2): {uit}T t=1 follows an AR(1) process associated with parameter ρ. In particular we assume uit = ρuit−1 + εit and εit ∼ Normal(0,1). Initial value ui0 is drawn from Normal(0,4). The true correlation of the composed error terms cg + uit does not follow from the standard AR(1) process because of the presence of ci. (DGP3): {uit}T t=1 follows an MA(1) process associated with parameter θ. In particular we assume uit = θεit +εit−1 and εit ∼ Normal(0,4). Again because of the presence of ci, the composed error does not follow the standard MA(1) process. Table 2.1 presents the simulation results of various estimators of β2 under different data gen- erating processes. In particular, we consider the pooled OLS estimator, GEE estimators with three different working correlation matrices, and GEE estimators with time-varying dispersion param- eters. In each replication, we obtain the robust standard errors of all these estimators because we rarely know the true correlation of the structural errors in empirical research. It is now conven- tional to make inferences robust to the failure of second moment assumptions, and I follow this 59 procedure here. To carry out the efficiency comparison, I calculate the mean value for the ratio of the Monte Carlo standard deviations. Each entry in the following table denotes the ratio of Monte Carlo standard deviation of the row corresponding estimators to the Monte Carlo standard devia- tion of the pooled estimator. The sample size is 500 and the replication number is 1000. In the following table, I use ∗ to indicate estimators that use time-varying dispersion parameters. From these results, we can easily see the efficiency improvements of GEE estimators over pooled estimators, which has been studied in the original paper by Liang and Zeger(1986). The new result here is that we could further improve the efficiency of GEE estimators by allowing different dispersion parameters for each period. This finding is useful for any nonnested data structure, such as panel data, in which we can get precise estimates of these dispersion parameters. The GEE estimator with the exchangeable working correlation matrix performs well in various data generating processes. Next, I will compare the efficiency of GEE estimators and pooled estimators in nonlinear panel models. Comparison of the pooled estimator and the GEE estimator in nonlinear panel models (DGP4): Random-effects binary response models yit = 1{α0 +ziα1 +xitα2 +ci +uit ≥ 0}, where α0 = 1,α1 = 2 and α2 = −3 are the true parameters. I assume that uit follows standard nor- √zi + 0.5wit, where wit ∼ mal distribution, zi ∼ Gamma(2,3), ci ∼ Normal(0,σ 2 c ), and xit = Normal(0,9). I also assume the balanced data structure Ti = 3 for each i. As mentioned before, it is difficult to derive the within group correlation in this specified model. However, the intuition is that as the “importance" of ci increases, the within group correlation increases. Thus, by varying the value of σ 2 c , we can compare the Monte Carlo standard deviations of different estimators with different within group correlations. (DGP5): Random-effects fractional response models: suppose the true conditional mean function is E(yit | zi,xi,ci) = Φ(α0 +α1zi +α2xit +ci), where α1 = 0.1 and α2 = −0.3 are the true param- eters. zi and xit are generated from the same distributions as in DGP4. Group heterogeneity ci is generated from Normal(0,σ 2 c ). To generate the fractional random variable, we need an interme- diate step: generate y∗ it/100 is a fractional it from binomial(100,Φ(α1zi +α2xit +ci)), and yit = y∗ 60 variable with the desired conditional mean. Again by varying the value of σ 2 c , we can change the within group correlation. From the above two nonlinear models, we see very similar results as in linear models. The GEE estimator performs better than the pooled estimator. As the within-group correlations increases, we gain greater efficiency by applying the GEE method. In all cases, The GEE estimator with the exchangeable working correlation matrix performs better than other GEE estimators with different working correlation structures. This lends some support to the use of the exchangeable working correlation matrix in applied work. Comparison of the pooled estimator and the GEE estimator in three-level nonlinear models (DGP6): Three-level random-effects binary response model: yit = 1{α0 +wgα1 +ziα2 +xitα3 + vg + ci + uit ≥ 0}, where α0 = 1,α1 = 2, α2 = 0.5 and α3 = −1 are the true parameters. wg ∼ uni f orm(−1,1), and xit −w2 g ∼ uni f orm(−2,1). All other variables are independent of each other and normally distributed. Specifically, zi ∼ Normal(0,0.04) and uit has been normalized to follow the standard normal distribution. We can change the time series correlations by varying σ 2 c and change the within group correlations by varying σ 2 v . Again here I take the pooled estimator of α3 as the benchmark, and we calculate the ratio of the Monte Carlo standard deviation of the GEE es- timator to the Monte Carlo standard deviation of the pooled estimator. I have also added the panel GEE estimator to the comparison list. The labels are a little different in the following table. For example, we use GEE(AR(1),EX) to denote the GEE estimator that uses a working matrix that has an AR(1) type correlation for diagonal blocks, namely R1(ρ), and has an exchangeable struc- ture for blocks off diagonal, namely R2(ρ). We use O to denote the null matrix; thus, GEE(EX,O) represents the working matrix that has zero correlation among schools within the same district. I assume a balanced data structure; there are 500 groups with 3 units in every group, and each unit is observed for 4 periods. Two things stand out from the above simulation results: (1) The extended GEE estimator with the correctly specified working matrix is the most efficient among all estimators that recognized the higher-level correlation. From the DGP, it is not difficult to see that both the within-group 61 correlation and time series correlation are constant, therefore, it is no surprise that GEE(EX, EX) delivers efficient results. In all three cases, we see significant efficiency improvements over the pooled estimator. An interesting finding is that the Monte Carlo standard deviation of the extended GEE estimator that ignores within-group correlations, i.e. GEE(EX,O), is very close to that of the efficient estimator. This is expected because the within-group correlations are not as strong as time series correlation in my DGP. (2) If we ignore the correlations within the district and take the data as a standard panel, then we are being too optimistic about the estimates. By comparing the extended GEE estimator and the panel GEE estimator, we can see that the panel GEE estimator has smaller Monte Carlo standard deviation in all three cases. As we increase the time series correlation and decrease the within-group correlations, the distortion becomes less evident. When there is no group correlation or district effects in our example, the asymptotic variances of three- level estimators and panel estimators should be the same. Comparison of the EGEE estimator and the mixed-effects estimator in three-level nonlinear models (DGP7): The three-level binary response model is specified the same as in DGP6 except that now we allow idiosyncratic errors to be correlated across time. When all distributional assumptions hold, the mixed-effects estimators, which are based on the maximum likelihood estimation, would be efficient. In this case, we would like to know the efficiency loss when we adopt the GEE estimation. On the other hand, if any distribution assumption fails, the mixed-effects estimator is inconsistent while the GEE estimator may remain to be consistent. In this part, I generate the idiosyncratic errors from standard normal distribution with AR(1) type correlations across time, i.e. Corr(uit,uis) = ρ|t−s|. Mixed-effects estimators that assume independent idiosyncratic errors will be inconsistent if ρ (cid:54)= 0. I report the GEE estimator and the mixed-effects estimator of the scaled parameter α3, which is −0.5 when we fix σ 2 v = 2 and σ 2 c = 1. Monte Carlo standard deviations of the corresponding estimators are reported in parentheses. From the results in table 2.5, we can see that the GEE estimator has good efficiency proper- ties in comparison to the mixed-effects estimator. The surprising result is that the mixed-effects 62 method gives very similar estimates to the EGEE estimators when ρ (cid:54)= 0. The intuition is that the correlations of idiosyncratic errors induce overestimation of the variance of heterogeneity, and this inconsistency somehow cancels out with the inconsistent estimator of α3 when we calculate the scaled parameters. In general, we should not expect the mixed-effects estimators to be consistent when the distributional assumptions are violated. 2.7 Conclusion This paper extends the GEE estimation to multi-level nonlinear models. In particular, I con- sider nonlinear panel data models with a higher nested structure. One stylized example is a model for passing rates of some standardized test at the school level at various time points. Observations are correlated across time through the school fixed effects and are correlated within the same area through the district effects. When we extend the GEE estimation to higher-level models, the work- ing correlation matrix has a more complicated form to capture various within group correlations. Compared with the usual mixed effects estimator used in hierarchical models, the GEE estimator is robust to certain violations of the distributional assumptions and is less intensive in terms of com- putation. For some nonlinear models, such as the fractional response model, it is difficult to find any distribution to implement the mixed effects estimation. In comparison with the pooled estima- tor, simulation evidence indicates that the GEE estimator has relatively good efficiency properties. This paper also contributes to GEE literature in other important ways. I show that in panel data models, as well as in other non-nested models, allowing time-varying dispersion parameters could further enhance the efficiency of the GEE estimator. Simulations show that this is a nontrivial im- provement. I also provide a systematic discussion of the selection of working correlation matrices. I show that the exchangeable working correlation matrix has a well defined converging limit under various misspecification settings, but its definiteness needs to be checked on a case by case basis. When the conditional mean associated with heterogeneties has a multiplicative function form, we can use the multivariate weighted nonlinear least squares method to explicitly quantify various within group correlations. Under certain assumptions, the MWNLS estimator is more efficient 63 than the GEE estimator. I show how to implement the MWNLS estimation in the count response panel model with a higher nested structure. The GEE method can be carried out either by using the one-step or the two-step estimation. If we use the same working correlation matrix, the one-step estimation and the two-step estima- tion are asymptotically equivalent, which is similar to the comparison between the continuously updating GMM and the two-step GMM. Less is known about the finite sample performances of the one-step GEE estimator and the two-step GEE estimator. This is one of my ongoing research projects. Other topics on multi-level models for future research include dynamic models, nonlinear models with endogeneity, and so on. 64 APPENDIX 65 APPENDIX FOR CHAPTER 2 Case 1.1 Estimator of ρ can be obtained by solving the equation T−1 ∑ t=1 1 N−K ∑N T ∑ ( s=t+1 1 N N ∑ i=1 eiteis − ˆρs−t) = 0 (B.1) i=1 eiteis → ξ . After some algebra, above equation can be rewrit- With large N asymptotics, ten as 2 ξ − ˆρ 1− ˆρ ) = 0 1− ˆρ (T − 1− ˆρT Q( ˆρ) ≡ T (T − 1) (B.2) is negative over α ∈ (−1,1). It is easy to verify that lim ˆρ→1 Q( ˆρ) = It can be shown that dQ( ˆρ) d ˆρ T (T−1) (ξ − 1) < 0. Equation (2.34) has a unique solution only if Q(−1) > 0. When M is even, T−1; when T is odd, Q(−1) = Q(−1) = 2 T (T−1) (ξ + 1 T ) has indefinite sign. When T is odd, the working correlation matrix is well defined only if ξ > − 1 T . T−1 ) > 0 by intrinsic constraint ξ > − 1 (ξ + 1 T (T−1) 2 2 Case 3.2 In this case, ˆρ can be found from the equation With large N asymptotics, 1 N ∑N N ∑N i=1 eiteis → 0 if |t − s| > 1. After some eiteis − ˆρs−t) = 0 1 N T ∑ ( s=t+1 T−1 ∑ t=1 i=1 eiteit+1 → ξt and 1 N ∑ i=1 (B.3) algebra, above equation can be rewritten as ξt − ˆρ 1− ˆρ (T − 1− ˆρT Q( ˆρ) ≡ T−1 ∑ t=1 is negative over α ∈ (−1,1) and lim ˆρ→1 Q( ˆρ) = ∑T−1 1− ˆρ ) = 0 t=1 ξt − < 0. Equation (36) has a unique solution only if Q(−1) > 0. When T is even, Q(−1) = 2 . The working correlation matrix is well Again it can be shown that dQ( ˆρ) d ˆρ T (T−1) 2 ∑T−1 t=1 ξt + T defined if ∑T−1 2 ; when T is odd, Q(−1) = ∑T−1 t=1 ξt > −T t=1 ξt + T−1 2 when T is even, and if ∑T−1 t=1 ξt > −T−1 2 when T is odd. (B.4) 66 SIMULATION RESULTS FOR CHAPTER 2 Table 2.1: traditional GEE estimator and GEE estimator with time-varying dispersion parameters pooled estimator pooled estimator* GEE(EX) GEE∗(EX) GEE(AR) GEE∗(AR) GEE(MA) GEE∗(MA) DGP1 1.00 0.836 0.958 0.774 0.966 0.786 0.973 0.800 DGP2 DGP3 ρ = 0.25 ρ = −0.25 θ = 0.6 θ = −0.6 DGP2 DGP3 1.00 0.938 0.788 0.703 0.741 0.669 0.848 0.797 1.00 0.938 0.934 0.857 0.974 0.897 0.986 0.914 1.00 0.998 0.858 0.857 0.784 0.783 0.802 0.801 1.00 0.998 0.997 0.996 0.984 0.982 0.987 0.986 Table 2.2: pooled estimator and GEE estimator in a binary response panel model pooled estimator GEE(AR) GEE(MA) GEE(EX) c = 0.5 σ 2 σ 2 1.00 0.999 0.999 0.999 c = 1 σ 2 1.00 0.996 0.996 0.994 c = 4 σ 2 1.00 0.983 0.987 0.978 c = 6 1.00 0.978 0.984 0.971 Table 2.3: pooled estimator and GEE estimator in a fractional response panel model pooled estimator GEE(AR) GEE(MA) GEE(EX) σ 2 c = 0.01 σ 2 1.00 0.918 0.950 0.897 c = 0.02 σ 2 1.00 0.829 0.951 0.793 c = 0.03 σ 2 1.00 0.762 0.993 0.718 c = 0.04 1.00 0.718 1.04 0.670 Table 2.4: pooled estimator and EGEE estimator in a three-level binary response model σ 2 v = 2, σ 2 c = 1; σ 2 v = 1, σ 2 c = 1; σ 2 v = 1, σ 2 c = 2 pooled estimator GEE(EX,EX) GEE(EX,O) GEE(AR(1),EX) GEE(MA(1),EX) panel GEE(EX) 1.00 0.817 0.828 0.880 0.909 0.770 1.00 0.871 0.882 0.908 0.916 0.848 1.00 0.834 0.838 0.877 0.894 0.823 67 Table 2.5: EGEE and mixed-effects estimator in a three-level binary response model pooled estimator GEE(EX,EX) mixed-effects ρ = 0 -0.488 (0.028) -0.488 (0.024) -0.488 (0.023) ρ = 0.2 ρ = 0.4 ρ = 0.6 -0.482 -0.463 (0.028) (0.028) -0.464 -0.483 (0.024) (0.023) -0.464 -0.483 (0.023) (0.022) -0.474 (0.028) -0.475 (0.024) -0.475 (0.023) 68 BIBLIOGRAPHY 69 BIBLIOGRAPHY Ai, C.(1997). A Semiparametric Maximum Likelihood Estimator. Econometrica, 65 (4), 933-963. Altonji, J. G., and R. L. Matzkin (2005). Cross Section and Panel Data Estimators for Nonsepa- rable Models with Endogenous Regressors. Econometrica, 73 (4), 1053-1102. Blundell, R., and J. L. Powell (2004). Endogeneity in Semiparametric Binary Response Models. Review of Economics Studies, 71, 655-679. Chaganty, N. R., and H. Joe (2006). Range of Correlation Matrices for Dependent Bernoulli Random Variables. Biometrika, 93 (1), 197-206. Chamberlain, G (1982). Multivariate Regression Models for Panel Data, Journal of Econometrics, 18 (1), 5-46. Crowder, M (1995). On the Use of a Working Correlation Matrix in Using Generalized Linear Models for Repeated Measures. Biometrika, 82 (2), 407-410. Crowder, M (2001). On Repeated Measures Analysis with Misspecified Covariance Structure. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 63 (1), 55-62. Deleeuw, J., and E. Meijer (2008). Handbook of Multilevel Analysis. Springer press. Entorf, H., and L. Sattarova (2016). The Analysis of Prison-Prisoner Data Using Cluster-Sample Econometrics: Prison Conditions and Prisoners’ Assessments of the Future. IZA Discussion Paper No. 10209. Available at SSRN: https://ssrn.com/abstract=2840153. Gourieroux, G., A. Monfort, and C. Trognon (1984). Pseudo-Maximum Likelihood Methods: Theory. Econometrica, 52, 681-700. Gromping, U (1996). A Note on Fitting a Marginal Model to Mixed Effects Log-Linear Regres- sion Data via GEE. Biometrics, 52(1), 280-285. doi:10.2307/2533162 Hausman, J., B. Hall, and Z. Griliches (1984). Econometric Models for Count Data with an Application to the Patents-R&D Relationship. Econometrica, 52(4), 909-938. Hsiao, C., and M.H. Pesaran (2004). Random Coefficient Panel Data Models. IZA Discussion Paper No. 1236. Johson, T.R., and J. Kim (2004). A Generalized Estimating Equations Approach to Mixed-Effects Ordinal Probit Models. British Journal of Mathematical and Statistical Psychology, 57, 295-310. Klein, R., and R. Spady (1993). An Efficient Semiparametric Estimator for Binary Response Models. Econometrica, 61(2), 387-421. Laird, N., and J. Ware (1982). Random-Effects Models for Longitudinal Data. Biometrics, 38(4), 963-974. doi:10.2307/2529876 70 Liang,Y.- K., and S.L. Zeger (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73 (1), 13-22. Liang, Y.-K., S.L. Zeger and B. Qaqish (1992). Multivariate Regression Analyses for Categorical Data. Journal of the Royal Statistical Society. Series B (Methodological), 54(1), 3-40. Manski, C.F (1975). Maximum Score Estimation of the Stochastic Utility Model of Choice. Jour- nal of Econometrics, 3, 205-228. Moulton, B (1990). An Illustration of a Pitfall in Estimating the Effects of Aggregate Variables on Micro Units. Review of Economics and Statistics, 72, 334-338. Mundlak, Y (1978). On the Pooling of Time Series and Cross Section Data. Econometrica, 46 (1), 69-85. Papke, L, E (1991). Interstate Business Tax Differentials and New Firm Location : Evidence from Panel Data. Journal of Public Economics, 45(1), 47-68. Papke, L. E., and J.M. Wooldridge (2008). Panel Data Methods for Fractional Response Vari- ables With an Application to Test Pass Rates. Journal of Econometrics, 145(1), 121-133, Verbeke, G., and E. Lesaffre (1996) A Linear Mixed-Effects Model with Heterogeneity in the Random-Effects Population, Journal of the American Statistical Association, 91:433, 217-221 Wang, Ming (2014). Generalized Estimating Equations in Longitudinal Data Analysis: A Review and Recent Developments. Advances in Statistics, vol. 2014, Article ID 303728, 11 pages, 2014. doi:10.1155/2014/303728 Wang, P., G-F.Tsai, and A.Qu (2012) Conditional Inference Functions for Mixed-Effects Models With Unspecified Random-Effects Distribution, Journal of the American Statistical Association, 107:498, 725-736. doi: 10.1080/01621459.2012.665199 Wooldridge, J. M (1997). Multiplicative Panel Data Models without the Strict Exogeneity As- sumption. Econometric Theory, 13(5), 667-678. Wooldridge, J. M (1999) Distribution-Free Estimation of Some Nonlinear Panel Data Models. Journal of Econometrics, 90, 77-97. Wooldridge, J. M (2003). Cluster-Sample Methods in Applied Econometrics. The American Eco- nomic Review 93(2), 133-138. Wooldridge, J.M (2010). Econometric Analysis of Cross Section and Panel Data. MIT press. Zhang, H. , Q. Yu , C. Feng , D. Gunzler , P. Wu and X. M. Tu (2012). A New Look at the Difference Between the GEE and the GLMM When Modeling Longitudinal Count Responses, Journal of Applied Statistics, 39:9, 2067-2079, doi: 10.1080/02664763.2012.700452 71 CHAPTER 3 ONE-STEP GEE OR TWO-STEP GEE? A SIMULATION STUDY 3.1 Introduction The generalized estimating equations(GEE) method was first proposed by Liang and Zeger (1986) to analyze panel data models. The GEE method is also commonly used for models with the clustering data structure. The main advantage of adopting GEE estimation approach is to get some efficiency gains without specifying the joint distribution of a subject’s observations. The idea of GEE is to capture the time series or the within group correlations by utilizing the working correlation matrix, which is used to approximate the true correlation structure. Many empirical results as well as simulation evidences show that the GEE estimators are usually more efficient than other estimators that ignore the within group correlation. The estimating equations in GEE method resemble the score equations from a likelihood analy- sis. The GEE method is also related to the multivariate weighted nonlinear least squares estimation (MWNLS) in the econometrics literature. It can be shown that the the equations associated with the first order conditions derived from the MWNLS estimation are the same as the estimating equations in the GEE method provided that we use the same working matrix in each method. The implementation of the GEE estimation requires the consistent estimation of parameters as- sociated with the working matrix. In the original paper by Liang and Zeger (1986). it suggested to use the identity working correlation matrix to obtain the preliminary consistent estimator of param- eters associated with the marginal conditional mean function. Using these preliminary consistent estimators, we could get the consistent estimator of the working correlation matrix. Based on the obtained working correlation matrix, we could further update the estimating equations to get new consistent estimators of the conditional mean parameters. This is a two-step estimation procedure in which we use some preliminary consistent estimator to estimate the working matrix to explore possible efficiency gains. The updating process could continue until the differences between the 72 previous estimators and the updated estimators converge to a certain specified range. This would result in the iterated GEE estimators. Alternatively, one can directly model the working matrix as a function of the unknown parameters associated with the conditional mean. We then can estimate the conditional mean function together with some nuisance parameters associated with the working matrix by one step in this special set up. These different implementation procedures of the GEE estimation is very similar to different GMM estimation methods. The two-step GMM estimator also uses some inefficient weighting matrix to obtain the preliminary consistent estimators. The iterated GMM estimator keeps updating the weighting matrix with the new obtained estimator until the difference between the sequence of estimators converge to a certain range. The continuous updating GMM estimator sets the weighting matrix as a function of the unknown parameters of the moment conditions so that all parameters can be estimated by one step. In the GMM literature, it has been well established that these three GMM estimators have the same first order asymptotic distributions as long as the optimal weighting matrix is used. Simulation evidences have shown that the continuous updating GMM estimator has better finite sample performances compared with the two-step GMM estimator. The main purpose of this paper is to investigate the finite sample performances of the two-step GEE estimator and the one-step GEE estimator. We compare the bias and variance of the two-step GEE estimator and one-step GEE estimator in various Monte Carlo simulation designs. The rest of the paper is structured as the follows. Section 2 reviews the GEE method in one-step and two- step estimations. Section 3 provides the simulation results of the two GEE estimators. Section 4 summarizes the paper. 3.2 One-Step and Two-Step GEE Estimations The GEE estimation method was first proposed by Liang and Zeger(1986) to analyze panel data models and it can also be applied to models with the nested data structure. This paper will take the panel data model as a leading case. The same analysis can be carried out for other nested models with some minor notation changes. Suppose our primary interest focuses on the estimation 73 of the following conditional mean function: E(yit | Xi) = m(xit;β ) (3.1) where i = 1,2, ...,N and t = 1,2, ...,T . The GEE method can be easily extended to cover models with unbalanced data. We assume a balanced panel data structure to simplify notations. yit is the dependent variable and Xi = (xi1,xi2, ...,xiT ) is the collection of all the covariates of subject i over time. Note that equation (3.1) has implicitly assumed strict exogeneity in the sense that E(yit | Xi) = E(yit | xit), which is needed to guarantee the consistency of the GEE estimator. The set up is general enough to cover limited dependent variables. For example, the dependent variable could be binary response, fractional response, or count response and many others. To simplify notations we define following conditional mean functions for subject i, where y = (yi1,yi2, ...,yiT )T and m(Xi;β ) =(cid:0)m(xi1;β ),m(xi2;β ), ...,m(xiT ;β )(cid:1)T . The GEE E(yi | Xi) = m(Xi;β ) (3.2) approach accounts for neglected time series correlations by utilizing the so called “working" vari- ance matrix, which is denoted by W(Xi;γ,β ). We use γ to represent all the unknown parameters associated with the working matrix except for β . The working matrix can be expressed as: 1 2 Ri(ρ)V(Xi;β ,σ 2 o ) 1 2 W(Xi;γ,β ) = V(Xi;β ,σ 2 o ) (3.3) o ) = diag{var∗(yi1 | Xi),var∗(yi2 | Xi), ...,var∗(yiT | Xi)} and Ri(ρ) is a T × T where V(Xi;β ,σ 2 positive semi-definite matrix associated with the unknown parameter vector ρ. The conditional variance var∗(yit | Xi) usually is determined by the nature of the underlying dependent variable and can be different from the true conditional variance function. For example, when yit is a bi- nary response, var∗(yit | Xi) = m(xit;β )[1− m(xit;β )] by construction; when yit is a fractional response, the generalized linear model (GLM) variance assumption usually takes var∗(yit | Xi) = o m(xit;β )[1−m(xit;β )], where σ 2 σ 2 o is called dispersion parameter; when yit is a count response, the conditional variance is modeled as σ 2 o m(xit;β ). Later we use the simulation study to show the 74 efficiency gain by using time-specific dispersion parameters. One of the most important component in the working matrix is R(ρ), which is used to approximate the true correlation. Several popular choices of R(ρ) include identity matrix, correlation matrix derived from the MA or AR process, unstructured correlation matrix. Except for identity matrix, we usually need some preliminary consistent estimator of β to obtain consistent estimator of ρ. The GEE estimators can be obtained by solving the following equations: N ∑ i=1 where (cid:53)β m(xi;β ) =(cid:0)(cid:53)β m(xi1;β )T ,(cid:53)β m(xi2;β )T , ...,(cid:53)β m(xiT ;β )T(cid:1)T and (cid:53)β m(xit;β ) is (cid:53)β m(xi;β )T W(Xi; ˇγ, ˇβ )−1[yi − m(Xi;β )] = 0 (3.4) a 1× K vector. The GEE estimators can also be characterized as multivariate weighted nonlinear least squares estimation by solving the following minimization problem: [yi − m(Xi;β )]T [W(Xi; ˇγ, ˇβ )]−1[yi − m(Xi;β )] N ∑ i=1 Min β (3.5) It can be easily shown that the first order condition from equation (3.5) is exactly the same as the estimating equation in (3.4). It has been well established that, under certain regularity assumptions, the GEE estimators are consistent and have the following asymptotic distribution √ N( ˆβ − β ) ∼ N (cid:16) 0,(cid:0)E[sT i W−1 i si](cid:1)−1E[sT i W−1 i uiuT i W−1 i si](cid:0)E[sT i W−1 i si](cid:1)−1(cid:17) (3.6) where si = (cid:53)β m(Xi;β ), ui = yi − m(Xi;β ) and Wi is shorthand for W(Xi;γ,β ). As mentioned above, the implementation of GEE estimation requires some preliminary estimation of the un- known working correlation matrix. To obtain the preliminary consistent estimator of the working matrix, we could use the identity matrix in the first step to get the consistent estimator of β , which is denoted by ˇβ . From the residuals from the first stage ˇui = yi−m(Xi; ˇβ ), we can consistently esti- mate the working matrix of any particular specific form. Using the consistent estimator of working matrix to replace the identity matrix, we would obtain the two-step GEE estimator ˆβ2GEE. The up- dating process can continue until the differences between the sequence of estimators converge to a 75 certain range and this would result in the fully iterated GEE estimator which is denoted by ˆβFGEE. √ N consistent preliminary estimators when we esti- It has been shown that, as long as we use any mate the working matrix, both ˆβ2GEE and ˆβFGEE have the same limiting distribution as stated in equation (3.6). Less is known about the finite sample performance of these two estimators. As we have seen that the working matrix is estimated based on some preliminary consistent estimators of β . This finding leads to the following one-step GEE estimation procedure, in which we explicitly model the unknown parameters γ as a function of β . (cid:53)β m(Xi;β )T W(Xi;γ(β ),β )−1[yi − m(Xi;β )] = 0 (3.7) N ∑ i=1 Or equivalently, we can characterize the one-step GEE estimator by the MWNLS estimation ap- proach N ∑ i=1 Min β [yi − m(Xi;β )]T [W(Xi;γ(β ),β )]−1[yi − m(Xi;β )] (3.8) Equation (3.8) looks very similar to the objective function of continuously updating GMM (CUGMM) estimation in which the optimal weighting matrix is also function of the unknown parameters. Based on various simulation studies, we have seen that the CUGMM estimators have better finite sample performances compared to the two-step GMM estimators. It can be shown that the one-step GEE estimator ˆβ1GEE has the same first order asymptotics as described in equation (3.6). Next section will use the simulation study to compare the small sample performances of the one-step GEE estimators and the two-step GEE estimators. 3.3 Simulation Results We have discussed that the one-step GEE estimators and the two-step GEE estimators have the same asymptotic distribution as long as we impose the same structure for the working matrix. This section presents several simulation results to compare the small sample performances of these two type GEE estimators. These results should be useful to empirical researchers to choose different implementation procedures of the GEE estimations. The following simulation studies set the repli- 76 cation number to be 1000 and assume a balanced data structure in which T=6 and N varies from 50 to 150. 3.3.1 Linear Models Consider the following linear regression model: yit = βo + β1zi1 + β2zi2 + β3xit1 + β4xit4 + ci + uit (3.9) where βo = 1, β1 = β2 = 2 and β3 = β4 = 3. ci ∼ Normal(0,σ 2 c ), zi1 ∼ Gamma(2,3), zi2 ∼ Gamma(1,2), xit1 ∼ Normal(0,3), xit2 ∼ Normal(0,4), uit ∼ Normal(0,4). All variables are independent of each other. By varying σ 2 c , we can change the serial correlation across time. In the following table, we report the finite sample performances of one-step GEE estimators and two-step GEE estimators of β1 and β3. Let ˆβ1(1AR) to denote the one-step GEE estimator of β1 with the autoregressive working matrix and ˆβ1(2EX) to denote the two-step GEE estimator of β1 with the exchangeable working matrix. In table 3.1 and table 3.2, we scale the bias and MSE by 100. From the simulation results in both tables, we can see that the one-step GEE estimator and the two-step GEE estimator have very similar finite sample performances for estimating linear models. Based on the DGP, it is no surprising that the GEE estimator with exchangeable working correlation matrix is the most efficient. When we use different working correlation matrices, the two-step GEE estimator seem to behave marginally better that the one-step GEE estimator in terms of MSE. Before we move to nonlinear models, it might be of some interests to investigate the efficiency gain by allowing time specific dispersion parameters. Many statistics software with built-in GEE command take the dispersion parameter as constant. The two-step GEE estimator can be easily adjusted to allow time-varying dispersion parameters. The DGP is almost the same as in the previous model. We now normalize ci to be standard normal random variable and generate uit from Normal(0,αt), where α is some positive constant to be specified. 77 Again we multiply the bias and MSE by 100 for the simulation results. In table 3.3, we report the results of the twp-step GEE estimators with and without time-varying dispersion parameter when α = −0.5. We use ˆβ∗ 1(2EX) to denote the two-step GEE estimator with the time-varying dis- persion parameters. When the sample size is reasonably large, we see that the GEE estimator with time-varying dispersion parameters has smaller bias and smaller Monte Carlo standard deviation, hence smaller MSE as well. In table 3.4, as we set α = 1, the gains from allowing time-varying dispersion parameters become more evident. In many cases, the GEE estimator using identity working matrix with time-varying dispersion parameters are even better than the GEE estimator with the exchangeable working matrix but with time constant dispersion parameters. The GEE estimator with identity working matrix and time-varying dispersion parameters is essentially the same as weighted least squares estimator, which only requires the contemporaneous exogeneity condition to be consistent. Other than the bias and standard deviation, we also simulate the coverage rates for the linear models under various situations. In table 3.5 and 3.6, we report the coverage rates of 90% confi- dence interval and 95% confidence interval for the one-step GEE estimator and the two-step GEE estimator. In particular table 3.5 reports the results for the usual linear regression model. It shows a similar pattern that the one-step GEE estimator and the two-step GEE estimator have very similar coverage rates when we use the exchangeable working correlation matrix. As shown in previous simulation study, when we use the inefficient working correlation matrix, the two-step GEE es- timator usually has a smaller standard deviation compared to the one-step GEE estimator. As a consequence of smaller standard deviations, the two-step GEE estimator has a smaller coverage rate. Table 3.6 compares the coverage rates for the traditional GEE estimator and the GEE esti- mator with time-varying dispersion parameter when the heteroskedasticity presents in the linear model. The coverage rates of the traditional GEE estimator are slightly higher than the coverage rate of the improved GEE estimator. When the sample size is reasonably large, these two GEE estimators have very close coverage rates. 78 3.3.2 Nonlinear Models The binary response model is by far one of the most popular examples for nonlinear models. The data generating process is usually specified as yit = 1[y∗ it = β0 + β1zi1 + β2zi1 + β3xit1 + β4xit2 + ci + uit. The DGP for all the independent variables are the same as in the linear model except now the idiosyncratic error uit follows standard normal distribution. We set βo = 0, β1 = 0.2, β2 = −0.2, β3 = 0.3 and β4 = −0.3. it ≥ 0], where y∗ In table 3.7 and table 3.8, we report the one-step GEE estimator and two-step GEE estimator of β1 and β3 when σ 2 c = 1.5. In both tables, we scale with bias and MSE by 100. From the simulation results of binary response models, we can see that the one-step GEE estimator and c = 1 and σ 2 two-step GEE estimator have very similar finite sample performances. It is interesting to observe that when we are not using the efficient working matrix, the two-step GEE estimator usually has smaller MSE compared to the one-step GEE estimator with the same working matrix. This might suggest that two-step GEE estimator should be favored by empirical researchers as the efficient working correlation matrix is rarely known in reality. Next we consider the one-step GEE estimator and two-step GEE estimator in the fractional re- sponse model. To obtain the fractional random variable, we first generate y∗ β1zi1 +β2zi2 +β3xit1 +β4xit2 +ci)), where the independent variables are generated from the same distributions as in the linear model and ci follows N(0,0.04), then yit = y∗ it/10000 is the desired fractional random variable with the conditional mean function Φ(β0 + β1zi1 + β2zi2 + β3xit1 + β4xit2 +ci). We set the true values of the underlying parameters the same as in the previous binary response model. In table 3.9, we also scale the bias and Monte Carlo standard deviation, as well it from binomial(10000,Φ(β0 + as the MSE, by 100. Again we find similar evidences that the one-step GEE estimator and the two-step GEE estimator behave equally well in finite samples. The GEE estimators, both one- step and two-step, with stationary working correlation matrix are not considered because of the convergence issue. To provide more simulation evidences, we look into the two GEE estimators for count response models. The data generating process is as the follows. yit ∼ Poisson(cid:0)exp(β0 + β1zi1 + β2zi2 + 79 β3xit1 + β4xit2 + ci)(cid:1), The distributions of independent variables are the same as before. ci is generated from the standard normal distribution and is independent of all the explanatory variables. The true parameters are set as βo = 0, β1 = 0.4, β2 = −0.4, β3 = 0.5 and β4 = −0.5. In table 3.10, we report the simulation results without any scaling and we hold off the GEE estimator with stationary working matrix because of the convergence issue. We find a similar pattern that two-step GEE estimator behaves no worse than the one-step GEE estimator when we use the efficient working correlation matrix. The mean squared error of two-step GEE estimator is generally smaller than the mean squared error of one-step GEE estimator when we use the inefficient working matrix. At the end of this subsection, we report the coverage rates of the two GEE estimators for the binary response model. In table 3.11 and table 3.12, it shows that the two-step GEE estimator with the exchangeable working correlation matrix has very similar coverage rates compared to the one- step GEE estimator with exchangeable working correlation matrix. When the inefficient working correlation matrices are used, the one-step GEE estimator provides a better coverage rate. As the sample size increases, the coverage rates of both these two GEE estimators are getting close to the nominal rates. 3.4 Conclusion While it is known that the one-step GEE estimator and the two-step GEE estimator have the same first order asymptotic distribution, less is known about their small sample performances. This paper conducts Monte Carlo simulations to investigate the finite sample performances of the one- step GEE estimator and two-step GEE estimator. We show that these two GEE estimators behave equally well when we use the efficient working correlation matrix. We also find that the two-step GEE estimator seems to outperform the one-step GEE in terms of the mean square error when we use inefficient working correlation matrix. The simulation study also shows that the two-step GEE estimator has similar coverage rates to the one-step GEE estimator in both linear and nonlinear models. This might suggest that the empirical research use the two-step estimation approach as we 80 rarely know the true correlation structure. The two-step GEE estimator is less computationally demanding compared with the one-step GEE estimator. Another advantage of the two-step GEE estimator is that it makes it easier to estimate time specific dispersion parameters in the context of panel data models. Simulation study has shown that allowing time-varying dispersion parameters can further improve the efficiency of the GEE estimator. 81 APPENDIX 82 APPENDIX FOR CHAPTER 3 Table 3.1: one-step GEE and two-step GEE for linear models (σ 2 c = 1) bias -0.1583 -0.1583 -0.1672 -0.1655 -0.1653 -0.1641 0.2098 0.2091 0.1482 0.1508 0.1355 0.1431 N=50 sd 0.0458 0.0458 0.0459 0.0459 0.0459 0.0458 0.0714 0.0714 0.0740 0.0738 0.0748 0.0743 RMSE 0.0458 0.0458 0.0459 0.0459 0.0459 0.0458 0.0714 0.0714 0.0740 0.0738 0.0748 0.0743 N=100 sd RMSE 0.0317 0.0316 0.0317 0.0317 0.0317 0.0316 bias 0.0778 0.0778 0.0791 0.0791 0.0807 0.0801 -0.0634 -0.0632 -0.0158 0.0514 -0.0219 0.0513 0.0060 0.0517 -0.0083 0.0515 bias 0.1286 0.0317 0.1286 0.0316 0.1299 0.0317 0.1297 0.0317 0.1308 0.0317 0.1302 0.0316 0.0506 0.0506 -0.1710 0.0506 0.0506 -0.1709 -0.1896 -0.1832 -0.1928 -0.1854 0.0514 0.0513 0.0517 0.0515 N=150 sd 0.0252 0.0252 0.0253 0.0253 0.0253 0.0252 0.0410 0.0410 0.0423 0.0423 0.0426 0.0425 Table 3.2: one-step GEE and two-step GEE for linear models (σ 2 c = 2) N=100 sd RMSE 0.0536 0.0536 0.0538 0.0537 0.0540 0.0537 bias 0.2097 0.0536 0.2097 0.0536 0.2253 0.0538 0.2219 0.0537 0.2283 0.0540 0.2227 0.0537 0.0520 0.0520 -0.1797 0.0520 0.0520 -0.1796 0.0568 -0.2420 -0.2291 0.0562 -0.3575 0.0765 0.0615 -0.2779 0.0568 0.0562 0.0765 0.0615 N=150 sd 0.0424 0.0424 0.0429 0.0427 0.0429 0.0426 0.0422 0.0422 0.0463 0.0461 0.0629 0.0505 bias -0.2260 -0.2260 -0.2638 -0.2563 -0.2381 -0.2435 0.2299 0.2290 0.1111 0.1183 0.0021 0.0687 N=50 sd 0.0773 0.0773 0.0786 0.0781 0.0786 0.0781 0.0743 0.0743 0.0825 0.0817 0.1097 0.0893 RMSE 0.0773 0.0773 0.0786 0.0781 0.0786 0.0781 0.0743 0.0743 0.0825 0.0817 0.1097 0.0893 bias 0.0626 0.0626 0.0449 0.0499 0.0423 0.0504 -0.0646 -0.0640 0.0628 0.0487 0.0344 0.0173 83 RMSE 0.0252 0.0252 0.0253 0.0253 0.0253 0.0253 0.0410 0.0410 0.0424 0.0423 0.0426 0.0425 RMSE 0.0424 0.0424 0.0429 0.0427 0.0429 0.0427 0.0422 0.0422 0.0463 0.0461 0.0629 0.0505 ˆβ1(1EX) ˆβ1(2EX) ˆβ1(1AR) ˆβ1(2AR) ˆβ1(1MA) ˆβ1(2MA) ˆβ2(1EX) ˆβ2(2EX) ˆβ2(1AR) ˆβ2(2AR) ˆβ2(1MA) ˆβ2(2MA) ˆβ1(1EX) ˆβ1(2EX) ˆβ1(1AR) ˆβ1(2AR) ˆβ1(1MA) ˆβ1(2MA) ˆβ2(1EX) ˆβ2(2EX) ˆβ2(1AR) ˆβ2(2AR) ˆβ2(1MA) ˆβ2(2MA) Table 3.3: two-step GEE estimators with and without time-varying dispersion parameters (α = 0.5) bias -0.1092 -0.0832 -0.1061 -0.0422 -0.1095 -0.0776 -0.1112 -0.0814 0.2081 0.2362 0.1154 0.1349 0.1031 0.1297 0.0914 0.1191 N=50 sd 0.0396 0.0387 0.0396 0.0393 0.0398 0.0389 0.0396 0.0389 0.0560 0.0534 0.0487 0.0442 0.0518 0.0478 0.0535 0.0493 ˆβ1(2IN) ˆβ∗ 1(2IN) ˆβ1(2EX) ˆβ∗ 1(2EX) ˆβ1(2AR) ˆβ∗ 1(2AR) ˆβ1(2MA) ˆβ∗ 1(2MA) ˆβ2(2IN) ˆβ∗ 2(2IN) ˆβ2(2EX) ˆβ∗ 2(2EX) ˆβ2(2AR) ˆβ∗ 2(2AR) ˆβ2(2MA) ˆβ∗ 2(2MA) sd bias 0.0513 0.0283 0.0401 0.0276 0.0492 0.0283 0.0360 0.0275 RMSE 0.0396 0.0387 0.0396 0.0490 0.0393 0.0235 0.0398 0.0389 0.0396 0.0524 0.0389 0.0399 0.2082 0.0560 0.1790 0.0534 0.0487 0.1697 0.1532 0.0442 0.2162 0.0518 0.0478 0.1924 0.2303 0.0535 0.0493 0.2308 N=100 N=150 sd 0.0231 0.0222 0.0229 0.0221 RMSE 0.0283 0.0276 0.0283 0.0283 0.0276 0.0276 0.0283 0.0275 0.0283 0.0283 0.0276 0.0276 0.0388 0.0366 0.0327 0.0294 0.0349 0.0319 0.0360 0.0331 RMSE bias 0.0230 0.1107 0.1002 0.0222 0.1102 0.0229 0.0229 0.0896 0.0219 0.0219 0.1183 0.0231 0.1090 0.0222 0.1170 0.0231 0.0231 0.1081 0.0222 0.0222 0.0388 0.1317 0.0318 0.0318 0.0366 0.1179 0.0307 0.0307 0.0327 0.1869 0.0273 0.0273 0.0294 0.1656 0.0252 0.0252 0.0349 0.2060 0.0287 0.0288 0.0319 0.1791 0.0270 0.0271 0.0361 0.2038 0.0296 0.0296 0.0331 0.1770 0.0279 0.0279 84 Table 3.4: two-step GEE estimators with and without time-varying dispersion parameters (α = 1) bias -0.0134 0.0145 -0.0107 0.0407 -0.0258 0.0087 -0.0025 0.0100 -0.2966 -0.3458 -0.1402 -0.2096 -0.1375 -0.2272 -0.1413 -0.2297 N=50 sd 0.0441 0.0418 0.0441 0.0418 0.0444 0.0420 0.0443 0.0420 0.0735 0.0679 0.0677 0.0613 0.0704 0.0644 0.0710 0.0649 ˆβ1(2IN) ˆβ∗ 1(2IN) ˆβ1(2EX) ˆβ∗ 1(2EX) ˆβ1(2AR) ˆβ∗ 1(2AR) ˆβ1(2MA) ˆβ∗ 1(2MA) ˆβ2(2IN) ˆβ∗ 2(2IN) ˆβ2(2EX) ˆβ∗ 2(2EX) ˆβ2(2AR) ˆβ∗ 2(2AR) ˆβ2(2MA) ˆβ∗ 2(2MA) N=100 N=150 sd bias -0.0399 -0.0522 -0.0405 -0.0561 -0.0435 -0.0574 -0.0311 RMSE RMSE 0.0324 0.0324 0.0441 0.0305 0.0305 0.0418 0.0324 0.0324 0.0441 0.0303 0.0418 0.0303 0.0444 -0.0533 0.0325 0.0325 0.0420 0.0304 0.0304 0.0325 0.0325 0.0443 0.0420 0.0304 0.0304 0.0735 -0.0589 0.0511 0.0511 0.0467 0.0467 0.0679 0.0677 0.0468 0.0468 0.0419 0.0419 0.0613 0.0704 -0.0420 0.0489 0.0489 0.0443 0.0443 0.0644 0.0495 0.0495 0.0710 0.0649 0.0449 0.0449 0.0840 -0.0474 0.0808 0.0699 -0.0607 0.0584 sd bias 0.0244 0.0255 0.0244 0.0255 0.0245 0.0255 0.0246 RMSE 0.0255 -0.0581 0.0245 -0.0406 0.0255 -0.0581 -0.0326 0.0246 -0.0632 0.0255 0.0255 -0.0461 0.0244 0.0255 -0.0635 -0.0467 0.0244 -0.1299 0.0405 0.0405 0.0370 -0.1459 -0.1008 0.0381 -0.1111 0.0341 -0.1240 0.0392 0.0392 -0.1335 0.0355 0.0396 -0.1264 -0.1365 0.0358 0.0370 0.0381 0.0341 0.0355 0.0396 0.0358 Table 3.5: coverage rate: one-step GEE and two-step GEE for linear models (σ 2 c = 2) N=50 N=100 N=150 90% CI 95% CI 95% CI 90% CI 90% CI 95% CI 85.36% 90.91% 85.98% 91.88% 87.97% 93.78% 85.36% 90.91% 85.98% 91.88% 87.97% 93.78% 84.42% 91.26% 86.21% 92.10% 87.55% 93.46% 84.65% 90.91% 86.10% 91.55% 87.87% 93.67% 84.42% 91.03% 85.87% 92.44% 87.97% 93.88% 84.53% 90.79% 85.87% 91.66% 88.19% 93.78% 88.19% 93.62% 88.10% 93.66% 88.92% 94.30% 88.43% 93.51% 88.10% 93.66% 88.92% 94.30% 86.78% 92.80% 88.99% 94.22% 88.08% 93.99% 86.89% 93.51% 88.99% 94.10% 89.35% 93.78% 85.83% 92.80% 90.43% 95.44% 89.03% 93.99% 86.78% 93.27% 89.99% 94.77% 89.56% 94.30% ˆβ1(1EX) ˆβ1(2EX) ˆβ1(1AR) ˆβ1(2AR) ˆβ1(1MA) ˆβ1(2MA) ˆβ2(1EX) ˆβ2(2EX) ˆβ2(1AR) ˆβ2(2AR) ˆβ2(1MA) ˆβ2(2MA) 85 Table 3.6: coverage rate: two-step GEE estimators with and without time-varying dispersion parameters (α = 0.5) N=50 N=100 N=150 90% CI 95% CI 95% CI 90% CI 90% CI 95% CI 86.2% 92.8% 86.3% 92.5% 87.1% 92.3% 85.7% 91.9% 86.4% 91.9% 86.7% 92.6% 86.3% 93.0% 86.3% 92.5% 87.1% 92.3% 84.6% 91.5% 85.6% 91.8% 87.0% 93.2% 85.9% 92.9% 87.0% 93.0% 86.3% 92.2% 84.4% 91.9% 86.5% 91.6% 86.7% 92.4% 85.8% 93.2% 86.9% 93.1% 86.2% 91.9% 84.9% 91.7% 86.5% 91.6% 86.8% 92.6% 88.3% 94.0% 89.1% 94.7% 89.8% 94.8% 87.3% 93.7% 88.8% 93.4% 88.6% 93.4% 88.2% 94.0% 90.4% 95.6% 89.8% 94.9% 87.4% 94.2% 90.5% 95.3% 88.9% 94.7% 87.5% 93.9% 90.6% 94.4% 90.5% 94.4% 86.7% 93.2% 89.6% 94.5% 87.9% 94.1% 87.9% 93.9% 90.6% 94.6% 89.7% 94.4% 87.9% 92.7% 89.7% 94.4% 89.0% 94.4% ˆβ1(2IN) ˆβ∗ 1(2IN) ˆβ1(2EX) ˆβ∗ 1(2EX) ˆβ1(2AR) ˆβ∗ 1(2AR) ˆβ1(2MA) ˆβ∗ 1(2MA) ˆβ2(2IN) ˆβ∗ 2(2IN) ˆβ2(2EX) ˆβ∗ 2(2EX) ˆβ2(2AR) ˆβ∗ 2(2AR) ˆβ2(2MA) ˆβ∗ 2(2MA) Table 3.7: one-step GEE and two-step GEE for the binary response models (σ 2 N=150 N=100 N=50 c = 1) bias 0.8588 0.8704 0.8479 0.8541 0.8564 0.8576 0.7515 0.7590 0.7376 0.7629 0.6952 0.7425 sd 0.0407 0.0407 0.0408 0.0408 0.0409 0.0408 0.0515 0.0515 0.0537 0.0536 0.0551 0.0543 ˆβ1(1EX) ˆβ1(2EX) ˆβ1(1AR) ˆβ1(2AR) ˆβ1(1MA) ˆβ1(2MA) ˆβ2(1EX) ˆβ2(2EX) ˆβ2(1AR) ˆβ2(2AR) ˆβ2(1MA) ˆβ2(2MA) sd RMSE bias 0.0415 0.3660 0.0416 0.3756 0.0417 0.0417 0.0417 0.0417 0.0520 0.2573 0.0520 0.2606 0.0542 0.0541 0.0555 0.0548 0.0288 0.0288 0.3698 0.0283 0.3752 0.0283 0.3728 0.0283 0.3770 0.0283 0.0344 0.0344 0.2963 0.0358 0.3036 0.0357 0.3161 0.0365 0.3158 0.0361 RMSE 0.0290 0.0290 0.0285 0.0285 0.0286 0.0285 0.0345 0.0345 0.0359 0.0358 0.0367 0.0362 bias 0.3850 0.3920 0.4004 0.3973 0.3988 0.3965 0.1520 0.1543 0.1578 0.1608 0.1664 0.1649 sd 0.0227 0.0228 0.0230 0.0229 0.0230 0.0229 0.0278 0.0278 0.0290 0.0290 0.0295 0.0292 RMSE 0.0231 0.0231 0.0233 0.0232 0.0233 0.0233 0.0278 0.0278 0.0290 0.0290 0.0295 0.0293 86 Table 3.8: one-step GEE and two-step GEE for the binary response models (σ 2 N=150 N=100 c = 1.5) N=50 bias 0.8070 0.8262 0.8197 0.8291 0.8260 0.8324 0.6899 0.6985 0.6998 0.7209 0.5757 0.6705 sd 0.0414 0.0415 0.0416 0.0417 0.0417 0.0416 0.0490 0.0490 0.0522 0.0521 0.0560 0.0537 ˆβ1(1EX) ˆβ1(2EX) ˆβ1(1AR) ˆβ1(2AR) ˆβ1(1MA) ˆβ1(2MA) ˆβ2(1EX) ˆβ2(2EX) ˆβ2(1AR) ˆβ2(2AR) ˆβ2(1MA) ˆβ2(2MA) sd RMSE bias 0.0422 0.3454 0.0423 0.3521 0.0424 0.0425 0.0425 0.0424 0.0495 0.2679 0.0495 0.2712 0.0527 0.0526 0.0563 0.0541 0.0297 0.0297 0.3690 0.0293 0.3754 0.0293 0.3777 0.0294 0.3791 0.0293 0.0323 0.0323 0.2982 0.0341 0.3099 0.0339 0.3053 0.0360 0.3174 0.0348 RMSE 0.0299 0.0299 0.0295 0.0295 0.0296 0.0295 0.0324 0.0324 0.0342 0.0340 0.0361 0.0349 bias 0.3688 0.3745 0.3885 0.3886 0.3925 0.3911 0.1354 0.1373 0.1582 0.1588 0.1659 0.1630 sd 0.0228 0.0229 0.0232 0.0232 0.0233 0.0232 0.0262 0.0262 0.0276 0.0275 0.0288 0.0281 RMSE 0.0231 0.0231 0.0235 0.0235 0.0237 0.0235 0.0262 0.0262 0.0276 0.0275 0.0289 0.0281 Table 3.9: one-step GEE and two-step GEE for the fractional response models(all scale by 100) bias 0.3678 0.3677 0.3696 0.3679 0.5613 0.5612 0.5601 0.5597 N=50 sd 0.1933 0.1916 0.1967 0.1842 0.0761 0.0758 0.0873 0.0856 ˆβ1(1EX) ˆβ1(2EX) ˆβ1(1AR) ˆβ1(2AR) ˆβ2(1EX) ˆβ2(2EX) ˆβ2(1AR) ˆβ2(2AR) N=100 sd bias RMSE 0.4159 0.3630 0.4147 0.3632 0.4183 0.4111 0.5666 0.5580 0.5666 0.5581 0.5666 0.5666 0.1360 0.1353 0.3669 0.1410 0.3684 0.1291 0.0529 0.0528 0.5573 0.0613 0.5578 0.0603 RMSE 0.3873 0.3873 0.3924 0.3899 0.5604 0.5604 0.5604 0.5612 bias 0.3724 0.3724 0.3711 0.3715 0.5570 0.5570 0.5558 0.5561 N=150 sd 0.1029 0.1026 0.1074 0.1006 0.0446 0.0445 0.0507 0.0497 RMSE 0.3860 0.3860 0.3860 0.3847 0.5586 0.5586 0.5586 0.5586 87 Table 3.10: one-step GEE and two-step GEE for the count response models without any scaling bias -0.0116 -0.0118 -0.0116 -0.0124 0.0040 0.0007 0.0063 0.0002 N=50 sd 0.0932 0.0930 0.0935 0.0926 0.0862 0.0790 0.0870 0.0767 ˆβ1(1EX) ˆβ1(2EX) ˆβ1(1AR) ˆβ1(2AR) ˆβ2(1EX) ˆβ2(2EX) ˆβ2(1AR) ˆβ2(2AR) N=100 N=150 sd sd bias bias -0.0120 -0.0122 RMSE 0.0749 0.0938 0.0937 0.0748 0.0942 -0.0119 0.0747 0.0934 -0.0126 0.0743 0.0863 0.0790 0.0872 0.0766 -0.0002 -0.0029 -0.0007 0.0799 -0.0047 0.0712 0.0686 0.0685 RMSE RMSE 0.0697 -0.0124 0.0759 -0.0127 0.0696 0.0758 -0.0119 0.0685 0.0695 0.0756 -0.0128 0.0683 0.0694 0.0754 0.0789 0.0789 0.0800 0.0799 0.0018 0.0728 0.0729 0.0742 0.0742 -0.0011 0.0029 0.0799 0.0799 0.0693 0.0693 -0.0035 0.0799 0.0713 Table 3.11: coverage rate: one-step GEE and two-step GEE for binary response model (σ 2 c = 1) N=50 N=100 N=150 90% CI 95% CI 90% CI 95% CI 90% CI 95% CI 87.18% 92.84% 87.56% 94.08% 89.17% 93,59% 87.18% 92.84% 87.56% 94.08% 89.28% 93.49% 87.39% 92.13% 87.96% 94.08% 88.68% 93.39% 87.29% 92.33% 87.76% 94.38% 88.88% 93.69% 87.49% 92.53% 87.86% 94.18% 88.68% 93.29% 87.18% 92.43% 87.76% 94.28% 88.78% 93.59% 88.09% 93.54% 89.77% 93.98% 89.28% 94.39% 88.19% 93.74% 89.87% 93.98% 89.28% 94.39% 87.89% 93.04% 89.57% 94.18% 90.18% 94.99% 88.80% 92.94% 89.47% 94.28% 90.08% 94.79% 87.59% 92.84% 89.37% 94.58% 90.28% 95.39% 88.29% 92.84% 89.27% 94.78% 90.08% 94.89% ˆβ1(1EX) ˆβ1(2EX) ˆβ1(1AR) ˆβ1(2AR) ˆβ1(1MA) ˆβ1(2MA) ˆβ2(1EX) ˆβ2(2EX) ˆβ2(1AR) ˆβ2(2AR) ˆβ2(1MA) ˆβ2(2MA) 88 Table 3.12: coverage rate: one-step GEE and two-step GEE for binary response model (σ 2 c = 1.5) N=50 N=100 N=150 95% CI 90% CI 90% CI 95% CI 90% CI 95% CI 87.83% 93.04% 87.59% 93.64% 88.77% 94.48% 87.83% 92.94% 87.59% 93.64% 88.87% 94.48% 88.14% 93.05% 87.99% 94.55% 87.46% 93.78% 87.53% 93.25% 88.09% 94.45% 87.76% 93.68% 87.53% 93.25% 87.89% 94.35% 87.36% 93.88% 87.53% 92.94% 87.89% 94.45% 87.76% 93.68% 88.55% 93.46% 89.71% 94.75% 90.07% 94.68% 88.55% 93.46% 89.91% 94.65% 89.97% 94.68% 87.83% 92.64% 89.51% 94.95% 89.67% 94.78% 87.22% 92.74% 90.01% 94.95% 89.57% 94.68% 86.30% 91.92% 89.51% 94.65% 89.77% 95.16% 86.81% 92.02% 90.21% 94.75% 89.47% 94.88% ˆβ1(1EX) ˆβ1(2EX) ˆβ1(1AR) ˆβ1(2AR) ˆβ1(1MA) ˆβ1(2MA) ˆβ2(1EX) ˆβ2(2EX) ˆβ2(1AR) ˆβ2(2AR) ˆβ2(1MA) ˆβ2(2MA) 89 BIBLIOGRAPHY 90 BIBLIOGRAPHY Laird, N., and J. Ware (1982). Random-Effects Models for Longitudinal Data. Biometrics, 38(4), 963-974. doi:10.2307/2529876 Liang,Y.- K., and S.L. Zeger (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73 (1), 13-22. Liang, Y.-K., S.L.Zeger and B. Qaqish (1992). Multivariate Regression Analyses for Categorical Data. Journal of the Royal Statistical Society. Series B (Methodological), 54(1), 3-40. Lipsitz, S., G. Fitzmaurice, D. Sinha, N. Hevelone, Jim Hu, and L.L.Nguyen (2017). One- Step Generalized Estimating Equations with Large Cluster Sizes. Journal of Computational and Graphical Statistics, 26(3), 734-737. Wang, M (2014). Generalized Estimating Equations in Longitudinal Data Analysis: A Review and Recent Developments. Advances in Statistics, vol. 2014, Article ID 303728, 11 pages, 2014. doi:10.1155/2014/303728 Yang, Y (2018). An Extended Generalized Estimating Equations Approach to Nonlinear Models with Hierarchical Structure. Working paper. 91