THREE ESSAYS ON ECONOMETRICS

By

Yimin Yang

A DISSERTATION

Submitted to

Michigan State University

in partial fulﬁllment of the requirements

for the degree of

Economics – Doctor of Philosophy

2018

ABSTRACT

THREE ESSAYS ON ECONOMETRICS

By

Yimin Yang

This dissertation consists of three chapters concerning the estimation and inference of multi-level

models. The ﬁrst chapter shows how various econometric methods can be extended to multi-

level linear models. The second chapter is a natural extension of the ﬁrst chapter which focuses

on the estimation of nonlinear models with hierarchical structure. The third chapter provides a

practical guidance on choosing between the one-step GEE estimator and two-step GEE estimator

by conducting simulation studies.

More speciﬁcally, the ﬁrst chapter considers a three-level linear model with both nested and

non-nested structures. The stylized example will be a model for the average score on some

statewide test at the school level at various points in time; scores are observed at multiple points

in time (level one) in multiple schools (level two) in various school districts (level three). In such

a model, the schools are naturally nested within the districts but time is not nested within schools

or districts. This chapter extends several important econometric methods of analysis for panel data

models to multi-level models. In particular, this chapter generalizes the results of Hausman and

Taylor and the subsequent literature to these multi-level models. This is a non-trivial extension

because the three-level linear model now has more than one kind of heterogeneity and more than

one kind of between regression. This chapter also generalizes the concept of testing the exogene-

ity assumptions by a simple variable addition test. It also provides a discussion about the GMM

estimation without the assumption of no conditional heteroskedasticity.

The second chapter considers the estimation of hierarchical nonlinear models. It shows how

to extend the generalized estimating equations approach to models with more than two levels.

When some covariates are correlated the unobserved heterogeneity, the GEE estimation method

can be combined with a modiﬁed Chamberlain-Mundlak device to account for the unbalanced

data structure. Another important contribution is that this chapter provides a systematic discussion

about the misspeciﬁcation issue of the working correlation matrix. When the working correlation

matrix is misspeciﬁed, some structured working correlation matrices may not have a well-deﬁned

limit or their limiting matrix is not positive semi-deﬁnite, which will cause identiﬁcation issue

of the GEE estimator. When the conditional mean function associated with heterogeneity has a

multiplicative form, a more efﬁcient estimator is available, which extends the multivariate weighted

nonlinear least squares estimation to multi-level models.

The third chapter investigates the small sample performances of the one-step GEE estimator

and two-step GEE estimator, which is similar to the comparison between the continuously updating

GMM estimator and two-step GMM estimator. The simulation study shows that the these two

estimators have similar ﬁnite sample performances when the efﬁcient working correlation matrix

is used. When the inefﬁcient working correlation matrix is adopted, the two-step GEE estimator

usually has a better ﬁnite sample performance in terms of the mean squared error. The simulation

also shows the efﬁciency gains by relaxing the traditional generalized linear model (GLM) variance

assumption. By allowing time-varying dispersion parameters, the efﬁciency of the GEE estimator

can be further enhanced.

Copyright by
YIMIN YANG
2018

ACKNOWLEDGEMENTS

I would like to express my sincere gratitude to Jeffrey M. Wooldridge, the chair of my dissertation

committee, for his continuous support and encouragement during these past ﬁve years. His friendly

guidance and insightful advice have been invaluable throughout all stages of the work. It was a

pleasure and a great honor to study under his tutelage.

I am also very grateful to my committee members, Peter Schmidt and Tim Vogelsang, for their

patience and generosity of time. I will always strive for those incredible personal and professional

attributes that they have demonstrated. My gratitude also extends to all the administrative staff

who helped me to make smooth transitions throughout my Ph.D. study.

Last but not the least, I would like to thank my family for all their love and trust. I am partic-

ularly indebted to my grandfather, who gave everything he had to help me to pursue my dream. I

would not have been able to ﬁnish this dissertation without their support.

v

TABLE OF CONTENTS

LIST OF TABLES .

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

CHAPTER 1 AN ECONOMETRIC METHOD TO THE ESTIMATION OF MULTI-

.

.
.

.
.

.
.

. .

Introduction .

LEVEL MODELS (WITH PETER SCHMIDT) . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1
.
1.2 Model and Notation .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Comments on Alternative Notation . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 GLS Estimation .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 An IV Treatment of the Multi-Level Model

1
1
2
4
6
. . . . . . . . . . . . . . . . . . . . . 10
. . . . . 11

1.5.1 A Hausman and Taylor-Type Treatment of the Multi-Level Model
1.5.2 An Amemiya and MaCurdy and Breusch, Mizon and Schmidt-Type Treat-

.

.

.

.

.

.

. .

ment of the Multi-Level Model . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6 Exogeneity Tests
. 15
1.7 GMM Estimation with or without the NCH Assumption . . . . . . . . . . . . . . . 18
1.8 Conclusion Remarks
. 19
. 21
.
APPENDIX .
.
BIBLIOGRAPHY .
.
. 28

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
. .

.
. .

.
.

.
.

.

.

CHAPTER 2 AN EXTENDED GENERALIZED ESTIMATING EQUATIONS APPROACH

.

.

.

.

.

. .

Introduction .

TO NONLINEAR MODELS WITH HIERARCHICAL STRUCTURE . . . 31
2.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Nonlinear Panel Data Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3 Generalized Estimating Equations method for Nonlinear Panel Models . . . . . . . 37
2.4 Selection of Working Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . 41
2.5 Estimation of Higher-Level Nonlinear Models . . . . . . . . . . . . . . . . . . . . 45
2.5.1 Extended GEE Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 46
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.5.2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Some Examples .
.
.
.
.

2.6 Simulation .
.
2.7 Conclusion .
APPENDIX .
BIBLIOGRAPHY .

.
.
. .
.

.
.
. .

.
.
.
.

.
.
.
.

. .

. .

. .

.

.

.

.

.

. .

Introduction .

3.1
3.2 One-Step and Two-Step GEE Estimations
3.3 Simulation Results

CHAPTER 3 ONE-STEP GEE OR TWO-STEP GEE? A SIMULATION STUDY . . . . . 72
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
. . . . . . . . . . . . . . . . . . . . . . 73
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
.
3.3.1 Linear Models .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.3.2 Nonlinear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

3.4 Conclusion .
APPENDIX .
BIBLIOGRAPHY .

.
. .
.

. .

. .

. .

. .

.
.
.

.
.
.

.
.
.

.
.

.

vi

LIST OF TABLES

Table 2.1:

traditional GEE estimator and GEE estimator with time-varying dispersion
parameters

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

. .

. .

. .

Table 2.2: pooled estimator and GEE estimator in a binary response panel model . . . . .

. 67

Table 2.3: pooled estimator and GEE estimator in a fractional response panel model

. . . . 67

Table 2.4: pooled estimator and EGEE estimator in a three-level binary response model

. . 67

Table 2.5: EGEE and mixed-effects estimator in a three-level binary response model

. . .

. 68

Table 3.1: one-step GEE and two-step GEE for linear models (σ 2

c = 1)

. . . . . . . . . . . 83

Table 3.2: one-step GEE and two-step GEE for linear models (σ 2

c = 2)

. . . . . . . . . . . 83

Table 3.3:

Table 3.4:

two-step GEE estimators with and without time-varying dispersion parameters
(α = 0.5)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

.

.

.

.

.

.

.

two-step GEE estimators with and without time-varying dispersion parameters
(α = 1)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

.

.

.

.

.

.

.

.

.

Table 3.5: coverage rate: one-step GEE and two-step GEE for linear models (σ 2

c = 2)

. . . 85

Table 3.6: coverage rate: two-step GEE estimators with and without time-varying dis-

persion parameters (α = 0.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Table 3.7: one-step GEE and two-step GEE for the binary response models (σ 2

c = 1) . . . . 86

Table 3.8: one-step GEE and two-step GEE for the binary response models (σ 2

c = 1.5) . .

. 87

Table 3.9: one-step GEE and two-step GEE for the fractional response models(all scale

by 100)

.

.

.

.

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Table 3.10: one-step GEE and two-step GEE for the count response models without any

scaling .

.

. .

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

Table 3.11: coverage rate: one-step GEE and two-step GEE for binary response model

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 88

(σ 2

c = 1)

(σ 2

c = 1.5)

Table 3.12: coverage rate: one-step GEE and two-step GEE for binary response model

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

vii

CHAPTER 1

AN ECONOMETRIC METHOD TO THE ESTIMATION OF MULTI-LEVEL MODELS

(WITH PETER SCHMIDT)

1.1 Introduction

In this paper we discuss the estimation of so-called “multi-level" (or “hierarchical") models.
The standard textbook panel data model with dependent variable yit (an outcome for individual i at
time t) is a two-level model, where level one is time and level two is individual. Each individual is

observed for multiple time periods. In this paper we will consider a three-level model. Our stylized

example will be a model for the average score on some statewide test at the school level at various

points in time;scores are observed at multiple points in time (level one) in multiple schools (level

two) in various school districts (level three).

There is a very large literature on multi-level models, which dates back at least to the seminal

paper by Fuller and Battese (1973), in which there are individuals (level three), there are multiple

measurements (level two) per individual, and there are multiple “determinations” (level one) per

measurement. A good but somewhat dated overview treatment of this literature is Raudenbusch

and Bryk (2002). They consider(chapter 8) a prototypical three-level model in which the dependent

variable is student achievement, level one is student, level two is classroom and level three is

school. A more recent general treatment is given by Garson (2012).

Most of the work on multi-level models has been done by individuals interested in educational

research. There is little economic or econometric literature on the topic. Wooldridge (2010, pp.

876-883) discusses a three-level model under the heading “Cluster Samples with Unit-Speciﬁc

Panel Data,” but he does not pursue any mathematical details. Also, after the semi-ﬁnal draft of

this paper was written, we became aware of Matyas (2017), an edited volume also in semi-ﬁnal

form that also deals with the econometrics of multi-level models. Several of its chapters overlap

with our paper, as we will note below. Our paper and the chapters in Matyas (2017) were written

1

independently of each other and more or less contemporaneously.

The econometric literature on panel data is clearly relevant to this topic, and we will discuss

the estimation and testing of the three-level model in a way that follows the panel data literature as

closely as is reasonably possible. We hope that this exposition will make multi-level models more

accessible to econometricians, and also that it will make the relevant econometric results more

accessible to non-economists who work on multi-level models.

We are speciﬁcally interested in extending the analysis of Hausman and Taylor (1981), and

of Amemiya and MaCurdy (1986) and Breusch, Mizon and Schmidt (1989), to the three level

model. This is a non-trivial extension because now there are two time-invariant effects, and there

are the “within” regression and two “between” regressions, and a given variable may be correlated

with either or both of the time-invariant effects. Currently existing applications of the Hausman-

Taylor methods to the multi-level model, such as Rice, Jones and Goldstein (1998) and Kim and

Frees (2007), do not identify all of the valid and relevant instruments and therefore do not yield

asymptotically efﬁcient estimators.

We also give a simple variable-addition form of the Hausman test of the exogeneity assump-

tions that we make, and we discuss Hausman-Taylor type estimation without the assumption of no

conditional heteroskedasticity (NCH).

1.2 Model and Notation

We will consider the case that we have panel data covering T time periods for each of N

schools, and the schools are grouped into G districts.In terms of standard panel data econometrics,

this is just a panel model with cluster effects. In terms of the multi-level framework, time is level

one, school is level two and district is level three. In the text, for simplicity we will consider the

balanced case in which the number of time periods is the same for all schools, and also in which

the number of schools is the same for each district. We will let P = N/G denote the number of

schools per district. The assumption that the number of schools is the same for each district is very

unrealistic, but it simpliﬁes the exposition considerably. In Appendix we show the changes that

2

are necessary to accommodate different numbers of schools per district and different numbers of

time periods per school.

Asymptotics in this paper will be as G → ∞, which also implies that N → ∞. So we have many
districts and many schools, but we do not require a large number of time periods or a large number

of schools per district.

Let t = 1, ...,T index time periods, i = 1, ...,N index schools and g = 1, ...,G index districts.

For a given school (i), we know what district it is in, and so sometimes for clarity we will write

g(i) to indicate the district to which school i belongs. We assume the following linear model:

yit = xT

it β + wT

i α + zT

g(i)γ + dT

t θ + ci + vg(i) + uit

(1.1)

Here yit is the dependent variable (e.g. some school-wide average test score) for school i at
time t; xit is a vector of observable time-varying characteristics of school i at time t; wi is a vector
of observable time-invariant characteristics of school i; zg is a vector of observable time-invariant
characteristics of district g; and dt is a vector of observable variables that depend only on time, not
on school. We assume that dt includes dummies for time, so that we have time effects, which we
treat as ﬁxed since T is small. We also have unobserved school effects ci and district effects vg, to
be treated as random, and an idiosyncratic error uit.

We will rewrite the model as

yit = hT

it ξ + εit

(1.2)

where hT

it = (xT

it ,wT

i ,zT

g(i),dT

t ), ξ T = (β T ,αT ,γT ,θ T ) and εit = ci + vg(i) + uit. Then for all NT

observations we write (correspondingly to (1.1) and (1.2))

y = Xβ + Wα + Zγ + Dθ + c + v + u = Hξ + ε

(1.3)

The order of observations is: T observations for school 1; T observations for school 2; ...;T ob-
servations for school N. Thus, for example, yT = (y11, ...,y1T ;y21, ...,y2T ; ...;yN1, ...,yNT ). Also
we will sort the school indices by district, so that the ﬁrst P schools (and therefore the ﬁrst PT

observations) are in district 1, the next P schools are in district 2, etc.

3

1.3 Comments on Alternative Notation

The notation in equation (1.1) is not standard for the multilevel model. Consider, for example,

the exposition in Wooldridge (2010, p.877, equation (20.52)). His linear model is (with slight

changes in notation to match ours more closely):

ygmt = xT

gmtβ + wT

gmα + zT

g γ + dT

t θ + cgm + vg + ugmt

(1.4)

Here, as in equation (1.1), g = 1, ...,G indexes districts and t = 1, ...T indexes time, but now

m = 1, ...,P indexes schools in district g, which is why m runs from 1 to P and not from 1 to N.

This notation is very similar to that of other papers in the multi-level modeling literature, such as

Fuller and Battese (1973), Rice, Jones and Goldstein (1998), and Kim and Frees (2007).

The obvious difference between this notation and ours is that in our notation a given school is

represented by a single index i, where in Wooldridge (and the other papers cited) a given school

is represented by a pair of indices (g,m). As it stands this is not a matter of substance because

the models are really the same. For example, if P = 4, the school that corresponds to i = 7 in our

notation corresponds to g = 2,m = 3 in his notation. The reason that the models are the same is
that in either case we have a school effect (our ci, his cgm) and a district effect (our vg(i), his vg).
The triple-index representation would be useful if we wanted an effect for whatever “m” rep-

resents. That would not make sense in the case that we are discussing, because m is an arbitrarily

assigned index for a school in a district. The second school in district 3 has nothing in common

with the second school in district 5, for example, and correspondingly an effect for m = 2 would

have no meaningful interpretation.

To better understand the issues involved in the number of indices used, we will use the words

nested or non-nested to describe the relationship between levels of the data. In our example, level

2 (school) is nested in level 3 (district) because, if we know which school we have, we know which

district it is in. That is, we can (and do) write g(i) to indicate the district in which we ﬁnd school

i. However, in our example, level 1 (time) is not nested in level 2 (school). If we know that an

observation is from t = 1 that does not tell us what school it is from. (And similarly level 1 is not

4

nested in level 3, because if we know that an observation is from t = 1 that does not tell us what

district it is from.) So it is natural and useful to have two indices (i,t) to identify an observation.

However, because g is a function of i, we do not need three indices. Also, because t = 1 has a

meaning that does not depend on which school we are considering, it is not unreasonable to have a

time effect. Note that the time effect can be treated as ﬁxed because the ﬁxed effects for time will

not be collinear with the effects for higher levels, but we cannot have ﬁxed effects for nested levels

like school and district because they would be collinear.

We can contrast this with the fully nested example of Fuller and Battese (1973, p.630). Here

(in their notation) i indexes plots (level 3) of ground for an agricultural experiment, j indexes sub-
plots (level 2) and k indexes sub-sub-plots (level 1), and yi jk is an output measure for sub-sub-plot
k of sub-plot j of plot i. This example is nested in the sense that if we know a sub-sub-plot, then

we also know what sub-plot it is in and what plot it is in. (A similar example is discussed by

Raudenbusch and Bryk (2002, chapter 8) who consider a three-level model in which the dependent

variable is student achievement, level one is “student", level two is “classroom" and level three is

“school". This data structure is fully nested in the same sense as above: There are multiple students

in each classroom but each student is in only one classroom, and there are multiple classrooms in

each school but each classroom is in only one school.) Once again, although the third level (plot)

index i is meaningful, the second level (sub-plot) indexjis not; we need two indices (i, j) to identify

a sub-plot, and sub-plot j in plot 1 has nothing in common with sub-plot j in plot 2, for example.

In fact, because these examples are fully nested, it takes only one index to keep track of the

data. Make a list of all of the sub-sub-plots and index them by n = 1, ...,N. (So, for example, if

there are ﬁve plots, four sub-plots per plot, and three sub-sub-plots per sub-plot, N = 60.) Then

make a list of all of the sub-plots and index them by m = 1, ...,M (where M = 20 in the simple

numerical example). Note that m is a function of n, denoted m(n), and the plot index i is a function
of m, say i(m), and therefore of n, namely i(m(n)). If we wish to have effects for plot, say vi, and
effects for sub-plots, say cm, then we can simply write the model as

(cid:48)
yn = h
nξ + cm(n) + vi(m(n)) + un

n = 1, ...N

(1.5)

5

Finally, we can consider completely non-nested models. The ﬁrst example in Fuller and Battese

(1973) ﬁt this description. Here level three is “individual," there are multiple “measurements"

(level two) per individual, and there are multiple “determinations" (level one) per measurement.

They do not give a more tangible description, but we can consider the case that their “individuals"

are schools, the “measurements" correspond to different achievement or aptitude tests like the SAT

or ACT, and the “determinations" correspond to various subscores like the math or verbal portions
of these tests. So yi jk might be the school’s (k) average score on the math portion (i) of the SAT
exam ( j). This is non-nested in the obvious sense that knowing that a score is on the math portion

of a test does not identify which test, and knowing which test does not identify which school. A

three-subscript representation is necessary because the model is non-nested and the different levels

are meaningful in and of themselves. For example, we might want to have an effect for math

portion and an effect for verbal portion, and another effect for SAT and for ACT. That is, we could
have an error with the components ci +v j +dk, each of which is meaningful. If we wish, these can
all be ﬁxed effects because they are not collinear with each other.A classic reference with a model

with this structure is Matyas (1997). See Balazsi, Matyas and Wansbeek (2015, 2017) for more

details of a ﬁxed effects treatment of the three-level model.

1.4 GLS Estimation

We now return to the model as given in equations (1.1), (1.2) and (1.3) above. We can consider

estimating the model by ﬁxed effects, but we cannot include all of the time, school and district ef-

fects because the school effects would be collinear with the district effects. To put this another way,

for a given district g, adding a constant to each of the school effects for the schools in that district,
and subtracting the same constant from the district effect, leaves the sum ci + v(g(i)) unchanged.
This is a reﬂection of the fact that we have one level of nesting in the model. However, if we are
interested only in controlling for the various types of effects in estimating the effect of xit on yit,
we can simply ignore the district effects and have only time and school effects. This still leaves

us with the important issue that we cannot identify the coefﬁcients of time-invariant variables, and

6

the coefﬁcients of variables which vary more over schools and districts than over time will likely

be estimated imprecisely.

We therefore wish to assume random school and district effects and estimate this model by

generalized least squares (GLS). The algebra of GLS for this model has been known since Fuller

and Battese (1973), but it will be convenient later to have given a brief derivation along the lines

of Hausman and Taylor (1981), Balzsi, Baltagi, Matyas and Pus (2017) take a similar approach to

ours, in a very general model. Our model is a special case of their general model but none of the

special cases that they analyze is the same as our model.

We make the following assumptions about the errors.
Assumption 1 (Strict exogeneity): E(c|H) = E(v|H) = E(u|H) = 0
Assumption 1 implies that E(ε|H) = 0 and therefore ordinary least squares will be consistent
for ξ in the regression of equation (1.3), subject to the usual techinial conditions (e.g. no perfect

multicollinearity) on H. However, we want to obtain more efﬁcient estimates by GLS. To do so

we need to say something about the variance matrix of ε.

Assumption 2(No conditinal heteroskedasticity-NCH): Conditional on H,

(i) c,v and u are mutually independent.
(ii) ci, i = 1, ...N, are iid (0,σ 2
c ).
(iii) vg, g = 1, ...G, are iid (0,σ 2
v ).
(iv) uit, i = 1, ...N, t = 1, ...T , are iid (0,σ 2
u ).
Note that we have independence of the errors across groups. This will be automatically satisﬁed

if we have random sample across groups. We do not have independence across schools, because of

the group effects, and we do not have independence across time because of the school and group

effects.

Let Ω = V(ε). We want to derive expressions for Ω−1 and Ω−1/2, given the NCH assumption.

It is easy to calculate that

Ω = V(u) + V(c) + V(v) = σ 2

u INT + Tσ 2

c A + PTσ 2
v B

(1.6)

7

where A = IN ⊗ ET , B = IG ⊗ EPT , and for any integer n, En is a matrix of dimension n× n with
each element equal to 1

u Ωo where Ωo = INT + (

n. We write Ω = σ 2

)A + (

Tσ2c
σ2u

PTσ2v
σ2u

)B.

Now we deﬁne the following:

Q = INT − A = IN ⊗ (IT − ET )
R = A− B = IG ⊗ [(IP ⊗ ET )− EPT ]
B = IG ⊗ EPT

(as above)

(1.7)

We note that Q, R and B are idempotent; Q + R + B = INT ; and QR = QB = RB = 0. We also
note that Q is the Hausman-Taylor matrix “Qv" that creates deviations from individual (school)
means and that R + B(= A) is the Hausman-Taylor matrix “Pv" that creates individual means. For
any data vector or matrix with observations in the order that we presented above, like y, we have

(Qy)it = yit − ¯yi,

(Ry)it = ¯yi − ¯¯yg(i)

(By)it = ¯¯yg(i)

(1.8)

Here ¯yit is the mean value of y over time for school i, i.e. ¯yi = 1
of y over time and over the schools in district g, i.e. ¯¯yg = 1
y = Qy + Ry + By corresponds in scalar terms to yit = (yit − ¯yi) + (¯yi − ¯¯yg(i) + ¯¯yg(i).

t=1 yit; and ¯¯yg is the mean value
t=1 yit. So the identity

T ∑T
P ∑i(cid:51)g(i)

1
T ∑T

As in the usual two-level model, these idempotent matrices have important properties in terms

of what they do to the unobserved error components and the time-invariant and individually-

invariant regressors. Speciﬁcally,

Qc = Qv = Rv = 0; QW = 0, QZ = 0, RZ = 0, RD = 0.

(1.9)

Just above equation (1.7), we wrote Ωo in terms of I, A and B. We can rewrite this in terms of
. It then follows

σ2u +Tσ2c +PTσ2v

σ2u +Tσ2c

and φ2 =

Q, R and B: Ωo = Q +φ1R +φ2B where φ1 =
that

σ2u

σ2u

Ω−1 = Q +

1
φ1

R +

1
φ2

B, Ω−1/2

o

= Q + λ1R + λ2B

(1.10)

8

(cid:113) 1
φ1
σu Ω−1/2

o

=

.

where λ1 =
Ω−1/2 = 1

(cid:114) σ2u

σ2u +Tσ2c

(cid:114)

(cid:113) 1

φ2

=

and λ2 =

σ2u

σ2u +Tσ2c +PTσ2v

. Then Ω−1 = 1
σ2u

Ω−1
o

and

These expressions for Ω−1 and Ω−1/2 have been known at least since Fuller and Battese
(1973), but our simple derivation may have pedagogical value, and it serves to link our treatment

of the next section to the work of Hausman and Taylor (1981).

GLS can be expressed as ˜ξGLS = (HT Ω−1H)−1HT Ω−1y, or equivalently as (HT Ω−1

o H)−1HT Ω−1
o y,

or equivalently as the result of OLS applied to the transformed equation

Ω−1/2

o

y = Ω−1/2

o Hξ + Ω−1/2

o

ε

(1.11)

It may be useful to be explicit about the form of Ω−1/2
(Ω−1/2

o

o H. We have

(Ω−1/2

X)it = (xit − ¯xi) + λ1(¯xi − ¯¯xg(i)) + λ2¯¯xg(i) = xit − (1− λ1)¯xi + (λ2 − λ1)¯¯xg(i) (1.12a)
o W)it = λ1(wi − ¯¯wg(i)) + λ2 ¯¯wg(i) = λ1 ¯wi + (λ2 − λ1) ¯¯wg(i) (1.12b)
Z)it = λ2zg(i) (1.12c)
D)it = (dt − ¯d) + λ2 ¯d = dt − (1− λ2)¯d (1.12d)

(Ω−1/2

(Ω−1/2

o

o

Given standard regularity conditions on H, plus Assumption 1 (Strict exogeneity), GLS is

consistent. (This is true whether or not the NCH assumption actually holds.) Given Assumption

1, OLS is also consistent, and in fact the consistency of OLS requires only the weaker exogeneity
condition that E(εit|hit) = 0. Given Assumption 1, the usual ﬁxed effects (within) estimator is
also consistent for β and θ (the coefﬁcients of X and D) but we cannot estimate α or γ by ﬁxed

effects because QW = 0 and QZ = 0. The standard argument in favor of GLS is that, given also

Assumption 2 (NCH), GLS is asymptotically more efﬁcient than OLS or within.

The GLS estimator depends on σ 2

v and a feasible version of GLS will require
consistent estimates of these variances. The estimation of these variances is discussed in the second

c and σ 2

u , σ 2

part of the appendix.

9

1.5 An IV Treatment of the Multi-Level Model

In this Section we follow Hausman and Taylor (1981) and subsequent papers in considering
models in which some of the regressors may be correlated with the school effects ci and/or the dis-
trict effects vg. Consistent estimation will be by instrumental variables (IV), where the instruments
are created from the regressors that are uncorrelated with one or both of the effects.

In Hausman and Taylor there is only one type of effect, which corresponds to our school effect.

There are time-varying regressors (X) and time-invariant regressors (Z), and these are divided into
regressors that are uncorrelated with the effects (X1,Z1) and those that are correlated with the
effects (X2,Z2). Our case is philosophically similar but we have more types of variables and two
types of effects.

Our classiﬁcation of variables will be denoted as follows:

Group 1 : X1,W1,Z1,D = variables uncorrelated with (c,v)

Group 2 : X2,W2,Z2 = variables uncorrelated with c but correlated with v

Group 3 : X3,W3,Z2 = variables correlated with c

(1.13a)

(1.13b)

(1.13c)

We note two things. First, a district speciﬁc variable is part of Z2 if it is correlated with c or it is
correlated with v. We don’t need to distinguish these two cases because in either case it does not

yield any valid instruments. Second, the variables D that depend only on t and not on i cannot be

correlated with c or v so they belong in group “1" above.

We now make the following exogeneity assumption (modiﬁed from Assumption 1 of Section

3) to reﬂect the classiﬁcation of variables given above.

Assumption 3 (HT Exogeneity):
(i) E(u|X,W,Z,D) = 0
(ii) E(c|X1,X2,W1,W2,Z1,D) = 0
(iii) E(v|X1,W1,Z1,D) = 0.
Given Assumption 3, plus standard regularity conditions on the regressors, the within estimator

is consistent for β and θ (the coefﬁcients of X and D). The ordinary least squares estimator is

10

not consistent because some of the regressors in (1.1) are endogenous. Consistent estimation of

all of the parameters in (1.1) can be obtained if Assumption 3 implies enough valid and useful

instruments.

1.5.1 A Hausman and Taylor-Type Treatment of the Multi-Level Model

Under Assumption 3, we can list a set of valid instruments that is a generalization of the Hausman

and Taylor (HT) instrument set to our model. If M is a variable in group “1” above, then QM,RM

and BM are valid instruments, if they are non-zero. If M is in group “2” above, then QM and

RM are valid instruments, if they are non-zero. If M is in group “3” above, then QM is a valid

instrument, if it is non-zero. The restriction to non-zero instruments is obvious, and is needed
because the following transformed variables are equal to zero: QW1, QZ1, RZ1, RD, QW2, QZ2,
RZ2. Also BD is a constant and is unnecessary if Z1 contains a constant since then BZ1 includes
a constant.

We therefore have the following list of valid and non-zero instruments:

JHT = (QX1,RX1,BX1,QD,RW1,BW1,BZ1,QX2,RX2,RW2,QX3)

(1.14)

HT JHT )−1JT

Let PHT = JHT (JT
HT be the projection matrix onto the space spanned by JHT . Then
the IV estimator of ξ is ˆξIV = (HT PHT H))(−1)HT PHT y. This estimator is consistent if Assump-
tion 3 holds and if the standard rank condition holds (JT
HT H is of full column rank). A necessary
condition for this to hold is that we have at least as many instruments as regressors. If we adopt the
convention that, for any data matrix M, KM is the column dimension of M, we need the following
condition to hold: KX + 2KX1 + KX2 + KD + 2KW1 + KW2 + KZ1 ≥ KX + KD + KW + KZ. With a
little algebra, this becomes the following:

HT Order Condition:

2KX1 + KX2 + KW1 ≥ KW3 + KZ2

(1.15)

11

Next we deﬁne the analogue of the Hausman-Taylor efﬁcient estimator for our model. For this we

need the following assumption, which is the modiﬁed NCH assumption for the present setting.

Assumption 4 (HT-NCH): Conditional on JHT
(i) c,v, and u are mutually independent.
(ii) ci, i = 1, ...,N, are i.i.d (0,σ 2
c ).
(iii) vg, g = 1, ...,G, are i.i.d (0,σ 2
v ).
(iv) uit, i = 1, ...,N,t = 1, ...,T , are i.i.d (0,σ 2
u ).
This is the same as Assumption 2 above (NCH) except that now we are conditioning only on

the instruments, which are exogenous, and not on the entire set of explanatory variables, some of

which may be endogenous. Again assumption (iii) will be redundant if we randomly sample across

groups.

o

o

y = Ω−1/2

o Hξ + Ω−1/2

Under assumption 4, V(ε|JHT ) equals Ω as deﬁned in equation (1.6) above. We then con-
sider the transformed equation Ω−1/2
ε as given in equation (1.11) above.
We denote the analogue of the Hausman-Taylor “efﬁcient" estimator as ˆξHT , equal to the instru-
mental variables estimator of the transformed model using the instrument set JHT . Explicitly,
ˆξHT = (HT Ω−1/2
y . This estimator is referred to as “ef-
ﬁcient” because, given the HT-NCH assumption, it is more efﬁcient than the within estimator or
the IV estimator ˆξ . As we show below, it is the efﬁcient GMM estimator given the orthogonality

o H))−1HT Ω−1/2

PHT Ω−1/2

PHT Ω−1/2

o

o

o

conditions implied by the HT-NCH assumptions.

There have been some previous papers that apply the Hausman and Taylor methodology to our

o

and then seek a Hausman-Taylor type instrument set.

three-level model. So far as we know, the closest to our paper is Rice, Jones and Goldstein (1998).
Like us, they transform the model by Ω−1/2
This is all done correctly, but they do not ﬁnd all of the available instruments. They distinguish
three types of regressors: (1) X1,Z1 = regressors (time-varying and time-invariant, respectively)
that are uncorrelated with both c and v; (2) X2
2 = variables correlated with their u, which is
our c; (3) Xv
2 = variables correlated with their v, which is also our v. Then their instrument set
is JRJG = (QuX1,QuXu
2,X1,Z1). Here their Qu = our Q and their Qv = our

2,QvX1,QvXv

2,QvZv

2,Zv

u,Zu

12

(Q + R) = (I− B).

3,X4)

For purposes of comparison, we will split our X3 (variables correlated with with c) as X3 = (X∗
where X∗
3 = variables correlated with c but not with v and X4 = variables correlated with both c
and v. It is not clear to us whether X4 belongs in Xu
2 or neither or both. From the verbal
description given in Rice, Jones and Goldstein, it would appear that X4 belongs in both Xu
2 and
2. However, if X4 is part of Xv
Xv
2, then the instrument set contains (Q + R)X4 which is not a valid
instrument, because [(Q + R)ε]it contains (ci − ¯¯cg(i)) which is still correlated with X4. Therefore
we will proceed to analyze their instrument set under the interpretation that X4 is part of Xu
2 but
not part of Xv
2.

2 or Xv

Under this interpretation, their Xu

2 = (X∗

3,X4) = X3; their Zu

2 = (W∗

3,W4,Z∗

3,Z4) = (W3,Z3);

their Xv

2 = X2; and their Zv

2 = (W2,Z2). In our notation, their instrument set is equivalent to

JRJG = (QX1,RX1,BX1,QD,W1,Z1, (Q + R)X2,RW2,QX3)

(1.16)

Comparing this to our instrument set JHT in equation (1.14) above, there are two differences. First,
they have W1, which equals (R + B)W1, whereas we have RW1 and BW1 separately. Second, they
have (Q + R)X2 whereas we have QX2 and RX2 separately. In addition, if we had interpreted their
deﬁnition of Xu

2 to exclude X4, they would also be missing the instruments QX4.

Balazsi, Bun, Chan and Harris (1977) also provide a Hausman and Taylor – type treatment of

a three-level model. Their approach is basically the same as ours, but the details differ because the

models differ. Our model has ﬁxed time effects and random school and district effects, which they

refer to (Section 3.4) as a “mixed effects model.” The details of their treatment differ from ours

because in our model the two levels other than time (school and district) are nested, whereas in

their model they are not.

13

1.5.2 An Amemiya and MaCurdy and Breusch, Mizon and Schmidt-Type Treatment of the

Multi-Level Model

i = (Mi1,Mi2, ...,MiT ), of dimension 1× LT ; deﬁne M(cid:93)

For the standard two-level panel data model, Amemiya and MaCurdy (1986) generated more in-
struments than Hausman and Taylor from the time-varying exogenous variables (our X1) and thus
proposed a more efﬁcient IV estimator. To express this construction simply, suppose that M is a
data matrix, of dimension NT × L, say, whose rows are M11,M12, ...,M1T , ...,MNT , where each
Mit is 1× L. Then deﬁne Mo
i = 1T ⊗ Mo
i ,
of dimension T × LT ; and deﬁne ]mathb f M∗ as the matrix of dimension NT × LT made up of
1,M(cid:93)
the blocks (arranged vertically) M(cid:93)
2, ...,M(cid:93)
N. Then Amemiya and MaCurdy replaced PvX1 (in
Hausman and Taylor notation) with X∗
1. Alternatively, Breusch, Mizon and Schmidt showed that
this is equivalent to keeping the HT instrument set but adding (QvX1)∗ to it. Here the conven-
tion is that, because QvX1 is made up of deviations from means that add to zero, we use only
the ﬁrst T − 1 deviations from means. Therefore (QvX1)∗ is of dimension NT × (T − 1)KX1 not
NT × T KX1.

For the model of this paper, the second expression is the more convenient. Therefore we deﬁne

our Amemiya-MaCurdy type instrument set as

JAM = (QX1, (QX1)∗,RX1,BX1,QD,RW1,BW1,BZ1,QX2,RX2,RW2,QX3)

(1.17)
This is the same as JHT as given in equation (1.14) above, except that we have added (QX1)∗.
These extra instruments are valid given Assumption 3 and so we have achieved an efﬁciency gain

without additional assumptions.

While QX2 and QX3 are valid instruments given Assumption 3, (QX2)∗ and (QX3)∗ are not
valid instruments without a further assumption. Following Breusch, Mizon and Schmidt, we make

the following additional assumption.

Assumption 5 (BMS):

(i) E(cid:0)x2,itvg(i)
(ii) E(cid:0)x3,itci

(cid:1) is the same for all t.
(cid:1) is the same for all t.

14

(iii) E(cid:0)x3,itvg(i)

(cid:1) is the same for all t.

This condition implies that the time-speciﬁc deviations from means are uncorrelated with ci and
v(g(i)) so that (QX2)∗ and (QX3)∗ are valid instruments. This leads us to the instrument set
JBMS = (QX1, (QX1)∗,RX1,BX1,QD,RW1,BW1,BZ1,QX2, (QX2)∗,RX2,RW2,QX3, (QX3)∗)
(1.18)

1.6 Exogeneity Tests

The various estimators discussed above are consistent under different exogeneity assumptions.

Tests of these assumptions are called exogeneity tests, and they are useful for researchers to choose

an appropriate estimation method. In this section we present a simple regression-based variable

addition test and we show the equivalence between this variable addition test and the classical

Hausman speciﬁcation test based on the difference between an efﬁcient and an inefﬁcient estimator.

This is a known result for the usual two-level panel data model but so far as we are aware it is new

for the three-level model.

As mentioned in section 1.4, the ﬁxed effects estimator is consistent under Assumption 3(i),
that E(u|X,W,Z.D) = 0, which is also necessary for the consistency of the GLS and HT estima-
tors. A statistically signiﬁcant difference between the ﬁxed effects estimator and the HT (or GLS)

estimator indicates a failure of the other exogeneity assumptions required for the HT (or GLS)

estimator to be consistent. So the null hypothesis is that Assumptions 3(i), 3(ii) and 3(iii) hold,

while only Assumption 3(i) is maintained under the alternative hypothesis. Another way to say

this is that the null hypothesis is that the HT instruments are valid, and the alternative is that they

are not.

In this section, we will focus on the comparison of the ﬁxed effects estimator and the HT

estimator, because the GLS estimator is a special case of the HT estimator when all of the ex-

planatory variables are uncorrelated with the school and district effects. That is, the HT estimator
is GLS when X1 = X,W1 = W and Z1 = Z, so that there are no variables in Groups 2 or 3 in
equation (1.13) above. Even though it is not true that PHT = I, in this case it is still true that

15

HT Ω−1/2

o

PHT Ω−1/2

o H = HT Ω−1

o H and so HT = GLS.

Before we proceed to any details, it is important to be speciﬁc about the set of parameters that

we can compare. We will write the model as

y = H1ξ1 + H2ξ2 + ε
1 = (β T ,θ T ) and ξ T

(1.19)

2 = (αT ,γT ), so that H1 contains the
where H1 = (X,D), H2 = (W,Z), ξ T
time-varying variables. Both β and θ are identiﬁed, but the asymptotic variance of the differ-
ence between the ﬁxed effects and HT estimators of ξ1 is singular. That is, V( ˆξ1,FE − ˆξ1.HT ) =
V( ˆξ1,FE )−V( ˆξ1,HT ) has rank of at most KX instead of KX + KD. Wooldridge (2010, Chapter 10)
provides a proof of this result for the random effects and ﬁxed effects estimators in the standard

panel data model. Appendix 1.3 gives a proof of this result for the ﬁxed effects and HT estima-

tors in our multilevel model. For this reason we will focus on the Hausman test based on β (the

coefﬁceints of X) only.

A natural test statistic is the Hausman test statistic based on the difference between ˆβHT and
ˆβFE. We will ﬁrst treat the case that V( ˆβHT − ˆβFE ) is nonsingular. Then the Hausman test statistic
is

(cid:48)
HT1 = ( ˆβHT − ˆβFE )

[ ˆV( ˆβHT − ˆβFE )]−1( ˆβHT − ˆβFE )

(1.20)

Under the null hypothesis HT1 converges to a χ2 distribution with KX degrees of freedom. We can
further simplify the asymptotic variance under the null if we impose the HT-NCH assumptions.

V( ˆβHT − ˆβFE ) = V( ˆβFE )− V( ˆβHT )

(1.21)

because ˆβHT is efﬁcient when the HT-NCH assumptions hold.(The efﬁciency of ˆβHT when the
HT-NCH assumptions hold is proved explicitly in Section 7 below.) Therefore, we have following

traditional Hausman test statistic:

(cid:48)
HT2 = ( ˆβHT − ˆβFE )

[ ˆV( ˆβHT )− ˆV( ˆβFE )]−1( ˆβHT − ˆβFE )

(1.22)

16

HT2 has a limiting χ2 distribution under the null hypothesis and the HT-NCH assumptions. This
test statistic is not robust to the failure of the HT-NCH assumptions. See Wooldridge (1991) for

further discussion of this point in a slightly different setting.

In the standard linear panel data model, it is well known that the Hausman test is equivalent

to a variable addition test, where we add the time averages of the X’s to the model and test the

signiﬁcance of their coefﬁcients. The same kind of equivalence between the Hausman test HT1
and a variable addition test also holds in our multi-level linear model.

The variable addition test is based on following augmented regression model:

y = H1ξ1 + H2ξ2 + ¯Xψ + εA

(1.23)

where ¯X = (R + B)X is the matrix of the individual means of the X’s and εA is the error term in
augmented regression. We ﬁrst apply the GLS transformation to the augmented regression:

¨y = ¨H1ξ1 + ¨H2ξ2 + X∗ψ + ¨εA
where(¨y, ¨H1, ¨H2,X∗, ¨εA) = Ω−1/2
(y,H1,H2, ¯X,εA). The HT estimator is the instrumental vari-
ables estimator of the transformed regression (1.24), with the HT instrument set, and with the

(1.24)

o

restriction ψ = 0 imposed. We show in Appendix 3 that the instrumental variables estimator of β

in the transformed regression with unrestricted ψ is identical to the ﬁxed effects estimator, and a
Wald statistic testing ψ = 0 is equivalent to the Hausman statistic HT1.

The comparison between the GLS estimator and the ﬁxed effects estimator is a special case.

The GLS estimator is just the OLS estimator of the transformed regression when we impose ψ =

0. The OLS estimator of β in regression (1.24) with unrestricted ψ is identical to ﬁxed effects
estimator. Then the Wald statistic testing ψ = 0 is equivalent to the Hausman test HT1 which is
based on the difference between GLS estimator and ﬁxed effects estimator. The null hypothesis

for this test is that all of the regressors are exogenous.

An advantage of the variable addition test is that we can easily make the exogeneity test robust

to failure of the NCH assumptions. We simply have to use a robust estimate of V( ˆψ) to construct

17

valid Wald test statistic when we are not willing to assume NCH or HT-NCH.

Lastly we consider the case that ˆV( ˆβHT )− ˆV( ˆβFE ) is singular. The nonsingularity of [ ˆV( ˆβHT )−
ˆV( ˆβFE )] requires that the number of exogeneity assumptions being tested is at least as large as KX,
so that we have enough instruments to identify ψ. If this is not true, we will require a generalized

inverse, and the number of degrees of freedom of the limiting chi-squared distribution will be the
rank of [ ˆV( ˆβHT )− ˆV( ˆβFE )], not KX. Also the form of the variable addition test is altered. These
points are discussed in more detail in the Appendix.

1.7 GMM Estimation with or without the NCH Assumption

In this section we consider estimation of our model by GMM, using the moment conditions

implied by the HT Exogeneity Assumption of Section 1.5. We will ﬁrst consider estimation without

imposing the NCH assumption (Assumption 4).

Let JHT be the set of instrumental variables given in equation (1.14). Then we have following

moment conditions:

E[JT

HT (y− Hξ )] = 0

This leads to a GMM estimator

ˆξGMM = (HT JHT ΛJT

HT H)−1HT JHT ΛJT

HT y

(1.25)

(1.26)

Here Λ is a positive deﬁnite weighting matrix. Subject to technical conditions including the

HT Order Condition, this estimator is consistent regardless of the choice of Λ. Picking Λ =
HT JHT )−1 yields the IV estimator discussed just below equation (1.14) in Section 1.5.1. But
(JT
HT (y− Hξ )])−1 =
this is not the optimal choice of Λ. The optimal weighting matrix is (V[JT
HT,g(yg − Hgξ )])−1. A consistent estimator of the optimal weighting matrix would be
(V[∑G
HT,g(yg−Hg ˆξ )(yg−Hg ˆξ )T JHT,g]−1 where ˆξ is some preliminary
its sample analogue [ 1
consistent estimator such the GMM estimator using the identity matrix as weighting matrix. This

g=1 JT

G ∑G

g=1 JT

18

choice yields the optimal (minimum asymptotic variance) GMM estimator based on the moment

conditions (1.25).

The weighting matrix estimator of the previous paragraph is an “unrestricted” estimator in the

sense that it imposes independence over values of g but it does not impose any structure on the

variance matrix of εg. However, now suppose that we impose the HT-NCH assumption. Then we
have V[JT

HT ΩJHT ] , and the GMM estimator can be written as:

HT (y− Hξ )] = E[JT

ˆξGMM = [HT JHT (JT

HT ΩoJHT )−1JT

HT H]−1HT JHT (JT

HT ΩoJHT )−1JT

HT y

(1.27)

This estimator is consistent whether the NCH assumption actually holds or not. If the NCH

assumption holds, it is asymptotically equivalent to the optimal GMM estimator discussed in the

previous paragraph. Its advantage is lies in the general folklore that GMM estimators using an

analytically derived optimal weighting matrix have better ﬁnite sample properties than GMM esti-

mators based on an unrestricted estimated weighting matrix.

We can use the GMM estimator in (1.27) to establish the efﬁciency of the HT estimator ˆξHT
under the NCH assumption. Speciﬁcally, ˆξHT = ˆξGMM as long as the same estimate of Ωo is used.
This equality holds because

JHT (JT

HT ΩoJHT )−1JT

HT = Ω−1/2

o

PHT Ω−1/2

o

(1.28)

as is easily demonstrated using the facts that Ω−1/2
idempotent and orthogonal to each to other, and that JHT is of the form (QJ1,RJ2,BJ3).

= Q + λ1R + λ2B where Q,R and B are

o

The same discussion holds also if we consider the AM instrument set or the BMS instrument

set. Just substitute JAM or JBMS for JHT .

1.8 Conclusion Remarks

In this paper we have considered the estimation of a three-level hierarchical model. The most

standard econometric panel data model corresponds to a two-level (individual and time) model.

We show how the methods developed for the panel data model can be applied to the three-level

19

hierarchical model. In the three-level model, we have the within regression and we have two dif-

ferent between regressions, and we give a systematic treatment of GLS and efﬁcient IV estimation

in this setting.

Part of the contribution of the paper is expository. We try to explain hierarchical models to panel

data econometricians, and we also try to give a clear explanation of the econometric methodology

for linear panel data models to people who use hierarchical models.

However, we also provide some new theoretical results. First, we give an exhaustive list of the

available instruments in the Hausman-Taylor, Amemiya-MaCurdy and Breusch-Mizon-Schmidt

IV approaches to this model. Second, we show how a Hausman test of the exogeneity assumptions

made for GLS or IV can be implemented using a simple variable addition test. This is a known

result for the two-level model but it is new for the three-level model. Third, we show how the

model can be estimated without imposing the no conditional heteroscedasticity (NCH) assumption

and discuss the advantages and disadvantages of such an approach.

Obviously we have touched upon only a very small part of the panel data literature. One could

make a long list of topics (dynamic models, nonlinear models, selectivity models, . . . ) that could

proﬁtably be extended to the multi-level setting. We hope that this paper provides a useful step in

that direction.

20

APPENDIX

21

APPENDIX FOR CHAPTER 1

Different Numbers of Schools in Different Districts

In this Appendix we show how to accommodate the realistic case that the number of schools

is different for different districts. This requires only some minor modiﬁcations of our notation and

does not change any of the main results in the paper.

We assume that there are Pg schools in district g and each school in district g is observed for the
same Tg time periods. Let ug denote the column vector of {uit}i∈g,1≤t≤Tg that is properly stacked
by the order of schools of district g. Similarly we deﬁne cg and vg as the column vectors of c and
v for district g.

u IPgTg + Tgσ 2

c Ag + PgTgσ 2

Ωg = IPgTg +

u we have Ωgo = 1
σ2u

In this case, we need to represent the variance matrix as Ω = diag{Ω1,Ω2, ...,ΩG}, where
v Bg and where Ag = IPg ⊗ ETg and
Ωg = V(ug) + V(cg) + V(vg) = σ 2
Bg = EPgTg. With normalization by dividing by σ 2
Ag +
PgTgσ2v
Bg. With this notation, we can now deﬁne Qg = IPg ⊗ (ITg − ETg) and Rg = Ag − Bg =
σ2u
(IPg − EPg)⊗ ETg. It is easy to verify that Qg,Rg and Bg are idempotent matrices and are or-
(cid:115)
thogonal to each other. Similarly we can write Ω−1/2
go = Qg + λ1gRg + λ2gBg, where λ1g =
. For the whole data set with G groups, we deﬁne Q =
diag{Q1,Q2, ...,QG}, R = diag{R1,R2, ...,RG} and B = Q = diag{B1,B2, ...,BG}. With this re-
, ...,Ω−1/2
= diag{Ω−1/2
vised notation we obtain the GLS transformation matrix Ω−1/2
Go }
which can be represented as a linear combination of the matrices Q,R and B:

σ2u +Tgσ2c +PgTgσ2v

and λ1g =

,Ω−1/2

2o

σ2u

σ2u +Tgσ2c

o

1o

Tgσ2c
σ2u

(cid:115)

σ2u

Ω−1/2

(A.1)
where λ 1 = diag{λ11IP1T1,λ12IP2T2, ...,λ1GIPGTG} and λ 2 is similarly deﬁned. All of the other
analysis in the paper goes through with this revised notation.

= Q + λ 1R + λ 2B

o

22

Estimation of the Error Variances

In practice, we do not know the true values of σ 2

v , and so we need consistent
estimates of these variances to construct a feasible version of the GLS or efﬁcient HT estimator.

u , σ 2

c and σ 2

Here we will provide two possible estimation approaches. We will focus on the case of GLS, which

can be extended to the case of the efﬁcient HT estimator with trivial changes.

1.Error Variance Estimation Based on the Within and the Between Estimators

We ﬁrst obtain a consistent estimator of σ 2

u based on the ﬁxed effects estimator. We consider

the within-transformed regression

Qy = QHξ + Qε = QHQξQ + Qε

(A.2)

where HQ = (X,D)= the time-varying variables. (Recall that QW = QZ = 0.) Ordinary least
squares applied to (A.2) gives the within estimator of ξQ = (β T ,θ T )T , say ˆξQ, which is consistent
given standard regularity conditions. However, in this discussion we can also simply take the true
value of ξQ as known, because consistent estimation of the regression coefﬁcients does not alter
the consistency of the variance estimators.

With some standard algebra, it can be shown that E(εT Qε) = E(uT Qu) = σ 2
u trace(Q). With
g=1 Pg(Tg − 1).
g=1 trace(Qg) = ∑G
g=1 Pg(Tg − 1). A consistent estimator of σ 2
u is

the deﬁnition of Q in Appendix 1.1, we have trace(Q) = ∑G
Therefore we have E(εT Qε) = σ 2

u ∑G

ˆσ 2
u = [

G
∑
g=1

Pg(Tg − 1)]−1 ˆεT

QQˆεQ

(A.3)

where ˆεQ = y− HQ

ˆξQ

We can then use the between estimator at the school level to get a consistent estimator of σ 2
c .

So we consider the equation

Ry = RHRξR + Rε

(A.4)

23

where HR = (X,W) and ξR = (β T ,αT )T . (Recall that RD = RZ = 0). Ordinary least squares
applied to (A.4) yields ˆξR. Now we have E(εT Qε) = trace[RE(εT ε)] = ∑G
g=1 trace(RgΩg) =
∑G
g=1(σ 2

c )(Pg − 1) . A consistent estimator of σ 2
c is

u + Tgσ 2

ˆσ 2
c = [

G
∑
g=1

where ˆεR = y− HR ˆξR.

Tg(Pg − 1)]−1[ˆεT

R RˆεR − G
∑
g=1

(Pg − 1) ˆσ 2
u ]

(A.5)

Finally, using the between estimator at the district level gives a consistent estimator of σ 2

v . We

consider the equation

By = BHξ + Bε

(A.6)

In this case we have E(εT Bε) = trace[BE(εT ε)] = ∑G
PgTgσ 2

v ). We then have a consistent estimator of σ 2
v ,

g=1 trace(BgΩg) = ∑G

g=1(σ 2

u + Tgσ 2

c +

ˆσ 2
v = [

G
∑
g=1

TgPg]−1[ˆεT

B BˆεB − G
∑
g=1

( ˆσ 2

u + Tg ˆσ 2
c )]

(A.7)

where ˆεB = y− HR

ˆξB and ˆξB is OLS applied to (A.6).

2. Error Variance Estimation Based on the Pooled OLS Estimator

We could also use the pooled OLS estimator to construct consistent estimators of the error vari-
ances. The pooled OLS estimator can be written as ˆξOLS = (HT H)−1HT y and is consistent under
Assumption 1 (strict exogeneity) or under the weaker condition of contemporaneous exogeneity.
Let ˆε = y− H ˆξOLS be the OLS residuals. Then we can simply replace ˆεQ in (A.3), ˆεR in (A.5) and
ˆεB in (A.7) with ˆε and we still have consistent estimates of the variances σ 2
v . This is a
reﬂection of the fact, noted above, that as far as consistent estimation of the variances is concerned,

c and σ 2

u , σ 2

any consistent estimator of ξ is as good as the true value.

24

Proofs of some Results of Section 1.6

This section provides proofs of the main results in section 1.6. We will frist prove that V( ˆξ1,HT )−
V( ˆξ1,FE ) is singular. To do so, we introduce the following notation. For any variable M, let
¨M = Ω−1/2
o M = QM + (λ1R + λ2B)M ≡ ˜M + M∗. Note that ¨Z = Z∗ (since = 0), ¨W = W∗ and
¨D = ˜D + λ2 ¯D where ¯D is constant over both i and t. Then

V( ˆξ1,FE ) = σ 2

u ( ˜HT

1 ˜H1)−1 = σ 2

u

(A.8)

−1
 ˜XT ˜X ˜XT ˜D

˜DT ˜X ˜DT ˜D

To obtain V( ˆξ1,HT ), we write the HT regression as follows

Applying the Frisch-Waugh-Lovell Theorem, we have ˆξ1,HT =(cid:0)(PHT ¨H1)oT (PHT ¨Ho

PHT ¨y = PHT ¨H1ξ1 + PHT ¨H2ξ2 + PHT ¨ε

(A.9)

1)(cid:1)−1(cid:0)(PHT ¨H1)oT ¨y(cid:1),

where (PHT ¨H1)o is the part of PHT ¨H1 that is orthogonal to PHT ¨H2. It is clear that PHT ¨D = ˜D
since ¯D cannot be orthogonal to PHT ¨H2 if H2 contains an intercept. Similarly (PHT ¨X)o = ˜X + Λ
where Λ = (PHT X∗)o is the part of PHT X∗ orthogonal to PHT ¨H2. Note that ΛT ˜X = 0 and
ΛT ˜D = 0. Then

V( ˆξ1,HT ) = σ 2
u

 ˜XT ˜X + ΛT Λ ˜XT ˜D
−1
 and so V( ˆξ1,FE )−V( ˆξ1,HT )
0
 ˜XT ˜D
 ˜XT ˜D
 = V( ˆξ1,HT )
 =

(A.10)

˜DT ˜X

˜DT ˜D

˜DT ˜D

˜DT ˜D

I

Then it is the case that V( ˆξ1,FE )

is singular. It has rank Kx instead of KX + KD.

Next we prove the equivalence of the variable addition test and the Hausman test HT1. The
augmented regression for the variable addition test, transformed by the projection matrix PHT , can
be written as follows:

PHT ¨y = PHT ¨H1ξ1 + PHT ¨H2ξ2 + PHT X∗ψ + PHT ¨εA

(A.11)

25

When φ ≡ 0, the OLS estimator of the restricted regression gives the HT estimator ˆξHT . With
unrestricted φ, it is easy to see that the OLS estimator of ξ1 is the ﬁxed effects estimator, since the
part of PHT ¨H1 that is orthogonal to (PHT ¨H2,PHT X∗) is just ( ˜X, ˜D). To simplify the analysis we
can ﬁrst project out PHT ¨H2 , so that the regression is just PHT ¨y = (PHT ¨H1)oξ1 + (PHT X∗)oφ +
error. Using a generic result on restricted regression, we have the following equality:

ˆξ1,HT = ˆξ1,FE +(cid:0)(PHT ¨H1)oT (PHT ¨H1)o(cid:1)−1

(PHT ¨H1)oT (PHT X∗)o ˆψ

(A.12)

With a little more algebra, we can calculate the difference between ﬁxed effects and HT estimators

of β :

ˆβHT − ˆβFE =(cid:2) ˜XT ˜X− ˜XT ˜D( ˜DT ˜D)−1 ˜DT ˜X + (PHT X∗)oT (PHT X∗)o(cid:3)−1

(PHT X∗)oT (PHT X∗)o ˆψ
(A.13)
Let M = ˜XT ˜X − ˜XT ˜D( ˜DT ˜D)−1 ˜DT ˜X + (PHT X∗)oT (PHT X∗)o and A = (PHT X∗)oT (PHT X∗)o.
Then we can write ˆβHT − ˆβFE = M−1A ˆψ.

Suppose ﬁrst that A is nonsigular. Then ˆψ = A−1M( ˆβHT − ˆβFE ), and it follows that V( ˆψ) =

A−1MV( ˆβHT − ˆβFE )MA−1. Therefore the statistic for the variable addition test is

ˆψT V( ˆψ)−1 ˆψ = ( ˆβHT − ˆβFE )T MA−1[A−1MV( ˆβHT − ˆβFE )MA−1]−1A−1M( ˆβHT − ˆβFE )

= ( ˆβHT − ˆβFE )T [V( ˆβHT − ˆβFE )]−1( ˆβHT − ˆβFE ) = HT1

(A.14)
However, the matrix A can be singular. The rank of A is the same as the rank of (PHT X∗)o
which has KX columns. We will write PHT = P(QJ1) + P(RJ2) + P(BJ3) where J1 = H1, J2 =
(X1,W1,X2,W2), J3 = (X1,W1,Z1). Since P(QJ1)(X∗)o = 0, apart from standard issues of mul-
ticollinearity the rank of (PHT X∗)o is determined by the rank of (P(RJ2) + P(BJ3)), and nonsin-
gularity of A requires that the number of variables in J2 plus the number of variables in J3 be
greater than or equal to KX. That is, nonsingularity of A requires that the number of exogeneity
assumptions being tested must be at least KX.

26

To simplify notation, let ∆ = ˆβHT − ˆβFE and V∆ = V(∆). Then it is still the case that M∆ =
A ˆψ, but when A is singular so is V∆ , and we will need to use generalized inverses. As a matter of
notation let r = rank(A) = rank(V∆).

For arbitrary A, consider the following conditions on a matrix G. (i) AGA = A. (ii) GAG = G.

(iii) AG and GA are symmetric. Then the following are standard deﬁnitions. (a) G is a generalized
inverse of A if (i) holds, and we write G = A−. (b) G is a reﬂexive generalized inverse of A if (i)
and (ii) both hold, and we write G = A−
n . (c) G is the Moore-Penrose pseudo inverse of A if (i),
(ii) and (iii) hold, and we write G = A+. The relevance of these facts to our discussion is the fact
that, if Y ∼ N(0,Σ), then YT Σ−
n Y is distributed as chi-squared with r degrees of freedom, where
r = rank(Σ). See Rao and Mitra (1972), p. 615, Corollary 7.1 and the discussion in the following

paragraph.

When V∆ is singular, it is well known that the Hausman test statistic HT1 = ∆T V+

∆ ∆ and that
it is asymptotically distributed as chi-squared with r degrees of freedom. We now want to show

that this is equivalent to a variable addition test, but this test has to be a test of the hypothesis

that Aψ = 0, not just ψ = 0, since Aψ can be estimated but Aψ = 0 does not imply ψ = 0

when A is singular. That is, the test must depend on A ˆψ not ˆψ. So our test statistic will be
(A ˆψ)T [V(A ˆψ)]−
n (A ˆψ). But A ˆψ = M∆ and V(A ˆψ) = MV∆M so the variable addition test statistic
is equal to ∆T M(MV∆M)−
n is a reﬂexive generalized inverse of MV∆M.
It is easy to verify that M−1V+
∆ M−1 ·
MV∆M· M−1V+
∆ M−1 · MV∆M = MV∆M. So the
∆ M−1 = M−1V+
variable addition test statistic equals ∆T M· M−1V+

∆ M−1 is such a reﬂexive generalize inverse, since M−1V+
∆ M−1 and MV∆M· M−1V+

n M∆, where (MV∆M)−

∆ M−1 · M∆ = ∆T V+

∆ ∆ = HT1.

27

BIBLIOGRAPHY

28

BIBLIOGRAPHY

Amemiya, T. and T.E. MaCurdy (1986). Instrumental Variable Estimation of an Error Compo-
nent Model, Econometrica, 54, 896-881.

Balazsi, L., L. Matyas and T. Wansbeek (2015). The Estimation of Multidimen-
sional Fixed Effects Panel Data Models, Econometric Reviews,
published online
DOI:10.1080/07474938.2015.1032164.

Balazsi, L., L. Matyas and T. Wansbeek (2017). Fixed Effects Models, Chapter 1 of The Econo-
metrics of Multi-dimensional Panels (L. Matyas, ed.), Springer-Verlag, forthcoming.

Balazsi, L., B.H. Baltagi, L. Matyas and D. Pus (2017). Random Effects Models, Chapter 2 of
The Econometrics of Multi-dimensional Panels (L. Matyas, ed.), Springer-Verlag, forthcoming.

Balazsi, L., M. Bun, F. Chan and M.N. Harris (2017). Models with Endogenous Regressors,
Chapter 3 of The Econometrics of Multi-dimensional Panels (L. Matyas, ed.), Springer-Verlag,
forthcoming.

Breusch, T.S., G.E. Mizon and P. Schmidt (1989). Efﬁcient Estimation Using Panel Data, Econo-
metrica, 57, 695-700.

Chamberlain, G. (1980). Analysis of Covariance with Qualitative Data, Review of Economic
Studies, 47, 225-238.

Fuller, W.A. and G.E. Battese (1973). Transformations for Estimation of Linear Models with
Nested-Error Structure, Journal of the American Statistical Association, 68, 626-632.

Garson, G.D. (2012). Hierarchical Linear Modeling: Guide and Applications, Sage Publication.

Kim, J.-S. and E.W. Frees (2007). Multilevel Modeling with Correlated Effects, Psychometrika,
72, 505-533.

Hausman, J.A. and W.E. Taylor (1981). Panel Data and Unobservable Individual Effects, Econo-
metrica, 49, 1377-1399.

Matyas, L. (1997). Proper Econometric Speciﬁcation of the Gravity Model. World Economy, 20,
363-369.

Matyas, L. (ed.) (2017). The Econometrics of Multi-dimensional Panels, Springer-Verlag, forth-
coming.

Mundlak, Y. (1978). On the Pooling of Time-Series and Cross-Section Data, Econometrica, 46,
69-86.

Rao, C.R. and S.K. Mitra (1972). Generalized Inverse of a Matrix and Its Applications, Proceed-
ings of the Berkeley Symposium on Mathematical Statistics and Probability, 1, 601-620.

29

Raudenbush, S.W. and A.S. Bryk (2002). Hierarchical Linear Models: Applications and Data
Analysis Methods, 2nd edition, Sage Publication.
Rice, N., A.M. Jones and H. Goldstein (1998). Multilevel Models Where the Random Effects Are
Correlated with the Fixed Predictors: A Conditioned Iterative Generalised least Squares Estimator
(CIGLS), Multilevel Modeling Newsletter, 10, 10-14.

Wooldridge, J.M. (1991). On the Application of Robust, Regression-Based Diagnostics to Models
of Conditional Means and Conditional Variances, Journal of Econometrics, 47, 5-46.
Wooldridge, J.M. (2010). Econometric Analysis of Cross Section and Panel Data, 2nd edition,
MIT Press.

30

CHAPTER 2

AN EXTENDED GENERALIZED ESTIMATING EQUATIONS APPROACH TO

NONLINEAR MODELS WITH HIERARCHICAL STRUCTURE

2.1 Introduction

The generalized estimating equations (GEE) method was ﬁrst proposed by Liang and Zeger

(1986) for the analysis of panel data models. It is also common to see applications of the GEE

method in models with cluster data. See, for example, Entorf and Sattarova (2016). The GEE

estimation method has been well developed for the two-level model, which includes the panel data

model as a special case. When the model has a more complicated nested structure, the mixed-

effects estimation is widely used rather than the GEE method. The main contribution of this paper

is to extend the GEE estimation to higher-level models, which serves as an alternative method to

the mixed-effects estimation for multi-level models.

Compared with the usual mixed-effects estimator in multi-level models, the GEE estimator is

robust to certain misspeciﬁcations. Because the mixed-effects estimator is based on maximum

likelihood estimation, the computation burden escalates as the dimension of the heterogeneity

increases. For some nonlinear models, the fractional response model for example, it is difﬁcult

to specify the underlying distribution to obtain mixed-effects estimators. Detailed discussion of

mixed-effects models can be found in Randenbusch and Bryk (2002), Deleeuw and Meijer (2008).

In comparison with the pooled estimator, our simulation results show that the GEE estimator has

some good efﬁciency properties in various data generating processes.

I show in detail how to

obtain the GEE estimator in multi-level nonlinear models.

The second contribution of this paper is to provide a systematic discussion of the selection

of working correlation matrices which are needed to implement GEE estimators. An important

ingredient of the GEE estimation is to use the working correlation matrix to approximate corre-

lations within the group. A necessary assumption for the consistency of the GEE estimator, as

31

stated in Liang and Zeger (1986), is that the chosen working correlation matrix should converge

to some nonrandom positive semi-deﬁnite matrix. Crowder (1995) shows a special case that the

misspeciﬁed working correlation matrix does not a have well-deﬁned limit. To date, there is no

systematic analysis of this issue. This paper will examine the converging limits of several popular

working correlation matrices under different misspeciﬁcation settings. The result could provide

some useful guidelines for the application of the GEE method.

This paper also shows how to apply the GEE method to hierarchical nonlinear models with un-

balanced data structure within the correlated random effects framework. As suggested in Wooldridge

(2010) for the panel binary response model, when the heterogeneity is correlated with covariates,

a modiﬁed Chamberlain-Mundlak device could be used to accommodate unbalanced data struc-

ture. In multi-level models, we could have more than one heterogeneity. I extend the treatment of

Wooldridge (2010) in the panel data model to the multi-level model with unbalanced data struc-

ture and show how the GEE estimation method can be applied to the modiﬁed conditional mean

function.

Another contribution of this paper to the GEE literature is that we could further enhance the

efﬁciency of the GEE estimator by allowing time-varying dispersion parameters. The traditional

GEE estimation uses a time constant dispersion parameter in the so-called generalized linear model

variance assumption. In the panel data model, or more generally, for models with a non-nested data

structure, we have the ﬂexibility to allow time-varying dispersion parameters, each of which can

be precisely estimated. Simulation evidence shows that this is a nontrivial efﬁciency improvement

compared with the traditional GEE estimator.

This paper also extends the multivariate weighted nonlinear least squares (MWNLS) method in

the panel data model to multi-level models. In some special cases, we can obtain a more efﬁcient

estimator by using the multivariate weighted nonlinear least squares method. I illustrate this with

the multi-level count response model in which we could explicitly calculate the within-group cor-

relations. The resultant MWNLS estimator is more efﬁcient than the GEE estimator under certain

assumptions.

32

My stylized example in this paper is a model for passing rates of a standardized test at the

school level at various time points. Data on passing rates and other variables are collected from

schools in various districts at multiple points in time. This is a three-level model where time is

level one, school is level two, and district is level three. The model is set up in a way to have both

nested and non-nested structures. Schools in the same district are nested, and we can use school

identity to track which district it is in. On the other hand, time is not nested with school or district.

Throughout this paper, I focus on binary/fractional response models and count response models

for the purpose of expository, though most methods are applicable to other nonlinear models. For

asymptotic analysis, I assume that the number of groups (districts) goes to inﬁnite, and the group

sizes and time periods are ﬁnite. The remainder of the paper is organized as follows. In section

2.2, we ﬁrst brieﬂy review existing econometric methods for nonlinear panel models. Applications

of the GEE method in nonlinear panel models and some extensions are discussed in section 2.3.

Section 2.4 provides discussions about the selection of working correlation matrices. I consider

three-level nonlinear models with both nested and non-nested structures in section 2.5. Simulation

results are presented in section 2.6. Section 2.7 concludes the paper.

2.2 Nonlinear Panel Data Models

This section brieﬂy summarizes some existing estimation methods for nonlinear panel data

models. For nonlinear models, we are not only interested in structural parameters, which give us

the direction of the effects and the relative effects, but also partial effects and average partial effects

(APEs). Instead of proceeding with abstract nonlinear models, I use the binary/fractional response

model and the count response model for illustration.

Example 1 (binary/fractional response panel models)

The binary response model and fractional response model are closely related. When we use

the same underlying distribution function, they have the same conditional mean.

yit = 1[xitα + ziβ + ci + uit > 0]

(2.1)

33

where subscript i, which runs from 1 to N, indexes the individuals. For individual i, t = 1,2, ...,Ti;
thus, the unbalanced panel structure is allowed. The unobserved group heterogeneity ci captures
the within-group correlations, and uit is the usual idiosyncratic error term.
random effects binary/fractional response panel models

D(yit | xi,zi,ci) = D(yit | xit,zi,ci)

D(yi | xi,zi,ci) =

D(yit | xi,zi,ci)

Ti∏

t=1

ci | xi,zi ∼ Normal(0,σ 2
c )

(2.2)

(2.3)

(2.4)

where yi is a vector that contains all yit for individual i, and xi is similarly deﬁned. Equa-
tion (2.2) is the strict exogeneity condition in the panel data model. If we further assume that
P(yit = 1 | xit,zi,ci) = Φ(xitα + ziβ + ci), where Φ(·) is the cumulative distribution function of
the standard normal random variable, then we have the so-called random effects probit model. A

straightforward application of maximum likelihood estimation gives consistent estimators of all

parameters as well as partial effects.

When the dependent variable yit has range [0,1] and we are interested in estimating conditional

mean function, we can make the following assumption:

E(yit | xi,zi,ci) = E(yit | xit,zi,ci) = G(xitα + ziβ + ci)

(2.5)

This is the strict exogeneity assumption stated in terms of conditional mean. If assumption (2.4)
is true, and we use normal distribution function for G(·), with some algebra we can integrate out
ci and get E(yit | xi,zi) = Φ(xitαc + ziβc), where αc and βc are original parameters scaled down
c )1/2. These are the parameters that deﬁne the average structure functions(ASF) , which
by (1 + σ 2
leads to the estimation of APEs. We could either use the pooled nonlinear least squares method or

the pooled quasi-maximum likelihood method to obtain consistent estimators of the scaled coefﬁ-

cients. For hypothesis testing, robust variance estimators are available in both estimation methods.

correlated random effects binary/fractional response panel models

34

When we suspect that covariates are correlated with heterogeneity, we can use either the Cham-
berlain (1980) or Mundlak (1978) device to model the conditional distribution of ci as a function
of covariates. Here I adopt the Mundlak device to conserve parameters. The difﬁculty of mod-
eling D(ci | xi,zi) is that we have an unbalanced panel structure. Wooldridge (2010) shows that
under joint normality assumption of (ci,xi,zi), both conditional mean and conditional variance of
ci depend on Ti and, he proposes several ﬂexible parametric approaches to model group hetero-
geneity. The key point is that we should at least allow D(ci | xi,zi) to depend on Ti. If we have
little variation of Ti across i, we can model the unobserved heterogeneity as follows:

ci = φ1Ti + ¯xiφ2Ti + εi
εi | xi,zi ∼ Normal(0,σ 2
)
cTi

(2.6)

where ¯xi is the time averages of xit. With assumption (2.2) and (2.6), we can use the pooled
heteroskedastic method to identify all the parameters. Speciﬁcally, we can write the conditional

probability as:

P(yi = 1 | xi,zi) = Φ(cid:2)xitα + ziβ + φ1Ti + ¯xiφ2Ti

(cid:113)

(cid:3)

1 + σ 2
cTi

(2.7)

To implement the estimation, we can ﬁrst create time period dummies for all but one Ti. We
can then interact them with ¯xi and include them in the conditional mean function as in equation
(2.7). With these consistent estimators, we can estimate the average structural function (ASF),

from which we can estimate average partial effects. When we have many different group sizes, it

might be too costly to include all these group size dummies. Wooldridge (2010) suggests several

ﬂexible parametric functions to model the conditional mean and conditional variance of the group

heterogeneity.

For the correlated random effects fractional response model, we get the same estimating equa-

tion if we combine assumption (2.6) with (2.5)

E(yit | xi,zi) =E[Φ(xitα + ziβ + φ1Ti + ¯xiφ2Ti + εi) | xi,zi]

(cid:104)xitα + φ1Ti + ¯xiφ2Ti

(cid:105)

(cid:113)

=Φ

1 + σ 2
cTi

35

(2.8)

The implementation of estimation is the same as the pooled heteroskedasticity binary response

model that is addressed above. For ﬁxed values of (x,z), the average structural function can be

expressed as:

ASF(x,z) = E¯xi

(cid:16)xα + zβ + φ1Ti + ¯xiφ2Ti

(cid:17)(cid:105)

(cid:104)

Φ

(cid:113)

1 + σ 2
cTi

(2.9)

With consistent estimators from the nonlinear least squares estimation or the quasi-maximum like-

lihood estimation, ASF can be consistently estimated by

(cid:100)ASF(x,z) = N−1

(cid:16)x ˆα + z ˆφ1Ti + ¯xi ˆφ2Ti

(cid:113)

1 + ˆσ 2
cTi

N
∑
i=1

Φ

(cid:17)

.

(2.10)

APEs can be obtained by taking partial derivatives for continuous variables or by differencing for

discrete variables. The bootstrapping method could be used to get valid inferences of the APEs.

Example 2 (count response panel models):

Many response variables in economics are integers. For example, years of education, number

of patent applications (Hausman, Hall, and Griliches (1984)), number of new ﬁrms in an industry

(Papke (1991)) and so on. Nonlinear models for count response variables have been well developed

for panel data.

Suppose our interested dependent variable yit takes the value of nonnegative integers and we

model the conditional mean as follows:

E(yit | xi,zi,ci) = E(yit | xit,zi,ci) = ciG(xitα + ziβ )

(2.11)

This equation assumes the strict exogeneity as well as correct speciﬁcation of conditional mean.
One popular choice for G(·) is exponential function form for unbounded count variable. Notice
that equation (11) has a multiplicative form of the group heterogeneity, which is usually assumed

in count panel data models. As a consequence of the multiplicative heterogeneity, the elasticity or

semi-elasticity does not depend on unobserved effects.

In the random effects framework, we assume the unobserved effect to be mean independent

of all covariates. Without loss of generality, we can normalize its mean to be one as shown in

36

equation (2.12). Such a normalization is not restrictive in the sense that it does not affect the

estimation results for the slope parameters.

E(ci|xi,zi) = 1

(2.12)

Combining equations (2.11) and (2.12), we can integrate out the group heterogeneity and obtain the
conditional mean E(yit | xi,zi). If we make parametric assumptions on conditional mean function,
we could proceed with the pooled nonlinear least squares or the pooled quasi-maximum likelihood

estimation. For inference, we should use robust standard error since the score functions are se-

rially correlated due to the presence of individual heterogeneity. Other more efﬁcient estimation

method is available. See Wooldridge (2010) for applications of multivariate weighted nonlinear

least squares method in count response panel models. Fixed-effects estimators are also available,

but we would not be able to obtain partial effects.

2.3 Generalized Estimating Equations method for Nonlinear Panel Models

We have seen that pooled methods, including pooled NLS and pooled QMLE, can be applied to

nonlinear panel models to estimate scaled parameters that identify partial effects. Pooled methods

are convenient for implementation but are unlikely to be efﬁcient as they ignore the within-group

correlations.

In this section, I will ﬁrst brieﬂy introduce the generalized estimating equations

(GEE) method, which was introduced by Liang and Zeger (1986). The GEE method tries to en-

hance efﬁciency without imposing any stronger conditions. Then I show how to extend the GEE

estimation method to heteroskedastic binary response panel models. Another extension for the

GEE estimation in panel data models is to allow time-varying dispersion parameters, which could

lead to further efﬁciency improvements.

In the previous section, we have seen that our task reduces to estimating the conditional mean
E(yit|xi,zi). For the random effects estimation, E(yit|xi,zi) = m(xi,zi;θ ), and for the correlated
random effects estimation, E(yit|xi,zi) has Ti included in zi. As discussed earlier, the GEE method
is intended to improve the efﬁciency by approximating the within-group correlations, and it explic-

37

itly acknowledges that the approximation is different from the true conditional covariance matrix.

As a result, we should always make inferences robust to the misspeciﬁcation of the chosen corre-

lation matrix.

There is no guarantee that GEE estimators are more efﬁcient than pooled estimators, but the

idea is that a rough approximation of the within-group correlations, which are ignored in the pooled

estimation, could improve the efﬁciency. To proceed, I ﬁrst deﬁne a diagonal matrix.

V(xi,zi;θ ,σ 2

o ) = diag{var∗(yi1 | xi,zi),var∗(yi2 | xi,zi), ...,var∗(yiTi | xi,zi)}

(2.13)
Notice that the conditional variance var∗(yit | xi,zi) used in (2.13) could be different from the true
conditional variance var(yit | xi,zi). The consistency of the GEE estimator does not require the
correct speciﬁcation of the conditional variance function. If we have a binary dependent variable,
then var∗(yit | xi,zi) = var(yit | xi,zi) provided that the conditional mean is correctly speciﬁed.

var(yit | xi,zi) = E(yit | xi,zi)[1− E(yit | xi,zi)]

(2.14)

If the dependent variable is fractional, and we adopt Bernoulli distribution as underlying distribu-

tion, the conditional variance can be written as:

var∗(yit | xi,zi) = σ 2

o E(yit | xi,zi)[1− E(yit | xi,zi)]

(2.15)

where σ 2

o is called the dispersion parameter. If we have a count response variable, the GEE ap-
proach associated with the Poisson distribution assumption would be to maintain the following

variance assumption:

var∗(yit | xi,zi) = σ 2

o E(yit | xi,zi)

(2.16)

Equations (2.15) and (2.16) are called the generalized linear model (GLM) variance assump-

tion in statistics literature. For panel data models, one possible extension is to allow the dispersion
parameter σ 2

o to vary with time. Since we have ﬁxed time periods, we can obtain precise esti-
mates of dispersion parameters in each time period. Take the random effects count response model

38

as an example, if we have some preliminary estimators of the conditional mean G(xit ˇα + zi ˇβ ),
the dispersion parameter σ 2
, where ˇvit =
yit − G(xit ˇα + zi ˇβ ) and K is the number of parameters to be estimated. Simulation results show
that allowing different dispersion parameters across time usually leads to more efﬁcient GEE es-

ot can be consistently estimated by

1
N−K ∑N
i=1

ˇv2
it

G(xit ˇα+zi ˇβ )

timators. Again the consistency of GEE estimators only requires the correct speciﬁcation of the

conditional mean. We could be arbitrarily wrong about the second moment assumptions.

If we don’t assume strict exogeneity, we could use pooled NLS or quasi-MLE methods to

estimate scaled coefﬁcients or the so-called “population average" parameters. Consistency would

only require the conditional mean of the marginal distribution to be correctly speciﬁed. For binary

response models, if we add strict exogeneity and conditional independence assumptions, we could

get the full MLE estimators, which are more efﬁcient than pooled estimators. A middle ground

between the pooled method and the full maximum likelihood estimation method is the generalized

estimating equations (GEE) method. For GEE to work, we need to impose the strict exogeneity

assumption. The purpose of using GEE is to get back some efﬁciency when we don’t assume

conditional independence. GEE starts by specifying a set of conditional mean functions. Following

I will focus on the correlated random effects fractional response panel model:

E(yi | xi,zi) = m(xi,zi;θ )

(2.17)
Where m(xi,zi;θ ) is a Ti × 1 vector containing elements m(yit | xi,zi;θ ) as speciﬁed in equation
(2.8). As discussed in section 2.2, we include Ti as an extra argument in zi for the correlated ran-
dom effects framework. In contrast to pooled methods, the GEE approach accounts for neglected

within-group correlations by utilizing the so called “working" variance matrix, which is denoted
by W(xi,zi;γ). The working matrix can be expressed as:

W(xi,zi;γ) = V(xi,zi;θ ,σ 2
o )

1
2 Ri(ρ)V(xi,zi;θ ,σ 2
o )

1
2

(2.18)

where Ri(ρ) is a Ti × Ti positive deﬁnite matrix with ri
column. ri

ts denoting the element in tth row and sth
ts ∈ (−1,1) for t (cid:54)= s. The discussion about the selection of Ri(ρ) is provided

tt = 1 and ri

39

in next section. Speciﬁcally, GEE estimators solve the following minimization problem:

Min
θ

[yi − m(xi,zi;θ )]T [W(xi,zi; ˇγ)]−1[yi − m(xi,zi;θ )]

N
∑
i=1

(2.19)

where we have replaced γ with its consistent estimator ˇγ, which could be obtained from some

preliminary estimation, such as the pooled NLS method.

It should be mentioned that, instead

of using the two-step estimation method, we could take γ as a function of θ and other nuisance

parameters. Thus, the one-step estimation is also available. The asymptotic distributions of the

two-step estimators and one-step estimators are the same as long as we use the

√
N consistent

estimator of γ in two-step estimation. Here I use the two-step estimation method, which makes it

easier for the later discussion of the selection of working correlation matrices.

The GEE method is very similar to the multivariate weighted nonlinear least squares method
in econometrics literature. While MWNLS tries to model var(yi | xi,zi) accurately, GEE acknowl-
edges that the weighting matrix is inherently wrong. It might be instructive to look at the ﬁrst order

conditions of above minimization problem

(cid:53)θ m(xi,zi; ˆθ )T W(xi,zi; ˇγ)−1[yi − m(xi,zi; ˆθ )] = 0

(2.20)

N
∑
i=1

The above equation looks very similar to the ﬁrst order condition of quasi-MLE. In fact, GEE

estimators can be obtained by solving these “quasi-score" equations. As proved by Liang and

√
N consistent and converge to the

Zeger (1986), under some mild conditions, GEE estimators are

following normal distribution.

√
N( ˆθ − θ ) ∼ N

(cid:16)
0,(cid:0)E[sT

i W−1

i

si](cid:1)−1E[sT

i W−1

i νiνT

i W−1

i

si](cid:0)E[sT

i W−1

i

si](cid:1)−1(cid:17)

(2.21)

where si = (cid:53)θ m(xi,zi;θ ), νi = yi − m(xi,zi;θ ) and Wi is shorthand for W(xi,zi;γ). The consis-
tent estimator of variance-covariance matrix can be obtained as

(cid:100)Avar( ˆθ ) =(cid:0) N

(cid:1)−1(cid:0) N
∑
i=1

ˆW−1
i

ˆsi

ˆsT
i

∑
i=1

(cid:1)(cid:0) N
∑
i=1

(cid:1)−1

ˆW−1
i

ˆsi

ˆsT
i

(2.22)

ˆW−1
i

ˆsT
i

ˆνi ˆνT
i

ˆW−1
i

ˆsi

40

In equation (2.22) we replace the expectation by the sample averages and the unknown parameters

by the corresponding consistent estimators. The efﬁciency of the GEE estimator obviously depends

on how well the chosen working correlation matrix approximates the true correlations. If it happens
to be true that W(xi,zi;γ) = var(yi | xi,zi), then GEE estimators are efﬁcient and the asymptotic
variance can be further simpliﬁed. As mentioned earlier, we could further improve the efﬁciency

of GEE estimators by allowing time-varying dispersion parameters. This would only require minor

notation changes to equation (2.18).

2.4 Selection of Working Correlation Matrix

If we look at the weighting matrix W(xi,zi;γ) in equation (2.18), once we have speciﬁed the
o ), which is associated with the function form of E(yi | xi,zi) and the
function form of V (xi,zi;θ ,σ 2
underlying distribution assumption, our task reduces to determining the working correlation matrix
Ri(ρ). In this section, I ﬁrst introduce various popular choices of working correlation matrices and
then discuss their validity in different situations.

among Pearson residuals eit = [yit − m(hit; ˆβ )]/(cid:112)(cid:99)var(yit | xi,zi).

The working correlation matrix Ri(ρ) is intended to capture the within group correlations
In practice, we ﬁrst need to

estimate the conditional mean and the pseudo conditional variance to get the Pearson residuals.
In the pooled estimation method, Ri(ρ) is set to be identity matrix and the consistency of GEE
estimators in this case would not require strict exogeneity.

Ri(ρ) = ITi

(2.23)

Using the identity weighting correlation matrix together with time-varying dispersion parameters

is similar to applying the generalized least squares method to models with heteroskedasticity. The

GEE estimator with identity working correlation matrix is the same as the pooled estimator which

only requires contemporaneous exogeneity for consistency. The identity correlation matrix ignores

the within-group correlations. For a nested structure, a popular choice of the working correlation

matrix is a matrix with exchangeable structure. The exchangeable matrix has ones along its diag-

41

onal and some constant ρ everywhere else.

Ri(ρ) =





(2.24)

1 ρ . . . ρ

...

ρ 1 . . . ρ
...
...
ρ ρ . . . 1

...

Essentially the exchangeable correlation matrix assumes that within-group correlations are con-

stant for all pairs of group members. This is more realistic in nested models but somewhat re-

strictive for non-nested models such as panel data models.

In panel data context, the working

ts = ρ|t−s| for an AR(1) process; ri

correlation matrix can be assumed to be generated by some autoregressive process or moving av-
erage process of the Pearson residuals. For example, ri
ts = ρ if
ts = 0 if | t − s |> 1 for MA(1) process. It should be mentioned that the MA(1)
| t − s |= 1 and ri
process is a special case of the one-dependent process, which is often used in biology and statis-
tics. For example, a sequence {eit}T
t=1 follows one-dependent process if corr(eit,eit+1) = ρt and
corr(eit,eis) = 0 for |t − s| > 1. When ρt = ρ for t = 1,2, ...,T − 1, then the one-dependent pro-
cess is the same as the MA(1) process. Generalization to m-dependent process is also available,

which would resemble the MA(m) process. Unstructured working correlation matrix, in which

each element is estimated by its corresponding sample average, is attractive when the group size is

relatively small.

In practice, we rarely know the true underlying correlation matrix, and our hope is that the cho-

sen working matrix is able to capture enough information about the within-group correlations. As
mentioned earlier, misspeciﬁcation of Ri(ρ) would not cause inconsistency of the GEE estimator,
but we do require that the chosen working matrix converges to some positive deﬁnite nonrandom

matrix. This requirement poses some challenges for the use of the GEE method. For a concrete
example, consider a T ×T correlation matrix that has an exchangeable structure indexed by the pa-
rameter ρ. The exchangeable matrix would be positive deﬁnite only if ρ > − 1
T−1. This means that
even if we know the true correlation structure and can estimate the correlation parameter precisely,

the obtained working correlation matrix may not be positive deﬁnite for a certain range of negative

42

correlations. In other words, the working correlation matrix will not converge to a positive deﬁ-
nite matrix when ρ < − 1
T−1, consequently, the GEE estimator is no longer consistent. Chaganty
and Joe (2006) provide further discussion of the deﬁniteness of working correlation matrices for

dependent Bernoulli random variables.

A potentially important issue ignored by many GEE applied researchers is that under mis-

speciﬁcation of the working correlation matrix, the working correlation matrix may not have any

well-deﬁned limit; this was ﬁrst investigated by Crowder (1995). An example given in his paper

is that the true correlation matrix has an exchangeable form and the chosen working correlation

matrix is derived from an AR(1) process of the Pearson residuals. Crowder (1995) shows that there

is no solution for the estimator ˆρ for a certain range of the parameter associated with the exchange-

able matrix. So far there is no systematic discussion of the convergence issue of the GEE working

correlation matrices in the econometrics literature. Next I will focus on three commonly used

working correlation matrices and analyze the convergence issue in various situations. It should be

noted that the estimation method of working correlation matrix discussed in Crowder (1995) is not

standard and can be seen as an inefﬁcient moment estimation. For the following analysis, I use

ξ to denote the parameter that indexes the true correlation matrix and ρ to denote the parameter

associated with the chosen working matrix. For simplicity, I assume Ti = T for all i.

Case 1.1: The true correlation matrix associated with {eit}T

t=1 has an exchangeable structure
with parameter ξ , and the chosen working correlation matrix is derived from an AR(1) process

determined by parameter ρ. As previously discussed, a necessary condition for the exchangeable
correlation matrix to be positive deﬁnite is ξ > − 1
T−1. Crowder(1995) suggests that the estimator
of ρ can be obtained by solving the equation ∑T−1
i=1 eiteis − ˆρt−s) = 0. As shown
t=1 ∑T
by Crowder (1995) (see more details in the appendix), when T is odd and ξ + 1
T < 0, no solution
exists for ˆρ, therefore, it does not have any sensible converging limit. In summary, the working
T ≥ 0. When the solution
correlation in this case converges only if T is even or T is odd with ξ + 1
exists, the proposed estimator ˆρ is an inefﬁcient moment estimator. Another drawback of the

s=t+1( 1

N ∑N

estimation method in Crowder(1995) is that it is difﬁcult to extend the result to unbalanced data

43

√
N.

i=1 ∑T−1

t=1 eiteit+1, which converges to ξ at rate

structures as we allow Ti to vary with i. Another convenient estimator of ρ can be obtained as
N(T−1) ∑N

1
Case 1.2: The true correlation matrix associated with {eit} has an exchangeable structure
associated with parameter ξ , and the chosen working correlation matrix is derived from the one-
dependent process determined by parameters {ρt}T−1
t=1 . In this case, ρt is estimated by its sample
analogue 1
i=1 eiteit+1, which converges to ξ by applying the law of large numbers. For the spe-
cial case where the one-dependent process degenerates to the MA(1) process, the single parameter

N ∑N

ρ is estimated by

i=1 ∑T−1
Case 2.1: Pearson residuals {eit}T

N(T−1) ∑N

1

t=1 eiteit+1, which still converges to ξ as N → ∞

t=1 follow an AR(1) process associated with parameter ξ ,
and the chosen working correlation matrix has an exchangeable structure with parameter ρ. In
s=t+1 eiteis, which can be
1−ξ ). This example shows that parameters associated
with working correlation matrix do not necessarily have a meaningful statistical interpretation.
One can verify that ρ > − 1

this case, ρ is estimated by its sample analogue
1−ξ (T − 1−ξ T

T−1 is satisﬁed for all positive integer T and ξ ∈ (−1,1)

NT (T−1)/2 ∑N

shown to converge to

i=1 ∑T−1

T (T−1)/2

t=1 ∑T

ξ

1

1

Case 2.2: Pearson residuals {eit}T

t=1 follow AR(1) process associated with parameter ξ , and
the chosen working correlation matrix is derived from the one-dependent process determined by
parameters {ρt}T−1
i=1 eiteit+1, which
converges to ξ by applying the law of large numbers. Similarly, in the special case where working

t=1 . In this case, ρt is estimated by its sample analogue 1

N ∑N

correlation matrix is derived from an MA(1) process, ˆρ can be estimated by

1

N(T−1) ∑N

i=1 ∑T−1

t=1 eiteit+1.

Case 3.1: Pearson residuals {eit}T

t=1 follow the one-dependent process associated with param-
eters ξ , and the chosen working correlation matrix has an exchangeable structure determined by pa-

rameter ρ. We can estimate ρ by its corresponding sample analogue:

1

NT (T−1)/2 ∑N

i=1 ∑T−1

t=1 ∑T

s=t+1 eiteis,

∑T−1
t=1 ξt
T (T−1)/2. The constraint that ρ > − 1

T−1 is automatically sat-
2 . For the special case where ξt = ξ , the converging limit simpliﬁes to

which can be shown to converge to
isﬁed if ∑T−1
ξ
T /2.

t=1 ξt > −T

Case 3.2: Pearson residuals {eit}T

t=1 follow the one-dependent process associated with param-

44

eters ξ , and the chosen working correlation matrix is derived from AR(1) process with parameter ρ.
Following Crowder (1995), the estimator of ρ can be obtained by solving ∑T−1
ˆρs−t) = 0. I show that the solution exists only if ∑T−1
t=1 ξt is sufﬁciently large (see the appendix
for proof). Again, this estimation method would be inefﬁcient and problematic in some cases. We

s=t+1( 1

t=1 ∑T

N ∑N

i=1 eiteis−

could instead estimate ρ by
simpliﬁes if {eit}T

N(T−1) ∑N

i=1 ∑T−1
t=1 follows an MA(1) process.

1

t=1 eiteit+1, which converges to

∑T−1
t=1 ξt
T−1 . The analysis

From the above analysis we see that, depending on the methods used for estimation, the work-

ing correlation matrices do not have well-deﬁned converging limits in some cases. For empirical

researchers, the exchangeable matrix is a good choice as it does converge to some nonrandom

matrix in most cases. We should also make sure that the chosen working correlation matrix is

positive semi-deﬁnite, and this has to be examined on a case by case basis. It should be mentioned

that the estimation equations of parameters in autoregressive models presented in cases 1.1 and 3.2

are not standard in econometrics literature, and they are also different from the estimation method

proposed by Liang and Zeger (1986). The estimating equation in case 3.2 is just analog of the one

proposed by Crowder (1995).

Other methods are also available to construct the working correlation matrix. An example

in the count response model is the MWNLS approach in which the working correlation matrix

converges to positive semi-deﬁnite matrix by construction. We could also use information criteria

to select different working correlation matrices. A good updated review about the selection of

working correlation matrices is provided by Wang (2014).

2.5 Estimation of Higher-Level Nonlinear Models

Previous sections have introduced nonlinear panel models and some common estimation meth-

ods. In particular, I have considered the correlated-random effects models with unbalanced group

sizes. I also discuss the popular GEE method in detail. If our interest is population average param-

eters, the GEE estimator has some good efﬁciency properties relative to the pooled estimator. With

simple reasoning, previous results can be extended to models of higher levels. It is interesting

45

to note that most analyses about higher level models are based on the mixed-effects estimation,

which has no robust property if any of the distributional assumptions fails. As we will see later, the

GEE method is robust to certain misspeciﬁcations of the distribution assumptions. In this section,

I consider the estimation of the conditional mean function for a three-level nonlinear model which

has both nested and non-nested features. I ﬁrst start with a general framework without specify-

ing any parametric function form for the conditional mean. Both random-effects and correlated

random-effects models are considered. Extended GEE estimation and inference are provided. I

use the binary/fractional response model and the count response model as concrete examples. Be-

cause of the special setup in the count response model, another attractive estimation method is also

available.

2.5.1 Extended GEE Estimation

Suppose that we want to study the effects of some factors or policy changes on passing rates

of national tests, and we have standard panel data of schools across many different districts. In

practice, many researchers would construct standard panel data model for estimation and inference,

however, as we will show later, the ignorance of the higher level cluster structure would cause

biased inferences. This issue has been addressed by Moulton (1990) and Wooldridge (2003) for

linear models, and a similar result is found in nonlinear models. Thus, it is very important to

explicitly model the correlation of data at a higher level. In our example, schools within the same

district might be correlated with each other through the district effect, and this is a nested data

structure. We also have the time dimension which is not nested within school identity or district

identity. Our interest is in estimating the conditional mean function, which relates to partial effects

and average partial effects.

E(yit | hit,ci,vg(i)) = m∗(hit,ci,vg(i);β∗)

(2.25)

Notations are slightly different here. Let t = 1, ...,T index time periods, i = 1, ...,N index schools

and g = 1, ...,G index districts. Given a school identity i, we know what district it is in through

46

the function g(i). Explanatory variables hit = (xit,zi,wg(i)) could include time-varying variables,
school constant variables, and district constant variables. Of course time dummy variables can be
included in xit. Here I use ci to denote the school effect and vg(i) to denote the district effect. β∗
are true structure parameters. For simplicity of notation, here I assume balanced panel structure

as well as balanced cluster structure. In particular, I assume there are Q schools in each district

and each school is observed for T periods. Analysis for unbalanced group sizes and varying time

periods can be easily accommodated with minor notation changes. For the asymptotic analysis, I

assume that the number of group goes to inﬁnity while the group sizes and time periods are ﬁxed

or relatively small.

As we have seen in panel models, many estimation methods rely on the strict exogeneity as-

sumption. Here I state the exogeneity assumption as follow.

E(yit | Hg(i),ci,vg(i)) = E(yit | hit,ci,vg(i))

(2.26)

where Hg(i) is a matrix that contains all explanatory variables hit within the same group across all
periods. The above equation is twofold: covariates of other periods of the same school do not help

to explain the outcome for current period; covariates of other schools within the same district do

not affect the outcome of the chosen school. These are all stated conditional on the school hetero-

geneity and the district heterogeneity. Strict exogeneity in time dimension can be restrictive in the

sense that it does not allow feedback effect or lagged dependent variables. Strict exogeneity across

schools within the same district essentially rules out peer effects. Strict exogeneity of the time-

varying variables can be examined by adding a subset of time-varying variables of next period, and

a joint signiﬁcance test of these added variables would be a valid test of the strict exogeneity as-

sumption in time dimension. Testing of peer effects, on the other hand, is an ongoing research area

and usually counters identiﬁcation issues. In some special panel data models, we could relax strict

exogeneity to sequential exogeneity to identify the conditional mean. See Wooldridge (1997) for

multiplicative panel data models. Relaxing the strict exogeneity condition would be more difﬁcult

in nested models as there is no meaningful “order" for schools within the same district. Testing

47

and relaxing strict exogeneity assumption in nested models could be interesting topics for future

research.

Depending on the assumptions we make about the correlation between the unobserved effects

and explanatory variables, we have the random-effects framework or the correlated-random effects

framework. In some special cases, we could use ﬁxed-effects estimation approach to identify some

structural parameters, but we would not be able to obtain partial effects from the ﬁxed effects

estimation. Since we are working with a general function form, it is better to hold off the ﬁxed-

effects estimation. In a random-effects framework, we generally assume the heterogeneities are

independent of explanatory variables.

D(ci | Hg(i)) = D(ci)

D(vg | Hg) = D(vg)

(2.27)

Undoubtedly these are very strong assumptions, but here we still allow for the dependence of ci on
vg(i). This is different from GLS type assumptions made in linear models, in which ci and vg(i) are
usually assumed to be independent of each other. With the above assumptions, we can integrate

out the two heterogeneities.

E(yit | Hg(i)) =E

(cid:16)
E(yit | Hg(i),ci,vg(i)) | Hg(i)
≡m1(hit;βo)

(cid:90)(cid:90)

(cid:17)

=

m∗(hit,c,v;β∗)dF(c | Hg(i))dF(v | Hg(i))

(2.28)
Note that I allow the function form m1(·) as well as the parameter βo to be different from the
In the probit model, βo would be β∗ scaled
original structure function as in equation (2.25).
by some constant. A more realistic case is that unobserved heterogeneities could be correlated

with explanatory variables. Since there are two different unobserved effects, we could have three

different situations: both are correlated with covariates; only one of the two unobserved effects is

correlated with covariates.

D(ci,vg(i) | Hg(i)) = D(ci,vg(i) | ¯Hg(i))

(2.29)

48

where ¯Hg(i) are time averages, and possibly group averages, of some variables in hit. Here I con-
sider the ﬁrst scenario as the other two are just special cases of the ﬁrst one. Again I adopt the

Mundlak-Chamberlain device for modeling the heterogeneities. This assumption allows correla-
tion between heterogeneities and covariates. As in random-effects assumption, ci can depend on
vg(i) in arbitrary ways. A simple application of the Bayesian rule can show that the above assump-
tion can also be obtained from two separate assumptions: D(ci | vg(i),Hg(i)) = D(ci | vg(i), ¯Hg(i))
and D(vg(i) | Hg(i)) = D(vg(i) | ¯Hg(i)). By the law of iterated expectation we can integrate out
these two heterogeneities.

E(yit | Hg(i)) =E

(cid:16)
E(yit | Hg(i),ci,vg(i)) | Hg(i)
≡m2(hit, ¯Hg(i);βo,φo)

(cid:90)(cid:90)

(cid:17)

=

m∗(hit,c,v;β∗)dF(ci,vg(i) | ¯Hg(i))

(2.30)
Note that we now have some new parameters φo associated with ¯Hg(i) to be estimated. The func-
tion form m2(·) could be different from the original conditional mean function form in equation
(2.25). If m2(·) has a linear index function form, then we would not be able to identify the original
coefﬁcients associated with the time constant variables. As discussed in panel models, when we

have unbalanced structures, it is important to allow the conditional distribution of heterogeneities

to be group size speciﬁc. If we don’t have much variation in group sizes, we could just index the

parameters to be group-size speciﬁc, or we can explicitly model the distribution as a function of

group size when there are many different group sizes. A consistent estimator of average structural
function can be obtained by averaging out ¯Hg(i) across all schools while ﬁxing hit at some con-
stant vector. Later I use concrete examples to show how to estimate three-level models with an

unbalanced data structure and how to obtain partial effects.

In both the random-effects models and correlated random-effects models, the problem pins
down to the estimation of E(yit | Hg(i)) after integrating out heterogeneities. Without loss of gen-
erality, I focus on equation (2.28) only. The following analysis also applies to correlated random-

effects models by including some additional variables in the conditional mean function. A simple

49

estimation method is just pooled nonlinear least squares:

(cid:105)
(yg − M(Hg;β ))T (yg − M(Hg;β ))

(cid:104) G
∑
g=1

ˆβp1 = argmin
β

(2.31)
where M(Hg(i);β ) is a QT × 1 vector containing m(hit;β ) that is sorted by schools and arranged
according to time. yg is similarly deﬁned as the ﬁrst T rows are the observations of some school
in district g. Under some regularity assumptions, we can show that

(cid:104)

(cid:105)

√
G( ˆβp1 − βo) ∼ Normal(0,A−1

(cid:104)(cid:53)β M(Hg;βo)T νgνT

(cid:105)
1 B1A−1
1 )
g (cid:53)β M(Hg;βo)

H(M;βo)

, and B1 = E

, in which H(M;βo)
where A1 = E
is the Hessian matrix of M(Hg(i);β ) evaluated at βo, and νg = yg − M(Hg,βo). The consistent
estimator of variance matrix can be obtained by replacing βo with its consistent estimator ˆβp1 and
replacing expectation with sample average. An interesting question to ask here is what would hap-

(2.32)

pen if we ignore higher nested structure and take it as a standard panel data model. This issue has

been addressed in linear models, and we have the same result in nonlinear models. If we take the

data as a panel, we would solve:

(cid:105)
(yi − m(Hi;β ))T (yi − m(Hi;β ))

(cid:104) N
∑
i=1

ˆβp2 = argmin
β

(2.33)

where m(Hi;β ) is T ×1 vector of the conditional mean function for school i and yT
i = (yi1, ...,yiT ).
Note that the objective functions are essentially the same in equations (2.31) and (2.33), so both
ˆβp1 and ˆβp2 are consistent. For nonlinear panel data models, we have following standard results

(cid:16)

(cid:17)

√
N( ˆβp2 − βo) ∼ Normal(0,A−1

(cid:16)(cid:53)β m(Hi;βo)T νiνT

, B2 = E

(cid:17)
2 B2A−1
(2.34)
2 )
, and νi = yi−m(Hi,βo).
i (cid:53)β m(Hi;βo)

(cid:17)

H(m;βo)

where A2 = E
With some algebra, one can show that A1 = QA2 and B1 = QB2 +E[∑i∈g ∑ j∈g\(i)(cid:53)β m(Hi;βo)T νiνT
. From these results we can see that Avar( ˆβp1) is not the same as Avar( ˆβp2) unless the
m(Hi;βo)
expectation of all cross products of scores for different schools within the same district is zero. If

j (cid:53)β

50

the second part of B1 is zero, then one can show that Avar( ˆβp1) = Avar( ˆβp2) using the fact that
B1 = QB2 and N = QG. This result is not limited to the pooled NLS estimation method. It is gen-
eral enough for all M estimators. Suppose that sg(βo) is the score function of the objective function
considering the higher level nested structure and si(βo) is the score function of the objective func-
tion ignoring higher level cluster structure. The two estimators would have the same asymptotic
variance matrix only if ∑i∈g ∑ j∈g\(i) E[si(βo)s j(βo)T ] = 0. In general, the pooled estimator that
ignores higher-level nested structure would still be consistent, but its asymptotic variances are not

valid for inferences except in some special cases. Later we will see an example in which this

condition can be violated as long as the group effect exists at a higher level.

Differences between asymptotic variances of pooled estimations with and without considering

the cluster structure can be used as an informal test of whether there is a higher-level nested struc-

ture. In the given example, if there are large differences between the variance estimators of pooled

panel estimation and pooled cluster estimation, this could be a sign that we have ignored nested

structure at the higher (district) level.

For the same reason as previously discussed, the pooled estimation is unlikely to be efﬁcient as

it ignores correlations across time and across schools within the district. The GEE method is attrac-

tive for improve the efﬁciency without imposing stronger assumptions. In the three-level model,

the working matrix would be a little more complicated as it intends to capture within-group corre-

lations as well as correlation across time. I will call it “Extended GEE"(EGEE) estimation when

the GEE method is applied to higher level models. With a minor notation change, the objective

function for EGEE can be written as

[yg − M(Hg;β )]T [W(Hg; ˇγ)]−1[yg − M(Hg;β )]

G
∑
g=1

Min
β

(2.35)

Of course, the efﬁciency of the EGEE estimator depends on how well we approximate the
true correlation matrix with our chosen working matrix W(Hg; ˇγ). Similarly, we can decom-
pose the working matrix as V(Hg,α)1/2R(ρ)V(Hg,α)1/2, where V(Hg,α) = diag{var∗(yi1 |
Hg), ...var∗(yiT | Hg), ....}i∈g. Again the GEE method allows the possibility that the constructed

51

conditional variance var∗(yiT | Hg) could be different from the true conditional variance var(yiT |
Hg). The interesting part is modeling R(ρ). Several commonly used working correlation matrices
are available as discussed in section 4. In our example, level one is time, and we can use exchange-

able correlation, AR correlation, or MA correlation to model the correlation across time. Level

two is schools which are nested within the district. For the nested data structure, it makes sense to

assume an exchangeable correlation structure. To be explicit about R(ρ), we can write it as:



R1(ρ1) R2(ρ2)
R2(ρ2) R1(ρ1)

...

...

R2(ρ2) R2(ρ2)



. . . R2(ρ2)
. . . R2(ρ2)
...
. . . R1(ρ1)

...

R(ρ) =

(2.36)

where R1(ρ1) is used to model the correlation across time, i.e. corr(yit,yis | Hg(i)), and R2(ρ2)
is used to approximate the within group correlation, i.e. corr(yit,y js | Hg(i) = Hg( j)). In both
GEE estimation and EGEE estimation, we use R1(ρ1) to model the correlation across time. The
new part of the EGEE estimation is R2(ρ2). If we have chosen exchangeable structure for R1(ρ1),
then we have good reason to assume R2(ρ2) is a matrix of some constant ρ2. When R1(ρ1) is
approximated by the correlation matrices generated from an AR process or an MA process, we
might still assume R2(ρ2) to be a matrix of a single constant if the within group correlation is
dominated by vg, which is time invariant. Of course, one might suspect that corr(yit,y js | Hg(i) =
Hg( j)) decreases as | t − s | increases. In this scenario we could use a more sophisticated matrix
for R2(ρ2). One possibility is to approximate corr(yit,y js | Hg(i) = Hg( j)) by θ1θ|t−s|
, and in this
case, ρ2 = (θ1,θ2). The Implementation of EGEE requires some initial consistent estimator of ρ.
Here I outline the formulas to calculate ˇρ with some initial consistent estimator ˇβ when we choose
exchangeable correlation matrices for both R1(ρ1) and R2(ρ2).

2

ˇνis
ˇµis
where ˇνit = yit − m(hit; ˇβ ), and ˇµit is the consistent estimator of

NT (T − 1)

ˇρ1 =

T
∑
s=i+1

ˇνit
ˇµit

2

N
∑
i=1

T−1
∑
t=1

(cid:113)
var∗(yit | Hg(i)), which is

(2.37)

52

(cid:16)

(cid:17)

(2.38)

usually associated with m(hit;β ) in many nonlinear models. The consistent estimator of ρ2 can be
found by

ˇρ2 =

1

GT 2Q(Q− 1)

G
∑
g=1

∑
i∈g

∑
j∈g

T
∑
t=1

T
∑
s=1

ˇνit
ˇµit

ˇν js
ˇµ js

− ∑
i∈g

T
∑
t=1

T
∑
s=1

ˇνit
ˇµit

ˇνis
ˇµis

Robust variance matrix estimator of ˆβEGEE can be found in equation (2.22) with minor notation
changes. When we have correctly speciﬁed the working matrix, the EGEE estimator is efﬁcient.

Even if the working matrix is misspeciﬁed, the EGEE estimator is shown to be more efﬁcient than

the pooled estimator in various simulation studies.

2.5.2 Some Examples

This subsection presents some examples of nonlinear panel models with higher hierarchical struc-

ture. Speciﬁcally, I consider the binary/fractional response model and the count response model.

There are other estimation methods available for the count response model because of the multi-

plicative function form of the conditional mean. Some results in the panel count response model

can also be extended to higher level models.

Three-Level Binary/Fractional Response Models

When our interest is only about the conditional mean function, we can model the conditiona
mean functuon of the binary or fractional response variable as E(yit | hit,ci,vg(i)) = Φ(hitβ∗ +ci +
vg(i)). Depending on the assumption we make about the correlation between (ci,vg(i)) and Hg(i),
we could have E(yit | Hg(i)) = Φ(hitβo) for the random effects framework or a more complicated
function form for correlated random effects after integrating out (ci,vg(i)) based on some normal
distributional assumptions.
It might be instructive to discuss in detail the estimation of three-

level binary/fractional response models within the correlated random effects framework. When

we allow an unbalanced data structure, the number of schools varies across districts, and the total

observation periods are different across schools. For the same reason as in unbalanced panel data

models, we should, at least, allow the parameters and variances to depend on the group size and

53

the total time periods. In the correlated random effects context, we could model the heterogeneity

as follows:

ci + vg(i) = ¯Hg(i)φTiQg + εi + εg
TiQg)

εi + εg | Hg(i) ∼ Normal(0,σ 2

(2.39)

where Ti denotes the time periods for school i, and Qg denotes the number of schools in district g.
If there is not much variation in Ti and Qg, we could ﬁrst generate dummy variables for different
values of {TiQg(i)}N
i=1. Making the substitution using equation (39) for the conditional mean
function, we have a heteroskedastic binary/fractional response model.

E(yit | Hg(i)) = Φ

(2.40)

(cid:104)hitβ∗ + ¯Hg(i)φTiQg
(cid:113)

(cid:105)

1 + σ 2

TiQg

There are slight differences between the conditional variances of binary response and fractional
response models. For the binary response model, var(yit | Hg(i)) = Φ(hitβo)(1− Φ(hitβo)). For
the fractional response model, we could allow more ﬂexibility for conditional variance, var(yit |
otΦ(hitβo)(1 − Φ(hitβo)), which is a modiﬁed Bernoulli GLM variance assumption.
Hg(i)) = σ 2
For the fractional response model, σ 2
, where ˇνit =
yit − Φ(hit ˇβ ).

ot can be estimated by 1

Φ(hit ˇβ )(1−Φ(hit ˇβ ))

N ∑N
i=1

ˇν2
it

In the mixed-effects literature, the binary response model with random intercepts at school

level and district level makes the following assumption:

yit = 1{hitβ + ci + vg(i) + uit > 0}

(2.41)

Usually the heterogeneities and the idiosyncratic errors are assumed to be normally distributed. As

we have seen, the GEE method also requires distributional assumption on heterogeneities, but the
consistency of GEE estimators does not require the correct speciﬁcation of var(ug). On the other
hand, mixed-effects models are usually estimated by the maximum likelihood estimation method,

54

which will give us inconsistent estimators if we misspecify the variance-covariance matrix of id-

iosyncratic errors. Clearly, there is a trade-off between efﬁciency and robustness. Simulation study

in the next section shows the comparison between GEE estimators and mixed effects estimators.

Three-Level Count Response Models

For the count response model, or more generally for models with a non-negative dependent
variable, we usually specify the conditional mean function as E(yit | Hg(i),ci,vg(i)) = exp(hitβo +
ci + vg(i)). Notice that I have implicitly imposed the strict exogeneity assumption. Other function
forms are also available, but the exponential function is by far one of the most popular choices
in empirical research. Without loss of generality, we might just write E(yit | Hg(i),ci,vg(i)) =
civg(i)exp(hitβo), where (ci,vg(i)) is strictly positive.

The random-effects assumption for this multiplicative function form can be stated in terms of

conditional mean.

E(ci | Hg(i)) = E(vg | Hg) = 1

ci|=vg(i) | Hg

(2.42)

We could allow different means for school effects and district effects, but this does not matter in

the context of ﬁxed group size and ﬁxed time periods. Without such a normalization, we would
get an estimator of the scaled βo. With the above assumption, we can integrate out ci and vg(i) and
get the estimating equation E(yit | Hg(i)) = exp(hitβo). Notice that the above assumption rules out
dependence between ci and vg(i), which could be restrictive in some cases. We could also assume
E(civg(i) | Hg) = 1, which allows the heterogeneities to be correlated. Other than the pooled NLS
estimation, we could also use pooled Poisson Quasi-MLE estimation since Poisson distribution

belongs to the so-called linear exponential family. As shown by Gourieroux, Monfort, and Trognon

(1984), the consistency of the Poisson Quasi-MLE estimator only requires the conditional mean to

be correctly speciﬁed. We should use robust variance estimators to allow for arbitrary conditional
variance and serial correlation induced by ci and vg(i). For the extended-GEE method, we could
assume var∗(yit | Hg(i)) = σ 2
otexp(hitβo). With these speciﬁc conditional mean and conditional

55

variance functions, we could apply the extended-GEE method to improve efﬁciency over pooled

estimators.

Another version of Poisson random-effects assumptions is also available. Hausman, Hall and

Griliches (1984)(HHG) provides a set of assumptions for count response panel models. Fixed

effects estimators are also available under the HHG assumptions, and Wooldridge (1999) further

demonstrates that the ﬁxed effects Poisson estimator is consistent under fairly weak conditions.The

ﬁxed effects estimator will not be the focus of this part because it does not deliver partial effects.

Here I extend the HHG framework in panel data to the three-level models. The Poisson random

effects assumptions in our framework are as follows.

Poisson random-effects assumptions

yit | Hg(i),ci,vg(i) ∼ Poisson(cid:0)civg(i)exp(hitβo)(cid:1)

yit|=yis | Hg(i),ci,vg(i)
D(ci | Hg(i)) = D(ci)

yit|=y js | Hg(i) = Hg( j),ci,c j,vg(i)
D(vg(i) | Hg(i)) = D(vg(i))

(2.43)

(2.44)

(2.45)

With some parametric assumptions on the distribution of ci and vg(i), full conditional MLE is
available and is efﬁcient if all assumptions hold. The drawback of full MLE is that the estimator

is not robust to the failure of any above assumptions. An alternative approach is to implement

the multivariate weighted nonlinear least squares estimation. The MWNLS estimator would be

robust to the failure of the above Poisson random-effects assumptions. It is consistent under the

assumption of the correct speciﬁcation of the conditional mean along with the strict exogeneity

assumption and equation (42). The MWNLS estimator could be more efﬁcient than the extended-

GEE estimator if some of the Poisson random-effects assumptions are true. The weighting matrix
used in MWNLS could provide a better approximation of var(yg | Hg) than EGEE if Poisson
random-effects assumptions are true. This is a special case in which we could explicitly quantify

the correlations across time and across schools because of the multiplicative function form.

Deﬁne νit = yit − E(yit | Hg(i)) and eit = yit − E(yit | Hg(i),ci,vg(i)). If we assume E(yit |

56

Hg(i),ci,vg(i)) = civg(i)E(yit | Hg(i)), as what we did in the count response model, then we have
νit = eit + E(yit | Hg(i))(civg(i) − 1). Under the Poisson random-effects assumption, with some
algebra, we have the following results.

E(ν2
E(νitνis | Hg(i)) = α2m(hit;βo)m(his;βo)

it | Hg(i)) = α1m(hit;βo) + α2m(hit;βo)2
t (cid:54)= s

(2.46)

(2.47)

E(νitν js | Hg(i) = Hg( j)) = α3m(hit;βo)m(h js;βo)

(2.48)
where α1 = E[civg(i)], α2 = E[(civg(i) − 1)2] and α3 = E[(civg(i) − 1)(c jvg( j) − 1)]. Obviously
when the Poisson random-effects assumptions are true, the constant working correlation matrix

i (cid:54)= j

would be inappropriate because the conditional correlation matrix depends on the covariates. With
the above results, we can specify var∗(yg | Hg) under the Poisson random-effects assumptions.
Using var∗(yg | Hg) to replace the working matrix in the extended GEE estimation would give
us the MWNLS estimator, which could be more efﬁcient than the EGEE estimator. Notice that
var∗(yg | Hg) could be different from the true conditional variance var(yg | Hg) if any Poisson
distribution assumption fails. As usual, we need some preliminary estimators to implement the

MWNLS estimation. In particular, we can estimate all the unknown αs based on moment condi-

tions (2.46)-(2.48).

T
ˇνit ˇνis
∑
ˇmit ˇmis
s(cid:54)=t
− ˆα2 ˇmit)

ˇνit ˇν js
ˇmit ˇm js

ˆα3 =

1
NT 2

ˆα2 =

1

ˆα1 =

N
∑
NT (T − 1)
i=1
N
T
∑
∑
(
i=1
t=1
∑
∑
i∈g
j∈g\i

1
NT
G
∑
g=1

T
∑
t=1
ˇν2
it
ˇmit
T
T
∑
∑
s(cid:54)=t
t=1

(2.49)

(2.50)

(2.51)

For inference of MWNLS estimators, we should use robust sandwich variance matrix estimators,

as in equation (22), to allow for failure of Poisson random-effects assumptions.

Fixed effects estimators are also available, but we would not be able to obtain partial effects

based on these estimators in the count response model because we have little information about the

57

correlation between heterogeneities and explanatory variables. A middle ground between random

effects and ﬁxed effects is correlated random effects framwork, which allows us to explicitly model
the correlation between unobserved effects and explanatory variables. We still maintain E(yit |
(cid:1). With this
Hg(i),ci,vg(i)) = civg(i)exp(hitβo). In addition, I assume E(civg(i) | Hg(i)) = exp( ¯Hg(i)φTiQg).

Using the law of iterated expectation gives us E(yit | Hg(i)) = exp(cid:0) ¯Hg(i)φTiQg + hitβo

moment condition, we could use the EGEE or WMNLS method to get consistent estimators of
φo and βo. The average structure function can be obtained by ﬁxing hit = h and integrating out
heterogeneties.

ASF(h) = E ¯Hg(i)

Consistent estimator of ASF(h) can be obtained by 1

N ∑N

[exp( ¯Hg(i)φTiQg + hβo)]

i=1 exp(cid:0) ¯Hg(i)

ˆφTiQg +h ˆβ(cid:1). Average partial

(2.52)

effects can be estimated either by differentiation with respect to continuous variables or differenc-

ing with respect to discrete change. Bootstrapping provides valid inference for average partial

effects.

2.6 Simulation

This section presents several simulation results for the efﬁciency comparisons between GEE

estimators and other competitive estimators under various data generating processes. It is also of

practical interest to compare efﬁciency of the GEE estimators that use different working correlation

matrices. Speciﬁcally, this section compares the Monte Carlo standard deviations of four group

estimators : the traditional GEE estimator and the GEE estimator with time-varying dispersion

parameters; the GEE estimator and the pooled estimator in two-level nonlinear models; the EGEE

estimator and the pooled estimator in three-level nonlinear models; the EGEE estimator and the

mixed-effects estimator in three-level nonlinear models.

In most nonlinear models, it is difﬁcult to analytically derive the population correlations of the

structural errors even if we have full knowledge of the data generating process. Take the random-
effects probit model as an example. E(yi | xi,zi) = Φ(xitαc + ziβc) and the structure error is

58

yit − Φ(xitαc + ziβc), then the correlation within the group can be found as

ρts = Corr(cid:0)yit − Φ(xitαc + ziβc),yis − Φ(xisαc + ziβc) | xi,zi

(cid:1)

which is difﬁcult to derive a close form expression. To show the efﬁciency gain by allowing time-

varying dispersion parameters, it might be helpful to look at the simple linear panel data model in

which the data generating process immediately determines within-group correlations.

Comparison of the traditional GEE estimator and the GEE estimator with time-varying dis-

persion parameters:

yit = β0 + ziβ1 + xitβ2 + ci + uit

where β0 = 1,β1 = 2 and β2 = 3 are the true parameters. I assume that zi ∼ Gamma(2,3), ci ∼
√zi + 0.5wit, where wwit ∼ Normal(0,9). I also assume the balanced
uni f orm(−2,2), and xit =
panel structure T = 3 for each i. Next, I specify several different DGP for the error term uit.
(DGP1): {uit}T
t=1 are independent across time and have different variances for each period.
Speciﬁcally we set uit ∼ Normal(0,t2). It is easy to verify that the true correlation of the composed
error terms cg + uit does not follow an exchangeable correlation structure.
(DGP2): {uit}T
t=1 follows an AR(1) process associated with parameter ρ. In particular we assume
uit = ρuit−1 + εit and εit ∼ Normal(0,1). Initial value ui0 is drawn from Normal(0,4). The true
correlation of the composed error terms cg + uit does not follow from the standard AR(1) process
because of the presence of ci.
(DGP3): {uit}T
t=1 follows an MA(1) process associated with parameter θ. In particular we assume
uit = θεit +εit−1 and εit ∼ Normal(0,4). Again because of the presence of ci, the composed error
does not follow the standard MA(1) process.

Table 2.1 presents the simulation results of various estimators of β2 under different data gen-
erating processes. In particular, we consider the pooled OLS estimator, GEE estimators with three

different working correlation matrices, and GEE estimators with time-varying dispersion param-

eters. In each replication, we obtain the robust standard errors of all these estimators because we

rarely know the true correlation of the structural errors in empirical research. It is now conven-

tional to make inferences robust to the failure of second moment assumptions, and I follow this

59

procedure here. To carry out the efﬁciency comparison, I calculate the mean value for the ratio of

the Monte Carlo standard deviations. Each entry in the following table denotes the ratio of Monte

Carlo standard deviation of the row corresponding estimators to the Monte Carlo standard devia-

tion of the pooled estimator. The sample size is 500 and the replication number is 1000. In the
following table, I use ∗ to indicate estimators that use time-varying dispersion parameters.
From these results, we can easily see the efﬁciency improvements of GEE estimators over pooled

estimators, which has been studied in the original paper by Liang and Zeger(1986). The new

result here is that we could further improve the efﬁciency of GEE estimators by allowing different

dispersion parameters for each period. This ﬁnding is useful for any nonnested data structure,

such as panel data, in which we can get precise estimates of these dispersion parameters. The

GEE estimator with the exchangeable working correlation matrix performs well in various data

generating processes. Next, I will compare the efﬁciency of GEE estimators and pooled estimators

in nonlinear panel models.

Comparison of the pooled estimator and the GEE estimator in nonlinear panel models
(DGP4): Random-effects binary response models yit = 1{α0 +ziα1 +xitα2 +ci +uit ≥ 0}, where
α0 = 1,α1 = 2 and α2 = −3 are the true parameters.
I assume that uit follows standard nor-
√zi + 0.5wit, where wit ∼
mal distribution, zi ∼ Gamma(2,3), ci ∼ Normal(0,σ 2
c ), and xit =
Normal(0,9). I also assume the balanced data structure Ti = 3 for each i. As mentioned before, it
is difﬁcult to derive the within group correlation in this speciﬁed model. However, the intuition is
that as the “importance" of ci increases, the within group correlation increases. Thus, by varying
the value of σ 2
c , we can compare the Monte Carlo standard deviations of different estimators with

different within group correlations.

(DGP5): Random-effects fractional response models: suppose the true conditional mean function
is E(yit | zi,xi,ci) = Φ(α0 +α1zi +α2xit +ci), where α1 = 0.1 and α2 = −0.3 are the true param-
eters. zi and xit are generated from the same distributions as in DGP4. Group heterogeneity ci is
generated from Normal(0,σ 2
c ). To generate the fractional random variable, we need an interme-
diate step: generate y∗
it/100 is a fractional

it from binomial(100,Φ(α1zi +α2xit +ci)), and yit = y∗

60

variable with the desired conditional mean. Again by varying the value of σ 2

c , we can change the

within group correlation.

From the above two nonlinear models, we see very similar results as in linear models. The GEE

estimator performs better than the pooled estimator. As the within-group correlations increases,

we gain greater efﬁciency by applying the GEE method. In all cases, The GEE estimator with the

exchangeable working correlation matrix performs better than other GEE estimators with different

working correlation structures. This lends some support to the use of the exchangeable working

correlation matrix in applied work.

Comparison of the pooled estimator and the GEE estimator in three-level nonlinear models
(DGP6): Three-level random-effects binary response model: yit = 1{α0 +wgα1 +ziα2 +xitα3 +
vg + ci + uit ≥ 0}, where α0 = 1,α1 = 2, α2 = 0.5 and α3 = −1 are the true parameters. wg ∼
uni f orm(−1,1), and xit −w2
g ∼ uni f orm(−2,1). All other variables are independent of each other
and normally distributed. Speciﬁcally, zi ∼ Normal(0,0.04) and uit has been normalized to follow
the standard normal distribution. We can change the time series correlations by varying σ 2
c and
change the within group correlations by varying σ 2
v . Again here I take the pooled estimator of α3
as the benchmark, and we calculate the ratio of the Monte Carlo standard deviation of the GEE es-

timator to the Monte Carlo standard deviation of the pooled estimator. I have also added the panel

GEE estimator to the comparison list. The labels are a little different in the following table. For

example, we use GEE(AR(1),EX) to denote the GEE estimator that uses a working matrix that
has an AR(1) type correlation for diagonal blocks, namely R1(ρ), and has an exchangeable struc-
ture for blocks off diagonal, namely R2(ρ). We use O to denote the null matrix; thus, GEE(EX,O)
represents the working matrix that has zero correlation among schools within the same district. I

assume a balanced data structure; there are 500 groups with 3 units in every group, and each unit

is observed for 4 periods.

Two things stand out from the above simulation results: (1) The extended GEE estimator with

the correctly speciﬁed working matrix is the most efﬁcient among all estimators that recognized

the higher-level correlation. From the DGP, it is not difﬁcult to see that both the within-group

61

correlation and time series correlation are constant, therefore, it is no surprise that GEE(EX, EX)

delivers efﬁcient results. In all three cases, we see signiﬁcant efﬁciency improvements over the

pooled estimator. An interesting ﬁnding is that the Monte Carlo standard deviation of the extended

GEE estimator that ignores within-group correlations, i.e. GEE(EX,O), is very close to that of

the efﬁcient estimator. This is expected because the within-group correlations are not as strong as

time series correlation in my DGP. (2) If we ignore the correlations within the district and take the

data as a standard panel, then we are being too optimistic about the estimates. By comparing the

extended GEE estimator and the panel GEE estimator, we can see that the panel GEE estimator

has smaller Monte Carlo standard deviation in all three cases. As we increase the time series

correlation and decrease the within-group correlations, the distortion becomes less evident. When

there is no group correlation or district effects in our example, the asymptotic variances of three-

level estimators and panel estimators should be the same.

Comparison of the EGEE estimator and the mixed-effects estimator in three-level nonlinear

models

(DGP7): The three-level binary response model is speciﬁed the same as in DGP6 except that now

we allow idiosyncratic errors to be correlated across time. When all distributional assumptions

hold, the mixed-effects estimators, which are based on the maximum likelihood estimation, would

be efﬁcient.

In this case, we would like to know the efﬁciency loss when we adopt the GEE

estimation. On the other hand, if any distribution assumption fails, the mixed-effects estimator

is inconsistent while the GEE estimator may remain to be consistent. In this part, I generate the

idiosyncratic errors from standard normal distribution with AR(1) type correlations across time, i.e.
Corr(uit,uis) = ρ|t−s|. Mixed-effects estimators that assume independent idiosyncratic errors will
be inconsistent if ρ (cid:54)= 0. I report the GEE estimator and the mixed-effects estimator of the scaled
parameter α3, which is −0.5 when we ﬁx σ 2
v = 2 and σ 2
c = 1. Monte Carlo standard deviations of
the corresponding estimators are reported in parentheses.

From the results in table 2.5, we can see that the GEE estimator has good efﬁciency proper-

ties in comparison to the mixed-effects estimator. The surprising result is that the mixed-effects

62

method gives very similar estimates to the EGEE estimators when ρ (cid:54)= 0. The intuition is that the
correlations of idiosyncratic errors induce overestimation of the variance of heterogeneity, and this
inconsistency somehow cancels out with the inconsistent estimator of α3 when we calculate the
scaled parameters. In general, we should not expect the mixed-effects estimators to be consistent

when the distributional assumptions are violated.

2.7 Conclusion

This paper extends the GEE estimation to multi-level nonlinear models. In particular, I con-

sider nonlinear panel data models with a higher nested structure. One stylized example is a model

for passing rates of some standardized test at the school level at various time points. Observations

are correlated across time through the school ﬁxed effects and are correlated within the same area

through the district effects. When we extend the GEE estimation to higher-level models, the work-

ing correlation matrix has a more complicated form to capture various within group correlations.

Compared with the usual mixed effects estimator used in hierarchical models, the GEE estimator is

robust to certain violations of the distributional assumptions and is less intensive in terms of com-

putation. For some nonlinear models, such as the fractional response model, it is difﬁcult to ﬁnd

any distribution to implement the mixed effects estimation. In comparison with the pooled estima-

tor, simulation evidence indicates that the GEE estimator has relatively good efﬁciency properties.

This paper also contributes to GEE literature in other important ways. I show that in panel data

models, as well as in other non-nested models, allowing time-varying dispersion parameters could

further enhance the efﬁciency of the GEE estimator. Simulations show that this is a nontrivial im-

provement. I also provide a systematic discussion of the selection of working correlation matrices.

I show that the exchangeable working correlation matrix has a well deﬁned converging limit under

various misspeciﬁcation settings, but its deﬁniteness needs to be checked on a case by case basis.

When the conditional mean associated with heterogeneties has a multiplicative function form,

we can use the multivariate weighted nonlinear least squares method to explicitly quantify various

within group correlations. Under certain assumptions, the MWNLS estimator is more efﬁcient

63

than the GEE estimator. I show how to implement the MWNLS estimation in the count response

panel model with a higher nested structure.

The GEE method can be carried out either by using the one-step or the two-step estimation.

If we use the same working correlation matrix, the one-step estimation and the two-step estima-

tion are asymptotically equivalent, which is similar to the comparison between the continuously

updating GMM and the two-step GMM. Less is known about the ﬁnite sample performances of

the one-step GEE estimator and the two-step GEE estimator. This is one of my ongoing research

projects. Other topics on multi-level models for future research include dynamic models, nonlinear

models with endogeneity, and so on.

64

APPENDIX

65

APPENDIX FOR CHAPTER 2

Case 1.1 Estimator of ρ can be obtained by solving the equation

T−1
∑
t=1
1
N−K ∑N

T
∑
(
s=t+1

1
N

N
∑
i=1

eiteis − ˆρs−t) = 0

(B.1)

i=1 eiteis → ξ . After some algebra, above equation can be rewrit-

With large N asymptotics,

ten as

2

ξ − ˆρ

1− ˆρ ) = 0

1− ˆρ (T − 1− ˆρT

Q( ˆρ) ≡ T (T − 1)
(B.2)
is negative over α ∈ (−1,1). It is easy to verify that lim ˆρ→1 Q( ˆρ) =
It can be shown that dQ( ˆρ)
d ˆρ
T (T−1)
(ξ − 1) < 0. Equation (2.34) has a unique solution only if Q(−1) > 0. When M is even,
T−1; when T is odd, Q(−1) =
Q(−1) =
2
T (T−1)
(ξ + 1
T ) has indeﬁnite sign. When T is odd, the working correlation matrix is well deﬁned
only if ξ > − 1
T .

T−1 ) > 0 by intrinsic constraint ξ > − 1

(ξ + 1

T (T−1)

2

2

Case 3.2 In this case, ˆρ can be found from the equation

With large N asymptotics, 1

N ∑N

N ∑N

i=1 eiteis → 0 if |t − s| > 1. After some

eiteis − ˆρs−t) = 0

1
N

T
∑
(
s=t+1

T−1
∑
t=1
i=1 eiteit+1 → ξt and 1

N
∑
i=1

(B.3)

algebra, above equation can be rewritten as

ξt − ˆρ

1− ˆρ (T − 1− ˆρT

Q( ˆρ) ≡ T−1
∑
t=1
is negative over α ∈ (−1,1) and lim ˆρ→1 Q( ˆρ) = ∑T−1

1− ˆρ ) = 0

t=1 ξt −
< 0. Equation (36) has a unique solution only if Q(−1) > 0. When T is even, Q(−1) =
2 . The working correlation matrix is well

Again it can be shown that dQ( ˆρ)
d ˆρ
T (T−1)
2
∑T−1
t=1 ξt + T
deﬁned if ∑T−1

2 ; when T is odd, Q(−1) = ∑T−1
t=1 ξt > −T

t=1 ξt + T−1
2 when T is even, and if ∑T−1

t=1 ξt > −T−1

2 when T is odd.

(B.4)

66

SIMULATION RESULTS FOR CHAPTER 2

Table 2.1: traditional GEE estimator and GEE estimator with time-varying dispersion parameters

pooled estimator
pooled estimator*

GEE(EX)
GEE∗(EX)
GEE(AR)
GEE∗(AR)
GEE(MA)
GEE∗(MA)

DGP1

1.00
0.836
0.958
0.774
0.966
0.786
0.973
0.800

DGP2
DGP3
ρ = 0.25 ρ = −0.25 θ = 0.6 θ = −0.6

DGP2

DGP3

1.00
0.938
0.788
0.703
0.741
0.669
0.848
0.797

1.00
0.938
0.934
0.857
0.974
0.897
0.986
0.914

1.00
0.998
0.858
0.857
0.784
0.783
0.802
0.801

1.00
0.998
0.997
0.996
0.984
0.982
0.987
0.986

Table 2.2: pooled estimator and GEE estimator in a binary response panel model

pooled estimator

GEE(AR)
GEE(MA)
GEE(EX)

c = 0.5 σ 2
σ 2
1.00
0.999
0.999
0.999

c = 1 σ 2
1.00
0.996
0.996
0.994

c = 4 σ 2
1.00
0.983
0.987
0.978

c = 6
1.00
0.978
0.984
0.971

Table 2.3: pooled estimator and GEE estimator in a fractional response panel model

pooled estimator

GEE(AR)
GEE(MA)
GEE(EX)

σ 2
c = 0.01 σ 2
1.00
0.918
0.950
0.897

c = 0.02 σ 2
1.00
0.829
0.951
0.793

c = 0.03 σ 2
1.00
0.762
0.993
0.718

c = 0.04
1.00
0.718
1.04
0.670

Table 2.4: pooled estimator and EGEE estimator in a three-level binary response model

σ 2
v = 2, σ 2

c = 1; σ 2

v = 1, σ 2

c = 1; σ 2

v = 1, σ 2

c = 2

pooled estimator
GEE(EX,EX)
GEE(EX,O)

GEE(AR(1),EX)
GEE(MA(1),EX)
panel GEE(EX)

1.00
0.817
0.828
0.880
0.909
0.770

1.00
0.871
0.882
0.908
0.916
0.848

1.00
0.834
0.838
0.877
0.894
0.823

67

Table 2.5: EGEE and mixed-effects estimator in a three-level binary response model

pooled estimator

GEE(EX,EX)

mixed-effects

ρ = 0
-0.488
(0.028)
-0.488
(0.024)
-0.488
(0.023)

ρ = 0.2 ρ = 0.4 ρ = 0.6
-0.482
-0.463
(0.028)
(0.028)
-0.464
-0.483
(0.024)
(0.023)
-0.464
-0.483
(0.023)
(0.022)

-0.474
(0.028)
-0.475
(0.024)
-0.475
(0.023)

68

BIBLIOGRAPHY

69

BIBLIOGRAPHY

Ai, C.(1997). A Semiparametric Maximum Likelihood Estimator. Econometrica, 65 (4), 933-963.

Altonji, J. G., and R. L. Matzkin (2005). Cross Section and Panel Data Estimators for Nonsepa-
rable Models with Endogenous Regressors. Econometrica, 73 (4), 1053-1102.

Blundell, R., and J. L. Powell (2004). Endogeneity in Semiparametric Binary Response Models.
Review of Economics Studies, 71, 655-679.

Chaganty, N. R., and H. Joe (2006). Range of Correlation Matrices for Dependent Bernoulli
Random Variables. Biometrika, 93 (1), 197-206.

Chamberlain, G (1982). Multivariate Regression Models for Panel Data, Journal of Econometrics,
18 (1), 5-46.

Crowder, M (1995). On the Use of a Working Correlation Matrix in Using Generalized Linear
Models for Repeated Measures. Biometrika, 82 (2), 407-410.

Crowder, M (2001). On Repeated Measures Analysis with Misspeciﬁed Covariance Structure.
Journal of the Royal Statistical Society. Series B (Statistical Methodology), 63 (1), 55-62.

Deleeuw, J., and E. Meijer (2008). Handbook of Multilevel Analysis. Springer press.

Entorf, H., and L. Sattarova (2016). The Analysis of Prison-Prisoner Data Using Cluster-Sample
Econometrics: Prison Conditions and Prisoners’ Assessments of the Future. IZA Discussion Paper
No. 10209. Available at SSRN: https://ssrn.com/abstract=2840153.

Gourieroux, G., A. Monfort, and C. Trognon (1984). Pseudo-Maximum Likelihood Methods:
Theory. Econometrica, 52, 681-700.

Gromping, U (1996). A Note on Fitting a Marginal Model to Mixed Effects Log-Linear Regres-
sion Data via GEE. Biometrics, 52(1), 280-285. doi:10.2307/2533162

Hausman, J., B. Hall, and Z. Griliches (1984). Econometric Models for Count Data with an
Application to the Patents-R&D Relationship. Econometrica, 52(4), 909-938.

Hsiao, C., and M.H. Pesaran (2004). Random Coefﬁcient Panel Data Models. IZA Discussion
Paper No. 1236.

Johson, T.R., and J. Kim (2004). A Generalized Estimating Equations Approach to Mixed-Effects
Ordinal Probit Models. British Journal of Mathematical and Statistical Psychology, 57, 295-310.

Klein, R., and R. Spady (1993). An Efﬁcient Semiparametric Estimator for Binary Response
Models. Econometrica, 61(2), 387-421.

Laird, N., and J. Ware (1982). Random-Effects Models for Longitudinal Data. Biometrics, 38(4),
963-974. doi:10.2307/2529876

70

Liang,Y.- K., and S.L. Zeger (1986). Longitudinal data analysis using generalized linear models.
Biometrika, 73 (1), 13-22.

Liang, Y.-K., S.L. Zeger and B. Qaqish (1992). Multivariate Regression Analyses for Categorical
Data. Journal of the Royal Statistical Society. Series B (Methodological), 54(1), 3-40.

Manski, C.F (1975). Maximum Score Estimation of the Stochastic Utility Model of Choice. Jour-
nal of Econometrics, 3, 205-228.

Moulton, B (1990). An Illustration of a Pitfall in Estimating the Effects of Aggregate Variables on
Micro Units. Review of Economics and Statistics, 72, 334-338.

Mundlak, Y (1978). On the Pooling of Time Series and Cross Section Data. Econometrica, 46 (1),
69-85.

Papke, L, E (1991). Interstate Business Tax Differentials and New Firm Location : Evidence from
Panel Data. Journal of Public Economics, 45(1), 47-68.

Papke, L. E., and J.M. Wooldridge (2008). Panel Data Methods for Fractional Response Vari-
ables With an Application to Test Pass Rates. Journal of Econometrics, 145(1), 121-133,

Verbeke, G., and E. Lesaffre (1996) A Linear Mixed-Effects Model with Heterogeneity in the
Random-Effects Population, Journal of the American Statistical Association, 91:433, 217-221

Wang, Ming (2014). Generalized Estimating Equations in Longitudinal Data Analysis: A Review
and Recent Developments. Advances in Statistics, vol. 2014, Article ID 303728, 11 pages, 2014.
doi:10.1155/2014/303728

Wang, P., G-F.Tsai, and A.Qu (2012) Conditional Inference Functions for Mixed-Effects Models
With Unspeciﬁed Random-Effects Distribution, Journal of the American Statistical Association,
107:498, 725-736. doi: 10.1080/01621459.2012.665199

Wooldridge, J. M (1997). Multiplicative Panel Data Models without the Strict Exogeneity As-
sumption. Econometric Theory, 13(5), 667-678.

Wooldridge, J. M (1999) Distribution-Free Estimation of Some Nonlinear Panel Data Models.
Journal of Econometrics, 90, 77-97.

Wooldridge, J. M (2003). Cluster-Sample Methods in Applied Econometrics. The American Eco-
nomic Review 93(2), 133-138.

Wooldridge, J.M (2010). Econometric Analysis of Cross Section and Panel Data. MIT press.

Zhang, H. , Q. Yu , C. Feng , D. Gunzler , P. Wu and X. M. Tu (2012). A New Look at the
Difference Between the GEE and the GLMM When Modeling Longitudinal Count Responses,
Journal of Applied Statistics, 39:9, 2067-2079, doi: 10.1080/02664763.2012.700452

71

CHAPTER 3

ONE-STEP GEE OR TWO-STEP GEE? A SIMULATION STUDY

3.1 Introduction

The generalized estimating equations(GEE) method was ﬁrst proposed by Liang and Zeger

(1986) to analyze panel data models. The GEE method is also commonly used for models with

the clustering data structure. The main advantage of adopting GEE estimation approach is to get

some efﬁciency gains without specifying the joint distribution of a subject’s observations. The

idea of GEE is to capture the time series or the within group correlations by utilizing the working

correlation matrix, which is used to approximate the true correlation structure. Many empirical

results as well as simulation evidences show that the GEE estimators are usually more efﬁcient

than other estimators that ignore the within group correlation.

The estimating equations in GEE method resemble the score equations from a likelihood analy-

sis. The GEE method is also related to the multivariate weighted nonlinear least squares estimation

(MWNLS) in the econometrics literature. It can be shown that the the equations associated with

the ﬁrst order conditions derived from the MWNLS estimation are the same as the estimating

equations in the GEE method provided that we use the same working matrix in each method.

The implementation of the GEE estimation requires the consistent estimation of parameters as-

sociated with the working matrix. In the original paper by Liang and Zeger (1986). it suggested to

use the identity working correlation matrix to obtain the preliminary consistent estimator of param-

eters associated with the marginal conditional mean function. Using these preliminary consistent

estimators, we could get the consistent estimator of the working correlation matrix. Based on the

obtained working correlation matrix, we could further update the estimating equations to get new

consistent estimators of the conditional mean parameters. This is a two-step estimation procedure

in which we use some preliminary consistent estimator to estimate the working matrix to explore

possible efﬁciency gains. The updating process could continue until the differences between the

72

previous estimators and the updated estimators converge to a certain speciﬁed range. This would

result in the iterated GEE estimators. Alternatively, one can directly model the working matrix as

a function of the unknown parameters associated with the conditional mean. We then can estimate

the conditional mean function together with some nuisance parameters associated with the working

matrix by one step in this special set up.

These different implementation procedures of the GEE estimation is very similar to different

GMM estimation methods. The two-step GMM estimator also uses some inefﬁcient weighting

matrix to obtain the preliminary consistent estimators. The iterated GMM estimator keeps updating

the weighting matrix with the new obtained estimator until the difference between the sequence of

estimators converge to a certain range. The continuous updating GMM estimator sets the weighting

matrix as a function of the unknown parameters of the moment conditions so that all parameters

can be estimated by one step.

In the GMM literature, it has been well established that these

three GMM estimators have the same ﬁrst order asymptotic distributions as long as the optimal

weighting matrix is used. Simulation evidences have shown that the continuous updating GMM

estimator has better ﬁnite sample performances compared with the two-step GMM estimator.

The main purpose of this paper is to investigate the ﬁnite sample performances of the two-step

GEE estimator and the one-step GEE estimator. We compare the bias and variance of the two-step

GEE estimator and one-step GEE estimator in various Monte Carlo simulation designs. The rest

of the paper is structured as the follows. Section 2 reviews the GEE method in one-step and two-

step estimations. Section 3 provides the simulation results of the two GEE estimators. Section 4

summarizes the paper.

3.2 One-Step and Two-Step GEE Estimations

The GEE estimation method was ﬁrst proposed by Liang and Zeger(1986) to analyze panel

data models and it can also be applied to models with the nested data structure. This paper will

take the panel data model as a leading case. The same analysis can be carried out for other nested

models with some minor notation changes. Suppose our primary interest focuses on the estimation

73

of the following conditional mean function:

E(yit | Xi) = m(xit;β )

(3.1)

where i = 1,2, ...,N and t = 1,2, ...,T . The GEE method can be easily extended to cover models
with unbalanced data. We assume a balanced panel data structure to simplify notations. yit is
the dependent variable and Xi = (xi1,xi2, ...,xiT ) is the collection of all the covariates of subject
i over time. Note that equation (3.1) has implicitly assumed strict exogeneity in the sense that
E(yit | Xi) = E(yit | xit), which is needed to guarantee the consistency of the GEE estimator. The
set up is general enough to cover limited dependent variables. For example, the dependent variable

could be binary response, fractional response, or count response and many others. To simplify

notations we deﬁne following conditional mean functions for subject i,

where y = (yi1,yi2, ...,yiT )T and m(Xi;β ) =(cid:0)m(xi1;β ),m(xi2;β ), ...,m(xiT ;β )(cid:1)T . The GEE

E(yi | Xi) = m(Xi;β )

(3.2)

approach accounts for neglected time series correlations by utilizing the so called “working" vari-
ance matrix, which is denoted by W(Xi;γ,β ). We use γ to represent all the unknown parameters
associated with the working matrix except for β . The working matrix can be expressed as:

1
2 Ri(ρ)V(Xi;β ,σ 2
o )

1
2

W(Xi;γ,β ) = V(Xi;β ,σ 2
o )

(3.3)
o ) = diag{var∗(yi1 | Xi),var∗(yi2 | Xi), ...,var∗(yiT | Xi)} and Ri(ρ) is a T × T
where V(Xi;β ,σ 2
positive semi-deﬁnite matrix associated with the unknown parameter vector ρ. The conditional
variance var∗(yit | Xi) usually is determined by the nature of the underlying dependent variable
and can be different from the true conditional variance function. For example, when yit is a bi-
nary response, var∗(yit | Xi) = m(xit;β )[1− m(xit;β )] by construction; when yit is a fractional
response, the generalized linear model (GLM) variance assumption usually takes var∗(yit | Xi) =
o m(xit;β )[1−m(xit;β )], where σ 2
σ 2
o is called dispersion parameter; when yit is a count response,
the conditional variance is modeled as σ 2
o m(xit;β ). Later we use the simulation study to show the

74

efﬁciency gain by using time-speciﬁc dispersion parameters. One of the most important component

in the working matrix is R(ρ), which is used to approximate the true correlation. Several popular

choices of R(ρ) include identity matrix, correlation matrix derived from the MA or AR process,

unstructured correlation matrix. Except for identity matrix, we usually need some preliminary

consistent estimator of β to obtain consistent estimator of ρ.

The GEE estimators can be obtained by solving the following equations:

N
∑
i=1

where (cid:53)β m(xi;β ) =(cid:0)(cid:53)β m(xi1;β )T ,(cid:53)β m(xi2;β )T , ...,(cid:53)β m(xiT ;β )T(cid:1)T and (cid:53)β m(xit;β ) is

(cid:53)β m(xi;β )T W(Xi; ˇγ, ˇβ )−1[yi − m(Xi;β )] = 0

(3.4)

a 1× K vector. The GEE estimators can also be characterized as multivariate weighted nonlinear
least squares estimation by solving the following minimization problem:

[yi − m(Xi;β )]T [W(Xi; ˇγ, ˇβ )]−1[yi − m(Xi;β )]

N
∑
i=1

Min
β

(3.5)

It can be easily shown that the ﬁrst order condition from equation (3.5) is exactly the same as the

estimating equation in (3.4). It has been well established that, under certain regularity assumptions,

the GEE estimators are consistent and have the following asymptotic distribution

√
N( ˆβ − β ) ∼ N

(cid:16)
0,(cid:0)E[sT

i W−1

i

si](cid:1)−1E[sT

i W−1

i uiuT

i W−1

i

si](cid:0)E[sT

i W−1

i

si](cid:1)−1(cid:17)

(3.6)

where si = (cid:53)β m(Xi;β ), ui = yi − m(Xi;β ) and Wi is shorthand for W(Xi;γ,β ). As mentioned
above, the implementation of GEE estimation requires some preliminary estimation of the un-

known working correlation matrix. To obtain the preliminary consistent estimator of the working

matrix, we could use the identity matrix in the ﬁrst step to get the consistent estimator of β , which
is denoted by ˇβ . From the residuals from the ﬁrst stage ˇui = yi−m(Xi; ˇβ ), we can consistently esti-
mate the working matrix of any particular speciﬁc form. Using the consistent estimator of working
matrix to replace the identity matrix, we would obtain the two-step GEE estimator ˆβ2GEE. The up-
dating process can continue until the differences between the sequence of estimators converge to a

75

certain range and this would result in the fully iterated GEE estimator which is denoted by ˆβFGEE.
√
N consistent preliminary estimators when we esti-
It has been shown that, as long as we use any
mate the working matrix, both ˆβ2GEE and ˆβFGEE have the same limiting distribution as stated in
equation (3.6). Less is known about the ﬁnite sample performance of these two estimators.

As we have seen that the working matrix is estimated based on some preliminary consistent

estimators of β . This ﬁnding leads to the following one-step GEE estimation procedure, in which

we explicitly model the unknown parameters γ as a function of β .

(cid:53)β m(Xi;β )T W(Xi;γ(β ),β )−1[yi − m(Xi;β )] = 0

(3.7)

N
∑
i=1

Or equivalently, we can characterize the one-step GEE estimator by the MWNLS estimation ap-

proach

N
∑
i=1

Min
β

[yi − m(Xi;β )]T [W(Xi;γ(β ),β )]−1[yi − m(Xi;β )]

(3.8)

Equation (3.8) looks very similar to the objective function of continuously updating GMM (CUGMM)

estimation in which the optimal weighting matrix is also function of the unknown parameters.

Based on various simulation studies, we have seen that the CUGMM estimators have better ﬁnite

sample performances compared to the two-step GMM estimators. It can be shown that the one-step
GEE estimator ˆβ1GEE has the same ﬁrst order asymptotics as described in equation (3.6). Next
section will use the simulation study to compare the small sample performances of the one-step

GEE estimators and the two-step GEE estimators.

3.3 Simulation Results

We have discussed that the one-step GEE estimators and the two-step GEE estimators have the

same asymptotic distribution as long as we impose the same structure for the working matrix. This

section presents several simulation results to compare the small sample performances of these two

type GEE estimators. These results should be useful to empirical researchers to choose different

implementation procedures of the GEE estimations. The following simulation studies set the repli-

76

cation number to be 1000 and assume a balanced data structure in which T=6 and N varies from

50 to 150.

3.3.1 Linear Models

Consider the following linear regression model:

yit = βo + β1zi1 + β2zi2 + β3xit1 + β4xit4 + ci + uit

(3.9)

where βo = 1, β1 = β2 = 2 and β3 = β4 = 3. ci ∼ Normal(0,σ 2
c ), zi1 ∼ Gamma(2,3), zi2 ∼
Gamma(1,2), xit1 ∼ Normal(0,3), xit2 ∼ Normal(0,4), uit ∼ Normal(0,4). All variables are
independent of each other. By varying σ 2
c , we can change the serial correlation across time. In the
following table, we report the ﬁnite sample performances of one-step GEE estimators and two-step
GEE estimators of β1 and β3. Let ˆβ1(1AR) to denote the one-step GEE estimator of β1 with the
autoregressive working matrix and ˆβ1(2EX) to denote the two-step GEE estimator of β1 with the
exchangeable working matrix.

In table 3.1 and table 3.2, we scale the bias and MSE by 100. From the simulation results

in both tables, we can see that the one-step GEE estimator and the two-step GEE estimator have

very similar ﬁnite sample performances for estimating linear models. Based on the DGP, it is

no surprising that the GEE estimator with exchangeable working correlation matrix is the most

efﬁcient. When we use different working correlation matrices, the two-step GEE estimator seem

to behave marginally better that the one-step GEE estimator in terms of MSE.

Before we move to nonlinear models, it might be of some interests to investigate the efﬁciency

gain by allowing time speciﬁc dispersion parameters. Many statistics software with built-in GEE

command take the dispersion parameter as constant. The two-step GEE estimator can be easily

adjusted to allow time-varying dispersion parameters. The DGP is almost the same as in the
previous model. We now normalize ci to be standard normal random variable and generate uit
from Normal(0,αt), where α is some positive constant to be speciﬁed.

77

Again we multiply the bias and MSE by 100 for the simulation results. In table 3.3, we report

the results of the twp-step GEE estimators with and without time-varying dispersion parameter
when α = −0.5. We use ˆβ∗
1(2EX) to denote the two-step GEE estimator with the time-varying dis-
persion parameters. When the sample size is reasonably large, we see that the GEE estimator with

time-varying dispersion parameters has smaller bias and smaller Monte Carlo standard deviation,

hence smaller MSE as well. In table 3.4, as we set α = 1, the gains from allowing time-varying

dispersion parameters become more evident.

In many cases, the GEE estimator using identity

working matrix with time-varying dispersion parameters are even better than the GEE estimator

with the exchangeable working matrix but with time constant dispersion parameters. The GEE

estimator with identity working matrix and time-varying dispersion parameters is essentially the

same as weighted least squares estimator, which only requires the contemporaneous exogeneity

condition to be consistent.

Other than the bias and standard deviation, we also simulate the coverage rates for the linear

models under various situations. In table 3.5 and 3.6, we report the coverage rates of 90% conﬁ-

dence interval and 95% conﬁdence interval for the one-step GEE estimator and the two-step GEE

estimator. In particular table 3.5 reports the results for the usual linear regression model. It shows a

similar pattern that the one-step GEE estimator and the two-step GEE estimator have very similar

coverage rates when we use the exchangeable working correlation matrix. As shown in previous

simulation study, when we use the inefﬁcient working correlation matrix, the two-step GEE es-

timator usually has a smaller standard deviation compared to the one-step GEE estimator. As a

consequence of smaller standard deviations, the two-step GEE estimator has a smaller coverage

rate. Table 3.6 compares the coverage rates for the traditional GEE estimator and the GEE esti-

mator with time-varying dispersion parameter when the heteroskedasticity presents in the linear

model. The coverage rates of the traditional GEE estimator are slightly higher than the coverage

rate of the improved GEE estimator. When the sample size is reasonably large, these two GEE

estimators have very close coverage rates.

78

3.3.2 Nonlinear Models

The binary response model is by far one of the most popular examples for nonlinear models. The
data generating process is usually speciﬁed as yit = 1[y∗
it = β0 + β1zi1 + β2zi1 +
β3xit1 + β4xit2 + ci + uit. The DGP for all the independent variables are the same as in the linear
model except now the idiosyncratic error uit follows standard normal distribution. We set βo = 0,
β1 = 0.2, β2 = −0.2, β3 = 0.3 and β4 = −0.3.

it ≥ 0], where y∗

In table 3.7 and table 3.8, we report the one-step GEE estimator and two-step GEE estimator of

β1 and β3 when σ 2
c = 1.5. In both tables, we scale with bias and MSE by 100. From
the simulation results of binary response models, we can see that the one-step GEE estimator and

c = 1 and σ 2

two-step GEE estimator have very similar ﬁnite sample performances. It is interesting to observe

that when we are not using the efﬁcient working matrix, the two-step GEE estimator usually has

smaller MSE compared to the one-step GEE estimator with the same working matrix. This might

suggest that two-step GEE estimator should be favored by empirical researchers as the efﬁcient

working correlation matrix is rarely known in reality.

Next we consider the one-step GEE estimator and two-step GEE estimator in the fractional re-

sponse model. To obtain the fractional random variable, we ﬁrst generate y∗
β1zi1 +β2zi2 +β3xit1 +β4xit2 +ci)), where the independent variables are generated from the same
distributions as in the linear model and ci follows N(0,0.04), then yit = y∗
it/10000 is the desired
fractional random variable with the conditional mean function Φ(β0 + β1zi1 + β2zi2 + β3xit1 +
β4xit2 +ci). We set the true values of the underlying parameters the same as in the previous binary
response model. In table 3.9, we also scale the bias and Monte Carlo standard deviation, as well

it from binomial(10000,Φ(β0 +

as the MSE, by 100. Again we ﬁnd similar evidences that the one-step GEE estimator and the

two-step GEE estimator behave equally well in ﬁnite samples. The GEE estimators, both one-

step and two-step, with stationary working correlation matrix are not considered because of the

convergence issue.

To provide more simulation evidences, we look into the two GEE estimators for count response

models. The data generating process is as the follows. yit ∼ Poisson(cid:0)exp(β0 + β1zi1 + β2zi2 +

79

β3xit1 + β4xit2 + ci)(cid:1), The distributions of independent variables are the same as before. ci is

generated from the standard normal distribution and is independent of all the explanatory variables.
The true parameters are set as βo = 0, β1 = 0.4, β2 = −0.4, β3 = 0.5 and β4 = −0.5.

In table 3.10, we report the simulation results without any scaling and we hold off the GEE

estimator with stationary working matrix because of the convergence issue. We ﬁnd a similar

pattern that two-step GEE estimator behaves no worse than the one-step GEE estimator when we

use the efﬁcient working correlation matrix. The mean squared error of two-step GEE estimator

is generally smaller than the mean squared error of one-step GEE estimator when we use the

inefﬁcient working matrix.

At the end of this subsection, we report the coverage rates of the two GEE estimators for the

binary response model. In table 3.11 and table 3.12, it shows that the two-step GEE estimator with

the exchangeable working correlation matrix has very similar coverage rates compared to the one-

step GEE estimator with exchangeable working correlation matrix. When the inefﬁcient working

correlation matrices are used, the one-step GEE estimator provides a better coverage rate. As the

sample size increases, the coverage rates of both these two GEE estimators are getting close to the

nominal rates.

3.4 Conclusion

While it is known that the one-step GEE estimator and the two-step GEE estimator have the

same ﬁrst order asymptotic distribution, less is known about their small sample performances. This

paper conducts Monte Carlo simulations to investigate the ﬁnite sample performances of the one-

step GEE estimator and two-step GEE estimator. We show that these two GEE estimators behave

equally well when we use the efﬁcient working correlation matrix. We also ﬁnd that the two-step

GEE estimator seems to outperform the one-step GEE in terms of the mean square error when we

use inefﬁcient working correlation matrix. The simulation study also shows that the two-step GEE

estimator has similar coverage rates to the one-step GEE estimator in both linear and nonlinear

models. This might suggest that the empirical research use the two-step estimation approach as we

80

rarely know the true correlation structure.

The two-step GEE estimator is less computationally demanding compared with the one-step

GEE estimator. Another advantage of the two-step GEE estimator is that it makes it easier to

estimate time speciﬁc dispersion parameters in the context of panel data models. Simulation study

has shown that allowing time-varying dispersion parameters can further improve the efﬁciency of

the GEE estimator.

81

APPENDIX

82

APPENDIX FOR CHAPTER 3

Table 3.1: one-step GEE and two-step GEE for linear models (σ 2

c = 1)

bias

-0.1583
-0.1583
-0.1672
-0.1655
-0.1653
-0.1641
0.2098
0.2091
0.1482
0.1508
0.1355
0.1431

N=50
sd

0.0458
0.0458
0.0459
0.0459
0.0459
0.0458
0.0714
0.0714
0.0740
0.0738
0.0748
0.0743

RMSE
0.0458
0.0458
0.0459
0.0459
0.0459
0.0458
0.0714
0.0714
0.0740
0.0738
0.0748
0.0743

N=100

sd

RMSE
0.0317
0.0316
0.0317
0.0317
0.0317
0.0316

bias
0.0778
0.0778
0.0791
0.0791
0.0807
0.0801
-0.0634
-0.0632
-0.0158 0.0514
-0.0219 0.0513
0.0060
0.0517
-0.0083 0.0515

bias
0.1286
0.0317
0.1286
0.0316
0.1299
0.0317
0.1297
0.0317
0.1308
0.0317
0.1302
0.0316
0.0506 0.0506 -0.1710
0.0506 0.0506 -0.1709
-0.1896
-0.1832
-0.1928
-0.1854

0.0514
0.0513
0.0517
0.0515

N=150

sd

0.0252
0.0252
0.0253
0.0253
0.0253
0.0252
0.0410
0.0410
0.0423
0.0423
0.0426
0.0425

Table 3.2: one-step GEE and two-step GEE for linear models (σ 2

c = 2)

N=100

sd

RMSE
0.0536
0.0536
0.0538
0.0537
0.0540
0.0537

bias
0.2097
0.0536
0.2097
0.0536
0.2253
0.0538
0.2219
0.0537
0.2283
0.0540
0.2227
0.0537
0.0520 0.0520 -0.1797
0.0520 0.0520 -0.1796
0.0568
-0.2420
-0.2291
0.0562
-0.3575
0.0765
0.0615
-0.2779

0.0568
0.0562
0.0765
0.0615

N=150

sd

0.0424
0.0424
0.0429
0.0427
0.0429
0.0426
0.0422
0.0422
0.0463
0.0461
0.0629
0.0505

bias

-0.2260
-0.2260
-0.2638
-0.2563
-0.2381
-0.2435
0.2299
0.2290
0.1111
0.1183
0.0021
0.0687

N=50
sd

0.0773
0.0773
0.0786
0.0781
0.0786
0.0781
0.0743
0.0743
0.0825
0.0817
0.1097
0.0893

RMSE
0.0773
0.0773
0.0786
0.0781
0.0786
0.0781
0.0743
0.0743
0.0825
0.0817
0.1097
0.0893

bias
0.0626
0.0626
0.0449
0.0499
0.0423
0.0504
-0.0646
-0.0640
0.0628
0.0487
0.0344
0.0173

83

RMSE
0.0252
0.0252
0.0253
0.0253
0.0253
0.0253
0.0410
0.0410
0.0424
0.0423
0.0426
0.0425

RMSE
0.0424
0.0424
0.0429
0.0427
0.0429
0.0427
0.0422
0.0422
0.0463
0.0461
0.0629
0.0505

ˆβ1(1EX)
ˆβ1(2EX)
ˆβ1(1AR)
ˆβ1(2AR)
ˆβ1(1MA)
ˆβ1(2MA)
ˆβ2(1EX)
ˆβ2(2EX)
ˆβ2(1AR)
ˆβ2(2AR)
ˆβ2(1MA)
ˆβ2(2MA)

ˆβ1(1EX)
ˆβ1(2EX)
ˆβ1(1AR)
ˆβ1(2AR)
ˆβ1(1MA)
ˆβ1(2MA)
ˆβ2(1EX)
ˆβ2(2EX)
ˆβ2(1AR)
ˆβ2(2AR)
ˆβ2(1MA)
ˆβ2(2MA)

Table 3.3: two-step GEE estimators with and without time-varying dispersion parameters
(α = 0.5)

bias

-0.1092
-0.0832
-0.1061
-0.0422
-0.1095
-0.0776
-0.1112
-0.0814
0.2081
0.2362
0.1154
0.1349
0.1031
0.1297
0.0914
0.1191

N=50
sd

0.0396
0.0387
0.0396
0.0393
0.0398
0.0389
0.0396
0.0389
0.0560
0.0534
0.0487
0.0442
0.0518
0.0478
0.0535
0.0493

ˆβ1(2IN)
ˆβ∗
1(2IN)
ˆβ1(2EX)
ˆβ∗
1(2EX)
ˆβ1(2AR)
ˆβ∗
1(2AR)
ˆβ1(2MA)
ˆβ∗
1(2MA)
ˆβ2(2IN)
ˆβ∗
2(2IN)
ˆβ2(2EX)
ˆβ∗
2(2EX)
ˆβ2(2AR)
ˆβ∗
2(2AR)
ˆβ2(2MA)
ˆβ∗
2(2MA)

sd

bias
0.0513 0.0283
0.0401 0.0276

0.0492 0.0283
0.0360 0.0275

RMSE
0.0396
0.0387
0.0396 0.0490
0.0393 0.0235
0.0398
0.0389
0.0396 0.0524
0.0389 0.0399
0.2082
0.0560
0.1790
0.0534
0.0487
0.1697
0.1532
0.0442
0.2162
0.0518
0.0478
0.1924
0.2303
0.0535
0.0493
0.2308

N=100

N=150

sd

0.0231
0.0222

0.0229
0.0221

RMSE
0.0283
0.0276
0.0283 0.0283
0.0276 0.0276
0.0283
0.0275
0.0283 0.0283
0.0276 0.0276
0.0388
0.0366
0.0327
0.0294
0.0349
0.0319
0.0360
0.0331

RMSE
bias
0.0230
0.1107
0.1002
0.0222
0.1102 0.0229 0.0229
0.0896 0.0219 0.0219
0.1183
0.0231
0.1090
0.0222
0.1170 0.0231 0.0231
0.1081 0.0222 0.0222
0.0388 0.1317 0.0318 0.0318
0.0366 0.1179 0.0307 0.0307
0.0327 0.1869 0.0273 0.0273
0.0294 0.1656 0.0252 0.0252
0.0349 0.2060 0.0287 0.0288
0.0319 0.1791 0.0270 0.0271
0.0361 0.2038 0.0296 0.0296
0.0331 0.1770 0.0279 0.0279

84

Table 3.4: two-step GEE estimators with and without time-varying dispersion parameters (α = 1)

bias

-0.0134
0.0145
-0.0107
0.0407
-0.0258
0.0087
-0.0025
0.0100
-0.2966
-0.3458
-0.1402
-0.2096
-0.1375
-0.2272
-0.1413
-0.2297

N=50
sd

0.0441
0.0418
0.0441
0.0418
0.0444
0.0420
0.0443
0.0420
0.0735
0.0679
0.0677
0.0613
0.0704
0.0644
0.0710
0.0649

ˆβ1(2IN)
ˆβ∗
1(2IN)
ˆβ1(2EX)
ˆβ∗
1(2EX)
ˆβ1(2AR)
ˆβ∗
1(2AR)
ˆβ1(2MA)
ˆβ∗
1(2MA)
ˆβ2(2IN)
ˆβ∗
2(2IN)
ˆβ2(2EX)
ˆβ∗
2(2EX)
ˆβ2(2AR)
ˆβ∗
2(2AR)
ˆβ2(2MA)
ˆβ∗
2(2MA)

N=100

N=150

sd

bias

-0.0399
-0.0522
-0.0405

-0.0561
-0.0435
-0.0574
-0.0311

RMSE
RMSE
0.0324 0.0324
0.0441
0.0305
0.0305
0.0418
0.0324 0.0324
0.0441
0.0303
0.0418
0.0303
0.0444 -0.0533 0.0325
0.0325
0.0420
0.0304
0.0304
0.0325 0.0325
0.0443
0.0420
0.0304
0.0304
0.0735 -0.0589 0.0511
0.0511
0.0467 0.0467
0.0679
0.0677
0.0468 0.0468
0.0419 0.0419
0.0613
0.0704 -0.0420 0.0489
0.0489
0.0443 0.0443
0.0644
0.0495 0.0495
0.0710
0.0649
0.0449 0.0449

0.0840
-0.0474
0.0808

0.0699
-0.0607
0.0584

sd

bias

0.0244
0.0255
0.0244

0.0255
0.0245
0.0255
0.0246

RMSE
0.0255
-0.0581
0.0245
-0.0406
0.0255
-0.0581
-0.0326
0.0246
-0.0632 0.0255 0.0255
-0.0461
0.0244
0.0255
-0.0635
-0.0467
0.0244
-0.1299 0.0405 0.0405
0.0370
-0.1459
-0.1008
0.0381
-0.1111
0.0341
-0.1240 0.0392 0.0392
-0.1335
0.0355
0.0396
-0.1264
-0.1365
0.0358

0.0370
0.0381
0.0341

0.0355
0.0396
0.0358

Table 3.5: coverage rate: one-step GEE and two-step GEE for linear models (σ 2

c = 2)

N=50

N=100

N=150

90% CI

95% CI

95% CI

90% CI

90% CI
95% CI
85.36% 90.91% 85.98% 91.88% 87.97% 93.78%
85.36% 90.91% 85.98% 91.88% 87.97% 93.78%
84.42% 91.26% 86.21% 92.10% 87.55% 93.46%
84.65% 90.91% 86.10% 91.55% 87.87% 93.67%
84.42% 91.03% 85.87% 92.44% 87.97% 93.88%
84.53% 90.79% 85.87% 91.66% 88.19% 93.78%
88.19% 93.62% 88.10% 93.66% 88.92% 94.30%
88.43% 93.51% 88.10% 93.66% 88.92% 94.30%
86.78% 92.80% 88.99% 94.22% 88.08% 93.99%
86.89% 93.51% 88.99% 94.10% 89.35% 93.78%
85.83% 92.80% 90.43% 95.44% 89.03% 93.99%
86.78% 93.27% 89.99% 94.77% 89.56% 94.30%

ˆβ1(1EX)
ˆβ1(2EX)
ˆβ1(1AR)
ˆβ1(2AR)
ˆβ1(1MA)
ˆβ1(2MA)
ˆβ2(1EX)
ˆβ2(2EX)
ˆβ2(1AR)
ˆβ2(2AR)
ˆβ2(1MA)
ˆβ2(2MA)

85

Table 3.6: coverage rate: two-step GEE estimators with and without time-varying dispersion
parameters (α = 0.5)

N=50

N=100

N=150

90% CI

95% CI

95% CI

90% CI

90% CI
95% CI
86.2% 92.8% 86.3% 92.5% 87.1% 92.3%
85.7% 91.9% 86.4% 91.9% 86.7% 92.6%
86.3% 93.0% 86.3% 92.5% 87.1% 92.3%
84.6% 91.5% 85.6% 91.8% 87.0% 93.2%
85.9% 92.9% 87.0% 93.0% 86.3% 92.2%
84.4% 91.9% 86.5% 91.6% 86.7% 92.4%
85.8% 93.2% 86.9% 93.1% 86.2% 91.9%
84.9% 91.7% 86.5% 91.6% 86.8% 92.6%
88.3% 94.0% 89.1% 94.7% 89.8% 94.8%
87.3% 93.7% 88.8% 93.4% 88.6% 93.4%
88.2% 94.0% 90.4% 95.6% 89.8% 94.9%
87.4% 94.2% 90.5% 95.3% 88.9% 94.7%
87.5% 93.9% 90.6% 94.4% 90.5% 94.4%
86.7% 93.2% 89.6% 94.5% 87.9% 94.1%
87.9% 93.9% 90.6% 94.6% 89.7% 94.4%
87.9% 92.7% 89.7% 94.4% 89.0% 94.4%

ˆβ1(2IN)
ˆβ∗
1(2IN)
ˆβ1(2EX)
ˆβ∗
1(2EX)
ˆβ1(2AR)
ˆβ∗
1(2AR)
ˆβ1(2MA)
ˆβ∗
1(2MA)
ˆβ2(2IN)
ˆβ∗
2(2IN)
ˆβ2(2EX)
ˆβ∗
2(2EX)
ˆβ2(2AR)
ˆβ∗
2(2AR)
ˆβ2(2MA)
ˆβ∗
2(2MA)

Table 3.7: one-step GEE and two-step GEE for the binary response models (σ 2
N=150

N=100

N=50

c = 1)

bias
0.8588
0.8704
0.8479
0.8541
0.8564
0.8576
0.7515
0.7590
0.7376
0.7629
0.6952
0.7425

sd

0.0407
0.0407
0.0408
0.0408
0.0409
0.0408
0.0515
0.0515
0.0537
0.0536
0.0551
0.0543

ˆβ1(1EX)
ˆβ1(2EX)
ˆβ1(1AR)
ˆβ1(2AR)
ˆβ1(1MA)
ˆβ1(2MA)
ˆβ2(1EX)
ˆβ2(2EX)
ˆβ2(1AR)
ˆβ2(2AR)
ˆβ2(1MA)
ˆβ2(2MA)

sd

RMSE
bias
0.0415 0.3660
0.0416 0.3756
0.0417
0.0417
0.0417
0.0417
0.0520 0.2573
0.0520 0.2606
0.0542
0.0541
0.0555
0.0548

0.0288
0.0288
0.3698 0.0283
0.3752 0.0283
0.3728 0.0283
0.3770 0.0283
0.0344
0.0344
0.2963 0.0358
0.3036 0.0357
0.3161 0.0365
0.3158 0.0361

RMSE
0.0290
0.0290
0.0285
0.0285
0.0286
0.0285
0.0345
0.0345
0.0359
0.0358
0.0367
0.0362

bias
0.3850
0.3920
0.4004
0.3973
0.3988
0.3965
0.1520
0.1543
0.1578
0.1608
0.1664
0.1649

sd

0.0227
0.0228
0.0230
0.0229
0.0230
0.0229
0.0278
0.0278
0.0290
0.0290
0.0295
0.0292

RMSE
0.0231
0.0231
0.0233
0.0232
0.0233
0.0233
0.0278
0.0278
0.0290
0.0290
0.0295
0.0293

86

Table 3.8: one-step GEE and two-step GEE for the binary response models (σ 2
N=150

N=100

c = 1.5)

N=50

bias
0.8070
0.8262
0.8197
0.8291
0.8260
0.8324
0.6899
0.6985
0.6998
0.7209
0.5757
0.6705

sd

0.0414
0.0415
0.0416
0.0417
0.0417
0.0416
0.0490
0.0490
0.0522
0.0521
0.0560
0.0537

ˆβ1(1EX)
ˆβ1(2EX)
ˆβ1(1AR)
ˆβ1(2AR)
ˆβ1(1MA)
ˆβ1(2MA)
ˆβ2(1EX)
ˆβ2(2EX)
ˆβ2(1AR)
ˆβ2(2AR)
ˆβ2(1MA)
ˆβ2(2MA)

sd

RMSE
bias
0.0422 0.3454
0.0423 0.3521
0.0424
0.0425
0.0425
0.0424
0.0495 0.2679
0.0495 0.2712
0.0527
0.0526
0.0563
0.0541

0.0297
0.0297
0.3690 0.0293
0.3754 0.0293
0.3777 0.0294
0.3791 0.0293
0.0323
0.0323
0.2982 0.0341
0.3099 0.0339
0.3053 0.0360
0.3174 0.0348

RMSE
0.0299
0.0299
0.0295
0.0295
0.0296
0.0295
0.0324
0.0324
0.0342
0.0340
0.0361
0.0349

bias
0.3688
0.3745
0.3885
0.3886
0.3925
0.3911
0.1354
0.1373
0.1582
0.1588
0.1659
0.1630

sd

0.0228
0.0229
0.0232
0.0232
0.0233
0.0232
0.0262
0.0262
0.0276
0.0275
0.0288
0.0281

RMSE
0.0231
0.0231
0.0235
0.0235
0.0237
0.0235
0.0262
0.0262
0.0276
0.0275
0.0289
0.0281

Table 3.9: one-step GEE and two-step GEE for the fractional response models(all scale by 100)

bias
0.3678
0.3677
0.3696
0.3679
0.5613
0.5612
0.5601
0.5597

N=50

sd

0.1933
0.1916
0.1967
0.1842
0.0761
0.0758
0.0873
0.0856

ˆβ1(1EX)
ˆβ1(2EX)
ˆβ1(1AR)
ˆβ1(2AR)
ˆβ2(1EX)
ˆβ2(2EX)
ˆβ2(1AR)
ˆβ2(2AR)

N=100

sd

bias
RMSE
0.4159 0.3630
0.4147 0.3632
0.4183
0.4111
0.5666 0.5580
0.5666 0.5581
0.5666
0.5666

0.1360
0.1353
0.3669 0.1410
0.3684 0.1291
0.0529
0.0528
0.5573 0.0613
0.5578 0.0603

RMSE
0.3873
0.3873
0.3924
0.3899
0.5604
0.5604
0.5604
0.5612

bias
0.3724
0.3724
0.3711
0.3715
0.5570
0.5570
0.5558
0.5561

N=150

sd

0.1029
0.1026
0.1074
0.1006
0.0446
0.0445
0.0507
0.0497

RMSE
0.3860
0.3860
0.3860
0.3847
0.5586
0.5586
0.5586
0.5586

87

Table 3.10: one-step GEE and two-step GEE for the count response models without any scaling

bias

-0.0116
-0.0118
-0.0116
-0.0124
0.0040
0.0007
0.0063
0.0002

N=50
sd

0.0932
0.0930
0.0935
0.0926
0.0862
0.0790
0.0870
0.0767

ˆβ1(1EX)
ˆβ1(2EX)
ˆβ1(1AR)
ˆβ1(2AR)
ˆβ2(1EX)
ˆβ2(2EX)
ˆβ2(1AR)
ˆβ2(2AR)

N=100

N=150

sd

sd

bias

bias

-0.0120
-0.0122

RMSE
0.0749
0.0938
0.0937
0.0748
0.0942 -0.0119 0.0747
0.0934 -0.0126 0.0743
0.0863
0.0790
0.0872
0.0766

-0.0002
-0.0029
-0.0007 0.0799
-0.0047 0.0712

0.0686
0.0685

RMSE
RMSE
0.0697
-0.0124
0.0759
-0.0127
0.0696
0.0758
-0.0119 0.0685 0.0695
0.0756
-0.0128 0.0683 0.0694
0.0754
0.0789 0.0789
0.0800 0.0799
0.0018
0.0728
0.0729
0.0742 0.0742 -0.0011
0.0029
0.0799
0.0799
0.0693
0.0693
-0.0035

0.0799
0.0713

Table 3.11: coverage rate: one-step GEE and two-step GEE for binary response model (σ 2

c = 1)

N=50

N=100

N=150

90% CI

95% CI

90% CI

95% CI

90% CI
95% CI
87.18% 92.84% 87.56% 94.08% 89.17% 93,59%
87.18% 92.84% 87.56% 94.08% 89.28% 93.49%
87.39% 92.13% 87.96% 94.08% 88.68% 93.39%
87.29% 92.33% 87.76% 94.38% 88.88% 93.69%
87.49% 92.53% 87.86% 94.18% 88.68% 93.29%
87.18% 92.43% 87.76% 94.28% 88.78% 93.59%
88.09% 93.54% 89.77% 93.98% 89.28% 94.39%
88.19% 93.74% 89.87% 93.98% 89.28% 94.39%
87.89% 93.04% 89.57% 94.18% 90.18% 94.99%
88.80% 92.94% 89.47% 94.28% 90.08% 94.79%
87.59% 92.84% 89.37% 94.58% 90.28% 95.39%
88.29% 92.84% 89.27% 94.78% 90.08% 94.89%

ˆβ1(1EX)
ˆβ1(2EX)
ˆβ1(1AR)
ˆβ1(2AR)
ˆβ1(1MA)
ˆβ1(2MA)
ˆβ2(1EX)
ˆβ2(2EX)
ˆβ2(1AR)
ˆβ2(2AR)
ˆβ2(1MA)
ˆβ2(2MA)

88

Table 3.12: coverage rate: one-step GEE and two-step GEE for binary response model (σ 2

c = 1.5)

N=50

N=100

N=150

95% CI

90% CI

90% CI

95% CI

90% CI
95% CI
87.83% 93.04% 87.59% 93.64% 88.77% 94.48%
87.83% 92.94% 87.59% 93.64% 88.87% 94.48%
88.14% 93.05% 87.99% 94.55% 87.46% 93.78%
87.53% 93.25% 88.09% 94.45% 87.76% 93.68%
87.53% 93.25% 87.89% 94.35% 87.36% 93.88%
87.53% 92.94% 87.89% 94.45% 87.76% 93.68%
88.55% 93.46% 89.71% 94.75% 90.07% 94.68%
88.55% 93.46% 89.91% 94.65% 89.97% 94.68%
87.83% 92.64% 89.51% 94.95% 89.67% 94.78%
87.22% 92.74% 90.01% 94.95% 89.57% 94.68%
86.30% 91.92% 89.51% 94.65% 89.77% 95.16%
86.81% 92.02% 90.21% 94.75% 89.47% 94.88%

ˆβ1(1EX)
ˆβ1(2EX)
ˆβ1(1AR)
ˆβ1(2AR)
ˆβ1(1MA)
ˆβ1(2MA)
ˆβ2(1EX)
ˆβ2(2EX)
ˆβ2(1AR)
ˆβ2(2AR)
ˆβ2(1MA)
ˆβ2(2MA)

89

BIBLIOGRAPHY

90

BIBLIOGRAPHY

Laird, N., and J. Ware (1982). Random-Effects Models for Longitudinal Data. Biometrics, 38(4),
963-974. doi:10.2307/2529876

Liang,Y.- K., and S.L. Zeger (1986). Longitudinal data analysis using generalized linear models.
Biometrika, 73 (1), 13-22.

Liang, Y.-K., S.L.Zeger and B. Qaqish (1992). Multivariate Regression Analyses for Categorical
Data. Journal of the Royal Statistical Society. Series B (Methodological), 54(1), 3-40.

Lipsitz, S., G. Fitzmaurice, D. Sinha, N. Hevelone, Jim Hu, and L.L.Nguyen (2017). One-
Step Generalized Estimating Equations with Large Cluster Sizes. Journal of Computational and
Graphical Statistics, 26(3), 734-737.

Wang, M (2014). Generalized Estimating Equations in Longitudinal Data Analysis: A Review
and Recent Developments. Advances in Statistics, vol. 2014, Article ID 303728, 11 pages, 2014.
doi:10.1155/2014/303728

Yang, Y (2018). An Extended Generalized Estimating Equations Approach to Nonlinear Models
with Hierarchical Structure. Working paper.

91