emu:
w‘...w...4uq
a? u. (L J...
( i
THESlS
2.
9i
.i j“ >
Illufllulijlfifimn‘llTWIEHIHHIMI
01772 1147
This is to certify that the
dissertation entitled
Estimating Logistic Regression Models
for Correlated Binary Outcomes
presented by
Adolfo Navarro
has been accepted towards fulfillment
of the requirements for
Ph.D. degree in macs and
Research Design
WK («7%
Major professor
Dateflujc—ut 257‘ (‘77?
MS U is an Affirmative Action/Equal Opportunity Institution 0-12771
__———'-_~_-_—_'.— _A .
_fl.-_ ~._——"——.-
LIBRARY
Michigan State
University
PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINE return on or before date due.
MAY BE RECALLED with earlier due date if requested.
DATE DUE
DATE DUE
DATE DUE
FEB 1* O‘zbm
118 mu
ESTIMATING LOGISTIC REGRESSION MODELS FOR CORRELATED
BINARY OUTCOMES
By
Adolfo Navarro
A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Department of Counseling, Educational
Psychology and Special Education
1998
ABSTRACT
ESTIMATING LOGISTIC MODELS FOR CORRELATED
BINARY OUTCOMES
By
Adolfo Navarro
The finite sample performance of the independence estimating equations (IEE),
which assumes that the responses are independent, and generalized estimating equations
(GEE), which take the correlation into account to increase efficiency, methods of
estimation proposed by Liang and Zeger (1986), are evaluated when used to fit grouped
logistic models. Specifically, the performance of the GEE estimator, under a constant
working correlation matrix across clusters, is compared to the IEE estimators in terms of
their bias, their efficiency, and their accuracy in testing a simple hypothesis about the
mean parameters.
A Monte Carlo simulation is used to compare the estimators empirically under the
various conditions specified in this study. For example, the effects of the population
average pairwise correlation of the binary responses and the effect of the distribution of a
covariate on the IEE and the GEE estimator associated with that covariate are analyzed.
In this regard, a data generating mechanism (DGM) to simulate binary correlated
responses is proposed. This DGM has the advantage over the traditional random effects
logistic formulation (TRELM) in that the marginal probabilities, conditional only on the
covariates, have a simple logistic form and the mean parameters quantify the average
effect of the covariates on the population’s response.
The finite sample findings show that for high correlated binary responses, when
including either a binary or a standard normal within cluster covariate, the GEE
outperforms the IEE in estimating the mean parameter associated with that covariate in
terms of their efficiency and their accuracy in hypothesis testing, and both the IEE and
GEE estimators are equally biased. However, for low correlated binary responses, both
methods perform equally well.
In contrast, for highly correlated binary responses, when including a binary cluster
specific covariate, the IEE estimator substantially outperforms the GEE in estimating the
mean parameter associated with that covariate in terms of their bias, their efficiency, and
their accuracy in hypothesis testing. However, for low correlated binary responses the
I BE method outperforms the GEE only in terms of their efficiency, although the
differences are not as substantial.
Copyright by
ADOLFO NAVARRO
1 998
To Daniel Gustavo and John Paul
ACKNOWLEGMENTS
This dissertation would not have been possible without the help of many people. I
am especially grateful to my dissertation director, Professor Jeffrey Wooldridge whose
guidance, insight and kindness have been invaluable to my progress. I would also like to
express my sincere gratitude to Professor James Stapleton. He has been a mentor and an
inspiration to me throughout my graduate studies. I will always be indebted to both of
them for their careful attention to and interest in my work. Finally, heartfelt thanks go to
Professor Robert F loden for extending his hand of support at the peak of my desperation.
Without his advice and support I would not have been able to complete my work.
I am grateful to Dr. Dozier Thornton for his enduring faith in me as a scholar. My
friend and colleague, Dr. Randy F otiu, provided me with valuable computational help and
kept my SAS updated. I would also like to express my thanks to two organizations that
provided me with financial support. First, without the support provided to me by the
Universidad de Oriente, in Cumana, Venezuela, graduate school would have remained a
distant dream rather than a reality. I am also grateful to the King-Chavez-Parks Future
Faculty Fellowship program which awarded me with financial assistance during this final
phase of graduate school.
Finally, I want to say thanks to my lovely family for their devotion to me as I
finished my degree. Special thanks to my in-laws, Richard and Suzanne Johnson, for
their unconditional support, which has been ongoing and sustaining. I wish to express
vi
my deep gratitude to my wife, Janet, for her constant and unconditional support and
encouragement all the way through graduate school.
vii
TABLE OF CONTENTS
LIST OF TABLES ..................................................................
CHAPTER 1
INTRODUCTION ..............................................................
1.1 Correlated Binary Outcomes in Education Research .........
1.2 Linear Probability Models .......................................
1.3 Models for Binary Response Data ..............................
1.4 Models for Correlated Binary Response Data ................
1.5 Purpose of the Thesis ............................................
1.6 Objectives of the Thesis ..........................................
CHAPTER 2
DATA GENERATING MECHANISM
2.1 Introduction ........................................................
2.2 Data Generating Mechanism ....................................
2.3 Other Ways for Generating Correlated BinaryResponses. ..
CHAPTER 3
ESTIMATION METHODS FOR GROUPED LOGISTIC MODELS...
3.1 Introduction .........................................................
3.2 Independence Estimating Equations ............................
3.2.1 How to Estimate the Variance of A ...................
3.3 Generalized Estimating Equations .............................
3.3.1 How to Compute the Variance of [BAG ...................
3.3.2 How to Compute £0, .....................................
3.4 The Computer Programs ........................................
CHAPTER 4
SIMULATION RESULTS ......................................................
4.1 Introduction .........................................................
4.2 Simulation Design ................................................
4.3 Results of the Estimation Procedures ............................
viii
AWNr—‘n—n
p—n—a
N—
15
16
22
25
25
26
27
28
32
34
34
36
36
36
40
CHAPTER 5
SUMMARY .....................................................................
5.1 Overview .........................................................
5.2 Parameter Estimation ...........................................
5.2.1 Bias .......................................................
5.2.2 Efficiency .................................................
5.2.3 Accuracy in Hypothesis Testing ......................
5.3 Further Research .................................................
APPENDIX .................................................................
SAS/IML Macro Computer Programs ............................
BIBLIOGRAPHY .............................................................
ix
80
80
82
82
84
85
89
91
91
97
Table
4.1
4.2
4.3
4.4
4.5
LIST OF TABLES
Page
Averages of the mean parameter and standard errors of the estimate of ,6]
results for the logistic marginal model, log it( fly) = :60 + :31 xi}. , fit by IEE.
The binary within cluster covariate xi]. equal to 1 or 0 with equal
probabilities. Population average correlation 5:102 = 0.81 ................... 59
Averages of the mean parameter and standard errors of the estimate of ,6]
results for the logistic marginal model, log it( ,uy) = '80 + fl] xi]. , fit by IEE
and GEE. The binary within cluster covariate x”. equal to 1 or 0 with equal
probabilities. Population average correlation 5=p2 = 0.81 .................. 61
Averages of the mean parameter and standard errors of the estimate of ’3'
results for the logistic marginal model, log it( '11”) = :60 + ’31 xi}. , fit by IEE.
The binary within cluster covariate x”. equal 1 or 0 with equal probabilities.
Population average correlation 5 = p2 = 0.16 .................................. 64
Averages of the mean parameter and standard errors of the estimate of '3'
results for the logistic marginal model, log it( '11”) = '60 + :31 x”. , fit by IEE
and GEE. The binary within cluster covariate x0. equal to 1 or 0 with equal
probabilities. Population average correlation 5 = p2 = 0.16 ................. 66
Averages of the mean parameter and standard errors of the estimate of ,6]
results for the logistic marginal model, log it( lug) = 160 + ,3] xj , fits by IEE.
The binary cluster covariate, xj , equal to 1 or 0 with equal probabilities.
Population average correlation 6 = p2 = 0.64 ................................. 69
4.6
4.7
4.8
Averages of the mean parameter and standard errors of the estimate of ,6]
results for the logistic marginal model, logit(luij) = :50 + ,5] xj. , fit by IEE
and GEE. The binary cluster covariate x}. equal to 1 or 0 with equal
probabilities. Population average correlation 6 = p2 = 0.64 ................ 71
Averages of the mean parameter and standard errors of the estimate of A
results for the logistic marginal model, log it( '11”) = '30 + '3' xj , fit by IEE
and GEE. The binary cluster covariate xj equal to l or O with equal
probabilities. Population average correlation 5 = p2 = 0.16 .................. 74
Averages of the mean parameter and standard errors of the estimate of :61
results for the logistic marginal model, log it( lug.) = flo + :3: x”. , fit by IEE
and GEE. The within cluster covariate xi]. ~ N(0, 1) .
. _ 2
Population average correlation §=p = 0.81 ................................... 77
xi
Chapter 1
INTRODUCTION
1.1 Correlated Binary Outcomes in Education Research
Much of what goes on in education occurs within a group context. Schooling
activities occur within hierarchical organizations in which the sources of educational
influence on students occur in the group to which an individual belongs (Burstein, 1980).
Moreover, in many areas of educational research, one encounters individual responses in
the form of qualitative information. For example, student i, belonging to school j ,
either passes or fails a test, has finished high school or not, repeats a grade or not, or can
be grouped according to his/her status of learning as mastery against non-mastery of
certain educational material. All of this information can be captured by defining a binary
(zero-one) variable that takes on a value of one if the individual or student has the
particular characteristic, and zero otherwise. These data can be viewed as hierarchical or
nested structures in which students are nested within schools. For example, suppose that
the researcher interest is in whether preschool experience makes any difference in the
students’ probability of finishing high school. Traditionally, it is assumed that after
controlling for the observable school level variables the students’ responses are
independent. However, in many cases because of the nested structure of the data, even
after controlling for the observed school characteristics, it is natural to expect that the
presence of unique unobserved school effects will induce a correlation across responses
of individual students attending the same school. This correlation must be accounted for
to obtain a correct statistical analysis (Albert & McShane, 1995; Fitzmaurice, 1995;
Liang & Zeger, 1986; Raudenbush, 1988).
1.2 Linear Probability Model
The linear probability model (LPM) has been used to represent a regression model
in which the dependent variable is a binary variable, taking the value of 1 if the event
occurs and 0 otherwise. More precisely, let yr] denote the binary response for student i,
in school j. Then the LPM assumes that the expected value of the binary response
variable is a linear function of a set of covariates; that is, it assumes
7r,- = E(y,.,lx,.,.) = my”. = 1|x.,~) = x;fl, (1.1)
where x”. is a p x 1 vector of covariates and ,6 is a p x 1 vector of unknown parameters.
Under the LPM formulation, the most popular way of estimating ,6 is ordinary
least square (OLS). Given the OLS estimate ,6 , the predicted value of y” is 3],,- = x; ,3.
This is the estimated probability that the event will occur given a particular value of x”. .
The LPM has the advantages that it is easy to interpret and it is easy to add group effects.
However, in practice the LPM suffers from two major difficulties. For one, the estimated
probabilities can lie outside the admissible range (0,1), which is problematic for
predicting probabilities. A related point is that the marginal effects are constant,
something that cannot strictly be true for all values of xii. Therefore, while the LPM
may provide reasonable estimates of the true marginal effects for some values of x”. near
the average, it might be a poor approximation for extreme values of x”. .
1.3 Models For Binary Response Data
Given the limitations of the linear probability model in the handling of the
dependence of the probability of response on certain covariates, several nonlinear models
have been proposed, each of which ensures that the fitted probabilities will lie between
zero and one. The logistic functional form is probably the most commonly used. Thus,
the model of choice for analyzing binary response data with covariates is the logistic
regression model. Under the logistic model formulation, one assumes that there are not
group effects. Then the probability of “success” on the dependent variable can be written
as
72",,- = pro/0. = 1|x,,-) (1.2)
1 + eXP(- x; fl)
where x”. is a p x 1 vector of covariates, and ,8 is p x 1 vector of constant parameters.
Likewise, the probit transformation (DOC;- fl) of the “success” probability have been
proposed as an alternative to the logistic transformation. However, the logistic functional
form is the most widely used in different fields of application. This is because the
logistic transformation is more convenient from the computational viewpoint; as well as
having a direct interpretation in terms of the logarithm of odds in favor of a success, and
for models based on the logistic transformation, differences in the logit scale can be
estimated regardless of whether the data are sampled prospectively or retrospectively (see
Agresti, 1990; Bishop, F ienberg, & Holland, 1975; Cox, 1970; McCullagh & Nelder,
1989 for an extensive coverage of these models).
1.4 Models For Correlated Binary Response Data
The data consist of observations (yg‘ , x0.) , where yy is a binary response and x”.
is a p x 1 covariate vector for ith observation, i = l, ..... ,nj , from the jth cluster,
j = l, ..... , J . Traditionally in educational research, to study the effects of covariate(s) on
the marginal probabilities for the binary responses, the standard logit maximum
likelihood estimation (MLE) approach has been applied under the key assumption that
the yg are independent across student 1' , in school j conditional on the xii. This
approach, although appropriate to deal with one-level binary responses, is not appropriate
in terms of its efficiency for fitting nested structured data, since the presence of unique,
unobserved contextual factors in group--school effects--are expected to induce a
correlation across responses of individual students attending the same school. While the
estimation of the regression parameters of the marginal model are consistent, their
variances will be wrongly estimated (Zeger & Liang, 1986). Moreover, with non-linear
models for binary response data, such as logistic regression, different assumptions about
the source of correlation can lead to regression parameter estimates with distinct
interpretations (Diggle, Liang, & Zeger, 1994; Zeger, Liang, & Albert, 1988).
Most studies dealing with binary response variables for nested structure data use
random effects logistic models (Stiratelli, Laird, & Ware, 1984). Typically, a random
effect, b]. , is included in the same logit scale as the constant regression parameters to
account for the correlation that may exist because of the nested structure of the data.
Henceforth this procedure will be called the traditional random effects logistic models
(TRELM) formulation. It is assumed that the Y, conditional on the vector of xij's ,
X j , and b}. are independent Bernoulli random variables with response probabilities
1
1+ CXPI-(x;fl+z;b,)]
#1; = E(yile./’b.i) = pr(y” : llXj’bj) = pr(yij =1lX,,aZ.~pb,-)=
Let Z.-,- be a vector of covariates associated with the random effect, b}. , and
Var(yy|bj, X j) = lug/(1 - 'uij). It is assumed that b}. is independent of X]. and
b1 ~ N(0, D), where D is an unknown covariance matrix to be estimated.
aprO/y. = llXj’bj)
é’x
Notice that depends on b]. . Thus, whereas the above
0.
models are helpful in providing a cluster specific interpretation of the regression
coefficients, outfitting individualize cluster predictions, they fail to ask the right questions
from a policy perspective. An important question to ask is how a particular school
program or policy that involves changing one of the variables X.) will affect average
population response. For example: What is the effect of a policy variable, pre-school
experience, which takes on a value of 1 if the student attended pre-school and 0
otherwise, on the probability of dropping out of school for the entire population? Here,
the interest is in quantifying uncertainty about future observable pr(yl_j : 1| X1) on the
basis of information known. Thus, under the TRELM formulation the pr(y” = 1| X1)
can be obtained by integrating the distribution of (yy lb]. , X_,.) with respect to the
distribution of random effects, bl. (i.e., the school effects). That is,
my, = 1|X,,.) = E[E(yy.|X_,.,b,.)lX,-] = I Iogit‘ ‘(x;,fl+ b,)f(b,)db_,--
For the integral above, exact solutions are not feasible. However, given estimates of the
regression parameter [3 and the density of random effects, b1 , in principle, one can use
the above integral to obtain predicted probabilities and partial effects. For example, Im
and Gianola (1988) and Anderson and Aitkin (1985) applied numerical integration
(quadrature) methods combined with the EM algorithm as described by Dempster, Laird,
and Rubin (1977) to approximate the above integral; Stiratelli, et al. (1984) and Wong
and Mason (1985) have estimated the joint posterior means by approximating the
likelihood function based on the joint posterior modes of the random parameters given
approximated ML estimators of the covariance parameters. This approach is
implemented by means of the EM algorithm.
Other researchers have focused on approximating quasi-likelihood or log
likelihood functions. For example, Schall (1991) proposed a general algorithm for the
estimation of the regression parameters, random effects, and components of dispersion in
generalized linear models (McCullagh & Nelder, 1989) with random effects; Breslow and
Clayton (1993) considered the penalized quasi-likelihood (PQL) method of inference for
data generated under the traditional multilevel logit formulation; Goldstein (1991)
proposed a first order Taylor series approximation to analyze data under the traditional
multilevel logit formulation; Longford (1991) focused on the logistic regression for
correlated binary outcomes generated under the traditional multilevel logit formulation
and proposed a computational algorithm based on an approximation to the log likelihood
function.
However, several problems have been reported or associated with these
approaches. First, the distributional assumptions on the random effects are somewhat
arbitrary and untestable for small cluster size. Second, some difficulties in the
computational aspects of model fitting have been mentioned quite ofien. For example,
Zeger and Karim (1991) pointed out: 1) investigators have largely restricted their
attention to random intercept models to avoid higher order numerical integration; 2)
problems exist in the interpretation of the regression coefficient estimates; 3) inferences
about the regression coefficients have often been made conditional upon the estimated
random effects variance, again to avoid difficult integration. Zeger and Karim (1991)
used a Monte Carlo method, the Gibbs sampler, for estimating parameters in the
generalized linear model with Gaussian random effects. This methodology promises to
overcome numerical problems in solving higher dimensional integration as a result of
fitting generalized linear models in which the random effects are incorporated in the same
logit scale as the fixed regression coefficients.
Because of the computational and interpretational problems faced when estimating
marginal probabilities under TRELM formulation, researchers have been interested in the
cost of ignoring the correlation among components of the vector of binary responses, both
in terms of biased estimates and biased assessments of precision. Recently, Rodriguez
and Goldman (1995), using a Monte Carlo study design, generated data under the
TRELM formulation to evaluate two software packages designed for fitting multilevel
models to binary response data, namely VARCL (Longford, 1987) and ML3 (Goldstein,
1991).
Rodriguez and Goldman (1995) found that the multilevel estimates of the constant
parameters reveal substantial downward bias. The bias was moderate when the random
effects were small, but fairly substantial when at least one of the random effects were
large. The authors also studied the properties of OLS when applied to data generated
under the TRELM and compared results with those obtained by using two software
packages designed for fitting multilevel models to binary response data. The authors
reported that the estimates of the constant parameters were virtually the same. Since the
two approaches utilized in estimating the mean parameters were estimating different
quantities, it was expected that the mean parameters would also be different. Thus, it is
interesting that Rodriguez and Goldman found that the estimates of the regression
coefficients obtained, when applying different methods of estimation to the misspecified
model (i.e., ordinary logistic model) and to the correctly specified model (i.e., multilevel
logit model), were similar.
The Rodriguez and Goldman (1995) findings could mean that the numerical
approximations in which the two methods of estimation proposed by Goldstein (1991)
and Longford (1991), and evaluated by Rodriguez and Goldman, especially designed to
deal with multilevel logit models, need to be revised. They also reported that the
estimates of the variance components were very disturbing, as they showed a pronounced
downward bias. The standard errors were, on average, quite accurate; however, their
utility came into question since the order of bias was 2-3 and 4 standard errors for the
constant and random effects, respectively. Given the results of the above simulation
studies, Rodriguez and Goldman highlight the need for alternative estimation procedures
to handle multilevel models with binary response variables. The question is raised
“whether random effects in binary response models can ever be estimated at acceptable
levels of bias and precision when the average size of clusters is modest” (Rodriguez &
Goldman, 1995, p. 87).
Alternatively, Diggle et al. (1994) describe a marginal model approach that can be
utilized to model correlated binary responses. This marginal model has the advantage
that the marginal probabilities can be modeled as a function of covariates without
explicitly accounting for unobserved heterogeneity across clusters. Under this marginal
modeling approach for correlated binary response data, one models the marginal
probabilities given only the observable covariates, X1 , rather than the conditional
probabilities given both the unobserved random effects, [3], , and the observables. Liang
and Zeger (1986), in the context of longitudinal data analysis, proposed the generalized
estimating equations (GEE) approach as a method for fitting logistic marginal models for
correlated binary responses. Most importantly, the marginal model approach via the GEE
method of estimation produces consistent estimates of the regression parameters and of
their variance, depending only on the correct specification of the marginal probabilities
for the binary outcomes, not on the correct choice of the correlation structure of the data
(Liang & Zeger, 1986). Furthermore, no further distribution assumptions are required,
other than the observations across units or clusters are independent.
This is to be contrasted with the random effects for correlated binary responses
under the TRELM formulation, where both the marginal probabilities and the random
effects distribution must be correctly specified for consistent inferences (Zeger et al.
1988). That is, the marginal probabilities, pr(yij = 1| X1.) , under the TRELM
formulation, depend on X j. and the fixed parameter, D , of the distribution of the
random effects, b]. . In addition, because an exact closed form expression for the
marginal probability is unavailable, in order to estimate the average population changes
in probability for unit change in a covariate, Zeger et al. (1988) proposed an algorithm to
calculate the marginal moments, ,ui and V1" from the conditional moments and the
distribution of the random effects, b]. . Having evaluated la] and V}. for each cluster,
one iterates solving the following equations (for details, see Zeger et al.1988, page 1054):
.l l -I
(ma) = 2D,.V, (y, - it) = 0 and
1:1 _ _
I ‘I l —l —| I -l
Dz (2121') Z} L1 (Vj— A.) L} ZJ(ZIZJ) ’
where D}. = 071'} mp,
L,- = diag{§ h" (,u) / 5p, ,u = x2, A}. his the logit or probit function,
Z I. is a vector of covariates
10
V,- =AER,-(5) A}.
A, = diagtVarwj»,
[21(5) is a “working” correlation matrix,
#1: ElyleJ,
luv]. 4.11.,» ...... nun/j) ,
yj = (yu, ...... , ya)"
The convergence of this algorithm depends on the accuracy of the linear approximations
proposed (Zeger et al. 1988). In addition, because the approximations involve the
nonlinear logit link function, the authors recommend examining the sensitivity of ,6 to
changes in D in the neighborhood of D. Clearly under the TRELM formulation, the
probability of success given only the observables does not have the simple logistic form.
That is, the logistic specification for the marginal probabilities is not appropriate. This
shows that the interpretation of the marginal probabilities under the TRELM formulation
is complicated.
1.5 Purpose of the Thesis
In this study I am interested in comparing independence estimating equations
(IEE) and generalized estimating equations (GEE) when used to estimate logistic
marginal models for correlated binary responses, in which the marginal probabilities of
11
the binary responses ya conditional on X j have the simple logistic form. Most
importantly, the IEE and the GEE methods of estimation have been used for fitting
logistic marginal models for correlated binary responses. However, whereas the
asymptotic properties of the IEE and GEE estimates are well understood, their finite
sample pr0perties are not well known. For that, I am interested in studying the small
sample properties of both methods of estimation.
1.6 Objectives of the Thesis
This thesis studies the properties of two different estimation approaches for
estimating the fixed parameter ,6 in the marginal logistic model as specified in (1.2), but
the binary outcomes conditional on the X j are correlated. The first method is the
standard pooled logit, which, in the terminology of the estimating equations literature, is
called independence estimating equations (IEE). Under the IEE approach, the within
cluster correlation of the binary responses is ignored when estimating the regression
parameter, ,6. That is, the identity matrix is specified as the “working” correlation
matrix, and the maximum likelihood method of estimation (MLE) is applied to the pool
data. MLE yields estimates of the regression parameters, ,6 , that are consistent and
asymptotically normal. However, the estimate of variance matrix of the estimates can be
inconsistent (F itzmaurice, 1995; Liang & Zeger, 1986).
In addition to a “naive” variance estimate of the estimated parameters, I compute
a “robust” variance estimate proposed by Huber (1967) and Liang and Zeger (1986). I
compare both the “robust” and naive standard error of the estimates of the regression
12
coefficients with the average Monte Carlo standard error. It is suspected that as the
correlation within the clusters goes up these standard errors would get farther apart. It
allows, at least, to see how much effect the variance correction has in terms of the
rejection probabilities and coverage probabilities when testing simple hypotheses about
the regression parameters. Here, the main question is: How good are the standard errors
when corrected for the correlation within group?
The second estimating procedure is the generalized estimating equations form. In
particular, to estimate ,6 under the GEE, the conditional correlation matrix is assumed
constant, even though it may not be. Hence, this “constant working correlation matrix”
does not correspond to the true correlation of the binary responses, so I compute a
“robust” covariance matrix of the mean parameter estimate. This allows us to compare
IEE and GEE when each uses the inappropriate second moment matrix. The main
question here: Is the extra complexity of the GEE worthwhile in terms of efficiency?
Since neither the IEE nor the GEE are efficient, the next question is: Is it better to
use the GEE with the misspecified covariance structure than to use the IEE? This
question makes sense since the GEE approach is harder or more complex to apply. If it is
found that using the GEE with an incorrect covariance structure for a reasonable form of
correlation does not perform better than when using IEE, it may be concluded that IEE is
more adequate than GEE under that circumstance. When their sample variances are
compared, it will be with the hope that because the GEE approach uses some correlation
structure, it will capture the qualitative structure of C0v(ylj ,....,y ~| X I.) , so in this
n); —
case it will be better than using the IEE method.
13
To study the finite sample performance of the IEE and the GEE methods of
estimation, I need a way to generate correlated binary outcomes that nevertheless follow
the logistic model. Thus, I develop a mechanism to generate correlated binary responses
according to the logistic marginal model. One unique feature of this marginal model
formulation is that the marginal probabilities, conditional only on the observables, have
the simple logistic form, yet the vector of binary responses are correlated. This logistic
marginal model formulation allows us to determine the effects of within cluster and
cluster constant level covariates on the marginal probabilities without assumptions about
the unobserved heterogeneity across clusters.
14
Chapter 2
DATA GENERATING MECHANISM
2.1 Introduction 7’
As mentioned in Chapter 1, analyzing the properties of estimators of ,5 in L7
Equation 1.2 requires a data generating mechanism that produces correlated
responses. In this chapter, I propose a mechanism with the following features:
1 ( IIX ) ( 1| ) “mm
. : r : _ : r = _ : I
M. p y , p y" x" l+exp(x,-,-fl)
where X} = (xi/9 ...... , xn,1)'
2. C 0rr(yy ,y’yl X1) ¢ 0; furthermore, the correlation between y” and
y,” is a function of the marginal means and an additional parameter, p.
3. The observations across groups or clusters (e.g., schools) are independent.
These features motivate the GEE approach to logistic models for cluster data
analde by Diggle et al. (1994).
The mechanism I propose can be conceptualized as a mixture model
mechanism (MMM) to generate correlated binary outcomes with a marginal
logistic distribution. This procedure is dependent on two independent
15
mechanisms, like flipping a biased coin with probability p of coming up heads.
If the result is heads, you pick from the group effect mechanism, otherwise you
pick from the student effect mechanism. This procedure captures both student
and school effects, and, as I show in the next section, produces logistic response
probabilities. As discussed in Chapter 1, the logistic marginal model approach
has the advantage over the TRELM formulation in that the marginal probabilities
do have a simple logistic form. This is the sense in which the logistic model
formulation for correlated binary responses correctly specify the marginal means.
2.2 Data Generating Mechanism
In this section I formalize the mixture mechanism for generating correlated
binary responses y , i=1,...,nl. , j=l,...,J,with covariates,Xj,that vary or
ii
are constant within cluster, such that Equation 1.2 holds. Generate y” as:
Y, = c.,v,-,-+(1-c,,.)w,.,. i=1,...,n’,. ,j=1,...,J
where j denotes group and 1' unit within a group.
The assumptions are:
1. The c”. 's are independent Bernoulli random variables, for
i = 1,..., n]. , j = 1,..., J with probability mass function given by
Pr(C,-j=1)=p and pr(C,,=0)=1-p,where 0
0 and v”. = 0 otherwise.
The @jfs are independent and identically distributed having the standard logit
1
distribution, F (u) = —————.
l + exp(—u)
Therefore m V, = 1) = me), > —x§,/3) = MG),- < x;fl) = F0 and W),- = 0 otherwise.
The u”. ' s are independent and identically distributed having the standard logit
1
distributions, F (u) = ——-——.
l + exp(-—u)
Therefore pr(w,-,- = 1) = Mu,- > -x§,fl) = pr(u,, < xL-fl) = F (x,- fl)- They
represent the within group effects.
4. The random variables of c”. , (4)]. , u”. are mutually independent, so the
C.) are independent of View.) for i = 1,....,n,. , j= 1, ..... ,J.
17
l"
5. The cg. , @j. , u”. are all independent of X J. . .
Some consequences of assumptions (1), (2), (3), (4) and (5) are:
1. Conditional on X ,,- , Wi/"""Wn./ are independent binary random variables;
2. Conditional on X J. , { V,,,, w”: h,i = 1,...., 71,-} are mutually
independent binary random variables for all j = 1,..., J ; and
3. Conditional on X; , {V0, VI”: i¢ h; i,h =1,....,nj} are related binary
random variables by the presence of common school effect (4)}. for all
j=l,...,J.
We first show that under the above mechanism the marginal mean has the
usual logistic form given in Equation 1.2.
To see this, notice that:
E(y,-,Ixii) = EHCU‘VU + (l _ Ca) wiluxii}
= EKCII vii I xii” + E[(1— Cu) Wu I xv]
= E(Cg)E(V,-j | x0.) + E(1— ng)E(WIj | x0.) , assumption (4)
= pEw, Ix,,.) + (1 - p)E(w,,.| x,,.). assumption (1)
18
l (1 - p) 1 , assumptions (2) and (3)
= ID I + l
1+6XP(- xyfl) 1+eXp(-— x,,-fl)
l
1 + €XP(- x2, fl)
= F(xijfl)
Thus, pr(yl_j = II x”) has the usual logistic form.
This mechanism will only be useful if it generates correlated binary
outcomes. In what follows I show how to compute the covariance and correlation
between any two pairs of observations across i within school j generated by this
mechanism.
By definition, the conditional covariance is:
Cov(y,.y,,IX,,) = my, y,,IX_,.) — E(yy.|X_,-)E(y,jl X,)
Thus, we obtain the following expression for Cov(yii , yhj | X j) . The
conditioning on the X I. is omitted for simplicity in the derivations.
Write
y”. yij = [Ci/Vli+(1— Cy) nglchj V,”- +(1— Chj) why]
=ci/Chi vii viii +611 Vii“ _ Chi) Whi
+(1_ Cy) Ch} WI,- vhf
19
+ (I — c,,)(1" CM) wt] W’J/ '
Taking expectations gives
E (ya. y,”.) = p2 E [v,, v,,]
+(1- p)pE(v.,-)E(w,.,)
+(1— p)pE(v,.,.)E(w,,,-)+(1- p)’E(w,-,-)E(w,,,) .
But E(v,,. v,,,-) = pr(v,., =13v,.,- =1) = MG),- > max(— xLfli- Joy-3))
= min[F(+ x; fl), F(+ x;,, m] = min(,u,.j , ,um),
where p” = my”) = F(x;fl);
Collecting terms gives
E(yijyly‘) = p min(#g’#hj) _ p #4,- Iuhj + lug [uhj
Thus, the expression for the CW0,”- , y,” l X!) can be written as:
2 .
C0V(y”.ayhle.,-) = P [mm(/’lg’#h/) _ lui‘iluhi]
and so
p2 [min(#,‘,' ’flhj) - #1} #hj]
CorrVar(y,le,)
20
.1
P. 771—: .7533 a.“ n
P2[min(fla’#m)’#iflii]
JAG-Aware» '
In general, the C0v(yy_ ,yh/ | X!) depends on the X I. through the marginal
means. However, if 1“,»,- = 1”,. for all i,h the correlation coefficient is constant
2
and equal to p :
2 2
p (#0: fly )_ 2
C (.., .IX.)= _ .
orr yr} yhj J #00 __ #0.) p
The above result implies that for cluster specific covariates, the population
pairwise correlation is an increasing function of p. This means that for a large
p , we have a large population pairwise correlation. Furthermore, for within
cluster covariates, we have that the population pairwise correlation is also an
increasing firnction of the mass, ,0 , the probability that each c”. is unity;
however, the extent of the population pairwise correlation is attenuated somewhat
downward by the difference between the marginal means. In particular, if
lug < [um we have
21
2 ”flu—‘11!!!)
C0rr( ,,, .IX‘): '
y), yh/ J p ’uh/(l - #0.)
Notice that under this mechanism, the marginal correlations are functions of the
marginal probabilities.
2.3 Other Ways for Generating Correlated Binary Responses
In Chapter 1, I described in detail the TRELM formulation as another way
to generate correlated binary outcomes, but those do not yielded marginal logistic
models. In this section I discuss some other ways to generate correlated binary
outcomes.
The Bahadur (1961) parametric description of the joint distribution of
clustered binary responses has been used to generate a binary vector with
specified marginal mean, I”, , and marginal correlation, 5].. Bahadur’s
representation of the joint distribution of the j cluster of binary responses can be
written as
y,, l-yi/
H#”- (1 — #0) (1+ 25th er] 8}” + 6171/1911 eh) elj+....+ 612...!relj 82]....8’”)
22
(y, — #0.)
Where e”. = and 6””. = E(e’.j em), ....... ,612 ...... ,n = E(elj er’ ..... ,em.) .
,/y,.(1— pg.)
The gihj's are pairwise correlation, the glm's 3rd order correlation, and so on,
up to 5], quququ J, the nth order correlation.
A drawback of the above representation is that it becomes computationally
difficult as the cluster sizes increase. As a result, the higher order correlations are re.
set to zero in practice. However, when the higher correlations are taken to be
zero, the range of the marginal correlations must satisfy certain linear inequalities
determined by the marginal probabilities. That is, the marginal correlations are
functions of the marginal probabilities (for details see Fitzmaurice, Laird and
Rotnittsky, 1993).
Emrich and Piedmonte (1991) proposed an algorithm for generating J
binary outcomes, yl "°"’y,~ , with given marginal probabilities and pairwise
correlations. The algorithm involves three basic steps: 1) solves the cumulative
distribution for a standard bivariate normal variable with correlation coefficient
l/2
19,-. ‘Dlz(p_,)az(p,.)ap_,,.l = 6,1. (12,61, 1),. q.) +1), I).
for 5]., (j = 1,...,J;k = 2,...,J), where z( p) represents the pth quantile of the
standard normal distribution; 2) generates a J -dimensional multivariate normal
variable, Z = (Zl , ..... , Z1) , with means zero and correlation matrix 2 = ( 10,1.) ;
and 3) for j = 1,...., J , the desired vector of binary variables is obtained by setting
23
y], = 1 if Z". S 2( p1) and y} = 0 otherwise. Lee (1993) presented two methods
that allow for the generation of clustered binary data with covariates that differ
within clusters and determine the marginal probabilities via a logistic or probit
link function. However, an important drawback of these two approaches is that
one must solve a large number of non-linear equations that require numerical
integration.
After reviewing the various mechanisms for generating correlated binary
responses, I found that even though the mechanism developed in this thesis shares
the drawback that the marginal correlations are functions of the marginal
probabilities, this mechanism is simple and allows for comparison between the
IEE and GEE methods of estimation.
24
Chapter 3
ESTIMATION METHODS FOR GROUPED LOGISTIC MODELS
3.1 Introduction
In the context of longitudinal data analysis, Liang and Zeger (1986) proposed a set
of estimating equations to estimate mean parameters for a general multivariate response.
These estimating equations are a multivariate extension of quasi-likelihood methods
(Nelder & Wedderburn,1972; Wedderburn, 1974; McCullagh & Nelder, 1989). Liang
and Zeger (1986) identified them as independence estimating equations (IEE) and
generalized estimating equations (GEE). The IEE arises by assuming that clustered data
are independent of one another. On the other hand, GEE generalizes the IEE by
introducing a “working correlation matrix” that accounts explicitly for the correlation for
each air of res onses within a multivariate res onse vector, = ,.... , ', in
p p p y J. ()’U ym.)
order to increase efficiency. The GEE approach is very appealing. First, researchers
need only correctly specify a known expectation for the outcomes to obtain consistent
and asymptotically normal estimators of the regression coefficients, as well as consistent
estimates of their covariance matrix, even if the correlation structure of the responses is
misspecified. Second, no further distribution assumptions are required, other than that
the observations across clusters are independent. Furthermore, if the correlation matrix is
correctly specified then the GEE is more efficient than IEE (Liang & Zeger, 1986).
25
3.2 Independence Estimating Equations
The data consist of observations (in , xii.) , where y” is a binary response and x”.
is a covariate vector for the ith observation, i = 1,..,nj , from the jth cluster, j = 1,..,J .
Assume that the marginal distribution of yul xi]. is Bernoulli (,Uil.)
f = expiy, 77,, — logrl + eXP(77,j)}],
where the marginal probabilities, defined as in Equation 1.2, are modeled as a function of
the covariate x”. by the logit link function, that is 77”, = log it( #g‘) = x; ,8. Thus, the
first two moments of y” | x”. are given by:
E(y,,.lx,,.) = ,u, and Var(y,|x,,~) = ,Ll,,(1- #0.)-
The pairs yg' , hi i = 1,.., )1]. within cluster j , are most likely correlated; however, by
naively assuming independence of the pairs of binary responses, the IEE or logit QMLE
estimator, ,3, , is the solution to the independence estimating equations (Liang & Zeger,
1986),
0”,!!!
U<fl>=2
VJ' (Y,-#,.> = 0
j=| é’fl '
where Y, = (yU, ..... ’y,,,)' , 'uj : (lull, ...... ,Iunj)' , .
26
ET ”fit—fin“. .
I '1
3.2.1 How to Estimate the Variance of ,6]
The identity “working correlation matrix” adopted in solving the above IEE most
likely does not correspond to the true correlation of the binary responses. The estimate of
regression coefficients is consistent, but the estimate of covariance matrix is under this
quasi maximum likelihood approach are usually inconsistent (Liang & Zeger, 1986). To
solve this problem of inconsistency, Liang and Zeger’s (1986) Theorem 1 established that
the IEE estimator, [3, , of ,6 , which is the solution to the IEE, is consistent and (,6, — ,6)
is asymptotically multivariate Gaussian as (J —-) 00) with zero mean and covariance
matrix given by
(élei Aij)-l(é.X’/C0V(yj) X1) (,éX’J A_i)(j)fll ’
where A}, = diag[/10(l— tug)...” Ian/(l — lump] the Xj's are treated as nonrandom.
Liang and Zeger (1986) suggest that the asymptotic variance of the ,3, given in Theorem
1 can be consistently estimated by
—l
‘I ’ A J 1 A A! J ' A
(.IEIX} AIX!) (jEIXjrjrij) (jEIXj Ain)
where X ,- = (xu’ ------ ’x’”). represents a n; x p matrix of covariate values,
A,- = diaglflyil-'[,1”_),....,#nlj(l- [any],
. .. A A exr3(x;fl)
the resrdual vector r]. = y, - ,u/ , where ,U”. =
, A , and ,6 is based on the
1 + exp(xijfl)
logit QMLE or IEE.
27
.....l
l
in-
The IEE estimator ,6, of ,6 and Var(fil) are consistent, even when the true
correlation matrix is not correctly specified, given only that the marginal expectation of
the responses, E(yu| x”) , are correctly specified (Zeger & Liang, 1986). F urtherrnore,
the estimator fl] and the variance of the estimate, Var(,él) , are easy to compute with
existing software. Some of statistical software are: GLIM (Numerical Algorithm Group,
Oxford, UK), MINITAB (Minitab Data Analysis Software, Pennsylvania, USA), SAS
(SAS Institute, Raleigh, North Carolina, USA), and STATA (Computer Resource Center,
Santa Monica, USA). However, the IEE estimator ,6, may not have high efficiency in
cases where the within cluster correlation of the binary responses is large (Liang and
Zeger, 1986).
3.3 Generalized Estimating Equations
Under the DGM, as shown in Chapter 2, the first two marginal moments of the
binary responses Y," X 1. correspond to
I —l
#17 = pr(y”‘ = IIXI) : [1+eXp(—x”-fl)]
p2 [KIRK/J” ’flhj)_#ij #hj]
(3.1)
J#. MAI-M.)
C0rr(yy , y,” | X!) =
The conditional mean parameter, ,6 , can be estimated by solving the GEE
W?) = EDS-V7 (My, — y) = 0 (3.2)
28
where [U], = (qu’ ...... nu") , Y, = 010’ ..... ,ym)', ,l.1= E[y,.|X_,.]
.I
D, = egg/(9,8,
V16) = A,5 R10) A? ,
A]. = diag{Var(yU|Xj.),....,Var(y"j|Xj)}.
Notice in Equation 3.1 that, in general, under this particular DGM the true
pairwise correlations of the binary responses depend on the unknown parameter p and
the covariate, X I. , through a minimum function of the marginal probabilities. For
efficiency, the correct specification of the Corr(yij , yhjl X J.) matters when fitting GEE
to model correlated binary responses. However, the consistency of the GEE estimator
and the standard error of the estimator are preserved even when the correlation structure
is misspecified given only that the regression model for the marginal probability is
correctly specified (Zeger & Liang, 1986). Thus, given primarily that the formula in
Equation 3.1 is specific to the this DGM, and it is not a trivial task to modify the existent
statistical software packages in order to solve the GEE when using the optimum
correlation matrix, I focus on a much simpler correlation matrix, other than the identity
working correlation matrix, that explicitly accounts for the pairwise correlations of the
binary responses. This “working correlation matrix” is not the optimum one, but I hope it
captures most of the qualitative features of variation; the IEE approach ignores the
correlation altogether. This allows us to compare IEE and GEE when each uses the
inappropriate or non-optimum second moment matrix. Thus, to estimate the conditional
29
mean parameter, 6 , via the GEE, I utilize a n,- x n, “constant working correlation
matrix” of the form
15, ......... ,5
5 1, ........ ,6
131(5): .................... . (3.2)
pa, .......... ,1_
To implement GEE, we need to estimate 6 .
First, if 6 = Corr(yil,yh/|Xj) then
Comm are,
X, = E r——
vii VIII 9 VI! vhf
5(X,)= X1
for all i at h, where g”. = y” — E(y”|Xj) and V”. = [ti/(1 —,uij).
Then by the law of the iterated expectations respect to X ,,- ,
(8,,- 8b,)
ivy-vi,- '
6: E6(X_,.) = E
30
Thus a consistent estimator of 6 (under standard regularity conditions) is easy to obtain
by
(3-3)
.. l" 88;,
6=_ ’I I ,h .’
Jiznjm, -1)/2.Z ;_ iv v,_,, >1
A
where g”. = ya — ,u” and v”. = [Jail - #0.),
A _ exp(xifl)
’u” l+exp(x:l.6)’
and 6 is based on the initial IEE estimation.
This constant “working correlation matrix” does not depend on the X I. , and it is
independent of the data generating mechanism. It depends on the single parameter 6 ,
which is easy to estimate based on residuals after initial IEE estimation. Thus the
conditional mean parameter 6 can be obtained by solving
(10(13): EDI-([713 131(5),?)- 1(3’, — [1]) = 0, where
' A A 1
all” it:
é’fl.’ ’é‘fl.
1);: ........................
ii: ”i
W.’ M _
31
14,-: diagIfiUU — 6H),”...I’Infi - I1,”
"18, ..... ,5 I
~ 31, ..... ,5
R,(5)=
65, ...... ,1_
[liq/1”, ...... ,6)’ and y,=(yU, ..... ym)’.
Liang and Zeger (1986) show that 6 , the solution of the above EE is consistent
and asymptotically (J —> 00) Gaussian given only correct specification of the marginal
probabilities. In addition, a robust variance estimate is also consistent even when the true
correlation matrix is misspecified. The “robust” variance estimate is robust in the sense
of being consistent even if the working correlation used is not the true correlation.
3.3.1 How to Compute the Variance of 6”
Liang and Zeger’s (1986) Theorem 2 established the conditions under which the
solution of the Equation 3.2 is consistent and JJ(,é(; — fl ) is asymptotically (J —-> oo)
multivariate Gaussian and covariance matrix V0. given by
V. = 1m 1(sz V;' D.)-' {2 0.2V? any, V7 1).} (2 D:- V? D)"
where the Cov(y,_), A]. and D}. are conditional on the X j's.
32
Furthermore, the variance estimate V(, of 6‘, can consistently be estimated by replacing
Cov(yj_| X .1) in the equation above by (X — [LII/)(yj — 1&1), and fl , 6, by their
estimates. That is,
I}... = M? M. M: , where
M ., = 2(5/1; / 076) WW '13.,- / an) , and
A I -I
M1 = 26/2; /6’6 VjO/ . — ,quy. " I?) I}
J I I
56’ mp.
I
Liang and Zeger (1986) show that the consistency of 60, and I?“ depends only on the
correct specification of the mean, not on the correct choice of R(6) . If 12(6) is the true
correlation matrix, then V]. is equal to C0v(yl_| X1) . Thus a consistent estimate of the
asymptotic variance matrix ,6“, is given by:
-l
i]: A, A_] A
(j=iDjVj Di) ’
which is known as a model-based covariance estimate. However, this is not the case
under the data mechanism utilized to generate correlated binary responses in this
simulation study. Therefore, the robust standard errors of the estimates are computed.
33
3.3.2 How To Compute 6“
To compute 6, , one can iterate between an iteratively reweighted least squares
algorithm for 6 and the moment estimate of 6. Starting with the initial value of 6k
based on logit QMLE or IEE, and the current estimate 6 , the values of 6 are updated
according to
e... = p, — {,§,D2(fi.)V,(fi.)Di(fii)} {ng<fii>I7’.}
where I71 (6 ) = VI,-[,5,5t {3}]-
3.4 The Computer Programs
A computer program based on equation (2.1) is coded using SAS/IML Macro
Language to generate correlated binary response data. Logistic marginal models (e.g.,
logit(lu”) = 60 + ,6] x”) are formulated to model the sets of simulated correlated binary
responses. The IEE and the GEE estimators of the regression parameter 6 and the
variance of the estimates are computed using the SAS/IML Macro Language version 2.03
(Groemping, 1993, 1994), which is a modified and extended version of the SAS macro
written by Karim and Zeger (Technical Report #674, Department of Biostatistics, the
Johns Hopkins University, 1989).
This SAS/IML Macro Language version 2.03 uses the GEE approach of Liang
and Zeger (1986) to model correlated data for a general class of outcome variable
including Gaussian, Poisson, binomial and gamma outcomes. The program uses an
34
iterative procedure to estimate the regression coefficients. That is, the program iterates
between the following steps: obtaining initial estimates of the regression parameters by
QMLE or IEE, if not given; estimating the “working covariance matrix” given the current
regression coefficients; updating the regression coefficients given the new correlation
estimates; calculating the model-based and the robust variance of the estimates of the
regression coefficients.
The Monte Carlo simulations were performed on a PC P5-166 Pentium.
Furthermore, to facilitate the use of the GEE SAS/IML macro, a program coded using
SAS/IML Macro Language was used for passing parameters.
35
Chapter 4
SIMULATION RESULTS
4.1 Introduction
In this Chapter, I study the finite properties of the estimating equations (EE)
approach of Liang and Zeger (1986) by numerical simulation. I simulate correlated
binary response data as described in Chapter 2. The conditional mean parameter A in
.j , is estimated via IEE and GEE. I
l
the logistic marginal models, logit(,u'_j) = '30 + ,3] x
compare the performance of the IEE and GEE methods of estimation in terms of the bias,
the efficiency and the accuracy of the estimation of the estimate 6' . I evaluate the
efficiency of both methods using the mean square error (MSE) of 6' as the criteria.
Furthermore, the accuracy of standard error estimation of the estimate 6| is described
with the Wald-type test for testing simple hypotheses about the mean parameter ,6l .
4.2 Simulation Design
The following factor combinations are specified in generating correlated binary
responses: the cluster size it fixed at 5; the sample size, J , given by the number of
clusters, increased from 50 to 100 to 200. Furthermore, in the process of analyzing the
36
effects of the covariate, x”. , on the average marginal probability under the above factor
combinations, I first consider binary within cluster covariate, that is, a covariate that is
different within the same cluster. For example, x”. might indicate whether the student i
in school j belongs to a minority group, and ya might indicate whether the student
either passes or fails a “National Test.” Under this formulation, the logistic marginal
model estimates the difference in promotion rates between minority and non-minority.
Second, I consider a binary cluster specific covariate, that is, a covariate that is identical ;
for all the subjects within a cluster. Third, I consider a covariate that is standard normally
distributed within cluster. Each data configuration is replicated 400 times. Furthermore,
under the DGM proposed in this study, the extent of the population pairwise correlation
of the binary responses depends on an unobserved school effect through the probability
mass function associated with the cg. ' 5 independent Bernoulli random variables and the
distribution of the covariates or observables predictors, x0. Thus, I simulate a logistic
marginal model with high and low correlated pair-wise binary responses by specifying
c”. ' s as independent Bernoulli random variables, with probability mass function given by
pr(C,, = 1) = p where 0 < p < l . The higher p, the higher the population average
correlation of the binary responses. p is the within cluster correlation coefficient. Thus,
I evaluate the small sample performance of the IEE and the GEE methods of estimation
in estimating the regression parameters under the following models specifications:
37
Model 1
log it( ’11”) = .30 + '61 xi]. , where :60 = l and ,61 = (—O.6,— 0.4, 0, 04,06) , and the
binary within cluster covariate x”. equal 1 or 0 with equal probabilities.
Model 2
log it( lug) = flo + 131x} , where :60 = 1 and I61 = (—0.6,— 0.4, 0, 04,06) , and the
cluster covariate x}. equal 1 or 0 with equal probabilities.
Model 3
log it( lug) = :50 + ,6I x0. , where '60 = l and ,6l = (—1.0, 0.5, 0, 0.5, 1.0) , and the
within covariate x”. ~ N(0, 1) .
In this study, the analysis of the correlated binary responses consists of two
phases. During the first phase, I ignore the within cluster correlation of the binary
responses by naively assuming independence of the responses, and I formulate logistic
marginal models to analyze the sets of simulated data. Thus, in order to fit the IEE to the
logistic marginal model, I specify the identity matrix as the working correlation matrix.
Then, I compute the model based (Naive SE) and the robust estimates (Robust SE) for the
variance of the estimates of the regression parameters based on formulation of Liang and
Zeger’s Theorem 1.
In an attempt to better describe the behavior of the estimator of the conditional
mean parameter, :31 , and both standard errors of the estimate 6| , (i.e., the Nai've SE and
38
Robust SE), 1 compare the estimates for the various simulated factor combinations
mentioned above in terms of their bias and the accuracy of standard error estimation of
the estimate 6, when testing simple hypotheses about the conditional mean parameter
6’. The comparison of both estimates of the standard error of the estimate 6,, is based
on the calculated rejection probabilities (RP) when testing H0: ,6l = 0 for given true
values of ,8l and coverage probabilities (CP) when testing H0: ,3,- = b.) for the given true
values of 1,}. The RP denotes the proportion of the times that H0216l = 0 is rejected for
the given true values of ,6'. The CP are determined by the proportion of intervals
containing the true parameter when testing Ho: ,8’ = b; for the given the true values of
fl] '
During the second phase, in order to fit the GEE to the logistic marginal model, I
specify a constant correlation matrix, R/ , whose dimension is n}. x nj , as the working
correlation matrix; additionally, I assume correlation is 6 = Corr(y,-, ’ y,,,»'X,) for all
i ¢ h , i,h = 1,...,ni , j = 1,..,J. This working correlation matrix is not the optimum, but
it is easy to estimate and it depends only on the single parameter 6 (see Equation 3.2).
Although there may be a loss of efficiency, this approach may lead to an estimator with a
higher efficiency than the IEE estimator. I suspect that this is more likely to be true for
large population average pairwise correlation. I compare both methods, IEE and GEE, in
terms of their bias and the magnitude of their MSE. Finally, the accuracy of the standard
39
errors of the estimate 6,, is evaluated when testing simple hypotheses about the
conditional mean parameter ,6].
4.3 Results of the Estimation Procedures
In what follows I show the results of the evaluation of the small sample properties
of IEE and GEE methods of estimation when used to fit logistic marginal models to
analyze sets of correlated binary responses. I compare the bias of the IEE and GEE
estimator, 6/ of [B], , for a given range of true values of '6’. The relative bias of 6’ and --
A
' .i
.I 7
the absolute error are defined by b x 100 for ,6} ¢ 0 and fl ,- — :6,- when flj = 0
j
respectively. ,6], is the true parameter value and ’37} is the average value of 6’ from the
simulations. Two scenarios are used to establish the accuracy of the standard error
estimation of 6,, under the IEE and GEE. First, I test Ho: ,6], = b,- for given true values
of b_,-- For this case, the accuracy of the test is obtained by the fraction of the two-sided
/\
confidence intervals, given by 6/1- Zn975 Var(6i) , which include the true parameter
value of ,6}. Second, I test H0216l = O for given true values of :81 . In this case the
A
accuracy of the test is obtained by the fraction of times that is greater than
lVar§)l
40
/\
Z 0975 = 1.96 , where '6' is the estimate of the regression parameter ,8] and Var(fil) is
either the Nai’ve SE or the Robust SE estimate of 6] .
In Table 4.1, IEE results for highly correlated binary responses, cluster size n
fixed at 5, sample size J equal to 50, 100, and 200, and covariate x”. equal to 1 or 0
with equal probability, show the averages of the estimated population average correlation,
6 , the estimate 6‘ , the standard error estimates of 6‘ (i.e., Naive SE and Robust SE),
and the true estimate of the precision of 61 (Monte Carlo SD), as well as the calculated
RP and the calculated CP for 400 trials. From Table 4.1, observe that for
J = 50,100,200 , the estimate 6| is biased and overestimates the regression parameter '6'
in absolute value across the range of values considered. Also from Table 4.1, notice that
the estimates of the regression parameter '31 for the values (-0.6, -O.4) are on average
nearly unbiased. The relative biases vary between 6.8% to 2.30%, 1.50% to -2.25%, and
1.83% to -1.0%, for J = 50, 100, 200 respectively. But for the values (0.4, 0.6), the
relative biases are much larger; they vary between 14.80% to 15.0%, 4.75% to 8.67%,
and 2.50% to 1.0% for J = 50, 100, 200 respectively. The exceptions occur only when
,8l 2 —0.4 and J = 100, 200, where the mean parameter A is underestimated in absolute
value. However, the relative biases are in the order of 2.25% and 1% respectively,
insubstantial amounts. And for :61 = 0 , the absolute error of the estimate 6| equals
41
0.005, -0.004, -0.003 for J = 50, 100, 200 respectively. As expected, the bias decreases
as the sample size increases.
Regarding the Naive SE and the Robust SE of the estimate 61 , notice that both
estimates, on average, are steadily lower than the Monte Carlo SD for all sample sizes
and across the range of values of :61 considered. It can also be observed that the Robust
SE is closer to the Monte Carlo SD than do the Naive SE across all the values of 161
estimated. Moreover, as the sample size increases, the Robust SE become relatively
closer to the Monte Carlo SD than the Naive SE. The exceptions occur only when
,8l = 0 , and J = 50, 100. For each exception, the Naive SE, on average, is closer to the
Monte Carlo SD. As expected, the Robust SE is clearly more variable than the Naive SE
for all sample sizes and all values of :81 .
In addition, to evaluate the performance of the Naive SE and the Robust SE of the
estimate 6], under IEE, two scenarios are considered: first, I test H0: :6,- = 1),- given
b] = (—0.6,— 0.4, 0, 0.4, 0.6) as the true values; second, I test H0: 6] = 0 given
6| 2 (-O.6,— 0.4, 0.4,0.6) as the true values.
In the first scenario, for J = 50 and Ho: ,6] = b,- given 6, = 0 as the true value,
Table 4.1 shows that under the Naive SE, the calculated CF (93.3 %) is clearly within the
limits of the nominal 95% level (92.9%, 97.1%,). However, under the Robust SE, the
calculated CP (91%) substantially underestimates the limits of the nominal 95% level.
42
Furthermore, for the other values of b, investigated, Table 4.1 shows that under either the
Naive SE or Robust SE, the calculated CP clearly underestimates the limits of the
nominal 95% level. However, under the Robust SE, as the sample size increases, the
calculated CP is within the limits or substantially close to the limits of the nominal 95%
level across all the values of bi" By contrast under the Naive SE, the calculated CP
continues to substantially underestimate the limits of the nominal 95% CP across all the
values of b,»
In the second scenario, for J = 50 and Hozfll = 0 given
’6' = (—0.6,—- 0.4, 0.4, 0.6) as the true values, Table 4.1 shows that 110:,6l = O is falsely
accepted a substantial amount of times across all values of '61 when using either estimate
of the standard error of 61 . The amounts vary from a minimum value of 44.5% to a
maximum value of 69.5%, and 53.2% to 72% under the Naive SE and Robust SE
respectively. As expected, the above situation improves as the sample size increases to
J = 100, 200 . As a final point, note that the estimated population average correlation of
A
the binary responses, 6 , on average, is comparable for all sample sizes and across all the
values of ’3' .
In Table 4.2, IEE and GEE results for high correlated binary responses, cluster
size n fixed at 5, sample size J equal to 50, 100, and 200, and covariate x”. equal to 1 or
0 with equal probability, show the averages of the estimated population average
correlation of the binary responses, 6 , the estimate 6] , the standard error estimates of
43
6| (i.e., IEE Robust SE and GEE Robust SE), as well as the mean square errors (i.e., IEE
MSE and GEE MSE), the calculated RP and the calculated CP for 400 trials.
In what follows I compare the performance of the IEE and GEE methods of
estimation in terms of the bias, the efficiency and the accuracy of the estimate 6| . I
evaluate the efficiency of both methods using the MSE of 6| as the criteria.
Furthermore, the accuracy of standard error estimation of the estimate 6| is described
with the Wald-type test for testing simple hypotheses about the mean parameter 6|.
From Table 4.2, notice that the IEE and GEE estimates 6| for the values of
6| = (—0.6,- 0.4) are, on average, nearly unbiased. That is, the relative biases are less
than 6.8%, -2.25%, and 0.8% for J = 50, 100, 200 respectively. But for the values of
6| = (0.6, 0.4) , the relative biases are substantially larger; they vary between 12% to
15.5%, 8.7% to 4.75, and 2.8% to 1% for J = 50, 100, 200 respectively. Finally, note
that for J = 50, 100, 200 and 6| = 0 , the bias of the estimate 6| under GEE is almost
nonexistent and substantially lower than the bias of the estimate 6| under the IEE
method. As expected, the bias decreases substantially as the sample size increases to
J = 200 under both methods. The above shows that the extent of the bias of the estimate
6| , under both methods, depends on the values of the mean parameter 6| being
estimated.
44
To evaluate the efficiency of the IEE and the GEE methods in estimating the
mean parameter 6|, I focus on the MSE of 6| . For J = 50 , Table 4.2 shows that the
MSE of 6| under the GEE ranged from 36% to 74.5% lower than under the IEE across
the true parameter values of 6| investigated. Notice also that this pattern remains as J
increases (i.e., 43.83% to 76.48%, and 45.10% to 78.40%, for J = 100, 200 respectively).
In evaluating the accuracy of the Robust SE of the estimate 6| under IEE and
GEE, two scenarios are considered: first, I test H0: 6 = 6| given
6| 2 (—0.6,— 0.4, 0, 04,06) as the true values; second, I test H0:6| = 0 given
'3' _-= (—0.6,- 0.4, 0.4, 0.6) as the true values.
In the first scenario, for J = 50 and H0:6j = b,- given b, = 0 as the true value,
Table 4.2 shows that under the GEE Robust SE, the calculated CP (95%) matches the
nominal 95% level perfectly. However, under the IEE Robust SE, the calculated CP
(91%) substantially underestimates the limits (92.9%, 97.1%) of the nominal 95% level.
Furthermore, for the other values, 17,- = (—0.6,— 0.4, 0.4, 0.6) , each one of the calculated
CP under the GEE Robust SE is somewhat closer to the limits of the nominal 95% level
than under the IEE Robust SE. As expected, as the sample size increases to J = 100, 200
the calculated CP is within the limits of the nominal 95% level for almost all the values
of b,» The only exception occurs under the IEE Robust SE when b,- = —0.6 and J = 100 ,
where the calculated CP (91.5%) somewhat underestimates the limits of the nominal 95%
leveL
45
In the second scenario, for J = 50 and Hoz6| = 0 given
6| = (—0.6,- 0.4, 0.4, 0.6) as the true values, Table 4.2 shows that the calculated RP are
substantially and uniformly higher under the GEE Robust SE than under the IEE Robust
SE. This means that the false null hypothesis, H0: 6| = 0 , under the GEE Robust SE is
rejected substantially more often than under the IEE Robust SE. Furthermore, as the
sample size increases to J = 100, 200 the above results consistently hold. Clearly, under
these simulated conditions, it can be said that the GEE method of estimation considerably
outperforms the IEE method.
In Table 4.3, IEE results for low correlated binary responses, cluster size 11 fixed
at 5, sample size J equal to 50, 100, and 200, and covariate x”. equal to 1 or O with
equal probability, show the averages of the estimated population average correlation 6 ,
the estimate 6| , the standard error estimates of 6| (i.e., Naive SE and Robust SE), and
the true estimate of the precision of 6| (Monte Carlo SD), as well as the calculated RP
and the calculated CP for 400 trials. Table 4.3 shows that for J = 50 , the estimate 6| is
biased and substantially overestimates in absolute value the regression parameter 6|
across the range of values considered. Furthermore, notice that as the sample size
increases, the bias of the estimate 6| consistently decreases across all values of 6|.
Also for J = 50 in Table 4.3, notice that both estimates of the standard error of
6| , the Naive SE and the Robust SE, perform somewhat erratically. For example, for
46
6| = —0.4 both estimates overestimate the Monte Carlo SD, and for 6| = —O.6 both
estimates underestimate the Monte Carlo SD, however, in both cases by an insubstantial
amount. On the other hand, for 6| = (0,0.4,0.6) both estimates similarly underestimate
the Monte Carlo SD by a substantial amount. That is, the relative bias ranges between
8.4% minimum to 15.0% maximum. Furthermore, as the sample size increases to
J --= 100, 200 , both estimates, the Naive SE and the Robust SE, are very close to the
Monte Carlo SD across all the values of 6| estimated. Furthermore, as expected the
Robust SE are more variable than the Naive SE across all the values of 6| .
Additionally, to evaluate the performance of the Naive and Robust SE of the
estimates of 6} under IEE, two scenarios are considered: first, I test H026], = 6, given
b, = (—0.6,— 0.4, 0, 0.4, 0.6) as the true values; second, I test H0: 61 = 0 given
6| = (—0.6,— 0.4, 0.4,0.6) as the true values.
In the first scenario, forJ = 50 and Hoz6j = b} given 6, = (—0.6,—0.4,0,0.4,0.6)
as the true values, Table 4.3 shows that under either the Naive SE or the Robust SE, the
calculated CP are clearly within the limits of the nominal 95% level. The only exception
occurs for b,- = 0 under the Robust SE. For this case the calculated CF (92.2%) slightly
underestimates the limits of the nominal 95% level. Furthermore, as the sample size
increases the calculated CP are within the limits of the nominal 95% CP under both
estimates of the standard error.
47
In the second scenario, for J = 50 and H026 = 0 given 6| = (—0.6,-O.4, 0.4,0.6)
as the true values, the calculated RP are almost always higher under the Naive SE than
under the Robust SE, however, by an insubstantial amount. As expected, as the sample
size increases to J = 100, 200 there are no substantial differences when using either the
Naive SE or the Robust SE.
In Table 4.4, IEE and GEE results for low correlated binary responses, cluster size
n fixed at 5, sample size J equal to 50, 100, and 200, and covariate x”. equal to 1 or O
with equal probability, show the averages of the estimated population average correlation
of the binary responses 6 , the estimate 6| , the standard error estimates of 6| (i.e., IEE
Robust SE and GEE Robust SE), as well as the mean square errors (i.e., IEE MSE and
GEE MSE), the calculated RP and the calculated CP for 400 trials.
In what follows I compare the performance of the IEE and GEE methods of
estimation in terms of the bias, the efficiency and the accuracy of the estimate 6| . I
evaluate the efficiency of both methods using the MSE of 6| as the criteria.
Furthermore, the accuracy of standard error estimation of the estimate 6| is described
with the Wald-type test for testing simple hypotheses about the mean parameter 6l .
In Table 4.4, for J = 50 , notice that IEE and GEE, in absolute value, overestimate
and show similar bias in estimating the mean parameter 6| . The bias ranges between 9%
to 14.3% for 6| = (—0.6,0.4,0.6) and less than 5% for 6| = (0, — 0.4). Furthermore, as
48
the sample size increases, the bias of the IEE and the GEE estimates decreases
substantially across all the values of 6|.
In Table 4.4. for J = 50 , notice that the results of the efficiency of IEE and GEE,
under the mean square error criteria, are mixed. That is, for 6| = (—0.6,0,0.4) the MSE of
6| under IEE are lower than under GEE; but for 6| = (— 0.4,0.6) , the MSE of 6| under
IEE are higher than under GEE. However, as the sample size increases, the MSE of 6|
under GEE are lower across all the values of 6| estimated. The relative difference
between the MSE ranges from a 1.52% minimum to a 7.14% maximum and a 3.85%
minimum to a 10.53% maximum for J = 100, 200 respectively.
Additionally, to evaluate the performance of the Robust SE of the estimates of 6},
under IEE and GEE, two scenarios are considered: first, I test H0: 6| = 6, given
b, = (—0.6,— 0.4, 0, 04,06) as the true values; second, I test H0: 6| = 0 given
6l = (—0.6,— 0.4, O.4,0.6) as the true values.
In the first scenario, for J = 50 and Ho: 6, = b_,- given b,- = 0 as the true value,
Table 4.4 shows that for both estimates, IEE Robust SE and GEE Robust SE, the
calculated CP (92.2%, 91.5%) underestimates the limits of the nominal 95% level.
However, for b_, = (——O.6,— 0.4, 04,06) the calculated CP are clearly within the limits of
the nominal 95% level for all the values of b_,-- Furthermore, as the sample size increases,
49
the calculated CP are clearly within the limits of the nominal 95% level for all values of
b . considered.
1
In the second scenario, for J = 50 and Ho: 6| = 0 given 6| = (—0.6,- 0.4, 0.4,0.6)
as the true values, Table 4.4 shows that the calculated RP are higher under the GEE
Robust SE than under the IEE Robust SE; at the same time, the calculated CP under GEE
Robust SE are within the limits of the nominal 95% level for all values of 6| estimated.
As the sample size increases both estimates perform similarly and reasonably well.
In Table 4.5, IEE results for highly correlated binary responses, cluster size n
fixed at 5, sample size J equal to 50, 100, and 200, and the cluster covariate x}. equal to
l or 0 with equal probability, show the averages of the estimated population average
correlation 6 , the estimate 6| , the standard error estimates of 6| (i.e., Naive SE and
Robust SE), and the true estimate of the precision of 6| (Monte Carlo SD), as well as the
calculated RP and the calculated CP for 400 trials.
Table 4.5 shows that for J = 50 , the estimate 6| , particularly for
6| = (—0.6,0.4, 0.6) , is biased and substantially overestimates the regression parameter
6| . The percentage of bias equals 13.18%, 8.89%, 31.83% respectively. On the other
hand, for 6| = (—0.4,0) the averages of the estimates 6| agree with 6| to two decimal
places; the percentage of bias is less than 2%, an insubstantial amount. Furthermore, as
expected, the bias decreases substantially as the sample size increases across all the
50
values of 6| estimated. Regarding the Naive SE and Robust SE of the estimate 6| ,
Table 4.5 shows that for all the sample sizes across all values of 6| considered, the
Robust SE are substantially closer to the Monte Carlo SD than are the Naive SE.
In addition, to evaluate the performance of the Naive and Robust SE of the
estimates of 6| under IEE, two scenarios are considered: first, I test H0: 6, = b./ given
6, = (—0.6,— 0.4, 0, 04,06) as the true values; second, I test H0: 6| = 0 given
6| = (—0.6,— 0.4, 0.4,0.6) as the true values.
In the first scenario, for J = 50 and Ho: 6| = 6, given
_|. = (—O.6,— 0.4, 0, 04,06) as the true values, Table 4.5 shows that under the Robust SE
the calculated CP are clearly within the limits or slightly underestimate the limits of the
nominal 95% level. On the other hand, under the Naive SE, the calculated CP equal 69.8,
72.0%, 71.8, 70.8%, 69.3%, which grossly underestimate the limits of the nominal 95%
level. Notice that as the sample size increases, the calculated CP under the Robust SE are
clearly within the limits of the nominal 95% level; at the same time, under the Naive SE
the calculated CP still grossly underestimate the limits of the nominal 95% level.
In the second scenario, for J = 50 and H0: 6| = 0 given 6| = (—0.6,— 0.4, 0.4,0.6)
as the true values, Table 4.5 shows that the ability of the test to recognize Ho: 6| = 0 as
false is higher under the Naive SE than under the Robust SE for all values of 6|
considered.
51
In Table 4.6, IEE and GEE results for high correlated binary responses, cluster
size n fixed at 5, sample size J equal to 50, 100, and 200, and the cluster covariate x}.
equal to l or 0 with equal probabilities, show the averages of the estimated population
average correlation of the binary responses 6 , the estimate 6| , the standard error
estimates of 6| (i.e., IEE Robust SE and GEE Robust SE), as well as the mean square
errors (i.e., IEE MSE and GEE MSE), the calculated RP and the calculated CP for 400
trials.
For J = 50 , Table 4.6 shows that GEE and IEE are biased and substantially
overestimate the regression parameter 6| in absolute value. Notice that the bias under
the GEE is substantially larger than under the IEE across all the values of 61 estimated.
As the sample size increases to J = 100 the biases under GEE decrease somewhat, but
they remain substantially high and about twice the biases observed under IEE. For
J = 200 , both estimates on average are nearly unbiased. The relative biases range
between 0.67% to 2.75% and 0.5% to 3.83% under IEE and GEE, respectively.
Regarding the efficiency of IEE and GEE, under the mean square error criteria,
for J = 50 in Table 4.6 notice that the GEE MSE are substantially higher than the IEE
MSE across all the values of 6| . Specifically, the GEE MSE of 6| ranges from 20.83%
to 52.18% higher than the IEE MSE. As expected, as the sample size increases to
J = 100 , the MSE under both methods of estimation reduce substantially. However, the
GEE MSE remain somewhat higher than the IEE MSE, particularly for 6| = (0.4, 0.6)
52
where the GEE MSE are 21.72% to 47.50% higher than the IEE MSE. As the sample
increases to J = 200 , both MSE reduce substantially across all values of 6| estimated;
however, the GEE MSE are still somewhat higher than the IEE MSE.
In addition, to evaluate the performance of the Robust SE of the estimate 6|
under IEE and GEE, two scenarios are considered: first, I test H0: 6| = b, given
b_,- = (—O.6,— 0.4, 0, 04,06) as the true values; second, I test H0: 6| = 0 given
6| = (—0.6,— 0.4, 0.4, 0.6) as the true values.
In the first scenario, for J = 50 and Hoz6|| = b_,- given
b, = (—0.6,— 0.4, 0, 04,06) as the true values, Table 4.6 shows that under IEE Robust
SE, the calculated CP are within the limits of the nominal 95% level across all the values
of b; . The only exception occurs for 6| = —0.6 , where the calculated CP ( 92%)
substantially underestimates the limits of the nominal 95% level. But under the GEE
Robust SE, the calculated CP substantially underestimates the limits of the nominal 95%
level. However, as the sample increases, the calculated CP are clearly within the limits of
the nominal 95% level under both estimates of the standard error of the estimate 6| .
In the second scenario, for J = 50 when testing H0: 6| = 0 given
6| = (—0.6, — 0.4, 0.4, 0.6) as the true values, the ability of the test under either method of
estimation to recognize that Ho: 6| = O is false is considerably low across all values of 6|
estimated. As the sample size increases there is some improvement; however, under
53
either method we would erroneously accept H0: 6| = 0 a substantial amount of times
across all values of 6| estimated.
In Table 4.7, IEE results for low correlated binary responses, cluster size 11 fixed
at 5, sample size J equal to 50, 100, and 200, and x|j representing a standard normal
within cluster covariate, show the averages of the estimated population average
correlation 6 , the estimate 6| , the standard error estimates of 6| (i.e., Naive SE and
Robust SE), and the true estimate of the precision of 6| (Monte Carlo SD), as well as the
calculated RP and the calculated CP for 400 trials.
For J = 50 in Table 4.7 notice that the estimate 6| is somewhat biased and
overestimates 6| in absolute value across the range of values considered. The relative
bias of 6| varies between 6.91% to 9.28% for 6| at 0; for 6| = 0 , the averages of the
estimates 6| differ from 6| in the second decimal place. Furthermore, as the sample size
increases, the bias decreases substantially. For example, for 6| at 0 the relative biases
vary between 3.52% to 5.65% for J = 100, and 1.24% to 1.58% for J = 200; for 6| = 0 ,
the averages of the estimates 6| differ from 6| in the third decimal place for
J: 100, 200.
Regarding the Naive SE and the Robust SE of the estimate 6| , for J = 50 in
Table 4.7 notice that the Robust SE are closer to the Monte Carlo SD than the Naive SE.
54
Furthermore, as the sample size increases, the Robust SE becomes consistently closer to
the Monte Carlo estimates of the true precision of 6| at a higher rate than the Naive SE.
In addition, to evaluate the performance of the Naive and Robust SE of the
estimates of 6| under IEE, two scenarios are considered: first, I test H0:6| = b,- given
I), = (—1.0, — 0.5, 0, 0.5, 1.0) as the true values; second, I test H0: 6| = 0 given
6| = (-1.0, — 0.5, 0.5, 1.0) as the true values.
In the first scenario, for J = 50 and H026, = b- given
j J
bi = (—1.0, — 0.5, 0, 0.5, 1.0) as the true values, Table 4.7 shows that under the Robust SE
the calculated CP underestimate the limits of the nominal 95% level (92.9%, 97.1%);
they vary between 89.3% to 92.5%. At the same time, similar results are obtained under
the Naive SE; however, for b_,- = 0 the calculated CP equal to 96 % is clearly within the
limits of nominal 95% level.
In the second scenario, for J = 50 and H0: 6| = 0 given 6| = (—1.0, — 0.5, 0.5, 1.0)
as the true values, Table 4.7 shows that under either standard error, the ability of the test
to recognize that H0: 6| = 0 is false is substantially high; it varies between 89.3% to
100% and 86.3% to 100% for IEE and GEE respectively. As the sample increases to
J = 100, 200 under the Robust SE, the calculated RF are very high; they vary between
99.3% to 100%, while at the same time the calculated CP are clearly within the limits of
the nominal 95% CP. However, although under the Naive SE the ability of the test to
recognize that H0: 6| = 0 is false is somewhat higher than under the Robust SE, the
55
calculated CP under the Naive SE still substantially underestimates the nominal 95%
level.
In Table 4.8, IEE and GEE results for high correlated binary responses, cluster
size n fixed at 5, sample size J equal to 50, 100, and 200, and x0. representing a
standard normal within cluster covariate, show the averages of the estimated population
average correlation of the binary responses 6 , the estimate 6| , the standard error
estimates of 6| (i.e., IEE Robust SE and GEE Robust SE), as well as the mean square
errors (i.e., IEE MSE and GEE MSE), the calculated RP and the calculated CP for 400
trials.
Table 4.8 shows that for J = 50 , the IEE and GEE estimates 6| of the regression
parameter 6| on average are biased and nearly equally overestimate the regression
parameter 6| in absolute value. The relative biases range between 6.91% to 9.28% and
7.03% to 10.44% for IEE and GEE respectively. Furthermore, for 6| = 0 under GEE, the
average of the estimate 6| differs from 6| only by 0.001; however, under IEE they differ
by as much as 0.02, a substantial amount. As the sample size increases, the bias under
both methods of estimation decreases substantially across all the values of 61 . The above
analysis shows that for the small sample size, both methods of estimation are practically
equally biased across the values of 6| estimated; the exception occurs only for 6| = 0 ,
where the absolute bias under IEE is substantially larger than under GEE.
56
Regarding the efficiency of IEE and GEE under the mean square error criteria, for
J = 50 in Table 4.8 notice that the GEE MSE, for almost all the values of 6| estimated,
are substantially lower than the IEE MSE. Precisely, the GEE MSE range from 1.11% to
73.37% lower than the IEE MSE. Notice also that this pattern remains as the sample size
increases.
Additionally. to evaluate the performance of the Robust SE of the estimates of 6|
under IEE and GEE, two scenarios are considered: first, I test H0: 6| = b, given
b, = (-1.0,- 0.5, O, 0.5, 1.0) as the true values; second, I test H0: 6| = 0 given
6| = (-—1.0,— 0.5, 0.5, 1.0) as the true values.
In the first scenario, for J = 50 and Hoz6| = 1),- given b|,- = (—1.0,— 0.5, 0.5, 1.0) as
the true values, Table 4.8 shows that the calculated CP under the GEE Robust SE are
somewhat closer to the limits of the nominal 95% level. The exception occurs only for
b, = l, where the calculated CP under the IEE method is closer to the limits of nominal
95% CP, however, by an insubstantial amount. Furthermore, for b,- = 0 under GEE
Robust SE, the calculated CP equals 95.3%, which is clearly within the limits of the
nominal 95% level; at the same time, under the IEE Robust SE the calculated CP equals
92.5%, which underestimates the limits. As the sample size increases to 100, 200, the
calculated CP under both methods of estimation are within the limits of the nominal 95%
level. The exception occurs only for J = 100 and b, = —l.0 under GEE, where the
calculated CP (90.3%) substantially underestimates the limits of the nominal 95% level.
57
In the second scenario, for J = 50 and H0: 6| = 0 given 6| = (—1.0,— 0.5, 0.5, 1.0)
as the true values, Table 4.8 shows that the calculated RF are uniformly higher under
GEE than under IEE across the values of 6| considered. That is, the ability of the test to
recognize that H0: 6| 2 0 is false is substantially larger under GEE than under IEE.
Furthermore, as the sample size increases, the accuracy in hypothesis testing of both
methods of estimation improves substantially and similarly well.
58
Table 4.1
Averages of the mean parameter and standard errors of the estimate of 6|
results for the logistic marginal model, log it( fly) = 60 + 6| x”. , fit by IEE.
The binary within cluster covariate x”. equal to 1 or 0 with equal probabilities.
Population average correlation 5: p2 = 0.81
J = 50 Values of the Regression Parameter 6|
"=5 -O.60 -040 0.00 0.40 0.60
Average
Pairwise 0.698 0.735 0.802 0.732 0.700
, a (0.066) (0.070) (0.060) (0.071) (0.076)
Correlation 6
Mean Estimate -0.641 -0409 0.005 0.459 0.690
%A Bias 6.80 2.30 0.005 14.80 15.00
Naive SE 0.291 0.219 0.305 0.325 0.356
(0.020) (0.019) (0.025) (0.036) (0.036)
Robust SE 0.328 0.310 0.286 0.343 0.373
(0.042) (0.036) (0.047) (0.053) (0.067)
Monte Carlo SD 0.353 0.358 0.321 0.437 0.616
Naive SE Rejection Probability For Ho: 6| 2 0
0.555 0.305 0.067 0.340 0.488
Robust SE
0.468 0.280 0.090 0.300 0.438
Coverage Probability
Naive SE 0.913 0.905 0.933 0.895 0.873
Robust SE 0.928 0.905 0.910 0.898 0.888
Numbers in parentheses are the Monte Carlo standard deviations.
59
Table 4.1 (continued)
Values of the Regression Parameter 6|
if" -O.60 -040 0.00 0.40 0.60
Average
Pairwise 0.703 0.734 0.812 0.733 0.701
, ,. (0.044) (0.043) (0.042) (0.052) (0.053)
Correlation 6
Mean Estimate -0.609 -0.391 -0.004 0.419 0.652
%A Bias 1.50 -2.25 -0004 4.75 8.67
Naive SE. 0.198 0.200 0.207 0.228 0.231
(0.007) (0.008) (0.010) (0.015) (0.014)
mm“ 315' 0.230 0.219 0.200 0.242 0.269
(0.018) (0.018) (0.020) (0.028) (0.032)
Monte cad" S'D' 0.245 0.224 0.208 0.268 0.282
Naive S.E Rejection Probabihty for Ho: 6| = 0
Robust s.E 0.830 0.458 0.047 0.500 0.768
0.755 0.408 0.073 0.450 0.675
Coverage Probability
Naive SE 0.898 0.925 0.953 0.893 0.898
Robust SB 0.915 0.950 0.927 0.928 0.943
Pairwise 0.704 0.735 0.808 0.735 0.707
J z 200 Correlation 5” (0.031) (0.031) (0.028) (0.034) (0.040)
n = 5
Mean Estimate -0.611 -0.396 -0.003 0.410 0.606
%A Bias 1.83 -1.00 0.003 2.50 1.00
Naive SE. 0.138 0.140 0.144 0.154 0.159
(0.003) (0.004) (0.005) (0.007) (0.007)
RObUS‘ 513- 0.162 0.156 0.142 0.173 0.187
(0.008) (0.009) (0.010) (0.013) (0.015)
Monte Carlo SD 0.164 0.168 0.142 0.184 0.193
Naive SE Rejection Probability for Ho: 6| = 0
Robust SE 0.983 0.768 0.045 0.728 0.945
0.970 0.715 0.050 0.653 0.908
Coverage Probability
Naive SE 0.918 0.895 0.955 0.893 0.908
Robust SE 0.950 0.940 0.950 0.930 0.948
60
Table 4.2
Averages of the mean parameter and standard errors of the estimate of 6|
results for the logistic marginal model, log it( fly) = 60 + 6| x0. , fit by IEE and GEE.
The binary within cluster covariate x0. equal to 1 or 0 with equal probabilities.
Population average correlation 5:,02 = 0.81
J = 50
n = 5
Values of the Regression Parameter 6
MethOd Of 0 6O 0 40 0 00 0 40 0 60 l
Estimation ' ' ' ' ' ' '
Average
Pairwise 0.698 0.735 0.802 0.732 0.700
, - (0.066) (0.070) (0.060) (0.071) (0.076)
Correlation 6
IEE Mean Estimate -0.641 -0.409 0.029 0.459 0.690
% A B“ 6.80 2.30 0.029 14.80 15.0
GEE Mean Estimate -0.635 -0.410 0.005 0.449 0.693
%A Bias 5.80 2.50 0.005 12.30 15.5
IEE Robust SE 0.328 0.310 0.286 0.343 0.373
(0.042) (0.036) (0.047) (0.053) (0.067)
MSE
0.126 0.128 0.104 0.194 0.388
GEE Robust SE 0.254 0.223 0.151 0.266 0.321
(0.050) (0.053) (0.045) (0.083) (0.105)
MSE
0.081 0.060 0.027 0.085 0.175
%A MSE -36.0 -53.2 -74.5 -56.0 -54.8
IEE Rejection Probability for Ho: 6| = 0
GEE 0.468 0.280 0.090 0.300 0.438
0.725 0.428 0.050 0.376 0.589
Coverage Probability
IEE 0.928 0.905 0.910 0.895 0.888
GEE 0.923 0.923 0.950 0.927 0.917
Numbers in parentheses are the Monte Carlo standard deviations.
61
Table 4.2 (continued)
J=lOO
71:5
Method of
Values of the Regression Parameter 61
E . . -0.60 -0.40 0.00 0.40 0.60
stimation
Average
Pairwise 0.703 0.734 0.809 0.733 0.701
Correlation 6 (0.044) (0.043) (0.040) (0.052) (0.053)
IEE Mean Estimate -0.609 -0.391 0.016 0.419 0.652
%A Bias 1.50 -2.25 0.016 4.75 8.67
GEE Mean Estimate -0.613 -0.400 0.000 0.419 0.642
%A Bias 2.17 0.25 0.000 4.75 7.00
IEE Robust SE 0.230 0.219 0.202 0.242 0.269
(0.018) (0.018) (0.022) (0.028) (0.032)
MSE 0.060 0.050 0.045 0.072 0.082
Robust SE 0.175 0.156 0.010 0.176 0.215
GEE (0.021) (0.022) (0.018) (0.034) (0.038)
MSE 0.031 0.023 0.011 0.036 0.047
%A MSE -48.98 -53.35 -75.48 -50.27 -42.83
IEE Rejection Probability For H0: 6| = 0
GEE 0.755 0.418 0.0735 0.450 0.675
0.978 0.763 0.052 0.688 0.90
Coverage Probability
IEE 0.915 0.950 0.928 0.928 0.943
GEE 0.953 0.933 0.948 0.925 0.938
62
Table 4.2 (continued)
Values of the Regression Parameter 6|
Method of 0 60 4
Estimation - . -0. 0 0.0 0.400 0.60
Average
Pairwise
Correlation 5‘ 0.704 0.735 0.808 0.735 0.707
(0.044) (0.043) (0.042) (0.052) (0.053)
IEE Mean Estimate -0.611 -O.396 -0.003 0.410 0.606
%A Bias 1.83 -1.00 0.003 2.50 1.00
GEE Mean Estimate -0.605 -0.404 0.004 0.411 0.611
%A Bias 0.80 1.00 0.004 2.80 1.80
0.162 0.156 0.142 0.173 0.187
Robust SE (0.008) (0.009) (0.010) (0.013) (0.015)
IEE
MSE 0.0288 0.028 0.020 0.034 0.037
GEE Robust SE 0.123 0.111 0.070 0.124 0.145
(0.010) (0.011) (0.008) (0.015) (0.000)
MSE 0.0147 0.014 0.004 0.018 0.021
%A MSE -49.10 -48.98 -78.40 -47.75 -45.10
IEE Rejection Probability for H0: 6| = 0
GEE 0.970 0.715 0.050 0.653 0.908
0.998 0.960 0.050 0.928 1.000
Coverage Probability
IEE 0.950 0.940 0.950 0.930 0.948
GEE 0.950 0.940 0.950 0.945 0.955
63
Table 4.3
Averages of the mean parameter and standard errors of the estimate of 6|
results for the logistic marginal model, log it( ’11”) = 60 + 6| x0. , fit by IEE.
The binary within cluster covariate x”. equal 1 or 0 with equal probabilities.
Population average correlation 6 = p2 = 0.16
J = 50 Values of the Regression Parameter 6|
” -O.60 -040 0.00 0.40 0.60
Average
Pairwise 0.138 0.142 0.158 0.147 0.136
, .. (0.062) (0.067) (0.066) (0.072) (0.071)
Correlation 6
Mean Estimate -0.656 -0410 0.014 0.435 0.672
%A Bias 9.30 2.50 0.014 13.0 12.0
Naive SE 0.286 0.290 0.301 0.312 0.335
(0.010) (0.011) (0.014) (0.021) (0.025)
Robust SE 0.284 0.284 0.289 0.318 0.338
(0.028) (0.029) (0.033) (0.038) (0.044)
Monte Carlo SD 0.296 0.281 0.321 0.367 0.369
Naive SE Rejection Probability for Ho: 6| = 0
0.650 0.303 0.060 0.273 0.518
Robust SE
0.645 0.300 0.078 0.280 0.515
Coverage Probability
Naive SE 0.930 0.970 0.940 0.930 0.938
Robust SE 0.930 0.963 0.922 0.930 0.938
Numbers in parentheses are the Monte Carlo standard deviations.
64
Table 4.3 (continued)
Values of the Regression Parameter 6|
it") -0.60 -040 0.00 0.40 0.60
Average
Pairwise 0.156 0.143 0.158 0.141 0.140
, . (0.045) (0.047) (0.047) (0.048) (0.048)
Correlation 6
Mean Estimate -0.676 -0.420 0.008 0.425 0.653
%A Bias 4.33 5.00 0.008 6.25 8.83
0.197 0.199 0.207 0.220 0.228
Naive SE (0.005) (0.005) (0.006) (0.009) (0.011)
0.201 0.201 0.203 0.221 0.235
RObUSt SE (0.014) (0.014) (0.015) (0.018) (0.021)
Monte Carlo SD 0.189 0.204 0.202 0.233 0.248
Naive SE Rejection Probability for Ho: 6| = 0
Robust SE 0.913 0.573 0.043 0.480 0.810
0.905 0.563 0.045 0.490 0.77
Coverage Probability
Naive SE 0.953 0.943 0.957 0.930 0.92
Robust SE 0.955 0.933 0.955 0.928 0.93
Average
Pairwise 0.139 0.143 0.161 0.144 0.137
J = 200 Correlation 6 (0.030) (0.031) (0.033) (0.034) (0.036)
n = 5
Mean Estimate -0.624 -0.407 0.0002 0.390 0.610
%A Bias 3.980 1.650 0.0002 -2.46 1.67
0.138 0.139 0.144 0.153 0.159
Naive SE (0.002) (0.002) (0.003) (0.005) (0.005)
0.142 0.142 0.143 0.157 0.164
ROPUSt SE (0.006) (0.007) (0.008) (0.009) (0.010)
Monte Carlo SD 0.142 0.141 0.137 0.158 0.161
Naive SE Rejection Probability for H0: 6| 2 0
Robust SE 0.993 0.845 0.035 0.718 0.968
0.990 0.820 0.033 0.695 0.965
Coverage Probability
Naive SE 0.935 0.948 0.965 0.948 0.945
Robust SE 0.940 0.953 0.968 0.953 0.950
65
Table 4.4
Averages of the mean parameter and standard errors of the estimate of 6|
results for the logistic marginal model, log it( fly) = 60 + 6| x”. , fit by IEE and GEE.
The binary within cluster covariate x0. equal to 1 or 0 with equal probabilities.
Population average correlation 6 = ,02 = 0.16
J:50,
7725
1124:1110?“ Values of the Regression Parameter 6|
3"“ 10“ -0.60 -040 0.00 0.400 0.600
Average
Pairwise 0.138 0.142 0.158 0.147 0.136
, . (0.062) (0.067) (0.066) (0.072) (0.071)
Correlation 6
IEE Mean estimate -0.656 -0.410 0.014 0.435 0.672
%A Bias 9.30 2.50 0.014 13.00 12.00
GEE Mean estimate -0.654 -0.418 0.014 0.440 0.686
%A Bias 9.00 4.50 0.014 10.00 14.30
IEE Robust SE 0.284 0.284 0.289 0.318 0.338
(0.028) (0.029) (0.033) (0.038) (0.044)
MSE 0.091 0.079 0.104 0.135 0.141
Robust SE 0.278 0.277 0.282 0.314 0.334
GEE (0.028) (0.0291) (0.036) (0.042) (0.047)
MSE 0.085 0.080 0.102 0.132 0.150
%A MSE -6.5 1.26 -1.2 -2.7 5.93
IEE Rejection Probability for Ho: 6| = 0
GEE 0.645 0.300 0.078 0.280 0.516
0.673 0.340 0.085 0.290 0.543
Coverage Probability
IEE 0.930 0.963 0.922 0.925 0.940
GEE 0.930 0.955 0.915 0.925 0.925
Numbers in parentheses are the Monte Carlo standard deviations.
66
Table 4.4 (continued)
J = 100
n = 5
Values of the Regression Parameter 6
Memo“ Of 60 0 40 0 00 1
Estimation -0. - . . 0.40 0.60
Average
Pairwise 0.146 0.143 0.158 0.141 0.140
Correlation 5* (0.045) (0.047) (0.047) 0.048) (0.048)
IEE Mean Estimate -0.626 -0.420 0.008 0.425 0.653
%A Bias 4.28 5.01 0.008 6.25 8.85
GEE Mean Estimate -0.628 -0.418 0.004 0.428 0.657
%A Bias 4.72 4.500 0.004 7.000 9.450
IEE Robust SE 0.201 0.201 0.203 0.221 0.235
(0.014) (0.014) (0.015) (0.018) (0.021)
MSE
0.036 0.042 0.041 0.054 0.064
GEE Robust SE 0.196 0.196 0.197 0.216 0.230
(0.014) (0.014) (0.015) (0.018) (0.022)
MSE 0.035 0.039 0.039 0.053 0.063
%A MSE -1.92 -7.14 -4.88 -1.85 -l.56
IEE Rejection Probability for Ho: 6| = 0
GEE 0.905 0.563 0.045 0.490 0.770
0.918 0.598 0.045 0.515 0.805
Coverage Probability
IEE 0.955 0.933 0.955 0.928 0.938
GEE 0.943 0.940 0.955 0.933 0.937
67
Table 4.4 (continued)
Values of the Regression Parameter 6
Method of 1
Estimation -0.60 -0.40 0.0 0.400 0.60
Average
Pairwise 0.139 0.143 0.161 0.144 0.137
. . (0.030) (0.031) (0.033) (0.034) (0.036)
Correlation 6
IEE Mean Estimate -0.624 -0.407 0.00 0.390 0.610
%A Bias 4.00 1.75 0.00 -2.50 1.67
GEE Mean Estimate -0.624 -0.405 0.004 0.392 0.608
%A Bias 4.00 1.25 0.004 -2.00 1.33
IEE Robust SE 0.142 0.142 0.143 0.157 0.164
(0.006) (0.007) (0.008) (0.009) (0.010)*
MSE
0.021 0.020 0.019 0.025 0.026
0.139 0.138 0.138 0.153 0.160
GEE Robust SE (0.006) (0.007) (0.008) (0.009) (0.010)
MSE 0.019 0.019 0.017 0.024 0.025
%A MSE -9.52 -10.0 -10.53 -4.0 -3.85
IEE Rejection Probability for Ho: 6| = 0
GEE 0.990 0.820 0.032 0.695 0.965
0.993 0.843 0.035 0.740 0.968
Coverage Probability
IEE 0.940 0.953 0.968 0.953 0.950
GEE 0.945 0.955 0.965 0.945 0.953
68
Table 4.5
Averages of the mean parameter and standard errors of the estimate of 61
results for the logistic marginal model, log it( [1”) = 60 + 6| 3;}. , fit by IEE.
The binary cluster covariate, xj , equal to 1 or 0 with equal probabilities.
Population average correlation 6 = p2 = 0.64
J=50
n25
Values of the Regression Parameter 6|
-0.60 -0.40 0.00 0.40 0.60
Pairwise
Correlation 6
Average
0.628 0.627 0.628 0.627 0.633
(0.071) (0.070) (0.079) (0.090) (0.106)
Mean Estimate
-0.679 -0.406 0.016 0.436 0.791
%A Bias 13.18 1.40 0.016 8.89 31.83
0.297 0.300 0.310 0.333 0.781
Naive SE, (0.026) (0.026) (0.024) (0.036) (7.950)
0.530 0.536 0.553 0.589 0.604
Robust SE (0.044) (0.042) (0.042) (0.055) (0.070)
Monte Carlo SD 0.586 0.546 0.572 0.651 1.130
Naive SE Rejection Probability for Ho: 6| = 0
0.593 0.415 0.283 0.413 0.498
Robust SE
0.258 0.093 0.055 0.130 0.225
Coverage Probability
Naive SE 0.698 0.720 0.718 0.708 0.693
Robust SE 0.920 0.955 0.945 0.928 0.920
Numbers in parentheses are the Monte Carlo standard deviations.
69
Table 4.5 (continued)
Values of the Regression Parameter 6l
’:_”5’° -0.60 -040 0.00 0.40 0.60
Pairwise 0.635 0.634 0.638 0.638 0.635
, .. (0.049) (0.049) (0.053) (0.056) (0.058)
Correlation 6
Average
Mean Estimate -0.628 -0.396 0.005 0.434 0.629
%A Bias 4.670 1.000 0.005 8.480 4.830
Naive SE 0.201 0.202 0.210 0.223 0.231
(0.001) (0.009) (0.011) (0.015) (0.018)
RObUS‘SE 0.370 0.372 0.387 0.411 0.434
(0.019) (0.018) (0.021) (0.027) (0.031)
Monte cad" SD 0.377 0.384 0.393 0.444 0.446
Naive SE Rejection Probability for Ho: 6| = 0
RobustSE 0.733 0.550 0.2925 0.510 0.648
0.390 0.185 0.055 0.203 0.320
Coverage Probability
Naive SE 0.703 0.690 0.708 0.635 0.705
Robust SE 0.950 0.943 0.945 0.928 0.950
1:200 Pairwise 0.638 0.637 0.637 0.638 0.637
, A (0.035) (0.036) (0.037) (0.041) (0.043)
n = 5 Correlation 6
Mean Estimate -0.585 ~0.389 0.002 0.395 .604
%A Bias 2.50 2.75 0.002 1.25 0.667
NaiveSE 0.139 0.140 0.146 0.155 0.160
(0.004) (0.004) (0.005) (0.007) (0.009)
ROb‘mSE 0.258 0.261 0.271 0.288 0.298
(0.007) (0.008) (0.010) (0.012) (0.015)
M‘mte cad" SD 0.251 0.257 0.257 0.302 0.309
Naive SE Rejection Probability for H0: 6| = 0
RobustSE 0.898 0.683 0.258 0.660 0.833
0.650 0.305 0.043 0.278 0.528
Coverage Probability
Naive SE 0.715 0.710 0.7425 0.7025 0.683
Robust SE 0.965 0.958 0.9575 0.9425 0.948
70
Table 4.6
Averages of the mean parameter and standard errors of the estimate of 6|
results for the logistic marginal model, log it( |u|j) = 60 + 6| 3;}. , fit by IEE and GEE.
The binary cluster covariate x], equal to 1 or 0 with equal probabilities. Population
average correlation 6 = p2 = 0.64
J:50,
7125
2461“?“ Values of the Regression Parameter 6|
1
“ma 1°" -0.60 -040 0.00 0.400 0.600
Average
Pairwise 0.628 0.627 0.628 0.627 0.633
Correlation 6 (0.071) (0.070) (0.079) (0.090) (0.106)
IEE Mean Estimate -0.679 -0.406 0.016 0.436 0.689
%A Bias 13.18 1.40 0.016 8.89 14.83
GEE Mean Estimate -0.747 -0.454 0.025 0.472 0.757
%A Bias 24.50 13.50 0.025 18.00 26.0
IEE RobustSE. 0.530 0.536 0.553 0.589 0.604
(0.044) (0.042) (0.042) (0.055) (0.070)
MSE 0.349 0.298 0.327 0.425 0.653
GEE Robust SE 0.558 0.585 0.596 0.684 0.822
(0.097) (0.294) (0.096) (0.356) (1.313)
MSE 0.477 0.453 0.471 0.626 0.789
%A MSE 36.68 52.18 43.98 47.35 20.83
IEE Rejection Probability for Ho: 6| = 0
91313 0.258 0.093 0.055 0.130 0.225
0.298 0.130 0.070 0.112 0.180
Coverage Probability
iEE 0.920 0.955 0.945 0.930 0.930
GEE 0.907 0.948 0.928 0.903 0.880
Numbers in parentheses are the Monte Carlo standard deviations.
71
Table 4.6 (continued)
J = 100
n = 5
Values of the Regression Parameter 6
MethOd 0f 0 60 0 40 0 00 0 40 0 60 I
Estimation ' ' ' ' ' ' ’
Average
Pairwise 0.635 0.634 0.638 0.638 0.635
Correlation 6 (0.049) (0.049) (0.053) (0.056) 0.058)
IEE Mean Estimate -0.628 -0.396 0.005 0.434 0.629
%A Bias 4.670 1.000 0.005 8.480 4.830
GEE Mean Estimate -0.651 -0.412 0.009 0.460 0.679
%A Bias 8.50 3.00 0.009 15.00 13.17
Robust SE 0.370 0.372 0.387 0.411 0.434
IEE (0.019) (0.018) (0.021) (0.027) (0.031)
MSE 0.143 0.148 0.154 0.198 0.200
Robust SE 0.375 0.378 0.395 0.427 0.450
GEE (0.024) (0.023) (0.026) (0.038) (0.097)
MSE 0.159 0.160 0.171 0.241 0.295
%A MSE 11.19 8.11 11.04 21.72 47.50
IEE Rejection Probability for Ho: 6| = 0
GEE 0.390 0.185 0.055 0.203 0.320
0.410 0.193 0.048 0.208 0.323
Coverage Probability
IEE 0.950 0.943 0.945 0.928 0.950
GEE 0.940 0.935 0.953 0.930 0.940
72
Table 4.6 (continued)
Values of the Regression Parameter 6|
Method of
Estimation -0.60 -0.40 0.00 0.40 0.60
Average
Pairwise 0.638 0.637 0.637 0.638 0.637
, . (0.035) (0.036) (0.037) (0.041) (0.043)
Correlation 6
IEE Mean Estimate -0.585 -0.389 0.002 0.395 0.604
%A Bias 2.50 2.75 0.002 1.25 0.667
GEE Mean Estimate -0.595 -0.398 0.001 0.406 0.623
%A Bias 0.830 0.500 0.001 1.75 3.83
IEE Robust SE 0.258 0.261 0.271 0.288 0.298
(0.007) (0.008) (0.010) (0.012) (0.015)
MSE 0.029 0.028 .020 0.034 0.037
GEE Robust SE 0.259 0.262 0.274 0.292 0.305
(0.009) (0.009) (0.011) (0.014) (0.020
MSE 0.068 0.070 0.070 0.097 0.108
%A MSE 6.35 5.61 5.47 6.87 12.44
IEE Rejection Probability for H0: 6| = 0
GEE 0.650 0.305 0.0425 0.278 0.528
0.660 0.310 0.045 0.280 0.533
Coverage Probability
IEE 0.965 0.958 0.9575 0.9425 0.948
GEE 0.960 0.955 0.955 0.940 0.933
73
Table 4.7
Averages of the mean parameter and standard errors of the estimate of 61
results for the logistic marginal model, log it( lug) = 60 + 6| xj. , fit by IEE and GEE.
The binary cluster covariate x}. equal to 1 or 0 with equal probabilities.
Population average correlation 6 = p2 = 0.16
J = 50
n : 5
1124:1110?“ Values of the Regression Parameter 6|
S "”3 1°" -0.60 -040 0.00 0.400 0.600
Average
Pairwise 0.152 0.145 0.147 0.154 0.154
Correlation 6 (00651) (0.064) (0.065) (0.069) (0.078)
IEE Mean Estimate -0.651 -0.418 0.003 0.446 0.645
% A Bias 8.41 12.13 0.003 11.58 7.473
GEE Mean Estimate -0.666 -0.428 0.003 0.464 0.657
°/0A Bias 10.97 6.90 0.003 16.01 9.55
IEE Robust SE. 0.350 0.353 0.367 0.389 0.406
(0.032) (0.034) (0.034) (0.039) (0.047)
MSE 0.140 0.134 0.180 0.222 0.218
GEE Robust SE 0.353 0.357 0.374 0.400 0.432
(0.035) (0.037) (0.140) (0.048) (0.061)
MSE 0.154 0.146 0.195 0.256 0.249
%A MSE 10.43 8.45 8.468 15.51 13.78
IEE Rejection Probability for Ho: 6| = 0
GEE 0.448 0.205 0.108 0.258 0.348
0.453 0.213 0.118 0.258 0.330
Coverage Probability
IEE 0.943 0.940 0.893 0.898 0.898
GEE 0.935 0.935 0.885 0.888 0.899
Numbers in parentheses are the Monte Carlo standard deviations.
74
Table 4.7 (continued)
J=100
":5
Values of the Regression Parameter 6|
MethOd Of 060 040 000 4
Estimation ' ' ' ' ' 0' 0 0'60
Average
Pairwise 0.155 0.155 0.158 0.156 0.150
, c (0.042) (0.046) (0.046) (0.052) (0.049)
Correlation6
IEE Mean Estimate -0.604 -0.431 0.019 0.381 0.624
%A Bias 0.73 7.68 0.019 -4.85 4.01
GEE Mean Estimate -O.610 -0435 0.020 0.386 0.634
%A Bias 1.62 8.80 0.020 -3.58 5.65
Robust SE 0.247 0.250 0.260 0.275 0.285
IEE (0.014) (0.015) (0.015) (0.019) (0.021)
0.072 0.067 0.067 0.087 0.096
MSE
Robust SE 0.248 0.251 0.262 0.278 0.289
GEE (0.015) (0.016) (0.016) (0.021) (0.024)
MSE 0.073 0.069 0.068 0.091 0.102
%A MSE 2.52 3.31 2.70 3.78 6.37
IEE Rejection Probability for H0: 6| = 0
GEE 0.773 0.563 0.108 0.413 0.735
0.670 0.445 0.060 0.293 0.590
Coverage Probability
iEE 0.945 0.950 0.938 0.920 0.930
GEE 0.948 0.943 0.940 0.930 0.928
75
Table 4.7 (continued)
J: 200
n=5
Values of the Regression Parameter 6l
MethOd 0f 0 60 0 40 0 00 0
Estimation " . ' . . .40 0.60
Average
Pairwise 0.122 0.156 0.157 0.156 0.156
, .. (0.060) (0.031) (0.032) (0.035) (0.156)
Correlation 6
IEE Mean Estimate -0.602 -0.416 -0.002 0.412 0.596
%A Bias 0.26 3.98 -0.002 2.93 0.60
GEE Mean Estimate -0.604 -0.418 -0.003 0.414 0.600
%A Bias 0.63 4.47 0.003 3.52 0.05
IEE Robust SE 0.174 0.176 0.183 0.193 0.200
(0.008) (0.0072 (0.007) (0.009) (0.010
MSE 0.031 0.029 0.035 0.040 0.038
GEE Robust SE 0.175 0.177 0.183 0.194 0.202
(0.008) (0.007) (0.008) (0.009) (0.010)
MSE 0.032 0.030 0.036 .041 0.039
%A MSE 0.96 0.34 1.36 1.73 2.29
IEE Rejection Probability for Ho: 6| 2 0
GEE 0.913 0.655 0.050 0.568 0.858
0.913 0.653 0.050 0.568 0.858
Coverage Probability
IEE 0.918 0.948 0.950 0.935 0.963
GEE 0.918 0.945 0.950 0.935 0.963
76
Table 4.8
Averages of the mean parameter and standard errors of the estimate of 6|
results for the logistic marginal model, log it( fly) = 60 + 6| x||. ,
fit by IEE and GEE. The within cluster covariate x0. ~ N (0,1).
Population average correlation 5:102 = 0.81
J=50
”:5
13471110?“ Values of the Regression Parameter 6|
8"“ 1°“ -100 -050 0.00 0.50 1.00
Average
Pairwise 0.505 0.627 0.806 0.624 0.506
Correlation 6" (0.085) (0.083) (0.063) (0.074) (0.086)
IEE Mean -1093 -0543 0.019 0.535 1.085
Estimate 9.28 8.66 0.019 6.91 8.52
%A Bias
GEE Mean -1.104 -0551 0.001 0.535 1.102
Estimate 10.44 10.23 0.001 7.03 10.20
%A Bias
IEE RobustSE. 0.213 0.175 0.141 0.172 0.216
(0.033) (0.026) (0.023) (0.024) (0.044)
MSE 0.085 0.044 0.021 0.039 0.090
GEE Robust SE 0.206 0.154 0.072 0.152 0.208
(0.044) (0.035) (0.025) (0.035) (0.070)
MSE 0.075 0.035 0.006 0.027 0.089
%A MSE -11.69 -2041 -7337 -3127 -1.11
IEE Rejection Probability for Ho: 6| = 0
GEE 1.00 0.863 0.075 0.870 1.00
1.00 0.970 0.047 0.963 1.00
Coverage Probability
IEE 0.893 0.918 0.925 0.915 0.900
GEE 0.900 0.920 0.953 0.920 0.883
Numbers in parentheses are the Monte Carlo standard deviations.
Table 4.8 (continued)
J=100
n=5
Values of the Regression Parameter 61
MthOdff -100 -050 0.00 0.50 1.00
Estimation
Average
Pairwise 0.505 0.621 0.807 0.625 0.495
, .. (0.059) (0.049) (0.044) (0.050) (0.054)
Correlation 6
IEE Mean Estimate -1.035 -0.528 0.007 0.517 1.055
%A Bias 3.52 5.65 0.007 3.34 5.46
GEE Mean -1.043 -0.523 0.007 0.517 1.057
Estimate 4.25 4.63 0.007 3.29 5.73
%A Bias
Robust SE 0.153 0.123 0.102 0.123 0.152
IEE (0.017) (0.012) 0.012 (0.012) (0.016)
MSE 0.029 0.017 0.011 0.016 0.029
Robust SE 0.142 0.106 0.050 0.105 0.141
GEE (0.018) (0.013) (0.011) (0.014 ) (0.017)
MSE 0.026 0.012 0.003 0.013 0.027
%A MSE -10.69 -25.46 -74.07 -22.84 6.62
IEE Rejection Probability for Ho: 6| = 0
GEE 1.00 0.995 0.058 0.993 1.00
1.00 1.00 0.060 1.00 1.00
Coverage Probability
IEE 0.928 0.935 0.942 0.948 0.940
GEE 0.903 0.938 0.940 0.955 0.933
78
Table 4.8 (continued)
K.
11
200
n:5
Values of the Regression Parameter 6|
Method of
Estimation 1.00 -0.50 0.00 0.50 1.00
Average
Pairwise 0.501 0.625 0.809 0.623 0.501
, . (0.041) (0.037) (0.030) (0.038) (0.038))
Correlation 6
IEE Mean Estimate -1.013 -0.508 0.002 0.513 1.012
%A Bias 1.29 1.58 0.002 2.60 1.24
GEE Mean Estimate -1.014 -0.509 0.0006 0.515 1.015
%A Bias 1.42 1.74 0.0006 2.94 1.46
Robust SE 0.107 0.087 0.072 0.087 0.107
IEE (0.009) (0.006) (0.006) (0.007) (0.008)
MSE 0.013 0.009 0.005 0.009 0.012
GEE Robust SE 0.098 0.073 0.035 0.074 0.098
(0.009) (0.006) (0.005) 0.007) (0.009)
MSE 0.011 0.006 0.001 0.006 0.011
%A MSE -15.50 -33.33 -80.0 -33.33 -8.33
IEE Rejection Probability for Ho: 6| 2 0
GEE 1.00 1.00 0.068 0.998 1.00
1.00 1.00 0.038 1.00 1.00
Coverage Probability
IEE 0.933 0.920 0.932 0.948 0.945
GEE 0.933 0.940 0.962 0.928 0.933
79
Chapter 5
SUMMARY
5.1 Overview
In many instances, educational researchers are interested in determining how a
particular school program or policy that involves changing one of the observables
variables, X, will affect the average population response. For example: What is the
effect of a policy variable, pre-school experience, on the probability of dropping out of
school for the entire population? Here, the interest is in quantifying uncertainty about
future observables on the basis of information known.
Diggle et al. (1994) describe population—averaged or marginal models that can be
utilized to model clustered response data. Under the marginal modeling approach for
correlated binary response data, one can model the marginal probabilities given only the
observable, X, rather than the conditional probabilities given both the unobserved random
effects and the observable, X. Furthermore, Liang and Zeger (1986) and Zeger et al.,
(1988), describe estimating equations (EE) methods of estimation for fitting either
marginal or random effects models to correlated binary responses. Most importantly, the
marginal model approach via EE methods of estimation produces consistent estimates of
the regression parameters and of their variance depending only on the correct
80
specification of the marginal probabilities for the binary outcomes, not on the correct
choice of the correlation structure of the data.
However, whereas the asymptotic properties of the BE estimators are well
understood, their finite sample properties are not well known. For that, in this thesis I
examined the finite sample properties of the independence estimating equations (IEE) and
the generalized estimating equations (GEE) methods of estimation via Monte Carlo
simulation. A natural way to evaluate the finite sample properties of IEE and GEE in
estimating the regression parameters for logistic marginal models is to simulate correlated
binary response data with known logistic marginal response probabilities. Thus, a major
effort in this thesis focused on developing a data generating mechanism (DGM) to
simulate correlated binary responses (see Chapter 2). This DGM approach to generate
correlated binary responses has the advantage over TRELM formulation in that the
marginal probabilities, conditional only on the covariates, has a simple logistic form.
Using this logistic marginal model to analyze clustered binary responses via EE is very
appealing since it, unlike the TRELM formulation, does not require numerical integration
in evaluating the BE.
In the following, I summarize the results that illustrate the finite sample
performance of IEE and GEE when used to fit the marginal logistic model,
log it( ,u”) = 60 + 6| x|| , formulated for the sets of correlated binary responses.
Specifically, I compare the IEE and GEE estimators of the mean parameter 6| in terms of
their bias, their efficiency, and their accuracy in testing simple hypotheses about 6| .
81
5.2 Parameter Estimation
5.2.1 Bias
0 For high and low correlated binary responses; x”. a binary within cluster covariate.
For J = 50 and high correlated binary responses, the effect of the covariate, x|| ,
was overestimated in absolute value by both methods, IEE and GEE, for all the values of
6| estimated. Particularly, for 6| = 0 the bias of the estimate 6| under GEE was
notably low and substantially lower than the bias under IEE; for 6| = (-0.6, — 0.4) the
estimate 6| , under both methods, was nearly unbiased; while for 6| = (0.6, 0.4) the
biases were similar and rather substantial. Similarly, for J = 50 and low correlated
binary responses, the effect of the covariate, x”. , was overestimated in absolute value.
The biases of the estimate 6| , under both methods, were somewhat high and comparable
across all the values of 6| estimated. Note that for 6| s 0 and high correlated binary
responses, the amount of bias of the GEE estimate 6| was substantially lower than the
bias of GEE estimate 6| for low correlated binary responses. Finally, as expected, for
high and low correlated binary responses and cluster size it fixed at 5, as the sample size
increased, the biases of the estimate 6| decreased substantially for all the values of 6|
estimated.
82
o For high and low correlated binary responses; x] a binary cluster level covariate.
For J = 50 and high correlated binary responses, the effect of the covariate, xi ,
was grossly overestimated by both methods, IEE and GEE, for all the values of 6|
estimated. It is noteworthy that the bias of the GEE estimate was at least 10% larger than
the bias of the IEE estimate for each one of the values of 6| estimated. As expected, as
the sample size increased, the bias of the GEE estimate 6| decreased considerably, but it
remained substantially larger than the bias of the IEE for each one of the values of 6|
estimated. On the other hand, for low correlated binary responses, the IEE and GEE
estimators were equally biased and moderately overestimated the regression parameter
6| in absolute value. Furthermore, as the sample size increased, the bias under both
methods reduced substantially.
o For high correlated binary responses; x||. a standard normal within cluster covariate.
For J = 50 , the bias of the GEE estimate 6| for 6| = 0 was very low and
substantially less biased than the IEE estimate. However, for 6| 1: 0 the GEE estimate
6| was slightly more biased than the IEE estimate. As the sample size increased, the
bias under both methods of estimation were very similar, and they decreased
substantially.
83
5.2.2 Efficiency
0 For high and low correlated binary responses; x”. a binary within cluster level
covariate.
For J = 50 and high correlated binary responses, the GEE MSE of 6| ranged
from 36% to 74% lower than the IEE MSE of 6| for all the values of 6| estimated.
Furthermore, this pattern remained as J increased. This shows that the estimation of the
population average correlation, 6, has less effect in larger sample sizes. Thus, under this
MSE criteria under the above simulation conditions, the GEE estimator of 6| was more
efficient than the IEE estimator. On the other hand, for J = 50 and low correlated binary
responses, the GEE MSE and the IEE MSE of the estimate 6| were very close.
However, for most cases the GEE MSE were somewhat smaller. As the sample size
increased, the similarity between both MSE was sustained across all the values of 6|
estimated.
0 For high and low correlated binary responses; xj a binary cluster level covariate.
For J = 50 and high correlated binary responses, when estimating the effect of
the covariate, xj. , the GEE MSE of 6| was substantially higher than the IEE MSE for all
the values of 6| estimated. The GEE MSE of 6| ranged from 20.83% to 52.2% higher
than the IEE MSE. As expected, as the sample increased, both MSE reduced
substantially; however, the GEE MSE was still somewhat higher than the IEE MSE. On
the other hand, for J = 50 and low correlated binary responses, the GEE MSE were also
84
higher than IEE MSE across all the values of 6| estimated. As the sample size increased,
the MSE under both methods reduced substantially and were nearly identical. Thus,
under this MSE criteria under the above simulated conditions, the IEE estimator
performed better than the GEE estimator.
o For high correlated binary responses; x|| a within cluster standard normal covariate.
For J = 50 and high correlated binary responses, when estimating the effect of
the covariate, x” , the GEE MSE of 6| were substantially lower than the MSE IEE.
Remarkably, for 6| = 0 , the GEE MSE of the estimate 6| were 71.43%, 72.73%, and
80% lower than the IEE MSE for J = 50, 100, 200 , respectively. As expected, as the
sample increased for the other values of 6| estimated, both MSE reduced substantially;
however, the IEE MSE was still somewhat higher than the GEE MSE. Clearly, under
this MSE criteria for the above Simulated conditions, the GEE estimator performed better
than the IEE estimator.
5.2.3 Accuracy in Hypothesis Testing.
In evaluating the accuracy of the Robust SE of the estimate 6| under IEE and
GEE, two hypothesis scenarios were considered: first, I tested Ho: 6}, = b_,- for a given
range of true values of 66 second, I tested Ho: 6| = 0 for a given range of true values of
b||¢0.
85
o For high correlated binary responses; x”. a binary within cluster covariate
In the first scenario, for J = 50,100 when testing H0:6| = 0 , given 6, = 0 as the
true value, under GEE the calculated CP equalled 95%, 95.2% and were clearly within
the limits of the nominal 95% level. However, under IEE the calculated CP equalled
91%, 92.7% and underestimated the limits of the nominal 95% level. Furthermore, for
t"
the other values of 6| estimated, the calculated CP under GEE were closer to the limits
of the nominal 95% level than under IEE. As expected, as the sample size increased, the
calculated CP under the GEE were within the limits of the nominal 95% level for almost
all cases; yet, even for the exception, the calculated CP was closer to the limits of the
nominal 95% level than under the IEE.
In the second scenario, when testing Ho: 6| = 0 given b,- = (—0.6,— 0.4, 0.4, 0.6)
as the true values, the ability of the test to recognize a false H0: 6| = 0 was substantially
higher under the GEE than under IEE. These results show that for the above Simulated
conditions the GEE outperformed the IEE method in terms of their accuracy in testing
simple hypotheses about the mean parameter 6| . For low correlated binary responses,
for J = 50 and H0: 6| = 0 given b,- = (—0.6,— 0.4, 0.4, 0.6) as the true values, the ability
of the test to recognize a false Ho: 6| = 0 was higher under GEE than under IEE; at the
same time, the calculated CP were clearly within the limits of the nominal 95% level for
almost all the values of 6| estimated. The exceptions occurred only for b,- = 0 where the
calculated CP (92.2%, 91 .5%) slightly underestimated the limits of the nominal 95%
86
level under IEE and GEE respectively. Furthermore, as the sample size increased, the
ability of the test to recognize a false H0: 6| = 0 remained higher under GEE than under
I BE, while at the same time the calculated CP were clearly within the limits of the
nominal 95% level for all values of b,» The above results show that even for low
correlated binary responses the GEE method retained an advantage over the IEE.
For highly correlated binary responses; x}. a binary cluster level covariate
In the first scenario, for J = 50 , when testing Ho: 6| = 6, given
5|. 2 (—0.6, — 0.4,0,0.4,0.6) as the true values, the calculated CP were within or closer to
the limits of the nominal 95% level under IEE than under GEE. In the second scenario,
for J = 50 and testing Ho: 6| = 0 given 6, = (—0.6,— 0.4, 0.4, 0.6) as the true values, the
ability of the test to recognize the false null hypothesis, H0: 6| = 0 , under IEE was better
than under GEE for most values of b,- considered. However, as the sample size
increased, the proportion of times that GEE recognize the false Ho: 6| = 0 was somewhat
higher than under IEE across all the values of b,-
o For high correlated binary responses; x”. a within cluster standard normal covariate.
To evaluate the performance of the Robust SE of the estimates of 6| under IEE
and GEE, two scenarios were considered: first, I tested Ho: 6, = b,- given
87
[7| = (—1.0,— 0.5, 0, 0.5, 1.0) as the true values; second, I tested H0: 6| = 0 given
b,- = (—l.0,— 0.5, 05, 1.0) as the true values.
In the first scenario, for J = 50 and H0:6|, 2 5|. given b,- = (—1.0,— 0.5, 0.5, 1.0) as
the true values, the calculated CP under the GEE Robust SE were somewhat closer to the
limits of the nominal 95% CP (93%, 97%). The exception occurred only for b_,- = 1,
where the calculated CP under the IEE method was closer to the limits of nominal 95%
level, however, by an insubstantial amount. Furthermore, for b, = 0 under GEE Robust
SE, the calculated CP equalled 95.3%, clearly within the limits of the nominal 95% level;
at the same time, under the IEE Robust SE the calculated CP equalled 92.5%, which
underestimated the limits. As the sample size increased, the calculated CP, under both
methods of estimation, were within the limits of the nominal 95% level.
In the second scenario, for J = 50 and Ho: 6| = 0 given b,- = (—1.0,— 0.5, 0.5, 1.0)
as the true values, the calculated RF were uniformly higher under GEE than under IEE
across the values of b, considered. That is, the ability of the test to recognize that
Ho: 6| = 0 is false was substantially larger under GEE than under IEE. Furthermore, as
the sample size increased, the accuracy in hypothesis testing of both methods of
estimation improved substantially and similarly well.
88
5.3 Further Research
In the present study, for high and low population intracluster correlation of the
binary responses, I compared the finite sample performance of IEE to GEE when
estimating the effects of a binary within cluster covariate with equal probabilities, a
standard normal within cluster covariate, and a binary cluster level covariate, also with
equal probabilities. The cluster size n was fixed at 5, while the sample size, J ,
increased from 50, to 100 and 200.
The identity and a constant working correlation matrix were specified in the
estimation of the mean parameter, 6|, via IEE and GEE respectively. Further research to
examine the performance of IEE to GEE for a severely skewed distribution of a
continuous within cluster covariate across clusters, as well as for disproportion patterns
across clusters of a binary within cluster covariate, would be important to ascertain the
efficiency of the-IEE compared to the GEE approach under the constant working
correlation matrix utilized to fit a marginal logistic model to correlated binary responses.
It would also be helpful to study how the performance of both approaches are affected by
increasing the cluster size while holding constant the sample size. F urthermore, studying
how the distribution of the cluster sizes will affect the performance of both approaches
would render further knowledge on the efficiency issues of the IEE and the GEE methods
of estimation.
In general, as showed in Equation 2.1, the true correlation matrix under the DGM
is a function of the unknown parameter p and the covariates. However, a constant
working correlation matrix was specified in this study in solving the GEE. This constant
89
working correlation matrix is easy to compute and does not depend on the covariates, but
it is not entirely general. Thus, future empirical work ought to specify a more general
correlation matrix, R,- (6) , which depends on the covariates, and should also compare the
performance of the IEE under the identity working correlation matrix to the GEE under
the more general working correlation matrix and the various simulation conditions
suggested above. Furthermore, because of the complexity of implementing this more
general correlation matrix, the performance of the GEE method under the constant
working correlation matrix should also be compared to the GEE under the more general
correlation matrix.
90
APPENDIX
PROGRAM 1
This SAS/IML macro includes a SAS/IML program that creates “test,” a SAS
dataset of correlated binary responses as given in the DGM under the assumptions
specified in Section 2.2. It invokes the GEE SAS/IML macro to analyze the SAS dataset
of correlated binary response data in two instances: first, the GEE SAS/IML macro is
invoked and the identity matrix is specified as the working correlation matrix; second, the
GEE SAS/IML macro is invoked and a constant correlation matrix is specified as the
working correlation matrix. In the first instance, the GEE macro output includes the IEE
estimates of the regression parameters, the naive and the robust estimates for the variance
of the estimates of the regression parameters. In the second instance, the GEE output
includes the GEE estimates the regression parameters, the naive and the robust estimates
for the variance of the estimates of the regression parameters.
%macro GENDATA (J ,n,k,p,seed 1 ,seed2,seed3,seed4);
%* This segment of the SAS/IML macro program creates “test”, a SAS
%* dataset of correlated binary responses under the DGM assumptions
%* specified in Section 2.2.
proc iml;
.=0;
beta0=0;
beta1=0;
t=&j*&n;
m=j(t,&k,0);
do school=1 to &j;
uj =ranuni(& seed 1 );
91
1’J'=(UJ'/(1-llj));
theta=log(1i);
do student=1 to &n;
i=i+1;
m[i,l]=ranbin(&seed2,1,.5);
ui=ranuni(&seed3);
ri=(ui/(1-ui));
u='log(ri);
c=ranbin(&seed4, l ,&p);
if c=1 then do;
v=beta0+m[i, 1 ]*beta1 +theta;
ifv > 0 then y=1;
else y=0;
end;
if c=0 then do;
w=beta0+m[i,1]*beta1+u;
ifw> 0 then y=1;
else y=0;
end;
predict=m[i, l ];
mli.2]=y;
m[i,3]=school;
end;
end;
create a from m;
setout a;
append from m;
quit;
FUD;
* * *proc print data=a;
data test;
SET a;
rename col] = predict
c012 = y
col3 = school;
intercpt=1 ;
***proc print data = test;
run;
92
%* Assuming independence of the binary responses, the SAS/[ML
%* GEE macro is invoked. The input to the GEE macro is
%* “test,” the SAS dataset created above. The output contains the IEE
%* estimates of the regression parameters, as well as the model based on
%* the robust variance of the estimates of the regression parameters.
%gee (data=test,
yvar=y,
xvar=intercpt predict,
id=school,
link=3,
vari=3,
het=1 ,
matrix=residual,
est_stor=mat,
out=residual.dat);
FUD;
%* Based on the IEE residuals, we estimate the pairwise population average %*
correlation as specified in Equation 3.3
data _null_;
set residual.dat;
file xyz;
put fit 10-20 .8 res 24-34 .8;
data summary;
array f{5} fl f2 f3 f4 f5;
array r{5} r1 r2 r3 r4 r5;
delta1=0;
infile xyz;
input #1 f1 10-20 .8 rl 24-34 .8;
input #2 12 10-20 .8 r2 24—34 .8;
input #3 f3 10-20 .8 r3 24-34 .8;
input #4 f4 10-20 .8 r4 24-34 .8;
input #5 f5 10-20 .8 r5 24-34 .8;
do i=1 to 5;
doj=1 to 5;
if(i