emu: w‘...w...4uq a? u. (L J... ( i THESlS 2. 9i .i j“ > Illufllulijlfifimn‘llTWIEHIHHIMI 01772 1147 This is to certify that the dissertation entitled Estimating Logistic Regression Models for Correlated Binary Outcomes presented by Adolfo Navarro has been accepted towards fulfillment of the requirements for Ph.D. degree in macs and Research Design WK («7% Major professor Dateflujc—ut 257‘ (‘77? MS U is an Affirmative Action/Equal Opportunity Institution 0-12771 __———'-_~_-_—_'.— _A . _fl.-_ ~._——"——.- LIBRARY Michigan State University PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINE return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE FEB 1* O‘zbm 118 mu ESTIMATING LOGISTIC REGRESSION MODELS FOR CORRELATED BINARY OUTCOMES By Adolfo Navarro A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Educational Psychology and Special Education 1998 ABSTRACT ESTIMATING LOGISTIC MODELS FOR CORRELATED BINARY OUTCOMES By Adolfo Navarro The finite sample performance of the independence estimating equations (IEE), which assumes that the responses are independent, and generalized estimating equations (GEE), which take the correlation into account to increase efficiency, methods of estimation proposed by Liang and Zeger (1986), are evaluated when used to fit grouped logistic models. Specifically, the performance of the GEE estimator, under a constant working correlation matrix across clusters, is compared to the IEE estimators in terms of their bias, their efficiency, and their accuracy in testing a simple hypothesis about the mean parameters. A Monte Carlo simulation is used to compare the estimators empirically under the various conditions specified in this study. For example, the effects of the population average pairwise correlation of the binary responses and the effect of the distribution of a covariate on the IEE and the GEE estimator associated with that covariate are analyzed. In this regard, a data generating mechanism (DGM) to simulate binary correlated responses is proposed. This DGM has the advantage over the traditional random effects logistic formulation (TRELM) in that the marginal probabilities, conditional only on the covariates, have a simple logistic form and the mean parameters quantify the average effect of the covariates on the population’s response. The finite sample findings show that for high correlated binary responses, when including either a binary or a standard normal within cluster covariate, the GEE outperforms the IEE in estimating the mean parameter associated with that covariate in terms of their efficiency and their accuracy in hypothesis testing, and both the IEE and GEE estimators are equally biased. However, for low correlated binary responses, both methods perform equally well. In contrast, for highly correlated binary responses, when including a binary cluster specific covariate, the IEE estimator substantially outperforms the GEE in estimating the mean parameter associated with that covariate in terms of their bias, their efficiency, and their accuracy in hypothesis testing. However, for low correlated binary responses the I BE method outperforms the GEE only in terms of their efficiency, although the differences are not as substantial. Copyright by ADOLFO NAVARRO 1 998 To Daniel Gustavo and John Paul ACKNOWLEGMENTS This dissertation would not have been possible without the help of many people. I am especially grateful to my dissertation director, Professor Jeffrey Wooldridge whose guidance, insight and kindness have been invaluable to my progress. I would also like to express my sincere gratitude to Professor James Stapleton. He has been a mentor and an inspiration to me throughout my graduate studies. I will always be indebted to both of them for their careful attention to and interest in my work. Finally, heartfelt thanks go to Professor Robert F loden for extending his hand of support at the peak of my desperation. Without his advice and support I would not have been able to complete my work. I am grateful to Dr. Dozier Thornton for his enduring faith in me as a scholar. My friend and colleague, Dr. Randy F otiu, provided me with valuable computational help and kept my SAS updated. I would also like to express my thanks to two organizations that provided me with financial support. First, without the support provided to me by the Universidad de Oriente, in Cumana, Venezuela, graduate school would have remained a distant dream rather than a reality. I am also grateful to the King-Chavez-Parks Future Faculty Fellowship program which awarded me with financial assistance during this final phase of graduate school. Finally, I want to say thanks to my lovely family for their devotion to me as I finished my degree. Special thanks to my in-laws, Richard and Suzanne Johnson, for their unconditional support, which has been ongoing and sustaining. I wish to express vi my deep gratitude to my wife, Janet, for her constant and unconditional support and encouragement all the way through graduate school. vii TABLE OF CONTENTS LIST OF TABLES .................................................................. CHAPTER 1 INTRODUCTION .............................................................. 1.1 Correlated Binary Outcomes in Education Research ......... 1.2 Linear Probability Models ....................................... 1.3 Models for Binary Response Data .............................. 1.4 Models for Correlated Binary Response Data ................ 1.5 Purpose of the Thesis ............................................ 1.6 Objectives of the Thesis .......................................... CHAPTER 2 DATA GENERATING MECHANISM 2.1 Introduction ........................................................ 2.2 Data Generating Mechanism .................................... 2.3 Other Ways for Generating Correlated BinaryResponses. .. CHAPTER 3 ESTIMATION METHODS FOR GROUPED LOGISTIC MODELS... 3.1 Introduction ......................................................... 3.2 Independence Estimating Equations ............................ 3.2.1 How to Estimate the Variance of A ................... 3.3 Generalized Estimating Equations ............................. 3.3.1 How to Compute the Variance of [BAG ................... 3.3.2 How to Compute £0, ..................................... 3.4 The Computer Programs ........................................ CHAPTER 4 SIMULATION RESULTS ...................................................... 4.1 Introduction ......................................................... 4.2 Simulation Design ................................................ 4.3 Results of the Estimation Procedures ............................ viii AWNr—‘n—n p—n—a N— 15 16 22 25 25 26 27 28 32 34 34 36 36 36 40 CHAPTER 5 SUMMARY ..................................................................... 5.1 Overview ......................................................... 5.2 Parameter Estimation ........................................... 5.2.1 Bias ....................................................... 5.2.2 Efficiency ................................................. 5.2.3 Accuracy in Hypothesis Testing ...................... 5.3 Further Research ................................................. APPENDIX ................................................................. SAS/IML Macro Computer Programs ............................ BIBLIOGRAPHY ............................................................. ix 80 80 82 82 84 85 89 91 91 97 Table 4.1 4.2 4.3 4.4 4.5 LIST OF TABLES Page Averages of the mean parameter and standard errors of the estimate of ,6] results for the logistic marginal model, log it( fly) = :60 + :31 xi}. , fit by IEE. The binary within cluster covariate xi]. equal to 1 or 0 with equal probabilities. Population average correlation 5:102 = 0.81 ................... 59 Averages of the mean parameter and standard errors of the estimate of ,6] results for the logistic marginal model, log it( ,uy) = '80 + fl] xi]. , fit by IEE and GEE. The binary within cluster covariate x”. equal to 1 or 0 with equal probabilities. Population average correlation 5=p2 = 0.81 .................. 61 Averages of the mean parameter and standard errors of the estimate of ’3' results for the logistic marginal model, log it( '11”) = :60 + ’31 xi}. , fit by IEE. The binary within cluster covariate x”. equal 1 or 0 with equal probabilities. Population average correlation 5 = p2 = 0.16 .................................. 64 Averages of the mean parameter and standard errors of the estimate of '3' results for the logistic marginal model, log it( '11”) = '60 + :31 x”. , fit by IEE and GEE. The binary within cluster covariate x0. equal to 1 or 0 with equal probabilities. Population average correlation 5 = p2 = 0.16 ................. 66 Averages of the mean parameter and standard errors of the estimate of ,6] results for the logistic marginal model, log it( lug) = 160 + ,3] xj , fits by IEE. The binary cluster covariate, xj , equal to 1 or 0 with equal probabilities. Population average correlation 6 = p2 = 0.64 ................................. 69 4.6 4.7 4.8 Averages of the mean parameter and standard errors of the estimate of ,6] results for the logistic marginal model, logit(luij) = :50 + ,5] xj. , fit by IEE and GEE. The binary cluster covariate x}. equal to 1 or 0 with equal probabilities. Population average correlation 6 = p2 = 0.64 ................ 71 Averages of the mean parameter and standard errors of the estimate of A results for the logistic marginal model, log it( '11”) = '30 + '3' xj , fit by IEE and GEE. The binary cluster covariate xj equal to l or O with equal probabilities. Population average correlation 5 = p2 = 0.16 .................. 74 Averages of the mean parameter and standard errors of the estimate of :61 results for the logistic marginal model, log it( lug.) = flo + :3: x”. , fit by IEE and GEE. The within cluster covariate xi]. ~ N(0, 1) . . _ 2 Population average correlation §=p = 0.81 ................................... 77 xi Chapter 1 INTRODUCTION 1.1 Correlated Binary Outcomes in Education Research Much of what goes on in education occurs within a group context. Schooling activities occur within hierarchical organizations in which the sources of educational influence on students occur in the group to which an individual belongs (Burstein, 1980). Moreover, in many areas of educational research, one encounters individual responses in the form of qualitative information. For example, student i, belonging to school j , either passes or fails a test, has finished high school or not, repeats a grade or not, or can be grouped according to his/her status of learning as mastery against non-mastery of certain educational material. All of this information can be captured by defining a binary (zero-one) variable that takes on a value of one if the individual or student has the particular characteristic, and zero otherwise. These data can be viewed as hierarchical or nested structures in which students are nested within schools. For example, suppose that the researcher interest is in whether preschool experience makes any difference in the students’ probability of finishing high school. Traditionally, it is assumed that after controlling for the observable school level variables the students’ responses are independent. However, in many cases because of the nested structure of the data, even after controlling for the observed school characteristics, it is natural to expect that the presence of unique unobserved school effects will induce a correlation across responses of individual students attending the same school. This correlation must be accounted for to obtain a correct statistical analysis (Albert & McShane, 1995; Fitzmaurice, 1995; Liang & Zeger, 1986; Raudenbush, 1988). 1.2 Linear Probability Model The linear probability model (LPM) has been used to represent a regression model in which the dependent variable is a binary variable, taking the value of 1 if the event occurs and 0 otherwise. More precisely, let yr] denote the binary response for student i, in school j. Then the LPM assumes that the expected value of the binary response variable is a linear function of a set of covariates; that is, it assumes 7r,- = E(y,.,lx,.,.) = my”. = 1|x.,~) = x;fl, (1.1) where x”. is a p x 1 vector of covariates and ,6 is a p x 1 vector of unknown parameters. Under the LPM formulation, the most popular way of estimating ,6 is ordinary least square (OLS). Given the OLS estimate ,6 , the predicted value of y” is 3],,- = x; ,3. This is the estimated probability that the event will occur given a particular value of x”. . The LPM has the advantages that it is easy to interpret and it is easy to add group effects. However, in practice the LPM suffers from two major difficulties. For one, the estimated probabilities can lie outside the admissible range (0,1), which is problematic for predicting probabilities. A related point is that the marginal effects are constant, something that cannot strictly be true for all values of xii. Therefore, while the LPM may provide reasonable estimates of the true marginal effects for some values of x”. near the average, it might be a poor approximation for extreme values of x”. . 1.3 Models For Binary Response Data Given the limitations of the linear probability model in the handling of the dependence of the probability of response on certain covariates, several nonlinear models have been proposed, each of which ensures that the fitted probabilities will lie between zero and one. The logistic functional form is probably the most commonly used. Thus, the model of choice for analyzing binary response data with covariates is the logistic regression model. Under the logistic model formulation, one assumes that there are not group effects. Then the probability of “success” on the dependent variable can be written as 72",,- = pro/0. = 1|x,,-) (1.2) 1 + eXP(- x; fl) where x”. is a p x 1 vector of covariates, and ,8 is p x 1 vector of constant parameters. Likewise, the probit transformation (DOC;- fl) of the “success” probability have been proposed as an alternative to the logistic transformation. However, the logistic functional form is the most widely used in different fields of application. This is because the logistic transformation is more convenient from the computational viewpoint; as well as having a direct interpretation in terms of the logarithm of odds in favor of a success, and for models based on the logistic transformation, differences in the logit scale can be estimated regardless of whether the data are sampled prospectively or retrospectively (see Agresti, 1990; Bishop, F ienberg, & Holland, 1975; Cox, 1970; McCullagh & Nelder, 1989 for an extensive coverage of these models). 1.4 Models For Correlated Binary Response Data The data consist of observations (yg‘ , x0.) , where yy is a binary response and x”. is a p x 1 covariate vector for ith observation, i = l, ..... ,nj , from the jth cluster, j = l, ..... , J . Traditionally in educational research, to study the effects of covariate(s) on the marginal probabilities for the binary responses, the standard logit maximum likelihood estimation (MLE) approach has been applied under the key assumption that the yg are independent across student 1' , in school j conditional on the xii. This approach, although appropriate to deal with one-level binary responses, is not appropriate in terms of its efficiency for fitting nested structured data, since the presence of unique, unobserved contextual factors in group--school effects--are expected to induce a correlation across responses of individual students attending the same school. While the estimation of the regression parameters of the marginal model are consistent, their variances will be wrongly estimated (Zeger & Liang, 1986). Moreover, with non-linear models for binary response data, such as logistic regression, different assumptions about the source of correlation can lead to regression parameter estimates with distinct interpretations (Diggle, Liang, & Zeger, 1994; Zeger, Liang, & Albert, 1988). Most studies dealing with binary response variables for nested structure data use random effects logistic models (Stiratelli, Laird, & Ware, 1984). Typically, a random effect, b]. , is included in the same logit scale as the constant regression parameters to account for the correlation that may exist because of the nested structure of the data. Henceforth this procedure will be called the traditional random effects logistic models (TRELM) formulation. It is assumed that the Y, conditional on the vector of xij's , X j , and b}. are independent Bernoulli random variables with response probabilities 1 1+ CXPI-(x;fl+z;b,)] #1; = E(yile./’b.i) = pr(y” : llXj’bj) = pr(yij =1lX,,aZ.~pb,-)= Let Z.-,- be a vector of covariates associated with the random effect, b}. , and Var(yy|bj, X j) = lug/(1 - 'uij). It is assumed that b}. is independent of X]. and b1 ~ N(0, D), where D is an unknown covariance matrix to be estimated. aprO/y. = llXj’bj) é’x Notice that depends on b]. . Thus, whereas the above 0. models are helpful in providing a cluster specific interpretation of the regression coefficients, outfitting individualize cluster predictions, they fail to ask the right questions from a policy perspective. An important question to ask is how a particular school program or policy that involves changing one of the variables X.) will affect average population response. For example: What is the effect of a policy variable, pre-school experience, which takes on a value of 1 if the student attended pre-school and 0 otherwise, on the probability of dropping out of school for the entire population? Here, the interest is in quantifying uncertainty about future observable pr(yl_j : 1| X1) on the basis of information known. Thus, under the TRELM formulation the pr(y” = 1| X1) can be obtained by integrating the distribution of (yy lb]. , X_,.) with respect to the distribution of random effects, bl. (i.e., the school effects). That is, my, = 1|X,,.) = E[E(yy.|X_,.,b,.)lX,-] = I Iogit‘ ‘(x;,fl+ b,)f(b,)db_,-- For the integral above, exact solutions are not feasible. However, given estimates of the regression parameter [3 and the density of random effects, b1 , in principle, one can use the above integral to obtain predicted probabilities and partial effects. For example, Im and Gianola (1988) and Anderson and Aitkin (1985) applied numerical integration (quadrature) methods combined with the EM algorithm as described by Dempster, Laird, and Rubin (1977) to approximate the above integral; Stiratelli, et al. (1984) and Wong and Mason (1985) have estimated the joint posterior means by approximating the likelihood function based on the joint posterior modes of the random parameters given approximated ML estimators of the covariance parameters. This approach is implemented by means of the EM algorithm. Other researchers have focused on approximating quasi-likelihood or log likelihood functions. For example, Schall (1991) proposed a general algorithm for the estimation of the regression parameters, random effects, and components of dispersion in generalized linear models (McCullagh & Nelder, 1989) with random effects; Breslow and Clayton (1993) considered the penalized quasi-likelihood (PQL) method of inference for data generated under the traditional multilevel logit formulation; Goldstein (1991) proposed a first order Taylor series approximation to analyze data under the traditional multilevel logit formulation; Longford (1991) focused on the logistic regression for correlated binary outcomes generated under the traditional multilevel logit formulation and proposed a computational algorithm based on an approximation to the log likelihood function. However, several problems have been reported or associated with these approaches. First, the distributional assumptions on the random effects are somewhat arbitrary and untestable for small cluster size. Second, some difficulties in the computational aspects of model fitting have been mentioned quite ofien. For example, Zeger and Karim (1991) pointed out: 1) investigators have largely restricted their attention to random intercept models to avoid higher order numerical integration; 2) problems exist in the interpretation of the regression coefficient estimates; 3) inferences about the regression coefficients have often been made conditional upon the estimated random effects variance, again to avoid difficult integration. Zeger and Karim (1991) used a Monte Carlo method, the Gibbs sampler, for estimating parameters in the generalized linear model with Gaussian random effects. This methodology promises to overcome numerical problems in solving higher dimensional integration as a result of fitting generalized linear models in which the random effects are incorporated in the same logit scale as the fixed regression coefficients. Because of the computational and interpretational problems faced when estimating marginal probabilities under TRELM formulation, researchers have been interested in the cost of ignoring the correlation among components of the vector of binary responses, both in terms of biased estimates and biased assessments of precision. Recently, Rodriguez and Goldman (1995), using a Monte Carlo study design, generated data under the TRELM formulation to evaluate two software packages designed for fitting multilevel models to binary response data, namely VARCL (Longford, 1987) and ML3 (Goldstein, 1991). Rodriguez and Goldman (1995) found that the multilevel estimates of the constant parameters reveal substantial downward bias. The bias was moderate when the random effects were small, but fairly substantial when at least one of the random effects were large. The authors also studied the properties of OLS when applied to data generated under the TRELM and compared results with those obtained by using two software packages designed for fitting multilevel models to binary response data. The authors reported that the estimates of the constant parameters were virtually the same. Since the two approaches utilized in estimating the mean parameters were estimating different quantities, it was expected that the mean parameters would also be different. Thus, it is interesting that Rodriguez and Goldman found that the estimates of the regression coefficients obtained, when applying different methods of estimation to the misspecified model (i.e., ordinary logistic model) and to the correctly specified model (i.e., multilevel logit model), were similar. The Rodriguez and Goldman (1995) findings could mean that the numerical approximations in which the two methods of estimation proposed by Goldstein (1991) and Longford (1991), and evaluated by Rodriguez and Goldman, especially designed to deal with multilevel logit models, need to be revised. They also reported that the estimates of the variance components were very disturbing, as they showed a pronounced downward bias. The standard errors were, on average, quite accurate; however, their utility came into question since the order of bias was 2-3 and 4 standard errors for the constant and random effects, respectively. Given the results of the above simulation studies, Rodriguez and Goldman highlight the need for alternative estimation procedures to handle multilevel models with binary response variables. The question is raised “whether random effects in binary response models can ever be estimated at acceptable levels of bias and precision when the average size of clusters is modest” (Rodriguez & Goldman, 1995, p. 87). Alternatively, Diggle et al. (1994) describe a marginal model approach that can be utilized to model correlated binary responses. This marginal model has the advantage that the marginal probabilities can be modeled as a function of covariates without explicitly accounting for unobserved heterogeneity across clusters. Under this marginal modeling approach for correlated binary response data, one models the marginal probabilities given only the observable covariates, X1 , rather than the conditional probabilities given both the unobserved random effects, [3], , and the observables. Liang and Zeger (1986), in the context of longitudinal data analysis, proposed the generalized estimating equations (GEE) approach as a method for fitting logistic marginal models for correlated binary responses. Most importantly, the marginal model approach via the GEE method of estimation produces consistent estimates of the regression parameters and of their variance, depending only on the correct specification of the marginal probabilities for the binary outcomes, not on the correct choice of the correlation structure of the data (Liang & Zeger, 1986). Furthermore, no further distribution assumptions are required, other than the observations across units or clusters are independent. This is to be contrasted with the random effects for correlated binary responses under the TRELM formulation, where both the marginal probabilities and the random effects distribution must be correctly specified for consistent inferences (Zeger et al. 1988). That is, the marginal probabilities, pr(yij = 1| X1.) , under the TRELM formulation, depend on X j. and the fixed parameter, D , of the distribution of the random effects, b]. . In addition, because an exact closed form expression for the marginal probability is unavailable, in order to estimate the average population changes in probability for unit change in a covariate, Zeger et al. (1988) proposed an algorithm to calculate the marginal moments, ,ui and V1" from the conditional moments and the distribution of the random effects, b]. . Having evaluated la] and V}. for each cluster, one iterates solving the following equations (for details, see Zeger et al.1988, page 1054): .l l -I (ma) = 2D,.V, (y, - it) = 0 and 1:1 _ _ I ‘I l —l —| I -l Dz (2121') Z} L1 (Vj— A.) L} ZJ(ZIZJ) ’ where D}. = 071'} mp, L,- = diag{§ h" (,u) / 5p, ,u = x2, A}. his the logit or probit function, Z I. is a vector of covariates 10 V,- =AER,-(5) A}. A, = diagtVarwj», [21(5) is a “working” correlation matrix, #1: ElyleJ, luv]. 4.11.,» ...... nun/j) , yj = (yu, ...... , ya)" The convergence of this algorithm depends on the accuracy of the linear approximations proposed (Zeger et al. 1988). In addition, because the approximations involve the nonlinear logit link function, the authors recommend examining the sensitivity of ,6 to changes in D in the neighborhood of D. Clearly under the TRELM formulation, the probability of success given only the observables does not have the simple logistic form. That is, the logistic specification for the marginal probabilities is not appropriate. This shows that the interpretation of the marginal probabilities under the TRELM formulation is complicated. 1.5 Purpose of the Thesis In this study I am interested in comparing independence estimating equations (IEE) and generalized estimating equations (GEE) when used to estimate logistic marginal models for correlated binary responses, in which the marginal probabilities of 11 the binary responses ya conditional on X j have the simple logistic form. Most importantly, the IEE and the GEE methods of estimation have been used for fitting logistic marginal models for correlated binary responses. However, whereas the asymptotic properties of the IEE and GEE estimates are well understood, their finite sample pr0perties are not well known. For that, I am interested in studying the small sample properties of both methods of estimation. 1.6 Objectives of the Thesis This thesis studies the properties of two different estimation approaches for estimating the fixed parameter ,6 in the marginal logistic model as specified in (1.2), but the binary outcomes conditional on the X j are correlated. The first method is the standard pooled logit, which, in the terminology of the estimating equations literature, is called independence estimating equations (IEE). Under the IEE approach, the within cluster correlation of the binary responses is ignored when estimating the regression parameter, ,6. That is, the identity matrix is specified as the “working” correlation matrix, and the maximum likelihood method of estimation (MLE) is applied to the pool data. MLE yields estimates of the regression parameters, ,6 , that are consistent and asymptotically normal. However, the estimate of variance matrix of the estimates can be inconsistent (F itzmaurice, 1995; Liang & Zeger, 1986). In addition to a “naive” variance estimate of the estimated parameters, I compute a “robust” variance estimate proposed by Huber (1967) and Liang and Zeger (1986). I compare both the “robust” and naive standard error of the estimates of the regression 12 coefficients with the average Monte Carlo standard error. It is suspected that as the correlation within the clusters goes up these standard errors would get farther apart. It allows, at least, to see how much effect the variance correction has in terms of the rejection probabilities and coverage probabilities when testing simple hypotheses about the regression parameters. Here, the main question is: How good are the standard errors when corrected for the correlation within group? The second estimating procedure is the generalized estimating equations form. In particular, to estimate ,6 under the GEE, the conditional correlation matrix is assumed constant, even though it may not be. Hence, this “constant working correlation matrix” does not correspond to the true correlation of the binary responses, so I compute a “robust” covariance matrix of the mean parameter estimate. This allows us to compare IEE and GEE when each uses the inappropriate second moment matrix. The main question here: Is the extra complexity of the GEE worthwhile in terms of efficiency? Since neither the IEE nor the GEE are efficient, the next question is: Is it better to use the GEE with the misspecified covariance structure than to use the IEE? This question makes sense since the GEE approach is harder or more complex to apply. If it is found that using the GEE with an incorrect covariance structure for a reasonable form of correlation does not perform better than when using IEE, it may be concluded that IEE is more adequate than GEE under that circumstance. When their sample variances are compared, it will be with the hope that because the GEE approach uses some correlation structure, it will capture the qualitative structure of C0v(ylj ,....,y ~| X I.) , so in this n); — case it will be better than using the IEE method. 13 To study the finite sample performance of the IEE and the GEE methods of estimation, I need a way to generate correlated binary outcomes that nevertheless follow the logistic model. Thus, I develop a mechanism to generate correlated binary responses according to the logistic marginal model. One unique feature of this marginal model formulation is that the marginal probabilities, conditional only on the observables, have the simple logistic form, yet the vector of binary responses are correlated. This logistic marginal model formulation allows us to determine the effects of within cluster and cluster constant level covariates on the marginal probabilities without assumptions about the unobserved heterogeneity across clusters. 14 Chapter 2 DATA GENERATING MECHANISM 2.1 Introduction 7’ As mentioned in Chapter 1, analyzing the properties of estimators of ,5 in L7 Equation 1.2 requires a data generating mechanism that produces correlated responses. In this chapter, I propose a mechanism with the following features: 1 ( IIX ) ( 1| ) “mm . : r : _ : r = _ : I M. p y , p y" x" l+exp(x,-,-fl) where X} = (xi/9 ...... , xn,1)' 2. C 0rr(yy ,y’yl X1) ¢ 0; furthermore, the correlation between y” and y,” is a function of the marginal means and an additional parameter, p. 3. The observations across groups or clusters (e.g., schools) are independent. These features motivate the GEE approach to logistic models for cluster data analde by Diggle et al. (1994). The mechanism I propose can be conceptualized as a mixture model mechanism (MMM) to generate correlated binary outcomes with a marginal logistic distribution. This procedure is dependent on two independent 15 mechanisms, like flipping a biased coin with probability p of coming up heads. If the result is heads, you pick from the group effect mechanism, otherwise you pick from the student effect mechanism. This procedure captures both student and school effects, and, as I show in the next section, produces logistic response probabilities. As discussed in Chapter 1, the logistic marginal model approach has the advantage over the TRELM formulation in that the marginal probabilities do have a simple logistic form. This is the sense in which the logistic model formulation for correlated binary responses correctly specify the marginal means. 2.2 Data Generating Mechanism In this section I formalize the mixture mechanism for generating correlated binary responses y , i=1,...,nl. , j=l,...,J,with covariates,Xj,that vary or ii are constant within cluster, such that Equation 1.2 holds. Generate y” as: Y, = c.,v,-,-+(1-c,,.)w,.,. i=1,...,n’,. ,j=1,...,J where j denotes group and 1' unit within a group. The assumptions are: 1. The c”. 's are independent Bernoulli random variables, for i = 1,..., n]. , j = 1,..., J with probability mass function given by Pr(C,-j=1)=p and pr(C,,=0)=1-p,where 00 and v”. = 0 otherwise. The @jfs are independent and identically distributed having the standard logit 1 distribution, F (u) = —————. l + exp(—u) Therefore m V, = 1) = me), > —x§,/3) = MG),- < x;fl) = F0 and W),- = 0 otherwise. The u”. ' s are independent and identically distributed having the standard logit 1 distributions, F (u) = ——-——. l + exp(-—u) Therefore pr(w,-,- = 1) = Mu,- > -x§,fl) = pr(u,, < xL-fl) = F (x,- fl)- They represent the within group effects. 4. The random variables of c”. , (4)]. , u”. are mutually independent, so the C.) are independent of View.) for i = 1,....,n,. , j= 1, ..... ,J. 17 l" 5. The cg. , @j. , u”. are all independent of X J. . . Some consequences of assumptions (1), (2), (3), (4) and (5) are: 1. Conditional on X ,,- , Wi/"""Wn./ are independent binary random variables; 2. Conditional on X J. , { V,,,, w”: h,i = 1,...., 71,-} are mutually independent binary random variables for all j = 1,..., J ; and 3. Conditional on X; , {V0, VI”: i¢ h; i,h =1,....,nj} are related binary random variables by the presence of common school effect (4)}. for all j=l,...,J. We first show that under the above mechanism the marginal mean has the usual logistic form given in Equation 1.2. To see this, notice that: E(y,-,Ixii) = EHCU‘VU + (l _ Ca) wiluxii} = EKCII vii I xii” + E[(1— Cu) Wu I xv] = E(Cg)E(V,-j | x0.) + E(1— ng)E(WIj | x0.) , assumption (4) = pEw, Ix,,.) + (1 - p)E(w,,.| x,,.). assumption (1) 18 l (1 - p) 1 , assumptions (2) and (3) = ID I + l 1+6XP(- xyfl) 1+eXp(-— x,,-fl) l 1 + €XP(- x2, fl) = F(xijfl) Thus, pr(yl_j = II x”) has the usual logistic form. This mechanism will only be useful if it generates correlated binary outcomes. In what follows I show how to compute the covariance and correlation between any two pairs of observations across i within school j generated by this mechanism. By definition, the conditional covariance is: Cov(y,.y,,IX,,) = my, y,,IX_,.) — E(yy.|X_,-)E(y,jl X,) Thus, we obtain the following expression for Cov(yii , yhj | X j) . The conditioning on the X I. is omitted for simplicity in the derivations. Write y”. yij = [Ci/Vli+(1— Cy) nglchj V,”- +(1— Chj) why] =ci/Chi vii viii +611 Vii“ _ Chi) Whi +(1_ Cy) Ch} WI,- vhf 19 + (I — c,,)(1" CM) wt] W’J/ ' Taking expectations gives E (ya. y,”.) = p2 E [v,, v,,] +(1- p)pE(v.,-)E(w,.,) +(1— p)pE(v,.,.)E(w,,,-)+(1- p)’E(w,-,-)E(w,,,) . But E(v,,. v,,,-) = pr(v,., =13v,.,- =1) = MG),- > max(— xLfli- Joy-3)) = min[F(+ x; fl), F(+ x;,, m] = min(,u,.j , ,um), where p” = my”) = F(x;fl); Collecting terms gives E(yijyly‘) = p min(#g’#hj) _ p #4,- Iuhj + lug [uhj Thus, the expression for the CW0,”- , y,” l X!) can be written as: 2 . C0V(y”.ayhle.,-) = P [mm(/’lg’#h/) _ lui‘iluhi] and so p2 [min(#,‘,' ’flhj) - #1} #hj] Corr = expiy, 77,, — logrl + eXP(77,j)}], where the marginal probabilities, defined as in Equation 1.2, are modeled as a function of the covariate x”. by the logit link function, that is 77”, = log it( #g‘) = x; ,8. Thus, the first two moments of y” | x”. are given by: E(y,,.lx,,.) = ,u, and Var(y,|x,,~) = ,Ll,,(1- #0.)- The pairs yg' , hi i = 1,.., )1]. within cluster j , are most likely correlated; however, by naively assuming independence of the pairs of binary responses, the IEE or logit QMLE estimator, ,3, , is the solution to the independence estimating equations (Liang & Zeger, 1986), 0”,!!! U<fl>=2 VJ' (Y,-#,.> = 0 j=| é’fl ' where Y, = (yU, ..... ’y,,,)' , 'uj : (lull, ...... ,Iunj)' , . 26 ET ”fit—fin“. . I '1 3.2.1 How to Estimate the Variance of ,6] The identity “working correlation matrix” adopted in solving the above IEE most likely does not correspond to the true correlation of the binary responses. The estimate of regression coefficients is consistent, but the estimate of covariance matrix is under this quasi maximum likelihood approach are usually inconsistent (Liang & Zeger, 1986). To solve this problem of inconsistency, Liang and Zeger’s (1986) Theorem 1 established that the IEE estimator, [3, , of ,6 , which is the solution to the IEE, is consistent and (,6, — ,6) is asymptotically multivariate Gaussian as (J —-) 00) with zero mean and covariance matrix given by (élei Aij)-l(é.X’/C0V(yj) X1) (,éX’J A_i)(j)fll ’ where A}, = diag[/10(l— tug)...” Ian/(l — lump] the Xj's are treated as nonrandom. Liang and Zeger (1986) suggest that the asymptotic variance of the ,3, given in Theorem 1 can be consistently estimated by —l ‘I ’ A J 1 A A! J ' A (.IEIX} AIX!) (jEIXjrjrij) (jEIXj Ain) where X ,- = (xu’ ------ ’x’”). represents a n; x p matrix of covariate values, A,- = diaglflyil-'[,1”_),....,#nlj(l- [any], . .. A A exr3(x;fl) the resrdual vector r]. = y, - ,u/ , where ,U”. = , A , and ,6 is based on the 1 + exp(xijfl) logit QMLE or IEE. 27 .....l l in- The IEE estimator ,6, of ,6 and Var(fil) are consistent, even when the true correlation matrix is not correctly specified, given only that the marginal expectation of the responses, E(yu| x”) , are correctly specified (Zeger & Liang, 1986). F urtherrnore, the estimator fl] and the variance of the estimate, Var(,él) , are easy to compute with existing software. Some of statistical software are: GLIM (Numerical Algorithm Group, Oxford, UK), MINITAB (Minitab Data Analysis Software, Pennsylvania, USA), SAS (SAS Institute, Raleigh, North Carolina, USA), and STATA (Computer Resource Center, Santa Monica, USA). However, the IEE estimator ,6, may not have high efficiency in cases where the within cluster correlation of the binary responses is large (Liang and Zeger, 1986). 3.3 Generalized Estimating Equations Under the DGM, as shown in Chapter 2, the first two marginal moments of the binary responses Y," X 1. correspond to I —l #17 = pr(y”‘ = IIXI) : [1+eXp(—x”-fl)] p2 [KIRK/J” ’flhj)_#ij #hj] (3.1) J#. MAI-M.) C0rr(yy , y,” | X!) = The conditional mean parameter, ,6 , can be estimated by solving the GEE W?) = EDS-V7 (My, — y) = 0 (3.2) 28 where [U], = (qu’ ...... nu") , Y, = 010’ ..... ,ym)', ,l.1= E[y,.|X_,.] .I D, = egg/(9,8, V16) = A,5 R10) A? , A]. = diag{Var(yU|Xj.),....,Var(y"j|Xj)}. Notice in Equation 3.1 that, in general, under this particular DGM the true pairwise correlations of the binary responses depend on the unknown parameter p and the covariate, X I. , through a minimum function of the marginal probabilities. For efficiency, the correct specification of the Corr(yij , yhjl X J.) matters when fitting GEE to model correlated binary responses. However, the consistency of the GEE estimator and the standard error of the estimator are preserved even when the correlation structure is misspecified given only that the regression model for the marginal probability is correctly specified (Zeger & Liang, 1986). Thus, given primarily that the formula in Equation 3.1 is specific to the this DGM, and it is not a trivial task to modify the existent statistical software packages in order to solve the GEE when using the optimum correlation matrix, I focus on a much simpler correlation matrix, other than the identity working correlation matrix, that explicitly accounts for the pairwise correlations of the binary responses. This “working correlation matrix” is not the optimum one, but I hope it captures most of the qualitative features of variation; the IEE approach ignores the correlation altogether. This allows us to compare IEE and GEE when each uses the inappropriate or non-optimum second moment matrix. Thus, to estimate the conditional 29 mean parameter, 6 , via the GEE, I utilize a n,- x n, “constant working correlation matrix” of the form 15, ......... ,5 5 1, ........ ,6 131(5): .................... . (3.2) pa, .......... ,1_ To implement GEE, we need to estimate 6 . First, if 6 = Corr(yil,yh/|Xj) then Comm are, X, = E r—— vii VIII 9 VI! vhf 5(X,)= X1 for all i at h, where g”. = y” — E(y”|Xj) and V”. = [ti/(1 —,uij). Then by the law of the iterated expectations respect to X ,,- , (8,,- 8b,) ivy-vi,- ' 6: E6(X_,.) = E 30 Thus a consistent estimator of 6 (under standard regularity conditions) is easy to obtain by (3-3) .. l" 88;, 6=_ ’I I ,h .’ Jiznjm, -1)/2.Z ;_ iv v,_,, >1 A where g”. = ya — ,u” and v”. = [Jail - #0.), A _ exp(xifl) ’u” l+exp(x:l.6)’ and 6 is based on the initial IEE estimation. This constant “working correlation matrix” does not depend on the X I. , and it is independent of the data generating mechanism. It depends on the single parameter 6 , which is easy to estimate based on residuals after initial IEE estimation. Thus the conditional mean parameter 6 can be obtained by solving (10(13): EDI-([713 131(5),?)- 1(3’, — [1]) = 0, where ' A A 1 all” it: é’fl.’ ’é‘fl. 1);: ........................ ii: ”i W.’ M _ 31 14,-: diagIfiUU — 6H),”...I’Infi - I1,” "18, ..... ,5 I ~ 31, ..... ,5 R,(5)= 65, ...... ,1_ [liq/1”, ...... ,6)’ and y,=(yU, ..... ym)’. Liang and Zeger (1986) show that 6 , the solution of the above EE is consistent and asymptotically (J —> 00) Gaussian given only correct specification of the marginal probabilities. In addition, a robust variance estimate is also consistent even when the true correlation matrix is misspecified. The “robust” variance estimate is robust in the sense of being consistent even if the working correlation used is not the true correlation. 3.3.1 How to Compute the Variance of 6” Liang and Zeger’s (1986) Theorem 2 established the conditions under which the solution of the Equation 3.2 is consistent and JJ(,é(; — fl ) is asymptotically (J —-> oo) multivariate Gaussian and covariance matrix V0. given by V. = 1m 1(sz V;' D.)-' {2 0.2V? any, V7 1).} (2 D:- V? D)" where the Cov(y,_), A]. and D}. are conditional on the X j's. 32 Furthermore, the variance estimate V(, of 6‘, can consistently be estimated by replacing Cov(yj_| X .1) in the equation above by (X — [LII/)(yj — 1&1), and fl , 6, by their estimates. That is, I}... = M? M. M: , where M ., = 2(5/1; / 076) WW '13.,- / an) , and A I -I M1 = 26/2; /6’6 VjO/ . — ,quy. " I?) I} J I I 56’ mp. I Liang and Zeger (1986) show that the consistency of 60, and I?“ depends only on the correct specification of the mean, not on the correct choice of R(6) . If 12(6) is the true correlation matrix, then V]. is equal to C0v(yl_| X1) . Thus a consistent estimate of the asymptotic variance matrix ,6“, is given by: -l i]: A, A_] A (j=iDjVj Di) ’ which is known as a model-based covariance estimate. However, this is not the case under the data mechanism utilized to generate correlated binary responses in this simulation study. Therefore, the robust standard errors of the estimates are computed. 33 3.3.2 How To Compute 6“ To compute 6, , one can iterate between an iteratively reweighted least squares algorithm for 6 and the moment estimate of 6. Starting with the initial value of 6k based on logit QMLE or IEE, and the current estimate 6 , the values of 6 are updated according to e... = p, — {,§,D2(fi.)V,(fi.)Di(fii)} {ng<fii>I7’.} where I71 (6 ) = VI,-[,5,5t {3}]- 3.4 The Computer Programs A computer program based on equation (2.1) is coded using SAS/IML Macro Language to generate correlated binary response data. Logistic marginal models (e.g., logit(lu”) = 60 + ,6] x”) are formulated to model the sets of simulated correlated binary responses. The IEE and the GEE estimators of the regression parameter 6 and the variance of the estimates are computed using the SAS/IML Macro Language version 2.03 (Groemping, 1993, 1994), which is a modified and extended version of the SAS macro written by Karim and Zeger (Technical Report #674, Department of Biostatistics, the Johns Hopkins University, 1989). This SAS/IML Macro Language version 2.03 uses the GEE approach of Liang and Zeger (1986) to model correlated data for a general class of outcome variable including Gaussian, Poisson, binomial and gamma outcomes. The program uses an 34 iterative procedure to estimate the regression coefficients. That is, the program iterates between the following steps: obtaining initial estimates of the regression parameters by QMLE or IEE, if not given; estimating the “working covariance matrix” given the current regression coefficients; updating the regression coefficients given the new correlation estimates; calculating the model-based and the robust variance of the estimates of the regression coefficients. The Monte Carlo simulations were performed on a PC P5-166 Pentium. Furthermore, to facilitate the use of the GEE SAS/IML macro, a program coded using SAS/IML Macro Language was used for passing parameters. 35 Chapter 4 SIMULATION RESULTS 4.1 Introduction In this Chapter, I study the finite properties of the estimating equations (EE) approach of Liang and Zeger (1986) by numerical simulation. I simulate correlated binary response data as described in Chapter 2. The conditional mean parameter A in .j , is estimated via IEE and GEE. I l the logistic marginal models, logit(,u'_j) = '30 + ,3] x compare the performance of the IEE and GEE methods of estimation in terms of the bias, the efficiency and the accuracy of the estimation of the estimate 6' . I evaluate the efficiency of both methods using the mean square error (MSE) of 6' as the criteria. Furthermore, the accuracy of standard error estimation of the estimate 6| is described with the Wald-type test for testing simple hypotheses about the mean parameter ,6l . 4.2 Simulation Design The following factor combinations are specified in generating correlated binary responses: the cluster size it fixed at 5; the sample size, J , given by the number of clusters, increased from 50 to 100 to 200. Furthermore, in the process of analyzing the 36 effects of the covariate, x”. , on the average marginal probability under the above factor combinations, I first consider binary within cluster covariate, that is, a covariate that is different within the same cluster. For example, x”. might indicate whether the student i in school j belongs to a minority group, and ya might indicate whether the student either passes or fails a “National Test.” Under this formulation, the logistic marginal model estimates the difference in promotion rates between minority and non-minority. Second, I consider a binary cluster specific covariate, that is, a covariate that is identical ; for all the subjects within a cluster. Third, I consider a covariate that is standard normally distributed within cluster. Each data configuration is replicated 400 times. Furthermore, under the DGM proposed in this study, the extent of the population pairwise correlation of the binary responses depends on an unobserved school effect through the probability mass function associated with the cg. ' 5 independent Bernoulli random variables and the distribution of the covariates or observables predictors, x0. Thus, I simulate a logistic marginal model with high and low correlated pair-wise binary responses by specifying c”. ' s as independent Bernoulli random variables, with probability mass function given by pr(C,, = 1) = p where 0 < p < l . The higher p, the higher the population average correlation of the binary responses. p is the within cluster correlation coefficient. Thus, I evaluate the small sample performance of the IEE and the GEE methods of estimation in estimating the regression parameters under the following models specifications: 37 Model 1 log it( ’11”) = .30 + '61 xi]. , where :60 = l and ,61 = (—O.6,— 0.4, 0, 04,06) , and the binary within cluster covariate x”. equal 1 or 0 with equal probabilities. Model 2 log it( lug) = flo + 131x} , where :60 = 1 and I61 = (—0.6,— 0.4, 0, 04,06) , and the cluster covariate x}. equal 1 or 0 with equal probabilities. Model 3 log it( lug) = :50 + ,6I x0. , where '60 = l and ,6l = (—1.0, 0.5, 0, 0.5, 1.0) , and the within covariate x”. ~ N(0, 1) . In this study, the analysis of the correlated binary responses consists of two phases. During the first phase, I ignore the within cluster correlation of the binary responses by naively assuming independence of the responses, and I formulate logistic marginal models to analyze the sets of simulated data. Thus, in order to fit the IEE to the logistic marginal model, I specify the identity matrix as the working correlation matrix. Then, I compute the model based (Naive SE) and the robust estimates (Robust SE) for the variance of the estimates of the regression parameters based on formulation of Liang and Zeger’s Theorem 1. In an attempt to better describe the behavior of the estimator of the conditional mean parameter, :31 , and both standard errors of the estimate 6| , (i.e., the Nai've SE and 38 Robust SE), 1 compare the estimates for the various simulated factor combinations mentioned above in terms of their bias and the accuracy of standard error estimation of the estimate 6, when testing simple hypotheses about the conditional mean parameter 6’. The comparison of both estimates of the standard error of the estimate 6,, is based on the calculated rejection probabilities (RP) when testing H0: ,6l = 0 for given true values of ,8l and coverage probabilities (CP) when testing H0: ,3,- = b.) for the given true values of 1,}. The RP denotes the proportion of the times that H0216l = 0 is rejected for the given true values of ,6'. The CP are determined by the proportion of intervals containing the true parameter when testing Ho: ,8’ = b; for the given the true values of fl] ' During the second phase, in order to fit the GEE to the logistic marginal model, I specify a constant correlation matrix, R/ , whose dimension is n}. x nj , as the working correlation matrix; additionally, I assume correlation is 6 = Corr(y,-, ’ y,,,»'X,) for all i ¢ h , i,h = 1,...,ni , j = 1,..,J. This working correlation matrix is not the optimum, but it is easy to estimate and it depends only on the single parameter 6 (see Equation 3.2). Although there may be a loss of efficiency, this approach may lead to an estimator with a higher efficiency than the IEE estimator. I suspect that this is more likely to be true for large population average pairwise correlation. I compare both methods, IEE and GEE, in terms of their bias and the magnitude of their MSE. Finally, the accuracy of the standard 39 errors of the estimate 6,, is evaluated when testing simple hypotheses about the conditional mean parameter ,6]. 4.3 Results of the Estimation Procedures In what follows I show the results of the evaluation of the small sample properties of IEE and GEE methods of estimation when used to fit logistic marginal models to analyze sets of correlated binary responses. I compare the bias of the IEE and GEE estimator, 6/ of [B], , for a given range of true values of '6’. The relative bias of 6’ and -- A ' .i .I 7 the absolute error are defined by b x 100 for ,6} ¢ 0 and fl ,- — :6,- when flj = 0 j respectively. ,6], is the true parameter value and ’37} is the average value of 6’ from the simulations. Two scenarios are used to establish the accuracy of the standard error estimation of 6,, under the IEE and GEE. First, I test Ho: ,6], = b,- for given true values of b_,-- For this case, the accuracy of the test is obtained by the fraction of the two-sided /\ confidence intervals, given by 6/1- Zn975 Var(6i) , which include the true parameter value of ,6}. Second, I test H0216l = O for given true values of :81 . In this case the A accuracy of the test is obtained by the fraction of times that is greater than lVar 0 then y=1; else y=0; end; if c=0 then do; w=beta0+m[i,1]*beta1+u; ifw> 0 then y=1; else y=0; end; predict=m[i, l ]; mli.2]=y; m[i,3]=school; end; end; create a from m; setout a; append from m; quit; FUD; * * *proc print data=a; data test; SET a; rename col] = predict c012 = y col3 = school; intercpt=1 ; ***proc print data = test; run; 92 %* Assuming independence of the binary responses, the SAS/[ML %* GEE macro is invoked. The input to the GEE macro is %* “test,” the SAS dataset created above. The output contains the IEE %* estimates of the regression parameters, as well as the model based on %* the robust variance of the estimates of the regression parameters. %gee (data=test, yvar=y, xvar=intercpt predict, id=school, link=3, vari=3, het=1 , matrix=residual, est_stor=mat, out=residual.dat); FUD; %* Based on the IEE residuals, we estimate the pairwise population average %* correlation as specified in Equation 3.3 data _null_; set residual.dat; file xyz; put fit 10-20 .8 res 24-34 .8; data summary; array f{5} fl f2 f3 f4 f5; array r{5} r1 r2 r3 r4 r5; delta1=0; infile xyz; input #1 f1 10-20 .8 rl 24-34 .8; input #2 12 10-20 .8 r2 24—34 .8; input #3 f3 10-20 .8 r3 24-34 .8; input #4 f4 10-20 .8 r4 24-34 .8; input #5 f5 10-20 .8 r5 24-34 .8; do i=1 to 5; doj=1 to 5; if(i