LHBRARY Michigan State Unlverslty PLACE N RETURN BOX to roman this checkout from you: "cord. TO AVOID FINES Mum on or before data duo. DATE DUE DATE DUE DATE DUE MSU I: An Aflimm ActtoNEwnl Opportunity Initiation mm. TOWARD A MULTILEVEL GENERALIZED LINEAR MODEL: THE CASE FOR POISSON DISTRIBUTED DATA By Wing Shing Chan A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Educational Psychology and Special Education 1995 ABSTRACT TOWARD A MULTILEVEL GENERALIZED LINEAR MODEL: THE CASE FOR POISSON DISTRIBUTED DATA by Wing Shing Chan This study presents a maximum likelihood approach for estimating a random coefficient generalized linear model via the Monte Carlo EM algorithm. Monte Carlo integration through simulating multivariate normal variates is used to integrate out the random effects in the E-step. The M-step is equivalent to maximizing a weighted sum of log likelihood for an exponential family and a multivariate normal distribution. The former weighted likelihood is maximized by one step of the Iteratively Reweighted Least Square algorithm while the latter is maximized in closed form by choosing appropriate weights. Standard errors for the fixed effect parameters and variance components are obtained through the tractable observed complete data information matrix. The Poisson distribution is used for illustration. A simulation study with a non-diagonal variance-covariance matrix shows better accuracy than an earlier comparable Gibbs sampling approach. Computing efficiency is demonstrated as the FORTRAN program converges in about ii 40 iterations within 30 minutes for a two-variance component model with 700 Poisson observations (100 groups each with 7 within-group units) on an IBM compatible 486 DX/33MHz personal computer. iii Copyright by WING SHING CHAN 1995 iv Dedicated to the land and people of Michigan ACKNOWLEDGMENTS I wish to express my thanks to Prof. Betsy J. Becker (Measurement and Quantitative Methods), Prof. Ralph L. Levine (Psychology), Prof. Stephen W. Raudenbush (Measurement and Quantitative Methods) and Prof. James H. Stapleton (Statistics and Probability) for serving on my dissertation committee and giving me many useful suggestions. I am grateful to Prof. Raudenbush for introducing me the concepts and principles of the Hierarchical Linear Models and to Prof. Stapleton for teaching me five graduate courses in mathematical statistics. Moreover, the large and beautiful campus of the Michigan State University has provided a favorable environment for building the foundation of my graduate education. I also wish to express my gratitude for the hospitality from the Department of Maternal and Child Health at Harvard University where I visited as a research specialist in 1991-1992. I wrote my dissertation proposal while residing in Somerville of Massachusetts during this period. vi I wish to thank my wife, Li Ming Wei, for her patience, encouragement and support as well as for her assistance in computer programming. She accompanied me to live in Forest Hills, New York where I conducted my independent dissertation research in 1992-1994. In 1993 the Queens College of CUNY had offered me some helpful part-time research and teaching opportunities and an Internet access to the rest of the world. I wish to thank my family members in Hong Kong for their needful financial support during the last phase of the dissertation research and writing. Last but not least, I wish to thank my Shih-Fu, Ch’an Master Dr. Sheng-Yen in Elmhurst, New York. From 1991 to 1995, his teaching in Buddhist meditation is of great help to my personal life and academic research as I attain a calmer and clearer mind. W. S. CHAN Queens, NY March 1995 TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES Chapter I. INTRODUCTION . An illustration for the application of a multilevel generalized linear model to educational research 11. BACKGROUND . Implications for the present research . Exponential family of distributions . Generalized linear models . III. STATISTICAL MODEL AND ESTIMATION STRATEGY A multilevel generalized linear model Maximum likelihood estimation via the Monte Carlo EM algorithm . IV. PARAMETER ESTIMATION The expectation step of the EM algorithm . The maximization step of the EM algorithm Estimation for the fixed effect parameters Estimation for the variance component parameters . viii Page xi 12 18 20 22 24 24 27 3o 30 32 32 35 VI. VII. VIII. VARIANCE OF PARAMETER ESTIMATES Estimation for the variance of the fixed effect estimates Estimation for the variance of the variance component estimates EMPIRICAL PROPERTIES OF THE STATISTICAL MODEL Maximization of the likelihood function . Convergence of estimated parameters Convergence from different starting values Influence of the size of Monte Carlo samples SIMULATION STUDY Simulation test for a random intercept model Simulation test for a random coefficient model . CONCLUSIONS Further research APPENDIX A: Mathematical notes A1 Expectation of the function of a random variable by a joint distribution . A2 Expected information matrix with weights A3 Conditional expectation by a missing data distribution via the Monte Carlo method . A4 Conditional expectation of the score function APPENDIX B: Estimation of starting values LIST OF REFERENCES: ix 4o 41 44 48 49 52 54 54 58 58 61 67 69 74 74 76 77 79 80 83 Table LIST OF TABLES Poisson, binomial and normal distributions as members of the exponential family . Comparing parameter and likelihood values obtained by two starting values (Parameter values for y and T are respectively -O.7 and 0.16) Results of 2 simulation studies. Case 1: Random intercept model. Case 2: Random coefficient model Log expected mean and expected mean (in parenthesis) by time and sex . Page 21 54 61 62 LIST OF FIGURES Page Figure 1. Log likelihood function as a function of the number of iterations for two sets of starting values . . . 51 2. Convergent paths of model parameters from two initialpositions............... 53 3. Influence of the size of Monte Carlo samples on the estimated log likelihood . . . . . . . . . 56 xi CHAPTER I INTRODUCTION Educational data concerning students often arise within classrooms or schools. Since different classrooms have teachers of various experiences and qualifications, and different schools have various school climates, policies and facilities, educational results vary according to the quality of the educational context. One of the most important questions in educational research is what kind of teachers or schools will enhance learning most. To answer this question empirically, we must collect data about students' academic performance and characteristics of the teachers, classrooms and schools. For research results to be readily used for academic theory construction or public policy making, large samples of classrooms or schools must be studied to make broad generalizations. However, traditional statistical research tools such as linear regression analysis fail to account for the variability due to the varying effects of each classroom or school. Ordinary random effects analysis of variance models also fail to allow for unbalanced data and flexible covariate adjustments. Because the scores of students within the same classroom or school are correlated, the standard statistical independence assumption is violated when data from large numbers of classrooms or schools are analyzed. As a remedy to these deficiencies in traditional statistical tools, multilevel or hierarchical linear statistical models have been developed for educational and social research applications (Aitkin, Anderson, and Hinde, 1981, Goldstein, 1986, Longford, 1987 and Raudenbush and Bryk, 1986). Multilevel linear models have two or more levels of regression equations. For example, in two-level models, a level- one equation relates students' outcomes (e.g., math achievement or verbal ability) to a set of independent variables (e.g., previous GPA, IQ, sex or socioeconomic status). However, each school is allowed to have its own regression equation. The slopes or coefficients of the level-one regression equation for each school are considered randomly distributed. The level-one coefficients are predicted by another set of independent variables related to the level-two units, that is, the school variables (e.g., private or public school, rural or urban school and racial composition of the school). The explicit modeling of level-one coefficients by level-two variables helps educational researchers study the intriguing relationships among schools, teachers and students. For example, Raudenbush and Bryk (1986) were able to demonstrate through a multilevel model that the type of school (Catholic or public) has a differentiating effect on the influence of students' socioeconomic status (SES) on Math achievement. Specifically, they found that Catholic schools have a flatter slope in a regression of Math achievement on SES, which supports the contention that Catholic schools are more egalitarian than public schools. Moreover, educational researchers making use of multilevel models need not be restricted by either choosing the student level or the aggregated school level for regression analysis. The estimated residual variances of the regression equations from both levels not only account for the components of variances at the student and school levels, they also give information about the relative amount of unexplained variation for the two levels. When repeated measurements on students over time are taken as the level-one units and the students are correspondingly taken as level-two units, the multilevel model can be used as an appropriate model to study academic growth over time. Student variables such as sex, race or income can be used to predict the effect of time on academic achievement (i.e., the rate of growth or change). Therefore multilevel models are a means to solving two of the most persistent methodological problems in educational research, namely the assessment of multilevel effects and the measurement of change or growth (Bryk and Raudenbush, 1992). Growing numbers of educational studies, for example, about vocabulary growth, school effectiveness and teaching styles have been conducted in a multilevel perspective (e.g., Aitkin, Anderson, and Hinde, 1981, Raudenbush and Bryk, 1986 and Huttenlocher et al., 1991). A review of educational applications of multilevel models is included in Raudenbush (1988) Notwithstanding the growing popularity of multilevel models, most of the research applications are confined to continuous outcome variables with normal error distributions. Details about the methodology and applications of normal multilevel models in social and educational research can be found in the books written by Goldstein (1987), Bryk and Raudenbush (1992) and Longford (1993). However, a wide range of educational outcomes are not normally distributed, for example, pass or failure (dichotomous variable), absence from school (discrete counts) and time before graduation or dropout of school (censored survival time data). These non-normal outcomes can be dealt with, respectively, by statistical models based on the binomial, Poisson and exponential distributions. These common distributions are subsets of the exponential family of distributions. A unified approach for maximum likelihood estimation of single-level regression models based on the exponential family of distributions is the generalized linear model ( Nelder and Wedderburn, 1972). Computer software for such statistical models is now also available (Aitkin et al., 1989). Many attempts have been made to develop multilevel non- normal models and they are described in the next chapter. However, a unified approach for a multilevel generalized linear model with full maximum likelihood estimation has not been developed. This dissertation is an attempt to extend the multilevel normal models to include other non-normal outcome variables so that a wider range of educational dependent variables can be investigated within the multilevel framework. In other words, the goal of the dissertation is to develop a unified maximum likelihood estimation approach for the multilevel generalized linear model. Since the generalized linear model covers a wide range of distributions, the dissertation will demonstrate statistical estimation of one member of the exponential family: the Poisson distribution. The same approach can be followed for the other distributions of the exponential family. Though the Poisson distribution is not commonly used in educational research, its potential for application should not be underestimated. The Poisson distribution is a standard model for independent count data. Much educational data exists in terms of counts, for example, school absence (Aitkin et al., 1989, p.223), classroom behaviors such as speaking in class, altruistic behaviors, antisocial behaviors, vocabularies of children’s utterances, number of peer-reviewed publications, teenage pregnancies, frequencies of using school facilities such as the library, gymnasium and counseling service, and numbers of times repeating a difficult required course or a certifying exam of a professional institution. The Poisson model is especially useful if the mean occurrence of counts per unit time is low. In these instances many people will have zero observed counts. Approximation of the empirical data by the normal distribution often fails to account for the positive skewness of the data. Since the educational outcome variables in the above paragraph also arise within educational settings, for example, classrooms, high schools and tutorial schools, the multilevel approach will often need to be applied. The multilevel Poisson model has an additional advantage of being able to help resolve the over-dispersion problem (Cox, 1983) due to more than expected variance. In theory, data of the Poisson distribution should have its mean approximately equal to its variance. However data arising from groups (e.g., educational institutions) are statistically dependent within a group, so the observed variance of the whole set of data can be much larger than the corresponding mean. A multilevel Poisson model accounts for the variation due to grouping by including a second or third level of variation. Thus it helps remove the over- dispersion due to the natural grouping of the data. Similar problems also exist for the binomial distribution and the multilevel approach can offer the same benefit. In essence, this dissertation will undertake research that fulfills the objectives below: 1) To develop a multilevel statistical model for the exponential family of distributions. 2) To provide maximum likelihood estimation of the parameters and standard errors for the parameters of the model. 3) To use the Poisson distribution to demonstrate the details ofthe estimation method. 4) To write an iterative computer program to obtain statistical estimation for the multilevel Poisson model by iteration. 5) To use a small simulation study to demonstrate the validity of the computer program. Before describing the state of previous research or the technical details for the formulation and estimation of the multilevel generalized linear model, an example on how the model can be applied could be illuminating. An illustration for the application of multileveQneraleed linear model to educaLional research To understand the formulation of the multilevel generalized linear model, it might be best to study a simple hypothetical example. A similar data analytic scenario from a real national school survey, with a normal outcome variable analyzed through the Hierarchical Linear Model, can be found in Raudenbush and Bryk (1986). Suppose a survey on the number of altruistic behaviors of the recent month is made on students from a large number of schools. Some of the schools have an ethics education program. The socioeconomic status (SES) of each student was also recorded. The research question is to study whether schools with ethics education programs affect the differentiation effect of SES on the exhibition of altruistic behavior in school. The analysis requires that we model the random effects due to each individual school, thus a multilevel model is formulated. Since the dependent variable takes on a random natural number greater or equal to zero, a Poisson model can be used to fit the data. As the Poisson model is a member of the exponential family, this is an example of the multilevel generalized linear model. In level 1, the 1"" student's altruistic behavior ya. in school j is modeled by his/her SES level: 1 log/1,). =60]. +,BUSES,.J.. Here 1,] refers to the mean of a student's counts of altruistic behavior. The response variable yijllij is assumed to be Poisson distributed with parameter iv. The logarithmic function is the 'link' function (Aitkin, Anderson, Francis and Hinde, 1989, p.76) that maps the natural numeric count onto the set of real numbers being predicted linearly by the student's SES variable. ,6,” is the random intercept for schoolj. ,6”. is the random slope for the effect of SES on the logarithm of the mean altruistic behavior for students in schoolj. A positive ,3”. will mean that the higher 10 the SES of a student is, the more altruistic behavior the student will elicit. For illustrations, we now assume that empirically ,6”. is positive. However, the implementation of an ethics education program can either suppress or elevate the influence of SES on altruistic behavior. To study the effect of higher level variables on the relationships between the outcome and the lower level variables, we need the level-2 equations: floj = 700 +701(ETHICS PRGM)j +uoj, :61} = y“, +yn(ETHICS PRGM)J. +ulj. The random intercepts and slopes of the schools are being predicted by the dummy variable for the ethics education program which is zero for absence and one for presence. The residual errors or random effects uj's at the school level are assumed to form a multivariate normal distribution with zero expectations. As an example, if schools with ethics education are found to have higher ,Boj's on average (i.e., yo] >0), it means that ethics education can help students behave more altruistically. If schools with an ethics education program have smaller fllj's on average (i.e., 7/“ <0), then we might infer that ethics education can suppress the influence of one's social class background on ll one's exhibition of altruistic behaviors in schools. In other words, ethics education enhances the egalitarian tendency of students toward acting altruistically. This example has demonstrated the potential of applying multilevel generalized linear models for real world educational research involving student-school interaction effects with a discrete count outcome variable. 1To facilitate understanding for readers who are previously acquainted with the Hierarchical Linear Model (Raudenbush and Bryk, 1986), the mathematical symbols are chosen to be similar to those of HLM. CHAPTER II BACKGROUND Much work has already been done on the multilevel normal models. A review of the methodology and applications of such models in educational research is given by Raudenbush (1988). Representative approaches that have been widely applied in educational research include the models and computer programs formulated by Goldstein (1986, 1987), Longford (1987), and Raudenbush and Bryk (1986). Since the present dissertation focuses on the multilevel extension for non-normal models, the review on multilevel normal models will not be repeated here. Multilevel extensions of various particular non-normal models have also been published. For example, Anderson and Aitkin (1985), Conaway (1990), Stiratelli, Laird and Ware (1984) and Wong and Mason (1985) have published papers on the binomial models; Goldstein (1991) on the log-linear model; and Albert (1985, 1988, 1992), Morton (1987), Tsutakawa (1985, 1988) on the Poisson models. Albert's (1985, 1988, 1992) models are Bayesian models with a gamma prior distribution. Morton (1987) uses the quasi-likelihood approach in an analysis of variance framework. Tsutakawa (1985, 1988) employs 12 13 approximating techniques for both Bayesian and empirical Bayes estimation. All of these past Poisson models do not have the flexibility to allow for a full scale two-level model with multiple explanatory variables and variance-covariance components. In terms of the whole exponential family of distributions, sometimes labeled as the generalized linear model with random effects, there are models developed by Anderson and Hinde (1988), Longford (1988), Zeger, Liang and Albert (1988), Schall (1991), Zeger and Karim (1991) and Breslow & Clayton (1993). Various estimation approaches and algorithms have been adopted in the above past research. Statistical estimation approaches include the Bayesian (e.g., Zeger and Karim, 1991), maximum likelihood (e.g., Anderson and Hinde, 1988) and the maximum quasi-likelihood method (e.g., Longford, 1988). The Bayesian approach is especially relevant when the level-2l sample size is small and the asymptotic normal approximation of the posterior distribution using the maximum likelihood approach becomes inadequate. Contrarily, the maximum likelihood approach generally involves less computation and simpler analytic methods. It also gives results similar to those of the Bayesian approach when the sample size is large. The maximum quasi-likelihood approach is similar to the maximum 14 likelihood approach except that the quasi-likelihood function only specifies the relationship between the mean and the variance and does not contain the full parametric likelihood (For details, see Wedderburn, 1974 and McCullagh and Nelder, 1989, p.325). Although the quasi-likelihood approach requires fewer assumptions than the ordinary likelihood approach, it can fail to give reasonable results in some cases (Crowder, 1987). There is also some loss of efficiency when the data depart from a natural exponential family (Firth, 1987). The algorithms relevant for the present research are: the EM algorithm (Dempster, Laird and Rubin, 1977), Fisher-scoring (Longford, 1987), iterative generalized least squares (Goldstein, 1986, 1989), data augmentation (Tanner and Wong, 1987) and Gibbs sampling (Gelfand, Hills, Racine-Poon and Smith, 1990). Tanner (1991) provides an excellent introduction to many data augmentation methods, broadly defined. It should be noted that a given algorithm might not be restricted to be used only for a particular estimation approach. For example, the EM algorithm can be used for maximizing a general likelihood (Dempster et al., 1977), for empirical Bayes estimation (Dempster, Rubin and Tsutakawa, 1981) or for Bayesian estimation (Racine-Poon, 1985) 15 Although several models already exist for formulating a multilevel generalized linear model, there has not been a full maximum likelihood (ML) model available for applications in educational and social studies. For example, although Anderson and Hinde (1988) formulate a maximum likelihood approach via the EM algorithm, their estimation method can only allow for a single random intercept model. The potential extension to high dimensions of random coefficients is also hindered by their use of Gaussian quadrature integration technique that works only for small dimensions (Rubinstein, 1981). Longford's (1988) non- normal extension to the multilevel normal model uses an approximation of the quasi-likelihood for estimation. A recent simulation study by Rodriguez and Goldman (1993) concludes that the approximate quasi-likelihood method of Longford (1988) is equivalent to Goldstein’s (1991) approximate generalized least square approach with regard to the multilevel logit model. The simulation results reveal substantial biases in the estimates of the fixed effects and/or the variance components whenever the random effects are large enough to be interesting. Moreover, the multilevel estimates of the fixed effects from the approximate quasi-likelihood method are virtually the same as those obtained using standard logit 16 models that ignore the hierarchical structure of the data (Rodriguez and Goldman, 1993, p.15). Zeger, Liang and Albert (1988) use a generalized estimating equation approach (Liang and Zeger, 1986) which combines quasi-likelihood and robust variance estimation for the marginal or so called ‘population average’ model. The estimating equation (Zeger et al., 1988, p.1053) is essentially a variant of the quasi-likelihood estimation (c.f. McCullagh and Nelder, 1989, p.327). Schall’s (1991) approach is an approximate maximum likelihood and quasi-likelihood estimation by means of a first order Taylor's expansion of the linked data. Zeger and Karim (1991) adopt the Bayesian approach via the Gibbs sampling method which is heavily computing intensive. Karim and Zeger (1992) have reported that a disadvantage of the Gibbs sampler is the computational burden. In their analysis of an ecological data set on Salamander mating with 360 binary responses, it takes about 5 hours of computer time on a 14 MIP DEC station 3100 microcomputer to generate 2000 simulated values from the posterior distribution, each obtained after 80 iterations of the Gibbs algorithm. They consider that the time required is sufficiently long as to possibly discourage the fitting of several different models (Karim et al., 1992, p.643). This 17 would not be feasible for educational applications with large data sets. In their paper on approximate inference for generalized linear mixed models, Breslow and Clayton (1993) use Laplace’s method for integral approximation in conjunction with a quasi-likelihood approach. Even for a particular distribution of the generalized linear model, there has not been a true maximum likelihood model. For example, Stiratelli, Laird and Ware (1984) estimate a multilevel binomial model via the EM algorithm. Because the joint posterior distribution of the fixed effect parameters and variance components is intractable, it is approximated by a multivariate normal approximation. The properties of their estimates will depart from those of ML estimates. Goldstein (1991) uses a first order Taylor's expansion for the nonlinear part of the likelihood and the resultant Iterative Reweighted Least Square estimates do not remain maximum likelihood estimates (See Rodriguez and Goldman, 1993). Overall, Zeger and Karim (1991) have provided the most promising simulation results so far for a multilevel Bayesian logit model with two variance components. They have demonstrated a simulation study using a diagonal variance- covariance matirx. A random intercept model is used to analyze 18 a clinical data set for illustration. Breslow and Clayton (1993) present a comparative simulation study against Zeger and Karim (1991). Their results approach that of Zeger and Karim (1991) when the sample size is increased. Impligtions for the present research Vis-a-vis the present status of research in the field of multilevel generalized linear models, this dissertation will attempt to provide an alternative of a full maximum likelihood estimation approach with multiple random regression coefficients. In order to reduce the expected intensive computation, I choose to adopt the maximum likelihood approach instead of the Bayesian approach because the latter generally requires solving high dimensional multiple integrals (Smith et al., 1985) or many repeated rounds of simulations (e.g., Zeger and Karim, 1991). The maximum likelihood approach is used in preference to the quasi-likelihood approach because of the former's well-behaved asymptotic properties of consistency, unbiasedness and efficiency (Rice, 1987, p.234-254). As for the algorithm that maximizes the likelihood function, the EM algorithm (Dempster et al., 1977) is used for 19 the following reasons. Since much educational data involves national surveys with huge numbers of schools, the number of unobserved random effects due to schools could be large (e.g., in the thousands). A naive and direct maximization of the likelihood involves simultaneously estimating the fixed effects and all the random effects. These huge number of parameters are unstable to estimate for most common maximization routine. However, in maximizing the marginal likelihood, the EM algorithm treats the random effects as missing data and essentially estimates just the fixed effect and variance component parameters. The EM algorithm is preferable to the other Newton-type algorithms because it does not require computing the inverse of the information matrix. Generally, simple closed form solutions are attainable for the iterating steps of the EM algorithm. Iterates are also confined within the parameter space during the execution of the EM algorithm. It is also proved that the EM algorithm increases the marginal likelihood of the observations for every iteration (Dempster et al., 1977). A disadvantage of the EM algorithm is its relatively slow linear convergent rate. Because I have adopted Monte Carlo integration to use with the EM algorithm, a slower convergent rate could prove to be an advantage because the 20 iterates will not so easily jump out of the parameter space due to the randomness of the Monte Carlo simulations. More details about the EM algorithm will be provided in the next chapter. Before giving the mathematical details in formulating and estimating the multilevel generalized linear model, the definitions and properties of the exponential family of distributions and the generalized linear model are described as follows. Exponential family of distributions The distribution of a random variable Y belongs to the exponential family if it can be expressed in the form (McCullagh and Nelder, 1989, p.28-29.): f (yl 49. ¢) = exvflyé' - 11(6)) / a(¢) + c(y. ¢)] (2. 1) where a(.),b(.)and c(.) are some known functions. The parameter 6 is called the canonical parameter and ¢ is the dispersion parameter. The function a(¢)=¢/m, where m is the prior weight for the data. This function reduces to ¢ if the data is unweighted. It can be easily shown that the first two moments are (McCullagh and Nelder, 1989, p.29): E(Y) = 5(9), Var(Y) = b"(6)a(¢). (2. 2) 21 Therefore, the mean of the observations is related to the canonical parameter 0. The variance depends on the canonical parameter (and hence on the mean) as well as on the dispersion parameter ¢. Other representations of the exponential family of distributions can be found in Dempster, Laird and Rubin (1977) and Zeger and Karim (1991). For example, the Poisson, binomial, and normal distributions can be expressed in the form of the exponential family of distributions as shown below (McCullagh and Nelder, 1989, p.30): Table l. Poisson, binomial and normal distributions as members of the exponential family Distribution Notation ¢ 19 b(0) C(y,¢) Poisson [9(1) 1 log}. exp(6l) —logy! Binomial B(n,7I)/n l/n log(——”—) log(l+e9) log(n] l-rr ny 2 Normal N(,u,02) 02 ,u 92/2 -%[y?+108(27r¢)] 22 Generalized linw models The linear model based on the exponential family of distributions is introduced by Nelder and Wedderburn (1972) as the generalized linear model. The model is defined by N independent random variables Y1,Y2,...,YN sampled from the exponential family of distribution with corresponding canonical parameters 61,62,...,6N and a common dispersion parameter ¢. The joint probability density function of the Y's (unweighted) is therefore f(yny2,-~,y~|91,92,~-,9~,¢)= 6XP{_Z[y.-9.- - b(49,-)1/ ¢+ZC(y.-,¢)l}- (2 . 3) In actual modeling, the expected value ,u, of Y, is predicted by a linear combination of explanatory variables xl,x2,...,xN as follows: g(#.-)=X.'fl, (2-4) where g(.) is a monotone link function (Aitkin et al., 1989, p.76), ,6 is the p x 1 vector of parameters and X1 is the vector of explanatory variables. The link function relates the expected values of Y to its linear predictors through a transformation function. The principal usage of the link function is to map the limited domain of the mean of the observations onto the real line. For example, the domain of the mean number of counts from a Poisson distribution is never negative, the logarithmic 23 function puts the mean onto the real number domain. The commonly used link functions for the Poisson, binomial and normal distributions are the logarithm, logit and identity functions respectively. These link functions have the property of 6,=g(,u,.)=x,',6 and are named as the canonical links. ' Personal communication suggested by Dr. Stephen W. Raudenbush. CHAPTER III STATISTICAL MODEL AND ESTIMATION STRATEGY A multilevel generalized linear model In the multilevel framework, an observation is represented by yij, i=1,2,...nj; j=1,2,...,J. The subscript i refers to the units in level-one, e.g., pupils, andj refers to the units in level-two, e.g., schools. P is the number of level-one predictors and Q, is the number of level-two predictors for the pth random coefficient in level-one. Conditional on a P+1 x 1 random effect vector u), the observations are random samples from the exponential family of distributions with density (McCullagh and Nelder, 1989): f(y.,-| “1:6y9¢) = eXPKyy-Q-j - b(011W ¢ +C(yy-,¢)l- (3- 1) The conditional moments are 1,, s E(y.-,-|u,) = my), Var(y.-,-lu,) = we...» ’ (3.2) Let the level-l equation be: g(#ij)Enjj =fl0j+flljxlij+'“+flR/’xl’gj (3-3) 24 f1) x1y x23 \x Pg} The level-2 equations are 25 .30) flu I31),- flOj = 700 +701w01j +"°+7OQOWOQ0j +u0j. :81} = 710 +71lwllj+'”+71QleQ,j +ulj. 161’} = 7P0 +ylePlj +---+7 PQPWPQPJ +tu‘ The above can be expressed as 180) 1,wmj,...,w0Qaj 4,. _ 0 fl.) 0 (700‘ 701 7 OQO 710 711 719, 7P0 7P1 K7 P9,) (3.4) 0J' “u 1' In matrix notation, the level-1 (3.3) and level-2 (3.4) equations can be rewritten compactly as filly); 771} = xijflp (3.5) 26 flj=w17+ur (3.6) P The matrix wJ has dimension equal to (P+1) x Z(Qp+l) and the p=0 P vector 7's dimension is equal to 2(Qp+l) x 1. If Qp=Q for all p, p=0 then ZPXQP +1)=(P+l)(Q+ l ). Combining the equations (3.5) and (3.6), we have g(,u,.j)=17,.j =1r'“wj}’+x'uuJ (3.7) =Auy+Bou (say). (3.8) The random effect vector 111 is assumed to vary with a multivariate normal distribution with zero expectations and variance-covariance matrix T with dimension P+l x P+l. The principal objective of the statistical estimation for this model is to estimate the P+l fixed effects 7 and the symmetric matrix T with (P+1)(P+2)/2 unique variance-covariance components. As noted earlier, direct maximization of the marginal likelihood function log f(Y|}/,T,u) is not feasible because the number of unknown vectors 11], which depends on the number of schools, can be very large. Alternatively, one can integrate out 27 the random effects u and maximize the log likelihood by partial differentiation. Let (p=(}/,T), we have a—i—logflWhgloglflmuvmmdu.1 (3.9) However, except for normal distributions, the above integral cannot be solved analytically. High dimensional numerical integration of the above requires the knowledge of (o the unknown parameter. Thus we need some kind of iterative procedure to maximize the log likelihood. MXimum likelihood estimation via the Monte Carlo EM flgorithm Treating the random effects 11 as missing data, the EM algorithm (Dempster, Laird and Rubin, 1977) can be used to maximize the analytically intractable marginal log likelihood function f(Y|y,T) indirectly through iterations derived from a combination of the E (expectation) step and the M (maximization) step. The E step computes a Q function which is the conditional expectation of logf(Y,u|7,T), the 'complete data' log likelihood, with respect to the distribution of the 'missing data' 11 28 conditioned on the observed data Y and the parameter values (0") at the 1th iteration. Q(<0;<0"’)= j108f(Y,u|¢)f(UI<0"’,Y)du (3. 10) The Q function is a function of (a and is maximized to obtain the parameter values gym” for the 1+1”I E step. That is, the M step solves the following equation: a . (I) _ , Q((0,(0 )—0. (3.11) “P Iteration between the E and M steps continues until the parameter values converge. Dempster, Laird and Rubin (1977) have shown that the EM algorithm increases the marginal likelihood f(Y|7,T) for every iteration and the algorithm converges to at least a local maxima. Wu (1983) discovered an error in the proof for the convergence of the EM algorithm in Dempster et al. (1977). However, Wu (1983) shows that under mild regularity conditions the EM sequence converges to the maximum likelihood estimate. Wu (1983, p.95) contends that Dempster et al.'s (1977) results on the monotonicity of likelihood sequence and the convergence rate of the EM sequence remain valid. In other words, employment of the EM algorithm by itself guarantees maximum likelihood estimation if 29 the starting values are close to the maxima and convergence is achieved. In practice, the integral for the expectation step may be hard to obtain. One can use the so called Monte Carlo EM algorithm (Wei and Tanner, 1990) to bypass the high dimensional multiple integral. The Monte Carlo EM algorithm is an EM algorithm with the expectation in the E-step approximated by a Monte Carlo integration. The conditional expectation of the complete data log likelihood is now estimated by the arithmetic mean of M realizations of the complete data log likelihood. with the unobserved random effects substituted by their simulated values: Q(¢;¢“’)-=- fizlogflvmnw), (3.12) m=l where um is generated from the distribution f(u|(o‘”,Y). It should be noted that the accuracy of the above approximation can be improved to any degree by increasing the number of random samples in the calculation. 1For simplicity of notation, the domain of the multiple integral in this dissertation is not printed. Interested readers can refer to the notation used in Appendix A1 for a more rigorous presentation. CHAPTER IV PARAMETER ESTIMATION Convergence to maximum likelihood estimates is achieved by alternative implementations of the E-step and M-step of the EM algorithm. The E-step consists of computing the conditional expectation of the 'complete data' log likelihood function below: The expeat_ation step of the EM algorithm Q(¢;¢"’) = I logf(Y,ul¢)f(ulr/)"’.Y)du <4. 1) = i[Zlogf(Y,,u,l¢)1f(u|go“’,Y)du j=l = zjlogm.u,1¢>/(m¢“>.v>du. (4.2) 1:1 By applying the results from Appendix A1, Corollary 1, expression (4.2) becomes 2j108f(Y,,u,l¢)f(u,l¢"’,Y)du, pi = ZjlogflYpu,l¢)f(u,|¢"’,Y,)du,, (4 . 3) 1:1 due to the independence between level-2 units. Applying Bayes' theorem, 30 31 f (Y,|u,,¢"’)f(u,lco‘”) f (Y1l (PM) i ’ Q(<0;¢‘”)= Zjlogmpum 1:1 = Zzlrlk’gf(Yp“1|¢)f(Y1l “1,¢"’)f(u,|¢“’)du,. (4. 4) Fl 1 The constant C,:f(Y,|(p“>)=jf(Y,|u,,¢“>)f(u,|¢<”)du, is computed by the Monte Carlo integration method (Rubinstein, 1981). l M C1zfi2f(leunp¢m)a (4-5) -=l where “11a“21w-sum ~f(uj|(0“’), which are M samples of simulations from the multivariate normal distribution. Applying the Monte Carlo integration technique again to (4.4): Wm Q(<0;co"’)~ Z—h:?zlogf(v,,u..l¢)f(v,law“) (4.6) 1:1 1 111:1 logf(Y,,u...l¢)f(Y,lawe”) °<= Z icing/(Y1 “w(0)+logf(u_u|¢)]f(yll W11.) 111:1 1:1 1 M .1 ocZZUogf(Y,Iu.,,7)+logf(u...lT)]w...., where -:1 1:1 (4.7) f(Y,|u...,,¢"’) "‘ ’ f(Y,lu.,,¢“’) C 22 C . 1 n=11=1 1 32 The M-step maximizes Q(.) to give the new parameter (1+1) (1+1) values (p . Parameters y are obtained by solving the equations M J 1 Y , 2.11213 Dwain" ”We” (4.8) and T“”’ by solving M ’ logf(u...,|T) _ 2:3 Wee” y/mj—O. (4.9) The maximization step of the EM algorithm The solution for maximizing an unweighted single level analogue of the likelihood function in equation (4.8) can be found in Aitkin et a1. (1989, p.322-325) for the generalized linear models. Anderson and Hinde (1988, p.3851-3852) have also derived the solution to a random intercept multilevel model with weighted likelihood using the numerical Gaussian quadrature method. In this dissertation, the solution to a random coefficient multilevel model with weighted likelihood function involving Monte Carlo integrations is derived below. Estimation for the fixed-effect parameters 33 The maximization of Q(.) with respect to y is equivalent to maximizing a weighted sum of log likelihoods of the conditional distribution of the outcome variable given the random effect variables “-41- The inclusion of the outer summation sign in subscript m is equivalent to expanding the original data set by M times with the simulated random effects um, appropriately substituted (c.f. Anderson and Hinde, 1988). We have to maximize the following weighted log likelihood function of the exponential family with respect to y: L<7>=222{u.6.. —b<9..)1/¢+c(y..¢)w.j. (4.10) m=lj=l i=1 with g(,um,.j)=9m,.j =77my =x11w17+xi1un1§Au7+Buunu using the canonical link function. The derivative of (4.10) w.r.t. 7 is M J "j 56 .. fl=ZZZIy.-b'wm.>1jlwmj/¢. <4“) 697 nr=l j=l 1:1 For exponential families, p=E(y)=b'(t9) and Var(y)=¢b"(6), thus 59mg? _ 86mg alum!) . film” ay "44.1%.... 0’7 = ”1 . 1 .Au. (4.12) b (any) g'(/ijj) M J "r - A . Therefore 1:2220” pm”) "W”. (4.13) 5’7 m=l 1:1 1:1 meg'UImy) 34 To form a Fisher's scoring algorithm (c.f. Seber, 1989, p.685) as in the GLIM program (Aitkin et al., 1989), we need the expected value of the second derivative of the weighted log likelihood: By results proved in Appendix A2, 52L) M J "I 1 [01,“, 01mg.) E——— :— —E —— 4.14 t... 2.2,: .. .. < > z 9142': 1 EU. w... 121114.113..- "1:1 1:1 1:1 Wm} mez [g'U‘my )]2 111:1 1:1 1:1 mgig'(#m1j )]2 MJ"; =‘ZZwajAuA1 (4'15) "1:1 1:1 1:1 : —A'QA (4.16) where A={[A,,...,AJ]m=l,...,[A1,...,AJ]m=M}' with A11={A11a---,A..,1}' and!) is . = WM] '1 Vmglg’ (11....) )]2 a diagonal weight matrix with elements (am (4.17). For the 1th iteration of the Fisher's scoring algorithm, the (1+1) new estimate 7 is given by d“) y"+”:y‘”+(A'Q‘”A)"— . (4.18) 37 3L M J "1 NOW 5=ZZZ(yy—#mfi)g(flmy)Aflwmij (4-19) m=l j=l 1:1 =A'Q"’6 (4.20) 35 where 5:[(51,]...6'1,“)...(6'“J...6'1,,JJ)],...,[(§',m...6'M,,ll)...(5:,m...6'mlj)]}' with 6..=0. For er and subject to 211/}:1, the weighted log 1:1 J J likelihood Zy/jlogflule) is maximized uniquely at T = Zy/jujuj'. 1:1 j=1 Proof: The following two lemmas are required for the proof of this theorem. Lemma 1: Consider the matrix function f, where f(T)=log|T| + trfl‘"]. If (I) is positive definite, subject to T being positive definite, f(T) is minimized uniquely at T= (Watson, 1964). 37 Lemma 2: Let U'=(u1,u2,...,uJ), where the u are J 1 independent p-dimensional vectors of random variables, and let ‘I’ be a positive semidefinite JxJ matrix of rank r (rzp). Suppose that for eachj and all h (b¢0) and c, pr(b'u,-=c)=0. Then U"I’U is positive definite with probability 1 (Das Gupta, 1971). Let c: ”€ij log271, then I 1 1 , _ 2V1 logf(ule)=c-:2—Zt//j longl—EZV’J‘U T 1“} j J J 1 1 , _ =c——log|T|Zr//j——Zy/jtr[ujT'uj] 2 j 2 1 l l -l 1 =c-—log|T|-—trT ijujuj 2 2 , =c—%{log|T|+tr[T"Zwjujuj'D. (4.26) J J Let U'=(u,,u2,...,uJ) and ‘I’=diag(1//l,y/2,...,1//J), then ijujuj' 1:1 = U"I’U. ‘1’ is positive semidefinite of rank r because r is the count of V11>0- Since 11,- ~ Np(0,T) and T is positive definite, then for all h (b¢0) and c, b'uj ~ Np(0, b'Tb). This distribution 38 is nondegenerate as b'Tb>0. Hence pr(b'u,-=c)=0. Therefore for J er, ijujuj' is positive definite with probability 1 by lemma 2. 1:1 J Directly applying lemma 1 to (4.26), Zyleogflule) is 1=1 J maximized uniquely at T =Zy/jujuj'. D 1:1 Applying the above theorem, the weighted likelihood of the multivariate normal distribution ZZlogflumleyymj in equation 1» 1 m} "'1' (4.9) is maximized by T"“’=ZZz//mju u ' because it could be m J easily verified that 232me =1. m 1 In practice, because the M-step needs only one iteration of an IRLS, this EM algorithm becomes a generalized EM (GEM) algorithm (Dempster et al., 1977). The fixed parameters are thus expressed as a weighted least squares expression (4.22). Applying the theorem above, a simple closed form solution exists for the variance component parameters. Hence, the advantage of using the EM algorithm, which often involves closed form solutions in the M-step, is maintained. 39 The convergence to the maximum likelihood estimate can be monitored by tracing through the expected complete data log likelihood function, Q(.). The reason is that every step of the EM algorithm will increase the log likelihood function through the increase of Q(.) (Dempster, Laird and Rubin, 1977). With Monte Carlo EM algorithm, the complete data likelihood or any other estimated iterates converge in a stochastic fluctuating manner. By plotting the iterative values of Q(.), the algorithm is stopped after the Q(.) function has risen to a maximum plateau. In an alternative parallel way, one can also monitor convergence by noting the increment of the model parameters by plotting parameter values against the number of iteration. After convergence, the algorithm can be allowed to run for more steps with an increased number of Monte Carlo samples so as to decrease the stochastic variation of the estimated parameters. CHAPTER V VARIANCE OF PARAMETER ESTIMATES The variance for the estimates of the fixed parameters and variance components can be estimated using the inverse of the information matrix (see Aitkin et al., 1989, p.81): flogfmwj“ . (51) 3‘02 W14 6’ The diagonal elements of V correspond to the large sample variances of the maximum likelihood estimator for the parameters (p. Louis (1982) shows that the Fisher information matrix of the incomplete data likelihood can be expressed as a function of the conditional expectations of the complete data information matrix and the cross-products of the complete data score function. In the terminology of this study, Louis’s (1982) result can be re-expressed as: _ 0"210gf(Y|(0) = _E[62 logf(Y9ul¢)]_E[31°gf(Ya“|(P) 510gf(Y, Ill¢)] flgoé'go' 560590. &(p flw' +§logf(Yl¢)§108f(Yl¢), (5 .2) 410 310' with the expectation taken with respect to f(ulY,go). The observed information matrix is thus 40 41 _ 4: log/”(11141 : -111 4041' I, 4144' mag/”(um E15 1011101111,) 5 ”gm “"o’i (5 3) 6’ W W for the last term in equation (5.2) above vanishes at the maximum likelihood estimator (,7). This method makes use of the complete data distribution f(Y,u|(o) rather than the marginal distribution f(Ylgo) which does not have an analytic derivation. Similar to its application in the EM algorithm, the method of Monte Carlo can be applied to (5.3) to ease the integration problem (Tanner, 1991, p.39). Since maximum likelihood estimators are asymptotically normally distributed, the estimated standard errors can be easily used to create confidence intervals and hypothesis tests. Estimation for the variance of the fixed effect estimates The components of equation (5.3) required for the evaluation the variance of the fixed effect estimates are derived as follows: éalogf(Y,u|¢) 6%ng ,u,I¢) Y d“ E1 (W 1=1§§W f(ul .10) g M J 1 flzlogflY uwlco) EEMC. 44 f(Y jl unpg’b)a (see Appendix A3) 42 ___M , 1 3210.5“wa (0) 421816.114] - §§Mqi 414' 5757 “Y '“ “”"0’ M , 1 a’lgfl | ) . zgqui o Ergu'fljflY lu..,,¢) (5.4) 0"210gf(Y’|uw,¢) ‘1 59,819,,” "J AEAE' h = -b"6’ = . 5.5 6W Z ( a) 31' ¢ Emmet”)? ( ) Moreover, [610g f(Y, ul (0) 418161.411 4) (91' _E[(4Iogf(vlu 4) ++4logfR, and ngP”—-)R and let f(u)=f(ul)f(u2),...,f(uJ) be a joint distribution function ofJ multivariate random variables u1,...,uJ each with dimension P+1. Then the expectation E[g(u9)] = Lg(u))f(u)du = I,g(u,)f(u.>du, . Proof: Lg(u,-)f(u)du= Lg110, )f(u)du = Z [38(14))f(u))du)jsg(uj- )f(u,- >424) Menu-.9; J.J'=1:1==f Proof: It follows directly from summing both sides of the results from Appendix A1, Corollary 2. CI 76 A2 Expected information mairix with weights 1 Let L((0)=Zl//,logf(y,|(p) be a weighted log likelihood i=1 function with 1 independent random samples and L,((p)=i;/,logf(y,lgo). f(y,|gp) is a probability density function with parameter (a and w, is some weight of given value . Then 46%») ___ {I} _1_ 401.00) (1.02)) . flaw. i=1 V11 flip 580' Proof: 4521472)) 5959' Eb"; .Zw.logf(y.-I¢)) =Zw, 455;, .logf(y9|¢)) ' 3108f(y9|¢)3108f(y9|¢)) — E E“ l a» 49' (Seber, 1989, p.685), = {12:45114. 108f(y9| con—5:17!) 108 mm) H; w, 610 550' 77 A3 Conditional expectation by the missing data distribution via Monte Carlo method Let E[g(u)] be the expectation of a function g(.) of u, consisted of independent variates u,,u,,...,u,, conditioned on the dataY with independent componentsYl,Y2,...,YJ and the current parameter value (0“) at iteration I. Then E[g(u)]= Igfdu M J 1 z z —g(ll )f(Y|u 91p") EEMC’ l1 .1 III where 1 M (I) CJ~fiZf(Y,Iu...,¢ ) Ina! and uu,uzj,...,uMl ~f(u1|(p‘”) are M samples of simulations from the multivariate distribution f(u,I)du, (A3. 1) I J=l where the constant Clsf(Yj|¢7“’)=[f(YJIul,¢(”)f(u’|(0(")a‘uJ is computed by the Monte Carlo integration method (Rubinstein, 1981): l M , C] z—M—Zf(leufla¢())a 111:] where uljauzp-Haum ~f(ujl¢(l) )3 are M samples of simulations from the multivariate distribution f(u,I¢"’)- Applying the Monte Carlo integration technique again, equation (A3.1) is approximated by E-filc—Jywovwiuww‘”) =ZZ$3(“u)f(YlI“-m¢m)~ D (A3'2) n=lj=l j 79 A4: Conditional expectation of the score function Lemma (c.f. a similar argument from Tanner, 1991, p.37): 3108 f (YJI w) 5108f(uI¢) I &’ f(u,IY,¢)du,= at Proof: f(Y,,u,I¢) Y = f( ,Ico) f(u,|Y,¢) f(leup¢)f(ujl¢) Y = f( JI40) f(u,|Y,,¢) 5108f(Y,l¢) : 5108f(Y,|IIp(/J) + 3108f(u,|¢) _ 5108f(u,|Y,,(0) fir at 31 at _ 0+ 5108f(u,|¢) _ alogflHIIchD) fir 37' ' Taking the expected value for both sides w.r.t. f(ujlYJ,(p): é’logf(YJ|(o) = Ialogf(u,|(o) logf(u,IY .77) a f(u,IY .mdu, —I f(ujlYpfldII, 0'11 0"! 52' 5’10 /( I71) 4101mm = I gar“ f(u,IY,,¢)du,- I ’5, du. =J‘alogf(uj1¢) at f(u,IY .mdu, - gi— I f(u,IY,,¢)du, = I mag/(um 3r f(uJIY,(o)du,. CI APPENDIX B Estimation of Starting Varlues Starting values are vital to the efficient and successful implementation of any iterative algorithm. In our case, if the starting values are too distant from the final ML estimates, it will take a long time for the program to converge. In the worse case, the program will fail. It is because the starting values may not be within a quadratic likelihood region of the maxima which is a requirement for Newton-type maximization algorithm to work (Seber and Wild, 1989). Precise estimation of starting values may sometimes require iterations and becomes time consuming. Hence a good choice of starting values are those which are easy to compute and close to the final ML estimates. Starting values for the fixed effects 7 are calculated by regressing the logarithm of the dependent variable on the independent variables related to the fixed part of the model. The random part of the model is ignored for simplicity: log(,u,.j)=A,j;/+B,ju szy. (B.l) 80 81 By taking the datum Y ,1. as an approximation for mean #0., we can obtain starting values 7“” as the least square solutions of the model: log(Yy)=Au7(°’+eij, (B.2) and 7(0) =(A'A)-1A'Z, (3.3) where A=(A;,A;,...,A'J)' and Z:[(logYu,...,logYn‘1),...,(logYU,...,logYan )]' are, respectively, the stacked matrix of the independent variables for the fixed effects and the stacked vector of the logarithm of the dependent variable across all the level-2 units. An arbitrary small negative value in place of zero is used to avoid undefined value for the logarithmic function: For example, if Y.)- =0, Yijis set to be 10". The above computed starting values for fixed effects are used to calculate the starting values for the variance-covariance matrix of the random effects: Let Z; =log(Yy.)—Au7‘°) zBou. (B.4) Random effects of each level-2 units can be approximated by regressing the difference between the logarithm of the dependent variables and the predicted fixed part of the systematic 82 component on the independent variables related to the random effects: uj" =(B'JBJ)"B'JZ;. (B.5) Starting values for the variance-covariance matrix are estimated by J T"’=%Zu§°’uj”', (Seber, 1984). (B6) j=1 The above estimation has the advantage of being positive definite with probability 1, which is a requirement for generating the multivariate normal random variables in perfoming Monte Carlo integrations. LIST OF REFERENCES Aitkin, M., Anderson, D., and Hinde, J. (1981). Statistical modeling of data on teaching styles. Journal of the Royfl Statistigal Society. Series A. 144(4). 419-461. Aitkin, M., Anderson, D., Francis, B. and Hinde, J. (1989). Mistical modeling in GLIM. NY: Oxford University Press. Albert, J.H. (1985). Simultaneous estimation of Poisson means under exchangeable and independence models. Journal of Statistical Computation and Simulflion. 2;, 1-14. Albert, J.H. (1988). Bayesian estimation of Poisson means using a Hierarchical log-linear model. In J.M. Bernado, M.H. Degroot, D.V. Lindley and A.F.M. Smith (Eds.), Bayesian statistics 3, (519-531). NY : Clarendon Press. Albert, J. (1992). A Bayesian analysis of a Poisson random effects model for home run hitters. The American Statisticia_n_, 4a,246-253. Anderson, D.A. and Hinde, JP. (1988). Random effects in generalized linear models and the EM algorithm. 83 84 Communications in Sfltistics Theogy and Methods. Q(ll), 3847-3856. Anderson, D.A. and Aitkin, M. (1985). Variance component models with binary response: interviewer variability. Journal of the Royal Statistiaral Society. Ser. B. 47, 203-210. Becker, B.J. (1988). Synthesizing standardized mean-change measures. British Journal of Mathematical_and Statisticfl Psychology, 41_, 257-278. Box, G.E.P. and Tiao, G.C. (1973). Bayesian inference in statistical analysis. Reading, MA: Addison-Wesley. Breslow, NE. and Clayton, D.G. (1993). Approximate inference iueneralized linear mixed models. Journal of the America Statistical Association. 88. 9-25. Bryk, A.S. and Raudenbush, SW. (1992). Hierarchical lineg models for sociafimd behavioral research: Appligtions and data analysis methods. Beverly Hills, CA: Sage. Crowder, M. (1987). On linear and quadratic estimating functions. Biometrika, 7_4_, 591-597. Conaway, M.R. (1990). A random effects model for binary data. Biometrics. 46. 317-328. Cox, D.R. (1983). Some remarks on over-dispersion. Biometrika, 7_0, 269-274. 85 Das Gupta, S. (1971). Nonsingularity of the sample covariance matrix. SMhya A. 33. 475-478. Dempster, A.P., Laird, N.M. and Rubin, DB. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journfl of the Royal Statistical Society. Ser. B. Q, 1-38. Firth, D. (1987). On the efficiency of quasi-likelihood estimation. Biometrika, 14, 233-245. Gelfand, A.E., Hills, S.I., Racine-Poon, A., and Smith, A.F.M. (1990). Illustration of Bayesian inference in Normal data methods using Gibbs sampling. Journfl of the America_n_ Mstical Association. 90, 972-985. Goldstein, H.I. (1986). Multilevel mixed linear model analysis using iterative generalized least squares. Biometrika, 7_3, 43- 56. Goldstein, H.I. (1987). Multilevel models in educational and social research. London: Oxford University Press. Goldstein, H.I. (1989). Restricted unbiased iterative generalised least squares estimation. Biometrika, _‘_7_6_, 622-623. Goldstein, H.I. (1991). Nonlinear multilevel models, with an application to discrete response data. Biometrika, 3, 45-51. 86 Huttenlocher, J.E., Haight, W., Bryk, A.S. and Seltzer, M. (1991). Early vocabulary growth: Relationship to language input and gender. Developmeflal Psycholgy, Ml).- 236-249. Jamshidian, M. and Jennrich, R. (1993). Conjugate gradient acceleration of the EM algorithm. Journal of the American Statistical Association, 88, 221-228. Johnson, L.W. and Riess, R.D. (1982). Numerical Angysis, Philippines: Addison-Wesley Karim, M.R. and Zeger, S.L. (1992) Generalized linear models with random effects; Salamander mating revisited. Biometrics, 48,631-644. Liang, KY. and Zeger, S.L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 7_3, 13-22. Longford, N.T. (1987). A fast scoring algorithm for maximum likelihood estimation in unbalanced mixed models with nested random effects. Biometrika, fl, 817-827. Longford, N.T. (1988). A quasi-likelihood adaption for variance component analysis. Proc. Sect. Comp. Stgstq Am. Statist. Assoc., 137-142. Longford, N.T. (1993). R_andom coefficient models. NY: Oxford university press. 87 Louis, T.A. (1982). Finding observed information using the EM algorithm. Journal of the Royal Statistigl Society. Ser. B, 41, 98-130. McCullagh, P. and Nelder, J.A. (1989). Generalized lineg models. London: Chapman and Hall. Morton, R. (1987). A generalized linear model with nested strata of extra-Poisson variation. Biometrika, 71, 247-257. Nelder, J.A. and Wedderburn, R.W.M. (1972). Generalized linear models. Journal of the Royal StatistiLaI Society. Ser. A, 135, 370-384. Press, W.H. (1986). Numerical recipes: The art of scientific computing. NY: Cambridge University Press. Raudenbush, S.W. and Bryk, A.S. (1986). A hierarchical model for studying school effects. Socio-logyof Education. 59, 1-17. Raudenbush, S.W. (1988). Educational applications of hierarchical linear models: a review. Journal of Educationfl Statistics, 13, 85-116. Raudenbush, S.W. and Chan, W.S. (1992). Growth curve analysis in accelerated longitudinal design with application to the National Youth Survey. Journfl of Crime and Delinquency. 2__9_, 387-411. 88 Raudenbush, S.W. and Chan, W.S. (1993). Application of a Hierarchical Linear Model to the study of adolescent deviance in an overlapping cohort design. Journal of Consultingfl Clinical Psychology, 6;, 941-951. Racine-Poon, A. (1985). A Bayesian approach to Non-linear random effects models.Biometrics. 41. 1015-1023. Rice, J.A. (1987). M_athemaatical statistics an_d data analysis. California: Wadsworth Inc. Rodriguez, G. and Goldman, N. (1993). An assessment of estimation procedures for multilevel models with binary responses. Paper presented at a workshop on multilevel analysis organized by the Rand Corporation, Santa Monica, California, July, 1993. Rubinstein, R.Y. (1981). Simulation and the Monte Carlo method. NY: Wiley. Schall, R. (1991). Estimation in generalized linear models with random effects. Biometrika, _‘Zfi, 719-727. Seber, G.A.F. (1984). Multivariate observations.NY: Wiley. Seber, G.A.F. (1989). Nonlinm raggssion. NY: Wiley. Smith, A.F.M., Skene, A.M., Shaw, J.E.H., Naylor, J.C. and Dransdield, M. (1985). The implementation of the Bayesian paradigm. Communications in Statistics: Theory 89 @d Methods. MU), 1079-1102. Stiratelli, R., Laird, N. and Ware, J.H. (1984). Random effects models for serialobservations with binary response. Biometrics. 40. 961-971. Tanner, M.A. and Wong, W.H. (1987) The calculation of posterior distributions by data augmentation (with discussion). Journal of the American Statistical Association. 8_2_, 528-550. Tanner, M.A. (1991). Tools for stagstical inference: Observed data and data augmentation methods. Berlin: Springer-Verlag. Tsutakawa, R.K. (1985). Estimation of cancer mortality rates: A Bayesian analysis of small frequencies. Biometrics. 41. 69-79. Tsutakawa, R.K. (1988). Mixed model for analyzing geographic variablity in mortality rates. Journgfl of the Americaa §_t_21tistiaal Association. 83. 37-42. Watson, G.S. (1964). A note on maximum likelihood. Sankhya A, aa,303-304. Wei, G.C.G. and Tanner, M.A. (1990). A Monte Carlo implementation of the EM algorithm and the Poor Man's Data Augmentation algorithms. Journal of the American Statistical Association, 8;, 699-704. 90 Wong, G.Y. and Mason, W.M. (1985). The hierarchical logistic regression model for multilevel analysis. Journal of the American StlatisticQAssociation. 80, 513-524. Wu, C.F.J. (1983). On the convergence properties of the EM algorithm. Annals of Statistics. 31, 144-148. Zeger, S.L., Liang, K.Y. and Albert, P.S. (1988). Models for longitudinal data: A generalized estimating equation approach. Biometrics. 44, 1049-1060. Zeger, S.L. and Karim, M.R. (1991). Generalized linear models with random effects: a Gibbs sampling approach. Journal of the American Statistical Association. 86, 79-86. "‘IIIIIIIIIIIIIIIIIS