Eur. ., at .3. £1... . :3..." . . . , a g: .3. .2". .. Aixfifi l u.‘ I Liv. . I: ". it . .3 an. x! . R;. :4? - 9!... a. .....I.u..u.v|.3. .1. ... . . I. . .c ...n.»..n...v .m. 3.21.“! : u ‘ A I. 13 . , A}? hunfiubumr mm... 1)! .. Ah... WI“ .r. ‘ u; ”TAT lllllllllllllll lllllllllllllllll lllllllll L 31293 017141544 This is to certify that the dissertation entitled Increasing the Efficiency in Estimating Multilevel Bernoulli Models presented by Meng—Li Yang has been accepted towards fulfillment of the requirements for Eh. 12. degree in Measurement and Quantitative Methods Major professor Date May 15_,,_J,9_98___ MS U i: an Affirmative Action/Equal Opportunity Institution 0-12771 LIBRARY Michigan State University PLACE 1N RETURN BOX to remove this TO AVOID FINES return checkout from your record. on or before date due. DATE DUE DATE DUE DATE DUE 1m chiRCJDGbOUOpGS-p.“ INCREASING THE EFFICIENCY IN ESTIMATING MULTILEVEL BERNOULLI MODELS By Meng-Li Yang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Educational Psychology and Special Education Spring 1998 COPYI'iSht by Meng-Li Yang 1998 ABSTRACT INCREASING THE EFFICIENCY IN ESTIMATING MULTILEVEL BERNOULLI MODELS by Meng-Li Yang Multi-level linear models are useful tools for educational research, where observations are often nested within clusters. If both the response and the random efl‘ects have normal distributions, maximum likelihood inferences for the fixed and random effects variances can be obtained analytically. For dichotomous responses such as dropping out and repeating a grade, logistic regression is often used to model the relationship between the responses and the covariates. However, in such multilevel Bernoulli models, estimation has been a problem. Since the responses have a Bernoulli distribution, which is not conjugate to the normal distribution of the random effects, rough approximation or numerical integration has to be used to approximate the marginal distribution of the responses in order to obtain maximum likelihood estimates. The strategies proposed before include the penalized quasi-likelihood approach, the approximate maximum likelihood approach using Monte Carlo methods or the Gauss- Hermite Quadrature technique, and the Bayes approach. This dissertation proposes using Laplace approximation to the marginal disu'ibution and then using approximate Fisher scoring to find the maximum likelihood inferences for the parameters. To achieve the goal, first, the infinite multivariate Taylor series is deduced. Via the infinite multivariate Taylor series, Laplace approximation can be extended to any order and any dimension. However, through preliminary experiments, approximation up to the sixth order is found to produce sufliciently accurate estimates. The resultant program is therefore called Laplace6. Laplaee6 is investigated using various simulated data sets by comparing its estimates with those of PQL, PQL2 and Gauss-Hermite Quadrature. Laplace6 was found to have, generally, the highest efficiencies among all the methods compared. The 1988 National Survey of Primary Education in Thailand was also analyzed using all of the above programs. Laplaee6 estimates were found to be close to those produced by Gauss- Hermite Quadrature using 30 and 40 quadrature points. In addition, to check the consistency property of the approximate maximum likelihood estimates produced by Laplace6, 400 bivariate data sets were generated. Half of the 400 had 200 clusters in the second level and the other half had 2000 clusters. Laplace6 estimates were found to be normally distributed with small negative bias. Moreover, the variances of the estimates of the data sets with 2000 clusters were 10 times smaller than those with 200 clusters. Thus, Laplace6 estimates were approximately consistent. To my grandmother and my father ACKNOWLEDGMENTS I would like to express my deepest gratitude for my advisor, Dr. Stephen W. Raudenbush for his enormous amount of instruction, guidance, support and push throughout my doctoral program. The dissertation would not exist without him. My special thanks go warmly to Richard Congdon, our computer programmer, who is so eficient and so pleasant to work with. I also want to thank my committee, Dr. James Stapleton, Dr. Ken Frank and Dr. Alka Indurkhya for their comments and suggestions. Dr. Stapleton’s suggestions about improving the last part of the simulation study especially helped provide more useful information about the behavior of the approximation method proposed here. I have always wanted to say “Thank you” to Dr. Shyh-Leh Chen, who is now an associate professor in Taiwan, and my colleague, Matheos Yosef. Their kindness and knowledge helped me go through the confusion and awkwardness of a freshman in applied statistics. 1 am indebted to my friend, Dr. Su-hao Tu for her support and friendship for the past few years. I am grateful that my parents allowed me to spend so many years studying abroad. I don’t think I can ever fully appreciate their love for me. I am thankful to God that I have wonderful sisters and brother to grow up together with through happier times and harder times. vi TABLE OF CONTENTS LIST OF TABLES .............................................................................................................. ix CHAPTER 1 INTRODUCTION ................................................................................................................ 1 CHAPTER 2 BACKGROUND AND SIGNIFICANCE ........................................................................... 9 Bayes Approach ..................................................................................................... 10 Full Likelihood Approach ...................................................................................... 10 Quasi-Likelihood/Approximate Likelihood Approach .......................................... 12 CHAPTER 3 METHOD ......................................................................................................................... 15 Introduction ......................................................................................................... 15 The Logistic Model .............................................................................................. 16 Submodel ................................................................................................. l7 Likelihood ............................................................................................................ 17 Likelihood of the Submodel .................................................................... 18 Approximation to the Log-Likelihood .......................................................... ‘ ...... l 8 Approximation to the Submodel .............................................................. 28 Approximate Fisher Scoring ................................................................................ 29 Implicit Differential ................................................................................. 29 Score Function of the Fixed Effects ........................................................ 30 Score Function of the Variance-Covariance Components ...................... 36 CHAPTER 4 AN ILLUSTRATIVE EXAMPLE Introduction ........................................................................................................... 41 Thailand Data. ....................................................................................................... 41 Results ................................................................................................................... 44 CHAPTER 5 , EVALUATION WITH SIMULATED DATA ................................................................ 49 Introduction ......................................................................................................... 49 Univariate Random Efiect Data Sets (Models 1 - 6) ........................................... 51 vii Results of Model 1 ................................................................................... 52 Results of Model 2 ................................................................................... 54 Results of Model 3 ................................................................................... 55 Results of Model 4 ................................................................................... 56 Results of Model 5 ................................................................................... 58 Results of Model 6 ................................................................................... 59 Relative Eficiencies Under Models 1 to 6 .............................................. 61 Bivariate Data Sets ................................................................................................ 63 Model 7 and its Results ............................................................................. 63 Model 8 and its Results ............................................................................. 71 CHAPTER 6 CONCLUSION AND DISCUSSION .............................................................................. 74 APPENDICES ................................................................................................................. 79 PRE-APPENDIX: Formulae and Lemmas for Appendices A to D .................................. 79 APPENDIX A: Multivariate Taylor Series Expansion ..................................................... 83 APPENDIX B: The Six Moments of a Multivariate Normal Distribution. ...................... 92 APPENDD( C: Proof of the Substitutions ...................................................................... 115 APPENDIX D: The Expectation of the Third Order Term Squared .............................. 121 APPENDIX E: Computational Algorithm ...................................................................... 123 LIST OF REFERECES .................................................................................................. 129 viii LIST OF TABLES Table l - Estimates of Thailand Data. ................................................................................ 46 Table 2 - Averages and Mean Squared Errors of Model 1 ............................................... 52 Table 3 - Averages of S.E.’s and S.D.’s of Estimates of Model 1 .................................... 53 Table 4 - Averages and Mean Squared Errors of Model 2 ............................................... 54 Table 5 - Averages of S. E’s and S.D.’s of .Estimates of Model 2 .................................. 54 Table 6 - Averages and Mean Squared Errors of Model 3 ............................................... 55 Table 7 - Averages of S. E.’s and S. D.’s of Estimates of Model 3 .................................. 56 Table 8 - Averages and Mean Squared Errors of Model 4 ............................................... 56 Table 9 - Averages of S. E.’s and S. D.’s of Estimates of Model 4 .................................. 57 Table 10 - Averages and Mean Squared Errors of Model 5 ............................................. 58 Table 11 - Averages of S.E.’s and S.D.’s of Estimates of Model 5 .................................. 59 Table 12 - Averages and Mean Squared Errors of Model 6 ............................................. 59 Table 13 - Averages of S. E’s and S.D.’s of .Estimates of Model 6 ................................ 60 Table 14 - Laplace6 Relative Efliciency Under Models with D00 = 1 ............................. 61 Table 15 - Laplace6 Relative Efiiciency Under Models with D00 = .25 .......................... 61 Table 16 - Averages of Estimates Under Model 7 ............................................................. 65 Table 17 - Mean Squared Errors of Estimates Undel Model 7 ......................................... 66 Table 18 - Laplace6 Relative Efiiciency Under Model 7 ................................................ 67 Table 19 - Standard Deviation of the Estimates Under Model 7 ..................................... 69 Table 20 - Averages of Standard Errors Under Model 7 ................................................. 69 Table 21 - Contrasts Between Different Cluster Sizes for Variance Components ........... 72 Table 22 - Contrasts Between Different Cluster Sizes for Fixed Effects ........................ 72 ix Chapter 1 INTRODUCTION People are concerned about the quality of education. They want to know what factors -- educational policies, programs, school environments, characteristics of teachers, or instructional approaches -- contribute to students’ best learning. Educational research tries to find answers to these concerns. Large samples of students from different classrooms or schools are often drawn in order to support generalizable conclusions. However, students are nested within classrooms, classrooms are nested within schools, and schools within districts. Thus student learning is embedded within clusters, i.e., classrooms or schools. Because each cluster has a special climate due to its components, such as students and teachers, not only will individual students differ fiom one another, but there will be group differences among clusters. Longitudinal data, with repeated measurements fiom the same person, can also be regarded as nested data. Here each person is considered a cluster, with observations of the same person more similar than observations from different people. Bennett (1976) found a significant difference between two styles (‘formal’ and non-formal) of teaching when he used multiple regression analysis, ignoring the grouping of the students into classes. However, when Aitkin et al. (1981) analyzed the same data 1 2 accounting for the nesting effect, the difference disappeared. In fact, with such a nested data structure, traditional statistical tools such as AN OVA and multiple regression either have great limitations or cannot work appropriately. Instead, as a statistical tool developed specifically for nested designs, hierarchical or multilevel linear regression models, also referred to as random coemcient models (Rosenberg, 1973) or covariance components models (Dempster, Rubin, and Tsutakawa, 1981), allow each cluster to have its own slope and regression coeficients. These slopes and regression coemcients are often considered normally distributed with mean equal to the efi‘ects of cluster characteristics as specified in the higher level, between-cluster model. That is, in a higher level these coefficients along with the slopes are each predicted by a set of cluster characteristics. The errors from the prediction are referred to as the random efi'ects, normally distributed with mean zero and a variance covariance matrix. For example, in the first level of a 2-level model, observation j in the ith cluster, y” , is modeled by a vector of independent variables, x y” = xlfa, + e”. , where a, is a i," vector of the regression coefficients (including the intercept) for the ith cluster; e” is the random term for y” , assumed to be normally distributed with mean 0 and variance 0'2 , e”. ~ N (0, 02 ) . In the second level, a, fiom each cluster is collected together as the a dependent variable to be predicted by cluster characteristics. Assume a, = [ " ]. Then an there will be two equations for level-2: a" = wffi, + b" and a,2 = 2,7,62 + by , where w, and z, are both vectors of the ith cluster characteristics; A and fl, , are vectors of 3 regression coeficients for a, and 0:2 , respectively; b“ is the random effect of cluster 1' for a| and b2, the random efl‘ect ofcluster r' for a,. b| and b2 are assumed to be multivariately normally distributed with mean 0 and a variance matrix D. Such modeling gives educational researchers a clearer look at the mechanism of the interactions among students, teachers, schools, society and policies, and resolves the statistical difficulties encountered by AN OVA and multiple regression. The resulting variance components from each level, additionally, give information about its respective amormt of unexplained variation. Such models for continuous outcomes have been well developed by researchers using difi'erent estimation methods. For example, Raudenbush (1984) used EM algorithm; Goldstein (1986) used iterative generalized least squares; deLeeuw and Krefi (1986) and Longford (1987) used Fisher scoring. A brief review reveals different applications of the models: school effectiveness as related to student achievement scores (Raudenbush and Bryk, 1986; Aitkin and Longford, 1986; Young, 1996), school effects and their stability (Raudenbush and Willms, 1991), how teacher interaction outside the classroom affects student learning (Louis, 1994), program evaluation (Marks, 1995; Lee,1995; Mac Iver and Plank, 1996), adolescent attitude change toward deviance (Raudenbush and Chan, 1993), and the effects of ratee and rater race on performance evaluations (Waldman and Avolio, 1991). Goldstein (1987), Bryk and Raudenbush (1992) and Longford (1993) gave detailed accounts of applications and methodology of these models in social and educational contexts. Bock (1989) and Raudenbush and Willms (1991) provided applications in education. 4 Nevertheless, although it is easy to conceive a continuous and normal distribution for human characteristics, such as intelligence, abilities and achievements, certain types of valuable educational data cannot be normally distributed. For example, whether a student has repeated a grade, dropped out of school, been admitted to college or persisted in the pursuit of higher education, might all be informative about the educational environment and policies. Since these outcomes are either yes or no (typically coded as l or 0), the usual linear model that assumes a normal random error fails. Instead, the logistic regression, one of the generalized linear models (McCullagh and Nelder, 1989) is used to model such outcomes. The logistic model uses the logit (the log of the odds ratio) of the dependent variable as the outcome variable. For example, to model the probability of dropping out for student j in school i using his personal information, x” , the model will be log(fl—) = xga, + ey , where p”. = E(y,j = l|b,) is the conditional l— p ,1 probability of dropping out, ya being the observed data with dropout = 1, non-dropout = 0. For analyzing nested non-normal data such as binary data and count data, the multilevel generalized linear model with random effects is a natural outgrowth of both generalized linear models and hierarchical linear models. It incorporates generalized linear models into the framework of the hierarchical linear model. In the first level, the linear regression model is substituted by a generalized linear model while the second level remains a linear model. For such models, researchers face a major task of obtaining a good estimate of the marginal distribution of the data. This marginal distribution is the 5 integral of the product of the first level likelihood, f (y, lb, ) , and the assumed second level distribution, p(b,)with respect to the random cluster effects, i.e., jf(y,|b,)p(b,)db, . In most cases, the second-level model is assumed to have a normal distribution. The dificulty in evaluating the integral arises from the fact that the normal distribution is not the conjugate prior for non-normal distributions such as binomial and Poisson that are assumed in generalized linear models. Many statisticians have studied estimation in generalized linear models with random efl‘ects, especially in the logistic model with random effects (e. g., Stratelli, Laird, and Ware, 1984; Wong and Mason, 1985; Schall, 1991). What makes the logistic model with random efl‘ects interesting and difiicult is that there is no closed form for the marginal distribution of the outcome for a logistic model (Zeger et al, 1988). As a result, the estimation of the parameters, including the variance components and the fixed effects, have to be derived through approximation, if not through intensive Monte Carlo computation. Moreover, because the approximations are generally rough, the resulting estimates of the variance components is often subject to underestimation (Breslow and Clayton, 1993; Rodriguez and Goldman, 1995), thus resulting also in the underestimation of the fixed effects coefiicients, especially when the number of random effects increases with the sample size and the binomial denominator is small (Breslow and Lin, 1995). Breslow and Lin (1995) proposed first-order and second-order Laplace approximations of the integral of the marginal likelihood with in the context of asymptotic bias correction, with the second-order being the fourth order Taylor expansion 6 whereas the first-order being the second order Taylor expansion. Incidentally, the first- order approximation has terms exactly the same as those in the penalized quasi-likelihood approximation (Breslow and Clayton, 1993). Therefore, the second order approximation can be regarded an improvement over the penalized quasi-likelihood approach such as Raudenbush’s posterior modal estimation method (1993). Later Lin and Breslow (1996) extended the approximation to the case of multiple random effects, with zero correlation among them, however. Nevertheless, the assumption of a zero correlation among random efi‘ects limits the use of the approximation in real world research. This dissertation will build on the work of Breslow and Lin (1995) and Lin and Breslow (1996), using Magnus’s (1988) and Magnus and Neudecker’s ideas (1988) as the toolbox. It will generalize the Laplace approximation to multiple random efi‘ects with a general variance-covariance matrix. Moreover, through simple simulations it was found that the contribution of the eighth order term in the eighth-order expansion to the approximate log- likelihood is negligible while those of the lower orders are not. Therefore, this dissertation will also expand up to the sixth order of the Taylor series to get a satisfactory Laplace approximation to the log-likelihood. However, the purpose of the extension will not be bias correction. Treating the resulting approximate marginal likelihood as the exact likelihood, Fisher scoring will be applied to simultaneously estimate the fixed efi‘ects and the variance-covariance matrix of the random effects. Because the Laplace approximation is not stable in some cases, according to Breslow and Lin (1995), to ensure convergence and more efficient estimation, the output from Raudenbush’s posterior modal algorithm (1995) will be used as starting values for 7 parameters to be estimated. The resulting estimation method will be called “Laplace6”. It will be tested and evaluated with extensive simulation studies. Its efiiciency, in terms of mean squared errors, will also be compared with those of Raudenbush’s posterior modal estimation (1993), which is equivalent to the PQL (PQL) (Breslow and Clayton, 1993), of Goldstein and Rasbash’s (1996) second-order penalized quasi-likelihood (PQL2) method, and approximate maximum likelihood method using Gaussian Quadrature technique (Gauss) by Hedeker and Gibbons’s MIXOR(1994, 1996)). To achieve the above goals the dissertation will 0 derive the multivariate Taylor series; 0 derive the six moments of the multivariate normal distribution; 0 prove that the fourth and sixth moments can be substituted by simpler forms for use in the approximation; 0 find the approximate log-likelihood using a Laplace sixth—order expansion of the joint density of the data and the random efi‘ects; 0 find first derivatives of the approximate log-likelihood for both the fixed effects and random effects variance matrix in order to use approximate Fisher scoring to estimate the fixed efl‘ects and the random effects variance matrix; 0 work out a computational algorithm, based on the derivatives and the approximate log-likelihood, for computer programming; - analyze the data set of 1988 National Survey of Primary Education in Thailand (Thailand data) using the above methods as an example; 8 generate data sets with different models and parameters, the structure of which generally follows that of Rodriguez and Goldman (1995); and investigate the performance of the methods by analyzing the estimators in terms of their biases and mean squared errors. According to experience so far, when the variance and especially the conditional expectation, E(y”|b,) , are both very small(e.g., .25, .01, respectively), all of the above methods except for PQL have dificulty converging. On the other hand, data sets with extremely small random effect variance and conditional expectation will not be of much practical as well as theoretical interest, anyway. Moreover, because of the symmetry of probability in dichotomous situations, a conditional expectation higher than .5 is the same as itself minus .5. Therefore, I will limit the range of the conditional expectation to (.1, .3) and the variance to (.25, 2). In addition, for a single variance component model, the within-group sample size will be around 20 and between group sample size around 150. For a model with variance-covariance matrix, even larger within- and between- group sample sizes are necessary. Chapter 2 BACKGROUND AND SIGNIFICANCE Among the members of the generalized linear models with random effects, the logistic model with random effects especially poses numerical difficulties. The obstacle occurs when the marginal likelihood is needed for estimation of the parameters. The marginal likelihood is obtained by integrating out the random effects from the joint likelihood of the data and the random efl‘ects. In the logistic model with random efl‘ects, the data have a Bernoulli distribution, while the random effects are usually assumed to have a multivariate normal distribution. Besides, while the conditional expectation of the response, [1,] = E(y,j =1Ib,), and the sum of the fixed and random efl'ects are linked by a canonical link function (McCullagh and Nelder, 1989) for each member, the marginal expectation, E(y,,) , is not. Researchers cannot find an exact closed form relationship between the logit link and the marginal expectation (Zeger et al., 1988). Hence there have been different approaches for estimation. A brief review with reference to the various approaches highlights the dimculty. 10 Bayes Approach Zeger and Karim (1991) used the Bayesian paradigm by applying Gibbs Sampler technique (Geman and Geman, 1984; Gelfand et al., 1990, Gelfand and Smith, 1990) to find the posterior distributions of the parameters and the random efl‘ects in the context of generalized linear models with random effects. The strength of Bayes approach lies in its flexibility in assessing the uncertainty in the random efl‘ects and functions of model parameters (Breslow and Clayton, 1993). The greatest advantage of Gibbs sampler is its ease of implementation. However, it is computationally intensive. Moreover, Hobert and Casella (1996) and Natarajan and McCulloch (1995) found that for models with random efl‘ects the posterior distribution of the parameters may not exist for diffuse priors, but that this problem may not be detected while computing, and thus wrong estimates can result (McCulloch, 1997). Full Likelihood Approach Anderson and Aitkin (1985), Hedeker and Gibbons (1994) and McCulloch (1997) approached the problem using a full likelihood approach. Anderson and Aitkin (1985) used Gaussian quadrature to approximate the integral in using maximum likelihood estimation in the logistic model with a single random effect. Hedeker and Gibbons (1994) also used Gauss-Hermite quadrature technique to find the marginal maximum likelihood estimators in ordinal regression models with multiple random effects. The advantage of Gaussian quadrature technique is that the precision of the estimation can be improved by increasing the number of quadrature points. However, as the number of 11 random effects increases, the number of quadrature points that have to be summed over increases exponentially, and so will the computational time. However, Bock, Gibbons and Muraki (1988) pointed out that the number of points for each random efi‘ect can be reduced as the number of random efi'ects increases, without hurting the accruacy of the approximation. McCulloch (1997) adapted the Monte Carlo version of the EM algorithm (MCEM) (Tanner 1993; Ledholter and Chan, 1994) for use in generalized linear models with random efi‘ects by incorporating a Metropolis-Hastings step. He also proposed a Monte Carlo version of the Newton-Raphson algorithm (MCNR) and improved the performance of simulated maximum likelihood developed by Geyer and Thompson (1992) and Gelfand and Charlin (1993) by preceding it with MCEM or MCNR. He compared these methods with the penalized quasi-likelihood approach using simulated data with large variance, which is known to be where the penalized quasi-likelihood sufi‘ers serious downward bias. The three methods were found to perform better than the penalized quasi-likelihood. However, the Monte Carlo methods have the same problem as the Gaussian quadrature technique in that the estimation takes time. Besides, the convergence is stochastic. That is, when the iterations converge, the convergent value will vary randomly within a small range of the maximum likelihood estimate. (Chan and Ledholter, 1994). This produces problems of deciding whether the MCEM or MCNR has really converged. 12 Quasi-Likelihood / Approximate Likelihood Approach Goldstein (1991) and Longford (1984, 1988a) arrived at the same results via difi‘erent routes (Goldstein, 1991; Rodriguez and Goldman, 1995). Goldstein (1991) completely avoided marginal likelihood estimation. He used the linearized dependent variable (McCullagh and Nelder, 1989) to borrow the suength of the normal theory methodology and proposed iterative generalized least squares to do the computation. Longford (1994, 1988) arrived at his approximation to the marginal likelihood integral by both using a second Taylor expansion around zero of the random efi‘ects and taking advantage of the normal theory. Breslow and Clayton (1993) considered such approaches as a marginal quasi-likelihood approach (MQL) because the conditional expectation in both cases is expanded around zero for the random efi‘ects. However, Rodriguez and Goldman (1995) conducted simulations on both packages as well as Goldstein’s second order MQL (1991) and found the estimates to sufi‘er substantial downward bias when the variances of the random effects are large. Raudenbush ( 1993) extended Stiratelli, Laird, and Ware’s (1984) posterior modal approach for binary responses to generalized linear models with random effects and also improved the emciency of the approach by adopting Schall’s framework (1991). He also used the linearized dependent variable with the conditional expectation expanded around the cmrent estimates of both the random and fixed efi'ects. As a result, although motivated in seemingly very different ways, the estimating equations used by Breslow and Clayton (1993) and Raudenbush (1993) are the same. Nevertheless, the fixed effects l3 and variance estimates also suffer tmderestimation (Yang, 1994), which, though not as severe as in the case of MQL, can still be serious when the variance of the random effects is large. Breslow and Clayton (1993) used Laplace’s method to derive the score function ofthe penalized quasi-likelihood (PQL) derived by Stiratelli et al. (1934). For the fixed and random efl‘ects estimation, they modified Green’s (1987) Fisher scoring for estimating equations so as to borrow the strength of the normal theory linear model. For estimation of the variance components, they derived estimating equations, again using the normal theory, fi'om the “REML version” of the profile likelihood of the approximate marginal likelihood, ignoring the dependence between the fixed efi‘ects and the variance. However, they showed PQL to be downward biased for estimates of both fixed effects and the variance components (Breslow and Clayton, 1993). In an attempt to asymptotically correct the biases in approximate estimators of regression coefi'rcients and the variance in generalized linear models with a single variance, Breslow and Lin (1995) expanded the joint distribution, using Taylor series, of the data and the random effects to the second and fourth orders around the current estimates, and then used Laplace’s method to approximate the marginal likelihood. They termed the approximation up to second-order Taylor expansion the “first-order Laplace approximation” and that up to the fourth-order Taylor expansion the “second-order Laplace approximation”. They found that the first order Laplace approximation for the variance estimator was seriously biased while the second order Laplace approximation 14 was better. Lin and Breslow (1996) extended the approximation further into models with multiple components of dispersion, with zero correlations among them. Given the computational burden and other problems for Bayes and full likelihood approaches, the approximate/quasi- likelihood approach seems to be worth exploring more. The idea of Breslow and Lin (1995) and Lin and Breslow (1996) provides a way for the approximate/quasi -1ikelihood approach to better approximate the marginal likelihood and thus the estimation. Working in the approximate likelihood paradigm and trying to reduce the underestimation, this dissertation will extend their idea to the most general case where multilevel logistic models will have arbitrary number of random efi’ects with a general variance covariance matrix. Chapter 3 METHOD Introduction This chapter will find the approximate marginal log-likelihood using Laplace’s method. Then, it will find the derivative of each term in the approximate log-likelihood in order to apply the approximate Fisher scoring for the estimation of the fixed efi‘ects and the variance-covariance components of the random efiects. Following is a list of all the formulae from Magnus and Neudecker (1988) and Magnus (1988) that are needed for the derivation. In proving theories in this section, these formulae will be referred to only by equation numbers. vec(ABC) = (C’ a A)vecB (F1) vec(A)Tvec(B) = ”(ATB) (FZ) (A®B)(C®D)= AC®BD - (F3) ”(A a B) = tr(A)tr(B) (F4) tr(AxxT) = x’Ax (FS) d loglFI = trF"dF, (F6) trAdX = (vecAT )7 vech , A being constant (F7) 15 I6 vecm = vecA(dX)B = (3’ ® A)vech , A and B being constants (F8) tread)!"l = -((XT )" ® X " )vech (F9) [vec(ABC)]T = [(C7 8 A)vecB]T = (vecB)T(CT 8 A)’ = (vecB)T(C 8 AT) (F 10) (b7 8 or) = (vec(ab’ ))T , b being a column vector. (Fl 1) The Logistic Model We consider dichotomous responses from individuals nested within group i: y, = p, +e,, wherey, isan n,x1responsevectorofeitherlor0 forcluster i, with elements y”, irangingfrom 1 tol,andj rangingfi'om 1 to n,. e, isann, x1 column vector of error terms. p, is an n, x 1column vector, the conditional mean of the ith cluster given b, , each element being 1 1 ”if=E(y0=l|b')=1+exp(-X,j’fl-Zyrb,)=l+exp(-q,j)' (I) Thus,eachterm p” in ,u, isrelatedtoeachterm 17,, in q, throughthelinkfunction 77!} = 8(fly)=1°8[1:12 J - (2) it 77, = X,fl+Z, b, is a column vector, the linear predictor ofthe ith cluster. HereX, is an n, x p design matrix for the fixed effects, ,6, ofthe ith cluster, [3 being a p X] vector. Z, is an n, x q design matrix for the random effects, b, , of the ith cluster, b, being a q x1 column vectorthathas adistribution N(0,D), where D is a q x q variance- covariance matrix of b . 17 Emma IfD=0, 9ascalar,then b, isascalar,and Z, becomesan n,xl vectorof1.As 1 1+e:q>(—X,’p-b,)’ a result, the conditional expectation becomes p” = E(yg = 4b,) = and I], = X ”T ,6 + b, . Likelihood For each observation j in the ith cluster, the conditional density of ya given b, is f(y.|b.)=u;' 0—11.)”. (3) Then for the ith cluster, the conditional density is f(y.|b.)= fly? (1- fly)” (4) 1-1 The log oquuation4is l, =yfr], +s,,where s, =is 3,]. ,with sy= log(l—py).Thus, I"! the conditional log-likelihood of all the clusters is I l l XI. =nym+2s.. (5) bl In] 3-1 To get the marginal likelihood, we wish to integrate out b, from the conditional density of 3’0: L: [1'1 f(y,|b )p(b)db =1'[(—2’1r ) A |D|V jexp(l--b"D-'b )db (6) 18 W For D = 0, the joint density of y,j and b, is L= in(y.|b.)p(b)db.H= ——iexp(1.2-b’ )db AMIGO” Approximation to the Log-Likelihood Direct integration of Equation 6 is impossible and numerical integration cumbersome. Breslow and Lin (1995) used Laplace’s method to approximate the integral, limiting the variance of the random efi’ect(s) to either a scalar (Breslow and Lin, 1995) or a diagonal matrix (Lin and Breslow, 1996), however. That is, they first approximated the integrand in Equation 6 using Taylor expansion. Then by regarding the second-order term of the Taylor series as the kernel of a normal distribution, N(E, —('i,'<” - 1)-I )“ ), with '17") being the second derivative of I, evaluated at 3;, they took expectation of the other terms in the series under this new normal density, and approximated the integral as a sum of moments (up to the fourth moment). Note that Q, = -—(Z‘” - D" )'I is also the posterior variance of b, given y,, D, and ,5, if the linearized dependent variable (McCullagh and Nelder, 1989) y: is assumed to be independently and identically normally distributed, i.e., y,‘ ~ N(X,,6+ 2,12,, 2,702, + W,). In this section, we will generalize their approach to allow covariances among random effects. In addition, we will improve the accuracy of the approximation by including terms up to the sixth moment. 19 First, we approximate I, — ébeTb, inside the exponent of the integrand with the Multivariate Taylor expansion around a current estimate of b, , b; , that maximizes the expansion. (See Appendix A: Multivariate Taylor Series Expansion) 1 _ ~ 1"” _ ~ 51 ~ ~ _ ~ It TEbITD lb: T It TTz'bITD lb: +35%“: -bt)-blTD 1(bi Tbi) 1 ~ 5’1 _ .. +§(bi Tbi)r(a—bé+bT1"T D l)(br T bi) v94 5’1, ) abab’ ~) 1 ~ ~ +i[(b‘ -bl)r ®(bl - bI)T] 6 br (bi - b" 321, ) a a ”(5 b0" 117 ”e a b’ 1 .c .c ... ... who, — b,)T ®(b, - of w. - b.)’] a or (b. - 1,) 1 .c .. ~ .c + 5101-507 ®(bl T bi), ®(br T b1)? ®(bl -b’)T] x 5’1, J a a ”(a on b’ vec 3 b1 0" vec 0" br (b —b) a b’ ' ‘ I ~ ~ ~ ~ ~ +a[(bi Tbi)r ®(b1 TbI)T ®(bt -bfir ®(bl —bi)T ®(b' —b')T] x 20 at, J 0,, a ”‘(a be" I” ve a" b7 T a vec :22, ~ 0" vec 0" br (b, - b,) where 1. Zisthevalueof I, =yfn,+s,,evaluatedatthecurrentestimates fl,Dand 3,; at, _ ab" 2- ZON‘ =(yl -H)TZI = (y; T ”01”,;th : With ii“) = 217”“); T 771), Where W1 is diag[w,,], with w, = p”(1-p,,),the derivative of p, with respectto r), , and y,’ = W," (y — 11,) + 77. t the ‘linearized dependent variable’ (McCullagh & Nelder, 1989). 621; _ in) = 7(2) = 3. 0M), - c‘bafl , -Z,’W,Z,. o‘lre £27) 6%ch ~ " a), = a); = 1,") = —(z,’ a z,’ )A,Z,, where, A, = Xa,(E,E,’ a 3,), fill an n,2 x n, matrix , with 0,,= p,(l-p,,)(l—2p.y.)= w,,(l-2,u,,), the second derivative of [1,, withrespectto 77,,and Evan 11, x1 vectorwiththe jthentrybeing l,theothers being 0. ' all, J are dw’br area d)’ > @ecT‘” ~ 5. ~ , = ' =I.“>=-z,’®(z,’ azf)G,z, a7 &1' t 21 where G, =ig,(E,E;®E,,®E,,),an n,3 xn, matrix,with g” =w,,(l-6w,,),the 1.1 third derivative of p, with respect to 27,, . m 521, J t d’ec ‘ J 1' ~ an?" =1") =—(Z.’®Zf®2.’®Zf>HtZt where H, =ih,(E,E;eE,®E, ®E,),an n,’ x n, matrix,with h, =a,(1-12w,,), j-l thethirdderivativeof/1,, withrespectto 1],. d’T MC @T ~(6) r r r r r 7. 57 = I, = -(Z, 82, 82, 82, ®Z, )EZ, where F, =Xf,(E,E,T®E,cE,eE,®E,),an n,‘ xn, matrix,with j-l f, = g,,(1—12w,,)-12a2 , the fifth derivative of ,u, with respect to r),,. All of the above are derived by using Equation F8, regarding the matrix to be difi‘erentiated as X and those on the two sides as A and B. Then we will prove that the approximate log-likelihood is approximately 22 ’ -1 l - ~ 1 ~ - ~ log(L) at L, = Z{-i—log|D| - Elong, 'l + 1,-5be 'b, + log 114,} , (7) 1.1 where Q," = -('i,'“> — D") = (r)-l + 2,’W,2,), B, = 2,; Q, 2,, and 8 , 48 , 72 1 Proof Substituting the integrand with its Taylor expansion, the marginal likelihood becomes ’ :1 :1 ~ 1 ~, ~ ;.. ~ ~ L ~ 1:10”) 2 IDI 2 icxp{l. ~31». D“b. +(I."’ - D"b.>’(b. -b.) 1 ~ ~ _ ~ 1 ~ ~ ~ ~ '1' 5(bl T bt)r(lt(2) T D |Xbl T bt)+ ikbt T bl)T ®(bl T bl)T]ll(3)(bt T bl) 1 .. .. ... .. ~ + flat - b.)’ ca». - by w. - b. )’]I. “’(b. - b.) 1 ~ ~ ~ ~ ~ ~ + -5_![(bl " b1), ®(bl T bt)T ® (bi T bl), ® (bi T bI)T]lI(S)(bI T bi) (8) + fiat - 1311’ orb. - E)’ out - E)’ e (b, — E)’ e (b. - E)’]’i.'“’(b. - 51)}db,, where b, is the maximizing value for L, i.e., b, solves '17") — D43, = 0. That is, 13; = DZ“) = DZfW, (y; - X,,6- 2,5,) . To find 3,, collect 3’, at the left hand side, (1 + DZ,’W,2,)S, = Dz,’W, (y; - X, p) . Thus, 3} = (I + szmzrr‘ szmo: - M) = Q. limo»: - X113). Consequently, 7,") — D"b, vanishes. The second order term is retained as the kernel of the new Normal density N (3,, Q, ) . Next we consider the approximation of the higher order terms. 23 1 ~ ~ ~ ~ For R = §i[(bt T bt)T(bl T bl)T]It(3)(bt T bl) I ~ ~ ~ ~ ~ + Iikb’ - b,)r Q (b, - b, )1' ® (bi T bI)T]II“)(bt T bi) I ~ ~ ~ ~ ~ ~ + alto. — by etb, - b.)’ w. - bd’ Wt - W14 "’0’: - bi) 1 ~ ~ ~ ~ ~ ~ ~ +3310.-b.)’®’®(b.-b.>’®’®’]':“’ .fi-lo, .. by no, - if on, - E)’ on. - '5.)"]7.'"’(br - 13;) tam, _ a)’ ®(b, -13;)’ our, -17,)’ ea», —3,)’ ma - I3})’]'Z“”(b. — 132)} z {I + 31—1kb1 T 31)" ®(bt T E)T]TZ(3)(bi T31) 24 +50». -I'3'.)’ 9(51-31y eat -E>’]Z'“’(br 47.) +501, 43;)? car, -3,)’ ®(b, -13})’ ®(b, -E)’]Z"’(b. 432-) +£01-13)? car, -13',)’ cw, —3,)’ ®(b, -I'3,)’ ®(b. -3.)’]7I‘°’(b.- ~30 +%(i[(b, 47,)" ®(b, -I'§,)’]7,'<”(b, —b,))2}. Then Equation 8 becomes .- :1 :1 ~ 1 ~ ~ 1 ~ .. ... L e gen) 2 IDI 2 eXp(I, - Scipio) j{exp[5(b. -b,)’(1,"’ - D’l )(b, - b,)] x +311”, -13)? ®(b, -13;)’ mo, —13,)’ @(b, -b~.)’]Z"’(b. -5.-) +3110), 43;)? ®(b, -E)’ w. -5.)’ eat-1311’ ®(b. -131>’]'17“’(b.-E) +-;-(%[(br - 321’ w. — E)’]Z"’(b. - E))2>}db.- (9) Using Laplace’s approximation method, multiplied by the normalizing constant, 1 3 , Equation 9 becomes the expectation of 1 plus the third, fourth, fifth and sixth l (2”)2 IQ: moments of a multivariate normal distribution with mean 0 and variance Q, multiplied by 7,"), '17“) , 7,") , 7,“) and (73”)2 respectively. However, the odd moment of a normal distribution is 0. Thus the likelihood can be approximated as (See APPENDIX B: The 25 Six Moments of a Multivariate Normal Distribution, APPENDD( C: Proof of the Substitution, and APPENDDI D: The Expectation of the Third Term Squared) I "_‘ :1 l 1 ~ ] T'T -1~ L e 11(27!)2 |D| 2 (21:)2 |Q,|2 exp(l, - -2-b, D b,) x Jul 1 ... ... ... ... ... E<1 '1’ Sikh T bi), 8 (b T bi )T ® (bl T bl )T]Il(‘)(bt T bi) + L 720 kit-Evan.—3})’®(b.-B;)’®’]7.<”(b. — bl) ) " '—‘ 1 ~ 1~r -l~ 1 r ”(4) =H|D|2|Q,|2 exp(l,-§b, D b,) l+-8-{vec[Q,®Q,]} vecl, i-l +21§{vec[Q, 8Q, ®Q,]}TvecZ“’ +%k,TQ,k, >. The matrixpre-multiplying vec?" is the result of 21—, multiplied by the fourth moment of the normal distribution. The matrix pre-multiplying vec'lj“) is the result of émultiplied by the sixth moment of the normal distribution. The derivation of the last term can be found in APPENDIX D. Therefore, the approximate log-likelihood is I ~ 1 ~ ~ ~ log(L) z 2 : {—é log|D| - %10g|Q,"| + 1,. - 5 b,’D"b, + log<1 + %[vec[Q, a Q, 11’ vecl, W 5-1 _1__ r ~(6) E r +481VeCIQi ® Qt ®Qi]} ”9011' T 72 kt Qlkt >} (10) Substituting 2"”, I?” Z“) andzminto Equation 10 gives 26 log(L) as Z{— log|D| - 3,10ng I + I,— —b,TD"b, + log(l — ghetto. a Q. )l’ vecKZ.’ e Z.’ <8 2.’ 16.2.1 1 -4—8{vec[Q, a Q, a Q,]}’ vec[(2,’ a 2,7 o z,’ e Z.’ ® Z.’)F.Z.-] +'7_2k1 Qlkt >} (11) Now we simplify the fomth order term in Equation 11. Let e, = [vec(Q, ® Q, ]7 vec[(Z,T ® Z,’ 8 Z,’ )G,Z,] , ignoring the constant, —?1. First, by Equation F 1, regarding G, as B in the formula, J vec[(Z,T c 2,’ a 2,’)G,2,] = vcc[(2,’ a 2,’ a 2,’ )2 g,(E,E,’ a E, a E, )2, 1 1-1 J = v.24: g,(2,’ a 2,’ o 2,’ )(E,E,’ o E, a E, )(2, ®1®1)] j-l J = 2g,vec[(z,2,’ a2, a2, )1. (12) 1.1 Here 1 is a scalar 1. Z, is a q x 1 column vector of the random effect design matrix for person j in group 1'. Since Equation 12 is the vectorization of Kronecker products of the same vector Z , , it can be re-written as J vec[(2,’ a 2,’ a 2,’ )G,Z,] = Z g,vec[(2,2,f a 2,2,? )1 (13) I" 27 (by F11, see also PA-12 in PRE-APPENDIX). As aresult, e, = [vec[Q, 319,117: g,vcc(2,2; 312,2; ). (14) 1 Since Q, is symmetric, so is Q, Q Q, . By applying Equation F2, Equation 14 becomes e, = Z g,tr{ [Q, 8 Q,](Z,Z; ® 2,2,7 )}. Furthermore, by Equations F3 and F4, e, is J simplified even more to e, = 2 g, {tr[(Q,Z,Z,T ]}2 . (15) 1 To get rid of the trace firnction of the above, we take advantage of Z, being a q x 1 vector, and use Equation F5. Therefore, Equation 14 becomes, finally, e! = X gyIZ;QIZy]2 = 2:8ng ° (16) 1 1 The sixth order term can be simplified in exactly the same way. That is , wmfladeflefl®me J = vec[(Z,T ® Z,’ ® Z,’ ® Z,’ ® 2.7 )2 fyiEgEyT ® E, ® Ev ® E, 8’ E0 )2, ] j-l J = vec1}: f,(2,,2,’ a 2, e 2, a 2, a 2, )1 jun] J = vec[Z f,(2,,z,? a 2,2; a 2,2,? )1 (17) j-l As a result. 4. = {vec[Q. ® Q. ® Q.]}’ veci(Z.’ ® Z.’ ® Z.’ ® Z.’ ® Z.’)F.-Z.-] = {vec[Q, ® Q, 8 Q,]}7 2 f,vec(Z,Z,T ® 2,2,; ® 2,2,; ) j 28 = ”Z f,tr[Q,Z,Z,f ® Q,Z,Z,T ® Q,Z,Z,T ] 1 T :fviflgaZtZIJI T ifg(ZJQ.Z,,)3 T 2": faB; (l7) 1 1 Putting ?le, and 4—; q, back to the approximate log-likelihood, it becomes log(L)eL =Z{—log|D|-——log|Q, |+T,--b,TD-'7;, +log<1--;-: g,B;- 4182 f,B’+ gig/HQ, )}. 1 Hence we have finished the proof. A ' 'nt th urn cl For D = 9 , the log-likelihood of the data is obtained by substituting D by 6? , Z, by the scalar 1. Therefore, I“) = XI?” , Z") = 2,7,“) , Z“) = 27.)“) a and .I j 1' U 1,“) = 21,“) for the univariate case. Then we have 1 l -l l "' ~ .., 1 ~ log(L) T L5 = 2 ,{—2 logB-Elogw-l - S (#0))“, I, _ 29 biz i=1 j l u T“) -l n ”(2) -l 2 i "I ~(6) -1_ n’ ”(2) -1 3 +log1+8 I, [(9 - I, ) 1 +4821, [(9 21, ) ] J' j 1' J' +%[EZ§"’] W‘- 211.5”)"1>} j J 29 Approximate Fisher Scoring This section will find the first partial derivatives ofthe terms in the approximate log-likelihood, L, , with respect to the fixed efi‘ects ,6 and the variance components D , separately, and then merge the derivatives into one score function vector, S, , for each group i, so as to form the approximate Fisher scoring (Green, 1984), S (2 S,S,T)"ZS, where S, = [SA ]. Note that to take the derivative ofa scalar s with l t D, respecttoacolumnvectorvisthe sameastaking its derivativewithrespectto vT,and then take the transpose of the resultant row vector. That is, a = [j s, v r J .Thelatteris 5v used here because it is more straightforward in applying formulae by Magnus and Neudecker (1988). li it 'ff The posterior mode of the random efi'ects, b, , depends on the variance D and the fixed efi’ects ,6, i.e., b; =b;(fl ,D) = DZ"). Infinding score functions for B and D we need to take this relation into account by finding the differentials, % and a—i—f—‘fi; , through implicit difi’erentiation. a 5; — $20) 0" ~ a}; a?" 613’ TDZ'TF‘yi'Wh Z.’W(X.+Z:Ei). Collecting % at the left hand side, (I + DZ,TWZ,)§—fib7’ = —DZ,TW,X,. 30 Thus, pre-multiplying each side with D" , a?" _ .. yaw '+2.’W.2.) '2.’W.X. =-Q.Z.’W.X.- (19) a X +23 men ,3? = ”T '33? “’ =(X. -Z.Q.Z.’W.X.). (20) ai 5127‘" . (913' Similarl , ’ = ’ = - ’WZ ®I -DZ’W Z———’ y 51%ch o'Ivet:D)T [(y, 711) I I q] I '( IOTWCD) ) WT”) +21 ”Itzt) D [(yl T771) "’12! @1411 (21) an. 61X.fl+2117.) -. . . dvecD)’ 4‘”ch IQ! Ky: 771) I I q] (22) T “Y: T 771),.le ® ZIQIDT'I In the following derivatives for each term in L,5 , all the terms that are functions of b, , e.g., 71,, #1, W,, A,, G, , and F; , will have partial derivatives not only with respect to the apparent ,6 and vecD but also those inside 5, using the above derivatives. S F tio f ix ects I will prove that the score function of B for group i is _1 n . SA T T angTQIZyIXy T XfmZIQIZy1+XiTmUI T 771') I l 1 n n, n + Til—(T El: ha B; T 22 2 811311011: (211191211. VII/Ya "' Xtrwlzl Q1211] 1 k k I 1 n n n ‘4‘le p.82. -322f.8.3a.(z.i 9.2. )2] 1X... - X.-"W.Z. 9.2.1 r I: I 31 15 + —, 23 3,139, 2.2; Q. 2.](X. - X.’W.2. 9.2..) J : ayaaklrglzfizggt Za )2 :tXa T X tr Will lea] -l§[ijo,(k,’Q,2 )}X, -X,’W,2, Q,z,, ]}, where p, = h,(1-12w,)-36a,,g,,. Proof Making use of Equations F6 first, and then F7, and F8, the derivative of the second term in L, is: 4:, (-1og|D" + 2,’W,2 ,=|) —'211r((n" + 2,’W,z ,)'l aZ—gfl—sz )) a vec(Z,TW,Z,) $7 = 121-[vec(D" + Z.’ W.Z. )" 1’ = :21: “alvecQt ]T(Zti ® 2” )[Xr + Zgrd dflr i] -1 "I , .—. yza,Z,TQ,Z,[X,T — 2,’Q,2,’W,X,] (by Equations F2 and 19), (23) J W, where ”we“ Z ) =(2,’®2,’)A, (X,— -,2,Q,2’W,X,) 5!? =za,(z., M.) (X; -Z.§Q.Z.’W.X.). (24) 1 Equation 24 is obtained by regarding 2,7 as A in Equation F 8, and Z, as B. We vectorize W, and take its derivative to get A,(X, - Z,Q,Z,TW,X,). However, 32 A, = iay(EyE; ® E”). We make use ofthis special structure to decompose A, and get 1 Equation 23. The transpose of Equation 23 is the first term in the score fimction. £517 = o, - p, )’(X. - 2,9,2,’W,X.) = (y: - n. )’W. (25) The derivation is similar to that of the 66b]; in the derivation of the approximate log- likelihood. §;(—%E’D"E)=-E’D"[%]=52’D"Q.Z.’W.X,- <26) However, since 52 = Dzmy: — m), 0"13'. - Z.’W.(y: - n.) = o. In adding up Equations 25 and 26, E’D"Q,Z,TW,X, —(y,' - r1. )W,2,Q,2,"W,X, = o, with only (y: - 7],)TW}X, left, the transpose ofwhich is the second term in the score function. The derivative of log M, in the approximate log-likelihood function will be 610M __1_ % aflT - Mi wT ' (27) We take derivative of the first term in M, . a " " 55?}: g,B; = ghwflx,’ — Z,’QZ.’W,X,] 1 ~22: g,B,(z;Q, ® 259sz ® z,’)A.(X. - 2921”,) J = (212,13: - ziigngaXZJQzlYHX; - 219,2} Wm] (28) k k j 33 where B” = 2; Q, Z” , B; = (Z5 Q, Z, )2, and ha is the derivative of g, with respect to 7]” . The first item in the right hand side of (28) results fiom taking derivative of g”. The second term comes from taking derivative of the 3,}. Since 3; is a scalar, " 5 B difl‘erentiating the second term leads to 22 gyB 3,1 , for which we use Equations F8 1 01’3de ”IIZI) and F9. Applying Equation F9, we have i vecQ, = —(Q, ® Q,) 519' Substituting Equation 24 inside Equation 29 and then applying F3 we have the second term in Equation 28. Then, the transpose of Equation 28 is £3}: gyB; = [213 haB: " 22*: 2 81134104: (2592* )ZHXI: - XITWIZ: QIZH: ] ° (30) Multiplied by :81 , Equation 30 is the contribution of the fourth order term of the Taylor series to the score function of [3,. Similarly, "I 9%.: fyB; = [210,33 4222324422.?) IX; — 2;Q.Z.’WX.1, (31) J l .I where p, = h,(1-12w,,)— 36ayg0 , the derivative of f, with respect to 77,. The transpose of Equation 31 is 732—223;. = [2 p.32. 42222222592021 IX. — X.’W.2. 9.2.1. (32) i k k I Multiplied by i% , Equation 32 is the derivative of the second last term in M, . 34 Forthe lastterm in M,, a " $1 kITQI kl = ZleQI {[Zgyzyzyrgi Zy](leT " lergizirWtXI) 1 {2:30. (259. ® Z. 2.? Q. )1; a... (Z. ® 2011!; - ZIQ.Z.’W.X. ]} —sz — 2:9.2.’W.X.]. = 2k,TQ, (2:; g,2,, 2,,T Q, Z,](X,,’ - 2,722,721 X,) - 2k,’Q, [2}: 2; a,a,,2, (2,,T Q, 2,, )2 }X,I - 2;Q,Z,TW,X,] n, ’[z all: (kiTQl Zlk )2 1X; " lquIZITWJXi] ' (33) k The terms of Equation 33 that involve 2k,T Q, are obtained by taking derivative of the k, vectors on both sides of Q, . To take derivative of k,(k, = ZaaBuzfi ), first use k Equation F8 to get 2k,TQ, in front of the derivative of k,. Then take derivative of a, by using Equation 20, and derivative of Q, inside It, by using Equations F 8, F9, 24 and F3. Finally, take derivative of Q, between k,T and k, , using Equations F 8, F9 29 and F 3. Thus, the transpose of Equation 33 is, since k,TQ,Z,, is a scalar, a n: '53" leQi kl = 2(2 gykiTQl Zyzg'TQt Zy](Xij " XITWIZI Qizrj) 1 35 -2[iia.a.kfg.2. (2.59. 2.. >212. - X.’W.2. 2.2.] 1 k {fiat (ijQl Z§)2]X& - Xtrmzl QIZR] - (34) I Multiplied by % , Equation 34 is the last term in the score firnction of the fixed effects 2,. 36 un ' V ' ovari ents This section will find the score function of the variance-covariance matrix of the random eflects. Itwill provethat S _-_l D-l l D-l -l __1_.X -l T '_ TWZ Di ' 2 W“ )+ 2 vec( QID ) 2 duvedD QIZyZy QIZgO’: ’71) I I) J . +%vec(D"3, 1370") 1 1 - . 1 " - .. +7{‘“8'i hyB; vec(D lQIZyU’I -' ’71)].le ) -228yBgV8¢‘(D 1912112;le 1) I J J +-i$a.g.2. (2.79.2 ) vecw Q. 2.vec + (2? ® 2.’)A.«y.-' - WW. 2 2. Q.D"')1. (37) Take the transpose off Equation 36 to get the second and third terms of the score function SD. I W213)— " [veC(D“Q.Z’W (y.- me.’ - 17.)"W.Z. )1’. (33) using Equations 22 and F10. 38 - ~ ~ ~ ~ -1 ~ ~ erc(-ilb,rb"b,)=-%(bf®b,’)ée—c(l—)—)- bTD“ 5'" o’KvecD)T - ' 5(vecD)’ l _ ~ ~ __ _ _ ~ 0 = ElvedD ‘bl birD 1)]1. '[vedD lQID lbl(yl " '71)T”,lzl)]r ’ (39) using Equations F8, F9, F11 and 21. In adding up Equations 38 and 39, however, because of the fact that $1 = DZITW’IU: " ’71): [vec(D"Q,Z,TW, (y; " TLXJ’: “ ”Dru/121 )]T ‘ [vec(D-lQID-‘El(yl. - '11)]. ”’12: >11. = O 2 with only %[vec(D"5, 127D")? left, the transpose of which is the fourth term in the score function. For the derivative ofthe first term in M, 555).: 2,2; = 222.32; to: - n)’W.2. ® Q.D"] 457“, g.B.(2.’ ® Z.’ XQ. a Q.) a’e“;:e:,,z)':mz’) = :th; [vec(D"Q,Z,,(y; - 17, )T W,Z, )]T +2: gyB, [vec(D"Q,Z,,Z,,TQ,D'l )]T -2:2:5a.g.2.(2.’Q.2. )zlvec(D"Q.Zu(y.° - n. )TW.2. )i’ , (40) using Equations 22, F8, F 9, 37 and F10. The transpose of Equation 40 multiplied by :81 is the derivative of %l X g, B; with respect to vecD. j 39 Similarly, dch) —_va8;= £1,113; 33W 912;: (yr- 771),. WZ,) +33: faBzvedD"Q,Z,,Z;Q,D") I -BX:a.f.B;(2;Q.2.)’vec(D"Q.2.(y: -— n.)’W.2.). (41) 1 Equation 41 multiplied by i; is the derivative of :7: i f, B; with respect to vecD. I The derivative of the last term in M, is a Q'IWCD)T k'TQ' 1" = Zk'TQ' 1: gflizinazar [(y.‘ - rI.)’W.Z. <8) Q.D"] azec(D" + Z,TW,Z,) o'KvecD)T "t {Ml} ® 2.2;)(Q. ® Q.) I aveew" + 2,’W,2,) dvecD)’ “(klr ® leXQI 8 Q!) = 22 2.059. Z.)B.Ivec(D"Q.Z.(y: - 77. W2. )1’ 1 +22 a.(k.’Q.2. )[vec(D"Q.Z.,Z.’ Q. D" )i’ J -2 jZauaAkaTQZaXZJQZu)zlvec(D"Q.Zu(y.' — n.)’W.2. )1’ + [vec(D-lglkl kir QID -1 )1T ' 2 0&(leQlZlk )2 [vec(D-IQIZW (y: - ’7: )1 W121 )1T - (42) k 40 Again, we take derivative by using Equation F8, F9, F3, 22, 37 and F11. The transpose of Equation 42 multiplied by g- is the derivative of the last term in M, with respect to vecD. Chapter 4 AN ILLUSTRATIVE EXAMPLE Introduction This chapter presents an analysis on the data set of 1988 National Survey of Primary Education in Thailand (Thailand data). The analysis serves mainly as an 1 example of the use of the multilevel Bernoulli model. It will also explore the difl'erences and similarities among the forn' methods, namely, the first order Taylor expansion of the conditional expectation of the response (PQL), the second order Taylor expansion (PQL2), the sixth-order Laplace approximation to the log-likelihood (Laplace6) and Gauss-Hermite Quadratme approximation to the log-likelihood (Gauss). In addition, the difi‘erences produced by Gauss in using different numbers of quadrature points will also be of interest in this chapter. Thailand Data The Thailand data (U SAID contract DPE-5824-A00-5076-00) were collected in 1988 by a research team from College of Education, Michigan State University, and Royal Thai Government, Office of the National Educational System. Information gathered includes survey and case studies. The purpose of the project was to “provide reliable data related to outcomes and costs of education and to allow study of policy 41 42 alternatives to improve the quality of primary education.” (T aoklam et. al., 1992; See also Raudenbush and Bhumirat, 1992 ; Raudenbush, Bhumirat and Kamali, 1992) The survey employed a multi-stage stratified sampling design. Samples were drawn at levels of schools and individuals. First, 405 schools were selected randomly within provinces. Then, one sixth grade class per school that had engaged in the national assessment project was selected at random from selected schools. At the individual level, samples were drawn from four population groups: principals, teachers, parents, and students. Student data are the interest of the cmrent study. Information about schools where the classes were drawn was also collected. Altogether, 405 schools were sampled, within which data on 8582 pupils were collected. However, after deleting missing information of schools, data of 376 schools with 7877 students were used for the current analysis. Before the survey began, Thailand had launched various programs since 1980 to improve the quality of education. These included a pre-primary education program, a national testing program, and various staff development programs for principals and teachers. The purpose of pre-primary education program was to improve each student’s readiness for schooling. At the same time the government tried to promote the quality of administration and classroom teaching through staff development programs, and hold educators accountable for student learning through national testing programs. By requiring students to demonstrate basic skills before they can advance to the next grade, the country strove to ensure the quality of the product of school education -- student learning. It would be expected that the programs did help elevate educational efficiency. 43 Therefore, the research question here is whether students receiving pre-primary education, controlling for student and school backgrormd, had a smaller probability of repeating a grade- However, an important variable that needs to be taken into account before making any claims about our focus of interest is socioeconomic status (SE8). SE8 has always been found positively correlated with student achievement. I suspect this would also be true in Thailand. Whether students have adequate nutrition, especially breakfast, is an interesting variable. Students that did not have breakfast every day either came from poor families that could not afiord breakfast every day, or had parents who did not pay too much attention to the children. Either way, not having breakfast interferes with students’ concentration on learning, which might increase the probability of repeating grades. Finally, whether a student spoke central Thai dialect could also affect his or her probability ofrepeating a grade, since central Thai was the language used in class. Ifthe student could not speak central Thai he or she would have dificulty understanding the instruction, which would increase the probability of repeating a grade. In addition, student gender is also an interesting covariate to put in, in order to see if girls do differently fi'om boys in grade repeating. Therefore, student-level variables include: response variable -- whether the student repeated grade(s) (REPl, l = yes, 0 = no); variable of interest —- whether the student received pre-primary education (PPEDID, 1 = yes, 0 = no); and concomitant variables, which are the student’s gender (DSSEX, 1 = male, 0 = female); 4.4 student’s family socio-economic background (SESC) (grand mean centered); whether the student had breakfast every day (BRF1,1= yes, 0 = no); and whether the student spoke dialects other than central Thai (DIALCTl, 0 = yes, 1 = no). On the other hand, school environments may also afi'ect student learning. Schools located in tn'ban and aflluent areas would have a larger enrollment and more resources than schools in anal and poor areas. Students fiom poor families who attend a big school might then have a better chance in education than students going to a poorer school. The average SES of students in a school is also a good indicator of the resources in a school. The average number of textbooks per student had in one school is a direct indicator for the instructional resources to which students have access. Without sufi‘icient textbooks, it would be very diflicult for students to learn. Thus, school information of interest includes: natural log of school enrollment, grand mean centered (L_ENRC); the average of students’ SES, grand mean centered (MSESC); and the average of number of books per student, grand mean centered (MTXBKC). Results After some preliminary runs, I found that the regression coefficients for variables in the first level either did not have significant amount of variance themselves, such as PPEDID (pm-primary education), DS SEX (student gender), or their variation could be explained by level-2 variables, such as SESC (student family SES), DIALCTl (student 45 spoke dialect), and BREFl (breakfast). Therefore, I decided to have a univariate random efi‘ect model. The first-level model for the data set is 1),, = a0, + a,,*(SES’C),, + a,,t(DSSEX),, + a,,*(DIALCT1),, + a4,r(BREFl),, + a,,#(PPEDID),, , while in level-2, a0: = poo +flor'(L- ENRC): +2602‘(WESC)1 ‘1’ bl an = filo +fiii*(mESC)i a2: = pm as) =flao +p31'(mmKC)i a4: = 2640 +fiu"(L- ENRC): a5! = firm where b, ~N(0, om). Table l - Estimat 46 es of Thailand Data PQL PQL2 Gauss-10 Gauss-20 Gauss-30 Gauss-4O Laplace6 ,8... -2.0137 -2.2166 -22353 .22009 .21990 -2.1998 2.1940 (.1409) (.1524) (.1429) (.1420) (.1421) (.1421) (.1421) 2,, -.4031 -.4136 -.4614 -.4095 -.4159 -.4156 -.4147 (.1600) (.1781) (.1933) (.1909) (.1915) (.1914) (.1914) 3,, -.6794 -.7889 - -.7884 -.7845 -.7814 -.7809 -.7753 (.2606) (.2958) (.3055) (.3079) (.3079) (.3079) (.3076) 13,, -.4971 -.5223 -.5325 -5220 -.5220 -.5220 -.5223 (.1003) (.1056) (.1027) (.1035) (.1034) (.1034) (.1034) 5,, .4657 .5003 .5321 .4962 .4976 .4972 .4978 (.1408) (.1562) (.1658) (.1651) (.1651) (.1651) (.1651) 3,, .5549 .5819 .5840 .5825 .5825 .5825 .5827 (.0728) (.0764) (.0710) (.0704) .0704) (.0704) (.0704) 13,, .3005 .3358 .3658 .3255 .3336 .3336 .3319 (.1262) (.1384) (.1235) (.1304) (.1300) (.1301) (.1300) ,63, -.1012 -.1112 -.1513 -.1052 -.1114 -.1109 -.1104 (.0593) (.0655) (.0671) (.0781) (.0776) (.0776) (.0777) 2.. -.4154 -4327 -.4214 -4354 -4335 -4340 -4337 (.1032) (.1081) (.1041) (.1026) (.1028) (.1028) (.1028) ,6, .2739 .2907 .2905 .2911 .2910 .2910 .2910 (.1355) (.1461) (.1447) (.1440) .1440) (.1440) (.1440) ,8, -.4146 -.4501 -4555 -.4462 -.4489 -.4482 -.4478 (.0947) (.1007) (.0993) (.0994) (.0994) (.0994) (.0994) 0,, 1.0703 1.444 1.473 1.383 1.390 1.388 1.3771 (.1187) (.1543) (.1830) ** numbers inside the parenthesis are standard errors Table 1 gives the estimates of the fixed effects and the variance by Laplace6, Gauss-10, Gauss-20, Gauss-30 (30 quadrature points) and Gauss-40 (40 quadrature points), PQL, and PQL2. Although the methods give different estimates for the parameters (Table 1), they agree on the .05 significance level for all estimates, except for ,6” by Gauss-10. In fact, the independent variables, except ,6” , are all very powerful predictors for grade repetition. Especially the school level variables, L_ENRC (log- 47 enrollment) and MSESC (school mean SES) not only have great impact on the intercept, pm , but also help predict, respectively, the impact of whether the student had breakfast every day (BREFI) and that of the student’s personal SES background (SESC). However, their impacts on BREF 1 and SESC are smaller than, and in opposite directions to, those on the intercept. That is, while students who did not have breakfast every day (BREFl) and came from family with low SES (SESC) had an increased risk of repeating grades, the efl'ects of these adverse personal background are weakened if they attended big schools (high school enrollment) with higher school mean SES. In other words, an aflluent school environment provides a cushion for students fi'om poverty, helping prevent them from failing in school. Pre-primary education, om' focus of interest, also helped prevent a student from repeating grades. According to the preliminary rims, there is not much variation in its effect. Therefore, the effect of pre-primary education on grade repetition was pretty stable across different schools. On the other hand, students speaking central Thai also tended to have advantage in their learning. Having textbooks helped reduce the disadvantage of speaking dialects by about one third of the effects of speaking dialects other than central Thai. This makes sense since students could learn little by little on their own if they had textbooks at hand. However, the efi‘ect is not significant at .05 level. Its p-value is around .15. Finally, girls did seem to learn better, in primary school level, than boys. Holding all other variables at the average, a boy had a higher logit of around .58 of repeating grades than girls. 48 The comparison among the estimates by different methods is another interesting issue. As shown in the table, a lot of the difl‘erences between Gauss-10 and Gauss-20 are in the second decimal place. Gauss-2O and Gauss-30 difi’er in the third decimal place. Gauss-30 and Gauss-4O do not differ too much, only at the fourth decimal place. Some of Laplace6 results differ from Gauss-40 in the third place and some in the fourth place. Laplace6 results are generally closer to Gauss-30 and Gauss-4O than Gauss-10 and Gauss-20. PQL2 and PQL are fin-ther away. PQL consistently gives estimates that are smaller in absolute values. PQL2 results are actually pretty close to those of Gauss’s with larger numbers of points, but they are not as close as those of Laplace6. Chapter 5 EVALUATION WITH SIMULATED DATA Introduction This chapter compares the 4 methods, Laplace6, Gauss—Hermite Quadrature (Gauss) (Hedeker and Gibbons, 1994; 1996), PQL (Raudenbush, 1993) and PQL2 (Goldstein and Rasbash, 1996) by analyzing data sets simulated under 8 different models. The comparison will be in terms of l) the unbiasedness ofthe estimates ((fi , vecD) = 9 ) across data sets under the same model; 2) the mean squared errors of the estimates; 3) average of standard errors fiom outputs (93,, ); 4) standard deviation of the estimates across data sets (SD09) ); and 5) the relative efficiency of Laplace6 to the other methods. Eight different models were used to simulate data sets. The first six models (Models 1 to 6) were univariate random effect models that had a wide range (.52, .2, .1) of the average conditional expectations of the response y, given b, = O (71,“) = E(E(y,,|b, = 0)) ) and two different values for the random effect variance, namely, 1, .25. The data sets were generated by Yang (1994). The purpose of the use of the six different models was to investigate whether the methods performed differently 49 50 depending on parameter values. Presumably, models with 71,“) close to .5 will be the easiest for all methods, because ofthe symmetry of the data sets. As It,“ becomes smaller, the estimation task will become more dimcult. However, while PQL was already known to have a large negative bias for larger variances, the performance of the other methods for large variances was of interest. Two bivariate random efl‘ects models (Models 7 and 8) were constructed to assess the four methods with dependent random efl‘ects, in two ways. Model 7 explored the performances of the methods under severe conditions with small 71,“); 143 and extreme values (1.625, .25) for the variances with a small covariance (.1). The interest of Model 8 lay in the wish to inspect the consistency property of the maximum likelihood estimates produced by Laplace6. The investigation was launched by comparing estimates under the same model but with two different cluster sizes in the second level, the first set being 10 times smaller than the second set. The property of consistency would be revealed if there is little bias in the estimates and the variances of estimates become smaller as the sample size increases. The basic structure of the data sets followed Rodriguez and Goldman (1995). In the first level, we had 77,, = log[p,, /(l — ,u, )] = a0, +(ch1‘ldc),*a,, . In the second level, 0,, = £00 + (commuc),*,60, + b0, withbo, ~ N(0,Doo). Here a,, = [3,, was fixed for the first six univariate random effects models. For bivariate random effects (Models 7 and 8), a,, = ,6", + b,, was random with b,, ~ N(0,D,,), and cov(b,,,,b,,) = D0,. The values of flu, and fl", were both set to 1. The values for flu, were manipulated in order 51 to get difl‘erent values for 71,”). The level-1 covariate, childc, was sampled flour a normal distribution with mean .0955621, and variance .0676, while the level-2 predictor, commuc, was sampled fiorn a normal distribution with mean -.6857591 and variance .2304. However, in bivariate random effects models, the means of both covariates remain unchanged while their variances were both changed to 1. There was no missing value in any of these models. Univariatc Random Efiect Data Sets (Models 1 - 6) The six univariate random efl‘ect models used 3 difl‘erent values for the average conditional probabilities, 71“”, namely, .52, .2, and .1, and 2 difl‘erent values of varianceDm, 1, and .25, where 1 is usually supposed to be large and .25 pretty small. The values of flu, were 66,53 .,7961 and -1.62 for 71“” to be .52, .2 and .1, respectively. Each data set had 16 observations for each cluster in the first level, and 161 clusters in the second level. For each combination of the parameters, 50 data sets were generated. Model 1 had 21'5,°’=.52, D0,, = 1, while Model 2 had the same value for p“), but a smaller variance, D00 =.25. Model 3 had 71$,°’— = .2 and D0,) = 1, whereas Model 4 differed from Model 3 by a smaller variance, D0,) = .25. Similarly, under Model 5, 11‘,” = .,1 D0,, =1; under Model 6, 21‘,” =.,1 7611:25- Gauss results were computed using 10 quadrature points (Gauss-10). The results were obtained from Yosef (1997). Ten points were specified because, according to the MIXOR manual(l993), 8 to 10 points would produce satisfactory results for univariate data sets, whereas fewer points could be specified for higher dimensional data sets. PQL 52 was not compared here since Yosef (1997) has found it to consistently underestimate the fixed efi‘ects and the variance components, in accordance with previous results (e.g., Goodman and Rodriguez, 1995; Breslow and Clayton, 1993). W Table 2 - Averages and Mean Squared Errors of Model 1 Laplace6 Gauss-10 PQL2 average mse average mse average mse Dw= 1.0135 0.0294 1.0142 .0300 1.0361 0.0336 fim=.665267 0.6679 0.0251 0.6677 .0251 0.6750 0.0258 13,, =1 0.9812 0.0430 0.9835 .0429 0.9913 0.0440 fim=l 0.9891 0.0400 0.9901 .0401 0.9944 0.0405 The clearest pattern in Table 2 is that, under Model 1, the averages and mean squared errors of the three methods were very close to each other, although PQL2 consistently had a slightly larger mean squared errors than the other two. The biases of the three methods were small, and the directions of biases for the parameters were the same too. Another clear pattern is that PQL2 always gave the largest estimate for all the parameters, whether the bias of the three methods was negative or positive for a particular parameter. The amount of positive bias of PQL2 for the variance was 3.6% of the parameter, which seemed to be a little too large compared to that of the other two methods. 53 Table 3 - Averages of S. E.’s and S. D.’s of Estimates of Model 1 Laplace6 Gauss-10 PQL2 6 6,, SD(0) 6,, SD(0) 6,, SD(0) om= .1759 .1725 .1740 .1742 NA .1816 p,,=.665267 .1698 .1599 .1683 .1602 .1648 .1621 p,,=1 .2014 .2085 .1999 .2087 .1953 .2116 p,,=1 .1789 .2018 .1788 .2021 .1749 .2032 The standard error of D0,, was not available in the PQL2 program. The averages of the standard errors (03,.) were the average amotmts of uncertainty the methods predicted for the estimates. The standard deviations of the estimates indicated the real amounts of uncertainty in the estimation. The discrepancy between the prediction and the reality gathered from the 50 data sets was the largest for all three methods for [3,0 , for which all three methods underestimated the variability; and smallest for Do, by Laplace6 and Gauss-10. The differences between the averages of the standard errors and the standard deviations of the estimates were the smallest for Gauss-10. Rmts of Mggl 2 54 Table 4 - Averages and Mean Squared Errors of Model 2 Laplace6 Gauss-10 PQL2 average mse average mse average mse D0,, =.25 .2656 .0048 .2658 .0048 .2662 .0049 ,6,” =.665267 .6759 .01 11 .6760 .01 l 1 .6771 .01 12 pm=l 1.0123 .0158 1.0124 .0158 1.0141 .0159 flm=1 1.0010 .0380 1.0011 .0380 1.0025 .0381 Model 2 was different fi'om Model 1 only in the value of D0,, . Again, the three methods were very similar in both biasedness and mean squared errors. The mean squared errors of Laplace6 and Gauss-10 were identical. With a smaller value of D0,, , the three methods all had positive bias, although small again. The largest positive bias appeared for D0‘, , at about 6% of the parameter by all three methods. Table 5 - Averages of S. E.’s and S. D.’s of Estimates of Model 2 Laplace6 Gauss-10 PQL2 6 6,, SD(0) 6,, 50(6) 6,, SD(0) 0,,=.25 .0684 .0683 .0685 .0684 NA .0687 6,,=.665267 .1074 .1060 .1075 .1060 .1045 .1061 6,,=1 .1264 .1265 .1264 .1265 .1243 .1267 ,6,=1 .1686 .1970 .1686 .1970 .1643 .1973 The similar values for the standard deviations of the averages under Model 2 in Table 5 were consistent with the close similarity of the mean squared errors. The 55 prediction of the uncertainty by the three methods were generally pretty close to the empirical results. All three methods underestimated the variability of A, by the largest amount, as in Model 1. W}. Table 6 - Averages and Mean Squared Errors of Model 3 Laplace6 Gauss-10 PQL2 average mse average mse average mse D0,, =1 .9396 .0472 .9362 .0463 .9772 .0515 [3,, =-.7960974 -.7794 .0196 -.7801 .0199 -.7836 .0199 7,, =1 1.0254 .0328 1.0261 .0330 1.0361 .0347 ,6“, =1 1.0322 .0364 1.0324 .0364 1.0356 .0369 For Model 3, Laplace6 results also followed closely those of Gauss-10. Contrary to the situation in Model 1, the three methods had negative bias for D0,, and very small positive biases for the [3 ’s. The underestimation of PQL2 for D,(, was around 2% of the parameter, while that by Laplace6 and Gauss-10 was much larger, around 6%. The biases for the ,8 ’s by the three methods were very close to each other. However, the mean squared errors for PQL2 were all slightly larger than those for the other two. 56 Table 7 - Averages of S. E.’s and S. D.’s of Estimates of Model 3 Laplace6 Gauss-10 PQL2 9 9E SD(9) 9;: SD(0) 0:! SD(0) Doo=1 .1789 .2108 .1766 .2076 NA .2276 pas-7960974 .1633 .1404 .1625 .1415 .1613 .1418 pm=l .2029 .1810 .2018 .1817 .1994 .1845 [3,, =1 .2077 .1899 .2076 .1900 .2017 .1908 A significant pattern of Table 7 is that for Model 3, all the three methods seemed to under-predict the variation of the estimates of Do, , and over-predict those of all the other parameters. The largest difl‘erence between prediction and empirical results occurred for D,o . Laplace6 had the largest discrepancy among the three for all parameters, over-predicting the variations of the three fixed effects; while PQL2 had the smallest discrepancy. Results of Model 4 Table 8 - Averages and Mean Squared Errors of Model 4 Laplace6 Gauss-10 PQL2 average mse average mse average mse D0,, =.25 .2435 .0059 .2427 .0059 .2501 .0060 I966 "—.7960974 -.7854 .0077 -.7853 .0077 -.7873 .0077 pm =1 1.0057 .0144 1.0057 .0144 1.0075 .0145 ,6", =1 1.0060 .0498 1.006 .0498 1.0071 .0499 With a small value of D0,, , the mean squared errors of the three methods were almost identical, as in Model 2. The biases of D0,) by PQL2 were almost 0, while the 57 negative bias by Laplace6 and Gauss-10 was around 3% of the parameter. The ,600 ’s were pretty much unbiased by all three methods. Table 9 - Averages of S. E.’s and S. D.’s of Estimates of Model 4 Laplace6 Gauss-10 PQL2 9 09L SRO) 0!. SRO) 03.5 SRO) , 0,,=25 .0816 .0774 .0814 .0770 NA .0782 E=~n7960974 .1123 .0878 .1122 .0878 .1099 .0880 6,, =1 .1437 .1212 .1436 .1212 .1392 .1215 £131 .2014 .2254 .2014 .2254 .1957 .2256 All three methods tended to over-predict the variation of the estimates, except for ,6“, . The discrepancies between the predicted and empirical variation of the estimates for the three methods were very close, too, although PQL2 had a slightly smaller discrepancy than the other two methods; the discrepancies for Laplace6 and Gauss-10 were almost identical. marinas Table 10 - Averages and Mean Squared Errors of Model 5 Laplace6 Gauss-10 PQL2 average mse average mse average mse 0,,=1 .9742 .0601 .9720 .0580 1.0511 .0803 6,, 1.62 -l.6122 .0318 -1.6138 .0322 -1.6306 .0335 6,, 1 .9994 .0580 1.0016 .0593 1.0117 .0592 6,,-1 .9990 .0499 .9981 .0495 1.0006 .0502 Again, the results of Laplace6 went together closely with Gauss-10 in Model 5. For Do, , the negative bias of the two methods were both around 2.5% of the parameter, while PQL2 had a positive bias of 5%. This was different from experiences with the above models, where PQL2 always had the same signs for biases (positive or negative) as the other two methods. On the other hand, all the [3 ’s by the three methods were almost unbiased. Gauss-10’s mean squared error of D0,, was a little smaller than that of Laplace6, while the ,6 ’s of Laplace6 had smaller mean squared errors. PQL2 generally had the largest mean squared errors for all parameters, as before. 59 Table 11 - Averages of S. E.’s and S. D.’s of Estimates of Model 5 Laplace6 Gauss-10 PQL2 9 O g SRO) 9&5 SRO) 03.5 SRO) Doo=1 .2221 .2462 .2184 .2417 NA .2815 fim=-1.62 .1811 .1800 .1807 .1811 .1789 .1845 ,60, =1 .2317 .2432 .2310 .2460 .2291 .2456 ,6", =1 .2485 .2256 .2483 .2248 .2481 .2262 The standard deviation of D,o of Model 5 for PQL2 in Table 11 was much larger than those of the other two methods. This contributed to the large value of its mean squared error in Table 9. The three methods under-predicted the variation of Do, and over-predicted that of 6,0 . The discrepancies for the other two ,8 ’s were small. The discrepancies by Gauss-10 were smaller than those of Laplace6 and PQL2. W Table 12 - Averages and Mean Squared Errors of Model 6 Laplace6 (49 obs.) Gauss-10 (48 obs.) PQL2 (50 obs.) average mse average mse average mse 0,,=.25 .2389 .0117 .2370 .0118 .2593 .0142 6,,=-1.62 -1.6139 .0123 -1.6119 .0124 -1.6214 .0123 6,,-1 .9995 .0259 1.0044 .0252 1.0063 .0266 .6... =1 .9933 .0765 .9883 .0770 1.0017 .0777 For Model 6, Laplace6 gave converged results for 49 out of the 50 data sets, Gauss-10, 48 , while PQL2 had no difliculty with any of the data sets, as was shown in Table 12. Laplace6 results were again very close to those of Gauss-10, both in averages 60 and mean squared errors. PQL2 seemed more unbiased than the other two but it gave larger mean squared errors for the parameters. The negative bias of Do, by Gauss-10 and Laplace6 were both around 5% of the parameters, whereas PQL2’s negative bias was a little smaller, around 3.5%. The three methods’ estimates for the ,O ’s were almost unbiased. Table 13 - Averages of S. E.’s and S. D.’s of Estimates of Model 6 Laplace6 Gauss-10 PQL2 6,, 50(6) 6,, 50(6) 6,, 50(6) 0,,=.25 ME .1089 .1051 .1091 N74: .1198 6,,=-1.62 .1301 .1119 .1299 .1123 .1257 .1121 6,,=1 .1652 .1625 .1649 .1605 .1642 .1648 p,,=1 .2568 .2794 .2575 .2802 .2506 .2816 The standard deviation of the estimates of PQL2 in Table 13 were again the largest. However, it gave the smallest standard errors of the estimates, as in the models discussed above. The underestimation of the variation was most severe for ,6”. The discrepancies between the predicted and empirical variation for the other two ,6 ‘s were the smallest by PQL2. Laplace6 and Gauss-10 were very similar in the errors of prediction for the variations of flu, and Don; both were smaller than those by PQL2. 've der 61 Table 14 - Laplace6 Relative Emciency Under Models with Dm‘l Model 1 (712” =.52) Model 3 (fi$,°)=.2) Model 5 (71:75.1) Gauss-10 PQL2 Gauss-10 PQL2 Gauss-10 PQL2 Do, 1.0204 1.1429 .9809 1.091 1 1.0362 1.3361 ,8“, 1.0000 1.0279 1.0153 1.0153 1.0354 1.0535 ,8," .9977 1.0233 1.0061 1.0579 1.0224 1.0207 ,6“, 1.0025 1.0125 1.0000 1.1014 .9920 1.0060 Table 15 - Laplace6 Relative Efficiency Under Models with 0,, =25 Mode12 (fi§,°)=.52) Model4 (fi(,°’=.2) Model 6 (71:,°’=.1) Gauss-10 PQL2 Gauss-10 PQL2 Gauss-10 PQL2 0,, 1.0000 1.0208 1.0000 1.0169 1.0085 1.2137 [3,, 1.0000 1.0090 1.0000 1.0000 1.0081 1.0000 ,6," 1 .0000 1 .0063 l .0000 1.0069 .9730 1.0270 ,6,,, 1.0000 1.0026 1 .0000 1.0020 1.0065 1.0000 Tables 14 and 15 give the efficiencies of Laplace6 relative to Gauss-10 and PQL2. The relative efficiency for D,0 , say, of Laplace6 to Gauss-10 is the ratio of Gauss-10’s mean squared error for D0,, to that of Laplace6’s. Therefore, Laplace6 has higher efiiciency for Do, if the ratio is larger than one, and vice versa. From the two tables, the efficiencies of Laplace6 relative to Gauss-10 were mostly larger or equal to 1. The relative efficiencies of Laplace6 were slightly higher for larger D00; for smaller D0,, , the relative efficiencies were mostly 1. The only exception was that of [30, under Model 6, where Gauss-10 had a higher efficiency. The effect of larger variance and smaller 62 average conditional expectation on the loss of eficiencies relative to Laplace6 was even more apparent in PQL2. Laplace6 was more efficient than PQL2 for the fixed efl’ects and the variance. Gauss-10 was also more eficient than PQL2. In conclusion, the three methods performed reasonably well under the six models. However, the values of Do, and 71),” did have an impact on how precisely and accurately the three methods estimated the parameters. With a smaller Do, , the variation (standard deviations) of the ,6", estimates by all three methods were larger than what they predicted (the average of the standard errors). On the other hand, when the value of fig” became smaller (.1 and .2), the three methods tended to have negative biases for D0,. Moreover, with small values of fig,” (.1, .2) and a large D0,, , the variation of D0,) was underestimated by all three methods. The biases of all the three programs were generally very small in these univariate models. The three programs almost always went together in the direction of biases. PQL2 always gave the largest absolute value of the estimates. Because of the largest variation in estimates, its mean squared errors were usually the largest among the 3 methods too, although the difi‘erence was usually small. Laplace6 estimates were very close to those of Gauss-1 0 in terms of averages and mean squared errors. Even in the discrepancies in the prediction of variation, Laplace6 results were very similar to those of Gauss-10, although the discrepancies seemed smaller for Gauss-10 in more cases. The largest disagreement between the Laplace6 and Gauss-10 in terms of both the averages of the standard errors (O 50) and the standard deviations of the estimates (SRO)) were in the 63 third decimal place, while the largest disagreement between either Laplace6 or Gauss-10 and PQL2 was in the second decimal place. However, Laplace6 and PQL2 gave estimates up to the fourth decimal place. Gauss gave variance and covariance estimates only to the third decimal place, computed from the values of the Cholesky decomposition rounded up to the second decimal place. Therefore, because of rounding errors, the mean squared errors of the variance and covariance estimates from Gauss might look larger than they really were. The comparison of variance and covariance estimates thus might not be exact. Bivariate Data Sets W Model 7 contained 100 data sets with parameters [30,, = —1.2 , 14,95 .143, b 0 1.625 r 0,, =1.625, 0,, =1, 0,, =25, ( W] ~ N ( J, . Each datasetcontained 20 bu 0 .1 25 observations in the first level and 200 clusters in the second level. Laplace6 was computed on an old UNIX machine, using the converged estimates of the parameters from PQL. The 100 data sets took Laplace6 altogether 2 hours to analyze. PQL2 was computed on the same UND(, too, using similar amount of time for the 100 data sets. The second order Taylor expansion of the conditional likelihood was set to start at the second iteration while computing starting values based on PQL. According to experiences, starting the Taylor expansion earlier would cost only some more iterations but would not have efi‘ect on converged values. However, if the starting values happened to converge fast and the Taylor expansion was set in after that, the 54 resulting estimate would be just PQL. All the 3 programs (Laplace6, PQL, and PQL2) were run in UNIX. Gauss-Hermite Quadrature method (using MD(OR package) was run on Pentium 233. Out of curiosity for the use of more quadratrn'e points in the improvement of the accruacy of the estimation and also intrigued by the pattern in the analysis of Thailand data,wheretheresults ofLaplace6 seemedtobemore similarto Gausswithmorepoints than with fewer points, both Gauss-10 (Gauss-Hermite Quadrature using 10 points) and Gauss-20 (Gauss-Hermite Quadrature using 20 quadrature points) were used in analyzing the 100datasets. Theestimatesfmmthe4programswerecomparedintermsoftheir unbiasedness and mean squared errors. Gauss-20 used about 20 hours altogether for the 100 data sets, and Gauss-10 used 5 hours. Although it was impossible to compare the time used by the 3 different methods, i.e., PQL, PQL2, and Laplace6, with Gauss, 6 data sets were randomly selected to use the same Pentium 233 to do the analysis. The time for the 6 data sets used for PQL plus Laplace6 ranged flour 7 seconds to 20 seconds; for the same 6 data sets, the time used by Gauss-20 was around 12 minutes, while the time used by Gauss-10 was about 3 to 5 minutes. Thus it was very clear that Laplace6 was significantly more efficient in terms of computational time. Gauss program produces variances and covariances fi'orn the Cholesky decomposition and gives standard errors for the decomposed terms only. It was impossible to find out the standard errors of the variance and covariance fi-orn what is available. 65 Table 16 - Averages of Estimates Under Model 7 Laplace6 Gauss-10 Gauss-20 PQL PQL2 0,,=1.625 1.6352 1.6532 1.6546 1.2752 1.7296 0,,=.1 .0960 .1003 .0995 .0538 .0864 0,,=.25 .2667 .2575 .2562 .1614 .2927 6.5.12 -1.2007 -1.1977 ' -12045 -10904 121» 73,21 1.0029 1.0081 1.0148 .9004 1.0231 ,6,=1 .9975 0.9971 .9976 .9114 1.0050 The estimates by all programs were fairly normally distributed. PQL again consistently had negative bias for all the parameters. The bias of the variance components ranged from 22% (D0,) to 46% (Do, ) of the parameters, while those of the O ’s were around 9% of the parameters. PQL2 had the second highest bias, either positive or negative. It had positive bias for D0,) , by about 6.5%, and D" by 17%, but negative bias for D," , by 14%. Even though the positive biases of its ,6 estimates were only around 1%, they were still larger than the two Gauss’s and Laplace6. The advantage of Gauss-20 over Gauss-10 was not very clear fiom the table, since the averages of the two were very similar. Laplace6 results were again close to those of the Gauss’s. Laplace6 had more biases, both negative and positive, than the Gauss’s for smaller values of the variance components, but had smaller positive bias for the large variance than the latter. For the [3 ’s, the three methods were pretty much unbiased. 66 Table 17 - Mean Squared Errors of Estimates Under Model 7 Laplace6 Gauss-10 Gauss-20 PQL PQL2 0,,=1.625 .0563 .0737 .0633 .1522 .0838 ”Her .0108 .0115 .0120 .0080 .0143 0,,=25 .0075 .0073 .0072 .0113 .0115 p,=—1.2 .0190 .0231 .0196 .0271 .0203 6,, 1 .0164 .0193 .0175 .0236 .0178 fi..=1 .0051 .0051 .0053 .0116 .0055 The mean squared errors of Table 17 tell another difi‘erent story. Laplace6 produced the smallest, or among the smallest, mean squared errors for the estimates among the five methods. However, considering that estimates of D0, and 0,, by Laplace6 had relatively large amounts of bias but that their mean squared errors were either smaller or only .0001 larger than those of Gauss-20, the variation of the variance components by Laplace6 seemed to be much smaller than that of the Gauss’s. The comparison of Gauss-10 with Gauss-20 was clearer in terms of mean squared errors. Gauss-20 most of the time had much smaller mean squared errors than Gauss-10, the values were closer to those of Laplace6 than those of Gauss-10, too. The mean squared errors of PQL were notably larger. The mean squared error for D0' was the smallest of its counterparts of the other three methods, although its underestimation from Table 3 was 46%. This again indicated that PQL performed well for small values of random efl‘ect (co)variance. On the other hand, considering that D,I = 0.1, and that the mean squared errors of Dm by the other methods were all larger than their respective mean squared error for D,,=0.25, it seemed that all the other 3 methods had a difficult time giving a 67 reasonable estimate for Do, . PQL2 produced the second largest mean squared errors for the variances and covariance of the random effects. However, its mean squared errors of the 3 p ’s were only a little larger than those of Gauss-20. PQL2 seemed to perform better for the fixed effects than for the variance components. Table 18 - Laplace6 Relative Efl'iciency Under Model 7 Gauss-10 Gauss-20 PQL PQL2 0,,=1.625 1.3091 1.1243 2.7034 1.4885 0,,=.1 1.0648 1.1111 .7407 1.3241 0,,=.25 .9733 .9600 1.5066 1.5333 6,,=-1.2 1.2158 1.0316 1.4263 1.0684 6,,=1 1.1768 1.0671 1.4390 1.0854 p,,=1 1.0000 1.0392 2.2745 1.0784 The relative efficiencies of Laplace6 relative to all the other programs in Table 18 gave a clear picture of the comparison of the mean squared errors. Laplace6 was more efficient than all the other methods in general, except for Dll , for which Laplace6 had a positive bias of 6% of its value (Table 16). However, even though Laplace6 also had a 4% positive bias for D0| , it was still more efiicient than Gauss-10 and Gauss-20, whose estimates were almost unbiased. The extra ten quadrature points in Gauss were a mixed blessing. It seemed that Gauss-20 was not as inefficient as Gauss-10 when the latter fell quite far behind Laplace6. However, at times, it was a little less efficient than Gauss-10 when the latter was only a little less efficient than Laplace6. 68 In Table 19, the standard deviations of the estimates by PQL again were the smallest among the five methods. PQL2 had the largest standard deviations of the estimates of Do, , DH and ,ON. Gauss-10 had the largest standard deviations of the estimates of ,3“, , ,Oo, and Do, . The variation of Laplace6 estimates was the smallest among the four programs without considering PQL. The small amormts of variation in the estimates, in addition to small biases, contributed to the significantly smaller mean squared errors and higher efficiency. The variations of the estimates by Gauss-20 was not necessarily smaller than those of Gauss-10, although its variation for Don was indeed much smaller. 69 Table 19 - Standard Deviation of the Estimates Laplace6 Gauss-10 Gauss-20 PQL PQL2 Dm=1.625 .2383 .2714 .2510 .1737 .2713 D,,=.1 .1045 .1076 .1102 .0768 .1194 D,, =.25 .0853 .0857 .0853 .0593 .0988 605-12 .1387 .1529 .1403 .1233 .1421 fio,=1 .1288 .1393 .1320 .1171 .1320 ,6“, =1 .0716 .0721 .0730 .0615 .0744 Table 20 - Averages of Standard Errors Laplace6 Gauss-10 Gauss-20 PQL PQL2 Doo=1.625 .2684 .2563 .2688 .1831 .2420 D0, =. .1 156 NA NA .0786 .1046 D,l =.25 .0956 NA NA .0662 .0894 fim=~1.2 .1290 .1198 .1293 .1110 .1273 pm=1 .1175 .1105 .1170 .0999 .1162 .3... =1 .0755 .0744 .0747 .0602 .0701 In reference to Tables 19 and 20, the averages of the standard errors produced by the methods were compared to their respective real standard deviations. PQL seemed to have the smallest discrepancies between the two tables for all parameters. PQL2 had the largest discrepancies. The variation of all its estimates were under-predicted by the standard errors it gave. Laplace6 over-predicted the variation of the variance components and of ,6“, , and under-predicted the other 2 [3 ’s. Gauss-10 and Gauss-20 had the same pattern as Laplace6 in estimating the variation of the fixed effects, but Gauss-10 underestimated that of D0,, while Gauss-20 was in the opposite direction. Apart fionr 70 PQL, Gauss-20 seemed to have the smallest discrepancy between the theoretical standard errors for its estimates and their empirical standard deviation. In summary, Laplace6 produced estimates that were very close to those of Gauss- 20, both in averages and in mean squared errors. Their biases were reasonably small. Over all, Laplace6 had the smallest mean squared errors and the highest efiiciency, thanks to its least variability across data sets. The discrepancy between the theoretical variation of the estimates and its empirical variation was the smallest in PQL. Gauss-20 has the second smallest discrepancy. The advantage of 20 quadrature points over 10 quadrature points was clear also in the parameters where Gauss-10 had substantially larger amounts of mean squared errors than Laplace6. For these parameters, the mean squared errors of Gauss-20 were much smaller. However, for parameters where Gauss-10 had only a little larger, or smaller, mean squared errors than Laplace6, Gauss-20 might do slightly worse than Gauss-10. This might suggest that the advantage of using a larger number of points for Gauss appears only where accurate estimation is difficult using a smaller number of points. However, with real data sets, it is impossible to decide on the accuracy of the estimates. The biases (in percentage of the parameters) of PQL2 under the current model were larger than under the univariate models. Its efficiencies of all estimates relative to Laplace6 decreased a lot in this bivariate model, too. As to the prediction of the variation of estimates, consistent with univariate models, PQL2 gave smaller estimates of the variation than the empirical results. 71 W Model 8 contained 400 data sets, which had parameters, 60,, = -508403, D0,, = 2 , b 0 2 .2 D0. :2, D" =.7S ,, (0:) ~ N[(0)’[.2 75]]. EachofthedatasetshadZO observations for each cluster in the first level. In the second level, 200 of the 400 data sets had 200 clusters in each ofthem, while there were 2000 clusters in each ofthe latter 200 data sets. The interest was to check whether Laplace6 estimates were consistent as cluster sizes increased by 10 times. That is, ifLaplace6 estimates using the latter 200 data sets have smaller biases and 10 times smaller variance than the former 200 do, Laplace6 method is considered consistent. We would also be interested in the percentages the estimates fell beyond the 95% confidence interval of the true parameter (i.e., true value 21:1.96 standard deviations), using the empirical standard deviations. This would give a sense of how well the estimation was. Besides, if the estimates were normally distributed then the theoretical probability of observations falling beyond the 95% confidence interval (i.e., average :1: 1.96 standard deviations) is .05. Therefore, statistics produced for each set of estimates using data sets of 200 clusters would be contrasted with those using data sets of 2000 clusters in the second level. These included averages, biases, variances, the empirical probabilities of observations falling out of the range of the true value :1: 1.96 standard deviation, and the empirical probabilities of observations falling out of the range of the average :1: 1.96 standard deviations. 72 In the following tables, “emp. prob. 1” is the empirical probability of observations falling out of the i 1.96 standard deviations around the true value, whereas “emp. prob. 2” the empirical probability of observations falling out of the :1: 1.96 standard deviations around the average. Table 21 - Contrasts Between Difl'erent Cluster Sizes for Variance Components D0,, = 2 D0, :2 Dn =.75 cluster size 200 2000 200 2000 200 2000 average 1.9187 1.9375 .1733 .1766 .7416 .7343 bias -.0813 -.0625 -.0267 -.0234 -.0084 -.0157 variance .0715 .0070 .0165 .0016 .0199 .0017 S. D. .2674 .0839 .1285 .0404 .1412 .0413 emp. prob. 1 .08 .09 .07 .08 .055 .065 emp. prob. 2 .06 .06 .05 .045 .055 .06 Table 22 - Contrasts Between Different Cluster Sizes for Fixed Efl'ects A” =-.5084 [3,, =1 A, =1 cluster size 200 2000 200 2000 200 2000 average -.5107 -.5108 .9844 .9742 .9861 .9909 bias -.0023 -.0024 -.0156 -.0258 -.0139 -.0091 variance .0163 .0022 .0123 .0012 .0061 .0007 S. D. .1278 .0473 .1107 .0342 .0778 .0255 emp. prob. 1 .045 .06 .075 .11 .06 .065 emp. prob. 2 .04 .065 .06 .045 .05 .05 Tables 21 and 22 showed that most of the estimates had fairly small amounts of negative biases, but the biases did not go down for data sets with a larger number of clusters. In terms of percentage, Do, =.2 had more than 10% negative biases for both 73 cluster sizes. The similar amounts of variation inDm =2 and Du =.75 signified that Laplace6 had dificulty estimating Dm , a phenomenon the same with the other methods (Gauss and PQL2) in the previous model. Histograms of the estimates not presented here showed that all of the estimates were fairly normally distributed. In efl‘ect, the empirical probability 2 (emp. prob. 2) also showed that percentages of observations falling out of the 95% interval around the average were around 5 for all estimates. The variances of estimates using 200 clusters were all approximately 10 times larger than their counterparts using 2000 clusters. This indicated that as the number of clusters increases, the variation (variance) of Laplace6 estimates would decrease in proportion to the number of clusters. Thus, finally the estimates would peak at one point. Nevertheless, this peak would be a little ofi‘ the true parameter, due to the negative bias. The empirical probability 1 also showed that a little more than 5% of the estimates were cast out of the 95% confidence interval around the true value. This was coherent with the finding of small negative biases. Since the estimates had a negative bias, the sample mean was shifted a little to the left of the true value, assuming both the sample mean and the true value were both normally distributed and had the same standard deviation. Thus, more estimates at the lower end than at the upper end of the empirical distribution would be rejected as plausible values fiorn the distribution of the true value. Chapter 6 DISCUSSION AND CONCLUSION This dissertation uses Laplace’s approximation method to solve the problems encountered in multilevel logistic models. In the process, I first deduced the multivariate Taylor expansion for use in expanding Laplace approximation to multivariate situations. Secondly, 1 derived the six moments of a multivariate normal distribution through its moment generating function. Then, 1 found the analogy between univariate moments and multivariate moments in doing Laplace’s method. Using the above findings, 1 obtained the marginal likelihood of the multilevel logistic regression models as a simplified, scalar function of matrices. In finding maximum likelihood estimates of the fixed effects and the variance components of the random effects, I used implicit difl'erential to take into consideration the dependence of the current estimate of the random effects on the parameters of interest. The result is the Laplace6 program in HLM (Bryk, Raudenbush, and Congdon, 1996), using as starting values the converged estimates by PQL. Both univariate and bivariate random effects simulation studies and a real data analysis were carried out to evaluate Laplace6. The estimates were compared to those by the approximate maximum likelihood method using Gauss-Hermite Quadrature (Gauss) (MIXOR, 1993, 1994, 1997), the method of second-order Taylor expansion around the 74 75 conditional expectation (PQL2) (Goldstein and Rasbash, 1996), and PQL (Breslow and Clayton, 1993; Raudenbush, 1993). The analysis of Thailand data is an example of how the multilevel Bernoulli model can be used to understand how student background, school background and national programs, such as pre-primary education, interact and affect educational outcome—graderepetition. Itwasfoundthfigirlshadasmallerriskofrepeatinga grade,thatstudentswithbetternutrition(breakfast)hadalowerrisk,andthatahigher family socioeconomic status and richer school resources (SES and enrollment) helped reduce the risk. However, there were interactions between student SES, student nutrition (breakfast) and school SES and school resources (log_enrollment), respectively. They indicated that students with poorer family background took more advantage of richer school resources than students with better family background in reducing the risk of repeating a grade. Moreover, all the background factors controlled, students having had pre-primary education still had a significantly lowered risk of repeating grades. The efi‘ect of the pre-primary education did not vary across different schools. Therefore, pre- primary education did have a positive efi‘ect in preparing children for primary schools. Through using 4 difl'erent numbers of quadratme points in analyzing Thailand data, Laplace6 results were found to be more similar to those by larger numbers of points, i.e., 40 or 30 points, than all the other methods. Thus, Laplace6 seems to be a pretty accurate approximation to the marginal Bernoulli likelihood with a normal prior. The extensive univariate simulation study indicated that all the programs except PQL performs reasonably well, with small biases at times, although PQL2 tend to haVe 76 slightly larger mean squared errors than Gauss-10 and Laplace6. Its efficiencies for the variance estimates especially trailed behind those of Laplace6. Laplace6 had the highest eficiency relative to all the methods for most of the parameters in all six univariate models, although the difference between the results of Laplace6 and those of Gauss-10 was small. However, in the bivariate model with 100 data sets, the similarities in mean squared errors among Gauss-10 and Laplace6 disappeared. Laplace6 performed even betterthan in the univariate cases. While Gauss-10, GaussZO and Laplace6 were all approximately unbiased, Laplace6 had the highest efliciencies over all the methods for most of the parameters. Its performance was even better than that of Gauss-20, which in turn was better than that of Gauss-10. PQL and PQL2 both had much smaller efficiencies than Laplace6. The eighth model shows that Laplace6 estimates were normally distributed, and had a small amount of negative bias. However, the variance of the estimates did go down in proportion to the second level sample size. Thus, the approximate maximum likelihood estimates produced by Laplace6 indeed are approximately consistent estimates. The analysis of Thailand data raises a question of how different the programs are and how much more useful Laplace6 or Gauss-Hermite Quadrature is for practice. The suggestion is, for univariate random efi‘ect models, PQL2 may do as well as Gauss- Hermite Quadrature and Laplace6; for multivariate random effects models, nevertheless, PQL2 may not give as good results as the latter two programs. On the other hand, PQL has serious negative bias for large variances of the random effects. Although in the 77 current model for Thailand data all programs happened to have the same amount of predictors that were significant at .05 level, it is still likely that for other models, PQL or PQL2 will have very difi’erent, and wrong, conclusions, originating from its negative bias in the parameters. Therefore, it is of both theoretical interest and of practical usefulness to have programs such as Gauss-Hermite Quadrature or Laplace6. The advantage of Gaussian Quadrature lies in its flexibility in that the estimates can be found as accurately as the user wishes by just giving a larger number of quadratme points. Laplace approximation can go as accurate as one wishes, too, but it can be done only by the programmer, not the user. The time needed for computation is a big advantage of Laplace6 over Gauss- Hermite Quadrature. As the experience with several random samples of the 100 bivariate data sets shows, Laplace6 was much faster than Gauss-Hermite Quadrature with 10 quadrature points specified, which was in turn much faster than Gauss-Hermite Quadrature with 20 points. However, given the exploration with different data sets here, 10 quadrature points could barely produce estimates as accurate as Laplace6. The time needed for Gauss-Hermite Quadrature to produce sufficiently accurate results will thus be much longer than for Laplace6. Therefore, for educational research that is interested in dichotomous responses, such as grade repetition, high school dropout, or college admission, and that often is longitudinal and/or nested designs, Laplace6 is an accurate tool that is fast to converge. Although currently Laplace6 is available only for 2 level modeling, it should be straightforward to extend it to 3 level models. 78 On the other hand, the contribution of Lale to the field of applied statistics/mathematics lies in the foundation upon which it is built, i.e., the multivariate Taylor series and the parallelism between moments in univariate normal distribution and those in multivariate normal distribution in applying the extended Laplace approximation. Iexpectthatthemetbodcanbe appliedto giveprettyacctuate solutionstoproblems concerning integrals that have to be evaluated numerically. APPENDICES PRE-APPENDIX PRE-APPENDIX Formulae and Lemmas for Appendices A to D . dABC = (CT o A)dvecB (PA-1) Lemma 1 For a matrix F, suppose a is a scalar function oft, F does not involve t. O vec(aF) _ Oa _ __O t, — vec(F)® at, . (pA 2) Pfu fu ----- fo- 'afu aft: ----- 0f...- fnfn-----fz. afnafh-----0fz. Proof Assume F: ,and aF= Lf-r f-Z ' ° ' ' fm_ b#lnl #32 ° ' ° ' “j": ... According to Magnus, we vectorize the matrix first before we take derivative of it with respect to a row vector. Therefore, 79 80 50F: ‘ =[vec(F)®:ta,]. Oa at’f'“ . . . . a Smce F 15 not a function oft, each element in F becomes a scalar to the row vector 0" t, . o For any two column vectorsa, b, a®bT =ab’ For F(X,Y)=X®Y, X isannxq matrix,and Ya pxrmatrix. Then dF(X,Y) =vec(dX®Y)+vec(X®dY), where vec(dX® Y) = (1,, O K", 8 IP)(I,., ® vecY)dvecX vec(X® dY) = (1,, O K", O Ip)(vecX® 1,)dvecl’, 1,, being an n x 11 identity matrix, and (PA-3) (PA-4) (PA-5) K, being an n x n commutation matrix. (Magnus and Neudecker(l988), p. 188) 81 0 Supposeaisapxl vectorandb,aqx1 vector. Then aObT isa pxq matrixand sois bTOa. Thus a®br=br®a. (PA-6) 0 Suppose 5= arb,aand bare nxlvectors. Then 5: ”(b'a) = tr(ba’). (PA-7) o For same-order matrices, A and B, ”(ATB) = (vecA)’(vecB). (PA-8) 0 Two vectorsaandb, ab’ =(a®b’) o For two vectors a and b, vec(abT) = (b O a) . 0 A®B®C=(A®B)®C=A®(B®C) (PA-9) (PA-10) (PA-11) o Lemma2 Fora pxqmatrix A,an sxlcolumnvector c, Proof vec(A O c) = vec 0 HA and B are square matrices, ”(A O B) = tr(A)tr(B) . o If AB and CD exist,then (AOBXCOD)=AC®BD. - If AB is a square matrix, then tr(AB) = tr(BA). pl. vec[A O c] = vec(A)® c. P2. LC: q . (PA-12) = vec(A)®c . (PA-l3) (PA-14) (PA-15) 82 . vec(ABC) = (CT 8) A)vecB (PA-16) APPENDIX A APPENDIX A Multivariate Taylor Series Expansion This appendix will derive the following theorem: u—l .. .. Theorem: The m-th order approximation in n-variable Taylor series is i“ O b T ) f ""b , m. where b is an n x 1 vector; f ('0 is the m-th derivative of the function fl fa function of b ; the derivative is obtained by first vectorizing the (m-1)-th derivative and then .. n-l .. .. difl‘erentiating with respect to b T; O b7 is the Kronecker product of bT , repeated 711-] times. Proof Fulks (F ulks, 1978, p.331) has the following multivariate Taylor series theorem. Theorem. Let f and all its partial derivatives up through order n be continuous in a neighborhood N of Q0. Then forPinN, f(P)= f 2]} ‘3 (1’2) +m(Iq ® (2t)® 14, )(KM ® Iq)(2 ® vec2) +m(t’2®2®14,)(1® 1,1[2 «8(2 r)1}®(t’2) +m(Iq ® 2! ® 1,1 )[vec2® 2]+ m[2®(tT2)® If ](Kq,fl ® 14, )(2 ® vec(Iq, )). The last term, 2?? vec{m[2 ® vec(2ttr2)]} = m{vec[2 ® vec(2ttr2)]® tT2} + ma, ex e 1 )(vec2 e 1 )— 4vec(2ttr2) = m{vec[2 ® vec(2ttr2)]® N2} + m(vec2 ® I ,1 ){(2 ® 2t)(2t ® 2)} The first term of the above equation is obtained by taking the derivative of m; the second by the derivative of the vec function. By setting t of the fifth derivative to 0, the fifth moment of a multivariate normal distribution is 0. 1 03 Sixth Derivative Theorem 6 The sixth derivative is Ovec a ‘T = mvec{[vec(2® vec(2))® t’2]} e(t’2) + m(2 ® vec(2 ® vec(2)» + mvec{vec{vec[vec(2)t72]tT2} ® (tT2)} ® (tT2) + m(2 ® vec{vec[vec(2)tr2]tr2}) + m(2t ® Iq, ){(2 ® vec[vec(2)t72]) + ((2!) ® I q, )(2 ® vec(2))} + mvec(2 ® vec(vec(2)t72)) ® ((72) + m(vec2 ® I q, )(2 ® vec(2» +mvec[(2t 8 I 43 )(2 ® vec(2))]® (1T2) +m[(2 ® (vec2)T ® 14,](K‘h 8 lg, )(2 ® vec(Iq, )) + mvec{{vec[vec(2 e 20% t’2]® t’2} } a (2’2) + m{2 e vec[vec(2 e 20 8 F21} + m(2t ® Iq. ){[2 ® vec(2 ® 2t)] + [2! 8 lg, ](vec2 ® 2)} + mvec{(2 e vec(2 e 20)} e (1’2) + m(vec2 69 lg, )(vec2 e 2) + mvec[(2t ® I"J )(vec2 ® 2)]®(tr2)+ m((vec2)T ® 2% 14. XX!” ® 14, )(2® vec(Iq, )) + mvec(vec[vec(2)® 21% t’2)®(t’2) + m(2 ® vec[vec(2) ® 2]) + mvec{vec[vec(2t ® 2) ® tT2] ® 072)} ® ((72) + m{2 ® vec[vec(2t ® 2) ® tT2]} + m[(2t)® 14. ]{[2 ® vec(2! ® 2)]+ ((2t)® Iq, )(K,M ® Iq)(2 ® vec2)} + mvec{[2 ® vec(2! ® 2)]} ® (tT2) + m(vec2 ® 14,)(KM ® 14 )(2 ® vec2) + mvec[(2t ® Iq, )(Km ® 1‘ )(2 ® vec2)]® ((72) + m[(2 ® (vec2)T )(KM ® Iq)® 14.](Kh ® 14, )(2 ® vec(Iq, )) + mvec[vec{(KM e 1,)(2 e vec(2))} e (z’2)]® (1’2) + m[2 ® vec{(K,M ® 14 )(2 ® vec(2))}] + mvec{vec{vec[vec(2ttr2)tr2] ® (72} ® (tr2)} ® (tT2) + "(2 e vee{vee[vee(2n’2)t’2]® t’2} + "(2: e 14. )((2 e vec[vee(2u’2)z’2]} + ((20% 1,. ){[2 ® vec(ztr’zn + «m 1,. )[(2 ® (2!))+ (it e 2m) 104 +mvec{2 ® vec[vec(2ttT2)172]} ® (1’2) +(vee2% 1,){[2 % vec(211’2)]+((21)% 1, )[(2%(21))+(21% 2)]} +mvec[(2t % 1,, )(2 % vec(211’2))]% (1’2) mu, % 21% Iq,)(vec2®1q,)[(2® 21)+(21% 2)] +m[2 @(vec(21172))7 8 1'. ](Kia 8 1', )(2 ® vec(If) +mvec[(2t® 1,, )(21% 1,, )(2% 21)]%(1’2)+ mu, % [(21% 1,)(21% 1‘, )])(vec2% 2) +m[(2®tT2)®(2t® 141)](Ka, ® 1', )(2®vec(1‘, )) +m[(2% 1’2)(1’2% 1', )% 10,](Kh % 1..)(2% vec(Iq,» +mvec[(2t ® Iq, )(2: ® 14, )(21 ® 2)]® (1’2) +1111, % [(21% 1,, )(21% 1,, )])(1<,m % 1,)(2 % vec2) +n1[((172)® 2)® (21 ® Iq, )](K‘h ® 1', )(2 ® vec(Iq, )) +m[(1T2 ® 2X172 ® Iq, )® Iq4 ](Kth ® 1", )(2 ® vec(Iq, )) +mvee{vee([(21)% 1', ][(20% 2]) % (1’2)} % (1’2) +M(2 ® vec{[(}3!)® 1,: IKE!) ® 2])) +m[(2t)® 19, ]((1, % (20% 1‘, )(K" % 1,)(2 % vec2) +((1’2)% 2 % 1,, xx,” % 1', )(2 % vec(Iq, )) + mvec[(1q ® (21) ® Iq, )(an ® Iq)(2 ® vec2)]® (172) + "(2% (vec2)T)(KM % 1,)% lg, )(1, % K” % 1', )(vec(1,)% 1,)(Kh % 14, ) x (2 ® vec(Iq, )) + mvec[((tr2)® 2 ® Iq, )(K‘h ® Iq, )(2 ® vec(lq, ))]® (172) +m[(2 % (vec(Iq, ))T)(KM, % 140% 1g,][2% vec(2 % 11)] +mvee(vee([(20 % 14, ][2 %(21)]} %(1’2)} % (1’2) +1110: ® vec{[(2t)® 1,: ][2 ® (21)”) +m((21)% 14,){(1q % [(20% 14, ])(vec2 % 2) +([2 % (1’2)% 1 q, ](K 111 % Iq, )(2 % vec(Iq, ))} + mvec[(Iq ® 2! ® Iq, )(vec2 ® 2)]® (172) + m((vee2)’ % 2 % 10. )(1, % K n % 1g, )(vec(1,,)% 1 q, )(K 4.1, % 14, )(2 % vec(Iq,) + mvec{[2 % (1’2)% 14, ](Kh % 14, )(2 % vec(Iq, ))} % (1’2) + m{[(2 ®(vec(1q, ))T)(KM, % 1,01% 14.}(1‘1, % K ‘1,” % 1 q, )(1 q, % vec(Iq, ))(vec2 % 2) 105 +mvec{vec[2 ® vec(2ttr2)] ® (172)} ® (1’2) +m{2 ® vec[2 ® vec(2tt72)]} + "(21 ® Iq. )(vec2 ® 14, )[(2 ® 21) + (21 ® 2)] +mvec{(vec2 ® Iq, )[(2 ® 21) + (21 ® 2)]} 8 (1’2) +m( 1, % (vec2 % 1,, ))[(vec2 % 2) +(1<,m % 1,, )(2 % vec2)] Proof 3’,- vec{m[ve0(2 <2 vec(2))e 1’21} = mec11vec<2 co vec(2»® 1’211811’2) +m(2 ® vec(2 ® vec(2») 20;- vec{mvec{vec[vec(2)tr2]tr2} ® (172)} = mvec{vec{vec[vec(2)tr2]172} ® (172)} ® (1’2) +m(2 ® vec{vec[vec(2)tr2]tr2}) + "(21 ® 1,4); vec{vec[vec(2)tr2]tr2} when jivee{vee[vee(2)1’2]1’2} = (2 % vee[vee(2)1’2]) + ((20% 14, )§ vec[vee(2)1’2] with 357 vec[vee(2)1’2] = (2 % vec(2». Therefore, 2?,- vee{vee[vee(2)1’2]1’2) = (2 % vec[vee(2)1’2]) + ((21) % 14, )[2 % vec(2)]. 56—,- vec{mvec{vec[vec(2)tr2]tr2} ® (172)} = mvec{vec{vec[vec(2)172]tr2} ® (172)} ® (172) + 111(2 ® vec{vec[vec(2)tr2]tr2}) +m(21 % 1g. ){(2 % vee[vee(2)1’2]) +((20® 1,, )(2 ® vec(2))} 106 30:- vec{m(2 ® vec(vec(2)1’2))} = mvec(2 ® vec(vec(2)1r2)) ® (172) +m(1, % 11,, % 1, )(vec2 % 11);??? vee(vee(2)1’2)) while 50;- vec[vec(2)1’ 2] = (2 ® vec(2», whereas K111 = I, and can be ignored. Therefore, % vec{m(2 ® vec(vec(2)tr2))} = mvec(2 ® vec(vec(2)1r2)) 8 (1’2) + 111(vec2 ® IqJ )(2 ® vec(2» . gnaw» % 1,. )(2 e vec(2))} = rum-[<21 % 1,. )(2 e vec(2))1® (1’2) + m[(2 ® (vec2)T ® 1'. ]§ vec(21 ® 14,) . a With 7vee(21%1q,)=(1, %Kq,a%14,)(1,%vee(14,))2. =(Kq,fl®1q,)(2®vec(lq,)). So, §?vec[m(2t® 11, )(2® vec(2))} = mvec[(21 ® I 1,3 )(2 ® vec(2))]® (172) +m[(2 @(Veczf 9 lg. ](Kq’a ® 143 )(2 ® vec(Iq, )). 5% vec{m{vec[vec(2 % 20% 1’2]% 1’2}} = mvec{vec[vec(2 ® 21) ® 172]® 1’2} ® (1 T2) +m{2 ® vec[vec(2 ® 21) ® 172]} + 111(21 ® 14,); vec[vec(2 ® 21) ® 172], where Oif vec[vec(2 ® 21)® 172] = [2 ® vec(2 ® 21)] + [21 ® 14,]? vec(2 ® 21) , 107 . O wrth 7ve0(2® 21) = (1,, 8 KM ® Iq)(vec2® Iq)2 = (vec2® 2) . Therefore, 31, vec{m{vec[vec(2 ® 21) 8 1’2] ® 1’2} } = mvec{ {vec[vec(2 ® 21) ® 1’2] 8 1’2}} ® (1’2) +m{2 ® vec[vec(2 ® 21) ® 1’2]} +m(21 % 1.. ){[2 % vec(2 % 21)]+ [21% 1 q, ](vec2 % 2)}. 3% vec{m(2 8 vec(2 ® 21))} = W642 9 vec(2 Q 2’» 3 (’72) O +m(Iq ® K111 ® Iq,)(vec2® If)?vec(2® 21) = mvec{(2 ® vec(2 ® 21))} ® (1’2) + m(vec2 ® I ,3 )(vec2 ® 2) 3:, vec[m(21 ® I ,3 )(vec2 ® 2)] = mvec[(21 ® I q, )(vec2 ® 2)]®(1’2) +m((vec2)’ ® 2 ® Iq, )3; vec(21 ® If) = mvec[(21 ® 1', )(vec2 ® 2)]® (1’2) +m((vec2)’ % 2 % 14. XX“ % 14, )(2 % vec(Iq, )) E” Vec{m(vec[VeC(2)® 21%1’2» = mvec® 21% 1’2)® (1’2) + m( 2 ® vec[vec(2) ® 2]) 567 vec{mvec[vec(21 ® 2) ® 1’2] ® (1’2)} = mvec{vec[vec(21 ® 2) ® 1’2] ® (1’2)} ® (1’2) + m{2 ® vec[vec(21 ® 2) ® 1’2]} +m[(21) ® Iq, 1% vec[vec(21 ® 2) ® 1’2], with g vec[vec(21 % 2) % 1’2] 108 = [2 % vec(21 % 2)]+((21)% 1‘, )3? vec(21 % 2) = [2 % vec(21 % 2)] + ((20% 1‘, )(KM % 1,)(2 % vec2) Therefore, 31’ vec(mvec[vec(21 ® 2) 3 t 121% (t 12)} = mvec{vec[vec(21 ® 2) ® 1’2]® (1’2)} ® (1’2) + m{2 ® vec[vec(21 ® 2) ® 1’2]} +m{(21)% 14, ]{[2 % vec(21 % 2)]+ ((20% 14, )(1 3” vec(m12 ® vealve6(2fl’2)t’2]}) = mveci2 2 vec[vec(211’2)1’21} 81 (1’2) O +(1, % KM % 1‘, )(vec2 % 14, )7 vec[vec(211’2)1’2] = mvec{2 % vec[vec(211’2)1’2]} % (1’2) +(vec2 % 14, ){[2 % vec(211’2)] + ((21) % 10, )[(2 % (21)) + (21 % 2)]}, since 557vw[vec(211’2)1’2] = {[2 % vec(211’2)]+[((21)8 Iq,)£%vec(211’2)} = {[2 ® veC(2fl’2)]+ [((20% qu )[(23 (31)) + (21 ‘8’ 2)]1}. 110 $1,114,421 % 1g, )(2 % vec(211’2))] = mvec[(21 % 1', )(2 % vec(211’2))]% (1’2) +1111, % (21 % Iq, ))anveca % vec(211’2)) +111[2 8 (vec(211’2))’ 8 lg. 1321' vec(21 8 14,) = mvec[(21 8 111’ )(2 8 vec(211’2))]8 (1’2) +m(Iq QZIQIvaeCEQ Iq,)[(2®2!)+(21®2)] +1112 3 (vec(211’2))’ 8 lg. ](Kffl ® 141’ )(2 ® vec(Iq, )) 2,Tafi1ee1.-[m(21% 11,)(21% 11,)(2% 20]: mvec[(218 11,)(21% 19, )(2% 21)]%(1’2) +1110, %[(21% Iq, )(21% 1., )])ai,vee(2% 21) +m[(281’2)8(218 1g, )]%vec(218 14,) O T T +m[(281 2)(128Iq,)81q,]7vec(218lq,) = mvec[(218 1g, )(21% 19,)(2% 21)]%(1’2) +1111, %[(21% 14,)(21% ’11)])("“2® 2) +m[(281’2)8(218 18)](Kh % 1', )(28vec(1 q, )) +m[(281’2)(1’28 1"2 )% 14.](Kh % 14, )(2% vec(Iq, )) %vec[m(21 8 lg, )(21 8 lg, )(21 8 2)] = mvec[(21 8 lg, )(21 8 11,)(21 8 2)]8(1’2) +1111, % [(21 % 14, )(21 % 1,, ”)(an % 1,)(2 % vec2) +m[((1’2) % 2)% (21 % ’11)](1‘14 % 1‘, )(2 % vec(Iq, )) +m[(1’28 2)(1’2% 1,)% 14.](Kn % 1g, )(2% vec(Iq,)) Ill 3?,- vec{mvec{[(21) 8 I4; ][(21) ® 2]} ® (”2)} = mvec{vec{ [(21) 8 I 4, ][(21) 8 2]} 8 (1’2)} 8 (1’2) +"‘(2 3 V96{[(Zt)® 141][(2‘)® 2]}) 4' “KY-0% 1‘1]%V90{[(20® 1421K”) 8 2]} = mvec{vec{[(21)8 14,][(21) % 2]} % (1’2)} % (1’2) +m(2 ® vec{ [(21) ® 14, ][(21) ® 2]}) +ni<21>® 1,1111, 21mm 1,. Dainedmw 2] +((1’2)% 2)% 14,);7ve4(21)% 14,]} = mvec{vec{[(21)8 14,][(20 % 2]} % (1’2)} % (1’2) +1110? ® ve¢={[(21)® 1,: ][(21)® 2]}) +m[(21)8 14, ]((1, %(21)% 14, )(KM % 1,)(2 % vec2) +((1’2)% 2% 14, )(K 4,4 % 14, )(2 % vec(I4, )) 245’,- vee[m(1,, %(21)% 14, )(KM % 1,)(2 % vec2)] = mvec[(Iq %(21)% 14, )(qu % 1,)(2 % vec2)]8 (1’2) +m(2 % (vee2)’)(1<,M % 1,)% 14031—111141, %(21)% 14,) = mvec[(lq % (20% 14, )(K,M % 1,)(2% vec2)]8 (1’2) +m(2 %(vee2)’)(1<,M % 1,)% 14. )(1, % K 4,4 % I4,)(vec(1q)8 14,)Ea7vec((21)8 14,) = mvec[(Iq 8 (21)8 I4, )(an 8 Iq)(2 8 vec2)]8 (1’2) +m(2 % (vee2)’)(11M % 1,)% 14. )(1, % 114,4 % 14, )(vee(1,)% 14, )(K 4,4 % 14,) x (2% vec(I4,)) 112 %vec[m((1’2)8 2 8 I4, )(K4,44 8 I 4, )(2 8 vec(l4, ))] = mvec[((1’2)8 2 % 14, )(K4,4 % 14, )(2 % vec(14, ))]% (1’2) +m[(2 %(vee(14, ))’)(K44, % 14, )% 14.]§;vee(1’2% 2 % 14,) = mvec[((1’2)8 2 8 I4, )(K4,4 8 I4, )(2 8 vec(14, ))]8(1’2) +m[(2 % (vec(14, ))’)(K44, % 14, )% 14. ][2% vec(2 % 14,)] £7 vec{mvec{ [(21) 8 14,][2 9 (201} 3012)} = mvec{vec{[(21)8 14, ][2 % (20]) % (1’2)) % (1’2) +m(2 8 vec{[(21) ® 14,][2 @(21)]}) +m((21) 2 (.1-415711211212 1,112 2 (2111} = mvec1vec11<21>2 1,112 2 (211112 (1’21) 2(1’2) +m(2 ® vec{[(21) ® 14,][2 %(21)])) +m((2t) ® 1444 ){(I,, ® [(208 I44, ])(vec2 8 2) + ([2 8 (1’2) 8 14, ]-§ vec[(21) ® 14, ]} = mvec{vec{[(21)8 14,][2 %(21)])%(1’2)) % (1’2) +m(2 % vec{[(21)8 14,][2 %(21)])) +m((21) % 14. ){(1, %[(21)% 14, ])(vec2 % 2) +([2 % (1’2)% 14,](K4,4 % 14, )(2 % vec(I4, ))) £4... vec[m(Iq % 21 % 14, )(vec2 % 2)] = mvec[(Iq % 21 % 14, )(vec2 % 2)]% (1’2) O T +((vec2) 8 2 8 14,)?vecuq 8 21 8 14,) = mvec[(14 8 21 8 I4, )(vec2 8 2)]8 (1’2) +((vec2)’ % 2% 14.)(1, % K 4,4 % 14, )(vec(14)8 14,)(K4,4 % 14, )(28vec(14, )) 113 %vec{m[28(1’2)8 I4,](K4,4 8 I4, )(28vec(14, ))) = mvec{[28(1’2)8 I4,](K4,44 ® 144, )(28vec(l4,))}8(1’2) +m{[(28(vec(14, ))T)(K444, @14,)]® I4,}§T-vec[28(1’2)8 144,] = mvec{[2 % (1’2)% 14, ](K4,4 % 14. )(2 % vec(14. ))) % (1’2) + m([(2%(vee(14.))’)(1<44, % 14,)]% 14.)(14, % 114,4 % 14, )(14, % vec(14,))(vec28 2) £7 vec{mvec[2 8 vec(211’2)] 8 1’2} = mvec{vec[2 8 vec(211’2)] 8 (1’2)} 8 (1’2) +m{2 8 vec[2 8 vec(211’2)]} + m(21 8 14.)}: vec[2 8 vec(211’2)] = mvec{vec[2 8 vec(211’2)]8 (1’2)} 8 (1’2) +m{2 8 vec[2 8 vec(211’2)]} + 111(21 8 I4. )(vec2 8 I4, )[(2 8 21) + (21 8 2)] 3’,— vec{m(vec2 % 14, )[(2 % 21) + (21 % 2)]) = mvec{(vec2 % 14, )[(2 % 21) + (21 % 2)]) % (1’2) +1111, % (vec2 % 14, ))%vec(2 % 20+ mu, % (vec2 % 14, ))g-T-vedfl % 2) = mvec{(vec2 8 I4, )[(2 8 21) + (21 8 2)]} 8 (1’2) +m(14 8 (vec2 8 I4, ))[(vec2 8 2) + (K1111 8 14 )(2 8 vec2)] The sixth moment of a multivariate normal distribution is, by setting 1 of the sixth derivative to 0, E(1°) = (2 8 vec(2 8 vec(2))} + (vec2 8 14,)(2 8 vec(2)) + [(2 8 (vec2)’ 8 I4.](K4,44 8 I4, )(2 8 vec(14, )) + (vec2 8 14,)(vec2 8 2) + ((vec2)’ % 2 % 14. )(K4,4 % 14, )(2 % vec(I4, )) + (2 % vec[vec(2)8 2]) + (vec2 8 I4, )(an 8 14 )(2 8 vec2) +[(28(vec2)’)(KM 8 Iq)8 I4.](K4,4 8 14,)(28 vec(14, )) + [2 8 vec{(K“ 8 14 )(2 8 vec(2))}] +(2 8 (vec2)’)(K“ 8 Iq)8 I4. )(14 8 K 44 8 I4, )(vec(14)8 I4, )(K4,4 8 14,)x ‘2 (2% vec(14, )) +[(28(vec(14, ))’)(K44, % 14,)% 14.][2% vec(28 14.)] 114 +((vec2)’ 8 28 I4. )(14 8 K4,4 8 I4,)(vec(14)8 I4, )(K4,44 8 I4, )(28 vec(14, )) +m{[(2 8 (vec(14, ))’)(K44, 8 I4, )]8 14.}(14, 8 [(4,4 8 14,)(14, 8 vec(l4, ))(vec2 8 2) + (14 8 (vec2 8 I4, ))[(vec2 8 2) + (KM 8 14 )(2 8 vec2)]. APPENDIX C APPENDIX C Proof of the Substitutions This Appendix will prove the substitutions of the expectations of the fourth and sixth moments with simpler forms. The Fourth Moment Substitution We have ((b. -13;)’ 201-13; )’ 201-11”. )’]7.'“’(b. ~57) in the approximate marginal likelihood, when the q xl vector (11, - 13;)has a multivariate normal distribution, N(O, 2), 2 = (13-1 -1700". For simplicity, I will use 2 for the derivation, use b for (b, — 3,) and ignore the subscript in ’17“) here: 1 =[b’ 8b’8b’]T“’b. (1) Theorem 1 13(1) = 3[vec(2 % 2)]’vec(7'“>) . (2) Proof The dimension of the first part in r, [11’ % 11’ % blT‘”, is 1 x q. The second part,b , is q x 1. Since r is a scalar, by using PA-7 and regarding the firstpartas a’ and the second partas b, it becomes 1 =11{b[b’ %b’%b’]T“’}. (3) Separate the inside ofthe trace flmction into two parts, 11 = b[b’ 8b’ %b’], a q x 1,3 matrix, and r2 = Z“) , a q3 x q matrix. Using Equation PA-8, (3) becomes r = (vec((b 8 b 8 b)b ’ ))T vec(T‘”) = (vec(r,’ )) ’ vec(r2 ) . (4) 115 116 Because 7'“) is a constant, and b has a multivariate normal distribution, MO, 2) , the expectation of (4) becomes 2(1) = E((vec((b 2 b 2 b)b’])’)vec(‘f“’) = E((vec(r,’»’)vec(rz). (5) where E(r,’) is the fourth moment of a normal distribution, 111,. Therefore, the expectation of(1) is 15(1) = [vec-(111. )]’vee('i'<‘>), (6) 111. = (2 % vec(2» + [vec(2) % 2] + {(K” % 1,)(2 % vec(2))} =E+Q+Q (D where K” is a commutation matrix that permutes rows of a matrix, and I is an identity matrix. (See APPENDD( B: The Six Moments of Multivariate Normal Distribution) Each matrix E,, E2, and B3, in m, isaKronecker product of vec2 and 2 . By q ranB 0123 ...al B P 02,8 11223 ...a, B P the definition of the Kronecker product that A 8 B = , the sum of 04,8 04,3 ...awB the elements inside E1 is equal to that inside 132 as well as that inside E3 , which, in turn, are all the same as that inside 2 8 2. Therefore, 1114 has exactly the same sum of elements as 3(2 8 2) . Moreover, according to Anderson (1958), the expectation of the fourth moment of a multivariate normal is E(bkbl bulbr) = 0110111 + 0.1111011 + 0111011111 (8) 117 where a", isthecovarianceof b, and b,; k, I, m, rcanbeequal. Notethatthe power of bkb, b.b, corresponds to that of any of the and aha,” and aha”, even though they are three difi‘erent products of covariance terms. On the other hand, since (1) is the fourth order term (without the factorial) of the fourth order Taylor expansion, with T“) the fourth derivative of the likelihood, when multiplied out, it becomes: E(r)= E(:b‘2gjz; +4ib: b ,Zgjzfizfl «Nib: b,’2gjz;zj’, lt-l j-l j-l j-l 4 +12 ibkzb, buigjzizflzj. + zbkbl b-brigjzflzflzfl'zfl] kulnljol j-l lulu-er j-l = BZangjzjk +1220uafl2gjzjkzfl +6:(aua’,, +20”)Zgjz;zfl k-l j-l bl j-l j-l +12 Ewan... +20'uah)2gjz;zj,zj. (9) hilt-nut»: j-l q "i + 2(auam + aha” + abah)Zgjzfizflzj,zj, tilt"? j-l Note that the power of 2,,, also corresponds to the power of b, , and the power of z 1! ‘3 ’4 1111211 1 corresponds to the power of b, , and so on. That is, 2&2},sz lulu-var 3,,s2,s3,s4 = O,..,4, .1,-1s2 +5, + s, = 4, is multiplied by bfbf’bflbf‘. Thus the power of z: z,” 222? ’5 corresponds to that of b: bf’bjbf‘ ’s, which in turn corresponds to that of the products of the variances and/or covariances as we see above. We have known that the sum of the terms in 3(2 8 2) is equal to that of m, . Therefore, to prove the substitution, 1 18 we will have to prove that the power of the product of covariances and/or variances in each position of 3[vec(2 8 2)]’ corresponds to that of z: zf’zfizfi‘ in 7;“) , so that 3[vec(2 % 2)]’vecT<" will also result in (9). Because (vec 111. )’ is the expectation of (vec((b % b % b)b’))’, the power of the sum of the three products of variances and/or covariances in each position of the former ((vec m4 )’) correspond to the power in each position of the b,b,b,b,'s in the latter ((vec((b % b % b)b’))’). On the other hand, because 2 = 13(1111’) , the power ofthe products of variances and/or covariances in each position of 3(2 8 2) will correspond to that in each position of (bb’ 8 bb’) . Combining these two facts, the proof reduces to proving that [vec(r,’)]’ = [vet-((11 % b % b)b’)]’ = [vec(bb’ % bb’ )]’ . Since to transpose a vector will not change the order of the elements in the vector, I will ignore the transpose on both sides of the above equation and prove: vec((b8b8b)b’)= vec(r,’) = vec(bb’ 8bb’). (10) By Equation PA-3, r,’ =[b8b8b]8b’. (11) Putting back (11) to the left hand side of (10) and by Equations PA-ll and PA-6, vec(r,T ) = vec[b 8 b 8 b 8 b’] = vec[b 8 c 8 b’] = vec[b 8 (c 8 b’)] = vec[b 8 (b’ 8 c)] =vec[b8b’8c]=vec[(b8b’)8c]=vec(b8b’)8c. (12) where c = (b8 b), c being a q2 x lvector. The last equation ofthe third line in (12) is obtained by PA-12 by regarding (b 8 b’) as a matrix. 119 Similarly, the right hand side of (10) is vec[bb’ 8 bb’] = vec[(b 8 b’)8 (b 8 b’ )] = vec[(b’ 8 b) 8 (b’ 8 b)] = vec[b’ 8(b8 b’)% 11] = vec[b’ % (11’ % b)8 b] = vec[b’ % b’ % 11% b] (13) = vec[(b’ 8 b’)8 c] = vec(b’ 8 b’)8 c = vec(b8 b’)8 c, which is equal to the left hand side. Therefore I have proved Equation 10, and thus the substitution. The Sixth Moment Substitution Theorem 2 E1r{(b’ %b’ % b’ %b’ %b’)i,'<"b} = 15[vec(2 8 2 8 2)]’veEI,“’ The theorem can be proved by following exactly the same reasoning as above. Theorem 3 1511{(b % b)[[vec(bb’)]’ % [vec(bb’ )]’] (74131 g 74111)} = 151r{vec2([vec2]’ 8 [vec2]’ X17") 8 Z9)» (15) Proof The argument for the substitution is similar to the above, too. That is, each of the fifteen different matrices in the sixth moment (See APPENDIX B: The Six Moments of a Multivariate Normal Distribution) is the Kronecker product of the variance matrix arranged in a special way by using identity matrices and commutation matrices. Therefore, the total of the elements inside each of the matrices will be the same as that of. 1 20 vec2([vec2]’ 8 [vec2]’) . Since Equation 15 is a scalar, the usefulness of the identity matrices and commutation matrices in the sixth moment disappears. I only need to show that vec((b % b)[[vec(bb’)]’ % [vec(bb’ )]’]) = vec(b % b % b % b % b % b’) . (16) The lefi hand side of Equation 16 can be simplified to vec((b % b)[[vec(bb’)]’ % [vec(bb’ )]’]) = vec((b%b)[b’ %b’%b’ %b’])=b%b%b%b%b%b, which is also the simplified version of the right hand side. Hence we can substitute the fifteen matrices of the sixth moment with 15vec2([vec2]’ 8 [vec2]’). APPENDIX D APPENDIX D The Expectation of the Third Order Term Squared This appendix will find the expectation ofthe third order term squared in the approximation to the Taylor series inside the exponential. To simplify equations, I will usethc same notationasinAPPENDlX c. That is, b= b, —13',,and b~ N(O,2),with 2 = (D" - '17”):I . Thus, this appendix will show that the expectation of 1; is 15 " "1 1 1 2 E(L§) = E[:4111,,2,’2z,z,’]2(24: 11,2,2,’ 22,] , when 1}, = 5(3—4[b’ % 1171:me , with '17” = -(z,’ % z,’ )4; . Proof Since 135 is a scalar, a trace function of the scalar will not change the value. L§ = 712-[1r(b[(b’ % b’]1j"’)]2, then by PA-13, = -441—2-1r[((b[b % b’]1,'">) % (b[b’ % b’]‘i,'<’>)] = 71—21r[<(b[b’ % b’]) % (11[b’ % b’])> (7,") % Z"’)] (by PA—14) = 713 ”[((b % b)[[b’ % 11’] % [b’ % b’]]) (7,”) % 1:01)] . Take expectation of the above function and substitute the sixth moment with 15vec2([vec2]’ % [vec2]’). (See APPENDICES B and C). c(&)=%1r[(v112([vec21’212621)) (11222)] 121 122 - —1r (by PA-14) = %(<[vec2]’ ’17”) 8 <[vec2]’ 74(3)» vec2 (by PA—15 and PA-8) 15 T~(3) ~(3)T = 7i[vec2] I, 21, vec2. (by PA-16) Furthermore, 1,") = -(2,’ % 2,’).4,2, and A, = 211,.(2 ,E,’ % E ,), when E , is 1 an n, x 1 vector with the jth element being 1 and the others being 0. Thus, '17") = —(2,’ % z,’ {flowing % E,)]2, = Zo,(2,2,’ % 2,.)(2, % l) I J' =Zo,(2, 2,’%2). Therefore, E(L§) becomes, by applying PA—16 and its transpose to T [Z o,(2,2,,’ % 2,0] vec2 and [vec2]’[fi a,(2,,2,’ % 2,,)] respectively, 1 1 T E(L,)= —[vec2] ’[Za,,(Z,,Z,,’ 8Z,,)]2 [2a,(Z, Z,’ 82,. )] vec2 15 " "I = 5(2: 11,2; 22,2; J 2(2 11,2,2,’ 22,.) . .l I APPENDIX E APPENDIX E Computational Algorithm To adapt to the PQL computer program (Bryk, Raudenbush and Congdon, 1996), the notations for equations in Chapter 2 are changed as follows: Chapter2 D X, Z, [9 b, APPENDD AW +_:zahj’1311(A2’uC11Aw)2A1216 i ll 125 15 I! n - 33’ 21" : “11011311 (”23116.11 A211 )2 Aw) I - -I;20181124167221>’(A.’rmrcr'Ao> + gins/1513112121121 - % $220“ (1311 (145,94ij )’(A,3W,A,,C,‘ 1A2”) _ 13 i B T , l 2 1' t 36 4 81 (1311(A11W1A21C1 A21) "' 3 ZhfiBfiAUWJAUCJ’ A” + 31% :ianraasv ((AzrbC] 1A2” )2 A,’,W,A,,C,"A,,,,) + %ia,s§A,’,W,AUCJT'Aw> "1 1 "I =(A1PY), — (A1PA19,), -(A1PA2),19,, +201, +3201, (2) 1 j 1 where (AlPY), = ASWJZ ,1 (AlPAl), = A,’,W,A,,, (AlPAZ), = A,§W,A, j, 126 1 l 1 1 15 15 C11 = ‘EhIJB; + za,A,’,,m,A,, ' EPIJB; + fgaqurlijAzu + 53131311 " 373011311 15 “36"1’42’091’4211’ "I when 1n, = 24:3,g,C,-'A,,A,’,C;‘, p, = 11,,(1 -12w,)-36a,g,, , 'JJ ‘1' b , = 21,,BZCT'A,,A;,C;‘. _ r -1 sij -A,,,C. R, J e = XagsyCT'AwA’ CT' . I J J 2111 For variance compongng: The score function for¢ in the log-likelihood is 1 _ . , 1'1 S44. = 513%ch ’(T, —r)r '] —§;a,,B,E’vec[Q,j] 1 "I ,. +3; Zea. E vec[Qy] "‘ l 1 15 15 - _. .. - +E’[Z(—Zg,,B,, —fir,,B; +§a,s,,)vec(F}j)+Evec{T 'Cj'kjkj’Cj'T '}]} (3) where '1", = (9216;, +C‘J'. , T is from previous iteration. _ dvecT E 44¢, , ¢ being the unique elements in vecT ; (A2PR), = A,’,W,2, - A,’,W,A,,q — A,’,W,A,,a,, 127 Q, = T"C;'A,,(A2PR),’ F, = T“C,"A,,A,’,,C,"T". The Fisher Scoring for the parameters is J [9“ - 9‘ vec¢“' — vec¢’ j.) y... Promm Flow: 1. Startwith HLM2 estimates of T andq. ]=11"s S =[S‘U] S=iS H=Zss’ ’ 1 S6 ’ J J J' ' (4) 2.1teratively solve C,"(A,’,W,Z, - A,’,W,A,,0,)= 6,, for 9,, with W, and z, computed holding constant T andé] from the last iteration. Thus,Z,, 1],, 11,, W,, (A2PA2),, (A2PY), , Cf , and (A2PA10,)j along with 0,, will result fi'om this iterative process. 3. Compute (AlPAl) ,, (A1PA2),, and (AlPY), fiom w,. 4. Compute a,,, g,,, h,,, rij and pij from wj and 11,. 5. Compute Bij from C;'. 6. Compute f, and k j fi'om aij and B,,. 7. Compute mj from Bij and C;'. 8. Compute sij fi'om k j and C}'. 9. Compute bj from 1,, Bij and C}'. 10. Compute e , from C;' , 5,), and a,,. 11. Compute c,, from B m,, b,, a e,, h,,,and pij. ij’ ij’ 128 12. Compute (AZPR), from (A2PY),, (A2PA19,),, and (A2PA2),. 13. Compute X, and I, from (AlPAZ), and C;'. 14. Compute '1' from 62 and C;'. 15. Compute Q, from C;' and (A2PR),. 16. Compute Fij from C;‘. 17. Complete the 2 S’s and hence the new T,6,. 18. Monitor convergence as in current HLM. At convergence compute standard errors from square roots of diagonal elements of H " . LIST OF REFERENCES LIST OF REFERENCES Aitkin, M., Anderson, D. and Hinde J. (1981). Statistical modelling of data on teaching styles (with discussion). Journal of the the Royal Statistical Society, A, 144, 148- 61. Anderson, T. W. (1958). An Introduction to Multivariate Statistical Analysis. New York: John Wiley & Sons, Inc.. Bennett, N. (1976). Teaching Styles and Pupil Progress. London: Open Books. Bock, R. D. (ed.) (1989). Multilevel Statistical Methods in Educational Research New York: Academic Press. . Bock, R. D. Gibbons, R D., and Muraki, E. (1988). Full-information item factor analysis. Applied Psychological Measurement, 12, 261-80. Breslow, N. E., and Clayton, D. G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association, 88, 421, 9-25. Breslow, N. E. and Lin, X. (1995). Bias correction in generalised linear mixed models with a single component of dispersion. Biometrics, 82, 81-91. Bryk, A. S. and Raudenbush, S. W. (1992). Hierarchical Linear Models: Applications and Data Analysis Methods. Newbury Park, CA: Sage. Bryk, A. S., Raudenbush, S. W., and Congdon, R, T. (1996). HLM2 and HLM3: Computer Programs and User ’s Guide, v. 4.25, Chicago: Scientific Software International. Chan, K. S. and Ledholter, J. (1995). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association, 88, 9-25. Dempster, A. P., Rubin, D. B., and Tsutakawa, R. K. (1981). Estimation in covariance components models. Journal of the American Statistical Association, 76, 341 -53. Fellner, W. H. (1986). Robust estimation of variance components. T echnometrics 28, 5 1 -60. Fellner, W. H. (1987). Sparse matrices, and the estimation of variance components by likelihood methods. Communication in Statistics, B 16, 439-63. 129 130 Fulks, W. (1978). Advanced Calculus: an Introduction to Analysis (3rd ed.). New York: John Wiley & Sons. Gelfand, A. 13., and Charlin, B. P. (1993). Maximum likelihood estimation for constrained- or missing-data problems. Canadian Journal of Statistics, 21, 303-1 1. Gelfand, A. E., Hills, 8., Racine-Poon, A., and Smith, A. F. M. (1990). Illustartion of Bayesian inference in normal data models using Gibbs sampling. Journal of the American Statistical Association, 85, 972-85. Geyer, C. J ., and Thompson, E. A. (1992). Constrained Monte Carlo maximum likelihood for dependent data Journal of the Royal Statistical Society B, 54, 657-99. Gelfand, A. 13., and Smith, A. F. M. (1990). Sampling based approaches to calculating marginal densities. Journal of the American Statistical Association, 85, 398-409. Geman, S., and Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine intelligence, 6, 721-41 . Goldstein, H. (1986). Multilevel mixed linear model analysis using iterative generalized least squares. Biometrika, 73, 43-56. Goldstein, H. (1987). Multilevel Models in Educational and Social Research. London: Oxford University Press. Goldstein, H. (1991). Nonlinear multilevel models, with an application to discrete response data. Biometrika, 78, 45-51. Goldstein, H. and Rasbash, J. (1996). Improved approximations for multilevel models with binary responses. Journal of the Royal Statistical Society, A, 159, 505- 1 3. Green, P. J. (1984). Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives, Journal of the Royal Society, B, 46,149-192. Green, P. J. (1987). Penalized likelihood for general semi-parametric regression models. International Statistical Review, 55, 245-59. Hedeker, D., and Gibbons, R. D. (1994). A mndom-eflects ordinal regression model for multilevel analysis. Biometrics, 50, 933-44. 131 Hedeker, D., and Gibbons, R. D. (1996). MIXOR: A Computer Program for Mixed- Eflects Ordinal Probit and Logistic Regression Analysis, Computer Methods and Programs in Biomedicine, 49, 1 57-1 76. Hobert, J., and Casella, G. (1996). The effect of improper priors on Gibbs sampling in hierarchical linear mixed models. Journal of the American Statistical Association, 91, 1461-73. Laird, N. M., and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrilta, 38, 963-74. Lee, V. E. (1995). Another look at high school restructm'ing. More evidence that it improves student achievement and more insight into why. Issues in Restructuring Schools, 9,1-10. Lee, Y., and Nelder, J. A. (1996). Hierarchical Generalized Linear Models Journal of the Royal Statistical Society, B, 58, 619-78. Lin, X. and Breslow, N. E. (1996), “Bias Correction in Generalized Linear Mixed Models with Multiple Components of Dispersion,” Journal of the American Statistical Association, 91, 1007-1016. Longford, N. T. (1987). A fast scoring algorithm for maximum likelihood estimation in unbalanced mixed models with nested random effects. Biometrika, 74, 812-27. Longford, N. T. (1988a). A quasi-likelihood adaptation for variance component analysis. Proceedings of the Statistical Computing, American Statistical Association. Longford, N. T. (1988b). VARCL: Soflware for Variance Component Analysis of Data with Hierarchically Nested Random Eflects (Maximum Likelihood) Princeton: Educational Testing Service. Longford, N. T. (1994). Logistic regression with random coemcients. Computational Statistics Data Analysis, 17, 1-1 5. Louis, K. 8., Marks, H. M., & Kruse, S. (1994). Teachers’ professional community in restructring schools. Center on Organimtion and Restructuring of Schools. (ERIC Document Reproduction Service No. ED 381 871) Mac Iver, D. J ., and Plank, S. B. (1996). The Talent Development Middle School. Creating a motivational climate conducive to talent development in middle schools: implementation and effects of Student Team Reading. (Report No. 4). Center for Research on the Education of Studnts Placed at Risk. (ERIC Document Reproduction Service No. ED 402 388) 132 Magnus, J. R. (1988). Linear Structures, New York: Oxford University Press. Magnus, J. R., and Neudecker, H. (1988). Matrix Ditferential Calculus with Applications in Statistics and Econometrics, New York: John Wiley & Sons. Marks, H. M. (1995). Student engagement in the classrooms of restructuring schools. Center on Organization and Restructming of Schools. (ERIC Document Reproduction Service No. ED 381 884) Mason, W. M., Wong, G. M., and Entwistle, B. (1983). Contextual analysis through the multilevel linear model. In Sociological Methodology, 1983-1984, 72-103. McCullagh, P., and Nelder, J. A. (1989). Generalized Linear Models (2nd Ed.), London: Chapman and Hall. McCulloch, C. E (1997). Maximum likelihood algorithms for generalized linear mixed models. Journal of the American Statistical Association, 92, 162-70. Natarajan, R., and McCulloch, C. E. (1995). A note on the existence of the posterior distribution for a class of mixed models for binomial responses. Biometrilta, 82, 639-43. Nelder, J. A. and Wedderburn, R. W. M. (1982). Generalized linear models. Journal of the Royal Statistical Society, A, 135, 370-84. Patterson, H. D. and Thompson, R. (1971). Recovery of inter-block information when block sizes are unequal. Biometrika, 58, 545-54. Pinheiro, J. C., and Bates, D. M.(1995). Approximations to the log-likelihood fimction in the nonlinear mixed efl‘ects model. Journal of Computational and Graphical Statistics, 4, 12-35. Raudenbush, S. W. (1993). Posterior modal estimation for hierarchical generalized linear models with application to dichotomous and cmmt data. Unpublished manuscript. Raudenbush, S. W. (1988). Educational applications of hierarchical linear models: a review. Journal of Educational Statistics, 13, 8 5-1 16. Raudenbush, S. W. and Bhumirat, C. (1992). The distribution of resources for primary education and its consequences for educational achievement in Thailand. International Journal of Educational Research, 143 -164. Raudenbush, S. W. and Bhumirat, C. and Kamali, M. (1992). Predictors and consequences of primary teachers’ sense of efficacy and students’ perceptions of achievement in Thailand. International Journal of Educational Research, 17, 165 133 -177. Raudenbush, S. W., and Bryk, A. S. (1986). A hierarchical model for studying school efl‘ects. Sociology of Education, 59, 1-17. Raudenbush, S. W., and Chan, W. (1993). Application of a hierarchical linear model to the study of adolescent deviance in an overlapping cohort design. Journal of Consulting and Clinical Psychology, 61, 941-51. Raudenbush, S. W., and Willms, J. D. (1991). Pupils, Classrooms, and Schools: International Studies of Schooling fiom a Multilevel Perspective. New York: Academic Press. Rodriguez, G., and Goldman, N. (1995). An assessment of estimation procedures for multilevel models with binary responses. Journal of the Royal Statistical Society, A, 158, 73-89. Rosenberg, B. (1973). Linear regression with randomly dispersed parameters. Biometrilta, 60, 65-72. Schall, R (1991). Estimation in generalized linear models with random efl’ects. Biometrika, 78, 719-27. Stiratelli, R. Laird, N., and Ware, J. H. (1984). Random effects models for serial observations with binary response. Biometrics, 40, 961 -71. Waldman, D. A., and Avolio, B. J. (1991). Race effects in performance evaluations: controlling for ability, education and experience. Journal of Applied Psychology, 76 , 897-901. Wedderburn, R W. M. (1974). Quasi-likelihood functions, generalized linear models. Applied Statistics, 30, 125-31 . Willms, J. D., and Raudenbush, S. W. (1989). A longitudinal hierarchical linear model for estimating school effects and their stability. Journal of Educational Measurement, 26, 209-32. Yang, M. (1994). A simulation study for the assessment of the non-linear hierarchical model estimation via approximate maximum likelihood. Unpublished apprenticeship paper: Michigan State University. Yosef, M. (1997). Two-level hierarchical mixed-effects logistic regression analysis: a comparison of maximum likelihood and penalized quasi-likelihood estimates. Unpublished apprenticeship aaper: Michigan State University. 134 Yomg, D. J. (1996). Science achievement and educational productivity: a hierarchical linear model. Journal of Educational Research, 8, 272-78. Zeger, S. L. and Karim, M. R. (1991). General'md linear models with random efi‘ects: a Gibbs sampling approach. Journal of the American Statistical Association, 86, 79-86. Zeger, S. L., Liang, K. and Albert, P. S. (1988). Models for longitudinal data: a generalized estimating equation approach. Biometrics, 44, 1049-60. "‘uuuu“