: ~—.J...~‘ _ ‘fi' 1 1 ‘1 M; ,1 2;?- ., m . ., . x s ._._ “er :1. I‘M. 1'1 . ,. fl ., r < ‘ Iii" \ 'u "1 v 47 . V ‘ gull-1L if: I; 1 ‘- 4: 1,: , Tu“ a «"12; rt‘x'é-‘f'L.-.‘%".}" I "H .73 g: 5:29! .. ‘ r-a, 54 "105‘, «5.142? v.5, u 5. "w t I’K'r. ,. ' willWilli\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ L This is to certify that the dissertation entitled THE APPLICATION OF GIBBS SAMPLING T0 NESTED VARIANCE COMPONENTS MODELS WITH HETEROGENOUS WITHIN-GROUP VARIANCE presented by RAFA M. KASIM has been accepted towards fulfillment of the requirements for Ph.D. Measurement & Qunt. Mthd. degree in Dr. Stephen W. Raudenbush Date ”M /S/ /ffl7" MS U is an Affirmative Action/Equal Opportunity Institution 0-12771 LIBRARY Michigan State University PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. DATE DUE DATE DUE DATE DUE JW— 7 i . rm 0?) i7; 8594012272003 W953 2 80110 MSU Is An Affirmative Action/Equal Opportunity Institution abimmmamn THE APPLICATION OF GIBBS SAMPLING TO NESTED VARIANCE COMPONENTS MODELS WITH HETEROGENOUS WITHIN-GROUP VARIANCE BY Rafa M. Kasim A DISSERTATION Submitted to Michigan State University College of Education in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Educational Psychology and Special Education 1994 ABSTRACT THE APPLICATION OF GIBBS SAMPLING TO NESTED VARIANCE COMPONENTS MODELS WITH HETEROGENOUS WITHIN-GROUP VARIANCE BY Rafa M. Kasim Research in social sciences such as education, psychology, and sociology often involves analyzing hierarchically structured data. The use of Bayesian procedures ‘with. hierarchical linear' regression :models ‘to analyze such data helped researchers obtain answers to questions where standard approaches seem to fail. It also improved the estimates of the regression coefficients by obtaining their empirical Bayes estimates. Empirical Bayes estimates of the regression coefficients at two levels of hierarchy are usually conditioned on the maximum likelihood estimates of the variance components, and often assuming homogeneity of within-group variance. There are times however, where research interest is focused on studying the variance components, especially when there exists a clear evidence of heterogeneity of variance, and there is a great concern about the effect of the uncertainty in estimating these variances on the.empirical Bayes estimates of the regression coefficients. A mixed model with random intercept, which permits heterogenous within—group variances, {03-} for j=1,---,k, across the k units is introduced within the Bayesian approach. Information about the within—group variance and the intercept from these k units represent their prior distributions in a second level of the analysis. The marginal posterior distributions of the variance components and regression coefficients parameters are obtained via Gibbs sampling. The goal is to obtain Bayes estimates of the parameters of interest, especially those of the within-group variances, and study the effect of adjusting for the uncertainty in estimating the variance components on the estimation of the regression.coefficientsu IBayes estimates of the parameters for the specified model, with heterogenous within-group variance obtained via Gibbs sampling, are compared to their empirical Bayes estimates obtained via HLM analysis assuming homogeneity of variance. The process of Gibbs sampling applied on several artificial data sets, and on one real data set to obtain the marginal posterior distributions of the parameters for the specified model. The real data set represents a nationwide random sample from the high schools in the U.S. Eighteen artificial data sets wereigenerated.to represent three models. The variation between the 18 data sets covers the complexity of the model used (number of predictors at the two levels), the degree of heterogeneity of within-group variance, and the number of groups in level-2. These data sets were generated to reflect the different models used in analyzing the real data set. There appears to be no substantial difference between Bayes estimates (posterior means) and empirical Bayes estimates of the regression coefficients. Note that empirical Bayes estimates are based on the assumption of homogeneity of ‘within-group variance, o§=m£. When it comes to the estimation of the variance components, HLM estimates of the within-group variance component, of are found to be positively biased, especially when there exists clear evidence of heterogeneity of variance. ACKNOWLEDGEMENT I would like to thank my dissertation director Dr. Stephen Raudenbush for his constant help, support, and advice through my study and the writing of this dissertation. I am very fortunate to have him as a teacher, as an advisor, and best of all as friend. I would also like to express my appreciation to Dr. Dennis Gilliland for his valuable suggestion and comments on my work, Having him.as a member in my dissertation committee helped me learn a lot from him. I would also like to thank. the rest of the dissertation committee members Dr. Betsy Becker and Dr. Richard Houang for their support and encouragement" I would also like to express my appreciation to Dr. Michael Seltzer for his insightful comments on my work. My wife Hripsime and my kids Nader, Neda, and Nabeel deserve special thanks for their understanding and patience so I can finish my dissertation. Finally, but most importantly, I would like to express my deepest gratitude to my parents and.my brothers and sister for their endless love and support. II III TABLE OF CONTENTS Introduction . . . . . . . . . . . . . . Purpose of the Study . . . . . . . . . . Objectives of the Study . . . . . . . . Bayes Solution . . . . . . . . . Importance of the Study . . . . . . . . Organization of the Study . . . . . . . Statement of the Problem . . . . . . Problems Associated with the Estimation of 02 and T . . . . . . . . . . . . . . Problems Associated with Estimating T . . . . Problems Associated with an Invalid Assumption of Homogeneity of Variance . . . . . . . . . . The Analysis of Many Groups Variances a; . . . . . Bayesian Approach . . . . . . . . . . . . . Comments on the Analysis of Many Variances . . Adjusting for the Uncertainty in Estimating a? and T . . . . . . . . . Model Specification . . . . . . . . . . Statistical Model . . . . . . . . . . . Assumptions of the Model . . . . . . . . The Joint Density Function of the Parameters in the Model . . . . . . . . . . . . . . vi 10 11 13 17 19 21 21 22 26 28 36 37 39 39 42 49 ififi-T.‘ mr’_:—— --_——_-lw:'.F— —,-_-—-— v-_ IV Obtaining Marginal Posterior Distributions Via Gibbs Sampling . . . . . . . . . . . . . . Data Augmentation . . . . . . . . . . . . . . Example . . . . . . . . . . . . . . . . Gibbs Sampling . . . . . . . . . . . . . . . . The Application of Gibbs Sampling to The Model of The Study . . . . . . . . . . . . . . Obtaining Marginal Posterior Distributions of Parameters of the Model in 3.1 . . . . . . . . Finding the Conditional Distributions for the Parameters of the Model in 3.1 . . . . . . . . The Conditional Distribution of LG Given 1, 1:2, {03.}, and Y . . . . . . . . The Conditional Distribution of 0; Given A, 0, 0.2, {Uj}, and Y . . . The Conditional Distribution of A Given HI}, {0%, and Y'. . . . . . . . . The Condhtional Distribution of 12 Given U7} and Y’ . . . . . . . . . . . The Condhtional Distribution of 03 Given 6 and Rig} . . . . . . . . . . The Conditional Distribution of 6 Given a. and k5} . . . . . . . . . . . . Empirical Applications of Gibbs Sampling . . . Artificial Data Analysis . . . . . . . . . . . Data Specifications . . . . . . . . . . . Data Creation . . . . . . . . . . . . . . Initial Estimates for Gibbs Sampling . . Assessing the Heterogeneity of Variance . Random Number Generation . . . . . . . . Specifying the Criteria for Convergence . Real Data Analysis . . . . . . . . . . . vii 52 52 58 63 67 67 71 72 73 76 78 8O 82 85 86 87 90 97 99 100 101 102 VI Results . . . . . . . . . . . . . . . . . . . . . 106 Required Number of Iterations for Convergence . . 106 Artificial Data . . . . . . . . . . . . . . . . . 118 High School and Beyond Data . . . . . . . . . . . 119 Comparing Estimates of Variance Components . . . . 120 Estimates of of . . . . . . . . . . . . . . 120 Estimates of 6 . . . . . . . . . . . . . . . 122 Estimates of t2 . . . . . . . . . . . . . . . 123 Estimates of kfi} . . . . . . . . . . . . . . 123 Comparing Estimates of the Regression Coefficients . . . . . . . . . . . . . . . . . . . 124 Estimates of A . . . . . . . . . . . . . . . 125 Estimates of an} . . . . . . . . . . . . . . 127 VII Discussion . . . . . . . . . . . . . . . . . . . . 182 Bayes and Empirical Bayes Estimation . . . . 182 Suggestions and Recommendations for Future Research . . . . . . . . . . . . . . . 185 References . . . . . . . . . . . . . . . . . . . . . . 191 viii s.__ - ‘11.; ._ ~—.—-~_:_._~u—»— 1_T .-—......- .v— .H- CHAPTER 1 Introduction Social and educational research is usually conducted in natural settings such as schools or communities rather than in a controlled experimental setting. Natural settings often have a hierarchical structure. In education, for example, students are nested within classrooms, and classrooms are nested within schools. This hierarchical structure is reflected in the data collected from these nested levels. Observations at one level share common characteristics. In teacher-effectiveness studies, for example, students within a classroom share common characteristics of the teacher and his/her teaching method. Student learning may be viewed as a result of social interaction within the classroom and the school system (Bandura, 1977). Students are assigned to classrooms, and they are taught in a planned and structured manner. Even though individual students respond differently to the same teaching' process, their’ responses 'will have commonality. Therefore, observations on students learning cannot often be assumed independent. However, independence is one of the assumptions of many statistical procedures that researchers often use in their analyses. 2 Ignoring the hierarchical structure of the data leads to other statistical problems (Burstein, 1980; Cooley, Bond and Mao, 1981; Cronbach and Webb, 1975; Knapp, 1977; and Raudenbush and Bryk, 1988), such as aggregation bias. Aggregation bias occurs when the relationship between two variables varies (in magnitude and sometimes even in direction) for different levels oftdata.analysis (Cooley, Bond and Mao, 1981; Cronbach and Webb, 1975; Robinson, 1950). For example, the relationship between socioeconomic status (SES) and academic achievement is more likely to be higher in school level data than it is in student level data. This difference in the relationship is attributed to the effect of aggregation of the. data in each school. In igeneral, school level membership is related to SES; schools with higher levels of SES are likely to have higher quality educational programs which lead to higher achievement in those schools. Information loss is another substantive problem that results from ignoring the hierarchical structure in data analysis. After the research or the evaluation design is determined, a type of data analysis is usually chosen to answer questions related to one level of data. If the research questions were asked about students, it is likely that they will be answered with a student-level analysis. If questions were asked about classrooms, then it is likely that they will be answered with a classroom-level analysis. When the analysis is done at the classroom-level (i.e., using the 3 classroom as unit of analysis), it is very possible that some important relationships between variables that were measured on the student level will be ignored. However, there has been recent development of statistical methods which are appropriate for analyzing a hierarchically structured data based on multi-stage (often two-stage) linear models (Aitkin and Longford, 1986; Burstein, 1980; Goldstein, 1986; Lindley and Smith, 1972; Smith, 1973; Rao, 1972; Rosenberg, 1973; Raudenbush and Bryk, 1986). General linear mixed models or hierarchical linear models (HLM) are some of the names used to describe the statistical models developed for analyzing multilevel data. The multi-stage methods share a common characteristic of treating the within-unit regression parameters as outcomes for the. between—unit ‘modelsu They differ, however, in the procedure for estimating variance components. Some of these procedures use the iterative generalized least squares method suggested by Goldstein (1986) , some use the EM algorithm method as applied by Raudenbush and Bryk (1986), and other use the Fisher scoring approach (Longford, 1987). The estimates produced by all these numerical procedures are maximum likelihood point estimates. They are consistent, efficient, asymptotically unbiased, and normally distributed (Harville, 1977). A two stage HLMann.be presented in the following general form. Within unit j, for j =i1, ,k units, we have _ (1.1) Yj" Xij + 53' , where Y]. is an njxl vector of observations, X]. is an njxp matrix of within-unit predictor variables, Bj is a pxl parameter vector which captures the relationships between Yj and X]. within each unit, and ej is an njxl vector of random errors which is assumed to have a multivariate normal distribution with mean vector 0 and variance covariance matrix 10$; that is ej~N(o,Ia§). For Xj with rank (Xj)=p the Ordinary Least Square (OLS) estimate for Bj is given by §j=(XJ{Xj)-1XJ’-Yj with a sampling variance of Var(DjIBj) = o§(XJI-Xj)‘1. By allowing the within—unit regression model coefficient to vary as a function of between—unit variables, we have a multi—level model that describes hierarchically structured data. Therefore, in the second (between-unit) stage, we have B,- = ij + Uj , (1.2) where W]. is a pxq matrix of known between—unit predictors and y is a qxl vector of parameters that capture the effect of the between-unit predictors W]. on the within-units parameters. Uj is assumed to be N(O,T) , where T is the residual variance—covariance matrix for [3]. after accounting for the effects of Wj. Thus, we can have a more general 5 representation of the nudiilevel model by substituting 1.2 into 1.1 to obtain Yj = XjoY + 53' . (1.3) where Ej==AGU3+ej. One might estimate the effects in 1.3 by using an ordinary least squares (OLS) multiple regression approach to the estimation of multiple simultaneous effects. However, one of the basic statistical assumptions for OLS is the homogeneity of variance for Ej. Because Ej is equal to X3U34-ej, this assumption can only be valid when either LG is set to equal zero or if X3=X’for all j. Setting either [5 equal to zero or Xj=X , however, implies holding all the within-unit regression model parameters as a constant across all units and ignoring the hierarchical structure of the data. This leads to the violation of the assumption of independence. The total variance var(fij|wp of the within-unit regression estimates is :made of 'their sampling ‘variance var(fij|Bj) plus their residual variance var(Bj|W3). That is Vaufijmj) = o§(X;Xj)-l + T. (1.4) Many applications have been found which can utilize the statistical model in 1.1 and 1.2. In longitudinal studies (Bryk and Raudenbush, 1987; Laird and Ware, 1982; Strenio, Weisberg and Bryk, 1983) the growth model parameters for each subject become the multivariate outcomes for the between-unit 6 models. Subject characteristics (either observed or controlled) can serve as predictors for these outcomes. In research synthesis (Raudenbush and Bryk, 1985), a simple measurement model for the observed effect size is set equal to the true effect size plus some error in each study. This measurement model is a special form of the within-unit model in 1.1. The parameter fig Ibecomes the true effect size in study j and .Kj becomes a scaler with a value of‘one. The true effect sizes vary across studies as a function of known study characteristics plus error in the form of the between— unit model in 1.2. A major benefit of fitting models 1.1 and 1.2 to hierar- chical data is the possibility of getting empirical Bayes estimates for the within-unit regression model parameters. Empirical Bayes estimates are more stable and outperform the classical estimates with regard to expected mean squared error. Information from other groups can be used to get improved estimates for the within-unit regression model parameters. In providing a prediction equation, Braun, Jones, Rubin, and Thayer (1983) presented a situation where there are several predictors and few cases in subgroups of the population ‘which. makes it hard. to obtain least squares prediction equations. The objective of their study was to provide separate predictive equations for the Graduate Management Admission Test (GMAT) for the white and minority students in each business school. Only 4% of 8500 students 7 sampled from 59 business schools were minority students. The number of minority students within each school ranged from 2 to 10; this made it hard to obtain a prediction equation for minority students for a particular school. With the HLM analysis, information from other schools was borrowed to obtain the predictive equation for the minority students within each school. Assuming that 03: and T are known, empirical Bayes estimates of Bj and y are given by [33- = 13-5,- + (1-Aj) ij" , (1.5) where Aj = T(Vj+T)'1 , (1.5) Vj - o§(X§Xj)‘1 (1.7) and y. = {§w§[Var(fij|Wj)]’1Wj)‘1{)J;wj’-[Var(§j|wj)]'lfij} . (1.3) The first component of B; in 1.5 is the OLS estimate of Bj from the data within each unit weighted by its reliability Aj. The second component, 7* is an empirical Bayes estimate of the between-unit parameters, y from 1.2. More comprehensive presentations of the HLM and different methods of estimation of the parameters of the model are given by 8 Lindley and Smith (1972), Smith (1973), and Raudenbush (1984, 1988). Although empirical Bayes methods have been around for a long time, they became increasingly popular and were applied to many types of problems in the early 1970s. Efron and Morris (1972, 1973, 1975, 1977) discussed many applications of empirical Bayes methods. Empirical Bayes estimates of the parameters in hierarchical linear'models are.obtained.when the true values for a; and T are known. In practice, however, these values are rarely known to practitioners and often are estimated from the data. Common estimates often used for these two parameters are their maximum likelihood (ML) estimates. Since estimates of the regression coefficients of the HLM are functions of ML estimates of a? and T, they are also considered to be ML estimates (Raudenbush, 1988). In classical ANOVA procedures 03. and T represent the within- and between-group variance components, respectively. In balanced designs these variance components.can.be estimated by solving a set of simultaneous linear equations obtained from equating the observed mean squares (which are quadratic forms of the observations) to their expected values (Searle, 1971). For unbalanced designs, there is more than one set of quadratic forms to use for estimating these variance components. Depending on the particular set used, inconsistent estimates of the variance components may be obtained (Searle, 1971). 9 Introducing HLM to the analysis of educational data can improve the quality of the data analysis in education. It, also helps applied statisticians obtain sensible answers to questions where standard approaches seem to fail. However, there continue to be obstacles related to HLM applications and theoretical derivations. In some applications where the assumption of homogeneity of variance cannot be verified, drawing inferences on the variance components becomes part of the research interest, especially the variances of the within—unit residuals (Leonard, 1975; Rao, 1970; Raudenbush and Bryk, 1987). The motivation for making such inferences on the variances can be attributed to the intrinsic interest in the variances themselves and to the fact that they are part of the standard errors of estimates for the regression.coefficientsu In other applications, where we have small samples within each group and.a small number of groups, maximum likelihood estimates for the variance components 03- and T might become unstable. Substituting these unstable estimates as true values for the variance components will distort the regression effect estimates. Purpose of the Study The purpose of this study'istto obtain marginal posterior distributions for all effect parameters and variance ,__.. .___, _ ____..‘ _--__.~. .- v—._..-- _- ._._ ,A._ estimate for the variance. Further, 10 components 111:3 linear mixed model with random intercepts without relying on the assumption of homogeneity of the residual variance 0; across groups. Non-informative priors will be posed on the hyper-parameters that describe the distributions for random parameters. The marginal posterior distribution for each parameter will then be approximated using the Gibbs sampling method (Tanner, 1993). When research interest is focused on making inferences on variances, finding the entire posterior distribution for each of the variance components is more informative than having onLy a point estimate of the variance. Having the entire posterior distribution at her/his disposal will allow the researcher to have the flexibility in choosing an appropriate s/he can establish probability intervals around the estimate. Objectives of the Study The idea of approximating the marginal posterior distributions for all parameters in the model leads to the question of how feasibly to do so. Therefore, the questions of this study are: 1- How do the posterior mean (Gibbs) estimates of the parameters cufaa mixed model with heterogenous within- group variance differ from their empirical Bayes estimates which are based on the homogeneity of within— 11 group variance assumption? (Empirical Bayes estimates for the within- and between—unit regression coefficients were conditiOned on ML estimates of the variance components with homogeniety of within-group variance). How do inferences about regression coefficients change when taking into account the uncertainty in estimating the variance components? Can a single estimate be used as typical value for all the different residual variances kfii? How precise are the posterior mean (Gibbs) estimates of the residual variances {0%.} in estimating their single typical value? The last two questions focus on the parameters of the prior distribution of the residual variances {03-}. Inferences can be made on these two parameters from their approximated marginal posterior distributions. Bayes Solution If reasonable priors are posed for all the unknown parameters in the model, theoretically, the marginal posterior distributions of the parameters of interest can be obtained by integrating out the other nuisance parameters in the model. Bayes estimates for these parameters can then be derived from their respective marginal posterior distributions based on a certain loss function. 12 From a Bayesian VieWpoint, estimating a parameter 6, given its sufficient estimate a from the sample implies selecting a decision function, say D, so that D(é) is a predicted value of 0 when both the computed value 8 and the posterior density function f(0|8) are known. Often, the researcher predicts an experimental value of any random variable to be its expected value; however, the median or the mode can also be used as predicted values. For many cases, it is desirable that the choice of the decision function D should depend upon the loss function say L[6, D(é) ], that we try to minimize (Hogg and Tanis, 1977) . For example, when the loss function is given by: L[e,D<é)] =[e —17(?))]2 (1.9) then Bayes‘ solution that minimizes this loss function is given by D(8) =E(6|é) . It is also possible to obtain Bayes' probability confidence intervals for the estimates directly from their respective marginal posterior distributions. A problem associated with the application of Bayesian ; methods to hierarchical models is the difficulty of the mathematics involved in the derivation of the posterior distributions (Lindley and Smith, 1972). The multiple dimensions of the parameter space in the model make it difficult to express some of the required equations in closed form. Therefore, certain marginal posterior distributions 13 cannot be obtained analytically through direct evaluation of an integral, and in many applications, integrating out the nuisance parameters is a difficult job. Consequantlly, empirical Bayes estimates are used. to approximate truly Bayesian estimates. In some situations we might find that the focus of the research is centered around making inferences about a? and T rather than the regression coefficients in the model. Therefore, obtaining the marginal posterior distributions for these parameters is more informative than obtaining their single point estimates. Even when concern focuses on the regression coefficients it will be useful to insure that posterior uncertainty regarding these coefficients fully reflects posterior uncertainty about the variance-covariance components. Importance of the Study In educational settings, where we have several classrooms or several schools, interest is mainly focused on studying the factors that affect students' academic achievement. Traditionally, student achievement is evaluated by comparing schools or classrooms on their means on some measure of academic achievement, usually students‘ test scores. Often, important decisions such as school funding or teacher promotion are based on this evaluation. The fairness 14 of these decisions depends on whether the test score means in different schools or classrooms truly represent students' academic achievement. Many factors that could affect these means can be controlled statistically through model specification. However, it is still possible that for a slightly higher test score mean, one school will get more funds than others or a specific teacher may get promoted. It is also possible that the school that got extra funds or the teacher who got promoted produced higher variability in their students' academic achievement. A successful educational program or an effective teacher should arguably strive not only to increase and facilitate students' academic achievement, but also to reduce the gap between students' learning and produce equality in their achievement (Bloom, 1984). Therefore, the evaluation process should be based not only on average achievement, but also on the equity of the achievement. Consequently, any improVement on the methods of estimating these two criteria certainly helps the decision maker in his/her evaluation of the educational system. Another example is drawn from research syntheses (Raudenbush and Bryk, 1985; Rubin, 1981) where a simple model for the observed effect size is set equal to unknown true effect size plus some error for each study. The objective is to obtain an efficient estimate of each study effect size as well as its precision. It is also common in some cases to 15 have a number of groups with a small number of observations for each group which were obtained under the same conditions. For example, in longitudinal studies, the growth of subjects from the same age can.be observed.over several points of time. One might become interested in studying the growth rate of a particular subject in comparison to other subjects. The replicate observations on the subject can then be used for variance estimation. In all of above examples, there are two sets of parameters (means and variances) with several parameters of the same type in each set. Within each set, these parameters are related by common circumstances; schools or classrooms in the first example, studies in the second example, and the individual subject in the last example. Within the Bayesian framework, it is logical to assume that these many parameters that share common circumstances would have a common parametric prior distribution. This distribution summarizes the informa- tion about these parameters prior to data collection. This allows the researcher to use such prior knowledge about these parameters to get their improved estimates. With the help of modern computers and developments in simulation'theory, research.interest.is being directed towards approximating the posterior distributions of the parameters in the model being investigated and obtaining their Bayesian estimates (Tanner and Wong, 1987). In his work on hierarchical linear models, Seltzer (1988) adopted the data l6 augmentation method to obtain Bayesian estimates for variance components. He also investigated Bayes' estimates for the within-school regression coefficients when they are assumed to have a t-distribution (see also Seltzer, 1993). Fotiu (1989) also applied the method of data augmentation in estimating the joint posterior distributions for the effect parameters (within- and between-units) when analyzing many groups. Both Seltzer and Fotiu approximated the posterior distribution of 2 O . 3 under the assumption that it is equal to (9 across all groups. This assumption is usually made for convenience, rather than because it necessarily holds true. By making this assumption, the researcher can pool all the observations from all the groups to get a large-sample point estimate of 02. However, when the marginal posterior distributions of all the parameters in the model can be approximated, this assumption becomes no longer essential to the analysis. In fact, sometimes the main objectives of the statistical analysis is to estimate the posterior distribution for each 0?. In conclusion, the objective of this study is to investigate the use of a mixed linear model with heterogenous variances {03-} across groups. A linear regression model with a random intercept will be used to represent the relationship between the criterion variable and the predictors in each group. By obtaining the marginal posterior distribution for all the parameters in the model, the practitioner can obtain Bayes estimates for the effect parameters as well as variance- 17 covariance components. Based on the theoretical formulation of the model a number of simulated data sets will be generated to verify the accuracy of the estimation under different parameter specifications. These specifications reflect the degree of heterogeneity of variance, the different numbers of group sizes, and the complexity of the regression model for each group. The estimation process will be applied to a national sample of US high schools where math achievement has been studied as a function of school and student characteristics. One important feature of this data set is the heterogeneity of the residual variance across schools. Radenbush and Bryk (1987) investigated the heterogeneity of variance in this data set through the application of hierarchical linear models. Organization of the Study Chapter 2 will present a statement of the problem and review of literature conducted on the topic. That chapter will also highlight some of the problems in variance estimation, and its relation to the HLM. In chapter 3, a general linear mixed model with a random intercept will be presented to represent the several applications presented in this study. In addition, a comprehensive description of the assumptions associated with the model will be provided within the Bayesian framework. Chapter 4 will explain the Gibbs l8 sampler and its application in achieving the objectives of the study. An example will be provided to illustrate the iteration process of Gibbs sampling. Chapter 5 will present 'the derivation of the conditional distributions for the parameters needed in the iteration process. It will also provide a: description to the computational steps for the iteration process presented in chapter 4. Chapter 6 will presents the results of the small simulation study as well as the results of an analysis of mathematic achievement data from US high schools. Discussion of the results for the generated data and the US high school data, as well as conclusions, will be presented in chapter 7. CHAPTER 2 Statement of the Problem Variance estimation plays an important role in almost every kind of quantitative research. A basic requirement of nearly all forms of analysis is that a measure of precision be provided for each estimate derived from the data. The most commonly used measure of precision of an estimator is the reciprocal of its variance. The sample variance is used to estimate the precision of the mean or other location estimates. It is also used to check for gross errors affecting' a single observation. Further, it is used to provide an estimate of a variance component such as a pooled within sample variance in several kinds of statistical procedures. In HLM analysis, the two variance—covariance components 2 0' . J and T are used in obtaining empirical Bayes estimates for the effect parameters in equations 1.1 and 1.2. Compared to the Bayesian approach, the empirical Bayes approach does not require the specification of prior distributions for a; and T. The two-level hierarchical linear model and its assumptions for the empirical Bayes approach can be restated as 19 20 Y]. = Xjflj + 6]. , where 6- ~ N(0,Io§-) , (2-1) for the first level, and B]. = ij + Uj , where Uj ~ N(O,T) , and 1 ~ N(G),l"), (2.2) for the second level. The joint posterior distribution for the effect parameters Dj and y is conditioned on the values of a; and T (Lindley and Smith, 1972). The joint density function f1(fij,le3,o§,T) for this posterior distribution can be expressed as f1(i3j,lej,o§-,T) o< f2(Yj|pj,o§) f3((3j|o§, y,T) f4(y), (2.3) where f2(yj|(3j,o§) is the likelihood function of the data, f3(Bj|o§-, y,T) represents the conditional prior density function of (33., and f4(y) is the density function for a noninformative prior for y. The values for a? and T are often estimated. by one of several methods of numerical estimation. One common method is the EM algorithm (Dempster, Laird and Rubin, 1977) which.produces their maximum likelihood estimates. These estimates are asymptotically normally distributed, unbiased and efficient estimates. 21 Problems Associated with the Estimation of 02 and T In some applications, the sample size within each group as well as the number of the groups might not be large enough to allow the specific estimates of a; and T to obtain the asymptotic properties. Nevertheless, practitioners often use maximum likelihood estimates of a; and T to represent their true values to obtain the conditional posterior distributions for the effect parameters. Problems Associated with Estimating T In 1.2 of the HLM, T represents the residual variance- covariance matrix for the within-group parameters Dj across all the groups. The shape of the sampling distribution of this variance—covariance matrix depends on the number of groups in the study. Consider the model investigated in this study, for example, where only the intercept is considered random. The matrix T , then, becomes scalar 12, which is the variance of the intercept across all the groups. When the number of groups is relatively small, say kle, it is possible that the distribution of an estimate of t2 becomes highly skewed. Using the mode of that distribution (maximum likelihood estimate) as an estimate for 12 might not be a good representation of its true value. In a validation study conducted on eight law schools, Rubin (1983) pointed out the problem of using maximum 22 likelihood estimates for 12 when there are small number of schools in the study. He stated that "The specific problem here is that the likelihood of 12 is not at all symmetric about the maximum likelihood estimate, and thus this estimate is not representative of reasonable values for 12. Integrating over :2 in such cases is a much more reasonable way to summarize evidence than to fix 12 at some value" (Rubin, 1983, p. 15). Notice that the estimated value of 12 is being substituted in 1.4 to 1.8 to obtain the precision weight for the empirical Bayes estimates for the regression coefficient of the HLM model. Consequently, invalid estimation of this parameter causes the estimates of the regression coefficients of the HLM model in 1.5 and 1.8 and their reliability estimates in 1.6 to be distorted. Problems Associated with an Invalid Assumption of Homogeneity of Variance When maximum likelihood estimates are obtained for variance components, {0?' for ;i= 1, ,k7 are often assumed homogeneous. Researchers often assume homogeneity of variances across all groups out of convenience rather than conviction. This assumption allows the pooling of the observations from all the groups to get one large sample maximum likelihood estimate of 02 for all the groups. Berlin (1984) stated that "large data sets are rarely homogeneous in their precision and that usual statistical analysis fails if it does not take such differences into consideration.“ (p. 209). 23 When the number of observations in each group is large enough to allow for consistent maximum likelihood estimation, Mason, Wong and Entwisle (1984) advised that a separate variance estimate should be obtained for each group. However, when the number of observations in each group is not large enough to allow for consistent maximum likelihood estimation and the parameter variances are not equal, pooling the observations from all the groups to obtain a single ML estimate for 02 might give an inaccurate estimate of the variance. Assuming homogeneity of within—group variance Bassiri (1988) showed that the within—group variance estimate becomes unstable when the groups' sample sizes are relatively small. This could lead to invalid estimates of the regression coefficients Bj and their precision estimates (see equations 1.4 and 1.5). Consequently, inferences about or confidence intervals for the regression effects would be invalid. Equation 1.4 shows when the true values of o; and T were being used, part of the variability in the estimated variance of the regression effect Bj stems from the variability inX across groups. Another part stems from the variability in og's when no homogeneity of variance assumption is being made. By assuming homogeneity of variance,o§= 02, we are ignoring part of the variability in the variances of the regression effects. The part being ignored is attributable to the variability in the residual variances. 24 For illustrative purposes, consider the case where A; is assumed equal to X across the k groups for j=14 m, k. Thus, the within—unit regression coefficient estimates become [’3‘]. = (X’X)‘1X’Yj. When homogeneity of variance holds, the variance—covariance matrix for these estimates can be derived from 1.4 as Var(i3j|Wj) = 02(X/X)‘l +T. (2.4) Equation 2.4 implies that the within—unit effects are estimated with the same precision across all the k groups. However, if there is a considerable variability in the residual variances across groups, the estimation of this precision becomes invalid and those estimates of the regression effects no longer have the same precision. In a classical analysis, when. researchers study any treatment effect, they compare the means of the criterion variable from different treatments. One of the assumptions they often make is homogeneity of variance. The justification of this assumption is to get a single pooled variance estimate for the within treatment variance to represent the error term in the analysis so that treatment effect can be tested. Bryk and Raudenbush (1988) showed that heterogeneity of variance in the traditional analysis of treatment effects can be seen as evidence of an interaction between treatment and subject— specific characteristics. They warned researchers against the problem of obtaining biased estimates for treatment effect 25 when ignoring heterogenous variances in the classical analysis. Consider the classical case, often referred to as the Behrens-Fisher case, of testing the hypothesis pa "ih>= 0 with unequal population variances, oisoi. Using §;-X; as an estimate for pa—pb, the sampling distribution of the statistic z: (Ea—Eb) - (pa-uh) 2 2 (2.5) 0 U (—a)+ —‘3> na H): is 1V(0,1). The given hypothesis cannot be tested without knowing o: and oi. When the sample sizes ha and nb are both large enough, we can substitute the unbiased estimates of the variances, s: and SE, for the corresponding parameters, and the resulting statistic has a sampling distribution approximated by N(O,1). In practice, however, when 12a and nb are both relatively small and noticeably unequal, the resulting statistic Z: = (Xa_Xb) _ (pa-p19) S2 32 (2-6) (—a)+(—5) a nb would have what is known by Behrens—Fisher distribution (Winer, 1971). Therefore, when there are both heterogeneity of variance and a noticeable difference in the sample sizes between treatments, the homogeneity of variances assumption becomes questionable. 26 The Analysis of Many Group Variances a; A common situation where many variances need to be estimated arises when there is a heterogeneity of variance in linear models. Rao (1970) introduced a method called Minimum Norm.Quadratic Unbiased Estimation (MINQUE) for estimating the residual variances in linear models when these variances are found to be heterogeneous. MINQUE estimates are a linear function of the distinct variances under certain conditions. These conditions are related to the choice of the matrix that creates a quadratic form in the outcome variable. If some of the variances are the same, their corresponding coefficients in the linear function can be chosen to be the same. The variances estimated using the MINQUE method can be used to obtain improved estimates of the coefficients of the linear model. Further, they can be used in obtaining the estimate of the precision of the simple least squares estimator of the regression coefficients or any linear function of these coefficients. Cook and Weisberg (1983) modeled the variability in variances as a function of some explanatory variables. The n residuals from a linear regression model are assumed to have a multivariate normal distribution, with mean zero and variance—covariance matrix 02W, where W is a diagonal matrix with all diagonal elements wq>0 for i=14. ,n. Estimating the residual variances implies estimating the elements of W. 27 The variability in up, which represents the variability in the variances, is expressed as a function of a 1 x q row vector of known explanatory variables zi=(zfi) forj=1, ,q and q><1 vector of unknown parameters A as in the following two models q %=en{ahflfi)l (L7) J: or q w.= exp(§:Ajlog(zfi)). (2.8) j=1 1 Note that the variables zfi can have negative values in the first model but not in the second one. The predicted values of hp, from the above models, can then be used to estimate the residual variances from their variance—covariance matrix 02W. Cox and Solomon (1986) have also dealt with the issue of estimation of many variances. They provided examples illustrating the problem of estimating many variances from small samples. One of the examples covers the situation where there is a systematic difference in variance between samples. Observations within a sample are assumed to come from. a N(pi,o§) population where 1:14. ,k. The ith variance 0? is a function either of an explanatory variable zi, characterizing the ith population, or of pi. They adopted similar versions of the models in 2.7 and 2.8 to represent the systematic change in the variances. They also presented another example for the case where 28 the changes in the variances are considered random. They consider it a complement to the above case (where changes in the variances are systematic) and called it the "overdispersion model", (Cox and Solomon, 1986, p. 544). In this case, different population variances are considered to be independent unobserved values of a random variable 02. This variable has an inverse gamma distribution with density 1 -1 a — Va (ivoof) (032v. exp —O—:° {F [2t°]} , (2.9) where v, represents the degrees of freedom, and of some constant. Given the inverse Gamma density in 2.9 and considering the ith. population :mean pi as ea nuisance parameter, the estimation of the ith population variance 0% is based on the marginal likelihood of the ith sample. The underlying idea of the above model suggested by Cox and Solomon (1986) coincides very closely with Bayesian thinking. Bayesian Approach Lindley (1971) used the Bayesian approach in the estimation of many different means and variances. In the case where both means and variances were unknown but have exchangeable distributions, he used the conjugate priors for these distributions to derive the joint posterior density for the means and variances. The joint density of all the parameters was based on a three—stage model of hierarchy. The ~§~As‘)—_ 1: .. A— 29 first stage describes the data, and its likelihood function given the means and the variances. That is, given {ufl and {0;} the data Kg are independent and normally distributed with .E(Yfi) = uj and var(rfi) = a? for i = 1, ,nj; j = 1, ,k. In the second stage, the means in; and the variances is? are assumed to be independent with two different exchangeable distributions. The exchangeable distribution for the means {uj} is assumed to be normal with mean 6 and variance 1. The exchangeable distribution for the variances hfi} is an inverse chi—square, where the variable voof/ofi is distributed as a x2 with v° degrees of freedom and of as a typical value for 0?. The third stage describes the prior knowledge about the parameters in the second stage. For the parameters in the normal distribution a conjugate prior distribution for the mean 6 with vague prior knowledge produces locally uniform prior. Similar to 0%, the variance T is assumed to have an inverse chi—square where vrof/t distributed as x2 with vt degrees of freedom and of as a typical value for I. Since a? is distributed as an inverse chi—square, a conjugate prior for of given V0 is a function of x2 with r and A as two constants describing the distribution. The parameter V0 is assumed to have a uniform prior on (O,W). The joint posterior density for the means in} and the variances {0? was derived by integrating out all other parameters (which are 6, T,V° and of) from the overall joint posterior density function. The derived estimates were based 30 on a simple model where there were several groups with different means and variances. If a more complex linear model with several predictors were to be used the estimation process then becomes more complex. The added coefficients for the predictors in the linear model make integrating the joint density function of the parameters analytically, extremely difficult. Leonard (1975) provided a procedure for modeling the variability in the means and the variances from several normal populations using the Bayesian approach. His model for the means, when the variances are known, represents a general case of the one presented by Lindley (1971). When populations means are known, Leonard expressed the log—transformed variances as a function of some explanatory variables using a linear model similar to the one for the means. The log-trans— formed variances were assumed to have an asymptotic normal exchangeable distribution similar to the one for the means. Expressing the variances or any transformation of them as function of other variables is similar to what Cook and Weisberg (1983) did as shown in equation 2.7. The major difference here is that Cook and Weisberg consider Aj in 2.7 to be fixed unknown parameters that need to be estimated, while Leonard proposed the use of prior knowledge about a similar parameter for the linear model of the log—transformed variances. The resulting estimates of the means and variances are 31 shrinkage estimates. They represent a weighted average of the standard estimate from the sample plus a weighted average of all the samples estimates. The weights are related to the reliability of the standard estimate from the sample. The shrinkage estimates of the variances are found by taking the exponential of the shrinkage estimates of the log—transformed variances. When both means and variances are unknown, Leonard proposed an iterative method for estimating the joint posterior mode vectors of the means and the variances. He substitutes the shrinkage variance estimates in the expression for the means to obtain their new estimates. The new estimates of the means are then used to find new shrinkage estimates for the variances. Raudenbush and Bryk (1987) tackled the issue of estimating variances from many groups as a special case of a more general problem of estimating the parameters of a two— stage HLM. They provided two different methods for obtaining empirical Bayes estimates for the variances. The conceptualization of their models is similar to those presented by Lindley (1971) and Leonard (1975). In the first method, estimates were derived using the exact distribution (chi—square) of the variance estimate within each group. That is, sf ~ oixifl/(ni—l) . (2.10) 32 Assuming that the variances have an exchangeable prior distribution, the variable vgof/oi has a chi-square distribution with vo degrees of freedom and 03 represents the typical value for the variances prior to observing the data. This distribution is the natural conjugate prior for the one in 2.10. Thus, the variance 0% is distributed as vqofxfij, where xfij denotes a ‘variable with an inverse chi-square distribution. In the second method, however, estimates were derived using the asymptotic normal approximation to the sampling distribution of the logarithmic transformed variance estimate within each group. For the first stage of the HLM di=6i+ei, where ei~.N(0,(ni—1)”/2) for i=1, ,k, (2-11) where cQ=-%[log(sf)-ci], Ci is a bias correction, and 6i=-%log(o§). For the second stage of the HLM 5. = W/y + U1, where U1” N(O,‘;1) for i=1, ,k , (2.12) .1 where W1 is a Mx1 vector of known predictors, y is a Mx1 vector of effect parameters, and [Q is a random error normally distributed with mean equals zero and variance equal ¥;. The use of the normal approximation to the sampling distributions of the variances permits the use of the above formulation and allows for the hypothesis testing associated with it. Estimates from both methods (exact distribution and normal approximation) are shrinkage estimates. They are 33 weighted averages of two components. For the exact distribution, the conditional posterior mean estimate for the variance is given by 32 = W , (2.13) ni+ v0— 3 The first component is the usual estimate of the variance based on the observations within each group weighted proportionally by its degrees of freedom. The second is the typical value of of the variances in their prior distribution weighted by the concentration of these variances va around that typical value. For the normal theory, however, the empirical Bayes estimate for the logarithmic transformed variance is given by (2.14) ~ 51 = Xidi + (1_Xi)Wi/T r where Ai=(ni-1)/(ni+9-1), and 9 is estimated numerically. The first component is the ordinary estimate of the logarithmic transformation of the variance estimate within a group, weighted proportionally by its degrees of freedom. The second component is a predicted value based on information from all groups, weighted proportionally by the concentration of the parameters estimated by the first component around that predicted value. The resulting estimates fronlboth.methods are conditional estimates. In the method where exact sampling distributions of the variance estimates are being used, the empirical Bayes 34 estimates for the variances are conditioned on knowing the true 'values for the parameters of and v0. In .practice, however, the true values of these two parameters are usually unknown. To obtain the shrinkage estimates of the variances in 2.13, of and v0 have to be estimated. Based on the EM algorithm approach by (Dempster, Laird and Rubin, 1977) for numerical estimation, Kasim (1986) developed a procedure for estimating of and v“ The complete data for estimating 03 and v0 are {S§L{o?‘ for j =1” ,k, where only {3;} have been observed. The parameters of and v, are estimated by maximizing their likelihood function .Zflo?)v°,ofi (M-step), assuming that (a? have been observed. The sufficient statistics for estimating of and v0 are functions of (0?. Since {0? cannot be observed, the posterior expectations (E- step) of these sufficient statistics are used to estimate of and Va. Finding the posterior expectations of the sufficient statistics, however, depends on knowing the values of of and v0 as well as the data {sfih This dependency between the posterior expectations and of and v0 is used in an iterative process to estimate of and v0. Therefore, given the observed data {Si} and initial estimates of of and vo, the values of sufficient statistics are estimated by their posterior expectations. The estimated sufficient statistics are then used in maximizing ldofilvucf) to obtain new estimates of of and v0. Going back and forth between the expectation step and the maximization step, we get reasonable estimates of 03 and 35 v In fact, each step of this iteration process increases 1({09lv°,03), which provides us with better estimates of of and v0 than those from the step before. The iteration process is terminated when the absolute change in the values of of and v0, between any two steps, is sufficiently small. In the method where the normal approximation to the sampling distribution of the variance estimates is being used, the empirical Bayes estimates of the variances are conditioned on the residual variance V in the second stage model (see 2.15). The true value of this parameter is unknown, and it must be estimated. Similar to the way that of and v. are estimated, the maximum likelihood estimate if is obtained numerically via the EM algorithm. The work of Lindley (1971), Leonard (1975) and Raudenbush and Bryk (1987) on estimating variances of many groups share the characteristic of borrowing information from other groups to get an improved estimate of the variance for a particular group. As an application of this idea Singh and Sedransk (1988) provided an example for obtaining improved estimates of strata variances when strata sample sizes are small. The improved variance estimator uses data borrowed from other strata, initially thought to be similar. The resulting estimate is a shrinkage variance estimate for the stratum, which is similar to the one in 2.14. 36 Comments on the Analysis of Many Variances There are still several theoretical and practical weakness in estimating and investigating the variability in many variances. Empirical Bayes estimates of the variances, which were suggested by' Raudenbush and Bryk (1987), for example, are not adjusted for the uncertainty in estimating the conditioning parameters of and v6 in the exact method, and V in the normal approximation method. This is similar to the problem (Tanner and Wong, 1987) of obtaining empirical Bayes estimate for the effect parameters in HLM without adjusting these estimates for the uncertainty in estimating the variance components. The focus on obtaining conditional point estimates for the variances in the model rather than deriving the entire posterior distribution for each variance can cause some problems. In some cases, the misrepresentation of the true values of the conditioning (given) parameters will lead to obtaining inaccurate conditional estimates of the variances. For example, in the case of the exact theory of Raudenbush and Bryk (1987), the variability of the group variances is inversely proportional to the conditioning parameter v° plus some constant. Therefore, if there is a clear evidence of heterogeneity of variance then the likelihood of V0 is not at all symmetric around its maximum likelihood estimate. Consequently, this estimate is an inaccurate representation 37 for the true value of v“ Conditioning the variance estimates for the different groups on the estimated ‘value of this parameters will distort these variance estimates. This problem is similar to the one presented earlier by Rubin (1983) for estimating 12 when the number of groups is small. Normal approximation for the distribution of the estimated variances is another crucial point in the analysis of many variances. Bartlett and Kendall (1946) reported that the normal approximation is sufficiently accurate when the sample size in each group is at least equal to 10. For smaller samples, the normal approximation to the distribution of the estimated variances ‘might be less accurate. The inaccuracy in the normal approximation can lead to inappropriate representation of the model. A severe loss of efficiency in estimation can occur when sample sizes are less than 5 (Cox and Solomon, 1986). Adjusting for The Uncertainty in Estimating a; and T Standard methods do not exist for adjusting the conditional posterior distribution for the effect parameters to reflect increased variability due to the uncertainty of estimating a? and T (Tanner and Wong, 1987). The estimates of the regression coefficients given in (1.5) to (1.8) are conditioned on knowing the true values of o? and T . In practice, however, these parameters are usually unknown, and 38 their ML estimates are often used to allow for empirical Bayes estimation for the regression coefficients. Therefore, part of the variability in the empirical Bayes estimates for the regression coefficients should be attributed to the uncertainty in estimating 03 and T. Having the marginal posterior distributions for the regression coefficients will facilitate obtaining their estimates without conditioning on a? and TH Consequently, the estimates will be correctly adjusted for the uncertainty of estimating the variance components a? and T. In summary, the problem in getting empirical Bayes estimates for the effect parameters in multilevel models is centered on getting appropriate estimates for a? and T. When the sample sizes and the number of groups are large enough, maximum likelihood estimates of a; and T are a good solution to the problem of getting those empirical Bayes estimates. However, when there is heterogeneity of variances and the number of observations within each group as well as the number of groups are not large enough to allow asymptotic estimation, maximum likelihood estimates of 02 and T may not be appropriate; it is essential that one should move toward being fully Bayesian in this analysis. CHAPTER 3 Model Specification This chapter provides a description of the statistical model used in this study and its underlying theory. The basic assumptions for this model are provided within the Bayesian framework. The joint posterior distribution for the parameters in the model is also presented. A common characteristic often shared by evaluation studies that involve social interventions, like a new educational program, drug rehabilitation, or health care promotion, is that the data collection process is based on more than one level of hierarchy. The most common case is where there are measures for program or group characteristics and other measures for characteristics of subjects who are nested within the groups or programs. Any model set up for investigating relationships between measures from this kind of data needs to acknowledge the hierarchical structure of the data. statistical Model This section presents a general linear model that captures the hierarchical structure of nested data. The model and its assumptions can be described in three stages. The 39 40 first stage describes data obtained from k groups, where j=1,-~, k. Given the set of parameters A, U. and 03:, let Yj J! be a vector of nj independent observations from a normal distribution with a mean ZjA + ljUj and a variance Ijog. A linear model with random intercept and fixed effect parameters for both group and subject level variables can be specified as where 2 Uj is a random error for the intercept, F Y.- 71 l le Wj lel .. lep ' z]. = s s 2 s s , and A = «(Q 1 W71 qu anjl anjp pl (3.2) Lfipi The part [27]., qu] of the matrix Zj represents q known group level variables, the part [X1- ] represents p known fixed 7'1 Xijp effect subject level variables, and A is p+q+1 vector of parameters that capture the effect of group and subject level variables on the outcome variable Yj. In the second stage both the intercept y°+Uj and the residual variance 0; are assumed to be independent and vary randomly across the k groups. Given A and t2, the random part of the intercept U]. is distributed as normal with mean zero and variance 12. Given 0.2 as a typical value of 03, the 41 02 variable 67'; is distributed as x2,%) with % degrees of freedom. The last stage represents vague prior knowledge about the hyper-parameters A, 12, 6, and of for the two distributions presented in the second stage. The prior for each one of these four parameters is assumed proportional to some constant. The model in 3.1 and its associated assumptions can be summarized as follows: 1- The observed data Given A, Hi and c@ the data I} is distributed as 2 / 2 2- The parameters of exchangeable distributions Here a; and 09 are assumed independent, Ule,1:2 ~ N(0,1:2) , (3.4) and 0.2 2 " xii) ' (3.5) 60] 9 3- The prior knowledge about the hyper-parameters. The joint prior distribution p(A,tz,6,o§) can.be written as ._....._. ____.‘—-— 1». ...__-._IH~.—-_..—-——.-q.~.- —— _... -. —»— “illuililita "'— 42 p(),.2,e,e§) « p(1)p(r2)p0 the above density function increases from zero at 03‘- = O to a maximum ate?- = 03v,/(v,+2) and then tends to zero as og-ow. The density has therefore the same general shape for all degrees of freedom as the chi-square distribution itself has for degrees of freedom in excess of two." (Lindley, 1965, p. 28). In his interpretation of the above density as a prior density for 0%, Lindley said: "To take it as a prior density for o? is equivalent to saying that values that are very large or very small are improbable, the most probable value is o§==o§vJ%v,+2) and, because the decrease from the maximum as oj increases2 is less than the corresponding decrease as 05 diminishes, the values above the maximum are more probable than those a similar distance below the maximum." (Lindley, 1965 p. 28). The above interpretation to the prior distribution of og‘makes choosing such a prior more realistic than assuming equal likelihood for all values of 03- (i.e., a locally uniform prior). The mean and the variance of 0% can be easily found by realizing that the density in 3.7 is an inverse Gamma function with or = 1/26 and [3 = 26/03. Therefore, 46 2 2 2 _ l = O. E(°jle’°°) _. [3(a — 1) 1 — 20 ’ and 4 (3.8, Var(o§|9,o§) = 1 = 200° 32(a - 1)2(a - 2) (1 - 28)2(1 — 48) The results in 3.8 reveal that 6 has to be less than .25 for the variance of a? to be defined» As 6 gets smaller, the mean and the variance are approximated to of and 2603. In many analyses cfiz is used as a measure of precision for each estimate derived from the data. From 3.7, the prior distributionofcr}2 is Gamma with a = 1/26 and B = 26/03. Its mean and variance are found to be E(o}2|6,o§) = ab = 0:2 , and (3.9) Var(o§2|6,o§) = «[32 = 260:4. Another measure that describes 03- and has the characteristics of being invariant to the change in the mean of its distribution, and depending only on 6, is the coefficient of variation (Linhart, 1965). From 3.8 and 3.9 this coefficient is found to be 1:46 for a? and J23 for 032. Small values of 6 indicate precise knowledge about a? prior to observing the data. The last set of assumptions concerns the prior knowledge about the parameters A,1:2,6, and of. They are the hyper- parameters for the exchangeable distributions for the parameters {Uj} and {03.}. The two questions presented earlier by Bretthorst (1988) might serve as general guidelines for choosing prior distributions. However, when. the model in 3.1 is being 47 applied, prior knowledge about the state of the hyper- parameters A, 1:2, 6, and of is usually limited or non-existent. That is, the higher we move in the hierarchical structure of the parameters in the model, the less information we have about those parameters prior to observing the data. In other words, one is usually more informed about the distribution of the {Uj} and the {0%}, prior to observing the data, than about the distribution of A,t2,6, and 03. Therefore, the prior distributions for A, t2, 6, and of should have the characteristic of being non-informative priors. This implies that the posterior distributions for these parameters should rely heavily on the information obtained from the data, with little or no weight given to the prior knowledge about the parameters. For the parameter A, we take its conjugate prior distribution, (which is normal) with vague prior knowledge about the parameter itself. Therefore, A is distributed uniformly over its parameter space R that is p(A) °< CA, q+p+1' where C). is constant. The subscript A is being used with the constant to identify it with the parameter. Similar to 0;, a natural conjugate prior for 12 would be an inverse chi-square with v12 degrees of freedom. A vague prior for “t2 implies v12 = 0, which produces what is often called Jeffery's prior (i.e., p(tz) °<1/1:2). This prior distribution is found to cause some difficulties when the number of groups is relatively small (Seltzer, 1988). The 48 effect of this prior with t2 near zero has a great influence on the density function of the data such that the marginal posterior distribution of 1:2 becomes heavily concentrated around 12 = 0 (Morris, 1983 and Lindley, 1983). As an alternative, a uniform prior distribution is chosen for‘l:2 (Seltzer 1989). That is p(tz) °< 0,2 for (1:2 > 0), where C12 is a constant. The prior distributions for both A and 1:2 will then become insignificant in their contribution to their posterior distributions. For 6 and of, we write their joint prior distribution proportional to the product of the conditional prior distribution of of given 6 and the prior distribution of 6. That is 6)pfl6). (3.10) p(of, 6) °< p(o.2 We specify p(of [6) to have conjugate prior to the density in 3.7, which is of a chi-square form as _ 2 PM.2 6,I,w) °< (01.2)“1 exp[ (mo) , (3.11) 26 where r and w are two parameters describing the prior for of (Lindley, 1971). A vague prior for of implies setting r=0 and (0:0 which will produce what is known as Jeffrey's prior. However, the fact that the marginal posterior distribution of of is heavily dominated by the data undermines the effect of the prior of 0.2 on its marginal posterior distribution. Thus, for mathematical convenience, a constant prior Co? is specified 49 for 0.2 . In the second set of the assumptions, the hyper-parameter 6 is presented as a function of the degrees of freedom for the exchangeable distributions for {03-}. It also, represents the variability of {03.}, and mathematically there is no reason why it should not be any positive number (Lindley, 1971). Therefore, we assume that p(6) has a uniform distribution on the real line. That is p(6) °< Ce: where C6 is a constant. The Joint Density Function of the Parameters in the Model Applying the Bayesian approach to the model in 3.1 allowed the parameters in the last two stages of the model to be treated as unknown parameters. In the first stage Hid and {03-} for j = 1,---,k are the parameters with exchangeable priors. The second stage covers the hyper-parameters A, 1:2, 6, and of, where A is defined in 3.2. They are the parameters of the prior distributions for id; and (0?. Vague prior distributions for A” 12,6L and.o§ are taken from their usual conjugate families. Treating all the parameters in the model as unknown parameters, allows their joint distribution along with the data I“ to be distributed as a product of conditional .7 distributions as follows: 2 2 2 2 2 jfi1p(yjilr thoj)p(Uj|A"1'-2)p(ojle’ WPW‘ '9'°°)- (3.12) Substituting the density functions for the assumed 50 distributions from the previous section for each of the above conditional distributions produces a joint density function proportional to -n fi ( 2)Tj 1f / -2 .=1 Oj exp -§j=1(Yj—ZjA—ljUj) Oj (Yj—ZjA—ljUj) ___k l x (1:2) 2 exp - U]? 21:2 j=1 (3.13) Theoretically, the marginal posterior distribution of any parameter in the model can be obtained by integrating the joint density function in 3.13 over the space of the other parameters. Being able to obtain the marginal posterior distribution of any parameter in the model implies being able to obtain interval and point posterior estimates of that.parameteru For example, an estimate of A based on its marginal posterior distribution will not be affected by the uncertainty in estimating the variance components, avoiding an undesirable characteristic of empirical Bayes estimates of A. Further, individual group estimates of {03-} for j =l,-~-,k become useful when we are investigating groups' heterogeneity of variance. Also, individual estimates of‘iqfi become useful for checking the normality assumption for the error term associated with each group intercept. Moreover, inferences on the hyper- 51 parameters 12, of, and 6 can be made using their respective marginal posterior distributions. With the help of modern computers and developments in simulation theories, we are now able to approximate these marginal posterior distribution numerically, to the desired degree of accuracy. The methods of Data Augmentation (Tanner and Wong, 1987) and Gibbs sampling (Tanner, 1993) are being used for numerical integration to obtain marginal distributions. Morris (1987) shows how one can use these methods for hierarchical Bayes models. CHAPTER 4 Obtaining Marginal Posterior Distributions Via Gibbs Sampling This chapter presents the method of Gibbs sampling as an procedure for calculating marginal posterior iterative The process of Gibbs sampling is best distributions. understood within the context of data augmentation (Tanner and the basic idea of the data An example, using A Wong, 1987). Therefore, augmentation procedure will be explained. the normal distribution, will illustrate its application. simple modification to the idea of data augmentation will facilitate the understanding of Gibbs sampling procedure. This procedure will then be used in chapter 5 to approximate the marginal posterior distributions of the parameters in the model in 3.1. Data Augmentation (1964) made in their The argument that Box and Tiao investigation of the importance of the assumptions applied.to the comparison of variances, using the Bayesian approach, can facilitate our understanding of data augmentation. Very often the distribution of observations Y depends on more than one and Z2, one of which, say say two parameters Z1 For'non-Bayesianmmethods, parameter, 2&, is of immediate interest to us. 52 53 this could cause an extremely difficult problem in dealing However, under the Bayesian with the other parameter Z2. approach making inferences about Z1 is simplified by finding the marginal posterior distribution LMZIIY). This marginal posterior distribution can be obtained by integrating out the of parameters Z2 from the joint posterior other set «distribution p(Zl,Z2|Y). In the data augmentation process, Z2 latent variable or missing data can be thought of as 1987). augmenting the observed data Y (Tanner and Wong, In the following discussion all distributions are posterior distributions, therefore, the term "posterior" is deleted from the names of the distributions for simplicity. If we write the joint distribution p(Z1,Z2|Y) as a product of the conditional distribution of Z1 and the marginal distribution of Z2, p(Z1,ZzlY) =p p . (4.1) the marginal distribution of Z1 can then be written as p(ZIIY) =fTP(Z1iZ2I Y) p(ZzlY) aZ2 ’ (4-2) where T is the parameter space for Z2. As Box and Tiao (1964) pointed out: the marginal posterior distribution of the parameter ,p(Z2|Y) acts as a weight function multiplying the conditional distribution P(Z1IZ2r Y) of the parameter of interest. It is frequently helpful in understanding the problem and the nature of the conclusions which can safely be drawn to consider not only p(Z1|Y) but also the components of the integral on the right-hand side One is thus led to consider the of Z1 for particular of equation [4.2]. conditional distribution 54 values of the nuisance parameter Z2 in relation to the probability of occurrence of the postulated values of the nuisance parameter." (p. 153). These two basic ideas, integrating out the nuisance parameter Z2 from the joint distribution p (Z1, Z2 I Y) , and using the marginal distribution of the nuisance parameterp(Z2|Y) as a weight function in that integral, can be generalized to the process of data augmentation when a direct solution to the integral in 4.2 can not be found. Tanner (1993) defines p(ZZIY) in the integral in 4.2 as the predictive distribution which can be used with the observed data Y to obtain the posterior distribution p(leY) . In many cases however, p(ZZIY) is not known, which makes it impossible to obtain p(leY) . The joint distribution of Z1 and Z2 as in 4.1 and 4.2 can then be used to obtain the marginal distribution ,p(Z2[Y). In other words, the joint distribution of Z1 andZ2 in 4.1 can be written as: p(Z1.Z2IY) =p p(Z.|Y> (4.3) The marginal distribution for the second parameter p (Z2\Y) can be obtained by integrating the joint distribution in 4.3 over the parameter space R of the first parameter le p(ZZIY) =pr(Z2!Z1,Y)p(Z1|Y)6Z1. (4.4) the marginal distribution of the Just as in 4.2, in 4.4 can be thought of as a weight parameter p(leY) function multiplying the conditional distribution of the parameter P(Z.2(Z1r Y) . Integrating this weighted conditional 55 distribution over all admissible values of Z1 will give us the marginal distribution of Z2. Carrying out the integration in equation 4.4 analytically requires knowing the parametric form of the marginal distribution of the first parameter p (leY) . Likewise, carrying out the integration in equation 4.2 analytically requires knowing the parametric form of the marginal distribution of the second parameter p(ZZIY) . But these two marginal distributions are unknown to us and we are interested in finding them. Therefore, this dependency between the two marginal distributions for the two parameter becomes the key point of the iteration process of data augmentation. In 4.1 and 4.3, the joint distribution of Z1 and Z2 is being expressed as a function of four other distributions, two of which have unknown parametric forms. Therefore, it is hard to sample from them. They are the marginal distributions p(leY) and p(Z2|Y) . The other two have known parametric forms, and one can easily sample from them. They are the conditional distributions p (Z1|Z2, Y) and p (Z2 IZ1, Y) . The iteration process, therefore, involves repeated sampling of Z1 and Z2 from their conditional distributions p(lezz, Y) and p(ZZIZI, Y), which accomplishes the numerical integration of the joint distributions in 4.2 and 4.4, and produces numerical approximations to the marginal distributions for Z1 and Z2. To exploit the dependency between the two distributions in 4.2 and 4.4, we start with an initial estimate of Z1. One 56 can think of this initial estimate as being sampled from a poor approximation of its marginal posterior distribution, p(leY) . A sample of M values of Z2 can then be drawn from its conditional distribution p(ZZIZ1,Y) , where the initial estimate of Z1 is used along with data Y. For each of theM values of Z2, we form a conditional distribution of Z1 given Z2=Z2m defined as p(ZIIZZm, Y) , for m=1, «-,M. The weighing of each of these conditional distributions is done empirically by the sampling process from the marginal posterior distribution p(ZzlY) in 4.2. That is, a large proportion of M of these conditional distributions are conditioned on values of Z2 which are near the mode of p(ZZIY) . Rubin (1987) refers to the process of sampling Z2 from p (ZZIY) , in order to form the conditional distribution of the first set of parameters p(ZIIZZ, Y) as a multiple imputation process. The mixture of the M conditional distributions of Z1 given Z21", m=1,~~-,M represented by their weighted average, where the weighing process is done empirically (as explained previously), produces the marginal posterior distribution p(leY) at the tth iteration M P" x M‘1 Zp.xp{2-_12[(.-n.2 + nut—7m} . (4.9) 0 Based on 4.9, and where 02 is known the conditional distribution p(IJ-loz, Y) is normal with mean 57 and variance 02/11. When 02 is unknown, however, P(l-|v|Y) represents a t- distribution with mean )7 and variance Sz/n and (n-1) degrees of freedom, resulting from the integration of 4.9 with respect to 02 (Box and Tiao, 1973). When p. is known, the sample variance is defined to be 5'2 = 1nd};1 (yi - (1)2. The variable n.5'3/o2 is therefore distributed as chi-square with 11 degrees of freedom. Consequently, ozlu, Y is distributed as nszxfi) where xii, is an inverted chi-square variable with 11 degrees of freedom. When u is unknown however, p(ozlY) represents the probability density of (n-1)Szx2,2,-1) , where x7534) is an inverted chi—square variable with (n-l) degrees of freedom, resulting from the integration of 4.9 with respect to n. For illustrative purpose, let us assume that the forms for the distributions of p(ply) and p(ozlY) are unknown, and need to be approximated. Our objective in this example is to show how these two distributions can be numerically approximated, using the process of data augmentation presented in the previous section. The joint posterior distribution of p. and 02 is written as a product of the conditional distribution of p. and the marginal posterior distribution of 02 61 p(wzm =p(p.|02,Y) p(ozly) . (4.10) Then the marginal posterior distribution of u can be defined by p(plY) =fozp()1|02,Y) p(ozlY)d02. (4-11) Similarly, the joint.posterior distribution.in 4.10 is written as a product of two new distributions p(p.02|Y) =p(02|u,y) P(|J,|Y) (4.12) Then the marginal posterior distribution of 02 can be defined by p(ozlY) =f p(ozlp,Y)p(plY>du. (4-13) p. Examining 4.11 and 4.13 reveals that obtaining' the marginal posterior distribution of one parameter depends on obtaining the marginal posterior distribution of the other parameter. Therefore, this dependency between the two equations can be used to ShOW' how the method of data augmentation can be used to approximate the two marginal posterior distributions in the following steps. 1. As a starting point for the iteration process, u can be estimated from the data Y by )1 = gin/fl. This estimate can be thought of as p, which has been sampled from the current approximation of its marginal posterior distribution p(ulY). 62 Using the current estimate ii to represent the true value of u, find the sample variance as S2 =n’1£1(yi - (1)2. 1- This variance estimate will be used, in conjunction with the sampled values of a chi-square random variable, in the next step to sample 02 from its conditional distribution p(02 In, Y) . Sample M values of x201)”, for m=1,~,M from a chi-square distribution with 11 degrees of freedom. For each sampled value of )(Zm)m find 63,, as 83,, = nSz/xinm. Applying steps 2 and 3 resembles the process of sampling M values of 02 from its conditional posterior distribution p(ozlil ,y) . The mixture of these M conditioned values of 02 forms an initial approximation to the marginal posterior distribution p(ozlY) . This approximation, however, is an inaccurate one since the sampled values of 02 are conditioned on a poor estimate of p. Given the data Y and the M values of 0,3, which are sampled in step 3, we sample M new values of pm from p(plog, Y) , which is N(iog/n) , where 57 is the sample mean from the data. The mixture of the M sampled values of the mean pm represents an approximation to its marginal posterior distribution p(p. | Y) . These new sampled values of pm from step 4 can then be used to get M new estimates of 8; as in step 2 for a new cycle of approximation. As we continue iterating between steps 2 to 5, the mixture of the M values of 0,3, which 63 are sampled in step 3, and the mixture of the M'values of p” which are sampled in step 4 become increasingly accurate in representing the marginal posterior distributions of u and 02. Gibbs Sampling The process of data augmentation was presented for the case:of approximating the marginal.posterior distributions for only two parameters, Z1 and Z2 with M >1. When M=1, Tanner (1993) defines the iteration process as "chained data augmentation". In the normal data example given above, the two parameters were represented by p and 02‘with.At>1. There are cases however, where there are more than two parameters for which we require numerical approximations to their marginal posterior distributions. To obtain them, a simple modification to the logic of data augmentation process can be used. When. the iteration. process. of data augmentation is generalized to more than two parameters with M'set equal to one, the process is called "multivariate chained data augmentation” (Tannery 1993) or' Gibbs sampling" (Gelfand, Hills, Racine-Poon and Smith, 1990). Consider the previous case, where the distribution of the observations Y depends not only on Z1 and Z2 but also on a third parameter say Z3. Under the Bayesian approach, making 64 inferences about any of the three parameters is simplified by finding its marginal posterior distribution, by integrating out the other sets of parameters from the joint distribution of Z1, Z2 and Z3. The joint distribution of Z1, Z2 and Z3 can be written as a product of the conditional distribution of Z1 and the joint distribution of Z2 and Z3, p(Z1,Z2,Z3|Y) =p(Z1|Z2,Z3,Y) p(Z2,Z3|Y) . (4.“) Then, the marginal posterior distribution of Z1 can be written as P(Z1)Y) = foWp(Z1]Z2,Z3,Y) p(Z2,Z3lY) 6Z3 8Z2 ’ (4-15) where T and W are the parameter spaces for Z2 and Z3, respectively. Similar to 4.2, the joint distribution p(Z2,Z3|Y) can be thought of as a weight function multiplying the conditional distribution of the parameter of interest p(ZIIZ2,Z3, Y) . In other words, we consider the conditional distribution of Z1 for particular values of parameters Z2 and Z3 in relation to the probability of getting those values of Z2 and Z3. Similarly, the marginal posterior distribution for each of Z2 and Z3 can be obtained by integrating out the other parameters from the joint.posterior distribution of Z1, Z2.and Z3 as follows : p(ZzlY) = IR pr(Z2lZ1,Z3. Y) P(Z1IZ:))Y) 6Z3 621 ' (4,15) and 65 p(ZBIY) = f. pr(z3|zl,z2,y) p(Z1,Z2IY) az2 azl. “.17, As in data augmentation, the three equations 4.15, 4.16 and 4.17 define the iteration process of Gibbs sampling. Further, it is assumed.that there are only three distributions that are easy to sample from. They are the three conditional distributions of Z1, Z2 and Z3, represented by the first part of the right-hand side in each of 4.15, 4.16, and 4.17. The iteration process therefore involves repeated sampling of Z1 , Z2, and Z3 from these conditional distributions to accomplish the. numerical approximations of 'their’2marginal posterior distributions. Starting with initial values of ZS”, ZS”, and the data Y, sample ZFJ from its conditional distribution p(z1|z2‘°’, Z3)”, Y) . Given the values of Z)” , ZS” , and the data 2(1) from its conditional distribution Y, sample P(Z2(ZFJ,Z§”,IO. Finally, given the values of Z9), ZEJ, and the data Y, sample Z?) from its conditional distribution P(Z3IZ1(1),Z2(1), Y). The three sampling processes complete one iteration of Gibbs sampling. After a large number of iterations say X, sampling' any' of’ the three ‘parameters resemble sampling that parameter from its marginal posterior distributitwn To simulate a marginal distribution for Z3, for example, after an X iterations, sample m values of Z3 from (Z3 |Z{X), Z2(X),Y) . The mixture of these m sampled values of Z3 represents a numerical approximation to its marginal distribution. That is, the sampled values Z3901, ...... , Z3“), can be 66 viewed as a numerical approximation to the marginal posterior distribution _p(Z3|Y). Similarly; the :marginal. posterior distributions for Z1 and Z2 can be approximated. Gibbs sampling can be generalized to as many parameters as the investigated model has. The basic idea is to write the joint distribution of all parameters in forms similar to 4.15 to 4.17. Repeated sampling of the parameters from their corresponding conditional distributions results in approximation of their marginal distribution. A special case of Gibbs sampling is found for certain applications, where some of the parameters depend only on a selected number of other parameters. For instance, in the case of approximating the marginal posterior distributions for the three sets of parameters Z1, Z2 and Z3, it is possible that the conditional distribution of say, Z3 depends only on Z2 and the data 1%. While equations 4.15 and 4.16 for the marginal distributions for Z1 and Z2 do not change, equation 4.17, however, becomes p(Z3IY) =pr(Z3|Z2,Y) P ah‘—‘i;) , (5.1., 6(0j) 6h 72 where (—C::) = (032)”. Therefore, 6(0j) 31.1- - I? 3 g(0}2|A,6,0§,{Uj}, Y) °< (032) 2 2° leXP{_°72(szg+i)} I (5.19) In general, a continuous random variable X with a density function f is said to have a Gamma distribution with parameters a and [5 if f(x) = X“—1 exp(——pi{) , O s X 3 9° . (5.20) __1_ PM) B“ The expected value and the variance of X are given by (5.21) E(X) = a6 , Var(X) = «(32. The expression in 5.21, therefore has the kernel of Gamma density function. Thus, given A, 6, 0.2, {Uj}, and Y, the parameter 032 is distributed as Gamma with a -2 = E + _1_ , (30.. = ’2—6—7 (5.22) “i 2 26 J 11].st + 0,, To sample 03-, simply sample 0? from 5.21 then use the _ — _ 2 relation h(0j2) = (of) 1 to get 03-. ' - a The conditional expectation and the variance of oj are easily found by applying 5.21, 76 E(O}2|1,0,03,{Uj}, Y) =((u)jSJa + (1—(i)j)0..2)'1 , and Var(072|A e 02 {U} Y) = 2"” 32 + (1 — ) 2'2 J I I 0] j] n. [(1)-7' j Qj 0.] , (5023) .7 where wj =-——£fi—— nj + 6‘1 The conditional expectation in 5.23 has a form similar to Stein's (1956) shrinkage estimate. It is a form of a weighted estimate of the variance Si, and the asymptotic overall mean of the variances 03. The weights depend, in a natural way, on the degrees of freedom for obtaining S? and 03. As 6 approaches zero, indicating homogenous variances, so does wj leaving the expected value of of equal to 0?. When 6 gets very large however, the value of wj approaches one, leaving the expected value of 032 equal to the inverse of the sample variance 392. The operational definition of reliability is the ratio of the "true" variance to the observed variance. Therefore, one ——£::—— as a reliability index for 032, nj + 6'1 expressed in terms of its precision. can think of 1 - wj = It represents the . - 4 proportion of the precision in the OLS estimates of oj that is parameter precision. - 2 The conditional distribution of A g1ven {U}, {0fi, and Y From line 1 of the joint density in 3.13, the conditional distribution of AIQGL{O?,Y' iS found to have a denSIty function proportional to 77 exp()5—.l (ir'.—z.A—1.U.)/0’.2 (y —Z )_1 (1)) (5.24) 1 2 .7 .7 .7 J J j j j j ' Let Yj-lj Uj = Dj. The expression in 5.24 can now be written as eXp{-%€3[(Dj—zji)’o;2(Dj—ZjA)]} . (5.25) Multiplying the terms inside the exponential, and ignoring the terms that do not depend on A, the conditional distribution of A becomes proportional to eXp(--%€3 032 (A’Zj-ZjA — zit/25%)) . (5.26) By completing the square, 5.26 is rewritten as -1 I exp [—%[A - (g (Zj-Zj) 0332) )::(z,’.o,) of] $(zj4zj) of (5.27) -1 x {A - ($2,421.) of) $2,419,) 052)] 1 When Yj—lj Uj is substituted back for Dj in 5.27, the resulting expression represents a kernel of a normal density. Thus A |{Uj}, {03-}, Y is distributed as normal with mean A“ and variance V,“ where -1 1* = (gm/.21.) of] (ganja?) — $Z§1jUj<732J , .7 -1 V, = (fir/.21.) of) . 1 (5.28) 78 The mean in 5.28 is a weighted least square estimate with Yj-ljUj as an outcome, Z- as predictors, and 032 as the .7 weighing factor. The conditional distribution of t2 given UL; and 1’ From line 2 of the joint density in 3.13, the conditional distribution of tzlflgh Y is found to have a density function proportional to 1. 9:50,? (5.29) (12) 2 exp 212 Compared to the general form of an Inverse Gamma function in 5.16, the expression in 5.29 has a similar density function. Therefore, tzlhgh Y is distributed as an Inverse Gamma variable with “:2 = k ——1 fi2=—— - ' t (5.30) 2 iv The conditional expectation and the variance of 12 are easily found to be 79 EU]? E(1:2|{Uj}, Y) = 1:_4 , and 2 262%?) 1 (k-4)2(k-6) (5.31) Vartc2 I{Uj), Y) = To sample 1:2 from its conditional distribution, adopt the same strategy used for sampling 0?; that is, let 1:2 = h(’E-2) = (.c-Z)-1 . (5.32) The density function for t“ can then be found as - 6h(t’2) (1: 2 {U.}, r) = rd?- {0.}, Y) ,_ (5.33) where M = (t‘2)‘2. Therefore, 6(1‘2) _1_<_2 “17—sz3g h(t‘2|{Uj}, Y) °< (t’z)2 exp 21 (5'34) Compared to the general form of a Gamma function in 5.20, the expression in 5.34 is similar. Therefore, t‘2|{Uj},Y is distributed as Gamma variable with - 1 , [3.4 = - (5.35) {‘Uj The conditional expectation and the variance of 1‘2 are 80 E(t‘2|{Uj}, Y) = k—2 and (5.35) Var(t‘2|{Uj}, y) = 2(k-2) ($412 To sample 1:2, simply sample 1'2 from 5.34, then use the relation in 5.32 to get 12. The conditional distribution of of given 6 and k@} From line 3 of the joint density in 3.13, the conditional distribution of 03 (L{0? is found to have a density function proportional to .2 a 1 20 exp -0. 1_2 fi(0§_)“—26*1’ _ (5.37) 0.,iL))k 26 1 o§ 1 26 Let G and H represent the geometric and the harmonic means of 0§'s as follows: G: fi(0 971‘, Thus flog-=61c l (5.38) k H: k , Thus 20 1 2.5 ale I ml» Q eels Substituting the equalities of 5.38 into 5.37 produces 81 J” (ale-)2 we)" 9"“ 1 -k02 3 ” ° (5.39) (0 ) exp( 26H) . The first term of the above expression does not involve 03, therefore, the above expression can be rewritten proportional to (of)?!£6 ‘kOEJ, (5.40) exp( 26H The above expression represents the kernel of a Gamma density with k 26E! “0.2 = 2—6 + l I pa? = T . (5041) c o c c 2 The cond1t1onal expectation and variance for (n therefore, are E(of e,{o§}) = H+ 3; , and (5.42) Var(0§|6,{0§}) = (H + 3;)3-2-5 . Notice that the first.part of the above expectation represents information depicted by the harmonic mean of the groups variances {03.}. The second part involves 6, which reflects the variability of these variances. As 6 approaches the zero, implying homogeneity of variance situation, the second.part in that expectation diminishes, and the conditional expectation 82 of 0.2 becomes equal to the harmonic mean of the group variances. Similar to 03- and 1:2, we can find the conditional distribution of 0:2 by letting 02 = 9(0‘2) = (0-2)-1 (5.43) The density function for 0:2 can then be found as '2 5.44 h(0§2|{0§-},6) = f(0§ (03),.) Ml < > am?) -2 where M =(O;Z)‘2. Therefore, 6(0Z2) k- _ h(o;2l{o§},e) o. (0J2)7’32exp(——k—:-2-) _ (5.45) 26H0. The expression above has the same general form as the inverse Gamma function in 5.16. Therefore, 0Z2l{0§},6 is distributed as an Inverse Gamma variable with «0:2 and (30:2 given in 5.41. o '2 The mean and variance of 0. are E (0:2 H03}, 6) , and _1_ H (5.46) H Var(0.f2 H02), 6) -——-l-——— 1 His-1) 26 e o 2 The Conditional Distribution of 6 g1ven 03 and {oj} From line 3 of the joint density in 3.13, the conditional distribution of 6|03,{0§-} is found to have a density function proportional to 83 k (03) 2—0 1 2 1 -(i+1) __ 0 ex —2§_ fi 02. 23 . (5047) ——P ( ) p 2—6 1 a; 1< 3) Let G and H be defined as in 5.38. Further, to simplify the term 1"(2—10) in 5.47 we use the first term of Stirling's approximation. The general expression for Stirling‘s approximation for a variable P is given by l + 1 _ 139 ,(5.48) 12p 288p2 51840p3 NP) a (21:) %p (p-é) exp (-p) [1 + Substituting 1/26 for p in the first term of 5.48 will produces the following approximation. k (F(-§%))k z ((21):)3 (2_16.)2_°3 exp(%6.)) _ (5.49) Substituting G, H, and 5.49 in 5.47 results in k i 2" (<33)E - i+ —ko§ (20) G 1((29 1) exp( 26H . (5050) The expression in 5.50 reduces to 2i ——1 [°°)”(-1-) 1.“; G 6 (26) uln- —.Iq and nj>p. However, for practicality; three versions of that model will be presented. In the first version, no between and within-unit variables were specified” This is the simplest case, where the model in 3.1 reduces to a simple ANOVA model. One between and one within-unit.variable will be included.in the second version of 87 the model with a random intercept and fixed effects predictors, i.e., q==g3= 1. The last version covers the case of two between and three within-unit variables with random intercept and fixed effects predictors , i.e., q = 2 and ;)= 3 (see Table 5.1). The above three models range in their complexity to represent. conditions often found. in. educational research projects. The first is a simple version of the model, used to facilitate the understanding of the statistical procedure and its application, The third version of the model is similar to the example of the real data used in the study. Data Specifications Two primary but not completely mutually exclusive criteria were used to create the data. The first deals with the degree of heterogeneity of variance. The second criterion deals with the number of groups k to be generated. For each combination of these two criteria along with each of the three versions of the model a data set was generated. Based on the first criterion (degree of heterogeneity of variance) and given that the coefficient of variation (C.V.) Ikur{o? is equal to where O<6 “0 ...... 1 Wj le Yo k=100 . Homogeneity Zj : : I and A : Yl case k=40 1 W5 X%J p (6—0.02) k=15 k=100 Model (3) Heterogeneity f case k=40 Yj = sz + ijUj + ej, or (6=0.2) (Yo. k=l5 71 l le sz X1j1 X1j2 X1j3 72 k=100 Zj : . I . : 1 and l: B I H mogeneity 1 o l le sz anjl anjg anj3 p2 (eggsgz) k=40 .ij k=15 90 Data Creation Given the model in 3.1, all eighteen data sets were generated in two general steps. The first step covered the generation of data for the group level variables. The second step covered the generation of data within each group based on the data generated for the group level variables in the first step. These two steps reflect the inherent nesting of the two levels of the model. step one - Between-group data The following random vectors with k elements (k = number of groups) in each vector were generated: 1. Within-group sample sizes nj vector, for j = 1, ..., k. First, k random numbers were generated from a UNIFORM distribution between 5 and.60. Each number was rounded to the nearest integer to represent a sample size nj for each of the k groups. Variation in sample sizes are considered to include a wide range from small (nj=5) to large (raj-=60). The following is an example of sample size vector: Group Sample size nj 1 n1=40 2 n2=15 91 2. Within-group variances {0;} vector. For pre-specified values of 6 and 0.2 the variances {03-} were generated from their respective distribution. f(o§-|6,o§) ~ I“1(a,[3), where (5.59) a z i = 2_6 26 ’ a: 3. Based on 3.9, the k elements of the {Uj} vector were generated from a normal distribution with mean equal zero and variance equal 12. 4. Now, q vectors of W = [W1, W2, ---, Wq] and p vectors of 2?: [§1,)?2,~~,5{_p] each has k elements were generated from a multivariate normal distribution with mean vector of zero and pre-specified variance-covariance matrix 2. The values of the p vectors of §= [§1,)?z,m,fp] , were then used, in step two, as means of the p vectors for the within-group predictors Xj = [Xj1,Xj2, -~,ij] in the design matrix Z in 3.2. Step two - Within-group data The generated data in the second step is for the within- group variables. It depends on the values of the between- group variables generated in the first step. For each j, where j = 1, ~--, k groups, an p variables ofXj = [Xj1,Xj2,o--,ij] within-group predictors and an error term ej were generated. The number of observations for these variables within each group is equal to the sample size nj generated in the first data set. . -._—._———_ —____._ —.— 92 1. Based on the model in 3.1, e1.j ~ N(O,Io§-) . Therefore, given {03-} for j = 1,2,---,k, which are generated in the first step, nj values of eij, for i = 1,2,---,nj, are generated for every one of the k groups. The eij's are normal with mean zero and variance 0?. 2. For each of the k elements of the p vectors of I? = [§1,)?2,---,)?p] generated in the first step, 12]. vectors of observations were generated for each of Xij = [Xv-1, X1721“! Xijp] , where 1' = 1, 2, ~--, nj and j = 1, 2, --~, k, from normal distribution with mean fjp and a pre-specified value for the variance air,»- 3. For pre—specified values of the parameter Vector A’ = [yu yl, ..., yq, [31, [32, ..., [3p] the values of the outcome variable Y1-j in a particular data set were then generated by substituting the generated values of all the between- and within-group variables in the right hand side of equation 3.1. In summary, one artificial data set for a model with one between and one within-group variables, for example, can have the following form: 1- k = 40 number of groups , nj ~ Uniform(5, 60) 2- 0:0.2, 03:50. 1 26 _ 2, ~ _1 -_- :I I a = __ I = — 3 a] I‘ (a 2.5 , [3 008) y 26 0.2 2 - Var(Uj) = 9.0 , Thus Uj ~ N(0,9.0) __ 4 1 93 2 = COV(W,X) =( ), oi“, = 81 , thus W~N(0,4) , 81) , and eij~N(0,o§-) 6- y°=5.0, 71:3.0, B=O.5, where p=q=l 7- Yij ~ N( 5+3wj+.5Xij+Uj , 03- ) Tables 5.2, 5.3 and 5.4 presents the true values of the parameters used to generate the 18 artificial data sets used in this study. Note that the within-group sample sizes nj is a random sample from uniform (5, 60) for all the artificial data sets. 94 Table 5.2 The true values of the parameters used to generate the artificial data sets with k = 100 k = 100 e = 0.2 e = 0.02 MOdel (1) Model (1) 03:30 12:6.25 A=yg=6 03:15 12:9 A=Yg=5 Model (2) Model (2) 03:30 12:4.00 03:35 12:6.25 Y°=3 . 00 Yo=12 . 5 A=yl=l.50 A=yl=6,00 31:3-50 81:2.50 4.00 8.00 E: 2: ] 0.21 3.00 0.70 4.00 0i = 400 a; = 400 Model (3) Model (3) 03:30 12:6.25 03:30 12:3.0625 ry,,=8.00' 3158.00) 71:3-50 yl=3.50 A =YZ=2'7S A::72=2.75 01:1.00 01:1.00 82:3.50 82:3.50 03:0.75) _ _p3=0.75) V4.00 “ 3.00 0.21 3.00 0.11 1.50 E==0.72 0.33 7.00 2==0.32 0.35 2.50 0.62 0.21 0.20 4.00 0.09 0.22 0.20 3.00 _0.30 0.20 0.21 0.11 5.00) _0.42 0.20 0.14 0.11 1.00“ 2_ 2_ og=676 02:400 02=484 02:225 ofi—289 qa—loo O 0 ' —' 12 is the variance—covariance matrix between the W s and.X s 95 Table 5.3 The true values of the parameters used to generate the artificial data sets with k = 40 k = 40 6 = 0.2 6 = 0.02 Model (1) Model (1) 03:25 12:2.25 A=yy=8 03:15 12:9 X=Yg=5 Model (2) Model (2) 03:40 12:2.25 03:30 12:9 Yo=3.00 Y°=l2 A==yl=1.50 1: 71:4 fi1=3.50 61:2 _4b0 2_6.00 J 0.21 3.00 0.30 4.00 a; = 169 a; = 400 Model (3) Model (3) 03:25 13:2.25 03:15 12:6.25 Fy°=8.oo' 'y.=8.00' Yl=3-50 Y1=3.50 =2.75 Y =2 75 A=Y2 A: 2 01:1.00 31:1.00 02:3.50 02:3.50 p3=o_75 _ _B3=O.75‘ F6.00 ' ‘ 2.00 0.11 3.00 0.13 2.50 2==0.32 0.35 3.50 2==0.12 0.15 1.50 0.12 0.22 0.20 4.00 0.09 0.05 0.10 3.00 0.30 0.20 0.21 0.11 5.00, _0.02 0.20 0.14 0.11 1.00, 2 2‘_ 02:121 02:169 02:196 02:64 oa=121 ofi—lOO 2 is the variance-covariance matrix between the W's and X's 96 Table 5.4 The true values of the parameters used to generate the artificial data sets with k = 15 = 15 6 = 0.2 0 = 0.02 Model (1) Model (1) 03:10 12:1 A=Y°=5 03:10 t2=300625 l=yo=5 Model (2) Model (2) 03:20 12:16 03:10 12:36 Yo=8'00 Y°:8.0 A=yl=3.50 1:71:35 51:1.50 61:1'5 3.00 3.00 2: 2: ] 0.11 2.00 0.11 2.00 2 2 _ ox1 = 49 0x1 — 49 Model (3) Model (3) 03:10 12:2.25 03:30 12:4 'Y.=8 . 00' (756 . 00' 7123-50 Y1=4.50 72: .75 lY2=2.50 A: A: _ pl: .00 81—2.00 82:3.50 82:3.50 p3=0.75) . _BB=1.75) 1 '6.00 ' 2.00 0.11 3.00 0.13 2.50 2==0.32 0.35 3.50 2==0.12 0.15 1.50 0.12 0.22 0.20 4.00 0.09 0.05 0.10 3.00 0.30 0.20 0.21 0.11 5.00) _0.02 0.20 0.14 0.11 1.00, 2 2.. 2.. 02:121 og=169 ag=196 ofi=25 qh—49 ofi—ie 9 ' ' —' 2} is the variance-covariance matrix between the W s and X S 97 Initial Estimates for Gibbs Sampling We recall from the section titled "Empirical Application of Gibbs Sampling" on page 87, that initial estimates for A, of, {Ufi' and. (a? are required to start the iteration process of Gibbs sampling. Reasonable initial estimates for these parameters could be their empirical Bayes estimates. However, to avoid the dependency in comparing empirical Bays estimates of these parameters with those produced by Gibbs sampling, the formal estimates will not be used as initial estimates for Gibbs sampling. Alternatively, least square estimates are used as initial estimates for these parameters as follows: 1- Using the data Y and the design matrix Zlin 3.2, The parameter A. is estimated by A“) =(J§1Z§Zj) 124le. The superscript (O) is used with the parameter A to represents the initial estimate of the parameter. 2- Given the data Y and the computed value of A”) from the previous step, each element of the set {U9 is estimated n- n m)__ l 2 _ i m) b U- - —— Y- Z-A ). Y 3 nj 14 1 id 3 3- Similarly, each element of the set {0? is estimated by 03(0) = nj‘l(yj _ 21.1w) — Uj‘°>)/(yj — zjl<°> — Uj‘”) . The values of A”) and (#0) were computed from the previous two steps. 4- Finally, given the set of estimated values {ojmn from step 3, of is estimated by{fhe harmonic mean H'of {ofiww as follow 02(0) = k 2092(0)) . The above estimates can be thought of as sampled values 98 of the parameters from the current approximation to their marginal posterior distributions. These sampled values are then used to simulate the distributions for the rest of the parameters. The simulated distributions however, are poor ones because they are based on poor estimates of A, of, {Uj}, and {03-}. Going through several iterations should improve the approximations of all the marginal posterior distributions. Another parameter estimated from the data is 6. While the estimate of this parameter was not needed to start the iteration process of Gibbs sampling, its estimated value, however, is used for empirical confirmation of its Gibbs (posterior mean) estimate. Except for 6, empirical Bayes estimates for all other parameters of the model in 3.1 can be easily found and compared to their counter part estimates from Gibbs sampling. Therefore, an estimate of 6 from Gibbs sampling can be compared to a one that is based on the log transformation of the estimated group variances (Raudenbush and Bryk, 1987). That is, when S]? for j=1,~-,k is used to estimate the group residual variance with vj degrees of freedom, the transformation dj, where 1 _ 5.60 is used to estimate 6]. =—:—109(O§-). Each dj is approximately distributed as N[5j, 1/ (2Vj—1)], for j=1, ..., k. Similarly, using the assumption in 3.5, where (IE/(60%) is distributed as a chi- square variable with 6‘1 degrees of freedom, 5. is distributed J as NHL-g) , where A is the average of all 6].. ') 99 A total variance estimate of dj approximately consists of a sampling variance and a residual parameter variance Var(dj|A) = Var(dj|6j) + Var(6j|A) , and f j=1 which has a large sample chi-square distribution withk-l degrees of freedom (Raudenbush and Bryk, 1987). 100 Random number generation Three computer programs, using the FORTRAN language, were written to implement the generation of the data and the iteration process of Gibbs sampling. The first two programs are for creating the artificial data sets in two steps. The third program applies the iterative steps of Gibbs sampling. In this program sufficient statistics are calculated from an input data file (either the artificial data created by the first two programs or the real data) and used to calculate the initial estimates of the parameters A, of, HG},anui {0%, the chi-square statistic for testing the hypothesis of homogeneity of variance as well as the moment estimate of 6. The initial estimates of the hyper-parameters are then treated as being sampled from the current approximation of their marginal posterior distributions and. they are 'used to start the iteration process between six subroutines. Each subroutine is written to sample one value of each of the parameters 6, 03, 12, A and one set of {Uj} and {03-} for j=1,---,k from its corresponding conditional distribution. The iteration process is then terminated when the number of iterations required for convergence is reached. The conditional distributions of the parameters involved in the iteration process of Gibbs sampling have three general parametric forms: normal, gamma and inverse gamma. The International mathematical and Statistical library (IMSL), 101 Version 10 is used with the programs to generate random numbers from these distributions. Sampling a random variable from an inverse gamma distribution with certain parameters was established by inverting a random variable that was sampled from gamma distribution with the same parameters. Only 600 observations are printed out for each of the parameters as a final representation of its marginal posterior distribution. To avoid the dependency between two successive iterations, the 600 observations are drawn from the last 3600 iteration of the program in a systematic order with a 6 iterations jump between selected observations. Specifying the Criteria for Convergence Gelfand, Hills, Racine-Poon and Smith (1990) have listed several procedures for checking convergence in Gibbs sampling. One of these procedures, they recommend, is the overlay plotting of the estimated density of the simulated distribution at several points of the iteration process. One can assume convergence when these plots become equivalent. Another procedure is the use of a Quantile-Quantile plot. Two equal size samples, each drawn several iterations away from the other, are ordered and plotted. As the number of iterations before drawing the two samples increases, the plot of the ordered samples moves toward a 45° line as an indication of convergence. The above two procedures were used 102 in this study for checking the convergence in approximating the marginal posterior distributions. The number of iterations required for convergence for model 3 with 6=0.2 and k=15 was used for the subsequent runs. Real Data Analysis Description of the variables available in the High School and Beyond data set is given in table 5.5. Table 5.5 Description of Variables in High School and Beyond data Variable Description Student Characteristics MATHACH Mathematic achievement (outcome). SES A measure of socioeconomic status. MINORITY An indicator of minority status. GENDER An indicator for females. School Characteristics SECTOR School affiliation, Public VS. Catholic. MEANSES Mean Of SES. 103 Previous research (Raudenbush and Bryk, 1987) has demonstrated the use of hierarchical linear model in studying the effect of the organizational characteristics of schools on dispersion in mathematics achievement (MATHACH). They have used the model for estimating the residual dispersion in mathematics achievement for each school using information from the whole sample. The resulting estimates of the residual dispersions are empirical Bayes estimates, which are conditioned on ML estimates of the hyper-parameters of their assumed distribution. This study goes beyond getting a single estimate of the residual dispersion for each school by approximating not just the marginal distribution for the dispersions but also the marginal distribution for every parameter in the 3.1 model, including the hyper-parameters. Clearly, this will give the researcher more flexibility in doing her/his inferences about any parameter of the model. Table 5.6 presents the different models used with the High School and Beyond data set. 104 Table 5.6 Models used with the High School and Beyond data set BASE + SECTOR + MEANSES + MINORITY + GENDER + SES 2 MATHACH = BASE + SECTOR + GENDER + SES 3 MATHACH = BASE + MEANSES + GENDER + SES 4 MATHACH = BASE + SECTOR + SES 5 MATHACH = BASE + SECTOR + GENDER 6 MATHACH = BASE + MEANSES + SES 7 MATHACH = BASE + SECTOR 8 MATHACH = BASE + SES 9 MATHACH = BASE 105 Similar to the models in table 5.1, all school and student level variables in any of the models in table 5.6 are assumed to be fixed effects except the intercept, it is assumed random. I 4 CHAPTER 6 Results This chapter provides a discussion of the results of the empirical application of Gibbs sampling. The discussion covers the results from the artificial data and the High School and Beyond data set. .Answers to the questions concerning the objectives of this study which were presented in chapter 1 are provided through this discussion. Required Number of Iterations for Convergence Several procedures have been suggested for checking the convergence of the iteration process of Gibbs sampling (Casella & George, 1992; Gelfand, Hills, Racine-Poon and Smith, 1990 and Iewis and Orav, 1989). In this study two graphical criteria were used in deciding about the convergence of the iteration process of the program. The first criterion is the overlay plotting of the density function for each parameter approximated by different samples. These samples were generated from several runs of the program on the same data set with different number of iterations in each run" The second criterion is based on "Quantile-Quantile" (Q-Q) plots. Observations from two different samples based on the same model, but different number of iterations are sorted and 106 107 plotted against each other in.a scatter plota “Using the first criterion, one can assume convergence when the graphs of the density, from different samples, for the marginal posterior distribution of a particular parameter become very close or identical to each other. Using ithe second criterion, convergence can be detected when the scatter plot of the sorted observations from two samples forms a 45 degree line. Model 3 from table 5.1 with 6=0.2 and k=15 was used as a reference to decide about the appropriate number of iterations required for convergence for all data sets. It is assumed.that this model represents the scenario of the largest number of iterations required for convergence because it has the largest number of parameters and the smallest number of groups, k=15, with.heterogeneous variances. Using the specified model the program was run 4 times. The total number of iterations in each run were 4600, 8600, 10600 and 15600 iterations respectively. Four samples, one from each run of the program, with 600 observations in each sample were generated. To avoid dependency between consecutive observations within each sample (Casella & George, 1992) the 600 observations were systematically drawn from the last 3600 iterations in each run. Thus, the first sample is made of 600 observations drawn systematically from the last 3600 iterations after 1000 iterations in the first run. The second sample is made of another 600 observations drawn systematically" from. the last 3600 iterations after 5000 108 iterations in the second run. The third sample is also made of 600 observations drawn systematically from the last 3600 iterations but after 7000 iterations in the third run. And finally the fourth sample is made of another 600 observations drawn systematically form the last 3600 iterations after 12000 iterations in the last run. Figures 1 and 2 show the overlay graphs of the densities for the marginal posterior distributions of the hyper- parameters in model 3 of table 5.1. Four densities, each based on one sample, were drawn for each parameter. These densities are smoothed by the "kernel" method (Silverman, 1986), which superimposes a univariate nonparametric kernel density estimator. The estimator shows areas where the observations are most concentrated in the sample. Figures 3 and. 4 jpresent. two Q-Q jplots for each. hyper- parameter. One plot is based on the samples generated from the runs of 8600 and 10600 iterations and the other plot is based on the samples generated from the runs of 8600 and 15600 iterations. Density graphs for the exchangeable parameters {U} and {0? for j=fiL,~,15 of all the 15 groups in the chosen data set are considered in deciding about the convergence of the iteration process. However, only 4 groups, selected at random, were presented.here for illustrative;purposeq IFigures 5 and 6 provide the overlay graphs of the densities of the marginal posterior distributions and the Q—Q plots for a; for 109 the 4 selected groups. The overlay graphs of the densities and the Q-Q plots for 09 are provided in figures 7 and 8. In general the overlay graphs show that densities from the four samples are almost identical for' most of the parameters. Densities for 6, of, 12 and 63 based on the 4600 iterations sample show little divergence from the rest of the samples. Considering the overlay graphs and the Q-Q plots of all parameters, it is decided that 8600 iterations is sufficient to achieve convergence. The Q-Q plots show that observations from the sample based on 8600 iterations form a 45 degree line when they are plotted against the observations of the samples based on 10600 and 15600 iterations. Divergence from the 45 degree line occurs only for small number of observations that fall at the tail of the distribution for every parameter. These observations can be thought of as outliers. They have much lower probabilities of being in the distribution than the majority of the observations. 110 Figure 6.1 Overlay graphs of the estimated densities of the marginal posterior distributions for the hyper-parameters 6, 0.2, 12, and Y. II—I'II—j v I r _ 7 I I I I I f I 1 I , i. W ’ 0.00 025 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0 2 4 6 8 10 12 14 16 18 2 6 o O 111 Figure 6.2 Overlay graphs of the estimated densities of the marginal posterior distributions for the hyper-parameters “(1, 131, [32, and B3 j I I I l I I l I fl I I I l I I I I . 2.8 28 8.0 3.2 8.4 8.6 8.8 4.0 0.95 0.97 0.99 1.01 1.08 1.05 7. fl. j . , . . I . - I I I I I I I ' I . I ' I '—| 8.40 8.42 8.44 8.46 8.48 8.50 8.52 8.54 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 fig B. 112 Figure 6.3 Q-Q plots of the observation of the marginal posterior distributions for the hyper-parameters 2 6, 0., 1'2, 2.00 ‘ ‘ 1.75 L 1.60 —' 1.26 - 1.00 -' 8600 iteration 076- 060- 026-I GOO om'oz's 'oéo'av'a '160'12'5 '16:) '17's 200 1 0800 Iteration ............... 1e 1e - 14- 8600 iteration 2r '- ' I 0 2 .,..,.,...,. 4 510121416 6 1 0600 Iteration 18 8600 Iteration 1O 1O 0 I 8600 iteration ‘4 l / v v I ' I 7 s 6 1 060 Iteration 10 and y. 2.00 1.75 '- 8600 iteration 0m ' l ' I ' I ‘ I ' I ' I ' I 0.00 026 050 0.76 1.00 1.25 1.60 1.76 2.00 8600 iteration 8600 iteration 8600 Iteration 1.60 - 1.25 -. _.. -' 1.00 - 'f .‘ .-- oms - l 050 .‘ I. 025 i : “I 5600 Iteration 13 13-. 14‘: 12$ 10: 3.‘ a: 4L 2. D D , - . - n ‘ . ." u' ' . .I' .— _ . .— . v v '2' '4 '6 5'10'1'2'1'4'1'515 15600 Iteration 5 .1 ' é ' 9' 15600 Iteration ‘Y. I . l ‘ I . ,. .4 / . " II" - I I v r ' I 10 . 7 9 15600 Iteratlo 113 Figure 6.4 Q-Q plots of the observation of the marginal posterior distributions for the hyper-parameters 71' 51! 62! and B3 4‘0 1 . . . M . . . . as r 1 C .- ‘ 95 ‘ .I :3 j E r. a m 4—4 \_ $6 - to so — — 31> 6 — 4: 8 3.4 - O 94 ~ _ LO - ‘3 co 3'" g 5.2 . 32 _ . _ .. 3'0 3'2 6'4 ab 32) 4.0 as 3'2 alt 0'0 (1'9 40 1 0600 Iteration 1 5600 Iteration B. 131 1.06 I . . - 1.06 . . . . 1°“ ‘ r" 1.0.? - _r‘ _ _i 8 H! i". *5 ,‘If 1‘01 - r-‘J'r » if) 101 - "Jr _ .J"‘ t J" 099 - [I - O 099 _ r" _ [7‘ Q .‘ _.v| (O - _J 097 ’ . CD 5.97 _ ‘ _ 095 . . 1 v 095 . I I i 006 0.97 0.90 101 1.03 1 ( 6 o; 6 097 099 01 1.03 1. )5 1 0600 iteration 1 5600 Iteration B2 I32 354 . . . . a“ . . . . . 362 — 352 _ ,- . C :‘ C - .: :8 860 79 360 - — c6 ' <0 ._ t. 0) a) :I: 343 - — t 64.9 — — o O - . 8 846 ~ _ - 8 3,45 _ ." _ CD CD _ 544 - 3,44 q _ 312 I v I l ' 642 u l I l l 5.1 2 644 3.46 646 550 as 5.5 4 5'42 544 5445 5-45 55°. 552 4 10600 Iteration 1 5600 iteratIon I33 076 0.73 . . . . . . 077 _ 077 "‘ I ' x 8 are — 8 97° ‘ 1"“ A: a ,.I g 075 - a 0.75 _ JP. _ :6: t 1‘..- [A74 — C) 0‘74 ' Ir" ‘ 8 8 . — on — _5' - 8 073 CO :1." 072 - 072 1 071 — 071 i I . . I . . Q 1 ,3 o: 'I 012 0.7: on or: on 077 Q 1 5600 Iteration 114 Figure 6.5 Overlay graphs of the estimated densities of the marginal posterior distributions of selected groups' variances {0? "'|'l'l' 'I'T‘] |"""|""""l 4 6 8 1012141618 20 22 3 8 13 18 23 28 33 88 l—rl'l'l'l'l'l'l'l‘l'l'l 15 25 85 45 55 65 75 85 95105115125 2 2 015 8600 iteration 8600 iteration 8600 iteration 8600 iteration 115 Figure 6.6 Q-Q plots of the points of the marginal posterior distributions of selected groups' variances {03-} v ‘2'! 1‘2 1'5 2'0 2'4 23 10600 iteration 2 0' 13 6 6 1'2 135 1‘3 2'1 10600 iteration 2 0'15 1 I l l 26354656657586 10600 Iteration 8600 iteration 8600 Iteration 8600 iteration 8600 Iteration 7 9 11 1G 16 17 19 21 15600 Iteration 2 O- 9 l I I /.I—.- I ' 32 é 1'2 1'5 20 2'4 2} 15600 Iteration 2 0'13 B 0 12 16 19 21 15600 Iteration 2 G15 aaaAddUflfi'lafldQfi 15600 Iteration 116 Figure 6.7 Overlay graphs of the estimated densities of the marginal posterior distributions of selected groups' intercept errors {Uj} 8600 iteration 8600 iteration 8600 iteration 8600 iteration 117 Figure 6 . 8 Q-Q plots of the points of the marginal posterior distributions of selected groups' intercept errors {Uji U3 5 #lnggnglA lAlA LL 5- L 1 . b~ . -I- 81 1". i; at .. 11 F- 01 f -13 ..., I -2 '3.0 ._ «I t -*4'Tfi-'27—-11'01‘;Iév4 a'o 10600 Iteration U9 4 A L . L1 ' ‘ r 3.1 l; - . .I' ,_ 2.. I t I- I J 01 E 1 J" - .J . -' 1‘ _ i- -2 I .31 v 6 . 1, . é , 10600 iteration 5 IL a 441 1.1 ....... 21 .P I1 i oi t -11 r .2: i ‘3‘ I")! i- -‘i i- —6.,".t, ,.,.,I,. -5 -2 -1 0 1 3 10600 Iteration U15 6 . LI i ....... 5-1 r ‘1 ..x' r 31 ' F 2 >- Ii / t o1 L -‘I-: i. 41,. ' L -a‘r.'..‘ fj,r,vh -O -2 -1 O 5 a . 1, vi . 10600 iter 3 4 ation 8600 Iteration 8600 Iteration 8800 Iteration 8600 Iteration 6— 1 ‘- O _l R.- 1-4 i o- -‘A -9; < _0— I 1'" /. . :". .1 Ir! ‘jfi'filfi'j—VI—rlfi lfitrl -4 -8 -2 -1 '6 ' i '2 a '1 T 15600 Iteration U9 . l A l A l L l A A Y 4 81 2.I or -14 -2 II"... If .1 f, 2 fir l—I T Y 'fi_ ' l :3 1 v t‘v’fi T -1 ' 6 ' i 2 a 15600 Iteration U 13 A l A l A l A l A 5 ~21 ./ l'l'l'l'l'] «5231' 31'1'6 ' 1 '2 15600 iteration I U15 a A l LJ_Ll A l A I A l A l A l A L 5- . - 4i .- t 91 t I I 2-4 _ It i O- .. -1. ../ t -21 _' r -I I- ‘9 ‘fii fir ' r ' —e -2 -1 O o 1 2 '7: ' Ii ' 5 15600 iteration 118 Artificial Data There are 18 data sets (see table 5.1) created by the combination of the three models, the two values chosen for 6, and the three choices of k. The marginal posterior distributions for 6, of, “:2, A, {03-}, and {Uj} where j=1,~--,k in each model are calculated for each data set. Posterior means and standard deviations are calculated from the approximated marginal posterior distributions. Information about the exchangeable parameters {03'} and {Uji are presented for only 10 groups selected in systematic sampling for data sets with k=1oo or 40. Information about these parameters in data sets with k=15 groups are presented for only 8 groups . That is, Iknr.k=100, groupl, groupll, group21, ..., group91 were presented, for k=40, groupl, group5, group9, ..., group37 were presented, and for k=15, groupl, group3, group5, ..., group15 were presented. Empirical Bayes estimates for the regression coefficients (i.e., A) with their standard.error'of‘estimateS'were obtained for each of the 18 data sets using the HLM program. Maximum likelihood (ML) estimates for the variance components, t2 and 03 were also obtained from the HLM analysis. (Note that the HLM estimate of of represents a pooled within-group variance ML estimate.) Because of the large cost of analysis, the limited time and resources available, each of the 18 data sets represents 119 just one sample generated from.a population that is different from the other data sets' populations. Therefore, there is a chance that any of the data sets might not be a good representative to its true population and the associated parameters. This has the implication that estimates from a given data set are based on one sample, which limit our assessment of error. A more complete study would involve generating multiple data sets from the same population. Estimates of a given parameter can then be obtained from each generated sample via Gibbs sampling. Given these different estimates, mean squared errors for example, could be calculated for the given parameter. High School and Beyond data As pointed earlier, this data is from a probability sample of 160 U.S. high schools. Table 5.5 describes the variables in this data set. The outcome variable used in all of the models is a standardized mathematics achievement score. Nine models (see Table 5.6) were applied to HSB data. Variations between these models are based on the number and type (school vs. student) of variables. Variations in school— level variables are expected to influence mainly the estimated value of 1:2, while variations in student-level variables affect all estimates of variance components. Gibbs sampling was used with 8600 iterations for all of the models. Similar 120 to the artificial data, information about the exchangeable parameters {03-} and {Uj} are presented for only 10 schools selected in systematic sampling. Empirical Bayes estimates for the regression coefficients in all models were also obtained via the HLM program. Comparing Estimates of Variance Components Part A of Table 6.1 to Table 6.18 provides estimates for the variance components of, 6, tzanuito? for each of the 18 artificial data sets. Variance component estimates for the High School and Beyond data set are provided in part A of table 6.19 to table 6.27. Estimates of of The hyper-parameter 03 represents the typical value that any of the parameters {03} for j=1, ---, k can take in their prior distribution. One objective of this study is to find a way of estimating that typical value without assuming homogeneity of variance. Taking the fully Baysian approach and letting of be random facilitated the:derivation.of its:marginal.posterior distribution as well as the marginal posterior distributions for each of the groups' residual variances. The fact that the marginal posterior distribution of of is not far from being symmetric even when k is small (see 121 figure 6.1), makes the difference between the posterior mean and the posterior mode estimates insignificant. If the marginal posterior distribution of of is far from being symmetric, it has to be positively skewed, and the posterior mean is larger than the posterior mode» ‘With.this in.mind, we find that HLM estimates of of are larger than Gibbs posterior mean estimates of of in all data sets including high school and beyond data set. When they are compared to the actual values of of (part A table 1 to table 18) HLM estimates of of are still found to be larger than the actual value of the parameter in all data sets. Furthermore, the over estimation of of in HLM analysis is more pronounced for data sets with extreme heterogeneity of variance (see tables with 0:0.20). This indicates that the HLM estimate of the within-group residual variance of is over estimated when there is clear evidence of heterogeneity of variances. One obvious implication of over estimating the within- group residual variance is the effect on empirical Bayes estimates of the regression coefficients. As we see in equations 1.5 to 1.7 and under the homogeneity of variance assumption (i.e. o§=o§ , .j=1,m,k) that larger value of of causes Aj in equation 1.5 to be smaller therefore giving less A weight to B in obtaining B}. Another implication stems from 3' the homogeneity of variance assumption itself, where some of the groups with smaller actual values of their residual variance 0% are assumed to have an inflated residual variance 122 which is equal to of because of the homogeneity of variance assumption. Estimates of 6 The hyper-parameter 6 represents a scale parameter of the prior distribution for the exchangeable parameters {03-} for j=14-u,k. It is inversely proportional to the concentration cflfto? around their typical value 03. Finding the posterior distribution of 6 and its estimate provides an answer to the question: “How precise are the posterior mean estimates of the exchangeable parameters.{¢? in estimating the typical value of?" The marginal posterior distribution of 6 is fOund to cover its actual value within 95 percent confidence interval in all of the artificial data sets. In.most of the data sets, and especially in the High School and Beyond data set, Gibbs estimates of 8 are found to be of the same magnitude of its moment estimates, which.areibased.on the log transformation of the residual variances (see chapter 5 for derivation). Although.they are close in‘theirtmagnitude, Gibbs estimates of 6 derived from data sets with k=15 are found to be larger than their moment estimates (see tables with k=15). This finding is expected because when the number of groups is relatively small the marginal posterior distribution of 8 becomes more positively skewed which causes the posterior mean (Gibbs estimate) to get larger. 123 Estimates of t2 The hyper—parameter t2 represents the variance of the prior distribution of the exchangeable parameters {U}; In general, Gibbs estimates of 12 are found to be of the same magnitude of the HLM estimates for both the artificial data sets and the High School and Beyond data set. However, in most of the data sets, Gibbs estimates of t2 are found to be slightly higher than HLM estimates. In fact, Gibbs estimates of 12 for artificial data sets with small number of groups (k=15) are found to be noticeably higher than HLM estimates. That is because as the number of groups get smaller the marginal posterior distribution 12 becomes more positively skewed causing the posterior mean estimate of 1:2 (Gibbs estimate) to be larger than the posterior mode of HLM. Estimates of kfii One advantage of this study is the ability to obtain the marginal posterior distribution for each of the within—group residual variance {03-} for j=1, ---,k. In all of the artificial data sets the actual values of {03:} were found to fall within the 95 percent confidence intervals which were derived from their marginal posterior distributions. The posterior means (Gibbs estimates of {031) , the actual values of {03-}, the standard deviations, the coefficient of variations (C.V.) , and 124 the sample sizes, nj, for the selected groups in each of the artificial data sets were presented. When compared to their actual values, Gibbs' estimates