PLACE IN REI'URN Box to remove this checkout from your record. TO AVOID FINE return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 'JULcm 24008 2 NOV 0 5zgn§ 111'; y; moo “Wis-9.14 This is to certify that the dissertation entitled Large and Small Sample Properties of Maximim Likelihood Estimates for the Hierarchical Linear Model presented by Dina Bassiri has been accepted towards fulfillment of the requirements for _ Ph.D. . Counseling, Ed. Psychology degree in Wal Educat ion , r I Major professor Date 1 1-4-88 M I u u 3 U u an Airman" Amen/Equal Opportunity Institution 042771 LARGE AND SMALL SAMPLE PROPERTIES OF MAXIMUM LIKELIHOOD ESTIMATES FOR THE HIERARCHICAL LINEAR MODEL 13}! Dina Bassiri A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Educational Psychology and Special Education 1988 ‘ ."¢\.7 7V7}? ABSTRACT LARGE AND SMALL SAMPLE PROPERTIES OF MAXIMUM LIKELIHOOD ESTIMATES FOR THE HIERARCHICAL LINEAR MODEL by Dina Bassiri The multilevel character of educational data have implications of a general methodological nature. Interest in these methodological problems has recently been stimulated by the development of the E.M. algorithmic approach to variance component models. The EM algorithm produces maximum likelihood estimates for variance components with known large sample properties. That is, the estimates are consistent, asymptotically efficient with known large sample normal distributions. However, at present little is known about the small sample behavior of the parameter estimates. The primary purpose of this Monte Carlo investigation is to understand the properties of maximum likelihood estimates in small and moderate samples using a two stage hierarchical linear model with standardized normal predictors at both levels of hierarchy (i.e., using a standardized two-stage hierarchical linear model). Specifically, this research will investigate the effects of variance estimation via the EM algorithm on properties of parameter estimates at the Dina Bassiri second stage of the hierarchy, that is, the macro or fixed effects, yoo, Y01 , Ylo , and Y11 . These are the regression coefficients in the equations for the mean and slope at the second stage of the hierarchy. A secondary purpose is to evaluate the robustness and power of asymptotic z-tests of the macro parameters under various conditions determined by the number of groups, the group size, and the effect size. The following are the major conclusions drawn from the investigation. (1) Macro estimators are unbiased, consistent, and asymptotically efficient with asymptotically known normal distribution. (2) Error estimates of macro parameters are considerably affected by the number of groups, but not so much by the group size. (3) Precision of macro parameters is directly proportional to the number of groups and inversely proportional to intraclass correlation coefficient. Increasing group size increases precision as well. Yet, the effect of one is not proportional to that of the other. (4) Micro parameter variance estimator for the slope and intercept of the first stage regression model are biased but consistent and asymptotically efficient. Increa- Sing the number of groups has a determinative effect on the Parameter variance in slopes, but parameter variance in intercepts is more influenced by group size. (5) Within group error variance estimates ( o2 ) are unbiased, consis- tent, asymptotically efficient and are considerably more affected by group size than by the number of groups. Dina Bassiri (6) The precisions of variance components estimates, in contrast to that of the macro parameters, are directly related to intraclass correlation coefficient. (7) Depar- tures of empirical type I error rates from nominal alpha for tests of macro parameters are typically within 99% confi- dence intervals. When outside the probability intervals, empirical significance levels are all liberal. No pattern developed between empirical type I error rates and number of groups, group size or effect size. (8) For all macro parame- ters, power increases as total sample size, number of groups, group size, or effect size increase. However, group size has a consistent determinative effect on power over the number of groups. To Mohammad and Yashaar ii Acknowledgements I would like to take this opportunity to thank my committee chairperson, counselor and friend Dr. Stephen W. Raudenbush for his invaluable support, insightful comments and understanding. Working with him contributed greatly to my professional development. I would also like to thank Dr. Richard F. Houang who has been a constant source of inspiration, encouragement and genuine support throughout my graduate studies, as well as for his intellectual persuasion throughout this research. I wish to express my appreciation to the rest of my committee, Drs. Dennis Gilliland, William H. Schmidt, and Robert E. Floden for reviewing my work and providing suggestions for improvement. I would further like to take this opportuniy to acknowledge the support I received from the Spencer Foundation. Most importantly, my deepest gratitude goes to my dearest friend and husband, without whose understanding, patience and moral support this work certainly would not have been completed, the person who always knew I could do it, Dr. Mohammad Ali Chaichian. I wish to express my appreciation to my parents who have been a source of love and strength from before my time at Michigan State University. Last but not the least, my iii deepest appreciation goes to my son Yashaar for his patience and understanding while I worked on my dissertaion. iv TABLE OF CONTENTS Page LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . v LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . vii Chapter I. STATEMENT OF THE PROBLEM . . . . . . . . . . . . . 1 II. REVIEW OF THE LITERATURE . . . . . . . . . . . . . 7 Uni- Level Techniques . . . . . 7 Multilevel Techniques with Random Intercepts and Fixed Slopes . . .11 The Multilevel Technique with Random Intercepts and Random Slopes. HLM . . . . . . . . . . .15 Estimation of Dispersion Matrices. . .18 Asymptotic Properties of Maximum Likelihood Estimates . . . . . . . . . . . . . . .20 III. TWO STAGE HIERARCHICAL LINEAR MODEL (HLM) . . . . .23 Estimation Under Known variance Components . . .27 Estimation Under Unknown Variance Components . .31 The Logic of EM Estimation . . . . . .35 Effects of Having to Estimate Variance Components . . . . . . . . . . . . . . .38 IV. METHOD . . . . . . . . . . . . . . . . . . . . ._.43 Standardized Two-Stage Hierarchical Linear Model . . . . . . . . . . . . . .43 Parameters of the study . . . . . . . . . . . .47 Design of the Study . . . . . . .50 Description of the Generation Routine . . . . .57 Monte Carlo Techniques . . . . . . . . . . .62 Random Number Generation . . . . . . . . . .62 Analysis Routine . . . . . . . . . . . . . . . .64 Checking the EM Algorithm . . . . . . . . .65 Type I Error Rate and Power . . . . . . . .69 V V. RESULTS . . . . . . . . Results for the Estimation Phase Are the Macro Parameters Asymptotically Unbiased and Consistent ? . . . Are the Macro Parameters Asymptotically Efficient ? . Do the Macro Parameters Have Asymptotic Normal Distribution ? Are the Variance Components Asymptotically Unbiased and Consistent ? . Are the Variance Components Asymptotically Efficient ? . . Results for the Hypothesis Testing Phase Robustness Under Various Conditions Total Sample Size and Robustness Number of Groups and Robustness Sample Size and Robustness Intraclass Correlation Coefficient and Robustness . . . . . Power Under Various Conditions Total Sample Size and Power Number of Groups and Power Sample Size and Power Intraclass Correlation Coefficient and Power . . . . EM Algorithm: Rate of Convergence VI. DISCUSSION . . . Conclusions . . . . Guidelines for the Researcher Suggestions for Future Research APPENDIX . REFERENCES vi .72 .72 .73 .83 .92 .94 102 102 105 106 108 108 113 114 114 115 119 122 122 125 125 127 130 132 149 LIST OF TABLES 5 Lay out of Type RBFP - 2 Design in Two Blocks Treatment Combination . . . . . . . . 5 Alias Pattern for Type RBFP - 2 Design Varying Combination of K and n for Block 0 and 1 5 of RBFP - 2 Design Actual Values of Macro Parameters Expected Errors of Estimate in the Macro Parameters Error Estimates in Yoo Error Estimates in 701 Error Estimates in ;10 Error Estimates in Yll Measures of Dispersion in Macro Parameters Differences in Measures of Dispersion in Macro Parameters Estimated by HLM via EM Algorithm (VAR) and Cramer-Rao Lower Bound (CRLB) for Different Number of Groups . . . . . Differences in Measures of Dispersion in Macro Parameters Estimated by HLM via EM Algorithm (VAR) and Cramer-Rao Lower Bound (CRLB) for Different Group Sizes . . . . . . . Differences in Measures of Dispersion in Macro Parameters Estimated by HLM via EM Algorithm (VAR) and Cramer-Rao Lower Bound (CRLB) for Different Intraclass Correlation Coeffi- cients Mean and Variances of the z-Statistics for the Macro parameters . . . Error Estimates in TUlw Error Estimates in TBIW vii Page .54 .54 .56 .56 .73 .75 .76 .77 .78 .84 .85 .86 .87 .94 .95 .96 5-21 5-22 5-23 5-24 5-25 Page Error Estimates in 02 . . . . . . . . . . . . .97 Standard Errors for Nominal Alpha Levels and Number of Replications Used in the Study . . 106 Probability Intervals for Nominal Alpha Levels and Number of Replications Used in the Study . . 107 Type I Error Rates for Tests of Macro Estimators Under a True Null . . . . . . . . . . . . . 107 Type I Error Rates of Macro Parameter,‘Y00. Under a True Null . . . . . . . . . . . 109 Type I Error Rates of Macro parameter, Y01’ Under a True Null . . . . . . . . . . 110 Type I Error Rates of Macro Parameter, Ylo’ Under a True Null . . . . . . . . . . 111 Type I Error Rates of Macro Parameter, Yll’ Under a True Null . . . . . . . . . . . . 112 Power for Tests of Macro Parameters . . . . . . 115 Power for Tests of Macro Parameter Y01 . . . . . 116 Power for Tests of Macro Parameter Ylo . . . . . 117 Power for Tests of Macro Parameter 711 . . . . . 118 Average Convergence Rate of EM Algorithm . . . . 123 viii LIST OF FIGURES Page Error Estimates in Y and‘YO for Different Combinations of 0and n w th d= .10 . . . . .81 Error Estimates in Ylo and‘Yll for Different Combinations of K and n with d = .10 . . . . .82 Plot of Stabilized VAR and CRLB for YOU and YOI.............. ....89 Plot of Stabilized VAR and CRLB for Y10 and Y11.....................90 Normal Probability Plot of 2- Statistics for 700, Y0]. , Y10 and Y11 . . . A A Error Estimates in Transformed T and T8 for Different Combinations SEW and n8 ith d = .10 . . . . . . . . . . . . . . . . . . .99 Error Estimates in Transformed. o2 for Different Combinations of K and n with d = .10 . . . . 100 Plot of Transformed Estimated and asymptotic Variance of T ulw and.T BIW . . . . . . . . 103 Plot of Transformed Estimated and Asymptotic Variance of 32 . . . . . . . . . . . . . . 104 Power Curves of ‘Y01 , Ylo and‘Yll for Different Number of Groups, X . . . . . . . . . . . . 120 Power Curves of ‘Y01 , 'Y10 and Y11 for Different Group Sizes, n . . . . . . . . . . . . . . . 121 ix CHAPTER I STATEMENT OF THE PROBLEM Science's main job is to "explain" natural phenomena by discovering and studying the relations among variables. In the behavioral sciences, variability is itself a phenomenon of great scientific curiosity and interest. In their attempts to explain the variability of a phenomenon of interest (often referred to as the dependent variable), scientists study its relations or covariations with other variables (referred to as the independent variables). Educa- tional researchers seek to explain the variance of school achievement by studying its relations with intelligence, aptitude, social class, race, home background, school atmos- phere, teacher characteristics, and so on. Various analytic techniques have been developed for the purpose of studying relations between independent variables and dependent variables, or the effects of the former on the latter (Pedhazur, 1973). Perhaps the most powerful method of doing this is the regression analysis, whose simplest form is one in which the effect of an independent variable on a dependent variable is being studied (Pedhazur, 1973). Under this simple conception the two parameters of interest are the slope and intercept which are usually called regression coefficients. The test statistic used for either of the regression coefficients is the Z-test (or t-test if the sampling variance, 0: , is replaced by its unbiased estimator). So long as we deal with a situation where the variables have the same level of aggregation (e.g., both at individual or group level) and where our measurement processes are assumed to be error free, there is no real drawback to this approach. But in educational research many, if not most, data have multilevel characteristics. For example, students are nested within classrooms, and classrooms are nested within grade levels which are themselves nested within schools, dis- tricts, or program sites. Thus, we can have variables of different levels, describing students, classes, schools and so on. Variables such as family background, prior achieve- ment, parental educational level, and the like are indivi- dual (or micro) variables identified with students and variables such as whether the school is public or private are group (or macro) variables. In a multilevel problem we want to investigate the relations between variables at dif- ferent levels of hierarchy as well as interactions across levels. There has been a great surge of interest in educational statistics over the past decade to search for appropriate statistical methods for hierarchical, multilevel data. As a result of this search, a general approach to the problem of multilevel data which is referred to as hierarchical linear models (HLM) by Sternio (1981) has emerged. The basic idea of a hierarchical linear model is fairly simple. When data are available at two levels of aggrega- tion, for example, on students and the schools to which they belong, the model is specified in two sets of equations: one within schools, and one between schools. The within- school model is defined separately for each school with student level predictors and a student level outcome variable. This is a familiar linear regression model with one major exception; the within-school parameters, regres- sion coefficients, are allowed to vary randomly across schools. This conception poses a second or between-school model. The between-school model then regresses the within- school regression coefficients on to the school level predictors. Two sets of parameters evolve from this formulation: micro parameters, or random effects, and macro parameters, or fixed effects. Research interest has focused on estima- tion of both micro and macro parameters. As Raudenbush (1988: 87) has pointed out: "Two fundamentally different types of problems have motivated the development of these HLM models. In the first type, interest focuses on the micro parameters or random effects. One seeks to estimate, for instance, a regression equation for a particular school, the effect size for a particular study, or the growth rate of a particular child when the data available for that school, study, or child are sparse. The empirical Bayes approach strengthens estimation for each unit by utili- zing data from many similar units: schools, studies, or children. In the second type of application, attention focuses on the macro parameters, or fixed effects. One asks why some kinds of schools have smaller regression slopes than others, why some studies report larger effects than others, and why some children grow faster than others." The conception that "micro" parameters vary randomly across the population of groups as a function of "macro" parameters not only justifies the "slopes as outcomes" idea (Burstein, 1980), but also introduces a new source of varia- tion in micro parameters referred to as random effect varia- nce or parameter variance. This is the variance among the micro parameters themselves which is distinguished from the sampling variance resulting from using a sample within each macro unit to estimate these parameters. The new advances in analyzing multilevel data have evolved from statistical theory stimulated by the seminal contributions of Lindley and Smith (1972), Novick, Jackson & Thayer (1972), and Smith (1973), who developed Bayesian estimation procedures for hierarchicaly structured data When the variance components (i.e., sampling and parameter variance) are known, estimates for the micro and macro parameters can be derived from alternative estimation theo- ries: least squares, Bayesian, and maximum likelihood (see for example Raudenbush, 1984). The crucial difference bet- ween Bayesian/ maximum likelihood approach and least squares approach is the difference in assumptions. In most applications, however, these variance components have to be estimated. Unfortunately, no simple closed-form estimate is available. However, a variety of numerical approaches to maximum likelihood estimation of covariance components are available among which EM algorithm (Dempster, Laird & Rubin, 1977) is especially conceptually appealing. The EM algorithm produces maximum likelihood estimates for variance components with known large sample properties. That is, the estimates are consistent, asymptotically efficient with known large sample normal distributions. The fact that the sampling distributions are known becomes espe- cially important when inferences are to be made based on the parameter estimates. The test statistic for a macro regres- sion coefficient is a Z-test (asymptotic z-test or t-test if the variance components are replaced by their maximum likelihood estimates). But before asymptotic results become exact, the number of levels of each random factor must increase to infinity (Miller, 1977). That is, for example, both the number of schools (call it K) and the number of pupils (call it n) within each school must approach infinity. At present little is known about the small sample beha- vior of these parameter estimates. To date it is not clear how large n and K have to be in order for estimates and their standard errors to become acceptable, thus justifying the use of large sample theory. The goal of this research is to understand the proper- ties of maximum likelihood estimates obtained from small and moderate samples, and to evaluate their implications for research design. Because anlytic study of these properties becomes intractable in the case of unknown variances and covariances, empirical studies are needed. Clearly to gain a comprehensive understanding of the inferential strength of hierarchical linear model, and to understand the small sample properties of its parameter estimates, many simulation studies are needed. In other words, alternative HLM methods with different model specifications and assumptions, or at least the most interesting and realistic ones, have to be studied. This research will take the initial step and will address these issues by considering. the two-stage standardized hierarchical linear model. Specifically, this research through simulated data generated for different values of K and n, will investigate the effects of variance estimation via EM algorithm on inferences about parameters at the second stage of the hierarchy, that is, about the macro or fixed effects. The following chapters will review the literature with respect to statistical approaches to multilevel data, discuss maximum likelihood parameter estimation, present the method used for investigating the small sample properties of parameter estimates, and provide results and discussion of their implications for research design. CHAPTER II REVIEN OF THE LITERATURE A long-standing problem associated with educational research has been the failure of many quantitative studies to attend to the complexity of data usually produced by hierarchical, multilevel educational field research (Cronbach, 1976; Haney, 1980; Burstein, 1980; Cooley, Bond & Mao, 1981; Rogosa, 1978). Cronbach (1976) remarked that the majority of studies of educational effects carried out until 1976 conceal more than they reveal, and that "the established methods have gene- rated false conclusion in many studies." (P. 1) Uni-level Techniques Traditionally, statistical approaches have attempted to adopt uni-level techniques to multilevel situations. This can often be done by using aggregation or disaggregation. A student (micro) variable, such as intelligence, can be aggregated to school level by assigning to a school the average intelligence of its students. A school (macro) variable, such as whether it is public or private, can be disaggregated to the student level by assigning to each student the type of school. But as de Leeuw and Kreft (1986) pointed out "the operations of aggregation and disaggrega- tion are highly nontrivial, both from the methodological and from the statistical point of view". Conceptually, by aggre- gating, a change in the meaning of the variables occurs. Statistically this means we are ignoring all within-school variation which sometimes results in dramatic increase in the correlation between aggregated variables. Robinson (1950) showed that not only does the correlation fluctuate as a function of grouping but that the sign may even be different at different levels. As a result we no longer can make inferences on the student level without committing the ‘ecological fallacy' (Alker, 1969; Cronbach and Webb, 1975; Hannan, 1971; Robinson, 1950). This refers to the practice of interpreting correlations between aggregated variables as if they were correlations between variables measured on individuals (i.e., cross level inference). This is the most commonly cited flaw in any early methodological treatments of hierarchical data. On the other hand if we disaggregate, we have to take into account the fact that students within the same school do not respond independently to school level variable. But the traditional linear models require the assumption that subjects respond independently to educational programs. Also by ignoring the nested structure of data we will misestimate the precision of parameter estimates, resulting in serious inferential problem (Aitkin, Anderson & Hinde, 1981; Knapp, 1977; Walsh, 1947). In the late 19605 and early 1970s, the topic of aggregation and proper choice of analytic units (using the student versus using the group) gained popularity in educational field research (see Burstein, 1980 for a review of these issues). This increased interest may be viewed as a natural by-product of the then growing emphasis in educational research on evaluation of social and educational programs; evaluations that had to be designed and analyzed in such a way as to take into account the ever-present natural hierarchy found in all school systems. With the awareness that students within a class and teachers within a school cannot be considered truly independent and that responses -to treatment may rightfully vary dramatically across groups, came an increased interest in how to deal with non-independence and differential effects for distinct population groups often labeled aptitude-treatment interac- tions (Cronbach and Webb, 1975). Up to the early 19708, then, considerable research had been done on the effects of aggregation on bias and efficiency under various grouping strategies. However, little of this research was grounded in practical applications. As Burnstein (1980) in his concluding remarks on choice of units of analysis points out, if the goal is to learn something about the effect of educational process on student achievements, the "discussion, about the choice of an appropriate unit are simply unnecessary digressions" (p. 196). The emphasis should be on choosing an appropriate analytical model that accounts for the relationship among variables observed at both levels of aggregation. Rogosa (1978: 83) remarked, "no one level is uniquely responsible for the delivery of and the response to educational programs.... confining substantive questions to any one level of analysis is unlikely to be a productive research strategy". Burstein and his associates (Burstein, Linn & Capell, 1978; Burstein, Miller & Linn, 1979; Burstein, 1980) argue that when relationships between the dependent and independent variables are different in different groups, single level analyses at either the individual or aggregate level will produce misleading results. For example, conduc- ting analyses at the individual level without regard for group membership might lead to spurious null effects or to spurious large effects; in either case, the actual effects will only be uncovered through within-group analysis. Thus, these researchers advocate conducting selected within-class analyses and using the results of these regressions in aggregate level analyses. Cooley et al. (1981) reached the same conclusions and pointed out : "We must not ignore the possibility of variation among groups (e.g., classrooms or schools) in estimating a variable's effect. Examining this variation can reveal grouping effects or specification error, ignoring it will conceal them." (p. 74) Such criticisms of single level analyses suggest that multilevel approaches are needed in many settings. Such models would aim simultaneously to discover: 1) what is happening within macro units; 2) what differences there are 10 between macro units; and 3) how those differences influence the quality of what is going on within the macro units. To be valid, statistical analyses must account simultaneously for effects at both levels. Multilevel Techniques With Random Intercepts and Fixed Slopes In the mid-seventies, the problem of aggregation bias was resolved by analyzing multilevel data with multilevel techniques. some of these alternative analytical strategies are: the separate between-group and pooled within-group analyses as suggested by Cronbach (1976), and Cronbach and Webb (1975); a two-stage hierarchical analysis proposed by Keesling and Wiley (1974), and Wiley (1976); and a "full model" analysis suggested by Keesling (1977). All three strategies obtain their estimators through ordinary least squares (OLS) technique, but they differ in the approach by which the estimators are obtained. Notice that in the case of random intercepts, OLS is an appropriate estimation method only when balanced designs are considered. Schmidt and Houang (1983) compare and contrast these three approaches with respect to parameter estimation. They concluded that these strategies differ in the way that the relationship between the between-group effects and the within-group effects are conceptualized. Analytically they showed that all three procedures give the same estimate for the within-group regression coefficient. With respect to the between-group regression coefficient, the estimate 11 obtained by Cronbach's approach is different but related to that of Keesling's. As far as the third approach is concerned (i.e., Keesling and Wiley's two-stage analysis), no estimate for the between group regression coefficient is available in this case. These differences reflect different conceptualizations in the three strategies. That is, where in the Cronbach's and Keesling's approaches the individual level variables are conceptualized to have direct impact on the outcome variables, in the Keesling and Wiley's their influence are indirect and mediated through other group level variables. This difference in conceptualization of the situation sets a criterion for choosing among these three . multilevel techniques (Schmidt and Houang, 1983). The model proposed by Wisenbaker and Schmidt (1979) may be viewed as the extension of these multilevel techniques to their multi-variate form. The application of a "compo- nents of covariance structure" method (Schmidt, 1969) to the proposed, multivariate random effects model (with random intercepts and fixed structural parameters) allow the simultaneous estimation of between-group and within-group effects and their standard errors via maximum likelihood procedure. The model potentially permits different specifi- cations for the relations at two levels of hierarchy. Although these analytical strategies accounted for aggregation bias, however, the other technical problem, i.e., misestimated precision, remained unresolved. The problem is that these methods allow for random intercepts, 12 but they assume constant within-group slopes. The technical consequences of ignoring slope heterogeneity when in fact it exists are inefficient estimation of regression coefficients and negatively biased standard errors of regression coeffi- cient, which inflate type I error rate. Cronbach (1976) cites three sources of variation in within-class slopes: 1) sampling variability and stability problems due to small class sizes when the process operating in the classes are basically the same; 2) differences in the selection factors operating to form the classes; and 3) differences in causal processes going on in the classrooms. If we can rule out chance effects and different selection rules as reasonable explanations, the variation in within- class slopes become a potent source of information to researchers and policy makers. Tate and Wongbundhit (1983) argued that random coefficient models with random slopes and intercepts are more appropriate than random coefficient models with only a random intercept for multilevel analysis in educational research. De Leeuw and Kreft (1986) further added that random coefficient models are more general and that fixed constants are special random variables. They argued that "whether something is random or fixed should be decided by considering what would happen if we replicate the experi- ment. Would it be realistic to suppose that regression coefficients stayed the same under replications? If not, then random coefficients are appropriate" (p. 59). 13 Boyd and Iversen (1979) discussed a "separate equations" approach in which both intercept and slope are allowed to be random. But their estimation procedure is ordinary un- weighted least squares for both sets of coefficients, which ignores the information provided by the random coefficient model. I One class of multilevel approaches in which random coefficients are estimated (Burstein and Miller, 1980; Cooley et al., 1981), first estimates relationships within each school; then these regression coefficients serve as outcome variables for an assessment of the importance of school policies and practices. Again, this approach is not free from problems. Raudenbush and Bryk (1986) discussed some of the technical difficulties associated with slopes- as-outcomes approach. Among these problems are weak statis- tical power to detect real differences in slopes, the need for a multivariate formulation so that several regression coefficients per unit can be studied, and the need for a statistical model which matches the complexity of hierarchi- cal, multilevel character of most educational field research data. Additional problems with Cooley et al.'s proposed model is that all random variation in the effects of macro units is assumed to be explained by the predictors included in the model, so that the only unexplained variation results from sampling of micro units (i.e., unsystematic or sampling variation). In traditional analysis of variance terms, such 14 a model is a fixed effects model. Assuming the model is completely specified, there is no drawback to this approach. However, when the model specification is incomplete, which will commonly be the case, the parameter estimates of regre- ssion coefficients and their standard errors are untrustworthy. The Multilevel Technique With Random Intercepts and Random SIopes:HLM A general approach to the problem of multilevel data (Aitken and Longford, 1986; de Leeuw and Kreft, 1986; Goldstein, 1986; Mason, Wong & Entwisle, 1984; and Rauden- bush and Bryk, 1986) incorporates the idea of "slopes as outcomes" without its various deficiencies. This general approach with random effects at each sampling level has been proposed under a variety of names: variance component models (Harville, 1977), mixed model ANOVA (Elston and Grizzle, 1962), regression with random coefficients (Rao, 1972; Swamy, 1973; Rosenberg, 1973; and Dielman, 1983), Bayesian estimation for linear models (Lindley and Smith, 1972; Smith, 1973,Dempster, Rubin & Tsutakawa, 1981; and Morris, 1983), multilevel linear models (Mason et al., 1984), mixed linear models (Goldstein, 1986), and hierarchi- cal linear models (HLM) (Sternio, Weisberg & Bryk, 1983) have all been used. The present study employs the term hierarchical linear models, labeled HLM for convenience. The HLM has a hierarchical structure in the sense that parameters at a lower level of aggregation (i.e., micro 15 parameters) are assumed to vary over a population of groups as a function of the parameters at the next higher level (i.e., macro level). Micro parameters may be as diverse as means, proportions, variances, linear regression coeffi- cients and logit linear regression coefficients (see Rauden- bush, 1988). Through such models, it is possible to assess the strength of relationship between macro predictors and micro parameters. This quality along with the "slopes as outcomes" idea enables investigators to go beyond traditio- nal questions (e.g., why do more schools have higher achievement than others ?) and ask more fundamental questions about why structural relationships vary across groups. This class of questions (e.g., why is the effect of social class or race stronger in some schools than others ?) reflect the "slopes as outcomes" conceptualization popula- rized by Burstein (1980). The HLM model identifies both slope and intercept heterogeneity and tries to explain them via related macro predictors. Not only do such models enrich the class of research questions asked about educational effects occuring within and between educational units, they solve problems of aggregation bias and misestimated precision long associated with multilevel data. Estimators of micro and macro parameters are available through empirical Bayes methods. The empirical Bayes estimates of the micro parameters (also called shrinkage or Stein estimators) provide an improvement over the least 16 squares estimators. This improvement is most pronounced when some or all groups have sparse data and when there is heterogeneity among micro parameters, some of which can be explained by group characteristics. Estimation of the micro parameters can be improved by shrinkage of least squares estimates around a grand mean (known as "unconditional shrinkage") in the first situation and by shrinkage toward a conditional expectation (known as "conditional shrinkage") in the second situation. The empirical Bayes approach also yields estimates for macro parameters. This estimator, which is recognizable as the generalized least squares estimator, weights each OLS estimate of micro parameters proportional to its precision. Estimation of macro parameters are of great importance not primarily because these improve estimation of micro parameters, but because it enriches the class of research questions asked about educational effects which goes far beyond what was plausible prior to the advent of HLM models. Research interest has focused on estimation of both micro parameters and macro parameters each addressing funda- mentally different type of questions. Studies with the goal of improving micro estimators (by either conditional or unconditional shrinkage) include Laird and Ware, 1982; Raudenbush and Bryk, 1985; and Sternio, et al., 1983 for the first type of shrinkage and Braun, Jones a Rubin, 1983; Der Simonian and Laird, 1983; Novick, et al., 1972; Novick and Jackson, 1974; Rubin, 1980 and 1981; and Shigemasu, 1976 for 17 the second type of shrinkage. However, numerous investiga- tors have recently found that the macro parameters them- selves may be of greater interest (Aitkin and Longford, 1986; Aitkin, et al., 1981; de Leeuw and Kreft, 1986; Gol- dstein, 1986; Laird and Ware, 1982; Lee, 1986; Mason, et al., 1984; Raudenbush and Bryk, 1985 and 1986; and Sternio, et al., 1933). The HLM model has broad applicability in educational research. The study of individual growth (Laird and Ware, 1982; Sternio, et al., 1983; Bock, 1983), the measurement of change (Bryk and Raudenbush, 1987 ); contextual effects in cross-national fertility research (Mason et al., 1984), and research synthesis or "meta analysis" (Raudenbush and Bryk, 1985) are examples of HLM's broad applicability. The major problem with this development is the mathematical complexity of Bayesian covariance components estimation. Fortunately, a variety of numerical approaches to maximum likelihood estimation of covariance components are now available. Estimation gf Dispersion Matrices: Estimation of dispersion matrices in multilevel linear models with fixed and random effects (i.e., mixed models) can be complex, particularly in an unbalanced case. The traditional 'ANOVA' approach is essentially the only method in use for balanced data. This method consists of equating the observed sums of squares and cross-products matrices to their expected values. For unbalanced data, the 'ANOVA' 18 approach leads to biased estimators of variance components. Henderson (1953) developed analogous techniques to correct this deficiency. Searle (1968, 1971a, and 1971b) gives excellent descriptions of Henderson's methods and indicates various generalizations. One problem with Henderson's method for estimating variance and covariance components is that the methods are not necessarily well defined. Moreover, except for balanced data cases, little is known about the properties of the Henderson estimators, other than that they are unbiased and translation invariant. It is known that, at least in particular cases, there are biased estimators that have uniformly (assuming normality) smaller MSE's than the Henderson estimators (see Klotz, Milton and Zacks, 1969). The discovery by Seely (1975) and by Olsen, Seely, and Birkes (1976) proved that, at least in the case of most unbalanced mixed or random effects models having one random factor, there exist estimators that have uniformly smaller variance than the Henderson estimators. These locally best estimators are related closely to maximum likelihood estima- tors (Hocking and Kutner, 1975). Maximum likelihood and related procedures, which are reviewed by Harville (1977), have received increased atten- tion in the past ten years. However, maximum likelihood approach has been somewhat ignored by practitioners because of computational complexities and because it takes no account of the loss in degrees of freedom (df) from the estimation of fixed effects, leading in some instances to 19 large biases and large mean squared errors (Patterson and Thompson, 1974). Improved computational procedures are now available, and Patterson and Thompson (1971, 1974) have devised a modified ML approach known as 'resticted maximum likelihood', that adjusts automatically for losses in df. As Harville (1977:320) states: " Certain deficiencies of various other methods are not shared by maximum likelihood. In particular, the maximum likelihood approach is 'always' well defined, even for the many useful generalizations of the ordinary ANOVA models, and, with maximum likelihood, nonnegativity constraints on the variance components or other constraints on the parameter space cause no conceptual difficulties. Moreover, the maximum likelihood estimates and the information matrix for a given parameterization of the model can be obtained readily from those for any other parameterization". Asymptotic Properties 22 Maximum Likelihood Estimates The attractive features of maximum likelihood estimates of variance-covariance components, discussed by Harville (1977), are important. The maximum likelihood are functions of sufficient statistic and are consistent, i.e., they converge to the population values as the sample size becomes indefinitely large. Their joint distribution is approximated by the multivariate normal distribution with mean equal to the population value and variance-covariance matrix equal to the negative inverse of the matrix of second derivative of the likelihood function. Moreover, the maximum likelihood estimators are said to be asymptotically efficient (in the sense described by Miller, 1973 and 1977) attaining the Cramer-Rao lower bound for the covariance matrix under mild regularity condition . There is, however, no guarantee or 20 unbiasedness or efficiency in small samples. In order to obtain asymptotic results in the mixed model, the number of levels of each random factor must increase to infinity. More often in the analysis of variance_ a conceptual sequence of experiments with the number of levels of each of the random factors increasing to infinity is considered. Hartley and Rao (1967) were the first to attempt an asymptotic theory that would be truly appropriate for the more complicated of the ordinary ANOVA models. They proved that under certain restrictions the estimates were consistent and asymptotically normal as the size of the experimental design increased. However, one of their assumptions is that the number of observations at any level of any factor must remain less than some fixed constant for all designs in the sequence. This assumption eliminates many crossed designs where the number of observations at a given level of one factor is proportional to the number of levels of another factor. An alternative way of obtaining asymptotic results in the mixed model is by considering repetitions of a given experiment. Anderson (1969, 1971) considered maximum likelihood estimates in a more general class of models (multivariate models where the covariance matrix has linear structure) and proposed a different solution; he proved that the estimates were consistent and asymptotically normall as the entire design was repeated. Miller (1973) developed an asymptotic theory for the ordinary ANOVA models which, while 21 it is similar to that presented by Hartley and Rao (1967), it does not exclude any cases of real interest. He consi- dered asymptotic properties of the maximum likelihood esti- mates for a large class of design sequences whose size increases to infinity; this class of design sequences con- tains all sequences treated by Hartley and Rao and most sequences which could occur in practice. In other words he took the basic model of Hartley and Rao, rewrote it in the form used by Anderson and proved consistency and asymptotic normality of the estimates in the model. Raudenbush (1988) in his paper entitled ,"Educational Application of Hierarchical Linear Models: A Review" provides a comprehensive review of HLM model with respect to estimation theory and application. In his concluding remarks he states, "despite the clear potential of such models, important questions about their statistical properties remained unanswered. The questions concern small sample properties, implications for research design and robustness of violations of assumptions" (p. 111). This research will take the initial step and will address questions about small sample properties of the estimators and their implications for research design by considering a two-stage standardized hierarchical linear model. 22 CHAPTER III TWO STAGE HIERARCHICAL LINEAR MODEL (HLM) In this chapter, a mathematical model for the general two-stage hierarchical linear model (HLM) is presented. This is followed by the description of parameter estimation when variance components are known and when they are unknown. Then the logic of EM algorithm along with the steps involved for the implementation is discussed. Finally, the effects of estimating variance components on macro or fixed parameters is described. For reasons of simplicity and clarity, a two-stage 'hierarchical linear model is considered although the statistical theory permits more (see Goldstein, 1986). The basic idea of HLM is reasonably simple. We begin by supposing that the researcher has data at two levels of aggregation, for example, on students and the schools to which they belong. The model is specified in two sets of equations: one within schools, and one between schools. Our fundamental assumption is that the outcome variable in some way depends on the student level predictors and that the micro regression coefficients may vary systematically as a function of the school level predictors. The within school model is defined separately for each school. This is a 23 familiar linear regression model, with student level predictors, and with a student level outcome variable. The between-school model then regresses the within-school regression coefficients on to the school level predictors. The present study restricts attention to the case in which variation in the outcome variable, Y, is to be explained as a function of one student (micro) level predictor, x, and one school (macro) level predictor, W (theoretically there is no limit as to the number of Y, X and W). In this case of a simple univariate regression model the within-school model (or micro model) becomes Yij II “1 + 81 xij '5‘ R11 (3.1) and 2 Rij~N(0.Oj) where Yij is the outcome score for student 1 in school j; where j = 1,..., n “j and 31 are the micro level regression coeffi- cients within school j; Xij is the micro level predictor for student 1 in school j; and RLj represents random error, assumed independently normally distributed with zero mean and variance 2 0'3 ' By centering the micro level predictor around its respective group mean, x21 , “1 represents the mean on the 24 outcome variable in school j. Equation (3.1) is a standard linear "full rank" regression model with one major exception; the within-school parameters, u and B are 1 j allowed to vary randomly across schools. This conception poses a second or between-school model. The between-school model (or macro model) may be either unconditional (involving no macro level predictors) or con- ditional (involving macro level predictors). The uncondi- tional model is: “j ' “+U0j’ (3.2) B - 8 + U j 11, (3.3) U ~ N (0. T“). 0.1 Ulj ~ N (0, TB), cov ( U U ) ' T Oj’ lj H8 that is, p1 and 31 are viewed as a functions of their respective grand mean across all schools plus random error. Under this simple model, TU and 1 represent the parameter 8 variances in U01 and U11 respectively. That is, they signify the variability in the true intercept and slope across the population of schools, and that THE signifies the covariance between them. Treating W as potential determinant of u 1 and 31 , leads to the following conditional between-school model: “3 - Yoo + Y01 Wj + "03 (3.4) 25 8j - 710 + Yll Wj + Ulj (3.5) and U ~ N ( 0 . T ) ulW 01 U11 ~ N ( 0 , TBIW ) cov ( UOj , UIj ) - TUBIW where THIW and-rBIW are the conditional parameter variances in 00;) and U11 respectively, and TIJBIW is. their conditional covariance. The micro errors are assumed independent of the macro errors. Equations (3.4) and (3.5) represent the effects of macro predictor W on the two micro parameters, pj and 31 . These two equations combined with equation (3.1) define a multilevel model that can be written equiva- lently as a single equation by substituting (3.4) and (3.5) into (3.1): (3.6) + (U + X U + R ) Y 0:1 ij 13 11 11 ' Yoo + Yo1wj + Yloxij + Yllxijwj The brackets in equation (3.6) enclose error terms that complicate the expression considerably, as they do its estimation. The presence of macro error terms in (3.4) and (3.5) make (3.6) a mixed model, because it contains fixed coefficients (the y'S) and random coefficients ( the U's ). The model shown in equation (3.6) is quite general in that a number of familiar models can be derived from it. If the macro errors are suppressed, the hierarchical linear model (3.6) becomes equivalent to an ordinary regression 26 model (or fixed effect specification) that includes student level variable Xij , school level variable, W1 , and their interaction effect, and its estimation poses no X W , 113 special problem. Under this model we assume that all of the variation in the micro parameters, p1 and sj , has been perfectly explained by knowledge of the macro level variable, Wj , whereas equation (3.6) allows for error. When random effects remain (i.e., 051 and/or U13 zero), application of ordinary least squares to (3.6) is are not equal to inefficient, and the estimated standard errors are too small. Another model that has received some attention is "random intercept regression model". This model considers the within-school intercepts, “j , as random, but the regression slopes, 83’ as fixed. Some variant of this model has been employed by Aitkin et al. (1981), Cronbach (1976), Keesling (1977), Keesling and Wiley (1974), and Wisenbacker and Schmidt (1979). There are hypothesis tests in each case to decide whether or not it is justifiable to make these simplifications (Raudenbush and Bryk, 1986). Mason et al. (1984) provide a detailed discussion of the relationship between the general hierarchiacl linear model (3.6) and other simplified sub-models of potential interest that can be derived from it. Estimation Under Known Variance Components Estimates for the parameters in HLM models assuming known variance components can be derived from alternative estimation theories: least squares, Bayesian and maximum 27 likelihood (see for example Raudenbush, 1984). Using matrix notations to generalize the model, equation (3.1) becomes; YJ -X16j+Rj, (3.7) and Rj” N (0) 21): where 2 221- c.1 Inj, 51 u R1 Y ' 1 9 9 = j , and R - i J . J B j - ’ Yn j Rn J j and equations (3.4) and (3.5) will be reduced into a single equation of the form, ej - pa + 111, (3.8) and U3 ~ N ( 0. I). where r T T _ u uB Tue TB - U “a! E 9 and U1. 0:] B Uij Under the Bayesian approach, assuming variance compo- nents are known, the minimum mean squared error point esti- mators for micro and macro parameters are; CD I >9 CD) * 28 and 4 -1 A ue ( £11) ( 2X1 9 ) . (3.10) where _ -1 81 is the ordinary least squares estimate of 8j for each school with sampling error of A a 2 ' -1 3 var ( 61 81 ) oJ ( X1 X1 ) Vj . and.l represents a "multivariate ratio" of the true para- 1 meter variance in 81 to the total observed variance in 81 . This ratio signifies the reliability of BJ as an esti- mator of school j's slope. It follows that 8j . (X5X3)-1x3Yj is normally distributed with a mean of “8 and variance Vj + T A i.e., A n s a A . ) ‘var ( 81 9:1 ) + var ( ej ) V3 +-T' j var ( 6 J The empirical Bayes estimator, u; is a generalized least squares estimate of 8 , where the outcome vector 8 (i.e., OLS estimates of micro regression coefficients) 3 is weighted by its precision. The empirical Bayes estimator a: is a weighted combination: first of 8j , derived for each school based only on the student data from the OLS slopes that school; and second, from the estimated mean slope u; , for the population of schools. That is, 6: is a vector the elements of which are somewhere between the elements contained in 81 derived entirely from within the macro unit j, and the elements u; the estimated mean vector for the 29 entire sample. The properties of these estimators are reviewed by Efron and Morris (1975) and Morris (1983). Such estimators are conditionally. biased, i.e., the bias is largest for ej . * . values far from the average. However, in general ej is a more precise estimator (i.e., it has smaller expected mean squared error) than 81 its OLS counterpart. Sternio (1981) reasoned, however, that the precision of * 61 could be improved even further by shrinking estimates 8j , not toward a grand mean u; , but toward a conditional mean ij* . This is obtained by regressing 61 onto a macro parameter W as follows, 8 - W + U , 3.11 1 1" .1 ‘ ) where Y 00 1 nj 0 o ' Y s W - Y“ 3 o o 1 w 10 3 Y11 Under this formulation the empirical Bayes estimators or equivalently the posterior means for micro and macro Parameters are: A * * 6 - A 6 - l 1 j j + ( I j : WjY ,1 A (3.12) - z ' A- - Z ' A- e o where Aj-(TIW)(TIW+V )'1 J 30 These results generalize to the case of multiple X's and multiple W's. The posterior dispersion ofea and ‘Yare given in equations (3.14) and (3.15) respectively (Raudenbush, 1988:91). D I - _ t 93 11 VS] + ( I 1:1)3j (I 11) , (3.14) where s - w (z w' A'1 w )'1 w' .‘l :l .‘l .‘l J 3' and 13* - (z w' A“1 w )‘1 (3 15) Y .1 J j ' ' It is worth noting that the crucial difference between the three alternative estimation theories: least squares, Bayesian and maximum likelihood is the difference in assumptions. With regard to 8 , the Bayesian and maximum likelihood method lead to identical result but different from least squares. This is because the least squares method makes no assumptions about the prior distribution of a On the other hand, both Bayes and maximum likelihood assume normality of the 83 in order to derive e; . With regard to Y all three approaches effectively assume no prior dis- tribution and therefore produce identical results (Rauden- bush, 1984). Estimation Under Unknown Variance Components So far we have assumed that the variance components are known. In most applications however, these will not be given 31 and have to be estimated. For balanced data it has been common practice to equate the observed sums of squares and cross-products matrices to their expected values (called "ANOVA" approach). Estimating variance components from unbalanced data is not as straight forward as from balanced data. Henderson (1953) developed analogous techniques dealing with variance component estimation from unbalanced data. However, his method is computationally cumbersome when a mixed model is assumed and when the number of classes is large. Searle (1971a, Chapter 10) discussed problems with the ANOVA approach when applied to unbalanced data. As can be seen in the survey article by Searle (1971b) there are many approaches to variance components estimation from unbalanced data; many of them of a rather specialized nature and many which depend on some form of balance or symmetry in the problems addressed. The need for general procedures for. handling unbalanced problems is quite well known to the statisticians. A complete Bayesian analysis can be performed by specifying a joint prior distribution for all the parameters involved.( 8 , y ,‘T and V in our case), combined with the likelihood function for regression coefficients (here 8 and Y ) in order to obtain a joint posterior distribution for the four parameters. This distribution then has to be integrated with respect to the variance-covariance compo- nents (here 'r and V ), thus removing the nuisance parame- ters, so that the posterior distribution of 8 and 'y can be 32 calculated (Lindley and Smith, 1972). While theoretically satisfying, this approach is computationally complex. As a practical alternative Sternio, et al. (1983) followed the general strategy of Dempster, et al. (1981) and developed an empirical Bayes analysis. The empirical Bayes approach consists of first deriving Bayesian estimates based on known variances and then substituting maximum likelihood estimates for the unknown variances and covariances in the estimation formulas. Similarly, they generated maximum likelihood estimates for unknown variance-covariance compo- nents via EM akgorithm (Dempster, et al., 1977), and then replaced the true parameter values in their model by these estimates. As Harville (1977: 320) points out, ".... except in relatively simple settings (cases), the computation of maximum likelihood estimates requires the numerical solution of a constrained non-linear optimization problem". For unbalanced data maximum likelihood estimates of variance components are not available in closed form and one has to resort to iterative solutions to obtain them. A variety of numerical approaches to maximum likelihood estimation of variance-covariance components are available. Among them EM algorithm is specially gaining prominence. Dempster, et al. (1977), review many areas where the EM algorithm has successfully been applied, or has potential applications. These include missing value situations, aplli- cations to grouped, censored or truncated data, variance 33 component estimations, iteratively reweighted least squares, fixed mixture models, hyperparameter estimation and factor analysis. They also derive theorems showing the monotonic behavior of the likelihood function and the convergence of the algorithm. some of the applications include Aitken, et al. (1981), Dempster, et al. (1981), Laird and Ware (1982), Mason, et al. (1984), Raudenbush and Bryk (1986), Rubin (1980), and Sternio, et al. (1983). Other numerical approaches to maximum likelihood estimation of covariance components are the iterative generalized least squares (Goldstein, 1986) and the Fisher scoring method (Longford, 1985; de Leeuw and Kreft, 1986). All these three iterative methods avoid the inversion of large matrices. Thus, they are computationally more feasible than Newton-Raphson, which requires inversion of large matrices at each iteration. S.J. Haberman, one of the discussants of the paper by Dempster et al. (1977), pointed out that the numerical stability and simplicity of implementation of the EM algorithm are in its favor. The Newton-Raphson and scoring algorithms are not especially difficult to implement. However, convergence of the EM algo- rithm is often slow. In contrast, the Newton-Raphson and scoring method are superior from the point of view of rate of convergence near a maximum since they converge quadrati- cally rather than linearly. However, they do not have the property of always increasing the likelihood, and can in some instances move toward a local minimum. Consequently, 34 the choice of starting value may be more important under Newton-Raphson and scoring method (Dempster, et al., 1977). The £23 3.9; EM Estimation The EM algorithm of Dempster, et al. (1977) provides an iterative method of finding the maximum likelihood variance estimates. The EM algorithm is a very general method for finding maximum likelihood estimates. . In the variance estimation situation, the EM algorithm alternates two steps in each iteration. The E ("expectation") step finds the posterior expectation of the sufficient statistics based on the complete data (in our case y, e ) given the observed data (in our case y) and given current estimates of parame- ters (in our case r and o; ). The M ("maximizing") step then uses the expected sufficient statistics to produce new ML parameter estimates of variance components. Each step of the EM algorithm increases the likelihood. This sequence of alternate steps guarantees convergence to a local maximum of the likelihood function. If data is normally distributed, the local maximum will also be the absolute maximum since the normal likelihood is unimodal. One difficulty with EM algorithm is that it may require many iterations to converge (Sternio, 1981). Thus, it is a slow process of maximum likelihood estimation particularly with poor starting values (Mason, et al., 1984). None- theless, in favor of the EM algorithm are simplicity in implementation and numerical stability. 35 The process of the EM algorithm along with the computa- tional details are provided by Dempster, et al. (1981) to a special version of the model considered in this research. They considered the model in which there are no macro pre- dictors (covariates) related to micro parameters, i.e., 2 W = I and in which the 21 have the special formz):j - o In where o2 is equal across all individuals. Sternio (1981)-has broaden this appraoch to include estimation of r and 02 in more general cases and provides a unified discussion of theory and computation in such cases. Bryk, Raudenbush, Seltzer, and Congdon (1987) have extended this approach even further to general mixed model in which the full rank j , and the assumption that micro parameters are random are no longer assumption of the within-group predictor matrix X required. Hence, relaxing these two restrictive assumptions broadens the range of application of the model (Braun, et al., 1983; Rubin, 1983). To illustrate the logic of EM algorithm, consider the simple conditional univariate HLM model prescribed in equations (3.7) and (3.11). The logic of EM works like this: First, assume that r and a; are known. Equations (3.12) through (3.15) provide posterior means and dispersions of Y and 8.1 . Next, suppose that y and 81 were known i.e., Rj and U1 had been observed and we want to estimate 1 and a; It can be shown easily that the following two equations (3.16) and (3.17) are maximum likelihood estimates for T 2 and Oj respectively. 36 'r- It" 21130:; (3.16) A2 I. 1 EM algorithm utilizes the dependence of estimators 8j and Y on knowledge of dispersion matrices and the dependence of ML estimators of these matrices on knowledge of 8j and 7 via an iterative process with the following steps: (1) Generate reasonable starting values for the unknown variances, q; and T . Perhaps as suggested by Raudenbush (1988), the within-group and between-group residuals from ordinary least squares regression can be used. (2) These starting values are substituted into equations (3.12) and (3.13) yielding starting values of 8; and YE. (3) To derive new estimates for T and o; , substitute the sufficient statistics-ZUJUS and R3 R1 in equations (3.16) and (3.17) by their posterior expectations. These posterior expectations are derived by Dempster, et al. (1981) and are as follows: E{(R'R) Y}-E{(Y )IY} ii -(Yj-X 1' x13: * 1* )'(Yj '- x381) + E {(83 - :1 )'(Y1 - X181) + tr(Xj'Xj DBj )' ( Y - XjBj )‘x'x (e - 8*)IY} j j J j ) B 8 Lhasa.»— J ' (Y3 " x3 and * * to: 11er.1 Y) - 2: 0303 + {{Ajvj + “351 ”3} 2 (4) The new estimates of dj and T are then used in a * i: repetition of step 2 to yield new value for 8j and yj 37 The process iterates until estimates converge to any degree of accuracy required. The estimated variance components after one iteration are then the maximum likelihood estimates of the variances, conditional on the values of the‘ structural. parameters (i.e., regression coefficients). The proof of convergence to the maximum likelihood estimates is given by Dempster, et al. (1977). Effects of Having £2 Estimate Variance Components Best linear unbiased estimators of the fixed and random effects (i.e., macro and micro parameters respectively) of mixed linear models are available when the true values of the variance components are known. If the true values are replaced by estimated values, the mean squared errors of the estimators of the macro and micro parameters increase in size (Kackar and Harville, 1984). Clearly the magnitude of this increase is unknown to us. Another problem resulting from this situation is; that the. parametric family distribution of micro and macro parameter estimates will remain unknown. Thus, any statistical inference concerning these parameters, if not impossible, will be inaccurate. Fortunately, we can use large sample theory to find asymptotic distributions of macro parameter estimates. But finding an analogous sampling distribution for micro parame- ters is not possible (Dempster, et al., 1981), because we cannot simultaneously maximize the joint likelihood function 38 2 of all four parameters ( 8 , y , r and.o ). The data 3 simply will not support the estimation of so many parame- ters. But the focus of the present research is on the effect of variance estimation on inferences about macro parameters. Of course, when variance components are unknown, substi- tuting their maximum likelihood estimates ; and. d: in the definition of £1 and then estimating 'y* by replacing A1 for 81 in equation (3.13) is a natural idea. That is, following empirical Bayes approach of first deriving Bayesian estimates based on known variances and then substi- tuting ML estimates for the unknown variances in the estima- tion formulas. The resulting empirical Bayes estimator, 7* is a true maximum likelihood estimator. Therefore, this estimate shares the desirable properties of maximum likelihood estimators. But maximum likelihood estimators rely on large sample theory. According to large sample theory we know that: 1. y*- (2 w; 231 NJ)”1 2 W3 331 81 . A is the maximum likelihood estimate of Y* if A is the maximum likelihood of A . This is the case since functions of ML estimates are ML estimates of the same functions of the parameters. 2. Under regularity condition the large sample distribu- tion of ML estimates of Y*for K‘+ m with n's fixed is as follows -1 Mir (xv; A; wj> 39 where (2 W5 A31 wj ) is the Cramer Rao lower bound for ’“k the covariance matrix of 7 But for n, K + m with n's/N fixed (2 “5 A31 wj )'1 = (2 W5 T-l wJ )'1 since A31 - ( Vj +:"c)-l = a: ( X3 X1 )-1+ 1' -1= 1-1 as 2 -1 c1 (x5 x3) + 0 thus 7* is indeed asymptotically efficient. It is clear that we can use the asymptotic distribution of y* for confidence interval and hypothesis testing: * _ ASY ' -1 '1 (Y Y) ~ N 0 . (2 W3 Aj Wj ) or (v; - vh / 3.2. 7;) A§Y N < o. 1 ) where subscript h refers to the elements in the Y vector i'e" ( Yoo' 701’ Y10’ Y11 ) ' Thus, even though the estimates of the macro parameters are numerically computed, their large sample properties are well defined which facilitates the large sample hypothesis testing and interval estimation. The EM algorithm yields estimates of dispersion matrices r and 2 which, in conjunction with Y* , maximize the marginal density of Y . In other words, EM estimates of T 40 and 2: (i.e., maximum likelihood estimates of 'r and 2: ) when substituted into the equation (3.13) make‘f*a true maximum likelihood estimator. These asymptotic properties of ML estimates are of value only if there is reason to believe that the data are extensive enough that the properties hold. For these properties to hold exactly it is sufficient that the number of groups (K in our case), approaches to infinity (Miller, 1977). However, it would be interesting to observe the behavior of the estimates as K and n (i.e., number of individual within each group) each increase to infinity. This does not imply that K and n be of the same order of magnitude (Miller, 1977). In the present research, the main question we set out to investigate concerns the small sample behavior of the macro estimators (i'e"YOO’ YOI’ Ylo , and Y11 ). The purposes of this study are three: (1) To check on the EM algorithm, we can look at the properties of the macro estimators and make sure that the algorithm behaves as expected. That is, the macro estimators are consistent, unbiased, asymptotically efficient, and with known and asymptotic normal distribution. Also it is worth- while to look at how well the EM algorithm does at estima- ting the variance components. Again, this concerns the bias, consistency and asymptotic efficiency of thse estimators. A side concern with this algorithm is its rate of convergence under varying combinations of K and n. This question is addressed through examining the total number of iterations prior to convergence to ML estimates. 41 (2) Investigating the effect of variance estimation on inferences about macro parameters with respect to both robustness and power. (3) By constructing data sets that differ in the number of K and n, we investigate how different combinations of K and n affect the properties of the macro estimators as well as the inferences about them. Specific questions of interest concentrate on estimation and hypothesis testing. The key issue in the estimation phase concerns the bias, consistency and efficiency of the macro estimators, y , and the effect of different combina- tions of K and n on these properties. For hypothesis testing interest centers on the type I errors and power. 42 CHAPTER IV METHOD The procedures employed in the study to answer the research questions presented in the previous chapter will now be discussed. The chapter begins by presenting standardized two-stage hierarchical linear model. Next, a description of the population parameters and the manner in which they were chosen will be given. In the third section details are presented about the computer routine utilized to generate the data. The fourth section looks at the analysis routines. Finally, the measures of biasedness, consistency, efficiency, type I errors and power will be described. Standardized Two-Stage Hierarchical Linear Model This model which is special case of the two-stage HLM model presented in the preceding chapter is adopted for generating data in the present research. The standardized HLM takes the same exact form of equations (3.1) through (3.5) for the unconditional and conditional case but with somewhat different assumptions. That is, the micro and macro predictors both are assumed to be standardized normal variables with mean of zero and variance of one. Clearly, this reduction in the number of unknown parameters simplifies the data generation process. 43 Within-School Model 11 J 13 ij. and Rij ~ N ( 0, 1 ). Between-School Model (unconditional) u = u + U . J 03 Bj=é+ U13 , and U0j ~ N ( 0. Tu)’ Ulj ~N(O,TB). cov ( U01, Ulj ) = TuB. Between-School Model (conditional) “j ' Yoo + Y01 Wj + U03, Bj ' Y10 + Y11 wj + ”13’ and UOj ~ N ( O, Tulw )! cov ( UOj’ Uljlw ) = TuBIw. 44 Further we assume that the micro and macro predictors are each a unit normal random variable, i.e., x N ( 0, 1 ). ij” w. .. N 0. 1 . J ( ) and that Xi R U and Ulj are mutually independent. 3 ' ij ' 03 This implies that cov ( Ubj' Ulj )= 0 , and that the dispersion matrix I is diagonal (but we will still investigate estimates of this covariance between macro level errors). In order to generate data we need to define the following parameters: 1. c = TB/Tu, so T6 = c TU' (4.1) 2 §2=52 (62/52) XY Y where E is pooled within group slope; 5¥y is pooled within group correlation coefficient; 3; is pooled within-group (unconditional) variance in Y ; and 6: is pooled within variance in X But 6; = 1, and 3; = TB+ §2+ var (R) (4.2) 45 so that EZ= < Biyl ( 1 - Egy >>< cru + 1 ). (4.3) = '2 3. d Tu/(Tu + oy ), (4.4) where d is the intraclass correlation of Y. By substituting expression (4.3) into expression (4.2) we will get: 32 = c1u + < Biy/c 1 - Exy >>< c1 y +1)+1, u and by substituting 32 into (4.4) and solving for Tu we Y will have: Tu = d/(Cl-d ) ( 1 - a ) - cd). (4.5) 2 KY This expression implies that the larger the intraclass correlation the larger the parameter variance T and that u I the larger the pooled within-group correlation the smaller the I“ . That is, Tu is directly related to d , but inversely to 52 KY But Tu and T8 are both positive quantities, therefore c is constrained to be in the following range of values 0 < c < (( 1 - d )/d )( 1 - Eiy ). (4.6) Also the conditional parameter variance in intercept and slope are: Tulw 3 Tu( 1 - pfiw ), (4.7) 46 and _ _, 2 o TBIW _ TB( 1 pBW ), respectively (4,3) The above specification of the standardized hierarchical linear model reduces to five parameters. These parameters are C , d , pxy , puw can generate a large number of samples under these known and 98W ; if predetermined, one population parameters and investigate the properties of resulting statistics (i.e., point estimates of and their standard errors) by observing their sampling distributions. Parameters of the Study In order to investigate the small sample properties of the macro parameters (i.e., YOO , Y01 , Y10 and Y11 , two more parameters need to be added to the list of five model parameters previously mentioned. These two parameters are: number of groups, x.and group size, n . This adds up the number of parameters considered in the present study to total of seven (K, n, d, and c ). pxy"pflw' pBW' The first three parameters (K, n and d ) are specially of great concern in the present study because of their significant implications in sampling and design of a study. In a two-stage random sampling (or two-stage cluster sampling using sampling design terminology), the coefficient of intraclass correlation ( d ) measures the homogeneity of the elements within clusters. For a fixed total sample size of N = nK, the larger the intraclass correlation, the larger 47 the number of groups (K) and the smaller the number of individuals within groups need to be sampled for optimum efficiency in design given fixed cost. In contrast, the smaller the d , the fewer the number of groups and the larger the number of individuals within groups the better the precision (Kish, 1983). However, to consider asymptotic properties of macro estimators it is sufficient that only K converges to infinity (Miller, 1977). Accordingly, we may occasionally define the population solely in terms of levels of K, on other occasions in terms of varying combinations of K and n, and still on some occasions redefine it in terms of all three parameters. Now the values assigned to each of these parameters will be given. (A) Number of Groups, K; Small to moderate to large groups with K = 10, 30, 60 and 150 are simulated in this study. (gl‘ggggg'§i524 n; Situations with n = 5, 25, 60 and 150 in each group are simulated. In deciding the values of K and n, the main concern was to select those values that provide us with reasonable ground to investigate the small sample properties of the macro parameters of interest. The other concern was to have a reasonable coverage of those combinations of n and K which occur in real research situations. These realistic situations include; 1) the study of growth model (Bock, 1983; Goldstein, 1986; Laird and Ware, 1982; and Sternio, et al., 1983) in which K is small and n 48 ranges from small, moderate to large; 2) school effects research (Aitken and Longford, 1986; Raudenbush and Bryk, 1986), where K is moderate or large and n is either small or moderate; and 3) sociological and contextual research (Mason, et al., 1984; and Wong and Mason, 1985) in which K is small to moderate and n is large. 3) Intraclass Correlation of 3, d L Two values of d =.10 and .25 are considered. These two values appear to be of reasonable magnitude based on the following grounds. The intraclass correlation of Y may be large if Tu is large compared with 5; , and zero only when T = 0 , that is u when there is no variation in outcome variable among schools in the population of schools, which will rarely happen in practice (see expression 4.4) . But as a general rule, intraclass correlations in educational research are small positive values, mostly under .15 (Kish, 1983). The range of the values are chosen to reflect the values often obtained from educational field research. In the school effect research conducted by Raudenbush and Bryk (1986) the actual value of intraclass correlation was .177. The mid point of the values considered in this Study is .175. 4), g) and _6_)_ Correlation Coefficients o_f pxy _,_p and pBW . uW-’- For each of these correlation coefficients two values are considered. These values which are considered to be of moderate and almost high magnitude (considering educational field research data), are .25 and .75. 49 11.52513 9; TB 52 Tu L c a Two values of c =.10 and .50 are considered in this study. Both fall within the range of permisable values for c given by expression (4.6). Two interrelated factors have affected the selection of these two values: First, as a general rule, regression coeffi- cients have considerably greater sampling variability than sample means( Burstein and Miller, 1980; Wiley, 1970). Mathematically, the total variability in intercepts and slopes can be decomposed into two parts; parameter variance and sampling variance. Logically it follows that the parame- ter variance in intercept is of larger magnitude than that of slope. Second, in many applications one would expect that much of the observed variation in slopes to be sampling variation. For example, in the school effect research conducted by Raudenbush and Bryk (1986) which utilized a sample of 10231 students in 176 schools, student samples per school ranged from 10 to 70, and samples less than 45 were rare, and the value of c was equal to .10. Consequently, this value is chosen in this study to act as a baseline, and will be compared to a less realistic but certainly not impossible larger value of c i.e., .50. __Design o_f m eta—av Considering the number of factors (total of seven) and number of levels in each factor (K and n each have 4 levels, and the remaining 5 factors each have 2 levels), if we were to include all factor combinations in our study, we would 5 2 have a ( 2 x 4 ) design matrix with a total of 512 possible 50 cells, which is unmanagable given the large cost of implementing the EM algorithm. As a practical alternative, this study adopted a fractional factorial design by which only a fraction of factor combinations of a complete fractional design will be considered. Specifically, this study has adopted a "one- half" randomized block fractional factorial (RBFF). Kirk (1968:386-87) made the following comment concerning fractional factorial designs: "the use of a fractional design can lead to a sizeable reduction in the number of treatment combinations that must be included in a study. This is accomplished by confounding main effects with higher order interactions ..... however, if certain information concerning the outcome of the experiment is of negligible interest, an experimenter can employ confounding so as to sacrifice only this information." As a result of treatment-interaction confounding, considerable ambiguity may exist in interpreting the results of such experiments. This is the case since every sums of squares can be given two or more designations referred to as "aliases". To minimize this ambiguity, careful attention must be given to the alias pattern of a proposed design. Treatments are customarily aliased with next to highest- order interactions which can be assumed to equal zero. This is accomplished by using the highest-order interaction as the "defining contrast" which is used to divide the treat- ment combination into two blocks. The higher order interac- 51 tions are then pooled to form a residual error term. "If these pooled interactions are insignificant, a complete factorial design would have been a better design choice for the data than the fractional factorial design. On the other hand, if some of the interactions are significant, the present analysis (i.e., fractional factorial design) offers the advantage of a larger number of degrees of freedom for experimental error and a within-all error term" (Kirk, 1968: 394). Designs with mixed treatments (or factors), i.e., having unequal number of levels, present special problems with respect to layout and analysis (see Kempthorne, 1952: 419). But this is the case in the present study which contains mixed treatments of the form 25 x 42 design. As a reasonable alternative this study, adopted a RBFF- 25 design for the five factors with two levels (i.e., c , d , p , xy puw , and 08w ), and to compensate for the two remaining factors, K and n each with four levels, every two blocks of the design layout of RBFF- 25 was crossed with different level combinations of K and n. Next, steps involved for laying out one-half replication of a type RBFF- 25 fractional factorial will be discussed. (1) Choose a defining contrast. Following Kirk's guideline the highest order interaction (i.e., five order interaction) is chosen in this study as the defining contrast. The 32 treatment combinations ( 25 = 32 ) of a complete factional design can be reduced to one-half of that by the use of the 52 defining contrast. (2) Confound an interaction with between-block variation. The interaction which serves as the confounding interaction must be insignificant and also different from the defining contrast. For this purpose the interaction between qu and paw is chosen which is thought to be insignificant. For confounding an ‘interaction with blocks see Kirk (1968, Chapter 9). As a result of this process, the 16 treatment combinations are assigned to two blocks of eight combinations each. For simplicity the five factors are assigned the following notations: A = puw B - p8w C = pxy D = d E = C If (ABCDE) is used as the defining contrast and (AB) as confounding interaction, the design shown in Table 4.1 will be obtained. Levels of each factor are denoted as zero and one. Where zero corresponds to the low value and one to the high value. All treatments and interactions except AB (the confounding interaction), its alias CDE, and the defining contrast ABCDE are within-block effects. All main effects are aliased with four-factor interactions. The alias 53 m< mm mo mo mm< mu< mo< mom mom nom< mum< mom< mao< maom moo mmfifi< Am9 L0H :Lmuumm mmwa< M Nuq mange HHHOH Hooofl oHooH oo~o~ HooHo oflofio ooHHo HHHHO A goofim HHOHH MQHHH ofiafia oooHH HHooo HoHoo oHHoo ooooo o xooam muons macaw ovonm muonm moose ovuna ouobm muons COHuQCMDEOQ UCQEUOQHH mxoon 039 :fi :mwmoa N I mum: mums mo uzoamq m ch oHan pattern for this design appear in Table 4.2. A careful examination of the alias pattern in Table 4.2 reveals an interesting feature of this one-half fractional factorial design. The incomplete five-treatment design con— tains all of the treatment combinations of a complete four- treatment design. This implies that the computational proce- dures for a one-half replication of a 25 design are identi- cal to those for a complete replication of a 24 design. That is, by ignoring one of the treatments, the analysis of an incomplete-design can be carried out as if all the treat- ment combinations were included in the experiment. The choice of which treatment to ignore is arbitrary (Kirk, 1968). As mentioned earlier, different combinations of K and n will be crossed with the blocks,contained in the RBFB- 25 design. There are a total of 42 different combinations of K and n, each referred to as a "trial" for convenience. Within each trial the first K by n level combination will be crossed with "block 0" of the RBFF design and the second K by n level combination will be crossed with "block 1". Table 4.3 contains all different combinations of K and n, their designated block, along with the trial number. By using a one-half RBFF- 25 design and by crossing this design with particular combination of K and n a total of 128 (16 cells x 8 trials = 128) treatment combinations will result. Notice that although levels of K are crossed with both blocks of the design matrix, levels of n are not. 55 m muq oHan : can a mo coflumcwnsoo wcwaum> NHooH. mamoo. oomeo. emcee. emmmH. ooeeH. HoHNH. mammn. HH.» mHHee.H Hmoom. Homom. HmmsH.H Hmoom. «swam. HmmeH.H mHHee.H OH.» monm. Hmoom. mmmmq. camwm. «ammo. oHHmH. NSHNH. mmHHN. Ho.» 0 o o o o o o o co.» «qum. «comm. ommHN. «Hmmo. mmoHH. mmmao. omHao. mmnwo. HH> mHemN. ommH~.H mHoHH.H «Homm. macaw. ommH~.H mHoHH.H sammm. oH » canoe. Nmeoe. HHHmo. «Hamw. oeeoH. qmqu. «QHHH. mnemo. Ho_» 0 o o o o o o o co.» mhmuQEmhmm Chum: HO MUSHQ> Hm=uu< 5-5 mHsme cmHna mm": onHua mm": oman mNna onHua «Nu: H xuon OnHuH OnHuH coax coax cmnx an"; oan oan as“: mu can: mu con: mu ecu: mu o xuon cmHux cmHux coax coax can: on"; oHux oan m H o m e m N H HmHue cmHmmo N . mama H0 H can o mxuon How I x3018 0 33018 56 Specifically samples of size 5 and 60 occur only in "block 0" and 25 and 150 in "block 1". Thus, each trial consists of either the two lowest levels of n, 5 and 25 (call it n'), or the two highest levels 60 and 150 (call it n"). As a result of this design with sixteen varying combinations of factors A, B, C, D, and B, we obtain sixteen different parameter values for yr andyul. as shown in 01 Table 4-4. With regard to .Yl the total number of parameter 0 values reduces to half of this size since 1w“) is defined only in terms of C, D, and 8. Thus, it is not affected by the high and low values of A and B. Irrespective of the design Yoo is pre-fixed at zero. Description of the Generation Routine In generating data the present study make use of five sufficient statistics. Generally speaking, sufficient statistics are useful in that they reduce the number of observations, say from n to r statistics ( where r s n ). This is because these r statistics contain all the "information" about 9 (iHe., parameters of the study) that the n observations contain (Graybill, 1976). If r is appreciably less than n , as it is in the present study (i.e., r = 1/n), then the very fact that we have to consider only r , rather than n simplifies our data generation routine. The five sufficient statistics are 23x , 22x2, 23R , 2 ER , and ZXR . Assume that, 57 iid x N(0.1). RiEdN(o,1), and pxago (i.e., the population correlation coefficient is zero). The generation procedure is composed of the following steps: (1) Generate XXj ~ N ( 0, n ). (2) First generate 2(Xj - §-)2 ~ x2(n_1) and then compute xx; = zc xj - i.)2 + ( ZXj)2/n (3) Generate ZRj ~ N (0, n ). (4) First generate £(Rj - i.)2 ~ x2(n_1) and then compute ZR; = 2(1:j - i.)2 + (XRj)2/n. (5) To generate ZXR first generate t with ( n-2 ) degrees of freedom, then compute r - t / (t + n - 2)1/2, 58 and finally compute 2X1 Rj = (n - 1) r Sx SR + ZXjRj/n (note: t = Z/(x2 /(n-2))1/2 ) n-Z) After completing the steps involved in generating the 2 sufficient statistics, we can actually compute ZY , zY and 2 KY . But before doing so we need to generate three more random variables which are contained in the conditional between-group model. These random variables, which are part of the expression for“j andggj and thus part of Y1" are J Also we need to assign values to the four Wj, U and U 01 11' macro parameters of interest. Assignment of values to the slopes, 7' and 711 are accomplished through the following 01 expressions: V T y H pp)“ p 01 Y ‘ngVJ/—?— 11 8 who is assumed equal to zero, and Y10 is assumed equal to E . Now proceed with the steps in generation routine. (6) Generate w. r N , . J ( 0 1 ) (7) Generate U ~N(09T 03 ulw)° 59 (8) Generate Ifij ~ N (0, TBIW). Also notice that prior to generating Uoj and Ulj we need to assign values to the five model parameters; i.e., c , d , p , p and p . The final step in xy uW 8W 2 generation routine is to compute ZY , EY’ and ZXY . (9) Compute EY = EC “j + Bj Xij + Rij ) a ““j + Bj inj + ZRij. Compute 2 Z a Y 2(uj+Bjxij+Rij) Exij + 2R2 + 2 u 8 Ex + 2 2 +8 1:1 11' 11 J .1 a nu X Z 2 “j Rij + 28j Xij Rij' Compute ZXY = u Ex. + 6 2x2 + 2x R .. j 13 J 13 ij i] The generation program, completes each of the nine steps as one observation is formed. The sample size chosen is five so five such vectors Y compromise one sample. Thus, begin- ning with the first "trial", values of 10 and 5 will be assigned to K and n respectively of "block 0" in the RBFF-Z5 design, and similarly values of 10 and 25 to K and n of "block 1". Then, starting with the cell one of the design layout, first the remaining five parameters will be assigned 60 values (according to zero and ones), and then the nine data generation steps will be completed and repeated for five replications. This process will be repeated for each and every 16 cells as one trial is completed. A total of 80 (16 cells x 5 observations) data points will be generated upon the completion of this trial. Next, we move to the second trial, assign values to K and n and repeat the same cycle as in the first trial. This process continues until all eight trials are completed and a total of 640 (80 sample points in each trial x 8 trials) sample points is generated. In other words 128 ( 16 cells x 8 trials) distinct samples each containing five replications wil be generated. Along with the generation of sample points, the generalization program will compute two indices of disper- sion in macro parameters. These indices are the mean squares within and the Cramer Rao lower bound which is - E ( azlog L/By'y) -1, and equal to the asymptotic dispersion of y* i.e., ( 2w; Agle )‘1. The first analysis routine (i.e., HLM program) accepts both raw data and summary statistics of the sample means and sample covariance matrix. Considering the efficiency of summary statistics, for each sample, the mean and the covariance matrix is computed to be used as input in the analysis phase. 61 Monte Carlo Techniques As recognized by Hammersley and Handscomb (1964), a Monte Carlo method is a general technique with different areas of application, for solving a model by using a random (or pseudo-random) numbers. One application is the generation of sampling distribution. Through repeated sampling under known population parameters, one can investigate the properties of estimators by observing their empirical (sampling) distribution. The present study is a Monte Carlo study aimed at generating sampling distributions of the macro estimators. These empirical distributions are then compared to the nominal distribution (in this case the normal distribution) obtained under asymptotic theory (i.e., when K and n converge to infinity). A Fortran program is used to generate a total of 640 sample points; five observations for every 128 experimental conditions. Random Number Generation The use of random number is considered to be an integral part of a Monte Carlo study. Random numbers are of two types: purely random numbers and pseudo-random numbers. However, for a computer based Monte Carlo study the purely random numbers are inefficient compared with pseudo random numbers. There are two advantages in using pseudo-random numbers: (1) the computer itself can generate sequence of 62 numbers by applying an algorithm, and (2) the same sequence of numbers can be reproduced exactly for the future use. Pseudo-random numbers are generated sequentially from a completely specified algebraic formula. At best which at best they behave as if they are random (i.e., uniformly distributed and mutually independent). These algebraic for- mulas are devised in such a way to resist any significant deviation from randomness. However, there are many statisti- cal tests that can be used to determine if this is the case. Typically run tests, serial tests, and various Chi-square tests for independence are applied to relatively short sec- tions of the pseudo-random sequence. See Knuth (1969) for discussions on many of these tests. Two subroutines, GGNML and GGCHS from the International Mathematical and Statistical Library (IMSL) were used to obtain a sequence of pseudo-random normal (R), distributed N (0,1), and Chi-square random deviates with n degrees of freedom respectively. Once the procedure is started by an initial number, called the seed, each new seed number will be determined from the previous one. Random normal (0, n) deviates can be obtained by transforming GGNML output according to Y (I) = R (I) x nl/Z, for I in (1, 2, ...., K). This transformation was done in steps (1) and (3) of the generation routine. In steps (7) and (8) a similar transformation was performed of the form Y (I) = R (I) x Vl/2 where V represents TuW or TBW whichever the case may be. 63 Analysis Routine Output from the generation program consists of summary statistics for each sample. This serves as input to the first analysis routine. From the first routine, HLM (Bryk, et al., 1987), we obtain a vector of the empirical Bayes estimates of the macro parameters, 7* (as in equation 3.13), the*empirical Bayes estimates of their dispersion matrix, DY (as in equation 3.15), estimates of parameter variances Tu and TB , estimate of 02 , and number of iterations . These estimates are numerically computed via EM algorithm. The convergence criterion for the log likelihood function was set at .0001 with the maximum number of iterations allowed fixed at 500. The empirical Bayes estimates of the macro parameters and their dispersion, parameter variances and 02 , yet serve as input to the second analysis routine. The analysis routine computes: (1) the required summary statistics for the estimation phase, (2) the proportion of times the values of each test statistics exceeded its criti- cal values for a given nominal significance level under true null hypothesis, (3) the noncentrality parameter (ncp) as is defined in the last section of this chapter, and (4) tabu- lates population effect size ( y in our case ) against ncp as a way of demonstrating power functions. 64 Checking the EM Algorithm As a check on the algorithm, first we might wish to examine the properties of the numerically computed estimates of the macro parameters,‘Y . The ML estimators are functions of every sufficient statistic and are consistent and asymptotically normal and efficient. Additionally, given normal data, ML estimates of regression coefficients are unbiassed. Key issues in estimation concentrate on bias and effi- ciency of an estimator. An estimator is unbiased if its expected value is equal to the population value of the parameter. In other words, if an estimator is unbiased, the estimated value minus its parameter value should have zero expectation, i.e., E ( Y - v) = o where Y ‘ Y00’ 701’ Y10’ Y11 Thus, by deviating parameter estimate from the known population value and averaging over the entire sample, one can determine the degree of bias, if any, present in the estimation procedure. An estimator is said to be relatively efficient if it has the smallest standard error term among the set of unbiased estimators. Three estimates of the variance are computed. The first is the variance of the macro estimators estimated by HLM via EM algorithm: 65 var( ;*,' ) = Diag (- [W3 831 Wj ) -1 A To the extent that variance components 1- and vi are misestimated due to small sample problem, the estimated variance of macro estimators will be in error. The second estimate of variance is the mean squares within (MSW): s we 7* 2 MSW-121( y - y )/s where 7* 5 Ave Y /5 Last measure is. the average squared bias or mean square error (MSE): MSE .2 m?" -Y)2/5 i=1 These last two measures of variance are similar except that MSE takes advantage of the fact that the population value ( y ) is known. Since the maximum likelihood estimates are asymptotically normally distributed, it is of interest to discover whether they are asymptotically efficient in the sense of attaining the Cramer-Rao lower bound for the covariance matrix. This minimum variance bound is the inverse of the Fisher information matrix and is equal to 66 the asymptotic dispersion of Y*, i.e.,( 3w; A31 w ) 3 Consequently, all three measures of variance are averaged -1 over the entire sample and then compared with the asymptotic variance of the macro parameters, i.e., diagonal elements of matrix ( zw‘ A'1 w )-1 J i 1 these various measures of dispersion are: Computational formulae for K n A 1) VAR = Z 2 var ( y*) /Kn jali-l where K is number of groups, and n is group size. Average estimate for the variance of the macro estimators estimated by HLM via EM algorithm from each sample. K n A* 7* 2) MSW - z 2 ( y - y )z/Kn 3:11-1 Average estimate for the variance of the macro estimators which is based on the squared difference between the estimates and the mean estimate. K n A* * 2 3) MSE = z z ( y - Y ) /Kn 3-11-1 Average estimate for the variance of the macro estimators which is based on the squared difference between the estimators and the population parameters. 4) CRLB = Diag ( zw' A'1 w )'1 J j 1 Average of values for the minimum variance bound of the macro estimators. 67 ML estimators of macro parameters have another desirable property, their asymptotic sampling distribution is known and normal. That is, ASY ( y* - y ) N O, ( 2W3 A- W. ) or equivalently (( y; - Yh )/S.E. ( y; )> A§Y N c o. 1) where subscript h refers to the elements in the y vector, i.e., ( Y , Y , Y , Y 00 01 10 11 One way to assess this property is through the use of normal probability plots. Similarly, we can determine the degree of bias in variance components produced by FM algo- rithm. But the estimates of variance components are un- stable. The size of the sampling variance of these estimates depends on the size of the parameter variances they esti- mate, i.e., the larger the parameter variances, the larger the sampling variance of the statistics. To stabilize these estimates a logarithmic transformation is performed on each of these variance components. This then is followed by making correction for bias. Formulas for Tu , T and 02 B are given below: A log :ulW - log I + 1/K . ulW log - log I + l/K , TBlw slw log 02 - log 02 + 1/nK , (note: log 02 = log 1 = 0 ). 68 where 1/K, 1/K and 1/nK are Cramer Rao lower bound for A A A2 a log TulW , log TBIW , and log a respectively, and are used for bias corrections (Pitman, 1938). 2 2 2 2 (note: E (log (8 ) # logc: but E (log (S ) + 1/V) = logcz where v is the correction for bias in 82 ). The efficiency of these variance components will be examined by plotting the log of the squared error estimates in'r 2 uIW ' TBIW 0 against their respective asymptotic variance, 2/K , 2/K, , and and 2/nK (see Bartlett and Kendall, 1946, for derivation of these asymptotic variances). As a last check we look at the FM convergence rate under varying combinations of K and n. Type I Error Rate and Power There are two ways to commit an error when making an inference: (1) rejecting a null hypothesis when is true (type I error), and (2) not rejecting a null hypothesis when is false (type II error). Where a probability of type I error, probability of type II error, 8 and 1 - B = power An experimenter wants to avoid errors and select a statistical procedure which is powerful enough to detect an "experimental effect" if it exists and in which the level of significance (<2) is accurate, i.e., neither inflated, nor conservative. One empirical question is what effect does the estimated variance components have on type I error and power . 69 Three specified significance levels .01, .05 and .10 are considered in this study . For a given nominal alpha, (100t1 )% of the values in a test statistic's distribution will exceed the appropriate critical value under a true null ( Ho : y = 7*) where Ha : y = o ) with known variance compo- nents. Actual significance level relates to the proportion of the values in a test statistic's distribution that exceed the appropriate critical value under true null and esti- mated variance components. Hence, an empirical estimate of the probability of type I error (i.e., actual signi- ficance level under unknown variance components) is deter- mined by counting the frequency with which the test statis- tics ( z - ( Y* - y )/S.E. ( Y*) ) in each replication exceeds the corresponding critical value, and the dividing by the total number of replications. Nominal power relates to the proportion of the values in a test statistic's distribution that exceed the appropriate critical value under a true alternative ( Ha : Y = 7* where H0 : y = 0 ) and known variance components. Notice that the null and the alternative hypotheses are the same as the ones under robustness but have switched their position. An empirical estimate of power (i.e., actual power when variance components are estimated) is determined by counting the frequency with which the observed test statistics ( z . (y*-y)/S.E.(y*) ) in each replication exceeds the corrgggonding critical value, and then dividing by the total number of replications. This count is made at all three 70 nominal significance levels. Power is a function of the discrepancy between central and noncentral distribution for a test statistics. In this study actual noncentrality parameters (ncpi ) is defined as the expected value of the observed test statistics i.e., ,1, * ncp = E ( z ) = E Y / S.E. ( Y ) OBS where S.E. (7* ) is the standard error of the macro estima- tors estimated by HLM via EM algorithm, and z .. N (ncp. 1) 0135 Actual power under noncentral distribution and unknown variance components is simply equal to the probability of z exceeding the corresponding critical values: OBS Actual power = P (‘ 2035 > C-V° (9/2) ) where z - ncp = z, and 2 ~ N ( 0,1 ) OBS a Empirical estimates of actual power are then compared to the nominal power in which the nominal noncentrality para- meter (ncp ) is defined as: n * 1!: ncp = E ( Y /o ( Y ) = Y/OY n where 0(y*) is square root of the asymptotic dispersion of y* ( i.e., the Cramer Rao minimum variance bound). 71 CHAPTER V RESULTS The results of the study are presented in this chapter in three sections. The first section is a check on the EM algorithm examining the properties of the maximum likelihood estimates of macro parameters and variance components. The second section will address robustness and power issues and the implication of variance estimation on inferences about macro parameters. The last section presents the rate of convergence of the EM algorithm under varying combinations of K and n used in this study. Results for Estimation Phase The objective for this phase of the study is to check the EM algorithm with respect to macro parameters and variance components (the vector notation “Y and.jz are used ~ throughout this section to refer to the macro parameters Yoo , Y01 , 710 and Y11 , and their estimates Yoo , YOI , Ylo A and'y11 respectively). The question could be phrased: Does the algorithm behave as expected ? That is, are the macro I-<> estimators (1) unbiased, i.e., E ( "I ) = 0; (2) asymptoti- cally efficient; and (3) with known and asymptotic normal distribution? Operationally this question could be phrased: 72 Does the estimation get better as a function of K and n ? That is, are the macro parameters consistent, less biased, and more efficient ? Similar questions will be addressed A with regard to the variance components,r 'thq and 02 . ulW ' Are the Macro Parameters Asymptotically Unbiased and Consistent 3 The error estimates in macro parameters are calculated by subtracting the estimated values {E from their corresponding parameter values I. . These values are then averaged over the entire sample, 640 sample points. The expected errors of all four macro parameters and their 95 percent confidence intervals are shown in Table (5-1) which suggest that the maximum likelihood estimates of the macro parameters are unbiased. Table 5—1 Expected Errors of Estimate in the Macro Parameters* Y00 Y01 Y10 Y11 .002178 .003224 -.001964 .005195 (-.OO78, .0118) (—.0088, .0148) (-.OO98, .0058) (-.0028, .0128)** *From 640 replications **9SZ confidence intervals To assess the differential effects of K, n and, d on the. error of estimates and to examine more explicitly the differences among levels of each factor, Tables 5-2 through 73 5-5 give error of estimates in i reflecting these three factors. In all four tables the same patterns emerged. Since this was consistent across all four macro parameters, only the results for the yoowill be discussed. The first eight rows relate to the averaged within-cell error of estimates. Generally these values considering the small number of replications (five) tend to be small. In the lower part of the tables absolute errors of estimates are summed within: 1) levels of d ; 2) levels of n (n' vs. n"); and 3) levels of K. Within each level of n, error increased as d in- creased. For example, with K = 10 under n' the absolute error of estimates in $00 (Table 5-2) went from .517 under low intraclass correlation to 1.013 with a high degree of d ( d = .25). With K = 30, 60 and 150 under n' absolute error increased from .325, .112 and .145 to .558, .276 and .172,respectively. This upward trend was remarkably consistent among all macro parameters. The only difference among the four parameters was one of magnitude. With Y00 and Y01 , error of estimates tended to be slightly higher than for the y and 10 The reason for this difference in reduction in errors 711" for slopes and intercepts seems to be due to the assumed ratio of the parameter variance in slope to that of inter- cept used to generate data. This ratio is prefixed to be either 1/10 or 1/2 (i.e., c = .10, .50). Having shown that error of estimates respond differently to the levels of d , the following is an attempt to 74 mm x we m~m>o~ swab“: reassm mumswumo mo muouuo muaaonno~ cucuwz voEE:m oumeHumo mo muouuo musHoma<¢tu a mo m~o>m~ :quwz coaszm mumeHumo uo mLoHLo muzmomao~ easy“: a mo m~m>o~ cquwz u no m~m>m~ :quw3 HcoHoHOOOOO :oHHmHmcHoo mmmmomnucfi a soar AOmH ho 00ucv:: cam RMN Ho mu:0.: oumm amassm mumawumo mo muouum musaomn :« mannedumm Houum MIM omnmh Hammowuumoo :oHHQHmLHou mmcHomuucw v zqu AOMH no 00ucv:: new AmN Ho nu:0.: mNHm 55 4 0o m~o>m~ casuH) thesam mumsflumm mo mbouum muzaoma<¢xum : mo m~m>m~ swab“: amassm mumeflumm 0o msocum mquomno~ =quw3 amassm mumsfiumo uo muouuo muaqoma<48 mo mason» 2 0o mcoHumoHHQOH m eoumt cmasm0m. M00. 0M0. H50.~ #3400N. 55N. 50N. 05N. ~00. 0M0. 000. 55~.H uuN5q. 000. 0MH. 0MH. NOH. M50. 00a. OHM. nMN. 00H. 00M. 00H. muM. 05H. 0M5. N00. HMO.1 0~0. m00.1 N00. 000. 0H0. 500. N00. H00.1 0M0. 000. 000.1 0N0. 0M0. 000. 0M0.1 0H0. 5~0.1 MHO.1 NNO.1 H00.1 N00.1 000. 000. 000.1 .1 0m0.1 MHO.1 NNO.1 HNO. 0N0.1 H00. NNO. NNO.1 000.1 NHO. H00.1 000. HMO. 500.1 NNO. 0N0. H00.1 NNO.1 0N0. 5N0.1 HNO. 0m0. 000. N00.1 000. MAO. 500.1 000.1 000. NHO. Mm0.1 000. 0M0. 000.1 NN~.1 NNO.1 0H0. mNO.1 000.1 000. 000. 0N0.1 HN0.1 NNO.1 0M0. MAO. NNO.1 0M0.! 000.1 0N0. N00. NNO.1 0MH. H00. M~0. 0N0. M00.1 MHO. 000.1 0HO. 000.1 NNO.1 H00. 0N0. who. 0H0.1 MM0.1 mOO. 05~.1 5NO.1 m00.1 500.1 5N0. N~0.1 000.1 0N0.1 MN0.1 0m0.1 000. MH0.I 0M~.1 N00.1 M00. 000. 00a. M0N.1 0N0.1 000.1 0M0. NmO. 000. 000.1 M00. 000.1 M00. 0N0. 500.1 0H0.1 000.1 0N0. mON. 000.. MN.uc 0d.uu MN.uu O~.nc mN.Iv OH.Ic MN.IU 0H.ut 0N.Iu O~.Iv MN.IU 0H.Ic MN.nv 0H.nc MN.uv 0a.»: : : : a. .- .. .- Omfiux 00ux OMnx 0H *OH > mmumefiumm Houum 01m mflamb 05 x 00 m~m>m~ swab“: voeaam muceHumo 0o muouuo obaHomb<¢*** : 0o m~o>o~ :«zuH: emsszm mumswumo no muouum ousHomaoH cfizqu seesaw mumsHumo mo muoHLo mquomao~ swab“: voessm mumswumo mo muonho ouafiomnm~ saga“: vasesm muoeflumo mo muouuo ousfiomno~ :fizuw: toessm oumeflumo mo muouuo muaflomno~ c~zu~3 : wo m~o>o~ c~zu~3 c we m~m>m~ =~zu~3 uzo~o~uaoou co~bm~mcuou mmm~omuucq a :u~3 A0n~ no 001:0:2 1cm anm no nncv.: 0N~m seesaw ouma~umo mo muouum ou=~oma<**** vmesam ouos~umm wo muouum mus~oma<¢zt umEE:m mums~umo we muouuo ou=~omno~ c~zu~3 amassm oume~umm mo mcouum muz~omno~ :~zu~3 «weszm mumEHumo mo muouum oua~omno~ cwzuuz cmessm oume~umo mo muouuo oua~oma1 1 2 1 :z .03 2 2 1 2 4 4 z (3 1 4 3 1 1 t0 1 1 2 4 9 2 3 6 2 1 2 4 7 4 8 c 4 8 0 6 o 6 6 6 a 6 6 6 a u 2 P 0 n L 8 9 1 2 3 lo 5 6 7 8 9 10 11 12 13 14 15 16 A S Y M P T O T I C V A R I A N C E , 2/nk Figure 5—9. Plot of transformed estimated and asymptotic variance of 62, Where: l=2/n4k4 2:2/113k4 3=2/n4k3 4:2/n4k2 5=2/n2k4 6=2/n3k3 7=2/n3k2 8:2/n4kl 9=2/n2k3 lO=2/n2k2 ll=2/nlk4 12:2/n3kl 13:2/nlk3 14:2/n2k1 15:2/n1k2 16:2/n1k1 kl=10 k2=30 k3=60 k4=150 n1: 5 n2=25 n3=60 n4=150 Note: Symbols A-Z and * signify frequencies 10-36, respectively. 104 combinations of K and n along with five other factors described in Chapter IV. The design for this part of the study allows for an assessment of robustness and power under unknown variance components when: (1) total sample size is considered, N; (2) the number of groups is varied (K = 10, 30, 60, and 150); (3) group size is varied (n = 5, 25, 60, and 150); and (4) the intraclass correlation coefficient is varied (d = .10, and .25). The data for the first part consists of 640 replications, parts two and three contain 160 replications each of four combinations of K and n, and the last part is based on 320 replications each of two combinations of d. Robustness Under Various Conditions This section evaluates the effect of variance estimation on tests of macro parameters based on total sample size, N, different levels of K, n, and intraclass correlation coefficient, d. Since the data are randomly generated via Monte Carlo methods, random error in the data must be considered. To take this error into account, the standard error (8.3.) of a proportion for a sample size equal to the number of replications is employed. The 3.3. for a proportion is estimated by ( P(1 - P)/N)1/2 , where P is the true value of the proportion, and N equals the number of replications. Since the true value of P (i.e., nominal alpha) is known, this formula is used to calculate the 8.3. at the three nominal alpha levels considered. These are given in Table 5-14. 105 Table 5-14 Standard Errors for Nominal Alpha Levels and Number of Replications Used in the Study Alpha N=640 N=320 N=160 .01 .0039 .0056 .0079 .05 .0086 .0122 .0172 .10 .0119 .0168 .0237 With a reduction in the number of generated data sets comes an increase in standard errors. Given known parameters (i.e., nominal alpha levels), the standard error of a proportion may be used to calculate confidence intervals around the known parameters instead of probability intervals around the sample estimates. Using the standard procedure, 95 and 99 percent confidence intervals for the three nominal levels considered are presented in Table 5-15. Thus, obtained alpha levels within these intervals may be considered to be within sampling error of nominal alpha. Total Sample Size and Robustness Table 5-16 contains the actual alpha levels for all parameters under central unknown variance components situation when all combinations of K and n considered together. The empirical type I error rates are consistently exceeding the nominal error rates across all macro 106 Table 5-15 Probability Intervals for Nominal Alpha Levels and Number of Replications Used in the Study a) 95% Probability Intervals Alpha N=640 N=320 N=160 .01 (.0024, .0176) (.0000, .0210) .0000, .0255) .05 (.0331, .0669) (.0261, .0739) .0163, .0837) .10 (.0767, .1233) (.0671, .1329) .0535, .1465) b) 99% Probability Intervals Alpha N=640 N=320 N=160 .01 (.0000, .0201) (.0000, .0245) .0000, .0304) .05 (.0278, .0722) (.0185, .0815) .0056, .0944) .10 (.0693, .1307) (.0567, .1433) .0389, .1612) Table 5-16 Type I Error Rates for Tests of Macro Estimators Under a True Nu11* e.s 01 =.Ol 0‘ -.OS 0% =.10 Y 00 0 .020“ .059 .119 Y 01 .31 .022** .066 .119 Y 10 .76 .013 .061 .112 Y 11 .16 .019** .072** .122 *From 640 replications with e.s. effect size. #*Outside the 95% confidence interval. 107 parameters. Especially the error rates are relatively large 01 and 111 than that of 710 most values tend to be within 95% confidence intervals. The when testing YOO ,y However, exceptions are for y and Y01 at a = .01, and for at 00 Y11 .01 and .05 alpha where all are within 99% confidence interval but Y01 . When outside the probability intervals, empirical alpha levels are all liberal. Number of Groups and Robustness Tables 5-17 through 5-20 (part a) present the type I error rates for tests of macro estimators Y00 and ' Y01’ Y10 Yll' respectively when the number of groups varied, with K = 10, 30, 60, and 150. The values for all macro parameters tend to be within 95% confidence intervals of the nominal alpha across all K levels. When outside the confidence interval, empirical significance levels are all liberal. Exceptions are for'y10 with K = 30 at .01 and .05 alpha and for'y11 with K = 30 and 150 at .05, .10 and .05 alpha respectively. However, these values are typically within 99% confidence interval. An unexpected finding from this set of results is that for a given macro parameter the largest type I error occurred randomly regardless of number of groups . Sample Size and Robustness Tables 5-17 through 5-20 (part b) give the type I error rates for tests of macro parameters-y00 7 and‘rll ' Y01’ Y10 respectively for experimental conditions with n = 5, 25, 60, 108 Table 5-17 Type I Error Rates for Test of Macro Parameter,'Y00 , Under a True Null a) For different number of groups, k.* k e.s. 0‘=.01 o‘=.05 0‘=.10 10 0 .025 .075 .119 30 .019 .075 .138 60 .013 .031 .087 150 .025 - .056 .131 *From 160 replications with e.s. effect size. b) For different group size, n.* n e.s. ' 6=.01 OL=.05 OL=.10 5 o .025 .063 .112 25 .019 .050 .119 60 .019 .050 .119 150 .019 .075 .125 *From 160 replications with e.s. effect size. c) For different intraclass correlation coefficients, d.* d e.s. 9:.01 “=.05 “=.10 .10 0 .022** .047 .100 .25 .019 .072 .138 *From 320 replications with e.s. effect size. **Outside the 95% confidence interval. 109 Table 5—18 Type I Error Rates for Tests of Macro Parameter,)%)1, Under a True Null a) For different number of groups, k.* k e.s. q =.01 a =.05 9.=.10 10 .31 .025 .081 .138 30 .019 ' .056 .112 60 .025 .050 .100 150 .019 .075 .125 *From 160 replications with e.s. effect size. b) For different group size, n.* n e.s. o.=.01 a.=.05 a.=.10 ** 5 .31 .031 .075 .138 25 .32 .013 .050 .081 ** ** 60 .31 .031 .081 .156 150 .32 .013 .056 .100 *From 160 replications with e.s. effect size. c) For different intraclass correlation coefficients, d.* d e.s. 01:.01 0t=.05 0:.10 .10 .22 .019 .053 .109 ** ** .25 .41 .025 .078 .128 *From 320 replications with e.s. effect size. **Outside the 95% confidence interval. 110 Table 5-19 Type I Error Rates for Tests of Macro Parameter,Y10 , Under a True Null a) For different number of groups, k.* k e.s. 0:.01 10:.05 01:.10 10 .76 .013 .044 .106 ** ** 30 .031 .094 .119 60 .006 .056 .106 150 .000 .050 .119 *From 160 replications with e.s. effect size. b) For different group size, n.* h e.s 01:.01 01 =.05 0t =.10 5 .73 .006 .050 .100 25 .78 .006 .075 .144 60 .73 .025 .081 .125 150 .78 .013 .038 .081 *From 160 replications with e.s. effect size. c) For different intraclass correlation coefficients, d.* d e.s a =.01 a =.05 a =.10 .10 .72 .019 .059 .103 .25 .79 .006 .063 .122 *From 320 replications with e.s. effect size. **0utside the 95% confidence interval. 111 Table 5-20 Type I Error Rates for Tests of Macro Parameter,Y11. Under a True Null a) For different number of groups, k.* k e.s. a =.01 a =.05 a =.10 10 .16 .019 .069 .112 ** ** 30 .025 .087 .150 60 .019 .038 .087 ** 150 .013 .094 .138 *From 160 replications with e.s. effect size. b) For different group size, n.* n e.s. a=.01 a=.05 a=.10 5 .16 .006 .044 .081 ** 25 .17 .031 .069 .106 60 .16 .013 .081 .138 150 .17 .025 .094** .162** *From 160 replications with e.s. effect size. c) For different intraclass correlation coefficients, d.* d e.s. 0:.01 OL=.05 01:.10 .10 .11 .025“ .081** .128 .25 .22 g .013 .063 .116 *From 320 replications with e.s. effect size. **Outside the 95% confidence interval. 112 and 150. For all macro parameters, actual significance levels tend to be within the 95% probability intervals of nominal values across all levels of n. When outside the confidence intervals, empirical alpha levels are all liberal. Exceptions are forY01 with n = 5 at .01 alpha and with n = 60 at .01 and .10 alpha, and forYEI with n = 25 at .01 alpha and with n = 150 at .05 and .10 alpha levels. However, all values are typically within 99% probability intervals of the nominal alpha. Again no pattern emerged. That is, for a given macro parameter the largest type I error occurred randomly regardless of sample size and effect size. The only exception is for Y01 where departure from nominal alpha was fairly small with an increase in effect size with n = 25 and 150. Intraclass Correlation Coefficient and Robustness Tables 5-17 through 5-20 (part c) report the type I error rates for tests of macro estimators YOO , Y01 , Ylo , and'Y11 , respectively when the intraclass correlation coefficient varied, with d = .10 and .25. Once again most values tend to be within 95% confidence intervals of the nominal alpha for both levels of d. The values outside of this interval are all liberal. However, values not contained within 95% confidence intervals are all within 99% probability interval. These values are: YOO with d = .10 at a = .01; with d = .25 at a = .01 and .05; and Y11 Yo1 with d = .10 at a = .01 and .05. Again no pattern emerged 113 with respect to d and /or effect sizes. Power Under Various Conditions The goal of this portion of the study is to evaluate the power of the tests of macro parameters in rejecting the null hypothesis under unknown variance components situation by considering total sample size, N, different levels of K, n and intraclass correlation coefficient, d. The empirical estimates of power (P') may also be compared to the theoretical values of power (P") obtained through nominal noncentrality parameter discussed in Chapter IV. Because of the way the null and true alternative hypotheses are set up (see Chapter IV) the implementation and discussion of power analysis will be limited only to macro parameters YOI , and Y11 . The actual parameter ’ Y10 value of Yoo is set at zero. Thus, power analysis cannot be applied. Total Sample Size and Power As shown in Table 5-21 the empirical power for all macro parameters considering all four combinations of K and n together are quite high. 0f the three macro parameters 710 consistently obtains the highest power, followed by Y01 and Y11 . This pattern of order in power is highly consistent with the magnitude of the effect sizes in the macro parameters. Within each macro parameter power is always larger at larger nominal levels. 114 Table 5-21 Power for Tests of Macro Parameters* e.s. p p p p p'** p"*** ‘Y01 .31 .9382 .9495 .9846 .9881 .9932 .9949 'Y10 .76 .9999 .9999 .9999 .9999 .9999 .9999 'Y11 .16 .8023 .7995 .9292 .9265 .9625 .9616 *From 640 replications with e.s. effect size. **Empirica1 power ***Nomina1 power The power estimates of macro parameters are either equal or very close to the theoretical values (differences are statistically insignificant). Within each nominal alpha level, empirical power is smaller than the nominal power for YOI . The situation is reversed with respect to Yll Number of Groups and Power Tables 5-22 through 5-24 (part a) give empirical and nominal power for macro parameters and Y11 , Y01 ' ‘Y10 ’ respectively when the number of groups varied, with K = 10, 30, 60, and 150. Again Y10 has the highest power across all levels of K and all alpha levels. with effect size Y01 e.s. = .31 obtains the next highest place but reaches the same degree of power (.9999) with K = 150 at all levels of alpha. On the other hand, Yll obtains the same power with the same value of K but only at a = .05 and .10. 115 Table 5—22 Power for Tests of Macro Parameter Y01 a) For different number of groups, k.* a: =.Ol C1 =.05 C1 =.10 k e.s. pl pt! pl p" p!** p!'*** 10 .31 .2546 .2119 .4801 .4286 .6064 .5517 30 .6736 .7054 .8577 .8770 .9162 .9292 60 .9686 .9772 .9934 .9955 .9974 .9983 150 .9999 .9999 .9999 .9999 .9999 .9999 *From 160 replications with e.s. effect size. b) For different groups size, n.* 0. =.01 a =.05 OL =010 n e.s. p" p" pl p" p! p" 5 .31 .6141 .7517 .8186 .9015 .8888 .9463 25 .32 .9406 .9564 .9854 .9898 .9936 .9959 60 .31 .9772 .9778 .9956 .9957 .9984 .9984 150 .32 .9893 .9846 .9982 .9973 .9993 .9989 *From 160 replications with e.s. effect size. . c) For different intraclass correlation coefficients, d.* a =.01 a =.05 0' =.10 d .5. pi. p" pl p" p! p" .10 .22 .8997 .9265 .9713 .9808 .9864 .9913 .25 .41 .9641 .9664 .9920 .9927 .9968 .9971 *From 320 replications with e.s. effect size. **Empirical power ***Nominal power 116 Table 5-23 Power for Tests of Macro Parameter Y10 a) For different number of groups, k.* 01=.01 a=.05 01:.10 k e.s. p' p" p' p" p'** p"*** 10 .76 .9999 .9999 .9999 .9999 .9999 .9999 30 .9999 .9999 .9999 .9999 .9999 .9999 60 .9999 .9999 .9999 .9999 .9999 .9999 150 .9999 .9999 .9999 .9999 .9999 .9999 *From 160 replications with e.s. effect size. b) For different group size, n.* a =.Ol a =.05 a =.10 k 6-8- p' p" p' p" p' p" 5 .73 .9999 .9999 .9999 .9999 .9999 .9999 25 .78 .9999 .9999 .9999 .9999 .9999 .9999 60 .73 .9999 .9999 .9999 .9999 .9999 .9999 150 .78 .9999 .9999 .9999 .9999 .9999 .9999 *From 160 replications with e.s. effect size. . c) For different intraclass correlation coefficients, d.* a =.01 a =.05 a =.10 d e.s. p' p" p, p" p' p" .10 .72 .9999 .9999 .9999 .9999 .9999 .9999 .25 .79 .9999 .9999 .9999 .9999 .9999 .9999 *From 320 replications with e.s. effect size. **Empirical power ***Nominal power 117 Table 5-24 Power for Tests of Macro Parameter Y 11 a) For different number of groups, k.* a =.Ol a =.05 ‘1 =.10 k e o S . p! p" p! p" p 0 ** p"*** 10 .16 .1423 .1314 .3264 .3050 .4443 .4247 30 .5040 .4840 .7357 .7190 .8264 .8159 60 .8665 .8621 .9582 .9564 .9798 .9783 150 .9996 .9997 .9999 .9999 .9999 .9999 *From 160 replications with e.s. effect size. b) For different group size, n.* a = 01 a = 05 a =.10 n .S. p! p" p! p" p! p" 5 .16 .2743 .3156 .5040 .5517 .6293 .6736 25 .17 .7422 .7357 .8962 .8944 .9429 .9406 60 .16 .9162 .9082 .9767 .9744 .9896 .9881 150 .17 .9767 .9693 .9955 .9936 .9982 .9974 *From 160 replications with e.s. effect size. c) For different intraclass correlation coefficients, d.* a =.Ol a =.05 a =.10 d .S. p! P" p! p" p! P" .10 .11 .6915 .6808 .8686 .8621 .9236 .9207 .25 .22 .8849 .8849 .9656 .9656 .9834 .9834 *From 320 replications with e.s. effect size. HEmpirical power ***Nomina1 power 118 As shown in Figure 5-10 with K = 150 the power curves for the three macro parameters are indistinguishable. With respect to number of groups, power is best with K = 60 and 150 and worst with K = 10 when ‘031 is considered and best with K = 150 and worst with K = 10 for 111 Empirical powers are always close but smaller than the theoretical powers for ‘le except with K = 10 and close but larger than the nominal power for ‘Yll except with K = 150 (differences between P' and P" are atatistically insignificant). Sample Size and Power Tables 5-22 through 5-24 (part b) give tha actual and nominal power for macro parameters ‘y01, 710 , ands-y11 , respectively for groups of size 5, 25, 60 and 150. Across all levels of n power improved relative to what it was when levels of K was considered. This was consistent across all macro parameters and three alpha levels (to evaluate simul- taneous effect of K and n on empirical and theoretical power see Tables A-4 through A-6 in Appendix ). the macro Y10 parameter with the larger effect size gained the most power. With each macro parameter power increased as either effect size, group size, or alpha level increased. As shown in Figure 5-11 again with n = 150 the power curves for the three macro parameters are indistinguishable. With regard to sample size, power is best with n = 25 or , but only with n = 60 and 150 when 'Y is above for Y 11 01 considered. 119 1.00 .90 .80 .70 .60 .50 .40 .30 .20 .50 .40 .10 1.00 .90 .80 .70 .60 .50 I40 .30 .20 .10 /' .___.¥10 I “‘91 ———“Y01 ”"”"Y11 o———+————- o———-¢Y1 0 a --05 Y01 _ """ ‘Y11 / ___]10 ~ ----- -fl11 I l l I i l l l l l l l I f I 10 30 60 150 Power curves of Y 01, 710 and ‘Y11 for different no. of groups, R. 120 .90 .80 .70 .60 .50 .40 .30 .20 .10 1.00 .90 .80 .70 .60 .50 .40 .30 .10 E 1.00 Figure 5-11. .90 .80 .70 .60 .50 .40 .30 .1 _————_—_——— ..-..Yio - -.m Y01 ”“"‘Y11 - h-—d_-_- —¥ ’—-—0_o—o—.--—.— n—o- 0-4—fi a .-—-