WEIGHTING IN MULTILEVEL MOD EL S By Bing Ton g A DISSERTATION Submitted to Michigan State University i n partial fulfillment of the requirements f or the degree of Measurement and Quantitative Method s -- D octor of Philosophy 2019 ABSTRACT WEIGHTING IN MULTILEV EL MODELS B y Bing Tong Large - scale survey programs usually use complex sampling design s such as unequal probabilities of selection, stratifications, and/or clustering to collect data to save time and money. This leads to the necessity to incorpora te sampling weights into multilevel models in order to obtain accurate estimates and valid inferences. However, the weighted multilevel estimators have been lately d eveloped and minimal guidance is left on how to use sampling weights in multilevel models a nd which estimator is most appropriate. The goal of this study is to examine the performance of multilevel pseudo maximum likelihood (MPML) estimation method s using different scaling techniques under the informative and non - informative condition in the con text of a two - stage sampling design with unequal probabilities of selection. Monte Carlo simulation methods are used to evaluate the impact s of three factors, including informativeness of the sampling design, intraclass correlation coefficient (ICC) , and e stimation methods. Simulation results indicate that including sampling weights in the model still produce biased estimates for the school - level varian ce . In general, the weighted methods outperform the unweighted method in estimating intercept and student - level variance while the unweighted method outperforms the weighted methods for school - level variance estimation in the informative condition. In gene ral, the cluster scaling estimation method is recommended in the informative sampling design. Under the no n - informative condition, the unweighted method can be considered a better choice than the weighted methods for all the parameter estimates. Besides, t he ICC has obvious effects on school - level variance estimat e s in the informative condition, but in the non - informative condition, it also affects intercept estimat e s . An empirical study is included to illustrate the model. Copyright by BING TONG 2 01 9 v This dissertation is dedicated to my famil y . vi ACKNOWLEDGEMENT S I have received a great deal of support and assi stance throughout the writing of my dissertation. This dissertation could not be completed without their help. I am especially indebted to my advisor and dissertation chair , Dr. Kimberly S. Kelly . With her encouragement, I chose MQM program. During my PhD career, she gave me tremendous help in my academic studies, and my spiritual life as well. Her expertise was invaluable in formulating the research top ic. I would like to acknowledge my committee members, Dr. Yuehua Cui, Dr. Richard Houang, and Dr. Will iam Schmidt. I am grateful to them and appreciate the m offering me e nlightening feedback, enormous support and guidance with great patience. My special thanks go to my CSTAT colleagues , including Dr. Frank Lawrence, Dr. Steven Pierce, Dr. Dhruv Sharman, Dr. Wenjuan Ma , Dr. Sarah Hession. In the last four and half years here, they have become my family members and I love to work with them. They have shared with me tremendous resources and insightful ideas. More importantly, they never hesitate to help me when ever I encounter any problems. I would never forget them and wou ld miss every single of them in the future. Nobody has been more important to me in the pursuit of this dissertation than my family members. I would like to thank my mom and my sisters, w ho always love me and support me without conditions. Most importantly, I would thank my beloved daughter, Shiyuan, who provides unending support . She is always there for me. vii TABLE OF CONTEN T S LIST OF TABLE S ................................ ................................ ................................ ........................ i x LIST OF FIGURES ................................ ................................ ................................ ...................... x KEY TO ABBRIVIATIONS ................................ ................................ ................................ ........ x i CH APTER 1 INTRODUCTION ................................ ................................ ................................ .. 1 CH AP TER 2 THEORETICAL BACKGROUND AND LITERATURE REVIEW .................... 7 2.1 Research Goal ................................ ................................ ................................ ................... 7 2. 2 Multistage S ampling ................................ ................................ ................................ ......... 8 2. 3 M ultilevel M odel ................................ ................................ ................................ .............. 9 2. 4 Multilevel P s eudo - M aximum L ikelihood (MPML) E stimation M ethod s ........................ 11 2. 5 Scaling S ampling W eights for M ult ilevel M odels ................................ ............................ 14 2. 6 Intraclass C orrelation C oefficient (ICC) ................................ ................................ ........... 1 5 2. 7 Informativeness of Selection ................................ ................................ ............................. 17 CHAPTER 3 METHOD S ................................ ................................ ................................ ............. 19 3.1 Empirical Data ................................ ................................ ................................ .................. 19 3.1.1 Data and V ariables ................................ ................................ ................................ ... 19 3.1.2 Statistical M odels ................................ ................................ ................................ ..... 2 2 3.2 Simulation s ................................ ................................ ................................ ...................... 2 4 3.2.1 Simulation D esign ................................ ................................ ................................ ... 2 4 3.2.2 Mo del ................................ ................................ ................................ ....................... 2 6 3.2.3 Sampling S election ................................ ................................ ................................ .. 2 7 3.2.4 M plus and D ata A nalysis ................................ ................................ ......................... 2 9 3.2. 5 Evaluation C riteria ................................ ................................ ................................ ... 30 CH AP TER 4 RESULTS ................................ ................................ ................................ ............... 3 2 4. 1 S imulation R esults ................................ ................................ ................................ ............ 3 2 4. 1 .1 Research Q uestion O ne ................................ ................................ ............................ 3 3 4.1.1.1 (Absolute) Relative Bias ................................ ................................ ................ 3 3 4.1.1.1.1 Informative Design ................................ ................................ ............... 3 3 4.1.1.1.2 N on - I nformative Design ................................ ................................ ....... 3 8 4.1.1.2 RMSE ................................ ................................ ................................ .............. 4 2 4.1.1.2.1 Informative Design ................................ ................................ ................ 4 2 4.1.1.2.2 Non - I nformative Design ................................ ................................ ........ 4 4 4.1.1.3 Coverage R ate ................................ ................................ ................................ . 4 5 4.1.1.3.1 Informative Design ................................ ................................ ................ 4 6 4.1.1.3.2 Non - I nformative Design ................................ ................................ ........ 4 7 4. 1 . 2 Research Q uestion T wo ................................ ................................ ........................... 4 9 4.1.2.1 (Absolute) Rel ative Bias ................................ ................................ ................. 4 9 viii 4.1.2.1.1 Informative Design ................................ ................................ ................ 4 9 4.1.2.1.2 Non - I nformative Design ................................ ................................ ........ 5 1 4.1.2.2 RMSE ................................ ................................ ................................ .............. 5 2 4.1.2.2.1 Informative Design ................................ ................................ ................ 5 2 4.1.2.2.2 Non - I nformative Design ................................ ................................ ........ 5 3 4.1.2.3 Coverage R ate ................................ ................................ ................................ . 5 3 4.1.2.3.1 Informative Design ................................ ................................ ................ 5 3 4.1.2.3.2 Non - I nfo rmative Design ................................ ................................ ........ 5 3 4. 1 .3 Simulated S tandard E rrors and S tandard D eviations ................................ ............... 5 4 4.2 Results for ECLS - K:2011 ................................ ................................ ................................ . 5 8 CH AP TER 5 SUMMARY AND DISCUSSION ................................ ................................ ......... 6 3 5.1 Summary of T his S tudy ................................ ................................ ................................ .... 6 3 5.2 Discussion of R esults ................................ ................................ ................................ ........ 6 8 5.3 Implications ................................ ................................ ................................ ....................... 70 5.4 Limitations and F uture Studies ................................ ................................ ......................... 70 APPENDICES ................................ ................................ ................................ .............................. 7 2 A PPENDIX A . Stata Simulation Syntax in the I nformative Sampling D esign ...................... 7 3 A PPENDIX B . Stata Simulation Syntax in the N on - I nformative Sampling D esign .............. 7 5 A PPENDIX C . M plus Syntax ................................ ................................ ................................ . 7 6 REFERENCES ................................ ................................ ................................ ............................. 80 ix LIST OF TABLES Table 3. 1 . ECLS - K: 2011 Variable D escripti ve S tatistics ................................ ............................ 22 T able 3.2 . Simulation Design ................................ ................................ ................................ ........ 2 5 Table 4. 1 . RB (%) , RMSE, 95% CI CR for C ovariates in the I nformative D esign ....................... 3 4 Table 4. 2 . RB (%) , RMSE, 95% CI CR for I ntercept and V ariance C omponents in the I nformative D esig n ................................ ................................ ................................ ................................ ........... 3 5 T able 4. 3 . RB (%) , RMS E, 95% CI CR for C ovariates in the N on - I nformative D esign ............... 3 9 Table 4. 4 . RB (%) , RMSE, 95% CI CR for Intercept and Variance Components in the Non - I nformative Design ................................ ................................ ................................ ........................ 40 Table 4. 5 . S imulation S tandard D eviations and S tandard E rrors of E stimates in the I nfor mative D esign ................................ ................................ ................................ ................................ ........... 5 6 Table 4. 6 . Simulation Standard Deviations and Standard Errors of Estimates in the Non - I nformative Design ................................ ................................ ................................ ........................ 5 7 Table 4. 7 . Null Model for ECLS - K: 2011Mathematics and Reading ................................ ........... 5 9 Table 4. 8 . Model with Student - L evel Predictors for ECLS - K: 2011 Mathematics and Reading . 60 Table 4. 9 . Full Model for ECLS - K: 2011 Mathematics and Reading ................................ .......... 6 1 Table 5.1. Summary of Comparisons of the Estimators ................................ ............................... 6 5 Table 5.2 . ICC effect ................................ ................................ ................................ ..................... 6 7 x LIST OF FIGURES Figure 4. 1 . Relative bias (%) for covariates in the informati ve design ................................ ........ 3 6 Figure 4.2 . Relative bias (%) for intercept and variance components in the informative design 3 6 Figure 4. 3 . Relat ive bias (%) for covariates in the non - informative design ................................ . 41 Figure 4. 4 . Relative bias (%) for intercept and variance componen ts in the non - informative design ................................ ................................ ................................ ................................ ....................... 41 Figure 4. 5 . RMSE for covariates in the informative design ................................ ......................... 4 3 Figure 4. 6 . RMSE for intercept and variance components in the informative design ................. 4 3 Figure 4. 7 . RMSE for covariates in the non - informative de si gn ................................ ................. 4 4 Figure 4. 8 . RMSE for intercept and variance components in the non - informative design .......... 4 5 Figure 4. 9 . Coverage rate for covariates in the informative design ................................ ............. 4 6 Figur e 4. 10 . Coverage rate for intercept and variance components in the informative design .... 4 7 Figure 4. 11 . Coverage rate for covariates in the non - informative design ................................ .... 4 8 Figure 4. 12 . Coverage rate for intercept and variance components in the non - infor ma tive desi gn ................................ ................................ ................................ ................................ ....................... 4 8 Figure 4.13 . Relative bias (%) for covariates in the informative design ................................ ...... 50 Figure 4.14 . Relative bias (%) for intercept and variance componen ts in the informative design ................................ ................................ ................................ ................................ ....................... 50 Figure 4.15 . Relative bias (%) for covariates in the non - informative design ............................... 5 1 Figure 4.16 . Relative bias (%) for intercept and variance components in the non - informat ive design ................................ ................................ ................................ ................................ ............ 5 2 xi KEY TO ABBREVIATIONS ECLS - K: 2011 Early Childhood Longitudinal Study, Kindergarten Class of 2010 - 2011 PML Pseudo Maximum Likelihood MPML Multilevel Pseudo Maximum Likelihood PML Pseudo Maximum Likelihood PWIGLS Probability Weighted I terative Generalized Least Squares ICC Intraclass Correlation Coefficient RB Relative Bias RMSE Root Mean Square Error CR Coverage Rate UW Unweighted Estimation Method RW Estimation Method with Raw Weights CS Estimation Method with Cluster S cali ng ES Estimation Method with Effective Scaling NAEP National Assessment of Educational Progress NCES National Center for Education Statistics NSF National Science Foundation 1 CHAPTER 1 INTRODUCTION A survey is defined as a data collection too l and is commonly used in social science to collect self - report data from study participants. It allow s researchers to collect a large amount of data quickly and less expensively. Besides , the samples in survey r esearch are often large , and a wide variety of variables can be examined (Boslaugh, 2007; Koz io l, Bovaird, & Suarez, 2017), including personal facts, attitudes, previous behaviors, and opinions. Also, a survey can be often quickly created and easily administered. Thus, secondary data analysis is bec om ing increasingly popular (Stapleton, 2006). Many large - scale survey programs in social science use complex sampling designs to collect data, such as unequal probabilities of sel ection, stratification, and/or cluster sampling due to the impracticality of si mple random sampling . In educational research , large scale data collection efforts such as National Assessment of Educational Progress ( NAEP afterwards) , Early Childhood Longitu dinal Study - Kindergarten Class of 1998 - 1999 ( ECLS - K afterwards) , Early Childh oo d Longitudinal Study - Kindergarten Class of 2010 - 2011 (ECLS - K afterwards) , available through National Center for Education Statistics ( NCES ) or National Science Foundation ( NSF ) use complex sampling plans. These three - stage surveys first involve sampling ge ographic areas with different probabilities of selection according to characteristics. These areas are often termed primary sampling units (PSUs). Then schools are sampled with different probabilities from the selected areas and lastly students are sampl ed from each of the selected schools , resulting in a cluster sampling design. Students chosen from the same school tend to be more alike than students chosen from other schools, and these groups of students show some degree of dependence (Hox & Kreft, 1994 ; Kish, 1965; Skinner, Holt & Smith, 1989) when compared to students from other school s . This type of sampling design bring s challenges wh en 2 performing statistica l analyses. If we disaggregate higher order variables to individual variables, ignoring the ne st ed structure of the data and assuming each observation is independent, the assumption of independence of observations is not tenable. Conventional parametric an alytic methods (e.g., regression, analysis of variance, t - tests) do not work well because they v iolate the assumption of observation independence (Cohen, West, & Aiken, 2003). The standard errors for the point estimates are estimated incorrectly, which cou ld lead to erroneous conclusions arising from increased Type - I errors due to the violation of th is assumption (Arceneaux & Nickerson, 2009; Clarke, 2008; Hahs - Vaughn, 2005; Heck & Mahoe, 2004; Judd, McClelland, & Ryan, 2009; Musca et al. , 2011). However, i f all the individual level variables are aggregated to the higher level, then important inform at ion could be lost. Multilevel models or Hierarchical linear models (HLM) were proposed and have been widely used in education , because they can be used to account for clustering, and allow the variance of the dependent variable to be partitioned explicit ly into within - and between - variance (Lee & Fish, 2010; Lubienski & Lubienski, 2006; Palardy, 2010; Raudenbush & Bryk, 2002; Snijder s & Bosker, 2012). Th ey are an alternative to some of the approaches used by survey analysis for dealing with nested data st ru cture s . Furthermore, some groups of the population are oversampled for various reasons. Units with higher data collection costs may be drawn with lower selection probabilities and individuals from small subpopulations of particular interest may be sample d with higher probabilities. For example, both ECLS - K and ECLS - K:2011 oversampled Asian, Native Hawaiians, and other Pacific islanders with the rate of 2 .5 compared with other racial groups. This feature suggests applying sampling weights in the model to r ef lect the unequal probabilit ies of selection whenever selection probabilities are related to the outcome variable after conditioning on covariates in th e model. The sampling design is said to be informative in this case (Fuller, 2009; Grilli & Pratesi, 3 20 04 ) . Ignoring this feature and w ithout using weights, parameter estimates would be severely biased (Korn & Graubard, 1995 ; Pfeffermann, Skinner & Goldste in, 1998; Rodriguez & Goldman, 1995, 2001 ; Zaccarin & Donati, 2008 ). But, appropriately using weights is not an easy task. For large - scale data sets, for example, ECLS - K:2011, there are many sampling weight variables, including school - level and student - le vel weights. For student level, th is include s weights generated f or the child assessment s , teacher - leve l questionnaire, student - level questionnaire, parent interview, and care provider questionnaire. A ppropriate use of complex sampling weights is of great importance because ignoring them may produce erroneous standard errors and consequently, inaccurate sta ti stical inference. there is not much guidance on how to incorporate sampling weights in the multilevel models. It can be dated back from th e late 1980s (e.g., Pfeffermann & LaVange, 1989). The pseudo maximum likelihood (PML) method, developed b y Skinner (1989) and following the thoughts of Binder (1983), is a well - established estimation procedure for any weighted single - level models. However, flexible techniques for estimating weighted multilevel models have only newly been developed (cf., Asp ar ouhov, 2004, 2006; Grilli & Pratesi, 2004; Rabe - Hesketh & Skrondal, 2006; Koziol et al., 2017). One possible reason for this is multilevel weights are not available, which is often the case for public - release d data file ( Kova evi & Rai, 2003; Staple to n, 2012). The second reason might be that weighted multilevel modeling requires scaling of the lower level sampling weights (Pfeffermann et al., 1998). Currently, there is no well - established general multilevel consistent estimation method incorporating w eights. I t is controversial whether to weight or not (Bertolet, 2008; Kish, 1992; Skinner, 1994; Smith, 1988; Xia & Torian, 2013) . F or example, o n the one hand, some researchers (e.g., Graubard & Korn, 1996; Korn & Graubard, 1995, 20 03; Lohr & Liu, 1994 ) suggest ed us ing sampling 4 weights in the model, as mentioned above to take into account for the complex sampling scheme . On the other hand, Winship and Radbill (1994) preferred unweighted estimators because estimates were unbiased, and consistent because th ey produced smaller standard errors. However, although the use of sampling weights will result in the increase of variance from unequal inclusion probabilities, it is still required and necessary because it prevents producing biased p arameter estimates u nd er informative sampling in multilevel models (Pfeffermann et al., 1998; Kim & Skinner, 2013), protects against misspecification, and makes full use of population - level information (Kim & Skinner, 2013). The estimation quality can be a ffected by a number of factors and some of them have been investigated in the past research across different conditions, such as cluster size, distribution of the response variable , estimator/software program , informativeness of the sampling design , intrac lass correlation coe ff icient (ICC) , model type , invariance of selection across clusters , number of clusters, relative variance of weights, sample design features, and weight approximation method. In this study, I focus on the multilevel pseudo maximum likelihood (MPML) estima ti on method . First of all, although various conditions have been exa mined, conclusions are not inconclusive and rely on the particular model or sampling mechanism. Second, there are limited number of studies evaluating MPML (i.e., Asparouhov, 2006; Asparou ho v & Muth n , 2006; Cai, 2013; Grilli & Pratesi, 2004; Koziol et al. , 2017; Rabe - Hesketh & Skrondal, 2006; Stapleton, 2012). Third, MPML, compared with other estimators , are more flexible. Therefore, more studies are needed to evaluate MPML. The purpose o f the present study is to evaluate the performance of MPML using dif ferent scaling procedures in the context of a two - stage sampling design with unequal probabilities of selection in the in formative and non - informative conditions across different levels of ICC using a 5 linear random - intercept model with covariates at both l evels. Monte Carlo simulation methods are used to estimate the relative bias (RB), root mean square error (RMSE) and coverage rate/probability (CR) of the corresponding 95% confidence in te rval estimators. The following factors are manipulated: ( a ) informat iveness; (b) ICC of the unconditional model ; and (c) estimation method. All factors are fully crossed. Cai (2013) conducted Monte Carlo simulations and found that the unweighted estimat or produce s biased estimates for the intercept and school - level variance, while the estimates for fixed effects and student - level var ia nce are nearly unbiased within 10% of the true value in terms of Muth n and Muth n (2002). Generally speaking, the MP ML estimators h a ve higher coverage rates than the unweighted estimator in the informative condition. I ncluding sampling weights increase s MSE substantially and produces biased estimates for the intercept and school - level variance in the informative samplin g design. Furth ermore, ignoring informative sampling design could produce biased estimates. Pfeffermann et al . (1998) pointed out that the unweighted method only produced biased est i m a tes for the intercept and school - level variance, not for student - level v ar iance when th e design is informative at school - level variance. Prior studies (e.g., Asparouhov & Muth n, 2006 ; Kova evi & Rai , 2003) show that as the ICC increase s the bias decrease s for all the parameters using an unconditional model. Asparouhov a nd Muth n (2007) also found that the MPML estimator outperforms substantially the other estimators. The plan of this study is as follows. Chapter 2 di s cusse s theoretical ba ckground and reviews the related literature . We briefly review multistage design and general multilevel models . Pseudo maximum likelihood estimation (MPML) method is presented , followed by two scaling methods. Intraclass correlation coefficient (ICC) an d i nformativeness are also describ ed in this section. In Chapter 3, I introduce the empirical data set I use in this study: ECLS - K:2011, and procedures of 6 simulat ion for the present study . Chapter 4 presents the results of the empirical data analysis and s im ulation analysis. Chapter 5 provides a discussion of overall findings, limitations, and topics for future researc h . 7 CHAPTER 2 THEORETICAL BACKGROUND AND LITERATURE REVIEW 2.1 Research Goal Using empirical and simulated data, the present study focuses o n examining the performance of MPML in the context of a two - stage sampling design with unequal probability of selection. Since MPML is newly developed compared to PML, there are far fewer stud ies examining MPML. And n o consensus ha s been achieved on which on e performs best and under which condition for the existing weighted multilevel estimators . MPML is considered the most flexible and popular method if the consistency of estimates and computa tion intensity are considered for multilevel data. But it is als o obvious that weighted estimators produce larger standard errors than unweighted method s do . Therefore, i t is controversial whether to use weight to not. More studies are needed to compare th more, the scaling e ffect used in the multilevel estimation method is inconclusive based on the previous literature. Lastly, to my knowledge, except one study (c.f., Koziol et al., 2017), all other previous sim ulation studies manipulating ICC values use only an unconditiona l random intercept model. Therefore, the main goal for this study is to examine the impact of sampling weights and to evaluate the performance of the MPML method s with different scaling techn iques in the context of two - stage informative and non - informativ e sampling design s across different values of ICC with unequal probability of selection using random intercept model with covariates at both levels. Monte Carlo simulation methods are used to evaluate several factors, including: (a) informativeness of the sa mple design (non - informativeness vs. informativeness at both stages ) ; (b) ICC with five different values; (c) estimation methods (unweighted, raw/unscaled weighted, 8 cluster scaling, effectiv e scaling). All the factors are fully crossed. This gives rise t o 2 × 5 × 4 =40 combination of conditions. This study makes several contributions to the complex survey data literature. First, it provides a comparison between unweighted and weighted multil evel approaches in the context of unequal probability of selecti on . Second, it provides a comparison of estimation methods between informative and non - informative sampling design. Third, it provides a comparison of estimation methods under different level s of ICC values. In order to cover the gaps of the current body o f literature, the following research questions are addressed: 1. How do MPML estimators differ from unweighted estimator in multilevel models in the informative and non - informative sampling designs in terms of relative bias, root mean square error and 95% con fi dence interval coverage rate? 2. How does intraclass correlation influence the performance of estimators under the informative and non - informative condition in terms of relative b ias, root mean square error and 95% confidence interval coverage rate? Large - s ca le survey s in social studies usually use complex sampling design s based on the characteristics of the population to glean information in order to address various research quest ions. This feature brings challenges to the analysis. This chapter include s se ve ral topics which are central to understanding weighted multilevel analysis of survey data. 2. 2 Multistage S ampling Multistage designs are commonly used in many practical cases. F or a two - stage sampling in the educational setting , for example, clusters o r PSUs such as schools are selected in the first 9 stage. In the second stage, i ndividual units, such as students are then sampled from the clusters. Ea ch sampling stage correspon ds to a multilevel model level . In this case, second stage corresponds to L evel 1 , first stage to L evel 2. At the first stage, cluster is sampled with probability , , where is the number of clusters to be sampled from the total number of clusters in the population, M . At the second stage, individual is sa mp led from the cluster selected at the first stage with c onditional probability , , where is the cluster sample size. Usually, clusters are sampled with probabilities that are proportional to their sizes, that is, the number of in di vidual units in their clusters, , = ( 2 .1 ) and the weight at cluster level is the inverse of the probability , that is, = 1/ . Each unit is sampled from cluster j with conditio na l probability (assuming that equal number of units are sampled from each cluster) = ( 2 . 2) an d the weight for individual unit given cluster is the inverse of the conditional probability , that is, = 1/ . Then the unconditional probability is defined as = = ( 2 . 3) 2. 3 Multilevel M odel A typical two - level linear model can be specified with two equations. The first equation is u sed to describe the re lationship between dependent variables and the covariates at the student level, within each group. Some or all of the parameters of the stude nt - level equation are viewed as 10 varying randomly across the groups. The second equation, scho ol - level equation, d ef ines these parameters as dependent variables with the school - level variables as covariates. If we combine them together, a two - level linear m ixed model can be specified in matrix vector form as follows, based on Laird and Ware (1982), = + + . ( 2 . 4) In the above equation, indexes the cluster, with m , where m is the number of clusters. For the cl uster with size , is an x 1 vector of obser ved response, is an x obser ved matrix for fixed effects, is a x 1 vector of unknown coefficients , denotes an x random - effect design m at rix, is a x 1 vector of cluster - specified random effects, and is an x 1 vector of random residual errors, where is the number of unknown coefficients including the intercept and is the number of random effects. Since rand om intercept model is used in the current study, equals 1. Either full ma ximum likelihood (M L/FIML) or the restricted maximum likelihood (REML) estimation method is often used to estimate the unknown model parameters in a general linear mixed model , su ch as fixed regression coefficients and variance components. Searle, Casella, and McCulloch (199 2) defines t he likelihood function for a linear mixed model as follows, ) = , ( 2 . 5) where is the covariance matrix of vector , = + , denotes covariance matrix for the random effect ve ctor , and in our ca se, it is a scalar , and is the variance of the error term. For computational convenience, t he log likelihood function is more often used instead of likelihood function. It is specified in mathematical form as 11 ( 2 . 6) where is the total number of observations, = . 2. 4 Multilevel Pseudo - Maximum Likelihood (MPML ) Estimation M ethod s In o rd er to achieve valid inference for the population , sampling weights must be used for all the levels of the data. But the literature does not obviously describe when and how to use sampling weights properly in the multilevel models . U sing single - level weig ht s to replace multilevel weights, is not always a ppropriate for the following reasons . First, sampling weights are placed into sum of squares and cross - products in a single - level regression . Final - level weights are the product of multilevel weights. Based o n Christ, Biemer, & Wiesen (2007), i f we use fin al - level weights, it might lead to biased estimates in multilevel models . Second, Pfeffermann et al . , (1998) noted that single final - level weights or overall inclusion probabilities may not c ontain sufficie nt information to correct for unequal sampling pro babilities at higher levels , because units at either level can be selected with differential probabilities. Therefore, multilevel weights need to be used in multilevel models. We us e sample data and the sam pl ing weights to estimate unknown parameters by ma ximizing the weighted sample likelihood. So far, researchers have explored different estimation methods incorporating sampling weights for complex surveys, such as multilevel pseudo maxi mu m likelihood (MPML) (Asparouhov , 2004, 2006; Grilli & Pratesi, 2004; Rabe - Hesketh & Skrondal, 2006), probability - weighted iterative generalized least squares (PWIGLS) (Pfeffermann et al., 1998), sample distribution methods (Eideh & Nathan, 2009; Pfefferm an n, Moura, & Silva, 2006), weigh ted composite likelihood (WCL) estimation ( Rao, Verret, & Hidiroglou, 2013 ) , and pseudo empirical 12 likelihoods (Chaudhuri, Handcock, & Rendall, 2010; Chen & Sitter, 1999; Francisco & Fuller, 1991; Fu ller, 1984; Lin, Steel, & C hambers, 2004; Rao & Wu, 2010; Scott & Holt, 1982). As Asparouhov & Muth n (2006) stated that there is no best estimation method for multilevel models if sampling weights are used. MPML method and PWIGLS method are the two most widely used estimation m et hods in multilevel models incorporating sampling weights. Compared with PWIGLS, MPML is more flexible and more widely applied, from the perspective of software implementation. Current ly, MPML has been applied in the software of S tata, M plus , and SAS whil e PWIGLS has been used in LISRAEL, HLM and MLwiN. Different software would generate different output ( Chantala, Blanchette, & Suchindran, 2011; Chantala & Suchindran, 2006 ). The applica tion of MPML, compared with PWIGLS, requires l ess computational intensi ty and is much more flexible ( Kova evi & Rai, 2003; Rabe - Hesketh & Skrondal, 2006). Besides, MPML can be applied to any general multilevel model (Rabe - Hesketh & Skrondal, 2006) just as the PML method can be used in any single - level models. The third a dv antage is th at MPML is versatile and it can be modified for different estimation issues (Asparouhov, 2004; Asparouhov & Muth n , 2006). In addition, MPML can account for stratification and extra non - substantive clustering levels in the estimation of sta nd ard errors without having to incorporate such design features into the parameterization of the model (Asparouhov & Muth n , 2006; Koziol et al., 2017; Rabe - Hesketh & Skrondal, 2006). Because of these advantages , only the MPML wi th different scaling tech ni que s is consider ed in th e present study. Let the estimates = be the parameters and the likelihood function for a general multilevel model can be expressed as ) ( 2.7) 13 where is the response variable in cluster of individual and the cluster - specific random effect; is student - level covariates and the cluster level covariates; is the density function of and the density function of , where and are the parameters to be estimated for the fixed effects for the student level and school level, respectively . If weight ing is incorporated into the analysis, and scaling procedures are also applied in order to reduce the bias arising from unequal probabilit ies of selection for complex survey data, the population likelihood functio n is directly estimated by weighting the sa mpling likelihood function, , ( 2. 8 ) where = 1/ is student - level weights where is the conditional inclusion probability for the i th unit in the j th cluster, given that the j th cluster is sampled; = 1/ is the school - level weights where is the inclusion probability for the j th cluster; and are the scaling fac tors for the school - level and individual level sampling weights, respectively. Numerical techniques are needed to integrate out the unobserved school - level random effect to approximate the weighted likelihood. Sandwich variance estima tor is employed to obtain standard errors because some researchers (e.g., Huber, 1967; White, 1980) claimed that they are robust to nonnormality and heterogeneity. The asymptotic covariance matrix of the parameter using this method is defined by Var( ) ( 2. 9 ) 14 w here ' and " refer to the first and second derivative of the log - likelihoods with respect to the parameters . M plus (Muth n & Muth n 1998 - 2017) implement s this method using a robust variance estimator having the following form: . ( 2. 10 ) 2. 5 Scaling Sampling Weights for Multilevel Models In multilevel weig hted estimation literature, one of the main problems is the fact that the parameter estimat es are usually only approximately unbiased. There are many factors that have substantial influence on the quality of the estimation, such as sample size of cluster, informativeness of selection, variability of sampling weights, intraclass correlation and scaling methods (Asparouhov, 2006; Asparouhov & Muth n , 2006; Bertolet, 2008; Cai, 2013; Grilli & Pratesi, 2004; Jia, Stokes, Harris, & Wang, 2011 ; Kova evi & Rai, 2003; Pfeffermann et al., 1998; Rabe - Hesketh & Skrondal, 2006). For instance, parameter estimation would be severely biased when the cluster sample size is not sufficiently large enough (Asparouhov, 2006; Rabe - Hesketh and Skondal, 2006). I n order to c orrect this, two scaling methods were proposed by Pfeffermann et al. (1998) . The scaling method is an indicator of how the weights are normalized at each level (Asparouhov, 2006). The first method, assuming individual level weights are approxi mately non - i nformative, may produce approximately unbiased estimator for both variance components. This approach produces a scaling factor so that the individual level weights size (Longford, 1995, 1996; Pfeffermann et al . , 1998). The sc alar factor, which was referred to as , is specified as follows 15 = . ( 2.11 ) Method 2 in Pfeffermann et al. (1998) is used when both levels of sampling design are assumed to be informative. The scaling factor is defined as = , ( 2. 1 2 ) where is the number of sample units in the j th cluster. The scaling factor is set so that the individual level weights equal the actual cluster size. These two scaling methods are termed as effective cluster sca ling (ES) and cluster s caling (CS) respectively in the current study . Currently, there is no consensus about which scaling method works better and under what conditions . For example, Pfeffermann et al. (1998) pointed out Method 2 (cluster scaling) works b etter in reducing bias in simulation in the informative sampling design while Stapleton (2002) found that Method 1 (effective cluster scaling) produces unbiased estimates in multilevel SEM analys is. Asparouhov (2006) noted that the different scaling method s may have different effects on different estimation techniques. If a scaling method performs well with the MPML approach, it does not necessarily mean that it performs well with other estimation techniques, for example, PWIGLS. Sometimes, which scaling me thod to use depends on the purpose of the research. If the main interest is point estimates, cluster scaling method is recommended. If cluster variance estimates are, then effective scaling metho d might be used (Asparouhov, 2006; Carle, 200 9 ). 2. 6 Intracl ass C orrelation C oefficient (ICC ) Besides sample size of cluster, informativeness of selection, variability of sampling weights, and scaling methods, ICC also affects estimation quality (Asparouh ov, 2006; Asparouhov and Muth n , 2006; Bertolet, 2008; Cai, 2013; Gril l i & Pratesi, 2004; Jia et al., 2011; Kova evi 16 & Rai, 2003; Pfeffermann et al., 1998; Rabe - Hesketh & Skrondal, 2006). Prior s tudies have found that the larger the ICC values are, the less biased the estimates are in simulation studies mani pulating ICCs using random intercept models without any covariat es at both levels (Asparouhov, 2006; Jia et al., 2011; Kova evi & Rai, 2003). ICC is one of the factors that is examined in this study. It can be used for model construction because it h elps to determine the predictors which are most important to account for the outcome variable (Raudenbush & Bryk, 2002) . It is also used as an index for including cluster level in multilevel modeling if ICC is not close to zero. Larger ICC values usually r epresent larger variations in cluster level, indicating larger propo rtion of total variance in the response variable that is accounted for by the clustering and thus larger clustering effect. In addition, the ICC value is informative for planning group - ran domized experiments in education (Hedges & Hedberg, 2007 , 2013). To estimate the ICC for a given outcome, y , a multilevel model is fit for the i th student in the j th school = + + , ( 2. 1 3 ) and the REML estimates of the variance of , (labeled as ), which is the variation between schools, and the variance of (labeled as ), which represents variati on at student level are used to compute ICC . The estimate of the ICC, , is then defined as = , ( 2. 1 4 ) which is t he proportion of total variability in scores due to the school - to - school differences. Mo reover, the ICC is used to calculate the design effect, which shows how much standard errors are underestimated. The design effect is defined as follows Designeffect = 1 + (averageclsutersize 1) * . ( 2. 1 5 ) 17 Based on Kish (1965), a design effect wh ich is greater than 2 indicates that we need to take into account the clustering effect of the data during estimation. 2. 7 Informativeness of Selection The informative ness of selection, according to Asparouhov (2006), is an indicator of how biased the se lection is. If the sampling design is informative, the inclusion probabilities are related to the response variable after conditioning on the variables in the model (Fuller, 2009; Grilli & Pratesi, 2004) . Otherwise, it is non - informative . Pfeffermann (1993 ) and Cai (2013) pointed out that if weights are informative, they are quite influential on the results an d therefore, should be consider ed in the multilevel analysis. However, if the sampling designs or weights are not informative, the effect of weights c ould be negligible and it is not necessary to include weights in the analysis. Therefore, to check whether the sampling design/weight is informative or not is necessary . Following Laukaityte and Wiberg (2018), weights are informative if the effective sampl e size is smaller than the real sample size. Effective sample size for two - level models can be defined as follows. E ffect ive sample size at level 2 (between schools) is calculated using the following formul a s: = ( 2. 1 6 ) and effect ive sample size at level 1 (within schools) for school j is obtained by = . ( 2.17 ) Pfeffermann (1993) developed a model to evaluate whether the s ampling design is informative or not. The informativeness of sampling design is examined by the test, which is defined as follows 18 I = ( ) ~ (2.18) where and are the estimates of weighted and unweighted analyses, respectively, and and are their va riance est imates . The informativeness statistic follows a distribution with p = dim ( ) degrees of freedom. 19 CHAPTER 3 METHOD S Two primary sections are included in this chapter : one introduces methods for empirical data; one introduces simulation design. 3.1 Empirical Data 3.1.1 Data and V ariables This study uses data from the public - use the Early Childhood Longitudinal Study, Kindergarten Class of 2010 2011 (ECLS - K: 2011, see Mulligan, Hastedt, & McCarroll, 2012, for an overview) data set, which is sponsored by the National Center for Education Statistics (NCES). It is a latest study in early childhood longitudinal study that follows a U.S. nationally representative sample of students entering Kindergarten in 2011 - 2012 to the spring of 2016, fift h grade. ECLS - K:2011 provides descriptive information about Data have been collected related to family, classroom and school environment. Individual variables are available as well, studying how cognitive, social and emotiona l development is related to them. The ECLS - K : 2011 data are not a simple random sample of individuals or clusters. The study employed a 3 - stage cluster sampling design. 90 geographic areas ( counties or groups of count ies ) as the primary sampling units (PSU s) were first sampled at stage 1. Then samples of public and private schools were select ed at stage 2 from the selected PSUs. Lastly, five - year - old children were randomly sampl ed within s elect ed schools at stage 3. Stratification and probability propor tion al to size sampling w ere used at the first two stages of selection; stratification and unequal sampling were used at the final stage. In the base year, Asian, Native Hawaiians, and other LS - K : 2011 kindergarten data 20 file and electronic codebook, public version (Tourangeau et al. , 2015) offer s an excellent overv ie w of the characteristics of complex sample designs including clustering, stratification, unequal probabilities of selection, and non - response and poststratification. The analytic samples in this paper only include kids in kindergarten, and data collected in both the fall and the spring semesters . Approximately 18,200 children enrolled in 970 schools during the 2010 - 11 school year p arti cipated during their kindergarten year. Although the use of sampling weights will result in the increase of variance due to unequal inclusion probabilities, it is still required and necessary because it prevent s producing biased parameter estimates under i nformative sampling in multilevel models (Pfeffermann e t al . , 1998; Kim & Skinner, 2013), protects against misspecification, and makes full use of population - level information (Kim & Skinner, 2013). The supplied sampling weights adjusted f or school - level n onresponse and inverses of estimated student - level response probability are used. Weights for first sampling stage are not available . For student level, I use composite variables based on the parent survey as the primary independent variab les of interest, as well as controlling for the student's fall test score in order to predict the spring score. The parent is used as a primary component to adjust for non - response, suggesting that child base weight adjusting for non - response associated wi th either fall or spring kindergarten parent interviews ( W1_2P0 ) would be a good choice of weight. For school - level weight, school base weight adjusted for non - response associated with the school administrator questionnaire ( W2SCH0 ) are used. The academi c outcome variabl es in this study are reading and mathematics scale scores calibrated using Item Response Theory (IRT) procedures. for the ECLS - K:2011, Mulligan et al . , 2012) measures basic skills (print familiarity, l etter recognition , beginning and ending sounds, rhyming words, word recognition), vocabulary 21 knowledge, and reading comprehension. Reading comprehension consists of questions identify ing information specifically in text, mak ing complex inferences within an d across texts, a nd consider ing the text objectively to judge its appropriateness and quality. The mathematics assessment measures skills in conceptual knowledge, procedural knowledge, and problem solving. The construct validity has been established for EC LS - K:2011 assessm ents as the assessment, national and state performance standards in each of the domains were examined and specifications for reading and mathematics were established based on NAEP framework. Furthermore, curriculum specialists in the subje ct areas were rec ruited and the pool of items created were examined for content and framework strand design, accuracy, on - ambiguity of response options, and appropriate formatting. The reliability of the reading score for Fall and Spring Kindergarten is 0.95, and the relia bility of the mathematics score is 0.92 for Fall Kindergarten, 0.94 for Spring. The kindergarten SD = kindergarten mean score was 61.26 ( SD = 13.56). To model mathe matics and reading achievements, we use three student - level covariates and two school - level covariates. Descriptive statistics of these variables are presented in Table 3. 1 . 22 Table 3.1. ECLS - K: 2011 Variable Descriptive Statistics Note : SD=standard deviation; MIN=minimum; MAX=maximum. 3.1.2 Statistical M odels The unexplained variance among randomly sampled clust ers (e.g., schools) in outcomes of interest could be inferred by using multilevel models. The effects of covariates at each level could also be estimated. Researchers could use models with random intercepts to account for the correlations within clusters c aused by longitudinal or clustered design ( West et al. , 2015). In a survey with multistage samples, there are always various levels of cluster. But only the lowest level of clustering usually has the greates t impact on individual outcome (Asparouhov & Muth n , 2006). Furthermore, Stapleton and Kang (2016) found minor impacts could be found on inference and no difference could be detected even if we disregard the first stage sampling design which is beyond the levels in the model. For large - scale data sets, the first - stage weights are usually no t provided, for example, ECLS - K: 2011. Hence this first stage sampling design is not considered in this study. Therefore, for simplicity, two - level random intercept regression models are used in this study to fit mult ilevel models in which individual stude nts are nested in schools to two academic 23 dependent variables, reading IRT scale score, and mathematics IRT scale scores. But I would not take account of IRT measurement errors in the analysis. Three different two - lev el models are examined with different s ets of covariates . Model 1 is an unconditional model without any covariates at both levels , model 2 includes all the student level predictors and model 3 is a full model consisting of all the student level and school level predictors. Model 1: unconditiona l model Level 1: = + ( 3.1 ) Level 2: = + ( 3.2 ) Combined: = + + ( 3.3 ) Model 2 : student model wi th three student - level predictors Level 1: = + *Female + *SES + *Pretest + ( 3.4 ) Level 2: = + ( 3.5 ) Combined: = + *Female + *SES + *Pretest + + (3.6) Model 3 : full model including two level covariates Level 1: = + *Female + *SES + *Pretest + ( 3.7 ) Level 2: = + *Suburb + *Rural + *Suburban + ( 3.8 ) Combined: = + *Female + *SES + *Pretest + *Suburb + *Rural + *Suburban + + (3.9) Since there are many factors affecting the quality of estimation in complex sampling design, it is noteworthy to investigate both unweighted and weighted models. In t his study , all the three multilevel models above are explored using the following four estimation methods : (a) maximum likelihood estimation method with no weights ( UW ), (b) MPML using raw / unscaled weights (RW), 24 (c) MPML using cluster scaling (CS), (d) MP ML using e ffective cluster scaling (ES). The missing data at level 1 ranges from 0.2% for female to 14.2% f or math pretest. Listwise deletion is used for handling missing data for the empirical study. Multiple imputation can be used in this case, but the e xact models for real data is less important here. So listwise deletion is used to simplify the problem. Mis sing data at level 2 is 3.6% for rural and suburban. Level 2 missing values cannot be simply removed because they have impact on the lower level. Sch after and Graham (2002) mentioned if the probabilities of missingness only depended on observed items, miss ing data could be assumed to be missing at random (MAR afterwards). Therefore, I assume missingness at level 2 here is MAR. Two methods are recommend ed for handling MAR data. One method is multiple imputation method (Robin, 1987; Enders, 2010; Howell, 2008), and the other is the full - information maximum likelihood (FIML) method ( Danielsen, Wiium, Wilhelmsen, & Wold, 2010 ; Enders, 2010; Laukaityte & Wei bert, 2018 ). I use FIML for handling missing data in this study. 3.2 Simulations 3.2.1 S imulation D esign The informativeness (Asparouhov, 2006; Cai, 2013) and t he intraclass correlation w ere found to be influential factor s on the performance of weighted e stimation in multilevel models (Aspa rouhov, 2006; Jia et al., 2011; Kova evi & Rai, 2003). Monte Carlo simulation method s are applied to evaluate the effect of ICC and examine the performance of MPML using different scaling tech ni ques in the context of two - stage informative and non - informative sampling design 25 (please see Table 3.2 S imulation D esign ) . All the conditions are fully crossed. The full study design results in a total of 2 5 4= 40 simulation settings. Table 3.2. Simulation Design Note : UW=unweighted estimation method; RW=es timation method with raw weights; CS= estimation method with cluster scaling ; ES= estimation method with effective cl uster scaling . Five different ICC values are us ed in this simulation: 0.5, 0.3, 0.2, 0.1, and 0.01. The unconditional ICCs that may typically be found in educational and psychological research in the United States are in the range of 0.15 and 0.25 for academic large - scal e assessments (Bloom, Bos, & Lee , 1999; Bloom, Richburg - Hayes, & Black, 2007; Hedges & Hedberg, 2007, 2013; Kreft & Yoon, 1994; Schochet, 2008). Accordingly, the values of 0.1, 0.2, and 0.3 are chosen for this study . T he lowest ICC value found in Hedges an d Hedberg (2013) is 0.02, in whi ch students were nested in grade s for each state. Raykov (2015) show ed that the lower bound of 95% confidence interval of ICC could be as low as 0.014. Murry and short (1995) found that in a school - based intervention design, ICC values were generally small er, in the range of 0.01 to 0.05. T he current study considers students may be nested in school district, or even larger geographic area s , which may result in a lower ICC value . Therefore 0.01, a very small non - zero value, is chosen , because small ICC still 26 affect s estimates of standard errors if we ignore the dependency. Musca et al. (2011) said small ICC would impact Type - I error dramatically. Different values for and are used while the total variance of y is kep t fixed, + = 60 . This value is determined based on the empirical data results (See Table 3.5). Five different ICC values 0.5, 0.3, 0.2, 0.1, and 0.01 are obtained by setting to be 30, 18, 12, 6 and 0.6 respectively, while the value of is 60 - , i.e., 30, 42, 48, 54, and 59.4 correspondingly. 3.2.2 Model To evaluate the performance o f MPML approach for a linear two - level regression model under informative and non - informative sampling condition , the Monte Carlo simulatio n mimics the sampling design in ECLS - K:2011. Specifically, about 18,200 kindergarteners from 970 schools were sampled . Overall, about 19 students were selected on average from each school. Mulligan et al. (2012) indicated that the school and student select ion probability (i.e., sampling rate) is 0.02 and 0.25 respectively and the overall student selection probability is 0.02 × 0.25 = 0.005. All the school population are categorized into six groups based on the percentages of public schools in ECLS - K:2011: 5.69% of schools have students ranging from 16 to 24; 11.49% of schools have students ranging from 25 to 49, 43.53% o f schools have students varying from 50 to 99, 25.3% of schools varying students from 100 to 149, 8.59% of schools have students ranging fr om 150 to 199, then 5.22% of schools have more than 200 students. Then finally 1 50 schools and 3915 student s are draw n from the population with the expected sampling rate for schools and students in ECLS - K:2011 . The true values for the parameters are all o btained using the empirical data set ECLS - K: 2011 with maximum likelihood estimation method ( s ee Table 3.5 ). Thus , the data are generated using the following model: = 17.4 3 + 0.91*Female + 1.06*SES + 0.92*Pretest + 1.04 *Rural + + ( 3.10 ) 27 where is school - level random effect and is student - level error term, and are normally distributed with mean of 0 and variance 30 , 18 , 12 , 6 , and 0. 6 for , and corresponding variance of 60 - for . Explanatory variables (e.g., female, social economic status (SES), pretest, rural and suburban) are determined because they contribute significantly to the model and are also variables other researchers are also interested in (e.g., Hedb erg, 2016; Hedges & Hedberg, 2007). Fema le follows Bernoulli distribution with probability of 0.49. Social economic status ( SES ) follows normal distribution with mean - 0.05 and variance 0.66 ( SD = 0.81). Pretest score follows normal distribution with mean 46.92 and variance 132.22 ( SD = 11.50). Subur ban follows Bernoulli distribution with probability of 0.3 6 . Rural follows Bernoulli distribution with probability of 0.2 2 . 3.2.3 Sampling S electio n Finite population are generated according to the model describ ed above. The expected sampling rate used in this study is still 0.02 for school s and 0.25 for student s as in ECLS - K: 2011, which results in the overall sampling rate of 0.005. Sampling selection is determined by whether the sampling design is informative or non - in formative. In order to introduce unequal probability sampling at both levels and make our sampling design informative, the present study uses the similar plan used by Asparouhov (2006), Cai (2013) and Koziol et al. (2017). Poisson sampling is used to selec t the j th school with probability: prob ( I j = 1) = ( 3.11 ) where t he is equal to (the random intercept effect for the j th cluster) but rescaled to have a variance of 2. For the selected school, Poisson sampling is used to select the i th student within the j th school with probability: 28 p rob ( I i|j = 1) = . ( 3.12 ) The is equal to (the residual effect fo r the i th student in the j th cluster) but rescaled to have a variance of 2. This sampling plan results in a design which is informative at both levels, because at both levels, the inclusion probabilities are linked to the response variable , according to th e defini tion of sampling design informativeness ( c.f., Fuller, 2009; Grilli & Pratesi, 2004) . The random variable variance is rescaled in order to keep a constant level of informativeness across different levels of the ICC . A variance of 2 f or both random variables and the slope coefficients (1/2) are selected to have approximately 0.3 of informativeness for both the school level and student level , which Asparouhov (2006) used as a moderate level of informativeness in his simulation s . The intercept values ( 4.12 and 1.23 for school level and student level, respectively) are determined using expected sampling rates (0.02 and 0.25 for the school level and the student level, respectively) and the formulas above (equation 3.11 and 3 .12 ) to obtain desired sample s izes. Under the non - informative sampling condition , and are replaced by other variables that are not part of the population model. Still Po i sson sampling is used to sel ect the j th school with probability prob ( I j = 1) = ( 3.13 ) where ~ N (0, 2 ) and is not related to any variables in the model. Conditional on the selected school, Poisson sampling is used to select the i th student in the j th school with probability of prob ( I i|j = 1) = . ( 3.14 ) 29 where ~ N (0, 2 ) and is not related to any variables in the model. Although th is design uses unequal probability of selection, it is not informative, because the selectio n probability is not related to the response variable. Data are generated using the software Stata. The syntax for data generation is provided in A PPENDIX A and A PPENDIX B . 3.2.4 M plus and D ata A nalysis Each simulation is replicated 10 00 times for each st udy condition. Each 1000 replications are analyzed in M plus Version 8 ( Muthén & Muthén , 1998 - 201 7 ) using the TYPE = MONTECARLO option under the M plus DATA command. The M plus Muthén & Muthén , 1998 - 2017) provides guidance on how to incorporat e sampling weights and how to use scaling methods in a two - level model. The two scaling methods that are used are referred to E CLUSTER and C LUSTER respectively in M plus documentation , which correspond to effective cluster scaling and clustering scaling respectively in this study. Altogether , four estimation methods are considered: (a) unweighted estimation method (U W) ; (b) MPML method using raw / unscaled weights (RW); (c) MPML method using cluster scal ed (CS) weights , and (d) MPML method using effective cluster scal ed (ES) weights. Then Sandwich variance estimators (ESTIMATOR = MLR) are used in all instances. The TYPE option is set to TWOLEVEL, and appropriate variables are identified for the CLUSTER, WEIGHT, and BWEIGHT options. For MPML models, WTSCAL E and BWTSCALE are also specified based on different scaling methods: UNSCALE D and UNSCALED are used respectively for raw scaling method, CLUSTER and SAMPLE for cluster scaling method , and ECLUSTER and SAMPLE for effective scaling method for three weighted methods respectively . For a general 30 multilevel model ignoring weighting in t h e present study , WTSCALE and BWTSCALE are not used under the VARIABLE command . 3.2. 5 Evaluation C riteria Empirical (absolute) Relative Bias, Root Mean Square Error ( R MSE), and 9 5% Confidence Interval Coverage Rate are used as the primary criteria to est imate the quality of the performance of the estimators as previous simulation studies ( e.g., Cai, 2013; Eideh & Nathan, 2009). In measurement or sampling situations, bias is defin e of the measurements or test results Then the true value can be under - or overestimated. Since large number of replications are applied in this stud y , even small values of bias may be deemed significantl y different from 0. As such, the relative bias instead of bias is used. The relative bias is defined as RBias ( ) = ( ). ( 3.15 ) where is the true value set, and is the estimated value in each iteration. It is noted in Muth én and Muthén (2002) that , if the absolute relative bias is less than 10% of the true val ue , then the parameter estimates can be considered unbiased. A common accuracy measure called mean square error (MSE) is the mean of the squared differences. It indica tes how close the estimate is to the true value. This measure incorporates concepts of bi as and precision because it equals to the sum of the variance of the estimates and the squared mean error. The root MSE (RMSE) tells us how far the approximation will be from the true value on average. RMSE is used because it can penalize large values. It is computed using the following formula RMSE ( ) = , (3.16) 31 where = . The smaller the RMSE is, the be tter the estimate is. The coverage rate/ probability (CR) in this study is set at 95%. It is utilized to evaluate the proportion of replication in each parameter estima te that the interval estimator contains the population parameter value (Muthén & Muthén, 1998 - 2017). It is recommended that the coverage rate should be at least 0 .91 by Muthén & Muthén (2002). That is, at least 91% of replications having true parameter val ues within the 95% confidence interval. M plus syntax for analysis is provided in A PPENDIX C . 32 CHAPTER 4 RESULTS This chapter consists of two primary sections: one for simulation results, the other for empirical study results. 4. 1 Simulation Results The primary evaluation criteria are (absolute) relative bias, root mean square error (RMSE) and coverage rate of the interval estimators. Simulation r esults are depicted in Table 4. 1 - 4. 6 and Figure 4.1 - 4.1 6 . Table 4. 1 - 4. 2 illustrate the Monte Carlo estimates of relative bias, RMSE and 9 5 % confidence interval coverage rate for the fixed effects, i nter cept and variance components in the informative condition, Table 4.3 - 4.4 for those in the non - informative condition . Table 4. 5 - 4. 6 display the average standard errors of the estimates and the standard deviations in t he informative and non - informative d esi gn r espectively. Figure 4.1 - 4.2 , and Figure 4.13 - 4.14 plot relative bias for the four covariates , intercept and variance components in the informative condition , and Figure 4. 3 - 4. 4 and Figure 4.15 - 4.16 for those in the non - informative cond ition . Dashed hori zont al lines indicate bounds for acceptable levels of relative bias (|RB%| 10; Muthén & Muthén, 2002). Figure 4. 5 - 4. 6 plot RMSE for the four covariates and intercept and variance components in the informative design and Figure 4. 7 - 4. 8 for those in th e n on - informative design . Figure 4. 9 - 4. 10 plot coverage rate for the four covariates and intercept and variance components in the informative design and Figure 4.11 - 4.12 for those in the non - informative design. Dashed horizontal lines indicate the nominal co vera ge rate of 95%. Results are organized b y research question s and evaluation criteria. Under each evaluation criteria, the results are illustrated by informative and non - informative condition respectively . 33 4. 1 .1 Research Q uestion O ne Research question o ne allows me to evaluate the performance of weighted and unweighted estimators under the informative and non - informative condition in terms of (absolute) relative bias, RMSE, and 95% confidence interval coverage rate. Comparison between unweighted and we ig hted estimators can give us a picture under standing whether differences among them are due to sampling weights application and which estimator performs best. 4.1.1.1 (Absolute) R elative B ias In general, all the fixed effects are estima ted somewhat unbi as edly in both informative and non - informativ e conditions if the criterion of Muthén and Muthén (2002) is applied. However, a different story can be told for the intercept and variance components estimates . On average, the absolute relativ e bias is compara ti vely larger in magnitude under the informat ive condition than that in the non - informative design . T he most variability in the absolute relative bias occurs for the school - level variance estimators in both conditions . 4.1.1.1.1 Informative Design From the p resented simulation results in Table 4.1 an d Figure 4.1, it is evident that all the estimates of absolute relative bias for the four fixed effects are less than 10% of the true value and can be considered unbiased i f the crit erion of Muthén and Muthén (2 002 ) is used acros s the fo ur estimators. The absolute relative bias es for the three student - level co variates (i.e., female, SES and pretest) are less than or close to 1%. Although the relative bias for the school - level covariate (i.e., rural) is higher tha n those of student - l evel cov ariates, it is still within 10% of the true value. Table 4.2 and Figure 4.2 show that the intercept and student - level variance are unbiased ly estimated ( in terms of Muth n & Muth n , 2002) except for the intercept estimate i n the unweighted 34 Table 4.1 . RB(%), RMSE, 95% CI CR for Cova riates in the In formative Design Note : RB=relative bias; RMSE=root mean square error ; CR=95% confidence interva l co vera ge rate; UW=unweighted estimation method; RW= estimat ion method with raw we ight s; CS= estima tion method with cluster scaling; ES=estimation method with e ffective cluster scaling. 35 Table 4. 2 . RB(%), RMSE, 95% CI CR for Intercept and Variance Components in the In formative Design Note : RB=relative bias; RMSE=root mean square error ; CR=95% confidence interva l co vera ge rate; UW=unweighted estimation method; RW= estimat ion method with raw weight s; CS= estima tion method with cluster scaling; ES=estimation method with e ffective cluster scaling. 36 Figure 4.1 . Relative bias (%) for covariate s in the informative design Figure 4.2 . Relative bias (%) for intercept and variance components in the informative design 37 c ase. T he three weighted estimators perform almost equa lly well since all the relative biases of the intercept estimate s produced by them are less than 2. T he unweighted estimator performs the worst and produces substantially larger relative bias than the weighted estimators do . As for the student - level varian ce, the absolut e relative biases are all less than or close to 10%. Among the four estimato rs, the unweighted method produces larger absolute relative bias than the weighted methods do. The cluster scaling method has the smal lest values of absolute relativ e bias. Therefo re, the cluster scaling method works the best and the unwei ghted method works the worst for the student - level variance in terms of (absolute) relative bias . As for the estimate s of school - level variance, all four estimators do not perform we ll and have ver y large relative biases when the ICC is extremely small. To be more specific, the relative bi as is as large as over 600 with the raw weighted method. Even for the best estimator, the unweighted one, has the relative bias of ove r 80, which is much larger th an the s tandard used in the present study. In general, the raw weighted estimator performs the worst and the unwei ghted estimator performs the best for the school - level variance across all the ICC levels . In all, the weighted models perform quite similarl y with each other and outperform the unweighte d estimator f or the intercept and student - level variance while the unweighted model has smaller relative bias and outperform the weighted estimators for the school - le vel variance. T he intercept i s always overes timated and the student - level variance is underesti mated. T he school - level variance, in most cases is overestimated, except wi th the unweighted method and effective scaling method when ICC equals 0.5. The student - level variables F emale and S ES are underestimated and pretest is overestima ted. School - level variable, rural , is overestimate d in the weighted case, while underestimated in the unweighted case. 38 4.1.1.1 .2 Non - Informative D esign Table 4.3 and Figure 4. 3 show that the abs olute relativ e biases of the four covariate estimates are al l smaller than 10% in the n on - informative condition . It means that these four covariates are considered to be estimated unbiase dly in terms of Muth n & Muth n (2002) . Also, the two continuous covaria tes ha ve smaller absolute relative bias es than the tw o dichotomous covariates do . At the same time, the unweighted method produces lower or equal absolute relative bias for the four fix ed effects than or as the other three weighted estimators do. So, the unweig h ted estimator performs the best for all the fix ed effec ts among the four e stimators. The intercept is precisely estimated since all the absolute relative biases are no more than 0.20 5 (see Table 4. 4 and Figure 4. 4 ). The unweighted method outperforms the o ther es timators when the ICC equals 0.01, 0.1, and 0.2, whi le it performs the worst when the ICC equals 0.5. Results also show that the student - level variance is estimated unbiasedly since th e absolute relative biases are all less than 5% across all the es timators. A mong them, the raw weighted method has t he largest r elative bia s, indicating it works the worst. The effective scaling and unweighted method outperform the other two. As for the school - level variance, all the four estimators produce substantiall y large relati ve bias when the ICC is extremely sma ll and all th e estimato rs do not work well when the ICC is 0.01. Comparatively, the raw weighted method works th e worst while the unweighted meth od performs the best across different levels of the ICC for the school - level varianc e estimate s . 39 Table 4. 3 . RB(%), RMSE, 95% CI CR f or Cova riates in the Non - In formative Design Note : RB=relative bias; RMSE=root mean square error ; CR=95% confidence interva l co vera ge rate; UW=unweighted estimation method; RW= estimat ion method with raw weight s; CS= estima tion method with cluster scaling; ES=estimation method with e ffective cluster scaling . 40 Table 4.4. RB(%), RMSE, 95% CI CR for Intercept and Variance Components in the Non - Informative Design Note : RB=relative bias; RMSE=root mean square error ; CR=95% confidence interva l co vera ge rate; UW=u nweighted estimation method; RW= estimat ion method with raw weight s; CS= estima tion method with cluster scaling; ES=estimation method with e ffective cluster scaling. 41 Figu re 4. 3 . Relative bias (%) for covar iates in the non - informative design Figure 4. 4 . Relative bias (%) for intercept and variance components in th e non - informative design 42 4.1.1.2 RMSE An overview of the RSME of the fixed effect point estimators and the interce pt and variance component estimators across informativeness and ICCs is provid ed i n Table 4.1 - 4.4, Figure 4. 5 - 4. 8 . There is not much diff erence on the RMSE for the fixed effects between the informa tive and non - informative condition. However, on average, t he RMSE is comparatively larger under the informative conditio n than those in the non - informative condition. 4.1.1.2.1 Informative Design Compared with weighted est imators, the unweighted estimator has smaller RMSE value for the four covariates under the i nformative condition (see Table 4.1 and Figure 4. 5 ). Th e weigh ted estimates of the RMSE show almost the same patterns for the four covaria tes. The unweighted estima tor performs the most efficiently among the four estimators . As the relative biases of the i ntercept and variance components, similar results are obtained for the RMSE. F or example, the unweighted method has comparatively much larger RMSE for the intercept than the weighted estimators do a nd the three weighted estimators perform very much similar ly to each other (see Table 4. 2 and Figure 4. 6 ). The unweighted estimator prod uces the largest RMSE for the student - level variance and performs the least effic iently among the four. The cluster scal ing method performs the most efficiently. As for the schoo l - level variance, the unweighted estimator has the smallest RMSE and per forms the most efficiently among the four. The raw weighted estimator has the least ef ficiency. In all, the unweighted estima tor performs the worst for the intercept and student - level variance estimate s , but performs the best for school - level variance estimate s in term s of RMSE in the informative design. 43 Figure 4. 5 . RMSE for covariates in the informative des ign Figure 4. 6 . RMSE for intercept and variance components in the infor mative design 44 4.1.1.2. 2 Non - Informative Design Table 4. 3 and Fi gure 4. 7 show that the unweighted method has the smaller RMSE for the four covariates than the weighted methods do, and in most cases, there is not much difference across the weigh ted methods for the four covariates at different levels of the ICC. Therefor e, the unweigh ted method performs the best among the four estimators for all the fixed effects. Apparently, the unweighted method ha s the smallest RMSE for the intercept and the t wo variance components across all the conditions in the non - informative cond ition (see Tab le 4.4 and Figure 4. 8 ) and performs the most efficiently among the four estimators across all the levels of the ICC. And the raw weighted method produces the large st RMSE among t he four estimators fo r the intercept and the two variance component estimates. Figure 4. 7 . RMSE for covariates in the non - informative desig n 45 Figure 4. 8 . RMSE for intercept and variance compon ents in the non - informative design 4.1.1.3 Coverage Rate An overview of coverage of the fixed effects , intercept and variance component estimators across inf ormativeness and ICCs is provided in T able 4. 1 - 4. 4 and Figure 4. 9 - 4. 12 . All the fixed effects are estimated without much bias (<10%) in both the i nformative and non - informative condition s i f the criterion of Muth n & Muth n (2002) is applie d. T he corresponding coverage rates for them are good and not much differen ce can be found among them. For the intercept and variance components, on average, the ir covera ge rate s are much lower under the informative condition than th ose under the non - inf ormative condition. Under the informative condition, the most variability i n coverage occurs for the intercept e stimators, whereas under the non - informative condit ion, the most variability in coverage occurs for the school - level estimators. 46 4.1.1.3 .1 Infor mative Design Because the four covariates are precisely or slightly biasedl y estimated, the coverage rates for t hem are all above or close to 0.91, especially for the thre e level - one predictors (see Table 4. 1 and Figure 4. 9 ). Because the unweighte d method produces substantially larger bias es for the intercept and student - level v ariance estimate s , this leads to very poor coverage rates for both of them (see Table 4.2 and Fi gure 4. 10 ): with the coverage rate of 0 for the intercept and less than 3% fo r the stu dent - level variance. The three weighted methods perform almost equally well and have the coverage rates of arou nd or over 0.91 for the interce pt. However, even the best s tudent - level variance estimator, the cluster scaling estimator, has the cover age rates no more than 0.63. For the schoo l - level variance estimate s , raw weighted method performs the worst while the unweighted estimator performs the best. Figure 4. 9 . Cove rage rate for covari ates in the informative design 47 Figure 4. 10 . Coverage ra te for intercept and variance components in the informative design 4.1.1.3.2 Non - I n formative Design The coverage rates for the four covariate estimates in the non - informative con dition are all above or close to 0.95. Among the four estimators, the unweigh ted meth od performs the best. The unweighted method has the highest coverage rates for the intercept amo ng the four est imators as well and they are all above or around 0.94. As for t he student - level variance, the effective scaling method has the high est c overage rates , which are around 0.92 wh ereas the raw weighted method has the lowest cover age rates, which are around 0.6 5. The unweighted estimator has very similar coverage rate to t he effective scaling method. The coverage rates for the school - level vari ance with un weighted method are the highest among all the estimators and are all larger than 0.93 except when the ICC is 0.01 , while the raw weighted has the smallest one . 48 Figure 4.1 1 . Coverage rate for covariates in the non - informative design Figu re 4.12 . Covera ge rate for intercept and variance components in the non - informative design 49 4. 1 .2 R esearch Q uestion T wo Research question two addresses the ICC effect on the different e stimation method s in the informative and non - informative design. 4.1.2 .1 ( Absolute ) R e lative B ias 4.1.2.1.1 Informative Design Table 4. 1 and Figure 4.13 show that, a s th e ICC increases, the absolute relative bias es for the two continuous covariates (e.g., SES and pretest) decrease . For the covariate female, there is no mono tonous pattern for its relative bias. As the ICC increases, it increases first and then starts to dec r ease. For the cova riate rural, the relative bias increases as the ICC increases in th e weighted case, while the absolute relative bias decreases in the un weighted case. The refore, for all the fixed effects, there is no overall consistent pattern. It is e v ident (see Figure 4.14 ) that the absolute relative bias for the intercept estimate wi th unweighted method increases as the increase of ICC, but no consist ent monotonous pat tern can be found for the relative biases for the intercept estimate with the wei gh ted methods and th ey do not vary much across the weighted methods at each different l evels of the ICC (see Table 4.2). The absolute relative biases for th e student - level va riance estimate s decrease as the ICC decreases with all the four estimators, but t he decrease rate is very tiny and hard to find from Figure 4.14 . There is an obvious in crease pattern in the relative bias of the school - level variance esti mate s as the ICC d ecreases with t he four estimators (see Table 4.2 and Figure 4.14 ). 50 Figure 4.13 . R elative bias (%) f or covariates in the informative design Figure 4.14 . Relative b ias (%) for intercept and variance components in the informative desig n 51 4.1.2.1.2 Non - I nformative Design Clear patterns can be found under the non - informative samp ling d esign. Table 4.3 and Figure 4.15 indicate as the ICC increases, the absolute relativ e bias decreases for the three student - level covariates, and increase s for rural, the sc hool - level covariate. Simulation results show that as the increase of the ICC , the absolute relative bia s for the intercept decreases with the three weighted methods wh ereas it increases with the unweighted model (see Table 4.4 and Figur e 4.16 ). As for the relative bias of student - level variance, it decreases as the ICC decreases, but t he decrease rate is s o small that similar patterns hold for the estimators across diff erent ICC values. The relative bias for the school - level variance inc reases as the decre ases of the ICC. Figure 4.15 . Relative bias (%) for covariates in the no n - info rmative design 52 Fi gure 4.16 . Relative bias for intercept and variance components in t he non - informative desig n 4.1.2.2 RMSE 4.1.2.2.1 Informative Design Contrary to the rela tive bias, there are clear patterns of RMSE for all the fixed effects (see Table 4. 1 and Figure 4. 5 ) . As the ICC increases, the RMSE decreases for all the student - leve l fixed effects , and in creases for the school - level fixed effect with all the estimators. Table 4. 2 and Figure 4. 6 show that the RMSE for the intercept is increas ing as the ICC increases. The increase rate is quite obvious with the unweighted method but i t is so small with the three weighted methods that not much variation can be found across different ICC values. As for the variance components, there are clear pat terns for both of them. As the ICC increases, the RMSE of the student - level variance decrease s wh ereas the RMSE of the school - level variance increases. 53 4.1.2.2.2 Non - I nformative Desi gn The RMSE decreases for the three student - level covariates, and increase s for rural, the school - l evel covariate as the ICC increases with all the four estimators (s ee Table 4. 3 and Figure 4. 7 ). Figure 4. 8 shows that the RMSE for the intercept remains a lmost unchanged across different levels of the IC C with the four estimator s, but Table 4. 4 show s tha t the RMSE does increase as the increase of the ICC consistently. I t is clear that as the ICC increases, the RMSE for the student - level variance decreases w h ereas the RMSE for the school - level variance incre ases with all the estimat ors. 4.1.2.3 Coverage Ra te 4.1.2.3.1 Informative Design Table 4. 1 and Figure 4. 9 show as th e increase of the ICC, there is not much variation on the coverage ra tes for all the fix ed effects. The coverage rate for the intercept and student - level varianc e rema in s almost the same a s the ICC increases (see Table 4. 2 and Fi gure 4. 10 ). For the school - level variance, although the coverage rate changes as the increase o f ICC, no consistent pattern can be seen for the estimators except for with the raw weighted m ethod. Overall, coverage r ate is not sensitive to the change of the ICC in the curr ent case. 4.1.2.3.2 Non - I nformative Design No obvious ICC effect can be found i n terms of the coverage rate for al l the parameter estimates except for school - level variance (see Table 4. 3 - 4. 4 , a nd Fi gure 4.11 - 4.12). The coverage rates for the fixed effects , intercep t, and student - level variance remain almost unchanged as the increase of the ICC. Although there are som e variations of the coverage rates for the school - level 54 variance, there i s no c lear p attern with the four estimators. For example, the covera ge rates wi th the unweighted model and cluster scaling method increase first and then decreases later as the incre ase of the ICC. The coverage rate keeps on increasing with the effective s caling metho d and decreasing with the raw weighted method as the dec rease of th e ICC. In sum, the effect of ICC cannot be found for the all the para meters in terms of the coverage ra te in the non - informative condition. 4. 1 .3 Simulated S tandard E rrors and S tandar d D evi ations If we tend to repeat the Monte Carlo simulation and tally th e sample mean each time, a normal distribution (based on Central Limi t Theorem) would result in the dis tribution of the sample mean. To assess how well the standard errors of th e esti mates approxi mate the true sampling variation, the sample standard deviat ion of each replicate, that is, the Monte Carlo standard deviation, c an be compared to the average of t he estimated standard errors. We might expect the sample standard deviatio n, an approx imation to and the av erage of standard errors. It means that the standard error is a good estimate of the standard deviatio n of the normal distribution if the sample size is sufficiently large. Th e diff erence s are c alculated between the standard deviations and averaged stand ard errors of 1000 point estimates for all the seven parameters: four regression coefficients for femal e, SES, pretest and rural, intercept, and two random effects ( the student - leve l vari a nce and scho ol - level variance). Table 4. 5 presents the results of standa rd deviations of simulation and standard errors of estimates in t he informative sampling design. The di fferences between them for the four fixed effects and intercept are on the secon d or even thi rd decimal place. The differences for student - level variance and school - level variance are a little bit larger, but still the y are less than or close to 1. Clearly , the 55 unweighted method produces the smallest standard errors and works be st com p ared with th e weighted estimators. Table 4.6 contains the results of sta ndard deviations of simulation and standard errors of estimates i n the non - informative sampling design. It tells us the same story as in the informative setting. The differences for a l l the parame ter estimates are even smaller, and the largest absolute dif ference is 0.273, indicating the estimation performs quite well. Still, the unweighted method has the s mallest standard errors and performs best compared with the three weighted model s . 56 Table 4.5. Simulation Standard Deviations and Standard Errors of E stimates in the Info rmative Design Note: UW=unweighted estimati on method; RW= estimat ion method with raw weight s ; CS=estimation method with cluster scaling; ES=estimation method with effective cluster scaling ; SD=standard deviation; SE=standard e r ror ; Diff= difference. 57 T able 4.6. Simulation Standard Deviations and Sta ndard Errors of Estimates in the Non - Informative Desig n Note: UW=unweighted estimation method; RW= estimat ion method with raw weight s ; CS=estimation method with cluster scaling; ES=estimation method with effective cluster scaling ; SD=standard deviation; S E=standard e r ror ; Diff= difference. 58 4. 2 Results for ECLS - K:201 1 First, the informa tiveness of the weights is examined following Laukaityte and Wiberg (2018) . The student - level effective sample sizes are all smaller than the actual sample siz es except those schools wh ich ha ve only one student. The schoo l - level effective sample size is 614, which is smaller than the actual number of schools. Therefore, both level weights are inf ormative and both level weights would affect the result s of the multilevel analys i s. Three two - level HLM models with different sets of covariates are use d to fit two dependent variables: reading achievement scores and mathematics achievement scores. The fi rst model is a null model, the second is the model with student - level predictors ( I label it as student model), and t he third model is a full model with student - level and school - level predictors included. Table 4. 7 presents the results of the unweighted an d weighted null models. Even this simple model shows th ere are important differen c es in the estimates of the variance components. Having no weights produ ces the largest estimates of student - level variance, whereas using raw weights produces the largest est imates of school - level variance. The estimates of inter cept are found to be in th e same direction and have similar si zes to each other across the four es timators in reading and mathematics. Still, the weighted intercept estimates are consistently larger th an unweighted estimate. Overall, the unweighted method has the smallest standard e rrors and largest test statistics c onsistently among the four estimator s. In addition, the two scaling methods perform more similarly with much more similar results of point estimates, standard errors, and consequently the test s tatistics. The ICC (see Ta b le 4. 7 ) shows 19.6% and 16.2% of th e total variance in mathematics and reading achievement are attributable to schools. Based on Equation 2.15, the design effects are 13.61 a nd 13.65 for mathematics and reading respectively. They are greater than 2, indic a ting that using multilevel model to analyzed data here is reasonable. 59 Table 4.7. Null Model for ECLS - K: 2011 Math ematics and Reading Note : UW=unweighted estimation method; RW= estimat ion method with raw weight s ; CS=estimation method wi th cluster scaling; ES=estimation method with effective cluster scaling ; SE=standard error. * p < .05 ; **p < .01; ***p < .001. The results of the model with student - level predictors are depicted in Table 4. 8 . Contrary to the nul l model, the intercept estim ates in the weighted models are smaller than those in the unweighted model. As in the nul l model, the unweighted model produc es the largest estimate of student - level variance and the raw weighted model produ ces the largest estim ate of school - level variance . Furthermore, the indices of goodness of fit AIC, BI C, and deviance are substantially la rger when raw weighted estimation me thod is applied. Compared with the null model, the standard errors for the interce pt increase, while th e standard errors for studen t - level and school - level variance decrease. The withi n - school variance decreases by 67% f or both mathematics and reading, the between - school variance decrease varies from by 68% to 72% for mathematics, and f rom by 61% to 64% for reading. Similar results ar e obtained when both student - level and school - level w eights are used in the model. The st andard errors of all the parameters with the unweighted method are consistently smaller than those of weighted methods , and the test statis tics of the 60 unweighted estim ator consistently larger than those of the weighted e stimators, as expected. The signific ance is stable for all the parameter s as well. Table 4.8. Model with Student - Level Predictors for ECLS - K: 2011 Mathematics and Reading Note : UW=unweighted estimation method; RW= estimat ion method with raw weight s ; CS=estimation method wi th cluster scaling; ES=estimation method with effective cluster scaling ; SE=standard error. AIC: Akaike In formation Criteri a; BIC= Bayesian In formation Criteria. * p < .05 ; **p < .01; ***p < .001. Table 4. 9 reports the results of the full model. The covariate sub urban is found not to contribute significantly to the model for both reading and mathematics data. Ano ther model excluding suburban is also run. These two models are then compared using likelihood ratio test: 61 Table 4.9. Full Model for ECLS - K : 2011 Mathematics and Reading Note : UW=unweighted estimation method; RW= estimat ion method with raw weight s ; CS=estimation method wi th cluster scaling; ES=estimation method with effective cluster scaling ; SE=standard error. AIC: Akaike In formation Criteri a; BIC= Bayesian In formation Criteria. * p < .05 ; **p < .01; ***p < .001. one with suburban and one without. No signi ficant result is foun d. Therefore, I simplify the model and include the three student - level predictors and only one school - level predictor rural in the model as full model in this study. The findings from comparison of weighted and unweighted analyses are similar to the those obtained from the model with only student level predictors. The estimates, 62 standa rd errors, and consequently the test statistics do not show much differe nces between the full model and student model. However, one can see that the signi ficance remains uncha nged for all the parameters except for school - level covariate rural. It changes, from being significant at 0.01 with raw weighted model to being signific ant at 0.001 with the other three models for mathematics data. For reading data, t he estimate for rural is significant at 0.001 wit h unweighted model, but it changes to be significant at 0.01 with other three weighted mo dels. In general, for both reading and mathematics data in ECLS - K:2011, using weighted approaches produce large r standard errors and small er test statistics than unwe ighted model. Hah - larger standard errors and resultin g smaller test statistic values gene rated suggest that, given a different model, the chance of committing Type I error will increase substa ntially when weights are use d, although rejection of the hypotheses remain the sa me across weighted approaches, the raw weight ed method produces larger standard errors than the other two weighted method s do. The two scaling metho ds perform quite similarly f or all the parameters in all models. 63 C HATPER 5 S UMMA RY AND DISCUSSION This chapter prov ides a summary, a discussion and lim itations of the results. It consists of four sections. The first section summarizes the research object ives, and results. The secon d section presents the discussion of major findings, followed by the imp lication s . Limita tions of this study and directions f or future research are discussed in the final section. 5.1 Summary of T his Study Th e primary aim for t his study is to examine the performance of the four estimators and analyze the im pact of sampling we ights in multilev el models in the context of two - stag e informative and non - informative sampling designs. Large - scale data in social science usually adopt co mplex sampling designs, such as clustering and unequal probability of selection, which bring challenges in statistica l analysis. Using multilevel models to analyze complex large - scale assessment data accou nting for clustering is becoming more and more popu lar, but it is still a quest ion in when and how to use sampling weights in such m odels , to correct for unequal probab ility of selection . For example, the re is controversy whether to use weight or not. It h as long history arguing this issue between model - ba sed and design - based schools . Even if we have determined to use weights , for inst ance , in a two - level model , us ing si ngle - level weight derived from the p roduct of the weights from each level, or using mult ilevel weights is debatable. I use multilevel weigh ts in this study because sin gle - level weight may not carry adequate information t o correct for unequal probability of selection. The analysis with real d ata shows that inco r porating sampling weights in the model do es produce different parameter estimates, standard errors , test statis tics and even sometimes the significance of a certain variable from those obtained when b oth 64 levels are informative. W eighted models have larger standard errors and smaller test statistics than unweight ed model does . And the clu ster scaling and effective s caling method produce more similar results compared w ith the unweighted and raw weighted model. Therefore, caution should be exercised while weights are applied in the multileve l analysis. In this study, Monte Carlo simulation s are conducted to evaluate t he performance of the four estimation methods in the in formative and non - informative samp ling design in a linear random - inter cept model , because prior studies (e.g . , Cai, 2013) found that the estimates were biased if the informa tiveness was ignored . Summar y of the comparisons of the estimators are depicted i n Table 5.1. Substantial differences are found among these four estimati on methods while estimating the intercept and variance components. In the informative design, in terms of bias, the weighted estima tors outperform the unweighted for the intercept and student - level variance estimation, w hereas the unweighted estimator work s the best for school - level variance estimation . Although the three weighted estimators produce almost unbiased estimates for the i ntercept and student - level variance, they p erform qui te differentl y. The three weighted p erform almost equally well for inter cept estimation, while the cluster scaling estimator performs the best for student - level variance estim ation. Raw weighted method w orks the worst and should be used with caution when e stimating sch ool - level variance . The weighted methods give better covera ge rates for the intercept and student - level variance, but unweighted method does for school - level vari ance in the informative desi gn. In the non - informative setting, t he unweighted me thod gives th e be tter coverage rate for all the parameter estimates . The unweighted estimator performs the best or the second best in terms of relative bias in the non - informa tive condition. Furthermore, including sampling weights decrease s the RMSE for th e intercept a nd student - level varian ce and increase 65 Tabl e 5.1. Summa ry of Comparisons of the Es timation Methods Note : RB= relative bias; RMSE=root mean square error; 95% CR=95% confidence interval co vera ge rate. 66 the RMSE for the school - level variance in the informative design. However, it increase s the RMSE for the intercept, stu dent - level variance and scho ol - level variance in the non - informative design . Ther efore , the unweighted method works t he most efficiently for all the paramete r estimates across different levels of the ICC i n the non - informative design. Tentatively , the clus ter scaling estimator and ef fective scaling estimator mig ht be preferred in the infor mative c ondition. ICC is one of the factors that influences the quality of estimation (e.g., Asparouhov & Muth n , 2006; Kova evi & Rai, 2003). Therefore, it is man ipul ated in this study . Simu lation results are summarized in Table 5.2 and it show s , the effect of the ICC is related to relative bias and RMSE, but not sens itive to coverage rate. A s the ICC increases , the bias for student - level variance increases and the bia s for school - level varia nce decreases in both conditions. These changes are quite obvious for school - level varian ce, but hard to see for student - level va riance. No monotonic patterns for the relative bias can be found as the ICC increases for fixed eff ects and intercept in the in formative cond ition, but clear patterns can be seen for f ixed effects and intercepts as t he increase of the ICC. RMSE shows the s imilar patterns in both conditions for all the parameters. As the ICC increases, t he RMSE decreases for the three student - level fixed effects and variance , and increases for the school - level fixed effect and variance with all the four estimators . Take the following scenario when ICC = 0.3 for example. In the informative condition, when ICC = 0.3, the si mula tion results show that t he clust er scaling estimator works best for the intercept and student - level variance in t erms of relative bias, RMSE and coverage rate. Although it is not the best estimator for the school - level variance estimates among the weig hted estimators, it gives th e best c overage rate and just slightly higher RMSE compar ed with the best weighted estima tor, the effective scaling estimator. In addition, it produces unbiased estimates for the 67 Table 5.2. ICC Effect Note : RB=relative bias; RMSE= root mean square error; 95%CR=95% confidence interval co verage rate. 68 school - level variance. Therefore, in the info rmat ive setting, cluster sca ling estimator is preferred in most cases. In the non - inf ormative condition, when ICC is 0.3, the unweighted estimator has the le ast (absolute) relative bias, RMSE and highest coverage rate in almost all the cases. Therefore, th e un weighted estimator i s pr eferred in the non - informative condition. 5.2 Discussion of Results The design of curren t simulation captures the general featur es of large - scale data sets available in social studies, for example, large number of clusters with dif ferent sizes, unequal pr obability of selection, and moderate informativeness valu es. Some of the findings from th e previous studies are confirmed, and so me are not in this study. For example, prior studies show ed that the unweighted method produces bia sed estimate for the interce pt and school - level variance when the sampling design is informative at both levels (Cai, 2013; Pfeffermann et al., 1998) . Pfeffe rmann et al. (1998) pointed out that when the design is informative at the cluster level, the unwei ghte d method only produces b iased estimates for intercept and school - level variance, not for stud ent - level variance. However, the current study shows that th e unweighted method works quite well most of the time for school - level variance estimat ion, and it only does not work well when the ICC is extremely small in the informative design . No ne of the es timators works well when the ICC is extremely small. This is expected because, based on the equation 3.15, we have a very small denominator, 0.6, which results in a very large relative bi as compared with relative bias when ICC is compar atively larger. As f or student - level var iance , although the unweighted estimator works the worst in the informative condition , it produces unbiased estimate s . In addition, Cai (20 13) pointed out that includi ng the sampling weights substantially increases the MSE. This is only confirmed in the no n - informative setting, but not in the in formative setting in the current study . 69 All the fixed effects are nearly unbiased estimated in term s of Muth n & M uth n (20 02) . This is confirmed in both studies. In general, inclu ding sampling weights still prod uces biased estimates. This is confirmed by all the studies. Asparouhov and Muth n (2007) reported that the MPML estimator outperforms su bsta ntially the other estima tors. This is partially confirmed in the present study, s ince cluster scaling estimator p erforms better than others in the inform ative condition, while raw weighted estimator needs to be used with caution, especially when we est imat e variance compon ents in the informative condition. Previous studies (e.g., Aspa rouhov & Muth n, 2006; Kova evi & Rai, 2003) found that the bias increases for all the parameters as the ICC decreases. This is only partially confirmed in the curr ent study. C urrent results d o not show monotonic patterns of the relative bias for th e fixed effects and intercept, b ut bia s increases for student - level vari ance and decreases for school - level variance as the ICC increases in the informative condition . In the non - informative conditio n, the increase of the ICC decreases the bias for student - level fixed effects and varianc e, and increase the bias for school - leve l fixed effect and variance . Therefore, the tentative conclusion is that weighted estimators with c lust er scaling and effective scaling weights are preferred when the ICC is not extrem ely small in the informative des ign an d unweighted method could be used in the non - informative design. The differences above might be due to the different settings of simu lati on. For example, either the estimators are examined using random - intercept model with no covariates at both level s ( cf . , Asparouhov & Muth n, 2006; Kov a evi & Rai, 2003) or the linear random - intercept model is used with no school - level predictor s ( c f., Cai , 2013). Therefor e, it is possible that our results might not be replicate d in different setting s . 70 5.3 I mplications The major finding from this study confirms that including sampling weights in the analysis produce different estimates in the i nfor mative sampling design a nd the unweighted method works best in the non - informativ e sampling design. The fair com p arison between the weighted and unweight ed, and between the informative and non - informative design might indicate to use sampling weights i n th e informative design and use unweighted estimation method in the non - informative sampling design. Calculation of informativeness is necessary since it gi ves us the extent to which the design is informative and indicate whether it is necessary to includ e sa mpling weights . Second , researchers should examine the ICC and evaluate the magni tude and significance of varian c e components to determine whether multil evel modeling is necessary. Lastly but not the least, c aution should be taken in using sampling wei ghts when ICC is extremely s mall . 5.4 Limitations and Future Studies There are sev eral limitations in this study. The primary limitation is that only a si mple linear random - intercept model is applied. It is more real if the slopes are random and differe nt t ypes of outcome variable s, such as Poisson or nominal, may be used. This may prov ide us with a clearer picture wh ich estimator works best. Second, beside s scaling the sampling weights, trimming weights can b e an alternative , which is not considered in this study. Third, I just ro ughly divide the situation into two: informa tive or non - i nformative . It might be better i dea if different levels of informativene ss, for example, low, medium and high levels of informativeness are all included in the analysis. T his might tell us under whic h condition of informativeness, the paramete r estimates c an be estimated unbiasedly. Four th, multistage sample selection is more complicated in real life. Therefore, the simulation design may not well reflect the reality. 71 Not al l th e findings in the prior studies are confirmed in this study. T herefore, more stud ies are need ed to evaluate MPML performance in different settings. For example, different types of outcome variable, such as discrete response or count data can be used. Ther e are more and more rese arch focus ing on them (Chaudhuri, Hand cock, & Rendall, 20 08; Natarajan, Lipsitz, Fitzmaur ice, Moore, & Gonin, 2008; Nordberg, 198 9; Rodriguez & Goldman, 1995, 2001) , or higher level HLM model s (e.g., three - level model) can be us ed. Furthermore, as is true with any simulation, conclusions from this study are rest rict ed to a particular sampling design and modeling context. In order to see if comparable findings happen in alternative situations, f uture research is necessary. In this stu dy, the simulation is co nduct ed on the basis of a large of number of clusters. Sm all samples are possible in prac tice . The performance of estimators migh t suffer from the small number of clusters (Asparouhov & Muth n , 2005; Li & Redden, 2015; Mass & Hox, 2005). Research to exam ine the performance of different estimation methods in un ideal conditions is necessary. A bove all, future research is needed to e nhance weighted multilevel models. Asparouhov & Muth n (2010) sta t ed that Bayesian estimation met hod could be an alternative with maximum likelihood estimation methods when sample si zes are small if we have informa tive pri or s , but few comparisons were ma de in the context of informative sampling design s . 72 APPENDIC ES 73 A PPENDIX A . Stata S imulation S yntax i n th e I nformative S ampling D esign /************************************************* **************************/ set more off local info 30 18 12 6 0.6 /*lev el 2 variance*/ forvalues i = 1/ 10 00 {/*to repeat the process 1000 times*/ display "iteration `i'" foreach j in `i nfo' { clear display "l2var `j'" *generate school level data quietly: set seed 1 `i'1 quietly: set obs 75000 quietly: gen uj = rnormal(0, sqrt(`j')) /*need sd here, so nee d to square root j*/ *uj recaled quietly: ege n ujmean = mean(uj) qu ietly: egen ujsd = sd(uj) quietly: gen uj_scaled = ((uj - ujmean)/ujsd)*sqrt(2) quietly : gen pj = 1/(1+exp(4.12 - uj_scaled/2)) quietly: gen wj = 1/pj quietly: gsample 150 [aw=pj] /*draws a unequal probability sample with sa mpli ng probabilities pj.*/ quietly: gen index = 1 quietly: gen school = _n *school covariates quietly: g en rand = runiform() quietly: gen loca le = cond(rand < 0.22, 1, cond(rand < 0.58 , 2, 3)) quietly: gen rural = locale==1 quietly: gen subu rb = locale==2 quietly : gen urban = locale==3 *expand students based on perce ntages of different types of sch ools quietly: expand 16+int((24 - 10+1)* runiform()) if school<=8 /*5.69% of 150 s chools: 8*/ quietly: expand 25+int((49 - 25+1)*runiform() ) if school>=9 &school<=25 /*11.49% of 150 schools: 17*/ quietly: expand 50+int(( 99 - 50+1)*runiform()) if school>= 26 & school<=91 /*43.53% of 150 schools: 66*/ quietly: expand 100+int((149 - 100+1) *runiform()) if school>=92 & school<=129 /*25.48% of 150 scho ols:38*/ quietly: expa nd 150+int((199 - 150+1)*runiform()) if school>=130 & schoo l<=142 /*8.59% of 150 schools:1 3*/ 74 quietly: expand 200+int((600 - 200+1 )*runiform()) if school>=143 & school<=150 /*5.22% of 150 schools:8*/ quietly: bysort school: gene rate student = _n *gen erate student data quietly: gen eij = rnormal(0, sqrt(6 0 - `j')) *eij recaled quietly : egen eijmean = mean(eij) quietly: eg en eijsd = sd(eij) quietly: gen eij_scal ed = ((eij - eijmean)/eijsd)*sqrt(2) quietly: gen pi_j = 1/ (1+exp(1.23 - eij_scaled/2 )) quietly: gen wi_j = 1/pi_j quietly: gen pij = pi_j *pj quietly: gen wij = 1/pij *generate correlated data for female, S ES and pretest quietly: local p = 0.49 quietly: ma trix m = (0, - 0.05, 46.92) quietly: matrix sd = (0.5, 0.81, 11.50) q uietly: matrix input c = (1, 0.005, 1, 0 .07, 0.409, 1) quietly: corr2data female SES pretest, corr(c) means(m) sds(s d) c storage(lower) /* Step s 2 - 3 for the one Bernoulli variable */ quietly: replace female = cond(normal(female)>=(1 - `p'),1,0) /*me rge two level data*/ quietly: gen yij = 17.43+0.91*female+1.06*SES + 0.92*pretest+1.04*rural+uj+e ij quietly: rename yij ach ieve quietly: rename wj schwgt quietly: rename wi_j s tdwgt *select final sample quietly: keep if index == 1 quietly: g sample 3915 [aw=pi_j] if `j' == 30 local r = 1 if `j' == 18 local r = 2 if `j' == 12 local r = 3 if `j' == 6 local r = 4 if `j' == 0.6 local r = 5 quietly: keep student sc hwgt school locale rural suburb urban stdwgt female SES pretest achieve gen iteration = `i' ****************************************************************************/ 75 A PP ENDIX B . Stata S imulatio n S yntax in the N on - I nformative S ampling D esign /******* ************************* set mo re off local info 30 18 12 6 0.6 /*level 2 variance*/ forvalues i = 1/ 10 00 {/*to repeat the process 1000 times*/ display "iteration `i'" foreach j in `inf o' { clear display "l2var `j'" *generate school le vel data quietly: set seed 1`i '1 quietly: set obs 75000 qui etly: g en uj = rnormal (0, sqrt(`j')) *betaj recaled quietly: gen betaj = rnormal (0, sqrt( 2 )) quie tly: egen betajmean = mean(b etaj) quietly: egen betajsd = sd(betaj) quietly: gen betaj_scaled = ((betaj - betajmean )/betajsd)*sqrt(2) quietly: gen pj = 1 /(1+exp(4.12 - betaj_scaled/2)) quietly: gen wj = 1/pj quietly: gsample 150 [aw=pj] /*draws a une qual probability sample with sampling probabilities pj.*/ quietly: gen index = 1 quietly: gen school = _n *s chool covariates quietly: gen rand = runiform() quietly: gen locale = cond(rand < 0.22, 1, cond(rand < 0.58, 2, 3)) quietly: gen rur al = locale==1 quietly: ge n suburb = locale==2 quietly: gen urban = locale==3 *expand students based on percent ages of different types of schools qu ietly: expand 16+int((24 - 10+1)*runiform()) if school<=8 /*5.69% of 150 schools: 8*/ quietly: expa nd 25+int((49 - 25+1)*runiform ()) if school>=9 &school<=25 /*11.49% of 150 schools: 17*/ quietly: expand 50+int((99 - 50+1)*r uniform()) if school>=26 & scho ol<=91 /*43.53% of 150 schools:66*/ quietly: expand 100+int((149 - 100+1)*runiform()) if school>=92 & s chool<=129 /*25.48% of 15 0 schools:38*/ quietly: expand 150+int((199 - 150+1)*run iform()) if school>=130 & school< =142 /* 8.59% of 150 schools:13*/ 76 qui etly: expand 200+int((600 - 200+1)*runiform()) if school>=143 & school<=150 /*5.22% of 150 schools:8* / quietly: bysort school: generate student = _n *generate student data quietly : gen eij = rnormal(0,sqrt(60 - `j' )) *ri j recaled quietly: gen eij = rnormal(0,sqrt(60 - `j')) quietly: gen rij = rnormal(0,sqrt( 2 )) quietly: egen rijmean = mean(rij) q uietly: egen rijsd = sd(r ij) quietly: gen rij_scaled = ((rij - rijmean)/rijsd)*sq rt(2) quietly: gen pi_j = 1/( 1+exp(1. 23 - rij_scaled/2)) quietly: ge n wi_j = 1/pi_j quietly: gen pij = pi_j*pj quietly: gen wij = 1/pij *generate correlated data for female, SES and pretest quietly: local p = 0.49 quietly: matrix m = (0, - 0.05,46.92) quietly: matrix sd = (0.5,0.81,11.50) quietly: matrix input c = (1, 0.006, 1, 0.07, 0.409, 1) q uietly: corr2data female SES pretest, corr(c) means(m) sds(sd) cstorage(lower) /* Steps 2 - 3 for the one Bernoull i vari able */ q uietly: replace female = cond(normal(female)>=(1 - `p'),1,0) /*merge two level data*/ quietly: gen yi j = 17.43+0.91*female+1.0 6*SES + 0.92*pretest+1.04*rural+uj+eij quietly: rename yij achieve quietly: rename wj schwgt quietly: rename wi_j stdwgt *select final sample quietly: keep if index == 1 quietly: gsample 3915 [aw=pi_j] if `j' == 30 local r = 1 if `j' == 1 8 local r = 2 if `j' == 12 local r = 3 if `j' == 6 l ocal r = 4 if `j' == 0.6 local r = 5 quietly: keep student schwgt sc hool locale rural suburb urban stdwgt female SES pretest achieve gen iteration = `i' } } ******* *** ************************* *************************************/ 77 A PPENDIX C . M p lus Syntax /*********************** ** M plus VERSION 8********************** *******/ /**************** *******Unweighted estimation method *********************/ Title: READING wit h NO weights; Data: File is iteration_list.csv; Type = MONTECARLO; Vari able: Names are schwgt sch ool locale rural suburb urban student s tdwgt female SES pretest achieve iteration; USEVARIABLES are achieve school female SES p ret est rural; CLUSTER = school; WITHIN = female SES pretest; BETWEEN = rural; MODEL: %WITHIN% achieve on female*.91 SES*1.06 pre test*.92; achieve* 30 ; !variance at level1 %BETWEEN% achieve on rural*1.04; [achieve*17.43]; ![gamma 00] achieve* 30 ; !variance at level2 ANALYS IS: TYPE = TWOLEVEL; / **************** Estimating method with raw weights *********************/ Title: READING with raw weights (unscaled); Data: File is iter ati on_list.csv; Type = MONTECARLO; Variable: Names are schwgt school local e rural suburb urban student stdw gt female SES pretest achi eve iteration; USEVARIABLES are achieve school female SES pretest rural; CLUSTER = school; WITHIN = female SES pr etest; BETWEEN = rural; Weight is stdwgt; Bweight = schwgt; Wtscale = UNSCALED; Bwtscale = UNSCALED; M ODEL: %WITHIN% achieve on female*.91 SES*1.06 pretest*.92; achieve* 30 ; !variance at l eve l1 78 %BETWEEN% achieve on rural*1.04; [achieve*17.43]; ![gamm a00] achieve* 30 ; !variance at level2 ANALYSIS: TYPE = TW OLEVEL; a lgorithm = integration; estimator = MLR; /******************** Estimation me thod with cluster scaling ************************/ Title: READING with scaling1; Data: File is iteration_list.cs v; Type = MONTECARLO; Variable: Names are schwgt school locale rural suburb urban student stdwgt female SES pretest ach iev e iteratio n; USEVARI ABLES are achieve school female SES pretest rural ; CLUSTER = school; WITHIN = f emale SES pretest; BETWEEN = rural ; Weight is stdwgt; Bweight = schwgt; Wtscale = cluster; Bwtscale = sample; MOD EL: %WITHIN% achieve on female*.91 SES*1.06 pretest*.92; achieve* 30 ; !variance at level1 %BETWE EN% achieve on rural*1.04; [achieve*17.43]; ![gamma00] achieve* 30 ; !variance at level2 ANALYSIS: TYPE = T WOL EVE L; a lgorithm = i ntegration; estimator = MLR; /************* Es timation method with effective sc aling (ecluster scaling) *************** / Title: READING with scaling2; Data: File is iteration_list.csv; Type = MONTECARLO; Vari abl e: Names are schwgt school locale rural suburb urban student stdwgt female SES pretest achieve iterati on; 79 USEVARIABLES are achieve schoo l female SES pretest rural; CLUSTER = school; WITHIN = female SES pretest; BETWEEN = ru ral; Weight is stdwg t; Bweight = schwgt; Wtscale = ecluster; Bwtscale = sample; MODEL: %WITHIN % achieve on female*.91 SES*1.0 6 pretest*.92; achieve* 30 ; !variance at level1 %BETWEEN% achieve on rural*1.04 ; [achieve*17.43]; ![ gamma00] achieve* 30 ; !variance at level2 ANALYSIS : TYPE = TWOLEVEL; a lgorithm = integration; es timator = MLR; 80 REFER ENCE S 81 REFERENCES Arceneaux, K. , & Nickerson, D. W. (2009). Modeling certainty w ith clustered data: A compar ison of methods. Political Analysis , 17 , 177 - 190. d oi: 1 0. 1093/pan/mpp004 Asparouhov, T . (2004). Weighting for unequal probabi lity of selection in multilevel modeling , M plus Web Notes : No. 8 , available from http:/ /www.statmodel.com/ Asparouhov, T. (2005). Sampling we ights in latent variable modeling . Structural Equation Modeling, 12 ( 3), 411 - 434. Asparouhov, T. (2006). General multi - level modeling with sampling weights. Communications in Statistics Theory and Met hods, 35 (3 ), 439 - 460. Asparouhov, T. , & Muth n , B. (2 005). Multivariate statistical mo deling with survey data (M plus Web Note s). Los Angeles, CA: Muthén & Muthén . Asparouhov, T. , & Muth n , B. (2007). Testing for informativ e w eights and weights trimmi ng in multivariate modeling with survey data . Retrieved August 2 1, 2012 from http://www.statmodel.com/download/JSM2007000745.pdf Asparouhov, T. , & Muth n , B. (2 010). Bayesian analysis o f latent variable models using Mplus (M plus Technical Re port Versi on 4). Los Angeles, CA: Muthén & Muthén . Retrieved from http://www.statmodel.com/download /Ba yes - Advantages18.pdf As parouhov, T. , & Muthén , B. (2006). Multilevel modeling o f complex survey data. Paper pres ented at the Proceedings of the Joint S tatistical Meeting in Seattle. Bainbridge, T. R. (1985). The Committee on standards: precision and bia s. ASTM Standardization News 13, 44 - 46. Bertolet, M. (200 8). To weight or not to weight? Incorporating sampling designs into model - based analyses. (Ph . D.), Carnegie Mellon University, Ann Arbor. Binder, D. A. (1983). On the variances of asymptotica lly normal estimators from c omplex surveys. International Stati stical Review, 51 ( 3), 279 - 292. Bloom, H. S., Bos, J. M., & Lee , S. (1999). Using cluster ran dom assignment to measure program impacts: statistical implications for the evaluation of education pro grams. Evaluation Review, 23 (4), 445 - 469. 82 Bloom, H. S., Ric hburg - Hayes, L., & Bl ack, A. R. ( 2007 ) . Using covariat es to improve precision for studies tha t randomize schools to evaluate educational interventions. Educational Evaluation and Policy Analysi s, 29 (1 ) , 30 - 59 . doi: 10.310 2/0162373707299550Schochet, 2008 Boslaugh, S. (2007). S econdary data sou rces for public health: A practical guide. New York, NY : Cambridge University Press. Cai, T. (2013). Investigation of ways to handle sampling weights for mul tilevel model analyses. S ociological Methodology, 43 (1), 178 - 219. Carle, A. C. ( 2009). Fitting multilevel models in complex survey data with design weig hts: Recommendations. BMC Medic al Research Methodology . doi:10.1186/1471 - 2288 - 9 - 49 Chantala, K. , & Suc hindran, C. M. (2006). Ad justing for unequal selection probability in multilevel models: a comparison of software packages. Proceedings of the American S tatistical Association, Seattle, WA: American Statistical Association, 2815 - 2824. Chantala, K., Bla nch ette, D., & Suchindran, C . M. (2011). Software to compute sampling weights for mu ltilevel analysis . Available from ht tp://www.cpc.unc.edu/rese arch/tools/data_analysis/ml_sampling_weights/Compute%20 W eights%20for%20Multilevel%20Analy sis.pdf . Chaudhuri, S., Handcock, M. S., & Rendall, M. S. (2008). Generalized linear models incorporating population level information: a n e mpirical - likelihood - based approach. Journal of the Royal Statistical Society: Ser ies B (Statistical Methodology), 70 ( 2), 311 - 328. Chaudhuri, S., Handcoc k, M. S., & Rendall, M. S. (2010). A conditional empirical likelihood approach to combine sampling d esi gn and population level i nformation. Technical report No. 3/2010, National Univer sity of Singapore, Singapore, 117 546. Chen , J. , & Sitter, R. R. (1999). A pseudo empirical likelihood approach to the effective use of auxiliary information in complex sur vey s. Statistical Sinica, 9 ( 2), 385 - 406. Christ, S., Biemer, P., & Wiesen, C. (2007 ). Guidelines for applying multil evel model ing t o the NSCAW data . Ithaca, NY: National Data Archive on Child Abuse and Neglect. Clarke, P. (2008). When can group level clu ste ring be ignored? Multilev el models versus single - level models with sparse data. J ournal of Epidemiology and Commun ity Health , 62 , 752 - 758. doi: 1 0. 1136/jech.2007.060798 Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regress ion/correlation analysis for the behavioral sciences. Mahwah, NJ: Lawrence Erlbaum. 83 Danielsen, A. G., Wiium, N., Wilhelmsen, B . U., & Wold, B. (201 0). Perceived support provided - reporte d academic initiative. J ournal of School Psychology , 48 (3), 247 - 67. doi:10.1016/j.jsp.2010.02.002 Eideh, A. , & Nathan, G. (2009). Two - stage informative clu ster sampling wi th application in small area estimation. Journal of Statistical Planning and Inferen ce, 139 , 3088 - 3101. Enders, C. K. (2010). Applied missing data analysis. New York, NY: Guilford Press. Fra n cisco, C. A. , & Fuller, W. A. (1991). Quantile estimation with a complex survey design. The Annals of Statistics, 19 (1), 454 - 469. Fuller, W. (198 4). Least squares and related analyses for complex survey design. The Annals of Statistics, 10 (1 ), 99 - 118 . F uller, W. (2009). Sampling Statistics. Hoboken: Wiley. Goldste in, H. (1986). Multilevel mixed linear model analysis using iterative generalized l east squares. Biometr ika, 73 , 43 - 56. Graubard, B. I. , & Korn, E. L. (1996). Modeling the sampling desig n i n the analysis of health surveys. Statistical me thods in medical research, 5 (3), 43 - 56. Grilli, L. , & Pratesi, M. (2004). Weighted estimation in mu ltilevel ordinal and binary models in the presence of informative sampling designs. Survey Methodology, 3 0 ( 1 ) , 93 - 103. Hahs - Vaughn, D. L. (2005). A primer for using and un derstanding weights with national datasets. The Journal of Experimental Education, 7 3 (3) , 221 - 248. d oi: 1 0.3200/JEXE.73.3.221 - 248 Heck , R. H. , & Mahoe, R. (2004). An example of the impact of s ample weights and centering on multilevel SEM m odels. Paper pre sented at the annual meeting of the American Educational Research Association, San D iego, CA. Hedges, L. V. , & Hedberg, E. C. (2007). Intraclass correlation values for planning grou p - randomiz e d trials in education. Educational Evaluation a nd Policy Analys is, 29 (1) , 60 - 87. doi: 10.3102/0162373707299706 Hedges, L. V. , & Hedberg, E. C. (20 13). Intraclass corre lations and covariate outcome correlations for planning two - and three - level cluster - ra n domized experiments in education. Evaluation Re view, 37 (6 ), 445 - 489. Howell, D. C. (2008). The analysis of missing data. In Handbook of social sc ience methodology, ed . W. Outhwaite and S. Turner, (208 - 224). London, GB: Sage. Hox, J. J. , & Kre ft, I. G. ( 1994). Multilevel analysis methods. Sociologica l Methods & Rese arch, 22 (3) , 283 - 299. 84 Huber, P. J. (1967). The behavior of maximum likelihood estima tes under nonstandard conditions. In Proceedings of the Fifth Berkeley Symposium on Mathematical S tatistics a nd Probability (Vol. 1, pp. 221 - 233). Berkeley, CA: University of California Press. https://projecteucl id.org/euclid.bsmsp/1200512988 Jia, Y., Stokes, L., Harris, I., & Wang, Y. (2011). Pe r formance of random effects model estimators und er complex sampl ing designs. Journal of Educational and Behavioral Statistics, 36 ( 1), 6 - 32. Judd, C . M., McClelland, G. H., & Ryan, C. S. (2009). Data an alysis: A model comparison approach. New York, NY: Rou t ledge. Kim, J. K. , & Skinner, C. J. (2013). We ighting in surve y analysis under informative sampling. Biometrika, 100 (2 ), 385 - 398. https://www.js to r.org/stable/43304565 Kish, L. (1965). Survey samplin g . New York: Wiley. Kish, L. (1992). Weighting for unequal Pi. Journal of Official Statistics, 8 ( 2), 183 - 200. Korn, E. L. , & Graubard, B. I. (1995 ). Examples of differ ing weighted and unweighted es tim ates from a sample survey. The American Statistician, 4 9 ( 3), 291 - 295. Korn, E. L. , & Graubard, B. I. (2003). Estimati ng variance components by using survey data. Journal of the Royal Statistical Societ y : Series B (Statisti cal Methodology), 65 (1 ), 175 - 1 90. Kova evi , M. S. , & Rai, S. N. (2003). A pseud o maximum likelihood approach to multi - level mod eling of survey data. Communications in Statistics - Theory and Methods, 32 ( 1), 103 - 121. Koziol, N. A ., Bovaird, J. A., & Suarez, S. (2017). A compariso n of population - averaged and cluster - specific approaches i n the context of unequal probabilities of selec tion. Multivaria te Behavioral Research, 52 (3 ) , 325 - 349 . doi: 10.1080/00273171.2 - 17.12921 15 Kreft, I . G. G. , & Yoon, B. ( 1994). Are multilevel techniqu es necessary ? An attempt at demystification . Retrieved fr o m http://eric.ed .gov/?id=ED371033 Laird, N. M. , & Ware, J. H. (1982). Random - effects m odels for lo ngitudinal data. Biom etrics, 38, 963 - 974. Laukaity te, I. , & Wiberg, M. (2018). Importance of sampling weigh t s in multilevel modeling of international large - scale assessmen t data. Communications in Statistics - Theory and Methods, 47 ( 20), 4991 - 50 12. https://doi.org/10.1080/03610926.2017.1383429 Lee, J. , & Fish, R. M. (2010). International and inters tate gaps in val ue - added math - achievement: multilevel instrumental variable analysis of age effect a nd grade effect. Amer ican Journal of Education, 117 ( 1), 109 - 137. 85 Li, P. , & Redden, D. T. (2015). Small sampl e performance of bias - corrected sandwich estimat ors for cluster - randomized trials with binary outcomes. Statistics in Medicine, 34 , 281 - 296. http://d x.doi.org/10.1002/sim.6344 Lin, Y. X., Steel, D., & Cha m bers, R. L. (2004). Restricted quasi - score esti mating functions for sample survey data. Journal of Applied Probability, 41 , 119 - 130. Longford, N. T. (1995). Model - base d methods for analysis of data from 1990 NAEP trial state a ssessment. Washington, DC. L ongford, N. T. (1995). Random coefficient model s . Handbook of S tatistical Modeling for the Social and Behavioral Sciences, 519 - 570. Lubienski, S. T. , & Lubienski, C. ( 2006). School sector and acade mic achievement: a multilevel analysis of NAEP mathematic s data. American Educational Research Journal, 4 3 ( 4), 651 - 698. Mass, C. J. M. , & Hox, J. J. (2005). Sufficient sample sizes for multilevel modeling . Methodology, 1 (3) , 86 - 92. http://dx.doi.org/10.1 0 27/1614 - 1881.1.3.86 Mels, G. (2006). LISREL for windows: get ting started guide. Lincolnwood, IL: Scientific Software International. Mulligan, G . M., Hastedt, S., & McCarroll, J. C. (2012). First - Time Kindergarteners in 2010 - 2011: First Findings From t h e Kindergarten Rounds of the Early Childhood Lo ngitudinal Study , Kindergarten Class of 20101 - 11 (ECLS - K:2011) (NCES 2012 - 049). U.S. Department of E ducation. Washington, DC: National Center for Educa tion Statistics. Murray , D. M., & Short, B. ( 1995 ) . Int r aclass correlation among measures related to al cohol use by you ng adults: estimates, correlates, and applications in intervention studies. Journal of Studies on Alcohol , 56 (6), 681 - 694. Musca, S. C ., Kamiejski, R., Nugier, A., M ot, A., Er - Rafiy, A., & Brauer, M. (2011). Data with hierarchical struc ture: Impact of intraclass correlation and sample size on type - I error. Frontiers in Psychology, 2 (7 4) . doi: 10.3389/fpsy g.2011.00074 Muth n, L. K. , & Muth n, B. O. (1998 - 2017). 8 th e d . Los Angeles: Muth n & Muth n. Muth n, L. K. , & Muth n, B. O. (2002). How to use a Monte Carlo study to decide on sample size and determ ine power. Structural Equation Modeling, 9 (4 ), 599 - 620. Natarajan, S., Lipsitz, S. R., Fitzmaurice, G., Moo r e, C. G., & Gonin, R. ( 2008 ). Variance estimati on in complex su r vey sampling for generalized linear models. Journal of the Royal Statistical Societ y: Series C (Applied Statistics), 57 (1), 75 - 87. 86 Nor dberg, L. ( 1989 ). Generalized linear modeling of sample s u rvey data. journal of Official Statistics, 5 (3) , 223. Palardy, G. J. (2010). The multilevel crossed random effects growth model for estimating tea cher and school effec ts: Issues and extensions. Edu cational and Psychological Measurement, 70 (3), 401 - 419. P feffermann, D. (1993). The role of sampling wei ghts whe n modeli ng survey data. International Statistical Review, 61 ( 2) , 317 - 337. doi: 10.2307/14036 31. Pfeffermann, D. , & LaVange , L. (1989). Regress ion models for stratified multi - stage cluster samples. In C. J. Skinner, D. Holt, & T. M. F. Smith (Eds), Analysis of com plex surveys (237 - 260). New York, NY: John Wiley & Sons. Pfeffermann, D., Krieger, A. M., & Rinott, Y. ( 1998). Par ametric distribution s of complex survey data under informative probability sa m pling. Statistica Sinica, 8 ( 4), 1087 - 1114. Pfe ffermann, D., Sk inner, C. J., Holmes D. J . , Goldstein, H. & Rasbash, J. (1998). Weighting for u nequa l s election p robabili ties in m u ltilevel m odels. Jou rnal of Royal Statistical Society : Series B , 60 (1) , 23 - 40 . Rabe - Hesketh, S. & Skrondal, A. (2006). Multi level modeling o f complex survey data. Journal of Royal Statistical Society : Series A , 169 (4) , 805 - 8 27 . https://doi.org/10.1111/j.1467 - 985X.2006 . 00426.x Rao, J. N. K. , & Wu, C. (2010). Bayes ian pseudo - empir ical - likelihood intervals for complex surveys. Journal of the Royal Statistical Soci ety : S eries B (Statis tical Methodology), 72 (4 ), 533 - 544. Rao, J. N. K., Verret, F., & Hidiroglou, M. A. (20 1 3). A weighted composite likelihood approach to inference for t wo - level models from survey data. Survey Methodology, 39 ( 2) , 263 - 282. Raudenbush, S . W. , & Bryk, A. S. ( 2002). Hierarchical linear mod es (2 nd ed.). Tousand Oaks, CA: SAGE. Raykov, T. (2011). Intraclass correlation coefficients in hierarch ical designs: Ev aluation using latent variable modeling. Structural Equation Modeling , 18 (1 ), 73 - 90. doi: 10.1080/1070551 1.2011.534319 Raykov, T. , & M arcoulides, G. A. (2015). Intraclass correlation coeffici e nt in hierarchical design studies with discrete response variab les: a note on a direct interval estimation procedure. Educational and Psychological Measurement, 75 (6), 1063 - 1071. Robin, D. B. (1987 ). Multiple imputations for non - response in surveys. New Y ork, NY: Wiley. 87 Rodriguez, G. , & Goldman, N. ( 1995). An assess ment of estimation procedures for multilevel models with binary responses. Journal o f the Royal Statistic al Society : Series A (Statisti cs in Society), 73 - 79. Rodriguez, G. , & Goldman, N. (200 1 ). Improved estimation procedures for multileve l models with bi nary response: a case study. Journal of the Royal Statistical Society : Series A (Sta tistics in Society), 164 (2), 339 - 355. Schafer, J. L. , & Graham, J. W. (2002). Missing da ta: Our view of the state of the art. Psychological Methods, 7 ( 2), 147 - 177. doi: 10 .1037//1082 - 989X.7.2.147 Schochet, P. Z. (2008). Statistical power for random assig nment evaluations of educational programs. Journal of Educational and Behavioral Statisti cs, 22 ( 1), 62 - 87. d o i: 10.3102/1076998607302714 Scientific Softwar e International, 2005 - 2012. Multilevel Models. LISREL Documentation. Retrieved July 22, 2011 from http://www. ssicentral.com/lisre l /complexdocs/chapter4_web.pdf Scott, A. J. , & Holt, D. (1982) . The effect of two - stage sampling on ordinary least squares methods. Journal of the American Statistical Association, 77 ( 380), 848 - 854 . Searle, S. R., Casella, G., & McCulloch, C. E. (1992 ) . Variance Components. New York: Wiley. Ski nner, C. J. (199 4). Sample models and weights . Paper presented at the Proceedings of the Section on Survey Research Metho ds. Skinner, C. J., Holt, D., & Smith, T. M. F. (1989). Analysis of complex survey s . Chichester, UK: Wiley. Snijder , T. A. , & Bos ker, R. J. (2012 ). Multilevel analysis: an introduction to basic and advanced multilevel modeling, 2 nd edition. London: S age Publication Ltd. Stapleto n, L. M. (2006). An assessment of practical solutions for structural equation modeling with complex sampl e data. Structur al Equation Modeling: A Multidisciplinary Journal, 13 , 28 - 58. doi: 10.1207/s15328007 sem1301_2 Stapleton, L. M. (2012). Evaluation of c ondit ional weight approximations for two - level models. Co m munications in Statistics Simulation and Comp utation, 41, 182 - 204. doi: 10.1080/03610918.2011.579700 Stapleton, L. M. , & Kang, Y. (201 8 ). Design effects of multileve l estimates from national prob abili ty samples. Sociological M ethods & R esearch, 47 (3), 4 30 - 457 . 88 Tourangeau, K., Nord, C., L ., T., So rongon, A. G., H agedorn, M. C., Daly, P., & Najarian, M. (2015). Early Childhood Longitudinal Study, Kindergarten Class o f 2010 - 11 (ECLS - s Manual for the ECLS - K:2011 Kindergarten Data File and E l ectronic Cdebook, Public Version. NCES 2015 - 074 . National Cente r for Education Statistics. West, B. T., Beer, L., Gremel, G. W., Weiser, J., Johns on, C. H., Garg, S., & Skarbinski, J. (2015). Weig hted multilevel models: a case study. American Journal o f Public Health, 105 (11), 2214 - 2215. White, H. (1980). A hetero skedasticity - consistent covariance matrix estimator and a direct test for heterosked asticity. Econometric a , 48 , 817 - 830. doi: 10.2307/1 912934 Winship, C. , & Radbill, L. (1994). Sampling weigh t s and regression analysis. Sociologic al M ethods & R esearch, 23 ( 2), 230 - 257. Xia, Q. , & Torian, L. V. (2013). To weight or not to weight in time - lo cation sampling: why not do both? AIDS and Behavior , 17 (9 ), 3120 - 3123. Zaccarin, S. , & Donati, C. (2008). T h e effects of sampling weights in mult ilevel ana lysis of PISA da ta (Working Paper No. 119). Universita Degli Studi di Trieste: Departimento di Scien ze Economiche e Stati stiche. Retrieved from: http://www2. units.it/nirdses/sito_inglese/working%20papers/files%20for%20wp/wp119. pdf.