Eh W 3 .— _— ——d —— .— _— _— _— .— —— —— g 141 413 THS THESIS LIBRARY , Michigan State 2 909 f University This is to certify that the dissertation entitled USING THE MULTIVARIATE MULTILEVEL LOGISTICREGRESSION MODEL TO DETECT DIF: A COMPARISON WITH HGLM ANDLOGISTIC REGRESSION DIF DETECTION METHODS presented by Tianshu Pan has been accepted towards fulfillment of the requirements for the Ph.D. degree in Measurement and Quantitative Methods flay/g MW Major Professor's Signature /2 '- M wJ/ Date MSU is an Affinnative Action/Equal Opportunity Employer PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 5/08 KZIProlechrelelRC/DateDue.indd USING THE MULTIVARIATE MU LTILEVEL LOGISTIC REGRESSION MODEL TO DETECT DIF: A COMPARISON WITH HGLM AND LOGISTIC REGRESSION DIF DETECTION METHODS By Tianshu Pan A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Measurement and Quantitative Methods 2008 ABSTRACT USING THE MULTIVARIATE MULTILEVEL LOGISTIC REGRESSION MODEL TO DETECT DIF: A COMPARISON WITH HGLM AND LOGISTIC REGRESSION DIF DETECTION METHODS By Tianshu Pan This study presents the Multivariate Multilevel Logistic Regression (MMLR) models to detect Differential Item Functioning (DIF), which are likely to detect DIF when the responses of an examinee are not locally independent. The study also compares the uses of the three MMLR models, three modified versions of Kamata’s Hierarchical Generalized Linear Model (HGLM) and the standard logistic regression model as DIF detection methods. The comparison between these statistical procedures for DIF detection will be made using Michigan Educational Assessment Program reading test and simulated data. The simulation study evaluates their performances in the detection of uniform DIF. Simulated data are generated by the 3-parameter logistic Item Response Theory models, varying conditions of different sample size (400, 700, and 1000 examinees), test length (20, 40 and 60 items), the difference of parameter b (0.25, 0.50, and 0.75) and the ability distributions with different means and variances for the reference and focal groups. These test conditions are crossed completely and replicated 500 times. In these analyses, total score and IRT ability estimate are respectively used as the matching variable. The results show that MMLR can be used for DIF detection. It is also found that the heterogeneous variances of the two groups influence power and Type I error rates of these methods, and the HGLM DIF models are unsuitable to identify DIF. This dissertation is dedicated to my grandmother Tao Cheng (W 1352). iii ACKNOWLEDGEMENT I appreciate very much many persons because this dissertation could not be completed without their help, assistance and encouragement. I am deeply thankful to Dr. Mark Reckase, the chair of my dissertation committee for his direction and patience throughout my doctoral studies. I wish to express my gratitude to my other committee members, Dr. Tenko Raykov, Dr. Connie Page, and Dr. Yuehua Cui for their comments. In addition, I am also thankful for the help of Dr. Yeow Meng Thum and Dr. Kimberly Maier. Finally, I am greatly indebted to my family. I thank my father, Pan Konggen, and mother, Sun Baozhen. They have always supported me and encouraged me to pursue a Ph.D. degree in the USA. I also thank my wife, Chen Yumin, who did most of the chores which enabled me to complete my doctoral study and dissertation, and my daughter, Pan Yuqi, whose loveliness made me forget any annoyance and hardships with my study and work. I give my most special thanks and long yearning to my grandmother, Tao Cheng, who took care of me from when I was born until when I got married. Unforttmately she did not have a chance to see me study abroad and complete the dissertation. I hope she is proud of me in Heaven and happily knows that I have finished my Ph.D. studies. iv TABLE OF CONTENTS LIST OF TABLES ............................................................................................................ vii Chapter 1 Introduction ........................................................................................................ 1 1.1 Background ............................................................................................................... 1 1.2 The Statement of Purpose ......................................................................................... 3 Chapter 2 DIF Detection Methods ...................................................................................... 5 2.1 Classifications of DIF Detection Methods ................................................................ 5 2.2 Some Standard observed-score DIF Detection Methods .......................................... 6 2.2.1 The Mantel-Haenszel Procedure ........................................................................ 6 2.2.2 The Logistic Regression DIF Detection Method ............................................... 7 2.2.3 The Standardized Difference Method ................................................................ 8 2.2.4 The SIBTEST Procedure ................................................................................... 9 2.3 The DIF Detection Method Based on Factor Analysis ........................................... 10 2.4 The DIF Detection Methods Based on GLMM ...................................................... 11 Chapter 3 Multivariate Multilevel Models ....................................................................... 14 3.1 The Development of Multivariate Multilevel Models ............................................ 14 3.2 Multivariate Multilevel Linear Models ................................................................... 15 3.3 Multivariate Multilevel Logistic Regression Models ............................................. 16 3.4 MMLR DIF Detection Model ................................................................................. 18 Chapter 4 HGLM DIF Detection Methods ....................................................................... 21 4.1 Kamta’s HGLM DIF Model ................................................................................... 21 4.2 The Modification of Kamata’s HGLM ................................................................... 23 4.3 Differences between MMLR and HGLM ............................................................... 26 Chapter 5 Estimation Methods ......................................................................................... 28 5.1 Linearization and Integral Approximation Methods ............................................... 28 5.2 SAS GLIMMIX Procedure ..................................................................................... 30 5.3 Pseudo-Likelihood Estimation Based on Linearization .......................................... 31 Chapter 6 Simulation Study .............................................................................................. 34 6.1 Simulated Data ........................................................................................................ 34 6.2 Some Practical Issues of the Simulation Study ....................................................... 37 6.2.1 Reduced LR DIF Model ................................................................................... 37 6.2.2 Matching Variables .......................................................................................... 38 6.2.3 Evaluation Indexes ........................................................................................... 39 6.3 Results of the Simulation Study .............................................................................. 40 Chapter 7 Real Test Study ................................................................................................ 54 7.1 Data ......................................................................................................................... 54 7.2 Results of Two-Level MMLR and HGLM DIF Detection Methods ...................... 55 7.3 Results of Three-Level MMLR and HGLM DIF Detection Methods .................... 62 Chapter 8 Disscussions ..................................................................................................... 66 8.1 Using of the MMLR DIF Detection Method .......................................................... 66 8.2 HGLM Is Unsuitable for DIF Detection ................................................................. 69 8.3 Effects of the Heterogeneous Variances on the DIF Detection Methods ............... 70 8.4 Comparisons of the Coefficients and Their Errors in MMLR and HGLM ............ 70 8.5 Limitations .............................................................................................................. 74 APPENDICES .................................................................................................................. 78 REFERENCE .................................................................................................................... 85 vi LIST OF TABLES Table 1 Comparison of Means and Variances for the Matching Variables ..................... 38 Table 2 Outputs of Multivariate Analysis of Variance .................................................... 41 Table 3 Multiple Comparisons of Power and Type I Error Rates ................................... 41 Table 4 Power by Methods for have the Same Ability Distributions .............................. 45 Table 5 Power by Methods for Different Means of Ability Distributions ....................... 46 Table 6 Power by Methods for Different Variances of Ability Distributions ................. 47 Table 7 Type I Error Rates by Methods for the Same Ability Distributions ................... 48 Table 8 Type I Error Rates by Methods for Different Means of Ability Distributions .. 49 Table 9 Type I Error Rates by Methods for Different Variances of Ability Distributions ............................................................................................................................ 50 Table 10 Power and Type I Error Rates by Method and Ability Distributions ................ 51 Table 11 Similarity Rates (%) of Results of the Seven Models ....................................... 51 Table 12 Power and Type I Error Rates Comparisons for the Different Matching Variables ............................................................................................................ 53 Table 13 Means of the Matching Variables by Gender .................................................... 55 Table 14 Results of the Reduce Logistic Regression DIF Detection Models .................. 57 Table 15 Results of the DIF Detection (the Matching Variable: Total Score) ................. 58 Table 16 Results of the DIF Detection (the Matching Variable: Ability Estimate) ......... 59 Table 17 Results of the MMLR DIF Detection Models with CS Variance Matrix .......... 61 Table 18 Results of the 3-Level MMLR DIF Detection Methods .................................... 65 vii Chapter 1 Introduction In this chapter, the background and the importance of this study will be described first. Then, related literature is also reviewed. The chapter will make clear what is new in the study and state the purposes of the study. 1.1 Background Sometimes, examinees in different demographic groups who have the same levels of ability have different probabilities of answering a particular item correctly. The difference is defined as differential item functioning (DIF). Differential item functioning (DIF) refers to differences in the functioning of an item among groups after the groups have been matched with respect to the ability or attribute that the item purportedly measures (Dorans & Holland, 1993) In the item response theory (IRT) fiamework, DIF means that the test item has a different item response function for one examinee group than for others (Lord, 1980) and it can be defined as a difference in the conditional probabilities that persons of the same ability answer an item correctly in two or more groups (Hidalgo & LOpez-Pina, 2004). DIF is viewed as a necessary but not a sufficient condition for item bias (Clauser & Mazor, 1998). In the case of two groups, they are identified as reference and focal groups. The reference group is composed of the majority or advantage group while the focal group is composed of the minority or disadvantage group and this group is considered the subject of DIF analysis. Mellenberg (1982) classified DIF as uniform and nonuniform DIF. In the framework of IRT models, uniform DIF occurs when the item characteristic curves (ICC) for the two groups differ only in the difficulty parameter and the relative advantage for the reference or focal group is uniform across the score scale. Nonuniform DIF is present when their ICCs are different as a result of the disparate differences of the discrimination parameters and/or pseudo-guessing parameters (Clauser & Mazor, 1998). Statistically, uniform DIF exists when there is no interaction between ability level and group membership. Nonuniform DIF exists when there is interaction between ability level and group membership. This study will compare the performance of three types of methods for detecting uniform DIF. In the 19805 and the beginning of the 19905, many DIF detection procedures were developed to identify DIF, such as Mantel-Haenszel (MI-I) (Holland & Thayer, 1988), logistic regression (Swaminathan & Rogers, 1990), standardized difference (Dorans & Kulik, 1983; 1986), SIBTEST(Shea1y & Stout, 1993) and so on. But these mentioned procedures can analyze only one item or small numbers of related items at a time (Swanson et a1, 2002). Confirmatory Factor Analysis was used to identify all DIF in a test at a time (Muthén & Lehman 1985). But a simulation study shows the method has extremely low power to find DIF (Finch & French, 2007). Since 1990’s, some types of the Generalized Linear Mixed Models (GLMM) were applied to identify DIF, for example, the Hierarchical Generalized Linear Model (HGLM) (Kamata, 1998; 2001), the Hierarchical Logistic Regression Model (HLRM) (Swanson et al, 2002) and the Logistic Mixed Model (LMM) (Van den Noortgate & Bock, 2005). The three DIF approaches are able to analyze all items in one computer run. Kamata’s HGLM approach has different equations or mathematical forms shown by different authors. Kamata’s HGLM and Van den Noortgate’s LMM approaches both have random person ability. But random person ability makes them unsuitable to detect DIF when the examinees of the two groups have the different expected ability or proficiency. This study will analyze Kamata’s model and try modifying it in order that it can be used when the two groups have different ability means. Swanson’s HLRM method is not able to include the variations of the proficiency of examinees from different groups, e. g. classes or schools. In addition, for all of the methods mentioned here, it is impossible to take into account the probable dependence between the binary responses of the same examinee. In order to address these disadvantages of these methods, a Multivariate Multilevel Logistic Regression (MMLR) Model will be introduced to detect DIF in this study. 1.2 The Statement of Purpose The multivariate logistic regression model in this study is not the one which is usually used for the analysis of multinomial or ordinal data, but is an extension of the Multivariate Multilevel Linear Model. The model is presented by Griffiths et a1 (2004), Mcleod (2001), and Yang et al (2000). It is known as the Multivariate Multilevel Logistic Regression Model (MMLR) since it has at least two levels. The main purposes of this study are to introduce the MMLR model to identify uniform DIF, set up a MMLR DIF detection procedure, and compare the performances of the MMLR, HGLM and LR DIF detection procedures identifying uniform DIF by a simulation study and their application to the real test data. These DIF detection approaches are compared since all of them use a logit transformation. Additionally, LR may be one of standard DIF detection methods as it is presented in the “Test Fairness” chapter (Camilli, 2006) of the book, Educational Measurement, sponsored jointly by National Council on Measurement in Education and American Council on Education. It is expected that MMLR is acceptable as a DIF detection procedure if it performs as well as LR when it is applied to detect DIF, otherwise it is not acceptable. Second, as noted, there are confiising equations and a potential problem in Kamata’s HGLM DIF procedure, which will be shown in the study. So, the secondary purpose of the study is to modify Kamata’s HGLM DIF procedure to extend its applied conditions. Chapter 2 DIF Detection Methods This chapter is the related literature review about DIF detection methods. It gives more details about the DIF approaches and their disadvantages mentioned in Chapter 1. First, the chapter shows how to classify the current DIF detection methods, and then gives the details of some methods, and their disadvantages. 2.1 Classifications of DIF Detection Methods Dozens of DIF procedures have been presented in the literature. They have been grouped under two major types by Camilli (2006). One is the use of IRT models and the other relies on analyzing the observed scores. The former includes Lord’s difference in IRT parameters (Lord, 1980), Raju’s signed and unsigned area indexes (Raju, 1988; 1990), the likelihood ratio tests (Thissen et al, 1988) and so on. The latter includes the Mantel-Haenszel, logistic regression, standardized Difference, SIBTEST, HGLM, and MMLR. The last method will be presented in this study. “While IRT methods provide useful results when the item models fit the data and a sufficient sample size exists for obtaining accurate estimates of IRT parameters, observed-score methods are frequently used with smaller sample sizes” than the IRT methods (Camilli, 2006: p. 236). Among the IRT DIF methods, the IRT models for the reference and focal groups are estimated separately first, and then the differences between the item response functions of the two groups are calculated and tested for each item, 6. g. Raju’s indexes; or the differences of the parameters are computed and tested, e.g. Lord’s approach. But in the likelihood ratio test, the likelihood ratio of the two IRT models is calculated for each item, i.e. the models with and without DIF, and has a large-sample chi-square distribution (Camilli, 2006). Since the proposed method in the study belongs to the observed-score methods, more details of the IRT methods are not shown in the dissertation. 2.2 Some Standard observed-score DIF Detection Methods Here the methods that appear in the chapter by Camilli (2006) are labeled as standard methods. In this type of methods, scored item responses from the focal group are compared with the ones from the reference group in order to identify items that function differently in the two groups, using one or more additional covariates to control for individual differences on the construct to be measured. The covariate is called the matching variable or conditioning variable. Usually total raw score is used as a matching variable since it is easy and convenient to get the score. 2. 2. I The Mantel-HaenSzel Procedure The Mantel-Haenszel procedure was introduced to identify DIF by Holland and Thayer (1988). The MH DIF statistic is computed by matching examinees in each group on total test score and then forming 2x2xA contingency tables for each item, where A is number of the score levels on the matching variable which is usually total test score. At each score level S, a 2-by-2 contingency table is created for each item, Correct Incorrect Total Reference Group C RS IRS HRS Focal group CFS IFS ”FS Total C TS 1 TS ’7 TS where C RS stands for the number of reference group examinees at score level S who answer the item correctly. The other variables in the table have similar definitions. Then the effect size measure of DIF is obtained by aAMH = (ZCRSIFS lnTS)/(2CFSIRS ”’73). s s The statistic is typically converted to the log-odds scale, i.e. SW = log(oZW) . At Education Testing Service, it is put on the delta scale with the transformation: MH D — DIF = AMH = —2.355M,,, Zieky (1993) divided the DIF magnitude into three categories according to the magnitude ofAMH. 2. 2.2 The Logistic Regression DIF Detection Method Swaminathan and Rogers (1990) first introduced logistic regression (LR) to detect DIF. They also showed that the Mantel-Haebszel (MH) procedure can be considered as being based on a LR model when the ability variable is discrete and no interaction term between the group variable and ability is specified. The logistic regression procedure employs the item response as the dependent variable, with a group membership variable, the abilities of examinees and the interaction between them as independent variables. The standard logistic regression model is expressed as p. 1_:D )= :80 436er + ,3sz +fl3IJ/jGj (1) j ln( where pj is the probability of examinee j ’s answering an item correctly, Wj is the matching variable, and could be the ability estimate or total score of examinee j, and 01- is the group membership of examinee j. The regression coefficients in the above equation can be estimated using maximum likelihood and can be tested for significance. If the item is unbiased, only ,6’0 and ,81 should be significantly different from zero. If ,8; is nonzero and ,B 3 equals zero, an item shows uniform DIF. If the interaction parameter ,8 3 is nonzero, the item has nonuniform DIF whether the other coefficients are equal to zero or not. Generally, total raw scores are used to indicate the proficiency of examinees. When the differentiating factors are assumed to function in the different patterns for examines with the same characteristics in different units, e. g. classes or schools, the standard logistic regression DIF model can be extended to a multilevel logistic regression model (e.g. van den Bergh et al, 1995). 2. 2. 3 The Standardized Dijfkrence Method The standardized difference approach was introduced to analyze DIF by Dorans and Kulik(1983,l986). First, they calculate: Aps =PRS —pFS =CRS/NRS "CFs/NFS, using the similar notation as the used for the contingency table of MH. Then, after these individual differences are summarizing across the levels of matching variables by applying some standardized weighting function to the differences, a standardized p- difference (STD P-DIF) can obtain be obtained by STDP—DIF =(ZwsApS)/Zws. S S The weighth can be defined in several ways. When the numbers of examinees at level S in the focal groups, nFS, are applied, the standard error of STD P-DIF was given as follows by Dorans and Holland (1993: p.50). SE=Zz..,. " i=1 (7) yij=pij+eij\/pij(1_pij) ,012 1 0'21 0'22 and E(ej)=6,Var(ej)= (8) 2 01:1 01:2 0k where the 25 are dummy variables used to distinguish between the individual’s responses to different items; k is the number of items. If the examinees are nested in classes or schools, another level also could be added into the model. Here ej is called the R-side 18 random effect in SAS PROC GLIMNflX (SAS Institute, 2008), which is introduced with the G-side random effect in Chapter 5. If Equation (2) is used to model random effects in MMLR, the model is not multivariate, but univariate. When the multivariate multilevel model was presented in the last section, there was no random effect for level-1. But for the multilevel logistic regression model, the level-1 model has a constant variance “20 for the logit transformation (Snijder & Bosker, 1999) when the scale parameter is set to be 1. This implies there is variation between the different binary responses. So, if the multiple dependent variables have the same scale and measure the same thing, for instance, repeated measures, and they are thought to be nested in a level-2 unit, the variance matrix can be set as in Equation (2). However, Equation (7) does not have a matching variable. Since “DIF is defined as item performance differences between examinees of comparable proficiency” (Camilli, 2007: p. 236), the MMLR model must include a matching variable to control the disparities of ability estimate between two groups. Suppose Wj is the ability estimate of the fh person, then for MMLR, Equation (7) will be rewritten as follows: pi' k log(1—:LT) = :12")- (flm + fllth + flZIWj ) . (9) Pa Now this model looks like a 2 Parameter Logistic (2PL) IRT model with DIF parameters. fig corresponds to the discrimination parameter, and -(fl0,+ finGj) to the product of the discrimination and difficulty pararnters. Or, in terms of the Rasch model, 19 when the discrimination parameters of all items is set to the same value, the model also can be rewitten as: P.3- l-p k log( ) = aWj + Zztij(fl0t + fllth), (10) t=l 1.}. Equation (10) can be regarded as the special case of Equation (9) when ,821=---=,82k=a. The variance matrix shown in Equation (8) makes it likely that MMLR can deal with the correlations between the responses of the same examinee to different items. For the DIF detection, however, it is only possible to select diagonal, compound-sysmetric or unconstrained structure. The type of autoregrssive structures are not reasonable since it is impossible to explain it in practice. In the simulation study, the simplest variance structure is assumed. First, 0,-1-=0 (i, j, = 1, 2, ..., k and i i j ) is set in the matrix of Equation (8). If Uy¢0, it means that the responses of an examinee to different items are correlated or locally dependent. It conflicts with the local item independence assumption. But the simulated data are generated based on the assumption. So, the constraint ail-=0 is 2 2 2 set. Then, 0'1 = 0'2 = ' ° ° — 0k z ¢ is also assumed to simplify the operation. Now, Var(e]) =¢I where I stands for the identity matrix. SAS PROC GLIMMIX is able to be employed to estimate this type of models. 20 Chapter 4 HGLM DIF Detection Methods This chapter describes the Hierarchical Generalized Linear Model DIF detection method, points out the confusing equations and potential problems, and modifies the model to solve these problems. 4.1 Kamta’s HGLM DIF Model The earlier papers about the multilevel models for binary data were published in 1985 (Guo & Zhou, 2000). This type of multilevel model is included in the Hierarchical Generalized Linear Model (HGLM) framework (Raudenbush & Bryk, 2002). Kamata (1998, 2001) outlined the extension of HGLM to IRT-style item analysis and the DIF analysis. His model assumes, given the item effects and the test-takers’ abilities, yy takes on a value of 1 with probability pij. The level-1 model is yij pi,- ~ Bernoulli(pij) pr" k-l 771)“ = log I = ”or +27%;qu (11) l—pij q=1 where Zqij is a dummy variable that takes on a value of unity if response i for person j is to item q, otherwise 0; 7t,”- is thus the difference in log-odds of a correct response between item q and a “reference item” for examinee j ; 21 k is the number of items. While there are k items, only (k—l) dummy variables are included. The item not included is the reference item, whose difficulty is arbitrarily set to zero. Then, an unconditional model is formulated for the abilities and all the item effects at level-2 model are fixed. 7501' =1600 +u0j’ “01' ~N(Oaroo) 72;”. = ,qu for q > 0 (12) where uoj is a random component of fig and it is distributed as a normal distribution with the mean of 0 and variance of T00. According to the studies of Kamata (1998; 2001), uoj is considered to be the ability of person j, which is consistent with the one from BILOG based on the Rasch model, and —(flq0+,800) is correspondent to the difficulty of the Rasch model. Now the ability of persons is a random variable. In SAS PROC GLIMMIX, uoj is the G-side random effect (SAS Institute Inc., 2008). When HGLM model is used to find DIF, especially uniform DIF, the level-2 model changed into the one shown as follows: 7:0]. = ,800 + uoj, uoj ~ N(0,2'00) 7r..- =fl.o +fl..G,- for q>o <13) where 01- denotes the group membership of person j. If the fixed coefficient of an item, A, 1, is significant, the item is thought to have uniform DIF. However, this model does not examine whether the reference item has DIF or not if no group membership variable appears in the random intercept M in Equation (13). 22 About this problem, Kamata’s position causes some confusion. Sometimes he (e. g. Kamata, 1998; Kamata et al, 2005) put a group membership variable in the random intercept 71:0]- of Equation (13) to show if the reference item has DIF. It is feasible to do it theoretically and practically. But sometimes he (e. g. Kamata & Binici, 2003) deleted it fiom this random intercept troj. Besides, other researchers, such as Kim (2003) and Luppescu (2002), also gave the same model as Kamata and Binici did in their article of 2003. The likely reason is that they met a problem when interpreting the reference item with DIF. According to the definition of Raudenbush and Bryk (2002), 7:0]- is the ability of examinee j, so ,600 should be the average ability of all examinees. Therefore, adding a group membership variable for 7r0j can test if the examinees in the different groups have the different average ability. Of course, ,800 can also be viewed as the difficulty of the reference item. But if its difficulty varies for the different groups, how can it be a reference? 4.2 The Modification of Kamata’s HGLM To reduce the difficulty of interpreting it, by using the notation of equations (11), (12), and (13), Kamata’s unconditional model is reformulated as follows: Level — 1 : yij pi]. ~ Bernoulli (pij) (14) k pi} 77:7 =log =an +27%qu -py‘ q=l 23 Level—2: 7701' = “0}" qu ~N(0a700) 7r,” =,qu for q>0 As compared with Equation (9), zqij has the same meaning but it,”- or ,qu is the log-odds of person j ’5 response to item q now since no reference item is defined in the model. The unconditional model is still algebraically equivalent to Kamata’s model, i.e., his generalized Rasch Model with random person ability (Kamata, 1998; 2001). The random intercept troj represents person ability and - ,qu are still correspondent to the estimate of item’s difficulty. Equation (14) reparameterizes Equation (11). In Equation (11), if dummy variables for all items are used, the matrix of independent variables of the equation will not be invertible so the coefficient of one dummy variable is zeroed out and its relevant item is defined as the reference item, which is called reference parameterization by Giesbrecht and Gumpertz (2004). But the matrix can also be invertible after deleting the level-1 intercept in Equation (11) and keeping all dummy variables for all items. Then it is changed into Equation (14). So, they are algebraically equivalent, but Equation (14) has k dummy variables at level-1 and no reference item. When the model shown by Equation (14) is applied to look for DIF, the level-2 models of Equation (14) may be rewritten as: ”01' = “o," ”o; ~N(09700) _ (15) Then this DIF model is also mathematically equivalent to the HGLM DIF model which Kamata and others presented in their papers in 1998 and 2005. But it is more easily 24 interpreted than Kamata’s. ,Bq1 will be used to test if an item shows DIF. In this study, the new DIF models will be used to identify DIF items. In this study, the new model will be called as the HGLM DIF model or detection procedure, and the original model of Kamata will be Kamata’s HGLM. In contrast with the standard logistic regression DIF procedure, the HGLM approach has a disadvantage before and after being modified. The random variable uoj in HGLM is correspondent to the matching variable in standard logistic regression DIF model. The random variable should be the matching variable in HGLM because a matching variable must be constructed to create comparable subsets of examinees in DIF techniques (Camilli, 2007). These HGLM procedures assume that the ability of persons conform to the same normal distribution, that is, N(0, Too) as mentioned in the above equations. So, theoretically, the ability of all examinees should have the same expected value 0 and variance 1'00 although it can be regarded as the estimate of a person ability in practice. The assumption is not always true. Practically, after the group membership variable is added into the models, the random ability or the residual uoj in HGLM Equation (12) and (14) will change, and it is not the ability estimate any more, so it is not able to be a matching variable to adjust the abilities of different groups or match the two groups. The subsequent simulation study will give the evidence to support the conclusion. Therefore, Kim (2002) set up a HGLM DIF detection procedure with a matching variable, which can identify uniform and nonuniform DIF. In his procedure, the matching variable is the estimate of person ability, which is the residual uoj of Equations (12) as suggested by Kamata (1998, 2001) and mentioned in Section 4.1. But the random effect 25 was used as fixed in his method, so the HGLM procedure also needs a fixed matching variable W}. Then if his procedure is just used to look for uniform DIF, Equation (15) can be reformulated as follows: no]. =u0}, uoj ~N(0,z'00) ”(11' = 'BqO + IquGj + flquj for q > 0 (16) In the model, the item parameters 7th- are changed according to the person ability. In addition, in light of Rasch model, Equation (15) can be shown as: 7’0; = 160er +u0j’ ”o; N N(09700) 72"”. = q0 +flqu} for q>0 (17) 71'0} is still an estimate of person ability, and then the random error can be described as the error of the measurement of person ability. 4.3 Differences between MMLR and HGLM Generally, there are the following differences between the MMLR model and HGLM. - MMLR is really a multivariate analytic method and deals with multiple dependent variables if 03-}790 (i 36 j) in the covariance matrix of Equation (6) while HGLM uses univariate method to do that as mentioned, and it is actually a special case of Hox’s MMLR model. When every coefficient in Equation (11) is set as random, then it is Hox’s unconditional MMLR model. So, one random coefficient makes it change fiom the multivariate to the univariate model. 26 MMLR can test whether the extra-binomial or scale parameter is equal to 1 or not, viz., whether the dependent variables conform to the Bernoulli distribution. But the parameter is constrained to 1 in HGLM, so it is not possible to test whether the assumption is reasonable. As mentioned before, MMLR has no random effect at the first level and has it at the second level (Yang et al, 2000) while HGLM also has the random effects on both of level 1 and level 2. According to the definition of Kamata (1998, 2001), the residuals uoj could be regarded as an estimate of person ability which is consistent with the estimate from the BILOG program. However, MMLR would not give the residual or the estimate of ability in that way. Finally, MMLR may deal with multidimensional tests more easily than HGLM when they are used to detect DIF in a test and appropriate matching variables are applied, for example, a science test including biology and physics items. If HGLM is used, another dummy variable needs to be added into the model and indicate different dimensions (Kamata, 1998) while MMLR does not need an additional dummy variable to do that since it is a real multivariate analysis. 27 Chapter 5 Estimation Methods In this chapter, the estimation methods of MMLR and HGLM are introduced. The first section of the chapter presents the linearization and integral approximation methods, and explains why the linearization-based methods are selected in the study. The second section introduces some estimation methods in SAS GLIMMIX procedure. Finally, some details about these estimation methods are given. 5.1 Linearization and Integral Approximation Methods It is easy to estimate the LR models by Maximum Likelihood. They will be estimated by SAS PROC LOGISTIC in the study. It is much more difficult and complicated to estimate MMLR and HGLM by Maximum Likelihood. At first, the marginal distribution of Y is approximated. The relevant approaches can mainly be classified into two broad categories, linearization and integral approximation methods (Schabenberger & Pierce, 2002). A linearization method approximates the nonlinear mixed model by a Taylor series to arrive at a pseudo-model and integral approximation methods use quadrature or Monte Carlo integration to calculate the marginal distribution of the data and maximize its likelihood (Schabenberger & Pierce, 2002). The linearization methods subsurne Pseudo-Likelihood (PL), Penalized or Predictive Quasi- Likelihood (PQL) and Marginal Quasi-Likelihood (MQL). PL is almost the same as PQL and MQL except that PL explicitly estimates the extra-binomial or extra-dispersion 28 parameter (Guo & Zhao, 2000). The integral approximation methods include Laplace, quadrature and Markov Chain Monte Carlo methods. These linearization-based methods have a relatively simple form of the linearized model, which typically can be fit using only the mean and variance in the linearized form. The methods can fit the models for which the joint distribution is difficult or impossible to ascertain and the ones with correlated errors, a large number of random effects, crossed random effects, and multiple types of subjects. However, the approaches include the absence of a true objective function for the overall optimization process and potentially biased estimates of the covariance parameters, especially for binary data (SAS Institute, 2008), and these approaches are such crude approximations that the fit statistics based on the likelihood (e.g. deviance, Akaike’s and Bayesian Information Criterion) are not recommended for use (Hox, 2002: p.110; Snijders & Bosker, 1999; p.220). In contrast with the linearization-based methods, integral approximation methods provide an actual objective function for optimization, which enables researchers to perform likelihood ratio tests among nested models and to compute likelihood-based fit statistics (SAS Institute, 2008). The integral approximation methods are also more accurate than the linearization methods, e.g. Laplace approximation is more precise than PQL and MQL (Raudenbush, Yang & Yosef, 2000). But integral approximation methods are difficult for accommodating crossed random effects, multiple subject effects, and complex R-side covariance structures. Integral approximation methods are practically feasible for a small number of random effects (SAS Institute, 2008). In light of these discussions, the linearization methods have to be selected to estimate MMLR in this study because it has complex R-side covariance structures. So, 29 SAS PROC GLIMMIX is used to estimate MMLR since it implements one linearization- based methods — Pseudo-Likelihood. For the purpose of the comparison, it is better to apply the same software to HGLM so HGLM is also estimated by the procedure. 5.2 SAS GLIMMIX Procedure SAS PROC GLIMMD( implements Pseudo-Likelihood. As noted, PL is almost the same as PQL and MQL except that PL explicitly estimates the extra-binomial or extra-dispersion parameter (Guo & Zhao, 2000). Some scholars (e. g. Van den Noortgate et al, 2003) even say SAS GLIMMIX macro (procedure) employs PQL and MQL. For the purpose of keeping consistent, it is also used to estimate HGLM. Although PQL and MQL were found to have downward bias (Breslow & Clayton, 1993; Rodriguez & Goldman, 1995), SAS GLIMMIX macro (procedure) is likely to be adequate for most of the projects undertaken in social science (Guo & Zhao, 2000) and even the first-order MQL is able to give acceptable estimates for less extreme datasets (Goldstein & Rasbash, 1996) According to the SAS/STATE User’s Guide (SAS Institute, 2008), the GLIMMIX procedure can use four linearization-based methods to estimate these models. They are RSPL, MSPL, RMPL and MMPL. The abbreviation “PL” stands for pseudo-likelihood techniques. The first letter determines whether estimation is based on a residual likelihood (“R”) or a maximum likelihood (“M”). The second letter identifies the expansion locus for the underlying approximation. The expansion locus of the first-order Taylor series expansion is either the vector of random effects solutions (“S”) or the mean of the random effects (“M”). The expansions are also referred to as the “S”ubject-specific 30 and “M”arginal expansions. RSPL is the default estimation method of PROC GLIMMIX. Of them, RSPL, MSPL are correspondent to PQL and RMPL and MMPL are MQL. In the process of parameter estimation, several optimization techniques can be selected in the SAS procedure. For the Generalized Linear Mixed Model, the default is Quasi-Newton Optimization. To get the convergent outputs, other techniques also are used, such as Newton-Raphson Optimization with Line Search, Newton-Raphson Ridge Optimization, Quasi-Newton Optimization, etc. The details about these optimization techniques are shown in the SAS/STATE User’s Guide. Newton-Raphson Ridge Optimization is the default for pseudo-likelihood estimation with binary data in the procedure (SAS Institute Inc., 2008).When the simulated data are analyzed, the four estimation methods and these optimization techniques are used in turn until the outputs converge and RMPL and MMPL are used first since the first—order MQL} is the most stable between the first- and second-order MQL and PQL (Snijders & Bosker, 1999). 5.3 Pseudo-Likelihood Estimation Based on Linearization In terms of Wolfinger and O’Connell (1993) and the SAS/STA T ® User’s Guide (SAS Institute Inc., 2008), here are some details about the pseudo-likelihood estimation based on linearization. Suppose Y is the (n X 1) vector and represents the observed data; and 'y is a (r X 1) vector of random effects. A Generalized Linear Mixed Model assume that —1 E[Y| r] =g (77)=# where r] = XB + Zy ; g(-) is a differentiable monotonic link fimction and g_l(~) stands for its inverse. The matrix X is a (n > 0 Level-3: [3001 = r00], r00, N N (0, 7000) IquI=7qsl for q>O;S=09192 If Equation (17) is used, then the leve-2 and leve-3 model can be shown as: Level-2: ”011 = 76001 + 18011er + ”017: “sz N N(Oatooz) 7’qu = flqOI + flquGjl for q > 0 Level-3: 76001 = 7001’ row N N (O: 1000) 76011 = 7010 ,qu, = 74,, for q > 0:5 = 0,1 The fixed part of the 3-level HGLM is the same as the one in HGLM 16 and 17. Because the estimates of variances at level-2 and -3 both were still zero, the results did not change. 64 Table 18: Results of the 3-Level MMLR DIF Detection Methods Matching Variable Total Score IRT Ability Estimate Item MMLR 9 MMLR 10 MMLR 9 MMLR 10 1 -0. 103 0.084 -0.042 0.079 -0. 122 0.086 -0.038 0.078 2 0.011 0.055 0.001 0.055 0.015 0.056 -0.008 0.055 3 0.014 0.052 -0.078 0.056 0.034 0.052 -0.087 0.056 4 -0.024 0.064 0.005 0.061 -0.035 0.065 0.001 0.060 5 0.004 0.056 0.018 0.054 -0.069 0.062 0.007 0.054 6 -0.069 0.062 -0.054 0.060 -0.070 0.062 -0.058 0.059 7 0.022 0.053 -0.031 0.055 0.028 0.053 -0.043 0.055 8 -0.182* 0.084 -0.063 0.075 -0.233** 0.087 -0.060 0.074 9 0.243" 0.082 0.269** 0.078 0.215* 0.084 0.262** 0.077 10 0.147 0.097 0220* 0.089 0.096 0.102 0.218* 0.088 11 -0.210** 0.055 -0.232** 0.055 -0.193** 0.055 -0.242** 0.055 12 0.001 0.076 0.084 0.068 -0.036 0.079 0.081 0.067 13 0.006 0.080 0.029 0.077 -0.018 0.082 0.029 0.076 14 0.345" 0.080 0.362" 0.077 0.322** 0.082 0.353" 0.076 15 -0.007 0.056 -0.055 0.058 0.002 0.056 -0.061 0.057 16 0.315** 0.100 0.374" 0.091 0.266* 0.107 0.369** 0.089 17 0.070 0.067 0.072 0.066 0.060 0.068 0.068 0.065 18 -0.063 0.079 0.010 0.072 -0.099 0.082 0.010 0.072 19 0.1 13 0.070 0.148* 0.066 0.092 0.072 0.142 0.065 20 -0.1 12* 0.052 -0.195** 0.055 -0.093 0.053 -0.204** 0.055 21 -0.009 0.076 0.034 0.072 -0.029 0.078 0.035 0.071 23 0.057 0.081 0.105 0.076 0.043 0.082 0.104 0.075 24 -0.159** 0.059 -0.165** 0.059 -0.151* 0.060 -0.168** 0.058 25 0.070 0.060 0.061 0.060 0.074 0.060 0.054 0.059 26 -0.181* 0.079 -0.097 0.073 -0.201* 0.080 -0.094 0.072 27 -0.023 0.061 -0.081 0.063 -0.006 0.061 —0.083 0.063 28 -0.109* 0.052 -0.176** 0.055 -0.092 0.053 -0.187** 0.055 29 -0.051 0.065 -0.058 0.064 -0.050 0.065 -0.060 0.063 30 -0.075 0.072 -0.052 0.070 -0.080 0.073 -0.051 0.069 ,4 0.940 0.922 0.983 0.921 1‘ 0.038 0.045 0.034 0.045 Note: The table displays the coefficients that are relevant to the DIF detection and their standard errors. In each cell, the first number is the estimate of the coefficient and the second is its standard error. * means 0.01< p <0.05 and ** means p <0.01. The last row is the variance estimate of the random effect of the model. 65 Chapter 8 Discussions This chapter summarizes the findings of the study, shows the advantages and disadvantages of the MMLR DIF detection methods, illustrates the reasons that some results appear, and explains the limitations of the study. 8.1 Using of the MMLR DIF Detection Method From the simulation study, the MMLR DIF models was shown to have greater power rate for DIF detection than RLR, and from the real test study, these model showed similar results. Although their Type I error rates of the DIF was also greater than RLR’s, the similarity rate of the results between RLR and these MMLR models is greater than 95%, especially the rate of MMLR 9 is 99.6% when the diagonal variance matrix is applied. If the unstructured or other reasonable variance matrixes are employed, it is expected that MMLR will give more accurate results for the DIF detection. Then, if LR is able to be used to detect DIF, MMLR also should be able to be applied to detect DIF especially when large power is needed. In contrast with other DIF detection methods, the main advantage is that NflVILR is able to model the related items of a test. As a natural multilevel model, MMLR can include the variations of examinees from the different groups, such as classes, schools and districts. The standard logistic regression DIF model can identify nonuniform DIF, 66 and so can MMLR if an interaction term between the group membership and the matching variables is put in Equation (9). Then Equation (9) is rewritten as follows: log('1—p_— )= erij (16w + flthj + :8:sz + rBt3G 'jW ) “Pr,- If ,6, 3 is significantly different from zero, then the item has nonuniform DIF. By a similar process, the 3-Leve1 MMLR nonuniform DIF model is also developed. Owing to the limitations of the estimation software, MMLR is not appropriate when a great number of examinees take a test or a test has too many items. Even when the sample size is small and the test does not have too many items, the computer still takes much more time and resources to deal with MMLR than HGLM and LR. For the 2- level MMLR model, different estimates of the R—side variance matrix only seem to influence the standard errors of the coefficients when the matrix is diagonal, i. e. 0‘,-}-=0 (i, j = 1, 2, ..., k and i i j ). But the convergence is another problem if complex variance matrixes are applied. When a test has k items, k(k+1) parameters need to be estimated in the matrix. In the real test study, the MEAP reading test has 29 items, and then 406 parameters need estimating if an unconstrained matrix is assumed, and over a hundred thousand students took the test. It is a task impossible for SAS PROC GLIMMIX to estimate the MMLR model with the unconstrained variance matrix and thousands of examinees. As shown by the results of this study, MMLR inflates Type I error rate (see Table 7, 8 and 9, and Appendix 11) if the sample size and DIF effect size are large. The inflated Type I error rate may result from the pseudo-guessing parameter. Jodoin and Gierl (2003) suggested reducing Type I error rate of LR using an effect size measure of LR. Possibly, 67 the additional statistic also needs to be developed for MMLR in the future when it is used to analyze the 3PL-model-fitted data. Due to the characteristics of MMLR, in light of the study, there are some tips or recommendations for using MMLR to detect DIF. First, care is needed when selecting the appropriate variance matrices for MMLR. The diagonal R-side variance matrix for MMLR is simple and helpfirl to save time to estimate the model. It can be used only when it is known that the local independent assumption is tenable. Some methods should be employed to measure local item dependence, which were discussed by Yen and F itzpatrich (2006: p. 141), before using the variance matrix. When it is not sure whether the assumption is reasonable, the unconstrained variance matrix should be used if a test is not too long, or the compound- symmetric variance matrix if the test does have many items. Second, the IRT ability estimate should be applied as the matching variable in MMLR instead of total score, if the estimate is available. The reason is that the simulation study shows that MMLR had larger power and smaller Type I error rates when the IRT ability estimate was used as the matching variable than when total score was used in the simulation study. Third, MMLR model with Equation (9) can be used to detect nommiform and uniform DIF, so the interaction term should be included when it is applied in case the nonuniform DIF is omitted. Finally, A third level should be included if it is known that the examinees are nested within some clusters, such as schools, states or others, and these data are available. 68 The hierarchical structure of MMLR is its basic advantage, and the use of the third level may improve the estimation of the model and DIF detection. 8.2 HGLM Is Unsuitable for DIF Detection By the simulation study in Chapter 6 and the analysis in Section 4.2, HGLM is not able to identify DIF until it has a fixed matching variable. However, when total score or IRT ability estimate are added into the HGLM DIF models as the matching variable, the variance estimate of the level-2 random effect is zero in the simulation and the real test studies. The variance estimate is not reasonable. It means that neither of the matching variables should be included in the model, or the random effect should be excluded. If the two variables are inappropriate for the HGLMs, then it is difficult to find a matching variable. If the random effect is excluded, then the model will be regular logistic regression model. Why is the variance estimate zero? It may be because total score or any ability estimate is highly correlated to means of the independent variable y or 7] across the level- 2 units in Equation (11) or (14). Then for the whole HGLM DIF models, most of variation is explained by the matching variable. So, the residual is so small that it is close to zero. Finally, SAS PROC GLIMMIX sets it as zero. If the hypothesis is correct, then any matching variable might be highly correlated to the means across the level-2 units, i.e. examinees. Even if it is not correct, neither the total score nor IRT ability estimate is good as the matching variable. The method of Kim (2003), using the ability estimate from the residual up} in Equation (12) as a matching variable, was also tried when analyzing the MEAP data, but the variance estimate was 69 still zero. It is a mystery why I am not able to replicate the study of Kim (2003). Then it is very difficult to find an appropriate matching variable for the HGLMS but the model must have one if we want to use it to find DIF. It is a dilemma. So, HGLM may not be suitable to identify DIF. 8.3 Effects of the Heterogeneous Variances on the DIF Detection Methods In the study, it is found that the different variances of the ability distributions of the reference and focal groups influence power and Type I error rates of LR, HGLM and MMLR DIF approaches. There are two explanations. If the variance of one group’s ability distribution is much different fi'om the one of the other group’s, it may make one group have more persons with extreme ability than the other. When these persons are matched in the DIF methods by a matching variable, sometimes the regular values of one group are matched with the outliers of the other group, and then the poor results appear. Of course, for the HGLM models, the random matching variable for all persons are assumed to conform to the same normal distribution. The assumption is not tenable no matter whether the models have the group membership variable or not when the ability distributions of the two groups have different variances. It must influence the results of the DIF detection procedures. 8.4 Comparisons of the Coefficients and Their Errors in MMLR and HGLM The simulation study shows if the three types of MMLR models, MMLR 7, 9 and 10 respectively correspondent to HGLM 15, 16 and 17, the estimate of the correspondent coefficients in these paired models are very similar, and their standard errors in MMLR 70 are smaller than the counterparts of correspondent HGLM. Why does that happen? It is related to the specification and estimation of the two kinds of models. For MMLR 7, combining the two parts of Equation (7), and given independent variables, the following equation is given: y..- = {1+ Eel-gate. + fl..G,-)]}“ k k + e... {1 + expl-gl z...- (a. + AG. )1 }’2 eXPI—t}; 2...- (flo. + AG, )1 Then the expected value of ya, E(y.-,-IX)={1+exp[-t}:212.,-,-(fls+flr.Gj)]}" (..., where X is used to denote all independent variables in the models. For the modified HGLM DIF procedure, the outcome variable y,-}- can be written as y,-}=p,-}-+e,-}-, where E(e,-}-)=0 and Var(ey)=p,-}(l+p,-}-) (Snijders & Bosker, 1999). Then combining equations (12) and (14), y,-}- can be written as follows: 1‘ -1 yij = {1+ epr- leqij(flq0 + flquj) _ uoj]} + 81} q: Since E(e,-}-)=O, if the first-order Taylor series expansion is used, the approximation of the expected value ofyij, E0, IX) e {1+ apt—gem. + 13,6.) — Etu..)ll" k Then E(yij IX) z {1+ exp[_t=zlijj(i6q0 + IBquj )]}—l (19) 71 So, if we compare Equation (18) with (19), the expected values of the outcome variables in the MMLR 7 and HGLM 15 have the same expressions. So, it may imply that the estimates of fixed effects in the two models will be similar if the same data set is analyzed, the random effect of HGLM is omitted, and estimated the coefficients of the two models by some estimation approaches, in which the first Taylor series expansion is applied to get the approximation, such as PQL, MQL and PLs. The situation happens when RMPL or MMPL is used to estimate MMLR 7 and HGLM 15. The two methods are similar to MQL. The difference between PQL and MQL is that the Taylor series is expanded around the condition u0}=0 in MQL while it is expanded around approximate posterior mode in PQL (Raudenbush & Bryk, 2002). Then, if MQL (RMPL or MMPL) is used to estimate HGLM then u0}=0 is applied, and MMLR 7 has no random effect at G- side and then u0}=0, the coefficients in HGLM 15 and MMLR 7 will be estimated based on the same equation. So, they have the similar estimates of the coefficients, which were shown in the results of the simulation study. However, HGLM actually has the two random effects, up} at level-2 or G-side and a random effect at level-1 or R-side. Comparatively, MMLR 7 only has the R-side random effect. The scale parameter will influence the standard errors of the regression coefficients as it influences them in the Generalized Linear Model. But it has little effect when it is close to 1. Actually, the parameter approximates to 1 in the simulation study. So, the estimates of fixed effects in MMLR will have smaller standard errors. Since the estimates are similar and HGLM are estimated by RMPL in the simulation study, the hypotheses tests are significant more easily in MMLR 7 than in HGLM 15. Therefore, 72 MMLR7 always has greater power and Type I error rate than HGLM 15 under the same condition. By the same reasoning, MMLR 9 and HGLM l6, MMLR 10 and HGLM 17 also respectively have the same expressions for fixed effects and the estimates of these fixed effects in the former also have the smaller standard errors. So, if the appropriate matching variable is used in these models and the estimate of the variance of the level-2 random effect is not zero, the former may still have greater power and Type I error rate than the latter when RMPL is employed to estimate the latter. In the study, the estimate of the G-side variance is zero in HGLM 16 and 17, and then the HGLM models are changed into the regular logistic regression model. It is shown as follows: Pi“ k log1—_i— = leqrjwqo + flthj + 13qu1) (20) q: If Pt" k logl—ja— = flij + 22,.)qu + 54161“ ) (21) 2,- 4:1 Equations (20) and (21) are the regular logistic regression models respectively corresponding to HGLM 16 and 17. In this situation, the coefficient estimates are still similar in the two models and the differences depend on their standard errors. Actually, under this condition, the standard error of a coefficient of MMLR is the counterpart of the regular logistic regression model multiplied by the corresponding 0', , i.e., the square root of the diagonal entries in the variance matrix of the 2-level MMLR model. So, the standard errors of the coefficient estimates in MMLR are smaller than the ones in the 73 regular logistic regression model when its scale parameters are less than 1 if both of them have the same fixed model. But most of the scale parameters of MMLR in the simulation study are smaller than 1. Therefore, for the most cases, the MMLR models have larger power and Type I error rates than the correspondent HGLM models. However, the simulation and the real test studies show that HGLM 16 without random efiect has the same estimates of the coefficients and their standard error as RLR. Why? Like standard logistic regression DIF model, RLR nms individual analysis for each item. If k dummy variables are used to identify these models for the different items, these individual RLR models are combined by these variables and merged into one model, and then it is HGLM16 without any random effect, and expressed as Equation (20). These models are the simple collection of the RLR models for all items. The estimates of coefficients and their standard errors in the combined model do not change (but their estimates are respectively from SAS PROC GLIMMIX and LOGISTIC so some small differences still exist between the correspondent estimates in the simulation study) although the individual RLR models are put together. So, by this way, LR DIF model also is able to analyze all items in a single run. 8.5 Limitations 2 2 2 Inthesimulationstudy, 0'1 =02 =°°°=0k =¢ andoij=0(i,j=l, 2, ..., k and #j) are assumed. But these assumptions may not be correct or reasonable for real tests. At the same time, when the covariances are constrained to 0, it may make the multivariate model become univariate and the multivariate model will lose the advantage in contrast with the univarite model. 74 This simulation study is not designed to explore the effects of the sample size ratio between the reference and focal groups and the proportion of item contamination. The sample size ratio between the reference and focal groups may influence DIF detection. If the proportion of items contaminated with DIF is set at a different percentage, the results may be different. In some statistical tests of the simulation study, e.g., Wilks’ Lambda and paired t tests, the calculated power and Type I error rates of the 7 models are the dependent variables. They may not be normally distributed. For Wilks’ Lambda tests in MANOVA, the test of the heterogeneity of variances is not able to be implemented because the results of RLR and HGLM 16 are too similar. So, it is unknown if they have heterogeneous variance matrices. These factors have effects on the robustness of the statistical tests. For the real data, an unconstrained R—side variance matrix may be more reasonable than others. As mentioned, the convergence is a big problem. Even if the convergent output exists, SAS PROC GLIMMIX will take long time, possibly several days, to get the output. Maybe other multilevel model software need'to be tried, for example MLwiN. Finally, this study is not involved with nonuniform DIF. In this Chapter, It is mentioned that MMLR 9 is able to be extended to identify nonuniform DIF. HGLM also has an extension for nonuniform DIF. As noted in Section 4.2, Kim (2003) extended Karnata’s model to identify nonuniform DIF. Equation (16) is the reduced form of his model. His level-2 model is written as follows: 75 ”01' = ”0]" ”0} N N(0,1'00) 7Q”. = flqo +,Bq1Gj +flq2Wj +,Bq3(Gj xW}. )for q > 0 If this model and the MMLR nonuniform DIF model in Section 8.1 are applied to identify uniform and nonuniform DIF in the 2005 MEAP reading test, they give the same results as the full logistic regression DIF model when all of them use the same matching variable, but the estimate variance of up}- is still 0. If the different matching variables, total score and IRT ability estimate, are respectively used in these methods, they give different the results. 76 APPENDICES 77 Appendix I: The following numbers are calculated when the IRT ability estimate is used as the The Calculated Power Rates for LR, HGLM and MMLR matching variable: q aouaromq 5:9 UIN OUT 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.139 0.374 0.679 0.215 0.581 0.854 0.279 0.697 0.920 0.135 0.401 0.666 0.203 0.576 0.847 0.278 0.706 0.921 0.128 0.3 89 0.657 0.202 0.574 0.851 0.265 0.697 0.920 0.087 0.193 0.3 88 0.1 12 0.299 0.558 0.136 78 9 l W’IDH 0.139 0.374 0.679 0.215 0.581 0.854 0.279 0.697 0.920 0.135 0.401 0.666 0.203 0.576 0.847 0.278 0.706 0.921 0.128 0.3 89 0.657 0.202 0.574 0.851 0.265 0.697 0.920 0.087 0.193 0.388 0.1 12 0.299 0.558 0.136 L lW’IDH 0.141 0.372 0.672 0.209 0.577 0.842 0.278 0.695 0.909 0.134 0.399 0.663 0.199 0.579 0.833 0.278 0.700 0.903 0.127 0.386 0.648 0.201 0.573 0.841 0.265 0.686 0.907 0.090 0.193 0.365 0.1 19 0.288 0.546 0.133 6H’TI’\IP\I 0.140 0.373 0.676 0.215 0.582 0.855 0.284 0.699 0.920 0.135 0.399 0.665 0.203 0.576 0.847 0.279 0.706 0.921 0.127 0.387 0.655 0.202 0.574 0.851 0.265 0.698 0.920 0.092 0.200 0.399 0.1 18 0.312 0.568 0.145 OIH'IWW 0.142 0.373 0.673 0.209 0.576 0.842 0.281 0.696 0.909 0.133 0.396 0.660 0.198 0.576 0.833 0.277 0.698 0.902 0.125 0.382 0.645 0.199 0.572 0.840 0.265 0.686 0.906 0.097 0.206 0.382 0.130 0.305 0.562 0.147 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 1000 1000 400 400 400 700 700 700 1000 1000 1000 400 400 400 700 700 700 1000 1000 1000 400 400 400 700 700 700 1000 1000 1000 400 400 400 700 700 700 1000 1000 1000 400 400 400 700 700 700 1000 1000 1000 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 N(0,1) N(0,1) N(0,9) N(0,9) N(0,9) N(0,9) N(0,9) N(0,9) N(0,9) N(0,9) N(0,9) N(5,1) N(5,1) N(5,1) N(5,1) N(5,1) N(5,1) N(5,1) N(5,1) N(5,1) N(0,1) N(0,1) N(0,1) N(O,1) N(O,l) N(0,l) N(O,l) N(O,1) N(O,1) N(0,9) N(0,9) N(0,9) N(0,9) N(0,9) N(0,9) N(0,9) N(0,9) N(0,9) N(5,1) N(5,1) N(5,1) N(5,1) N(5,1) N(5,1) N(5,1) N(5,1) N(5,1) (L399 (L653 (L073 (L183 (L362 (L101 (L282 (L538 (L131 (L376 (L646 (L083 (L178 (L352 (L104 (L276 (L537 (L122 (L367 (L651 (L148 (L384 (L676 (L198 (L588 (L854 (L269 (L708 (L931 (L125 (L376 (L664 (L190 (L580 (L850 (L253 (L712 (L925 (L133 (L376 (L672 (L199 (L571 (L849 (L258 (L703 (L928 79 (L399 (L653 (L073 0J83 (L362 (L101 (L282 (L538 0J31 (L376 (L646 (L083 (L178 (L351 (L104 (L276 (L537 (L122 (L367 (L651 (L148 (L384 (L676 (L198 (L588 (L854 (L269 (L708 (L931 (L125 (L376 (L664 (L190 (L580 (L850 (L253 (L712 (L925 (L133 (L376 (L672 (L199 (L571 (L849 (L258 (L703 (L928 (L370 (L646 (L079 (L173 (L352 (L105 (L271 (L538 (L137 (L369 (L637 (L087 (L172 (L345 (L105 (L271 (L533 (L129 (L330 (L623 (L145 (L390 (L715 (L192 (L633 (L915 (L287 (L768 (L971 (L135 (L388 (L718 (L202 (L609 (L906 (L260 (L773 (L969 (L140 (L386 (L716 (L209 (L608 (L904 (L273 (L753 (L969 (L416 (L664 (L077 (L186 (L367 (L106 (L291 (L546 (L139 (L385 (L655 (L084 (L181 (L353 (L107 (L282 (L542 (L127 (L371 (L657 (L152 (L385 (L680 (L202 (L596 (L854 (L275 (L714 (L932 (L125 (L376 (L664 (L190 (L582 (L850 (L255 (L714 (L925 (L133 (L375 (L671 (L200 (L572 (L850 (L260 (L704 (L928 (L395 (L656 (L081 (L174 (L352 (L109 (L274 (L540 (L140 (L373 (L640 (L088 (L172 (L346 (L106 (L273 (L534 (L142 (L368 (L647 (L148 (L391 (L718 (L193 (L640 (L915 (L290 (L770 (L970 (L135 (L385 (L715 (L202 (L608 (L906 (L260 (L774 (L969 (L139 (L384 (L713 (L209 (L608 (L904 (L273 (L752 (L969 Appendix II: The Calculated Type I error Rates for LR, HGLM and MMLR The following numbers are calculated when the IRT ability estimate is used as the matching variable: 700 700 700 1000 1000 1000 400 400 400 700 700 700 1000 1000 1000 400 400 400 700 700 700 1000 1000 1000 400 400 400 700 700 700 1000 q oouaroglq F’F’F’ \IMN moat 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.055 0.068 0.088 0.061 0.085 0.125 0.062 0.096 0.156 0.054 0.070 0.088 0.060 0.089 0.123 0.064 0.098 0.161 0.054 0.067 0.090 0.061 0.082 0.121 0.063 0.099 0.155 0.103 0.122 0.145 0.135 0.167 0.215 0.173 80 9 I W’IDH 0.055 0.068 0.088 0.061 0.085 0.125 0.062 0.096 0.156 0.054 0.070 0.088 0.060 0.089 0.123 0.064 0.098 0.161 0.054 0.067 0.090 0.061 0.082 0.121 0.063 0.099 0.155 0.103 0.122 0.145 0.135 0.167 0.215 0.173 L I W'IDH 0.053 0.067 0.084 0.058 0.083 0.120 0.064 0.092 0.148 0.054 0.069 0.086 0.060 0.085 0.1 19 0.063 0.093 0.151 0.055 0.067 0.088 0.061 0.079 0.1 18 0.060 0.093 0.148 0.109 0.126 0.151 0.144 0.167 0.217 0.178 63W 0.055 0.067 0.088 0.062 0.086 0.126 0.063 0.098 0.157 0.053 0.069 0.087 0.061 0.089 0.123 0.064 0.098 0.161 0.054 0.066 0.089 0.061 0.082 0.121 0.063 0.099 0.155 0.108 0.128 0.151 0.143 0.177 0.224 0.183 OIH'IWW 0.053 0.067 0.084 0.058 0.083 0.120 0.064 0.092 0.149 0.053 0.069 0.085 0.059 0.084 0.1 18 0.063 0.092 0.151 0.054 0.065 0.086 0.060 0.078 0.1 17 0.060 0.093 0.147 0.1 l 1 0.127 0.151 0.146 0.168 0.217 0.178 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 1000 1000 400 400 400 700 700 700 1000 1000 1000 400 400 400 700 700 700 1000 1000 1000 400 400 400 700 700 700 1000 1000 1000 400 400 400 700 700 700 1000 1000 1000 400 400 400 700 700 700 1000 1000 1000 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 (L25 (L50 (L75 N(0,1) N(0,1) N(0,9) N(0,9) N(0,9) N(0,9) N(0,9) N(0,9) N(0,9) N(0,9) N(0,9) N(5,1) N(5,1) N(5,1) N(5,1) N(5,1) N(5,1) N(5,1) N(5,1) N(5,1) N(0,1) N(0,1) N(0,1) N(0,1) N(0,1) N(0,1) N(0,1) N(0,1) N(0,1) N(0,9) N(0,9) N(0,9) N(0,9) N(0,9) N(0,9) N(0,9) N(0,9) N(0,9) N(5,1) N(5,1) N(5,1) N(5,1) N(5,1) N(5,1) N(5,1) N(5,1) N(5,1) (L222 (L281 (L102 (L122 (L148 (L142 (L173 (L219 (L175 (L229 (L282 (L105 (L124 (L149 (L138 (L179 (L219 (L179 (L224 (L282 (L056 (L072 (L088 (L052 (L085 (L114 (L061 (L100 (L151 (L055 (L070 (L084 (L058 (L083 (L118 (L064 (L100 (L153 (L057 (L070 (L086 (L055 (L085 (L120 (L063 (L096 (L144 81 (L222 (L281 (L102 (L122 (L148 (L142 (L173 (L219 (L175 (L229 (L282 (L105 (L124 (L149 0138 (L179 (L219 (L179 (L224 (L282 (L056 (L072 (L088 (L052 (L085 (L114 (L061 (L100 (L151 (L055 (L070 (L084 (L058 (L083 (L118 (L064 (L100 (L153 (L057 (L070 (L086 (L055 (L085 (L120 (L063 (L096 (L144 (L225 (L274 (L108 (L127 (L151 (L145 (L173 (L213 (L175 (L222 (L262 0J11 (L126 (L149 (L144 (L176 (L213 (L196 (L240 (L292 (L070 (L083 (L099 (L084 (L112 (L141 (L100 (L136 (L192 (L072 (L082 (L098 (L085 (L110 (L144 (L101 (L137 (L190 (L072 (L084 (L099 (L085 (L110 (L142 (L099 (L137 (L183 (L232 (L293 (L106 (L126 (L152 (L148 (L179 (L225 (L182 (L237 (L289 (L107 (L127 (L151 (L142 (L184 (L224 (L185 (L230 (L288 (L057 (L073 (L088 (L054 (L087 (L116 (L064 (L102 (L155 (L055 (L070 (L085 (L058 (L084 (L119 (L064 (L102 (L153 (L057 (L070 (L086 (L055 (L086 (L120 (L064 (L096 (L145 (L225 (L272 (L110 (L127 (L151 (L148 (L175 (L215 (L178 (L226 (L265 0J11 (L126 (L148 (L146 (L177 (L214 (L182 (L219 (L269 (L071 (L084 (L100 (L086 0J15 (L143 (L102 (L137 (L193 (L071 (L082 (L097 (L085 (L109 (L144 (L101 (L137 (L190 (L072 (L083 (L098 (L085 (L109 (L142 (L099 (L137 (L183 Appendix III: Results of Analyses with the IRT Ability Estimate The results displayed here are obtained when the IRT ability estimate is used as the matching variable. 1. Output of multivariate analysis of variance for power and Type I error rates of LR, MMLR7, MMLR9, MMLR10, HGLM15, HGLM16 and HGLM17. Wilks' Den. Num. p- Effect Lambda F D.F . D.F. value Test Length 1.565x10'5 53.947 28 6 0000 Sample Size 8.471x10‘8 736.038 28 6 0.000 b Difference 4.808x10'9 3090-202 28 6 0.000 Distribution 1.087x10'll 65005.161 28 6 0.000 Test Length x Sample Size 3.457x10'6 6.020 56 13.8 0.000 Test Length x b Dif. 2.179x10'3 0.948 56 13.8 0.584 Test Length x Distribution 1.272x10'6 7.856 56 13.8 0000 Sample Size x b Dif. 7.060x10‘9 30.554 56 13.8 0000 Sample Size x Distribution 2.486x10'll 131.364 56 13.8 0.000 b Dif. x distribution 7.774x10’12 177.200 56 13.8 0.000 Test Length x Sample Size 2.303).“)6 1.514 112 32.7 0.087 x b Dif. Test Length x Sample Size 3,490x10‘8 2.681 112 32.7 0.001 x distribution Test Length x b Dif. x 3,139x1045 1.481 112 32.7 0.099 Distribution Sample Size x b Dif. x 2,572x10-11 9.095 112 32.7 0.000 Distribution 2. Multiple comparisons of power and Type I error rates of LR, MMLR7, MMLR9, MMLRIO, HGLM15, HGLM16 and HGLM17. When repeated measure analysis Of variance (Wilks’ Lambda test) is used, for power, A = 0.09, F =124.28, degrees of freedom are 6 and 75, and p < 0.0001; for Type I error rate, A = 0.17, F=59.27, degrees of freedom are 6 and 75, and p < 0.0001. When the pairs that are of interest are 82 compared, the results of the paired t-tests are shown in the follow table. In every cell, the upper number is the average difference and the lower is its standard error, and * means p < 0.0056 by a Bonferroni correction. HGLM HGLM HGLM MMLR MMLR MMLR l 5 16 17 7 9 10 Power RLR 0.10* 0.000 -0.006 0.059 -0.003 * -0.009* 0.03 0.000 0.003 0.030 0.0004 0.002 HGLM 15 -0.04* 0.002 HGLM 16 -0.003 * 0.0004 HGLM 17 -0.003* 0.001 Type I RLR -0.173 * 0.000 -0.009* -0.21* -0.002* -0.008* Error 0.030 0.000 0.002 0.030 0.0004 0.002 rate HGLM 15 -0.04* 0.002 HGLM 16 -0.002* 0.0004 HGLM 17 -0.001 0.0004 3. Similarity rates of DIF detection between the 7 models with the IRT ability estimate. Model HGLM15 HGLM16 HGLM17 MMLR7 MMLR9 MMLRIO LR 70.30 100.00 94.81 67.98 99.73 94.83 HGLM15 70.30 70.40 95.94 70.22 70.47 HGLM16 94.82 67.99 99.73 94.84 HGLM17 68.09 94.80 99.70 MMLR7 67.94 68.16 MMLR9 94.83 83 REFERENCES 84 References Agresti, A. (1997). A model for repeated measurements of a multivariate binary response. Journal of the American Statistical Association, 92, 315-321. Bolt, D. M., & Gierl, M. J. (2006). Testing features of graphical DIF: application of a regression correction to three nonparametric statistical tests. Journal of Educational Measurement, 43(4), 313-333 Breslow, N., & Clayton, D. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association, 88, 9-25. Camilli, G. (2006). Test Fairness. In R. L. Brennan (Eds), Educational Measurement (pp. 221-256). Westport, CT: Greenwood. Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items: an NCME instructional module. Educational Measurement: Issues and Practice, 17(1), 314-14. Creswell, M. (1991). A multilevel bivariate model. In R. Prosser, J. Rasbash, and H. Goldstein (Eds). Data Analysis with ML3. London, Institute of Education. Dorans, N. J. & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P.W. Holland & H. Wainer (Eds), Difikrential Item Functioning (pp. 35-66). Hillsdale, NJ: Erlbaum. Dorans, N. J ., & Kulick, E. (1983). Assessing Unexpected Dijfcrential Item Performance of Female Candidates on SAT and T S WE Forms Administered in December 1977: An Application of the Standardization Approach (RR-83-9). Princeton, NJ: Educational Testing Service. Dorans, N. J ., & Kulick, E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23(4), 355-68. Douglas, J. A., Roussos, L. A., & Stout, W. (1996). Item-bundle DIF hypothesis testing: Identifying suspect bundles and assessing their differential functioning. Journal of Educational Measurement, 33(4), 465-484. Finch, W. H., & French, B. F. (2007). Detection of crossing differential item functioning: a comparison of four methods. Educational and Psychological Measurement, 67(4), 565-5 82. 85 French, B. F ., & Finch, W. H. (2006). Confumatory factor analytic procedures for the determination of measurement invariance. Structural Equation Modeling, 13(3), 378- 402. Giesbrecht, F. G., & Gumpertz, M. L. (2004). Planning, Construction, and Statistical Analysis of Comparative Experiments. Hoboken, NJ: Wiley & Sons. Glonek, G., & McCullagh, P. (1995). Multivariate logistic models. Journal of the Royal Statistical Society, Series B, 57(3), 533-546. Goldstein, H. (1995). Multilevel Statistical Models. London: Arnold. Goldstein, H., & McDonald, R. (1988). A general model for the analysis of multilevel data. Psychometrika, 53(4), 455-467. Goldstein, H., & Rasbash, J. (1996). Improved approximations to multilevel models with binary responses. Journal of the Royal Statistical Society, Series A, 159, 505-513. Griffiths, P., Brown, J ., & Smith, P. (2004). A comparison of univariate and multivariate multilevel models for repeated measures of use of antenatal care in Uttar Pradesh. Journal of the Royal Statistical Society, Series A, 167(4), 73-89. Guo, G., & Zhao, H. (2000). Multilevel modeling for binary data, Annual Review of Sociology, 26, 441-62. Hidalgo, M. D., & LOpez-Pina, J. A. (2004) Differential item functioning detection and effect size: a comparison between logistic regression and Mantel-Haenszel Procedures, Educational and Psychological Measurement, 64(6), 903 -91 5. Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel- Haenszel procedures. In H. Wainer & H. 1. Braun (Eds), Test Validity (pp.129-l45). Hillsdale, NJ: Lawrence Erlbaum. Hox, J. (2002). Multilevel Analysis: Techniques and Applications. Mahwah, NJ: Erlbaum. Hmnphreys, L. G., & Taber, T. (1973). Ability factors as a function of advantage and disadvantaged groups. Journal of Educational Measurement, 10(2), 107-115. Jodoin, M. G., & Gierl, M. J. (2001).Evaluating Type I error and power rates using an effect size measure with the logistic regression procedure for DIF Detection. Applied Measurement in Education, 14(4), 329-349. 86 Kamata, A. (1998). Some Generalizations of the Rasch Model: an Application of the Hierarchical Generalized Linear Model. Unpublished doctoral dissertation, Michigan State University. Kamata, A. (2001). Item analysis by the hierarchical generalized linear model. Journal of Educational Measurement, 38(1), 79—93. Kamata, A., & Binici, S. (2003) Random-eflect DIF Analysis via Hierarchical Generalized Linear Model. Paper presented at the biannual meeting of Psychometric Society, July 2003, Sardinia, Italy. Kamata, A., Chaimongkol, S., Gene, E., & Bilir, K. (2005). Random-Eflect Diflerential Item Functioning Across Group Unites by the Hierarchical Generalized Linear Model. Paper presented at the annual meeting of the American Educational Research Association, Montreal. Kim, W. (2003). Development of a Diflerential Item Functioning Procedure Using the Hierarchical Generalized Linear Model: A Comparison Study with Logistic Regression Prodedure. Unpublished doctoral dissertation, Pennsylvania State University. Kristjansson, E., Aylesworth, R., McDowell, I., & Zumbo, B. (2005). A comparison of four methods for detecting differential item functioning in ordered response items. Educational and Psychological Measurement, 65(6), 93 5-953. Longford, N. (1993). Random Coeflicient Models. Oxford, England: Clarendon Press. Inrd, F. M. (1980). Application of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum. Luppescu, S. (2002). DIF detection in HLM item analysis. Paper presented at the annual meeting of the American Educational Research Association, New Orleans. Mcleod, A. (2001). Multivariate multilevel models. In A. H. Leyland & H. Goldstein (Eds), Multilevel Modelling of Health Statistics. West Sussex, England: John Wiley & Sons. Mellenberg, G. J. (1982). Contingency table models for assessing item bias. Journal of Educational Statistics, 7, 105-106. Muthén, B., & Christoffersson, A. (1981). Simultaneous factor analysis of dichotomous variables in several groups. Psychometrika, 46(4), 407-419. Muthén, B., & Lehman, J. (1985). Multiple group IRT modeling: Applications to item bias analysis. Journal of Educational Statistics, 10(2), 133-142. 87 Narayanan, P., & Swaninathan, H. (1996). Identification of items that Show nonuniform DIF. Applied Psychological Measurement, 20(3), 257-274. Raju, N. (1988). The area between two item characteristic curves. Psychometrika, 53, 495-502. Raju, N. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied psychological Measurement, 14, 197- 207. Raudenbush, S., & Bryk, A. S. (2002). Hierarchical Linear Models: Applications and Data Analysis Methods. Thousand Oaks: Sage, second edition. Raudenbush, S., Rowan, B. & Kang, S. J. (1991). A multilevel, multivariate model for studying school climate with estimation via the EM algorithm and application to US. High-School Data. Journal of Educational Statistics, 16(4), 295-330. Raudenbush, S., Yang, M., & Yosef, M. (2000). Maximum likelihood for generalized linear models with nested random effects via high-order, multivariate Laplace approximation. Journal of Computational and Graphical Statistics, 9(1), 141-157. Rizopoulos, D (2006). ltrn: an R Package for Latent Variable Modeling and Item Response Theory Analyses, Journal of Statistical Sofiware, 17(5). Rodriguez, G., & Goldman, N. (1995). An assessment of estimation procedures for multilevel models with binary responses. Journal of the Royal Statistical Society, Series A, 158, 73-89. SAS Institute Inc. (2008). SAS/STAT® 9.2 User ’5 Guide. Cary, NC: SAS Institute Inc. Schabenberger, O., & Pierce, F. J. (2002). Contemporary Statistical Models for-the Plant and Soil Sciences, Boca Raton, FL: CRC Press Shen, L. (1999). A Multilevel Assessment of Diflerential Item Functioning. Paper presented at the annual meeting of the American Educational Research Association, Montreal, Canada. Snijders, T., & Bosker, R. (1999). Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling. London: Sage. Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361-370. 88 Swanson, D. B., Clauser, B. E., Case, S. M., Nungester, R. J. & Feather, C. (2002). Analysis of Differential Item Functioning (DIF) using hierarchical logistic regression models. Journal of Educational and Behavioral Statistics, 27(1), 53-75. Thissen, D., Steinberg, L., & Wainer, H. (1988). Detection of differential item functioning using the parameters of item response theory models. In H. Wainer & H. 1. Braun (Eds), Test Validity (pp.129-145). Hillsdale, NJ: Lawrence Erlbaum. Thum, Y. M. (1997). Hierarchical linear models for multivariate outcomes. Journal of Educational and Behavioral Statistics, 22(1), 77-108. Van den Bergh, H., Kuhlemeier & Wijnstra, J. (1995). Diflerential Item Functioning fiom a Multilevel Perspective. Paper presented at the annual meeting of the American Educational Research Association, San Francisco. Van den Noortgate, W. & Boeck, P. D. (2005). Assessing and explaining differential item functioning using logistic mixed models. Journal of Educational and Behavioral Statistics, 30(4), 443-464. Van den Noortgate, W., Boeck, P. D., & Meulders, M. (2003). Cross-classification multilevel logistic models in psychometrics. Journal of Educational and Behavioral Statistics, 28(4), 369-386. Whitrnore, M. L., & Schumacker, R. E. (1999). A comparison of logistic regression and analysis of variance differential item functioning detection methods. Applied Psychological Measurement, 59(6), 910-927. Wolfinger, R., & O’Connell, M. (1993). Generalized linear mixed models: A pseudo- likelihood approach. Journal of Statistical Computation and Simulation, 48(3—4), 233—243. Yen, W. M., & Fitzpatrick, A. R. (2006). Item Response Theory. In R. L. Brennan (Eds), Educational Measurement (pp. 111-153). Westport, CT: Greenwood. Yang, M., Goldestein, H. & Heath, A. (2000). Multilevel models for repeated binary outcomes: attitudes and voting over the electoral cycle. Journal of the Royal Statistical Society, Series A, 163(1), 49-62. Zieky, M. (1993). Practical questions in the use of DIF statistics in test development. In P.W. Holland & H. Wainer (Eds), Differential Item Functioning (pp. 35-66). Hillsdale, NJ: Erlbaum. 89 mllllllllllllllll lllllllllllllllllllllllES 3 1293 03062 4534