....3 “EH Ift‘ .1... .il , J . ‘ 1 , . i, V 5 1‘ .1]. vvvltlxflin u n “r. 141' _ y ..r .I - $115.1. ‘ f .il Iv ll.l .. l§b~v3<‘.: I; . [3' fly .fiwmawnw w a... a. THESIS 2s \’ 1 l; \\ I i l W iti‘iijilij ..___’ LIBRARY . |\\\\\\\\\\ii\i\iiiiii\i\i\ Michigan State 3 1293 017 Unlversity This is to certify that the dissertation entitled SOME GENERALIZATIONS OF THE RASCH MODEL: AN APPLICATION OF THE HIERARCHICAL GENERALIZED LINEAR MODEL presented by Akihito Kamata has been accepted towards fulfillment of the requirements for Psychology and Special Education fiat: ,/ flat... J Major professor Date December 15, 1998 MSU is an Alfirmaliw Action/Equal Opportunity Institution 0-12771 REMOTE STORAGE PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. DATE DUE DATE DUE DATE DUE iLflggqm I \r 2/17 2083 Blue FORN S/DateDueForms_ZOi7.indd - 09.5 SOME GENERALIZATIONS OF THE RASCH MODEL: AN APPLICATION OF THE HIERARCHICAL GENERALIZED LINEAR MODEL By Akihito Kamata A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Educational Psychology and Special Education 1998 ABSTRACT SOME GENERALIZATIONS OF THE RASCH MODEL: AN APPLICATION OF THE HIERARCHICAL GENERALIZED LINEAR MODEL By Akihito Kamata In this dissertation the Rasch model is generalized as a special case of the hierarchical generalized linear model (HGLM), facilitating various extensions. First, the standard binary-response Rasch model is reformulated according to the HGLM specifications. Since the reformulated model is a special case of the logistic HGLM, the model is referred to as a one-parameter hierarchical generalized linear logistic model (1-P HGLLM). Illustrative analyses using hypothetical data sets reveal that parameters estimated via the HLM program are very similar to the estimates fiom the BILOG program. A parameter recovery study reveals that both item- and person-parameters are estimated properly by the HLM program. Three extensions are presented, including (a) a model with a person-level predictor, (b) a multidimensional model, and (c) a multi-level model. In the first extension, a person-level predictor is added to the model such that the person abilities, as well as the item parameters, are decomposed as a linear combination of more than one parameter. A coefficient of the person-level predictor is properly estimated in the parameter recovery study, as well as in an illustrative analysis. In the second extension, a multidimensional model is formulated. It is shown that the correlation coefficients between latent traits are properly estimated by the HLM program, and that multidimensional analysis can distinguish people who have the same raw scores. In the last extension, a three-level 1-P HGLLM, with an additional level, for schools, is formulated. An illustrative analysis demonstrates that the empirical Bayes estimation enables one to distinguish people who have the same raw scores based on which school each individual has attended. Also, the analysis shows that the empirical Bayes estimation in the three-level model can improve estimation of school means, as well as person abilities. Contributions of this work include (a) a pedagogical presentation of this formulation to the learner of item response theory (IRT) models to facilitate the understanding of IRT models from a different perspective, (b) an easy application of this formulation to conduct a one-step analysis of binary-response test data, (c) a readily accessible multidimensional, as well as multi-level analysis of binary response test data. Several suggestions for future research are also mentioned. This dissertation is dedicated to my parents Toshikatsu and Toshiko Kamata. iv ACKNOWLEDGEMENT This dissertation would not have been completed without assistance and encouragement of many people. I would like to be deeply thankful to Dr. Betsy Becker, my advisor, friend, and chair of my dissertation committee for her support, direction, patience, and encouragement throughout my doctoral studies. I wish to express my deep gratitude to all my committee members, Dr. Stephen Raudenbush, Dr. Kenneth Frank, Dr. Alexander von Eye, and Dr. Susan Phillips for their thoughtful and valuable comments. They all encouraged me to pursue and complete this dissertation topic. I also thank all other people who gave me valuable insights and comments on this dissertation work; Yasuo Miyazaki, Michael Rodriguez and Dr. Mark Reckase, to name a few. I am also thankful to my mentors when I was at Yamanashi University as an undergraduate student; Professor Ryoichi Yamada and Dr. Kunihiko Ogawa, who encouraged me to pursue graduate degrees. They fought for their lives while I was working on this dissertation. Unfortunately, they did not have a chance to see the completion of this dissertation. Finally, I thank my parents who taught me the value of hard work that enabled me to complete this challenging work. Also, I thank my wife, Yasuyo, who always believes in me and has been so generous for my spending so much of time for my doctoral studies and this dissertation. TABLE OF CONTENTS page LIST OF TABLES ........................................................................................... viii LIST OF FIGURES ......................................................................................... ix CHAPTER 1 INTRODUCTION ........................................................................................... 1 1.1. Background ......................................................................................... 1 1.2. Review of Literature ........................................................................... 3 1.2.1. Linear Logistic Test Model ...................................................... 4 1.2.2. Many-Facet Model ................................................................... 5 1.2.3. Investigating Item Parameter Drift ........................................ 5 1.2.4. Reformulating IRT Models as GLM ........................................ 6 1.2.5. Random Coefficient Multinomial Logit Model ........................ 7 1.3. Statement of Purpose ......................................................................... 8 CHAPTER 2 GENERALIZING THE STANDARD BINARY-RESPONSE RASCH MODEL ........................................................ 11 2.1. Model Formulation ............................................................................ 11 2.1.1. The Standard Rasch Model .................................................... 12 2.1.2. Formulation of LP HGLLM ................................................... 15 2.1.3. Estimation .............................................................................. 20 2.2. Illustrative Analysis of Hypothetical Data ....................................... 23 2.3. Parameter Recovery Study ................................................................ 32 2.3.1. Methods ................................................................................... 33 2.3.2. Results ..................................................................................... 35 2.4. Comments on the Simulated Data ..................................................... 45 2.5. Summary and Comments on Practical Issues ................................... 46 CHAPTER 3 MODEL WITH PERSON-LEVEL PREDICTORS ........................................ 49 3.1. Model ................................................................................................. 49 3.2. Illustrative Analysis .......................................................................... 52 3.3. Parameter Recovery .......................................................................... 57 3.4. DIF Model .......................................................................................... 66 3.5. Summary and Comments on Practical Issues .................................. 75 vi CHAPTER 4 MULTIDIMENSIONAL MODELS ............................................................... 78 4.1. Model ................................................................................................. 78 4.2. Illustrative Analysis ......................................................................... 85 4.3. Parameter Recovery .......................................................................... 94 4.4. Summary and Comments on Practical Issues ................................. 110 CHAPTER 5 THREE-LEVEL MODEL .............................................................................. 112 5.1. Model ................................................................................................. 112 5.2. Illustrative Analysis ......................................................................... 115 5.3. Summary and Comments on Practical Issues ................................. 125 CHAPTER 6 CONCLUSIONS ........................................................................................... 128 6.1. Summary .......................................................................................... 128 6.2. Comments on Practical Issues ......................................................... 131 6.3. Suggestions for Future Research and Recommendations ............... 134 APPENDIX A ................................................................................................ 138 APPENDIX B ................................................................................................ 139 REFERENCES ............................................................................................. 141 vii Table 10. 11. 12. 13. 14. 15. LIST OF TABLES The item difficulties used in illustrative analyses and simulation studies Person parameter estimates from the HLM and BILOG programs Item parameter estimates from the HLM and BILOG programs Results of parameter recovery study for the LP HGLLM Item parameter estimates from the model with a person- level predictor The layout of the simulation study for the model with a person-level predictor variable The results of parameter recovery study for the model with a person-level predictor Item parameter estimates from the DIF model Person parameter estimates from the DIF model Item parameter estimates of multidimensional data Person parameter estimates of multidimensional data The layout of the simulation study for the multidimensional model The results from the parameter recovery study for the multidimensional model Item parameter estimates from 2-level and 3-level models Estimates of school means from 2-level and 3-level models viii page 25 29 30 36 54 58 59 73 74 86 89 97 99 117 124 Figure 10. 11. 12. LIST OF FIGURES Mean correlation between true and estimated item parameters and their standard deviations Root mean squared errors of f and standard deviations of I". The relationship between person parameter estimates fiom the model with a predictor and estimates from the model without a predictor The root mean squared errors of the coefficient of the predictor The mean and the standard deviation of the estimate of the slope coefficient of the predictor Plot of person parameter estimates fi'om UD and MD models Root mean squared errors of f Mean correlation between true and estimated item parameters and their standard deviations Mean correlation between latent traits and their standard deviations Person parameter estimates from 2-1evel model and 3- level models The relationship between the estimates from the 2-level model and the linear combination of the estimates from the 3-level model Person parameter estimates from 2-level model and 3- level models — comparison to the true values ix page 37 41 56 60 63 92 100 103 107 118 120 121 Chapter 1 Introduction In this chapter, some background for this research along with the importance of this study are described first. Second, related literature is reviewed. This clarifies that similar approaches were explored in the past using other models. At the same time, the review clarifies what is new about this study. Third, the purposes of this study are stated. 1.1. Background When investigating effects of student characteristics on student performance on a test, henceforth termed “ability”, a two-step analysis of test data with student-level predictors is a common practice. Student abilities are estimated via a standard item response theory (IRT) model as the first step in such a two-step analysis. Student ability scores from an IRT model are originally expressed in the logit scale, but commonly the scores are linearly transformed to another scale when they are reported. Then, in the second step, ability scores are used as an outcome variable, and student characteristic variables are used as predictors in a simple linear model, such as a multiple regression. This two-step analysis may be done routinely if existing IRT-scale-based test scores are used as an outcome variable in a regression analysis. Such an analysis would be considered a two-step analysis. Two possible problems are associated with such a two-step analysis. First, students’ ability estimates obtained via an IRT model have different magnitudes of standard errors at different ability levels. Scores in the middle of the distribution are associated with lower standard errors, while scores that are apart from the middle of the distribution are associated with higher standard errors (Hambleton & Swaminathan, 1985). Thus the test scores, as an outcome variable, have heteroscedastic measurement errors. However, a two-step analysis typically ignores this heteroscedastic nature of the standard errors of measurement of the dependent variable. For this reason, a two-step analysis may not provide accurate results. Second, when marginal maximum likelihood estimation (MMLE) is used to estimate item parameters, person ability estimates derived fi'om either the maximum likelihood or the mean or mode of the posterior distributions (i.e., EAP or MAP) are biased and inconsistent (Goldstein, 1980; Lord, 1984). Inconsistency of the outcome variable is problematic in regression models. One possible solution to the first problem is to take into account the different standard errors of measurement of dependent variables in the second step of the analysis, e.g., by applying the weighted least squares solution (W LS) in regression estimation instead of the ordinary least squares solution. Unfortunately, this approach is only good for the problem of unequal standard errors of the dependent variable, but it will not solve the problem of unbiasedness and inconsistency of the dependent variable. Also, this approach can provide distorted results if standard errors are estimated poorly. Another solution is to perform a one-step analysis that includes student characteristic variables as predictors in an IRT model (e.g., Zwinderman, 1991). By including student-level predictors in an IRT model, the second step of the two-step analysis will be embedded in the first step. In other words, a regression model that estimates effects of student characteristics is embedded within an IRT model that estimates students’ abilities. This way, the effects of student-level predictors are simultaneously estimated with the effect of items, as well as person parameter estimates. As a result, the heteroscedastic nature of standard errors, as well as the bias and inconsistency, of the person parameter estimates would not affect the estimates of student-level predictors on the outcome variable. Also, one can expect improved estimation of the effects of predictors on a latent trait via a one-step analysis rather than a two-step analysis. This study follows in this tradition, as described in the subsequent chapters. In addition, we can expect improved precision for estimates of item- and person-parameters (Mislevy, 1987), although that is not the focus of this study. 1.2. Review of Literature To date, attempts to generalize IRT models have been made by several authors (detailed below). Such generalizations are achieved mainly by adding predictor variables as a linear combination of parameters to IRT models. Also, attempts to reformulate IRT models in terms of the Generalized Linear Model (GLM) have been made. This section summarizes such models that have been proposed in the past. 1.2.1. Linear Logistic Test Model Fischer (1973, 1983a, 1995) was probably the first to incorporate a regression analysis in the IRT model. He generalized the standard binary Rasch model by decomposing an item difficulty parameter into linear combinations of more than one item-varying parameter. More specifically, in his generalization an item difficulty parameter (5} for item i is decomposed into P parameters a1(l= 1, , p), such that 5, = dea, +c, where a, are called “basic l=l parameters” (Fisher, 1973, 1983a) that are decomposed parameters. On the other hand, w” is a coefficient or weight for parameter I and item i, and c is a normalization constant. This approach enables one to include person characteristic variables as linear constraints in the Rasch model. This model can be applied for measuring such things as the effect of experimental conditions on item difficulty and the imp act of educational treatments on ability. This approach also has been also applied in measuring change in unidimensional latent traits (Fischer, 1983a, 1983b). In this specific application of the model, any change in person parameters occurring between time points is described as a change of item parameters, instead of change in the person parameters. 1.2.2. Many-Facet Model Linacre (1989), on the other hand, added an indicator variable for raters as a linear constraint to polytomous Rasch models. This becomes important when students’ scores are rated by different raters in order to detect different degrees of severity between the raters, such as in a performance test. Linacre considered the indicator variable for raters as a “facet” additional to those for item and examinees. Therefore, he called the model a “many-facet Rasch model” (Linacre, 1989). This approach can also be seen as a three- factorial logit model. 1. 2. 3. Investigating Item-Parameter Drift Bock, Muraki and Pfeiffenberger (1988) extended the three-parameter logistic model by adding a variable for time points when the item is tested. They decomposed each item difficulty parameter into a linear combination of “item specific” parameters and a time point indicator. More specifically, the item difficulty parameters were expressed as 6'] = b; + ,4th + flzjtkz, where b, is the “item specific” difficulty parameter, tk is an indicator for time point k, ,61 ,- is a linear effect of time on item j, and ,63, is a quadratic effect of time on item j. In other words, the decomposition can be viewed as a “growth model” for item parameters. As a result, this approach was used for detecting change in item difficulty parameter values across multiple time points, in effect, item difficulty parameter drift. In estimating parameters, they dealt with the person parameters as random parameters, and incorporated the marginal maximum likelihood estimation (MMLE) to estimate item parameters and effects of time points. 1.2.4. Reformulating IRT models as GLM Mellenbergh and Vijn (1981) reformulated the Rasch model as a log linear model. However, like Fischer and Linacre’s generalizations, their reformulation treated both item and person parameters as fixed parameters. As a result, in most applications the model would have so many parameters to be estimated (the number of items plus the number of examinees) that it could not rely on a parameter estimation algorithm used for log-linear models in general. Instead, Mellenbergh and Vijn had to rely on modified conditional maximum likelihood (CML) estimation. On the other hand, Mellenbergh (1994) later successfully summarized different IRT models in terms of the generalized linear model (GLM) (McCullagh & Nelder, 1989), including both dichotomous and polytomous one- and two- parameter IRT models. The three-parameter logistic model is classified as a so-called “left-side added” model (Thissen & Steinberg, 1986) and cannot be reformulated under the GLM family. Generally speaking, GLM provides a way to estimate a function of the mean responses as a linear function of the values of some set of predictors. However, Mellenbergh did not address the issue of providing parameter-estimation algorithms for the reformulated models. One critical limitation of the above generalizations is that, except for the model of Bock et al. (1988), they deal with all parameters, including item and person parameters and parameters associated with predictors, as fixed parameters. As a result, models are formulated within a single level, despite the fact that predictor variables may represent different levels of measures. This especially becomes obvious when dealing with variables other than person- or item-characteristics as predictors. In the many-facet Rasch model, for example, when raters are included as a predictor variable and rater- characteristic variables are also of interest, the above approach would treat all variables as being associated with parameters that are fixed and independent of all other parameters. However, it may be more reasonable to treat each rater effect as a random parameter and to let rater-characteristic variables be associated with fixed parameters that characterize all raters, because rater characteristics vary among raters. 1.2. 5. Random Coeflicient Multinomial Logit Model More recently, Adams and Wilson (1996) proposed another model with person-level predictors, called a random coefficient multinomial logit model (RCMLM). It is general enough to include a wide range of models from the Rasch family, both dichotomous and polytomous models. This model, unlike other models mentioned above, considers person parameters to be random (or random coefficients, in their words). Adams, Wilson and Wang (1997) further generalized the RCMLM to a multidimensional model (MRCMLM). These models are technically formulated as multi-level models because of the appearance of the random parameters. However, Adams, Wilson and Wong at first did not explicitly recognize them as multi-level models. The random parameter formulation was used mainly for the purpose of parameter estimation and decomposing each of the person parameters into a linear combination of multiple parameters. Later, Adams, Wilson and Wu (1997) explicitly recognized the RCMLM and MRCMLM as multi-level models, in which person-characteristic variables could be added as fixed parameters that are related to latent traits. This was probably the first time that a regular IRT model was explicitly conceptualized as a multi-level model. However, their approach was limited to a two-level formulation; therefore, their model was only able to include person-varying variables as linear predictors. 1.3. Statement of Purpose This study shows another way to generalize the Rasch model as a multi-level model. As a result of this generalization, the new model can include level-2 predictors when the Rasch model is formulated as a two-level model. Although similar generalizations were proposed by several others (e.g., Adams and his colleagues; Zwinderman, 1991), the generalization in this study is distinct from earlier work in two ways. First, the generalized model in this study makes explicit connections between two seemingly unrelated but well understood models: the hierarchical generalized linear model (HGLM) (Raudenbush, 1995; Stiratelli, Laird, & Ware, 1984; Wong & Mason, 1985) and the Rasch model. Second, the generalized model in this study enables one to formulate and estimate parameters using a range of models with more than two levels, while previous formulations (Adams, Wilson and Wu, 1997) allowed only two levels. First, the Rasch model is reformulated as a special case of the HGLM . The HGLM is an extension of the generalized linear model (McCullagh & Nelder, 1989) to hierarchical data. This study, accordingly, treats item response data as hierarchical data, where items are nested within people. The generalized model is referred to as the one-parameter hierarchical generalized linear logistic model (1-P HGLLM). Estimation of person and item parameters per se is not the purpose of the LP HGLLM. In order for the approach to be applied in extended models, however, it is essential to properly estimate person- and item- parameters for the Rasch equivalent case (the simplest case among the generalized models). Therefore, one primary purpose of this study is to demonstrate equivalence between the LP HGLLM and the Rasch model, both algebraically and numerically. This study then shows extensions of the generalized model. First, a model with a person-level predictor is presented. This includes a case where person ability parameters are decomposed into two parameters, as well as a differential item functioning (DIF) model, where item parameters are composed of two parameters, including a person-level predictor. Second, a multidimensional model, where more than one latent trait is assumed, is presented. An illustrative analysis, as well as a parameter recovery study, are given for this case. Finally, a three-level model is presented, where an additional level (e.g., for schools, classes, etc.) is added. These extensions suggest various possible applications useful to educators, test practitioners, and researchers alike. 10 Chapter 2 Generalizing the Standard Binary-Response Rasch Model In this chapter the standard binary-response Rasch model is generalized as a special case of the HGLM. First, the model is formulated according to the HGLM framework. Second, an illustrative data analysis is presented in order to show how an actual data analysis using the model would take place. Third, a simulation study demonstrates parameter estimation by the HLM program. This chapter deals with a very basic model, the Rasch model equivalent. However, estimating the Rasch item and person parameters per se is not the primary purpose of this generalization, as mentioned in the previous chapter. The primary purpose of the generalization is to expand the model to a Rasch model with predictors, a multidimensional Rasch model, and a multilevel Rasch model, which will be discussed in the following chapters. This chapter provides a sound basis for those extensions. 2.1. Model Formulation In this section, the standard binary-response Rasch model is first presented for the purpose of making a clear connection with the HGLM. Then, the Rasch model is carefully reformulated using the HGLM framework. Specifications of the HGLM framework include a sampling distribution for 11 item responses, its expectation and variance, a link function, a level-1 structural model, and level-2 models. 2.1.1. The Standard Rasch Model Let py- be the probability that the person j (i = 1, , n) gets item 1' correct, 6,- be the latent trait of person j, a: be the difficulty of item i (i = 1, , k), and y,- be a binary outcome, indicating a score for the jth person on the ith item (ya- = 1 if the person answers the item correctly, and y,-, = 0 if the person answers incorrectly). Then, the conditional distribution of the outcome y”, given pg“, is a binomial distribution with parameters 1 and p,-,-, which is also known to be a Bernoulli distribution with a parameter p1,. Specifically, mp. ~ 3(1, pg) . (1) Based on the above probability model, the Rasch model is defined to be exp[6’, «2] __ 1 p”=1+exp[6lj—é‘,]—1+exp[-(6lj-o‘,)]’ (2) following Wright and Stone (1979), which is equivalent to stating 12 log[-1—’i—]= a —.5, . (3) In the above Rasch model, parameters 6}- and 6, are considered to be fixed, in effect, no distributional assumptions are made for the parameters. There are n + k — 2 parameters to be estimated in the model. However, the unique number of parameters to be estimated is reduced to 2k + 1 because people with the same raw score have the same ability level estimate in the Rasch model. As already mentioned, attempts to reformulate Rasch models in terms of a linear model have been made by several other authors. Some of the parameter estimation methods proposed by those authors are based on conditional maximum likelihood estimation (CMLE) (Andersen, 1972). In CMLE both @ and d are considered to be fixed parameters as the above Rasch model assumes. The CMLE is based on sufficient statistics for the Rasch model, in effect, the numbers of correct responses for each person and each item. One advantage of CMLE is that the likelihood function does not contain 9 = (61, , 62,), yet it produces consistent and efficient estimation of item parameters. Also, the CMLE does not require any assumptions about the population distribution of dvalues, because both item- and person- parameters are considered to be fixed parameters. However, this approach is strictly limited to a family of the Rasch model. Another perspective on the standard Rasch model is that the latent 13 trait :9, can be considered to be a random variable (e.g., Andersen and Madsen, 1977; Thissen, 1982). In other words, the examinees represent a random sample from a population in which ability is distributed according to a specified density function. In general, the density function of a latent trait is a function of 9, given 7%, where L5, is the vector containing the parameters of the examinee population ability distribution; we write g(9 I é). (4) Often the standard normal distribution, 6?] ~ N (O, 1), is used for the density function g, but it could have other forms, of course. Item and person parameters based on the assumption of random person parameters are estimated via marginal maximum likelihood estimation (MMLE) (Bock & Aitkin, 1981; Bock & Lieberman, 1970). To date, the MMLE is considered to be the only available likelihood method that makes use of the population distribution of 9. With MMLE, one integrates out person parameters from the likelihood function so that the estimation of item parameters will not depend on person parameters. As a result, the MMLE estimates item parameters first, then estimates person parameters afterwards, under the assumption that the estimated item parameters have known parameter values. 14 2.1.2. Formulation of 1-P HGLLM In this section, the unidimensional l-P HGLLM is formulated following the GLM framework. Then the LP HGLLM is shown to be algebraically equivalent to the Rasch model with 6} being a random variable. According to the GLM framework, a sampling distribution of item responses, its expectation and variance, link function, and a linear predictor model have to be specified. Then following the HLM framework, level-2 models are formulated. Here the linear predictor model is considered to be the level-1 model, in effect, an item-level model. Consequently, the level-2 models are person-level models. In this formulation, the ability parameter 6} is considered to be a random parameter. For item i (i = 1, , k) and personj (i = 1, , n), a binomial sampling model with one trial is employed. This is the same as Equation 1 in the regular Rasch model. Thus, the expected value and variance of yy- are E(y.,|p.,)=p., and Var(y,,|p.,)=p.,(1-p,)- (5) When the level-l sampling model is binomial, a GLM can utilize one of several link functions, including logit, probit, and complementary log-log functions. In this case, the logit link function 15 ’7” = iog[——1 1’” j (6) is used. This is equivalent to Equation 3 if 77,,- = 6} — 3;. Now the level-1 structural model, in effect, the level-1 linear predictor model, is the item—level model, and 10 pi = 77,} l—py. =160j+161leij +fl2jX2y +u°+fl(k-l)jX(k-l)ij (7) k-l = 16:)! + 21”!”qu ’ q: where qu- is the qth dummy variable for person j, with values -1 when q = i, and 0 when q¢ i, for item i. The coefficient A), is an intercept term, and ,6.”- is a coefficient associated with X”, where q = 1, , k— 1. Equation 7 can be reduced to pt' log[I—J_] = 771/ = 160/ -16,” v (8) for item i that is associated with the qth dummy variable, given qu = -1 for q = i, and 0 for q¢ i. This way, ,6,” represents the effect of the qth dummy 16 variable, consequently the effect of item i when i = q. Further, Equation 7 can be written as : ){ 160} T]! = [d] g j]i:B‘/] (9) I I in matrix form, where W}. =[dj X1] and B, = [4,1 B j] . The purpose of writing Equation 7 in matrix form is to show how the data are laid out. Refer to Appendix A for the full-matrix representation of Equation 9. Here, (I, is a kxl column vector whose elements are all 1, fl), is a constant, B’, is a (k —1)x1 column vector that contains ,6, through Aim», and X, is a kx(k — 1) matrix, whose diagonal elements for rows 1 through k -— 1 are —1 and the off-diagonal elements are 0. No indicator variable is associated with the kth item because we set the constraint that A, = 0. As a result, all elements in row k in X, are zeros. This constraint is needed so that the design matrix has full rank. A parameter other than flk, can be assumed to be 0, of course, but ,3,- was chosen to be 0 for convenience. Here, A), is an intercept term, and a value 1 is assigned to d,, for all observations. Therefore, fl), is considered to be an overall effect common to all items. On the other hand, ,6.”- represents the specific effect of the qth dummy variable, for q = 1, , k. The constraint ,5, = 17 0 means that the effect of the kth item, which is subtracted from the overall effect, is assumed to be zero. Then the probability that person j answers item i correctly is expressed as 1 = 1+exp[- 77,1] ’ (10) Pu which follows from Equation 6. It may at first seem a little odd to have the person subscript j on the As in Equation 7, because effects of items (item difficulties) should be constant across people. However, the level-1 model is the item-level model, and we do not assume that ,6b are constant across people at this level of the model. In fact, the fits are not the parameters that are considered to be item difficulties. The parameters of interest are defined in the level-2 model, and may be characterized as being constant across people in the level-2 model. The level-2 models are person-level models. In the level-1 model fl), is the mean item effect (or overall mean) with a constraint ,5,- = 0. Since fl), is treated as a parameter that represents the common effect of all items in the level-1 model, it must be assumed in the level-2 models that A, is a random effect across people. This way, a latent trait that is common to all items but varies across people can be modeled. Also, while the level-1 model did not assume that ,fi, through A“), were equal across people, the level-2 models may 18 specify that item effects are constant across people. Therefore, the level-2 models are r 160} = 700 +u0} ,2] 161/ . IO (11) “flat-1), =7(k-1)o ’ where uo, is a random component of fly and distributed as N (O, r) , which states that uo, is normally distributed with a mean of O and variance of r. The level-1 model together with the level-2 models shows that item parameters are fixed across people and vary across items because there are no random terms added into ,5, through filmy, while a latent trait (person parameter) varies across people and is fixed across items. As a result, when level-l and -2 models are combined, the linear predictor model, Equation 7, becomes 77,}. = 700 + uo, — no for a specific person j and a specific item i that is associated with the qth dummy variable. Then, the probability that person j answers a specific item i correctly is expressed as 1 p, - (12) —1+exp{-[uo, "(7qo “700)” ’ where i = q. This is algebraically equivalent to the Rasch model in Equation 19 2, where 6,1 = no], and 6, =qu —700 for i = q. Note that both & and no “700 are fixed parameters for item parameters. For person parameters 6, in the Rasch model is a fixed parameter, while no, in the HGLM framework are random variables, with ac, ~ N (O, r) . According to the work of Neyman & Scott (1948) on non-linear models for panel data, inconsistency of parameter estimators occurs if item and person parameters are estimated simultaneously. The LP HGLLM approach avoids this problem by treating person parameters as random components of the intercept term, in effect, residuals in the level-2 model. In other words, it does not treat person parameters as parameters to be estimated. As a result, only k + 1 parameters need to be estimated (i.e., the number of items plus one 1). When the Rasch model is reformulated in terms of a non-hierarchical GLM, both person and item parameters have to be treated as fixed parameters in the same level of the model (Mellenbergh, 1994). This results in many parameters to be estimated in a regression equation (i.e., k + n - 2 regression coefficients), and leads to inconsistency of parameter estimates. This is one strong advantage of applying the HGLM over the GLM when item responses are formulated in the framework of a linear logistic model. 2.1.3. Estimation In estimating parameters in the LP HGLLM, I use the currently 20 available algorithm in the HLM program (Bryk, Raudenbush, & Congdon, 1996), in which the HGLM is incorporated. The HGLM estimation is referred to as a “doubly-iterative algorithm” (Raudenbush, 1995), that is, a combination of the GLM and HLM estimation procedures. Both the GLM and HLM estimation procedures are performed iteratively within and between the two procedures, and this is why the combination of the two procedures is called a doubly-iterative algorithm. The HLM iterations are referred to as macro iterations, while the GLM iterations are referred to as micro iterations. In the GLM estimation, the penalized quasi-likelihood (PQL) (Breslow & Clayton, 1993) is maximized in order to achieve the most plausible estimates of linearized dependent variables, Z,-,, and weights, w,-,. The Z,-, and w, are defined as; WI} = Pg(1"P.,) (13) and Z =L+Ug , (14) where p,-,, u,-,, and 77,-, are defined above. The HLM estimation for the level-2 residuals uo,, the person parameter estimates, is performed by the empirical 21 Bayes (EB) methods, and for the level-2 coefficients 7,0, the item parameter estimation is done by the generalized least squares (GLS). The doubly-iterative algorithm works in the following way. First, initial estimates of the predicted probability value, pg‘o’, a linearized ’, are computed by Equations 13 and dependent variable, Z,,‘°’, and weights, w,‘0 14. Then, given those initial estimates, a weighted HLM analysis with Z,-,“” as the level-1 outcome is computed (micro iterations). According to the new predicted values from a weighted HLM analysis, the new linearized ), are computed (macro iteration). dependent variables, Z9“), and weights, w," Then, a new weighted HLM analysis with Z,,(” as the level-1 model is computed (micro iterations). This iterative process is repeated until it achieves a predetermined convergence criterion. As a result, the HGLM produces approximate joint posterior modes of the distributions of level-1 and level-2 parameters, given a variance-covariance matrix estimated based on a normal approximation to the restricted likelihood, in effect, the PQL. This approach is able to estimate parameters with the existence of missing data. Also, abilities for people with zero or perfect scores are able to be estimated, whereas the cases are simply discarded in some Rasch IRT software. For more details on parameter estimation, readers are referred to Raudenbush (1995). 22 2.2. Illustrative Analysis of Hypothetical Data Although estimating person and item parameters per se is not the primary purpose of the LP HGLLM, it is essential that the model can estimate person and item parameters appropriately for the simplest case of the model. As mentioned in the previous section, the simplest case of the LP HGLLM is when there is no predictor variable in the model, essentially, the Rasch model equivalent case. For the purpose of illustrating possible data analyses, a numerical example is presented here. A set of hypothetical data is used for the purpose of illustrating the model and demonstrating how parameters are estimated. First, person and item parameters in the LP HGLLM are estimated via the HLM program. The same data set is analyzed by the BILOG program (Mislevy & Bock, 1990) to estimate person and item parameters, and their results are compared. The BILOG program is chosen to be compared because it treats a latent trait as a random variable, so does the LP HGLLM. Also, since the PQL is an approximation of the marginal maximum likelihood estimation (MMLE), the comparison should be able to tell how well the LP HGLLM approach estimates parameters. It is important to know how differently parameters are estimated in comparison to other methods. For this study, the BILOG program is configured so that it will estimate item parameters via the MMLE with no prior distribution specified, and person parameters are estimated via an expected a posteriori (EAP) algorithm 23 with the standard normal prior distribution. Both person- and item-parameter estimates may be rescaled into a specific common scale by adjusting their means and standard deviations, for example, mean = 0 and variance = 1. Since the rescaling is done based on linear transformations, it is sufficient to see that estimates from both programs are highly correlated before rescaling. Once estimates are obtained, it will be quite simple to transform both estimates to a common scale, and they will be still correlated at the same magnitude. Here, rescaling is done, but it is solely for the procedural illustration purpose. Rescaling of the Rasch parameter estimates is done by the BILOG program in the following manner. First, person parameter estimates are transformed in order to have the mean and standard deviation of the prior distribution. In this case, a mean of O and a standard deviation of 1 are used to standardize the estimates. Then item parameter estimates are linearly transformed, using the slope and intercept constants from the linear transformation of the person parameter estimates. Several other rescaling options are available in the BILOG program (see Mislevy & Bock, 1990). Rescaling of parameter estimates from the HLM program was done in the same manner. 24 Table 1 The item difficulties used in illustrative analyses and simulation studies item item difficulty item item difficulty 1 -2.000 16 -1.250 2 -1.500 17 -0.250 3 -1.000 18 0.250 4 -0.500 19 1.750 5 0.250 20 0.000 6 0.500 21 -1.875 7 1.000 22 -1.625 8 1.500 23 -1.375 9 2.000 24 -1.125 10 0.000 25 -0.875 11 -1.750 26 -0.625 12 -0.750 27 -0.375 13 0.125 28 -0.125 14 0.750 29 2.125 15 1.250 30 -2.125 25 In the hypothetical data set, 10 items and 250 examinees are assumed. Item-difficulty-parameter values are arbitrarily chosen. The values that are used are shown in Table 1. Item difficulty values for 30 items are shown in this table, because the table is also used to determine item difficulty values for other illustrative analyses and simulation studies, in which the number of items is as large as 30. According to the table, item difficulties (— 2.00, -1.50, -1.00, - 0.50, — 0.25, 0.00, 0.50, 1.00, 1.50, 2.00) are used. 250 person- parameter values are independently sampled from the standard normal distribution, N (0, 1). These values are used as the true item difficulty and true ability measures. Conventional procedures are used to generate simulated dichotomous item responses according to the Rasch model (Equation 2). A cumulative probability is computed based on the Rasch model with the given true item- and sampled person- parameter values. Then, the probability value is compared with a random number sampled from a uniform distribution with a range between 0 and 1. A simulated response is scored correct and a value 1 is assigned if the probability of a correct response is greater than or equal to the sampled number, and the response is scored incorrect and a value 0 is assigned, otherwise. Then, the generated data set is analyzed by the HLM program and the BILOG program, and item and person parameters are estimated. 26 The model for this example is defined by Equations 7, 8, and 9, where k = 10, n = 250, and the dummy variable for the tenth item is dropped in order to achieve full rank for W,, ’7" zflol +’6‘IX“J +£1.qu +1631X3IJ +164/X4y' +165/X5I'J (15) +I661X6IJ +fl7,X7,, +1581Xsa +,6;,X9,, In matrix notation, n1=ijj (16) where 'q,‘ '1 —1 o o o o o o o 0' 71,11 9,, 1 0 -1 o o o o o o o a, 0,, 1 o o -1 o o o ‘0 o o a, m, 1 o o o -l o o o o 0 a, . —1 o o 0 o . n1=”’/,wj=l 0 ° 0 0 ,and 13.-sf"! . ('7) 0,, 1 o o o o o -1 o o o I a, ’In 1 o o o o o o -1 o o a, a, 1 o o o o o o o -1 o a, 0,, 1 o o o o o o o o -1 ,4, 3,1,” _1 0 o o o o o o o o, .411 I Then the entire design matrix across people will be W = (W, W250 J . The level-2 models are defined by Equation 11, where k = 10, 27 '4,» =71. +“o, 16ij=710 262, =720 2631:730 U64] =7” ,where u0 ~ N(O, 2'). (18) ’52, =75.) ’ 166/ =760 flu- =r70 1%,“ =780 fig,- =79.) Then, the item difficulty for item 10 is - 700, and the item difficulties for the other items are no — 700, while the ability for person j is uo,. Person parameter estimates from both the HLM and BILOG programs before transformation are shown in the first two columns of Table 2, and item parameter estimates before transformation are shown in the first two columns of Table 3. Although they show a nearly perfect linear relationship (r = 0.999) for both person and item estimates, the BILOG estimates vary more across estimates than the HLM estimates do for both person and item parameter estimates (0.798 vs. 0.689 for person estimates, and 1.241 vs. 0.976 for item estimates). This is also reflected in larger absolute values for the BILOG estimates than the HLM estimates. This is because of the empirical Bayes (EB) estimation of the person parameter. The EB estimation shrinks person parameter estimates based on the overall estimated variance of the 28 Table 2 Person parameter estimates from the HLM and BILOG programs Before Re-Scaling After Re-Scaling“ Raw Score Mean True Values H LMT Bl LOGt HLM Bl LOG 0 -2.125 -1.681 -2.037 -2.439 -2.554 1 -1.447 ~1.323 -1.546 -1.919 -1.939 2 -0.811 -0.988 -1.135 -1.432 -1.425 3 -0.780 -0.677 -0.772 -0.967 -0.969 4 -0.368 -0.354 -0.419 -0.514 -0.527 5 -0.227 0047 -0.054 -0.067 -0.069 6 0.391 0.263 0.330 0.381 0.412 7 0.580 0.578 0.701 0.839 0.885 8 0.936 0.905 1.050 1.312 1.313 9 1.490 1.248 1.357 1.810 1.698 10 1.589 1.616 1.648 2.343 2.061 Mean' ' -0.007 0.000 0.001 0.000 0.000 Std. Deviation 0.987 0.689 0.798 1.000 1.000 Note. ‘ Means and Standard Deviations are weighted by the number of people in each raw score category. "’ Both HGLM and BILOG estimates are standardized. 1' Root mean square is 0.643 1: Root mean squared error is 0.642 29 Table 3 Item parameter estimates from the HLM and BILOG programs Before Rescaling After Rescaling" Item True Values HLM BILOG HLM BILOG item 1 -2.00 -1.516 -1.823 -2.198 -2.286 item 2 -1.50 -1.386 -1.671 -2.010 -2.096 item 3 -1.00 -0.754 -0.915 -1.093 -1.149 item 4 -0.50 -0.389 -0.470 -0.564 -0.591 item 5 -0.25 -0.064 -0.071 -0.093 -0.091 item 6 0.00 0.128 0.150 0.186 0.190 item 7 0.50 0.225 0.284 0.326 0.354 item 8 1.00 0.541 0.670 0.784 0.838 item 9 1.50 1.265 1.536 1.835 1.923 item 10 2.00 1.629 1.955 2.361 2.449 Mean -0.058 -0.065 -0.084 -0.084 Std. Deviation 0.976 1 .241 1.41 5 1.556 constants used for person parameter re-scaling. 30 Note. 'Rescaled estimates are transformed based on the same linear transformation distribution of ac, (i.e., r). In this example, twas estimated to be 0.80, which is a considerable underestimation of the true value r= 1.000. Apparently, this underestimation of raffects largely on smaller variance of the HLM parameter estimates than the true values. The root mean squared error of the BILOG and the HLM person parameter estimates are 0.643 and 0.642, respectively. These values indicate that the HLM and BILOG estimates are almost equal in terms of deviations from the true values. Also, these values indicate that both of them are not estimating the true values well. Transformed person and item parameter estimates are shown in the last two columns of Tables 2 and 3, respectively. It is evident that the rescaled person parameter estimates are very similar between the HLM and the BILOG estimates. This makes sense because they are correlated almost perfectly before transformation, and they were transformed so that they would have the same mean and the same standard deviation. However, the transformed item parameter estimates appear somewhat different between the HLM and BILOG programs; some are different by 0.3 logits. This occurs because each set of item parameters is transformed according to the transformations taken for each set of person-parameter estimates; however, the two sets of item parameters are not transformed to have the same mean and standard deviation. They are still correlated almost perfectly (r = 0.999), because they are transformed linearly. 31 2.3. Parameter Recovery The purpose of this simulation study is twofold. First, this simulation is intended to demonstrate parameter recovery for the simplest model that I showed, namely, the reformulated binary-response Rasch model via the HGLM framework. In the previous section, I presented a numerical example based only on one set of simulated data that illustrated how a set of item response data can be analyzed using the reformulated model and the HLM program. However, I was not able to evaluate the quality of those estimates, because the results were based on only one set of simulated data. In this simulation study, I replicate more than one data set for the same conditions to show that the HLM program can consistently reproduce parameter values. Second, this simulation will also explore the role of the number of replications on the specific model applications of the HLM program. In the past, the number of replications used in simulation studies of IRT models varied across studies and has seemed arbitrary. For example, Drasgow (1989) used 10 replications to study the two-parameter logistic (2PL) model, and Seong (1990) used only 5 replications for the 2PL models. Although Stone (1992) used 100 replications, pointing out that 5 or 10 replications might result in unstable results, it was not clear how much stability he gained from the increased number of replications. On the other hand, Yang (1995) used 50 replications and was able to present convincing results for binary- outcome hierarchical models with continuous explanatory variables using the 32 HLM software. This simulation study examines several different numbers of replications, and explores how differently they produce parameter estimate values. It is h0ped that this will give a good indication, although not an exact solution, of how many replications should be used for the other two simulation studies that I will conduct with more complicated models. 2.3.1. Methods The variables of interest in this simulation study are (a) the number of replications (g = 3, g = 5, g = 10, g = 20, g = 50, and g = 100); (b) sample size (n = 250, n = 500, and n = 1000), representing small, medium and large sample sizes; and (c) the number of items (k = 10 and k = 20), representing small and large numbers of items (short and long tests). These three variables produce 6 x 3 x 2 = 36 conditions to investigate. Although 20 is not that large for the number of items in real test settings, it was selected here to roughly double the number of parameters to be estimated. The exact number of parameters to be estimated is 11 when k = 10 (10 fixed parameters and 1 random parameter — the variance of person parameters), and 21 when k = 20 (20 fixed parameters and 1 random parameter). For each replication in each of the 36 conditions, person ability values are sampled from the standard normal distribution, MO, 1). Item difficulty parameter values are determined by Table 1. The first ten difficulty values 33 are used for the 10-item test, and the first 20 difficulties are used for the 20- item test. The response data are generated in the same manner as described in section 2.2. Then, the generated item-level data are analyzed by the HLM program, and item and person parameters are estimated. Estimated parameter values are summarized across the 36 conditions using several statistics and indicators. I compute (a) the mean correlation coefficient between estimated and true item-parameter values across replications; (b) the standard deviation of the correlation coefficients between estimated and true item-parameter values across replications; (c) the root mean squared error (RMSE) of % across replications; (d) the mean of the 2" values; and (e) the standard deviation of the i" values. The RMSE for z" is defined to be . (19) where i, is the estimate of the rin the lth (l = 1, , g) replication. As mentioned in the previous section, the measurement scales for item and person parameter estimates are arbitrary, and their values have to be re- scaled to be compared directly across replications. Here, since the non- rescaled estimated values are compared, direct comparisons between parameter estimates are not made. Instead, correlations between true and 34 estimated item parameters are compared across replications. Also, the estimate of the variance of person parameters (i) is compared across replications. 2. 3. 2. Results I report summary statistics from the 36 conditions of the simulation experiment in Table 4. The first three columns indicate the attributes of the conditions: k is the number of items, it is the number of examinees, and g is the number of replications. The means of correlation coefficients between true and estimated item difficulties are shown in the fourth column. The values are consistently very high, around 0.995, and they are only different in their third decimal place, except two cases where the number of replications is very small (g = 3), and the number of examinees is small (n = 250). Also, the standard deviations of the correlation across replications (shown in the fifth column) are very small, and they are also only different in their third decimal place. These results show that the HIM software, through the reformulated model, is able to reproduce item parameter values very well across all the conditions. However, despite the generally small differences across the 36 conditions, we can still observe some apparent differences between the conditions when the values are examined carefully. In Figure 1, the first 35 Table 4 Results of parameter recovery study for the 1-P HGLLM sample Results items size replication mean(r) sd(r) m_sd(,v) RMSE( 2) mean( t) sd( 2') 3 0.9884 0.0022 0.1799 0.2617 0.7970 0.0409 5 0.9947 0.0016 0.1194 0.2073 0.8188 0.0126 250 10 0.9945 0.0036 0.1317 0.2441 0.7801 0.0125 20 0.9944 0.0032 0.1319 0.1892 0.8461 0.0128 50 0.9934 0.0031 0.1418 0.2125 0.8506 0.0233 100 0.9933 0.0031 0.1406 0.2038 0.8488 0.0189 3 0.9946 0.0051 0.0972 0.1815 0.8263 0.0042 5 0.9983 0.0005 0.0655 0.1715 0.8464 0.0073 10 500 10 0.9960 0.0015 0.1088 0.1307 0.9124 0.0104 20 0.9973 0.0012 0.0922 0.1666 0.8502 0.0056 50 0.9965 0.0020 0.1014 0.1888 0.8349 0.0085 100 0.9962 0.0021 0.1059 0.2065 0.8130 0.0078 3 0.9983 0.0006 0.0700 0.1801 0.8212 0.0007 5 0.9989 0.0004 0.0630 0.2213 0.7808 0.0011 1000 10 0.9985 0.0006 0.0715 0.1776 0.8319 0.0037 20 0.9981 0.0009 0.0745 0.1647 0.8516 0.0054 50 0.9983 0.0009 0.0676 0.2032 0.8057 0.0036 100 0.9983 0.0008 0.0703 0.1999 0.8088 0.0034 3 0.9892 0.0053 0.1476 0.2236 0.8040 0.0173 5 0.9929 0.0015 0.1339 0.1812 0.8537 0.0143 250 10 0.9927 0.0027 0.1272 0.1089 0.9155 0.0052 20 0.9921 0.0019 0.1376 0.1003 0.9170 0.0033 50 0.9914 0.0029 0.1370 0.1494 0.8925 0.0110 100 0.9916 0.0030 0.1406 0.1394 0.9022 0.0100 3 0.9957 0.0006 0.0908 0.0978 0.9025 0.0001 5 0.9951 0.0021 0.0949 0.1005 0.9144 0.0035 20 500 10 0.9948 0.0015 0.1009 0.1412 0.8740 0.0045 20 0.9954 0.0020 0.0973 0.1138 0.9117 0.0054 50 0.9957 0.0013 0.0995 0.1264 0.8891 0.0037 100 0.9959 0.0013 0.0971 0.1204 0.9032 0.0052 3 0.9971 0.0018 0.0737 0.1057 0.8950 0.0002 5 0.9977 0.0008 0.0656 0.1116 0.8950 0.0018 1000 10 0.9978 0.0010 0.0664 0.1498 0.8543 0.0013 20 0.9979 0.0007 0.0675 0.1288 0.8808 0.0025 50 0.9980 0.0007 0.0666 0.1153 0.8942 0.0022 100 0.9979 0.0007 0.0675 0.1189 0.8923 0.0026 36 Figure 1 Mean correlation between true and estimated item parameters and their standard deviations 8 m=250 Q .. C .9 E 93 6 0 C m a) 2 ID (I) o, . d 1 l l T I 0 20 40 60 80 100 replications to n=250 O o_ . O — k=10 . ' ------- =20 U) C .9. 31‘: 05> fa o . “5 0 co 0. i o l W l f I l 0 20 40 60 80 100 replications 37 Figure 1 (cont’d) 100 =20 80 I 60 500 — k=1O k n 40 20 0 SN: 8? 8.88 8? 5:30:00 coo—2 replications 0 a . 1 10 8 00 12 = = kk :0 s 0 am. .10. m -o m a 4 10 2 q d d d 4 q 1 I0 wood wood wood 06 95:90:00 Lo om Figure 1 (cont’d) k=10 k=20 n=1000 4O com: 8?. 8? 5:96:00 cams. mmmd 100 80 60 0 replications 0 fl 1 w 1 O 8 0 0 1 2 = __ k k S m em 1 m .n. 4. W 9 r O m 3 4 r O 2 .... c ...... o 1 . a J - . . 0 wood Sod wood o.o mCOzm—w—LOO ho Cw three plots show mean correlations for three different sample sizes. First, higher correlation coefficient values between true and estimated item parameters are observed when there are more examinees. This observation makes sense because, in theory, the quality of item parameter estimates depends on how large the sample size is, and it does not depend on the number of items in a test. On the other hand, these three plots show that correlation coefficients between true and estimated item parameters are slightly weaker for larger numbers of items. The difference between the mean rs for k = 10 and 20 becomes smaller as the number of examinees increases. The correlation coefficients are lower for the longer test, while holding the number of examinees constant. This makes sense because, holding the number of examinees constant, the ratio of the number of items to the number of examinees is higher for the longer test. Since the difference in the two ratios is smaller when there are more examinees, the correlation coefficient does not decrease so much when there are a large number of examinees, like n = 1000, as much as it does when n is smaller. The same trends apply in the plots of standard deviations of the correlations. They are the last three plots in Figure 1. However, the plots magnify the differences, and these differences are all at their third decimal place. The seventh column in Table 4 contains the RMSE of 2, defined above. The values are also plotted in the first three plots in Figure 2. The standard 40 Figure 2 Root mean squared errors of i and standard deviations of i- o r1=250 cm 6 A o H N... H 2 o‘ '5, :1 g _‘ .... ............................. . iii o . 2 ‘1- "' ' 0: O k=10 . ------ k=20 0.4 O I I I l I replications m n=250 0.4 O — k=10 3" T ----- k=20 O A 0’) *6 0.1 .Cl 0 :3 s e, a o -. a. .-. ............................ . o ‘- 0... .. .,. 0.. O I I l ' I ' 0 20 4o 60 80 100 replications 41 Figure 2 (cont’d) n=500 O (‘0. o' A O H N‘ A 2 o' :I \// E . (”ii 0 ................ O ............................. . E g- 0: '- k=10 _ ooooooo k=20 0. o I I I I 0 2° 4° 50 80 100 replications ‘0 n=500 q. C _ k=10 3- """" k=20 O A CO ‘66 0.- .r: 0 3| N £3; q. 8 o 5. o —o ............................ . O 0' I I 50 80 100 replications 42 Figure 2 (cont’d) RMSE(tau_hat) sd(tau_hat) o n=1000 m .. d o V “i ‘ A O r .0. _. O ’ . ............. . ............................ . o — k=10 - """" k=20 0. . O I l l 1 l I 0 20 40 60 80 100 replications LO n=1000 O . o' — k=10 3.- ------- ‘ k=20 O 0") O - o' N O .1 o“ s . O O o‘ ‘l 1 0 20 40 60 80 100 replications 43 deviations of 2' across replications are shown in the last column in the table and plotted in the last three plots in Figure 2. Also, the actual mean values of i- are shown in the eighth column in the table. As can be seen fiom the table, the mean of z“ is consistently smaller than the true value, which is 1.0, in all 36 cases. This result is consistent with those of Yang (1995), who empirically showed that the algorithm used in the HLM program, PQL estimation, tends to underestimate rwith binary outcome models. Despite such a limitation of the algorithm, consistently low RMSE values are observed. The results show that restimates are closer to the true value, when there are more items. When k = 10, the mean f are around 0.8, and when k = 20, the mean i" are around 0.9. On the other hand, the number of examinees does not seem to affect the RMSE values. Again, this makes sense because the precision of person parameters is affected by the number of items in a test, but it is not affected by the number of examinees. The standard deviations of 2" becomes smaller with larger k, and the difference between k = 10 and k = 20 becomes smaller with larger n. Although I was not able to incorporate the improvement in this study, Yang (1998) has recently improved the PQL algorithm to estimate r. By incorporating Yang’s algorithm, it is expected that the underestimation of rwould not be as severe as observed in this study. The values plotted in Figures 1 and 2, the values are somewhat unstable when the number of replications are 3, 5, and 10. Sometimes the patterns for values based on k = 10 and k = 20 reverse when the number of 44 replications are 3, 5, or 10. From these observations, simulation might still produce reasonable results with g = 20, but it makes more sense to use g = 50 for subsequent simulations, because results are much more stable with g = 50 than with g = 20. No added benefit is apparent with 100 replications. 2.4. Comments on the Simulated Data In the conventional method to generate item response data that I described above, the marginal sums for both persons and items are determined as a result of independently generated individual 0-1 responses. In other words, it is not always true that a set of generated responses represents a set of item and person parameter values. For this reason, a set of marginal sums is not actually randomly sampled from possible sets of marginal sums. Since the marginal sums are sufficient statistics in Rasch model, it is important to have a set of marginal sums that is randomly sampled from the population of possible marginal sums for the purpose of reproducing parameter values in a simulation study. An alternative method to generate item responses is the method proposed by Snijders (1991). Snijders' method would first generate marginal sums for both items and persons based on a set of item and person parameters. Then, it would generate individual 0-1 responses based on the marginal sums. This way, Snijders’ method guarantees that a set of marginal sums is a random sample from a set all possible marginal sums based on given 45 parameter values. As a result, the distributions of both sets of parameter estimates will be exactly the same as the theoretical distributions. One benefit of using Snijders’ method is that we can be sure that any statistics computed based on parameter estimates are sampled from their theoretical sampling distributions. Therefore, this simulation method would enable us to better illustrate how parameter estimates behave than the conventional simulation method does. However, an application of Snijders’ method is limited to the Rasch model because marginal sums are not sufficient statistics for two and three parameter IRT models. 2.5. Summary and Comments on Practical Issues In this chapter it was shown that the regular binary-response Rasch model can be reformulated as a special case of the HGLM. Also, it was shown that the HLM program could estimate item and person parameters similarly to those computed by the BILOG program, an MMLE based IRT program. The results of the simulation study revealed that the variance of person parameters tends to be underestimated with HGLM, because of the PQL estimation. The degree of underestimation depended on the number of items in a test and the number of examinees. As stated earlier in this chapter, the primary purpose of this generalization is not to estimate Rasch parameters per se. There are various Rasch/IRT parameter estimation programs already available for the purpose 46 of estimating item and person parameters. Thus, there is no need for the HLM program to be used solely for the purpose of estimating item and person parameters. The real purpose of the generalization is to extend the model to a Rasch model with predictors, a multidimensional Rasch model, and a multilevel Rasch model. They will be discussed in the following chapters. However, this reformulation of the regular binary-response Rasch model can be considered an important clarification of the connection between two seemingly unrelated statistical and psychometric models, in effect, an IRT model and the HGLM. From this perspective, estimating item and person parameters using the HLM program could be meaningful for didactic purposes. IRT models, including the Rasch model, are often thought to be specialized statistical and psychometric models for item response data. Furthermore, specialized estimation algorithms, often in specialized software, are thought necessary for parameter estimation. For this reason, IRT models and parameter estimation are often treated completely separately from other statistical models by the learner of IRT models. This generalization clarifies that the Rasch model is a special case of the HGLM. Also, this study clarifies that item and person parameters can be estimated using the currently available algorithms in the HLM program, which are used for more general purposes. Furthermore, interpretation of IRT parameters also tends to be specialized and separated from other statistical models. More specifically, an 47 item parameter is simply interpreted as an item difficulty in the Rasch model. This interpretation is accurate and useful in the context of item response data. However, at the same time, this interpretation is isolated from more general contexts, such as logistic regression, GLM, and HGLM, where it can be interpreted as an effect of item. As a result, it is sometimes very difficult for the learner of IRT models to understand what the parameter really means. On the other hand, if the Rasch model is presented as a special case of the HGLM, it is relatively easy to interpret the parameters because the item parameters are coefficients of dummy variables, and the learner can bring to bear their knowledge of logistic regression to make sense of IRT analyses. 48 Chapter 3 Model with Person-Level Predictors In this chapter an extension of the LP HGLLM is presented. As indicated in Chapter 2, a direct extension of the LP HGLLM is to include person-level predictors in the model, in effect, person-characteristic variables. This approach achieves a one-step analysis of test data with person-level predictor variables. Through such a single analysis rather than a two-step analysis, one can expect improved estimation of the effects of the predictors on the latent trait because the effects of predictors are estimated simultaneously with ability parameters. As a result, the heteroscedastic nature of the standard errors of the ability estimates is taken into account, and inconsistency of the ability estimates is avoided. 3.1. Model Assume a situation where one wishes to analyze the effect of the amount of instruction on a specific topic (Time) on reading achievement. In most cases, such analysis is done by a simple regression model, (Score), = flo +161 (Time), + e, , (20) where (Score), is a reading test score for person j, and (Time), is an amount of 49 instruction on a specific topic for person j. Then, one would be interested in the magnitude of A. If A is different from 0, it would be interpreted that the amount of instruction has an impact on reading achievement test score. However, the analysis based on this model may not give accurate results, especially when test scores are based on an IRT scale. The reason is that estimated scores based on IRT are associated with different magnitudes of standard errors across estimates. Extreme scores, such as zero or perfect scores, are associated with larger standard errors than scores in the middle of the range. In other words, measurement errors of the dependent variable are not homoscedastic. Therefore, unless those different standard errors of measurement are taken into account in the regression analysis, the result may be misleading. One way to avoid such a problem is to perform a one-step analysis of test data with person-level predictors in the LP HGLLM. The level-1 model shown in Equation 7 is used, that is, ’71 =floj +fluXw +fl21Xza ++fl(k-1)1X(k-oa- Here, for the level-2 models, let the random component of the A, in the level-1 model be uo,* in order to distinguish from uo, from the model without any predictors (Equation 11). Then, the level-2 models are 50 fit); :7“) +7OIPVIJ + +70pr} +140] * fl]; T710 (21) 3411-1), = 7(k-I)0 , where W,, (s = 1, , p) are person-level predictor measures for predictor s and person j. For the specific example with Time as a predictor, we write 260, =l’oo +701(Time), 'H‘o, * 16117710 (22) “7611-1), =}’(k—i)o This way, the regression model (Equation 20) is embedded in the LP HGLLM, that is equivalent to saying that the regression model is embedded in the Rasch model. Therefore, there is no need to further account for difi'erent magnitudes of standard errors of measurement for the outcome variable. More technically, )0, in the above equation will be estimated simultaneously with person specific ability, 110;“. The combined model then becomes 1 17,, _ i, (23) — 1+exp[-— {[70,(Time)j +110, *1-(7qo “700)” 51 where var(u0, *) = r* and i = q. Therefore, the overall person ability in this formulation is 70, (Time) J + no, *, in contrast to uo, in the model without any predictor variables (see Equation 12). The relationship of uo,* to ac, is given algebraically by uo, = )01(Time), + u0,*. In other words, the person ability from the simple Rasch model is decomposed into two parts in this model. On the other hand, the item parameters are the same as in the model without predictor variables, that is, 7110 — 700, for i = q. 3.2. Illustrative Analysis As a numerical example, a set of hypothetical data for a 20-item test is analyzed. In order to simulate a data set, it is assumed that the mean of the Time measures equals zero and variance of Time equals one. Note that the distribution of the variable may not be normally distributed, because the amount of instruction time is often represented by two groups; one is a group of people who received the instruction on the topic, and the other is a group of people who did not receive the instruction on the topic. However, the shape of the distribution for a predictor variable does not affect the estimate of the effect of the predictor. Therefore, the Time measures are assumed to be distributed normally for the purpose of convenience. Second, it is assumed that the Time and true student abilities are correlated with a certain degree of strength. Therefore, the student abilities and the Time measures are simultaneously sampled from a bivariate normal distribution. In this 52 analysis, 250 students are assumed, and I specify that the student abilities and the Time are correlated with p = 0.3. Then, the simulated true student ability measures, along with true item difficulties, were used to generate a zero-one response for each item. The zero-one response data are then analyzed by two models: (a) the model without Time, and (b) the model with Time as a predictor. Then, results from these two analyses are compared. The slope for Time, 701, is estimated to be 0.270 in the second model, a one-step analysis. Since both Ability and Time are standardized, its slope is equivalent to its correlation coefficient. This value is reasonable, as compared to the true correlation coefficient of 0.30 between Time and student ability, although it is a 10 % underestimation of the true value. Also, the p- value associated with the )0, estimate is smaller than 0.0005, indicating that the effect of Time is significantly different fiom zero. Table 5 shows item parameter estimates from the two different models. The first column contains the estimates from the model with Time as a predictor, and the second column contains the estimates fiom the model without Time. Item parameters are estimated very similarly for the two models. They are only different at the second decimal place at the most and differ by no more than 0.04. This makes sense because the item parameters from the two models are algebraically the same and ought to be estimated similarly. The last column shows the true item difficulties. The RMSE is 0.225 for the model with Time, and 0.240 for the model without Time. 53 Table 5 Item parameter estimates from the model with a person-level predictor Item with SES without SES true value 1 -1.698 -1.688 -2.000 2 -1.294 -1.289 -1.500 3 —0.846 -0.847 —1.000 4 -0.552 —0.558 —0.500 5 0.312 0.290 0.250 6 0.549 0.524 0.500 7 0.957 0.925 1.000 8 1.123 1.088 1.500 9 1.451 1.411 2.000 10 —0.065 —0.080 0.000 11 —1.667 -1.658 -1.750 12 —0.770 —0.772 -0.750 13 0.245 0.225 0.125 14 0.480 0.456 0.750 15 0.898 0.866 1.250 16 -1.178 -1.175 -1.250 17 —0.147 —0.160 -0.250 18 -0.049 —0.064 0.250 19 1.503 1.463 1.750 20 -0.098 -0.080 0.000 54 On the other hand, the person abilities in the model without Time are 110,, while the person abilities in the model with the Time are }m(Time), + uo,*, as mentioned above. Since uo, = 701(Time), + uo,*, u0,* can be thought of the person specific component of the ability, while )01(Time), is the amount of contribution that Time makes to the jth person’s ability in this particular example. In other words, uo,* is the ability when Time is taken into account. The relationship between 170, * and 120] is plotted in the first plot of Figure 3. The 1’20, 3 are clustered into 21 raw score groups. However, 170, * varies within raw score groups, depending on what Time scores the individuals have. When )70, = 0.27, the contribution of Time, is taken into account, and the jth person’s ability is computed as O.27(Time), + 120]. *, the relationship between 170, and 0.27(Time), + 1701* is plotted in the second graph of Figure 3. The graph shows that the two estimates are almost identical when the effect of the Time is included. This confirms the algebraic relationship uo, = m(Time), + uo,*, described above. 55 Figure 3 The relationship between person parameter estimates from the model with a predictor and estimates from the model without a predictor 2" o O 0:388 1— ozii. *- 0 ii“ 0 no 0 mom 0 mmoo ._°‘ 0 D 0 egg 0 ° 9 o C 1‘ 89 8 o ” o °8 ° 0 O 25 8 E 0 _ O 3 1 1 r 1 1 3 -2 1 0 1 2 U0j 1 A 1 1 2“ a h— 08 .9 1‘ l _ .2. 3 o I D I 1" I 00‘ q - g i E; i o I " I “44-1 I - ° 3 e 8 2‘ 3 F c -3 1 r 1 1 r -3 -2 ~1 0 1 U0j 56 3.3. Parameter Recovery In this simulation study, the same two-level model with a person-level predictor variable, given in Equations 7 and 22, is examined. In order to simulate 50 replicated data sets, the method described in the previous section is utilized; students’ abilities and Time measures are simultaneously sampled from a bivariate normal distribution. Item difficulties are determined based on Table 1, as in the simulation study in the previous Chapter. The variables of interest in this simulation study are (a) sample size (n = 250, n = 500, n = 1000), (b) the number of items (k = 10, k = 20), and (c) the magnitude of the correlation between the person parameter and the person- level predictor (p = 0.2, p = 0.5, p: 0.9). These three variables produce 3 x 2 x 3 = 18 conditions. Table 6 shows the layout of the design of this simulation study. I present the results in Table 7. Overall, the RMSEs of )0, estimates are very small (i.e., all smaller than 0.01). Also, the means of the estimates of the coefficient are all smaller than the true values, implying that the coefficient tends to be underestimated. The standard deviations of the estimates across 50 replications are also small. They range approximately between 0.02 and 0.07. Figure 4 shows the plots of the RMSE values. The plots are separated based on the three different true coefficient values. When ,0: 0.9, the RMSE values are lower than when either p: 0.2 or ,0: 0.5, while the RMSE values 57 Table 6 The layout of the simulation study for the model with a person -level predictor variable p=0.2 p= 0.5 p= 0.9 n=250 n=250 n=250 k=10 n=500 n=500 n=500 n=1000 n=1000 n=1000 n=250 n=250 n=250 k=20 n=500 n=500 n=500 n=1000 n=1000 n=1000 58 Table 7 The results of parameter recovery study for the model with a person-level predictor Results r k n RMSE Mean SD 250 0.0064 0.1770 0.0772 10 500 0.0033 0.1728 0.0508 1000 0.0020 0.1695 0.0335 02 250 0.0044 0.1758 0.0623 20 500 0.0030 0.1701 0.0458 1000 0.0018 0.1696 0.0298 250 0.0064 0.4432 0.0568 10 500 0.0058 0.4431 0.0511 0 5 1000 0.0036 0.4497 0.0330 250 0.0099 0.4295 0.0706 20 500 0.0049 0.4448 0.0436 1000 0.0047 0.4388 0.0310 250 0.0040 0.8715 0.0571 10 500 0.0029 0.8637 0.0396 0 9 1000 0.0017 0.8700 0.0292 250 0.0025 0.8744 0.0436 20 500 0.0015 0.8755 0.0310 1000 0.0015 0.8676 0.0210 59 Figure 4 The root mean squared errors of the coefficient of the predictor o rho=02 N o.‘ o —— k=10 # to ------- k=20 2 5‘ I o (0 E e o LU l0 ‘é’ 8‘ m o ......... ......................... fl 0 o" T r ' ' 400 600 800 1000 n 8 Lha=05 q- __ o k=10 H to ------- k=20 2 5.- ml O E E 52 o LIJ l0 .— s e- m o o o" r . ' I 400 600 800 1000 n 60 Figure 4 (cont’d) 8 rho=OQ O,“ o — k=10 so """" k=20 E 5.- | 0 <0 E $2 5 8' o “J In 2 8: m o ‘\ o .......... o ....... .,,,,...............................‘... C5 I t ' 400 600 800 1000 n 61 are relatively high when p: 0.5. However, these differences are only at the third decimal place. When ,0: 0.2 or p: 0.9, the RMSE values are somewhat lower with k = 20 than with k = 10, especially for smaller samples. However, this relationship is not observed when p: 0.5. Figure 5 shows the plots of the means and the standard deviations of the estimated coefficient across 50 replications. The means of the estimates show that the coefficient tends to be underestimated, although it is underestimated only by about 0.025 when p: 0.2 and p: 0.9, while it is underestimated by about 0.05 to 0.075 when p: 0.5. The sample size does not seem to affect the mean of the estimates so much. The means are about the same across three different sizes for all true coefficient values. On the other hand, the standard deviations of the estimates across replications are smaller with larger samples and with more items. The only exception is when ,0: 0.5 and n = 250, where the standard deviation is smaller for k = 10. In summary, the coefficient of the predictor variable tends to be underestimated. The degree of underestimation depends on the true coefficient values. Although the results revealed that the coefficient tends to be underestimated more when the true coefficient is 0.5, it is not clear why this happens. Further investigation is suggested to reveal the reason for this observation. Also, the results revealed that the estimates are similar across different sample sizes. However, sample size affects the consistency of the estimates, 62 Figure 5 The mean and the standard deviation of the estimate of the slope coefficient of the predictor Mean of gamma_hat SD of gamma_hat 0.0 0.02 0.04 0.06 0.08 0.10 m mo=02 q. " --- k=10 ------- k=20 O TRUE VALUE N -l -------------------------------------------------------------------------------------------------- o' I """"""""" + ........................ 4 2 o. H O ". 'l O 1 I I I 400 600 800 1000 n rho = O 2 — k=1O ------- k=20 l l 63 1 000 Figure 5 (cont’d) Mean of gamma_hat SD of gamma_hat 0.0 0.02 0.04 0.06 0.08 0.10 m I'hQ = Q 5 to- o — k=10 ------- k=20 o more VALUE l0- .................................................................................................. o' in V. - ’4 o r ....... - O—f ............................... . o """"" O V. O I I I n 400 600 800 1000 n 111g: 0 5 1 —— k=10 ....... k=20 ., l l l l 400 600 800 1000 n 64 Figure 5 (cont'd) In rho = 0 9 a), C’ --- k=10 ------- k=20 E o TRUE VALUE ml c»,- .................................................................................................. E O E “5 C (D (D 2 O m... d T 1 I l 400 600 800 1000 n rho = 0 9 — k=10 ------- k=20 l 1 SD of gamma_hat 0.0 0.02 0.04 0.06 0.08 0.10 J l 65 that is to say, the larger the sample size, the smaller the standard deviation of the estimates is. This simulation study is limited to only one predictor in the model. Therefore, a model with more than one predictor needs to be investigated. 3.4. DIF Model In the previous example, a person characteristic variable was added only in the first equation of the level-2 models. This was because the person parameter was decomposed into more than one parameter, person ability and an effect of the predictor. Similarly, we can specify that item parameters be decomposed into more than one parameter. In other words, we can examine whether item parameters function differently depending on person characteristics. As an example of the simplest case, gender is used as a predictor and added to the level-2 model of the LP HGLLM, Equation 11. Since the level-2 models are person-level models, the person characteristic variables should be added to the level-2 models, while Equation 7 is used as is. As a result, the level-2 models are 66 i 760, =2’00 +7o,(gender)j +u0, flu =710 +7H(gender)j 1 (24) «60-0 = 701-00 T 7(k-i)l (gender)! , where (gender), is a dummy variable, and where 1 is given to one of the gender groups and 0 is given to the other gender group, say 1 for females and 0 for males in this example. Then, the linear predictor model for a specific item i, after the level-1 and level-2 models are combined, will be 77,]. = 700 +701 (gender), +110, —[7qo +7q,(gender),] =uo,‘ +700—7q0_(7q1-701)(gender)j (25) =u0,“[740‘700+(7q1’701)(gender),] 2 for i = q (i = l, , k, and q = 1, , k— 1). The combined model shows that no - 700 + (74, — 70,)(gender), is a difficulty of the item that is associated with the qth dummy variable for gender group j, while — 700 - 701(gender), is a difficulty for the reference item; the item whose dummy variable has been dropped for that group. Therefore, 74, -— yo, is the effect of (gender), on an item that is associated with the qth dummy variable, for q = 1, ..., k — 1, while 70, is the effect of (gender), on the reference item. In other words, no — 700 + (qu - 70,)(gender), is a difficulty for females and no - 700 is a 67 difficulty for males for items with q = 1, , k — 1, while - 700 - 70, (gender) , is a difficulty for females and — 700 is a difficulty for males for the reference item. If any of the values )3,” —)?0, for q = 1, , k -1, or fro, for the reference item are significantly different from zero, they indicate that males and females perform differently on those items, given the same ability. This suggests that the item may be biased against one of the gender groups. However, such a statistical difference does not always indicate bias, because performance differences between gender groups might occur because of real differences between the genders. Gender difference may result in an effect that looks like item bias (that is to say, some may argue that such an effect is sufficient evidence), but a gender difference is not conclusive evidence of item bias per se. The above example raises an important issue because it is sometimes of interest to test publishers whether a specific item is biased for a particular group of people. If an item is statistically detected that functions differently between sub-populations, this situation is referred to as differential item functioning (DIF). In the IRT, an item is considered to show DIF when its item characteristic curve (ICC) for the target sub-population (the focal group) and the ICC for the rest of the population (the reference group) are different. Since the Rasch model, and consequently the LP HGLLM, can differ only in terms of item difficulties, ICCs can differ only in their locations with respect to the x -axis (difficulty), but not in their shapes (i.e., their slopes or lower 68 asymptotes). Therefore, if we look at the shapes of ICCs for the LP HGLLM, we will be only comparing values of item difficulties between the target sub- population and the rest of the population. If the target group’s performance is considerably lower than that of the rest of the population, given the same ability, it results in a lower item difficulty for that group. The most widely used method to detect DIF for the Rasch model is described by Wright and Stone (1979) and Wright and Masters (1982). It simply estimates item difficulties and their standard errors separately for the focal group and the reference group, and tests if they are significantly different from each other. The method described in this paper is equivalent to the conventional method because both approaches compare the difficulties between two sub-populations. However, the method described in this paper does the job in a one-step analysis, while the conventional method requires a two-step analysis. Again, this one-step analysis may increase the precision of item-parameter estimates, which results in reducing the magnitude of the standard errors of the estimates. This will make the test statistics more sensitive to rejecting the null hypothesis to conclude that the two groups perform differently, given the same ability. As a numerical example, a set of hypothetical data is analyzed that assumes that one of 10 items is biased against one sub-group in a population. The same item difficulties as in the first numerical example are assumed, except that the first item in the test is assumed to be biased. The biased item 69 is assumed to be harder for people in the target gender group (say, female, in this example), so that they have less chance to answer the item correctly than males. 2000 person ability values are sampled from the standard normal distribution, and item responses are generated using the same method as mentioned in the previous section. The level-1 model is the same as the first example for the unidimensional l-P HGLLM, while the level-2 models are r160, =7oo +701(ge"der), +“o, 161; =710 +7“(gender)J flz, =72.) +721(gender), 163} =730+73l(gender)j +A”. = y“, +741(gender)j 765) =7so +75,(gender)j 75:3,- =760+761(gender), 767, =770+771(gender), ,4, =780 +781(ge”der), L160, =790+79,(gender)j , where 140} ~ N(O, r). (26) In this example, it is specified that 70,, the effect of gender on the first dummy variable, equals 0.5 in the logit scale, where the first item is the biased item. Here item indicator variable for the first item is dropped. There is no bias for the rest of the items (i.e., items 2 through 10). This is equivalent to saying 70, - 7,1, = 0 for q = 1, , 9, which correspond to items 2 through 10. Then, the data are analyzed by the HLM program. The null 70 hypotheses Ho : 70, = 0 (27) and H0:7ql-}’Ol=0 (28) for q = 1, ..... , 9, are tested separately. The first hypothesis is tested directly by the t-test for )0, that is provided by the standard HLM output. On the other hand, the rest of the null hypotheses are tested by general linear hypothesis tests, which result in Wald-type asymptotic chi-square tests with df= 1. This type of hypothesis testing is also readily available in the HLM program. Table 8 shows estimates of gender effects on each item from the model with a linear constraint. The first column shows values of iql, the estimated coefficients for the gender dummy variables, and the second column shows the values of 77,“ — f0, , the effect of gender on each item. The third column shows the p-values for the t-tests on fro, and the general linear hypothesis tests on )3,“ — 170,, described in the previous section. The value of fro, = — 0.444 indicates the parameter value of yo, = — 0.5 is recovered fairly well, although it is not strong evidence because this result is based on only one data set. The p-value associated with f0, shows that 70, is significantly different from 0, 71 that is, men and women perform differently on the first item. On the other hand, fr” — go, through i9, - fro, are not significantly different from 0. One exception is item 5 for which p = 0.037. This item is assumed no difference between the gender groups, however, the result shows significant difference between the gender groups, indicating a Type I error. For other items, the result does not provide evidence that men and women perform differently on these items. Table 9 shows person parameter estimates from a model with the person-level predictor (gender) and a model with no predictor. The values are not exactly the same, but they are correlated with r = 0.999. This result confirms that the inclusions of the linear constraints affect the item parameter estimates, not person parameter estimates. These results are based on only one replication of a set of simulated data, and more extensive simulation study is expected to confirm the recovery of the parameter values, including I. For future work, it is suggested that standard errors for the person parameters be compared between the two models on the same data to look into Mislevy’s (1987) claim that such a linear constraint should improve the precision of the person parameter estimates. Also, investigations of statistical power for rejecting the null hypothesis, in comparison with other conventional methods will be beneficial in order to assess the usefulness of this approach in detecting biased items. 72 Table 8 Item parameter estimates from the DIF model 7‘“ 7111’ 7’01 p-value q = 0 (item 1) —0.444 N/A < 0.000 q = 1 (item 2) —0.377 0.066 > 0.500 q = 2 (item 3) —0.409 0.035 > 0.500 q = 3 (item 4) —0.513 -0.069 > 0.500 q = 4 (item 5) .0233 0.206 0.037 q = 5 (item 6) —o.573 —0.129 0.188 4 = 6 (item 7) -0.668 -0.223 0.079 4 = 7 (item 8) —0.491 —0.049 > 0.500 q = 8 (item 9) -0.526 -0.082 > 0.500 q = 9 (item 10) -0.290 0.154 0.118 73 Table 9 Person parameter estimates from the DIF model Raw Score No Predictor With Predictor 0 -1.720 —1.695 1 -1.347 -1.322 2 —0.997 —0.972 3 —0.662 -0.638 4 —0.337 -0.311 5 —0.015 0.005 6 . 0.308 0.324 7 0.637 0.649 8 0.976 0.983 9 1.337 1.333 10 1.713 1.707 74 3.5. Summary and Comments on Practical Issues In this chapter two examples are presented under one family of extensions of LP HGLLM, the inclusion of a person-level predictor variable. Although this study deals with only one person-level predictor in the model, more than one predictor can be included in the model. In the first example a person-level predictor is included in the first equation of the level-2 models. As a result, person abilities are decomposed into two parts. This type of analysis is analogous to conducting a two-step analysis of a’person characteristic variable on test scores. Such results might be possibly used for accountability purposes, such as accreditation of schools in a state-wide testing program. Also, such results can be integrated into test construction processes in order to detect possible bias of test or items. If this is the case, a reduction of the variance of person abilities (z) in the model with a person characteristic variable, in comparison to the unconditional model with no predictor variable, is of interest. If ris considerably reduced, this would mean the predictor variable accounts for large portion of the variation in test performance. This could be an indication of bias in the test, unless there is evidence of similar variation in the criterion. A likelihood ratio test can be used to test if the reduction of ris significant. This approach can be also used to investigate effects of test conditions on test scores. For example, if one is interested in the effect of allowing more time for some examinees, an indicator variable describing the extra time given 75 each examinee to take the test can be included in the model as a predictor variable. This would be a useful analysis if more time were accidentally allowed for some groups of people in a large-scale testing program. Again, the amount of reduction in rwill be our interest. Another practical issue concerns use of the adjusted ability estimates (uo,*), instead of uo, as estimates of abilities. Technically, it is reasonable to use uo,* because it provides the estimate of ability after the effects of predictor variables are taken into account and the estimates are associated with an smaller overall error. However, in some settings this may not be reasonable for philosophical and political reasons. According to this approach, two people who obtained the same raw scores on the test would have different ability estimates, depending on their measures on predictor variables. For example, suppose amount of instruction on a topic (Time) has a positive relationship with ability level. If the coefficient for Time is positive, a person with more instruction would have a lower uo,* value than a person with less instruction, given the same raw scores. One disadvantage of this approach is that the person with more instruction is penalized just because the person received more instruction. This could be an unfair judgement of one’s ability, especially when performance itself on the test is of interest, as with a criterion referenced test. On the other hand it would be an advantage if appropriate justifications are made. An example would be when one believes that test scores are confounded and therefore scores need to be adjusted. 76 In summary, the one step analysis described in this chapter is analogous to conducting a two-step analysis of test data. The two-step analysis involves biased and inconsistent estimates of student abilities, as well as heteroscedastic errors of measurements. Although it is obvious that these are problematic in a two-step analysis, the amount of improvement with the one-step procedure was not investigated. However, other sources that use different formulations (e.g., Adams, Wilson & Wu, 1997; Zwinderman, 1991) indicate considerable improvement can be expected when a large set of data is analyzed. This is also true when results are compared with results from an analysis using raw scores because raw scores are susceptible to both ceiling and floor effects. In future research it is expected that improvements for this specific model will be investigated. 77 Chapter 4 Multidimensional Models In this chapter another generalization of the LP HGLLM is presented, namely, a multidimensional 1-P HGLLM. It is shown that confirmatory multidimensional Rasch analyses, both between- and within-item multidimensional structures, can be formulated and analyzed under the multidimensional 1-P HGLLM. 4.1. Model A multidimensional model can be treated as an extension of the generalized Rasch model. In the unidimensional 1-P HGLLM in the previous sections, only one person-specific latent trait that relates to the probability of getting a specific item correct was modeled. Now, more than one person- specific latent trait that determines such probabilities is assumed. For item 1 (i = 1, , k), personj (j = 1, , n), and latent trait s (s = 1, , m), the level-1 structural model is expressed as follows; ”(I zflOllXOly' + H. +160!!!) XOmlj +fllHleU + "- +fl(k-M)‘jX(k-m)‘ll m Ir-m (29) = ZIflOvXO-w + 21/641qu 3: q: Here, A, is a parameter that is associated with the sth latent trait. A value 78 of 1 will be assigned to the dummy variable X0“, if item i is associated with the sth latent trait, and 0, otherwise. As in the unidimensional model, A, is the effect of the qth dummy variable. Notice that the second subscript that indicates the corresponding latent trait is dropped and represented by a dot ( . ) because each item does not have to be associated with only one latent trait. As for the unidimensional model, Equation 29 can be written in matrix form in order to show the data layout as . Bo- 11.: DX{.1 1 i 1 J B} (30) where B0, is now an m x 1 column vector that consists of m latent trait parameters, A], , , Any, D is a kx m matrix in which the sth column is a vector of dummy variables for the sth latent trait, B,‘ is a (k — m) x 1 column vector that consists of k — m item parameters, and X is a k x (k — m) design matrix that consists of k — m dummy variables. Note that one dummy variable from each of m latent traits has to be dropped to achieve full rank in the W, matrix. Then the level-2 models are 79 [601) =7010 +1101, 160m} = y0m0 + “0m; (31) ’61-) = 71-0 36(14):), = 7(k—m).0 , where “01/ O z'” . . . Tlm ' ~N 5 , 5 5 . (32) “0171} 0 2'ml z.mm Here, 140,, is a random component of the sth latent trait for the jth person, which implies that each person has a unique value for each latent trait. The variance for the sth latent trait is r and it is constant across people. Also, 2'”. (s it s') is a covariance between the sth and s' th latent traits, and it is also constant across people. Note that when m = 1, all items are associated with the same latent trait and the model (Equations 29, 31, and 32) will be exactly the same as the unidimensional 1-P HGLLM. This multidimensional formulation can be directly applied for confirmatory analysis purposes. It was already mentioned that each item does not need to be associated with only one latent trait. Assume k, items are associated with the sth latent trait, then k1 + + k 2 k. A test is considered m 80 to be multidimensional between items (Adams, Wilson, & Wang, 1997) if kl + + km = k , in effect, a test consists of several unidimensional subscales. On the other hand, a test is considered to be multidimensional Wit/11h items (Adams, Wilson, & Wang, 1997) if kI + + km > k, in effect, at least one of the items is associated with more than one latent trait. Both types of multidimensionality can be modeled using Equation 29, depending upon how the D matrix is defined. In such multidimensionally constructed tests, one of our interests is in how strongly latent traits are correlated. The multidimensional l-P HGLLM approach is able to estimate the variance-covariance matrix of a vector of I latent traits, [um] um] in Equation 32; consequently, correlation coefficients between latent traits can be estimated. If latent traits are highly correlated, use of a unidimensional IRT model might still be reasonable or provide meaningful parameter estimates. However, if they are nothighly correlated, this would suggest that a use of unidimensional IRT item and person parameter calibration is questionable, and multidimensional analysis should be encouraged. In order to describe the two different model formulations (i.e., between- and within-item multidimensional formulations), assume a 15 item test in which 2 latent traits are involved. Then, the level-1 model is 81 77g =160in011, +7602,on1, 136,,le + TIbiisnXun-q : (33) following Equation 29. There are only 13 dummy variables, because one item from each item group (trait) does not have a dummy variable, that is, (8 — 1) + (7 — 1) = 13. The level-2 models are r1601; =7010 +“01, 1502, =7020 +1402]. 4 fl _ :7 where um, ~ N O 2'“ 7’2 (34) ll] : (ll-0 , “021' O , 721 T22 , (fls, =l’(13)vo following Equation 31. As mentioned above, both “between” and “within” multidimensionality can be modeled depending on how the D matrix, contained in W,, is defined. First, assume between-item multidimensionality in a 15 item test, in which the first 8 items are associated with the first latent trait and the other 7 items are associated with the second latent trait. Then, 82 p ”a,“ '1 o -1 0 o 0 o 0 o o o o o 0 o“ n.0,“ 4,, 1 o o -1 0 0 0 0 0 o 0 o o 0 0 4,, 4,, 1 0 o 0 -1 0 0 o 0 0 0 0 o o 0 A”, ,7” 1 0 0 o 0 —1 0 o 0 o o o 0 o 0 A”, 7,, 1 o 0 0 0 0 -l 0 0 o 0 0 0 0 o [11,, 4,, 1 0 0 0 o 0 0 —1 o 0 0 0 o 0 0 a,” 4,, 1 o o o 0 0 o o —1 0 0 0 o 0 0 A», n,= '11, w]: 1 0 o o 0 o o o 0 0 0 0 0 o o and 13]: pm, (35) 4,, o 1 o 0 0 0 o 0 o —1 0 0 0 0 0 5,”, 27,0, 0 1 0 0 o o o o o o —l 0 0 o 0 A», m, 0 1 0 0 0 0 0 0 o 0 0 -1 0 0 o A”, r52, 0 1 o 0 0 0 o o o 0 o 0 —1 o 0 Am, 4,, 0 l 0 0 0 0 o 0 0 o 0 o —1 0 A”), 4., 0 1 0 0 0 o 0 0 o o o 0 0 0 -1 Am]. 4,), _0 1 o o 0 o 0 o 0 o 0 0 o 0 0, flu»,- represent the between-item multidimensionality. Notice that dummy variables for the eighth and the last items are dropped in order that W, have full rank. As a result, )010 and )020 are difficulties of items 8 and 15, respectively. Also, 7mg — mg, )(gm — )010 71710 — )010 are difficulties of items 1 through 7, while ”9,0 — 7020, 71101.0 — )020 703,0 — )030 are difficulties of items 9 though 14. The ability of person j is represented by a vector [um] 1:02, ]. On the other hand, assume that a 15 item test has a within-item multidimensional structure, in which items 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10 are associated with the first latent trait and items 9, 10, 11, 12, 13, 14 and 15 are associated with the second latent trait. Then, 83 in," '1 o -1 o 0 o 0 o o 0 0 o o 0 0‘ "emf 0,, 1 0 0 -1 0 o o o 0 0 0 0 o 0 0 A,” 4,, 1 0 o 0 —1 0 0 o 0 0 o 0 0 0 0 5“,, 77,, 1 0 0 0 0 —1 0 0 0 0 0 0 0 0 0 am, 7,, 1 0 0 0 0 o -1 0 0 o o o 0 0 0 a,” r7,” 1 0 o 0 0 o 0 —1 0 0 0 o o 0 0 A”, 17,, 1 0 0 0 0 0 0 0 —1 0 o o 0 0 0 pm, n,= a. wl=1 0 o o 0 o 0 0 o 0 o o 0 0 o and 0,: 19,,“ (36) 7,, 1 1 o 0 0 o 0 o o -1 0 0 0 0 o 11.7,, 17,0, 1 1 0 o 0 0 o o o 0 —1 0 0 o 0 am, 2,1,, 0 1 o 0 0 0 0 0 0 0 o —1 o 0 0 A”, 4,, 0 1 o o 0 0 0 0 0 0 o 0 —1 0 0 .61.,” 4,, 0 1 0 0 0 o 0 0 o 0 0 0 o -1 o [1,”, 25,, 0 1 0 o 0 0 0 o o 0 0 0 0 o -1 Am, 31:). _o 1 0 0 0 0 0 o o o 0 0 o 0 0 , _fl,,,,,l. will represent the within-item multidimensionality. Here, the entries of 1s in columns 1 and 2 of W, for rows 9 and 10 show the items that are associated with both traits. Again, dummy variables for the eighth and the last items are dropped in order to achieve full rank for W,. As mentioned in the previous chapter, any dummy variables can be dropped, as long as the W, matrix becomes full rank. Here dropping the dummy variable for the tenth item, instead of the one for the eighth item, would have been more consistent with the earlier presentations, because the tenth item is the last item to be associated with the first latent trait. However, the eighth item is chosen to be dropped, because it is the last item to associated with only the first latent trait. As a result, mg and )(320 are difficulties of items 8 and 15, respectively. For items 1, 2, 3, 4, 5, 6 and 7, difficulties are 7mg — 7010, 7(2),) — mm, 7mg — )0”), and for items 11, 12, 13 and 14, difficulties are 711110 - )020, 7(12).0 - 2020, 711310 - 7020 84 and 704,0 — 7030. However, since items 9 and 10 relate to both of the two latent traits, 7(9)), — )tno — yoga and 700,0 —- 2010 - 2020 will be their difficulties, respectively. The abilities of person j are, again, represented by a vector[u0,, no, 11° 4.2. Illustrative Analysis As a numerical example, a set of hypothetical data for 15 items and 250 people is generated based on the between-item multidimensional model, described above. Two latent traits are modeled in the data set, where the first 8 items are associated with the first latent trait and the next 7 items are associated with the second latent trait. The following item parameter values are arbitrarily chosen; —2.0, —1.5, —1.0, —0.5, 0.0, 0.5, 1.0, 1.5, —1.0, —0.5, 0.0, 0.5, 1.0, 1.5, and 2.0. Also, 250 sets of person parameter vectors are independently sampled fiom the standard bivariate normal distribution, where the two latent traits are assumed to be correlated with the magnitude of p: 0.5, such that Lilli/(iii [015 015]] (37) Then, dichotomous item response data are generated according to the unidimensional Rasch model for the two item groups separately. Person and item parameters and zs are estimated by the HLM program. 85 Table 10 Item parameter estimates of multidimensional data item MD 00 1 -2.460 —2.672 2 -1.593 -1.653 3 —0.928 —0.941 Latent 4 -O.509 -0.511 Trait 1 5 0.064 0.064 6 0.654 0.657 7 1.240 1.261 8 1.724 1.781 9 —1.390 -1.404 10 —0.882 -0.884 H t 11 -0.273 -0.274 Tfainz 12 0.549 0.554 13 0.842 0.857 14 1.647 1.715 15 1.754 _ 1.833 86 Table 10 shows the estimates of item parameters from both unidimensional (UD) and multidimensional (MD) models on the same data. For the UD analysis, two separate UD analyses for the two item groups are conducted. For the MD analysis, two item groups are analyzed at the same time, using the MD model described above. Item estimates are very close between the UD and MD models. In fact, they are correlated with r = 0.999. This occurred because the model assumed the between-item multidimensionality, where no item is associated with both latent traits. As a result, item parameters are estimated as if they are fiom two separate item groups. On the other hand, there are at least two notable results for person parameter estimates using the MD 1-P HGLLM. First, in this particular example, there are 72 possible combinations of raw scores from the two item groups; (kl + 1 )(k2 + 1) = 72, although only 55 patterns are observed here. However, if a single unidimensional model were employed to analyze the 15 items, there are only 16 possible raw scores and 16 possible corresponding ability estimates . This shows that the use of the multidimensional model lets us distinguish people more than the unidimensional model does. Second, although we can obtain up to 72 possible pairings of the raw scores if the two item groups are analyzed separately via unidimensional models, this approach determines the person parameter estimates assuming that the two item groups are independent. As a result, two people who have the same raw 87 score on one item group will have the same person ability estimate for the item group no matter what raw scores they have in the other item group. On the other hand, when the data are analyzed using the multidimensional model, person parameter estimates are determined by the raw score for the target item group as well as the raw scores from the other item groups. In other words, two people who have the same raw score on one item group can have different person parameter estimates if they have different raw scores on any of the other item groups. As a result, in this example, there are 72 possible different person parameter estimates for each of the two item groups. Table 11 shows estimates of the person parameters from both the UD and MD models. The first five columns contain estimates for the first latent trait, and the next five columns contain estimates for the second latent trait. The columns labeled UD contain estimates from the separate unidimensional analyses, while the columns labeled MD contain estimates from one multidimensional analysis. Also, the columns labeled LS contain least squares (LS) estimates and the columns labeled EB contain empirical Bayes (EB) estimates. The estimates are grouped by the raw scores from the first latent trait. The estimates for the second latent traits are ordered by the raw scores on trait 2 within the raw-score groups for the first latent trait. 88 Table 11 Person parameter estimates of multidimensional data Latent Trait 1 Latent Trait _2_ UD estimates MD estimates UD estimates MD estimates raw raw score LS EB LS EB score LS EB LS EB 0 -2.851 -1.279 -2.913 -1.383 1 -1.461 -0.746 -1.542 -1.095 0 ~2.851 -1.279 -2.853 -1.276 2 -0.600 -0.324 -0.609 -0.667 0 -2.851 -1.279 -2.800 -1.176 3 0.140 0.078 0.155 -0.267 0 -2.851 -1.279 -2.753 -1.082 4 0.840 0.471 0.851 0.117 0 -2.851 -1.279 -2.710 -0.991 5 1.561 0.865 1.544 0.493 1 -1.955 -0.928 -2.020 -1.158 O -2.595 -1.203 -2.764 -1.441 1 -1.955 -0.928 -1.990 -1.047 1 -1.461 -0.746 -1.519 -0.986 1 -1955 -0.928 -1.963 -0.946 2 -0.600 -0.324 -0.609 -0.567 1 -1.955 -0.928 -1.939 -0.852 3 0.140 0.078 0.148 0.173 1 -1.955 -0.928 -1.917 -0.762 4 0.840 0.471 0.847 0.207 1 -1.955 -0.928 -1.896 -0.674 5 1.561 0.865 1.547 0.583 2 -1.201 -0.591 -1.227 -0.830 0 -2.595 -1.203 -2.683 -1.327 2 -1.201 -0.591 -1.217 -0.725 1 -1.461 -0.746 -1.495 -0.883 2 -1.201 -0.591 -1.208 -0.628 2 -0.600 -0.324 -0.607 «0.472 2 -1.201 -0.591 -1.199 -0.538 3 0.140 0.078 0.144 -0.083 2 -1.201 -0.591 -1.190 -0.451 4 0.840 0.471 0.844 0.295 2 -1.201 -0.591 -1.182 -0.364 5 1.561 0.865 1.551 0.670 2 -1.201 -0.591 -1.174 -0.278 6 2.362 1.269 2.318 1.049 3 -0.525 -0.264 -0.530 -0.512 0 -2.595 -1.203 -2.609 -1.220 3 —0.525 -O.264 -0.529 -O.412 1 -1.461 -0.746 -1.472 -0.785 3 -0.525 -0.264 -0.528 -0.319 2 —0.600 -0.324 -0.603 -0.381 3 0.525 -0.264 -0.526 -0.231 3 0.140 0.078 0.142 0.005 3 -0.525 -0.264 -0.524 -0.145 4 0.840 0.471 0.843 0.381 3 -0.525 0264 -0.522 -0.060 5 1.561 0.865 1.557 0.755 3 -0.525 -0.264 -0.520 0.027 6 2.362 1.269 2.337 1.137 4 0.117 0.059 0.119 -0.202 0 -2.595 -1.203 -2.540 -1.117 4 0.117 0.059 0.117 -0. 105 1 -1.461 -0.746 -1.448 -0.691 4 0.117 0.059 0.116 -0.015 2 -0.600 -0.324 -0.597 -0.292 4 0.117 0.059 0.116 0.072 3 0.140 0.078 0.141 0.090 4 0.117 0.059 0.116 0.157 4 0.840 0.471 0.843 0.465 4 0.117 0.059 0.116 0.243 5 1.561 0.865 1.563 0.841 4 0.117 0.059 0.116 0.330 6 2.362 1.269 2.357 1.225 4 0.117 0.059 0.114 0.421 7 3.332 1.694 3.301 1.626 89 Table 11 (cont’d) melon—rat” M2 LID estimates MD estimates UD ima MD eetimegee raw raw score LS EB LS EB score LS EB LS MD!EB 5 0.755 0.381 0.752 0.104 0 -2.595 -1.203 -2.477 -1.017 5 0.755 0.381 0.753 0.199 1 -1.461 -0.746 -1.425 -0.598 5 0.755 0.381 0.754 0.289 2 -0.600 -0.324 -0.592 -0.204 5 0.755 0.381 0.756 0.375 3 0.140 0.078 0.141 0.176 5 0.755 0.381 0.760 0.546 5 1.561 0.865 1.570 0.928 5 0.755 0.381 0.762 0.635 6 2.362 1.269 2.379 1.315 5 0.755 0.381 0.763 0.729 7 3.332 1.694 3.351 1.721 6 1.423 0.707 1.406 0.504 1 -1.461 -0.746 -1.402 —0.507 6 1.423 0.707 1.414 0.593 2 -0.600 -0.324 -0.585 -0.117 6 1.423 0.707 1.422 0.680 3 0.140 0.078 0.141 0.262 6 1.423 0.707 1.431 0.767 4 0.840 0.471 0.844 0.637 6 1.423 0.707 1.440 0.855 5 1.561 0.865 1.578 1.016 6 1.423 0.707 1.450 0.947 6 2.362 1.269 2.401 1.407 6 1.423 0.707 1.459 1.044 7 3.332 1.694 3.406 1.819 7 2.159 1.040 2.109 0.812 1 -1.461 -0.746 -1.380 -0.415 7 2.159 1.040 2.150 0.991 3 0.140 0.078 0.142 0.349 7 2.159 1.040 2.173 1.080 4 0.840 0.471 0.845 0.725 7 2.159 1.040 2.197 1.172 5 1.561 0.865 1.586 1.107 7 2.159 1.040 2.222 1.267 6 2.362 1.269 2.426 1.503 8 3.023 1.386 2.990 1.312 3 0.140 0.078 0.142 0.439 8 3.023 1.386 3.211 1.711 7 3.332 1.694 3.532 2.034 90 As mentioned above, all 55 observed combinations of raw scores have unique values for both item groups when EB estimates are obtained by the MD model. For example, a person parameter estimate on trait 1 for a person who got 6 items correct on the first item group can range from 0.5039 to 1.0438, depending on the raw score on the second item group. On the other hand, when EB estimates are obtained by the UD models separately for the two item groups, the examinee will get the estimate of 0.7069 for latent trait 1 for the EB estimates, no matter what their raw score on the second item group. Plots in Figure 6 are based on the values in Table 11. The two plots in Figure 6-a are plots of the person-parameter estimates from the two separate UD models; the first one shows LS estimates and the second one shows EB estimates. EB estimates are considerably shrunk while the shape of the plots are the same. The two plots in Figure 6-b are plots of estimates from the MD model . Again, the EB estimates are considerably shrunk, but the shape of the plot is also different from the one for LS estimates. By comparing the two EB estimate plots, they show the difference between the estimates from the two models clearly. From the MD analysis, the estimated variance matrix of rs is , u, 0.719 0.421 var = a, 0.421 0.761 91 Figure 6 Plot of person parameter estimates from U0 and MD models 3. Estimates from two unidimensional models LS Estimates for the Second Latent Trait EB Estimates for the Second Latent Trait l l l 4 I l I -2 o 2 LS Estimates for the F lrst Latent Trait M L O I o N J O O O O O O O O O O O O O I O O O O O O O O O O O O O O O O O O O O O O O O O O O I O O O O O O O O O O O O f T T -2 0 2 EB Estimates for the First Latent Trait 92 Figure 6 (cont’d) b. Estimates from one multidimensional model 4 I l L o 0 ° o-o ‘5 I: .. o 0 e o O 0 .52“ 3 e e e e e e e e 'o g o e e e e e e (7)0“ e e o e e e e o 5% o e e e o o e .9 33 g o o o o O 0 9 521 fl) [U . . . C . ‘3 -4 I I 1 .4 -2 0 2 LS Estimate for the First Latent Trait 4 J l L E l— H C .921 . ' :1 °' . .2 0...... 8 ....... (3 0.....0... .1 0 0° ....... 5 .0 .0 .2 .0. .o 2:1-2‘ U) UJ m LIJ 4 I 1 I .4 -2 0 2 EB Estimate for the First Latent Trait 93 As a result, the correlation coefficient between the two latent traits estimated from the covariances for the MD model is 52/1/5322 = 0.570. On the other hand, the observed correlation between ability estimates for the two latent traits, based on the estimated (observed) two latent traits from the MD analysis is f = 0.658 , while 9 = 0.251 for the UD analysis. Since the data are generated so that the two latent traits are correlated with the magnitude of p: 0.500 in the population, none of the correlation estimates are particularly close. However, the estimates of the correlation based on the estimates of variances and covariance of the two latent traits from the MD analysis show that the MD model is more appropriate at least for that purpose. Since this result is based on only one replication of simulated results, more extensive simulation study is needed in order to confirm the recovery of the correlation coefficients between the latent traits, as well as the values of zs themselves, for the MD model. 4.3. Parameter Recovery In this simulation study I deal with the between-item multidimensional model as I did in the previous numerical example. The variables of interest are (a) the number of latent traits (m = 2 and m = 3), (b) sample size (n = 250, n = 500 and n = 1000), representing small, medium and large sample sizes, (0) the magnitude of correlation between the latent traits (,0: 0.2, p: 0.5, p: 0.9), 94 where the three values are intended to represent a weak, medium, and strong correlation, and (d) the number of items (k1 = k2 = 5, k1 = k2 = 10). These variables create 2 x 3 x 3 x 2 = 36 conditions, and I replicate an analysis 50 times for each of the 36 conditions. Table 12 shows the layout of the 36 conditions to be investigated. The item difficulty parameter values are determined based on Table 1 as described previously. For each replication in each of the 36 conditions, person ability values are sampled from a standard multivariate normal distribution, 11:1; :1) when m = 2, and p )0 . (39) 1 ht-‘b when m = 3. Correlation coefficients between latent traits are then computed, as 95 r 12 _ A b. (40) T11727 for the 18 conditions for the case of m = 2, and as 5' 2" f _ 12 _ 13 _ 23 ’12 ‘ .. A : r13 " A .. and r23 ‘ . . (41) Z11 T22 711 733 2'22 733 for the 18 conditions for the case of m = 3. Estimated parameter values are compared across the 36 conditions using several statistics and indicators. Those include (a) the mean correlation coefficient between estimated and true parameter values, (b) the standard deviations of the correlation coefficients between estimated and true item-parameter values across g = 50 replications, and (c) the root mean squared error (RMSE) of i. The RMSE of i is defined to be 11175505,.) = ’=' , (42) g for s ¢s' where s and s' = 1, 2 when m = 2; and s and s' = 1, 2, 3 when m = 3. The RMSE of i is computed for each of the 36 conditions, where 1 indicates the lth replication (1 = 1, , g), in each of the 36 conditions. The RMSE 96 Table 12 The layout of the simulation study for the multidimensional model = m: ,=5 k,=10 k,= k,=10 n=250 n=250 n=250 n=250 r=0.2 n=500 n=500 n=500 n=500 n=1000 n=1000 n=1000 n=1000 n = 250 n = 250 n_= 250 n = 250 r=0.5 n=500 n=500 n=500 n=500 n=1000 n=1000 n=1000 n=1000 n=250 n=250 n=250 n=250 r=O.9 n=500 n=500 n=500 n=500 n=1000 n=1000 n=1000 n=1000 97 values are compared between the 36 conditions in order to make inferences about the recovery of the rvalues. Table 13 shows the obtained values for the 5 statistics described above. Overall, the RMSE( 7) values are small, less than 0.1 in more than half of the cases, indicating that testimates differ no more than 0.1 in about half of the cases. Figure 7 shows three plots of RMSE( f). The first plot shows results when the true pequals 0.2. It shows that the cases are roughly grouped into two groups, one is cases with k, = 5, and the other is cases with k, = 10. This indicates a relationship that the RMSE of f' is affected more by the number of items within latent traits than the number of latent traits to be estimated. In other words, rappears to be estimated more poorly when there are fewer items within latent traits. This result is consistent with the result in the unidimensional case, where the number of items affects the precision of estimate of 7. Here, in the multidimensional case, the number of items within latent traits is the important factor rather than the total number of items in a test. The second plot in Figure 7 is for p: 0.5. In this plot, the tendency described above is more obvious; the RMSE for k, = 10 are much smaller than ones for k, = 5. The last plot is for p: 0.9. In this plot, the differences between k, = 5 and k, = 10 are very small. It seems that z'values are estimated equally well for all the cases. It is difficult to explain why the differences are small when p: 0.9, and large when p: 0.5. 98 Table 13 The results from the parameter recovery study for the multidimensional model Results Dimension items p sample RMSE r_hat var(r_hat) r-(y) se(y) 250 0.205 0.215 0.030 0.993 0.003 0.2 500 0.133 0.216 0.014 0.997 0.002 1000 0.077 0.234 0.007 0.998 0.001 250 0.171 0.576 0.032 0.994 0.003 5 0.5 500 0.139 0.585 0.013 0.997 0.002 1000 0.110 0.581 0.005 0.998 0.001 250 0.076 0.943 0.003 0.994 0.003 0.9 500 0.061 0.950 0.001 0.997 0.002 1000 0.060 0.954 0.001 0.998 0.001 2 250 0.106 0.234 0.011 0.992 0.003 02 500 0.064 0.210 0.004 0.996 0.001 1000 0.063 0.208 0.003 0.998 0.001 250 0.103 0.508 0.008 0.991 0.003 10 0.5 500 0.069 0.518 0.002 0.996 0.002 1000 0.043 0.526 0.002 0.998 0.001 250 0.054 0.925 0.003 0.991 0.003 0.9 500 0.052 0.944 0.001 0.995 0.001 1000 0.044 0.937 0.001 0.998 0.001 250 0.170 0.232 0.024 0.991 0.003 0.2 500 0.123 0.225 0.012 0.996 0.001 1000 0.085 0.233 0.006 0.998 0.001 250 0.179 0.548 0.028 0.992 0.003 5 0.5 500 0.132 0.578 0.010 0.996 0.002 1000 0.104 0.574 0.005 0.998 0.001 250 0.073 0.927 0.003 0.992 0.003 0.9 500 0.059 0.937 0.001 0.997 0.001 1000 0.050 0.946 0.000 0.998 0.001 3 250 0.111 0.213 0.011 0.992 0.002 0.2 500 0.073 0.212 0.006 0.996 0.001 1000 0.062 0.207 0.002 0.998 0.001 250 0.091 0.510 0.009 0.992 0.002 10 0.5 500 0.065 0.519 0.003 0.996 0.001 1000 0.051 0.517 0.002 0.998 0.001 250 0.057 0.912 0.003 0.992 0.003 0.9 500 0.047 0.926 0.001 0.996 0.001 1000 0.037 0.925 0.001 0.998 0.001 99 dr2412 s=2,k=5 s=2,k=10 600 800 1000 400 flrrilfi =5 =2, k S S S 2, k=10 3, k=5 600 800 1000 400 100 Root mean squared errors of 2 Figure 7 8.8 88.8 8.8 8.8 8.8 8.8 cmciamsmmzm 38 848 9.8 3.8 88 88 8213:me Figure 7 (cont’d) =5 s=2, k s S S =2,k=10 3,k =5 3, k=10 th=D 9 _ 8 _ . ~ h l-. -0-0-I-O-I-I-O_I-O- - .IOIOIIIIOOIC U IO. . 8.8 8.8 2.8 2.8 38 8mo 824.83%”. 101 The top three graphs in Figure 8 (top graphs on each page) show the mean correlation coefficients between the true and estimated item parameters when three different correlations between the latent traits are used to generate the data. The three graphs are very similar, and show no notable differences. The magnitude of correlation between the latent traits does not seem to affect the accuracy of item parameter estimation. Item parameters were estimated relatively poorly with smaller sample sizes, like n = 250. These differences are all at the third decimal place, and all the correlation coefficient values are greater than 0.990. The standard deviations for those correlation values are plotted on the bottom three graphs in Figure 8 (bottom graphs on each page). Although the graphs show that there is a tendency that the standard deviations are smaller with bigger sample sizes, differences are at the third decimal places. All the values are smaller than 0.004. The mean correlation coefficients between latent traits are plotted in top three graphs in Figure 9 (top graphs on each page). Tendencies are quite similar to those plots of the RMSEs of % in Figure 7. This makes sense because the correlations are computed from 2. Overall, the correlation coefficients tend to be overestimated, especially when k, = 5. This happens because the variance components of rtend to be underestimated, while the covariance components of IS tend not to be underestimated as much as the variance components do. This is evident from the results, where RMSE( 2,.) = 102 Figure 8 Mean correlation between true and estimated item parameters and their standard deviations [hp=0 2 C .9 E 9.’ '5 o " ' 0.9/z """" 5:2: k=10 ‘ {I ..... - 5:31 k=5 0 ——- 5:3, k=10 O) C”. o 1 I ' I 400 600 800 1000 n 8 rhg=0 2 O,‘ o —- s=2, k=5 ....... =2, k=10 c _. ..... - 5:3, k=5 8 m . (5 O ...>.. Q‘ g C 'E m " '. . 1:3 \"'\:.. (U ‘- ' '. ???? *6.) 8 4 i —.L-;; T.‘ f, g ......... 8 “‘“wlz 0.4 d r ' I I 400 600 800 1000 n 103 Figure 8 (cont’d) mp=0 5 5 2, k=10 3,k 2, k= 8: =5 8 S S 3, k=10 £88 a 3&8 COZEGCOO 1 000 rhp=0 5 2, k=10 =5 2, k=5 3, k S S S S 3, k=10 8888 F888 88 c2638 Emocfim 104 Figure 8 (cont’d) rho=0 9 0 O 5151 = = = : kkkk 223& = = __ = I s s S s . . m _ _ _ xu/.s. (x. ..o 8888 «888 . 8&8 :o=m_m:oo =5 s=2,k s=2,k=10 s=3,k=5 s=3,k=10 dufiilg 88.88 888 Sm8 8_8 cozmSoo Emccmwm 105 0.225 , and RMSE( fa.) = 0.086 , where i¢ i'. When the true correlation between latent traits is p: 0.9, the differences between the mean correlations for the four conditions in Figure 9 are very small; the F s are all between 0.9 and 0.95. In addition, when p: 0.9, the standard deviations of F across 50 replications are all small; they are always less than 0.003 (standard deviations of F are plotted in bottom graph on each page of Figure 9). Similarly, when p: 0.2, the differences in the F s between the four groups are small; the 'r' s are all between 0.2 and 0.24. However, the standard deviations of those 7 s are not as small as when r = 0.9. When p= 0.5, the F values range between 0.5 and 0.6. The values are fairly close to 0.5 when k, = 10, but the values are closer to 0.6 when k, = 5. The patterns observed for the standard deviations are very similar to those for p= 0.2. In summary, the simulation study shows that the z'values and correlation coefficient values between latent traits are well estimated when there are more items within latent traits, regardless of the number of dimensions to be estimated. Also, item parameters are well estimated regardless of the numbers of items within latent traits. 106 Figure 9 Mean correlation between latent traits and their standard deviations o rho=i12 co- 0' L0 or. .6 ° ........ E ........... g 8...muemue ...... _ ....... . ........ “"“"‘“"——_~' 8 o g _" s=2, k=5 “E9 g """" s=2, k=10 0' '''' " =3, k=5 _"" s=3, k=10 2 8‘ . r . r 400 600 800 1000 n rhg=02 or) _ s=2, k=5 0 o' c .9 E > m 'o E co 1: c .9 (D n 107 Figure 9 (cont’d) a. . j m_ _ m_ M nu no u ;. saga _ ”W zaaa . H_ = __ ._ __ r _ U . . m_ m _ 5 _ ._ . _ m _ H . ...,. m . - . _ , m ..., M e - . U x w m . .E r «m 888 888 88.8 $8 8818 C03m—0tOO mee 600 800 1000 400 rhp=055 c2538 nemucmum 600 800 1000 400 108 Figure 9 (cont’d) a}. _m_ "m _ no no .H 5151 .U _ __ __ __ __ w _ k.k.k.k. .H 2233 w _ = = = = w . _ _ m _ 9 H. _ . _ O H. _ . = l .m. w _ w m... M T: ...Z ”..., a E .. a U .. Him rrm 88.; more 88.8 88.8 c0_um_®.:00 Cmme 600‘ 800 1000 400 rhp=0 9 3 _ _ 0 0 . 5151 . __ __ __ = m 2.2.3.3. . __ __ __ = _ SSSS . - um . . __ . __ .0 moo oo cozm_>mu Emocmfi 600 800 1 000 400 109 4.4. Summary and Comments on Practical Issues In this chapter the LP HGLLM was extended to both between- and within-item multidimensional models. Between-item multidimensionality can exist in a test that is constructed so that more than one group of items is intended to measure different abilities. A good example is a testlet-based test. A testlet is defined to be “a group of items related to a single content that is developed as a unit and contains a fixed number of predetermined paths that an examinee may follow” (W ainer & Kiely, 1987, p.190). For example, a series of reading-test items that are based on the same reading passage can be thought of as a testlet. Another example is a series of science-test items that are based on the same scenario. When a test is composed of several testlets, the test is referred to as a “testlet based test”. Examples include the reading section of the Test of English as a Foreign Language (TOEFL) and the Michigan Educational Assessment of Progress (MEAP) reading and science tests. On the other hand, within-item multidimensionality can exist in a test that is constructed so that some items measure distinctive latent traits and some items measure more than one latent trait. A good example is a science test in which some items are strictly about either physical or natural sciences (not both) and some items require knowledge of both physical and natural sciences. The multidimensional extension of the l-P HGLLM can be utilized to 110 estimate the magnitude of correlation between latent traits in a multidimensionally constructed test, such as examples given above. This will be also a useful tool in construct validation for test and questionnaire construction. In such a case, a group of items intended to measure the same psychological construct is represented by a common latent trait. 111 Chapter 5 T h rec-Level Model In this chapter another extension of the LP HGLLM, a three-level model, is presented. In a three-level model formulation, an additional level is considered. Here the school level is considered as an example. An illustrative analysis is presented as well. 5.1. Model Suppose the third level of the model represents schools. Let pg... be the probability that person j in school m answers item 2' correctly. The additional subscript m indicates schools, in contrast to the two-level model where p has only two subscripts. The level-1 model is an item-level model as it is in the two-level case. It is written as ‘ ”W" = I501»! +161/mVVlrjm +162/mulhjm + +16(k-l)jm”/(k-l)ijm , (43) where i= 1, , k,j = 1, , n, m = 1, , r. It is identical to the level-1 model in a two-level model, except for the additional subscript for schools. WW. is the qth dummy variable (q = 1, , k — 1) for item 2', for person j in school m. fly", is the overall effect of items, and flu," is the effect of the qth item, assuming the effect of the kth item is 0. As seen in the 2-level model formulation, the fly”. 112 are not assumed to be constant across people, or across schools, in this level-1 model. The level-2 models for the fly", parameters are person-level models, which hypothesize that the fly”, are constant across people. The person-level models for person j in school m are written as r flOJm =700m +110”. 161/»: = 710:” i '62»: = 720»: (44) L’fllk‘lllm = 7(k-l)0m ’ where u0 1,, ~ N (0, r7). Again, these models are identical to the level-2 models in the two-level formulation, except for the extra subscript m. Here uojm indicates how much person j in school m is deviated from the average ability of school m, which is denoted as r00," below. Here no". is an overall effect of school across items, and mm is the effect of the qth item in school m, assuming the effect of the kth item is 0. The level-2 models do not assume that item effects are constant across schools. Now, an additional level-3 model, a school-level model, shows that item effects are constant across schools. The overall effect across items ()00",)now contains a school effect as well. For school m, we have 113 i 700m =77ooo+r00m 710». = ”100 i 720»: = ”200 ’ (45) J’u-lwm = ”(k—mo where moo is a fixed component of 700m, and reg," is a random component of 200m. On the other hand, 710,, through mine,” only have fixed components. Here, r00," ~ N (O, 1,). As a result, when the same dummy coding is done as in the two-level model, the combined model is 1 1+ exp{— [(rmm + u0 1”,) - (moo - from )]} , pljm = (46) where 7500 — zrooo is an item difficulty for item 2' (i = 1, , k — 1), and 7500 is the item difficulty for item k. This parallels the two-level model, in which item difficulty for the ith item is expressed as 7,0 — m. In this three-level formulation r00". + uojm is an ability parameter for person j in school m. Unlike ability parameters in the two-level model, the ability parameters for this three-level model consist of two parts. First, r00," is the random effect associated with school m, and can be interpreted as the average ability of students in school m. Second, uojm is the person-specific ability of person j in 114 school m, indicating how much the ability of person j is deviated from the average ability of students in school m. 5.2. Illustrative Analysis In order to illustrate parameter estimation in this 3-level model, a hypothetical data set is analyzed. In the data set, binary response data are generated based on the following specifications. A sample of 1000 students is assumed to be from the standard normal distribution. The 1000 students are further assumed to be in 20 different schools, where 50 students are assumed to be from each school. The 20 school means are sampled from a normal distribution with mean of 0 and variance of 0.5. This data generation is achieved by generating school mean abilities and individual abilities separately. First, 20 school means ,u,,. (m = 1, , 20) are sampled from N(0, 02), where 02 = 0.5. Then 50 individual student ability measures are sampled separately for each school from N(,u,,,, 1 — 02), where ,u,,, is the school mean for school m, which was generated in the previous step, and a2 = 0.5 in this case. This two-step data generation results in a sample of 1000 students from N(O, 1), while the means of schools are from N(0, 0.5). First, the data are analyzed using the 2-1evel model, where school effects are ignored and only item and person parameters are estimated. Then, the same data set is analyzed using the 3-level model, where school mean abilities are estimated in addition to the item and person parameters. 115 The item parameter estimates from both the 2- and 3-level models are shown in Table 14. As the table shows, item parameters are estimated almost identically in the 2- and 3-level models. They are only different in the fourth decimal place, at the most. This makes sense because 740— J00 in Equation 12 is algebraically equivalent to 14,00— Irooo in Equation 46. Similar results are observed for least squares (LS) person ability estimates. Here, let 170/, for j = 1, , n, be the LS estimates of person abilities from the 2-level model, in order to distinguish from the empirical Bayes (EB) estimates that have been used throughout this study. Also, let i701," + 760». be the LS estimates of person abilities from the 3-level model. Figure 10-a shows the relationship between £701 and £70,," + F000,. The values for 170]. and i701”, + F50," are almost identical, except for a small amount of variation at both the higher and lower ends of the raw scores. The computed correlation coefficient between the two estimates is 0.9996; more evidence that :70] and 1.701," + F50," are identical. This observation makes sense because uo}. equals u0 1m + room algebraically, from Equations 12 and 46. On the other hand, when the EB estimation is employed, the person abilities are estimated quite differently in the 2- and 3-level models. Figure 10-b shows a scatter plot of the EB person ability estimates from the 2- and 3- level models. Here, the EB person ability estimates from the 2-level model 116 Table 14 Item parameter estimates from 2—level and 3-level models 2-level 3-level true value item1 —1.620590 -1.621330 -2.00 item2 -1 . 180070 -1.180280 -1.50 item3 —0.762940 -0.762980 -1.00 item4 —0.294100 —O.294100 —0.50 item5 0.212798 0.212799 0.25 item6 0.455718 0.455726 0.50 item7 0.919703 0.919794 1.00 item8 1.398376 1.398796 1.50 item9 1.710175 1.711024 2.00 item10 0.096074 0.096074 0.00 117 Figure 10 Person parameter estimates from 2-level model and 3-level models a. LS Estimates -1" -3~ LS Estimates from 3-Level Model (U0jm+ROOm) 00. l l I I T l -2 .1 0 1 2 3 LS Estimate from 2-Level Model (UOj) b. Empirical Bayes (EB) Estimates 1 l l l J l l l .‘s o .. N 1 l l 1 00 EB Estimates From 3-Level Model (U0jm+ROOm) via 1 O 000 0(1) 00 00C!) mmooooo 00C!) moooom coo moooom ooo 0-1330“) omoooom ommooom 0000‘!) O O 000 (no 000 l -2.0 -1.5 l l l l l l l -1.0 -o.5 0.0 0.5 1.0 1.5 2.0 EB Estimates from 2-Level Model (UOj) 118 are {10], and the EB estimates from the 3-level model are F00," + £01,". People who obtained the same raw scores have the same EB ability estimates in the 2-level model, because the 2-level model is identical to the Rasch model. However, people with the same raw score have different EB ability estimates in the 3-level model. Through the EB estimation, the 3-level model estimates person abilities so that individual abilities are normally distributed within schools. As a result, via the EB estimation, the 3-level model estimates person abilities differently for people with the same raw score, depending on which school they are in. The correlation coefficient between 120} and foo," + 1201.," is 0.91. Although this is a strong relationship, 170]. and F00”, + £20,," are quite different as Figure 10 shows. However, if the weights for foo", and i201," are allowed to be other than 1 to maximize the correlation between £201. and aim," + biio M, where a and b are constants, the multiple correlation of £20]. with afmm + bfiojm becomes 0.9996, which is almost 1.00. More specifically, the correlation between fig}. and 1.489 foo," + 0591 120/," is 0.9996 (see Figure 11). This is because the EB estimates are results of the “double shrinkage”, where LS estimates are shrunk by both of level-2 and -3 reliabilities. The fact that the EB 3-level estimates distinguish among people who obtained the same raw score improves the relationship between estimated abilities and true abilities. The first scatter plot in Figure 12 shows the 119 Figure 11 The relationship between the estimates from the 2-level model and the linear combination of the estimates from the 3-level model N l l .n l .5 l O I 0.591(School Mean) + 1.489(Person Deviation) ' o l o M l r'o o I -0.5 I 0.0 I 0.5 1 1.0 Estimates from 2-Level Model 120 1.5 2.0 Figure 12 Person parameter estimates from 2-level model and 3-level model - comparison to the true values 3 —1 ° 0 o a O 1 - ° i O 8 . o g ° to I > i 34— . ' *- 3 o 8 g o .3 .1 O -5 I T r I I r l I -2.0 -1.5 ~1.0 -0.5 0.0 0.5 1.0 1.5 2.0 Estimates from 2-Level Model L l l l l ...1 O 3 o o O o o G E o o ,5 fa O o o 9 1 r q’ & £0 °° ° 9 00 g o O 0 0° 0 a 0 8’3 > 9° 0 -1 - °° a 3 l" o 8 .3 - O 5 f I r r x -2 1 2 -1 0 Estimates from 3—Level Model 121 relationship between the EB person ability estimates from the 2-level model and true parameter values. There are only 11 possible ability groups, because the 2-level model is equivalent to the Rasch model and the Rasch model estimates person abilities based only on raw scores. This scatter plot shows a large amount of estimation error. For example, in the 2-level model, people who obtained the raw score 8 are estimated as having an ability of 1.04 in the logit scale. However, the true ability values for those people range from approximately — 0.2 to + 2.9 (see the third cluster fi'om right in the plot). This large variability is seen in the other score groups as well. When there are more items in a test, this variability should decrease to some degree, as discussed in Chapter 2. On the other hand, this variability is somewhat reduced when the person abilities are obtained from the EB estimation in the 3-level model, as the lower graph in Figure 12 represents. Actually, the root mean squared error (RMSE) of person parameter estimates is 0.628 for the EB 2-level estimation and 0.538 for the EB 3-level estimation, which is about 14 % less than the RMSE from the 2-level estimation. In other words, the EB 3-level estimation reduces the amount of estimation error for the person ability parameters. This is evidence that the 3-level model has the capacity of including level-3 predictors and estimating effects of them, assuming abilities are clustered within level-3 units. Estimation of school means is dramatically improved when the EB 3- 122 level estimation is employed. The EB estimates of school means from both the 2- and 3-level models, along with the true school means, are listed in Table 15. The school means for the 2-level model are obtained by simply computing the means of ability estimates within each school, in effect, the mean of 170} within each school, while the school means from the 3-level model are fmm. The estimates from the ,3-level model are closer to the true values than estimates from the 2-1evel model, except for two of the 20 schools in this analysis (see asterisks in Table 15). Actually, the root mean squared error among the school means is 0.120 for the 3-level model; that is a 66 % reduction from 0.354 for the 2-level model. On the other hand, the estimated school means from the 3-level' model are less variable, resulting in a smaller standard deviation of estimated school means (0.518 for the 3-1evel model) than the standard deviation of true school means (0.636). This is again a result of the double shrinkage, where level-3 random effects are corrected by reliabilities of both second and third levels. Overall, I can conclude that the 3-level model approach in this example enables one to estimate school means better than the conventional Rasch model or the 2-level 1-P HGLLM model approach. 123 Table 15 Estimates of school means from 2-level and 3-Ievel models school 2-Ievel 3-Ievel true value 1 0.6463 1.0489 1.0241 2* —0.0037 —0.0025 -0.0500 3 —0.1987 —0.3063 -0.2732 4" 0.3185 0.5116 0.3950 5 0.2287 0.3670 0.3175 6 0.7211 1.1774 1.1659 7 0.2299 0.3671 0.3246 8 0.3079 0.4935 0.5247 9 -0.1777 —0.2738 -0.3143 10 -0.1756 -0.2726 —0.3504 11 —0.8477 -1.4355 -—1 .6018 12 -0.5384 -0.8717 -1.0193 13 —0.6587 —1.0831 -1.3294 14 0.4360 0.7012 0.8988 15 0.6667 1.0831 1.1190 16 -0.2320 -0.3658 —0.5703 17 0.0269 0.0479 -0.03964 18 —0.2413 -0.3779 -0.4936 19 -0.0393 -0.0554 -0. 1419 20 —0.4690 —0.7540 -0.8681 Note. Only those schools with ' show 2-Ievel estimates closer to the true value. 124 In summary, while uoj equals uojm + r00". algebraically, this relationship holds only for LS estimates. Specifically, this relationship does not hold for the EB estimates. Instead, the EB weighting scheme is such that 1201,, equals 1.489 foo,” + 0591 £201," in this example. The EB estimates thus distinguishes among people who obtain the same raw scores. This shows improved person ability estimates as well as improved school mean estimates, at least in this specific illustrative analysis. This relationship is worth investigating further in a more extensive manner. 5.3. Summary and Comments on Practical Issues The three-level formulation becomes quite useful when school mean abilities are of interest, in addition to students’ individual abilities. For example, when a large-scale assessment is used for the purpose of accreditation of schools, indicators that represent students’ average performance within schools are needed. For a criterion referenced test, the percentage of students who exceed the passing score is often used as an indicator of school performance. It is a good indicator if the percentage of students who do not exceed the passing score is the concern for the accreditation. However, if the concern is more on the average performance of students within schools, estimates of the mean abilities for each school will be more appropriate. The three-level 1P-HGLLM approach would provide accurate estimates of the mean ability of the school, as described in this 125 chapter. This three-level formulation further enables one to include school- characteristic variables, as well as student-characteristic variables in more complex models. This would be analogous to conducting a two-level HLM analysis with the Rasch model embedded therein. As mentioned in Section 3.1., this is a one-step analysis that avoids bias and inconsistency of MMLE based person parameter estimates, as well as heteroscedastic measurement errors in the outcome variable. Furthermore, the fact that the 3-level formulation reduces measurement errors both in person ability estimates and school mean estimates should improve the estimates of person and school characteristic variable effects. Again, this one step analysis of test data can be applied for the purpose of accreditation of schools. Judging students’ performance based only on a test might be unfair for accreditation. Difference in demographic variables, such as social economic status (SES), might be confounding a school’s average performance. If this is the case, the usage of the level-3 school mean ability to represent students performance after adjusting for the efi'ects of such confounding variables will be appropriate. One drawback of this 3-level formulation is that the model assumes equal variances of student abilities across schools. If this assumption is not met, it can result in unfair judgements about individual abilities. Robustness to violations of this assumption should be investigated. Also, the 126 fact that students who are in better schools would get a credit just because they are in good schools may not be acceptable philosophically and politically, although it is evident that the 3-level formulation improves person parameter estimates on average. A similar argument is applicable for the use of student’s ability estimates when person-characteristic variables are taken into account. This was discussed in Chapter 2. Impacts of using person parameter estimates from the 3-level formulation are another future potential areas for investigation. Despite such uncertainties, the EB 3-level analysis is promising as a way of both conducting a one-step analysis of student- and school- characteristic variables, as well as estimating school mean abilities. 127 Chapter 6 Conclusions In this chapter a summary of this dissertation is provided. Also, several comments on practical issues are given. Some of them are in addition to the comments given in the previous three chapters. Finally, several recommendations for future research are suggested. 6.1. Summary One purpose of this study was to show the equivalence of the Rasch model and the one-parameter hierarchical generalized linear logistic model (1-P HGLLM), both algebraically and numerically. In Chapter 2, I successfully showed that the standard binary-response Rasch model can be reformulated as a special case of the HGLM. This study also confirmed that person parameters estimated by the currently available algorithms in the HLM program were very close to the estimates from the BILOG program, although the HLM estimates had smaller variance. Also, item parameter estimates from HLM and BILOG programs were somewhat different when rescaling was done. Another purpose of this study was to present various possible extensions of the LP HGLLM. In the previous three chapters, three extensions of the LP HGLLM were presented, including a two-level model 128 with a person-level predictor, a two-level multidimensional model, and a three-level model. These extensions are not exhaustive. Other possible extensions of the LP HGLLM include (a) a two-level model with more than one person predictor variable in the level-2 models, (b) a two-level model that has one or more item characteristic variables in the level-1 models, (c) a two- level multidimensional model with person characteristic variables in the level-2 models or item characteristic variables in the level-1 model, (d) a three-level multidimensional model, and (e) a three-level multidimensional model with school characteristic variables in the level-3 models, person characteristic variables in the level-2 models or item characteristic variables in the level-1 model. Parameters for the all these cases can be estimated by the currently available HLM program. These extensions have a variety of potentials in practical settings. For example, it would be possible to predict item difficulties from item characteristics, using a model with item characteristic variables, instead of item indicators, in the level-1 model. This would be very useful in a test where item tryouts are not desirable or impossible for some reasons, such as high item security. One would be able to obtain a good idea of what item characteristics determine item difficulties from previously administered items. Then, item characteristic analyses for newly developed items can predict item difficulties before items are given to examinees. A combination of this approach and a live equating procedure would be able to ensure a sound 129 assessment even though items are not tried out. The most important contribution of this study is that this generalization of the Rasch model as a hierarchical model allows one to add an additional level of data in a relatively simple way. Thus one can include multi-level linear predictors (e.g., school-level predictors, as well as person- level predictors) in one analysis, although I did not deal with this extension in this study. This is an important contribution because inclusion of linear predictors to an IRT model has been limited to a two-level model in the past. In addition, I have shown that parameters can be estimated with the currently available algorithms in the HLM program. Also, I showed that the generalized model can represent a multidimensional Rasch model. This is also an important contribution because it can be readily applied to a confirmatory multidimensional analysis in a multi-level Rasch model, as well as in a regular single-level Rasch model. Furthermore, IRT models, including the Rasch model, are often thought to be specialized statistical and psychometric models for item response data. Specialized estimation algorithms, often in specialized software, have been considered necessary for parameter estimation. For this reason, IRT models and parameter estimation are often completely separated from other statistical models. This study clarified that the Rasch model is a special case of the HGLM, and parameters can be estimated within the frame work of the HGLM using algorithms in the HLM program, which are used for 130 more general purposes. This perspective would be useful for didactic purposes as discussed in Chapter 2. 6.2. Comments on Practical Issues As discussed in the previous three chapters, this HGLM approach for test data analyses is widely applicable. However, I am not insisting that the LP HGLLM should replace the Rasch model. As I mentioned in Chapter 2, I do not believe that 1-P HGLLM should be used for the sole purpose of item and person parameter estimation. The advantage of LP HGLLM is that this approach can handle tasks that the conventional Rasch analysis does not deal with, including simultaneous analysis of person-level predictors and item responses, confirmatory multidimensional analysis, and 3-level analysis of item response data. In a large-scale testing program, for example, the 1P-HGLLM could provide advantages in various stages of the test construction process and test result reporting. Including person-characteristic variables in an analysis could provide important information in a pilot testing process, in order to identify possibly biased items. Conducting a confirmatory multidimensional analysis would provide additional information for construct validation of the test, if the test is multidimensionally constructed. Also, when school-level performance is reported, instead of individual performance, the 3-level model would provide better estimates of mean school ability level. This was 131 discussed in detail in Chapter 5. Another practical benefit that was not mentioned in previous chapters, but which is important, is that the LP HGLLM can handle missing data. This means that all the students do not have to take the same set of items for the data to be analyzed. This flexibility would allow one to estimate person- and school abilities from matrix-sampled test data along with student- and school-level predictors, all at the same time. Also, one could perform an anchor-item test equating in one analysis, because all item parameters can be estimated on a common scale even if students take different sets of items. Despite many advantages, one drawback of the lP-HGLLM in practical settings is the fact it does not provide standard errors for person ability estimates, although it provides an estimate of the variance of the latent trait distribution. This might become a concern if a testing program decides to report person-ability estimates from the l-P HGLLM. Parameter estimates are expected to be accompanied by their standard errors, because estimates with smaller standard errors are preferable to the ones with larger standard errors. For this reason, it will be crucial for LP HGLLM to be able to provide the standard errors for random components of the model, in effect, individual ability estimates and school abilities, for them to be used in reporting ability estimates. However, this limitation does not affect the reliability of item- parameter estimates and estimates of the effects of predictor variables. This study did not conduct a systematic investigation on required 132 sample sizes for 1-P HGLLM analyses. Therefore, only approximate recommendations can be suggested. The three simulation studies in Chapters 2, 3, and 4 provided similar results in terms of the effect of the sample size on the quality of the estimates. The sample size of 250 produced means of the estimates across replications that are as good as ones from sample sizes of 500 and 1000. However, the standard deviations of the estimates across replications became much smaller when the sample size was increased to 500, while the differences between 500 and 1000 were not so large. Therefore, the sample size of 500 is recommended for the purposes of item-parameter estimation, estimation of the effects of person-characteristic variables, and estimation of the correlation coefficients between latent traits. However, no other sample size between 250 and 500 was investigated in the simulation studies. Moreover, more complicated designs with more parameters to be estimated were not investigated. Further systematic investigations are desirable to provide more comprehensive recommendations regarding this issue. Overall, it is recommended for anyone who does test data analysis in relation to demographic variables to have software available to conduct 1-P HGLLM analyses. Although it would take some effort to become familiar with the software, the flexibility of the 1-P HGLLM analysis is worth the investment of time. This is also recommended for a testing program that wants or needs to estimate and report school-level results. 133 6.3. Suggestions for Future Research and Recommendations Several recommendations can be suggested regarding practical issues. First, the reformulation of the Rasch model in terms of the HGLM is strongly encouraged to be presented for didactic purposes, as discussed in Chapter 2. Presenting this formulation would provide an alternative view of IRT models, which might help the learner of IRT models obtain a more general View of IRT, in terms of both model formulations and interpretations of parameters. Second, practical impacts of using person parameter estimates fiom the 3-level formulation should be investigated. It was mentioned in Chapter 5 that estimating person abilities from the 3-level model generally gives advantage to people who are in a group with a lower mean. This should be useful in a lot of cases, but not always. Finally, the obvious next step is to utilize this generalized model to answer real research questions using real data sets. For example, using the 3-level formulation. evaluation of school performance along with person- and school-level characteristic variables could be of interest to a testing program, as mentioned in Chapter 5. Any other possible applications of 1-P HGLLM, in comparison to traditional approaches, are encouraged to be conducted. Also, several technical problems remain unsolved. First, in Chapter 2, it was demonstrated that item parameter estimates were somewhat different between l-P HGLLM and BILOG, while person parameter estimates were very close between the two. The difference might be a result of differences in 134 the error structures between the two models; the Rasch model considers errors both in item and person parameters, while the LP HGLLM is formulated so that all random variations are attributed to person variations. If this is the case, along with the fact that the LP HGLLM provides a smaller variance of person estimates (see Chapter 2), it might indicate superior parameter estimation for the LP HGLLM (i.e., I-P HGLLM may be better in error control). This issue needs further attention. Second, it was mentioned in Chapter 5 that the 3-level formulation assumes equal variances of level-3-unit distributions. In the example for the illustrative analysis, the model assumed equal variances for 20 schools. For this reason, results from data with unequal variation in level-3 units may be misleading. Robustness to violations of this assumption should be investigated in future research. Third, it was mentioned in the previous section that test equating can be done in one analysis, because the LP HGLLM would be able to handle missing data. More accurate equating results are expected to be obtained because of the possible advantage that the LP HGLLM handles errors better (see above) and because of the nature of one-step analysis in comparison to a two-step analysis (see Chapter 3). However, since there is no evidence at this point that the LP HGLLM always provides better results, a valuable task would be to investigate how much accuracy one would be able to obtain compared to other conventional equating procedures. 135 Fourth, there is a possibility that an estimate of the variance of the level-2 random component (r) can be used as a measure of item fit for a corresponding item. If a model is specified so that each level-2 model has it's own random component, each random component can be conceptualized as the variation of corresponding item difficulty across people. Large variation of the random component would mean that the effect of the item varies considerably across people (i.e., different difficulty for different people). Since an item should have the same difficulty across people, this will be a measure of item fit. This approach also has a potential to be used in DIF analysis along with level-2 predictor variables. It is important that this approach is further examined in future research. Finally, although this study was limited to the Rasch model (i.e., one- parameter logistic model), I hope to extend the generalization to the two- parameter logistic model, as well as the polytomous response models. In summary, an application of the hierarchical generalized linear model (HGLM) to the item response theory (IRT) is new, yet applicable to a wide range of applied research. Also, the formulation is useful for pedagogical purposes. However, further investigations are needed to explore behavior of parameter estimates from the HLM program, especially for the multidimensional models. Also, some practical benefits and advantages of the use of parameter estimates from this approach have to be investigated. 136 APPENDICES 137 APPENDIX A The full matrix representation of Equation 9, for person 1' For 77,], X and ,6; as defined in section 2.1.2, 40’ F ’71; I F X01; X11, X21; X(k-l)lj P flo, - 772, X02, X12, X22; X0802) A} = i i i i E . (A1) ”(k—i), Xou-i), Xl(k-l)j X2(k-l)j ' ' ' Xu-ka-m 141:4); _ ’71.; J00“) _ X019 Xuq X219 X(k-l)kj 4(kxk) .flik-Uuuxi) Assign -1 for X W if q = i, and 0 if q at i , except when q = 0, then ij = 1. Then, — ’71, . P1 -1 0 0‘ F ’63! - 7,2]. 1 0 —1 0 flu s = s s s s 5 - (A2) ”(HM 1 0 0 --- -1 150—2)} _ 7h,- LM _1 0 0 JUN.) 358-014;...” The equation above represents a set of k equations for person j , specifically, r ”1} =fl01-1611 ”2] =160} _I@j ”(k-1); = '63; "flu-I); ”k1 = flOj 138 (A3) APPENDIX B The full matrix representation of Equation 17 For 7],], XO_W,XN, flag and ,4“, as defined in section 3.2.1. r 7531, I i 771, - P X011, XOml/ X11; X21; Xu-mm - E ’72; X012; X0012] X12, X22, X(k-m).2j 160m; 2 = s s s s 2 fl”. (B1) ”(k-1); Xouk—n/ "'XOmUI—I); Xuk-I); X2.(k-l)j Xu-mu-i); 162.; _ ’71; ‘(kxu L X011, XOmkj X11; X21, X(k—m).kj ima) 5 J60")! - (kxl) Assign 1 for X 05,]. , if item iis associated with the sth latent trait of m traits, and 0 otherwise. Assign -1 for X”. ifq=1§and0if q¢i. 139 REFERENCES 140 REFERENCES Adams, R. J ., & Wilson. M. (1996). Formulating the Rasch model as a mixed coefficients multinomial logit. In G. Engelhard & M. Wilson (Eds), Objective measurement: Theory and practive (V 01. 3, pp. 143-166). Norwood; NJ: Ablex. Adams. R. J ., Wilson, M., & Wang, W. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement. 21(1), 1-23. Adams, R. J ., Wilson, M., & Wu, M. (1997). Multilevel item response models: An approach to errors in variables regression. Journal of Educational and Behavioral Statistics. 22(1), 47-76. Andersen, E. B. (1972). The solution of a set of conditional estimation equations. Journal of the Royal Statistical Society, 34, 42-54. Andersen, E.B., & Madsen. M. (1977). Estimating parameters of the latent population distribution. Psychometrilra, 42, 357-374. Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: An application of the EM algorithm. Psychometrilra, 46', 443-459. Bock, R. D., & Lieberman, M. (1970). Fitting a reponse model for n dichotomously scored items. Psychometrilra. 35. 179-187. Bock, R. D., Muraki, E., & Pfeiffenberger, W. (1988). Item pool maintenance in the presence of item parameter drift. Journal of Educational Measurement, 25(4), 275-85. Breslow, N ., & Clayton, D. G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association, 88, 9-25. Bryk, A. S., Raudenbush, S. W., & Congdon. R. (1996). HLM' Hierarchical linear and nonlinear modeling with the HLlll/ZL and HLlll/3L programs. Chicago: Scientific Software Inc. Fischer, G. H. (1973). The linear logistic test model as an instrument in 141 educational research. Acta Psychologica, 37, 359-374. Fischer, G. H. (1983a). Logistic latent trait models with linear constraints. Psychometrilra. 48(1). 3-26. Fischer, G. H. (1983b). Some latent trait models for measureing change in qualitative observations. In DJ. Weiss (Ed.), New horizons in testing (pp. 309-329). New York: Academic Press. Fischer, G. H. (1995). The linear logistic test model. In G.H. Fischer, & I.W. Molenaar (Ed.), Rasch models: Foundations, recent developments, and applications. New York: Springer-Verlag. Goldstein, H. (1980). Dimensionality, bias, independence and measurement scale problems in latent trait test score models. British Journal of Mathematical and Statistical Psychology, 33, 234-260. Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston: Kluwer-Nijhoff. Linacre, J. M. (1989). Many-faceted Rasch measurement. Chicago: MESA Press. Lord, F. M. (1984). Maxim um likelihood and Bayesian parameter estimation in IRT (RR-84-30-ONR). Princeton, NJ: Educational Testing Service. McCullagh, P., & Nelder, J. A. (1989). Generalized linear models. (2nd edition ed.). London: Chapman and Hill. Mellenbergh, G. J. (1994). Generalized linear item response theory. Psychological Bulletin, 115(2), 300-307. Mellenbergh, G. J ., & Vijn, P. (1981). The Rasch model as a loglinear model. Applied Psychological Measurement, 5(3), 369-37 6. Mislevy, R. J. ( 1987). Exploiting auxiliary information about examinees in the estimation of item parameters. Applied Psychological Measurement, 11(1), 81-91. Mislevy, R. J ., & Bock, R. D. (1990). BILOG 3: Item analysis and test scoring with binary logistic models. Chicago: Scientific Software Inc. Neyman, J., & Scott, E. L. (1948). Consistent estimates based on partially 142 consistent observations. Econometrilra, 16, 1-5. Raudenbush, S. W. (1995). Posterior modal estimation for hierarchical generalized linear models With a pplica tion to dichotomous and count data (Unpublished manuscript ): Michigan State University. Snijders, T.A.B., (1991). Enumeration and simulation methods for 0-1 matrices with given marginals. Psychometrilra, 56. 397 -417. Stiratelli, R., Laird, N ., & Ware, J. H. (1984). Random effects models for serial observations with binary responses. Biometrics, 40, 961-971. Thissen, D. (1982). Marginal maximum likelihood estimation for the one- parameter logistic model. Psychometrilra, 47, 175-186. Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51, 567-577. Wainer, H., & Kiely, G. L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 26, 247-260. Wong, G. Y., & Mason, W. M. (1985). The hierarchical logistic regression model for multilevel analysis. Journal of American Statistical Association, 80, 513-524. Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago: MESA Press. Wright, B. D.. & Stone. M. H. (1979). Best test design. Chicago: MESA Press. Yang, M. (1995). A simulation study for the assessment of the non-linear hierarchical model estimation Via a pproxima te maxim um likelihood. (Unpublished manuscript): Michigan State University. Yang, M. (1998). Increasing the efficiency in estimating multilevel Bernoulli models. Unpublished doctoral dissertation, Michigan State University. Zwinderman, A. H. (1991). A generalized Rasch model for manifest predictors. Psychometrilra, 56(4), 589-600. 143 "Iiilliliillliilir