, . J. .. :.3 .r........ . . i. a d .3 Fan 3.. 3:;n5..: ti. . .... .. z x. .12... I. r1. 32“.. mi . his; 2.. . i... I S i) vl fiwfifii . s 3‘ {wake GHQ. : .. zinuhud. : 51%. was... . r ”.5... .. 1. . . . .. u. g a . s at: . «Mewsfilwhvwétflfllu . .4. 2.... 1n. 3‘ ‘. 1.1a. «1:1: . z. r Iva 31...; :13 5). ad.” Lt 11 I}. ~ y): 1» twat. .17... i. r :lu... :1,v::¢¢-D&l}v . Is, I ma. ”fig. (ill I”... {2.3}! ! :tv {I 1 I. .33? ,l . I. 2!. 13 V31... 3 . a «I. y. .7 33.35.; . .3: .4 nifty» , . ~10 5 . . L. . . 5 Wm. 1.10. mu. - L III}. , . . , “film”. "8.1 , . . . “3.1m... . g u) a“ . . :1“... . . . i1... . ..\ 5.15:. .31.! .. 12.3.. I . . This is to certify that the dissertation entitled THE ASYMPTOTIC DISTRIBUTION OF AN IRT MEASURE FOR ITEM FIT BASED ON PSEUDOCOUNTS presented by Deping Li has been accepted towards fulfillment of the requirements for the PHD. degree in Education 72701Z/9. /;Qeé . 2, , Major Professor’s Signature '\ 62.77% 2, 26:03 Date MSU is an Affinnative Action/Equal Opportunity Institution LIBRARY Michigan State University _ I "- -W PLACE IN RETURN BOX to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE “(liege ML 2/05 p:/ClRC/DaIeDue.indd-p.1 THE ASYMPTOTIC DISTRIBUTION OF AN IRT MEASURE FOR ITEM FIT BASED ON PSEUDOCOUNTS By Deping Li A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counselling, Educational Psychology and Special Education 2005 ABSTRACT THE ASYMPTOTIC DISTRIBUTION OF AN IRT MEASURE FOR ITEM FIT BASED ON PSEUDOCOUNTS By Deping Li Item fit measure Q‘DM is formed based on the posterior distribution (or pseudocounts) of proficiency instead of the proficiency estimates. The reference distribution of Q‘DM is not x2 but a quadratic function of normal variates. A consistent estimator of the covariance matrix of pseudocounts is found for the approximation of the true asymptotic distribution of Qbm- The data-based estimate of the covariance matrix of pseudocounts depicts the interrelations among pseudocounts and show reasonably good agreement with the true covariance matrix among pseudocounts for sample size as large as 1000. Results from simulation studies show that the method based on pseudocounts has adequate power for detecting item misfit and low type I error rates. The method is robust over the underlying ability distribution and number of quadra- ture points. Real data applications suggest that the method provide more helpful information on assessing model-data fit even when sample size is large compared to x2 test. Copyright by DEPING LI 2005 ACKNOWLEDGEMENTS I am indebted to many people for criticism, suggestions, reviews, and constructive conversations. I wish to express my sincere thanks to the committee: Dr. Mark Reckase (chair), Dr. Kimberly Maier, Dr. Lijian Yang, and Dr. John Donoghue. Each contributed tremendously to the work by sharing their extensive professional knowledge and ideas. I am especially grateful to Dr. Donoghue and Dr. Catherine McClellan for the continued advise, criticism, and wisdom, beginning from the summer research ex- perience through the completion of this work. I would like to thank Educational Testing Service for their financial support, through both summer intern research and the fellowship offered for this research. I would also like to thank the Center for Educational Performance and InfOrmation and Dr. Oren Christmas, whose assistance enabled the completion of my doctoral study. Thanks also are due Hongwen One for her insightful critiques and helpful comments. The encouragement by my wife, Yanlin Jiang, and her support in all aspects did much to reduce the burden of the work involved. iv Contents List Of Tables ................................. List Of Figures ................................ 1 Introduction to IRT Measures of Item Fit 1.1 Item Fit in General Context of Assessing the Fit of the IRT models . 1.2 Item Fit Analysis Based on Ability Estimates ............. 1.3 Item Fit Analysis Based on Raw Scores ................. 1.4 Item Fit Analysis Based on Pseudocounts ............... 1.5 Approximation by Observed Covariance Among Pseudocounts . . . . 1.6 Reformulating the Item Fit Measure QBM ............... 2 Item Fit Analysis Based on Pseudocounts 2.1 Definitions and Notations ........................ 2.2 Asymptotic Distributions of Pseudocounts ............... 2.3 The Asymptotic Distribution of the Item Fit Measure QbM ..... 2.3.1 Reformulated QbM and Its Asymptotic distribution ...... 2.3.2 Asymptotic Distribution of Q .................. vi viii 23 26 2.4 The Observed Covariance Matrix of Interrelations among Pseudocounts 27 2.5 Estimation of the Asymptotic Distribution for Q2», ......... 3 Simulation Studies on Item Fit 3.1 Type I Error Rates ............................ 3.2 Coefficients for the Asymptotic Distributions .............. 3.3 Item Misfit and Power with Known Item Parameters ......... 3.4 Item Misfit and Power with Item Parameter Estimates ........ 3.5 True Asymptotic Distribution Versus the Approximation ....... 3.6 Sensitivity Analysis ............................ 3.6.1 Non-normal Proficiency Populations ............... 3.6.2 The Number of Quadrature Points and Item Fit ........ 3.7 Computing Time and Programs ..................... 4 Real Data Applications 4.1 Assumptions ................................ 4.2 Two Approaches on Item Fit Analysis for Real Data ......... 4.3 Graphic Approach ............................ 31 34 37 41 46 57 57 63 66 70 70 5 Concluding Remarks and Future Research Directions BIBLIOGRAPHY vi 84 94 List of Tables 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 'IIue Item Parameters for the Test of 15 Items ............. 36 Type I Error Rate for Sample Size 500 ................. 38 Type I Error Rate for Sample Size 1000 ................. 38 Type I Error Rate for Sample Size 5000 ................. 39 The 20 Positive Eigenvalues from True Covariance Matrix ...... 43 20 Eigenvalues for True Item Parameters (N = 500) .......... 43 20 Eigenvalues for True Item Parameters (N = 1000) ......... 44 20 Eigenvalues for True Item Parameters (N = 5000) ......... 45 20 Eigenvalues for Item Parameter Estimates (N = 500) ....... 46 20 Eigenvalues for Item Parameter Estimates (N = 1000) ....... 47 20 Eigenvalues for Item Parameter Estimates (N = 5000) ....... 48 The Power for Test Data Generated by 3PL Model with True Item Parameters ................................ 48 The Power for Test Data Generated by 2PL Model with True Item Parameters ................................ 49 The Power for Test Data Generated by 1PL Model with True Item Parameters ................................ 49 The Power for Test Data Generated by 3PL Model with Item Param- eter Estimates ............................... 52 vii 3.16 3.17 3.18 3.19 3.20 4.1 4.2 4.3 4.4 The Power for Test Data Generated by 2PL Model with Item Param— eter Estimates ............................... The Power for Test Data Generated by 1PL Model with Item Param- eter Estimates ............................... Type I Error Rates for Non-normal Ability Population and Data-Based Item Parameter Estimates . - ....................... RMSE for Non-normal Ability Population ............... Type I Error Rates for Three Numbers of Quadrature Point ..... MEAP 2000 Fall High School Science Test Items with the 3PL Model (N = 7088) ................................ -MEAP 2000 Fall High School Mathematics (N = 6857) ........ MEAP 2000 Fall High School Science Items (N = 7088) ....... MEAP 2000 Fall High School Mathematics Items (N = 6857) viii 53 65 73 74 76 77 List of Figures 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 4.1 4.2 4.3 4.4 4.5 5.1 True Asymptotic Probabilities Versus Approximation (N = 500) . . . True Asymptotic Probabilities Versus Approximation (N = 1000) Tme Asymptotic Probabilities Versus Approximation (N = 5000) Beta Distribution versus Standard Normal Distribution ........ Item Fit Statistics QbM and Number of Quadrature Points ...... Asymptotic Probabilities and Number of Quadrature Points ..... Item Fit‘Statistics QLM and Number of Quadrature Points ...... Asymptotic Probabilities and Number of Quadrature Points ..... Empirical versus Hypothetical Item Response Functions for MEAP 2000 High School Science Items (1—4) .................. Empirical versus Hypothetical Item Response Emotions for MEAP 2000 High School Science Items (5-8) .................. Empirical versus Hypothetical Item Response Functions for MEAP 2000 High School Science Items(9-12) .................. Empirical versus Hypothetical Item Response Functions for MEAP 2000 High School Science Items(13-16) ................. Empirical versus Hypothetical Item Response Functions for MEAP 2000 High School Science Items(17-19) ................. Item Response Functions for the 3PL, 2PL, and 1PL Model (Item 1, 8, 10, 15) ................................. ix 56 56 57 60 67 67 68 68 79 80 81 82 83 Chapter 1 Introduction to IRT Measures of Item Fit 1.1 Item Fit in General Context of Assessing the Fit of the IRT models Item response theory (IRT) is becoming an important tool for educational and psy- chological tests, one of the most important tools for both test design and test data analysis. IRT provides a philosophical framework for test design and many other ap- plications (e.g., differential item functioning, test equating, computer adaptive test- ing, etc.). The advantages of IRT may not be fully realized if the test data do not adequately fit the item response models. Assessing model-data fit is fundamental in psychometrics and has always been an issue of enormous interests. The model-data fit issue should be a primary concern when applying IRT models to test data. However, there is no unanimous consensus upon the diagnostic tools for model-data fit. There are other aspects of model-data fit (e.g., person fit analysis and analysis of other type of misfit including violation of local independence and unidimensionality by Hambleton & Swanminathan (pp. 151-195), 1985; Embreston 8:: Reise (pp. 238- 246), 2000; Glas & Meijer, 2003, Hoijtink 2001; and Sinharay and Johnson 2003), but this research is limited to item fit only. In IRT, there is no need to fit a set of data with the same model for all items because a test can be a combination of different types of items (e.g., dichotomous, polytomous, or constructed response items). Even if items with the same type of responses are available, they may be represented by different mathematical models, and separate IRT models may be used for adequate fit. Therefore, attention should be paid to the fit of IRT model on an item-by-item basis. Item fit analysis should also play an important role in decisions about the reten- tion of items in the assessment pool. Poorly fitting items undermine the validity of decisions based on measurement results. In this chapter, various measures of item fit and the corresponding statistical approaches for testing goodness-of-fit at the item level will be reviewed. Generally speaking, there are two basic approaches to assessing item fit—graphical (or heuristic) and statistical test procedures. Graphical procedures are intuitive but more subjective in deciding the adequacy of model-data fit. Statistical tests of good- ness of fit (e.g., X2 or likelihood ratio test) are probably the most widely used in current operational research. In graphical procedures, the adequacy of item fit is typically evaluated on the basis of a comparison between an empirical item response function and a hypothetical item response function. The empirical function is obtained from the sample of test data. Detailed descriptions of graphical procedures can be found in most IRT literature dwelling on model-data fit (e.g., Hambleton & Swaninathan, 1985; p234, Embreston & Reise, 2000). The plots of the empirical and hypothetical item response functions can reveal areas along the proficiency continuum where there are discrepancies between these two functions. The discrepancies indicate the degree of item misfit. 1.2 Item Fit Analysis Based on Ability Estimates Much research on analysis of item fit has been conducted via significance tests. This section reviews Wright and Panchapakesan’s (1969) x2 test, Bock’s (1972) x2 test, likelihood ratio test, and standardized residuals test. The procedure advocated by Wright and Panchapakesan (1969) is a commonly used statistical test. The procedure defines a standardized variable yij = .._ E .. (Jae—7%) , where f”- represents the frequency of examines at the ith ability (11‘ ij G level answering the jth item correctly. Then the measure of item fit X2 = 29:1 y?)- . Wright and his colleagues assume this measure to have a chi-square distribution. The Bock (1972) chi-square index is defined as G N9(0ig - Ei )2 Ei9(1 " Eig) , 2 X Bock 9=1 where 0,9 is observed proportion-correct on item 2' for interval group 9, E,g is the expected proportion correct based on the hypothetical item response function at the within interval median proficiency level estimate, and N9 is the number of exami- nees with ability estimates falling within proficiency interval 9 that comes from the classification of the proficiency estimates. This index is assumed to distribute asymp- totically as a x2 variable with degree of freedom equal to G — m, where m represents the number of item parameters to be estimated. High value of the item fit index indicate that the data may not have a reasonable with fit the hypothetical model on the item. The Wright and Mead (1977) statistic is based on number-correct grouping ap- proach for Rasch model. The statistic is given by mi Ng(0ig - Eig)2 9:1. Eig(1 — Eig) — 533', where Szj— — N lzkej(P-( —,°E j)2, Pi(9k) is the proportion correctly answering item 2' in score group k. The degrees of freedom are C, the number of intervals for the proficiency estimates, minus the number of parameters estimated. Yen’s (1981) Q1 statistic uses the mean proficiency within each proficiency cat- egory to obtain the predicted item response function. Furthermore, Yen fixes 10 categories of proficiency in calculating the Q1 index, which is assumed approximately distributed as X2 with the number of categories minus the number of parameters as the degree of freedom. The likelihood ratio C2 is implemented in the BILOG-3 (Mislevy and Bock, 1990) and BILOG-MG (Zimowski, Muraki, Mislevy, and Bock, 1996). 02 is computed by comparing the observed frequencies with those predicted from the hypothetical model. 2 _ R‘ R: GBILOG - 22(mzogm— (pg( m» +(N‘_ R‘)logN.-(1—P(0m)))° This test of item fit was designed from a long test (e.g., more than 20 items). In 4 this test, BAP estimate of proficiency for each examinee is computed based on the item parameter estimates, then is assigned to proficiency intervals. The summation is performed over G ability scale 6 groups, R,- is the proportion correct within group i, and N is the number of examinees in group 2'. This 6'2 is also assumed to distributed as x2 with the degrees of freedom equal to the number of proficiency groups. Standardized residuals are used to assess the item fit in the Rasch model context (e.g., Masters & Wrights, 1996). In this procedure, the expected response EX ”- for a particular person s responding to item 2' is described by EX,,~ = {=11 ICE-(0,). The variance of X,,- can be calculated by Var(X,,-) = 2,23%]: - EX,,-)2P,-(0,). Let Z”- denote the standardized residual, then Z 32' = ML . A mean square fit ‘/ Var(X3,-) 2 n Z . statistic, i.e., E i=1 -;:1, can then be computed as an item fit measure. The summation is performed over the n items in the test. The above measures of item fit and corresponding statistical tests are open to criticisms. The most common criticism is that these item fit measures and the corre- sponding significance tests often require parameter estimation (i.e., item and ability estimates) and are often viewed as inconclusive evidence of adequate fit. The most commonly used measures of item fit (e.g., Bock, 1972; Yen, 1981) use model—based es- timates (e.g., maximum likelihood estimate (MLE), or expectation a posterior (BAP) of the latent proficiency of examinees. In computing these fit measures, the pro- ficiency estimates are generally treated as point estimates containing no error—an obviously false assumption. That is, even if there is perfect fit of the model to the data, the proficiency estimate for an individual is hardly ever equal to the true value 5 due to estimation errors. This problem is especially pronounced for short tests where proficiency estimates have larger error. In addition, the proficiency estimates are then grouped into intervals that serve as the basis of a contingency table measure of fit. Due to the uncertainties in the proficiency estimation, the proficiency estimates are subject to errors of classification, thus making the use of the chi-square reference distribution questionable. Several studies (e.g., Reise, 1980; Rogers and Hattie, 1987; Mckinley and Mills, 1985) have indicated that the sampling distributions of these measures are not x2 distributed. Moreover in some contexts, researchers point out that the X2 statistic for a single item is insensitive to certain type of misfit (e.g., Vander Wollenberg, 1982; Drasgow et al 1995). 1.3 Item Fit Analysis Based on Raw Scores Because of the shortcomings of measures based on point estimates of ability, alterna- tive measures have been developed. In the past 10 years, two main approaches have been put forth. The first approach was suggested by Orlando and Thissen (2000, 2003). Their approaches compute IRT-based expected values for each level of total score on the test, raw score or number correct score. They then use the observed frequencies for the total scores, and compute a fit measure (likelihood ratio 02 or Pearson X2). The item fit statistics for item 11 suggested by Orlando and Thissen (2000) are of the form 2 H (pile — 2‘ E02 3" Xi = gNkE E2k(1 _ Elk) and I- 1 8— Gf- - 22:1Nklpikl09(—) + (1— pik)l09('11—__’_ 3)], with k standing for raw score category as k = 0, 1, 2, . -- ,1, N, for the number of examinees on score k, put arid E5]; respectively representing the observed and expected correct scores for item 2' in raw score group 1:. Orlando and Thissen then compare the statistic to a chi-square distribution (the two statistics are assumed to have asymptotic x2(I - 4) distributions under the null hypothesis that the fitted model is true). Unfor- tunately, their statistic is not distributed exactly as chi-square when item parameters are estimated from MMLE (Donoghue, McClellan, and Oranje, 2004; Sinharay, 2005). However, the departure from x2 appears to be relatively small, a result supported by several simulation studies (e.g., Orlando and Thissen, 2000; Stone and Zhang, 2002); the departure of the distribution of S - x2 and S — G’2 from the referred X2(I — 4) distribution may be severe for a short test. Glas and Suarez-Falcon (2003) suggest an item fit statistic based on the lag'range multiplier test (or equivalent efficient score test) and uses number correct score on examinee groups. For item 2', the statistic is used to test the null hypothesis H, (e.g., the 3PL model is correct) versus the alternative hypothesis, in which the model is defined as 1 1 + e—ai(9—bi-fits)’ p('u.,-|6’, at: bi, 023513, 3) = Ci + (1 — Ci) where 3 indicates the raw score group an examinee belongs to, a,, b,, c,- describe the parameters for item 2', and 6,, adjusts the item difficulty b,- from the score group 3. The test statistics, which is defined as h;2hi, has an asymptotic x2(S,- — 1) distribution. In computing the test statistic, h, is a vector of differences between the observed proportion correct and its posterior expectation for a raw score group computed based on MMLE, and 2,- is the estimated matrix of hi. Even though this test statistic appears to have a strong theoretic basis, Glas and Suarez-Falcon (2003, p.97) found that overall characteristics of their test statistic is worse then that of S -- x2 and G — X2. Researchers (e.g., Sinharay, 2005) points out that assessing item fit using number correct score on examinee groups is not entirely satisfactory and there is a substantial scope of further research in this area. Recently, Sinharay (2005) from a Bayesian perspective suggested uses of the )8- type and Gz-type test statistics of Orlando and Thissen (2000) as a summary measure of discrepancy, but computed the posterior predictive distributions as the reference distributions. The resulting Bayesian p-values provide probability statements about the fit of the data with the model on the items. This method also has strong the- oretic basis. However, the posterior predictive model checking methods are heavily dependent on the resampling methods and are using the MCMC algorithm and hence 8 are computationally intensive. 1.4 Item Fit Analysis Based on Pseudocounts The second approach of a fit measure called Q‘DM, is proposed by Donoghue and Mc- Clellan (e.g., 2004, 2003b, 2003a, 2001b, 2001a, 1999). In this approach, the asymp- totic distribution of an alternative IRT measure of item fit, referred to as QDM, is derived and well justified as asymptotically quadratic form of normal variables. QLM is based on pseudocounts as opposed to counting the number of examinees falling within a proficiency interval on'the basis of proficiency estimates. It is a natural by- product of the MML—EM estimation (Bock and Lieberman, 1970; Bock and Aitkin, 1981) used by most IRT calibration programs. This measure has generated much study (e.g., Stone, 2000; Stone, Ankerman, Lane, and Liu, 1993; Stone and Hansen, 2000; Stone, Mislevy and Mazzeo, 1994; Stone and Zhang, 2002 Donoghue and lsham 1998; and Donoghue and Hombo, 1999, 2001ab, 2003ab; Hombo and Donoghue, 1999, 2000, 2001; Hombo, Donoghue and Oranje, 2003). Simulation studies (Hombo and Donoghue, 1999, 2000) have found that the asymptotic distribution functioned ex- tremely well, even with samples as small as 1000 examinees. Both Q — Q plots and Type I error rates indicated very good agreement between the asymptotic distribu- tion and the observed values. Moreover, the measure has good power to detect misfit when it was present in items (Hombo and Donoghue, 2001). The difference between the second approach and the first one is that QBM is based on the distribution of ability, at each quadrature point. The term “pseudocount” by Donoghue, McClellan and Orange(e.g., 2004) refers to the fact that real counts of the number of examinee proficiency estimates falling with an interval on the scale are not used. Rather, counts are estimated from the sum of posterior distributions. Peudocounts are the basic building blocks for the item fit measure QBM. Pseudocounts of examinees at a given quadrature point are computed by summing over the posterior expectation (pseudocounts) of an M -category item for score level Is and proficiency 0 level q. Then QBM is defined as QM on. = 229%. (u) q=l k=0 Here 0 represents the observed response counts and E represents the expected re- sponse counts. Assuming that item parameters are known, QbM has been shown to be asymptotically distributed as a quadrature form of normal variables (Donoghue and Hombo, 1999). This distribution is represented as the sum of independent x?” vari- ates (e.g., Johnson and Kotz, 1970). QbM ~ 2:, Agxfl) , where A,,Vi = 1,2, - -- ,m, are the non-zero eigenvalues of matrix L'EL, L is a special form of matrix with di- mension 2Q x Q (Q is the number of quadrature point used in the computation) for dichotomous items, and 2 is the covariance matrix of the pseudocounts (Donoghue, McClellan, and Oranje, 2004). A routine by Davies (1980) can be used to evaluate this probability. However, further work is needed to establish the utility of the result in practical testing situations. Hombo and Donoghue (1999, 2000) examined some possible lim- iting factors, including potentially prohibitive sample size requirements to achieving sampling distribution properties approaching those of the asymptotic distribution. A 10 major limitation to practical application of the findings is the computational burden required to compute the asymptotic distribution QDM. The computation requires the evaluation of all possible item response patterns—2’ for a test of J dichotomous items, for example. For short-moderate length tests (10 — 15 items) the number for patterns (1024-32768) is manageable. For tests of 20 items, the evaluation of slightly over one million response patterns per item begins to become burdensome. 1.5 Approximation by Observed Covariance Among Pseudocounts The work for the asymptotic distribution for the item fit measure QBM represents a major advance along this line of research. To avoid evaluating all possible response patterns for calculating the covariance matrix of pseudocounts and thus making ap- plications possible to operational research, Donoghue, McClellan, and Oranje (2004) propose a consistent estimator S for the covariance matrix 2 and the true asymp- totical distribution is approximated by the observed matrix of interrelations among pseudocounts. To understand and construct the matrix S, consider the joint probabil- ity consisting of positive values p(U = u,-, 0.,) and 0 for p(U # u,-, 0Q) for dichotomous item 2' and given response u,- and any quadrature point 0,], Vq = 1,2, - -- ,Q, and i = 1, 2, - - - , J + 1. Then S can be seen as a simple covariance matrix with every ex- aminee contributing to all of the 2Q quadrature points. The matrix S is a consistent estimator of 2. Therefore, a natural idea is to use the data-based estimatorgL'SL in place ofL'EL. Because QDM is an asymptotic result, for very large N (approaching 11 infinity) is arbitrarily close to 2 and intuitively should yield the correct estimate of QDM- Indeed, the use of the observed matrix of interrelation among pseudocounts yields the hoped-for accuracy and simplicity on computation, and the approximation of Q D M based on the observed matrix of interrelations among the pseudocounts opens up the possibility of operationally feasible and theoretically defensible statistical test of item misfit. Results from Li, Donoghue, and McClellan (2005) demonstrate how accurate the approximation is in relative to the asymptotic distributions across three different sample sizes. The results from simulation studies show that the approximation works extremely well for many situations. The cumulative probability, mean, and variance are very close between the true and approximation values. These results can also be generalized to the case of polytomous items, as in Donoghue and Hombo (2001a) when item parameters are known constrants. However, the asymptotic distribution of QBM was derived under the assumption that the item parameters are fixed and known. When the item parameters are data- based estimates, the theoretic results of Donoghue and Hombo (1999) do not hold. Several studies (Donoghue and lsham, 1998; Hombo and Donoghue, 1999; Donoghue and Hombo, 2001ab; Stone and Zhang, 2002) have repeatedly found that, when item parameters are data-based estimates, Type I error rates from QDM are much too conservative, and that distribution of the Q‘DM statistic is stochastically smaller than Q 0M. This study is an attempt to overcome the disadvantage of working with the item parameters by reformulating the measure of item fit based on pseudocounts. 12 1.6 Reformulating the Item Fit Measure QBM The form of QbM defined as in 1.1 is a Person-type measure for goodness-of-fit. Donoghue and Hombo (e.g., 1999) suggest that the expectation of the pseudocounts can be found through binomial approXimation. That is, the expectation of pseudo— counts is a product of total pseudocounts and the hypothetical item response function at certain levels of quadrature points (please refer to the first section of Chapter 2). The asymptotic distribution of Q7», can be shown through a Taylor expansion of the fit statistic. As sample size increases, the asymptotic distribution for the second order Taylor expansion of Q'bM converges to the true asymptotic distribution of Q‘DM. The idea of reformulating QbM is to simply replace the expectation of pseudo- counts by its theoretic expectation under null hypothesis. The reformulated version of the statistic QbM allows researchers to derive the true asymptotic reference distri- bution for QBM and to extend the results for data-based item parameter estimates. 13 Chapter 2 Item Fit Analysis Based on Pseudocounts The item fit measure QBM by Donoghue and McClellan (e.g., 2004, 2003b, 2003a, 2001b, 2001a, 1999) is similar in form to a Pearson X2. However, as noted before, the distribution of Q‘DM is not X2, but a quadratic function of normal variates. This chapter first introduces the basic concept of pseudocounts, on which the measures of item fit (i.e., QLM) are based. Next the reformulation of DEM will be discussed with the help of the fundamental concept of pseudocounts. Then the asymptotic distribution of the reformulated measure of item fit will be derived in a different way. Finally, the observed interrelations among pseudocounts are examined to obtain a consistent estimator of the true covariance matrix among pseudocounts. 2.1 Definitions and Notations Let 9., be the discrete proficiency at quadrature point q, w(0q) = 21),, be the density of 6, i.e., P(0 = 09) = wq. The prior w will often be chosen to approximate a continuous distribution, such as N ([1,02). Denote U as a random variable representing the 14 response for dichotomously scored studied item. In study of item level model fit, test items are classified into two groups—the studied item (only one item) and the remaining items (containing J items). Thus the total number of items in the test is J + 1. Let fql = f (6,11) be the item response function for the studied item, i.e., P(U = 1|6 = 6,,). Let N be the sample size or number of examinees, and t index patterns of responses to the remaining J items Y on a test. For the dichotomous items, t = 1, - - - , T = 2J. Let nuc be the number of examinees who got score pattern (U = k, Y = 3);). Suppose 7‘r is the vector of observed proportions for the sample response pattern (U = k,Y = y). Then fr”c = nae/N, and (,9 = P(Y = yt|6 = 6,,), where 1: represents the category for the studied items (e.g., for dichotomous case, k = 0,1), and ltq is the likelihood function of the remaining item response pattern (Y = 3);) at quadrature point q. Denote 7r", the model-based prediction of the probability of response pattern (U = k,Y = Yt), or the marginal probability of (U = k,Y = Yt). For dichotomous case (i.e., k = 0, 1), it is easy to see that 7M = P(U=1tY=3/t) Q = Z wqfqlltq- q=l Similarly, «to = 2;, wq(1 — fq1)ltq. Let pfq be the posterior of 6 at quadrature point 6 = 6(1 given response pattern (U = k, Y = y,). Then, pf, = P(6 = 6q|U = k,Y = y.) waqultq 7TH: 15 In dichotomous case, the posterior distribution for 6 = 6q given the response pattern 0 _ wq(1-fq1)ltq tq — 7rt0 _ _ . 1 __ wqfqlltq (U — 1,Y — yt) IS ptq — m , and p given response pattern (U = 0,Y = yt)- The posterior distributions provide the best information about the distribution of examinees’_ proficiency levels. Thus, it is the posterior distribution of proficiency rather than the proficiency point estimates that are used for assessing model-data fit on the item level in this regard. Define pseudocount, sql, to response U for the studied item at quadrature point 6q as the sum of the posteriors over all response patterns P(6 = 6qu, Y = yt),Vt = 1, 2, - -- ,T. For example, the pseudocount to the correct response for the studied item at quadrature point 6,,. T 31 = :71. pl 9 ‘1 tq t=l T E : ntlltq t=l 77:1 Here T is the number of all possible response patterns for the remaining items in the T I test. In a similar fashion, define sqo as Sqo = wq(1 — flq) t=1 WTOtotg. Denote 59 = sql + ago. 3,, is the total pseudocount at quadrature point 6,,, Vq = 1, 2, - -- ,Q. Q is the total number of quadrature points (designated in the study, in this case 41, ranging from -4 to 4). Now consider the following vectors in the dichotomous case: T _ n - (n11,n21,-~ ,nT1,n10,7120,"' ,nTO), AT — A A A A A A _ 7T -(7T11,7T21,"° ,WTit7T10t7T20,"',7TT0)— n/N, WT = (7r11,7r21, . « - ,7r7~1,1r10, 7r20, ...,7r7~0), the model-based probabilities, 16 sT = (311, 321, ..., SQ], 810, 320, am), observed pseudocounts, §T= (5132,” ,sQ). The vector n describes the frequencies of all possible patterns of the response data for J + 1 items in a test. That is, n contains the frequencies of the mutually exclusive response patterns from the sample data. If N examinees are available, then ZLO (nu + "40) = N. The model4based probability of the tth pattern of the remaining items and correct response on the studied item (i.e., (U = 1,Y = yt)) is 7r” = P(U = 1,Y = Y,),Vt = 1,2, ...,T. Similarly, the probability of observing response (U = 0,Y = y,) is 7r“) = P(U = 0,Y = yt),Vt=1,2,...,T. For the convenience of studying the statistical properties of pseudocounts, two posterior matrices P and P are constructed. P is a matrix consisting of all posterior and having dimension of 2T by 2Q. That is, 10}, p12 piq 0 0 0 \ pit p52 pin 0 0 0 P2Tx2Q= T1 T2 T0 0 0 0 0 0 0 P61 P12 P1Q 0 0 0 P21 P32 P30 \0 0 0 P91 P92 P90) The matrix P is a 2T x Q matrix defined as {Pi1 Piz Pic) \ P21 P22 P20 1 1 1 ~ p p no. p P= :1 r to P51 P62 1’10 P21 P22 ng lp‘z’n 19°72 P(fQ/ 17 With the matrix notation, the pesudocount vector 5 can be expressed as s = Flu and s = P'n. The matrix P can be written as column form P = (P11, P21, - - - , P5,P10,P§, - -- ,Pg), where Fifi/q =1,2,--- ,Q and j = 1,0, denotes the column in the matrix P corresponding to the posteriors at quadrature point 6q with response U = j for the studied item. Then sq,- = Png. Similarly, write the matrix P as P = (131,132, - -- ,PQ), where R, represents the qth column in the matrix P,Vq = 1,2,--- ,Q. Then sq = Pin. The pseudocount vector s or s can be considered as a random vector since it is a linear function of the frequency vector n, which follows multinomial distribution with probability vector 1r, denoted as n ~ M2T(N, 7r) with 2;,(m1 + 7R0) = 1. To establish the results regarding the asymptotic distribution of the pseudo-counts vector 5, the following two vectors are useful: vT_ (nu—N77“ Tim—N772] 1111—er13] nJQ—erm nZQ—erzg — \/N7r11 ’ t/N7r21 ’°"’ N/NWTl ’ \/N7r10 ’ \/N7r20 ’ nIQ—NWIQ) I .., mfl—TO (PT : (V ”111 V 71'211'H1V WTI) V 7T101 V W20) "'1 V 77.70)- The object is to study the properties regarding the pseudocounts, which are a linear combination of the observed frequency vector 11 of response patterns. 2.2 Asymptotic Distributions of Pseudocounts Before showing the theorems regarding the pseudocounts, define a matrix B with 1 1 fixed elements, B = D} P, where P is the matrix of posteriors defined as before, D7? is a diagonal matrix with square root of the model-based prediction vector 1r as its 18 diagonal entries. ( 7T1] 0 0 \ 0 V721 0 13% = O ‘/7TT1 O 0 1r 0 0 ‘/7l'10 0 0 0 0 0 ‘/7l’20 0 If item parameters are all known constants, so are each component in the poste- rior matrix P and each element in the diagonal matrix Dé. Simply put, the product matrix B has entries of fixed values. Denote each column of B as bf, or b2,Vq = 1,2, - -- ,Q. Then B can be expressed as B = (bib; ...,blq,b‘1’,bg,--- ,bg). b}, or 1 1 b2 is a fixed vector with dimension of 2T, and b}, = D}; R}, or b: = DfiPf. theorem 2.2.1 (Marginal Distribution of Pseudocounts) The asymptotic distribu- tion of WC}; — PgTir) for each element sq,- in the pseudocount vector 8 defined as above is normal with mean 0 and variance Pg'(Dfl — 7r1r')Pg,Vq = 1,2, - -- ,Q and j=0,1. Proof: Let the vectors v, b}, or b2, Vq = 1,2, - - - ,Q be defined as above. Then the asymptotic distribution of the linear function of bav or bgv is normal with mean 0 and variance bah}; - (13:90): = bg(I - 00. Similarly, the asymptotical distribution of W(% — P’fl')for total pseudocounts vector s is NQ(o,1”>’(D,, — «7013) as N —» 00. Now one can see why pseudocounts contain essential information for assessing the degree of item fit. They are the sum of posterior distributions across all possible response patterns and over all examinees. The posterior probability of proficiency, instead of the count of grouped proficiency estimates themselves, provide the best information for evaluating the degree of model-data fit. The preportions of pseudo- counts 8 over the total number of examinees N can give empirical values that can be 21 compared to IRT model predicted values. A measure of the correspondences between the empirical and predicted values represents the degree of adequacy of model-data fit at the item level. However, it is often difficult or impossible to judge from the plots whether the differences between the empirical values based on pseudocounts and the model based predicted values. A statistical significance test is very desirable. The following section is to reformulate QBM and find out its reference distribution based on pseudocounts. 2.3 The Asymptotic Distribution of the Item Fit Measure Q}; M ' The statistic Q‘DM suggested by Donoghue and McClellan (e.g., 2003) is defined through binomial approximating the expectation of pseudocounts as 522914 ((Sq1 - Esq1)2 + (Sqo - ESquz) E391 Esqo Me Me ((3:11 — fqlsq)2 + (Sqo — qusq)2) q=1 fqlsq _ fqosq Q (Sql - fqlsql2 (12:; fql(1- fallsq. Donoghue and Hombo (2003b) expand the above expression of Q‘DM about fr = 7r as a Taylor series to derive the that the asymptotic distribution of the measure QBM is asymptotically a quadratic form of normal variables: QDMUfl = mg? - 7r)'C(7“r — 7r)\/N + 0(N‘b) 22 The matrix C is the same as that in Donoghue, McClellan, and Oranje (2004, p 10). That is, Q C t C (Vql’ " fqlvq2)(vq1 ’ fqlqu) C = gwq fql(1" fqll , l l ' l whereV* ’— (fqllq fq12q M 0 ,0 ,andva2lz ‘11 W11 ’ W21 ’ ’ WT1 ’ ’ fqlllq fqll2q . . . fqlqu (l-fqllllq (l-fqlll2q . . . (l—fqllqu W11 ’ W21 ’ ’ WT1 ’ W10 ’ W20 ’ ’ WT0 ’ ° 2.3.1 Reformulated Q*DM and Its Asymptotic distribution In this study, also define QLM as Pearson Xz-like statistic. That is, 02214 = 20: ((3q1 - Esql)2 + (5:10 - Esqolz) . Esq] Esqo q=l As previously defined, .991 or sqo is the pseudocount at quadrature point 6q,\7’q = 1,2, - -- ,Q. Esql or Esqo denote the corresponding expectations. First simplify the expression of the expectation of 3,71 and sqo,\'/q = 1,2, - -- ,Q. Notice that the expectation of the pseudocount Equ = N Pgrr for j = 0, 1 can be expressed as T fr-l Esqj = E(ququz 72.”) t=l ‘1 T = ququzltq i=1 = ququ. That is, Esql = qufql, and Esqo = qufqo = qu(1 — fql). Therefore, the expectation of the pseudocounts vector 5 is Es = N(w1f11, W2f21, . - - ,waQI, w1f10,w2f20, - ~ ,wqu0)T. The expression of Es is the same as that derived from the theorem on joint distribution of the pseudocounts vector. 23 Now turn to look at the asymptotic distribution of the reformulated Q‘DM. Let DES be a diagonal matrix with the expectation of the pseudocounts as its diagonal -1 -1 -1 elements. Obviously, D5; is a 2Q by 2Q matrix, and DE; = DE; DE; , where DE; can be expressed as 1 ( m 0 0 \ '6 III 71...? II. '6 III '0' DES—% = _1__ wa01 l m 0 0 m (I) 0 0 0 0 m 0 '6 II III II II '0' 7...? h wefeo ) With the matrix DES—i, the QbM can be further simplified by Q s __ (Sql — ESql)2 (300 - Esqu") QDM _ Z ( Esql + Esqo = "Duns—Es»? q=l = (s — Es)’DE,-1(s — Es) = (P'n — Np’n)'DEs-1(P’n — NP’n) = (n — Nn)’PD;,:P’(n — Nir) = We — n)’NPDg§P’t/fi(rt — 7r). As it is known, \/N(p — 7r) are asymptotically distributed as multivariate normal variates with mean vector 0 and covariance matrix G = D,r — 7r7r' (e.g., p470, Bishop, Fienberg, and Holland, 1975). Thus, QBM is asymptotically a quadratic function of normal variables. Obviously, the matrix N PDgéP' is nonnegative definite since all of the diagonal components in the matrix DES are nonnegative. Following Sta- 24 pleton’s (1995, p65) expression for quadratic form by denoting y = mm — 1r) ~ N3T(0,G), and the nonnegative definite matrix A = NPDgéP', let Gi be the unique symmetric square root of G, and let G“% be its inverse. Thus QBM = . (y'G-ixeiAei-xG-iy) = z’Cz, where z = G-iy and c = Giaoi. Then Var(z) = G‘aGG'% is an identity matrix of 2T x 2T, so that 2 ~ N3T(0, I). Let C = TAT’ be the spectral decomposition of C. Then A is the diagonal ma- trix of eigenvalues of C, and T is the 2T x 2T matrix whose columns are the corresponding eigenvectors of C, and T is an orthogonal matrix. Hence, QBM = z'TAT’z = (T’z)’A(T’z) = Lu’Aw 'for w = T’z. Var(w) = T’IT = 1313‘”, and 0) ~ N3T(0, I). Denoting the eigenvalues of C by A1, A2, - - - ,AgT, QBM = 23:, Aiwf, where w’ = (w1,w2, - -- ,ng). Therefore, QBM is alinear combination with coefficients A1, A2, - -- ,/\2T of independent x? random variables. The coefficients A1, A2, - -- , AgT are the eigenvalues of N G’ iPDgsP’G’ 5, and also the eigenvalues of NPDgéP'G or of NGPDgéP’. By theorem 2.2.1 (Stapletone, 1995, p51), the expectation of QbM is E(QbM) = trace(AG) = trace(NPDg§P’G). The asymptotic distribution of QbM can further be simplified as the reduced sum of independent xfl) variates (e.g., Johnson and Kotz, 1970). That is, wa ~ 2;, Aixf, where A, are the m non-zero eigenvalues of the 2T x2T matrix N PDESP'G. The non-zero eigenvalues from matrix NPDgéP’G is equivalent to the non-zero eigen- values from matrix L’GL = NDQEPXD, — m’)PD;,§ for L = P’Dgé. It can be easily seen by letting u be a non-zero vector (with dimension of 2Q) and scalar A. Then by defining equation NPDgéP'Gu = Au, N LL'GV = AV. This implies 25 that N L’ GV = ALlu, where Ll represents the generalized inverse of matrix L. And N L’GL(Llu) = A(Llu). Hence the result. A routine by Davies (1980) can be used to evaluate this probability distribution. Now state this result about the asymptotic distribution of QLM in the following theorem. theorem 2.3.1 Asymptotic Distribution of QbM The Pearson xz-like measure of item fit QBM defined as above is a quadratic function of random variables with mean . . I vector 0 and covariance matrix D, — 1r7r . Take a close look at the covariance matrix ND:- P’ (D, — 7r7r’)PD;3§. Denote this matrix product as A (i.e., A = ND: P’ (D, — 1r1r’)PD;3§). Let the set of distinct eigenvalues of A( the spectrum of A) denote as 0(A). The maximum magnitude of eigenvalues, denoted as p(A) = max|A|,V)t 6 o(A) has p(A) S “All for every matrix norm (Meyer, 2000, p497), i.e., IA] 3 “A” for all A E 0(A). Since all the components in the matrices DigiP, and D, — 1r7r’ are regarding probabilities, the maximum absolute values of the components in the product for A is less than for equal to 1. Thus the maximum eigenvalue of the matrix A is equal to 1. 2.3.2 Asymptotic Distribution of Q The statistical distance between the observed pseudocounts and their expectations (the external and fixed values) also represent the degree of model-data fit. If the distance is defined by Q = (s — E§)f3‘l(§ — E§)', and it is seen from theorem 2.3.1 and theorem 2.3.2 that the asymptotic distribution of the sequence «17(73- — P’ 7r) for s is Nq(0, P’ (D, — 1r7r’)P) with a nonsingular covariance matrix. As it is easy to see, 26 the expectation and covariance of s is E5 = P’rr and Var(§) = N P'(D,-1r7r’ )P. Since 33$?) = x/NGV — P’ it), the following states the result for the asymptotic distribution of Q (e.g., p163, Johnson and Wichern, 2002). theorem 2.3.2 Asymptotic Distribution of Q The asymptotic distribution of item fit measure defined as Q = (s — E§)E‘1(§ — Es)’ is X2 with degree of freedom Q and the covariance matrix 2 is NP'(D, — irrr')P. Let 2 denote the 2Q x 2Q covariance matrix of the pseudocounts vector 8, i.e., 2 = NP, (D, — 1r7r')P. Then the covariance matrix over the total pseudocounts vector 5, f], is N P'(D, — 1r1r')P. The following section will introduce a consistent estimator of 2 and E. 2.4 The Observed Covariance Matrix of Interrela- tions among Pseudocounts Although the covariance matrix of pseudocounts E = P'(D, — 7r7r')P has dimension of 2Q x 2Q and the dimension of E = P'(D, — 7r7r’)P is Q x Q, the estimation of E and )5 involves evaluating the 2T x 2Q matrix P, the 2T x Q matrix of P, and the 2T x 2T matrix of D, — 7r7r'. Note that T indicates all possible response patterns for the remaining J items. In dichotomous case, T = 2J . For a long test, the numerical computation of 23 seems impractical for most Operational work. To reduce the computation complexity, 2 is estimated from the observed covariance matrix S of interrelations among pseudocounts s, and E is estimated from the observed covariance matrix S of interrelations among pseudocounts s. As is known that n ~ 27 M2T(N, it), a multinomial distribution with 2T —- 1 parameters and covariance matrix N (D, — 7r1r’). P},- is a uniformly minimum variance unbiased estimator (UMVU) of 1r,,~,Vt = 1, 2, - - - ,T, and j = 0,1 (e.g., Lehmann and Casella, 1998, p106; Bickel and Docksum, 2001, p187). It is natural to think of the matrix D15 — PP’ as estimate of the matrix D, — irrr'. Let the vector. x, indicate the posterior contribution of the ith examinee on the studied item across the array of Q quadrature points given the response pat- tern (U,Y = y,),Vi = 1,2,--- ,N. Then X; is a 2Q dimensional vector as x, = (X11, X32, - ~- ,X-1 ,X3, X3, ~ - -, XPQ). The value of each component in the vector X,- 1 1 is X.’ lq’ vq=1,2,--- ,Q,i=1,2,--- ,N,andj=1,0. Or x3, = UP fiCov(sqj, sq'jr). It is not hard to find that 30 Cov(§q,§;) = NPqT(D, — 7r7r’)Pq:, Vq,q’ = 1,2,--- ,Q. Form the N x Q posterior matrix X with each row vector 32,-,Vi = 1, 2, - - - ,N representing the ith examinee’s and number of Q posteriors, then x,- = (X,1,X,-2, - -- ,X,Q),Vi = 1,2, - u ,N, where th = ((P‘i)U(P-‘l)1'ut(13112)"(1’3V’Uv -~ 1(Pt‘1Q)U(Pt%2)1—U)- 1 3 In the same way, the vector i, is one realization of the vector i = (X 1 , X2, - - - , XQ). Let the covariance of the vector 5': denote S = Cov(x). It can be seen that Cov(Xq, qu) = f’flDp - prim In a summary, the observed covariance matrix S of the interrelation among pseu- docounts is a consistent estimator of the average covariance of pseudocounts vector 8. S can be arbitrarily close to 71?): when the sample size N is large enough. Similarly, the observed matrix S is a consistent estimator of #2. The noticeable computational simplicity can be obtained using S, which is a constructed N by 2Q matrix of poste- riors. The simplification of the computational complexity for the covariance matrix among pseudocounts make the hypothesis testing of goodness of fit at the item level feasible using the measure of item fit QbM- 2.5 Estimation of the Asymptotic Distribution for :1: QDM The true asymptotic distribution of Q‘DM is a function of the covariance matrix 2 of pseudocounts. The relations of the asymptotic distribution with the covariance E rely on the non-zero eigenvalues A’s from the matrix 2. The asymptotic distribution of Q‘DM can be written as 211,1,Xfm and the nonzero coefficients )t’s comes from 31 the matrices 2. Denote the non-zero eigenvalues from the observed covariance S as :\’s. Then the differences between the true asymptotic distribution and the estimated asymptotic distribution is 2:10.,- — A,)x?(l). It is easy to see that the estimated distribution is arbitrarily close to the. true asymptotical distribution as long as the :\’s are arbitrarily close to the true A’s. Obviously, as N —» 00, due to the consistency of E to E, ;\,v is arbitraily close to A,, Vi = 1, 2, . .. ,m. For the estimate of the asymptotic distribution of Q, replace the covariance ma- trix )5 in the middle of (s -— E§)E'1(§ —— Es)’ with its consistent estimator. Then asymptotic distribution of the estimate is arbitrarily close to its true asymptotic dis- tribution. Therefore, the asymptotic distribution of the fit measures QBM and Q ,with true covariance matrix among pseudocounts are the same as the asymptotic dis- tributions of fit measures QbM and Q, respectively, with observed covariance matrix of interrelations among pseudocounts as their corresponding consistent estimators of the true covariance matrix. Assuming item parameters known constants is not realistic in many applications. This section will investigate the relations between the item parameter estimates and asymptotic distribution of the reformulated item fit measure QLM for data-based item parameter estimates. Since item response function 1),, are continuous function of item parameters given each quadrature point 6q,Vq = 1, 2, - ~ ,Q andj = 0, 1, qu(&,,, b", 8,, 6g) ——1 qu(a, b, c, 6,) in probability as n —* oo, in short, qu ——> qu, if both item and ability parameters are consistent estimates (e.g., p124,Rao, 1976; p74, 8.4, Lehmann, and Casella, 1998). It- 32 is also not hard to demonstrate that lag -+ 1;, in probability, in, —» 7n,- in probability, sq,- _. sq,- in probability, and lie, —» E'sqj in probability, Vq = 1, 2, . -. ,Q, j = 0,1, and Vt = 1,2, - -- ,T. In the same way, the estimates of QBM and Q tend to true QbM and Q, respectively in probability. Moreover, by convergence together theo- rem (e.g., p122, Rao, 1976; p91, Durret, 1996), the estimates of qu, QbM, and Q have the same asymptotic distribution as those of sqj, Q‘DM, and Q, correspondingly, Vq = 1,2, - -- ,Q, j = 0,1. Therefore, suppose the consistent estimates of item pa- rameters are available, the results on the item fit measure QbM and its corresponding asymptotic distribution can be extended to the situation in which item parameters are data-based estimates in theory. 33 Chapter 3 Simulation Studies on Item Fit Several simulation studies on the item fit measure QLM are presented in this chap- ter. One important purpose for the simulation studies is to examine how large the additional errors might be induced by the approximation for the asymptotic distribu- tion based on the observed covariance compared to the true asymptotic distribution, and to find out what conditions can make the approximation practically useful. To investigate the accuracy of the approximation, a test consisting of 15 items is simu- lated. Such a short test is chosen because most personal computers can handle the computation involving all possible response patterns of 15 items, which is required for computing the true asymptotic probability. For dichotomously scored responses, there are 215 = 32678 possible response patterns in all. Thus, the true asymptotic distri- bution, the approximation of the true asymptotic distribution based on the observed covariance matrix of interrelations among pseudocounts, and the approximation on the basis of data-based item parameter estimates can be compared to each other. For a longer test (e.g., a 30-item test), the possible response patterns may be too huge (e.g., 1073741824 for a 30-item test) tocompute the true asymptotic probabilities. 34 Without the true asymptotic probabilities, it is difficult to have an intuitive sense of how good is the approximation. The comparison of the true parameters and param- eter estimates (e.g., the true covariance among pseudocounts versus the covariance estimate, the true asymptotic probability versus the approximation, the true eigen- values versus the estimated eigenvalues from the observed covariance matrix, the true item parameters versus the item parameter estimates) is viewed as an oracle analysis. In applications, there is no need to compute all possible response patterns for the sake of the true covariance matrix among pseudocounts, if the approximation is suffi- ciently close to the true value or the induced errors are negligible for practical use. To compute the true covariance matrix among pseudocounts here and the true asymp- totic distribution for a given QBM is merely for the convenience of the comparison to which one can see how good the approximation can be. According to this asymptotic method and approximation approach, there should be no practical concerns on the computation of item fit analysis for longer tests. Therefore, the method is not limited to short tests only. It can be applied to longer tests as long as the sample size is large enough so that the approximation work well. Three different sample sizes are chosen for this study to determine how large the sample sizes are sufficient for this asymptotic method, and attempt to provide a guideline on how large sample size is sufficient for the method to work well. The 15 item parameters are also generated from computers. Discriminating power parame- ters are simulated from uniform distribution ranging from .6 ~ 2.6, i.e., U (.6,2.6), difficulty parameters are generated from standard normal distribution N (0, 1), and 35 Table 3.1: ’Irue Item Parameters for the Test of 15 Items Item Discrimination a Difficulty b Asymptote c 1 .672 1.410 .177 2 1.652 1.493 .013 3 .747 .935 .005 4 1.486 . 1.706 .165 5 1.286 .967 .080 6 1.357 .820 .086 7 1.140 -.411 .159 8 1.107 1.060 .083 9 1.465 .388 .085 10 .920 1.643 .145 1 1 .740 -.668 .173 12 .803 1.125 .040 13 1.407 -.451 .067 14 .662 .077 .124 15 1.845 1.166 .148 the asymptote parameters are from uniform U (0, .25). Table 3.1 contains all of the true parameter values for the 15 items. Three groups of examinees are generated from N (0, 1) with sample sizes 500, 1000, and 5000, which represent small, medium, and large samples, respectively. For each sample, dichotomous response data are simulated from 3PL IRT models. To account for the randomness from the response data, replications (1000) for each sample size will be conducted. More specifically, the 15—item test will be administrated to 1000 groups of examinees with sample size 500 each from N (0, 1), and 1000 with sample size 1000, and 1000 with sample size 5000. Combined with the sample size and replication conditions, there are in all 3000 data sets yielded for the simulation studies. 36 3.1 Type I Error Rates To allow comparisons, the true asymptotic distribution,the approximation of the true asymptotic distribution based on the observed covariance matrix of interrelations among pseudocounts, and the estimated asymptotic distribution on the basis of item parameter estimates as well are computed alone with the corresponding item fit mea- sure QbM. Type I error rates are calculated and compared across different sample sizes (e.g., 500, 1000, and 5000). Under the null hypothesis that the simulated re- sponse data from the 3PL model fit the hypothetical 3PL model (in this example, the same form of mathematic model is assumed for all items in the short test— 3PL model), the observed item fit measure QBM is asymptotically distributed as a quadratic form of normal variables, which is addressed in Chapter 2. For a given observed item fit statistic QbM, the asymptotic probability of observing such a value or greater can be evaluated through the routine by Davies (1980). For each item and each replication, count the number of times for the hypothetic item model being rejected. If the number is greater than 50 over 1000 replications (i.e., the type I error rate is greater than .05), it is said the type I error is greater than what is expected. Otherwise, the type I error rate would be acceptable. Table 3.2-3.4 shows type I error rates for each item in the test over 1000 replications across three different sample sizes (e.g., 500, 1000, and 5000). A good model-data fit test requires low type I error rate. The lower the type I error rate, the less mistakes that would be made when to accept a correct hypothesis. 37 Table 3.2: Type I Error Rate for Sample Size 500 Item Type I Error RMSE 'Irue(Full) 'Irue(Appr.) Item Esti. a b c 1 .026 .023 .000 .187 .254 .040 2 .029 .029 .006 .401 .190 .019 3 .016 .017, .002 .254 .257 .089 4 .020 .023 .001 .506 .276 .023 5 .011 .013 .000 .228 .153 .030 6 .016 .019 .000 .234 .135 .030 7 .023 .024 .002 .169 .093 .035 8 .018 .020 .001 .200 .178 .033 9 .011 .014 .001 .231 .108 .040 10 .026 .031 .000 .221 .247 .026 11 .023 .025 .003 .115 .136 .036 12 .017 .013 .000 .233 .224 .062 13 .017 .018 .008 .630 .112 .092 14 .031 .035 .001 .148 .243 .081 15 .019 .020 .000 1.148 .152 .023 Table 3.3: Type I Error Rate for Sample Size 1000 Item Type I Error RMSE True(Full) True(Appr.) Item Esti. a b c 1 .027 .023 .000 .162 .174 .035 2 .012 .012 .000 .292 .092 .012 3 .012 .012 .000 .208 .180 .070 4 .014 .012 .000 .303 .146 .018 5 .016 .013 .000 .192 .098 .022 6 .019 .017 . .000 .203 .092 .022 7 .024 .020 .003 .123 .082 .032 8 .010 .010 .000 .184 .105 .025 9 .011 .010 .000 .179 .089 .030 10 .033 .029 .000 .199 .148 .022 11 .020 .020 .001 .087 .122 .037 12 .020 .016 .000 .192 .146 .046 13 .018 .013 .003 .193 .113 .075 14 .025 .023 .000 .118 .211 .070 15 .020 .019 .000 .317 .085 .018 38 Table 3.4: Type I Error Rate for Sample Size 5000 Item Type I Error RMSE True(Full) TruaAppr.) Item Esti. a b c 1 .054 .045 .000 .079 .083 .023 2 .020 .019 .000 .130 .043 .004 3 .049 .045 .001 .099 .093 .035 4 .040 .034 .000 .175 .061 .008 5 .039 .036 .000 .093 .046 .01 1 6 .037 .032 .000 .096 .042 .011 7 .036 .033 .000 .070 .062 .035 8 .045 .038 .000 .084 .051 .012 9 .029 .026 .000 .090 .042 .014 10 .050 .040 .000 .099 .072 .013 11 .047 .043 .000 .047 .097 .037 12 .035 .032 .000 .077 .065 .019 13 .020 .020 .001 .095 .067 .034 14 .044 .039 .000 .056 .114 .038 15 .026 .024 .000 .176 .042 .009 To examine the type error rates for item fit test, the data are generated from the particular mathematic models (e.g., the 3PL model) and fit back into the same item model—an obvious known fact or correct hypothesis. Therefore, the item fit test, if it is right, should provide useful information to accept the correct hypothesis except some acceptable level of errors (Type I error) due to randomness; or the item fit test is simply employed to verify the known fact. The type I error rates in the tables are calculated based on 1000 replications for each sample size. Table 3.2 through 3.4 give the type I error rates when item parameters are known (denoted as “Full” and “Appr.”) and type I error error rates when item parameters are estimated from the response data (denoted as “Item Esti.”) along with the root mean square errors (denoted as “RMSE”) for each item parameter estimates. 39 It can be seen from the three tables (table 3.2, 3.3, and 3.4) that the type I error rates across different sample sizes are basically very low, lower than .05, the level of significance. Only one item (the first item in the 5000 case in table 3.4) has type error rate .054, a little bit bigger than the significant level, on the true asymptotic distribution. One major feature of the type I error rates in the tables is when item parameters are known constants, the type I error rate based on the true asymptotic distribution is close to their counterpart from the approximation by the observed covariance matrix. However, the type I error rates from the data-based item parameter estimates are in general less than those from the true item parameters and are very conservative regardless of the sample sizes. - It can be seen from these tables that most of the items the type I error rates are near to zero. As seen in any estimation programs in IRT, item parameter estimates contain estimation errors even if the data adequately fit the mathematical models used for the estimation. To examine the conservative performance of QBM under the circum- stances of the item parameter estimates, root mean square errors (RMSE) of the item parameter estimates are calculated from the data sets. RMSE is defined as the square root of the mean squared difference between the item parameter estimates and the true item parameters over r replications (r in this example is 1000). Let n denote as the item parameter (e.g., discriminating power parameter a, or difficulty parameter b, or asymptote parameter c) and r”; as the item parameter estimates. Then RMSE 40 can be calculated by RMSE provides a summary index of assessing the accuracy of item parameter esti- mates. Apparently, the larger RMSE of the item parameter estimates, the worse of the estimation. For a simulation study, an adequate fit of model and data is assumed, and thus the difference in the item parameter estimates may depend on the estimation procedures and some other factors (e.g., sample size of examinees). Table 3.2 through 3.4 also contain the RMSE over 1000 replications for each item parameter in the test. The estimation procedure used in this study for BILOG-MG3 is Bayesian MML with default item prior distributions. That is, for all item parameters, (1 ~ lognormal(0, 0.5), b ~ N (0, 2),c ~ beta(5, 17). It shows from these three tables that the RMSE decreases as the sample size increases, indicting that better item parameter estimates are obtained, which is expected. In general, the RMSE for the sample size equal to 500 is the largest and for 5000 the RMSE is the smallest. For the same sample size, the RMSE for discriminating power parameter is in general larger than that of difficulty and asymptote parameters. 3.2 Coeflicients for the Asymptotic Distributions Since the asymptotic distribution depends on the coefficients in the linear combi- nation, i.e., the eigenvalues extracted from the covariance among pseudocounts, it is important-to compare the coefficients from the true covariance matrix (the full covariance matrix that comes from evaluating all possible response patterns for a 41 \ given test), approximation of the covariance matrix, and estimated covariance ma- O trix on the data-based item parameters estimates across different sample sizes. The puriiose of comparing those coefficients is to examine how much additional error is induced through the coefficients of the asymptotic distribution. Table 3.5 through table 3.11 include 20 ordered positive eigenvalues extracted from the true covariance matrix (i.e., Table 3.5) and from the approximated covariance matrix of the observed covariance among pseudocounts as well (Table 3.6 through table 38 show the 20 ordered epositive eigenvalues from the true item parameters; table 3.9 through table 3.11 from the data-based item parameter estimates). In these tables, the rows rep- resent the 20 pbsitive eigenvalues and the columns indicate the 15 items in the test. The other extracted eigenvalues are omitted and not used for calculating the asymp- totic probabilities due‘to their trivial magnitudes. Note the values in these tables are from one replication. Similar results can be obtained from other 999 replications and hence are not reported here. The 20 ordered positive eigenvalues from the true covariance matrix, which depends only on the number of items, are used to compute true asymptotic distributions; the 20 ordered positive eigenvalues extracted from the observed covariance matrix of interrelations among pseudocounts from true item pa- rameters are used to compute the approximation of the asymptotic probabilities; the 20 ordered positive eigenvalues from the observed covariance matrix of interrelations among pseudocunts based on the item parameter estimates are the coefficients for computing the estimated asymptotic probabilities. It ,can be shown from these tables that for each sample size the 20 eigenvalues I 42 Table 3.5: The 20 Positive from True Covariance Matrix 1 2 3 4 5 6 9 10 ll 1 13 l 15 0 0 10 12 14 0 43 Table 3.7: 20 Eigenvalues for 11116 Item Parameters (N = 1000) 2 7 9 10 11 1 13 14 15 0 from the estimated covariance matrix across three different sample sizes are very close to their counterparts from the true covariance matrix no matter whether the item parameters are true constants or data-based estimates. These values are the estimated coefficients for the linear combination of x? random variables, which are eventually used to calculate the asymptotic probabilities. Except for the 20 values from the true covariance matrix in Table 3.5, the coefficients in Table 3.6 through Table 3.11, which are from observed covariance matrices of pseudocounts, are data- based estimates and vary as data change. And so do the resulting approximation of true asymptotic probabilities. For example, over 1000 replications of the 15—item test with 500 examinees, there are 1000 different observed covariance matrices of 44 Table 3.8: 20 Eigenvalues for True Item Parameters (N = 5000) 8 9 10 11 12 13 14 15 pseudocounts, and correspondingly the 20 ordered positive eigenvalues extracted from these matrices vary across data sets. However, it is found that for the 20 ordered positive eigenvalues extracted from each observed covariance matrix of pseudocounts, the differences between their true counterparts are so small that the approximation of the distribution is close to the true asymptotic distribution even for small sample size of 500. Similar results are also found for the case of the sample size 1000 and 5000. In addition, as the sample size increases, the the observed covariance matrix of pseudocounts become closer to the true covariance matrix of pseudocounts, and hence the approximation of the asymptotic distribution gets closer to its true asymptotic distribution. 45 Table 3.9: 20 Eigenvalues for Item Parameter Estimates (N = 500) 2 4 5 1 11 1 1 14 3.3 Item Misfit and Power with Known Item Pa- rameters A good significance test also requires higher power for detecting model-data misfit. The higher the power for a hypothesis test, the higher the probability to reject the null hypothesis when it is actually incorrect. In this section, power is not computed analytically for the hypothesis testing, but is estimated empirically through simulated data. To estimate the power, ‘for instance, the 3PL model is used to generate dichoto— mous response data, then fit the data generated by the 3PL model with the 2PL or the 1PL models, respectively. The Type I error rate is expected be low when fitting the data back with the 3PL model, but the power is expected high when fitting with 46 Table 3.10: 20 Eigenvalues for Item Parameter Estimates (N = 1000) 2 1 11 l l the data with the 2PL or 1PL models over 1000 replications. Similarly, low type I error rates are expected when fitting the dichotomous response data generated by the 2PL model with the hypothetical 2PL model over 1000 replications, whereas power is expected high when fitting with the data with the hypothetical 1PL model. Table 3.12 through 3.14 show the power for all items at nominal level in the test for different sample sizes provided all item parameters are known constants. Fiom table 3.12, it can be seen easily that fitting the data generated by the 3PL model with the hypothetical 2PL or 1PL model is not adequate given the item parameters are known. Most times over 1000 replications the incorrect hypothesis is rejected, which makes the correction decision on the model-data misfit tests. Horn 47 Table 3.11: 20 Eigenvalues for Item Parameter Estimates (N = 5000) 9 10 11 Table 3.12: The Power for Test Data Generated by 3PL Model with line Item Parameters 1 l 1 1 1 1 1 1 48 Table 3.13: The Power for Test Data Generated by 2PL Model with True Item Parameters Table 3.14: The Power for Test Data Generated by 1PL Model with Time Item Parameters = 1000 1 l 49 the perspective of hypothesis testing, it can be explained as that the testing of the null hypothesis (e.g., Ho here is the data fit the hypothetical 2PL or 1PL model) is being rejected almost all the times over the 1000 replications when the data are actually generated by the 3PL model. under the condition of true item parameters. The rejection rate of 1 means the incorrect hypothesis is correctly rejected for each replication across three sample size conditions (500, 1000, and 5000), or the hypothesis tests for model-data misfit have perfect power. Similarly, table 3.13 shows higher power for testing the hypothesis of fitting the data generated by the 2PL model with the 3PL model, and table 3.14 shows adequate power for testing the hypothesis of fitting the data generated by the 1PL model with the 3PL model regardless of the sample size provided item parameters are known constants. As is known that power is a function of the sample size. As the sample size increases, power would also increase. This feature is apparent in table 3.13 and 3.14 by comparing the same hypothesis testing across three different sample sizes (e.g., 500, 1000, and 5000). For example in table 3.13 for testing the hypothesis that the item model for item 10 is the 1PL model using the data that are actually generated by the 2PL model, the power at sample 500 is .005, .117 when the sample size is 1000, and .986 when sample the size increases to 5000. However, the power for each item is found not homogenously high, in particular for sample size of 500 case, when testing the hypothesis that the correct model is the 1PL model using the data generated by the 2PL model provided that the item 50 parameters are known constants. For example in the third column on table 3.13, item 3 through item 14 have very lower power for the sample size at 500. In fact, the power varies as the values of a parameter changes from item to item. The results from table 3.13 and 3.14 also support that when fitting the data to models with more parameters than the number of item parameters for the data generating model (e.g., in table 3.13 fitting the data with the 3PL model using the data generated by the 2PL model, and in table 3.14 fitting the data generated by the 1PL model with the 3PL or 2PL model), the power is generally high provided the item parameters are known constants (except item 8 and item 10 in table 3.14). 3.4 Item Misfit and Power with Item Parameter Estimates ' The simulation study in this section is similar to the above on power estimates with exception that the item parameters are not known constants but data-based estimates. When the response data are generated by the 3PL model (this is a known fact for the simulation study), then fit back the response data with the 3PL, 2PL, and 1PL models, respectively, on the basis of item parameter estimates. Lower type I error rates over 1000 replications would be expected for testing the hypothesis that the data fit the 3PL model meanwhile using the 3PL model to estimate the response data, or higher rejection rates or power would be expected when testing the hypothesis with other models (the 2PL or 1PL) meanwhile estimating the data with the 2PL or 1PL model. In addition, as seen in the above section, the power would also be expected to 51 Table 3.15: The Power for Test Data Generated by 3PL Model with Item Parameter Estimates . increase as sample size increases. Table 3.15 through table 3.17 show the power on the basis of item parameter estimates under three different data generating conditions. One apparent characteristic in the three tables (table 3.15 through table 3.17) is that the power increases as the sample size increases. For example, when the sample size increases to 5000, the power reaches 1 at nominal level for testing the hypothesis of the 2PL or 1PL model using the data generated by the 3PL model (table 3.15), or for testing the hypothesis of the 1PL model using the data generated by the 2PL model (table 3.16). Another expected feature is that the power is generally greater when testing the hypothesis of the 2PL model (i.e., Ho: the correct model is 2PL) than the one when testing the hypothesis of the 1PL (i.e., Ho: the correct model is 1PL) given the same sample size (column 1 versus column 2 for the sample size of 500; column 3 versus column 4 for the sample size of 1000). For the sample size of 52 Table 3.16: The Power for Test Data Generated by 2PL Model with Item Parameter Estimates HHHHHHHHI—‘b—‘b—‘H Table 3.17: The Power, for Test Data Generated by 1PL Model with Item Parameter Estimates = 5000 53 500, there is not enough power for testing the hypothesis of the 2PL and the 1PL using the data generated by the 3PL model except a small number of items (e.g., in testing hypothesis of the 1PL model, item 1, item 2, item 4, and item 10 seem to have adequate power that is greater or close to .80). When the sample size increases to 1000, testing both hypothesis (i.e., Ho: the correct model is the 2PL model or Ho: the correct model is the 1PL model) have power reached about .90 or greater except item 3, item 11, and item 12 when testing the hypothesis that the correct model is the 1PL model. In table 3.16, the power for testing the hypothesis that the correct model is the 1PL model using the data generated by the 2PL model is less than .5 when sample size is 500, and there are 8 items (item 4 through item 8, item 10, item 12 anditem 13) having power less .5 for testing the same hypothesis even when the sample size increases to 1000. In general, there is not enough power for testing the hypothesis of the 1PL model using the data generated by the 2PL model when item parameters are data-based estimates, in particular for the condition in which a parameters in the 2PL model are close to 1. As is expected, the power is low for testing the hypothesis of the correct model with more item parameters than the number of item parameters for the data generating model. For example in table 3.16, the power would be low when the hypothesis is Ho: the correct model is the 3PL as compared to the 2PL data generating model no matter what the sample size is. That is to say, the item fit analysis does not have enough power to reject the test for the hypothesis that the data generated the 2PL 54 model fit with the 3PL model most times over 1000 replications. Similarly, table 3.17 demonstrates the item fit analysis results does have enough power to reject the hypothesis that the correct model is the 3PL or 2PL model using the data generated by the 1PL model when item parameters are data-based estimates. 3.5 True Asymptotic Distribution Versus the Ap- proximation The plot of the true asymptotic probabilities based on the full covariance matrix versus the approximation of the probabilities based on the observed covariance matrix among pseudocounts is very intuitive on how well the approximation works across sample sizes, with plots along the reference line y = :1: indicating the small difference between the true and approximated values. The plots over different sample sizes may provide practical recommendations as to how large the sample size is required for an adequate approximation. For example, the following three figures (figure 3.1 through figure 3.3) are the plots of the true asymptotic probabilities and the approximation of the true asymptotic probabilities for item 1, item 3, item 5, and item 7 in the 15-item test over 1000 replications across three different sample sizes (500, 1000, and 5000). Similarly, the plots for other items can be displayed over 1000 replications, but are omitted here since the results on the plots are very close to these items. As it can be seen from the three figures (figure 3.1 through figure 3.3), the plots spread wide along the middle of the reference line for the sample size of 500, getting narrower for the sample size of 1000, and becoming almost a straight line when the 55 Figure 3.1: True Asymptotic Probabilities Versus Approximation (N = 500) Approximation 00 04 DB 00 04 OB Item 1 00 DA OB OD 00 O2 OA 05 QB to Item 5 w— v 00 02 OA 06 Q8 L0 Item 3 04 DB 00 02 GA 05 Q8 10 Item 7 r —v 00 02 Q4 Q6 Q8 L0 True Asymptotic Probabilities Figure 3.2: True Asymptotic Probabilities Versus Approximation (N = 1000) Approximation Item 1 00 04 03 \ l 00 04 DB DD 02 0A OB QB to Item 5 04 DB 00 f ~v r 00 02 OA 05 Q8 L0 Item 3 00 02 0A OB QB 10 Item 7 DA OB OD l r 00 02 OA 05 Q8 LO True Asymptotic Probabilities 56 Figure 3.3: True Asymptotic Probabilities Versus Approximation (N = 5000) Imm1 Imm3 0.8 0.0 0.4 0.8 04 0.0 0.0 0.2 0.4 0.6 QB 1.0 0.0 0.2 0.4 0.6 QB 1.0 Imms Imm7 Approximation 0.4 0.8 0 4 0 B i l 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 True Asymptotic Probabilities sample size increases to 5000. Obviously, the approximation based on the observed covariance matrix of interrelations among pseudocounts works well for sample size 1000 and 5000 cases. The results on the plots are a bit dispersed for 500 examinees. In the case of a short test with small the sample size (e.g., 500), it is advised to use the true asymptotic probability instead of the approximated one. 3.6 Sensitivity Analysis 3.6.1 Non-normal Proficiency Populations Psychometrician will be interested in finding out the applicability of one method developed in certain contexts to various other psychological and educational testing contexts. For example, in the the above studies, the group of examinees is assumed coming from a standard normal population (i.e., N (0, 1)), which is typically seen in 57 simulation studies. How does the method work with a non—normal population? This is an interesting practical issue of many tests, in which examinees do not have the exact standard normal distribution. This study is to examine the effects of the ability population distribution on the asymptotic method developed in Chapter 2. To investigate the potential effects of the underlying ability distribution, the pop- ulation is chosen as four-parameter Beta distribution ranging from -4 to 4. One reason for choosing the four-parameter Beta distribution as compared to the stan- dard normal distribution is that it is relatively convenient to manipulate the shape and range of the distribution. The following section will briefly introduce the expecta- tion, variance, probability density function of the distribution. The type I error rates will be examined for the above 15—item test but with a non-normal population—four parameter Beta distribution. Four-parameter Beta distribution, denoted as B (a, 6, L, U), is determined by two shape parameters (and) and two range parameters (lower limit L and upper limit U of the distribution). Let :r be a random variable from B(a,B,L, U), i.e., :1: ~ B(a,fl, L, U), L < :r < U. Then the density is given by 1 f(:r) = (U — L)°+6‘lBeta(a, ,6) (a: _ L)°"1(U — (”fl-1’ where Beta(a, fl) is the Beta function defined for a > 0, 6 > 0 by 1 Beta(a,fi) = / u°"1(1 — u)B-1du. 0 58 The expectation and variance of :r can be expressed as E, - M 0+6 _ 2 Var(:r) = (U L)aB (a+B)2(a+fi+1)' If a = 6, then a: has a symmetric distribution within its lower and upper limits. For a > 6 > 0, :r is a positively skewed distribution; for ,8 > a > 0, a: is a negatively skewed distribution. For L = 0, U = 1, the four-parameter Beta distribution reduces to the regular Beta distribution that is often presented in basic statistics text books. In particular for a = 6 = 1 and L = 0, U = 1, r degenerates as a uniform distribution within 0 and 1. Figure 3.4 is to compare four-parameter Beta distribution with the standard nor- mal distribution. One can find that B(4,4,-4,4) and the standard normal are symmet- ric but obliviously have different probability distributions. The shoulder of B(4,4,- 4,4) is more wide and short than that of N (0, 1). Also as it is known, the range of standard normal distribution is not only restricted from -4 and 4. One can see in the figure that B(2, 4, —4, 4) is positively skewed distribution and B(4, 2, —4, 4) negatively skewed distribution. In this study, assume the examinees coming from B(4, 4, —4, 4) as compared with N(O, 1) to see if the ability distribution has substantial effects on the results of the item fit analysis. Table 3.18 shows that even if the underlying ability distribution is not normal, the item fit test still has low type I error rates, which is also conservative as seen in the case of the standard normal population. Again, the Bayesian procedure with 59 Figure 3.4: Beta Distribution versus Standard Normal Distribution 0.5 0.4 —— Beta4(4,4,-4,4) Beta4(4,2.-4,4) Beta4(2,4,-4,4) N(0,1 ) .......... 60 Table 3.18: Type I Error Rates for Non-normal Ability Population and Data-Based Item Parameter Estimates Tern N=500 N=1000 N=5000 1 .000 .000 .000 2 .004 .000 .002 3 .004 .001 .040 4 .002 .000 .001 5 .000 .000 .000 6 .001 .001 .002 7 .003 .000 .000 8 .000 .000 .001 9 .000 .001 .001 10 .000 .000' .001 11 .001 .001 .000 12 .003 .001 .002 13 .010 .005 .004 14 .000 .001 .002 15 .003 .000 .000 MML is used to calibrate all item parameters with default item prior distributions when calibrating the item parameters with the 3PL model using the data generated by the 3PL model. It can be seen from table 3.18 that when the underlying abil- ity distribution is different from the standard normal distribution, the method still provides low type I error rates, which in some sense are also viewed too conserva- tive. The results show that the method is robust regarding the underlying ability distribution, although the item parameter estimates contains large errors in the case of the non-normal ability population. Further evidences can easily found from the RMSE for each item parameter estimate in table 3.19. The RMSE for each item in the test on three different sample sizes (N = 500,N = 1000, and N = 5000) over 1000 replications are generally larger than those RMSE in the case of the standard 61 Table 3.19: RMSE for Non-normal Ability Population Item RMSE N = 500 RMSE N = 1000 RMSE N --§ 5000 a b c a b c a b c 1 .3 .374 .032 .276 .365 .03 .175 .377 .025 ' 2 .692 .357 .017 .699 .374 .009 .488 .375 .003 3 .487 .156 .066 .47 .155 .047 .313 .207 .021 4 .4 .423 .024 .359 .428 .018 .319 .416 .009 5 .553 .241 .024 .559 .237 .016 .393 .259 .009 6 .565 .208 .024 .6 .193 .016 .437 .224 .01 7 .873 .116 .029 .493 .174 .035 .384 .089 .034 8 .476 .259 .025 .473 .259 .017 .337 .278 .009 9 .618 .101 .03 .682 .073 .019 .504 .116 (.012 10 .343 .413 .025 .323 .417 .022 .218 .403 .014 11 .326 .211 .03 .325 .257 .039 .247 .147 .04 12 .448 .236 .044 .426 .242 .028 .272 .28 .009 13 .667 .174 .06 .751 .234 .052 .565 .134 .024 14 .345 .124 .054 .34 .125 .048 .228 .068 .026 15 .512 .317 .023 .61 .311 .017 .508 .31 .008 normal ability distribution. The RMSE for each item parameter from the sample size N = 500 are indicated in the first three columns in table 3.19 corresponding to discriminating, difficulty, and asymptote item parameters, respectively. Similarly, the RMSE in the second three columns in table 3.19 for the sample size 1000, and the last three column are the RMSE for sample size 5000. One can see from the study that the effects of the ability distribution on the results of the item fit analysis are confounded with the item parameter estimation. The conservative type I error rates show that the population distribution itself should not be a factor on the results on item fit analysis, but that it can severely influence the item parameter estimates, as are represented by the large RMSE in table 3.19. 62 3.6.2 The Number of Quadrature Points and Item Fit The item fit measure QBM or the corresponding asymptotic distribution relies on the discrete underlying ability distribution, (i.e., p(6 = 6,) = w, for q = 1,2, - -~ ,Q), which is used to approximate a continuous distribution N (O, 1). Here Q represents the number of quadrature points. How the item fit diagnostic procedure depends on the number of quadrature points Q is an important practical issue regarding the stability of the method. As is known, for a large number of quadrature points, the approximation for the distribution of the discrete proficiency gets closer to the ' continuous proficiency distribution. For the previous simulation studies, the number of quadrature points Q was chosen as 41 ranging within -4 and 4. To compare the stability of the results between different numbers of quadrature points, 21 and 81 quadrature points are selected within the range of -4 and 4, with similar results for the same data indicating the method is stable regarding the number of quadrature points. In this simulation study, a test of 30 items are simulated and administrated to a sample of 1000 examinees from a standard normal population. The dichotomous response data are simulated using the 3PL model. For a given data set and a stable method in which the number of quadrature points does not have substantial effects on the item fit analysis, each item fit statistic and its corresponding asymptotic probability would not expect to have a big difference as the number of quadrature points changes from 21, 41, to 81. Similarly, the type I errors rates at nominal level over 1000 replications would also not be expected to differentiate as the number of the quadrature points vary. Table 3.20 shows the true item parameters in the first three 63 columns of the table and the RMSE in the second three columns and type I error rates in the last three columns when Q = 41, Q = 21, and Q = 81, respectively. The item parameter estimates are MML estimates using the 3PL model in BILOG-MG3. The true item parameters in this study have a wide variety values, which intends to simulate more general practical contexts for the test items. The discriminating power parameter ranges from the smallest of .139 to the highest of 2.67; the difficulty parameters are ranging from -1.821 to 2.233; most of the asymptote parameters are around .2 with the highest of .29. Figure 3.5 through figure 3.8 show the results on the three different numbers of quadrature points (e.g., Q=21,41, and 81). It can be seen from these figures that the plots of both the item fit statistics (i.e., QbM) and the corresponding asymptotic probabilities on the four items (e.g., Item 1, Item 3, Item 5, and Item 7) over 1000 replications are closely around the reference lines y = 1:, indicating these values are very close to each other no matter what the number of quadrature points is. However, with careful examination, one can find that some places are a bit messy on the plots of Q = 21 versus Q = 41, implying that some large differences occur. Similar results are also obtained from other items in the same test but not listed and plotted here. These results show the item fit analysis based upon psedocounts approach developed in Chapter 2 is not overly sensitive to the number of quadrature points, indicating a stable and robust results achieved. From these nearly interchangeable results on item fit statistics and the corresponding asymptotic probabilities, one can conclude that the number of quadrature points, practically, is not a factor that affect the results 64 Table 3.20: Type I Error Rates for Three Numbers of Quadrature Point Item 'Irue RMSE Type I Errors a b c a b c 41 21 81 1 1.899 -.054 .24 .254 .077 .036 .000 .000 .000 2 1.411 1.107 .243 .235 .097 .026 .000 .000 .000 3 2.656 -1.326 .255 .520 .136 .077 .002 .002 .002 4 2.159 1.083 .057 .279 .062 .011 .001 .001 .001 5 1.545 .735 .048 .178 .063 .018 .000 .000 .000 6 2.605 .619 .273 .415 .064 .026 .000 .000 .000 7 .771 .416 .016 .159 .162 .071 .007 .006 .007 8 2.474 1.18 .085 .362 .065 .011 .000 .000 .000 9 .941 .096 .022 .159 .137 .067 .004 .004 .004 10 2.423 .708 .246 .383 .065 .023 .000 .000 .000 11 .653 .35 .129 .119 .170 .056 .000 .000 .000 12 1.543 -.088 .226 .195 .084 .039 .003 .003 .003 13 1.832 .559 .239 .259 .072 .027 .000 .000 .000 14 1.959 .536 .096 .226 .056 .018 .001 .001 .001 15 2.587 -1.821 .096 .506 .109 .088 .002 .002 .002 16 .241 .135 .115 .166 .872 .170 .001 .003 .001 17 2.117 .838 .146 .286 .061 .018 .001 .001 .001 18 1.045 -.19 .037 .158 .128 .068 .012 .012 .012 19 .139 .211 .286 .113 .459 .048 .023 .023 .023 20 .474 1.879 .164 .178 .219 .045 .000 .001 .000 21 1.39 1.522 .222 .269 .121 .022 .000 .000 .000 22 1.972 -.963 .028 .316 .097 .082 .025 .023 .025 23 1.635 .558 .233 .229 .076 .028 .001 .001 .001 24 .381 .877 .29 .126 .312 .058 .003 .003 .003 25 .795 -.329 .197 .108 .138 .048 .002 .002 .002 26 .174 2.233 .078 .293 .439 .193 .009 .009 .009 27 1.69 2.211 .014 .297 .166 .006 .000 .000 .000 28 2.195 1.435 .066 .340 .078 .010 .002 .002 .002 29 1.268 -.331 .077 .151 .095 .050 .008 .007 .008 30 2.675 -.139 .094 .331 .052 .023 .002 .002 .002 65 on item fit analysis. Further evidence for this conclusion can be seen from the type I error rates at nominal level over 1000 replications in table 3.20. The largest difference of type I error rates at nominal level is .002 on item 22 (i.e., the type I error rates is .025 for Q=41 and Q=81, and .023 for Q=21), which can be attributable to the random errors of the sample data. One can use the results in this simulation study to reduce the computational complexity for a large data set since computing QBM based on Q=81 takes less time than the computation when Q=41. However, it is not advised to using a smaller number of quadrature points (e.g., Q = 21) in applications since, in a small number of cases, large disturbances occur when Q getting smaller. When Q greater or equal to 41, Figure 3.7 and Figure 3.8 show stable results on both Q7», and its asymptotic probabilities. Therefore, Q = 41 is generally recommended for computing item fit in applications. 3.7 Computing Time and Programs Several C++ programs have been implemented for the simulation studies. Three parts of C++ programs are coded for simulating response data, computing the item fit measure QBM for each item in a test, and evaluating the asymptotic probabilities through Davies routine (1980). The computing time, of course, depends on both the sample size and the test length. Longer tests or large sample of examinees take more time for computing item fit measure statistics. The computing time also depends on the computer equipment. The computer that is used for this simulation study is equipped with Pentium IV processor of CPU 2.39 GHZ speed and 512 MB RAM. 66 Figure 3.5: Item Fit Statistics Q’DM and Number of Quadrature Points Item1 ltem3 8 8 r: 2 2 .,..- N . ("3 O / o / 8 0 5 10 15 20 25 0 5 10 15 20 25 a g Item5 Item7 (I) if a a . 5 1°- '.. '0' / O / O 0.5 1015 20 25 0 5 1015 20 25 Item Fit Statistics (0:41) Figure 3.6: Asymptotic Probabilities and Number of Quadrature Points Item 1 Item 3 7: to to . (RI :5 v ’ d . is Q o o, q s o o E o 0 0 4 0 a 0.0 0 4 0 a 8 03: Item 5 Item 7 e -- ’ 2 to, M '0 to E o / a 33 3 2 0.0 0.4 0.0 0.0 0.4 0.8 Asymptotic Probabilities(Q=41) 67 Figure 3.7: Item Fit Statistics QBM and Number of Quadrature Points Item Fit Statistics (0:41) Figure 3.8: Asymptotic Probabilities(Q=41) 10 20 0 10 Item 1 .0 v / 0 510 Item 5 15 20 25 /' 0 510 15 20 25 10 10 Item3 // 0 5 1015 20 25 Item7 0 5 1015 20 25 Item Fit Statistics (0:81) Asymptotic Probabilities and Number of Quadrature Points 0.6 0.0 0.6 0.0 Item 1 /" r 0.0 0.4 0.8 Item 5 /" o 0.0 0.4 0.8 0.6 0.0 '9. o O. c Item 3 f’. O /’ 0.0 0.4 0.8 Item 7 0.0 0.4 0.8 Asymptotic Probabilities(Q=81) 68 The time to compute the item fit statistics (QbM) for each item in the test of 15 items administrated to 500 examinees takes a quarter of one minute; the time for the same test administered to 1000 examinees takes around one third of a minute; and the time for 5000 examinees takes about one and half minutes. The time to compute the test of 30 items for 1000 examinees takes around one minute. The computing time for the item fit statistics is on the basis of the number of quadrature points equal to 41. In fact, the number of quadrature points is also a factor that affects the computation time. Generally speaking, the method is robust regarding the number of quadrature points as seen in section 3.6.2. However, it takes less time for the same data set when smaller number of quadrature points is chosen. For the computation of the asymptotic probabilities, the computing time is within a second for each data set. In all the computation is efficient and applicable to most applications. 69 Chapter 4 Real Data Applications One advantage of doing a simulation study on item fit analysis is that information is available about whether or not the test data fit the hypothetical IRT models. For real test data, item fit analysis is often confounded with parameter estimation (in particular for item parameter estimates) and thus make the decisions on whether or not the test data fit the hypothetical models much more complex. 4.1 Assumptions Before doing the real data analysis, some conditions should be assumed for the sake of reasonable interpretations on the analysis results. Several assumptions that may be involved in the item fit analysis on real data. One assumption is that the parameter estimation is accurate and reliable. That is, both item and ability parameters are correctly estimated. To satisfy this condition, the standard procedures (e.g., MML for item parameter estimation and EAP for ability estimates recommended in BILOG- MG3) in most of the IRT softwares are used to estimate the parameters for the real data in this chapter. As is known, the parameter estimation is often confounded with 70 model-data fit issues. Poor parameter estimates may be caused by inadequate model- data fit or some other factors, for example, insufficient sample size and test length, and dimensionality or local independence conditions. Therefore, the big assumption here for real data analysis is that when testing the hypothesis that the test data fit with a hypothetical model, the parameter estimates using this hypothetical model are assumed to have no errors. For example, if one is to test the hypothesis that the observed ‘data fit with a 3PL model, then both the item and ability parameters are correctly estimated using this 3PL model. If the parameter estimates are incorrect, the only explanation is that the test data have not adequate fit with the hypothetical 3PL model, instead of the estimation procedure itself. Other assumptions that apply to the item response theory are also all assumed here. For example, unidimensionality and local independence are assumed for the analysis in this chapter. 4.2 Two Approaches on Item Fit Analysis for Real Data In this section for real data applications, two data sets are from Michigan Educa- tional Achievement Program (MEAP) anonymous 2000 Fall high school science and math tests. The MEAP science data set used for this example only consists of the dichotomous responses for 19 items and 7088 examinees; the MEAP math data set here also only contains 19 dichotomous items and 6857 examinees. In this chapter, both science and math data will be fitted in the 3PL, 2PL, and 1PL models, respectively. The item parameters will be estimated using MML method 71 in BILOG-MC3. The results of item fit analysis in BILOG-MG3 will be compared with the results of item fit (QbM) based on pseudocounts, with 10 ability groups and 30 quadrature points in the program BILOG-MG3 for x2 test. Table 4.1 and table 4.2 are the item parameter estimates for the science data (table 4.1) and math data (table 4.2) corresponding to fitting the data with the 3PL model. Since the two sample sizes are large, the item fit x2 tests in BILOG-MG3 show that all items have statistically significant deviations between the test data and the model predictions in both science and mathematics tests (i.e., their p-values all less than .05), which are indicated by the large value of x2 statistics and the low p-values in table 4.1 and 4.2. As it is known, X2 test is sensitive to examinee sample size. Almost any departure in the data from the item model under consideration (even if the practical significance of a departure is trivial) leads to rejection of the null hypothesis of model-data fit if sample size is sufficiently large. On the other hand, for small sample size, even large discrepancies between model-data cannot be detected due to the lower power. Hambleton and Rogers (in Educational Measurement, 3rd edition, edited by Linn, (1993, p.173), “principles and selected applications of item response theory” by Hambleton) suggest that “statistical tests of model fit do appear to have some value. Because they are sensitive to sample size and because they are not uniformly pow- erful, the use of any of these statistics as the sole indicator of model fit is clearly inadvisable. But two situations can be identified in which these tests may lead to relatively clear interpretations. When sample size are small and the statistics indicate model misfit, or when sample size are large and model fit is obtained, the researcher may have reasonable con- fidence that, in the first case, the model does misfit the data, and in the second, that the model fits the data. These possibilities make it worth- while to employ statistical tests of fit despite the alternate possibility of 72 Table 4.1: MEAP 2000 Fall High School Science Test Items with the 3PL Model (N = 7088) Item a b c QbM p‘ X2 p 1 .416 -2.742 .000 8.387 .064 53.5 .000 1.455 2.047 .343 2.585 .772 20.7 .023 .462 -1.187 .000 7.626 .093 55.8 .000 .598 -.181 .018 5.776 .228 64.0 .000 .240 -2.112 .000 3.76 .530 22.9 .011 .824 -.048 .198 2.681 .752 63.1 .000 .752 -.173 .315 2.352 .818 39.6 .000 .514 -1.733 .000 3.38 .606 56.1 .000 .556 -.664 .500 12.097 .009 37.2 .000 10 .808 .943 .256 2.411 .806 33.1 .000 11 1.048 1.255 .317 2.462 .796 31.3 .000 12 .641 -1.368 .000 4.294 .431 90.2 .000 13 .621 —1.027 .000 1.065 .796 92.5 .000 14 .635 .330 .422 3.259 .631 29.2 .001 15 .973 .588 .091 2.331 .822 107.6 .000 16 .722 .125 .325 2.548 .779 31.0 .000 17 .936 .752 .269 2.616 .765 48.8 .000 18 .747 -1.035 .000 2.888 .709 121.1 .000 19 .465 -1.783 .000 4.350 .422 68.1 .000 CDGJKIOU‘ubOOtO equivocal results.” According to the above guideline by Hambleton and Rogers, the item fit analysis results from BILOG-MG3 might not provide useful information that can lead to “relatively clear interpretation” due to the use of large sample of examinees in both tests. Or for these two examples, it is difficult for one to evaluate whether the test data on the science and math tests fit the hypothetical 3PL model if the only information available is from the results on x2 tests in BILOG—MGB. Look at the results of item fit analysis for both science and math test data on the 73 Table 4.2: MEAP 2000 Fall High School Mathematics (N = 6857) Item 9 a b c QbM p“ x2 p 1 .524 -2.638 .000 4.079 .530 43.9 .000 2 .704 .074 .122 3.387 .663 38.9 .000 3 .851 1.326 (.172 4.072 .581 40.3 .000 4 .771 .340 .228 3.431 .654 27.5 .002 5 .541 -2.883 .000 4.981 .379 48.6 .000 6 .912 -.173 .098 3.019 .735 58.0 .000 7 .571 -.165 .158 3.284 .683 43.0 .000 8 1.171 .559 .104 5.589 .296 57.9 .000 9 1.390 -.657 .223 3.844 .574 78.2 .000 10 .900 -.980 .198 2.849 .768 49.9 .000 11 .582 -.254 .138 3.135 .712 26.8 .003 12 1.135 -.808 .212 3.174 .705 60.4 .000 13 1.236 -.135 .226 3.400 .660 35.7 .000 14 1.329 .529 .153 4.581 .442 32.6 .000 15 .713 -1125 .000 4.924 .387 68.0 .000 16 .414 -.620 .000 6.464 .202 49.1 .000 17 .455 -2.189 .000 8.672 .072 70.6 .000 18 .611 -.840 .500 7.110 .151 25.5 .004 19 .104 -6.671 .000 24.252 .000 77.4 .000 74 basis of pseudocounts. The item fit measure (QBM) and its corresponding asymptotic probabilities are computed using the data-based item parameters (e.g., the standard item parameter estimation procedure MML) for the science and math tests, which are the listed in table 4.1 and 4.2, respectively. It shows that item 9 in the science test and item 19 in the math test have significant deviations between the test data and the hypothetical 3PL model (i.e., p-value less than .05). Data from other items in both science and math tests are consistent with predictions based on the hypothetical 3PL model. One can also see that when the hypothetical model is being rejected, the corresponding fit statistics QBM is relatively larger than other items in the two tests. According to Hambleton and Roger’s guideline, the test data (except item 9 for science test and item 19 for math test) have observed adequate fit with the hypothetical 3PL in both tests for such a large sample of examinees and should lead to “relative clear interpretation”. One apparently attractive property of this example of real data applications is that the item fit analysis approach based on pseudocounts (i.e., QDM) is able to reveal item fit test information even for the sample size as large as 7000 in this example. If both the science and math test data are fitted with the 2PL or 1PL models, then results show that all hypothesis tests for item fit analysis (Q‘DM) based on pseducounts are rejected (table 4.3, 4.4). That is, the test data in both science and math test do not have adequate fit with the 2PL or 1PL models. However, different results for testing these two hypothesis are obtained from BILOG-MG3. The X2 tests from BILOG- MG3 shows that the test data for three items (e.g., item 1, item 5, and item 11) in 75 Table 4.3: MEAP 2000 Fall High School Science Items (N = 7088) l 11 1.4 . . 14.54 .5 -1. l .18 math test have reasonable fit to the 2PL model and that three items (i.e., item 1, item 11, and item 17) also in math test have reasonable fit to the 1PL model (table 4.4). Interestingly, note that in the math test the same three items (e.g., item 1, item 5, and item 11) shows reasonable fit with the 2PL model but inadequate fit with the 3PL model, which might be hard to make sense. Similarly, it is also difficult to consider a situation that the data from the same three items (item 1, item 11, and item 17) in the math test have reasonable fit with the 1PL model but fail to support the fit with the 3PL models. These results seem conflict with the general principles that the more parameters in the model the better fit may be achieved merely from the model-data fit perspective. 76 Table 4.4: MEAP 2000 Fall High School Mathematics Items (N = 6857) 1 4.3 Graphic Approach One more interesting question is what other evidence one can havelto further sup- port the assessment decisions on the apparently different results from the above two approaches (i.e., x2 test and QBM) on item fit analysis. One alternative approach— graphic approach—might provide some intuitive sense to help assess on whether or not the test data from MEAP science and math tests fit the hypothetical IRT models. Figure 4.1 through'figure 4.5 are the plots of the hypothetical 3PL model item response function (denoted as solid curve in the graph) with the observed empirical item response curve (denoted as dot in the graph) for the 19 items in the science test. One can see from these plots that most of the items do have reasonable fit with the 77 3PL model, assuming the estimation is correct and other assumptions for IRT (e.g., local independence and unidimensionality) are satisfied. Item 9 is diagnosed to have significant deviation between the data and the 3PL model, which can be seen in the first plot on figure 11 with large discrepancy (i.e., more .5 deviation) between the hypothetical IRF and the emprical IRF at the lower end of ability scale. In fact, it can be seen that there are other items (item 1, item 3, item 5, item 8, and item 19) that also show large discrepancies at the lower end of the ability scale but result in reasonable fit. One possible explanation to this finding is that there may have large errors for the ability estimates, which lead misclassifications for examinee groups. The reason for the possible large errors for ability estimation, in particular for the ability estimates at the two ends on the ability scale, may be attributable to the small number of test items in the science test (i.e., a 19-item test can consider to be a short test). That is the part of the reasons why BILOG-MG recommends using x2 test for a test with more than 20 items. Combined with the plots diagnose and results on item fit analysis, one can conclude that the QBM test provides helpful information on assessing model-data fit. Moreover, the Q'bM test for item fit can apply to short tests and large sample of examinees, which broaden the settings for item fit analysis. 78 Figure 4.1: Empirical versus Hypothetical Item Response Emotions for MEAP 2000 High School Science Items (1-4) Imm1 Imm2 S 3. oo «2. O 3 0 <9. 8 0 ° ’3 m 8w ° 0 - . 1 - . E -2 -1 0 1 2 0 g c2. Item3 O :5 ‘- (0 .0 9 0- o 0.. 4 00 231° - . , . . - . . - -2 -1 0 1 2 -2 -1 0 1 2 Ability Scale 79 Figure 4.2: Empirical versus Hypothetical Item Response Emotions for MEAP 2000 High School Science Items (5-8) o Item5 o Item6 .— 73 .-°* 0° 0 w 0 d. «2 «2. 0* é o 0° ‘ o It- 0 N 00 g- c’0' d‘ofio ' ‘3; -2 -1 o 1 2 O 55‘ 0.. 0.4 ltem8 '5 '- " (U .8 1 at o 10.1 .4 o o o N. O N O cs‘or . or f -2 -1 o 1 2 -2 -1 o 1 2 AbilityScale 80 Figure 4.3: Empirical versus Hypothetical Item Response Functions for MEAP 2000 High School Science Items(9-12) Probability or Proportion 0.5 0.8 0.4 0.9 0.7 81 Item 9 0 Item 10 ..- 01 O ‘9. 4 O O Q N. r0 o r , Y . o . . r t -2 -1 o 2 -2 -1 o 2 Item 11 Item 12 o 0 3+ 0 ‘91 ° 0 O m G O or v v f QT r r 17 -2 -1 O 2 -2 -1 0 2 Ability Scale Figure 4.4: Empirical versus Hypothetical Item Response Functions for MEAP 2000 High School Science Items(13-16) Probability or Proportion Item 13 Item 14 (D O O o" no” 0 O V“ 1 d‘ o r 0 v 0 o o o" 5‘0. . T . 0' , v , f -2 -1 o 1 2 -2 -1 o 1 2 0 Item 15 Item 16 P' 0 OT) 0 O O O l O m «2. 0 ° 0 “ll ‘3 0 o 0 O . Yo f j 0’0 . . . Y -2 -1 o 1 2 -2 -1 o 1 2 Ability Scale 82 Figure 4.5: Empirical versus Hypothetical Item Response Functions for MEAP 2000 High School Science Items(17-19) Probability or Proportion 1.0 0.6 0.2 1.0 0.6 0.2 Imm17 Item 18 OO «2. O z» o 010 a 3- 2° . -2 -1 o 2 -2 -1 o Imm19 O 0 Or v r -2 -1 o 2 AbilityScaIe 83 Chapter 5 Concluding Remarks and Future Research Directions The simulation studies in Chapter 3 demonstrate that the approach to detect item fit or misfit is reliable and promising. The approach achieves the expected computational efficiency by approximating the true asymptotic probabilities based on the observed covariance matrix of interrelations among pseudocounts (e.g., Figure 1), thus making the approach applicable to most operational research. The approximation not only brings computational simplification, but also produces accurate results on assessing item fit from the oracle analysis in Chapter 3. When other sources of errors are controlled, for example in the condition if item parameters are known, the item fit test statistic QBM, the coefficients of the asymptotic distribution (table 3.5 through table 3.11), the asymptotic probabilities (Figure 1 to 3), type I error rates (table 3.2, 3.3, 3.4), and the decisions on whether the test data fit the hypothetical models have good agreement on the basis of the approximation. However, it is a fact that the approximation based on the observed covariance matrix among pseudocounts brings additional errors for assessing item fit, and the error may be large in the situation 84 when the test is long and only a small number of examinees is available. The utility of this approach is not limited to test length. For short tests, for example, a test with 10 items or less, one can directly use the true asymptotic distri- bution rather than its approximation to evaluate whether or not the test data fit the hypothetical model, because computing the true asymptotic distribution only needs to evaluate 1024 possible response patterns no matter how large the sample size is. However, it is advised to use a sample at least as large as 1000 to achieve better approximation. It can be seen from Figure 1 that the approximation looks a bit dispersed when sample size is 500, but is improved when the sample size increases to 1000. This approach has strong theoretical basis, because the fundamental concept of this approach is “pseudocounts,” or the posterior of ability distribution instead of ability estimates, which is believed to provide better information on assessing item fit. One direct theoretic advantage of using ”pseudocounts” rather than “ability estimates” to evaluate item fit is that this approach is able to avoid additional sources of errors that are confounded with ability estimation in item fit analysis, in particular for short tests. For a short test, the ability scale might not be well defined, and thus the large errors induced by ability estimates and classification errors by grouping examinees make the results from the X2 item fit test questionable, as is the case for the example on Chapter 4 real data applications. But this is not a problem on the approach based on pseudocounts because the observed counts from ability estimates are not required for the analysis. The following is the summary on other advantages 85 and limitations. First of all, the approach of detecting item fit has reasonable type I error rates (table 3.2, 3.3, 3.4, 3.18, 3.20). In table 3.2, 3.3, and 3.4, when the item parameters are known constants, one can see that the type I error rates ranges from 0 to .05 with most items having type I errors rates around .02, which is acceptable. However, when the item parameters are data-based estimates, almost all items have conser- vative type I error rates no matter what the sample size is and how good the item parameter estimates are in the analysis. The too conservative type I error rates when item parameters are estimated are resulted from the under estimates of the item fit statistics QbM, which also lead to under estimates of the corresponding asymptotic probabilities. In Chapter 2, it is addressed that the asymptotic distribution can be expressed as a linear combination of the independent x2 variables. The coefficients on the basis of item parameter estimates for the linear combination on each item are arbitrarily close to those from the true item parameters (see table 3.5 though 3.11). One interpretation to the conservative type I error rates for the data-based item parameter estimates can be attributable to the estimation errors (e.g., errors for estimating the covariance matrix, errors for estimating the eigenvalues, and errors for estimating item parameters), which result in under estimates of the item fit statistic (Q’DM) and its asymptotic probability. Since the eigenvalues seem well estimated by table 3.5 through table 3.11, the conservative type I error rates could resulted from the under estimates of the item fit statistic due to the errors for estimating item parameters. Note that the extension of the results on item fit analysis to the context 86 of item parameter estimates relies on the availability of consistent estimates for item parameters. Although the RMSE for item parameter estimates when the sample size is 5000 are much smaller than those when the sample size is 500 and 1000 (see table 3.2 through 3.4), the estimates of item parameters contain a large amount of errors for each item. If the item parameters would not contain estimation errors, one could expect the similar type I error rates to those when the item parameters are known constants. It is possible that poorly recovered item parameters from observed data I cause the poor item fit results in the simulation studies. Therefore, it is necessary to discern if poor item fit is resulted from that the test data really inadequately fit the item models or from the item parameters that are poorly estimated possibly due to bad estimation procedures. That is, although the item fit analysis on the situation when the item parameters are data—based estimates does not rely on ability estimates, detecting item fit or misfit based on pseudocounts requires item parameter estimates, which inevitably confounds the model-data fit issues with the estimation issues. Poorly recovered item parameters lead to questionable model-data fit analysis. It is also true that inadequate model-data fit will result in questionable item parameter estimates. Further research work is still needed to investigate the effects of item parameter estimates on the model-data fit analysis. For example, further efforts are needed to examine what cause the under estimates of the item fit statistics and how to correct the effects of item parameter estimates. One possible approach is found in Donoghue & Hombo (2003) by explicitly examining the effect of item parameter estimation and deepening the understanding of its effect on the distribution of item 87 fit measure. Secondly, the approach has adequate power to detect item misfit (table 3.12, 3.13, 3.14, 3.15, 3.16, 3.17) in the simulation studies. When item parameters are true values, the power estimates for many items are around .9 even when the sample size is as small as 1000 (see table 3.12 to 3.14). Item 5 through item 13 in table 3.13 and item 8 and item 10 in table 3.14 show that the power less than .9 and varies across these items as their discriminating power (0. parameters) get closer to 1, which can be explained by the relations between their item response functions. In general, when item parameters are true values, the more separation of the IRF between the true model and the hypothetical model, the easier to detect item misfit, and the higher power could be expected for even small sample size (e.g., 500). For example, the 3PL model can be more likely to be separated from the 2PL or the 1PL model because of the presence of the asymptote parameters. However, the 2PL and the 1PL model can hardly be separated from each other in particular when the discriminating power parameters are close to 1 and the 2PL model nearly reduce to the 1PL model, which is also difficult to detect from test data. Therefore, to detect misfit on the 2PL or 1PL, the power should be a function of item discriminating power parameter and the power curve over a parameter can be expected to look like a “U” shaped curve with the lowest power associated with a parameters close to 1. The item response function can provide information for diagnosing item fit testing process. In the simulation process, it is the item response function that determines the simulation of the dichotomous response data. In Chapter 2, it is also shown 88 how an IRF influences the pseudocounts, the sum of the posteriors over all possible response patterns for the rest items in a test, and how an IRF directly affects the theoretic expectation of pseudocounts, and eventually how an IRF influences on the item fit measure QBM and its corresponding asymptotic distribution. If the two IRF are very close to each other, one can expect that the two models would fit a data set equally well or would not have reasonable fit for the data at the same time. Thus the power may be low in the situation when the two IRF are close, and large sample size may be required to detect the misfit. For example, the 2PL model and the 1PL model have the same asymptote value. When an IRF from a 2PL model has very similar curve to an IRF from a 1PL IRT model (or the a parameter for the 2PL model is close to 1), and if the data can reasonably fit the hypothetical 2PL model, the data can also be expected to fit well for the hypothetical 1PL model, and vice versa. Look at the true item parameters in table 3.1, the discriminating power parameter a’s starting from item 5 to item 13 are close to 1, in particular for item 8 and item 10 with discriminating power parameters equal to 1.107 and .92, respectively. _If the asymptote parameter c is disregarded, then the 2PL and the 1PL (treat all a’s value as 1) IRF should have a slight difference. Therefore, although the data sets are generated from the 2PL model, the power should be low for rejecting the 1PL model (see table 3.13) due to the fact that the two IRF are too close to each other. Similarly, the power should also be low for item 8 and item 10 when fitting the data generated by the 1PL model with the 2PL model, as reported in table 3.14. Figures 5.1 shows the comparison of the 3PL, 2PL, and 1PL IRFs for item 1, item 89 Figure 5.1: Item Response Functions for the 3PL, 2PL, and 1PL Model (Item 1, 8, 10, 15) Immt Imma «4 -2 0 2 4 4 -2 0 2 m .92 :6“ .0 Item 10 e a. «a . «a O O 1 v C o' i o' ‘ O. . O. . O O v .4 2 o 2 4 Ability Scale 8, item 10, and item 15. It is apparent that the 3PL, 2PL, and 1PL IRF for item 1 and item 15 are well separated and thus these two items have higher power even for sample size 500. On the other hand, the 2PL and 1PL curves for item 8 and item 10 are too close to separate from each other, as seen in the figure, and thus have lower power for the sample size as large as 1000. However, their IRF are well separated from the 3PL model and these two items also observe higher power for detecting misfit of the 3PL model. i In a short, for detecting item misfit, an IRF from a 3PL model can be easily separate from an IRF from other models (e.g., the 2PL and the 1PL). Therefore, this is why higher power is observed when fitting a 2PL or 1PL model using the data 90 generated by a 3PL model. However, when the test data are generated by a 2PL model, if a hypothesis of fitting the data with a 1PL model and the two IRF are not well separated, it is hard to expect adequate power unless a sufficiently large sample size is available. When item parameters are data-based estimates, the power for detecting misfit (e.g., the 2PL or 1PL model) for most items is .9 or greater when data are generated using the 3PL model and the sample size is large (1000), as seen in the third and fourth column in table 3.15. The lowest power for three items (item 3, item 11, and item 12) has power around .7. However, when data are generated from the 2PL model, the test for fitting the data with the 1PL model shows very low power, which can be seen in table 3.16, in particular for item 4 through item 13, whose IRF are close to that of the 1PL model. Next, the method is robust in terms the ability distribution, and is insensitive to the change of the number of quadrature points. Although the results on type I error rates in table 3.19 with non-normal ability population (i.e., Beta distribution in the example) show that the method is robust over the underlying ability distri- bution, too conservative type I error rates are observed with poorly recovered item parameters, as can be seen from their root mean square errors. Here the problem of non-normal ability population turns back to the discuSsions on the effects of item parameter estimates on the item fit analysis. As is true that poorly recovered set of item parameters cannot yield a correct decision on whether or not the test data fit the hypothetical item models even in simulation studies, it is also true that the results 91 on item fit analysis based upon the bad item parameter estimates may not support that the test data fit the hypothetical model even though the data are generated by the hypothetical model. That is, one can obtain unacceptably high type I error rates using a set of bad item parameter estimates. The point is that how bad are the item parameter estimates can be tolerated for the use of the results from item fit analysis. The study of non-normal ability population only provides a general sense of how the bad item parameter estimates can have effects on the item fit analysis in terms of the root mean square errors. In the table 3.19 on RMSE for the non-normal ability population across three different sample sizes (500, 1000, and 5000), one can see that most of the RMSE for discriminating power parameters are greater than .5, for difficulty parameters greater than .3, and for asymptote parameters greater than .03. More research work is needed to study the tolerance of the item fit on the effects of item parameter estimates. As for the effects of the number of quadrature points on the results of item fit analysis, it can be seen in table 3.20 and from Figure 3.5 through Figure 3.8, the results based on Q = 21 and Q = 41 have slight differences. However, the results based on Q = 41 and Q = 81 show extremely good consensus. Thus, it is advised to compute item fit analysis using 41 quadrature points to have both computing accuracy and efficiency. Finally, although the method takes an asymptotic approach, it works extremely well even for the sample size of 1000 and the test of item fit is not sensitive to the number of examinee sample size. When the test has the sample size as large as 92 5000, X2 test for item fit will tend to reject the hypothesis on most items, whereas QBM statistic test will still provide useful information on diagnosing item fit, as is evident in table 3.2 through table 3.4 and in the example of real data applications in Chapter 4. The high school science and math MEAP data include large sample of examinees, which makes it hard to diagnose whether or not the test data fit the hypothetical 3PL model using X21 as shown in table 4.1 and table 4.2. Additional evidence from the plots between the hypothetical IRF and the empirical IRF for each item in the science test in Figure 4.1 through 4.5 show that the results from Q‘DM analysis provide reliable information, which agree with the results obtained from the graphic approach. One can conclude from the real data applications that the item fit QBM diagnosing test is able to provide more helpful information on assessing the model-data fit. In a short, the reformulation of the Q2», seems not correct on the conservative type I error rates when item parameters are data-based estimates. However, the reformulation does provide a convenient theoretical framework for studying item fit based on pseudocounts. 93 Bibliography [1] Birch, M. W. (1964). A new proof of the Pearson-Fisher theorem. Annals of Mathematical Statistics, 35, 818-824. [2] Bishop, Y. M. M., FeinBerg, S. E., and Holland, P. W. (1975). Discrete multi- variate analysis. Cambridge, MA: The MIT Press. [3] Bock, R. D. (1972). Estimating item parameters and latent ability when re- sponses are scored in two or more nominal categories. Psychometrika, 37, 29-51. [4] Bock, R. D., and Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: An application of the EM algorithm. Psychometrika, 46, 443-449. [5] Bock, RD, and Lieberman, M. (1970). Fitting a response model for n dichoto- mously scored items. Psychometrika, 35, 179-197. [6] Davies, R. B. (1980). Algorithm AS 155: Distribution of a linear combination of non- central chi-squared random variables. Applied statistics, 29, 323-333. [7] Donoghue, J.R., and Hombo, C. M. [McClellan, C. A.] (2003b). Some asymp- totic results on the distribution of an IRT measure of item fit. Psychometrika (conditionally accepted). [8] Donoghue, J .R., and Hombo, C. M. [McClellan, C. A.] (2003a, April). A corrected asymptotic distribution of an IRT fit measure that accounts for the effects of item parameter estimation. Paper presented at the Annual Meeting of the American Educational Research Association, Chicago, IL. [9] Donoghue, J .R., and Hombo, C. M. [McClellan, C. A.] (2001b, June). The be- havior of an IRT measure of item fit in the presence of the item parameter estimation. Paper presented at the Annual Meeting of the Psychometric Society, Valley Forge, PA. [10] Donoghue, J .R., and Hombo, C. M. [McClellan, C. A.] (2001a, April). The dis- tribution of an item-fit—measure for polytomous items. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Seattle, WA. 94 [11] [12] [13] [14] [15] [16] [17] [18] [1.9] [20] [21] [22] Donoghue, J.R., and Hombo, C. M. [McClellan, C. A.] (1999, June). Some asymptotic results on the distribution of an IRT measure of item fit. Paper presented at the Annual Meeting of the Psychometric Society, Valley Forge, PA. Donoghue, J. R., and lsham, S. P. (1998). A comparison of procedures tordetect item parameter drift. Applied Psychological Measurement, 22, 33-51. Efron, B. (1982). The jackknife, the bootstrap, and other resampling plans. Philadelphia, PA: Society of Industrial and Applied Mathematics (SIAM). Glas, C. A. W., and Meijer, R. R. (2003). A Bayesian approach to person fit anal- ysis in item response theory models. Applied psychological measurement, 27(3), 217-233. Glas, C. A. W., and Suarez-Falcon, J. C. (2003). A comparison of item fit statis- _ ties for the three-parameter logistic model. Applied psychological measurement, 27(2), 87-106. Hambleton, R. K., and Swanminathan, H. (1985). Item response theory: princi- ples and applications. Boston: Kluwer Academic Publishers. Hoijtink, H. (2001). Conditional independence and differential item functioning in the two-parameter logistic model. In A. Boomsma, M. A. J. van duijn, and T. A. B. Snijiders (Eds), Essay in item response theory (pp. 109-130). New York: Springer. Hombo, C. M. [McClellan, C. A], and Donoghue, J. R. (2001, July). A power study of an IRT measure of item fit. Paper presented at the Annual Meeting of the Psychometric Society, King'of Prussia, PA. Hombo, C. M. [McClellan, C. A], and Donoghue, J. R. (2000, July). Some prop- erties of the distribution of an IRT measure of item fit. Paper presented at the 2000 annual meeting of the Psychometric Society, Vancouver, British Columbia. Hombo, C. M. [McClellan, C. A], and Donoghue, J. R. (1999, June). A simulation study of distribution of an IRT measure of item fit. Paper presented at the Annual Meeting of the Psychometric Society, Lawrence, KS. Hombo, C. M. [McClellan, C. A], and Donoghue, J. R., and Oranje, A. H. (2003, April). Evaluating item fit in 2002 NAPE writing data. Paper presented at the annual meeting of the American Educational Research Association, Chicago, IL. Hsu, Y. (2000). On the Bock-Aitkin procedure—from an EM perspective. Psy- chometrika, 65, 547-549. Johnson, N. L. and Kotz, S. (1970). Continuous uniuariate distributions —2. Boston: Houghton Mifflin. 95 [23] Li, D., Donoghue, J. R., and McClellan, C. A. (2005). Approximate the asymp- totic distribution of an IRT measure for item fit based on observed interrelations among pseudocounts. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Montreal, CA. [24] Linn, R. L. (1993). Educational Measurement, 3rd edition. The Oryx Press. [25] McKinley, R. L., and Mills, C. N. (1985). A comparison of several goodness-of-fit statistics. Applied Psychological Measurement, 9, 49-57. [26] Orlando, M. and Thissen, D. (2000). Likelihood-based item-fit indices for di- chotomous item response theory models. Applied Psychological Measurement, 24(1), 50-64. [27] Orlando, M. and Thissen, D. (2003). Phrther investigation of the performance of S-x2: an item fit index for use with dichotomous item response theory models. Applied psychological measurement, 27(4), 289-298. ‘ [28] Reckase, M. D. (1997). The past and future of multidimensional item response theory. Applied Psychological Measurement, 21, 25-36. [29] Reise, S P. (1980). A comparison of item-and person-fit methods of assessing model-data fit in IRT. Applied Psychological Measurement, 14, 127-137. [30] Rogers, H. J ., and Hattie, J. A. (1987). A Monte Carlo investigation of several person and item fit statistics. Applied Psychological Measurement, 11, 47-57. [31] Sinharay, S. (2005). Bayesian item fit analysis for unidimensional item response theory models. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Montreal, CA. [32] Sinharay, S., and Johnson, M. S. (2003). Simulation studies applying posterior predictive model checking for assessing fit of common item re- sponse theory models (ETS RR—03-33). Princeton, NJ: ETS. Available from , http://www. ets. org/research/newpubs. html. [33] Stone, C. A. (2000). Monte Carlo based null distribution for an alternative goodness-of- fit statistic in IRT models. Journal of Educational Measurement, 37, 58-75. [34] Stone, C. A., Ankenmann, R. D., Lane, S., and Liu, M. (1993, April). Scaling QUASAR’S performance assessments. Paper presented at the Annual Meeting of the American Educational Research Association, Atlanta, GA. [35] Stone, C. A., and Hansen, M. A. (2000, April). Using resampling methods to evaluate the significance of a goodness-of-fit statistics in item response theory model. Paper presented at the Annual Meeting of the American Educational Re- search Association, Montreal, Canada. 96 [36] Stone, C. A., Mislevy, R. J ., and Mazzeo, J. (1994, April). Misclassificaiton error and goodness-of-fit in IRT models. Paper presented at the Annual Meeting of the American Educational Research Association, New Orleans, LA. [37] Stone, C. A., and Zhang, B. (2002). Comparing three approaches for assessing goodness- of-fit of IRT models. Paper presented at the Annual Meeting of the National Council on Measurement in Education, New Orldeans, LA. [38] Yen, W. M. (1981). Using simulation to choose a latent trait model. Applied Psychometrical Measurement, 5, 245-262. 97 l[[1]][ll/[l[Milli][Willi]! 1293 02736 7824