, . J.
.. :.3
.r........ . . i.

a d
.3 Fan 3.. 3:;n5..:
ti. . .... .. z x. .12...
I. r1. 32“.. mi . his; 2.. . i...
I S i) vl

ﬁwﬁﬁi .

s
3‘

{wake

GHQ. : ..

zinuhud.
: 51%. was...
. r ”.5... ..
1. .
. . .. u.
g a . s at: .
«Mewsﬁlwhvwétﬂﬂlu . .4.

2.... 1n. 3‘ ‘.
1.1a. «1:1:

. z.

r

Iva

 

31...;
:13

5).

ad.”
Lt 11 I}.
~ y): 1» twat. .17...
i. r :lu...
:1,v::¢¢-D&l}v .
Is, I

ma.
”ﬁg.

(ill I”...

 

{2.3}!
! :tv {I 1
I. .33?
,l . I.

 

2!. 13
V31... 3

. a «I. y. .7 33.35.; . .3: .4 nifty»
, . ~10 5 . . L. . . 5 Wm.

1.10. mu. - L III}. , . . , “film”. "8.1 , . . . “3.1m... . g u)

a“ . . :1“... . . .

i1...
. ..\ 5.15:. .31.! ..
12.3.. I . .

 

This is to certify that the
dissertation entitled

THE ASYMPTOTIC DISTRIBUTION OF AN IRT MEASURE FOR
ITEM FIT BASED ON PSEUDOCOUNTS

presented by

Deping Li

has been accepted towards fulﬁllment
of the requirements for the

PHD. degree in Education

 

 

72701Z/9. /;Qeé . 2,

, Major Professor’s Signature
'\
62.77% 2, 26:03

Date

 

 

MSU is an Afﬁnnative Action/Equal Opportunity Institution

LIBRARY
Michigan State
University _ I

 

 

"-
-W

 

PLACE IN RETURN BOX to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE

DATE DUE

 

“(liege ML

 

 

 

 

 

 

 

 

 

 

 

 

 

2/05 p:/ClRC/DaIeDue.indd-p.1

 

 

 

 

THE ASYMPTOTIC DISTRIBUTION OF AN IRT MEASURE FOR
ITEM FIT BASED ON PSEUDOCOUNTS

By

Deping Li

A DISSERTATION
Submitted to
Michigan State University

in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Department of Counselling, Educational Psychology and Special Education

2005

ABSTRACT
THE ASYMPTOTIC DISTRIBUTION OF AN IRT MEASURE FOR
ITEM FIT BASED ON PSEUDOCOUNTS
By

Deping Li

Item fit measure Q‘DM is formed based on the posterior distribution (or pseudocounts)
of proﬁciency instead of the proﬁciency estimates. The reference distribution of Q‘DM
is not x2 but a quadratic function of normal variates. A consistent estimator of
the covariance matrix of pseudocounts is found for the approximation of the true
asymptotic distribution of Qbm- The data-based estimate of the covariance matrix
of pseudocounts depicts the interrelations among pseudocounts and show reasonably
good agreement with the true covariance matrix among pseudocounts for sample size
as large as 1000. Results from simulation studies show that the method based on
pseudocounts has adequate power for detecting item misfit and low type I error rates.
The method is robust over the underlying ability distribution and number of quadra-
ture points. Real data applications suggest that the method provide more helpful
information on assessing model-data ﬁt even when sample size is large compared to

x2 test.

Copyright by
DEPING LI
2005

ACKNOWLEDGEMENTS

I am indebted to many people for criticism, suggestions, reviews, and constructive
conversations. I wish to express my sincere thanks to the committee: Dr. Mark
Reckase (chair), Dr. Kimberly Maier, Dr. Lijian Yang, and Dr. John Donoghue.
Each contributed tremendously to the work by sharing their extensive professional
knowledge and ideas.

I am especially grateful to Dr. Donoghue and Dr. Catherine McClellan for the
continued advise, criticism, and wisdom, beginning from the summer research ex-
perience through the completion of this work. I would like to thank Educational
Testing Service for their ﬁnancial support, through both summer intern research and
the fellowship offered for this research.

I would also like to thank the Center for Educational Performance and InfOrmation
and Dr. Oren Christmas, whose assistance enabled the completion of my doctoral
study. Thanks also are due Hongwen One for her insightful critiques and helpful
comments.

The encouragement by my wife, Yanlin Jiang, and her support in all aspects did

much to reduce the burden of the work involved.

iv

Contents

List Of Tables .................................
List Of Figures ................................

1 Introduction to IRT Measures of Item Fit
1.1 Item Fit in General Context of Assessing the Fit of the IRT models .
1.2 Item Fit Analysis Based on Ability Estimates .............
1.3 Item Fit Analysis Based on Raw Scores .................
1.4 Item Fit Analysis Based on Pseudocounts ...............
1.5 Approximation by Observed Covariance Among Pseudocounts . . . .
1.6 Reformulating the Item Fit Measure QBM ...............

2 Item Fit Analysis Based on Pseudocounts
2.1 Deﬁnitions and Notations ........................
2.2 Asymptotic Distributions of Pseudocounts ...............
2.3 The Asymptotic Distribution of the Item Fit Measure QbM .....
2.3.1 Reformulated QbM and Its Asymptotic distribution ......
2.3.2 Asymptotic Distribution of Q ..................

vi
viii

23
26

2.4 The Observed Covariance Matrix of Interrelations among Pseudocounts 27

2.5 Estimation of the Asymptotic Distribution for Q2», .........

3 Simulation Studies on Item Fit

3.1 Type I Error Rates ............................
3.2 Coefficients for the Asymptotic Distributions ..............
3.3 Item Misﬁt and Power with Known Item Parameters .........
3.4 Item Misﬁt and Power with Item Parameter Estimates ........
3.5 True Asymptotic Distribution Versus the Approximation .......
3.6 Sensitivity Analysis ............................

3.6.1 Non-normal Proﬁciency Populations ...............

3.6.2 The Number of Quadrature Points and Item Fit ........
3.7 Computing Time and Programs .....................

4 Real Data Applications
4.1 Assumptions ................................
4.2 Two Approaches on Item Fit Analysis for Real Data .........
4.3 Graphic Approach ............................

31

34
37
41
46

57
57
63
66

70
70

5 Concluding Remarks and Future Research Directions

BIBLIOGRAPHY

vi

84

94

List of Tables

3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11

3.12

3.13

3.14

3.15

'IIue Item Parameters for the Test of 15 Items ............. 36
Type I Error Rate for Sample Size 500 ................. 38
Type I Error Rate for Sample Size 1000 ................. 38
Type I Error Rate for Sample Size 5000 ................. 39
The 20 Positive Eigenvalues from True Covariance Matrix ...... 43
20 Eigenvalues for True Item Parameters (N = 500) .......... 43
20 Eigenvalues for True Item Parameters (N = 1000) ......... 44
20 Eigenvalues for True Item Parameters (N = 5000) ......... 45
20 Eigenvalues for Item Parameter Estimates (N = 500) ....... 46
20 Eigenvalues for Item Parameter Estimates (N = 1000) ....... 47
20 Eigenvalues for Item Parameter Estimates (N = 5000) ....... 48
The Power for Test Data Generated by 3PL Model with True Item

Parameters ................................ 48
The Power for Test Data Generated by 2PL Model with True Item

Parameters ................................ 49
The Power for Test Data Generated by 1PL Model with True Item

Parameters ................................ 49
The Power for Test Data Generated by 3PL Model with Item Param-

eter Estimates ............................... 52

vii

3.16
3.17
3.18

3.19

3.20

4.1

4.2
4.3

4.4

The Power for Test Data Generated by 2PL Model with Item Param—
eter Estimates ...............................

The Power for Test Data Generated by 1PL Model with Item Param-
eter Estimates ...............................

Type I Error Rates for Non-normal Ability Population and Data-Based
Item Parameter Estimates . - .......................

RMSE for Non-normal Ability Population ...............

Type I Error Rates for Three Numbers of Quadrature Point .....

MEAP 2000 Fall High School Science Test Items with the 3PL Model
(N = 7088) ................................

-MEAP 2000 Fall High School Mathematics (N = 6857) ........

MEAP 2000 Fall High School Science Items (N = 7088) .......

MEAP 2000 Fall High School Mathematics Items (N = 6857)

viii

53

65

73
74
76

77

List of Figures

3.1
3.2
3.3
3.4
3.5
3.6
3.7

3.8

4.1

4.2

4.3

4.4

4.5

5.1

True Asymptotic Probabilities Versus Approximation (N = 500) . . .
True Asymptotic Probabilities Versus Approximation (N = 1000)

Tme Asymptotic Probabilities Versus Approximation (N = 5000)

Beta Distribution versus Standard Normal Distribution ........
Item Fit Statistics QbM and Number of Quadrature Points ......
Asymptotic Probabilities and Number of Quadrature Points .....
Item Fit‘Statistics QLM and Number of Quadrature Points ......
Asymptotic Probabilities and Number of Quadrature Points .....
Empirical versus Hypothetical Item Response Functions for MEAP

2000 High School Science Items (1—4) ..................

Empirical versus Hypothetical Item Response Emotions for MEAP
2000 High School Science Items (5-8) ..................

Empirical versus Hypothetical Item Response Functions for MEAP
2000 High School Science Items(9-12) ..................

Empirical versus Hypothetical Item Response Functions for MEAP
2000 High School Science Items(13-16) .................

Empirical versus Hypothetical Item Response Functions for MEAP
2000 High School Science Items(17-19) .................

Item Response Functions for the 3PL, 2PL, and 1PL Model (Item 1,
8, 10, 15) .................................

ix

56
56
57
60
67
67
68
68

79

80

81

82

83

Chapter 1

Introduction to IRT Measures of
Item Fit

1.1 Item Fit in General Context of Assessing the
Fit of the IRT models

Item response theory (IRT) is becoming an important tool for educational and psy-
chological tests, one of the most important tools for both test design and test data
analysis. IRT provides a philosophical framework for test design and many other ap-
plications (e.g., differential item functioning, test equating, computer adaptive test-
ing, etc.). The advantages of IRT may not be fully realized if the test data do not
adequately ﬁt the item response models. Assessing model-data ﬁt is fundamental in
psychometrics and has always been an issue of enormous interests. The model-data ﬁt
issue should be a primary concern when applying IRT models to test data. However,
there is no unanimous consensus upon the diagnostic tools for model-data ﬁt.

There are other aspects of model-data ﬁt (e.g., person ﬁt analysis and analysis of
other type of misﬁt including violation of local independence and unidimensionality

by Hambleton & Swanminathan (pp. 151-195), 1985; Embreston 8:: Reise (pp. 238-

246), 2000; Glas & Meijer, 2003, Hoijtink 2001; and Sinharay and Johnson 2003), but
this research is limited to item fit only. In IRT, there is no need to ﬁt a set of data
with the same model for all items because a test can be a combination of different
types of items (e.g., dichotomous, polytomous, or constructed response items). Even
if items with the same type of responses are available, they may be represented by
different mathematical models, and separate IRT models may be used for adequate
ﬁt. Therefore, attention should be paid to the ﬁt of IRT model on an item-by-item
basis.

Item ﬁt analysis should also play an important role in decisions about the reten-
tion of items in the assessment pool. Poorly ﬁtting items undermine the validity of
decisions based on measurement results. In this chapter, various measures of item ﬁt
and the corresponding statistical approaches for testing goodness-of-ﬁt at the item
level will be reviewed.

Generally speaking, there are two basic approaches to assessing item ﬁt—graphical
(or heuristic) and statistical test procedures. Graphical procedures are intuitive but
more subjective in deciding the adequacy of model-data ﬁt. Statistical tests of good-
ness of ﬁt (e.g., X2 or likelihood ratio test) are probably the most widely used in
current operational research.

In graphical procedures, the adequacy of item ﬁt is typically evaluated on the basis
of a comparison between an empirical item response function and a hypothetical item
response function. The empirical function is obtained from the sample of test data.

Detailed descriptions of graphical procedures can be found in most IRT literature

dwelling on model-data ﬁt (e.g., Hambleton & Swaninathan, 1985; p234, Embreston &
Reise, 2000). The plots of the empirical and hypothetical item response functions can
reveal areas along the proﬁciency continuum where there are discrepancies between

these two functions. The discrepancies indicate the degree of item misﬁt.
1.2 Item Fit Analysis Based on Ability Estimates

Much research on analysis of item ﬁt has been conducted via signiﬁcance tests. This
section reviews Wright and Panchapakesan’s (1969) x2 test, Bock’s (1972) x2 test,
likelihood ratio test, and standardized residuals test.

The procedure advocated by Wright and Panchapakesan (1969) is a commonly
used statistical test. The procedure deﬁnes a standardized variable yij =

.._ E ..
(Jae—7%) , where f”- represents the frequency of examines at the ith ability
(11‘ ij

G

level answering the jth item correctly. Then the measure of item ﬁt X2 = 29:1 y?)-

. Wright and his colleagues assume this measure to have a chi-square distribution.

The Bock (1972) chi-square index is deﬁned as

G
N9(0ig - Ei )2

Ei9(1 " Eig) ,

 

2
X Bock
9=1

where 0,9 is observed proportion-correct on item 2' for interval group 9, E,g is the
expected proportion correct based on the hypothetical item response function at the
within interval median proﬁciency level estimate, and N9 is the number of exami-
nees with ability estimates falling within proﬁciency interval 9 that comes from the

classiﬁcation of the proﬁciency estimates. This index is assumed to distribute asymp-

totically as a x2 variable with degree of freedom equal to G — m, where m represents
the number of item parameters to be estimated. High value of the item ﬁt index
indicate that the data may not have a reasonable with ﬁt the hypothetical model on

the item.
The Wright and Mead (1977) statistic is based on number-correct grouping ap-

proach for Rasch model. The statistic is given by

mi Ng(0ig - Eig)2
9:1. Eig(1 — Eig) — 533',
where Szj— — N lzkej(P-( —,°E j)2, Pi(9k) is the proportion correctly answering
item 2' in score group k. The degrees of freedom are C, the number of intervals for
the proﬁciency estimates, minus the number of parameters estimated.

Yen’s (1981) Q1 statistic uses the mean proﬁciency within each proﬁciency cat-
egory to obtain the predicted item response function. Furthermore, Yen ﬁxes 10
categories of proﬁciency in calculating the Q1 index, which is assumed approximately
distributed as X2 with the number of categories minus the number of parameters as
the degree of freedom.

The likelihood ratio C2 is implemented in the BILOG-3 (Mislevy and Bock, 1990)
and BILOG-MG (Zimowski, Muraki, Mislevy, and Bock, 1996). 02 is computed by

comparing the observed frequencies with those predicted from the hypothetical model.

2 _ R‘ R:
GBILOG - 22(mzogm— (pg( m» +(N‘_ R‘)logN.-(1—P(0m)))°

 

This test of item ﬁt was designed from a long test (e.g., more than 20 items). In

4

this test, BAP estimate of proﬁciency for each examinee is computed based on the
item parameter estimates, then is assigned to proﬁciency intervals. The summation is
performed over G ability scale 6 groups, R,- is the proportion correct within group i,
and N is the number of examinees in group 2'. This 6'2 is also assumed to distributed
as x2 with the degrees of freedom equal to the number of proﬁciency groups.
Standardized residuals are used to assess the item ﬁt in the Rasch model context
(e.g., Masters & Wrights, 1996). In this procedure, the expected response EX ”- for a
particular person s responding to item 2' is described by EX,,~ = {=11 ICE-(0,). The
variance of X,,- can be calculated by Var(X,,-) = 2,23%]: - EX,,-)2P,-(0,). Let Z”-

denote the standardized residual, then Z 32' = ML . A mean square ﬁt
‘/ Var(X3,-)

2
n Z .
statistic, i.e., E i=1 -;:1, can then be computed as an item ﬁt measure. The

summation is performed over the n items in the test.

The above measures of item ﬁt and corresponding statistical tests are open to
criticisms. The most common criticism is that these item ﬁt measures and the corre-
sponding signiﬁcance tests often require parameter estimation (i.e., item and ability
estimates) and are often viewed as inconclusive evidence of adequate ﬁt. The most
commonly used measures of item ﬁt (e.g., Bock, 1972; Yen, 1981) use model—based es-
timates (e.g., maximum likelihood estimate (MLE), or expectation a posterior (BAP)
of the latent proﬁciency of examinees. In computing these ﬁt measures, the pro-
ﬁciency estimates are generally treated as point estimates containing no error—an
obviously false assumption. That is, even if there is perfect ﬁt of the model to the
data, the proﬁciency estimate for an individual is hardly ever equal to the true value

5

due to estimation errors. This problem is especially pronounced for short tests where
proﬁciency estimates have larger error. In addition, the proﬁciency estimates are
then grouped into intervals that serve as the basis of a contingency table measure of
ﬁt. Due to the uncertainties in the proﬁciency estimation, the proﬁciency estimates
are subject to errors of classiﬁcation, thus making the use of the chi-square reference
distribution questionable. Several studies (e.g., Reise, 1980; Rogers and Hattie, 1987;
Mckinley and Mills, 1985) have indicated that the sampling distributions of these
measures are not x2 distributed. Moreover in some contexts, researchers point out
that the X2 statistic for a single item is insensitive to certain type of misﬁt (e.g.,

Vander Wollenberg, 1982; Drasgow et al 1995).

1.3 Item Fit Analysis Based on Raw Scores

Because of the shortcomings of measures based on point estimates of ability, alterna-
tive measures have been developed. In the past 10 years, two main approaches have
been put forth. The ﬁrst approach was suggested by Orlando and Thissen (2000,
2003). Their approaches compute IRT-based expected values for each level of total
score on the test, raw score or number correct score. They then use the observed
frequencies for the total scores, and compute a ﬁt measure (likelihood ratio 02 or

Pearson X2). The item ﬁt statistics for item 11 suggested by Orlando and Thissen

(2000) are of the form

 

2 H (pile — 2‘ E02
3" Xi = gNkE E2k(1 _ Elk)
and
I- 1
8— Gf- - 22:1Nklpikl09(—) + (1— pik)l09('11—__’_ 3)],
with k standing for raw score category as k = 0, 1, 2, . -- ,1, N, for the number of

examinees on score k, put arid E5]; respectively representing the observed and expected
correct scores for item 2' in raw score group 1:. Orlando and Thissen then compare the
statistic to a chi-square distribution (the two statistics are assumed to have asymptotic
x2(I - 4) distributions under the null hypothesis that the ﬁtted model is true). Unfor-
tunately, their statistic is not distributed exactly as chi-square when item parameters
are estimated from MMLE (Donoghue, McClellan, and Oranje, 2004; Sinharay, 2005).
However, the departure from x2 appears to be relatively small, a result supported by
several simulation studies (e.g., Orlando and Thissen, 2000; Stone and Zhang, 2002);
the departure of the distribution of S - x2 and S — G’2 from the referred X2(I — 4)

distribution may be severe for a short test.
Glas and Suarez-Falcon (2003) suggest an item ﬁt statistic based on the lag'range
multiplier test (or equivalent efﬁcient score test) and uses number correct score on

examinee groups. For item 2', the statistic is used to test the null hypothesis H, (e.g.,

the 3PL model is correct) versus the alternative hypothesis, in which the model is

deﬁned as

1
1 + e—ai(9—bi-ﬁts)’

 

p('u.,-|6’, at: bi, 023513, 3) = Ci + (1 — Ci)
where 3 indicates the raw score group an examinee belongs to, a,, b,, c,- describe the
parameters for item 2', and 6,, adjusts the item difﬁculty b,- from the score group 3. The
test statistics, which is deﬁned as h;2hi, has an asymptotic x2(S,- — 1) distribution.
In computing the test statistic, h, is a vector of differences between the observed
proportion correct and its posterior expectation for a raw score group computed
based on MMLE, and 2,- is the estimated matrix of hi. Even though this test statistic
appears to have a strong theoretic basis, Glas and Suarez-Falcon (2003, p.97) found
that overall characteristics of their test statistic is worse then that of S -- x2 and
G — X2. Researchers (e.g., Sinharay, 2005) points out that assessing item ﬁt using
number correct score on examinee groups is not entirely satisfactory and there is a
substantial scope of further research in this area.

Recently, Sinharay (2005) from a Bayesian perspective suggested uses of the )8-
type and Gz-type test statistics of Orlando and Thissen (2000) as a summary measure
of discrepancy, but computed the posterior predictive distributions as the reference
distributions. The resulting Bayesian p-values provide probability statements about
the ﬁt of the data with the model on the items. This method also has strong the-
oretic basis. However, the posterior predictive model checking methods are heavily
dependent on the resampling methods and are using the MCMC algorithm and hence

8

are computationally intensive.
1.4 Item Fit Analysis Based on Pseudocounts

The second approach of a ﬁt measure called Q‘DM, is proposed by Donoghue and Mc-
Clellan (e.g., 2004, 2003b, 2003a, 2001b, 2001a, 1999). In this approach, the asymp-
totic distribution of an alternative IRT measure of item ﬁt, referred to as QDM, is
derived and well justiﬁed as asymptotically quadratic form of normal variables. QLM
is based on pseudocounts as opposed to counting the number of examinees falling
within a proﬁciency interval on'the basis of proﬁciency estimates. It is a natural by-
product of the MML—EM estimation (Bock and Lieberman, 1970; Bock and Aitkin,
1981) used by most IRT calibration programs. This measure has generated much
study (e.g., Stone, 2000; Stone, Ankerman, Lane, and Liu, 1993; Stone and Hansen,
2000; Stone, Mislevy and Mazzeo, 1994; Stone and Zhang, 2002 Donoghue and lsham
1998; and Donoghue and Hombo, 1999, 2001ab, 2003ab; Hombo and Donoghue, 1999,
2000, 2001; Hombo, Donoghue and Oranje, 2003). Simulation studies (Hombo and
Donoghue, 1999, 2000) have found that the asymptotic distribution functioned ex-
tremely well, even with samples as small as 1000 examinees. Both Q — Q plots and
Type I error rates indicated very good agreement between the asymptotic distribu-
tion and the observed values. Moreover, the measure has good power to detect misﬁt
when it was present in items (Hombo and Donoghue, 2001).

The difference between the second approach and the ﬁrst one is that QBM is based

on the distribution of ability, at each quadrature point. The term “pseudocount” by

Donoghue, McClellan and Orange(e.g., 2004) refers to the fact that real counts of
the number of examinee proﬁciency estimates falling with an interval on the scale
are not used. Rather, counts are estimated from the sum of posterior distributions.
Peudocounts are the basic building blocks for the item ﬁt measure QBM. Pseudocounts
of examinees at a given quadrature point are computed by summing over the posterior
expectation (pseudocounts) of an M -category item for score level Is and proﬁciency 0

level q. Then QBM is deﬁned as

QM

on. = 229%. (u)

q=l k=0

Here 0 represents the observed response counts and E represents the expected re-
sponse counts. Assuming that item parameters are known, QbM has been shown to be
asymptotically distributed as a quadrature form of normal variables (Donoghue and
Hombo, 1999). This distribution is represented as the sum of independent x?” vari-
ates (e.g., Johnson and Kotz, 1970). QbM ~ 2:, Agxfl) , where A,,Vi = 1,2, - -- ,m,
are the non-zero eigenvalues of matrix L'EL, L is a special form of matrix with di-
mension 2Q x Q (Q is the number of quadrature point used in the computation) for
dichotomous items, and 2 is the covariance matrix of the pseudocounts (Donoghue,
McClellan, and Oranje, 2004). A routine by Davies (1980) can be used to evaluate
this probability.

However, further work is needed to establish the utility of the result in practical
testing situations. Hombo and Donoghue (1999, 2000) examined some possible lim-
iting factors, including potentially prohibitive sample size requirements to achieving
sampling distribution properties approaching those of the asymptotic distribution. A

10

major limitation to practical application of the ﬁndings is the computational burden
required to compute the asymptotic distribution QDM. The computation requires
the evaluation of all possible item response patterns—2’ for a test of J dichotomous
items, for example. For short-moderate length tests (10 — 15 items) the number for
patterns (1024-32768) is manageable. For tests of 20 items, the evaluation of slightly

over one million response patterns per item begins to become burdensome.

1.5 Approximation by Observed Covariance Among
Pseudocounts

The work for the asymptotic distribution for the item ﬁt measure QBM represents a
major advance along this line of research. To avoid evaluating all possible response
patterns for calculating the covariance matrix of pseudocounts and thus making ap-
plications possible to operational research, Donoghue, McClellan, and Oranje (2004)
propose a consistent estimator S for the covariance matrix 2 and the true asymp-
totical distribution is approximated by the observed matrix of interrelations among
pseudocounts. To understand and construct the matrix S, consider the joint probabil-
ity consisting of positive values p(U = u,-, 0.,) and 0 for p(U # u,-, 0Q) for dichotomous
item 2' and given response u,- and any quadrature point 0,], Vq = 1,2, - -- ,Q, and
i = 1, 2, - - - , J + 1. Then S can be seen as a simple covariance matrix with every ex-
aminee contributing to all of the 2Q quadrature points. The matrix S is a consistent
estimator of 2. Therefore, a natural idea is to use the data-based estimatorgL'SL in

place ofL'EL. Because QDM is an asymptotic result, for very large N (approaching

11

inﬁnity) is arbitrarily close to 2 and intuitively should yield the correct estimate of
QDM-

Indeed, the use of the observed matrix of interrelation among pseudocounts yields
the hoped-for accuracy and simplicity on computation, and the approximation of Q D M
based on the observed matrix of interrelations among the pseudocounts opens up the
possibility of operationally feasible and theoretically defensible statistical test of item
misﬁt. Results from Li, Donoghue, and McClellan (2005) demonstrate how accurate
the approximation is in relative to the asymptotic distributions across three different
sample sizes. The results from simulation studies show that the approximation works
extremely well for many situations. The cumulative probability, mean, and variance
are very close between the true and approximation values. These results can also
be generalized to the case of polytomous items, as in Donoghue and Hombo (2001a)
when item parameters are known constrants.

However, the asymptotic distribution of QBM was derived under the assumption
that the item parameters are ﬁxed and known. When the item parameters are data-
based estimates, the theoretic results of Donoghue and Hombo (1999) do not hold.
Several studies (Donoghue and lsham, 1998; Hombo and Donoghue, 1999; Donoghue
and Hombo, 2001ab; Stone and Zhang, 2002) have repeatedly found that, when item
parameters are data-based estimates, Type I error rates from QDM are much too
conservative, and that distribution of the Q‘DM statistic is stochastically smaller than
Q 0M. This study is an attempt to overcome the disadvantage of working with the

item parameters by reformulating the measure of item ﬁt based on pseudocounts.

12

1.6 Reformulating the Item Fit Measure QBM

The form of QbM deﬁned as in 1.1 is a Person-type measure for goodness-of-ﬁt.
Donoghue and Hombo (e.g., 1999) suggest that the expectation of the pseudocounts
can be found through binomial approXimation. That is, the expectation of pseudo—
counts is a product of total pseudocounts and the hypothetical item response function
at certain levels of quadrature points (please refer to the ﬁrst section of Chapter 2).
The asymptotic distribution of Q7», can be shown through a Taylor expansion of the
ﬁt statistic. As sample size increases, the asymptotic distribution for the second order
Taylor expansion of Q'bM converges to the true asymptotic distribution of Q‘DM.
The idea of reformulating QbM is to simply replace the expectation of pseudo-
counts by its theoretic expectation under null hypothesis. The reformulated version
of the statistic QbM allows researchers to derive the true asymptotic reference distri-

bution for QBM and to extend the results for data-based item parameter estimates.

13

Chapter 2

Item Fit Analysis Based on
Pseudocounts

The item ﬁt measure QBM by Donoghue and McClellan (e.g., 2004, 2003b, 2003a,
2001b, 2001a, 1999) is similar in form to a Pearson X2. However, as noted before,
the distribution of Q‘DM is not X2, but a quadratic function of normal variates. This
chapter ﬁrst introduces the basic concept of pseudocounts, on which the measures
of item ﬁt (i.e., QLM) are based. Next the reformulation of DEM will be discussed
with the help of the fundamental concept of pseudocounts. Then the asymptotic
distribution of the reformulated measure of item ﬁt will be derived in a different way.
Finally, the observed interrelations among pseudocounts are examined to obtain a

consistent estimator of the true covariance matrix among pseudocounts.
2.1 Deﬁnitions and Notations

Let 9., be the discrete proﬁciency at quadrature point q, w(0q) = 21),, be the density of
6, i.e., P(0 = 09) = wq. The prior w will often be chosen to approximate a continuous

distribution, such as N ([1,02). Denote U as a random variable representing the

14

response for dichotomously scored studied item. In study of item level model ﬁt,
test items are classiﬁed into two groups—the studied item (only one item) and the
remaining items (containing J items). Thus the total number of items in the test is
J + 1.

Let fql = f (6,11) be the item response function for the studied item, i.e., P(U =
1|6 = 6,,). Let N be the sample size or number of examinees, and t index patterns
of responses to the remaining J items Y on a test. For the dichotomous items,
t = 1, - - - , T = 2J. Let nuc be the number of examinees who got score pattern (U =
k, Y = 3);). Suppose 7‘r is the vector of observed proportions for the sample response
pattern (U = k,Y = y). Then fr”c = nae/N, and (,9 = P(Y = yt|6 = 6,,), where 1:
represents the category for the studied items (e.g., for dichotomous case, k = 0,1),
and ltq is the likelihood function of the remaining item response pattern (Y = 3);)
at quadrature point q. Denote 7r", the model-based prediction of the probability of
response pattern (U = k,Y = Yt), or the marginal probability of (U = k,Y = Yt).
For dichotomous case (i.e., k = 0, 1), it is easy to see that

7M = P(U=1tY=3/t)
Q

= Z wqfqlltq-

q=l

Similarly, «to = 2;, wq(1 — fq1)ltq. Let pfq be the posterior of 6 at quadrature point

6 = 6(1 given response pattern (U = k, Y = y,). Then,

pf, = P(6 = 6q|U = k,Y = y.)

waqultq
7TH:

15

In dichotomous case, the posterior distribution for 6 = 6q given the response pattern

0 _ wq(1-fq1)ltq
tq — 7rt0

_ _ . 1 __ wqfqlltq
(U — 1,Y — yt) IS ptq — m , and p

 

given response
pattern (U = 0,Y = yt)- The posterior distributions provide the best information
about the distribution of examinees’_ proﬁciency levels. Thus, it is the posterior
distribution of proﬁciency rather than the proﬁciency point estimates that are used
for assessing model-data ﬁt on the item level in this regard.

Deﬁne pseudocount, sql, to response U for the studied item at quadrature point
6q as the sum of the posteriors over all response patterns P(6 = 6qu, Y = yt),Vt =
1, 2, - -- ,T. For example, the pseudocount to the correct response for the studied

item at quadrature point 6,,.

T
31 = :71. pl
9 ‘1 tq
t=l
T

E : ntlltq

t=l 77:1

 

Here T is the number of all possible response patterns for the remaining items in the
T I

test. In a similar fashion, deﬁne sqo as Sqo = wq(1 — flq) t=1 WTOtotg. Denote

59 = sql + ago. 3,, is the total pseudocount at quadrature point 6,,, Vq = 1, 2, - -- ,Q.

Q is the total number of quadrature points (designated in the study, in this case 41,

ranging from -4 to 4).

Now consider the following vectors in the dichotomous case:

T _
n - (n11,n21,-~ ,nT1,n10,7120,"' ,nTO),
AT — A A A A A A _
7T -(7T11,7T21,"° ,WTit7T10t7T20,"',7TT0)— n/N,
WT = (7r11,7r21, . « - ,7r7~1,1r10, 7r20, ...,7r7~0), the model-based probabilities,

16

sT = (311, 321, ..., SQ], 810, 320, am), observed pseudocounts,

§T= (5132,” ,sQ).

The vector n describes the frequencies of all possible patterns of the response
data for J + 1 items in a test. That is, n contains the frequencies of the mutually
exclusive response patterns from the sample data. If N examinees are available,
then ZLO (nu + "40) = N. The model4based probability of the tth pattern of the
remaining items and correct response on the studied item (i.e., (U = 1,Y = yt))
is 7r” = P(U = 1,Y = Y,),Vt = 1,2, ...,T. Similarly, the probability of observing
response (U = 0,Y = y,) is 7r“) = P(U = 0,Y = yt),Vt=1,2,...,T.

For the convenience of studying the statistical properties of pseudocounts, two
posterior matrices P and P are constructed. P is a matrix consisting of all posterior

and having dimension of 2T by 2Q. That is,

 

 

 

 

10}, p12 piq 0 0 0 \
pit p52 pin 0 0 0
P2Tx2Q= T1 T2 T0 0 0 0
0 0 0 P61 P12 P1Q
0 0 0 P21 P32 P30
\0 0 0 P91 P92 P90)
The matrix P is a 2T x Q matrix deﬁned as
{Pi1 Piz Pic) \
P21 P22 P20
1 1 1
~ p p no. p
P= :1 r to
P51 P62 1’10
P21 P22 ng
lp‘z’n 19°72 P(fQ/

17

With the matrix notation, the pesudocount vector 5 can be expressed as s = Flu
and s = P'n. The matrix P can be written as column form P = (P11, P21, - - - ,
P5,P10,P§, - -- ,Pg), where Fiﬁ/q =1,2,--- ,Q and j = 1,0, denotes the column in
the matrix P corresponding to the posteriors at quadrature point 6q with response
U = j for the studied item. Then sq,- = Png. Similarly, write the matrix P as
P = (131,132, - -- ,PQ), where R, represents the qth column in the matrix P,Vq =
1,2,--- ,Q. Then sq = Pin.

The pseudocount vector s or s can be considered as a random vector since it is a
linear function of the frequency vector n, which follows multinomial distribution with
probability vector 1r, denoted as n ~ M2T(N, 7r) with 2;,(m1 + 7R0) = 1.

To establish the results regarding the asymptotic distribution of the pseudo-counts

vector 5, the following two vectors are useful:

vT_ (nu—N77“ Tim—N772] 1111—er13] nJQ—erm nZQ—erzg
— \/N7r11 ’ t/N7r21 ’°"’ N/NWTl ’ \/N7r10 ’ \/N7r20 ’
nIQ—NWIQ)
I

.., mﬂ—TO
(PT : (V ”111 V 71'211'H1V WTI) V 7T101 V W20) "'1 V 77.70)-

The object is to study the properties regarding the pseudocounts, which are a linear

combination of the observed frequency vector 11 of response patterns.
2.2 Asymptotic Distributions of Pseudocounts

Before showing the theorems regarding the pseudocounts, deﬁne a matrix B with
1 1
ﬁxed elements, B = D} P, where P is the matrix of posteriors deﬁned as before, D7?

is a diagonal matrix with square root of the model-based prediction vector 1r as its

18

diagonal entries.

 

 

( 7T1] 0 0 \
0 V721 0
13% = O ‘/7TT1 O 0
1r 0 0 ‘/7l'10 0 0
0 0 0 ‘/7l’20 0

If item parameters are all known constants, so are each component in the poste-
rior matrix P and each element in the diagonal matrix Dé. Simply put, the product
matrix B has entries of ﬁxed values. Denote each column of B as bf, or b2,Vq =
1,2, - -- ,Q. Then B can be expressed as B = (bib; ...,blq,b‘1’,bg,--- ,bg). b}, or

1 1
b2 is a ﬁxed vector with dimension of 2T, and b}, = D}; R}, or b: = DﬁPf.

theorem 2.2.1 (Marginal Distribution of Pseudocounts) The asymptotic distribu-
tion of WC}; — PgTir) for each element sq,- in the pseudocount vector 8 deﬁned as
above is normal with mean 0 and variance Pg'(Dﬂ — 7r1r')Pg,Vq = 1,2, - -- ,Q and

j=0,1.

Proof: Let the vectors v, b}, or b2, Vq = 1,2, - - - ,Q be deﬁned as above. Then

the asymptotic distribution of the linear function of bav or bgv is normal with mean

0 and variance bah}; - (13:90): = bg(I - <pgo')b}‘, or bgbg' — (bggo)2 = bg'(I — <p<p’)bg,

19

respectively (p383, Rao, 1973). Therefore,

'I 1 -[ ...1.
I -T
= —P’ n— N7r
t/N q ( )

. T , .
= 71—]? ZWtijq — Nﬂtjpfq)

T
s .
= m 7271 '— Zl’ﬂtjpgq)
t:
= WW3; — PgT’II’).

Var(b{;v) = PgTDé (I — tpcp’)Pg = P,{"(D,r — 1r7r')Pg. Hence the theorem.

The expectation and variance of pseudocounts, Esq,- and Var(sqj) respectively,
can be found easily by Equ = EPng = N PgTvr = N'wqfql and Var(sqj) =
PgT,var(n)Pg = NPgT(D,, — m’)Pg),vq = 1,2,.-. ,Q and j = 1,0. It can be
seen from the theorem that each pseudocount is a random variable and asymptot-
ically distributed as normal. Or the sequence of the pseudocounts 311, @919 —
PllTvr) ~ N (0, P11'(D,r — 7r7r')P,‘) asymptotically as N —-» 00, where vector P,1 can
be written as P11, = (p1,,p;,,---, p§~1,0,0,--- ,0). And E'(311) = Nz;lwt1p31,and
Va?"(311) = N(Z;1Pi12“t1 - (Zilphﬂuy).

In the same way, it can be shown that the marginal distribution of the total pseu-
docount 5,, at the quadrature point 60 is also asymptotically normally distributed
with mean 0 and variance P;(D,, — 1r1r')Pq, i.e., WC} —- PqTﬂ') ~ N(0,P,;(D,r —
1r1r')Pq),Vq = 1, 2, - - - , Q, where P4 is the qth column in the matrix P. Note that the
expectation of Eq is Esq = N Pévr = N wq. Interestingly, notice that Esq does not de-
pend on the hypothetical models. E5, only depends on the quadrature approximation

20

wq.

theorem 2.2.2 Joint Distribution of Pseudocounts The asymptotic distribution of
VN(ﬁ — P’ 1r) for the pseudocounts vector 8 is multivariate normal with mean vector

0 and dispersion matrix P'(D,r -— 7r7r')P, where P, D,” and 7r, are deﬁned as above.

Proof: The asymptotic distribution of the 2Q linear functions B'v-, where B is 3
2T by 2Q matrix of rank 2Q — 2 deﬁned as above, is multivariate normal with mean .
vector zero and dispersion matrix B'(I — cpcp')B (p383, Rao, 1973). It is easy to see

that

Bv = (b},b§,--- ,blq,b2,bg,-~ ,bg)Tv

S 1
—-\/NB .
r—N 97

After a little algebra, it can be shown that the pseudocounts vector is asymp-
totically multivariate normal distribution with mean vector 0 and covariance matrix
B'(I—tp<,o')B, i.e., \/N(-,'(—,—P’1r) ~ N3Q(0,P'(D,,—1r7r')P) asymptotically as N —-> 00.
Similarly, the asymptotical distribution of W(% — P’ﬂ')for total pseudocounts
vector s is NQ(o,1”>’(D,, — «7013) as N —» 00.

Now one can see why pseudocounts contain essential information for assessing the
degree of item ﬁt. They are the sum of posterior distributions across all possible
response patterns and over all examinees. The posterior probability of proﬁciency,
instead of the count of grouped proﬁciency estimates themselves, provide the best
information for evaluating the degree of model-data ﬁt. The preportions of pseudo-
counts 8 over the total number of examinees N can give empirical values that can be

21

compared to IRT model predicted values. A measure of the correspondences between
the empirical and predicted values represents the degree of adequacy of model-data
ﬁt at the item level. However, it is often difficult or impossible to judge from the plots
whether the differences between the empirical values based on pseudocounts and the
model based predicted values. A statistical signiﬁcance test is very desirable. The
following section is to reformulate QBM and ﬁnd out its reference distribution based

on pseudocounts.

2.3 The Asymptotic Distribution of the Item Fit
Measure Q}; M '

The statistic Q‘DM suggested by Donoghue and McClellan (e.g., 2003) is deﬁned
through binomial approximating the expectation of pseudocounts as

522914 ((Sq1 - Esq1)2 + (Sqo - ESquz)

E391 Esqo

Me Me

((3:11 — fqlsq)2 + (Sqo — qusq)2)

q=1 fqlsq _ fqosq

Q
(Sql - fqlsql2
(12:; fql(1- fallsq.

Donoghue and Hombo (2003b) expand the above expression of Q‘DM about fr = 7r as

a Taylor series to derive the that the asymptotic distribution of the measure QBM is

asymptotically a quadratic form of normal variables:

QDMUﬂ = mg? - 7r)'C(7“r — 7r)\/N + 0(N‘b)

22

The matrix C is the same as that in Donoghue, McClellan, and Oranje (2004, p 10).

That is,

 

Q C t C
(Vql’ " fqlvq2)(vq1 ’ fqlqu)
C =
gwq fql(1" fqll ,

 

l l ' l
whereV* ’— (fqllq fq12q M 0 ,0 ,andva2lz

 

 

 

‘11 W11 ’ W21 ’ ’ WT1 ’ ’
fqlllq fqll2q . . . fqlqu (l-fqllllq (l-fqlll2q . . . (l—fqllqu
W11 ’ W21 ’ ’ WT1 ’ W10 ’ W20 ’ ’ WT0 ’ °

2.3.1 Reformulated Q*DM and Its Asymptotic distribution

In this study, also deﬁne QLM as Pearson Xz-like statistic. That is,

02214 = 20: ((3q1 - Esql)2 + (5:10 - Esqolz) .

Esq] Esqo

 

 

q=l

As previously deﬁned, .991 or sqo is the pseudocount at quadrature point 6q,\7’q =
1,2, - -- ,Q. Esql or Esqo denote the corresponding expectations. First simplify
the expression of the expectation of 3,71 and sqo,\'/q = 1,2, - -- ,Q. Notice that the

expectation of the pseudocount Equ = N Pgrr for j = 0, 1 can be expressed as

 

T
fr-l
Esqj = E(ququz 72.”)
t=l ‘1
T
= ququzltq

i=1

= ququ.

That is, Esql = qufql, and Esqo = qufqo = qu(1 — fql). Therefore, the
expectation of the pseudocounts vector 5 is Es = N(w1f11, W2f21, . - - ,waQI,
w1f10,w2f20, - ~ ,wqu0)T. The expression of Es is the same as that derived from
the theorem on joint distribution of the pseudocounts vector.

23

Now turn to look at the asymptotic distribution of the reformulated Q‘DM. Let
DES be a diagonal matrix with the expectation of the pseudocounts as its diagonal
-1 -1 -1

elements. Obviously, D5; is a 2Q by 2Q matrix, and DE; = DE; DE; , where DE;

can be expressed as

 

 

1
( m 0 0 \
'6 III 71...? II. '6 III '0'
DES—% = _1__ wa01 l
m 0 0 m (I) 0
0 0 0 m 0
'6 II III II II '0' 7...?
h wefeo )

With the matrix DES—i, the QbM can be further simpliﬁed by

Q
s __ (Sql — ESql)2 (300 - Esqu")
QDM _ Z ( Esql + Esqo

= "Duns—Es»?

 

 

q=l

= (s — Es)’DE,-1(s — Es)
= (P'n — Np’n)'DEs-1(P’n — NP’n)
= (n — Nn)’PD;,:P’(n — Nir)

= We — n)’NPDg§P’t/ﬁ(rt — 7r).

As it is known, \/N(p — 7r) are asymptotically distributed as multivariate normal
variates with mean vector 0 and covariance matrix G = D,r — 7r7r' (e.g., p470, Bishop,
Fienberg, and Holland, 1975). Thus, QBM is asymptotically a quadratic function
of normal variables. Obviously, the matrix N PDgéP' is nonnegative deﬁnite since
all of the diagonal components in the matrix DES are nonnegative. Following Sta-

24

pleton’s (1995, p65) expression for quadratic form by denoting y = mm — 1r) ~
N3T(0,G), and the nonnegative deﬁnite matrix A = NPDgéP', let Gi be the
unique symmetric square root of G, and let G“% be its inverse. Thus QBM = .
(y'G-ixeiAei-xG-iy) = z’Cz, where z = G-iy and c = Giaoi. Then
Var(z) = G‘aGG'% is an identity matrix of 2T x 2T, so that 2 ~ N3T(0, I).
Let C = TAT’ be the spectral decomposition of C. Then A is the diagonal ma-
trix of eigenvalues of C, and T is the 2T x 2T matrix whose columns are the
corresponding eigenvectors of C, and T is an orthogonal matrix. Hence, QBM =
z'TAT’z = (T’z)’A(T’z) = Lu’Aw 'for w = T’z. Var(w) = T’IT = 1313‘”, and
0) ~ N3T(0, I). Denoting the eigenvalues of C by A1, A2, - - - ,AgT, QBM = 23:, Aiwf,
where w’ = (w1,w2, - -- ,ng). Therefore, QBM is alinear combination with coefficients
A1, A2, - -- ,/\2T of independent x? random variables. The coefﬁcients A1, A2, - -- , AgT
are the eigenvalues of N G’ iPDgsP’G’ 5, and also the eigenvalues of NPDgéP'G or
of NGPDgéP’. By theorem 2.2.1 (Stapletone, 1995, p51), the expectation of QbM
is E(QbM) = trace(AG) = trace(NPDg§P’G).

The asymptotic distribution of QbM can further be simpliﬁed as the reduced
sum of independent xfl) variates (e.g., Johnson and Kotz, 1970). That is, wa ~
2;, Aixf, where A, are the m non-zero eigenvalues of the 2T x2T matrix N PDESP'G.
The non-zero eigenvalues from matrix NPDgéP’G is equivalent to the non-zero eigen-
values from matrix L’GL = NDQEPXD, — m’)PD;,§ for L = P’Dgé. It can be
easily seen by letting u be a non-zero vector (with dimension of 2Q) and scalar

A. Then by deﬁning equation NPDgéP'Gu = Au, N LL'GV = AV. This implies

25

that N L’ GV = ALlu, where Ll represents the generalized inverse of matrix L. And
N L’GL(Llu) = A(Llu). Hence the result. A routine by Davies (1980) can be used
to evaluate this probability distribution. Now state this result about the asymptotic

distribution of QLM in the following theorem.

theorem 2.3.1 Asymptotic Distribution of QbM The Pearson xz-like measure of
item ﬁt QBM deﬁned as above is a quadratic function of random variables with mean

. . I
vector 0 and covariance matrix D, — 1r7r .

Take a close look at the covariance matrix ND:- P’ (D, — 7r7r’)PD;3§. Denote this
matrix product as A (i.e., A = ND: P’ (D, — 1r1r’)PD;3§). Let the set of distinct
eigenvalues of A( the spectrum of A) denote as 0(A). The maximum magnitude
of eigenvalues, denoted as p(A) = max|A|,V)t 6 o(A) has p(A) S “All for every
matrix norm (Meyer, 2000, p497), i.e., IA] 3 “A” for all A E 0(A). Since all the
components in the matrices DigiP, and D, — 1r7r’ are regarding probabilities, the
maximum absolute values of the components in the product for A is less than for

equal to 1. Thus the maximum eigenvalue of the matrix A is equal to 1.
2.3.2 Asymptotic Distribution of Q

The statistical distance between the observed pseudocounts and their expectations
(the external and ﬁxed values) also represent the degree of model-data ﬁt. If the
distance is deﬁned by Q = (s — E§)f3‘l(§ — E§)', and it is seen from theorem 2.3.1
and theorem 2.3.2 that the asymptotic distribution of the sequence «17(73- — P’ 7r) for

s is Nq(0, P’ (D, — 1r7r’)P) with a nonsingular covariance matrix. As it is easy to see,

26

the expectation and covariance of s is E5 = P’rr and Var(§) = N P'(D,-1r7r’ )P. Since
33$?) = x/NGV — P’ it), the following states the result for the asymptotic distribution

of Q (e.g., p163, Johnson and Wichern, 2002).

theorem 2.3.2 Asymptotic Distribution of Q The asymptotic distribution of item
ﬁt measure deﬁned as Q = (s — E§)E‘1(§ — Es)’ is X2 with degree of freedom Q and

the covariance matrix 2 is NP'(D, — irrr')P.

Let 2 denote the 2Q x 2Q covariance matrix of the pseudocounts vector 8, i.e.,
2 = NP, (D, — 1r7r')P. Then the covariance matrix over the total pseudocounts
vector 5, f], is N P'(D, — 1r1r')P. The following section will introduce a consistent

estimator of 2 and E.

2.4 The Observed Covariance Matrix of Interrela-
tions among Pseudocounts

Although the covariance matrix of pseudocounts E = P'(D, — 7r7r')P has dimension
of 2Q x 2Q and the dimension of E = P'(D, — 7r7r’)P is Q x Q, the estimation
of E and )5 involves evaluating the 2T x 2Q matrix P, the 2T x Q matrix of P,
and the 2T x 2T matrix of D, — 7r7r'. Note that T indicates all possible response
patterns for the remaining J items. In dichotomous case, T = 2J . For a long test, the
numerical computation of 23 seems impractical for most Operational work. To reduce
the computation complexity, 2 is estimated from the observed covariance matrix
S of interrelations among pseudocounts s, and E is estimated from the observed
covariance matrix S of interrelations among pseudocounts s. As is known that n ~

27

M2T(N, it), a multinomial distribution with 2T —- 1 parameters and covariance matrix
N (D, — 7r1r’). P},- is a uniformly minimum variance unbiased estimator (UMVU) of
1r,,~,Vt = 1, 2, - - - ,T, and j = 0,1 (e.g., Lehmann and Casella, 1998, p106; Bickel and
Docksum, 2001, p187). It is natural to think of the matrix D15 — PP’ as estimate of
the matrix D, — irrr'.

Let the vector. x, indicate the posterior contribution of the ith examinee on
the studied item across the array of Q quadrature points given the response pat-
tern (U,Y = y,),Vi = 1,2,--- ,N. Then X; is a 2Q dimensional vector as x, =
(X11, X32, - ~- ,X-1 ,X3, X3, ~ - -, XPQ). The value of each component in the vector X,-

1 1

is X.’

lq’

vq=1,2,--- ,Q,i=1,2,--- ,N,andj=1,0. Or

x3, = UP<o=o.w= 1.1/=20,

x}; = (1—U)P(6=6q|U=0,Y=y,-).

Therefore, a N by 2Q matrix representing the contributions of each examinee to the
posteriors at Q quadrature points given the observed test data is available. If all of
the item parameters are known constants, then each posterior can be thought of as

a ﬁxed value. And therefore, each component X-j Vq = 1,2, - -- ,Q,i = 1,2, ~ u ,N,

IQ,
and j = 1,0 can be viewed as a random variable, because U is a Bernoulli random
variable. Clearly, these 2Q random variables are not independent. The realization
of each component X3, constitutes a N by 2Q matrix, a much smaller and more

manageable matrix for computation, which can then be used to estimate the covari-

ance matrix among pseudocounts. It can be shown that the sum over all examinee’s

28

posteriors (or a row vector in the N by 2Q matrix) is actually the pseudocounts
vector 5. In this sense, the vector x, can be regarded as the unit pseudocount, and
for Vi = 1,2, - ~ ,N, 2:,- can be viewed as the ith realization of the random vector
x = (X11,X21,- -- ,X5,X?,X§, - -- ,Xg). The N by 2Q matrix contains all informa-
.tion for each examinee’ unit pseudocount on each quadrature point.

The observed covariance matrix of the vector x from sample data depicts the
covariance S of interrelation of pseudocounts. The following section states the inter-
relations between the variance of the unit pseudocount X g and overall pseudocounts
sqj,Vq = 1,2, . -- ,Q and j = 1,0. By deﬁnition, the sample variance of X3 is given

by
' 1 N ,- qu 2
var(xg) = HEMP?) .
i=1

Denote qu = if} = 2;, fin-Pg], forj = l, 0, then

N . 2
var(Xg) = — (X? —§q—’)

When the item parameters are all known constants, the posterior P; is also known
and ﬁxed, Vt = 1,2,... ,T, and q = 1,2,~- ,Q. The difference between the vari-
ance for the unit pseudocount Var(X;) and the average sample variance of the

29

total pseudocounts ﬁVar(sq1) is Var(X,;) — 71(7Var(sq1) = 2;,(21 — «MB? -
(SLAB; — rrn)P,{,) (23:10.11 + 7r“)P,f,). Since the sample proportion Pu is a con-
sistent estimator for the parameter 7m, Pu -—» 7n,- in probability as N —+ 00. Thus,
as the sample size N goes to inﬁnity, Var(X;) —+ ﬁVar(sq1) in probability. That is,
Var(X:) is consistent for estimating ﬁVar(sq1),\/q = 1,2, - -- ,Q. Similarly, it can
be shown that the Var(X3) is a consistent estimator of ﬁVar(Sqo), Vq = 1, 2, - - - ,Q.

Now consider the relations between the covariance Cov(X g, X g: ) and the covari-
ance Cov(sqj,sq:ji),‘v’q,q’ = 1,2,--- ,Q, and j,j’ = 1,0. First express Cov(sqj,sq:J-I)

as

T T

Cov(sqj,sqrjr) = Cov(Zn¢jPé,Zn¢jiP£,)
t=l t=1
= Cov(Pan,n'P£.)

q

'T , -I
= NP; (D, — 7r7r )P’,.

q

The vectors P3 and P3,, are two columns in the matrix P with row q, q’ and column
j, 1", respectively.

Next, study the covariance Cov(X g, X 3:) By deﬁnition,

‘ 303' " Sq’j’
(qu — W‘XXQ — 7(7)

ZIH
.Mz

Cov(Xg,Xg,') =

1

"
ll

1:" J'-__j’-_ -.-,
(Xini’q’ _ Xt'qrq’i’ Xiq’rQJ) + TQJTQ'J'

ll
2|“
M2

1

= pimp; — Pngn’ij

1'

Again, it is seen that Cov(Xg, X33) —> ﬁCov(sqj, sq'jr). It is not hard to ﬁnd that

30

Cov(§q,§;) = NPqT(D, — 7r7r’)Pq:, Vq,q’ = 1,2,--- ,Q. Form the N x Q posterior
matrix X with each row vector 32,-,Vi = 1, 2, - - - ,N representing the ith examinee’s

and number of Q posteriors, then x,- = (X,1,X,-2, - -- ,X,Q),Vi = 1,2, - u ,N, where

th = ((P‘i)U(P-‘l)1'ut(13112)"(1’3V’Uv -~ 1(Pt‘1Q)U(Pt%2)1—U)-

1 3

In the same way, the vector i, is one realization of the vector i = (X 1 , X2, - - - , XQ).
Let the covariance of the vector 5': denote S = Cov(x). It can be seen that Cov(Xq, qu) =
f’ﬂDp - prim

In a summary, the observed covariance matrix S of the interrelation among pseu-
docounts is a consistent estimator of the average covariance of pseudocounts vector 8.

S can be arbitrarily close to 71?): when the sample size N is large enough. Similarly,
the observed matrix S is a consistent estimator of #2. The noticeable computational
simplicity can be obtained using S, which is a constructed N by 2Q matrix of poste-
riors. The simpliﬁcation of the computational complexity for the covariance matrix
among pseudocounts make the hypothesis testing of goodness of ﬁt at the item level

feasible using the measure of item ﬁt QbM-

2.5 Estimation of the Asymptotic Distribution for
:1:
QDM
The true asymptotic distribution of Q‘DM is a function of the covariance matrix 2
of pseudocounts. The relations of the asymptotic distribution with the covariance E
rely on the non-zero eigenvalues A’s from the matrix 2. The asymptotic distribution

of Q‘DM can be written as 211,1,Xfm and the nonzero coefﬁcients )t’s comes from

31

the matrices 2. Denote the non-zero eigenvalues from the observed covariance S as
:\’s. Then the differences between the true asymptotic distribution and the estimated
asymptotic distribution is 2:10.,- — A,)x?(l). It is easy to see that the estimated
distribution is arbitrarily close to the. true asymptotical distribution as long as the
:\’s are arbitrarily close to the true A’s. Obviously, as N —» 00, due to the consistency
of E to E, ;\,v is arbitraily close to A,, Vi = 1, 2, . .. ,m.

For the estimate of the asymptotic distribution of Q, replace the covariance ma-
trix )5 in the middle of (s -— E§)E'1(§ —— Es)’ with its consistent estimator. Then
asymptotic distribution of the estimate is arbitrarily close to its true asymptotic dis-
tribution. Therefore, the asymptotic distribution of the ﬁt measures QBM and Q
,with true covariance matrix among pseudocounts are the same as the asymptotic dis-
tributions of ﬁt measures QbM and Q, respectively, with observed covariance matrix
of interrelations among pseudocounts as their corresponding consistent estimators of
the true covariance matrix.

Assuming item parameters known constants is not realistic in many applications.
This section will investigate the relations between the item parameter estimates and
asymptotic distribution of the reformulated item ﬁt measure QLM for data-based
item parameter estimates.

Since item response function 1),, are continuous function of item parameters given
each quadrature point 6q,Vq = 1, 2, - ~ ,Q andj = 0, 1, qu(&,,, b", 8,, 6g) ——1 qu(a, b, c, 6,)
in probability as n —* oo, in short, qu ——> qu, if both item and ability parameters are

consistent estimates (e.g., p124,Rao, 1976; p74, 8.4, Lehmann, and Casella, 1998). It-

32

is also not hard to demonstrate that lag -+ 1;, in probability, in, —» 7n,- in probability,
sq,- _. sq,- in probability, and lie, —» E'sqj in probability, Vq = 1, 2, . -. ,Q, j = 0,1,
and Vt = 1,2, - -- ,T. In the same way, the estimates of QBM and Q tend to true
QbM and Q, respectively in probability. Moreover, by convergence together theo-
rem (e.g., p122, Rao, 1976; p91, Durret, 1996), the estimates of qu, QbM, and Q
have the same asymptotic distribution as those of sqj, Q‘DM, and Q, correspondingly,
Vq = 1,2, - -- ,Q, j = 0,1. Therefore, suppose the consistent estimates of item pa-
rameters are available, the results on the item ﬁt measure QbM and its corresponding
asymptotic distribution can be extended to the situation in which item parameters

are data-based estimates in theory.

33

Chapter 3

Simulation Studies on Item Fit

Several simulation studies on the item ﬁt measure QLM are presented in this chap-
ter. One important purpose for the simulation studies is to examine how large the
additional errors might be induced by the approximation for the asymptotic distribu-
tion based on the observed covariance compared to the true asymptotic distribution,
and to ﬁnd out what conditions can make the approximation practically useful. To
investigate the accuracy of the approximation, a test consisting of 15 items is simu-
lated. Such a short test is chosen because most personal computers can handle the
computation involving all possible response patterns of 15 items, which is required for
computing the true asymptotic probability. For dichotomously scored responses, there
are 215 = 32678 possible response patterns in all. Thus, the true asymptotic distri-
bution, the approximation of the true asymptotic distribution based on the observed
covariance matrix of interrelations among pseudocounts, and the approximation on
the basis of data-based item parameter estimates can be compared to each other. For

a longer test (e.g., a 30-item test), the possible response patterns may be too huge

(e.g., 1073741824 for a 30-item test) tocompute the true asymptotic probabilities.

34

Without the true asymptotic probabilities, it is difficult to have an intuitive sense of
how good is the approximation. The comparison of the true parameters and param-
eter estimates (e.g., the true covariance among pseudocounts versus the covariance
estimate, the true asymptotic probability versus the approximation, the true eigen-
values versus the estimated eigenvalues from the observed covariance matrix, the true
item parameters versus the item parameter estimates) is viewed as an oracle analysis.
In applications, there is no need to compute all possible response patterns for the
sake of the true covariance matrix among pseudocounts, if the approximation is suffi-
ciently close to the true value or the induced errors are negligible for practical use. To
compute the true covariance matrix among pseudocounts here and the true asymp-
totic distribution for a given QBM is merely for the convenience of the comparison to
which one can see how good the approximation can be. According to this asymptotic
method and approximation approach, there should be no practical concerns on the
computation of item ﬁt analysis for longer tests. Therefore, the method is not limited
to short tests only. It can be applied to longer tests as long as the sample size is large
enough so that the approximation work well.

Three different sample sizes are chosen for this study to determine how large
the sample sizes are sufﬁcient for this asymptotic method, and attempt to provide a
guideline on how large sample size is sufﬁcient for the method to work well. The 15
item parameters are also generated from computers. Discriminating power parame-
ters are simulated from uniform distribution ranging from .6 ~ 2.6, i.e., U (.6,2.6),

difﬁculty parameters are generated from standard normal distribution N (0, 1), and

35

Table 3.1: ’Irue Item Parameters for the Test of 15 Items

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Item Discrimination a Difﬁculty b Asymptote c
1 .672 1.410 .177
2 1.652 1.493 .013
3 .747 .935 .005
4 1.486 . 1.706 .165
5 1.286 .967 .080
6 1.357 .820 .086
7 1.140 -.411 .159
8 1.107 1.060 .083
9 1.465 .388 .085
10 .920 1.643 .145
1 1 .740 -.668 .173
12 .803 1.125 .040
13 1.407 -.451 .067
14 .662 .077 .124
15 1.845 1.166 .148

 

 

 

 

 

 

the asymptote parameters are from uniform U (0, .25). Table 3.1 contains all of the
true parameter values for the 15 items.

Three groups of examinees are generated from N (0, 1) with sample sizes 500, 1000,
and 5000, which represent small, medium, and large samples, respectively. For each
sample, dichotomous response data are simulated from 3PL IRT models. To account
for the randomness from the response data, replications (1000) for each sample size
will be conducted. More speciﬁcally, the 15—item test will be administrated to 1000
groups of examinees with sample size 500 each from N (0, 1), and 1000 with sample size
1000, and 1000 with sample size 5000. Combined with the sample size and replication

conditions, there are in all 3000 data sets yielded for the simulation studies.

36

3.1 Type I Error Rates

To allow comparisons, the true asymptotic distribution,the approximation of the true
asymptotic distribution based on the observed covariance matrix of interrelations
among pseudocounts, and the estimated asymptotic distribution on the basis of item
parameter estimates as well are computed alone with the corresponding item ﬁt mea-
sure QbM. Type I error rates are calculated and compared across different sample
sizes (e.g., 500, 1000, and 5000). Under the null hypothesis that the simulated re-
sponse data from the 3PL model ﬁt the hypothetical 3PL model (in this example,
the same form of mathematic model is assumed for all items in the short test—
3PL model), the observed item ﬁt measure QBM is asymptotically distributed as a
quadratic form of normal variables, which is addressed in Chapter 2. For a given
observed item ﬁt statistic QbM, the asymptotic probability of observing such a value
or greater can be evaluated through the routine by Davies (1980). For each item
and each replication, count the number of times for the hypothetic item model being
rejected. If the number is greater than 50 over 1000 replications (i.e., the type I error
rate is greater than .05), it is said the type I error is greater than what is expected.
Otherwise, the type I error rate would be acceptable. Table 3.2-3.4 shows type I error
rates for each item in the test over 1000 replications across three different sample
sizes (e.g., 500, 1000, and 5000).

A good model-data ﬁt test requires low type I error rate. The lower the type I

error rate, the less mistakes that would be made when to accept a correct hypothesis.

37

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 3.2: Type I Error Rate for Sample Size 500
Item Type I Error RMSE

'Irue(Full) 'Irue(Appr.) Item Esti. a b c
1 .026 .023 .000 .187 .254 .040
2 .029 .029 .006 .401 .190 .019
3 .016 .017, .002 .254 .257 .089
4 .020 .023 .001 .506 .276 .023
5 .011 .013 .000 .228 .153 .030
6 .016 .019 .000 .234 .135 .030
7 .023 .024 .002 .169 .093 .035
8 .018 .020 .001 .200 .178 .033
9 .011 .014 .001 .231 .108 .040
10 .026 .031 .000 .221 .247 .026
11 .023 .025 .003 .115 .136 .036
12 .017 .013 .000 .233 .224 .062
13 .017 .018 .008 .630 .112 .092
14 .031 .035 .001 .148 .243 .081
15 .019 .020 .000 1.148 .152 .023

 

 

 

 

 

 

 

Table 3.3: Type I Error Rate for Sample Size 1000

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Item Type I Error RMSE
True(Full) True(Appr.) Item Esti. a b c
1 .027 .023 .000 .162 .174 .035
2 .012 .012 .000 .292 .092 .012
3 .012 .012 .000 .208 .180 .070
4 .014 .012 .000 .303 .146 .018
5 .016 .013 .000 .192 .098 .022
6 .019 .017 . .000 .203 .092 .022
7 .024 .020 .003 .123 .082 .032
8 .010 .010 .000 .184 .105 .025
9 .011 .010 .000 .179 .089 .030
10 .033 .029 .000 .199 .148 .022
11 .020 .020 .001 .087 .122 .037
12 .020 .016 .000 .192 .146 .046
13 .018 .013 .003 .193 .113 .075
14 .025 .023 .000 .118 .211 .070
15 .020 .019 .000 .317 .085 .018

 

 

 

 

 

 

 

38

 

 

Table 3.4: Type I Error Rate for Sample Size 5000

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Item Type I Error RMSE
True(Full) TruaAppr.) Item Esti. a b c
1 .054 .045 .000 .079 .083 .023
2 .020 .019 .000 .130 .043 .004
3 .049 .045 .001 .099 .093 .035
4 .040 .034 .000 .175 .061 .008
5 .039 .036 .000 .093 .046 .01 1
6 .037 .032 .000 .096 .042 .011
7 .036 .033 .000 .070 .062 .035
8 .045 .038 .000 .084 .051 .012
9 .029 .026 .000 .090 .042 .014
10 .050 .040 .000 .099 .072 .013
11 .047 .043 .000 .047 .097 .037
12 .035 .032 .000 .077 .065 .019
13 .020 .020 .001 .095 .067 .034
14 .044 .039 .000 .056 .114 .038
15 .026 .024 .000 .176 .042 .009

 

 

 

 

 

 

 

 

 

To examine the type error rates for item ﬁt test, the data are generated from the
particular mathematic models (e.g., the 3PL model) and ﬁt back into the same item
model—an obvious known fact or correct hypothesis. Therefore, the item ﬁt test, if
it is right, should provide useful information to accept the correct hypothesis except
some acceptable level of errors (Type I error) due to randomness; or the item ﬁt test
is simply employed to verify the known fact. The type I error rates in the tables
are calculated based on 1000 replications for each sample size. Table 3.2 through
3.4 give the type I error rates when item parameters are known (denoted as “Full”
and “Appr.”) and type I error error rates when item parameters are estimated from
the response data (denoted as “Item Esti.”) along with the root mean square errors

(denoted as “RMSE”) for each item parameter estimates.

39

It can be seen from the three tables (table 3.2, 3.3, and 3.4) that the type I error
rates across different sample sizes are basically very low, lower than .05, the level of
signiﬁcance. Only one item (the ﬁrst item in the 5000 case in table 3.4) has type
error rate .054, a little bit bigger than the signiﬁcant level, on the true asymptotic
distribution.

One major feature of the type I error rates in the tables is when item parameters
are known constants, the type I error rate based on the true asymptotic distribution is
close to their counterpart from the approximation by the observed covariance matrix.
However, the type I error rates from the data-based item parameter estimates are
in general less than those from the true item parameters and are very conservative
regardless of the sample sizes. - It can be seen from these tables that most of the items
the type I error rates are near to zero.

As seen in any estimation programs in IRT, item parameter estimates contain
estimation errors even if the data adequately ﬁt the mathematical models used for
the estimation. To examine the conservative performance of QBM under the circum-
stances of the item parameter estimates, root mean square errors (RMSE) of the item
parameter estimates are calculated from the data sets. RMSE is deﬁned as the square
root of the mean squared difference between the item parameter estimates and the
true item parameters over r replications (r in this example is 1000). Let n denote as
the item parameter (e.g., discriminating power parameter a, or difﬁculty parameter

b, or asymptote parameter c) and r”; as the item parameter estimates. Then RMSE

40

can be calculated by

 

 

RMSE provides a summary index of assessing the accuracy of item parameter esti-
mates. Apparently, the larger RMSE of the item parameter estimates, the worse of
the estimation. For a simulation study, an adequate ﬁt of model and data is assumed,
and thus the difference in the item parameter estimates may depend on the estimation
procedures and some other factors (e.g., sample size of examinees).

Table 3.2 through 3.4 also contain the RMSE over 1000 replications for each item
parameter in the test. The estimation procedure used in this study for BILOG-MG3 is
Bayesian MML with default item prior distributions. That is, for all item parameters,
(1 ~ lognormal(0, 0.5), b ~ N (0, 2),c ~ beta(5, 17). It shows from these three tables
that the RMSE decreases as the sample size increases, indicting that better item
parameter estimates are obtained, which is expected. In general, the RMSE for the
sample size equal to 500 is the largest and for 5000 the RMSE is the smallest. For the
same sample size, the RMSE for discriminating power parameter is in general larger

than that of difﬁculty and asymptote parameters.
3.2 Coeﬂicients for the Asymptotic Distributions

Since the asymptotic distribution depends on the coefﬁcients in the linear combi-
nation, i.e., the eigenvalues extracted from the covariance among pseudocounts, it
is important-to compare the coefficients from the true covariance matrix (the full
covariance matrix that comes from evaluating all possible response patterns for a

41

\

given test), approximation of the covariance matrix, and estimated covariance ma-

O

trix on the data-based item parameters estimates across different sample sizes. The
puriiose of comparing those coefﬁcients is to examine how much additional error is
induced through the coefficients of the asymptotic distribution. Table 3.5 through
table 3.11 include 20 ordered positive eigenvalues extracted from the true covariance
matrix (i.e., Table 3.5) and from the approximated covariance matrix of the observed
covariance among pseudocounts as well (Table 3.6 through table 38 show the 20
ordered epositive eigenvalues from the true item parameters; table 3.9 through table
3.11 from the data-based item parameter estimates). In these tables, the rows rep-
resent the 20 pbsitive eigenvalues and the columns indicate the 15 items in the test.
The other extracted eigenvalues are omitted and not used for calculating the asymp-
totic probabilities due‘to their trivial magnitudes. Note the values in these tables
are from one replication. Similar results can be obtained from other 999 replications
and hence are not reported here. The 20 ordered positive eigenvalues from the true
covariance matrix, which depends only on the number of items, are used to compute
true asymptotic distributions; the 20 ordered positive eigenvalues extracted from the
observed covariance matrix of interrelations among pseudocounts from true item pa-
rameters are used to compute the approximation of the asymptotic probabilities; the
20 ordered positive eigenvalues from the observed covariance matrix of interrelations
among pseudocunts based on the item parameter estimates are the coefﬁcients for
computing the estimated asymptotic probabilities.

It ,can be shown from these tables that for each sample size the 20 eigenvalues

I

42

Table 3.5: The 20 Positive from True Covariance Matrix

1 2 3 4 5 6 9 10 ll 1 13 l 15
0
0

     

10 12 14
0

 

43

Table 3.7: 20 Eigenvalues for 11116 Item Parameters (N = 1000)

2 7 9 10 11 1 13 14 15
0

 

from the estimated covariance matrix across three different sample sizes are very
close to their counterparts from the true covariance matrix no matter whether the
item parameters are true constants or data-based estimates. These values are the
estimated coefﬁcients for the linear combination of x? random variables, which are
eventually used to calculate the asymptotic probabilities. Except for the 20 values
from the true covariance matrix in Table 3.5, the coefficients in Table 3.6 through
Table 3.11, which are from observed covariance matrices of pseudocounts, are data-
based estimates and vary as data change. And so do the resulting approximation
of true asymptotic probabilities. For example, over 1000 replications of the 15—item

test with 500 examinees, there are 1000 different observed covariance matrices of

44

Table 3.8: 20 Eigenvalues for True Item Parameters (N = 5000)

8 9 10 11 12 13 14 15

 

pseudocounts, and correspondingly the 20 ordered positive eigenvalues extracted from
these matrices vary across data sets. However, it is found that for the 20 ordered
positive eigenvalues extracted from each observed covariance matrix of pseudocounts,
the differences between their true counterparts are so small that the approximation
of the distribution is close to the true asymptotic distribution even for small sample
size of 500. Similar results are also found for the case of the sample size 1000 and
5000. In addition, as the sample size increases, the the observed covariance matrix of
pseudocounts become closer to the true covariance matrix of pseudocounts, and hence
the approximation of the asymptotic distribution gets closer to its true asymptotic

distribution.

45

Table 3.9: 20 Eigenvalues for Item Parameter Estimates (N = 500)

2 4 5 1 11 1 1 14

 

3.3 Item Misﬁt and Power with Known Item Pa-
rameters

A good signiﬁcance test also requires higher power for detecting model-data misﬁt.
The higher the power for a hypothesis test, the higher the probability to reject the
null hypothesis when it is actually incorrect. In this section, power is not computed
analytically for the hypothesis testing, but is estimated empirically through simulated
data. To estimate the power, ‘for instance, the 3PL model is used to generate dichoto—
mous response data, then ﬁt the data generated by the 3PL model with the 2PL or
the 1PL models, respectively. The Type I error rate is expected be low when ﬁtting

the data back with the 3PL model, but the power is expected high when ﬁtting with

46

Table 3.10: 20 Eigenvalues for Item Parameter Estimates (N = 1000)

2 1 11 l l

 

the data with the 2PL or 1PL models over 1000 replications. Similarly, low type I
error rates are expected when ﬁtting the dichotomous response data generated by the
2PL model with the hypothetical 2PL model over 1000 replications, whereas power
is expected high when ﬁtting with the data with the hypothetical 1PL model. Table
3.12 through 3.14 show the power for all items at nominal level in the test for different
sample sizes provided all item parameters are known constants.

Fiom table 3.12, it can be seen easily that ﬁtting the data generated by the
3PL model with the hypothetical 2PL or 1PL model is not adequate given the item
parameters are known. Most times over 1000 replications the incorrect hypothesis is

rejected, which makes the correction decision on the model-data misﬁt tests. Horn

47

Table 3.11: 20 Eigenvalues for Item Parameter Estimates (N = 5000)

9 10 11

 

Table 3.12: The Power for Test Data Generated by 3PL Model with line Item
Parameters

1
l
1
1
1
1
1
1

 

48

Table 3.13: The Power for Test Data Generated by 2PL Model with True Item
Parameters

 

Table 3.14: The Power for Test Data Generated by 1PL Model with Time Item
Parameters

= 1000

1
l

 

49

the perspective of hypothesis testing, it can be explained as that the testing of the
null hypothesis (e.g., Ho here is the data ﬁt the hypothetical 2PL or 1PL model)
is being rejected almost all the times over the 1000 replications when the data are
actually generated by the 3PL model. under the condition of true item parameters.
The rejection rate of 1 means the incorrect hypothesis is correctly rejected for each
replication across three sample size conditions (500, 1000, and 5000), or the hypothesis
tests for model-data misﬁt have perfect power.

Similarly, table 3.13 shows higher power for testing the hypothesis of ﬁtting the
data generated by the 2PL model with the 3PL model, and table 3.14 shows adequate
power for testing the hypothesis of ﬁtting the data generated by the 1PL model with
the 3PL model regardless of the sample size provided item parameters are known
constants.

As is known that power is a function of the sample size. As the sample size
increases, power would also increase. This feature is apparent in table 3.13 and 3.14
by comparing the same hypothesis testing across three different sample sizes (e.g.,
500, 1000, and 5000). For example in table 3.13 for testing the hypothesis that the
item model for item 10 is the 1PL model using the data that are actually generated
by the 2PL model, the power at sample 500 is .005, .117 when the sample size is 1000,
and .986 when sample the size increases to 5000.

However, the power for each item is found not homogenously high, in particular
for sample size of 500 case, when testing the hypothesis that the correct model is

the 1PL model using the data generated by the 2PL model provided that the item

50

parameters are known constants. For example in the third column on table 3.13,
item 3 through item 14 have very lower power for the sample size at 500. In fact, the
power varies as the values of a parameter changes from item to item.

The results from table 3.13 and 3.14 also support that when ﬁtting the data
to models with more parameters than the number of item parameters for the data
generating model (e.g., in table 3.13 ﬁtting the data with the 3PL model using the
data generated by the 2PL model, and in table 3.14 ﬁtting the data generated by the
1PL model with the 3PL or 2PL model), the power is generally high provided the

item parameters are known constants (except item 8 and item 10 in table 3.14).

3.4 Item Misﬁt and Power with Item Parameter
Estimates '

The simulation study in this section is similar to the above on power estimates with
exception that the item parameters are not known constants but data-based estimates.
When the response data are generated by the 3PL model (this is a known fact for
the simulation study), then ﬁt back the response data with the 3PL, 2PL, and 1PL
models, respectively, on the basis of item parameter estimates. Lower type I error
rates over 1000 replications would be expected for testing the hypothesis that the data
ﬁt the 3PL model meanwhile using the 3PL model to estimate the response data, or
higher rejection rates or power would be expected when testing the hypothesis with
other models (the 2PL or 1PL) meanwhile estimating the data with the 2PL or 1PL

model. In addition, as seen in the above section, the power would also be expected to

51

Table 3.15: The Power for Test Data Generated by 3PL Model with Item Parameter
Estimates .

 

increase as sample size increases. Table 3.15 through table 3.17 show the power on the
basis of item parameter estimates under three different data generating conditions.
One apparent characteristic in the three tables (table 3.15 through table 3.17) is
that the power increases as the sample size increases. For example, when the sample
size increases to 5000, the power reaches 1 at nominal level for testing the hypothesis
of the 2PL or 1PL model using the data generated by the 3PL model (table 3.15),
or for testing the hypothesis of the 1PL model using the data generated by the 2PL
model (table 3.16). Another expected feature is that the power is generally greater
when testing the hypothesis of the 2PL model (i.e., Ho: the correct model is 2PL)
than the one when testing the hypothesis of the 1PL (i.e., Ho: the correct model is
1PL) given the same sample size (column 1 versus column 2 for the sample size of

500; column 3 versus column 4 for the sample size of 1000). For the sample size of

52

Table 3.16: The Power for Test Data Generated by 2PL Model with Item Parameter
Estimates

 

HHHHHHHHI—‘b—‘b—‘H

Table 3.17: The Power, for Test Data Generated by 1PL Model with Item Parameter
Estimates

= 5000

 

53

500, there is not enough power for testing the hypothesis of the 2PL and the 1PL
using the data generated by the 3PL model except a small number of items (e.g., in
testing hypothesis of the 1PL model, item 1, item 2, item 4, and item 10 seem to
have adequate power that is greater or close to .80). When the sample size increases
to 1000, testing both hypothesis (i.e., Ho: the correct model is the 2PL model or Ho:
the correct model is the 1PL model) have power reached about .90 or greater except
item 3, item 11, and item 12 when testing the hypothesis that the correct model is
the 1PL model.

In table 3.16, the power for testing the hypothesis that the correct model is the
1PL model using the data generated by the 2PL model is less than .5 when sample
size is 500, and there are 8 items (item 4 through item 8, item 10, item 12 anditem
13) having power less .5 for testing the same hypothesis even when the sample size
increases to 1000. In general, there is not enough power for testing the hypothesis
of the 1PL model using the data generated by the 2PL model when item parameters
are data-based estimates, in particular for the condition in which a parameters in the
2PL model are close to 1.

As is expected, the power is low for testing the hypothesis of the correct model with
more item parameters than the number of item parameters for the data generating
model. For example in table 3.16, the power would be low when the hypothesis is
Ho: the correct model is the 3PL as compared to the 2PL data generating model no
matter what the sample size is. That is to say, the item ﬁt analysis does not have

enough power to reject the test for the hypothesis that the data generated the 2PL

54

model ﬁt with the 3PL model most times over 1000 replications. Similarly, table
3.17 demonstrates the item ﬁt analysis results does have enough power to reject the
hypothesis that the correct model is the 3PL or 2PL model using the data generated

by the 1PL model when item parameters are data-based estimates.

3.5 True Asymptotic Distribution Versus the Ap-
proximation

The plot of the true asymptotic probabilities based on the full covariance matrix
versus the approximation of the probabilities based on the observed covariance matrix
among pseudocounts is very intuitive on how well the approximation works across
sample sizes, with plots along the reference line y = :1: indicating the small difference
between the true and approximated values. The plots over different sample sizes may
provide practical recommendations as to how large the sample size is required for an
adequate approximation. For example, the following three ﬁgures (ﬁgure 3.1 through
ﬁgure 3.3) are the plots of the true asymptotic probabilities and the approximation of
the true asymptotic probabilities for item 1, item 3, item 5, and item 7 in the 15-item
test over 1000 replications across three different sample sizes (500, 1000, and 5000).
Similarly, the plots for other items can be displayed over 1000 replications, but are

omitted here since the results on the plots are very close to these items.

As it can be seen from the three ﬁgures (ﬁgure 3.1 through ﬁgure 3.3), the plots
spread wide along the middle of the reference line for the sample size of 500, getting

narrower for the sample size of 1000, and becoming almost a straight line when the

55

Figure 3.1: True Asymptotic Probabilities Versus Approximation (N = 500)

Approximation

00 04 DB
00 04 OB

Item 1

 

 

00 DA OB

OD

 

 

00 O2 OA 05 QB to

Item 5

 

 

 

 

w— v

00 02 OA 06 Q8 L0

Item 3

 

 

 

04 DB

 

00 02 GA 05 Q8 10

Item 7

 

 

 

 

r —v

00 02 Q4 Q6 Q8 L0

True Asymptotic Probabilities

Figure 3.2: True Asymptotic Probabilities Versus Approximation (N = 1000)

Approximation

Item 1

 

00 04 03
\
l

00 04 DB

 

 

 

DD 02 0A OB QB to

Item 5

 

 

 

04 DB

 

00

f ~v r

00 02 OA 05 Q8 L0

Item 3

 

 

 

 

00 02 0A OB QB 10

Item 7

 

DA OB

 

 

OD

 

l

r

00 02 OA 05 Q8 LO

True Asymptotic Probabilities

56

Figure 3.3: True Asymptotic Probabilities Versus Approximation (N = 5000)

Imm1 Imm3

 

 

0.8

 

 

 

 

0.0 0.4 0.8
04

 

0.0

 

0.0 0.2 0.4 0.6 QB 1.0 0.0 0.2 0.4 0.6 QB 1.0

Imms Imm7

 

 

Approximation

0.4 0.8
0 4 0 B

i
l

 

 

 

 

0.0
0.0

 

 

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

True Asymptotic Probabilities

sample size increases to 5000. Obviously, the approximation based on the observed
covariance matrix of interrelations among pseudocounts works well for sample size
1000 and 5000 cases. The results on the plots are a bit dispersed for 500 examinees.
In the case of a short test with small the sample size (e.g., 500), it is advised to use

the true asymptotic probability instead of the approximated one.

3.6 Sensitivity Analysis

3.6.1 Non-normal Proﬁciency Populations

Psychometrician will be interested in ﬁnding out the applicability of one method
developed in certain contexts to various other psychological and educational testing
contexts. For example, in the the above studies, the group of examinees is assumed
coming from a standard normal population (i.e., N (0, 1)), which is typically seen in

57

simulation studies. How does the method work with a non—normal population? This
is an interesting practical issue of many tests, in which examinees do not have the
exact standard normal distribution. This study is to examine the effects of the ability
population distribution on the asymptotic method developed in Chapter 2.

To investigate the potential effects of the underlying ability distribution, the pop-
ulation is chosen as four-parameter Beta distribution ranging from -4 to 4. One
reason for choosing the four-parameter Beta distribution as compared to the stan-
dard normal distribution is that it is relatively convenient to manipulate the shape
and range of the distribution. The following section will brieﬂy introduce the expecta-
tion, variance, probability density function of the distribution. The type I error rates
will be examined for the above 15—item test but with a non-normal population—four
parameter Beta distribution.

Four-parameter Beta distribution, denoted as B (a, 6, L, U), is determined by two
shape parameters (and) and two range parameters (lower limit L and upper limit
U of the distribution). Let :r be a random variable from B(a,B,L, U), i.e., :1: ~

B(a,ﬂ, L, U), L < :r < U. Then the density is given by

1

f(:r) = (U — L)°+6‘lBeta(a, ,6) (a: _ L)°"1(U — (”ﬂ-1’

 

where Beta(a, ﬂ) is the Beta function deﬁned for a > 0, 6 > 0 by

1
Beta(a,ﬁ) = / u°"1(1 — u)B-1du.
0

58

The expectation and variance of :r can be expressed as

 

E, - M
0+6
_ 2
Var(:r) = (U L)aB

(a+B)2(a+ﬁ+1)'

If a = 6, then a: has a symmetric distribution within its lower and upper limits. For
a > 6 > 0, :r is a positively skewed distribution; for ,8 > a > 0, a: is a negatively
skewed distribution. For L = 0, U = 1, the four-parameter Beta distribution reduces
to the regular Beta distribution that is often presented in basic statistics text books.
In particular for a = 6 = 1 and L = 0, U = 1, r degenerates as a uniform distribution
within 0 and 1.

Figure 3.4 is to compare four-parameter Beta distribution with the standard nor-
mal distribution. One can ﬁnd that B(4,4,-4,4) and the standard normal are symmet-
ric but obliviously have different probability distributions. The shoulder of B(4,4,-
4,4) is more wide and short than that of N (0, 1). Also as it is known, the range of
standard normal distribution is not only restricted from -4 and 4. One can see in the
ﬁgure that B(2, 4, —4, 4) is positively skewed distribution and B(4, 2, —4, 4) negatively
skewed distribution. In this study, assume the examinees coming from B(4, 4, —4, 4)
as compared with N(O, 1) to see if the ability distribution has substantial effects on
the results of the item ﬁt analysis.

Table 3.18 shows that even if the underlying ability distribution is not normal,
the item ﬁt test still has low type I error rates, which is also conservative as seen

in the case of the standard normal population. Again, the Bayesian procedure with

59

Figure 3.4: Beta Distribution versus Standard Normal Distribution

 

0.5

0.4

 

—— Beta4(4,4,-4,4)
Beta4(4,2.-4,4)
Beta4(2,4,-4,4)
N(0,1 )

..........

 

 

60

 

Table 3.18: Type I Error Rates for Non-normal Ability Population and Data-Based
Item Parameter Estimates

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Tern N=500 N=1000 N=5000
1 .000 .000 .000
2 .004 .000 .002
3 .004 .001 .040
4 .002 .000 .001
5 .000 .000 .000
6 .001 .001 .002
7 .003 .000 .000
8 .000 .000 .001
9 .000 .001 .001
10 .000 .000' .001
11 .001 .001 .000
12 .003 .001 .002
13 .010 .005 .004
14 .000 .001 .002
15 .003 .000 .000

 

 

 

 

 

 

MML is used to calibrate all item parameters with default item prior distributions
when calibrating the item parameters with the 3PL model using the data generated
by the 3PL model. It can be seen from table 3.18 that when the underlying abil-
ity distribution is different from the standard normal distribution, the method still
provides low type I error rates, which in some sense are also viewed too conserva-
tive. The results show that the method is robust regarding the underlying ability
distribution, although the item parameter estimates contains large errors in the case
of the non-normal ability population. Further evidences can easily found from the
RMSE for each item parameter estimate in table 3.19. The RMSE for each item in
the test on three different sample sizes (N = 500,N = 1000, and N = 5000) over

1000 replications are generally larger than those RMSE in the case of the standard

61

Table 3.19: RMSE for Non-normal Ability Population

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Item RMSE N = 500 RMSE N = 1000 RMSE N --§ 5000
a b c a b c a b c
1 .3 .374 .032 .276 .365 .03 .175 .377 .025
' 2 .692 .357 .017 .699 .374 .009 .488 .375 .003
3 .487 .156 .066 .47 .155 .047 .313 .207 .021
4 .4 .423 .024 .359 .428 .018 .319 .416 .009
5 .553 .241 .024 .559 .237 .016 .393 .259 .009
6 .565 .208 .024 .6 .193 .016 .437 .224 .01
7 .873 .116 .029 .493 .174 .035 .384 .089 .034
8 .476 .259 .025 .473 .259 .017 .337 .278 .009
9 .618 .101 .03 .682 .073 .019 .504 .116 (.012
10 .343 .413 .025 .323 .417 .022 .218 .403 .014
11 .326 .211 .03 .325 .257 .039 .247 .147 .04
12 .448 .236 .044 .426 .242 .028 .272 .28 .009
13 .667 .174 .06 .751 .234 .052 .565 .134 .024
14 .345 .124 .054 .34 .125 .048 .228 .068 .026
15 .512 .317 .023 .61 .311 .017 .508 .31 .008

 

 

normal ability distribution. The RMSE for each item parameter from the sample
size N = 500 are indicated in the ﬁrst three columns in table 3.19 corresponding to
discriminating, difﬁculty, and asymptote item parameters, respectively. Similarly, the
RMSE in the second three columns in table 3.19 for the sample size 1000, and the
last three column are the RMSE for sample size 5000. One can see from the study
that the effects of the ability distribution on the results of the item ﬁt analysis are
confounded with the item parameter estimation. The conservative type I error rates
show that the population distribution itself should not be a factor on the results on
item ﬁt analysis, but that it can severely inﬂuence the item parameter estimates, as

are represented by the large RMSE in table 3.19.

62

3.6.2 The Number of Quadrature Points and Item Fit

The item ﬁt measure QBM or the corresponding asymptotic distribution relies on the
discrete underlying ability distribution, (i.e., p(6 = 6,) = w, for q = 1,2, - -~ ,Q),
which is used to approximate a continuous distribution N (O, 1). Here Q represents
the number of quadrature points. How the item ﬁt diagnostic procedure depends
on the number of quadrature points Q is an important practical issue regarding the
stability of the method. As is known, for a large number of quadrature points,
the approximation for the distribution of the discrete proﬁciency gets closer to the
' continuous proﬁciency distribution. For the previous simulation studies, the number
of quadrature points Q was chosen as 41 ranging within -4 and 4. To compare the
stability of the results between different numbers of quadrature points, 21 and 81
quadrature points are selected within the range of -4 and 4, with similar results for
the same data indicating the method is stable regarding the number of quadrature
points. In this simulation study, a test of 30 items are simulated and administrated
to a sample of 1000 examinees from a standard normal population. The dichotomous
response data are simulated using the 3PL model. For a given data set and a stable
method in which the number of quadrature points does not have substantial effects
on the item ﬁt analysis, each item ﬁt statistic and its corresponding asymptotic
probability would not expect to have a big difference as the number of quadrature
points changes from 21, 41, to 81. Similarly, the type I errors rates at nominal level
over 1000 replications would also not be expected to differentiate as the number of the

quadrature points vary. Table 3.20 shows the true item parameters in the ﬁrst three

63

columns of the table and the RMSE in the second three columns and type I error
rates in the last three columns when Q = 41, Q = 21, and Q = 81, respectively. The
item parameter estimates are MML estimates using the 3PL model in BILOG-MG3.
The true item parameters in this study have a wide variety values, which intends
to simulate more general practical contexts for the test items. The discriminating
power parameter ranges from the smallest of .139 to the highest of 2.67; the difficulty
parameters are ranging from -1.821 to 2.233; most of the asymptote parameters are
around .2 with the highest of .29.

Figure 3.5 through ﬁgure 3.8 show the results on the three different numbers of
quadrature points (e.g., Q=21,41, and 81). It can be seen from these ﬁgures that
the plots of both the item ﬁt statistics (i.e., QbM) and the corresponding asymptotic
probabilities on the four items (e.g., Item 1, Item 3, Item 5, and Item 7) over 1000
replications are closely around the reference lines y = 1:, indicating these values are
very close to each other no matter what the number of quadrature points is. However,
with careful examination, one can ﬁnd that some places are a bit messy on the plots
of Q = 21 versus Q = 41, implying that some large differences occur. Similar results
are also obtained from other items in the same test but not listed and plotted here.
These results show the item ﬁt analysis based upon psedocounts approach developed
in Chapter 2 is not overly sensitive to the number of quadrature points, indicating a
stable and robust results achieved. From these nearly interchangeable results on item
ﬁt statistics and the corresponding asymptotic probabilities, one can conclude that

the number of quadrature points, practically, is not a factor that affect the results

64

Table 3.20: Type I Error Rates for Three Numbers of Quadrature Point

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Item 'Irue RMSE Type I Errors
a b c a b c 41 21 81
1 1.899 -.054 .24 .254 .077 .036 .000 .000 .000
2 1.411 1.107 .243 .235 .097 .026 .000 .000 .000
3 2.656 -1.326 .255 .520 .136 .077 .002 .002 .002
4 2.159 1.083 .057 .279 .062 .011 .001 .001 .001
5 1.545 .735 .048 .178 .063 .018 .000 .000 .000
6 2.605 .619 .273 .415 .064 .026 .000 .000 .000
7 .771 .416 .016 .159 .162 .071 .007 .006 .007
8 2.474 1.18 .085 .362 .065 .011 .000 .000 .000
9 .941 .096 .022 .159 .137 .067 .004 .004 .004
10 2.423 .708 .246 .383 .065 .023 .000 .000 .000
11 .653 .35 .129 .119 .170 .056 .000 .000 .000
12 1.543 -.088 .226 .195 .084 .039 .003 .003 .003
13 1.832 .559 .239 .259 .072 .027 .000 .000 .000
14 1.959 .536 .096 .226 .056 .018 .001 .001 .001
15 2.587 -1.821 .096 .506 .109 .088 .002 .002 .002
16 .241 .135 .115 .166 .872 .170 .001 .003 .001
17 2.117 .838 .146 .286 .061 .018 .001 .001 .001
18 1.045 -.19 .037 .158 .128 .068 .012 .012 .012
19 .139 .211 .286 .113 .459 .048 .023 .023 .023
20 .474 1.879 .164 .178 .219 .045 .000 .001 .000
21 1.39 1.522 .222 .269 .121 .022 .000 .000 .000
22 1.972 -.963 .028 .316 .097 .082 .025 .023 .025
23 1.635 .558 .233 .229 .076 .028 .001 .001 .001
24 .381 .877 .29 .126 .312 .058 .003 .003 .003
25 .795 -.329 .197 .108 .138 .048 .002 .002 .002
26 .174 2.233 .078 .293 .439 .193 .009 .009 .009
27 1.69 2.211 .014 .297 .166 .006 .000 .000 .000
28 2.195 1.435 .066 .340 .078 .010 .002 .002 .002
29 1.268 -.331 .077 .151 .095 .050 .008 .007 .008
30 2.675 -.139 .094 .331 .052 .023 .002 .002 .002

 

65

 

on item ﬁt analysis. Further evidence for this conclusion can be seen from the type I
error rates at nominal level over 1000 replications in table 3.20. The largest difference
of type I error rates at nominal level is .002 on item 22 (i.e., the type I error rates
is .025 for Q=41 and Q=81, and .023 for Q=21), which can be attributable to the
random errors of the sample data. One can use the results in this simulation study
to reduce the computational complexity for a large data set since computing QBM
based on Q=81 takes less time than the computation when Q=41. However, it is not
advised to using a smaller number of quadrature points (e.g., Q = 21) in applications
since, in a small number of cases, large disturbances occur when Q getting smaller.
When Q greater or equal to 41, Figure 3.7 and Figure 3.8 show stable results on both
Q7», and its asymptotic probabilities. Therefore, Q = 41 is generally recommended

for computing item ﬁt in applications.
3.7 Computing Time and Programs

Several C++ programs have been implemented for the simulation studies. Three
parts of C++ programs are coded for simulating response data, computing the item
ﬁt measure QBM for each item in a test, and evaluating the asymptotic probabilities
through Davies routine (1980). The computing time, of course, depends on both the
sample size and the test length. Longer tests or large sample of examinees take more
time for computing item ﬁt measure statistics. The computing time also depends
on the computer equipment. The computer that is used for this simulation study is

equipped with Pentium IV processor of CPU 2.39 GHZ speed and 512 MB RAM.

66

Figure 3.5: Item Fit Statistics Q’DM and Number of Quadrature Points

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Item1 ltem3
8 8
r: 2 2 .,..-
N .
("3 O / o /
8 0 5 10 15 20 25 0 5 10 15 20 25
a
g Item5 Item7
(I)
if a a .
5 1°- '.. '0' /
O / O
0.5 1015 20 25 0 5 1015 20 25
Item Fit Statistics (0:41)

Figure 3.6: Asymptotic Probabilities and Number of Quadrature Points

 

 

 

 

 

 

 

 

 

 

 

 

 

Item 1 Item 3

7: to to .
(RI :5 v ’ d . is
Q
o o, q
s o o
E o 0 0 4 0 a 0.0 0 4 0 a
8
03: Item 5 Item 7
e -- ’
2 to, M '0 to
E o / a
33 3 2

0.0 0.4 0.0 0.0 0.4 0.8

Asymptotic Probabilities(Q=41)

67

Figure 3.7: Item Fit Statistics QBM and Number of Quadrature Points

Item Fit Statistics (0:41)

Figure 3.8:

Asymptotic Probabilities(Q=41)

10 20

0

10

Item 1

 

.0
v

 

/

 

 

0 510

Item 5

15 20 25

 

/'

 

 

 

0 510

15 20 25

10

10

 

 

 

 

 

 

 

 

Item3
//
0 5 1015 20 25
Item7
0 5 1015 20 25

Item Fit Statistics (0:81)

Asymptotic Probabilities and Number of Quadrature Points

0.6

0.0

0.6

0.0

 

 

 

 

 

 

 

 

Item 1

/" r
0.0 0.4 0.8

Item 5
/" o
0.0 0.4 0.8

0.6

0.0

'9.
o

O.
c

 

 

 

 

 

 

 

 

Item 3
f’. O
/’
0.0 0.4 0.8
Item 7
0.0 0.4 0.8

Asymptotic Probabilities(Q=81)

68

The time to compute the item ﬁt statistics (QbM) for each item in the test of 15
items administrated to 500 examinees takes a quarter of one minute; the time for the
same test administered to 1000 examinees takes around one third of a minute; and
the time for 5000 examinees takes about one and half minutes. The time to compute
the test of 30 items for 1000 examinees takes around one minute. The computing
time for the item ﬁt statistics is on the basis of the number of quadrature points
equal to 41. In fact, the number of quadrature points is also a factor that affects the
computation time. Generally speaking, the method is robust regarding the number
of quadrature points as seen in section 3.6.2. However, it takes less time for the same
data set when smaller number of quadrature points is chosen. For the computation
of the asymptotic probabilities, the computing time is within a second for each data

set. In all the computation is efﬁcient and applicable to most applications.

69

Chapter 4

Real Data Applications

One advantage of doing a simulation study on item ﬁt analysis is that information
is available about whether or not the test data ﬁt the hypothetical IRT models. For
real test data, item ﬁt analysis is often confounded with parameter estimation (in
particular for item parameter estimates) and thus make the decisions on whether or

not the test data ﬁt the hypothetical models much more complex.
4.1 Assumptions

Before doing the real data analysis, some conditions should be assumed for the sake of
reasonable interpretations on the analysis results. Several assumptions that may be
involved in the item ﬁt analysis on real data. One assumption is that the parameter
estimation is accurate and reliable. That is, both item and ability parameters are
correctly estimated. To satisfy this condition, the standard procedures (e.g., MML for
item parameter estimation and EAP for ability estimates recommended in BILOG-
MG3) in most of the IRT softwares are used to estimate the parameters for the real

data in this chapter. As is known, the parameter estimation is often confounded with

70

model-data ﬁt issues. Poor parameter estimates may be caused by inadequate model-
data ﬁt or some other factors, for example, insufﬁcient sample size and test length,
and dimensionality or local independence conditions. Therefore, the big assumption
here for real data analysis is that when testing the hypothesis that the test data ﬁt
with a hypothetical model, the parameter estimates using this hypothetical model
are assumed to have no errors. For example, if one is to test the hypothesis that the
observed ‘data ﬁt with a 3PL model, then both the item and ability parameters are
correctly estimated using this 3PL model. If the parameter estimates are incorrect,
the only explanation is that the test data have not adequate ﬁt with the hypothetical
3PL model, instead of the estimation procedure itself. Other assumptions that apply
to the item response theory are also all assumed here. For example, unidimensionality

and local independence are assumed for the analysis in this chapter.

4.2 Two Approaches on Item Fit Analysis for Real
Data

In this section for real data applications, two data sets are from Michigan Educa-
tional Achievement Program (MEAP) anonymous 2000 Fall high school science and
math tests. The MEAP science data set used for this example only consists of the
dichotomous responses for 19 items and 7088 examinees; the MEAP math data set
here also only contains 19 dichotomous items and 6857 examinees.

In this chapter, both science and math data will be ﬁtted in the 3PL, 2PL, and

1PL models, respectively. The item parameters will be estimated using MML method

71

in BILOG-MC3. The results of item ﬁt analysis in BILOG-MG3 will be compared
with the results of item ﬁt (QbM) based on pseudocounts, with 10 ability groups and
30 quadrature points in the program BILOG-MG3 for x2 test.

Table 4.1 and table 4.2 are the item parameter estimates for the science data
(table 4.1) and math data (table 4.2) corresponding to ﬁtting the data with the 3PL
model. Since the two sample sizes are large, the item ﬁt x2 tests in BILOG-MG3 show
that all items have statistically signiﬁcant deviations between the test data and the
model predictions in both science and mathematics tests (i.e., their p-values all less
than .05), which are indicated by the large value of x2 statistics and the low p-values
in table 4.1 and 4.2. As it is known, X2 test is sensitive to examinee sample size.
Almost any departure in the data from the item model under consideration (even
if the practical signiﬁcance of a departure is trivial) leads to rejection of the null
hypothesis of model-data ﬁt if sample size is sufficiently large. On the other hand, for
small sample size, even large discrepancies between model-data cannot be detected
due to the lower power. Hambleton and Rogers (in Educational Measurement, 3rd
edition, edited by Linn, (1993, p.173), “principles and selected applications of item

response theory” by Hambleton) suggest that

“statistical tests of model ﬁt do appear to have some value. Because
they are sensitive to sample size and because they are not uniformly pow-
erful, the use of any of these statistics as the sole indicator of model ﬁt
is clearly inadvisable. But two situations can be identiﬁed in which these
tests may lead to relatively clear interpretations. When sample size are
small and the statistics indicate model misﬁt, or when sample size are
large and model ﬁt is obtained, the researcher may have reasonable con-
ﬁdence that, in the ﬁrst case, the model does misﬁt the data, and in the
second, that the model ﬁts the data. These possibilities make it worth-
while to employ statistical tests of ﬁt despite the alternate possibility of

72

Table 4.1: MEAP 2000 Fall High School Science Test Items with the 3PL Model (N
= 7088)

 

Item a b c QbM p‘ X2 p

1 .416 -2.742 .000 8.387 .064 53.5 .000
1.455 2.047 .343 2.585 .772 20.7 .023
.462 -1.187 .000 7.626 .093 55.8 .000
.598 -.181 .018 5.776 .228 64.0 .000
.240 -2.112 .000 3.76 .530 22.9 .011
.824 -.048 .198 2.681 .752 63.1 .000
.752 -.173 .315 2.352 .818 39.6 .000
.514 -1.733 .000 3.38 .606 56.1 .000
.556 -.664 .500 12.097 .009 37.2 .000
10 .808 .943 .256 2.411 .806 33.1 .000
11 1.048 1.255 .317 2.462 .796 31.3 .000
12 .641 -1.368 .000 4.294 .431 90.2 .000
13 .621 —1.027 .000 1.065 .796 92.5 .000
14 .635 .330 .422 3.259 .631 29.2 .001
15 .973 .588 .091 2.331 .822 107.6 .000
16 .722 .125 .325 2.548 .779 31.0 .000
17 .936 .752 .269 2.616 .765 48.8 .000
18 .747 -1.035 .000 2.888 .709 121.1 .000
19 .465 -1.783 .000 4.350 .422 68.1 .000

 

 

 

 

 

 

 

 

 

CDGJKIOU‘ubOOtO

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

equivocal results.”

According to the above guideline by Hambleton and Rogers, the item ﬁt analysis
results from BILOG-MG3 might not provide useful information that can lead to
“relatively clear interpretation” due to the use of large sample of examinees in both
tests. Or for these two examples, it is difficult for one to evaluate whether the test data
on the science and math tests ﬁt the hypothetical 3PL model if the only information
available is from the results on x2 tests in BILOG—MGB.

Look at the results of item ﬁt analysis for both science and math test data on the

73

Table 4.2: MEAP 2000 Fall High School Mathematics (N = 6857)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Item 9 a b c QbM p“ x2 p
1 .524 -2.638 .000 4.079 .530 43.9 .000
2 .704 .074 .122 3.387 .663 38.9 .000
3 .851 1.326 (.172 4.072 .581 40.3 .000
4 .771 .340 .228 3.431 .654 27.5 .002
5 .541 -2.883 .000 4.981 .379 48.6 .000
6 .912 -.173 .098 3.019 .735 58.0 .000
7 .571 -.165 .158 3.284 .683 43.0 .000
8 1.171 .559 .104 5.589 .296 57.9 .000
9 1.390 -.657 .223 3.844 .574 78.2 .000
10 .900 -.980 .198 2.849 .768 49.9 .000
11 .582 -.254 .138 3.135 .712 26.8 .003
12 1.135 -.808 .212 3.174 .705 60.4 .000
13 1.236 -.135 .226 3.400 .660 35.7 .000
14 1.329 .529 .153 4.581 .442 32.6 .000
15 .713 -1125 .000 4.924 .387 68.0 .000
16 .414 -.620 .000 6.464 .202 49.1 .000
17 .455 -2.189 .000 8.672 .072 70.6 .000
18 .611 -.840 .500 7.110 .151 25.5 .004
19 .104 -6.671 .000 24.252 .000 77.4 .000

 

74

 

basis of pseudocounts. The item ﬁt measure (QBM) and its corresponding asymptotic
probabilities are computed using the data-based item parameters (e.g., the standard
item parameter estimation procedure MML) for the science and math tests, which
are the listed in table 4.1 and 4.2, respectively. It shows that item 9 in the science
test and item 19 in the math test have signiﬁcant deviations between the test data
and the hypothetical 3PL model (i.e., p-value less than .05). Data from other items in
both science and math tests are consistent with predictions based on the hypothetical
3PL model. One can also see that when the hypothetical model is being rejected,
the corresponding ﬁt statistics QBM is relatively larger than other items in the two
tests. According to Hambleton and Roger’s guideline, the test data (except item
9 for science test and item 19 for math test) have observed adequate ﬁt with the
hypothetical 3PL in both tests for such a large sample of examinees and should lead
to “relative clear interpretation”.

One apparently attractive property of this example of real data applications is that
the item ﬁt analysis approach based on pseudocounts (i.e., QDM) is able to reveal item
ﬁt test information even for the sample size as large as 7000 in this example. If both
the science and math test data are ﬁtted with the 2PL or 1PL models, then results
show that all hypothesis tests for item ﬁt analysis (Q‘DM) based on pseducounts are
rejected (table 4.3, 4.4). That is, the test data in both science and math test do not
have adequate ﬁt with the 2PL or 1PL models. However, different results for testing
these two hypothesis are obtained from BILOG-MG3. The X2 tests from BILOG-

MG3 shows that the test data for three items (e.g., item 1, item 5, and item 11) in

75

Table 4.3: MEAP 2000 Fall High School Science Items (N = 7088)

l

11
1.4

. . 14.54
.5 -1. l .18

 

math test have reasonable ﬁt to the 2PL model and that three items (i.e., item 1,
item 11, and item 17) also in math test have reasonable ﬁt to the 1PL model (table
4.4). Interestingly, note that in the math test the same three items (e.g., item 1,
item 5, and item 11) shows reasonable ﬁt with the 2PL model but inadequate ﬁt with
the 3PL model, which might be hard to make sense. Similarly, it is also difficult to
consider a situation that the data from the same three items (item 1, item 11, and
item 17) in the math test have reasonable ﬁt with the 1PL model but fail to support
the ﬁt with the 3PL models. These results seem conﬂict with the general principles
that the more parameters in the model the better ﬁt may be achieved merely from

the model-data ﬁt perspective.

76

Table 4.4: MEAP 2000 Fall High School Mathematics Items (N = 6857)

1

 

4.3 Graphic Approach

One more interesting question is what other evidence one can havelto further sup-
port the assessment decisions on the apparently different results from the above two
approaches (i.e., x2 test and QBM) on item ﬁt analysis. One alternative approach—
graphic approach—might provide some intuitive sense to help assess on whether or
not the test data from MEAP science and math tests ﬁt the hypothetical IRT models.

Figure 4.1 through'ﬁgure 4.5 are the plots of the hypothetical 3PL model item
response function (denoted as solid curve in the graph) with the observed empirical
item response curve (denoted as dot in the graph) for the 19 items in the science test.

One can see from these plots that most of the items do have reasonable ﬁt with the

77

3PL model, assuming the estimation is correct and other assumptions for IRT (e.g.,
local independence and unidimensionality) are satisﬁed. Item 9 is diagnosed to have
signiﬁcant deviation between the data and the 3PL model, which can be seen in the
ﬁrst plot on ﬁgure 11 with large discrepancy (i.e., more .5 deviation) between the
hypothetical IRF and the emprical IRF at the lower end of ability scale. In fact, it
can be seen that there are other items (item 1, item 3, item 5, item 8, and item 19)
that also show large discrepancies at the lower end of the ability scale but result in
reasonable ﬁt. One possible explanation to this ﬁnding is that there may have large
errors for the ability estimates, which lead misclassiﬁcations for examinee groups.
The reason for the possible large errors for ability estimation, in particular for the
ability estimates at the two ends on the ability scale, may be attributable to the small
number of test items in the science test (i.e., a 19-item test can consider to be a short
test). That is the part of the reasons why BILOG-MG recommends using x2 test for
a test with more than 20 items. Combined with the plots diagnose and results on
item ﬁt analysis, one can conclude that the QBM test provides helpful information
on assessing model-data ﬁt. Moreover, the Q'bM test for item ﬁt can apply to short

tests and large sample of examinees, which broaden the settings for item ﬁt analysis.

78

Figure 4.1: Empirical versus Hypothetical Item Response Emotions for MEAP 2000

High School Science Items (1-4)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Imm1 Imm2
S 3. oo
«2.
O
3 0 <9.
8 0 °
’3 m
8w °
0 - . 1 - .
E -2 -1 0 1 2
0
g c2. Item3 O
:5 ‘-
(0
.0
9
0- o
0..
4 00
231° - . , . . - . . -
-2 -1 0 1 2 -2 -1 0 1 2

Ability Scale

79

Figure 4.2: Empirical versus Hypothetical Item Response Emotions for MEAP 2000

High School Science Items (5-8)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

o Item5 o Item6
.— 73 .-°* 0°
0
w 0
d.
«2
«2. 0*
é o 0° ‘ o
It- 0 N 00
g- c’0' d‘oﬁo '
‘3; -2 -1 o 1 2
O
55‘ 0.. 0.4 ltem8
'5 '- "
(U
.8 1
at o 10.1
.4 o
o o
N. O
N O
cs‘or . or f
-2 -1 o 1 2 -2 -1 o 1 2
AbilityScale

80

Figure 4.3: Empirical versus Hypothetical Item Response Functions for MEAP 2000

High School Science Items(9-12)

Probability or Proportion

0.5

0.8

0.4

 

0.9

0.7

 

 

 

 

 

 

 

 

 

 

 

 

 

81

Item 9 0 Item 10
..- 01
O
‘9. 4
O
O
Q N. r0 o
r , Y . o . . r t
-2 -1 o 2 -2 -1 o 2
Item 11 Item 12
o 0 3+
0
‘91
° 0
O
m
G O
or v v f QT r r 17
-2 -1 O 2 -2 -1 0 2
Ability Scale

 

 

Figure 4.4: Empirical versus Hypothetical Item Response Functions for MEAP 2000

High School Science Items(13-16)

Probability or Proportion

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Item 13 Item 14
(D O O
o" no” 0
O
V“ 1
d‘ o
r 0 v 0
o o o"
5‘0. . T . 0' , v , f
-2 -1 o 1 2 -2 -1 o 1 2
0 Item 15 Item 16
P' 0 OT) 0 O
O O
l O m
«2. 0 °
0
“ll ‘3 0
o 0 O
. Yo f j 0’0 . . . Y
-2 -1 o 1 2 -2 -1 o 1 2
Ability Scale

82

 

Figure 4.5: Empirical versus Hypothetical Item Response Functions for MEAP 2000

High School Science Items(17-19)

Probability or Proportion

1.0

0.6

0.2

1.0

0.6

0.2

Imm17

Item 18

 

 

 

 

 

 

 

 

 

 

 

OO
«2.
O
z» o
010 a 3- 2° .
-2 -1 o 2 -2 -1 o
Imm19
O
0
Or v r
-2 -1 o 2
AbilityScaIe

83

 

Chapter 5

Concluding Remarks and Future
Research Directions

The simulation studies in Chapter 3 demonstrate that the approach to detect item ﬁt
or misﬁt is reliable and promising. The approach achieves the expected computational
efﬁciency by approximating the true asymptotic probabilities based on the observed
covariance matrix of interrelations among pseudocounts (e.g., Figure 1), thus making
the approach applicable to most operational research. The approximation not only
brings computational simpliﬁcation, but also produces accurate results on assessing
item ﬁt from the oracle analysis in Chapter 3. When other sources of errors are
controlled, for example in the condition if item parameters are known, the item ﬁt
test statistic QBM, the coefﬁcients of the asymptotic distribution (table 3.5 through
table 3.11), the asymptotic probabilities (Figure 1 to 3), type I error rates (table 3.2,
3.3, 3.4), and the decisions on whether the test data ﬁt the hypothetical models have
good agreement on the basis of the approximation. However, it is a fact that the
approximation based on the observed covariance matrix among pseudocounts brings

additional errors for assessing item ﬁt, and the error may be large in the situation

84

when the test is long and only a small number of examinees is available.

The utility of this approach is not limited to test length. For short tests, for
example, a test with 10 items or less, one can directly use the true asymptotic distri-
bution rather than its approximation to evaluate whether or not the test data ﬁt the
hypothetical model, because computing the true asymptotic distribution only needs
to evaluate 1024 possible response patterns no matter how large the sample size is.
However, it is advised to use a sample at least as large as 1000 to achieve better
approximation. It can be seen from Figure 1 that the approximation looks a bit
dispersed when sample size is 500, but is improved when the sample size increases to
1000.

This approach has strong theoretical basis, because the fundamental concept of
this approach is “pseudocounts,” or the posterior of ability distribution instead of
ability estimates, which is believed to provide better information on assessing item
ﬁt. One direct theoretic advantage of using ”pseudocounts” rather than “ability
estimates” to evaluate item ﬁt is that this approach is able to avoid additional sources
of errors that are confounded with ability estimation in item ﬁt analysis, in particular
for short tests. For a short test, the ability scale might not be well deﬁned, and thus
the large errors induced by ability estimates and classiﬁcation errors by grouping
examinees make the results from the X2 item ﬁt test questionable, as is the case for
the example on Chapter 4 real data applications. But this is not a problem on the
approach based on pseudocounts because the observed counts from ability estimates

are not required for the analysis. The following is the summary on other advantages

85

and limitations.

First of all, the approach of detecting item ﬁt has reasonable type I error rates
(table 3.2, 3.3, 3.4, 3.18, 3.20). In table 3.2, 3.3, and 3.4, when the item parameters
are known constants, one can see that the type I error rates ranges from 0 to .05
with most items having type I errors rates around .02, which is acceptable. However,
when the item parameters are data-based estimates, almost all items have conser-
vative type I error rates no matter what the sample size is and how good the item
parameter estimates are in the analysis. The too conservative type I error rates when
item parameters are estimated are resulted from the under estimates of the item ﬁt
statistics QbM, which also lead to under estimates of the corresponding asymptotic
probabilities. In Chapter 2, it is addressed that the asymptotic distribution can be
expressed as a linear combination of the independent x2 variables. The coefﬁcients
on the basis of item parameter estimates for the linear combination on each item are
arbitrarily close to those from the true item parameters (see table 3.5 though 3.11).

One interpretation to the conservative type I error rates for the data-based item
parameter estimates can be attributable to the estimation errors (e.g., errors for
estimating the covariance matrix, errors for estimating the eigenvalues, and errors for
estimating item parameters), which result in under estimates of the item ﬁt statistic
(Q’DM) and its asymptotic probability. Since the eigenvalues seem well estimated by
table 3.5 through table 3.11, the conservative type I error rates could resulted from
the under estimates of the item fit statistic due to the errors for estimating item

parameters. Note that the extension of the results on item ﬁt analysis to the context

86

of item parameter estimates relies on the availability of consistent estimates for item
parameters. Although the RMSE for item parameter estimates when the sample size
is 5000 are much smaller than those when the sample size is 500 and 1000 (see table
3.2 through 3.4), the estimates of item parameters contain a large amount of errors
for each item. If the item parameters would not contain estimation errors, one could
expect the similar type I error rates to those when the item parameters are known
constants. It is possible that poorly recovered item parameters from observed data
I cause the poor item ﬁt results in the simulation studies. Therefore, it is necessary
to discern if poor item ﬁt is resulted from that the test data really inadequately
ﬁt the item models or from the item parameters that are poorly estimated possibly
due to bad estimation procedures. That is, although the item ﬁt analysis on the
situation when the item parameters are data—based estimates does not rely on ability
estimates, detecting item ﬁt or misﬁt based on pseudocounts requires item parameter
estimates, which inevitably confounds the model-data ﬁt issues with the estimation
issues. Poorly recovered item parameters lead to questionable model-data ﬁt analysis.
It is also true that inadequate model-data ﬁt will result in questionable item parameter
estimates. Further research work is still needed to investigate the effects of item
parameter estimates on the model-data ﬁt analysis. For example, further efforts are
needed to examine what cause the under estimates of the item ﬁt statistics and how
to correct the effects of item parameter estimates. One possible approach is found
in Donoghue & Hombo (2003) by explicitly examining the effect of item parameter

estimation and deepening the understanding of its effect on the distribution of item

87

ﬁt measure.

Secondly, the approach has adequate power to detect item misﬁt (table 3.12, 3.13,
3.14, 3.15, 3.16, 3.17) in the simulation studies. When item parameters are true
values, the power estimates for many items are around .9 even when the sample size
is as small as 1000 (see table 3.12 to 3.14). Item 5 through item 13 in table 3.13 and
item 8 and item 10 in table 3.14 show that the power less than .9 and varies across
these items as their discriminating power (0. parameters) get closer to 1, which can
be explained by the relations between their item response functions. In general, when
item parameters are true values, the more separation of the IRF between the true
model and the hypothetical model, the easier to detect item misﬁt, and the higher
power could be expected for even small sample size (e.g., 500). For example, the 3PL
model can be more likely to be separated from the 2PL or the 1PL model because
of the presence of the asymptote parameters. However, the 2PL and the 1PL model
can hardly be separated from each other in particular when the discriminating power
parameters are close to 1 and the 2PL model nearly reduce to the 1PL model, which
is also difficult to detect from test data. Therefore, to detect misﬁt on the 2PL or
1PL, the power should be a function of item discriminating power parameter and the
power curve over a parameter can be expected to look like a “U” shaped curve with
the lowest power associated with a parameters close to 1.

The item response function can provide information for diagnosing item ﬁt testing
process. In the simulation process, it is the item response function that determines

the simulation of the dichotomous response data. In Chapter 2, it is also shown

88

how an IRF inﬂuences the pseudocounts, the sum of the posteriors over all possible
response patterns for the rest items in a test, and how an IRF directly affects the
theoretic expectation of pseudocounts, and eventually how an IRF inﬂuences on the
item ﬁt measure QBM and its corresponding asymptotic distribution.

If the two IRF are very close to each other, one can expect that the two models
would ﬁt a data set equally well or would not have reasonable ﬁt for the data at the
same time. Thus the power may be low in the situation when the two IRF are close,
and large sample size may be required to detect the misﬁt. For example, the 2PL
model and the 1PL model have the same asymptote value. When an IRF from a 2PL
model has very similar curve to an IRF from a 1PL IRT model (or the a parameter
for the 2PL model is close to 1), and if the data can reasonably ﬁt the hypothetical
2PL model, the data can also be expected to ﬁt well for the hypothetical 1PL model,
and vice versa. Look at the true item parameters in table 3.1, the discriminating
power parameter a’s starting from item 5 to item 13 are close to 1, in particular for
item 8 and item 10 with discriminating power parameters equal to 1.107 and .92,
respectively. _If the asymptote parameter c is disregarded, then the 2PL and the 1PL
(treat all a’s value as 1) IRF should have a slight difference. Therefore, although the
data sets are generated from the 2PL model, the power should be low for rejecting
the 1PL model (see table 3.13) due to the fact that the two IRF are too close to each
other. Similarly, the power should also be low for item 8 and item 10 when ﬁtting
the data generated by the 1PL model with the 2PL model, as reported in table 3.14.

Figures 5.1 shows the comparison of the 3PL, 2PL, and 1PL IRFs for item 1, item

89

Figure 5.1: Item Response Functions for the 3PL, 2PL, and 1PL Model (Item 1, 8,

10, 15)

Immt Imma

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

«4 -2 0 2 4 4 -2 0 2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

m
.92
:6“
.0 Item 10
e
a.
«a . «a
O O
1
v C
o' i o' ‘
O. . O. .
O O v
.4 2 o 2 4
Ability Scale

8, item 10, and item 15. It is apparent that the 3PL, 2PL, and 1PL IRF for item 1
and item 15 are well separated and thus these two items have higher power even for
sample size 500. On the other hand, the 2PL and 1PL curves for item 8 and item 10
are too close to separate from each other, as seen in the ﬁgure, and thus have lower
power for the sample size as large as 1000. However, their IRF are well separated from
the 3PL model and these two items also observe higher power for detecting misﬁt of
the 3PL model. i

In a short, for detecting item misﬁt, an IRF from a 3PL model can be easily
separate from an IRF from other models (e.g., the 2PL and the 1PL). Therefore, this

is why higher power is observed when ﬁtting a 2PL or 1PL model using the data

90

generated by a 3PL model. However, when the test data are generated by a 2PL
model, if a hypothesis of ﬁtting the data with a 1PL model and the two IRF are not
well separated, it is hard to expect adequate power unless a sufficiently large sample
size is available.

When item parameters are data-based estimates, the power for detecting misﬁt
(e.g., the 2PL or 1PL model) for most items is .9 or greater when data are generated
using the 3PL model and the sample size is large (1000), as seen in the third and
fourth column in table 3.15. The lowest power for three items (item 3, item 11, and
item 12) has power around .7. However, when data are generated from the 2PL
model, the test for ﬁtting the data with the 1PL model shows very low power, which
can be seen in table 3.16, in particular for item 4 through item 13, whose IRF are
close to that of the 1PL model.

Next, the method is robust in terms the ability distribution, and is insensitive
to the change of the number of quadrature points. Although the results on type I
error rates in table 3.19 with non-normal ability population (i.e., Beta distribution
in the example) show that the method is robust over the underlying ability distri-
bution, too conservative type I error rates are observed with poorly recovered item
parameters, as can be seen from their root mean square errors. Here the problem
of non-normal ability population turns back to the discuSsions on the effects of item
parameter estimates on the item ﬁt analysis. As is true that poorly recovered set of
item parameters cannot yield a correct decision on whether or not the test data ﬁt the

hypothetical item models even in simulation studies, it is also true that the results

91

on item ﬁt analysis based upon the bad item parameter estimates may not support
that the test data ﬁt the hypothetical model even though the data are generated
by the hypothetical model. That is, one can obtain unacceptably high type I error
rates using a set of bad item parameter estimates. The point is that how bad are
the item parameter estimates can be tolerated for the use of the results from item ﬁt
analysis. The study of non-normal ability population only provides a general sense
of how the bad item parameter estimates can have effects on the item ﬁt analysis in
terms of the root mean square errors. In the table 3.19 on RMSE for the non-normal
ability population across three different sample sizes (500, 1000, and 5000), one can
see that most of the RMSE for discriminating power parameters are greater than .5,
for difficulty parameters greater than .3, and for asymptote parameters greater than
.03. More research work is needed to study the tolerance of the item ﬁt on the effects
of item parameter estimates.

As for the effects of the number of quadrature points on the results of item ﬁt
analysis, it can be seen in table 3.20 and from Figure 3.5 through Figure 3.8, the
results based on Q = 21 and Q = 41 have slight differences. However, the results
based on Q = 41 and Q = 81 show extremely good consensus. Thus, it is advised
to compute item ﬁt analysis using 41 quadrature points to have both computing
accuracy and efﬁciency.

Finally, although the method takes an asymptotic approach, it works extremely
well even for the sample size of 1000 and the test of item ﬁt is not sensitive to the

number of examinee sample size. When the test has the sample size as large as

92

5000, X2 test for item ﬁt will tend to reject the hypothesis on most items, whereas
QBM statistic test will still provide useful information on diagnosing item ﬁt, as is
evident in table 3.2 through table 3.4 and in the example of real data applications
in Chapter 4. The high school science and math MEAP data include large sample
of examinees, which makes it hard to diagnose whether or not the test data ﬁt the
hypothetical 3PL model using X21 as shown in table 4.1 and table 4.2. Additional
evidence from the plots between the hypothetical IRF and the empirical IRF for each
item in the science test in Figure 4.1 through 4.5 show that the results from Q‘DM
analysis provide reliable information, which agree with the results obtained from the
graphic approach. One can conclude from the real data applications that the item
ﬁt QBM diagnosing test is able to provide more helpful information on assessing the
model-data ﬁt.

In a short, the reformulation of the Q2», seems not correct on the conservative
type I error rates when item parameters are data-based estimates. However, the
reformulation does provide a convenient theoretical framework for studying item ﬁt

based on pseudocounts.

93

Bibliography

[1] Birch, M. W. (1964). A new proof of the Pearson-Fisher theorem. Annals of
Mathematical Statistics, 35, 818-824.

[2] Bishop, Y. M. M., FeinBerg, S. E., and Holland, P. W. (1975). Discrete multi-
variate analysis. Cambridge, MA: The MIT Press.

[3] Bock, R. D. (1972). Estimating item parameters and latent ability when re-
sponses are scored in two or more nominal categories. Psychometrika, 37, 29-51.

[4] Bock, R. D., and Aitkin, M. (1981). Marginal maximum likelihood estimation
of item parameters: An application of the EM algorithm. Psychometrika, 46,
443-449.

[5] Bock, RD, and Lieberman, M. (1970). Fitting a response model for n dichoto-
mously scored items. Psychometrika, 35, 179-197.

[6] Davies, R. B. (1980). Algorithm AS 155: Distribution of a linear combination of
non- central chi-squared random variables. Applied statistics, 29, 323-333.

[7] Donoghue, J.R., and Hombo, C. M. [McClellan, C. A.] (2003b). Some asymp-
totic results on the distribution of an IRT measure of item ﬁt. Psychometrika
(conditionally accepted).

[8] Donoghue, J .R., and Hombo, C. M. [McClellan, C. A.] (2003a, April). A corrected
asymptotic distribution of an IRT ﬁt measure that accounts for the effects of item
parameter estimation. Paper presented at the Annual Meeting of the American
Educational Research Association, Chicago, IL.

[9] Donoghue, J .R., and Hombo, C. M. [McClellan, C. A.] (2001b, June). The be-
havior of an IRT measure of item ﬁt in the presence of the item parameter
estimation. Paper presented at the Annual Meeting of the Psychometric Society,

Valley Forge, PA.

[10] Donoghue, J .R., and Hombo, C. M. [McClellan, C. A.] (2001a, April). The dis-
tribution of an item-ﬁt—measure for polytomous items. Paper presented at the
Annual Meeting of the National Council on Measurement in Education, Seattle,

WA.

94

[11]

[12]

[13]
[14]
[15]
[16]
[17]
[18]
[1.9]
[20]

[21]

[22]

Donoghue, J.R., and Hombo, C. M. [McClellan, C. A.] (1999, June). Some
asymptotic results on the distribution of an IRT measure of item ﬁt. Paper
presented at the Annual Meeting of the Psychometric Society, Valley Forge, PA.

Donoghue, J. R., and lsham, S. P. (1998). A comparison of procedures tordetect
item parameter drift. Applied Psychological Measurement, 22, 33-51. Efron, B.
(1982). The jackknife, the bootstrap, and other resampling plans. Philadelphia,
PA: Society of Industrial and Applied Mathematics (SIAM).

Glas, C. A. W., and Meijer, R. R. (2003). A Bayesian approach to person ﬁt anal-
ysis in item response theory models. Applied psychological measurement, 27(3),
217-233.

Glas, C. A. W., and Suarez-Falcon, J. C. (2003). A comparison of item ﬁt statis- _
ties for the three-parameter logistic model. Applied psychological measurement,
27(2), 87-106.

Hambleton, R. K., and Swanminathan, H. (1985). Item response theory: princi-
ples and applications. Boston: Kluwer Academic Publishers.

Hoijtink, H. (2001). Conditional independence and differential item functioning
in the two-parameter logistic model. In A. Boomsma, M. A. J. van duijn, and T.
A. B. Snijiders (Eds), Essay in item response theory (pp. 109-130). New York:
Springer.

Hombo, C. M. [McClellan, C. A], and Donoghue, J. R. (2001, July). A power
study of an IRT measure of item ﬁt. Paper presented at the Annual Meeting of
the Psychometric Society, King'of Prussia, PA.

Hombo, C. M. [McClellan, C. A], and Donoghue, J. R. (2000, July). Some prop-
erties of the distribution of an IRT measure of item ﬁt. Paper presented at the
2000 annual meeting of the Psychometric Society, Vancouver, British Columbia.

Hombo, C. M. [McClellan, C. A], and Donoghue, J. R. (1999, June). A simulation
study of distribution of an IRT measure of item ﬁt. Paper presented at the Annual
Meeting of the Psychometric Society, Lawrence, KS.

Hombo, C. M. [McClellan, C. A], and Donoghue, J. R., and Oranje, A. H. (2003,
April). Evaluating item ﬁt in 2002 NAPE writing data. Paper presented at the
annual meeting of the American Educational Research Association, Chicago, IL.

Hsu, Y. (2000). On the Bock-Aitkin procedure—from an EM perspective. Psy-
chometrika, 65, 547-549.

Johnson, N. L. and Kotz, S. (1970). Continuous uniuariate distributions —2.
Boston: Houghton Mifflin.

95

[23] Li, D., Donoghue, J. R., and McClellan, C. A. (2005). Approximate the asymp-
totic distribution of an IRT measure for item ﬁt based on observed interrelations
among pseudocounts. Paper presented at the Annual Meeting of the National
Council on Measurement in Education, Montreal, CA.

[24] Linn, R. L. (1993). Educational Measurement, 3rd edition. The Oryx Press.

[25] McKinley, R. L., and Mills, C. N. (1985). A comparison of several goodness-of-ﬁt
statistics. Applied Psychological Measurement, 9, 49-57.

[26] Orlando, M. and Thissen, D. (2000). Likelihood-based item-ﬁt indices for di-
chotomous item response theory models. Applied Psychological Measurement,
24(1), 50-64.

[27] Orlando, M. and Thissen, D. (2003). Phrther investigation of the performance of
S-x2: an item ﬁt index for use with dichotomous item response theory models.
Applied psychological measurement, 27(4), 289-298. ‘

[28] Reckase, M. D. (1997). The past and future of multidimensional item response
theory. Applied Psychological Measurement, 21, 25-36.

[29] Reise, S P. (1980). A comparison of item-and person-ﬁt methods of assessing
model-data ﬁt in IRT. Applied Psychological Measurement, 14, 127-137.

[30] Rogers, H. J ., and Hattie, J. A. (1987). A Monte Carlo investigation of several
person and item ﬁt statistics. Applied Psychological Measurement, 11, 47-57.

[31] Sinharay, S. (2005). Bayesian item ﬁt analysis for unidimensional item response
theory models. Paper presented at the Annual Meeting of the National Council
on Measurement in Education, Montreal, CA.

[32] Sinharay, S., and Johnson, M. S. (2003). Simulation studies applying
posterior predictive model checking for assessing ﬁt of common item re-
sponse theory models (ETS RR—03-33). Princeton, NJ: ETS. Available from ,
http://www. ets. org/research/newpubs. html.

[33] Stone, C. A. (2000). Monte Carlo based null distribution for an alternative
goodness-of- ﬁt statistic in IRT models. Journal of Educational Measurement,
37, 58-75.

[34] Stone, C. A., Ankenmann, R. D., Lane, S., and Liu, M. (1993, April). Scaling
QUASAR’S performance assessments. Paper presented at the Annual Meeting of
the American Educational Research Association, Atlanta, GA.

[35] Stone, C. A., and Hansen, M. A. (2000, April). Using resampling methods to
evaluate the signiﬁcance of a goodness-of-ﬁt statistics in item response theory
model. Paper presented at the Annual Meeting of the American Educational Re-
search Association, Montreal, Canada.

96

[36] Stone, C. A., Mislevy, R. J ., and Mazzeo, J. (1994, April). Misclassiﬁcaiton error
and goodness-of-ﬁt in IRT models. Paper presented at the Annual Meeting of the
American Educational Research Association, New Orleans, LA.

[37] Stone, C. A., and Zhang, B. (2002). Comparing three approaches for assessing
goodness- of-ﬁt of IRT models. Paper presented at the Annual Meeting of the
National Council on Measurement in Education, New Orldeans, LA.

[38] Yen, W. M. (1981). Using simulation to choose a latent trait model. Applied
Psychometrical Measurement, 5, 245-262.

97

l[[1]][ll/[l[Milli][Willi]!

1293 02736 7824