, H35 ,. , ‘ , . . .
.a a? . ﬁrm

a;

. . .xmnugummmgr
. . .2
r 4% .r. ‘
a. x

1...

35“.! 6.
.i x .

mm .51.},
7

a;
V. .5?

x: Stu-rap. .

.l
. .3;
sunning.
. :15...
glpshd.
5-.J'v4-u1‘
.31.
!a..u.?¥...3§1..; a
, iii?
x .2.

 

six.) I
Jinan.“

 

 

. 5;!
iii:
VI...

0“ .
[I .;of. I
. $23.1)! \‘
a Kilt-Akl.
\‘1.....&
£1.39?!

. 4 $32205.“ o.
. . . . ,. .rr . ‘ . ‘ Lylfuﬂvvw‘zqfvx‘

, . 1.54:1 ﬁﬁﬁmﬁm .. , V . . ,

ﬁ,&.aq ~+§$§$ .:. .1 4, :

Fﬂuug $51.. . ‘ 3.- - .

.‘I

III'

 

 

LIBRARY
Mlchirjcm State
University

 

 

 

This is to certify that the
dissertation entitled

ESTIMATING THE PARAMETERS FOR
MULTIDIMENSIONAL ITEM RESPONSE THEORY MODELS
BY MCMC METHODS

presented by

Yanlin Jiang

has been accepted towards fulﬁllment
of the requirements for the

Ph. D degree in Education

 

 

 

Major Professor’s Signature

 

(8/! ">70 3'“
Date

MSU is an Afﬁrmative Action/Equal Opportunity Institution

 

-.-.- .- -.... - - _

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE
AUG 0 9 2014

 

 

020.315

 

 

 

 

 

 

 

 

 

 

 

 

2/05 p:/ClRC/DateDue.indd-p.1

 

 

ESTIMATING PARAMETERS FOR MULTIDIMENSIONAL ITEM
RESPONSE THEORY MODELS BY MCMC METHODS

By

Yanlin Jiang

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Department of Counselling, Educational Psychology and Special Education

2005

ABSTRACT

ESTIMATING PARAMETERS FOR MULTIDIMENSIONAL ITEM

RESPONSE THEORY MODELS BY MCMC METHODS

By

Yanlin Jiang

Efforts to apply Markov Chain Monte Carlo (MCMC) methods to three-parameter lin-
ear logistic multidimensional IRT models are addressed using the Metropolis-Hastings
algorithm within Gibbs approach. Bayesian modal estimators of both item and pro-
ﬁciency parameters are obtained in a simultaneous process rather than a separate
parameter estimation procedure. It is shown that it is effective by blocking individ-
ual item discrimination and proﬁciency dimensional parameters and treating them
without reference to other item and proﬁciency parameters. Both simple and com-
plex structures of item dimensions are included. In addition, various proﬁciency di-
mensional structures are considered for three and ﬁve dimensional cases, respectively.
The effects of four potential factors on model parameter estimation are investigated.
Simulation studies are conducted across different designs for one-, three-, and ﬁve-
dimensional cases. Results show that the parameter estimators based on MCMC are
accurate in terms of correlation and root mean square errors. Numeric examples for
the estimates of the standard errors demonstrate that the estimation is statistically

stable and accurate.

ACKNOWLEDGEMENTS

I am grateful for my dissertation committee: Dr. Mark Reckase (chair), Dr.
Kimberly Maier, Dr. Richard Houang, and Dr. James Stapleton for their constructive
comments and valuable suggestions. Without their inputs, this dissertation would not
have been completed.

I would like to express my sincere gratitude to my academic advisor, Dr. Mark
Reckase, for his constant support, direction, and encoragement over the past ﬁve
years. I would also like to thank the Center for the Study of Curriculum and my
supervisor, Dr. Richard Houang, whose ﬁnal assistance supported the completion of
the dissertation research and enabled the completion of my doctoral study. Working
with him has been a tremendously rewarding experience for me.

Special thanks go to my husband, Deping Li, for his support, patience, and un-

derstanding in my life.

iii

Contents

LIST OF TABLES ..............................
LIST OF FIGURES .............................

1 Introduction
1.1 Item Response Theory Models ......................
1.1.1 The Uni-dimensional Item Response Theory Models . . . i. . .
1.1.2 The Multi-dimensional IRT Models ...............
1.2 Estimation Methods for IRT Models ..................
1.2.1 Commonly Used Estimation Methods and Their Limitations .

1.2.2 Applications of MCMC methods to Estimation of IRT-based
Models ...............................

1.3 The Importance of the Study ......................

2.1 Overview of Markov Chain Monte Carlo Methods ...........

2.2 Likelihood Functions for the Linear Logistic MIRT Models ......

2.3 M—H within Gibbs for Parameter Estimation for MIRT Models . . .
2.3.1 Complete Conditional Functions for Model Parameters . . . .

2.3.2 Modelling the Covariance Structure for Multidimensional Abil-
ities ................................

2.3.3 Random Walk Metropolis Algorithm within Gibbs .......

2.4 Unbiased and Consistent Estimators of Parameters ..........

3 Simulation Studies and Results
3.1 Prior Distributions for Model Parameters ................
3.2 Diagnosing the Convergence of Markov Chains .............
3.3 Initial Values and Iterations .......................

iv

vi

viii

1
1
1
4
7
7

12
13

MCMC Methods for Parameter Estimation for Logistic MIRT Model 17

17
21
23
23

27
31
34

36
38
38
39

3.4

3.5

3.6
3.7
3.8

Estimating the Unidimensional 3PL Model ...............

3.4.1

Assessing Convergence ......................

Estimating the 3-Dimensional MIRT Model ..............

3.5.1
3.5.2
3.5.3
3.5.4
3.5.5

Generating Proﬁciency Parameters ...............
The Number of Proﬁciency Dimension and Sample Size . . . .
Proﬁciency Structure .......................
Generating Item Parameters ...................

The Estimation Accuracy and Stability for
the 3-Dimensional MIRT Model .................

Estimating the 5-dimensional Model ..................

Proﬁciency Structure Estimation ....................

Computing Time .............................

4 Concluding Remarks and Eiture Research Directions

BIBLIOGRAPHY

41
42
54
57
58
59
60

62
69
82
85

87

96

List of Tables

3.1

3.2

3.3

3.4

3.5

3.6

3.7

3.8

3.9

3.10

3.11

3.12

3.13

3.14

3.15

3.16

Time Item Parameters for 30—Item Test (Dim : 1) ........... 43
True Item Parameters for 45-Item Test (Dim = 1) ........... 44
Estimates from three chains for 30-Item Test (Dim = 1, N = 2000) . 46
Item Parameter Estimates for 30—Item Test (Dim = 1) ........ 48

Item Parameter Estimates for 30—Item Test In BILOG-MG3 (Dim = 1) 49

Item Parameter Estimates for 45—Item Test (Dim = 1) ........ 50
Item Parameter Estimates for 45-Item Test (Dim = 1), cont ...... 51
RMSE for Estimating Uni-dimensional Models (Dim = 1) ....... 53
Correlations Between True Proﬁciency and Estimates (Dim = 1) . . . 54
True Item Parameters for 30—Item Test (Dim = 3) ........... 63
Tme Item Parameters for 45-Item Test (Dim = 3) ........... 64
RMSE for Multi-dimensional Test (Dim = 3, p = .2) ......... 64
RMSE for Multi-dimensional Test (Dim =3, p = general) ....... 66

Correlations Between True Proﬁciency and Estimates (Dim = 3, p = .2) 66

Correlations Between True Proﬁciency and Estimates (Dim = 3, p =
general) .................................. 67

True Item Parameters for 30—Item Test (Dim = 5) ........... 75

vi

3.17 True Item Parameters for 45-Item Test (Dim = 5) ........... 76

3.18 True Item Parameters for 45—Item Test (Dim = 5), cont. ....... 77
3.19 RMSE for Multi-dimensional Test (Dim = 5, p = .2) ......... 80
3.20 RMSE for Multi-dimensional Test (Dim = 5, p = general) ...... 80

3.21 Correlations Between 'Itue Proﬁciency and Estimates (Dim == 5, p = .2) 81

3.22 Correlations Between True Proﬁciency and Estimates (Dim = 5, p =

general) .................................. 82
3.23 Estimates of Covariance Matrix, Dim = 3, p = .2 ........... 83
3.24 Estimates of Covariance Matrix, Dim = 3, p = general ........ 83
3.25 Estimates of Covariance Matrix, Dim = 5, p = general ........ 84
3.26 Estimates of Covariance Matrix, Dim = 5, p = .2 ........... 84
3.27 Computing time for 1-, 3-, and 5-Dimension data ........... 86

4.1 TESTFACT Item Parameters estimates for 30-Item Test (Dim = 3) . 97

4.2 TESTFACT Item Parameters Estimates for 30-Item Test (Dim = 5) . 98

vii

List of Figures

3.1 Sample ACF for series of a6, Dim = 1 .................. 45
3.2 Sample draw at ﬁrst 3000 iterations for series of a, b and c ...... 47
3.3 True Proﬁciency Versus Estimates (Dim = 1) ............. 55
3.4 True a Parameter Versus Estimates (Dim = 1) ............. 55
3.5 True b Parameter Versus Estimates (Dim = 1) ............. 56
3.6 True 0 Parameter Versus Estimates (Dim = 1) ............. 56
3.7 True Proﬁciency Versus Estimates (Dim = 3, p = general, n = 30, N =
5000) .................................... 69
3.8 TNe Proﬁciency Versus Estimates (Dim = 3, p = general, it = 45, N =
2000) .................................... 70
3.9 Time Proﬁciency Versus Estimates (Dim = 3, p = general, n = 45, N =
2000) .................................... 70
3.10 True a1 Parameter Versus Estimates (Dim = 3, p = .2) ........ 71
3.11 Time a2 Parameter Versus Estimates (Dim = 3, p = .2) ........ 71
3.12 True a3 Parameter Versus Estimates (Dim = 3, p = .2) ........ 72
3.13 True d Parameter Versus Estimates (Dim = 3, p = .2) ......... 72

viii

Chapter 1

Introduction

1.1 Item Response Theory Models

Item response theory (IRT) becomes more and more important for psychological and
educational testing. This philosophic and theoretic framework not only provides
useful analytical tools (e.g., item differential functioning and test equating), but also
provides an effective test design tool. The importance of the IRT framework cannot
be realized unless the model parameters are accurately estimated given that the model
assumptions are satisﬁed and the model is adequately ﬁtted to the observed data.
In this chapter, both uni-dimensional and multi-dimensional logistic IRT models
will be introduced, then some of the existing estimation methods will be reviewed,
and ﬁnally the importance of a new method for estimating multidimensional IRT

models will be addressed.

1.1.1 The Uni-dimensional Item Response Theory Models

Classical test theory (CTT) has been the mainstream of educational and psycholog-

ical testing research and practice for many decades. Gulliksen’s “Theory of Mental

Tests ” (1950) is one of the earliest books and a milestone of measurement theory.
However, CT T suffers from a number of limitations, as is often seen in the literature
(e.g., Embreston & Reise, 2000; Hambleton & Swaminathan, 1985). For example,
item statistics (e.g., item difficulty) are sample dependent; reliability and standard
errors of measurement estimators, which are the fundamental concepts in true score
theory, do not take the proﬁciency diﬁerences among examinees into account. Hence,
only a single reliability estimate is obtained for one test. Furthermore, CTT cannot
probabilistically predict examinees’ response on items unless the items have previ-
ously been administered to similar individuals. In many testing contexts such as
adaptive test, it is important to predict the examinee’s response in probability in

order to provide next item for the examinee. As Lord states,

“we need to describe the items by item parameters and the examinees
by examinee parameters in such a way that we can predict probabilistically
the response of any examinees to any items, even if similar examinees have
never taken similar items before (P.11, Lord, 1980)”.

Unfortunately, CTT fails to satisfy this property. Item response theory is a model-
based measurement framework. IRT provides a more complete rationale for model-
based measurement than CTT and overcomes a number of limitations of CTT (for
details, please refer to Embreston & Reise, 2000). The important development of IRT
is due to the work of Lord (1952, 1953), Birnbaum (1957, 1958a, 1958b), Lord and
Novick (1968), and Rasch (1960). Various IRT-based models have been developed
in the literature, for examples, the normal ogive models (Lord, 1952) and the logis
tic models (Rasch, 1960; Birnbaum, 1957, 1958a, 1958b, 1968; & Wright & Stone,
1979) for binary data, the graded response model (Samejima, 1969), the partial credit

2

model (Master, 1982), and the nominal response model (Bock, 1972) for polytomous
data. There are other uni-dimensional IRT models (e.g., continuous response model,
Samejima, 1972) but will not be discussed here since this study focuses on applying a
new method to the logistic IRT models. One common feature of these models is that
they explicitly predict the probability of correct response on an item given person
and item parameters. More comparisons of other characteristics between CTT and
IRT can be found in Embreston and Reise (2000).

In the family of IRT models, the three-parameter logistic model (3PL model) is
one of the most widely used models. It was proposed by Birnbaum in 1968. For a
dichotomous item, the item response function (IRF or called ICC) is the probability
of a correct response to the item. This probability can be represented by the function

(Lord, 1980)

exp[1.7ai(9j — bill
1 + exp[1.7a.-(9j - bail ,

 

Pi(9j) E p(Uij = 1 I aubuCiagj) = Ci + (1 — Ci) (1-1)

where

p,(9j) is the probability of correct answer to item i given the jth examinee’s proﬁciency
level 03-;

Uij is the item response either 0 (incorrect) or 1 (correct) for examinee j on item i;
a, is the ith item discriminating power; it is usually a positive number.

b,- is the ith item difficulty;

c, is the ith item lower asymptote or called pseudo-guessing parameter; and

1.7 is a scale constant.

If there is no lower asymptote parameter in the above model, i.e., c,- = 0, the 3PL
model reduces to the 2PL model. Furthermore, if the discriminating power parameter
a,- is treated as a constant in the model, then the model becomes 1PL model or Rasch
model because of only one item parameter (i.e., item difﬁculty) in the model. Note
that the 3PL, the 2PL, and the 1PL models only contain one proﬁciency parameter
for each examinee, an important assumption for the models, which are labelled as
uni—dimensional IRT models.

In addition to unidimensionality, another important assumption for IRT models
is local independence. For a single examinee, the responses to the test items are
related to each other only through this examinee’s proﬁciency parameter(s). Hence,
local independence can be understood as conditional independence. It assumes that
examinee’s responses to items are independent of each other after controlling for the
examinee’s proﬁciency parameter(s). The mathematical expression of local indepen-

dence is given by

n
p(ulau2i ' ' ' tun l 0) = Hpi(ui i 0), (12)
i=1
where u,- is the item response on the ith item for a single examinee and i = 1, 2, - - - ,n.

Equation (1.2) implies that given a ﬁxed proﬁciency parameter, the joint distribution

p of responses to n items is the product of the marginal distributions p,- for all items.

1.1.2 The Multi—dimensional IRT Models

In the multi-dimensional item response theory (MIRT), items require multiple abilities

to get a correct response. Under this circumstance, the uni-dimensional IRT models

are not adequate for such response data. A family of IRT models that contain multiple
proﬁciency parameters is needed to reﬂect proﬁciency level on different dimensions
for each examinee.

MIRT is an extension of uni-dimensional IRT. Like uni-dimensional IRT, MIRT
models examinee’s behavior (i.e., item response) given person and item characteristics.
The essential difference of MIRT from uni-dimensional IRT is that in MIRT, multiple
proﬁciency parameters are used to model person abilities and a vector form of item
parameters to characterize items.

To describe MIRT-based models, it is necessary to introduce the concept complete
latent space. Lord deﬁned it as a collection of all those latent variables Ok’s that
discriminate among groups of examinees (Lord & Novick, 1968) for k = 1, 2, - - . , p,
where p is the number of proﬁciency dimensions. Denote the complete latent space 0
by the vector

0 5 (01,027 ' ° ' 30?)" (1'3)

These variables can be thought of as “psychological dimensions necessary for the
psychological description of individuals” (p.359). For the population of examinees,
every single examinee possesses a value for each of the latent variables in the space.
For uni-dimensional IRT models, the complete latent space has only one variable. For
multi-dimensional IRT models, it is assumed that two or more latent variables are
needed to characterize an examinee’s proﬁciency.

There are a few MIRT-based models. Early MIRT models for binary data were

from the work of McDonald (1967) and Lord & Novick (1968). Other models have

also been found in the literature. For example, the multidimensional Rasch model
(Stegelmann, 1983), the multidimensional two-parameter normal ogive IRT model
(Bock, Gibbons and Muraki, 1988), the multicomponent latent trait model (MLTM;
Whitely, 1980), etc. Reckase provides the extension of the uni-dimensional three
parameter logistic model to multi-dimensional form (Reckase, 1985, 1996). He pointed
out that

“After reviewing many possible models that include vector parameters for
both examinee and item characteristics [see McKinley and Recakse (1982)
for a summary], the model given below was selected for further develop-
ment because it was reasonable given what is known about item response
data, consistent with simpler,uni-dimensional item response theory mod-
els, and estimable with commonly attainable numbers of examinees and
test items (p.272)”.

exp (aiaj + d.)
1 + “1301491 + di),

 

pi(0j) E p(U1‘j =1 I ai,d,-,c,-,9,-) = C; +(1— Ci) (1.4)

where

p(U,-,- = 1 | a,, d,-, c,-, 03-) is the probability of a correct response (score of 1) for
examinee j on test item i;

U,,- is a dichotomous random variable representing the item response for examinee j
on item i;

Q, is the vector of abilities for examinee j, i.e., 91- E (OJ-1,6,2, - - - ,0,,,)’;

a, is a vector of parameters related to the discriminating power of the test item i (the
rate of change of the probability of correct response to changes in trait levels for the
examinees);

d, is a parameter related to the difficulty of item i;

6

c,- is the probability of correct response that is approached when the abilities assessed
by item i are very low; c,- is usually called the lower asymptote, or less correctly, the
guessing parameter.

The unique contribution of the model above, as summarized by Recakse (1997),
is that it focuses on the characteristics of the test items and the way they interact
with the examinee population. This model has proved to be useful for a variety of
applications and has helped in conceptualizing a number of psychometric problems
including the assessment of differential functioning and test parallelism (Ackerman,

1990, 1992).

1.2 Estimation Methods for IRT Models

1.2.1 Commonly Used Estimation Methods and Their Limi-
tations

IRT models contain at least two types of parameters: person parameters (also called
latent trait, proﬁciency, or ability parameters) and item parameters. Estimating
person parameters for IRT models is frequently accomplished by using one of three
methods: (1) maximum likelihood (ML); (2) maximum a posteriori (MAP); and
(3) expected a posteriori (EAP). The ML method estimates person parameters by
maximizing the likelihood of an examinee’s item responses. But one critical problem
in the ML method is that the ML cannot estimate person parameters for examinees
who have all correct or all incorrect response patterns (p.162, Embreston &o Reise,

2000). In addition, ML estimates have the consistency property only as sample size

increases (here sample size refers to the number of test items, or test length), which
in reality, is not an easy condition to meet because the test is often viewed as a ﬁxed
set of items.

Both EAP and MAP are from the Bayesian perspective. MAP (also called
Bayesian Modal Estimation) scoring method uses prior information about person
proﬁciency in conjunction with the likelihood function to estimate proﬁciency level
by maximizing a posterior distribution. The advantage of MAP is that proﬁciency
can be estimated for all possible response patterns including perfect pattern. The per-
fect pattern could be all-correct response pattern, all-incorrect response pattern, or
some odd pattern that makes it difﬁcult for the ML procedure to ﬁnd solutions (e.g.,
no solution, or multiple solutions). Critics of Bayesian modal estimation methods is
the proﬁciency estimates may depend on heavily the choice of the prior distribution
of proﬁciency parameters especially when the sample size (i.e., test length) is small.
EAP is a method of ﬁnding the mean of a posterior distribution. One advantage of
the EAP estimator is that it “has minimum mean square error over the population of
ability” (p.439, Bock & Mislevy, 1982). However, the estimates from EAP are biased
(Wainer & Thissen, 1987).

Item parameters in IRT models are usually estimated by the maximum likelihood
(ML) approach. The commonly used methods under this approach are (a) joint maxi-
mum likelihood (J ML), (b) marginal maximum likelihood (MML), and (c) conditional
maximum likelihood (CML).

It is known that the consistency property of the maximum likelihood estimator

holds for person parameters only when item parameters are known and the number of
items increases. Similarly, the consistent item parameter estimates can be obtained
when person parameters are known and the number of examinees increases. The
J ML procedure simultaneously estimates person and item parameters for all items
and examinees by jointly maximizing the likelihood function of the response data.
In principle, this procedure is straightforward. However, it has several drawbacks in
practice as some researchers pointed out. First, nonlinear (i.e., S—shape) item char-
acteristic curve (ICC) results in nonlinear likelihood equations. Solving nonlinear
equation systems is often a formidable task (Hambleton & Swaminathan, 1985). See-
ondly, when used with the 3PL model, large numbers of examinees (e.g., more than
1000) are required for accurate item parameter estimation (e.g., Lord & Novick, 1968;
Swaminathan & Gifford, 1979). Thirdly, increasing the number of examinees cannot
guarantee the estimation improvement (Hulin, Lissak, & Drasgow, 1982). That is,
the consistency property of estimation does not always hold due to increase in both
item (structure) and person (incidental) parameters simultaneously.

When sufﬁcient statistics are available for person parameters, one may avoid the
problem of presenting person parameters in the likelihood function. For the Rasch
model, since the number correct score (also called total score) is a sufficient statistic for
the proﬁciency parameter, it is possible to express the likelihood function L(U I 0, b.)
in terms of total score instead of proﬁciency parameters. The CML procedure can
be used to estimate item parameters and the corresponding estimates are consistent

(Hambleton & Swaminathan, 1985). However, since CML requires a sufﬁcient statistic

for estimating trait level, it is restricted to the Rasch model family. In more complex
models such as the 2PL, the 3PL and the MIRT models, proﬁciency estimates are
dependent on item characteristics. Therefore the total score is no longer a sufﬁcient
statistic for estimating proﬁciency. In addition, Embreston and Reise (2000) pointed
out several other disadvantages on CML estimation procedure: no estimates for items
or persons are available for perfect response pattern (R218); numerical problems often
occur for long tests, complicated patterns of missing data, or polytomous data.
Estimating item parameters can be carried out if the likelihood function can be
expressed without any reference to the person parameters. Assuming the underly-
ing distribution of proﬁciency is continuous and known, the essence of MML is to
integrate over the proﬁciency distribution, then the item parameters are estimated
in the marginal distribution (Bock & Lieberman, 1970). This procedure removes the
dependency of item parameter estimates on the proﬁciency estimates. The advantage
of MML is its estimates possess the consistency preperty since increasing number
of examinees doesn’t require additional estimation of proﬁciency estimates (Kiefer
& Wolfowitz, 1956). The MML approach is accomplished within the framework of
the EM algorithm (p.190, Baker, 1992). Although MML/EM has lot of nice features
and becomes a standard for item parameter estimation, Baker (p.190, Baker, 1992)
pointed out that certain limitations of this approach exist in practice. For example,
items that are answered correctly or incorrectly by all examinees have to be eliminated
for item parameter estimation before calibration, an obvious loss of data information;

certain data set can yield large absolute value of item difﬁculty and other deviant

10

values as item parameter estimates. Once these deviant values are used for proﬁ-
ciency estimation, it will cause estimation process to fail. In addition, although many
has done research on an accelerated EM algorithm which is faster, the EM algorithm
convergence rate is slow when estimating high-dimensional models.

If prior information about item parameters is available, Bayesian estimation meth-
ods are possible for IRT-based models. In 1982, 1985, and 1986, Swaminathan and
Gifford (1982, 1985, 1986) derived Bayesian estimation procedures for the one, two-,
and three-parameter logistic models, where item parameter estimation takes place
without any marginalization. Mislevy (1986b), Tsutakawa and Lin (1986) took a
different approach, which inherited properties of MML by integrating (i.e., marginal-
izing) proﬁciency parameter out of likelihood function. Marginal Bayesian modal
estimation is accomplished within the framework of the EM algorithm (Baker, 1992)
too. However, marginalized Bayesian item parameter estimates may heavily depend
on the item priors in particular for small sample size, and hence the resulting item
parameter estimates will be shrunk to the mode of its corresponding prior distribution
for informative priors.

The frequently used estimation methods and their limitations are summarized
in this section. For one-dimensional IRT models, although joint maximum likelihood
estimates are available in some programs to estimate item and proﬁciency parameters
simultaneously (e.g., LOGIST uses joint maximum likelihood estimation paradigm
formulated by Alan Birnbaum in 1968), the estimates of proﬁciency parameters need

not be consistent as the sample size increases (e.g., Neyman & Scott, 1948; Little &

11

Rubin, 1983). In addition, in some extreme situations of responses, the maximum
likelihood procedure could give positive or negative inﬁnity estimates for proﬁciency
parameters.

MML / EM procedure has become a central methodology for parameter estimates
in the IRT framework. However, when test settings get more complex (e.g., with
presence of missing data and polytomously score data) and IRT models are more
complicated (e.g., the MIRT models), application of EM algorithm becomes less
straightforward (Patz & Junker, 1999a).

In Section 1.3, the importance of a new method for parameter estimation in linear

logistic MIRT models will be addressed.

1.2.2 Applications of MCMC methods to Estimation of IRT-
based Models

A new estimation approach that could avoid some shortcomings of the estimation
procedures discussed above is desired to improve the estimation accuracy in particular
for the more complicated testing practices and the complex IRT models. Markov
Chain Monte Carlo (MCMC) methods, which are from a Bayesian perspective, can
be applied to estimating parameters for IRT models.

Researchers have had interests in MCMC methods for several decades (e. g., Metropo-
lis, et al., 1953). MCMC methods have been successful in many Bayesian applications
because they allow one to draw samples from a wide range of interested posterior dis-
tributions, including many for which simulation methods were previously much more

difﬁcult to implement ( e.g., Gilks, Richardson, & Spicgelhalter, 1996).

12

MCMC methods have also been recently implemented for parameter estimation
and inference through stochastic simulation for IRT models. Patz and Junker (1999a)
demonstrate that MCMC techniques are well-suited to complex models with IRT
assumptions and the MCMC methodology can be routinely implemented to ﬁt the
IRT contexts, and further address the strategies and issues of extending the basic
MCMC methods for Bayesian inference in complex IRT settings such as non-response,
designed missingness, multiple raters, guessing behaviors, and partial credit (i.e.,
polytomous) test items (Patz & Junker, 1999b). Earlier work can trace back to
Albert (1992), who estimated the two-parameter normal ogive model for augmented
data using the Gibbs sampler. Various applications of MCMC methods have also
been developed in the literature for item parameters recovery (e.g., Wollack, Bolt,
Cohen, & Lee, 2002; Mathews & Hombo, 2001; Kim & Cohen, 1998; Dela-Torre,
Patz, 2001; Maris & Maris, 2002; Fox, 2002; Williamson, Johnson, Sinharay & Bejar,
2002), for coefﬁcient alpha estimates (Li & Woodruff, 2001), etc.

Different from the Bayesian modal estimates discussed in Section 1.2, the MCMC
estimates of parameters will no longer be dependent on the prior distribution and the

parameter estimates are not shrunk to the mean of prior distribution.

1.3 The Importance of the Study

Recently, Segall (1996, 2001) has advanced multidimensional adaptive testing (MAT)

and the measure of general proﬁciency using a linear logistic MIRT model. He found

13

that MAT could provide equal or higher reliability with fewer items than are required
in one-dimensional adaptive test. He concludes that in addition to increasing mea-
surement efﬁciency, MAT can also be used as a tool ensuring adequate and efﬁcient
coverage of content for examinees at different levels of proﬁciency (Segall, 1996). How-
ever, as he emphasizes, further study is needed before MAT can be routinely applied
and item parameter estimation for MIRT models must be reﬁned.

In estimating parameters for MIRT models, simple structure (i.e., each item only
measure one dimension of proﬁciency) is sometimes assumed (e.g., Dela-Torre, Patz,
2001). the Multi-unidimensional approach, as suggested by Segall (e.g., 1996), is an
example of a simple structure. In this approach, several sub-tests measuring diﬂ’er-
ent contents are given at one test administration. There are two ways to estimate
the model parameters for the multi-unidimensional approach. One is estimating the
model parameter for the tests separately (i.e., independently), which is not realis-
tic since usually the contents to be measured are correlated. The other way is to
treat each content as one dimensional, then estimate the model parameters simulta-
neously using a multidimensional model. Segall (1996) pointed out that although the
multi-unidimensional approach is appealing in terms of its simple structure, it may
suffer at least two undesirable features. One may be due to the poor speciﬁcation
of the elements of the covariance matrix of the proﬁciency vector, and the other is
that the assumption of simple structure may lead to some poorly speciﬁed loadings
(p.350). In addition, to develop a common metric and orientation of item parameter

estimates for MIRT models is not convenient or even unlikely to be achieved. Segall

14

(1996) addresses that when developing large item pools with several dimensions, it is
often necessary to divide the pools into subsets of items. This design however may
raise several issues concerning the metric of the latent dimensions. Therefore, a new
methodology is desirable for the concurrent estimation of item parameters for MIRT
models for building item pool before MAT can be more reliably implemented.

Both item and proﬁciency parameters in MIRT models can be estimated simul-
taneously using MCMC methods. Parameter estimation using MCMC methods is
different from a number of approaches for estimating MIRT models (Carlson, 1987;
Fraser, 1988; McDonald, 1985; Mckinley & Reckase, 1983; Muthen, 1984). Efforts to
apply MCMC methods to multidimensional models have been explored in the litera-
ture. For example, Beguin and Glas (1998) generalized the Albert (1992) procedure
to the unidimensional 3PL normal ogive model and Q—multidimensional normal ogive
models. However, the study assumes the underlying covariance matrix for abilities
is an identity matrix, which is not realistic since the proﬁciency dimensions in one
test are more likely to be correlated. Moreover, the values of item parameters in the
study are restricted to a small range (e.g., a is from O to 1, d is from -1 to 1), which
is also not realistic for a general and more complex testing context.

De—la-Torre and Patz (2001) examine simultaneous proﬁciency estimation for
MIRT models using MCMC approach. But the study only assumes the simple struc-
ture. In addition, to estimate the proﬁciency parameters, the study assumes the item
parameters are known, which actually is not available in many applications.

Belt and Lall (2003) investigate the item parameter estimation of compensatory

15

and noncompensatory MIRT models using the MCMC method. In their study, the
guessing parameter was not included in the MIRT models and only two-dimensional
model was considered. In addition, the item parameters cover only a small range of
values.

However, not much attention has been paid to three-parameter MIRT models that
has been proven useful for a variety of applications in the literature. It is necessary
to study parameter estimation using MCMC methods in a more general, complex,
and realistic situations. For example, guessing parameter is included to the model,
complex item dimension structures (i.e., each item measures one dimension or more
than one dimension of abilities) are considered in the test design with an exploratory
solution, and the inter-correlation among proﬁciency dimensions will be estimated and
not limited to the identity matrix or special pattern of covariance matrix (e.g., all off-
diagonal elements are the same). Moreover, the current study intends to examine the
impact of four factors — the test length, the number of dimensions, the sample size,
and the proﬁciency covariance structure on the accuracy and stability of parameter

estimates for MIRT models.

16

Chapter 2

MCMC Methods for Parameter
Estimation for Logistic MIRT
Model

2.1 Overview of Markov Chain Monte Carlo Meth-
ods

Statistical inference is a procedure for drawing conclusions about pepulation pa-
rameters from the observed sample data. Bayesian statistical conclusions about a
parameter are typically made in terms of a probability statement conditioned on the
observed data, or the posterior of the interested parameter. A sample generated by
MCMC methods can be used for statistical inference, including point estimate, the
construction of a marginal density, prediction, estimation of moments, and so on.

Gill (2002) deﬁned Markov chain as:
“a stochastic process with the preperty that any speciﬁed state in the
series, 0“), is only dependent on the previous value of the chain. Or in a

probability expression (p.302):

p(g(t) E A l 9(0),g(1), . .. ,g(t-2),g(t-1)) = p(g(t) E A l 90-1)), (2.1)

Where A is an event or range of events in the complete state space; t is a positive

17

number referring to the tth time interval; 6 is a random quantity taking values in
some known state space, 0.

The Monte Carlo method uses random samples from the desired distribution in-
stead of calculating quantities from the analytical form to summarize the interested
theoretical distribution.

Generally speaking, the Markov Chain Monte Carlo methods invlove two steps.
First, producing a chain in which each value only depends on the previous value.
Second, once this chain converges to the desired posterior distribution, the Monte
Carlo method is used to summarize the interested distribution.

There are two basic methods in MCMC: (1) Gibbs sampler; (2) Metropolis-
Hastings algorithm.

The Gibbs sampler named by Geman and Geman (1984) is one of the most widely
used MCMC techniques. Let Q be the model parameters vector with k components,
and q,- be the ith model parameter in Q. Denote Q E (q1,q2, - -- ,q,-,--- ,le and
Q_,- E (q1,q2,~-- ,q,-_1,q,-+1,--- ,qk). Then Q can be expressed as Q E Q_,- U q,.
Denote the complete conditional function of the ith parameter by P(q,- | Q_,-) E
P(Qi l (11,(I2,"' 141—1,Qz+1,'°' #1:“)-

The Gibbs sampler sequentially samples from the complete conditional distribu-
tions P(q,- | Q_,-, y),i = 1, . .. ,k, where y indicates observed data.

Then Gibbs sampling algorithm can be deﬁned as the following:

1. Specify the starting values for the model parameter vector Q, i.e.,

18

2 t=0 t=0 t=0
Q“°’=(q§ ),q§ ),-- .9}. ))-

2. At t + 1th iteration, simulate

qltH ) from p(ql l q2t)aq(t)1°aql(ct))

l (t 1 t
93+ ) from We | 91+ ),q§),- .99)

1 t 1 t 1 t 1 t t
95” ) from P(Qt I 9‘ I ),q§+ ), ,qf SI ),q§.31,--- .919)
q(t+1) fr (t+1) (t+1) (t+1) .
om p(q;c I q1 , q2 , -__,qk 1 ) sequentially.

3. Set t = t + l and repeat step 2 until convergence.

The second frequently used method is the Metropolis-Hastings algorithm (M-H
algorithm, Metropolis et a1, 1953; Hastings, 1970). This method is applied when
it is difﬁcult to simulate from the complete conditional distributions by traditional
methods (by the method of rejection sampling or by a known generator, for example).

A Markov chain using the M-H algorithm can be obtained as follows:

For any parameter 0,
1. Assign an initial value for parameter 6.

2. Specify a preposal density r(0‘, 0(‘+1)), which deﬁnes the proposal density from
state 0‘ to state 0““).

19

3. Given the current state 0‘, the candidate 9" for the next state 6““) in the chain

is sampled from r(6‘, 6(‘+”).

4. 6" is accepted as the next value 6““), i.e., 0““) = 0" with probability 0(0‘, 0‘),

where

 

1 . _ . 9(9’)7‘(9‘,9‘)
a(0,0)—m1n{g(0t)r(0t,0‘),1}, (2.2)

and g(.) is the density of the target distribution.

5. If 0“ is rejected, then the next value will stay at current state, i.e., assign

g(t-l-l) : at.

The M-H algorithm ﬁrst simulates a Markov chain whose distribution differs from
the desired distribution for the parameter, and then subsequently uses the acceptance
probability to reject or accept the value such that a new Markov chain is constructed
that has the target posterior as its stationary distribution.

It has been shown that the Gibbs sampler is a special case of the M-H algorithm
where the probability of accepting the candidate value is always one (p.436, Gelman
1992; p.182, Tanner 1996). The distinction between the Gibbs sampler and the M-H
algorithm is that the M-H algorithm requires the complete conditional distribution
and so it is more restrictive (p.166, Gamerman 1997, Besag et a1. 1995, Tierney
1991).

The combination of the Gibbs sampler and the M-H algorithm is a hybrid algo-
rithm. One value is generated from the M-H procedure, followed by the next Gibbs

step. Like the Gibbs sampler and the M-H algorithm, the M-H within the Gibbs

20

algorithm also produces a Markov chain with the correct stationary distribution.

2.2 Likelihood Einctions for the Linear Logistic
MIRT Models

If pre—calibrated item parameters are available, maximum likelihood estimates or
Bayesian modal estimates of the proﬁciency parameters can be obtained. Suppose the
assumption of local independence is held for the MIRT models. Then the probability
of a set of observed responses u,- = (u1j,u2j, - - - ,ugj, - - - ,unj) for the jth examinee
with proﬁciency vector 91- on 71 items is equal to the product of the probabilities

associated with the response to each item.

L(Uj l 91', 2,A,d,C) = p(ulj,u2j,- ' ' ,u,,~,- ° ' ,unj I 0]) (2.3)

n
= II p.(0,-)”ii(1 - 90%))l ' “‘1‘. (2.4)

i=1

where
Uij is a response (0 or 1) of the jth examinee on the ith item;
9,- is a p-dimensional proﬁciency vector, i.e., Oj = (OJ-1,0,2, - -- ,ij).
p,(0,-) is the probability of the jth examinee correctly answering the ith item. Simi-
larly, the probability of a set of N observed responses
v,- = (011,012, - -- ,v,j, - -- ,v,N) for the ith item is given by
N
L(v, | 9,2,a,,d,,c,-) = H p(v,1,v,-2,-~ ,v,,-,--- ,2)»; | 9,2,a,,d,-,c,-)
j = 1

N
H p.(01)“‘j(1 - p.(9.))1 — u
j =1

21

According to Bayes theorem, the posterior density function of 03- for j = 1, 2, - - - , N,

can be expressed as

 

f(91|uj)=LU(J-l9j)°(()) L(ujl91)7ro(9j) (2.5)

where

L(uj | 01-) is the likelihood function given by (2.3);
11'9 is the prior distribution of 9;

m(uj) is the marginal probability density of u,-; and

N is the number of examinees.
Assume the prior distribution of 9 is a multivariate normal with mean vector u

and the covariance matrix 2, then the density of 7r9(9,-) is

770(9j)=(27f) ZIZ 2exp[--(¢9 - u)§3'1(91-u)l- (2-6)

Maximizing L(Oj | uj) can obtain the Bayesian modal estimates of an individual

proﬁciency parameter vector Oj,Vj = 1,2, ~ - - ,N. That is to solve the equations as

(9—.03-1 —logL(0- luj)=0,Vk=1,,2- -,p;j=1,2,-~-,N. (2.7)

Nevertheless, in many applications, the item parameters are not available, or both
item and proﬁciency parameters are required to estimate from the observed data. The
following section is to address the simultaneous estimation of the item and proﬁciency

parameters using the MCMC methods.

22

2.3 M-H within Gibbs for Parameter Estimation
for MIRT Models

2.3.1 Complete Conditional Functions for Model Parameters

Under the assumption of local independence, the overall likelihood function of re-

sponses for N examinees on 71 items can be written as

N
L(Ule,2,A,d,c) = H L(u,|o,-,2,A,d,c)
j = 1
n
= II L(v.le.2.a.,d.,c.~)
i= 1
N n
= 1'1 11 p.(6,-)"='j(1—p.<a.-))l-“9.
j = 12' = 1

where

6 is a N x p matrix representing all proﬁciency parameters, i.e.,

611 912 91p
921 922 ' " 92p
95(919929'”,0j9°'°90N)’= ;
\ 9N1 9N2 6’ij

 

 

E is a p x p variance-covariance matrix for 9,- under the assumption that each
examinee comes from a multivariate normal population, i.e., 0 ~ Np(0, )3); p is the
number of dimensions of proﬁciency parameters;

A is a n x p matrix representing all a parameters for n items, i.e.,

23

 

 

( 011 012 01p \
021 022 02p
A = e e e e e e e )I _ . . . . .
— (aliaﬂ, 1a], ran — 1
ail ai2 alp
(am an2 0'an

a,- is a row vector with p components representing all of parameters related to
discriminating power for the ith item, i.e., a, = (an, (In, - - - ,a,,,);
d is a vector of d parameters for n items;
i.e., d5 (d1,d2,--- ,d,-,--- ,dn)’;
c represents a vector of all pseudo-guessing parameters for a test of n items,
i.e., cs (c1,C2,--- ,c,-,--- ,cn)’;
U is a N x 71 matrix of responses data for all N examinees on n items;
v, is a response vector for all N examinees on the ith item,
i.e., v, E (v,1,v,-2,--- ,v,J-, - -- ,v,N)’.
uj is a row vector for the jth examinee’s response on all items,
i.e.,uj E (1113,1121, - ~ ,unj).
The above equations tell that the likelihood function for the N x 71 response matrix

can be expressed as either the product of the likelihood functions across all examinees

or the product of the likelihood functions across all items.

Let 1r(9,2,A,d, c) denote the joint prior distribution of all parameters in the

24

model 9,2,A,d, and c. Assume that the prior distributions of both item and

person parameters are independent. Then the joint prior distribution can be written

7r(8,2,A,d,c) = 7r9(6 | 2)7r2(2)7rA(A)7rd(d)7rc(c)
= (II-VI 770(93' I Elﬁzmll f1 77a(ai)7rd(di)7rc(ct')-
j=1 i=1
The joint posterior function for model parameters can be expressed as
p(e, 2, A, d, C I U) 0( L(U I 9, 2, A, d, c)1r(9, 2, A, d, c)

Apparently, we cannot simulate samples from the joint posterior distribution di-
rectly, since the joint posterior is not a known distribution for direct sampling.

In order to sample values for the model parameters from the joint posterior dis-
tribution, the Metropolis-Hastings within Gibbs (Gibbs/M-H) algorithm is imple-
mented, which is found to be effective in experimenting with new models (Patz &
Junker, 1997), the complete conditional distributions of the parameters in MIRT

models are analytically expressed in the following:

The complete conditional distribution of the proﬁciency parameters by Bayes the-

25

orem is

P9(9j I e_j,2,A,d,C,U) = P(9j I11j,2,A,d,C)
OC L(Hj I 9j,A,d,C)7i’9(9j)
= Hon-WI — plant-was).
i=1

where 9_j = (01, 92, ° ' ' , 03'-“ 912+}, ° ° ' , 951),. Note that 9 = 91' U 9.3.
Similarly, we can have the complete conditional distributions for item parameters.

That is,

P(vi I aiaeizidiici)P(aireizadiaci)
P(ea zaviadiaci)

oc L(ViI 9.21,, dr.cr-)7ra(a1)

Pa(ai I A—i, 9, 21 d7 C, U) =

 

Hpt(93-)“‘j(1 — ture.->11 " “aroma.

i=1
It can be shown that the complete conditional distributions for d,- and e,- have the

following expressions:

Pd(d, I d_i,8,E,A, c, U) at L(vt I 9,at,d,-,c,-)7rd(d,-)

Tl
= II Pt(9j)u‘j(l—pr(0j))1Tu‘jrd(d,~),

i: l
Pd(ci I C-la ea 2)A3d9U) a L(Vi I eiaiadiici)7rci(c‘i)

n
= II pt(9,-)“‘j(1 ‘ pi(0j))1 _ u‘ﬁrdq),

i=1

where
71's,, 7rd, and 71}; are the prior distributions for a, d, and c respectively;

26

A4 is a (n — 1) x p matrix, i.e., A4 = (a1,a3,--~ ,ai_1,ai+1, - -- ,an);
d_i is a vector with (n — 1) components, i.e., d_t = (d1,d2, - -- ,d,-_1,d,-+1, - ~- ,dn);
c_t is a vector with (n — 1) components, i.e., C4 = (01,62, - - - ,c,-_1, c,+1,- -- ,cn); and

p,(Bj) is as previously deﬁned in equation (1.4).

2.3.2 Modelling the Covariance Structure for Multidimen-
sional Abilities

For a test measuring several different proﬁciency dimensions, it is assumed that each
examinee’s proﬁciency follows a p-variate normal distribution with mean vector u.
and the variance-covariance matrix 2. That is, 01' ~ Np(p.,E), Vj = 1,2,--- ,N.
Since there is not much meaning in comparing abilities across dimensions, the mean
of each dimension proﬁciency is set to zero. Thus, the mean vector for proﬁciency is
set to a p-component zero vector.

Modelling the covariance matrix is very important but difficult because (1) there
are KHz—”ll parameters to estimate, where p is the number of dimensions; and (2) the
matrix is required to be non—negative deﬁnite. To estimate the variance-covariance
matrix 2, this study will use the inverse-Wishart (W‘l) distribution, a multivariate
generalization of the sealed inverse-x2 distribution, as the prior distribution of the
matrix 2, i.e.,

2 ~ W‘1(m, ‘11), (2.8)

which is suggested by Gelman, Carlin, Stern, and Rubin (2004). The above distri-
bution is the conjugate prior distribution for the covariance matrix in a multivariate

normal distribution. Where m and \II describe the degrees of freedom and the scale

27

matrix for the inverse-Wishart distribution on 2. The advantage of using inverse—
Wishart as prior distribution for )3 is that the posterior distribution of 2 also follows

the W“1 distribution (e.g., Gelman, Carlin, Stern, and Rubin, 2004) :

711'

E I 9~ W‘l(m+n,(n—1)S+‘I'+ 56'), (2.9)

 

n+1

where n is the number of examinees, S is the sum of squares and cross product matrix

about the sample mean
N
(n — as = 2 9,9} (2.10)
j=l

T is the number of prior measurements, 0,- is a p - dimension vector, and 9- is a
p—dimensional sample mean vector. Since the posterior distribution on E is a known
distribution, 2 I 0 can be sampled directly.

Let 2k be the kth sample covariance matrix drawn from W'l(m + n, (n — 1)S +
\P + %§§'). Let sijk be the (i j)th component of 2k. Then the estimate of

proﬁciency structure is the average of drawn covariance matrix samples:

1 N
.2 _ 2
k=l
where N is the total number of randomly drawn samples; 2', j = 1, 2, . - - . p.

There are alternative approaches to modelling the underlying proﬁciency struc-
ture. Another method for estimating proﬁciency structure is addressed through a
two-dimensional example. For a two-dimensional IRT model, assume the proﬁciency

parameters come from a bivariate normal distribution N2(0, E), where E is the stan-

28

dardized covariance matrix or correlation matrix, i.e.,

Assume p has a prior density which is the uniform distribution on (—1, 1). Then the

posterior for p, fp(p I 9) can be expressed as

fp(p I 9) 0< p(9 I p)1(_1,1), (2.12)

where p(9 I p) is the probability function given by

N

p(elp) = H

j = 1

N
_.l! 1 '
0C (1 - P2) 2 CXP[_W Z (91]- — 2091192) + gig-II,
j = 1
and I (—1, 1) is an range indicator function.
Therefore, the posterior for p is
2 -ﬂ 1 N 2 2
fp(pI 9) 0< (1 — p I 2 “PI—W Z (91,- — 210911921 + 92,-)I1(_1,1)- (2-13)
2‘ = 1

_e£—1 _ 1+
Letp— (TE—+3. Thené—logtg.

= 28
f5“ I 9) fp(P I e)d§

2e£
1+e€'

 

= fp(P I 9)

Suppose f is the maximum likelihood estimates of E, and 62 represents the estimated

29

variance of E. f can be obtained by letting

p = arg mgxme | p)
= arg mgxlogme I p),
where
N N
logp(8 I p) = e +—— 2 log(1 — — W200? —2p01j02j + 033-),

where e is a constant. Solve the likelihood equation

 

 

3103149 | p) = 0
6p '
The equation above implies,
Np N
—- 1— p2 -— pijWf-w —2p61j92j 'I” 02jH_—17;01j02j = 0’

i.e., ,5 subjects to

Np +—— 1 + pjpz 2::(03 —2p01j92j + 931-) — 2011021 = 0.

, . . - ~ 1 ‘
l\ote here, the pI'IOI' 7rp(p) = U(-1,1). So pmle = pmode' Thus 5 = log 113% The

Fisher information

Me) = wag—Iowa I p»

N
- —E5p—2'108P(9IPI- 1_p2-

 

Then the asymptotic distribution of pmle is approximated by N (p, ) as N —t

__1_
N1(/))

00.

30

Therefore, by the delta method, 5 has the asymptotic distribution as

 

 

. 1 1+p 2 .
—+ N h ,—h’ , where h p =10 = , h’ = ——,1.e.,
t ((12) 1W) 0») (> g1_p 5 (p) H),
g N (5 e—£-+—1) ”2 - e£ + 1 Hence aMetro olis-Hastin al orithm can be written

to generate 5 from f5(£ I 9) using N (E , 6%) as the proposal density.

Since the target function f5(€ I 9) —t N (E , 62). The sampling density is N (5 , 62).
The transition function can be expressed as r£(.) = N (E , 62). The M-H algorithm is
as follows: Given 5‘, simulate y from N (f , 62), then 5‘“ = y with 0(5‘, y) and 6‘ with

1 — a(§‘, y), where

A

My | emf—E)
a(£,y)=min 6.. ,1

no: I eat—3)

y—
at

 

Repeat this step.

2.3.3 Random Walk Metropolis Algorithm within Gibbs

Since each complete conditional distribution is not convenient for sampling directly
from the expressions given in Section 2.3.1, a MetrOpolis step, in which each pa-
rameter or block has to specify a proposal distribution, is needed for the sampling
process. Patz and Junker (1997) point out that there is much freedom in choosing
the proposal distributions. For example, to sample a proposal value for 01- at step
t + 1, a multivariate normal distribution can be chosen as the convenient proposal
distribution.

The random walk algorithm will choose the candidate state via a random walk
mechanism. The candidate state is not chosen independently of the current state. And

31

the candidate state is not always accepted, unlike in the Gibbs sampler. Speciﬁcally,
let 8?? be the p—dimensional Euclidian space, and let r be a density on if?” so that
the transition function is deﬁned as R(y, B) = [B r(z — y)dz. Deﬁne the acceptance

probability a by

(2)719 - 2) 1},

_ ~ 9
a(y, z) — min {g(y)r(z _ y) , (2.14)

where g(.) is the density of the target distribution function (e.g., the above posteriors
for each examinee proﬁciency parameter, P9(0:,- I 8.5, E, A, d, c, U, or the complete
conditional distribution for each item parameters,Pa(ai I A_i, B, 2,d, C, U),
Pd(di I d_i, e, 2, A, C, U»,
and Pc(c,- I C_i, e, 2, A, d,U)). If the denominator is zero, just set a = 1.
Suppose Y, = y. Generate a “candidate ” observation 2 from the distribution R(y, .);
accept this observation (set Yt+1 = z) with probability a(y,z). Otherwise, reject
this observation (set Y,“ = Y; = y). Another way to describe the procedure is as
follows. Start at y. Generate a candidate step w from the distribution R deﬁned by
R(B) = fl; r(a:)d:c with probability a(y, y + to) moving forward to w; Otherwise stay
at y.

In the MIRT context, for instance, denote rg(0,-‘, 0f“) as the transition function
for the constructed Markov chain for sampling the jth examinee’s abilities. For

random walk Metropolis algorithm, the transition kernel can have the form

1 , _
73(0)}, git-H) = exp {—§(0jt — 9jt+l) 2 I(Gjt — 9jt+l)} . (2.15)

Then the acceptance probability for the new candidate 9f, 3' = 1,2, . - - , N from the

32

transition kernel r9(0j‘, 9;“) is

 

0* 0* at
g9(03t)ro(01t ’ 01*),1 . (2.16)
99(92' )To(9j ’91 )

Note here the target distribution g9(.) is the complete conditional distribution deﬁned

a(0,-t,6j*) = min {

previously, i.e.,

99(9j) P9(9j I 9.,-,A, (1,6, U) OC L(Uj I 9j,A,d,C)’/Tg(0j I 2)

n
H pt(0r)“‘j(1 —p.-(0.-)>1 - “aroma.

i=1

Similarly, the acceptance probability for a new candidate of item parameters a;‘ for

item i, i = 1, 2, ~ .. ,n from the transition kernel ra(ait,ai(t+1)) is,

 

.* t t
o agar = min {9“(a‘ )Ma‘ ’3‘ ),1} , 2.17
( I ) ga(ait)ra(aitaai*) ( )

where g..(.) is the complete conditional distribution for at, i.e., ga(ai*) (X L(Vi I
9, ai", di, c,-)7ra(ai"‘). In the same way, we can ﬁnd a(d§,d,?) and (I(CE, 6:).
The following are the proposal densities corresponding to person and item param-

eters, which are chosen for the purpose of convenience and efﬁciency.
. t+1 . t
Proposal den51ty for 9 IS Np(9 , Eat).

Preposal density for each component of aitTI, aik is U (ail: — h, aﬁk + h),

Proposal density for d2“ is N(dz, 02).

Proposal density for Ct+1 is U (C: — 5, CI + (5);

i

33

where h, 6, and 02 are constants. In this study, h = 0.3, 6 = .03, and o2 = 1.

Once the derivation of the complete conditional distribution for each parameter
in the multidimensional model is ﬁnished, the corresponding acceptance probabilities
can be calculated. And if the proposal densities are specified, it is ready to draw
parameter samples.

The steps for this drawing of parameter samples for the MIRT model are:

1. Draw 0;. ~ Np(0;,29t), Vj = 1,2,--- ,N. 0;.“ = 9; has acceptance proba-

bility 040;, 0;)

2. Draw 2 I 9 ~ w-1(m + n, (n — ms + \II + ”$367)

 

* o

3. Draw each ail: ~ U(aik — haaik + h), (LEE-1 = a”: With probability of
a(afk,afk) VI: = 1,2,-~- ,p. and i = 1,2,--- ,n. p is the total number of

dimensions.
4. Draw d: ~ N(dﬁ, 02) with acceptance probability ofa(d:, (1?) Vi = 1, 2, - -- , n.

5. Draw c: N U (cf - (i, C: + (5) with acceptance probability of a(c§, 6:) Vi =

1, 2, - - - ,n. Here h and k are known constants.

2.4 Unbiased and Consistent Estimators of Param-
eters

Let 9,1,, amine,- be the model estimators Vj = 1,2,-~ ,N, i = 1,2,--- ,n, k =
1, 2, - -- , p. For example, if the samples from the complete conditional distribution

M
r 1
of 9;, at, d,, c,- are drawn from the constructed Markov chain, then 0,), = M Z 37;“
m=l

34

M M M
1 ~ 1 1
ink = — E am, d,- = — 2 d}", and c,- = — E c1", where M is the sample size
M m=1 M m=l M m=1

used for the estimates after certain length of the burn-in period.

Obviously, E(é,-) = 93-, E(éii) = ai, Ea.) = di, E(5i) = Ci since E93: = 93': E03: =

J

aik, Ed;7| = (1,, and EC? = c,-. That is, the estimators are unbiased.
VGT(0J']¢)

 

The variance of the estimates Var(éjk) = T —> 0, We = 1, 2, - -- ,p. as M —>
. , d.-
00. Var(c‘z,-k) = Lag-5191c)- —> 0, Var(d,-) = V013 ) ——» 0,
Var(é,~) = YEIME). ——+ 0, as M —» 00. By the law of large number, 0:,- —+ 0,- , a, —* a,,

d,- —+ d,, and 6,- -—+ c,- in probability. Therefore, the estimates are consistent.

By the central limit theorem,

M => N(0,1), (2.18)
V var(6jk)
as M —» 00, for j = 1, 2, - - - ,N. This can give a conﬁdence interval for the estimate of

proﬁciency parameters. Similarly, the results also hold for item parameter estimates.

35

Chapter 3

Simulation Studies and Results

The derivations for the application of MCMC methods into the 3-PL linear logistic
multidimensional IRT model are illustrated in Chapter 2. This approach is imple-
mented in a C++ program, which provides an eﬁcient computational tool for param-
eter estimation of MIRT models of the application of the program are reported in the
chapter. In this chapter, the parameter estimates for MIRT models. The accuracy
and stability of the MCMC estimates will be examined by simulating various testing
situations for the one-, three-, and ﬁve-dimensional MIRT models, respectively.
Various simulation studies are presented in this chapter in an attempt to examine
the eﬁects of four potential factors on the recovery of item and underlying proﬁciency
parameters. These factors are: the number of proﬁciency dimensions, proﬁciency
structure (i.e., covariance matrix for the proﬁciency distribution), test length (i.e.,
the number of test items), and the sample size (i.e., the number of examinees). Us-

ing simulated data to investigate parameter estimation has at least two advantages:

36

(1) since the true person and item parameters are available, they can be used to
assess the accuracy of parameter estimates, with smaller root mean square errors
(RMSE) between the true parameters and the parameter estimates indicating more
accurate estimation; (2) the information for the number of dimensions is available
from the simulated data, as is similar to the conﬁrmatory factor analysis given the
factor structure is known before analyzing data. With knowing the number of di-
mensions, researchers do not have to do additional analysis to determine how many
dimensions each item measures and what these dimensions are about, a strategy
that can help researcher separate dimensionality analysis with the issue of parameter
estimation. It is necessary to point out that determining the statistical dimension
based on the observed data itself is actually a complex and active research area. For
example, Researchers suggest detecting the underlying dimension structure by para-
metric approach (e.g., Reckase, Ackerman, & Carlson, 1988; Miller 81. Hirsh 1992)
and nonparametric approach (e.g., Roussos, 1995). The topic of detecting dimension
structure from the observed data is out of the scope of this research. Therefore, to
control the dimensional structure in the simulated data instead of diagnosing it will
facilitate an effective examination of the MCMC estimation approach.

In addition, to examine the performance of parameter estimation by the MCMC
approach in this research involves only simulation experiments because: (1) real data
analysis will bring the model-data ﬁt issue, which is often confounded with the issue
of parameter estimation and obviously is not the focus of this study; (2) it is more

difﬁcult to evaluate the accuracy of estimation due to the lack of the true parameter

37

information.
3.1 Prior Distributions for Model Parameters

The MCMC approach for parameter estimation is in fact from Bayesian perspective.
The item and proﬁciency parameters are not treated as ﬁxed values but random vari-
ables with probability distributions. The role of prior distributions for both item
and proficiency parameters is to provide additional information on the parameters
before data collection and parameter estimation. In this study, the prior distribution
for proﬁciency vector is Np(0, 29). That is, the group of examinees is assumed to
come from the multivariate normal population Np(0, 29), where p is the number of
dimensions. The prior distribution for each component of each a parameter is the uni-
form distribution, the prior distribution for each d parameter is the standard normal

distribution, and the prior for each c parameter is also the uniform distribution.
3.2 Diagnosing the Convergence of Markov Chains

There are many approaches to the diagnosis of the convergence of a Markov chain.
The purpose of this analysis is to ensure that the constructed Markov chains for
the posterior distributions for both item and proficiency parameters through the
Metropolis-Hastings within Gibbs algorithm have the target stationary distributions
before taking sample for Monte Carlo estimation. The reliable estimation requires
that each posterior distribution of a parameter converges to its stationary distribution.

Gelfand and Smith (1990) suggested several approaches to check the convergence

38

based on graphical techniques. For m parallel chains, plot a histogram for n values of
kth iteration, after skipping certain iterations (say 19 iterations), and plot a histogram
for n values of (k + p)th iteration. Convergence is assumed if the histograms have
very close pattern.

Gelman, Carlin, Stern, and Rubin (p.294, 2004) recommended an approach to
the inference and assessing convergence based on several independent parallel chains.
First, simulate several independent sequences, with over-dispersed initial values. If
multiple chains with different starting values are well mixed after certain number of

iterations, then one can conclude that the chain reaches the convergence.

3.3 Initial Values and Iterations

The choice of initial values should not affect the item and proﬁciency estimates,
because the ﬁnal estimates rely on the sample from the posterior distributions for the
parameters when they reach stationary status. The initial values are often discarded
before computing Monte Carlo estimates for the parameters. However, the initial
values may affect the convergence speed for each chain of a posterior distribution.
Thus, carefully selected starting values will accelerate the convergence speed and
construct an effective Markov chain. For example, Beguin and Glass (1998) suggested
using a = 1, d = 0, and the true c parameter or its estimates from BILOG as starting
values and concluded that 1000 burn-in iterations was sufficient.

In this study, random initial values will be used each time for the estimation. To

39

ensure the convergence of each chain, a large number of iterations, for example, 10,
000, will be taken. Moreover, multiple chains (e.g., 3 chains) will be constructed for
each data set to assess the convergence of each chain and evaluate the accuracy and
stability of the estimates by comparing the estimation from each chain with different
random initial values. Hence, the starting values used for estimating proﬁciency
parameters in this study are randomly drawn from Np(0, I), and the initial values for
item parameters will be randomly sampled from uniform distributions.

Since three independent replications of Markov chains are constructed with dif-
ferent initial values for each data set, the ﬁnal estimates for the parameters take the
mean of the estimates from the three independent chains. For each independent chain,
parameter estimates H is the average of the sample from posterior distributions, i.e.,

- 1 "
11:52:21,, (3.1)

i=1

where n is the number of samples drawn from the stationary Markov chain for the
posterior distribution. Thus the ﬁnal estimates of parameters for each data set H is

the average of the estimates from multiple independent chains,
m
H = 2 H,, (3.2)
i=1
where m is the number of replications, i.e., m = 3 in this study.
All of the data sets are randomly sampled from the linear logistic multidimensional
IRT model for various conditions (e.g., test length, the sample size of examinees, the
number of dimensions, and different proﬁciency covariance matrices). To minimize

the sampling effects on parameter estimation, three replications are simulated for each

40

condition. Four factors considered in the simulation studies result in a total of 60
dichotomous response data sets. Therefore, the precision of parameter estimates can

be compared across the sample size, the test length, and the proﬁciency structures.

3.4 Estimating the Unidimensional 3PL Model

The form for the unidimensional 3PL model is given in equation (1.1) in the ﬁrst sec-
tion of Chapter 1. This section will discuss the parameter estimation by simulating
dichotomous response data from the unidimensional 3PL model. One big difference
for estimating unidimensional model parameters from the estimation of the multidi-
mensional model parameters is that no underlying proﬁciency dimension structure
needs to be estimated. To consider the model indeterminacy problem and establish
a ﬁxed metric for both item and proﬁciency parameter estimates, the sample of the
posterior distributions for proﬁciency parameters will be standardized at each step of
sample draw. Therefore, the ﬁnal metric for the proﬁciency parameter estimates is
placed on 0, 1 metric.

For the simulation study in this section, the underlying proﬁciency parameters
and difficulty item parameters are generated from the standard normal distribution
N (0, 1); the discriminating power and asymptote item parameters are generated from
a uniform distribution. Two tests with 30 and 45 items were simulated. Each test is
administrated to 2000 and 5000 examinees, respectively. The combination of the test

length, the sample size, and replications yields 12 (i.e., 2 x 2 x 3) data sets. Table

41

3.1 and Table 3.2 are the true items parameters for the two tests.

It can be seen from Table 3.1 and Table 3.2 that both tests contains a wide variety
of values of item parameters. For example, in the 30—item test, the discriminating
power a parameter ranges from the smallest of .54 to the largest of 2.43, the difﬁ-
culty parameters from -1.64 to 1.6, and the asymptote parameters from 0 to .25. In
the 45—item test, the discriminating parameters cover a range between .5 and 2.45,
the difficulty parameters fall into a range within -1.78 to 2.85, and the asymptote

parameters ranges from 0 to .25.

3.4.1 Assessing Convergence

Table 3.3 shows the three independent estimates from each chain replication with
different initial values for the data set generated by the 30—item test to 2000 examinees.
The ﬁnal item parameter estimates are the mean of the three independent estimates
for each chain. Clearly, the estimates from the three independent chains are very
stable and consistent. For example, item 28 has the same estimates on a and c
parameters over three chains, but has .01 difference on b parameter estimates across
the three independent chains. The largest change for a parameter estimates over
three independent chains is on item 1, showing 1.82 for the ﬁrst chain, 1.67 for the
second chain with a difference of .15, and 1.72 for the third chain. The slight change
of estimates for each item parameter across the three independent chains indicates
the stable estimates by the MCMC. More importantly, one can assess the convergence

of the posterior distributions by the stability of the estimates over multiple chains

42

Table 3.1: True Item Parameters for 30—Item Test (Dim = 1)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Item Discriminating ((1) Difﬁculty (b) Asymptote (c)
1 1.67 -l.17 0.14
2 0.89 0.28 0.09
3 0.55 -1.64 0.14
4 1.85 -0.72 0.16
5 2.07 0.50 0.08
6 1.40 0.46 0.18
7 2.43 1.37 0.21
8 0.85 -0.04 0.14
9 1.39 0.91 0.09
10 1.25 0.14 0.09
11 1.52 -0.19 0.21
12 1.34 -0.80 0.25
13 1.64 -0.44 0.05
14 0.99 0.57 0.12
15 1.48 -l.11 0.16
16 0.54 0.48 0.09
17 1.78 1.60 0.03
18 1.10 0.21 0.14
19 2.09 -0.31 0.01
20 2.26 1.10 0.04
21 1.53 0.65 0.24
22 0.79 -0.41 0.11
23 2.40 0.57 0.11
24 0.73 -1.21 0.25
25 0.56 0.62 0.02
26 0.56 -1.43 0.19
27 1.01 1.51 0.04
28 2.07 1.31 0.18
29 2.05 -0.25 0.17
30 1.48 -1.62 0.10

 

 

 

 

 

 

43

Table 3.2: True Item Parameters for 45-Item Test (Dim = 1)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Item (1 b c Item 0 b c
1 1.73 0.03 .09 24 1.98 —0.88 .09
2 2.45 0.00 .16 25 0.83 0.20 .02
3 2.35 0.45 .02 26 0.84 0.98 .07
4 1.04 0.15 .22 27 1.95 0.90 .01
5 2.37 0.27 .23 28 1.99 -0.51 .19
6 0.95 -1.78 .05 29 0.92 -1.80 .20
7 2.06 1.08 .08 30 1.26 2.85 .00
8 1.43 -0.59 .11 31 1.77 -1.19 .02
9 1.63 -0.67 .11 32 0.50 -0.44 .17

10 2.00 0.54 .18 33 1.78 -0.62 .08
11 2.13 0.33 .25 34 0.61 0.64 .00
12 1.27 -0.56 .17 35 2.21 -0.57 .19
13 1.45 -0.64 .22 36 2.31 0.54 .09
14 2.04 -1.31 .05 37 2.30 0.27 .07
15 0.53 1.16 .19 38 1.51 1.48 .07
16 1.51 -1.53 .13 39 2.26 0.45 .10
17 2.29 0.70 .10 40 0.85 -1.05 .14
18 0.62 -0.18 .07 41 1.33 —0.33 .15
19 2.10 -1.08 .11 42 1.02 -1.24 .02
20 1.69 0.64 .23 43 0.73 1.74 .11
21 1.55 -1.32 .08 44 1.21 1.53 .07
22 1.34 0.03 .18 45 1.99 1.31 .19
23 1.50 1.21 .03 - - - -

 

 

 

 

 

 

 

 

44

 

Figure 3.1: Sample ACF for series of a6, Dim = 1

 

ACF

 

 

 

 

 

 

 

g - .. lewmllllllllllllllllihi

.....................................................................................................................................

 

 

 

Lag

suggested by Gelman, Carlin, Stern, and Rubin (p.294, 2004). Table 3.3 provides
numeric demonstrations that the chain has converged to its stationary distribution.
Similar results are obtained for the sample size of 5000 and for the 45-item test but
are omitted here.

Figure 3.1 describes the estimated autocorrelation function (ACF) in the series of
discriminating power for the 5th item after throwing away the burn-in draws. It is
found that the autocorrelation become negligible at lags greater than 28. Figure 3.2
illustrates the behavior of the Markov chains constructed by the M-H within Gibbs
algorithm for item 5 in the 30—item test. The upper panel shows the ﬁrst 2000 draws

for the posterior distribution of a parameter, the middle shows the 2000 draws for the

45

Table 3.3: Estimates from three chains for 30-Item Test (Dim = 1, N = 2000)

 

Item 01 0.2 03 d1 d2 d3 Cl 02 C3
1 1.82 1.67 1.72 -1.00 -1.07 -1.05 .23 .18 .20

2 0.87 0.88 0.88 0.23 0.27 0.27 .07 .08 .08

3 0.46 0.45 0.43 -1.78 -1.80 -1.97 .20 .20 .17

4 1.66 1.58 1.60 -0.71 -0.75 -0.73 .16 .13 .14

5 2.09 2.12 2.12 0.56 0.58 0.57 .08 .09 .09

6

7

8

 

 

 

 

 

 

1.37 1.37 1.38 0.56 0.59 0.58 .20 .20 .20
2.09 2.11 2.10 1.43 1.44 1.44 .21 .21 .21
0.99 0.97 0.96 0.26 0.26 0.24 .26 .25 .25
9 1.50 1.49 1.51 0.90 0.91 0.90 .07 .07 .07
10 1.45 1.42 1.45 0.18 0.19 0.20 .08 .08 .09
11 1.70 1.71 1.71 -0.12 -0.10 -0.11 .22 .23 .23
12 1.17 1.13 1.17 -0.83 -0.87 -0.82 .23 .20 .24
13 1.68 1.68 1.63 -0.33 -0.31 -0.34 .06 .07 .05
14 1.01 1.00 0.99 0.64 0.65 0.64 .12 .12 .11
15 1.51 1.54 1.53 -1.16 -1.12 -1.14 .05 .07
16 0.64 0.68 0.66 0.83 0.90 0.89 .18 20
17 1.82 1.80 1.81 1.68 1.70 1.69 .04 04
18 1.19 1.20 1.21 0.17 0.20 0.20 .12 .13
19 2.06 2.06 2.05 -0.31 -0.29 -0.30 .00 .00

04

23

 

 

 

 

 

 

 

 

 

 

 

 

 

 

20 2.35 2.35 2.36 1.17 1.19 1.18 .04
21 1.60 1.60 1.59 0.76 0.77 0.77 .23 . .
22 0.79 0.8 0.75 -0.31 -0.26 -0.37 .14 .16 .12
23 2.24 2.24 2.22 0.60 0.63 0.62 .12 .12 .12
24 0.67 0.65 0.67 -1.52 -1.55 -1.50 .07 .06 .08
25 0.58 0.59 0.59 0.78 0.79 0.81 .04 .04 .04
26 0.59 0.57 0.60 -1.33 -1.37 -1.30 .20 .20 .22
27 1.14 1.18 1.14 1.45 1.45 1.46 .04 .04 .04
28 2.22 2.22 2.22 1.36 1.37 1.37 .19 .19 .19
29 2.34 2.31 2.34 -0.20 -0.20 -0.19 .18 .18 .18
30 1.72 1.72 1.69 -1.41 -1.39 -1.43 .23 .24 .22

 

06
19
.04
.13
00
04
22

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

46

Figure 3.2: Sample draw at ﬁrst 3000 iterations for series of a, b and c

 

 

 

 

 

a = 1.4
‘0. . O .. . I g Q . . .‘ . Jo
1— ‘ . ' r... . ‘ , ' Q1 "ﬂ. . i. -, ’ z
- a . w. A ‘ '- '4.‘ . , ._ .. n . w,
o. L... . oJ . ' O: o o . . i o - . . .
O 500 1 000 1 500 2000 2500 3000
b = .46

 

 

 

3QJ=JS~:::J1ﬁHhé‘yjrié- {3"‘hﬂigfqaéniﬂhﬁFQ’ﬁAB

 

Sample Value
0.5 0.9

 

 

 

 

O 1 000 1 50018 2000 2500 3000
ID
‘V.
° WW
I!)
<3.
c 500 1 000 1 500 2000 2500 3000
Iteration

posterior distribution of b parameter, and the lower panel gives the ﬁrst 2000 draws
of the posterior distribution for the asymptote parameter. The path plot in Figure
3.2 shows that the posterior distributions for the ﬁfth item parameters mixed well
even in the ﬁrst 2000 draws. The path plots for other items in the 30—item test or
45-item tests have similar path plots for 2000 draws and are not shown.

The column 2 through 7 (denoted asa 6,,65 ,S(a), S(b), S(c)) in Table 3.4 are the
item parameter estimates and their corresponding standard error of the estimates for
the 30—item test with sample size 2000 from the ﬁrst replication of response data.

The last six columns are the values for the sample size 5000. Table 3.5 shows the

item parameter estimates from BILOG-MG3 using MML procedure for the same

47

Table 3.4: Item Parameter Estimates for 30—Item Test (Dim = 1)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

N = 2000 N = 5000
Item (1 5 6 5(0) 5(1)) S(c) 6 5 5 5(0) S(b) S(c)
1 1.74 -1.04 .20 .16 .07 .06 1.70 -1.14 .13 .10 .07 .05
2 0.88 0.26 .08 .09 .09 .04 0.83 0.22 .04 .04 .04 .02
3 0.45 -1.85 .19 .05 .29 .09 0.58 -1.35 .23 .04 .17 .06
4 1.61 -0.73 .14 .18 .10 .07 1.88 -0.65 .17 .12 .05 .03
5 2.11 0.57 .09 .19 .04 .01 2.07 0.55 .07 .11 .02 .01
6 1.37 0.58 .20 .15 .05 .02 1.41 0.56 .18 .09 .03 .01
7 2.10 1.44 .21 .27 .05 .01 2.17 1.42 .21 .20 .03 .03
8 0.97 0.25 .25 .09 .08 .03 0.75 -0.10 .09 .05 .07 .03
9 1.50 0.90 .07 .14 .04 .01 1.38 0.98 .08 .10 .02 .01
10 1.44 0.19 .08 .10 .04 .02 1.20 0.20 .09 .06 .03 .02
11 1.71 -0.11 .23 .15 .05 .03 1.64 -0.08 .24 .10 .04 .02
12 1.16 -0.84 .22 .10 .10 .06 1.29 -0.78 .22 .08 .07 .04
13 1.66 -0.33 .06 .12 .04 .03 1.55 -0.42 .01 .07 .02 .01
14 1.00 0.64 .12 .12 .08 .03 0.93 0.63 .12 .07 .06 .02
15 1.53 -l.14 .06 .13 .08 .06 1.62 -0.95 .18 .11 .06 .04
16 0.66 0.87 .19 .10 .13 .04 0.55 0.57 .11 .04 .08 .03
17 1.81 1.69 .04 .25 .07 .01 1.83 1.66 .03 .14 .03 .00
18 1.20 0.19 .13 .12 .07 .03 1.16 0.23 .13 .07 .04 .02
19 2.06 -0.30 .00 .13 .03 .01 2.04 -0.27 .01 .09 .01 .01
20 2.35 1.18 .04 .15 .04 .01 2.24 1.15 .05 .15 .02 .00
21 1.60 0.77 .23 .18 .05 .02 1.60 0.69 .24 .12 .02 .01
22 0.78 -0.31 .14 .06 .12 .05 0.71 -0.51 .04 .04 .09 .04
23 2.23 0.62 .12 .18 .04 .01 2.21 0.62 .10 .14 .02 .01
24 0.66 -1.52 .07 .04 .14 .06 0.70 -1.41 .10 .03 .10 .06
25 0.59 0.79 .04 .06 .10 .03 0.58 0.74 .03 .04 .09 .03
26 0.59 -1.33 .21 .06 .26 .09 0.55 -1.69 .04 .03 .11 .04
27 1.15 1.45 .04 .13 .07 .01 1.09 1.54 .05 .08 .04 .01
28 2.22 1.37 .19 .21 .05 .01 2.35 1.35 .18 .14 .03 .01
29 2.33 —0.20 .18 .16 .04 .02 2.14 -0.14 .20 .13 .02 .02
30 1.71 -1.41 .23 .17 .08 .06 1.79 -1.35 .23 .15 .08 .06

 

 

 

 

 

 

 

 

 

 

 

48

 

Table 3.5: Item Parameter Estimates for 30—Item Test In BILOG-MG3 (Dim = 1)

 

N=2000 N=5000

Item 6 13 6 a B 6
1.83 -1.00 .25 1.71 -1.17 .15
0.88 0.21 .08 0.81 0.18 .04
0.61 -0.65 .50 0.58 -144 .21
1.67 -073 .16 1.90 -O.68 .18
2.09 0.52 .08 2.04 0.51 .07
1.39 0.54 .20 1.39 0.53 .18
2.20 1.38 .21 2.14 1.39 .21
0.95 0.17 .24 0.75 -0.12 .11
1.49 0.85 .07 1.38 0.94 .09
10 1.45 0.15 .08 1.17 0.14 .08
11 1.66 -0.17 .22 1.62 -013 .24
12 1.18 -0.83 .25 1.27 -0.83 .21
13 1.65 -0.37 .05 1.53 -047 .01
14 1.00 0.58 .11 0.93 0.60 .12
15 1.46 -123 .00 1.64 -097 .19
16 0.67 0.84 .19 0.57 0.58 .12
17 1.82 1.62 .04 1.81 1.63 .03
18 1.23 0.17 .13 1.15 0.18 .12
19 2.02 -033 .00 2.02 -031 .01
20 2.52 1.12 .04 2.21 1.12 .04
21 1.60 0.72 .23 1.57 0.66 .24
22 0.78 -0.35 .14 0.66 ~0.61 .00
23 2.28 0.56 .12 2.19 0.59 .10
24 0.64 -1.67 .00 0.72 -131 .17
25 0.58 0.73 .03 0.55 0.64 .01
26 0.57 -152 .15 0.51 -1.87 .00
27 1.16 1.40 .04 1.08 1.51 .05
28 2.41 1.31 .19 2.47 1.31 .18
29 2.51 -021 .19 2.09 -0.19 .20
30 1.72 -1.39 .28 1.82 ~1.36 .26

 

 

p—n

 

 

 

 

 

 

 

 

tomNOSCﬂAODM

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

49

Table 3.6: Item Parameter Estimates for 45—Item Test (Dim = 1)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

N=2000 N=5000

Item 6 B a S(a) S(b) 3(0) 6 B 6 3(0) S(b) 8(6)
1 1.54 -001 .08 .10 .03 .02 1.59 0.01 .07 .07 .03 .01
2 2.42 0.06 .17 .11 .03 .02 2.31 .08 .17 .13 .02 .01
3 2.31 0.46 .02 .15 .02 .01 2.41 0.50 .02 .10 .02 .00
4 1.00 0.20 .24 .10 .07 .03 0.95 0.20 .20 .05 .03 .02
5 2.37 0.31 .26 .14 .04 .02 2.44 0.35 .24 .09 .03 .01
6 0.96 -l.65 .09 .08 .14 .07 0.99 -155 .11 .05 .08 .05
7 2.10 1.13 .07 .20 .03 .01 2.06 1.16 .08 .16 .02 .01
8 1.35 -0.60 .14 .14 .11 .06 1.41 -049 .15 .07 .04 .03
9 1.55 -0.69 .11 .12 .06 .04 1.66 -0.62 .12 .08 .03 .02
10 1.93 0.55 .16 .18 .04 .02 2.08 0.58 .17 .12 .02 .01
11 1.87 0.33 .24 .17 .05 .02 2.03 0.37 .24 .12 .03 .01
12 1.39 -042 .23 .14 .10 .06 1.33 -0.47 .18 .08 .04 .03
13 1.28 -0.71 .18 .15 .12 .06 1.36 -0.63 .18 .06 .04 .03
14 2.29 -124 .08 .17 .06 .05 2.23 -1.19 .05 .12 .03 .03
15 0.50 1.01 .15 .08 .17 .05 0.43 1.02 .12 .05 .13 .04
16 1.53 -153 .21 .15 .11 .06 1.57 -143 .20 .18 .15 .11
17 1.89 0.76 .09 .17 .04 .01 2.16 0.75 .09 .12 .02 .01
18 0.61 -013 .10 .06 .16 .06 0.59 -025 .04 .03 .10 .04
19 2.28 -107 .14 .15 .05 .04 1.91 -107 .07 .13 .05 .03
20 1.62 0.70 .23 .17 .06 .02 1.77 0.71 .25 .13 .03 .01
21 1.60 -130 .09 .13 .08 .06 1.43 -135 .02 .07 .04 .02
22 1.38 0.04 .19 .12 .06 .03 1.45 0.11 .19 .07 .02 .01

 

 

 

 

 

 

 

 

 

 

 

 

 

50

 

Table 3.7: Item Parameter Estimates for 45-Item Test (Dim = 1), cont.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

N =2000 N =5000

Item 5 5 e S(a) 5(0) 8(6) 5 B a S(a) S(b) 5(6)
23 1.40 1.25 .02 .12 .05 .01 1.51 1.25 .03 .08 .03 .00
24 1.84 -0.96 .04 .14 .05 .03 2.08 "-0.78 .12 .09 .02 .02
25 .85 0.24 .02 .06 .06 .02 0.83 0.19 .00 .03 .03 .01
26 1.03 1.19 .12 .14 .07 .02 0.87 1.13 .09 .06 .04 .01
27 1.69 1.01 .01 .16 .04 .01 1.85 0.98 .01 .10 .02 .00
28 1.80 -052 .24 .15 .04 .03 1.95 -045 .21 .12 .05 .03
29 0.87 -195 .13 .10 .22 .10 0.85 -194 .07 .05 .09 .05
30 1.19 3.17 .00 .20 .29 .00 1.26 2.98 .00 .09 .10 .00
31 1.72 —1.22 .02 .11 .05 .03 1.69 -1.12 .04 .12 .07 .04
32 0.44 -072 .11 .04 .21 .06 0.50 -.67 .05 .02 .10 .04
33 1.66 -0.63 .06 .13 .05 .03 1.76 -0.56 .05 .08 .03 .02
34 0.64 0.69 .02 .06 .09 .02 0.66 0.72 .03 .05 .05 .02
35 2.22 -0.58 .20 .20 .06 .04 2.40 -0.48 .21 .11 .03 .02
36 2.39 0.55 .08 .12 .03 .01 2.40 0.58 .09 .11 .02 .01
37 2.37 0.32 .08 .14 .03 .01 2.46 0.33 .08 .07 .02 .01
38 1.37 1.56 .07 .19 .06 .01 1.47 1.54 .07 .11 .04 .01
39 2.33 0.50 .10 .14 .03 .01 2.37 0.49 .09 .11 .02 .01
40 0.76 -1.24 .05 .05 .09 .05 0.80 -1.13 .07 .04 .10 .05
41 1.18 -043 .06 .08 .05 .03 1.37 -024 .16 .06 .03 .02
42 1.32 -094 .21 .10 .08 .06 1.21 -101 .14 .05 .05 .04
43 0.77 1.76 .11 .13 .09 .02 0.77 1.78 .11 .08 .08 .01
44 1.19 1.59 .06 .13 .06 .01 1.27 1.60 .08 .09 .04 .01
45 1.84 1.36 .19 .24 .05 .01 1.94 1.34 .19 .16 .03 .01

 

 

 

 

 

 

 

 

 

 

 

 

 

51

 

data set, a standard procedure of item parameter estimation in most IRT calibration
software. Comparing the results of item parameter estimates from these two different
procedures, one can see that these results are very close to each other and close to
their true item parameters, indicating the two estimation methods are comparable.
Table 3.6 and 3.7 show the item parameter estimates and the corresponding standard
error for the 45—item test.

As is true in many estimation programs in IRT, item parameter estimates con-
tain estimation errors even if the data and the mathematical models have perfect
ﬁt. To examine the estimation accuracy of item parameter estimates, root mean
square errors (RMSE) of the item parameter estimates are calculated from each data
replication and each chain. In this study, three data replications are observed for
both tests (i.e., the 30-item test and the 45—item test). Here data replication means,
for example, the 30—item test is administered to three groups of different examinees
who come from the same population (N (0, 1)). Therefore, there will be three sets of
item parameter estimates corresponding to the three groups of examinees. For each
data set, the computation program will come up with three different chains along
with three different initial values to make sure that the MCMC approach can provide
stable parameter estimates. Each chain will independently give estimates for item
parameters. Therefore, combining three data replications and three chains for each
data set will yield nine sets of item parameter estimates. For each data set, the ﬁnal
item parameter estimates are the average of estimates from the three chains. RMSE is

deﬁned as the square root of the mean squared difference between the item parameter

52

Table 3.8: RMSE for Estimating Uni-dimensional Models (Dim = 1)

 

 

 

 

 

 

 

 

 

30 x 2000 30 X 5000 45 x 2000 45 x 5000
a .15 .07 .11 .07
b .11 .08 .08 .08
c .05 .04 .04 .03

 

 

estimates and the true item parameters over 7' data replications and across 71 items
(1' in this example is 3, and n is 30 or 45). Let 17 denote as item parameter (e.g.,
discriminating power parameter a, or difﬁculty parameter b, or asymptote parameter

c) and 1? as item parameter estimates. Then RMSE can be calculated by

n r .
‘= '= (ni""7li‘)2
RMSE(77) = \/22 12] 1 J J .

rxn

 

 

RMSE gives a summary index of assessing the accuracy of item parameter estimates.
Apparently, the larger RMSE of item parameter estimates for a data set, the worse
of the item parameter estimates. For a simulation study, the perfect ﬁt of model
and data is assumed, and thus the difference between the true and item parameter
estimates may depend on estimation procedures and some other factors (e.g., the
sample size of examinees).

Table 3.8 contains the RMSE for item parameters. It shows that for the same
test the larger the sample size, the smaller RMSE, and the less estimation errors.
The largest RMSE for a is .15 in the 30-item test with 2000 examinees. The smallest
RMSE is .07 in both tests when sample size is 5000. The largest RMSE for b is .11 in
the 30—item test with examinee 2000. It also shows that the RMSE for c is generally

smaller than RMSE for a and b, with the largest one .05 in the 30—item test to 2000

53

Table 3.9: Correlations Between Me Proﬁciency and Estimates (Dim = 1)

 

Tests N = 2000 N = 5000
30-items .9546 .9554
45—items .9712 .9718

 

 

 

 

 

 

 

examinees.

Table 3.9 shows the correlation between true proﬁciency and estimates from the
MCMC approach. For the 30—item test, the correlations are around .96. The correla-
tions in the 45—item test are about .97, slightly higher than those in the 30—item test.
That is, longer tests gives higher correlation between true and estimates, implying
better proﬁciency parameter estimation. Figure 3.3 shows the plots of true proﬁciency
versus estimates corresponding to the four correlations in Table 3.9. One can see that
the proﬁciency estimates from the longer test (i.e., the 45-item test) more closely
around the reference line y = 1:, representing a higher correlation between the true
and estimates. Figure 3.4 through Figure 3.6 are the plot of the true item parameter
versus the estimates for parameter a, b, and c, correspondingly. Most of the plots are
close to the reference line y = 3:. For these ﬁgures that have larger sample size, the

plots are more close to the reference line, implying better item parameter estimates.
3.5 Estimating the 3-Dimensional MIRT Model

This section will discuss the simulation studies of the parameter estimation for the 3-
dimensional model, which is slightly different compared to the parameter estimation

for the unidimensional model because the number of parameters in the multidiem-

54

Figure 3.3: True Proﬁciency Versus Estimates (Dim = 1)

n = 30, N = 2000

 

2

1

 

  

n = 45, N = 2000

 

 

 

 

 

 

Estimate

-3 -2 -1

v

0

 

 

 

 

n = 30, N = 5000

 

 

 

 

 

 

 

 

\ m)
N
c.
123 -3-2—10123
m n=45,N=5ooo
N1 :
c1
1
a".
TrueAbility

Figure 3.4: True a Parameter Versus Estimates (Dim = 1)

n = 30, N = 5900

 

 

n = 30, N = 2000

 

    

 

 

 

 

 

1 11 11 1 1
‘0. . ° 1 to. .
v- r-
In
0 4 m J
c - - - - <5 - -
0.5 1 .0 1 .5 2.0 2.5 0.5 1 .0 1 .5 2.0 2.5
g .0, n=45,N=50QQ
N

 

 

 

 

 

 

 

 

 

0.5 1.0 1 .5 2.0 2.5

Parameter a

55

Estimates

Estimates

Figure 3.5: The b Parameter Versus Estimates (Dim = 1)

 

 

     

 

 

   

 

 

 

 

 

n=30,N=2000 n=30,N=5000
o o 0,

-1:5 ' -0'.5 f 0:5 f 1:5 ‘7 —1.5 ' -0'.5 f 055 ' 155
”‘n=45,N=2000 n=45,N=5000

 

 

 

 

 

Parameter b

Figure 3.6: True c Parameter Versus Estimates (Dim = l)

 

 

 

 

 

 

 

 

 

 

 

 

 

,0 n=30,N=2000 n=30,N=5QOO
2 b o 1 g ‘o o 1 P

o 5‘ ‘ o‘
2 . ‘ . 9. b
d : *0 b ° :
O ‘3 0.05 ' 0.15 ' 0.25
a .0 n=45,N=5000

   

0.0 0.10 0.20

Parameter c

56

sional model is much greater than that in the unidimensional model. In addition,
new parameters (e.g., proﬁciency structure parameters that appear as the compo-
nents in the covariance matrix of the underlying proﬁciency distribution) need to be
considered to estimate at the same time along with the estimation of the item and
proﬁciency parameters. One more concern for MIRT model parameter estimation
is the issue of indeterminacy that is inherited from the form of the MIRT model.
Basically, one needs to put some constraints to ensure the MIRT model parameters
have ﬁxed solutions. The following sections will discuss the design of the simulation
studies, for example, on how to generate the item and proﬁciency parameters and
the response data, the underlying proﬁciency covariance, how to put constraints on
the items in a test to establish a ﬁxed scale for the parameter estimates, and how to

assess the accuracy and stability for the parameter estimation.

3.5.1 Generating Proficiency Parameters

Assume that the underlying distribution of proﬁciency for each examinee follows the
multivariate normal distribution with mean vector p and covariance matrix 29. That
is, 0,- ~ Np(p., 29), where j = 1,2, - - - , N. Proﬁciency parameters for each examinee
are randomly drawn from Np(0, 29), where p is the number of dimensions; 29 is the
generating covariance matrix, which corresponds to its dimensional structure and will
have more discussions in Section 3.5.3. The mean vector p here is set to 0, because
each dimension actually represent one hypothetical construct and comparison among

dimensions seems to be not necessary.

57

3.5.2 The Number of Proﬁciency Dimension and Sample Size

One factor that might indirectly affect the parameter estimation in the MIRT model is
the proﬁciency dimensions (i.e., the number of latent variables in the complete latent
space). As is known, the unidimensional IRT model (dim = 1) has 3 parameters
for each item and one parameter for an examinee’s proﬁciency. For a test with n
item and N examinees, the total number of parameters to be estimated is 311 + N.
But in the case of the 3-dimensional MIRT model, there are 5 parameters for each
item (i.e., three a parameters plus d and c parameters), 3 parameters for an individual
proﬁciency, and 3 more parameters for representing the components in the proﬁciency
covariance matrix. Therefore, for a test with 71 items and N examinees, the total
number of model parameters need to estimate is 5n+3N +3, much more than that in
the unidimensional model. The increasing number of parameters in the MIRT model
brings more difﬁculties for the estimation given the test length n and the sample
size of examinees N, since more information is required to achieve the same level of
estimation precision.

The simulation studies here consider two different numbers of dimensions for esti-
mating multi-dimensional MIRT models: three and ﬁve proﬁciency dimensions. That
is, three, and ﬁve-dimensions of proﬁciency are required to determine the correct
answers in the simulation studies.

The stable Monte Carlo estimates may depend on the sample size (this would also
be the case for the maximum likelihood and Bayesian modal estimation). To investi-

gate the effect of the sample size on the accuracy and stability of the estimation, the

58

response data with the sample size 2000 and 5000 examinees are independently gen-
erated from the multivariate normal population. The sample size 2000 is considered

as moderate, and 5000 as a large sample.

3.5.3 Proﬁciency Structure

For multivariate analysis, the estimating of the covariance matrix is an important
step, because the covariance structure can reveal some helpful information on the
interrelations among the interested set of variables. Since the comparisons among
proﬁciency dimensions are not useful in testing practice, one can standardize the set
of proﬁciency components and thus make the variance for each proﬁciency dimension
equal to 1, which reduce the number of parameters in the proﬁciency covariance. For
example, if a test requires 3 dimensional proﬁciency, three additional parameters are
needed to describe the proﬁciency covariance. However, the off-diagonal components
represent the interrelations among the required proﬁciency dimensions and the pair-
wise correlations in the matrix may vary. For the multi-dimensional MIRT model,

the generating covariance matrices used are in the form of

1 p p
. 1 p
p p
p p 1

for simplicity, where p in the proﬁciency structure matrix equals to .2, which is

denoted as,

59

1 .2 2
1 . . . .2
29.2 E
. ... 2
.2 .2 1

For a more general case, ,0 takes different values for the off-diagonal components.
For example, the generating covariance matrix for the 3-dimensional model has off-

diagonal components from .2 to .7 denoted as

29.9 E

104""
1.21-«1
“can

3.5.4 Generating Item Parameters

It is natural to assume that some items in a test only measure one dimension proﬁ-
ciency (call such items uni-items), some items may measure two or more dimensions
(call such items multi-items). A test can be composed by both uni—items and multi-
items. Two tests that include both uni-items and multi-items are generated in this
simulation study on estimation for the 3-dimensional MIRT model with 30 and 45
items, respectively.

Table 3.10 contains the true item parameters for the 30—item test. The ﬁrst 15
items only measure one dimension proﬁciency and the remaining 15 items measure
three dimension abilities. The parameter vector a ranges from 0 to 2.45. Note for the
items which measure 3-dimensional abilities, some components in the a parameter are
dominant over other dimensions(e.g., item 20, 21, 24), and some items have very close
values of a parameters on two or three dimensions (e.g., item 19, 25, 26, 27). The

values of (1 parameters are simulated from the standard normal distribution N (0, 1).

60

The lowest d value is -1.63 and the highest value of d is 2.38, indicating a wide
range of d values is included in the test. Asymptote parameters c are drawn from
the uniform distribution U (0, .25). High guessing parameters are not expected for
good test items, as in the case of this example. Combined with the number of items
(e.g., 30 and 45 items) and the sample size (e.g., 2000 and 5000), and the underlying
proﬁciency structure (e.g., 29,3 and 29,9), there are in all 24 dichotomous response
data sets generated.

To solve the indeterminacy problem and establish a ﬁxed scale for the model
parameter estimates, the ﬁrst three items are chosen as an unidimensional item, which
is strongly considered to measure only the ﬁrst, the second, and the third dimension,
respectively. More speciﬁcally, the a values for the ﬁrst item takes zero on the second
and third dimensions, the a values for the second item takes zero on the ﬁrst and
third dimensions, and similarly the a values for the third item takes zero on the ﬁrst
and second dimensions. These three items are viewed as anchor items, because they
are placed at the ﬁrst three positions in the test and all are uni-dimensional items,
which is treated as a constraint in order to settle the metric issue or the indeterminacy
problems that are inherited in the MIRT models. It is argued that the model can
be identiﬁed by setting the mean vector of proﬁciency parameters equal to zero and
standardizing the covariance matrix, plus the above constraints, which are also used
in the exploratory option of NOHARM (Fraser, 1988).

Table 3.11 contains the true parameters for the 45-item test. The ﬁrst thirty

items only measure one dimension proﬁciency. Item 1 and item 4 to item 12 only

61

load on the ﬁrst dimension, item 2 and item 13 through item 21 measure the second
dimension, and item 3 and item 22 through item 30 only load on the third dimension.
The remaining 15 items of the test, item 31 through item 45, are able to measure all
three dimensions. The parameters a in the 45—item test also see a wide range as well,
from 0 (item 2) to 2.43 (item 41). The minimum value of parameter (1 is -2.06 (item
26) and the maximum is 2.07 (item 6). The parameters c are within the range of .01
to .24 in this test.

The ﬁrst three items in the 45—item test are also uni-dimensional items and placed
in the ﬁrst three positions in the test, which is to believe that these three items are
able to measure well the ﬁrst, the second, and the third dimension, respectively. The
purpose of placing the three uni-dimensional item in the ﬁrst three positions in the
test is to settle the indeterminacy problems and establish a ﬁxed scale for the item

and proﬁciency parameter estimates.

3.5.5 The Estimation Accuracy and Stability for
the 3-Dimensional MIRT Model

Table 3.12 contains the RMSE for the item parameters in the 3—dimensional model
for both tests with the sample size 2000 and 5000 and in a condition that all of
the off-diagonal components for the proﬁciency covariance are equal to .2. Note the
item parameter estimates are the means of the three individual estimates of the item
parameters, which are based on the three chains with different random initial values.
By taking the means of the individual estimates based on multiple chains for the

same data set, one can expect the the ﬁnal estimates to be more stable and accurate

62

Table 3.10: True Item Parameters for 30-Item Test (Dim = 3)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Item a1 a2 a3 d c
1 1.30 0 0 -0.23 .21
2 0 0.50 0 0.02 .00
3 0 0 2.10 -1.00 .24
4 1.93 0 0 0.61 .06
5 0.81 0 0 0.31 24
6 1.62 0 0 1.76 .00
7 0.59 0 0 1.56 .06
8 0 2.45 0 -0.38 .08
9 0 1.88 0 -0.86 .14
10 0 0.57 0 -0.51 .02

11 0 1.15 0 1.25 .03
12 0 0 1.35 -0.29 .25
13 0 0 0.98 2.38 .09
14 0 0 1.46 -1.45 .12
15 0 0 1.49 -0.30 .21

 

16 1.34 2.23 1.98 -1.24 .05
17 1.84 2.34 0.90 0.08 .00
18 0.86 1.04 1.76 1.13 .03
19 1.93 1.65 1.96 0.61 .11
20 0.56 0.87 1.97 1.23 .09
21 2.20 0.96 1.16 -l.01 .05
22 1.58 1.48 2.29 -1.58 .19
23 1.26 1.68 1.45 -0.07 .16
24 2.37 0.75 0.52 -1.37 .02
25 1.94 1.99 1.16 -1.63 .04
26 0.89 1.32 0.92 0.35 .20
27 1.25 1.56 1.64 1.08 .06
28 2.07 1.71 2.43 0.79 .06
29 1.41 0.96 2.12 0.46 .20
30 0.98 2.30 1.64 -0.43 .08

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

63

Table 3.11: 'D‘ue Item Parameters for 45-Item Test (Dim = 3)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Item a1 a2 a3 d 0 Item a1 a2 a3 (1 c
1 1.12 0 0 0.18 .09 24 0 0 0.24 0.56 .13
2 0 1.51 0 1.28 .08 25 0 0 1.96 -0.23 .14
3 0 0 1.24 —0.46 .19 26 0 0 0.44 -2.06 .04
4 2.03 0 0 -1.74 .05 27 0 0 0.91 0.24 .07
5 1.92 0 0 1.24 .14 28 0 0 2.46 0.05 .07
6 1.84 0 0 2.07 .19 29 0 0 0.77 1.06 .01
7 2.43 0 0 0.42 .11 30 0 0 1.57 0.51 .02
8 0.94 0 0 1.04 .03 31 2.26 0.52 1.85 -2.05 .06
9 0.89 0 0 0.27 .13 32 1.20 1.05 0.79 0.28 .04
10 0.52 0 0 ~0.69 .22 33 2.31 1.25 1.98 —0.31 .13
11 0.30 0 0 -0.75 .23 34 1.38 0.64 1.62 0.80 .02
12 0.94 0 0 0.65 .12 35 0.53 2.21 1.23 -0.55 .02
13 0 0.53 0 —0.92 .22 36 0.95 1.09 1.02 -0.99 .23
14 0 0.91 0 1.28 .03 37 0.22 0.92 0.80 0.04 .23
15 0 0.22 0 0.02 .17 38 1.77 2.50 0.78 1.33 .01
16 0 1.03 0 -1.64 .02 39 1.32 2.19 1.32 1.30 .13
17 0 1.87 0 -1.69 .02 40 1.85 1.08 1.22 -0.45 .09
18 0 0.79 0 -1.11 .03 41 1.27 2.21 2.43 1.98 .10
19 0 1.81 0 -0.47 .17 42 0.22 1.15 2.00 0.50 .24

20 0 1.75 0 1.31 .12 43 0.36 1.25 0.21 -0.43 .16
21 0 0.73 0 -1.07 .21 44 1.95 1.60 1.35 -0.90 .03
22 0 0 2.05 -1.22 .04 45 1.11 2.21 1.07 -0.40 .07
23 O 0 2.05 0.47 .05 - - - - - -

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 3.12: RMSE for Multi-dimensional Test (Dim = 3, p = .2)

 

 

 

 

 

 

 

 

 

 

 

Estimates 30 x 2000 30 x 5000 45 x 2000 45 x 5000
al .15 .08 .15 .06
(i2 .12 .08 .16 .04
*3 .18 .08 .11 .05
a? .19 .10 .22 .11
c .07 .03 .06 .03

 

 

64

 

because the ﬂuctuation of the item parameter estimates induced by the initial values
and sampling errors are taken into accounted. It shows for a given test (e.g., the 30-
item or 45-item test) the larger the sample size, the smaller RMSE. For the 30—item
test, the largest RMSE for a. is .18 when sample size is 2000, but is .8 when sample
size is 5000. The RMSE for d parameter is .19 when sample size is 2000, and is .10
for the sample size 5000. The RMSE for c parameter is .07 for sample size 2000, but
is .03 for 5000. Similar results can also be found in the 45-item test. The smallest
RMSE is .04 for a parameter in the 45-item test with sample size 5000. Note that
within the same test and with the same sample size, the RMSE for a,,Vi = 1, 2, 3 are
close to each other, which implies that the estimation can achieve the same level of
precision across dimensions. It also shows that the RMSE for c is generally smaller
than the RMSE for a and b, with the largest one .07 in the 30—item test to 2000
examinees.

Table 3.13 gives the RMSE for the situation in which the underlying proﬁciency
covariance is a general one or it does not follow a special pattern (e.g., all off-diagonal
components on the proﬁciency covariance matrix are the same). The results of the
parameter estimation for this particular condition are found very similar to the case
in which the off-diagonal components for the covariance matrix are equal to .2. This
implies that the underlying proﬁciency covariance does not affect the item parame-
ter estimates, which is expected because the estimation of the item and proﬁciency
parameters are independent.

Compared to the RMSE for the unidimensional model in Table 3.8, the RMSE

65

Table 3.13: RMSE for Multi-dimensional Test (Dim =3, p = general)

 

 

 

 

 

 

Estimates 30 x 2000 30 x 5000 45 x 2000 45 x 5000
($1 .10 .12 .13 .06
6‘2 .14 .06 .11 .07
(£3 .13 .12 .13 .06
" .13 .11 .21 .15
6 .06 .03 .06 .04

 

 

 

 

 

 

 

Table 3.14: Correlations Between True Proﬁciency and Estimates (Dim = 3, p = .2)

 

 

 

 

30 x 2000 30 x 5000 45 x 2000 45 x 5000
corr(01, (51) .8765 .8737 .9144 .9136
corr(02, 62) .8677 .8703 .9125 .9121
corr(63, 63) .8531 .8649 .9109 .9146

 

 

 

 

 

 

 

for item parameter estimates in Table 3.12 and 3.13 are generally higher those item
parameter estimates for the 3—dimensional MIRT model. It is clear that given the same
size of data information, the more parameters to be estimated, the more estimation
errors.

It can be seen that for the same test, larger sample size gives smaller RMSE. The
RMSE for a parameter cross dimensions are close to each other with a range from
.10 to .14 for the sample size 2000 and a range of .06 to .12 for the sample size 5000.
The largest RMSE for d is .21, which occurs in the 45-item test with 2000 examinees,
the smallest is .11 in the 30—item test with sample size 5000. Generally speaking,
The RMSE for parameter c are smaller than those for parameters a and d, varying
from .03 to .06, because c is restricted to a very small range. The RMSE of c for the

sample size 5000 are about the half of the ones for 2000 examinees.

66

Table 3.15: Correlations Between True Proﬁciency and Estimates (Dim = 3, p =
general)

 

 

 

 

30 x 2000 30 x 5000 45 x 2000 45 x 5000
corr(01, (5,) .8876 .8943 .9198 .9259
c0rr(02, 62) .8878 .8966 .9211 .9255
corr(t93, 63) .8474 .8602 .9111 .9101

 

 

 

 

 

 

 

The correlations between true abilities and estimates are presented in Table 3.14
and 3.15 for p = .2 and p is varied, respectively. Table 3.14 shows that for the 30—item
test the correlation between the true values and the estimates are around .87 with
a very small range from .8531 to .8765. Also, the correlations for the 45—item test
slightly differ from .9109 to .9146. The 45—item test in general has higher correlations
(around .91) between the true and the estimated abilities than those in the 30-item
tests. This implies the proﬁciency estimates get improved for the longer test, or the
estimation precision for proﬁciency in the longer test is better than that in the short
test (i.e., the 30—item test).

Table 3.15 presents the correlations between the true proﬁciency (6) and the esti-
mates (0) for the situation in which the components for the off-diagonal proﬁciency
covariance matrix take different values. The 30-item test gives correlations from .8474
to .8966. Higher correlations are also found in the 45-item tests with a range from
.9101 to .9259. N o noticeable difference of correlations have been found cross dimen-
sions. For example, for the 30-item test with 2000 examinees, the correlation between
the ﬁrst proﬁciency dimension and its estimates, corr(01, 01) = .8876, the correlation

between the second proﬁciency dimension and its estimates, corr(t92, 02) = .8878, and

67

the correlation for the third dimension is corr(03,é3) = .8474. Comparing Table
3.14 to 3.15, slightly higher correlations appear in the situation that p takes different
values than the ﬁxed p = .2 condition. But the difference is negligible.

In general, the correlations for the unidimensional model in Table 3.9 are higher
than those for the 3-dimensional model in Table 3.14 and 3.15. This implies that
as the number of dimensions increases from 1 to 3, the number of parameters to
be estimated increases from 2090 to 6153 for the 30-item test to 2000 examinees.
Therefore, more estimation errors will appear in the item and proﬁciency estimates
for the 3-dimensional model.

Figure 3.7 through Figure 3.9 show the plots of the true proﬁciency versus the
estimates for the 30—item and the 45-item tests cross different sample sizes. The
plots in these 3 ﬁgures demonstrate that the true and estimates are more close to the
reference line 3] = :r for the longer test (45—item), as is consistent with the ﬁndings
on the correlations in Table 3.14 and 3.15. Figure 3.10 through 3.13 are the plots
of the true item parameters versus their estimates and they are all tightly around
the reference line, showing the stable and accurate estimates are obtained in various
simulation conditions regarding the test length, the examinee sample size, and the
underlying proﬁciency covariance. It is worth pointing out that from the Figure 3.10,
3.11, and 3.12, for (1 parameters with true value 0, the estimates are close to zero.
The estimates in the tests with larger sample size (e.g., N = 5000) are even closer to

zero, with the biggest difference between the true parameters and estimates less than

.2.

68

Figure 3.7: True Proﬁciency Versus Estimates (Dim = 3, p = general, 71 = 30, N =
5000)

 

0 1

-1

 

 

 

-2

 

-2 0 2

—4
n = 45, N = 5000

Estimate

 

2

    

 

 

Ability 1

Note that the plots are for the situation in which the underlying proﬁciency covari-
ance matrix is 293. Similar results are also obtained when the proﬁciency covariance

is Bag, in which the pairwise correlations vary, but the plots are omitted here.
3.6 Estimating the 5-dimensional Model

The two tests for the simulation studies in this section will have the same number of
item (e.g., n = 30 or n = 45) and will also be administrated to the groups of examinees
with size N = 2000 and N = 5000, respectively. The differences are both tests are
assumed to require ﬁve dimensions of proﬁciency to correctly answer the items in the

two tests. Since the tests are to measure ﬁve dimensions of abilities, the total number

69

Figure 3.8:
2000)

Esmnake

Figure 3.9:

2000)

 

  
 

True Proﬁciency Versus Estimates (Dim = 3, p = general,n == 45, N =

 

 

n = 30, N = 5000

1

  

 

 

 

 

  

 

 

 

 

  

 

 

 

 

 

 

o 1
‘7‘ ‘ 7 - e x
-2 0 2
n = 45, N = 5000
NJ _ -‘ is:
o u
“,1 1 o
-3 -1 0 1 2 3
bennynz

True Proﬁciency Versus Estimates (Dim = 3, p = generalm = 45, N =

 

 

n = 30, N = 5000

   

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

n = 30, N = 2000
-5 .5 £1 6 i 5 5 .4 l; 6 2
n=45,N=2000 n=45,N=5000
... 1 f . ‘ "’
“’1
e31 - -
4 4
Ability 3

70

Figure 3.10: True 01 Parameter Versus Estimates (Dim = 3, p = .2)

 

 

   

 

 

 

 

Estimates

 

 

n=30,N=2000 (n=30,N=500‘0
=2. °o 6. 1 P
o 0.0 055 1:0 1:5 2.0 c 0.0 0:5 150 1:5 250
n=45,N=2000 n=45,N=5000

 

 

 

 

 

 

0.0 1 .0 2.0

Parameter a1

Figure 3.11: True 02 Parameter Versus Estimates (Dim = 3, p = .2)

n=30,N=2000 n=30,N=SOQ

    

 

 

 

 

 

 

 

 

Estimates

 

 

      

 

 

 

 

o O.
5“ N
o. c.
0. c.
O - - O - -
0.0 1 .0 2.0 0.0 1 .0 2.0
n=45,N=2000 n=45,N=5000
9. 1 O.
N N
c: O.
o. c.
O - - - c . - -
0.0 1 .0 2.0 0.0 1 .0 2.0

Parameter a2

71

Figure 3.12: True 03 Parameter Versus Estimates (Dim = 3, p = .2)

Estimates

Estimates

n = 30, N = 2000

 

01$D

   
     

1.0 2.0

   

 

 

 

 

 

 

 

n=30,N=5000

1.0 2.0

     

 

0.0

 

0.0 1 .0 2.0

‘n=45,N=5000

 

 

 

 

 

0.0 1 .0

Parameter a3

 

 

 

 

 

 

 

q

C v - 4

0.0 1.0 2.0

n = 45, N = 2000

c: 1‘
N
q
Q
C - -

0.0 1.0 2.0

Figure 3.13:
n = 30, N = 2000-
N 1
o J 11
‘71 1
3 6 1 5

N n=45, N=2000
O
‘2‘

-5 -1 3 7 2

True d Parameter Versus Estimates (Dim = 3, p = .2)

n = 30, N = 5000

 

 

 

N
O
1
‘7 1
-1 0 1 2
n = 45, N = 5000
N
O
1
N

 

 

-2 -1 0 1 2

Parameter d

72

of parameters to be estimated are (5 + 2)n + 5N + 10, where n stands for test length
and N for the sample size of examinee. For the 30-item test that is administrated
to 2000 examinees, for example, the total number of model parameters need to be
estimated from the observed data is 10220, which is much greater than the sample
size 2000. If this test is to administrated to a group of 5000 examinees, the number
of model parameters is 25220. Similarly, for a 45-item test that is administrated to
a group of 2000 examinees, the total number of model parameters is 10325, and is
25325 if administrated to a sample of 5000 examinees.

The design for the 30—item test that is assumed to measure ﬁve dimensions of
abilities will follow the same pattern as that of the three dimensional tests. To put
some constraints for the model identiﬁcation and the establishment of the ﬁxed scale
for the parameter estimates, the ﬁrst ﬁve items are unidiemsional items and are placed
on the ﬁrst ﬁve positions in the test with each item measuring only one dimension
of proﬁciency. More speciﬁcally, these items are also called anchor items with the
ﬁrst item only measuring the ﬁrst dimension of proﬁciency and the second items only
measuring the second dimension, and so on. Table 3.16 and 3.17 contain the true
item parameters for the 30-item test and the 45-item test, respectively. It can be
seen that the anchor items have a wide range of values on the 0. parameters (e.g.,
from .65 to 2.04 for the 30—item test, and from 1.38 to 2.32 for the 45-item test). In
the 30-item test, there are two additional unidimensional items (e.g., item 6 through
item 15) for each dimension and the rest of the items are assumed to measure all

ﬁve dimensions of abilities (e.g., item 16 through item 30). For the 45—item test,

73

only one additional unidimensional item for each dimension are present in the test,
item 6 through item 10. The rest of the items in this test are suppose to measure
all ﬁve dimensions of abilities. In the 30—item test, each dimension of proﬁciency is
designed to be measured by only 17 items. And in the 45-item test, each dimension
of proﬁciency can be measured by 42 items, much more than that in the 30—item
test. According to this design of items for the two tests, one would reasonable expect
that the proﬁciency estimates in the 45—item would be improved since more items are
designed to measure each dimension of proﬁciency.

Note that the true item parameters in both tests in Table 3.16, 3.17 and 3.18
include a wide range of values on each item parameter. For example, the largest
value of a parameter is 2.44 and the lowest is 0 in the 30—item test, and the largest
and lowest 0. values in the 45—item test are 2.32 and 0, respectively. The values on d
parameters for both tests have a reasonable range, which are both from a standard
normal distribution. All the asymptote parameters are controlled within the range
between 0 and .3.

The ﬁve dimensional proﬁciency parameters are randomly generated from a multi-
variate normal distribution with the mean vector 0 and the covariance matrix 20 (i.e.,
N (0, 29)). As in the case for the three dimensional tests in Section 3.5, the mean
vector for the underlying proﬁciency distribution is set to 0 to establish the same
scale for each proﬁciency dimension. In the same way, the covariance matrix 29 is
standardized and becomes actually the correlation matrix among these dimensions of

abilities. The pairwise correlation among these ﬁve dimensions (or the off-diagonal

74

Table 3.16: True Item Parameters for 30-Item Test (Dim = 5)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Item 01 a2 03 a4 05 d c
l 0.65 0 0 0 0 1.76 .20
2 0 1.74 0 0 0 ~0.69 .23
3 0 0 2.04 0 0 0.13 .15
4 0 0 0 1.38 0 1.13 .24
5 0 0 0 0 0.98 -0.64 .14
6 1.14 0 0 O 0 0.30 .07
7 1.64 0 0 0 0 -0.11 .10
8 0 0.67 0 0 0 -0.62 .23
9 0 1.21 0 0 0 0.73 .25
10 0 0 1.49 0 0 -1.12 .12
11 0 0 0.99 0 0 -1.10 .12
12 0 0 0 1.18 0 1.34 .04
13 0 0 0 1.41 0 2.02 .09
14 0 0 0 0 1.91 0.49 .16
15 0 O 0 0 0.88 -1.28 .24
16 2.44 1.24 2.18 1.88 0.85 0.85 .03
17 1.81 1.85 2.28 1.21 2.44 -l.64 .13
18 1.02 2.14 1.77 1.80 2.02 0.91 .07
19 0.60 1.75 2.14 2.19 2.35 2.73 .03

20 0.94 1.23 2.07 1.91 1.42 1.43 .19
21 1.01 1.39 2.17 2.26 0.98 0.95 .12
22 1.13 1.47 2.50 1.08 1.84 2.30 .08
23 1.32 1.29 1.59 2.20 0.80 0.48 .22
24 0.73 2.28 2.00 0.86 0.87 0.51 .15
25 2.43 1.08 1.84 1.15 2.03 0.20 .15
26 1.73 1.30 2.42 1.29 1.15 0.21 .00
27 1.98 1.69 1.50 2.28 1.46 -0.71 .15
28 2.00 1.39 2.15 0.59 1.10 -0.86 .09
29 1.62 1.92 1.56 2.07 1.91 -0.09 .10
30 0.81 1.70 2.13 1.39 1.28 0.75 .06

 

75

 

Table 3.17: True Item Parameters for 45-Item Test (Dim = 5)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Item 0.1 0.2 03 a4 0.5 d C
1 2.32 0 0 0 0 -0.17 .18
2 0 1.94 0 0 0 0.16 .22
3 0 0 1.53 0 0 0.36 .13
4 0 0 0 1.38 0 0.30 .19
5 0 0 0 0 1.71 0.47 .12
6 1.51 0 0 0 0 0.71 .23
7 0 1.74 0 0 0 -l.61 .21
8 0 O 1.90 0 0 -0.88 .03
9 0 0 0 2.14 0 ~1.15 .16
10 0 0 0 0 1.34 -0.13 .18
11 1.30 1.61 1.93 2.05 0.83 -0.73 .00
12 1.03 2.24 0.73 2.20 1.94 2.12 .23
13 2.05 1.56 1.09 0.92 1.83 -0.75 .04
14 1.36 0.93 0.90 1.89 1.45 1.12 .17
15 1.42 2.11 0.88 1.22 0.80 -0.07 .20
16 1.10 0.95 1.83 0.80 1.34 0.00 .23
17 1.65 1.52 2.15 1.09 1.38 1.01 .15
18 1.48 1.25 1.00 1.19 1.85 2.17 .14
19 0.82 1.49 0.62 2.01 1.84 -0.58 .21

20 0.87 1.79 1.61 1.10 1.31 -0.92 .02
21 0.93 2.07 1.49 1.11 1.85 0.80 .03
22 1.03 2.14 1.76 2.33 1.49 0.01 .01

 

 

76

 

Table 3.18: True Item Parameters for 45-Item Test (Dim = 5), cont.

 

Item al 02 a3 a4 as d c
23 1.97 2.17 2.32 2.10 1.57 -0.44 .08
24 1.79 1.25 1.93 1.87 2.34 0.17 .24
25 1.29 0.76 2.20 1.70 1.60 -1.35 .10
26 1.50 1.90 2.03 1.31 1.07 -0.74 .17
27 1.65 0.90 1.42 1.81 0.69 -0.31 .12
28 2.31 0.82 1.91 1.50 1.75 -2.08 .19
29 0.93 2.35 2.34 1.70 1.12 0.36 .08
30 1.99 0.73 1.58 1.68 1.04 -1.36 .08
31 1.34 1.20 1.88 2.18 1.60 -0.81 .18

32 1.49 1.50 1.76 2.00 1.63 -0.25 .12

33 1.95 2.22 1.39 1.59 1.09 -0.29 .11

34 0.64 1.26 0.80 1.21 0.95 -1.55 .23

35 1.06 1.51 1.69 1.64 1.17 -O.60 .09

36 1.45 0.82 1.92 1.66 0.49 0.50 .13

37 1.52 2.22 0.87 1.70 0.71 0.82 .13

38 2.04 1.45 0.97 2.28 1.81 0.96 .24

39 0.90 2.06 1.27 1.55 1.25 1.83 .00

40 1.93 2.09 1.65 1.25 0.80 0.78 .04

41 1.44 1.01 0.81 2.13 1.22 0.19 .25

42 0.74 1.78 1.94 0.92 2.07 -1.01 .04

43 1.93 1.81 0.69 0.90 1.79 0.08 .09

44 1.11 1.91 1.83 0.86 1.06 1.60 .22

45 2.05 1.25 1.55 0.89 1.79 0.89 .02

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

77

components in 29) can be the same or can vary from each other. In this section,
two covariance matrices of 29 are used and denoted as 29,; and 239.9, respectively.
From the notations on the covariance matrices, one can see that the former covariance
matrix indicates that all the off-diagonal components take the same values (e.g., .2)
and the off-diagonal components for the latter covariance matrix vary from .2 to .6,

which is shown as

20.9

m

hbbe
bbbwb
cease
cat-Hams:
Haaaa

Combined with the test length (30 and 45), the sample size (2000 and 5000), the
proﬁciency covariance (29,3 and 29,9), and the replications, 24 response data sets
are yielded for the simulation studies on the ﬁve dimensional case. For each data set,
multiple chains (e.g., 3 chains for each data set) will be constructed. To give more
stable and accurate estimates, the ﬁnal estimates for item parameters will take the
means of the three individual estimates from each chain with different initial values.
Therefore, there are in all 72 runs for the parameter estimates in this section.

Table 3.19 and Table 3.20 give the RMSE for the item parameter estimates for
the eight simulation conditions for each item parameter. The differences between the
two tables are that the underlying proﬁciency covariance is different. The results of
Table 3.19 are based on 29.2 and Table 3.20 on 29.9. Most of the RMSE in the tables
are less than .2. The highest RMSE value (.29) is for d parameter in the condition of

5000 examinee on the 45-item test with covariance 29,9.

78

Clearly from the two tables, the precision of item parameter estimates does not
change due to the use of different proﬁciency covariance. Or the underlying proﬁciency
covariance is not a factor that can affect the item parameter estimates, which is
expected because sampling of item and proﬁciency parameters are independent. It
is also clear that the RMSE are generally smaller when the sample size is 5000 than
those when the sample size is 2000, which is also expected since more examinees
provide more information on item parameter estimation. However, the estimation
seems better on the 30-item test since the RMSE have slightly higher values in the
45—item test in general no matter what the sample size is, which is not expected.
One possible reason is that the dimension structure in the 30-item test (only 17 items
measuring all 5 dimensions) is much simpler than the 45—item test (32 items measuring
all 5 dimensions). In addition, more items with extreme values that are difficult to
estimate, might appear in the 45-item tests.

Compared to the RMSE for item parameter estimates in the unidimensional model
(Table 3.8) and the 3—dimensional model (Table 3.12 and 3.13), the RMSE for the
item parameter estimates for the 5—dimensional model (Table 3.19 and 3.20) are gen-
erally higher. Again, this implies for the same size of data information, the more
parameters to be estimated as the number of dimensions increases, the more errors
for the estimation.

Table 3.21 shows the correlations between the true and estimates of proﬁciency
parameters when the underlying covariance matrix is 29,2. That is, the off diagonal

components for the covariance matrix of the proﬁciency distribution is equal to .2.

79

Table 3.19: RMSE for Multi-dimensional Test (Dim = 5, p = .2)

 

 

 

 

 

 

 

 

 

Estimates 30 x 2000 30 x 5000 45 x 2000 45 x 5000
01 .15 .13 .21 .20
62 .24 .16 .20 .14
63 .16 .11 .23 .15
04 .20 .14 .22 .22
65 .18 .15 .20 .14
d .21 .16 .22 .24
a .05 .05 .03 .03

 

 

 

 

 

Table 3.20: RMSE for Multi-dimensional Test (Dim = 5, p = general)

 

 

 

 

 

 

 

 

 

Estimates 30 x 2000 30 x 5000 45 x 2000 45 x 5000
a, .17 .15 .26 .17
62 .18 .16 .27 .18
d3 .16 .15 .20 .19
a, .18 .16 .25 .21
65 .21 .18 .24 .25
a .25 .17 .28 .29
c .06 .04 .03 .03

 

 

 

 

 

80

 

 

Table 3.21: Correlations Between True Proﬁciency and Estimates (Dim = 5, p = .2)

 

 

 

 

 

 

30 x 2000 30 x 5000 45 x 2000 45 x 5000
corr(01, 61) .7899 .7829 .7935 .7976
corr(02, 62) .7508 .7499 .7984 .8006
corr(03, 0'3) .8038 .8067 .8088 .8195
corr(04, 9",) .7606 .7641 .7934 .7818
corr(05, 65) .7594 .7559 .8010 .7915

 

 

 

 

 

 

 

In general, the correlation for each dimension in this study is around .8, and the cor-
relations are close between the two proﬁciency covariance conditions, indicating the
proﬁciency covariance does not affect proﬁciency estimates. When compared to the
correlations for the unidimensional model (Table 3.9) and the 3-dimensional model
(Table 3.14 and 3.15), the correlations for the 5—dimensional model in Table 3.21 and
3.22 are generally smaller, which is expected because as the dimension increases, more
parameters are to be estimated. The lowest correlations are for the short test ( the
30-item test), which is eXpected, because each dimension of proﬁciency is measured
by only 17 items. The longer test (the 45—item test) has slightly higher correlation
coefﬁcients. Low correlations indicate large estimation errors for the proﬁciency esti—
mates. Nevertheless, the estimation is not signiﬁcantly improved in the 454tem test
although each dimension is measured by 32 items. One possible interpretation is that
the parameters to be estimated substantially increase as the number of dimensions

increases to ﬁve.

81

Table 3.22: Correlations Between True Proﬁciency and Estimates (Dim = 5, p =
general)

 

 

 

 

 

 

 

30 x 2000 30 x 5000 45 x 2000 45 x 5000
corr(61, 8,) .7835 .7935 .8076 .7999
corr(92, 62) .7548 .7617 .7882 .7983
corr(03, 63) .8098 .8034 .8221 .8139
corr(04, (9,) .8171 .8241 .8264 .8326
corr(05, 0",) .7745 .7971 .8244 .8333

 

 

 

 

 

 

3.7 Proﬁciency Structure Estimation

The estimates of the underlying proﬁciency structure have potential affects on the
convergence speed, since at each sampling step, the proﬁciency samples are taken
from the multivariate normal distribution with mean vector 0 and sample covariance
from the inverse Whishart distribution based on the sample covariance of abilities.
Good recovery of the covariance structure can make an effective Markov chain.

The components of the underlying proﬁciency covariance are also estimated along
with item and proﬁciency parameters by the MCMC procedure. For each data set,
one estimate of covariance can be obtained for each chain replication with different
initial values. The ﬁnal covariance matrix estimate is the mean of the three estimates
from independent chains. Note for each chain, the proﬁciency covariance estimate
is the mean of the 1000 sample of the covariance from inverse Wishart distribution,
which is also based on the sample covariance. The good estimates of covariance
matrix would better recover the interrelations across proﬁciency dimensions. Table

3.23 gives estimates for each chain of the 30—item test in threedimensional case with

82

 

Table 3.23: Estimates of Covariance Matrix, Dim = 3, p = .2

 

 

 

 

 

 

 

Data 30 x 2000 30 x 5000
1.02 0.21 0.15 1.01 0.15 0.13
Repl 1.01 0.14 1.03 0.13)
0.97 0.98
1.04 0.18 0.17 0.99 0.18 0.18
Rep2 0.99 0.12 1.00 0.18
0.96 1.01
0.99 0.13 0.17 1.00 0.12 0.14
Rep3 1.03 0.21 1.01 0.17)
1.01 1.00

 

Table 3.24: Estimates of Covariance Matrix, Dim = 3, p = general

 

 

 

 

 

 

 

Data 45 x 2000 45 x 5000
.95 .58 .15 .99 .69 .16
Rep 1 .94 .29 1.00 .25
.98 1.01
1.04 .65 .13 1.01 .68 .14
Rep 2 ( 1.02 .24 ) ( 1.01 .25 )
.99 1.01
.99 .60 .18 .98 .70 .19
Rep 3 .97 .22 1.02 .27
1.05 .99

 

 

 

p taking the same value of .2 for all off—diagonal components. The table shows the
diagonal elements are all close to 1, ranging from .96 to 1.04. The off diagonal
elements ranges from .12 to .21. Similarly, Table 3.24 shows the covariance estimate
for the 45—item test in the three-dimensional case with true covariance 29,9. Clearly,
the estimate of each component is close to their true parameter. Results from the ﬁve
dimensional case in Table 3.25 and 3.26 also indicate the reasonably good recovery

of the proﬁciency structure.

83

Table 3.25: Estimates of Covariance Matrix, Dim = 5, p = general

 

 

 

 

 

Data 30 x 2000 30 x 5000
1.03 .14 .24 .23 .47 1.00 .26 .26 .21 .47

1.05 .18 .39 .23 .99 .22 .46 .25

Repl 1.00 .29 .19 1.00 .32 .31
.99 .54 .99 .43

1.05 .97

1.01 .27 .30 .28 .44 1.01 .20 .29 .26 .39

1.03 .28 .55 .29 1.03 .07 .54 .20

Rep2 1.02 .39 .25 .98 .35 .24
1.07 .44 1.03 .52

1.02 1.00

1.03 .10 .17 .22 .55 1.01 .20 .30 .22 .53

1.01 .25 .45 .21 1.03 .21 .44 .20

Rep3 1.03 .37 .30 .98 .45 .29
1.00 .50 1.03 .41

1.03 1.00

 

 

 

Table 3.26: Estimates of Covariance Matrix, Dim = 5, p = .2

 

 

 

 

 

Data 45 x 2000 45 x 5000
.99 .18 .23 .38 .21 1.02 .21 .20 .33 .18
1.05 .22 .29 .06 .99 .21 .24 .22
Repl 1.00 .31 .21 1.01 .28 .15
.98 .25 .99 .14
1.05 .97
1.04 .17 .21 .17 .21 1.02 .16 .18 .33 .15
1.02 .15 .28 .21 .99 .24 .27 .18
Rep2 1.04 .30 .18 1.01 .26 .18
1.00 .27 .99 .17
1.00 .97
1.03 .14 .24 .28 .35 1.01 .19 .20 .33 .17
.98 .32 .22 .13 1.03 .24 .26 .17
Rep3 1.03 .35 .26 .98 .25 .18
1.04 .33 1.04 .17
1.03 1.00

 

 

 

84

 

 

3.8 Computing Time

One open criticism to the MCMC approach is the extensive computation, which may
depends on the program efﬁciency, the size of the data, the convergence speed, and
the computer equipment as well. The program efficiency includes the design and
algorithm in the source codes. Many researchers now use the application softwares
(e.g., WINBUG, BUGS, SAS, SPLUS, MATLAB) to run MCMC procedures (e.g.,
Patz and Junker use S-PLUS, 1999a; Bolt uses WinBug, 2004). Some researchers
use computer languages (e.g., S, R, FORTRAN, JAVA) to code their own programs.
In this study, the code is written by C++ with efficient algorithm using MCMC
for computing IRT model parameter estimation. The size of data involves the test
length, the sample size of examinees, and number of dimensions and parameters to
be estimated. In general, the longer the test, the more time is needed. Similarly,
the larger number of examinees and dimensions of proﬁciency required, the longer
the computing time is required. For a given data set, the more parameters are to be
estimated, the longer the computing time is needed. As for the convergence speed,
it is associated with the priors chosen for each item and proﬁciency parameters, and
is also associated with the data structure. If each chain is diagnosed not mixed well,
or not converged to the target posterior distributions, long iteration is required, and
thus longer time is required. Finally, better equipped computer system give faster
computation for the same program. The computing time for 11000 iterations using

the C++ program is given in the Table 3.25, and it is calculated based on a computer

85

Table 3.27: Computing time for 1-, 3-, and 5-Dimension data

 

 

 

 

Data 30 x 2000 30 x 5000 45 x 2000 45 x 5000
l-dimension 37 min 1 hr 17 min 42 min 1 hr 33 min
3-dimension 59 min 2 hr 30 min 1 hr 20 min 3 hr 35 min
5-dimension 1 hr 35 min 4 hr 5 min 2 hr 8 min 5 hr 17 min

 

 

 

 

 

 

 

with 512 MB RAM and 3300 AMD Athlon 64 processor. The shortest time, 37
minutes, is in the computation of the parameter estimation for unidimensional model
with 30 items and 2000 examinees. The longest time is in the case with 45 items to
5000 examinees and with 5 dimensions of proﬁciency, taking 5 hours and 17 minutes

to ﬁnish the 10000 iterations. The time required to computing other conditions is

within the range from the shortest to the longest.

86

 

Chapter 4

Concluding Remarks and Future
Research Directions

This research involves extensive simulation studies on parameter estimation for mul-
tidimensional IRT models in various conditions in terms of the test length, the sample
size of examinees, the number of dimensions, and the underlying proﬁciency structures
using the MCMC approach. Results on parameter estimates from these conditions
are compared to investigate the inﬂuence of the potential factors on the accuracy and
stability of the estimation.

This study is a extensive examination on the MCMC approach to parameter
estimation in terms of the test length, the examinee sample size, the number of
dimensions, the proﬁciency covariance, the range of item parameters, and the di-
mensional structure in each simulated tests. For example, the study includes both
unidimensional items and multidimensional items in a test, and it has a wide variety
of parameter values (not limit to certain range of values for parameters). Moreover,
the study does not only focus on simple structure, but also considered the complex

structure.

87

The MCMC approach provides a convenient and ﬂexible framework for parameter
estimation of complex IRT models, as is shown in Chapter 3 for estimating multidi-
mensional models. The C++ program is used to estimate not only the simple IRT
model (e.g., unidimensional) but also some complex models (e.g., multidimensional
IRT models). The framework involves estimation of any type of parameters in the
IRT models (i.e., item parameters, proﬁciency parameters, proﬁciency covariance).
One can use the framework to estimate both item and proﬁciency parameters simul-
taneously. Or one can obtain the estimates of the proﬁciency covariance matrix to
infer the interrelations among the proﬁciency dimensions. For some simple situations,
for example, if only item parameter estimates, or only proﬁciency estimates, or only
knowing the interrelations among proﬁciency dimensions is required, the program can
give the required estimation procedures and ignore other parameter estimation with-
out loss of any generality. In this case, the MCMC approach would be faster because
less number of parameters are to be estimated, and thus less operation time is needed.
In addition, under this framework, one can give the item parameter estimates ﬁrst,
then treat the item parameter estimates as true to yield the proﬁciency parameter es-
timates and proﬁciency covariance estimates (even by other procedures, for example,
ML procedure). Or one is able to estimate all the model parameters simultaneously,
as is done in this study. In addition, the framework is not restricted to short tests
or lower dimensional tests. It is particularly useful for estimating higher dimensional
and long tests with large number of examinees, or is useful for the contexts in which

the IRT model is so complicated that other estimation approaches become infeasible.

88

The MCMC approach is effective and the computation is efficient. For parameter
estimation in unidimensional models, half an hour is enough for a test with 30 items
to 2000 examinees for 11000 iterations. One hour and half to longer tests and larger
sample size, for example a 45—item test to 5000 examinees. The path plots for the
posterior distribution for item parameters shown in Figure 3.2 imply that the con—
structed chains are well mixed even in the ﬁrst 3000 iterations. If some parameters are
not required for the estimation, less time is needed for the estimation and the resulted
estimates are not affected by ignoring other parameter estimation. For example, the
item parameters estimation will take less time if no proﬁciency parameter estimation
is involved and the results of item parameter estimates are not affected, because the
estimation of item and proﬁciency parameters are independent. Moreover, better
equipped computer system can give faster computation for the parameter estimation.

The important aspect of the MCMC approach for parameter estimation of IRT
models is the reasonable estimation accuracy and stability for the estimates. Simu-
lation study can have a straightforward comparison between the estimates and the
true parameters, which are available before the estimation. The accuracy of item
parameter estimates increases as sample size increases, but decreases as the number
of dimensions increases.

The estimation accuracy for item parameters can be seen from the comparison
between the true and the estimates directly, which are presented in the RMSE tables
(e.g., Table 3.8, 3.12, 3.13, 3.19, and 3.20) and plot ﬁgures (e.g., Figure 3.4 through

3.6 and Figure 3.10 through 3.13) in Chapter 3 for various simulation conditions. For

89

the unidimensional case, the item parameter estimates for both tests (e.g., the 30—item
test and the 45-item test) are listed in Table 3.4, Table 3.6 and 3.7 for sample size
2000 and 5000 along with the standard errors. For multidimensional model parameter
estimation, each item parameter estimate is not listed in a table but is plotted with
the corresponding true parameters. The small difference between the true and the
estimates of the item parameters indicates reasonable estimation. One can see in
Table 3.4 and Table 3.6 and 3.7 on the item parameter estimates for unidimensional
case, most of the absolute differences between the true and estimates are less than
.1 and many of the standard errors of estimates are also less .1. More results are
found in the summary statistics—RMSE. For unidimensional case, the RMSE for 0
parameters is less than .15 and arrives .07 when the sample size increases to 5000
(Table 3.8). The RMSE for b parameter is less than .11 and c parameter less than
.05. For parameter estimation in multidimensional case, the RMSE is generally higher
than the RMSE in the unidimensional models. For example in Table 3.12 and 3.13
for the RMSE for 3—dimensional model estimation, the RMSE for each a parameter
estimates is generally higher than RMSE for a parameter in the unidiemnsional case;
the RMSE in 5-dimensional item parameter estimation (Table 3.19 and 3.20) are
in general higher than both the unidimensional and 3-dimensional case. One can
conclude that as the dimension of proﬁciency increases in the model, the RMSE for
item parameter estimates become larger, indicating poorer item parameter recovery.
One simple interpretation to this observation is that the number of parameters to be

estimated increases substantially as the proﬁciency dimension increases. Given the

90

same data structure and information, the more parameters need to estimate (as in the
3—dimensional and 5-dimensional model), the less information that the data contains
for parameter estimation, and thus the less accurate the item parameter estimates. It
is expected that the RMSE are larger in the 5—dimensional models than those in the
3—dimensional or unidimensional model. The good recovery of the item parameters
can also be found from the plots of the true item parameters versus the estimates
(e.g., Figure 3.4 through 3.6, Figure 3.10 through 3.13). In these ﬁgures the plots are
closely around the reference line, indicating good estimates are obtained.

The precision of the proﬁciency estimates are assessed in terms of the correlation
and plots of the true proﬁciency parameters versus the estimates. Large correlations
are obtained for longer test (the 45—item test), but lower correlations are associated
with higher dimensional tests (e.g., 5—dimensional test). The proﬁciency covariance
matrix has negligible effects on proﬁciency parameter estimation.

The correlation tables show the correlations between the true proﬁciency param-
eters and estimates in terms of the number of dimensions, the sample size, and the
test length (e.g., 3.9, 3.14, 3.15, 3.21 and 3.22 ). One can ﬁnd that the correlations
for the unidimensional case are the highest, more than .95 for every conditions in the
simulation studies (Table 3.9). The correlations for the multidimensional cases (e.g.,
3 dimensions and 5 dimensions) are generally lower than those in the unidimensional
models, around .8 ~ .93 for each proﬁciency dimensions. The plots of the true proﬁ-
ciency versus the estimates in Figure 3.4 through 3.6 show the estimates are closely

around the reference line for unidimensional model. However, the plots on Figure 3.10

91

through 3.13 for the multidimensional proﬁciency cases show the estimates relatively
spread out from the line. The possible reason to explain the relations of the correla-
tions with the proﬁciency dimensions is concerning the information that is contained
in the data. One can expect better proﬁciency estimates or higher correlations for the
lower dimensional models, in particular for the uni-dimensional model, because less
parameters are required to be estimated in the same size of data structure and more
information contained in the data is provided for the proﬁciency estimation. Better
proﬁciency estimates is expected for longer tests if the higher dimensional model is
used.

The estimation accuracy for both item and proﬁciency parameter estimates by
the MCMC approach is clearly seen by the comparison of the results with the results
from other procedures. For example, for unidimensional case, item parameter esti-
mates in the 30-item test are calibrated from the standard procedure — MML / EM in
BILOG-MG3, which is shown in Table 3.5. The results from the two approaches are
comparable. However, the MCMC procedure, although from a. Bayesian perspective,
is ﬂexible and convenient for much more complex IRT models. Furthermore, as Patz
and Junker point out, one advantage of the MCMC procedure over traditional method
is that this procedure is capable to estimate the exact joint posterior distribution for
the parameters (Patz and Junker, 1999a).

The accuracy of the estimation by MCMC is clearly seen from the consensus esti-
mation on the replication of data sets and the consensus estimation on the replication

of multiple chains. This is also the aspects of the stability of the parameter estimation

92

of the MCMC approach. It is seen from Table 3.3 for the unidimensional ease, the
three independent chains yield very stable estimates of item parameters for the 30-
item tests. Similar results are obtained for the 45—item tests and higher dimensional
model parameter estimation. For the same data set, parameter estimates are stable
from three independent chains with different initial values indicating the posteriors
of the model parameters reach the stationary status. That is why the parameter
estimates do not depend on the initial values. The item parameter estimates are not
only stable across the multiple chains, but also stable across data sets (e.g., Table
3.8, 3.9).

It seems difﬁth to increase the estimation precision for both item and proﬁciency
parameter estimates in IRT models at the same times. When the sample size increases
for a ﬁxed number of items in a test, the item parameter estimates are expected to
be improved. For a ﬁxed group of examinees, the proﬁciency parameter estimates
are expected to improve as the number of items in a test increase. One can argue
that for a ﬁxed number of items in a test, the number of item parameters to be esti-
mated is ﬁxed and increasing the sample size of examinees provides more information
for estimating item parameters. Therefore, the standard error of estimates decreases.
When estimating proﬁciency parameters for a ﬁxed number of examinees, the number
of proﬁciency parameters to be estimated will be improved as the number of items
increases in the test, because the test provides more information for estimating proﬁ-
ciency parameters. This also happens to the parameter estimation using the MCMC

procedures. It is seen from Table 3.13 and 3.14 that for a ﬁxed test (e.g., the 30—item

93

test or the 45—item test), item parameter estimates get better in terms of RMSE when
the sample size changes from 2000 to 5000.

The proﬁciency covariance is well recovered in the MCMC procedure and the
estimation of the proﬁciency covariance matrix does not affect the item parameter
estimates.

The relations between the estimates and the design variables for a test (e.g., the
test length, the sample size of examinees, and the number of dimensions) are helpful
for suggesting a general guideline for parameter estimation. For example, to require
accurate item parameter estimates for the unidimensional model assuming perfect
model-data ﬁt, if a test consists of 30 items, the number of 2000 examinees is good
enough. But with the same number of 30 items for estimating item parameters
from the 3—dimensional model, more than 2000 examinees (e.g., 5000) could achieve
the estimation precision. Similarly, for the 5-dimensional model, more than 5000
examinees (e.g., 8000 or more) could help to reach the same estimation precision.
For proﬁciency estimates using the unidimensional model, the number of 30 items
can provide reasonable good estimation, as seen in the correlation Table 3.9 and plot
Figure 3.3. But for the 3-dimensional test, the number of 45 items could provide
reasonable good estimation for proﬁciency estimates, as seen in Table 3.14 and 3.15
and Figure 3.17. For the 5-dimensional model, more than 45 items (e.g., 60 items)
could help for reasonable good proﬁciency estimation.

One limitation for the MCMC approach estimating multidimensional IRT model

parameters except the extensive computation, is the number of dimension is given.

94

But the number of dimensions is not generally available in real data analysis. How
would the performance of the MCMC approach be if the number of dimension is less or
more than that of the required dimensions in the test? This is an interesting practical
issue and worthwhile for further research efforts. This issue is in fact also a model-data
ﬁt issue rather than parameter estimation issue (the focus of the whole research), or
sensitivity issues on parameter estimation using the MCMC approach. The reality is
the estimates are acceptable on the basis of the model-data ﬁt. However, the MCMC
approach does not give any mechanism to diagnose whether or not the data ﬁt the
estimating model. How much additional errors would be introduced because of the
model-data having not adequately ﬁt? This practical issue would give challenges to
the MCMC estimation.

In the simulation studies, the proﬁciency covariance matrix varied from a special
pattern (e.g., all off-diagonal elements are the same) to a general one and the effects of
the proﬁciency covariance matrix on the parameter estimation are carefully examined,
the proﬁciency population is assumed from multivariate normal or standard normal.
If the examinee groups are not from a normal distribution, does the approach still
yield accurate and stable estimation? This issue also deserves further research efforts,
because the examinees might not come exactly from a normal population in many
applications.

In addition, the metric for both item and proﬁciency parameters is established
by a well-deﬁned set of anchor items, which are often placed in the ﬁrst positions

in the tests. The anchor items help with solving the indeterminacy problems that

95

is inherited in many IRT models. However, the choice of anchor items are often
subjective, and therefore may inﬂuence the establishment of the proﬁciency scales.
Further research is needed to investigate the effects of the anchor items on parameter
estimation using the MCMC approach. In real data applications, how can one choose
a useful set of anchor items that help with the model identiﬁcation and meanwhile
ensure accurate parameter estimation?

Finally, the item parameter estimates by MCMC methods are compared with the
estimates by TESTFACT, and the results show that the estimates from MCMC meth-
ods are better than those from the TESTFACT. Table 4.1 shows the item parameter
estimates by TESTFACT for the 30—item test with 3 dimensions to 2000 examinees
(i.e., the ﬁrst replication of the data). Table 4.2 shows the item parameter estimates
by TESTFACT for the 30-item test with 5 dimensions to 2000 examinees (i.e., also
the ﬁrst replication of the data). The input of the estimates for the pseudo-guessing
parameters is the true values for the 0 parameters. Compared with the true item
parameters (Table 3.10 and Table 3.16) and the estimates by MCMC (Table Table
3.13, 3.19, and 3.20), the item parameter estimates by TESTFACT in Table 4.1 and
4.2 in general seem a little bit worse. In addition, the results from TESTFACT have
some deviant values (e.g., item 11, item 17, item 18, item 19, item 21 in Table 4.2 for

5—dimensional case).

96

Table 4.1: TESTFACT Item Parameters estimates for 30—Item Test (Dim = 3)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Item 01 a2 03 d
1 1.25 -0.2 -0.17 -0.27
2 -0.09 0.65 -0.16 0.02
3 -0.45 -0.37 2.28 -1.2
4 1.97 -0.33 -0.35 0.53
5 0.86 -0.21 -0.13 0.22
6 1.37 -0.22 -0.21 1.47
7 0.65 -0.22 -0.08 1.57
8 -0.55 2.64 -0.41 -0.53
9 -0.27 1.86 -0.41 -1
10 -0.11 0.67 -0.15 -0.6
11 -0.16 1.08 -0.15 1.16
12 -0.29 -0.34 1.47 -0.33
13 -0.24 0.04 0.89 2.22
14 -0.03 -0.32 1.31 -1.46
15 -0.28 -0.36 1.83 -0.56
16 0.44 1.54 1.26 -1.48
17 1.11 1.78 0.19 -0.24
18 0.3 0.71 1.19 0.86
19 1.28 1 1.17 0.21

20 0.04 0.48 1.5 0.95
21 2.06 0.4 0.53 -1.39
22 1.29 0.75 1.96 -2.2
23 0.8 1.16 0.78 -0.42
24 2.27 0.19 -0.09 -1.55
25 1.35 1.45 0.45 -1.91
26 0.58 0.98 0.44 0.21
27 0.61 0.91 1.02 0.66
28 1.13 0.77 1.31 0.37
29 0.95 0.41 1.62 0.16
30 0.47 1.99 1.09 -0.87

 

 

 

 

 

97

 

Table 4.2: TESTFACT Item Parameters Estimates for 30—Item Test (Dim = 5)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Item 0.1 02 a3 a4 a5 (1
1 0.88 -0.07 -0.29 -0.11 -0.02 1.59
2 -0.54 3.67 -0.59 -0.43 -0.66 -0.76
3 -0.05 -1.67 7.62 -0.88 -l.03 -l.84
4 -0.16 -0.4 -0.4 1.71 -0.25 1.48
5 -0.21 0 -0.25 -0.38 1.47 -0.58
6 1.96 -0.27 -0.39 -0.35 -O.24 -0.06
7 2.67 -0.31 -0.33 -0.54 -0.51 -0.75
8 -0.06 0.83 -0.14 0.03 -0.21 -0.5
9 -0.17 2.43 -O.38 -0.42 -0.42 1.17
10 0. 14 -0.2 1.3 -0.16 -0.25 -1.24
11 -22.14 -5.89 98.05 -15.65 -22.07 -76.86
12 -0.27 -0.22 -0.31 1.89 -0.41 1.66
13 -0.15 -0.16 -0.25 1.44 -0.31 2.13
14 -1.17 -2.15 -1.93 -1.48 10.28 2.83
15 0.01 -0.11 -0.14 -0.08 0.45 -0.42
16 3.19 -0.12 1.26 1.25 -0.51 0.3
17 15.39 1.78 26.26 7.73 26.4 -35.36
18 -0.62 6.8 0.43 6.35 10.72 7.27
19 -4.07 3.32 5.28 8.76 9.89 14.02

20 0.36 -0.06 1.26 1.74 1.24 1.57
21 -5 2.5 15.72 43.88 8.1 24.18
22 1.23 0.66 2.01 0.1 1.68 1.94
23 0.78 0.2 0.07 0.63 -0.1 0.72
24 0.81 2.59 1.17 —0.29 -0.03 0.21
25 2.02 -0.18 0.36 0.15 0.82 0.1
26 1.59 0.07 1.14 0.42 0.56 -0.26
27 0.7 0.21 0.05 0.45 0.03 0.12
28 1. 13 0.25 0.55 -0.09 -0.02 -0.48
29 0.87 0.35 -0.05 0.48 0.21 0.25
30 0.3 0.28 0.74 0.52 0.5 0.5

 

98

 

Bibliography

[1] Ackerman, T. A. (1990). An evaluation of the multidimensional parallelism of
the EAAP Mathematics Test. Paper presented at the Meeting of the American
Educational Research Association, Boston, MA.

[2] Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and
item validity from a multidimensional perspective. Journal of Educational Mea-
surement 29(1), 67-91.

[3] Albert, J. H. (1992). Bayesian estimation of normal ogive item response curves
using Gibbs sampling. Journal of Educational Statistics, 17, 251-269.

[4] Baker, F. B. (1990). Some observations on the metric of BILOG results. Applied
Psychological measurement, 14, 139-150.

[5] Beguin, A. A., Glas, C. A. W. (1998). ED428100. MCMC Estimation of Multi-
dimensional IRT models. Research Report 98-14.

[6] Besag, J ., Green, P. J., Higdon, D. M., and Mengersen, K. L. (1995). Bayesian
Computation and Stochastic Systems (with discussion). Statistical Science 10,
3-66.

[7] Birnbaum, A. (1957). Eﬂicient design and use of tests of a mental ability for

various decision-making problems. (Series Report No. 58—16. Project No. 7755-
23). USAF School of Aviation Medicine, Randolph Air Force Base, Texas.

[8] Birnbaum, A. (1958a). Further considerations of efficiency in tests of a mental
ability. Technical Report No. 17. Project No. 7755-23, USAF School of Aviation
Medicine, Randolph Air Force Base, Texas.

[9] Birnbaum, A. (1958b). On the estimation of mental ability. Series Report No. 15.
Project No. 7755-23, USAF School of Aviation Medicine, Randolph Air Force
Base, Texas.

[10] Birnbaum, A. (1968). Some latent trait models and their use in inferring an
examninee’s ability. In F.M. Lord and MR. Novick (Eds), Statistical Theories
of Mental Test Scores (pp. 397-472). Reading, MA: Addison-Wesley.

99

[11] Bock, R. D. (1972). Estimating item parameters and latent ability when re
spouses are scored in two or more nominal categories. Psychometrika.

[12] Bock, R. D., Gibbons, R., & Muraki, E. J. (1988). Full information item factor
analysis. Applied Psychological Measurement, 12, 261-280.

[13] Carlson, J. E. (1987). Multidimensional item response theory estimation: A
computer program (Research Report ONR 87-2). Iowa City, IA: The American
College Testing Program.

[14] Bock, R. D. and Lieberman, M. (1970). Fitting a response model for n dichoto—
mously score items. Psychometrika, 35, 179-197.

[15] Bolt, D. M. and Lall, V. F. (2003). Estimation of compensatory and noncom-
pensatory multidimensional item response models using Markov Chain Monte
Carlo. Applied Psychomological Measurement, 27( 6), 395-414.

[16] De-la-Torre, J, Patz, R. J. (2001). ED 464143. Item Response Theory Equating
Using Bayesian Informative Priors. Paper Presented at the Annual Meeting of
the National Council on Measurement in Education (Seattle, WA, April 11-13,
2001).

[17] Embreston, S. E. and Reise, S. P. (2000). Item response theory for psychologists.
Lawrence Erlbaum Associates.

[18] Fox, J. P. (2002) Multilevel IRT Using Dichotomous and polytomous Response
Data. Research Report.

[19] Ftaser, C. (1988). NOHARM II. A Fortran program for ﬁtting unidimensional
and multidimensional normal ogive models of latent trait theory. Armidale, Aus-
tralia: The University of New England, Center for Behavioral Studies.

[20] Gamerman, D. (1997). Markov Chain Monte Carlo. New York: Chapman & Hall.

[21] Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2004). Bayesian Data
Analysis. Second Edition. Chapman & Hall.

[22] Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using
multiple sequences. Statistical Sciences, 7(4), 457-472.

[23] Geman, S. and Geman, D. (1984). Stochastic Relaxation, Gibbs Distributions
and the Bayesian Restroation of Images. IEEE Transactions on Pattern Analysis
and Machine Intelligence. 6, 721-741.

[24] Gilks, W. R., Richardson, 8., and Spiegelhalter , D.J., eds. (1996), Markov Chain
Monte Carlo in Practice, London: Chapman and Hall.

100

 

[25] Gill, J. (2002). Bayesian methods for the social and behavioral sciences. Chapman
& Hall/CRC.

[26] Gulliksen, H. (1950). Theory of mental tests. New York: Wiley.

[27] Hambleton, R. K. and Swaminathan, H. (1985). Item Response Theory: Principle
and Applications. Kluwer Nijhoff Publishing.

[28] Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains
and their applications. Biometrika, 57, 97-109.

[29] Hulin, C. L., Lissak, R. L., and Drasgow, F. (1982). Recovery of two and three
parameter logistic item charactersitic cureves: A monte Carlo study. Applied
Psychological Measurement, 6, 249-260.

[30] Kiefer, J ., and Wolfowitz, J. (1956). Consitency of maximum likelihood estimates

in the presence of iniﬁnitely many incidental parameters. Annals of Mathematical
Statistics, 27, 887-890.

[31] Kim, S. H., & Cohen, A. S. ( 1998). ED420689. An Evaluation of a Markov Chain
Monte Carlo Method for the Two-parameter Logistic Model.

[32] Lemann, E. L., Casella, G. (1998). Theory of point estimation. Second edition.
Springer-Verlag New York, Inc.

[33] Li, J. C., Woodruff, D. J. (2001). ED 462419. Bayesian Statistical Inference for
Coefﬁcient Alpha. ACT Research Report Series.

[34] Little, R. J. A., and Rubin, D. B. (1983). On jointly estimating parameters ad
missing data by maximizing the complete-data likelihood. The American Statis-
tician, 37, 218-220.

[35] Lord, F. (1952). A theory of test scores. Psychometric Monograph, No. 7.

[36] Lord, F. (1953). The relation of test score to the trait underlying the test. Edu-
cational and Psychological Measurement, 13, 517-548.

[37] Lord, F., & Novick, M. R. (1968). Statistical theories of mental test scores. Read-
ing, MA: Addison-Wesley.

[38] Lord, F.(1980). Applications of item response theory to practical testing problems.
Hillsdale, NJ: Lawrence Erlbaum Associates.

[39] Maris, G. & Maris, E. (2002). A MCMC-Method for Models with Continuous
Latent Responses. Psychometrika Vol. 67, No. 3, 335-350.

[40] Matthews-Lopez, J. L., Hombo, C. M. (2001). ED 454268. Modeling the Hyper-
distribution of Item Parameter to Improve the Accuracy of Recovery in Estima-
tion Procedures.

101

[41] McDonald, R. P. (1967). Nonlinear factor analysis. Psychometric monographs,
No. 15.

[42] McDonald, R. P. (1985). Unidimensional and multidimensional models for item
response theory. In D. J. Weiss (Ed.), Proceeding of the 1982 Computerized Adap-
tive Testing Conference (pp. 127-148). Minneapolis: University of Minnesota,
Department of Psychology, Psychometrics Methods Program.

[43] McKinley, R. L., & Reckase, M. D. (1983). MAXLOG: A computer program for
the estimation of the parameters of a multidimensional logistic model. Behavior
Research Methods 69' Instrumentation, 15, 389-390.

[44] Metropolis, N ., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E.
(1953). Equations of state calculations by fast computing machines. The Journal
of Chemical Physics, 21, 1087-1092.

[45] Muthen, B. (1984). A general structural equation model with dichotomous, or-

dered categorical , and continuous latent variable indicators. Psychometrika, 49,
115-132.

[46] Neyman, J ., and Scott, E. L. (1948). Consistent estimates based on partially
consistent observations. Econometrika, 16(1), 1-32.

[47] Patz, R. J. Junker, B. W. (1999a). A Straightforward Approach to Markov Chain
Monte Carlo Methods for Item Response Models. Journal of Educational and
Behavioral Statistics, 24(2), 146-178.

[48] Patz, R. J. Junker, B. W. (1999b). Applications and extensions of MCMC in IRT:
multiple item types, missing data, and rated responses. Journal of Educational
and Behavioral Statistics, 24 (4), 342-366.

[49] Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests.
Copenhagen: Danish Jnstitute for Educational Research.

[50] Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests.
Chicago: The University of Chicago Press.

[51] Reckase, M. (1996). A linear logistic multidimensional model for dichotomous
item response data. In W. Van der Linden,& R. Hambleton (Eds), Handbook of
modern item response theory (pp.271-286). New York: Springer - Verlag.

[52] Reckase, M. D., Ackerman, T. A., 85 Carlson, J. E. (1988). Unidimensional data
from multidimensional testes and multidimensional data from unidimensional
test.

[53] Reckase, M. D. & Hirsh, T. M. (1991). Interpretation of number-correct scores
when the true number of dimensions assessed by a test is greater than two.

102

 

 

Paper presented at the annual meeting of the National Council on Measurement
in Education, Chicago.

[54] Roussos, L. A. (1995). A new dimensionality estimation tool foe multiple-item
tests and a new DIF analysis paradigm based on multidimensionality and con-
struct validity. Unpublished doctoral dissertation, Universtiy of Illinois at Urbana-
Champaign.

[55] Samejima, F. (1969). Estimation of latent ability using a response pattern of
graded scores. Psychometric Monograph, No. 17.

[56] Samejima, F. (1972). A general model for free-response data. Psychometric
Monograph, No. 18.

[57] Segall, D. O. (2001). General ability measurement: an application of multidi-
mensional item response theory. Psychometrika. Vol. 66, No. 1, 79-97.

[58] Segall, D. O. (1996). Multidimensional adaptive testing. Psychometrika. Vol.61,
No. 2, 331-354.

[59] Stegelmann, W. (1983). Expanding the Rasch model to a general model having
more than one dimension. Psychometrika, 48, 259-267.

[60] Tierney, L. (1991). Exploring Posterior Distributions Using Markov Chains. In
Computing Science and Statistics: Proceedings of the 23rd Symposium on the
Interface. E. Ml Keramidas (ed.). Fairfax Station, VA: Interface Foundation. pp.
563-570.

[61] van der Linden, W. J ., & Hambleton, R. K. (1996). Handbook of modern item
response theory. New York: Springer.

[62] Whitely, S. E.(1980). Multicomponent latent trait models for ability tests. Psy-
chometrika, 45, 479-494.

[63] Williamson, D. M., Johnson, M. S., Sinharay, S., & Bejar, I. I. (2002). ED
464948 Hierarchical IRT Examination of Isomorphic Equivalence of Complex
Constructed Response Tasks.

[64] Wright, B. D., & Stone, M. H. Best test design. Chicago: MESA, 1979.

[65] Wollack, J. A., Bolt, D. M., Cohen, A. S, & Lee, Y. S. Recovery of Item Pa-
rameter in the Nominal Response Model: A Comparison of Marginal Maximum
Likelihood Estimation and Markov Chain Monte Carlo Estimation. Applied Psy-
chological Measurement, 26(3), 339-352.

103

 

 

"‘l’lllllliiiiii]